Mathematics and the Genome
Mathematics and the Genome: Near and Far (Strings)
Feature Column Archive
5. Near and Far (Strings)
One dramatic way that mathematics has come in handy for the study of the genome
involves the concept of distance. What is the distance between New York and Boston?
is a very familiar use of the word distance. Yet the word distance is very rich
in linguistic connotations. We speak of distant relatives and distant memories.
Distance suggests not only the ability to tell how close or far apart things are
but also issues of similarity and dissimilarity. It is because of the varied contexts
in which distance arises that a mathematical framework has been helpful. Given
a collection of objects (cities, bird songs, genes, languages, manuscripts), one
can assign a number to pairs of these objects which tell one how distant or how
similar or how close they are. What properties do we want the number associated
with the pair of objects to have?
We want these properties:
1. The distance between two objects is zero or a positive number.
2. The distance between two objects is zero if and only if the two objects are
the same.
3. The distance between objects A and B is the same as the distance between objects
B and A.
4. The distance between objects A and B added to the distance between B and C
is at least as large as the distance between A and C.
These 4 properties capture the distance with which you are probably most familiar
- Euclidean distance. Given the locations O = opera house = (0, 0) and C = commuter
rail station = (3, 4) in the grid below, the distance from O to C as the crow
or a helicopter would fly would be

which is derived from the Pythagorean Theorem.
This theorem states that in a right angle triangle, the side c opposite the
right angle is related to the other two sides of the triangle by:
Given two points with coordinates (x1, x2) and (y1,
y2), then the Euclidean distance between them will be:
However, the context here illustrates an important point: often there is more
than one way to define the distance between the same objects. Here, in getting
between two locations in an urban setting, we can not get between the opera house
and the commuter rail terminal as the crow flies. We will probably want to take
a taxi. The taxicab distance from O to C will not be 5 but 7, because the taxi
might take the route going from O to B (3 units) and then from B to C (4 units).
In general, taxicab distance between two points (x
1, x
2)
and (y
1, y
2) is given by:
This illustrates that often one gets a new and different insight into the same
objects by changing perspective on them.
How can a molecular geneticist put this lesson to work? Given two species how
far apart are they? Given two proteins, how far apart are they? Given two genes,
how far apart are they? Everyone is familiar with how to find the distance between
two points when the points are given in coordinate form. But how would one find
the distance between two pictures, two genes or two bird songs?
One clue came from Richard Hamming's pioneering work on error correction codes.
Hamming realized that by coding information in binary (or more generally with
entries from a finite field) that one could correct errors if one represented
information in the form of code words of fixed length with the property that they
differed in many positions. For example, the distance between the two binary strings
:
would be 6 because the strings, when lined up one on top of the other differ in
six columns. This distance, which is a distance because it obeys the 4 rules described
earlier, is today called Hamming Distance in honor of Richard Hamming.
However, for work in mathematical biology it is necessary to be able to extend
this notion of distance between strings to strings with different lengths. As
has often been the case, when people interested in mathematical biology turned
to mathematics for new tools to attack the new kinds of problems they had to deal
with, they found that mathematics for its own sake had developed tools that were
of help to them. These tools included a family of distance functions called edit
distance or Levenschtein distance. Strictly speaking, this is not one distance
function but a related family of distance functions for computing the distance
between two strings with similar features. These distance functions can be adapted
to a wide variety of situations, including situations where the strings have different
lengths; the treatment of what happens in different positions of the strings is
different.
Here is an illustration of the idea of edit distance in the context of English
strings. For the genome project one is typically interested in the distance between
DNA strings. Suppose one wishes to see how far apart the English words grind and
agreed are. Note one string has length 5 and the other 6. We begin by making the
strings the same length by introducing dashes into one or both of the strings.
One way to do this would be : - g r i n d and a g r e e d. The result of adding
these dashes may be to make both of the strings longer than either of the originals.
Now we line up the equal length strings into an alignment of the two strings:
The operations we will allow to transform one string into another, which also
permits the addition of dashes as noted above, are:
1. Insertion or deletion of a character in either of the strings. (These are referred
to as indels (insertions/deletions) in parts of the biology and computer science
literature.)
2. A substitution.
Above, we have matches in columns 2 and 6 but there fail to be matches in columns
1, 4, and 5. We think of the situation of having a dash in column 1 as inserting
a symbol, in this case a in the first word. We think of e substituting for e in
column 4 and n substituting for i in column 5. In the simplest version of edit
distance, we assume that the cost of making an insert or deletion has the same
cost as a substitution. The distance between two strings is the minimum cost of
any alignment for the two given strings. (The process of taking strings and finding
the edit distance between them is called sequence alignment.) In the case of grind
and agreed the edit distance would be 3. One can also use different weights for
insertions, deletions, and substitutions. Further complications which have been
taken into account use the fact that for the distance between DNA pieces which
are parts of genes, triples of letters play a special role. Since DNA, protein
sequences, genes, and many other objects can be represented by strings, the edit
distance becomes a powerful tool for determining closeness and similarity. In
particular, edit distance can be used to find the closest match between a possible
gene and a list of genes in a data base. The goal is to find the best alignment
with respect to some criterion. If one is given many DNA sequences, one can align
pairs of these sequences, but another and perhaps more informative procedure is
to find a multiple sequence alignment. The goal here is to align the sequences
in a way that brings to the fore the common features of the sequences as a group.
Thus, a multiple alignment of a group of parts of genes in different species may
show that portions of the genes are essential for a specific function.
- Introduction
- Mathematics
and Classical Genetics (The Early Days)
- Mathematics
and Clasical Genetics (1900-1953)
- Molecular
Genetics (1953-Present)
- Near
and Far (Strings)
- Near
and Far (Trees)
- The
Wider Picture and the Future
- References