Feature Column from the AMS

Mathematics and the Genome: Near and Far (Strings)

5. Near and Far (Strings)

One dramatic way that mathematics has come in handy for the study of the genome involves the concept of distance. What is the distance between New York and Boston? is a very familiar use of the word distance. Yet the word distance is very rich in linguistic connotations. We speak of distant relatives and distant memories. Distance suggests not only the ability to tell how close or far apart things are but also issues of similarity and dissimilarity. It is because of the varied contexts in which distance arises that a mathematical framework has been helpful. Given a collection of objects (cities, bird songs, genes, languages, manuscripts), one can assign a number to pairs of these objects which tell one how distant or how similar or how close they are. What properties do we want the number associated with the pair of objects to have?

We want these properties:

1. The distance between two objects is zero or a positive number.

2. The distance between two objects is zero if and only if the two objects are the same.

3. The distance between objects A and B is the same as the distance between objects B and A.

4. The distance between objects A and B added to the distance between B and C is at least as large as the distance between A and C.

These 4 properties capture the distance with which you are probably most familiar - Euclidean distance. Given the locations O = opera house = (0, 0) and C = commuter rail station = (3, 4) in the grid below, the distance from O to C as the crow or a helicopter would fly would be which is derived from the Pythagorean Theorem.

This theorem states that in a right angle triangle, the side c opposite the right angle is related to the other two sides of the triangle by: Given two points with coordinates (x₁, x₂) and (y₁, y₂), then the Euclidean distance between them will be:

However, the context here illustrates an important point: often there is more than one way to define the distance between the same objects. Here, in getting between two locations in an urban setting, we can not get between the opera house and the commuter rail terminal as the crow flies. We will probably want to take a taxi. The taxicab distance from O to C will not be 5 but 7, because the taxi might take the route going from O to B (3 units) and then from B to C (4 units). In general, taxicab distance between two points (x₁, x₂) and (y₁, y₂) is given by:

This illustrates that often one gets a new and different insight into the same objects by changing perspective on them.

How can a molecular geneticist put this lesson to work? Given two species how far apart are they? Given two proteins, how far apart are they? Given two genes, how far apart are they? Everyone is familiar with how to find the distance between two points when the points are given in coordinate form. But how would one find the distance between two pictures, two genes or two bird songs?

One clue came from Richard Hamming's pioneering work on error correction codes. Hamming realized that by coding information in binary (or more generally with entries from a finite field) that one could correct errors if one represented information in the form of code words of fixed length with the property that they differed in many positions. For example, the distance between the two binary strings :

would be 6 because the strings, when lined up one on top of the other differ in six columns. This distance, which is a distance because it obeys the 4 rules described earlier, is today called Hamming Distance in honor of Richard Hamming.

However, for work in mathematical biology it is necessary to be able to extend this notion of distance between strings to strings with different lengths. As has often been the case, when people interested in mathematical biology turned to mathematics for new tools to attack the new kinds of problems they had to deal with, they found that mathematics for its own sake had developed tools that were of help to them. These tools included a family of distance functions called edit distance or Levenschtein distance. Strictly speaking, this is not one distance function but a related family of distance functions for computing the distance between two strings with similar features. These distance functions can be adapted to a wide variety of situations, including situations where the strings have different lengths; the treatment of what happens in different positions of the strings is different.

Here is an illustration of the idea of edit distance in the context of English strings. For the genome project one is typically interested in the distance between DNA strings. Suppose one wishes to see how far apart the English words grind and agreed are. Note one string has length 5 and the other 6. We begin by making the strings the same length by introducing dashes into one or both of the strings. One way to do this would be : - g r i n d and a g r e e d. The result of adding these dashes may be to make both of the strings longer than either of the originals. Now we line up the equal length strings into an alignment of the two strings:

The operations we will allow to transform one string into another, which also permits the addition of dashes as noted above, are:

1. Insertion or deletion of a character in either of the strings. (These are referred to as indels (insertions/deletions) in parts of the biology and computer science literature.)

2. A substitution.

Above, we have matches in columns 2 and 6 but there fail to be matches in columns 1, 4, and 5. We think of the situation of having a dash in column 1 as inserting a symbol, in this case a in the first word. We think of e substituting for e in column 4 and n substituting for i in column 5. In the simplest version of edit distance, we assume that the cost of making an insert or deletion has the same cost as a substitution. The distance between two strings is the minimum cost of any alignment for the two given strings. (The process of taking strings and finding the edit distance between them is called sequence alignment.) In the case of grind and agreed the edit distance would be 3. One can also use different weights for insertions, deletions, and substitutions. Further complications which have been taken into account use the fact that for the distance between DNA pieces which are parts of genes, triples of letters play a special role. Since DNA, protein sequences, genes, and many other objects can be represented by strings, the edit distance becomes a powerful tool for determining closeness and similarity. In particular, edit distance can be used to find the closest match between a possible gene and a list of genes in a data base. The goal is to find the best alignment with respect to some criterion. If one is given many DNA sequences, one can align pairs of these sequences, but another and perhaps more informative procedure is to find a multiple sequence alignment. The goal here is to align the sequences in a way that brings to the fore the common features of the sequences as a group. Thus, a multiple alignment of a group of parts of genes in different species may show that portions of the genes are essential for a specific function.