Feature Column from the AMS

Mathematics and the Genome: Molecular Genetics (1950-Present)

4. Molecular Genetics (1953-Present)

Given the short history of the evolution of molecular genetics, the support mathematics has provided has been equally dramatic. Intriguingly, history has repeated itself to some extent in this saga: new methods to deal with fast developing problems are being found, but there has also been an available reservoir of tools developed without specific reference to the needs of molecular genetics that are proving valuable off the shelf. Many of the ways that mathematics has been put to use is that DNA can be thought of as a string made up of an alphabet of 4 letters, and a protein can be thought of as a string whose letters are the amino acids which make it up.

Problems that have attracted mathematical interest include:

1. Finding an exact match for a given string within a long string.

2. Finding an approximate match for a given string within a long string.

3. Finding the best match for a string in a list of strings with respect to some optimization criterion.

4. Finding the best approximate match for a string in a list of strings with respect to some optimization criterion.

5. Finding the largest common string in a list of strings.

6. Given a set of strings with information about the overlap properties of these strings, to reconstruct a string which has the original strings as substrings.

7. Given a set of strings, find the shortest string which has the given strings as a substring.

8. Given a set of strings, find the shortest string which has the given strings or the reversals of these strings as a substring.

9. Given a pair of strings, what is the distance between them (see next section).

10. Finding trees that help represent relationships between objects.

Some of these problems have been shown to have polynomial time complexity but others have been shown to be NP-complete. Since the problems that biologists are interested in are quite large, computers have been used from the very early days. However, the implications in some of the problems listed for having approximate answers in a biological setting are more problematical than in some other situations. For example, if one is designing a spell checker, one wants to locate strings in a dictionary that are close to a string (which presumably is not a true word) which is not in the dictionary. The consequence of not finding the word that the typist meant to enter is rather different from reaching the wrong conclusion about relationships between species.

Methods exist for taking a long stretch of DNA whose structure is not known and cleaving it at particular sites. One then needs to sequence the resulting pieces and to be able to reconstruct the original DNA from these pieces. In the traditional method of sequencing, one locates particular stretches of DNA as being associated with known sections of DNA from specific chromosomes. In the so-called shotgun method the whole genome is in small pieces and one reconstructs the structure of all of the chromosomes from the small pieces. Many people have been involved in various parts of the designing of algorithms for reconstruction of long stretches of DNA from shorter pieces. However, the work of E. W. Myers is especially noteworthy in promoting this approach, whose success from a biological viewpoint is still being debated. However, much new has been learned in a mathematical sense.