DNA provides the blueprint by which organisms are constructed and subsequently function; it is the basis of life. As new advances in laboratory technology now allow us to manipulate DNA with astonishing precision, there is also need for mathematical techniques to transform data coming from the laboratory into a detailed understanding of the structure of DNA.
In this column, I will describe current technology that is used in genome sequencing and see how mathematics plays a crucial role in processing the data. First, let's review what DNA is and what kind of information we would like to learn about it. Of course, this is a subject with many, many details, and the presentation here will necessarily be cursory; my aim is to describe how a mathematical problem arises in a biological context and then present a solution to that problem.
What is DNA?
DNA, or deoxyribonucleic acid, is a complex organic molecule that is found in the nucleus of a living cell in units called chromosomes. The nuclei of human cells, for instance, contain 23 pairs of chromosomes with both parents contributing one chromosome to each pair.
The most common form of DNA appears in the form of a double helix, which resembles a twisted ladder. The molecule consists of a sugar-phosphate backbone, which forms the outside of the ladder, along with pairs of nucleotide bases, which form the rungs of the ladder. To give a sense of scale, the base-pair rungs of the ladder are separated by 0.34 nanometers. These bases will be of most interest to us for they are what encode the design of the organism.
There are four different bases that appear in DNA--adenine (A), guanine (G), cytosine (C), and thymine (T)--and the unique underlying structures of the four bases allow them to pair in a rung of the ladder only in limited ways. For instance, adenine may pair only with thymine to form an A-T pair, and guanine may pair only with cytosine to form a G-C pair. We say that A and T are complements as are C and G.
For our purposes, it will be useful to imagine the molecule as being untwisted and rotated so that the rungs of the ladder are oriented vertically and the base-pairs may be read from left to right:
In fact, the ends of the sugar-phosphate chains are distinguishable from one another by the nature of an unbound carbon atom: one end has an unbound 5' carbon atom, while the other end has an unbound 3' carbon atom. When we represent a DNA molecule with a diagram, such as that shown above, it is understood that the top chain moves from the 5' end on the left to the 3' end on the right. It turns out that the bottom chain is oppositely oriented, from 3' on the left to 5' on the right.
In this way, we may read the genome sequence , the sequence of bases from left to right across the top strand. Shown below is a portion of a genome sequence, known as NCBI Build 36, taken from chromosome 8 of a human. Across the top, we see the number of each base-pair measured from the 5' end of the top chain. That is, we are looking at a section about 2.4 million base-pairs from the 5' end of the top chain.
You may explore more of this sequence at the National Center for Biotechnology Information. We will have more to say about sequences such as this in a moment. In the meantime, however, let's see why this kind of information might be important.
What do we hope to learn from genome sequences?
There are many answers to this question, and our discussion will, once again, be regrettably brief. We have said that DNA contains all the information necessary for the growth and functioning of an organism. Let's begin with a quick summary of how this works.
Through a process called transcription, DNA encodes its information in the form of RNA, ribonucleic acid, which may be thought of as a sequence of bases that is copied from portions of the DNA molecule. The bases in this sequence are grouped into triplets, called codons, and each codon is responsible for the creation of an amino acid through a process known as translation. For instance, the codon GCC is translated into the amino acid alanine, while GGC is translated into glycine. Notice that the difference in a single base may cause the creation of a different amino acid.
Amino acids are the building blocks for proteins that control the growth and function of cells. Regions in the genome sequence called genes then work together as a unit to create sequences of amino acids that combine to form specific proteins that fill specific roles in the functioning of the cell. Shown below is an enlarged view of the genome sequence (from above) that shows the presence of genes as green bars.
In brief, this is the mechanism by which an organism's genotype, or genetic structure, is related to its phenotype, or physical structure, and there is reason to believe that a deeper understanding of this relationship will lead to much useful information. For example, it is now possible to identify regions in the genome sequence that are related to an individual's susceptibility to certain diseases. The plot below shows the association of various locations on chromosome 12 with elevated levels of HDL cholesterol.
|Image created with LocusZoom using data from Teslovich et al
This kind of information can be used in at least two ways. First, an individual's susceptibility may be identified, in some cases, before the onset of a disease, and strategies taken to avoid or lessen the disease's impact. Second, this information may provide a more complete understanding of some diseases. For instance, if we know that a particular region of the genome sequence is related to a particular disease, we may be able to identify the resulting sequence of amino acids and proteins and therefore gain insight into the chemical basis of the disease.
Other uses of genome sequences include a richer understanding of the human family tree as well as effects of natural selection.
How are genome sequences obtained?
There are several methods by which genome sequences are read. Because I hope to describe a particular mathematical question, this discussion will focus on a current technology known as next-generation sequencing.
Let's first give a scale to this problem. Across the 23 human chromosomes, there are approximately 3 billion base-pairs. What is remarkable is that the sequences from two individuals are 99.9% identical; that is, only one out of roughly every thousand bases in the two sequences will differ. Remember that portion of the genome sequence we looked at earlier? It shows about 50 bases. Chances are that your genome sequence looks exactly like this.
Differences in the sequences arise from mutations that occur, for instance, when the DNA of a parent is incorrectly transmitted to a child in the next generation. Geneticists estimate that mutations occur at the rate of roughly 1 mutation for every 100 million bases transmitted across a generation. This is a remarkably low rate, especially considering the complexity of the chemical processes involved in the transmission of DNA across generations. It says roughly this: when a child's sequence of 3 billion bases is created, there are roughly 30 mutations made. This low mutation rate explains why two genome sequences so closely resemble one another.
In the early 2000s, two parallel projects, led by the government-sponsored Human Genome Project and the private firm Celera, created the first complete genome sequences of two individuals. These were the result of multi-year efforts by both groups; in fact, the Human Genome Projet lasted roughly 13 years.
What is the value in having the complete sequence for a single individual? After all, it is the differences in genome sequences that account for differences in individuals so we will need to look at the sequences of many individuals if we hope to understand how differences in genotype manifest as differences in phenotype. To answer this question, let's consider how DNA is sequenced now.
First, several molecules of DNA are extracted from an individual and subjected to an acoustic vibration that causes the molecules to break into fragments of, say, 100 bases. Here is a representation of one such fragment:
Since the full genome contains 3 billion bases, there are several tens of millions of these fragments for the whole genome. Through a process known as ligation, a special piece of DNA, with a specific sequence of bases, is added to the end of each fragment.
The ligated fragments are then passed over a surface or chip that contains special pieces of DNA composed of base sequences that are complementary to the ligated ends.
These complementary DNA sequences bind the fragment to the chip's surface.
In fact, the chip is designed to bind hundreds of millions of fragments:
Through a process known as amplification, each fragment is copied many times so that you may imagine a small forest of identical copies around each original fragment. The chip is then immersed in a bath containing bases where each of the four base types has a different fluorescent dye added. In this bath, exactly one complementary base is added to a fragment, as in the figure below.
A laser then excites the dyed bases so that they emit a wavelength of light that characterizes the type of base present. Due to the amplification process, there are many closely grouped fragments emitting the same wavelength of light, which may then be captured on an image. This process is then repeated
Each image, which will contain tens of millions of small spots of light, is then analyzed, sometimes using software originally developed for astronomical observations, to determine the particular base present in the fragment at that stage of the process.
A video outlining this method may be seen at the Illumina website. (Select the "Video: Genome Analyzer workflow" link and the "Technology" tab.)
After this process and analysis is complete, we then know the sequence of each of the tens of millions of fragments.
Of course, there is some error in this process: typically, we read about 1 in a 100 bases incorrectly. Since the total genome has 3 billion bases, this would mean we make about 30 million errors in one read of the genome. Since there are only about 3 million differences between two genomes, this is an unacceptably high error rate. We therefore read the genome many times, as many as 80 times, in an effort to catch errors. As our understanding of the genome has increased, sophisticated statistical methods allow us to correct errors using as few as 4 passes over the entire genome.
The problem now is that we need to put the fragments back together. It's as if we've dumped a 10 million piece puzzle out onto the table in front of us and need to assemble the puzzle by correctly locating each of the pieces. To do this, we'd like to be able to look at a picture of the completed puzzle. This is where it is useful to have a complete genome available as a reference. The genome we are trying to assemble does not perfectly match the reference, but it is in 99.9% agreement. We will therefore compare the reads from our fragments to the reference genome to determine how all the pieces fit together.
This is the mathematical issue that I would like to explain. The reference genome is a string composed of 3 billion letters "A", "T", "C", and "G". If we have a string with 100 letters coming from the read of a fragment, how can we determine where it came from in the reference genome? (This task is called alignment.) Morever, since we have over ten million fragments, we need to do it efficiently.
Also, each fragment may not be an exact match to a portion of the reference genome so we need to find substrings of the genome that are close matches to our fragments. The types of possible differences are shown below.
At some locations, the fragment may not agree with the reference genome due to either an error in the sequencing read or a real difference between the fragment and the reference. Errors will hopefully be detected by reading this section of the genome many times. A difference between the fragment and the reference genome is known as a SNP, or single nucleotide polymorphism, and a great deal of current research involves relating SNPs to the etiology of certain diseases.
There may be regions in the fragment, called insertions, that are not present in the reference genome. In the same way, regions in the reference may be deleted in the fragment. Insertions and deletions are known as indels and are another important source of genetic variation.
The Burrows-Wheeler transform
Now we have uncovered a mathematical problem. The reference genome is a sequence of 3 billion bases A, T, C, and G. The sequencing data gives us tens of millions of sequence reads of 100 bases, each of which should resemble a subsequence of the reference genome without necessarily being an exact match. To reconstruct the sampled genome, we need an algorithm that can find the best alignment between a sequence read and a subsequence of the reference genome.
I will describe an algorithm that was recently introduced by Li and Durbin and that is based on the Burrows-Wheeler transform. An implementation of this algorithm is widely used for aligning sequence reads.
Let's begin with the Burrows-Wheeler transform, which takes a string of text X and transforms it into a new string XBW. In our application, we will begin with the reference genome and apply the Burrows-Wheeler transform. For the purposes of this discussion, we will transform the example string "banana".
To start, we append the symbol "$" to the end of the string so that we obtain "banana$". The individual letters in the string will be considered to be ordered in the usual alphabetical order; the special symbol "$" will be considered to come before all the other letters in the alphabet.
Next, create a table in which each row is a cyclic permutation of the letters in the string X. The figure below indicates the number of each row though this is not a part of the permuted strings. Notice that this number gives the location in the string X of the first letter in the row (starting from zero).
Now we will sort the rows in lexicographical, or dictionary, order. In the figure below, the original row numbers are indicated next to each row.
The Burrows-Wheeler transform of the original string X = "banana$" is now the string in the rightmost column of the table, which is shaded in the figure below. In our example, the transformed string is XBW = "annb$aa".
Through a process that need not concern us here, the transform may be inverted; that is, if we know XBW, we may recover X. The transform tends to group together letters that occur in similar substrings, a feature that makes it useful for compressing data represented as a string as well as for our purposes.
Now that we have the Burrows-Wheeler transform in hand, we may also introduce the suffix array. First, note that the suffixes of the original string X (in our example, these are "$", "a$", "na$", "ana$", "nana$", "anana$", "banana$") may also be sorted in lexicographical order. In fact, they appear in lexicographical order in the sorted table we constructed above.
The suffix array S is defined like this. Consider the kth suffix in the lexicographical order. The element in the suffix array S(k) is the position in the original string X of the first letter of the kth suffix. If this seems like a mouthful, the example below should make it clear.
Finding exact matches
We will now describe how to use these ideas to find substrings of the original string X. Notice that every substring of X is a prefix of a suffix. In our example, for instance, the substring "nan" is a prefix of the suffix "nana$".
If we now look for occurrences of the substring "an", we notice that it is a prefix to two suffixes, "ana$" and "anana$", and that these suffixes are sorted together in lexicographical order.
More generally, if we have a substring W of X, we may define lower and upper bounds:
the smallest k so that W is a prefix of the kth sorted suffix
the largest k so that W is a prefix of the kth sorted suffix
In our example, we have L("an") = 2 and U("an") = 3. We call the interval [L(W), U(W)] the suffix array interval of W. Finding the suffix array interval of a string W identifies all the occurrences of W in X:
L(W) ≤ k ≤ U(W) if and only if S(k) is the starting position of an occurrence of W in X.
This provides a way of formulating our problem: we would like to find all occurrences of a string W in the string X. If we can find the suffix interval of W, then we have solved this problem.
We will describe an algorithm that finds the suffix interval for any string by working recursively: If x is a letter in our alphabet and if we know the suffix interval of W, then we will find the suffix interval of xW, the string formed by adding x to the beginning of W. We need to define two quantities first:
- If x is a letter in the alphabet, let C(x) be the number of letters in X that are lexicographically smaller than x. For instance, in our example, C("n") = 5 since X contains the letters "$", "b", "a", "a", and "a" that are lexicographically smaller than "n". Alternatively, C(x) is the value of k labeling the first row that begins with x. So it must be the case that L(xW) ≥ C(x).
- We will also need to consider O(x, k), which we define to be the number of occurrences of x in the first k+1 letters of XBW. In other words, O(x, k) is the number of times x appears in the rightmost column of the table in the rows labeled 0, 1, 2, ..., k.
Let's describe this algorithm by working through an example: We'll suppose we know that the suffix array interval of the substring "na" is [5, 6] and use this information to find the suffix array interval of the substring "ana". Consider modifying the table by taking the rightmost column and placing it before the leftmost column:
is modified to
Of course, this changes the lexicographical order of the rows. However, if we consider only the rows that begin with x, then the relative order of those rows is not changed by this operation. Shown below is the effect of this modification on rows beginning with "a":
is modified to
As an example, let's ask ourselves how far the row beginning with "ana" is below the first row beginning with "a".
is modified to
Now undo the modification
This shows that, in the original table, the row beginning with "ana" is below the first row beginning with "a" by the number of times that "a" occurs in the rightmost column above the first row beginning with "na". Stated more generally, this means that
L(xW) = C(x) + O(x, L(W)-1).
In the same way, we find that
U(xW) = C(x) + O(x, R(W)) - 1.
To summarize, here is the result as stated and proven by Ferragini and Manzini:
Suppose that the suffix array interval of W is [L(W), R(W)]. For x a letter in our alphabet, define
L(xW) = C(x) + O(x, L(W)-1),
U(xW) = C(x) + O(x, R(W)) - 1.
If L(xW) ≤ R(xW), then xW is a substring of X with suffix array interval [L(xW), R(xW)].
If L(xW) > R(xW), then xW does not appear as a substring of X.
Let's see how this works in practice. Before beginning the algorithm, we determine XBW, the Burrows-Wheeler transform of the string X and the suffix array S. We will also determine the values C(x) and O(x, k). Since X plays the role of the reference genome, we only need to do this once.
Imagine that we want to search for W = "ana" in X = "banana$". We first consider the empty string "" so that L("") = 0 and U("") = n and begin working on the right hand side of W:
C("a") + O("a", -1) = 1
C("a") + O("a", 6) - 1 = 3.
Then we find
C("n") + O("n", 0) = 5
C("n") + O("n", 3) - 1 = 6
C("a") + O("a", 4) = 2
C("a") + O("a", 6) - 1 = 3
This shows that W = "ana" is a substring of X = "banana$" with suffix array interval [2, 3]. Therefore, "ana" appears in "banana" twice with starting positions S(2) = 3 and S(3) = 1.
Notice that the algorithm finds all occurrences of W in X. Also, if W were not a substring of X, we would have arrived at L > R at some step of the algorithm.
How to handle mismatches, insertions, and deletions
Of course, in our application, we are not necessarily looking for exact matches; however, it is fairly straightforward to modify the algorithm to allow mismatches, insertions, and deletions. To do this, we will set a limit d on the number of mismatches, insertions, and deletions we will allow. As we work through the string W from back to front, we will let i denote the position of the next letter we wish to match. We then allow the algorithm to branch according to the four possibilities:
- Match: Decrease i by one to move to the next letter and hold d fixed since we have not found a difference.
- Deletion: Allow a match with any letter in the reference X without changing the current position in W. This has the effect of keeping i fixed and decreasing d by one since we will now allow one fewer difference.
- Insertion: Skip to the next letter in W without changing the current position in the reference. This means we decrease both i and d by one.
- Mismatch: Allow a match to a letter other than the current one in W. Here, we decrease both i and d.
We continue until either
- d < 0, which means the branch we are in has too many differences to find an acceptable match in this branch.
- i < 0, which means that, up to an acceptable number of differences, we have found positions in the reference string matching W.
This algorithm is written out in psuedo-code in Li and Durbin's quite readable paper. For reads of approximately 100 bases, they recommend setting d, the maximum number of allowed differences, to 5.
Since it is possible that a given read will match to more than one location in the reference genome, it is useful to generate a map quality score for each possible alignment. For instance, alignments to substrings with greater differences will have a lower map quality. In fact, more refined measurements are possible by giving different weights to the map quality defects associated to insertions, deletions, and mismatches.
In addition, each read of a base typically has a quality value that reflects the laboratory's confidence in its correctness. This provides further assistance in choosing the best possible alignment.
Now that the work of finding the reference genome is complete, we would like to be able to find the genome sequences for large numbers of individuals. Such data would enable us to deepen our understanding of the relationship between genotype and phenotype. Indeed, several such projects are underway; one example is the 1000 Genomes Project, which was begun in 2008.
New advances in laboratory technology are decreasing both the time and cost of creating a genome sequence. Indeed, one company manufacturing this equipment claims that the day of a $1000, 1-day genome is near.
This raises an interesting issue. I am struck by what seems to be a fortunate historical coincidence. At the same time that laboratory technology is enabling us to rapidly generate genetic data, the computational tools we have on hand are commensurate to the task of processing the data. For instance, a genome is 3 billion letters, while computer memory is now measured in billions of bytes (gigabytes) and processors run at billions of cycles per second (gigahertz).
However, it appears that our ability to generate data is growing at a rate faster than computational power is growing. This discipline presents an unusual prospect: compared to the problems it is being asked to solve, the power of our computational tools is effectively decreasing. Because of this, we need increasingly better mathematical techniques to create more efficient algorithms for working with this data.
It has been said that if the twentieth century was the century of physics then the twenty-first will be the century of biology. It appears that mathematics will play a significant role in this transition.
I would like to thank the many people at the Center for Statistical Genetics in the University of Michigan's School of Public Health who have made me feel welcome and introduced me to this area. I would like to thank especially Tanya Teslovich for assistance with one of the figures and for several helpful conversations.
- NCBI Science Primer
- U.S. National Library of Medicine
These sites give a great deal of background information about DNA, genetics, and the genome and are written for a lay audience. The illustration of a DNA molecule at the beginning of this column is taken from and used with permission of the U.S. National Library of Medicine.
- Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60.
This paper gives an accessible description of the string matching algorithm described here as well as details on its implementation.
- Burrows, M. and Wheeler, D. (1994) A block-sorting lossless data compression algorithm. Technical Report 124, Palo Alto, CA, Digital Equipment Corporation.
This is the original paper on the Burrows-Wheeler transform and explains its effectiveness in data compression.
- Ferragini, P. and Manzini, G. (2000) Compressed Suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings on 32nd Annual ACM symposium on Theory of Computing (STOC 2000), ACM, 397-406.
A preliminary version of the paper with complete proofs of all results.
- The 1000 Genomes Project Consortium. (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061-1073.
This describes recent work from the 1000 Genomes Project and what we've learned by looking at large number of genome sequences.
- Teslovich, T.M., et al. (2010) Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466 (7307): 707-713.
This paper contains data used to create the figure illustrating the association of SNPs to HDL cholesterol.
- Human DNA Contamination Seen in Genome Databases, New York Times, February 16, 2011.
While preparing this column, I coincidentally saw this news article describing how human DNA has contaminated the reference genomes for other species.