A representation of DNA primary sequences by random walk☆
Introduction
In recent years, scientists have adopted various techniques to study the DNA sequences. Some developing quantitative methods of symbolic sequence analysis for the DNA sequences are widely applied. The DNA molecule contains abundant chemical information and biological information, therefore it is very important to analyze the DNA sequences statistically for the purpose of getting the useful information, including some mapping rules of the bases and the statistical analytical methods. There are several methods to convert the DNA sequences into digital sequences, and many statistical methods such as root-mean-square fluctuation, entropy near method, Fourier transform and wavelet transform, etc. are used to analyze the DNA sequences [1], [2], [3], [4], [5], [6], [7], [8]. Especially, Peng et al. [1] reported long-range correlations of characters in nucleotide sequences, and introduced a quantitative method of the long-range correlations. Dodin et al. [2] took the Fourier analysis and the wavelet transformation as the visible tools of the DNA sequence. Tsonis et al. [3] explored the local structures of DNA sequences with the continual wavelet transform method, and inspected the vital significance of the gene evolution through the experiment. Luo et al. [4], [5] also developed the fractal structure research of DNA molecule, and calculated some fractal dimension values of DNA of organisms. They discovered that the fractal dimension and the evolutionary level are closely related.
In the meantime, some researchers proposed graphical representations of the DNA primary sequences for the study of the DNA sequences. For example, Hamori and Rusbin [9], Zhang and Zhang [10] have considered a DNA primary sequence as a H-curve and Z-curve, respectively. Gates [11], Leong and Mogenthalar [12] and Nandy [13], [14], [15], [16], [17] considered several 2-D graphical representations of DNA. Extension to 3-D and 4-D representations has been made by Randic et al. [18], [19] and others [20]. These methods provide a simple way of viewing, sorting and comparing various gene structures, and making the analysis of similarity between DNA sequences.
Motivated by these aforementioned works, especially based on DNA walk by Peng et al. [1] with purine and pyrimidine classification of the four bases in DNA sequence, together with another two kinds of classification: amino group and keto group; the weak hydrogen bond and the strong hydrogen bond, we here give three kinds of mapping from the four bases {A, C, G, T} in DNA sequence to the number set {+1, −1}. Using each of three mappings, we convert a DNA sequence into two random numeric sequences, then select some numerical characterizations of random sequences as new invariants for the DNA primary sequences. Using these invariants, we compare the similarities and dissimilarities among the primary sequences for exon 1 of β-globin genes belonging to nine species.
Section snippets
Random walk and primary sequences
A DNA sequence can be identified with a word over an alphabet . We can consider a DNA primary sequence to be a finite totally ordered set with n elements, denoted as [t] ≔ {1, 2, … , t}. As in Ref. [10], in DNA primary sequences, the four bases A, C, G, T can be divided into classes according to their chemical structure, i.e., purine R = {A, G} and pyrimidine Y = {C, T}; amino group M = {A, C} and keto group K = {G, T}; according to the strength of the hydrogen bond, the bases can still be classified
Numerical characterization of DNA primary sequence
Because the DNA sequence may be regarded as one information composed of four parts (A, C, G, T), we may regard the random numerical sequence to be composed of two parts (−1, +1). Since these two parts appear randomly (Markov characteristic), we use the average information content to characterize or the DNA sequence. Namely we may quantitatively investigate the DNA sequence by the information entropy as followswhere pi is the transition
Similarities and dissimilarities
Based on 1-step transition probabilities, transition entropies and mean square deviation of random sequences as invariants of DNA primary sequences, we make comparisons of similarities and dissimilarities for nine exon-1 genes in this section. We first construct a 12-component vector and two 3-component vectors which components respectively take values of probabilities, transition entropies and mean square deviation in Table 2, Table 3, Table 4 for each of nine exon-1 genes. Through comparisons
Concluding remarks
Description, comparison and similarity analysis of DNA sequences are still important subjects in bioinformatics. DNA sequence databases have accumulated much data on biological evolution during billions of years, consequently novel concepts and methods are needed to reveal the biological functions of DNA sequences codes, to investigate relationships of DNA sequences with biological evolution, cellular function, genetic mechanism and occurrence of illness. In this paper, with description of DNA
References (25)
- et al.
Fourier and wavelet transform analysis a tool for visualizing regular patterns in DNA sequences
J. Theor. Biol.
(2000) - et al.
Information parameters of nucleic acid and molecular evolution
J. Theor. Biol.
(1988) - et al.
Wavelet based fractal analysis of DNA sequences
Physica D
(1996) - et al.
H curves, a novel method of representation of uncleotide series especially suited for long DNA sequence
J. Biol. Chem.
(1983) - et al.
On the uniqueness of quantitative DNA difference descriptors in 2D graphical representation models
Chem. Phys. Lett.
(2003) - et al.
Novel 2-D graphical representation of DNA sequences and their numerical characterization
Chem. Phys. Lett.
(2003) - et al.
Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation
Chem. Phys. Lett.
(2003) - et al.
New 3D graphical representation of DNA sequence and their numerical characterization
Chem. Phys. Lett.
(2003) - et al.
Analysis of similarity/dissimilarity of DNA sequences based on 3-D graphical representation
Chem. Phys. Lett.
(2004) - et al.
Long-range correlations in nucleotide sequences
Nature
(1992)
Navelet analysis of DNA sequences
Phys. Rev. E
Fractal dimension of nucleic acid and its relation to evolutionary level
Chem. Phys. Lett.
Cited by (0)
- ☆
It is supported by the National Natural Science Foundation of China (10571019).