Elsevier

Mathematical Biosciences

Volume 209, Issue 1, September 2007, Pages 282-291
Mathematical Biosciences

A representation of DNA primary sequences by random walk

https://doi.org/10.1016/j.mbs.2006.06.004Get rights and content

Abstract

We describe the DNA primary sequences by random walk. With the description, two random sequences {Ym} and {Xn} corresponding to a DNA sequence, as well as graphical representations of DNA sequences are given. We further prove that two random sequences {Ym} and {Xn} are both Markov chains. Based on transition probability distributions of Markov chains, some numerical characterizations of random sequences, we introduce some new invariants for the DNA primary sequences. Then using these invariants, we make comparisons among primary sequences for exon 1 of β-globin genes belonging to nine species for analysis of the similarity and dissimilarity.

Introduction

In recent years, scientists have adopted various techniques to study the DNA sequences. Some developing quantitative methods of symbolic sequence analysis for the DNA sequences are widely applied. The DNA molecule contains abundant chemical information and biological information, therefore it is very important to analyze the DNA sequences statistically for the purpose of getting the useful information, including some mapping rules of the bases and the statistical analytical methods. There are several methods to convert the DNA sequences into digital sequences, and many statistical methods such as root-mean-square fluctuation, entropy near method, Fourier transform and wavelet transform, etc. are used to analyze the DNA sequences [1], [2], [3], [4], [5], [6], [7], [8]. Especially, Peng et al. [1] reported long-range correlations of characters in nucleotide sequences, and introduced a quantitative method of the long-range correlations. Dodin et al. [2] took the Fourier analysis and the wavelet transformation as the visible tools of the DNA sequence. Tsonis et al. [3] explored the local structures of DNA sequences with the continual wavelet transform method, and inspected the vital significance of the gene evolution through the experiment. Luo et al. [4], [5] also developed the fractal structure research of DNA molecule, and calculated some fractal dimension values of DNA of organisms. They discovered that the fractal dimension and the evolutionary level are closely related.

In the meantime, some researchers proposed graphical representations of the DNA primary sequences for the study of the DNA sequences. For example, Hamori and Rusbin [9], Zhang and Zhang [10] have considered a DNA primary sequence as a H-curve and Z-curve, respectively. Gates [11], Leong and Mogenthalar [12] and Nandy [13], [14], [15], [16], [17] considered several 2-D graphical representations of DNA. Extension to 3-D and 4-D representations has been made by Randic et al. [18], [19] and others [20]. These methods provide a simple way of viewing, sorting and comparing various gene structures, and making the analysis of similarity between DNA sequences.

Motivated by these aforementioned works, especially based on DNA walk by Peng et al. [1] with purine and pyrimidine classification of the four bases in DNA sequence, together with another two kinds of classification: amino group and keto group; the weak hydrogen bond and the strong hydrogen bond, we here give three kinds of mapping from the four bases {A, C, G, T} in DNA sequence to the number set {+1, −1}. Using each of three mappings, we convert a DNA sequence into two random numeric sequences, then select some numerical characterizations of random sequences as new invariants for the DNA primary sequences. Using these invariants, we compare the similarities and dissimilarities among the primary sequences for exon 1 of β-globin genes belonging to nine species.

Section snippets

Random walk and primary sequences

A DNA sequence can be identified with a word over an alphabet N={A,C,G,T}. We can consider a DNA primary sequence to be a finite totally ordered set with n elements, denoted as [t]  {1, 2,  , t}. As in Ref. [10], in DNA primary sequences, the four bases A, C, G, T can be divided into classes according to their chemical structure, i.e., purine R = {A, G} and pyrimidine Y = {C, T}; amino group M = {A, C} and keto group K = {G, T}; according to the strength of the hydrogen bond, the bases can still be classified

Numerical characterization of DNA primary sequence

Because the DNA sequence may be regarded as one information composed of four parts (A, C, G, T), we may regard the random numerical sequence Ymu to be composed of two parts (−1, +1). Since these two parts appear randomly (Markov characteristic), we use the average information content to characterize Ymu or the DNA sequence. Namely we may quantitatively investigate the DNA sequence by the information entropy as followsH(p)=-ipilogpiorH(p)=-1Lipilogpi,i{-1,1},where pi is the transition

Similarities and dissimilarities

Based on 1-step transition probabilities, transition entropies and mean square deviation of random sequences as invariants of DNA primary sequences, we make comparisons of similarities and dissimilarities for nine exon-1 genes in this section. We first construct a 12-component vector and two 3-component vectors which components respectively take values of probabilities, transition entropies and mean square deviation in Table 2, Table 3, Table 4 for each of nine exon-1 genes. Through comparisons

Concluding remarks

Description, comparison and similarity analysis of DNA sequences are still important subjects in bioinformatics. DNA sequence databases have accumulated much data on biological evolution during billions of years, consequently novel concepts and methods are needed to reveal the biological functions of DNA sequences codes, to investigate relationships of DNA sequences with biological evolution, cellular function, genetic mechanism and occurrence of illness. In this paper, with description of DNA

References (25)

  • A.A. Tsonis et al.

    Navelet analysis of DNA sequences

    Phys. Rev. E

    (1996)
  • L.F. Luo et al.

    Fractal dimension of nucleic acid and its relation to evolutionary level

    Chem. Phys. Lett.

    (1988)
  • Cited by (0)

    It is supported by the National Natural Science Foundation of China (10571019).

    View full text