Poster Abstract

Mutual Information Reveals Species-Independent Statistical Differences Between Coding and Non-coding DNA

Ivo Grosse (1), Hanspeter Herzel (2), Sergey V. Buldyrev (1) and H. Eugene Stanley (1)

(1) Center for Polymer Studies and Department of Physics, Boston University, Boston, MA 02215, USA

(2) Institute for Theoretical Biology, Humboldt University, Invalidenstr. 43, D-10115 Berlin, Germany

The search for statistical patterns that are different in coding and non-coding DNA has received major interest as genome projects turn from mapping to large-scale sequencing. Many statistical patterns have been found to be different in coding and non-coding DNA, but none of them seems to be universal across the entire phylogenetic tree. Here we report the finding that the probability distribution functions of the Average Mutual Information (AMI) for coding and non-coding DNA, while being significantly different from each other, are almost the same for organisms of all taxonomic classes. In order to provide evidence for this species independence, we perform an exhaustive comparison of all GenBank data from different taxonomic orders, classes, phyla, and kingdoms.

To allow for an observer-independent measurement of the species-dependence of different statistical quantities, we define the Degree of Species Dependence (DSD) based on the standard chi-squared-distance between two histograms. We compute the DSD of the codon usage as well as of the AMI for all available DNA sequences from GenBank and find that the codon usage exhibits differences between organisms as great as the differences between coding and non-coding DNA, while the differences between the AMI distributions of different organisms are only a few percent of the differences between coding and non-coding DNA. Moreover, the AMI is as accurate in predicting coding regions as the most specialized coding measures and can, therefore, be used to identify coding regions in genomes of all animals, plants, or bacteria without prior training.

The capacity of the AMI to distinguish coding from non-coding DNA irrespective of its phylogenetic origin may be of practical importance in the future when numerous new genomes are being sequenced for which training sets do not exist.