Go Back

Searching DNA databases for similarities to DNA sequences: when is a match significant?

Isobel Anderson and Andy Brass

School of Biological Sciences, University of Manchester 2.205 Stopford Building, Oxford Road, Manchester. M13 9PT, England.Searching DNA sequences against a DNA database is an essential element of sequence analysis, especially in cases where a coding region cannot be identified, yet the possible biological significance of such search results is not well understood. To tackle this problem we have constructed a test sets of artificially evolved DNA sequences which have been used to test various database searching algorithms (BLAST, Smith-Waterman, FASTA). We have looked at which methods are best suited to DNA database searching and determined how sensitive we can realistically expect them to be. ROC (Relative Operating Characteristic) curve analysis allowed the discriminatory power of the search to be investigated (Swets and Pickett, 1982, Shah and Hunter, 1997). A set of guidelines by which to assess the significance of DNA database search results has been produced. We have also found that the 'twilight zone' of sequence similarity for DNA sequences occurs when the PAM distance between the corresponding aligned protein sequences is about 130. This corresponds to approximately 35% amino acid identity between two protein sequences (Dayhoff et al, 1978). Therefore searching a DNA sequence against a DNA database is sensitive enough to find other DNA sequences that code for structurally and functionally similar proteins.

References

Dayhoff, M.O., Schwartz,R.M. and Orcutt,B.C. (1978) A Model of Evolutionary Change. In Dayhoff,M.O.(ed), Atlas of Protein Sequence and Structure, Volume 5, Suppl.3, National Biomedical research Foundation, Washington, DC. p.p. 345-352.

Shah,I. and Hunter,L. (1997) Predicting Enzyme Function from Sequence: A Systematic Appraisal. In Gaasterland,T., Karp,P., Karplus,K., Ouzounis,C., Sander,C. and Valencia, A. (eds), Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, pp.276-283.

Swets,J. and Pickett,R.M. (1982) Measuring the Accuracy of Diagnostic Systems. Academic Press, New York:

Go Back