Go Back

A Scale Space Approach for Biological Sequence Comparison

Gautam B. Singh

Algorithms Research Division, National Center for Genome Resources, Santa Fe, NM 87505

It is well known that the sequence similarity methods often fail to detect homology when in fact the sequences are related at the functional and structural level. For example, in spite of a strong protein structure similarity between the sperm whale myoglobin and human hemoglobin, the sequence level homology is poor. This implies that an alternate strategy needs to be developed for sequence comparisons such that the overall information content is used as the basis for comparison as opposed to performing searches for statistically significant stretches of matching subsequences.

One such holistic approach is based on comparing sequences represented in a scale-space. Considering the frequency of all possible N-mers found in the sequence develops its N-word frequency profile. The family of frequency profiles is independent of the sequence length, and is dependent only upon the scale of comparison. For example, independent of the sequence length, a 4-mer scale will result in a space of 44 or 256 frequency values, a 5-mer scale in a space 1024 frequency values, etc. The correspondence between frequency profiles is subsequently used as the basis for establishing functional similarity between a given set of sequences. This is in contrast to the use of a single region of strong local match as a basis for homology.

This method is applied for several DNA sequence comparisons including a 3-mer, 4-mer and 5-mer scale-space comparison of the alpha- and beta-globin gene clusters (sizes 235 kb and 79 kb, respectively). The result of its application for clustering the envelope protein DNA sequences from various strains of the Human Immunodeficiency Virus (HIV) is presented. An adaptation of scale-space methodology for the comparison of protein sequences is described where the polarity and the charge of an amino acid is used for reducing a 20-character alphabet to a 3-character alphabet. The comparison of different protein families using their transformed scale-space representation is presented.

Go Back