Poster Abstract

Finding Genes by Hidden Markov Models with Protein Motifs

Kiyoshi Asai (1), Yutaka Ueno (1), Katunobu Itou (1) and Tetsushi Yada (2)

(1) {asai ueno kito}@etl.go.jp, Electrotechnical Laboratories, 1-1-4 Umezono, Tsukuba 305, Japan

(2) yada@tokyo.jst.go.jp, Japan Science and Technology Corporation, 5-3 Yonbancho, Chiyoda-ku, Tokyo 102 Japan

We introduce a gene finding system for eukaryotes, the GeneDecoder, based on hidden Markov models (HMMs) with a stochastic grammar and a dictionary, which includes protein motifs.

The structure of eukaryotic genes is expressed by a stochastic grammar and a dictionary whose components are HMMs. The HMMs represent the nucleotide acid bases, the codons, and the amino acids. The genetic words in the dictionary are described by the sequence of these HMMs and represent exons, introns, intergenic regions, protein motifs and signals in DNA sequences. The statistics between these components are expressed by the grammar, which is a stochastic network of the genetic words. The recognition process is exactly the same as the stochastic parsing of speech using a grammar defined on the words, which uses word level pruning and N-best parsing techniques.

The protein motifs are used during the parsing of the DNA sequences. We extracted 1149 motif entries from PROSITE release 13.0, and selected 933 motif patterns as the genetic words in the dictionary, by the score based on the specificity of the patterns. If there are regions that match protein sequence motifs, these regions have high probabilities to be in an exon. At the same time, stochastic features of donor/acceptor sites, information of the di-codon statistics, and other important features are integrated into stochastic scores during the parsing. As a result, while the system parses DNA sequences and finds the exon/intron structures, the protein motifs are automatically annotated in the regions. It helps to identify the functions of the genes and reduces the cost of homology search for each hypothetical coding regions.

The reading frames are considered by three types of exon fragments at donor/acceptor sites. The intron model behaves differently in three contexts, depending on how many bases of the incomplete codon are placed in the fragment of the exon at the donor site. Splice site statistics are distributed in the donor/acceptor fragment models of exons and the fixed length parts of the intron models, like weight matrix models of splice sites.

We evaluated the system by the test set of 570 vertebrate sequences (Burset and Guigo, 1996) and achieved 87%/82% sensitivity/specificity in base level, and 62%/52% in exon level, where 123 genes were correctly predicted with their complete exon/intron structures. Among the 241 occurrences of 97 motifs in the data, 167 occurrences are correctly annotated by GeneDecoder. Seventy-four occurrences have been missed by the system because those motifs are not completely included in the exons. We also tested it on the non-redundant sets of human genes (Kulp and Reese, 1996) and achieved 87%/72% sensitivity/specificity in base level, and 45%/52% in exon level. These results have shown that the method reasonably finds and annotates the protein motifs in the exons of eukaryotes.