Go Back

Toward the Integration of Gene Identification and Its Functional Prediction in Microbial Genomes

Tetsushi Yada, Yasushi Totoki (1), Reiko Tanaka (1), Takahiro Ishii (2), and Kenta Nakai (2)

Japan Science and Technology Corporation (JST) 5-3 Yonbancho, Chiyoda-ku, Tokyo 102, Japan

(1) Information and Mathematical Science Laboratory, Inc., 2-43-1 Ikebukuro, Toshima-ku, Tokyo 171, Japan

(2) Institute for Molecular and Cellular Biology, Osaka University, 1-3 Yamada-oka, Suita 565, Japan

The increased interest in the integration of gene identification and promoter recognition tools since Bucher et al. suggested a potential for the inference of gene function from its transcriptional context [1]. We have performed two independent studies in both of which the hidden Markov model (HMM) were applied: one is the gene identification in Synechocystis genome [2] and the other is the prediction of gene function in Bacillus subtilis genome [3].

The former HMM, which represents start codon frequencies and the following di-codon statistics, can identify ORFs in given sequences. In our cross-validation test, 93.4 % sensitivity (Sn) and 99.4 % specificity (Sp) were observed while they were 90.2% and 98.8 % in GeneMark, respectively. Moreover, integration of the information on Shine-Dalgarno sequences enabled us to identify translational initiation sites with 83.3 % accuracy. Further investigation revealed that (1) the HMM possesses high ability in the detection of short ORFs (< 150 bp): Sn was 47.4 % while it was 21.1 % in GeneMark, (2) another model whose parameters are optimized against exogenous genes is desired for more reliable prediction because 54.1 % of false negative ORFs were transposase genes.

The latter HMM represents eight kinds of consensus patterns each of which corresponds to the binding sites of a sigma factor and can predict sigma factor dependencies of given promoters. In bacterial cells, gene expression is regulated by multiple sigma factors, each of which has its promoter specificity, according to their conditions (stress, starvation, etc.). Thus, if we can discriminate which sigma factor binds to a given promoter, we can predict in what condition it will be expressed. In our cross-validation test, the HMM showed the prediction accuracy of 75.5%. Moreover, in open test of 1415 candidate promoters, the prediction results that seem very likely comparing with their gene annotations were obtained. These results indicate that the approach, the inference of gene functions from their regulatory sequences, is feasible in Bacillus subtilis genome.

We have designed the HMM which integrated the above two models and are applying it to the analysis of Bacillus subtilis genome. The HMM is capable of simultaneously predicting gene locations and their functions in given DNA sequences. The suggestion by Bucher et al. is just on the point of being embodied in microbial genomes.

References

[1] Bucher, P., Fickett, J. W. and Hatzigeorgiou, A.: Computational analysis of transcriptional regulatory elements: a field in flux, Comput. Appl. Biosci., Vol. 12, pp. 361-362 (1996).

[2] Yada, T. and Hirosawa, M.: Gene Recognition in Cyanobacterium Genomic Sequence Data Using the Hidden Markov Model, Proc. of the 4th Int. Conf. on Intelligent Systems for Molecular Biology, Menlo Park, Calif., AAAI Press., pp. 252-260 (1996).

[3] Yada, T., Totoki, Y., Ishii, T. and Nakai, K.: Functional prediction of bacillus genes using hidden Markov model, Proc. of the 5th Int. Conf. on Intelligent Systems for Molecular Biology, Menlo Park, Calif., AAAI Press., pp. 354-357 (1997).

Go Back