Eukaryotic GeneMark.hmm model description

 

 

 

Notation for GeneMark.hmm models:

ES-3.0 (E - eukaryotic; S - self-training; 3.0 - the version)

Supervised version 3.0 otherwise.

 

Supervised model parameterization for GeneMark.hmm E-3.0.

The statistical model employed in the GeneMark.hmm algorithm is a hidden Markov model with duration (1) or a hidden semi-Markov model (HSMM). Hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes form the  HSMM architecture (2). It also includes hidden states for start site (initiation site), stop site (termination site), and donor and acceptor splice sites. In what follows, we refer to such hidden states as site states.

The site states emit nucleotide sequences of fixed length modeled by positional

(inhomogeneous) Markov chains (3,4). 

The protein-coding states (initial, internal, terminal exons and single exon gene) emit nucleotide sequences modeled by fifth order three periodic inhomogeneous Markov chains (5-7).

The non-coding states (intron and intergenic region) emit sequences modeled by homogeneous Markov chains (5-7).  The parameters of the intron and intergenic region models are estimated from the set of direct and reverse complements intron sequences.

Hidden state durations are derived from the length distributions of the training sequences associated with a particular hidden state (2).

In the absence of the reliable set of intergenic regions the uniform probability distribution is used for the intergenic state duration.

 

Unsupervised gene finding algorithm GeneMark.hmm ES-3.0 (2)

The algorithm of parallel unsupervised (automatic) training and gene prediction consists of the following steps: i) all parameters of the HSMM model with reduced architecture are initialized; ii) GeneMark.hmm E-3.0 is run to determine a genomic sequence parse into “coding” and “non-coding” regions and the input genomic sequence is labeled with respect to this parse, iii) the subsets of the uniformly labeled fragments (selected as described below in the training set refinement procedure) are used for re-estimation of parameters of HSMM.  Steps ii) and iii) are repeated until the convergence.

 

1.         Rabiner, L.R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. IEEE, 77, 257-286.

2.         Lomsadze A., Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M., "Gene identification in novel eukaryotic genomes by self-training algorithm", Nucleic Acids Research

3.         Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res, 12, 505-519.

4.         Zhang, M.Q. and Marr, T.G. (1993) A weight array method for splicing signal analysis. Comput Appl Biosci, 9, 499-509.

5.         Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol, 268, 78-94.

6.         Borodovsky, M.Y. and Sprizhitskii, Y.A., Golovanov, E. I., and Aleksandrov, A. A. (1986) Statistical patterns in primary structures of functional regions in E.coli genome: III. Computer recognition of coding regions. Mol. Biol., 20, 1145-1150.

7.         Borodovsky, M.Y. and McIninch, J.D. (1993) GeneMark: parallel gene recognition for both DNA strands. Comput. Chem., 17, 123-153.

8.         Silverman, B.W. (1986) Density Estimation for Statistics and Data Analysis. Chapman and Hall, London New York.