Eukaryotic GeneMark.hmm
model description
|
|
|
|
|
|
Notation for GeneMark.hmm
models:
ES-3.0 (E - eukaryotic; S -
self-training; 3.0 - the version)
Supervised
version 3.0 otherwise.
Supervised model parameterization for GeneMark.hmm
E-3.0.
The
statistical model employed in the GeneMark.hmm algorithm is a hidden Markov
model with duration (1) or a hidden semi-Markov model (HSMM). Hidden states for
initial, internal and terminal exons, introns, intergenic regions and single
exon genes form the HSMM
architecture (2). It also includes hidden states for start site (initiation
site), stop site (termination site), and donor and acceptor splice sites. In
what follows, we refer to such hidden states as site states.
The site
states emit nucleotide sequences of fixed length modeled by positional
(inhomogeneous) Markov chains (3,4).
The
protein-coding states (initial, internal, terminal exons and single exon gene)
emit nucleotide sequences modeled by fifth order three periodic inhomogeneous
Markov chains (5-7).
The
non-coding states (intron and intergenic region) emit sequences modeled by
homogeneous Markov chains (5-7). The
parameters of the intron and intergenic region models are estimated from the
set of direct and reverse complements intron sequences.
Hidden
state durations are derived from the length distributions of the training sequences
associated with a particular hidden state (2).
In the
absence of the reliable set of intergenic regions the uniform probability
distribution is used for the intergenic state duration.
Unsupervised gene finding
algorithm GeneMark.hmm ES-3.0 (2)
The algorithm of parallel unsupervised (automatic) training and
gene prediction consists of the following steps: i)
all parameters of the HSMM model with reduced architecture are initialized; ii)
GeneMark.hmm E-3.0 is run to determine a genomic sequence parse into “coding”
and “non-coding” regions and the input genomic sequence is labeled with respect
to this parse, iii) the subsets of the uniformly labeled fragments (selected as
described below in the training set refinement procedure) are used for
re-estimation of parameters of HSMM.
Steps ii) and iii) are repeated until the convergence.
1. Rabiner, L.R. (1989) A tutorial on
hidden Markov models and selected applications in speech recognition. IEEE, 77, 257-286.
2. Lomsadze A., Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M., "Gene identification in
novel eukaryotic genomes by self-training algorithm", Nucleic Acids
Research
3. Staden, R. (1984) Computer methods
to locate signals in nucleic acid sequences. Nucleic Acids Res,
12, 505-519.
4. Zhang, M.Q. and Marr, T.G. (1993) A weight array method for
splicing signal analysis. Comput Appl
Biosci, 9, 499-509.
5. Burge, C. and Karlin, S. (1997)
Prediction of complete gene structures in human genomic DNA. J Mol Biol, 268, 78-94.
6. Borodovsky,
M.Y. and Sprizhitskii, Y.A., Golovanov, E. I., and Aleksandrov, A. A. (1986)
Statistical patterns in primary structures of functional regions in E.coli
genome: III. Computer recognition of coding regions. Mol. Biol., 20, 1145-1150.
7. Borodovsky,
M.Y. and McIninch, J.D. (1993) GeneMark: parallel gene recognition for both DNA
strands. Comput. Chem., 17, 123-153.
8. Silverman,
B.W. (1986) Density Estimation for
Statistics and Data Analysis. Chapman and Hall,