Abstract

Integrating multiple evidence to predict and annotate genes in genomic sequences

Roderic Guigo

As the Human Genome Project enters the large-scale sequencing phase, computational gene identification methods are becoming essential for the automatic analysis and annotation of large uncharacterized genomic sequences. Substantial progress has been made in the recent years in the field of computational gene identification, and when the location of the genes in the genomic sequences is approximately known, computer programs exist that are able to predict the exon/intron boundaries with high accuracy. However, currently available programs are still unable to succesfully cope with anonymous sequences a few megabases long containing an unknown number of genes---the sequences typically produced in the large Genome Centers. Moreover finding the genes and deciphering gene structure is only the first step towards the automatic annotation of genomic sequences; attaching relevant functional information to the predicted genes is also essential. Here, we will discuss recent developments in the GeneID program to address both these problems: predicting genes in very long anonymous genomic sequences, and automatically attaching functional annotation to the predicted genes. In particular, we will describe the methodology used to assign functional descriptions to the predicted genes based on the functional annotation of similar amino acid sequences in the public databases. By means of a process which we term "reverse querying of a database", the first order boolean formula built on the annotation of a protein sequence database is found, that best describes the set of amino acid sequences showing similarity to the amino acid sequence encoded by a predicted gene. Such a formula is assumed to be the best description for the function of the gene. A measure of quality is computed for the descriptions obtained, and thus, the ability to assign a good functional description to a predicted gene may reinforce the confidence in the reliability of the prediction. Functional annotation is also attempted for connected regions of similarity to amino acid sequences along the DNA sequence---which may not be assembled into genes. In cases of low or controversial similarity, the quality of the assigned functional prediction can be used to independently asses the biological significance of the amino acid matches.