From Sequence Analysis of Antigenic Peptides to Possible Mechanism for Proteasome Cleavage

Yael Altuvia and Hanah Margalit
The Hebrew University, Jerusalem, Israel

Proteasomal cleavage of proteins is the first step in the processing of most antigenic peptides that are presented to cytotoxic T cells. Still, its specificity and mechanism are not fully understood. To identify preferred sequence signals that are used for generation of antigenic peptides by the proteasome, we perform a rigorous analysis of the residues at the termini and flanking regions of naturally processed peptides eluted from MHC class I molecules. Our results suggest that both the C-terminus (position P1 of the cleavage site) and its immediate flanking position (P1') possess cleavage signals, and that their contributions are additive. The N-termini of the peptides show these signals only weakly, consistent with previous findings that antigenic peptides may be cleaved by the proteasome with N-terminal extensions. However, we succeed to demonstrate indirectly that the N-terminal cleavage sites show the same preferred signals at position P1'. This implies that the residues at the P1' position of a cleavage site participate in determining the cleavage specificity, in addition to the already known contribution of position P1. Our results apply to the generation of antigenic peptides and bare direct implications for the mechanism of proteasomal cleavage. We propose a model for proteasomal cleavage mechanism by which both ends of cleaved fragments are determined by the same cleavage signals, involving preferred residues at both P1 and P1' positions of a cleavage site. The compatibility of this model with experimental data on protein degradation products and generation of antigenic peptides is demonstrated.

The Biomolecular Interaction Network Database (BIND) as a Resource and a Research Tool

G.D. Bader, T. Pawson and C.W.V. Hogue
Samuel Lunenfeld Research Institute/University of Toronto, Toronto, Ontario, Canada

Each protein expressed in a cell can interact with various different proteins and other molecules in the course of its function. Protein-protein interactions are often mediated by modular protein domains. One example is the SH3 domain which binds a proline rich motif. These "interaction networks" form conventional signaling cascades, transcription activation complexes, vesicle controlling mechanisms, cellular growth and differentiation systems, among other cellular machinery. Known cellular protein interactions will eventually comprise more information than the Human Genome Project. We present a data specification for a new public submission database called BIND (Biomolecular Interaction Network Database). This database will span the complexity of interaction information gathered through experimental studies of biomolecular interactions. Interaction information will come from the literature, submitters and other databases. BIND contains interaction, molecular complex and pathway records. An interaction record is based on the interaction between two objects. An object can be a protein, DNA, RNA, ligand or molecular complex. Description of an interaction encompasses cellular location, experimental conditions used to observe the interaction, conserved sequence, molecular location of interaction, chemical action, kinetics, thermodynamics, and chemical state. Molecular complexes are defined as collections of more than two interactions that form a complex, with extra descriptive information such as complex topology. Pathways are defined as collections of more than two interactions that form a pathway, with extra descriptive information such as cell cycle stage.

An Automated Comparative Analysis of Seventeen Complete Microbial Genomes

Arvind K. Bansal
Department of Mathematics and Computer Science, Kent State University, Kent, Ohio, USA

As sequenced genomes become larger and sequencing becomes faster, there is a need to develop accurate automated genome comparison techniques and databases to facilitate derivation of genome functionality; identification of enzymes, putative operons, and metabolic pathways; and to derive phylogenetic classification of microbes. This paper [3] modifies and extends an automated pair-wise genome comparison technique [1, 2] used to identify orthologs and gene-groups to derive orthologous genes in a group of genomes, to identify genes with conserved functionality, and to identify genes specific to groups of genomes. Seventeen microbial genomes archived at ftp://ncbi.nlm.nih.gov/ genbank /genomes have been compared using the extended technique to derive orthologs, orthologous gene-groups, duplications, gene-fusions, genes with conserved functionality, and genes specific to groups of genomes.

The comparison results [3] for E. coli and B. subtilis two of the microbes thoroughly explored in wet laboratories are consistent with the NCBI annotations. The results reveal that the genomes within the same family have a higher percentage of orthologs and orthologous gene-groups in terms of size of the smaller genomes in the genome-pairs. However, genome-pairs with large number of genes share a large number of orthologs and orthologous gene-groups. There are large numbers of gene-group duplications and duplications of single genes. Duplication of gene-groups is largely a function of genome size, and to a lesser extent is a function of genomes being in the same family. The duplication of single genes is random for some genomes. Fused genes are small in number. Around 85 genes have conserved function. The functions of many genes involved in transcription and translation are conserved. 21 genes corresponding to ribosomal proteins have no orthologs in archaea microbes. Archaea genomes share a relatively higher percentage of orthologs among themselves. There are number of genes which are specific to E. coli and various subsets of eight pathogens.

[1] Bansal, A. K., Bork, P., and Stuckey, P., "Automated Pair-wise Comparisons of Complete Microbial Genomes", Mathematical Modeling and Scientific Computing, 9, 1 - 23, (1998).
[2] Bansal A. K., and Bork, P., "Applying Logic Programming to Derive Novel Functional Information in Microbial Genomes," Lecture Notes in Computer Science, Springer Verlag, 1551, 274 - 289, (1999).
[3] Bansal, A. K., "An Automated Comparative Analysis of Seventeen Complete Microbial Genomes", Bioinformatics, in press.

Heuristic Approach for Building Markov Models for Gene Prediction

John Besemer and Mark Borodovsky
School of Biology, Georgia Institute of Technology, Atlanta

We have developed a simple approach for building inhomogeneous Markov models of protein coding regions requiring only a small fragment of unannotated DNA as opposed to the large sets of experimentally validated genes or anonomymous DNA sequence used previously. This new method builds models 'on the fly' through our web server for any sequence longer than 400 nt. Tests of this method on 10 complete bacterial genomes using the GeneMark.hmm program have shown that the new models predict 93.1% of annotated genes on average, while models built using traditional methods predict a comparable 93.9%. Models derived through the heuristic method can be used in cases where there is not enough coding sequence available to produce sound models such as the extremely small genomes of viruses, plasmids and organelles as well as sequencing projects at their beginning. A further application of this method is in highly inhomogeneous genomes, where optimizing the model to fit local sequence composition is advantageous. Extension of this approach for use with eukaryotes and implications of the method on possible mechanisms of codon usage pattern evolution will be presented as well.

GeneMark.hmm: A Gene Finding Tool for Eukaryotic Genomes

Mark Borodovsky1, John Besemer1, Natalia Milshina2#, George Tarasenko2 and Alexander Lukashin1*
1- School of Biology, Georgia Institute of Technology, Atlanta, GA, USA
2 - Gene Pro, Inc. Atlanta, GA, USA
# - currently at Celera Genomics, Rockville, MD, USA
* - currently at Biogene, Cambridge, MA, USA

Gene prediction tools developed for prokaryotic genomes are generally inadequate for prediction of exon-intron gene structures in eukaryotic genomes. The GeneMark.hmm algorithm, previously described for gene finding in prokaryotic DNA and utilizing inhomogeneous Markov models in a hidden Markov model with duration framework, has been extended for analyzing eukaryotic DNA and finding split genes. The more complex grammar of eukaryotic DNA required to use, in addition to species-specific Markov models of coding and non-coding sequence, the use of site models such as models for donor and acceptor sites and for start and stop codon contexts. To properly use Hidden Markov model with duration, the probability distributions for exon, intron and intergenic region lengths were derived and used as well. The GeneMark.hmm program was tested for long genomic sequences of several eukaryotic species such as Human, A.thaliana, C. elegans, C.reinhardtti, D. melanogaster and Rice. The program performance was observed to be at the same level or higher than other frequently used gene finders for eukaryotes.

Functional and Evolutionary Relations of HSP60 Proteins

Luciano Brocchieri and Samuel Karlin*
Department of Mathematics, Stanford University, Stanford, CA 94305-2125, USA.

*Supported in part by Grant NIH-5R01GM10452-34, NIH-5R01HG00335-11, and NSF-DMS9704552

HSP60 (GroEL) proteins are ubiquitously expressed in eubacteria and in eukaryotic organelles. We examine HSP60 similarities using our new SSPA (Significant Segment Pair Alignment) method and multiple sequence ITERALIGN program, and interpret them with respect to function and evolution. The HSP60 proteins are largely conserved, with unaligned N-terminal segments in organellar sequences (leader peptides) and unaligned repetitive elements at the C-terminus. Unaligned regions between blocks of alignment, the three longest of about five residues, are generally exposed to the external wall of the Anfinsen cage complex. Among the most conserved regions is the first shell of residues surrounding the ATP and Mg++ binding sites. Conservation declines in the second shell. Hydrophobic residues that putatively interact with the substrate are highly conserved, affirming their important functional role. However, a second set of residues observed to contact a histidine-rich peptide in a mini-chaperone crystal, is poorly conserved and apparently less relevant. A large number of charge residues line the central cavity of the GroEL-GroES complex in the substrate-releasing (cis) conformation. These residues encompass a statistically significant intra-monomer structural charge cluster which is highly conserved among sequences and is likely to play an important functional role interacting with the substrate. In the substrate-binding conformation (trans) most of these residues become buried between monomers of the heptameric ring, where they establish inter-monomer mixed charge clusters. Similarity comparisons between sequences and the analysis of the multiple alignment imply that the HSP60 sequences do not support the hypothesis that animal mitochondria arose from a Rickettsial bacterial endosymbiont. In particular, Rickettsia is strongly divergent in the substrate binding Apical Domain whereas Ehrlichia is mostly divergent in the multimer assemblage/ATP binding Equatorial Domain. A sequence from Plasmodium falciparum, previously characterized as mitochondrial, appears instead as the non functional remnant of a secondary symbiont chloroplast sequence.

Establishing the Role of Variable Residues Important for Functional Specificity within the CheY Family

Sean Bulloch (2), Robert B. Bourret (2) and Igor B. Zhulin (1)
(1) Department of Microbiology and Molecular Genetics, Loma Linda University, Loma Linda, California 92350, USA
(2) Department of Microbiology and Immunology, University of North Carolina, Chapel Hill 27599, USA

The CheY protein is a prototypical member of the functional superfamily of bacterial response regulators and the structural superfamily of the Rossman fold. In E. coli, it functions as a regulator, which upon phosphorylation by a chemotaxis kinase binds to the flagellar motor. CheY is a single-domain protein, however it has been recently reported as a domain in hybrid chemotaxis proteins (a CheY-like domain). In some alpha-proteobacteria, more than one copy of the CheY protein was found. One of the two CheY proteins in S. meliloti was shown to have a different function: it does not bind to flagellar motors and plays a role of "phosphatase" competing with a major CheY protein for a phosphate. In order to analyze the diversity within the CheY family, we have constructed a multiple alignment of all known and putative CheY proteins and CheY-like domains. Calculating a consensus identified highly conserved residues, which along with known CoC residues were mapped onto the 3D model of E. coli CheY. All of them were located within the "active site". Residues involved in phosphorylation and interaction with other chemotaxis proteins were first mapped onto the alignment, and their conservation within subsets of sequences was examined. Residues involved in CheY phosphorylation were among most conserved, reflecting a common function for all proteins of the superfamily.

In many CheY sequences, some of the residues required for CheY binding to the flagellar switch protein FliM in E. coli were not conserved. This prompted a similar analysis of FliM. We have demonstrated that interface of the FliM protein, which interacts with CheY, is also variable in many species. Mutual variation of interacting surfaces of two proteins may adjust the chemotaxis pathway to particular types of flagellar motors. We have found that among multiple CheY proteins within a given genome, there is one CheY protein, which has seven conserved FliM-binding residues (presumably a real CheY homolog), whereas in other CheY proteins two of these seven residues are variable. These two residues, however are highly conserved between "multiple-copy" CheY proteins and CheY-like domains that are known not to interact with FliM. The CheY residues involved in interaction with the CheZ phosphatase in E. coli were conserved only in gamma-proteo bacteria. The BLAST search of the non-redundant database (including unfinished microbial genomes) revealed that CheZ phosphatase is present only in gamma-proteobacteria. Variable residues responsible for functional diversity within the CheY family were mapped onto the 3D structure of E. coli CheY and found clustering on the surface of two exposed alpha helixes.

Small changes in critical positions on the protein sequences that apparently caused a dramatic change in function appear to occur on the background of similar changes throughout the protein length. Phylogenetic analysis placed CheZ-interacting, FliM-interacting and FliM-non-interacting CheYproteins in distinct clusters.

Predicting Protein Family -Function, -Local Structure and -Global Fold by Comparing Local Sequence Motifs

Bob Chan, Gila Lithwick, Einat Sitbon, Victor Kunin and Shmuel Pietrokovski
Fred Hutchinson Cancer Research Center, Seattle, USA and
The Weizmann Institute of Science, Rehovot, Israel

We present a method to identify functional and structural similarities between protein families using motif sequence similarity. The method is based on the depiction of each protein family by a set of local ungapped multiple alignments (blocks) and on sophisticated sequence analysis programs. A very sensitive block-to-block comparison (LAMA) is followed by a highly selective consistency analysis (CYRCA). This analysis identifies groups of blocks with consistent and transitive relations to each other. Careful inspection of many such groups shows that each contains protein families with the same function, specific structural motifs or even global structural fold. Most of these relations cannot be identified by other advanced sequence-to-sequence and sequence-to-multiple alignments comparisons. Thus, our method enables the prediction of function, local structure and global fold from the comparison of multiply aligned protein sequences. Our poster will outline the method and present representative examples. More details on the approach can be found and will be posted on the Blocks WWW site (http://blocks.fhcrc.org).

Identification and Automating Calculation of Homologous Core Structures

Jie Chen, Marchler-Bauer Aron and Stephen H. Bryant
NCBI, NIH, Bethesda, Maryland, USA

Using a large database of protein structure-structure and sequence-sequence alignments, we test a new method for distinguishing homologous and analogous structural neighbors. The homologous neighbors in the test set show no detectable sequence similarity, but they may be well superimposed and belong to the same superfamily according to the SCOP database (Murzin et al, JMB 247:536-540). Analogous neighbors also show no sequence similarity and may be well superimposed, but their structural similarity may be the result of convergent evolution. In our previous research we defined the homologous core structure (HCS) as the subset of alpha-carbon coordinates that may be well superimposed on homologous neighbors. In a cross-validated trial, we showed that a test for the presence of the HCS can well distinguish homologous and analogous neighbors (Matsuo and Bryant, Proteins 35:70-790, 1999). In this previous work homologous neighbors were identified by their SCOP classifications, which are based on manual examination. We would like to automate definition of the HCS, however, so as to allow fully automatic ranking of structural neighbors according to the extent of conservation of the HCS, as an indicator of evolutionary distance. Here we investigate whether this may be accomplished by a kind of "bootstrap" procedure: 1) An initial set of homologous structural neighbors is identified by PSI-BLAST (Altschul et al, NAR 25:3389-3402) 2) An initial HCS is defined from these neighbors. 3) Other structural neighbors are identified as homologous based on the presense of the HCS. 4) The HCS definition is updated, followed by iteration (with bounds) of steps 3 and 4. In the poster we present the results to date from this investigation.

Exon Detection by Comparison Between Two Distant Vertebrate Genome Sequences

H. Roest Crollius (1), O. Jaillon (1), C. Dasilva (1), L. Bouneau (1), C. Fizames (1), A. Billault (2), A. Bernot (1), F. Quetier (1), J. Weissenbach (1), W. Saurin (1)
(1) Genoscope, 2 rue Gaston Cremieux, CP 5706, 91057 Evry Cedex, France
(2) CEPH, 27 rue Juliette Dodu, 75010 Paris, France

The conservation of coding information between two genomes is driven by its importance as functional element, and generally decreases as evolution progresses and species drift apart. However, regions of less or no functional relevance mutate and change at a faster rate. This characteristic has been successfully exploited to detect coding regions in genomic sequence. To achieve this, it is necessary to compare the sequence of two genomes that have sufficiently diverged to a point where coding and non-coding regions are clearly separated. This should reveal functionally important elements such as exons and regulatory elements, and provide a wealth of secondary information on gene evolution, structure and organisation within a genome.

We have tested this approach on a set of homologous genes selected in the Human and the tetraodondiform Fugu rubripes genome respectively. Starting with the 17 genes that have been sequenced and annotated in both genomes (204 human exons) and deposited in public databases, we have retained those showing more than 40% protein similarity over their complete length (13 genes). We have first performed pairwise comparisons between homologous exons, then between homologous genes, then between homologous genomic regions containing the genes and finally between both genome samples. This gradual increase in non-coding sequence and complexity in the set used for comparison enabled us to calibrate the parameters of the algorithms to reach maximum sensitivity while controlling the emergence of potential loss in specificity. A variety of comparison methods were used, all based on the BLAST algorithm. Maximum sensitivity and specificity are obtained with TBLASTX alignments using a scoring matrix that does not allow for amino acid substitutions. Hence the T value that specifies the threshold score for building the dictionary of initial search words can be adjusted to the score of an exact match of length W (the length of the initial search word). This scoring scheme eliminates the construction and use of a list of neighbour search words. The speed of TBLASTX searches is therefore increased by approximately two orders of magnitude compared to searches performed with substitution matrixes such as BLOSUM. This aspect is critical when dealing with large fractions of vertebrate genomes.

This work is the basis of a sequencing program initiated at Genoscope, that aims at sequencing a large fraction of the genome of another tetraodondiform, Tetraodon nigroviridis (400 Mb) to help identify coding regions in the human and other vertebrate genomes. Tetraodon has a compact genome approximately 8 times smaller than human or mouse, while containing a similar gene complement. It is therefore particularly adapted to serve as basis for comparative genomics at the sequence level, and is situated at a suitable evolutionary distance to ensure that conserved amino acid stretches will be of some functional importance. We have sequenced 20% of this genome in mostly non redundant and random fashion (http://genoscope.cns.fr). This sample, the largest available for a vertebrate after human, has been compared to a set of several hundred human genes. Preliminary results suggest that 20 % of human exons, distributed in 50% of the genes, may be detectable with over 95% specificity.

Facilitation of Comparative Genomics Analyses by the Integration of YPD and WormPD

Michael E. Cusick, Maria C. Costanzo, Peter D. Hodges, Jennifer D. Hogan, Jodi Lew-Smith, Kevin J. Roberg-Perez and James I. Garrels
Proteome Inc., 100 Cummings Center, Beverly, MA 01915, USA

Two highly integrated proteome databases of model organisms are now publicly available in the BioKnowledge Library produced by Proteome, Inc. at . The Yeast Proteome Database (YPD) for the yeast Saccharomyces cerevisiae was the first comprehensively curated model organism database. Its facile presentation, detailed information about all aspects of yeast biology, and in-depth curation of the full research literature on yeast has been a boon to researchers in many fields. Now YPD is joined by WormPD covering C. elegans biology with parallel presentation and detail. YPD and WormPD are both presented as lucid Protein Reports containing Title Lines, experimental and predicted Protein Properties, detailed free-text Annotations, and References. Links between the two species are available from any Protein Report, and are based on Blast similarities, protein family memberships, and cross-referenced annotations. YPD and WormPD are both freely available to academic labs, and to corporate entities by licensed subscription.

With two comprehensively curated databases now available, for the first time bioinformatics researchers can make detailed interspecies comparisons of pathways, complexes, protein families, and regulation. As an example of what can be done, a comparative analysis of protein complexes was carried out using the extensive descriptions of protein complexes within YPD. Complexes for which all members are conserved in C. elegans (over 50 complexes) define common cellular machinery. With other yeast complexes no member has a significant match to a C. elegans protein, indicating that the complex is likely fungal-specific. Similar comparative analyses will be shown for subcellular localization. The expansive information available for yeast proteins in YPD has been used to predict properties and functions for uncharacterized orthologs in C. elegans and from there on to other higher species, including human.

A major bottleneck in interpretation of the immense amount of functional genomics data now coming available is understanding the thousands of research leads that are generated. The high-quality annotation present in YPD and WormPD provides ready passage through this bottleneck. Two features are especially useful when YPD, and soon WormPD, are used as the platform for presentation of functional genomics results. 1) The Title Line on each Protein Report provides a concise, one-line description of the protein. Title Lines are continuously updated, and as such reflect the best synopsis of what is currently known about the protein. 2) Every protein is classified by Biochemical Function and Cellular Role, by virtue of a controlled vocabulary constructed for those two properties.

Model for the Unfolded State of Proteins

Howard J. Feldman, Mark A. Kotowycz, Thanh-Van T. Le and Christopher W. V. Hogue
Samuel Lunenfeld Research Institute, Mount Sinai Hospital/Department of Biochemistry, University of Toronto.Toronto, Ontario, Canada

A method to generate protein conformers of arbitrary amino acid composition in O(NlogN) time has been developed, taking only the primary sequence as input. These conformers possess physically and chirally valid backbones with all bond lengths, angles and dihedrals within the allowable tolerances. The method is based upon a 2-D probability distribution function for Ca placement called a 'trajectory graph', previously described.

The algorithm has been shown to be useful for both reconstructing backbones of real proteins, and generating random proteins. These modes may be mixed, making it possible to sample unknown domain structures and linker regions while reconstructing domains with known structure simultaneously.

To determine just how accurate random structures can be, 10,000 random conformers of proteins representing a wide variety of folds were generated. We report the structure with the smallest RMSD to the crystal structure in each case.

The random conformer generator may also be used to generate starting points for molecular dynamics simulations or ensembles of a protein for comparison with experimental studies of disordered structures. Experimental distance restraints such as NOEs or hydrogen bonds can be added to bias the random walk, if known, as well.

We have further developed a system that allows us to compute protein dynamic trajectories, based on a physical model of protein backbone motion. We show protein unfolding movies, and the energies calculated for these at each step, using an atom based potential. Through analogy to a 2-D gas, a relation has been drawn between this energy score and the motion of a given residue.

Predicting Protein Distance Constraints with Enhanced Performance Using Sequence Motifs and Neural Networks

J. Gorodkin, O. Lund, C. A. Andersen and S. Brunak
Department of Ecology and Genetics, The Institute of Biological Sciences, University of Aarhus, Denmark

For each sequence separation (in residues) of any pair of amino acids in polypeptide chains where the 3-dimensional structure is known, we investigate the predictability of the physical distance (in Angstroms). It is found that the distance distributions for small sequence separations are bimodal, whereas for large sequence separations they converge towards a universal shape, even though the mean value of the distances increases as the sequence separation increases. Similar to the change in distance distributions, the sequence motifs also change for increasing sequence separation. A sequence motif is constructed for the residues for which the distance between the C-alpha atoms is smaller than the mean value at that separation. When the separation is small the motif consists of a single peak located in between the two residues. As the sequence separation increases additional peaks around the two separated residues appear, and when the separation is large, the center peak is smeared out. This analysis shows why a neural network prediction scheme performs better for this task, than simple statistical data-driven approaches such as pair probability density functions. Using the knowledge from the investigation for design of a new neural network architecture a large improvement in performance is obtained for sequence separation 10 to 30 residues. The change of sequence motifs and shape of distance distributions account qualitatively for the network performance with increasing sequence separation. A WWW server is made available at http://www.cbs.dtu.dk/services/ distanceP/.

Computational Characterization of 3'-end-processing Control Phrases

Joel H. Graber, Charles R. Cantor, Scott C. Mohr and Temple F. Smith
Center for Advanced Biotechnology Boston University, Boston, MA, USA

Nucleic acid control sequences (phrases) are difficult to recognize because they are relatively small and display wide variation in fidelity and complexity. We have shown that 3'-end-processing controls consist of multiple elements, where the individual elements can vary widely from a consensus sequence and yet remain functional as part of the whole. Such variability, common among control phrases, makes a bioinformatic analysis a natural approach for characterization. The large sequence databases currently available provide sufficient data for such investigations, given a suitable biological hypothesis for selection of candidate sequences.

Nearly all mature eukaryotic mRNAs terminate in polyadenylate (poly(A)) tails. The site of 3'-end-processing (cleavage and polyadenylation) is determined by control phrases within the immature RNA sequence. Experimental studies have demonstrated a wide range of functional 3'-end-processing phrases within many organisms and only weak conservation when examined across multiple species. We have searched for 3'-end-processing phrases within Expressed Sequence Tags (ESTs), cDNA sequences that are typically generated from oligothymidylate primers that ostensibly hybridize to the mRNA poly(A) tail. The 3'-end of the EST sequence identifies the 3'-end-processing site.

We have collected large (> 1000) groups of yeast, rice, arabidopsis, fruitfly, mouse, and human EST sequences judged as highly likely to have originated at the 3' end of the EST. We have identified several motifs with statistically significant abundance, indicating probable biological function. Cross-species comparison reveals that the use and conservation of the canonical AAUAAA element varies widely among the six species and is especially weak in plants and yeast. In all species examined, the complete 3'-end-processing control appears to consist of a complex aggregate of multiple elements. We present a broadened model of 3'-end-processing control phrases to explain the varied phenomena seen in both our results and previous investigations.

Phylogenetic Analysis on Complete-genome Scale Using Distributions of Evolutionary Rates among Proteins

Nick V. Grishin, Yuri I. Wolf, Eugene V. Koonin
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

Accumulation of complete genome sequences of diverse organisms creates new possibilities for evolutionary inferences from whole-genome comparisons. Here we analyze the distributions of substitution rates among proteins encoded in 19 complete genomes (the inter-protein rate distribution). To estimate these rates, it was necessary to employ another fundamental distribution, that of the substitution rates among sites in individual proteins (the intra-protein distribution). Using two independent approaches, we show that intra-protein substitution rate variability appears to be significantly greater than generally accepted. We demonstrate that the inter-protein rate distributions inferred from the genome-to-genome comparisons are similar to each other and can be approximated by a single distribution with a long exponential shoulder. This suggests that the molecular clock hypothesis may be valid on genome scale. We use the scaling parameter of this distribution to build a rooted whole-genome phylogenetic tree whose topology is largely compatible with that of global rRNA-based trees.

FramePlus: A Sensitive Algorithm for Aligning DNA to Protein Sequences

Eran Halperin, Simchon Faigler and Raveh Gill-More
Compugen Ltd., 72 Pinchas Rosen Street, Tel Aviv 69512, Israel

Biological sequence alignment algorithms have become extremely popular in the last few years, and are now being used by thousands of researchers as arguably the most important annotation tool in bioinformatics. However, the search algorithm used may have a crucial effect on the success of an annotation project: different algorithms will find (and miss) different hits under different circumstances.

Frame algorithms are a special case of sequence alignment algorithms, when one wishes to compare a nucleic acid sequence to an amino acid sequence. They are particularly useful for annotating Expressed Sequence Tags (ESTs). The first frame algorithm developed was Translated Search (also known as six-frame translation), and is directly based on the Smith-Waterman algorithm. Heuristic database search packages (such as BLAST and FASTA) have incorporated frame algorithms early on. However, early frame algorithms were not very error tolerant, especially when the errors cause frame-shifts. A solution to this was introduced by FrameSearch.

In this work we introduce a new frame algorithm called FramePlus, which is an extension of FrameSearch, where we model sequencing errors separately from indels of amino-acids caused by evolution. Since these are two different phenomena it is reasonable to expect that this better modeling will result in increased sensitivity. In order to test this conjecture, we have utilized and customized ideas by Brenner et al. for algorithm benchmarking based on the SCOP database of structurally classified proteins, and implemented a general framework for benchmarking frame algorithms. We used this framework to compare all the above mentioned algorithms.

Our results suggest that FramePlus is significantly more sensitive than other algorithms, and in cases of low sequence identity may find as much as 13% more true hits than any of the other tested algorithms. Although FramePlus is slower than heuristic algorithms such as BlastX when implemented on a standard computer, it can be accelerated by up to 3 orders of magnitude on special purpose hardware. The FramePlus source code is freely available, at ftp.compugen.co.il/pub.

Comparative Tests of Methods for Detecting Neutral Rate Violation in Protein-Coding Genes

A. P. Jason de Koning and Caro-Beth Stewart
Department of Biological Sciences, University at Albany, SUNY, Albany, NY 12222, USA

An important, but daunting, challenge in comparative genomics is to identify those genetic differences between species that became fixed by positive Darwinian selection for new function, rather than by neutral genetic drift. Neutral theory predicts that genes which are under no selective pressure will evolve such that the rate of nonsynonymous nucleotide substitution (dN) will approximately equal the synonymous substitution rate (dS). Significant elevation of dN relative to the neutral substitution rate of the locus, as measured by dS of the gene, is taken as strong evidence of positive selection for changes in the protein sequence.

Although numerous methods for estimating dN and dS have been proposed, little is known about their relative strengths and weaknesses when applied to real DNA sequence data. One reason for this dearth of comparative studies is that most of the available dN/dS methods are implemented on different computer platforms, use different input files, and display results in ways that make direct comparisons difficult.

To facilitate comparative dN/dS studies, we are developing a new computer program, FENS (Facilitated Estimates of Nucleotide Substitutions), that calculates dN and dS between homologous protein-coding genes by a variety of published and newly-developed methods. The methods of analysis currently implemented include those of Li et al. (1985), Nei and Gojobori (1986) [as described in the original publication, not as programmed in MEGA], Pamilo & Bianchi (1993) and Li (1993) [as implemented in the computer program, Li93, which corrects a mathematical error in the original publications], and Ina (1995) [by both methods 1 and 2]. Additional options are available, including a gamma-correction for among-site rate variation, and an adjustment to the Nei-Gojobori and Ina methods for cases where stop codons could be easily reached by point mutations. FENS also calculates t-tests for significant differences between dN and dS, as proposed by Hughes and Nei (1988). Results from all methods are displayed in compact matrices, with optional output of all calculations.

Here we will present analyses of large pseudogene datasets by all methods implemented in FENS, comparing the behaviors of the methods with respect to neutral expectation (dN = dS, on average). Combined with simulation studies, such comparative studies using real DNA sequences should help us understand which of the various methods are most appropriate for the detection of adaptive molecular evolution.

FENS is being released as a beta edition, and will be available at the poster session. The program currently runs on Power Mac computers, although other platforms will be supported in the future. Input files use a standard Nexus format, so that data are readily portable to other commonly-used evolutionary analysis programs.

We thank M. Nachman for unpublished sequences, and the NSF for support.

Use of Secondary Structure State Hidden Markov Models for Gene Identification and Protein Fold Recognition

Peter J. Lammers, John B. Spalding and Steven P. Duran
New Mexico State University, Las Cruces, NM, USA

We have tested the simple hypothesis that protein secondary structures predicted from amino acid sequences can be used as the basis for identifying structural homologs in the Protein Data Bank (PDB). By concentrating on groups of closely related proteins, the secondary structure prediction accuracy is improved and the results can be used to construct a profile hidden Markov Model (HMM) for each group based only on helix, sheet or coil designations. The resulting HMM provides a sensitive tool for searching the PDB resource. The validity of this approach was tested with 37 groups of protein sequences related by varying degrees to homologous proteins in the PDB. A single query sequence was used to create each group of related sequences using a fully automated process. Secondary structures were predicted for each training set protein using two methods: Predator and DSC. HMMs were built for each group using HMMER 2.1.1 and scored against the STRIDE database of PDB-derived secondary structures. The Predator method proved to be superior, as the top scoring protein was a true positive for 31/37 models (84%). Models derived from the secondary structures predicted by the DSC method were correct in 23/37 cases (62%). The likelihood of success was not correlated with the degree of sequence identity between the initial query protein and it's closest PDB homolog, or length of the protein. However, the size of the training set used to build the HMM did have an effect. Four of the six misses by the Predator-derived models came from training sets with 13 or fewer proteins.

Rooting the Kinesin Superfamily: A Comprehensive Phylogenomic Analysis

Lawrence, C.J. (1), Malmberg, R.L. (1), Muszynski, M.G. (2) and Dawe, R.K. (1&3)
(1) University of Georgia, Department of Botany, Athens, GA, USA
(2) Pioneer Hi-Bred Intl., Inc. Athens, GA, USA
(3) University of Georgia, Department of Genetics, Athens, GA, USA

Kinesins constitute a diverse, anciently derived superfamily of microtubule-based motor proteins. By building phylogenetic trees and mapping function onto monophyletic clades, we hope to reconstruct the evolution of unique functions within the kinesin superfamily. We include two bacterial sequences for MukB, kinesinUs prokaryotic ancestor, and four kinesin sequences from Giardia lamblia, an anciently diverged amitochondriate protist, to root the pan-kinesin tree. In addition to classifying previously described kinesins from protists, fungi, and animals, we classify 13 unique kinesins we sequenced from the monocot Zea mays as well as many newly reported dicot sequences as representatives of the plant kingdom. Preliminary results of our phylogenetic analysis indicate that (1) plants have both plus- and minus-end directed kinesins, (2) minus-end directed kinesins form a monophyletic clade, suggesting that a single evolutionary event accounts for the origin of reversed motor directionality, and (3) plants may have a nuclear copy of MukB, presumably necessary for chloroplast or mitochondrion replication.


Lee, D. A., Pearl, F. M. G. and Orengo, C. A.
Biomolecular Structure and Modelling Group, University College London, Gower Street, London WC1E 6BT, UK.

CATH (1) is a system of classification of the structures of proteins that have been deposited in the PDB (2, 3). It is a domain-wise, hierarchical classification, the four principal levels being: Class; Architecture; Topology; and Homology. PSI-BLAST (4) is an efficient and powerful tool for the detection of significant sequence similarities between proteins. In this study, CATH and PSI-BLAST have been used together to help investigate the relationship between the sequence and structure of proteins. A secondary aspect of the study is validation of the CATH classifications.

The study is divided into two main sections. Section one concerns the screening, using PSI-BLAST, of sequences with unknown structure against the CATH sequence dataset. Results are analysed in light of the structure associated with each CATH sequence.

In section two, the sequences of a representative from each CATH family are screened, using PSI-BLAST, against the latest release of the GENBANK (5) non-redundant protein sequence dataset. A procedure is described for the recruitment of putative homologues to CATH families.

A CATH-PSI-BLAST server is under construction at

1) Orengo, C. A. et al. 1997. Structure. 5:1093-1108.
2) Abola, E. E. et al. 1987. In: Crystallographic databases - information content, software systems, scientific applications, F. H. Allen, G. Bergerhoff, and R. Sievers, eds. Data Commission of the International Union of Crystallography, Bonn/Cambridge/Chester. pp 107-132.
3) Abola, E. E. et al. 1997. In: Methods in Enzymology, C. W. Carter Jr. and R. M. Sweets, eds. Academic Press, San Diego. Vol. 277, pp 556-571.
4) Altschul, S. F. et al. 1997. Nucleic Acids Research. 25:3389-3402.
5) Baskin, Y. 1983. Science Digest. 91:94-95.

Hierarchical Effects Model (HEM) for Anti-cancer Gene Discovery Using Markov Chain Monte Carlo and Web-based Development on Bioinformatic and Statistical Analysis Tools

Jae K. Lee
Laboratory of Molecular Pharmacology, National Cancer Institute, National Institute of Health, Bethesda, MD, USA

From the end of the last decade, NCI has been experimenting with and collating a rich set of data of anticancer drugs based on a pool of 60 lines of various types of cancer. In parallel with this massive drug database, several large databases of microarray & oligonucleotide cDNA expression data and some molecular targets on the 60 cancer cell lines are now available. To rigorously and effectively investigate these multi-GB data, we need to develop innovative bioinformatic and statistical investigation methods. I propose a novel statistical modeling approach to rigorously estimate the effects, especially interaction effects, of various biological factors simultaneously and identify interesting---potentially clinically important--- drugs and genes. This approach is based on the construction of a Hierarchical Effects Model (HEM) and estimation of the model parameters using Markov Chain Monte Carlo, an advanced computer-intensive statistical technique. The vitality of such a statistical/bioinformatic development on vast amounts of biological and clinical data strongly depends both on intensive interaction and collaboration between statistical and biological researchers and on the flexibility of our investigational tools to interpret the data from various perspectives. Fully utilizing modern statistical packages, such as S-PLUS, we have developed a web-based system to provide our statistical analysis tools directly to biological and clinical researchers.

How Much Accuracy can RBS Model Bring to Translation Start Recognition?

Ping Li and Mark Borodovsky
School of Biology, Georgia Institute of Technology, Atlanta, GA 30332-0230, USA

Accurate prediction of translation start sites is still an open problem. GeneMark program, a gene prediction software, uses Markov chain models. The prediction of the translation start is difficult by only using models of protein-coding and non-coding regions. Ribosome Binding Site (RBS) usually is located in a region -19 to -4 upstream of translation initiation site. The latest version of GeneMark utilizes the RBS model to help the translation start recognition. To know the distribution of error rates of such prediction is critical to interpret the GeneMark predictions. In this study, a large number of artificial model sequences was generated by Markov chain models and RBS models. A new algorithm was suggested, in which two scores were calculated to distinguish the real translation start from false ones instead of one score used in current GeneMark. The dependence of prediction error rate on model parameters, represented by Kullback-Liebler distance, was determined to provide guidance for gene prediction of different prokaryotic genomes. It was shown that the new algorithm potentially has a higher prediction accuracy than the start site prediction procedure currently used in GeneMark.

The Quality of merC, a Module of the mer Mosaic

Cynthia A. Liebert, Alice L. Watson and Anne O. Summers
Department of Microbiology, The University of Georgia, Athens, GA 30602-2605, USA

We examined a region of high variability in the mosaic mercury resistance (mer) operon of natural bacterial isolates from the primate intestinal microbiota. The region between the merP and merA genes (PA) of nine mer loci was sequenced and either the merC, the merF or no gene was present. Two novel merC genes were identified. Overall nucleotide diversity, p (per 100 sites), of the merC gene was greater (49.63) than adjacent merP (35.82) and merA (32.58) genes. However, the consequences of this variability for the predicted structure of the MerC protein are limited and, with two exceptions, putative functional elements (metal-binding ligands and transmembrane domains) are strongly conserved. Possible agents of the diversity in the PA region include homologous recombination mediated by Chi sites in and near mer. There is also evidence of vestigial sequences suggesting the activities of site-specific recombinases in and near some of the mer operons.

G-Protein Coupled Receptor Clustering by Hierarchical Pattern Discovery

Agatha H. Liu, Gustavo Stolovitzky, Ajay Royyuru, Andrea Califano
Computational Biology Center, IBM TJ Watson Research Center, USA

The G-Protein Coupled Receptor superfamily is probably the largest and most functionally differentiated gene family in our genome. Its members can exhibit a surprising level of sequence similarity but still have radically different function. An example of this can be found in parathyroids and calcitonins. As a consequence, from a Comparative Genomics perspective identifying the sequence elements that confer specific functional traits to members of the GPCR families is still very much an open question. This is even truer in the absence of a true baseline for structural models of these transmembrane proteins, which expose a large hydrophobic area and are therefore extremely hard to crystallize.

This paper presents an unsupervised, top down approach that allows researchers to efficiently identify sequence regions that confer progressively more specificity to the function of each of the GPCR proteins in SWISS-PROT Release 36. This method is based on the recursive identification of statistically significant conserved regions through deterministic sparse pattern discovery via the Splash algorithm.

Two approaches are studied. In the first one, at each step, the pattern (or patterns) that are most conserved in a protein set A0 are discovered and used to build a local HMM representation. The latter is used to divide the set in two subsets: a set A01 that scores above statistical significance with respect to the HMM, and a set A00 that does not. After removing the HMM region from proteins in the A01 set, the procedure is repeated for both A01 and A00 (yielding the sets A011, A010, A001, and A000) until a complete classification tree is obtained and statistically significant patterns can no longer be identified.

In the second approach the procedure is repeated but the sets are not immediately split. That is the HMM region is masked in all sequences in A0 that are also in A01 and then the pattern discovery is repeated on the entire set A0, yielding the set A02, A03, and so on. When statistically significant patterns can no longer be discovered in A0, the procedure is repeated for each of the subsets A01, A02, etc. until statistically significant patterns can no longer be discovered. Finally, by determining the amount of overlap in the sequences that score above the significance threshold for each pair of HMMs, a full graph of the protein cluster relationships is built. Each method is interesting in its own merit and yields biologically significant results.

Due to the efficiency of the pattern discovery algorithm the entire procedure can be completed in minutes on a workstation for more than 1000 GPCRs. This makes this methodology useful for clustering large protein databases such as the full SWISS-PROT.

A comparative analysis of the results with respect to previous technique is reported. Also a number of interesting functional protein clusters, not previously reported, will be discussed. Some HMMs generated by this approach have been used to screen the dbEST database producing several new GPCR candidates at various levels of granularity.

Local Multiple Sequence Alignment Using Dead-end Elimination

Alexander V. Lukashin and Joseph J. Rosa
Biogen, Inc., 14 Cambridge Center, Cambridge, MA 02142, USA

Local multiple sequence alignment is a basic tool for extracting functionally important regions shared by a family of protein sequences. We present an algorithm for rigorously solving the local multiple alignment problem. The algorithm is based on the dead-end elimination procedure that makes it possible to avoid an exhaustive search. Certain rejection criteria are derived in order to eliminate those sequence segments and segment pairs that can be mathematically shown to be inconsistent (dead-ending) with the globally optimal alignment. Iterative application of the elimination criteria results in a rapid reduction of combinatorial possibilities without considering them explicitly. In the vast majority of cases, the procedure converges to a unique globally optimal solution. In contrast to the exhaustive search, whose computational complexity is combinatorial, the algorithm is computationally feasible because the number of operations required to eliminate the dead-ending segments and segment pairs grows quadratically and cubically, respectively, with the total number of sequence elements. The method is illustrated on a set of protein families for which the globally optimal alignments are well recognized.

Xenologous Gene Displacement in Archaea and Bacteria

Kira S. Makarova, L. Aravind and E. V. Koonin
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health, Bldg. 38A
Bethesda, MD 20894, USA

Perhaps the most unexpected result of the comparative analysis of completely sequenced genomes of bacteria and archaea is the apparent high rate of horizontal gene transfer, which seems to occur even between phylogenetically distant microbes. One of the possible results of horizontal gene transfer is the replacement of a gene by its ortholog from a distant species which is assumed to proceed via an intermediate stage when both genes are present in the genome. We termed this evolutionary phenomenon xenologous gene displacement (XGD). Using the complete sets of proteins encoded in 5 Archaeal and 15 Bacterial genomes, we attempted to assess the contribution of XGD events to the evolution of these prokaryotes. In order to detect relatively recent cases of XGD, groups of closely related genomes were compared, such as E.coli-Haemophilus influenzae-Rickettsia prowazekii, Treponema pallidum-Borrelia burgdorferi, Chlamydia pneumoniae-C.trachomatis, and Mycoplasma genitalium-M.pneumoniae. We found that certain organisms, such as R.prowazekii among the Proteobacteria and two spirochaetes - B.burgdorferi and T.pallidum, are particularly prone to XGD. In order to detect potential ancient XGD events, we searched for "archaeal" genes in bacteria and, conversely, for "bacterial" genes in archaea. Several cases of XGD in different groups of Archaea and Bacteria were convincingly supported by phylogenetic analysis. In general, the results suggest that the amount of XGD is roughly proportional to the evolutionary distance between the compared genomes. The sources of gene acquisition vary in each case but there is a clear connection with the organism's lifestyle. For example, spirochaetes primarily acquire eukaryotic genes or genes from other pathogenic bacteria, whereas in hyperthermophilic bacteria, there is a strong trend towards acquisition of archaeal genes.

HOPS: Hybrid Optimizer of Protein Structure

Alberto Maria Segre and Sean Forman
University of Iowa, Iowa City, Iowa, USA

Composing the protein folding problem as a computer model is notoriously hard due to the number of potential conformations. Many techniques utilize a simplified protein model and allow the model to move freely. We have developed a structure prediction method utilizing a full protein representation. Rather than allowing the model to move freely, the full representation is folded in a mostly discrete manner. Bond angles and lengths are fixed and a discrete number of phi/psi angle pairs are selected off-line using each amino acid's Ramachandran Plot and a clustering algorithm.

The chosen phi/psi angle pairs form a search tree of potential conformations. The search algorithm folds the protein from left to right attempting to find a minimum value for our scoring function. A partial fold is scored using each amino acid's accessible surface area (computed incrementally), the number of hydrogen bonds formed in the partial fold, and an estimated contribution from the unfolded portion of the protein. If the partial fold has a sufficiently good score, the algorithm moves forward and sets the phi/psi angle values for the next amino acid. Partial folds resulting in steric clashes or unfavorable scores are pruned, and the search backtracks to the previous amino acid or sets a new phi/psi angle combination at the current amino acid.

Parallel partitioning techniques rarely provide a high speedup factor for the solution of search trees. We implement HOPS in a parallel manner, but use a new technique called nagging. The solution time in search trees often varies with the order in which the variables are searched. Nagging takes advantage of this variance by searching the tree in a variety of orders.

DBAli: A Collection of Alignments and Tools for Protein Sequence-Structure Comparison

Marc A. Mart-Renom and Andrej Sali
Laboratories of Molecular Biophysics
Pels Family Center for Biochemistry and Structural Biology
The Rockefeller University
1230 York Ave, New York, NY 10021, USA

Analysis of many comparisons of known protein structures is essential for improving the alignment of protein sequences with related structures. The aim of DBAli is to facilitate such analysis. DBAli consists of many alignments and Perl programs for deriving distributions of and correlations between a number of sequence and structure properties of proteins. Currently, DBAli includes ~2000 reference pairwise alignments from SCOP [1] and ~125 multiple structural alignments from HOMSTRAD [2]. DBAli also has links to other internal and external resources. For example, Compare3D applet [3] is used to visualize sequence alignments and structure superpositions. Three applications of DBAli are described. First, structural environments of insertions and deletions have been characterized. This information will be used to devise a better gap penalty function for sequence-structure alignment in comparative protein structure modeling. Second, multiple structural alignments of similar structures have been used to construct various matrices for dipeptide-dipeptide substitutions. These new substitution matrices will be evaluated for their performance in sequence-structure alignment. Third, to learn about the difficulties encountered by several sequence alignment programs, sequences from the reference alignments were re-aligned by these programs. The new alignments are also part of DBAli. For pairwise alignments, ALIGN [4], ALIGN2D [4], CLUSTALW [5] and PSI-BLAST [6] programs were used. For multiple alignments, MALIGN [4] and CLUSTALW were used. The alignment errors made by these programs are described.

[1] Hubbard, T., Murzin, A., Brenner, S., and Chothia, C. Nucleic Acids Res 25, 236-9 (1997).
[2] Mizuguchi, K., Deane, C., Blundell, T., and Overington, J. Protein Sci 7, 24 69-71 (1998).
[3] Shindyalov, I. and Bourne, P. http://www.sdsc.edu/pb/ Software.htm.
[4] Sali, A., Selnchez, R., Badretdinov, A., Fiser, A., Melo, F., Overington, J., Feyfant, E., and Mart-Renom, M.A. http://guitar.rockefeller.edu/modeller/ (1999).
[5] Thompson, J., Higgins, D., and Gibson, T. Nucleic Acids Res 22, 4673-80 (1994).
[6] Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. Nucleic Acids Res 25, 3389-402 (1997).

Finding Sequencing Errors in DNA Sequence Based on Intrinsic Properties of Coding Regions : What About Available Complete Prokaryotic Genomes ?

Claudine Médigue (1,2), Alain Viari (3) and Antoine Danchin (1)
(1) Institut Pasteur- REG - 28 rue du Docteur Roux, 75724 Paris Cedex 15, France
(2) GENOPOLE- Lab. d'Annotation des Génomes - 7 rue Montespan, 91000 Evry, France
(3) Atelier de BioInformatique - Université Paris VI - 12 rue Cuvier 75005, Paris, France

During the determination of a DNA sequence, the introduction of artefactual frameshifts and/or in-frame stop codons in putative CDSs can lead to mistranslation and premature termination of the inferred transcripts. Detection of such errors using a method based on protein similarity matching is only possible when related sequences are available in databases [1,2]. We have developed a new method to detect frameshift errors in partial or complete genomes. The method, called ProFED (Prokaryotic Frameshift Errors Detection) is based on the intrinsic properties of the coding sequences and combines the results of two complementary DNA analyses : the search for translational initiation/termination sites and the prediction of coding regions using the GeneMark method [3]. The ProFED method is embedded into our Imagene platform dedicated to sequence annotation and analysis [4]. In a first step, the method was used to screen the complete Bacillus subtilis genome sequence and experimental verifications (i.e re-sequencing) were performed on the predicted erroneous regions. This procedure validates the overall quality of the data and allows to correct the sequence accordingly. Interestingly, in several cases in-frame termination codons or frameshifts were not sequencing errors but confirmed to be present in the chromosome, indicating that the genes are either non-functional (pseudogenes) or subject to regulatory processes such as programmed translational frameshifts. In second step, the ProFED method was used to screen twenty other available prokaryotic genome sequences. The predicted sequencing errors have not been, is these cases, validated by a re-sequencing procedure. Analysis of the results obtained shows that our strategy seems to be a reliable tool for assessing the quality of the final sequences for new genome projects. A web site including the results of our analysis is under construction. We hope that such compilation of putative sequencing errors will help biologists in the correction of current genome annotations.

References :
[1] Claverie, J.-M. 1993. Detecting frame shifts by amino acid sequence comparison. J. Mol. Biol. 234 : 1140-1157.
[2] Brown, N.P., C. Sander, and P. Bork. 1998. Frame : detection of genomic sequencing errors. Bioinformatics 14 : 367-371.
[3] Borodovsky, M. and J.D. McIninch. 1993. GeneMark : Parallel gene recognition for both DNA strands. Comp. Chem. 17 : 123-133.
[4] Médigue, C., F. Rechenmann, A. Danchin, and A. Viari. 1999. Imagene : an integrated computer environment for sequence annotation and analysis. Bioinformatics 15: 2-15.

Statistical Potentials for Fold Assessment in Comparative Modeling

Francisco Melo, Roberto Sanchez and Andrej Sali
The Rockefeller University, Laboratory of Molecular Biophysics, 1230 York Avenue, #270, New York, 10021, USA

It is important to evaluate a comparative model before it is used to address the problem for which it was constructed. It is useful to assess first whether or not the model has at least the correct fold. The model will have the correct fold (good model) only if the template has the correct fold. In addition, the alignment between the template and the modeled sequence has to be substantally correct. Objective model evaluation is especially important in large-scale automated modeling of whole genomes where no user intervention is possible. A test set of approximately 10,000 correct and incorrect models has been built by automated comparative modeling for all non-redundant proteins in the Protein Data Bank. The test models span a wide range of size and fold type. The distribution of model accuracy is expected to be similar to that for genome-wide modeling calculations. A variety of model quality criteria and discrimination methods have been tested for their ability to distinguish between the good and bad models. The criteria have included one and two residue statistical potentials of mean force, the number of residues in the model, percent sequence identity between the target sequence and the template structure, compactness of the model, significance score for the target-template alignment, and the number of heteroatoms in the template structure. The discrimination methods have included the linear and non-linear discriminant analysis, genetic algorithms, and Bayesian models. The inter-dependency, complementarity, and relationships between the quality criteria have been explored. This analysis allowed us to improve the accuracy of model classification. The current method evaluates correctly 95% of the models in the test set, with 5.0 and 5.9% of falses positives and false negatives, respectively. The method performs well over a wide range of sensitivity and specificity.

Integrated Sequence Database System with an HTTP Programming Interface

Katerina Michalickova and Christopher W.V. Hogue
Samuel Lunenfeld Research Institute, Mount Sinai Hospital,
600 University Avenue, Toronto, Ontario, Canada
Department of Biochemistry, University of Toronto, Faculty of Medicine, Medical Sciences Building, Toronto, Ontario, Canada

Our bioinformatics research required a fast, simple and reliable in-house database system containing the same information as found in public biological sequences databases. We took advantage of the resources available at the National Center for Biotechnology Information ftp site which contains all GenBank, SwissProt and PDB sequences in Asn.1 binary form. We parsed the Asn.1 files for indexing information and stored it together with the original Asn.1 binary data in CodeBase software (Sequiter Software Inc., Alberta). The CodeBase database system enables us to maintain all nucleotide, protein and 3-D data in-house in a few individual databases. The content is the same as the latest GenBank release, can be updated daily from NCBI ftp site. At the present stage, the web interface facilitates database searches for sequences based on unique geninfo identifiers (GI), GenBank accession numbers, original sequence names, NCBI taxonomy identifiers, medline identifiers, molecular modeling database (MMDB) identifiers and protein databank (PDB) identifiers. All sequences and 3-D structures can be displayed in several formats such as definition line, FastA format, Asn.1 print format, GenBank flat file, PDB flat file. The query also triggers a search for linked nucleic acid or proteins. Taxonomy and Medline searches offer a direct link to the NCBI to obtain full information about a particular taxon or a published article concerning the sequence of interest. We developed our own application programming interface (API) which utilizes the in-house databases to retrieve data both from a local disk and remotely through an http interface. The API performs some operations which are not addressed in Entrez such as obtaining a non-redundant set of sequences from a given taxon, better control of protein sequence subsets, Clustal file format support and information about protein amino acid compositions.

Universally Conserved Positions in Protein Folds: Reading Evolutionary Signals about Stability, Folding Kinetics and Function

Leonid Mirny and Eugene Shakhnovich
Department of Chemistry,
Harvard University, Cambridge, MA, USA

In this work we provide the analysis of molecular evolution of five most populated protein folds: immunoglobulin fold, oligonucleotide binding fold, Rossman fold, alpha/beta-plait, and TIM-barrels. In order to distinguish between "historic", functional and structural reasons for amino acid conservations, we consider proteins that acquire the same fold and have no evident sequence homology. For each fold we identify positions that are conserved within each individual family and coincide when non-homologous proteins are structurally superimposed. As a baseline for statistical assessment we use conservatism expected according to the solvent accessibility. The analysis is based on a new concept of "Conservatism of Conservatism". This approach allows to identify the structural features that are stabilized in all protein having a given fold despite the fact that actual interactions that provide such stabilization may vary from protein to protein. Comparison with experimental data on thermodynamics, folding kinetics and function of the proteins reveal that such universally conserved clusters correspond to either (i) super-sites or (ii) folding nuclei whose stability is an important determinant of folding rate, or both (in case of Rossman fold). The presented analysis also helps to clarify relation between folding and function, which is apparent for some folds.

Threading with Explicit Models for Evolutionary Conservation of Structure and Sequence
Anna R. Panchenko, Aron Marchler-Bauer and Stephen H. Bryant
Computational Biology Branch, National center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894, USA

Due to the rapid evolution of protein sequences, the similarity between proteins sometimes is not evident on the sequence level, although their structures can be quite similar. These examples of remote homologs and analogs comprise the majority of the potential targets of fold recognition. To recognize the distant relationships we have attempted to combine the physically plausible contact based potential with the quantitative descriptions of evolutionary conservation within protein families. First, we deliberately constructed the benchmark, which contained cases from different ranges of difficulty for fold recognition. Then, for each protein from our test set we defined the position specific score matrix based on the multiple sequence alignments and the conserved core elements from the multiple structure-structure superpositions. Threading was done using core element threading algorithm, which did not allow gaps within the core elements. The performance of the combined scoring function was measured relative to the contribution of contact and sequence conservation terms to analyze the importance of physical and evolutionary signals. We showed, that the largest improvement of the threading significance as well as alignment accuracy is observed when the contact and motif terms are combined in the equal proportions in the region of percent identity exceeding 15% and fraction of conserved contacts being more than 50%. This in turn implies that contact based and motif matching scoring functions do indeed compliment each other since interactions encoded in the contact potentials determine the overall protein topology whereas family specific sequence motifs define the unique protein structure.

Combining Evidence from Different Gene-Structure Prediction Programs

Sanja Rogic(1) Francis Ouellette(2) Alan Mackworth(1)
(1) Computer Science Department, The University of British Columbia, Vancouver, Canada
(2) Center for Molecular Medicine and Therapeutics, The University of British Columbia, Vancouver, Canada

Over the last decade many programs have been developed for computational gene finding. They use different methods to identify gene structure, from basic open reading frame finding to sophisticated machine learning and statistical methods. It has been observed [1] that these different techniques will often correctly predict different elements of the gene, suggesting that they could complement each other, yielding better prediction.

The goal of our ongoing research is to test this hypothesis by combining predictions from two gene finding programs, GENSCAN [2] and FGENES [3]. The programs have been tested on an independent dataset and their predictions are used to build decision trees, which classify predicted exons according to their expected accuracy. High scoring exons are further integrated into the plausible gene structure. Preliminary experiments show that extracting correctly predicted exons from these two programs' predictions could increase the percentage of correctly identified exons by 10% (currently that percentage for each programs is around 75%).

In order to further improve gene identification, especially in the sequences with multiple genes, we plan to integrate NNPP [4], a promoter finding program, into our system. Low information content around ATG start site makes it difficult for genefinders to identify initial exons correctly (usually predicting them as a part of internal exons) and thus failing to identify gene boundaries, leading to joining of the genes. NNPP's promoter prediction would give additional evidence where gene 5' end should be.

An important part of our project is the generation of non-redundant dataset that excludes sequences used for training of GENSCAN or FGENES. It contains 579 human and mouse sequences with either complete or partial genes that have passed all the standard filter procedures for gene-finding datasets.

[1] K. Murakami and T.Takagi. Gene recognition by combination of several gene-finding programs. Bioinformatics, Vol. 14 no.8: 665-675, 1998.
[2] C. Burge and S. Karlin. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 268: 78-94, 1997.
[3] http://genomic.sanger.ac.uk/gf/gf.html
[4] M. Reese and F. Eeckman. Time-Delay Neural Networks for Eukaryotic Promoter Prediction. In preparation, 1999.

Sequence Annotation by Splash

Ajay K. Royyuru(1), Andrea Califano(1), Gustavo Stolovitzky (1), and Lawrence Shapiro(2)
(1) Computational Biology Center, IBM Thomas J. Watson Research Center, PO Box 704, Yorktown Heights, NY 10598, USA
(2) Structural Biology Program, Department of Physiology and Biophysics, Mount Sinai School of Medicine, 1425 Madison Avenue, New York, NY 10029, USA

The major goal of the Human Genome Project is to obtain the sequences for all genes in the human genome. The Expressed Sequence Tag (EST) sequencing strategy provides an effective means to identify the human "transcriptome". This, however, provides only a raw sequence readout and leaves unaddressed the formidable task of attaching a functional annotation to each new sequence.

Splash is an algorithm for discovering the sets of sequence patterns that characterize a given family of related protein sequences [1]. It is well suited for discovery of sequence signatures in functionally related proteins.

Here we describe a protocol for annotating ESTs using Splash. The protocol comprises of the following steps:
1. Gather a set of functionally related protein sequences,
2. Use Splash to identify the set of statistically significant sequence patterns that characterize this functionally related protein family,
3. Analyze and identify the relative order and position of the sequence patterns in all occurrences in the sequence database,
4. Scan the EST database (dbEST) to identify EST sequences that contain these sequence patterns in the prescribed relative order and position.

We have applied this protocol to several families of proteins and successfully identified new candidates in dbEST. For example, starting with a database of 68 sequences belonging to the C1q/TNF superfamily [2], Splash finds two statistically significant sequence patterns that characterize this family:
These patterns overlap exactly with the regions identified as the structural core responsible for the structural similarity between the non-homologous families of C1q and TNF proteins. On scanning dbEST for occurrence of these patterns, we find 17 sequences already annotated as members of C1q/TNF superfamily. In addition, we identify 6 new (previously unannotated) sequences as potential members of the C1q/TNF superfamily.

The efficiency and deterministic nature of Splash enables the use of this protocol for rapid annotation in high throughput sequencing projects.

[1] A. Califano. SPLASH: Structural pattern localization analysis by sequential histograms. Bioinformatics (Communicated, 1999).
[2] L. Shapiro and P. E. Scherer. The crystal structure of a complement-1q family protein suggests an evolutionary link to tumor necrosis factor. Current Biology, 8:335 - 338 (1998).

A Computer Program for Prediction of Gene Domain on Rice Genome Sequence

Katsumi Sakata (1), Hideki Nagasaki (2), Atsuko Idonuma (2), Kazunori Waki (2), Masaki Kise (3) and Takuji Sasaki (1)
(1) Rice Genome Research Program (RGP), National Institute of Agrobiological Resources, Tsukuba, Japan
(2) Institute of the Society for Techno-innovation of Agriculture, Forestry and Fisheries, Tsukuba, Japan
(3) Mitsubishi Space Software Co., Ltd., Tokyo, Japan

Rice is one of the major cereal crops and is the principal source of food for about half of the world's population. In terms of genome analysis, it has an advantage among other cereals because it has the smallest genome size estimated at 430 Mb. At the Rice Genome Research Program (RGP), sequencing of the entire genome was launched in 1998 and nearly 1 Mb of genome sequence has already been finished and made available to the public domain through the DNA Data Bank of Japan (DDBJ) and RGP home page (http://www.dna.affrc.go.jp:82/). The finished sequences were annotated to determine the potential protein-coding genes and/or gene segments. As a part of the annotation scheme, gene domain prediction programs were used to predict the coding regions and/or biological signals such as splice sites. Some representative programs such as GENSCAN for maize and Arabidopsis were evaluated and were found to be comparatively useful for rice genome sequences. However the results were not completely satisfactory because some gene candidate regions with similarities to rice cDNAs could not be predicted. We have been developing a new computer program to predict gene domains on rice genome sequence based on a probabilistic model using a catalog of rice ESTs developed at RGP. This catalog is composed of nearly 15,000 cDNAs corresponding to about one third of the total of all rice genes. A prototype version has been completed and evaluated. The program predicts gene candidate regions by calculation using the probabilistic scheme of hidden Markov model (HMM). Among the major features of the program are: (i) a detailed model for 3' untranslated region using more than 5000 cDNA sequences, and (ii) an algorithm that incorporates some characteristics of the genome sequence as a medium of transmitting and storing data.

EuGene: a Simple yet Effective Gene Finder for Eucaryotic Organisms (Arabidopsis thaliana)

Thomas Schiex, Annick Moisan, Lucien Duret, Pierre Rouze
INRA, Chemin de Borde Rouge, BP 27, Castanet-Tolosan, 31326 Cedex, France

It is standard, in a thorough sequence annotation, to take into account several sources of information in order to try to precisely locate genes (exons/introns) in eucaryotic sequences. The sources of information exploited typically include matches against databases (EST or protein databases), output of signal prediction software such as NetGene2 or Netstart (www.cbs. dtu.dk/services/{NetGene2,NetStart}) and more or less sophisticated "integrated" gene finding software such as GeneMark.hmm (genemark.biology.gatech.edu/GeneMark/) and / or GENESCAN (gnomic.stanford.edu/~chris/GENSCANW. html).

Along this line of idea, we have designed a simple, general, efficient and yet effective graph-based approach for gene finding that allows researchers to combine several sources of evidence. For a given sequence, the basic idea is to build a directed acyclic weighted graph such that all possible gene structures are represented by a path in the graph. The weights of the edges of the graph are defined using the available evidence in such a way that shortest paths in the graph corresponds to gene structure that "best respects" this evidence. A simple linear time, linear space shortest path algorithm such as Bellman's algorithm simply outputs the best possible gene structure. The approach is comparable (although not equivalent) to an explicit state duration Hidden Markov Model with uniform duration densities.

A first prototype called EuGene has been built that integrates the following sources of information for Arabidopsis thaliana:
- output of five interpolated Markov models (IMM) for respectively frame 1, 2, 3 exons, introns and intergenic sequences (estimated on AraClean v1.1 dataset, www.cbs.dtu.dk/databases/ARACLEAN).
- the output of NetPlantGene and NetGene2 for splice site strength (some parameters used to compute the weights from the output have been estimated on AraClean).
- the output of NetStart for ATG strength (some parameters used to compute the weights from the output have been estimated on AraClean).

The structure and weights of the graph can be defined/modified using a very simple language that allows statement such as "start f1371 0.4" (a forward Start occurs at position 1371 with strength 0.4). Similar sentences allow the inclusion of information about acceptors, donors, exonic/intronic/intergenic state strengths on a per nucleotide basis. This is actually automatically built by a Perl script. This file, and the Perl script can be simply modified by the user to include other sources of information if desired.

A second version adds to these basic informations results from EST and protein databases search. The current use of this information is still very preliminary: EST hits simply remove intronic edges and protein hits slightly enhance exonic strengths.

This approach has been assessed on "AraSet" (not AraClean !), a recent dataset of precisely annotated DNA sequences of Arabidopsis thaliana that has already been used to assess several existing gene/signal finding pieces of software (see http://sphinx.rug.ac.be:8080/biocomp/GeneComp/index.html, full paper presented at this conference). On this dataset, GeneMark.hmm was the best available software with a gene sensitivity of 40% and a gene specificity of 32%. The first version of EuGene directly yields a gene sensitivity of 57% with a specificity of 48%. Further taking into account EST and protein (SPTR) BLAST hits gives a gene sensitivity of 67% with a specificity of 54%.

This report is very preliminary and we expect to significantly enhance EuGene's effectiveness in a near future (and apply it to other organisms). Actually, compared to other gene finding algorithms, EuGene is extremely simple: it uses a linear time algorithm, a single Markov model set and does not take into account length of exons/introns or other signals such as polyA or promoters. This should leave room for lot of improvements.

Learning Hidden Markov Model Topology for Sequence Analysis

Alexander Schliep
ZAIK/ZPR, University of Cologne, Cologne, Germany

Hidden Markov Models (HMMs) are a widely and successfully used tool in statistical modeling and statistical pattern recognition with gene finding being one of the prime examples in computational biology. One fundamental problem in the application of Hidden Markov Models is finding the HMMs underlying architecture or topology especially when there is no strong evidence towards a specific choice from the application domain (e.g., when doing black box modeling). Or similarly, if the existence of rarely used or too frequently used states after training suggests that the chosen topology does not fit the data well.

Topology is important with regard to good parameter estimates and with regard to performance: A model with "too many" states - and hence too many parameters - requires too much training data while an model with "not enough" states prohibits the HMM from capturing subtle statistical patterns.

To determine the "optimal" topology either knowledge from the application domain is used or a trial and error procedure using ad-hoc methods (i.e., model surgery) are employed; systematic procedures have been rarely considered (e.g., Bayesian Model merging, Stolcke and Omohundro). We have developed a novel algorithm that will infer an HMM representation of the (ergodic) process generating a sequence, without prespecifying the topology of the model. That is, we infer the number of hidden states, the transitions allowed and the transition and emission probabilities. We use a Bayesian approach where a suitable prior on one crucial parameter forces generalization (and thus necessarily reduces data likelihood) from the maximum likelihood model.

We will present the algorithm, some of our theoretical results and results from numerical experiments on biological DNA and protein sequence data.

The Identification of Novel Signals Regulating mRNA Translation: Effects of Gene Context

Mark Schreiber and Chris Brown
Department of Biochemistry, University of Otago, P.O. Box 56 Dunedin, New Zealand

It is well known that the context of a gene regulates the efficiency and accuracy of its translation from mRNA to protein. Several elements have already been identified. Initiation of translation is regulated by the Shine-Dalgarno ribosome binding site and the downstream box in many bacteria, or the Kozaks consensus in Eukaryotes. The use of a biased subset of codons has been shown to enhance translation in many organisms. The efficiency of termination is also affected by the identity of surrounding nucleotides such as the residue following Escherichia coli stop codons. Using the TransTerm database developed at the University of Otago we have identified two putative new signals in Synechocystis sp. PCC6803 that may regulate translation. Unexpectedly the genes of Synechocystis appear to lack the conventional bacterial Shine-Dalgarno Box. Instead, a previously unobserved consensus sequence sandwiches the start codon (CYAUGR) with strong bias at the -2 position. The information content of alignments to the start codons shows this element may be sufficient for recognition by the ribosome. The termination context of Synechocystis is also unusual. In Escherichia coli the identity of the +1 nucleotide (fourth base) is highly biased and affects termination efficiency at stop codons. Conversely, the +1 nucleotide of Synechocystis is biased. Genetic reporter systems are being designed to study the effects of these elements in vivo.

Protein Tertiary Structure Modelling with SWISS-MODEL and the SwissPdbViewer

Torsten F. Schwede, Nicolas Guex & Manuel C. Peitsch
GlaxoWellcome Experimental Research SA, 16 Chemin des Aulx, 1228 Plan-les-Ouates, Geneva, Switzerland

The insights, which a 3-D structure of a protein can provide, are of great assistance during the rational design of mutagenesis experiments. Experimental protein structure determination methods are often hampered by technical difficulties and are time and resource intensive. The number of known 3-D structures of proteins thus only represents a small fraction of the known protein sequences. In this context it is not surprising that theoretical approaches have been explored, of which comparative protein modelling is by far the most reliable.

SWISS-MODEL and the Swiss-PdbViewer
We have developed an environment for comparative protein modeling that consists of SWISS-MODEL (http://www.expasy.ch/swissmod/), a server for automated comparative protein modelling and of the SwissPdbViewer (http:// www.expasy.ch/spdbv/) [1]. The Swiss-PdbViewer not only acts as a client for SWISS-MODEL, but also provides a large selection of structure analysis and display tools. The software framework of the SWISS-MODEL server can be used to generate large collections of protein models. During the 3DCrunch of 1997, a very large scale modelling experiment, 64,000 sequences from of the SWISS-PROT and trEMBL databases have been modelled by SWISS-MODEL [2], and more than 30,000 requests per year are received via the WWW-interface. By making such tools freely available to the scientific community, we hope to make Protein Modelling accessible to biochemists and molecular biologists World Wide.

Recent Improvements (version 3.5)
The SwissModel server version 3.5 provides a better stability and overcomes several limitations of earlier versions of the ExPDB template database. On demand, a requests can be forwarded to the PredictProtein secondary structure prediction [3] or 3DPSSM fold recognition server (http://www.bmm.icnet.uk/~3dpssm/). The quality of the final model is evaluated by WhatCheck [4] and a detailed report is sent back, as well as a project file, containing the template structures and the underlying structural alignment. The close integration of SPDBV and SwissModel allows high flexibility in the submitted requests, including the use of your own template structures. The functionality of Swiss PdbViewer, which is used as graphical user interface to SwissModel, has been extended. It provides a clear graphical display (OpenGL, supporting hardware stereo) and several tools for model building and analysis, e.g. energy minimization & surface representations. The direct server connection allows the importation of structures and sequences from different databases. SPDBV is a full sequence to structure workbench, running on PC, Linux, Macintosh & SGI [5].
1. Guex, N. & Peitsch, M.C. (1997). SWISS-MODEL and the Swiss-PdbViewer: An environment for comparative protein modelling. Electrophoresis, 18, 2714-2723.
2. Peitsch M.C. & Guex N. (1997) Large-scale comparative protein modelling. in: Proteome research: new frontiers in functional genomics, p. 177-186, Wilkins MR, Williams KL, Appel RO, Hochstrasser DF eds., Springer.
3. Rost, B. (1996). PHD: predicting one-dimensional protein structure by profile based neural networks. Meth. in Enzym., 266, 525-539.
4. Hooft, R.W.W., Vriend, G., Sander, C. & Abola, E.E., (1996). Errors in protein structures. Nature 381, 272-272.
5. Guex N, Diemand A and Peitsch M.C. (1999) Protein modelling for all. TiBS, 24, 364-367.

A Database of Remote Homologue Clusters

Lorenzo Segovia and Ricardo Ciria
Instituto de Biotecnología. UNAM, Mexico

Several approaches have been undertaken to study structure and function relationships in proteins. Koonin et al. have created a database of orthologues (Clusters of Orthologous Groups, COGs), classified by similarity and function, based on different genomes from 6 major phylogenetic groups. This effort has been complemented by analyses such as the GeneCensus approach by Gerstein et al. who have studied structure and function relationships in the PDB databank looking for correlations between the SCOP and EC classifications.

Method and Results:
There are around 1400 different entries in the PDB databank corresponding to proteins of known structure with less than 90% identity between them due to the very large number of mutant structures deposited. We used this subset to center our analysis. Considering that homologues share the same fold, the analysis of sequence clusters should allow us to draw general conclusions about each fold in particular. We searched in Swissprot37 using Psi-blast, looking for convergent groups (30 iterations with default parameters) and then purged the hits to leave only sequences with less than 90% identity (using L. Holm's nrdb90 perl script). We then looked for common motifs in each cluster using MEME 2.0 (-mod OOPS -nmotifs 5) and annotated the motifs using the corresponding Swissprot entries.

We are analysing the results under two different points of view. One is to determine which catalyses coexist more commonly and on which folds and the other is the conservation of functional motifs in each cluster and mapping of possibly relevant amino acids in non-annotated remote homologues. Some additional benefits are the creation of a database that could be used for "sequence hopping" to use for fold recognition, and conversely, the identification of null hits which could be candidates for new folds. We will present the results found so far and some examples of clusters.

Finding Prokaryotic Genes by the "Frame-by-Frame" Algorithm: Targeting Gene Starts and Overlapping Genes

Anton M. Shmatkov, Arik A. Melikyan, Felix L. Chernousko and Mark Borodovsky1
Russian Academy of Science, The Institute for Problems in Mechanics, Moscow 11526, Russia
1School of Biology, Georgia Institute of Technology, Atlanta, GA 30332-0230, USA

Tightly packed prokaryotic genes frequently overlap with each other. This feature, rarely seen in eukaryotic DNA, makes detection of translation initiation sites and, therefore, exact predictions of prokaryotic genes notoriously difficult. Improving the accuracy of precise gene prediction in prokaryotic genomic DNA remains an important open problem. A software program implementing a new algorithm utilizing a uniform Hidden Markov model for prokaryotic gene prediction was developed. The algorithm analyzes a given DNA sequence in each of six possible global reading frames independently. Twelve complete prokaryotic genomes were analyzed using the new tool. The accuracy of gene finding, predicting locations of protein-coding ORFs, as well as the accuracy of precise gene prediction, detecting the whole gene including translation initiation codon, were assessed by comparison with existing annotation. It was shown that in terms of gene finding, the program performs at least as well as the previously developed tools, such as GeneMark and GLIMMER. In terms of precise gene prediction the new program was shown to be more accurate, by several percentage points, than earlier developed tools, such as GeneMark.hmm, ECOPARSE and ORPHEUS. The results of testing the program indicated the possibility of systematic bias in start codon annotation in several early sequenced prokaryotic genomes. The new gene finding program can be accessed through the web site: http://dixie.biology.gatech.edu/GeneMark/fbf.cgi

MetaFam: A Unification of Protein Families

Elizabeth Shoop
Academic Health Center, Computational Biology Centers University of Minnesota, Minneapolis, MN, USA

We describe MetaFam, a protein family characterization derived from a comprehensive set-theoretic comparison of 10 publicly-accessible protein family databases (BLOCKS, DOMO, Pfam, PIR, PRINTS, PROSITE, ProDom, PROTOMAP, SBASE, and SYSTERS). Families of one database are matched to those in another when the overlap in their membership is maximal. Pairwise family matches are drawn together transitively to create a new list of protein family supersets. These supersets have several advantages: (1) Our supersets contain the most members, because each of the component family databases work with a subset of our full non-redundant set of proteins; (2) Questionable assignments of individual family databases can be found quickly, since our analysis identifies individual members that are in conflict with the majority consensus; (3) family descriptions which may be absent from automated databases can now be assigned; (4) statistics have been computed comparing domain boundaries, family superset/subset relationships, and domain associations; (5) the supersets have been loaded into an Oracle database to allow for complex queries and visualization of the connections between families in a superset, and the consensus of individual members. Public access to the data is available through our web site http://metafam.ahc.umn.edu/.

A Novel Gene within the Neisserial Division and Cell Wall Synthesis Gene Cluster

Lori A. Snyder (1) and William M. Shafer (1,2)
(1) Department of Microbiology and Immunology, Emory University School of Medicine, Atlanta, GA, 30322, USA
(2) Laboratories of Microbial Pathogenesis, VA Medical Center, Decatur, GA, 30033, USA

During a screen of the University of Oklahoma Gonococcal Genome Sequencing Project database for additional binding sites of the transcriptional regulator MtrR, a putative binding site associated with the Division and Cell Wall (DCW) Synthesis Cluster of Neisseria gonorrhoeae was identified. Sequence analysis of the gonococcal DCW gene cluster revealed its strong homology with the DCW clusters previously described in Escherichia coli, Haemophilus influenzae and Bacillus subtilis. The differences between these clusters, and the DCW cluster located within the Neisseria meningitidis serogroup A sequence strain Z2491, from the Sanger Centre Neisseria meningitidis Genome Sequencing Project, are presented here. Sequence comparison revealed notable differences between the gonococcal and meningococcal DCW clusters and those of other bacteria. These include the addition of at least three open reading frames, the largest of which, orfA, has been selected for further study. Genome sequence comparison highlight this reading frame as unusual in that it is inserted into a region that is normally highly conserved in terms of homology, gene organization and presumably essential function across both Gram-negative and Grampositive species. The results of comparative sequence analysis, cloning and expression of the protein encoded by orfA and the results of knock-out experiments will be presented.

WEIGHBOR: Fast and More Accurate Distance-Based Phylogeny Reconstruction

Nicholas D. Socci [1], Aaron L. Halpern [2] and William J. Bruno [3].
[1] The Rockefeller University, New York, NY 10021, USA
[2] University of New Mexico, Albuquerque, NM 87131, USA
[3] Los Alamos National Laboratory, Los Alamos, NM 87574, USA

Sequence analysis using multiple sequences presupposes requires relationships among the sequences. Any rigorous statistical analysis requires that evolutionary tree be reconstructed so that it can be taken int account. Maximum Likelihood tree reconstruction would ideally be used to build the tree, but it is too slow to be used on large alignments.

We introduce a new, weighted neighbor-joining method called WEIGHBOR. This method uses weights that accurately reflect the exponential increase of variances and covariances with distance. The weights are used both in determining which pair is joined and in computing branch lengths.

Tests show that WEIGHBOR has is superior to other methods (Maximum Parsimony, Neighbor Joining, BIONJ, and Fitch-Margoliash) in avoiding the "long branches attract" bias. WEIGHBOR also does not suffer from "long branch distracts," which causes unnecessary errors in trees built by Neighbor Joining and BIONJ. WEIGHBOR is much faster than the Fitch-Margoliash or Maximum Likelihood methods on large problems, and can easily handle hundreds of sequences. WEIGHBOR is much more efficient than Neighbor Joining and BIONJ, and in our tests is 80% to 95% as efficient as Maximum Likelihood.

Visit www.t10.lanl.gov/billb/weighbor to download the program.

Genomic Signature: Short DNA Fragments are Eligible

Alexandra Vaury, Alain Giron, Joseph Vilain, Bernard Fertil and Patrick Deschavanne
INSERM - U 494 - CHU Pitié-Salpêtrière, 91 boulevard de l'hôpital, 75634 Paris cedex 13 - France

The recent availability of long and even complete genomic sequences opens a new field of research devoted to the general analysis of their global structure, without regard to gene interpretation. Our approach takes advantage of the CGR (Chaos Game Representation), modified here to allow for quantification, which produces pictures displaying usage, in terms of frequencies, of words (small sequences of up to 8 nucleotides) and revealing nested patterns in DNA sequences. It has proved to be a quick and robust method for extracting information from long DNA sequences, allowing comparison of sequences and detection of anomalies in word frequency. We observed that subsequences of a genome exhibit the main characteristics of the whole genome in such a way that a specific image can be associated with each species and may therefore be considered a genomic signature. The distance between images may quantify phylogenetic proximity. Eukaryotes and Prokaryotes for example, can be discriminated on the mere basis of their DNA structure. This work addresses two related issues about the genomic signature. i/ how long DNA fragment must be to get a worthy signature ii/ is there an optimal length for the words to be analyzed sixteen complete genomes (or very long genomic sequences) were sliced into 100 kb down to 1 kb long subsequences. The images obtained from the fragments were compared and classified using a principal component analysis as a preprocessing step (to reduce the amount of information) followed by an unsupervised clustering algorithm. It was found that the origin of most DNA fragments can be properly determined. As a general rule, recognition of fragments increases with size of fragments and length of words to reach an almost perfect result with 25 kb fragments and 5-letter words. It thus appears possible to perform global comparison of species by means of genome fragments found in databases.

From Genome to Protein Sequence to 3D Structure: Protein Neighbors in Entrez Genomes

Yanli Wang, Tatiana Tatusova, Roman Tatusov, Steven Bryant
National Center for Biotechnology Information,
National Library of Medicine,
National Institutes of Health, Bethesda, MD, USA

A new WWW application is presented which provides the links between genomic protein sequences and 3D structures using protein sequence similarity information from BLAST search. This WWW site is a highly integrated bioinformatics resources. The results were pre-computed for all the proteins from complete microbial genomes in Entrez Genomes database. Neighbor relationships to the proteins with known 3-dimensional structures were detected. Sequence pairwise alignments are presented graphically and linked to Cn3D viewer which allows to display 3-dimensional structures, sequences, and text sequence alignment simultaneously. In addition links to MMDB(The Molecular Modeling Database)-Entrez's 3D database provide the users with a pre-computed structure neighbors with VAST(The Vector Alignment Search Tool), the database of structure neighbors that often identify distant homologs. Recent advances in sequencing efforts resulted in 22 complete microbial genomes. The majority of the genes have no reliable functional annotations. Searching for well annotated homologues in the database, particularly in the structure databases, is an important way to understand the functions of these proteins. In our current neighboring system, among the over 20 complete genomes, about 20% of the genes have neighbors in the MMDB structure database detected simply by BLAST algorithm with strict criteria. Entrez's 3D viewer 3D brings a great ease for the analysis and visualization of sequence-structure alignment. Sequence and structure comparisons taken together can provide a powerful methodology for functional annotation of microbial proteins. We plan to perform this analysis for complete Eukaryotic genomes in the future.

GI(TM) - Java Based Software for Gel Analysis

Mark Welsh, Hong Guo, Martin D. Leach
Bioinformatics, CuraGen Corporation, New Haven, CT, USA

Large scale sequencing projects require high quality gel analysis without compromising on speed. To meet such needs, CuraGen has developed OGI(TM) (Open Genome Initiative), a web-based client-server application in Java for high-throughput gel analysis. This client-server design allows an operator, using any web browser, to control processing on many OGI servers, each of which takes output from several sequencers. Currently, OGI supports sequencing on the ABI 377(TM) and MegaBACE(TM) 1000 machines. Within a web browser, the Java applet communicates with the server using RMI (Remote Method Invocation). A multi-threaded Java application on the server schedules CPU-intensive image processing steps. Sequence traces are analyzed using CuraGen's versatile DOLPHIN(TM) trace processor, and then base-called using PHRED (Ewing et al., 1998). OGI has been designed as an open and extensible framework, which will accept new processing steps and whole new data-flows with ease. The ability of OGI to coordinate data processing and analysis using the internet makes it ideal for high-throughput sequencing facilities. OGI's Java and ANSI-C executables will be made available through our web site: www.curagen.com.

This research was supported by a grant from the NIH.

GeneHacker Plus: An Integrated HMM for Bacterial Gene-Finding

Tetsushi Yada, Yasushi Totoki (1) Kenta Nakai (2)
(1) Genome Sciences Center, RIKEN, Japan
(2) Human Genome Center, IMS, University of Tokyo, Japan

The entire genomes of various bacterial species continue to be sequenced at an even faster pitch. Therefore, there is a great need for bacterial gene-finding programs with higher accuracy, especially in defining the translational start sites. We improved our gene-finding program, GeneHacker (Yada & Hirosawa, ISMB 4, 252-260, 1996), entirely. Some of its new features are similar to those of a well-known program, GeneMark.hmm (Lukashin & Borodovsky, 1998); it employs a duration-type HMM which incorporates the information of the length distributions of both coding sequences and their upstream spacers; it also defines a class of potentially exogenous coding sequences automatically. One notable feature of our new "GeneHacker Plus" is that an automatically defined model of ribosome-binding sequence is directly embedded in the HMM, enabling an objective treatment of ribosome-binding sites and no need for post-processing. Since GeneHacker Plus extracts predicted coding sequences by successive local searches, it can be safely used with overlapped genes. A preliminary assessment shows that our GeneHacker Plus performs better than GeneMark.hmm in view of the averaged sensitivity and specificity. Furthermore, a version which directly embeds the feature of homology-modeling in HMM is under construction.

Conference Home Contact Information

Please send questions or comments about this site to John D. Besemer, john@amber.biology.gatech.edu