Ensembl, Annotation of Large Metazoan Genomes

Ewan Birney
European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SA, UK

The Ensembl project (based at www.ensembl.org) aims to provide an entirely open suite of data and software for large eukaryotic genomes. Ensembl provides an actively maintained dataset for the human and mouse genomes. All the data produced by Ensembl is placed in the public domain; the software is licensed under an extremely open Apache style license. There are over 20 sites with an externally running copy of the Ensembl web server and two sites with a full installation of the web site and underlying analysis system.
Building Ensembl has required facing many bioinformatics and software engineering challenges, from algorithmical issues for gene prediction, through software engineering challenges for large scale compute management to user interface design. In my talk I will introduce the Ensembl database and its uses, at the same time touching on some of the challenges we have met whilst building the system.

Comparative Analysis of Protein Lengths and Amino Acid Usages among the Three Domains of Life

Luciano Brocchieri
Department of Mathematics, Stanford University, Stanford, CA 94305-2125 USA

We analyze variation of protein lengths and amino acid usages in the proteomes from 5 complete eukaryotic genomes, 11 archaeal genomes, and 36 bacterial genomes. Protein lengths of eukaryotes are larger (median range 346-384aa) than from bacteria (260-295aa), while archaeal proteins are the smallest (generally 230-250aa). The greater length of eukaryotic proteins probably reflects an intrinsic greater complexity of their structure and function (multifunctionality, intron-exon structure, alternative splicing). Comparing quantile distributions of amino acids, among acidic residues glutamate (Glu) is pervasively more used than aspartate (Asp) in most species among the three classes. A few exceptions pertain to prokaryotic species of high G+C genomic content. Among these, Halobacterium sp. exhibits an exceptionally high frequency of Asp, probably as an adaptation to high salt concentrations. The median usage of hydrophobic residues in eukaryotes is lower than in virtually all prokaryotes. In particular, eukaryotes have a lower than expected frequency of isoleucine compared to prokaryotes. Among eukaryotes, human sequences have lower frequency of asparagine, a phenomenon that might relate to the peculiar absence of runs of this amino acid in human sequences.
The frequencies of amino acids encoded by strong bases {Ala, Gly, Pro) are positively correlated with the genomic G+C content and those encoded by weak bases {Lys, Ile, Phe, Tyr, Asn, but not Met} are negatively correlated. Amino acid usages are also studied and compared restricted to specific functional classes (transcriptional classes, DNA replication and repair, chaperones, etc.), among proteins conserved in all species, and within regions of low or high conservation. Results are discussed in relation to their functional interpretation and their implications for phylogenetic studies.

Predicting Splicing Enhancers

Chris Burge
Department of Biology, Massachusetts Institute of Technology, Cambridge, MA

RNA splicing is an essential step in the expression of most eukaryotic genes. An important goal of research on this process is to determine a set of rules, perhaps encoded in a computer algorithm, that accurately predicts the splicing pattern of primary transcripts.  I will discuss some of our recent work on this problem focusing on:
1) modeling the splicing of short introns in five different organisms (yeast, fly, worm, mustard weed and human);
2) a computational method for predicting which short oligonucleotides function as exonic splicing enhancers and some preliminary experimental data testing the function of candidate enhancer motifs.

Isochore Organization of Mammalian Genomes: Selection or Neutral Evolution?

Laurent Duret, Christian Gautier, Dominique Mouchiroud
Laboratoire de Biometrie et Biologie Evolutive, UMR 5558 – CNRS,
Universite Claude Bernard, 43, Bd du 11 Novembre 1918, 69622 Villeurbanne cedex, FRANCE

Pioneer works by Bernardi and colleagues in the 70's have demonstrated that the base composition is spatially structured in mammalian genomes: chromosomes can be seen as mosaics of long (>300kb) GC-rich and GC-poor fragments called isochores. The sequencing of the human genome confirmed the existence of substantial variations in average GC-content among large fragments (from 33% to 62% G+C), although these isochores do not appear to be as homogeneous as was expected according to Bernardi's model. The isochore structure is correlated with various genomic features, including repeat element distribution, methylation pattern, replication, recombination and, most remarkably, gene density. However, the biological significance of this large-scale variation in GC-content remains highly debated. Does the isochore organization result of a selective pressure on base composition or does it simply reflect a neutral evolutionary process? We will present recent results on the dating of the origin of GC-rich isochores in amniotes, on the relationships between isochores and gene expression patterns, and on the variation of mutation and substitution patterns along chromosomes (analyses of polymorphism data and substitution in pseudogenes). We will discuss the selectionist and neutralist models in the light of these new results.

Integrative Genomics beyond the Genes: Computational Analyses of Pseudogenes and Expression Data

M Gerstein, P Harrison, J Qian, V Alexandrov, P Bertone, R Das, D Greenbaum, R Jansen, W Krebs, N Echols, J Lin, C Wilson, A Drawid, Z Zhang, Y Kluger, N Lan, N Luscombe
Molecular Biophysics & Biochemistry Department, Yale University, New Haven, CT 06520 USA

I will talk about using the properties and attributes of proteins in two different types of large-scale genomic analyses. First, I will survey the occurrence of pseudogenes in several large eukaryotic genomes, focussing on grouping them into families and functional categories and comparing these groupings with those of existing "living" genes. Second, I will talk about using protein catgories and features to mine the data from microarray experiments. In particular, I will present a new method of clustering expression timecourses that finds "time-shifted" relationships and also a Bayesian method of predicting subcellular localization from expression data.


J Qian, B Stenger, CA Wilson, J Lin, R Jansen, SA Teichmann, J Park, WG Krebs, H Yu, V Alexandrov, N Echols, M Gerstein (2001).  "PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information.," Nucleic Acids Res 29: 1750-64

PM Harrison, N Echols, M Gerstein (2001). "Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome.," Nucleic Acids Res 29: 818-30

A Drawid , M Gerstein (2000). "A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome." J Mol Biol 301 : 1059-75

R Jansen , M Gerstein (2000). "Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins." Nucleic Acids Res 28 : 1481-8

Remote Homology Detection and Protein Classification

Nick V. Grishin
Howard Hughes Medical Institute/Dept. of Biochemistry, Rm. L4.247A, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd., Dallas, TX, 75390-9050 USA

Approaches integrating sequence, structure and functional information with evolutionary considerations have been proven to be most efficient for understanding weak similarities between proteins. Several examples of remote homology detection using combination of computational methods will be discussed. In particular, power of transitive sequence similarity searches in reliable detection of homologs at close to and below random sequence identity will be illustrated. Several pairs of proteins with statistically supported sequence similarity that adopt different structural folds will be shown.

Testing Hypotheses of Genome Duplication

Austin L. Hughes
Department of Biological Sciences, University of South Carolina, Columbia SC 29208 USA

Though widely cited, the hypothesis of ancient genome duplication, particularly in the case of the vertebrates, has only recently been tested rigorously. The availability of complete or nearly complete genomic sequences for a number of eukaryotic species will greatly facilitate rigorous testing of genome duplication hypotheses. We review evidence from a number of recent tests of these hypotheses, particularly of the hypothesis that the vertebrates underwent two rounds of genome duplication by polyploidization early in their history (the 2R hypothesis). Tests of the 2R hypothesis include the following:
(1) comparison of gene numbers in homologous families between vertebrate and invertebrate genomes;
(2) phylogenetic tests of the hypothesis that genes in apparently duplicated genomic blocks actually duplicated simultaneously, as expected to occur by polyploidization;
(3) phylogenetic tests of the hypothesis that gene duplications in 4-member vertebrate families occurred early in vertebrate history, as predicted by the 2R hypothesis;
(4) phylogenetic tests of the hypothesis that four-member families have the topology of the form predicted under the 2R hypothesis; i.e., a topology of the form (AB) (CD) or two clusters of two. Application of these approaches to the complete human genome provides no support for the 2R hypothesis. On the contrary, these analyses suggest that genomes are structured in ways that are so far largely unexplored.

A Direct Estimate of Human per Nucleotide Spontaneous Mutation Rates at 20 Loci Causing Mendelian Diseases

Alexey S. Kondrashov
National Center for Biotechnology Information, NIH, 45 Center Drive, MSC 6510, Bethesda, MD 20892, USA

I estimate per nucleotide rates of spontaneous mutations of different kinds in humans from the data on per locus mutation rates and on sequences of de novo loss-of-function alleles at 8 loci causing autosomal dominant and 12 loci causing X-linked diseases. I use only those mutations that surely inactivate the locus, i. e. nonsense nucleotide substitutions and frameshifts. For most of the loci, estimates of the combined rate of all mutations are between 1x10^-8 and 3x10^-8. Coefficient of variation of per nucleotide mutation rates at different loci is much smaller than that of per locus rates, and there is only a slight tendency for loci with high per locus rates to have also high per nucleotide rates. Substitutions are much more common than length difference mutations, and deletions are ~4 times more common than insertions. Mutation hot spots with per nucleotide rates above 10^-6 make only a minor contribution to the overall human mutation. There is a close agreement between direct estimates of per nucleotide mutation rates and their indirect estimates, obtained by comparison of human and chimpanzee pseudogenes. Thus, a human zygote carries at least ~100 de novo mutations, and perhaps over 10 of them are deleterious.

Back to RNA World through Comparative Genomics

Eugene V. Koonin
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD 20894, USA

With multiple, complete genome sequences now available for representatives of all three primary kingdoms, bacteria, archaea and eukaryotes, a reliable reconstruction of the protein repertoire of the Last Universal Common Ancestor (LUCA) becomes possible. Examination of this ancestral protein set reveals many groups of paralogs, which provides for the reconstruction of even earlier stages of evolution, leading back to the ribozyme-dominated RNA world, from which the first proteins emerged. The results of evolutionary reconstructions for central components of the translation machinery, such as aminoacyl-tRNA synthetases and translation factors, indicate that a substantial diversity of protein domains has evolved even before the modern-type translation system, based on protein catalysts, was established. The reconstruction of pre-LUCA steps of evolution indicates that the first proteins included a small set of low-specificity, multifunctional, RNA-binding and nucleotide-binding domains, which facilitated RNA replication and translation.

Origin and Evolution of Meiosis: Complex Machinery of Sex

John M. Logsdon, Jr.
Department of Biology, Emory University, Atlanta GA 30322;

The origin and evolution of sexual reproduction in eukaryotes is a major unsolved puzzle for biology. In particular, questions about the initial establishment of meiosis-the central process by which sexual reproduction proceeds-are largely unanswered. Molecular phylogenetic studies have been initiated for meiotic genes obtained from a wide range of species, focusing mainly on protozoa which represent most of eukaryotic phylogenetic diversity. When in the history of eukaryotes did meiosis arise? The answer to this question will provide a phylogenetic framework for further studies to understand the origin, evolution and function of meiotic sex. To initiate this work, we are studying genes for the eukaryotic homologs of the bacterial recombinase, recA. Two major eukaryotic recA paralogs, RAD51 and DMC1, have been isolated from a diversity of protists. In addition to the RAD51/DMC1 gene duplication serving as a probable marker for the origin of meiosis, the presence of DMC1 in species diverging after the duplication may itself indicate sexuality (with its absence suggesting asexuality). Our results indicate that the RAD51/DMC1 gene duplication occurred early in eukaryotic evolution: prior to the divergence of those protist lineages from which recA genes were obtained. If the RAD51/DMC1 gene duplication is coincident with the origin of meiosis, this indicates that either meiosis is a process ancestral to all extant eukaryotes or that we have not yet sampled key protist species representing early-diverging lineage(s). Giardia lamblia is a putatively deep-branching, possibly asexual species and perhaps the best candidate for a primitively ameiotic group. Surprisingly, it encodes two paralogs of DMC1 but it is also the only species in which RAD51 has not been found. The presence of DMC1 genes indicates that G. lamblia may be cryptically sexual. In fact, starting from the partially-sequenced G. lamblia genome and using a combination of bioinformatic and directed isolation methods, we have sequenced a number of additional meiotic genes from G. lamblia. These and other results have allowed us to begin describing a conserved "core" meiotic machinery in eukaryotes. Progress on the isolation and analysis of additional meiotic genes using bioinformatic and directed efforts will be presented along with some considerations on the origin of the meiotic machinery itself.

From Database Information to Prediction of Protein-DNA and Protein-Protein Interaction

Hanah Margalit
Department of Molecular Genetics and Biotechnology, Faculty of Medicine, The Hebrew University of Jerusalem, P.O.B. 12272, Ein Kerem Jerusalem 91120 ISRAEL

The data accumulated in biological databases present a challenge to extract biological insight from this information, and to use this knowledge in prediction. Here we demonstrate how we have addressed this challenge regarding two major questions in molecular biology, of protein-DNA recognition and of protein-protein interaction. For the protein-DNA recognition problem we extracted information from crystallographically solved protein-DNA complexes and from databases of transcription factors and their binding sites. We demonstrate how these types of information have allowed us to derive quantitative parameters for amino acid-base interaction, which can be used in turn for prediction of transcription factor binding sites in gene-upstream regions. In regard to the protein-protein interaction problem we have demonstrated that characteristic pairs of sequence-signatures can be learned from a database of experimentally determined interacting proteins, where one protein contains the one sequence-signature and its interacting partner contains the other sequence-signature. It is proposed that these correlated sequence-signatures can be used as markers for predicting putative pairs of interacting proteins in the cell.

The Natural History of Domains

Chris Ponting
MRC Functional Genetics Unit, University of Oxford, Department of Human Anatomy and Genetics, South Parks Road, Oxford OX1 3QX, UK

Domains have represented the most persevering units of protein structure throughout evolution. Fusions with other domains, to form repertoires of domain architectures, and other mutational events have contributed greatly to functional innovation. The wealth of sequence and structure data available from diverse species now allows us to trace the propagation of domains from ancient times to the present day. Such studies show that the majority of domain families are demonstrably ancient and that sequence divergence has masked the long-standing heritage of many of the remaining families. Modern sequences even hint at the protein structures of the pre-domain world. The abundance of short repeat-containing domains and, more rarely, inserted motifs argues in favour of the evolution of modern single polypeptide domains from ancient short peptide ancestors. These findings argue that there is a need for domain families to be classified within a hierarchy similar to Linnaeus' Systema Naturae, the classification of species.

Genome Archeology Leading to the Characterization and Classification of Transport Proteins

Milton H. Saier, Jr.
Division of Biology, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA, 92093-0116, USA

In the study of transmembrane transport, molecular phylogeny provides a reliable guide to protein structure, catalytic and noncatalytic transport mechanisms, mode of energy coupling and substrate specificity. It also allows prediction of the evolutionary history of a transporter family, leading to estimations of its age, source, and route of appearance. Phylogenetic analyses, therefore, provide a rational basis for the characterization and classification of transporters.  A universal classification system has been described, based on both function and phylogeny, which has been designed to be applicable to all currently recognized and yet-to-be discovered transport proteins found in living organisms on Earth.

Probabilistic Codes for DNA-Protein Interactions

Gary D. Stormo
Washington University Medical School, St. Louis, MO

The search for a "recognition code" that would allow prediction of high affinity DNA-protein interactions has continued for over two decades. The original hope for a simple, deterministic code was undone by the first few DNA-protein complex structures that were solved crystalligraphically. But clear preferences for specific combinations of interacting base pairs and amino acids have led to qualitative rules that are used to explain, and sometimes to predict, preferred protein-binding site combinations. At the same time efforts to develop more quantitative relationships have emerged and shown some success. This talk will describe our approach to determine a probabilistic code for the interaction of EGR family zinc finger proteins with DNA binding sites. It will describe the model for interaction that we employ and the similarities and differences with previous models. It will also describe the method we use to obtain the maximum likelihood estimates for the parameters of the model and the status of the current results.

The 2R Hypothesis and the Human Genome Sequence

Kenneth H. Wolfe, Aoife McLysaght, and Karsten Hokamp
Department of Genetics, Smurfit Institute, University of Dublin, Trinity College, Dublin 2, IRELAND

We are investigating whether the draft sequence of the human genome provides support for the 2R hypothesis of two rounds of genome duplication, first proposed by Ohno. Our dataset was release 1.0 of Ensembl (April 2001) which contains 27,615 predicted proteins. After removal of alternative splice variants, highly similar tandem repeat genes, and unmapped genes, we were left with 20,830 proteins encoded by genes that appear on the UCSC Golden Path (Dec. 2000). All-against-all BLASTP searches were carried out on these proteins using a 20-processor Linux cluster. Dot-matrix plots of the results show that the human genome does not contain large duplicated regions on the scale of those found in Saccharomyces cerevisiae or Arabidopsis thaliana. However, many duplicated chromosomal regions can be identified; their number and extent depends greatly on the parameters used to define them. To focus on gene duplications that occurred within the chordate lineage, as envisaged by the 2R hypothesis, we used Drosophila and Caenorhabditis sequences as a heuristic orthology threshold and only searched for duplicated blocks composed of human paralogs that are more similar to each other (by BLASTP E-value) than to their closest invertebrate homologs. We also required a maximum BLASTP expectation value of E ? 1e-7, and a maximum gap size of 30 unduplicated genes between any two paralogs making up a block. Using these parameters we find that the human genome contains many more paralogous regions than expected by chance. Ninety-six pairs of large duplicated regions, each containing at least 6 duplicated genes, cover 44% of the genome. These apparently duplicated chromosomal regions in human are statistically significant, as judged by comparisons to computer simulations where gene locations were randomized.
In an independent search, all gene families in human (not just those in duplicated chromosomal regions) were identified for which Drosophila and Caenorhabditis outgroup sequences were known. Phylogenetic trees drawn from these families, followed by molecular clock estimation of the human gene duplication date, showed a small excess of gene duplications with ages roughly 333–583 Mya (0.4–0.7 ? the age of the split between human and Drosophila). This may indicate some sort of increased duplication activity at that time.

Can Bioinformatics Tackle Signal Transduction?

Igor B. Zhulin
School of Biology, Georgia Institute of Technology, 310 Ferst Drive, Atlanta, GA 30332-0230, USA

One of the major goals of comparative genomics is to predict a biological function for proteins. A variety of bioinformatics tools was developed and successfully implemented in order to achieve this goal. For many enzymes, not only the mode of action, but also exact substrate specificity can be predicted with confidence. However, function prediction for some classes of proteins is difficult due to their mosaic structure and the presence of highly variable domains. Proteins comprising signal transduction pathways are probably the best example to illustrate such a problem. Current annotation of signal transduction proteins in both prokaryotes and eukaryotes is limited to the identification of a superfamily based on the presence of one highly conserved domain, for example a histidine kinase, a response regulator, a bHLH transcription factor, etc. An approach is presented, where accurate predictions of a biological function is achieved by combining bioinformatics tools, such as sensitive similarity searches (PSI-BLAST), conservation patterns of multiple alignments, and phylogenetics, with current biological knowledge on structure and function of individual proteins and domains. Examples of refined predictions of a biological function will be given for several superfamilies of signal transduction proteins, including histidine kinases, response regulators, chemotaxis transducers and guanylyl cyclases/ phosphodiesterases.

Enrichment of Regulatory Signals in Conserved Non-Coding Genomic Sequence

Samuel Levy, Sridhar Hannenhalli and Christopher Workman
Celera Genomics, 45 West Gude Drive, Rockville, MD 20850 USA

Motivation: Whole genome shotgun sequencing strategies generate sequence data prior to the application of assembly methodologies that result in contiguous sequence. Sequence reads can be employed to indicate regions of conservation between closely related species for which only one genome has been assembled. Consequently, by using pairwise sequence alignments methods it is possible to identify novel, non-repetitive, conserved segments in non-coding sequence that exists between the assembled human genome and mouse whole genome shotgun sequencing fragments. Conserved non-coding regions identify potentially functional DNA that could be involved in transcriptional regulation.
Results: Local sequence alignment methods were applied employing mouse fragments and the assembled human genome. In addition, transcription factorbinding site were detected by aligning their corresponding positional weight matrices to the sequence regions. These methods were applied to a set of transcripts corresponding to 502 genes associated with a variety of different human diseases taken from the Online Mendelian Inheritance in Man database. Using statistical arguments we have shown that conserved non-coding segments contain an enrichment of transcription factor binding sites when compared to the sequence background in which the conserved segments are located. This enrichment of binding sites was not observed in coding sequence. Conserved non-coding segments are not extensively repeated in the genome and therefore their identification provides a rapid means of finding genes with related conserved regions, and consequently potentially related regulatory mechanism. Conserved segments in upstream regions are found to contain binding sites that are co-localized in a manner consistent with experimentally known transcription factor pairwise co-occurrences and afford the identification of novel co-occurring TF pairs. This study provides a methodology and more evidence to suggest that conserved non-coding regions are biologically significant since they contain a stastistical enrichment of regulatory signals and pairs of signals that enable the construction of a regulatory models for human genes.

Birth of Scale-Free Molecular Networks and the Number of Distinct DNA and Protein Domains Per Genome

Andrey Rzhetsky (1,2), and Shawn M. Gomez (1)
(1)Columbia Genome Center and (2)Department of Medical Informatics, Columbia University, New York 10032, USA

Motivation: Current growth in the field of genomics has provided a number of exciting approaches to the modeling of evolutionary mechanisms within the genome. Separately, dynamical and statistical analyses of networks such as the World Wide Web and the social interactions existing between humans have shown that these networks can exhibit common fractal properties – including the property of being scale-free. This work attempts to bridge these two fields and demonstrate that the fractal properties of molecular networks are linked to the fractal properties of their underlying genomes.
Results: We suggest a stochastic model capable of describing the evolutionary growth of metabolic or signal-transduction networks. This model generates networks that share important statistical properties (so-called scale-free behavior) with real molecular networks. In particular, the frequency of vertices connected to exactly k other vertices follows a power-law distribution. The shape of this distribution remains invariant to changes in network scale: A small subgraph has the same distribution as the complete graph from which it is derived. Furthermore, the model correctly predicts that the frequencies of distinct DNA and protein domains also follow a power-law distribution. Finally, the model leads to a simple equation linking the total number of different DNA and protein domains in a genome with both the total number of genes and the overall network topology.

Availability: MatLab (MathWorks, Inc.) programs described in this manuscript are available on request from the authors.
Contact: ar345@columbia.edu

Clustering Protein Sequences - Structure Prediction by Transitive Homology

Eva Bolten, Alexander Schliep, Sebastian Schneckener, Dietmar Schomburg, and Rainer Schrader
ZPR/ZAIK, University of Cologne, Weyertal 80, 50931 Cologne, GERMANY

It is widely believed that for two proteins A and B a sequence identity above some threshold implies structural similarity due to a common evolutionary ancestor. Since this is only a sufficient, but not a necessary condition for structural similarity, the question remains what other criteria can be used to identify remote homologues. Transitivity refers to the concept of deducing a structural similarity between proteins A and C from the existence of a third protein B, such that A and B as well as B and C are homologues, as ascertained if the sequence identity between A and B as well as that between B and C is above the aforementioned threshold. It is not fully understood, if transitivity always holds and whether transitivity can be extended ad infinitum. We developed a graph-based clustering approach, where transitivity plays a crucial role. We determined all pair-wise similarities for the sequences in the SwissProt database using the Smith-Waterman local alignment algorithm. This data was transformed into a weighted directed graph, where protein sequences constitute vertices and weights correspond to alignment scores effectively scaled by sequence length. This assymetric distance and the subsequent clustering based on strongly connected components seems to be robust with respect to problems caused by increased noise levels in larger databases or multidomain proteins. The method was evaluated on two releases from SCOP and showed a drastic improvement over pair-wise comparisons in terms of detecting remote homologues. We also discuss a very favorable comparison with PSI-Blast.

Detection of cis-element clusters in higher eukaryotic DNA

M.C. Frith(1), U. Hansen(2), and Z. Weng(3)
(1) Bioinformatics Program, Boston University, 44 Cummington St., Boston, MA 02215, USA;
(2)Department of Biology, Boston University, 5 Cummington St., Boston MA 02215, USA;
(3)Department of Biomedical Engineering, Boston University, 44 Cummington St., Boston, MA 02215, USA

Motivation: Computational prediction and analysis of transcription regulatory regions in DNA sequences has the potential to accelerate greatly our understanding of how cellular processes are controlled. We present a hidden Markov model based method for detecting regulatory regions in DNA sequences, by searching for clusters of cis -elements.
Results: When applied to regulatory targets of the transcription factor LSF, this method achieves a sensitivity of 67%, while making one prediction per 33 kb of non-repetitive human genomic sequence. When applied to muscle specific regulatory regions, we obtain a sensitivity and prediction rate that compare favorably with one of the best alternative approaches. Our method, which we call Cister, can be used to predict different varieties of regulatory region by searching for clusters of cis -elements of any type chosen by the user. Cister is simple to use and is available on the web.
Availability: http://sullivan.bu.edu/~mfrith/cister.shtml
Contact: mfrith@bu.edu; zhiping@bu.edu

Model-Based Clustering and Data Transformations for Gene Expression Data

Ka Yee Yeung, Walter L. Ruzzo
University of Washington, Department of Computer Science, Box 352350, Seattle, WA,98195 USA

Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions, such as multivariate normal distributions. The issues of selecting a "good" clustering method and determining the "correct" number of clusters are reduced to model selection problems in the probability framework.
We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model.

Prediction of disulfide connectivity in proteins

Piero Fariselli and Rita Casadio
CIRB Biocomputing Unit, Laboratory of Biophysics, Department of Biology, University of Bologna, via Irnerio 42, 40126 Bologna, ITALY

Motivation: A major problem in protein structure predictionis the correct location of disulfide bridges in cysteine-rich proteins. In protein-folding prediction, the location of disulfide bridges can strongly reduce the search in the conformational space. Therefore the correct prediction of the disulfide connectivity starting from the protein residue sequencemay also help in predicting its 3D structure.
Results: In this paper we equate the problem of predicting the disulfide connectivity in proteins to a problem of finding the graph matching with the maximum weight. The graph vertices are the residues of cysteine-forming disulfide bridges, and the weight edges are contact potentials. In order to solve this problem we develop and test different residue contact potentials. The best performing one, based on the Edmonds–Gabow algorithm and Monte-Carlo simulated annealing reaches an accuracy significantly higher than that obtained with a general mean force contact potential. Significantly, in the case of proteins with four disulfide bonds in the structure, the accuracy is 17 times higher than that of a random predictor. The method presented here can be used to locate putative disulfide bridges in protein-folding.
Availability: The program is available upon request from the authors.
Contact: Casadio@alma.unibo.it; Piero@biocomp.unibo.it

DIANA-EST: a statistical analysis

Artemis G. Hatzigeorgiou(1,2), Petko Fiziev(1) and Martin Reczko(2)
(1)Metagen GmbH, Ihnestr.63, 14195 Berlin, Germany and (2) Synaptic Ltd, Science and Technology Park of Crete, PO Box 1447, Voutes Heraklion, 71110 Greece

Motivation: Expressed Sequence Tags (ESTs) are next to cDNA sequences as the most direct way to locate in silico the genes of the genome and determine their structure. Currently ESTs make up more than 60% of all the database entries. The goal of this work is the development of a new program called DNA Intelligent Analysis for ESTs (DIANA-EST) based on a combination of Artificial Neural Networks (ANN) and statistics for the characterization of the coding regions within ESTs and the reconstruction of the encoded protein.
Results: 89.7% of the nucleotides from an independent test set with 127 ESTs were predicted correctly as to whether they are coding or non coding.
Availability: The program is available upon request from the author.
Contact: Present address: Department of Genetics, University of Pennsylvania, School of Medicine, 475 Clinical Research Building, 415 Curie Boulevard, Philadelphia, PA 19104-6145, USA. artemis@pcbi.upenn.edu.

A Biosystems Network Ontology Based on Petri Nets

John Ambrosiano(1) and Joseph S. Oliviera (2)
(1)Los Alamos National Laboratory and (2)Pacific Northwest National Laboratory, USA

Complex biological systems on many levels, from genetic regulatory networks to communities of cells and organisms, can be viewed conceptually as self-regulating control networks. Unfortunately in biology, the diversity of interpretations that must be applied to this simple concept is enormous. This introduces substantial practical difficulties in designing ontologies for biological networks because we want them to be general enough to accommodate a broad range of interpretations, and yet still support data structures that can be customized to specific bioinformatics applications. While there are many good efforts underway to define knowledge ontologies for systems biology [1], we believe that the key to eventual success, that is a truly generic conceptual framework for biosystems networks, remains a challenge.
We will describe a conceptual framework under development for biosystems ontologies based on Petri nets. In the past, Petri nets have been applied successfully in the analysis of complex networks occurring in a number of settings such as parallel computing and manufacturing-distribution systems. Recently, Petri net models have also been applied to biomolecular networks [2,3].
The formal Petri net model has a number of features that appear ideal for capturing fundamental relationships in control systems; and models of biosystems ranging from reaction kinetics to logic circuits seem to map onto them well. Earlier work in "event nets," as these systems were once called, suggests that category theory may provide the formal basis on which to build useful mappings for ontology interchange. This can in turn provide a solid foundation for generic, object-oriented implementations of bioinformatics software that would be capable of handling the complex and diverse data sets expected to emerge from rapidly expanding research in systems biology.


1. M. Hucka, A. Finney, H. Sauro, H. Bolouri, "Introduction to the Systems Biology Workbench," California Institute of Technology (2001). See: www.cds.caltech.edu/erato/the_project.html

2. Joseph S. Oliviera, Colin G. Bailey, Janet B. Jones-Oliveira, and David Dixon, "An algebraic-combinatorial model for the identification and mapping of biochemical pathways," to appear in Bull. Math. Biol.

3. Peter J.E. Goss, Jean Peccoud, "Quantitative modeling of stochastic systems in molecular biology by using stochastic Petri nets," PNAS, 95, 6750, June 1998.

Prediction of Structure, Function and Evolution of a Putative β-xylosidase in Escherichia Coli using Bioinformatic Techniques

Anuradha R, L. O. Ingram and J. F. Preston J. F.
Department of Microbiology and Cell Science, IFAS, University of Florida, Gainesville FL 32611

The main objective of this work is to explore, through bioinformatics, the potential of the putative E. coli gene (yagH) to express a functional enzyme, β-xylosidase. Statistically based sequence similarity methods are often used to predict protein function. The yagH gene shares 52% homology with a 56 kDa functional β-xylosidase (xynB) present in B. pumilus. There are at least 10 other xylosidases/arabinofuranosidases (known and putative) that share homologous domains with yagH and xynB, also belonging to Class 43 of glycosyl hydrolases. The xynB (β-xylosidase/ α-arabinosidase) from Butyrovibrio fibrisolvens, belonging to GH43, has been shown to cleave the glycosidic bond with an inversion of anomeric Conformation. Glycosidases in this class have similar catalytic residues and hydrolysis occurs with inversion at the anomeric carbon. Based on the Pfam classification of proteins, these enzymes can be divided into three domains. The region containing amino acids 127 to 309 includes the catalytic domain of known enzymes in this family. The evolutionary and functional implications of the domain architectures were analysed using phylogenetic bootstrapped NJ trees. The gene yagH is flanked on one side by yagG (a putative permease) and on the other side by yagI (a putative transcriptional regulator). Computational predictions strongly suggest the transcription unit yagG_yagH to comprise an operon. Codon usage and factorial correspondence analysis of E. coli genes show that the yagH gene belongs to the class III cluster, which in turn strongly indicates inheritance by horizontal gene transfer, probably from a Bacillus species. Predicted values of free energy of folding, isoelectric point and linear charge density, based upon primary structure, are similar for B. pumilus and E. coli, implying similar structure and function. Most functional restraints on evolutionary divergence operate at the level of tertiary structure and hence 3 dimensional structures are more conserved in evolution than are sequences. In the absence of solved structures, the ROSETTA method (which accounts for both local and non-local interactions) was used to generate tertiary structures for short peptides of the catalytic domains with the lowest free energy minimum. The structures generated have similar secondary structure and folding patterns, allowing similar catalytic activity. Thus all data generated through computational predictions indicate that yagH encodes a functional β-xylosidase.

Reconstructing ORFs for the EST and mRNA Assemblies in the AllGenes Gene Index Project

Vladimir Babenko, Brian Brunk, Jonathan Crabtree, Li Li, Christian Stoeckert
Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA, USA

AllGenes (www.allgenes.org ) is a gene index project created at the University of Pennsylvania based on EST and mRNA sequences. Currently the AllGenes project contains up to a million assembled nucleic acid sequences (assemblies) for mouse and human. These assembly sequences were generated from the set of EST and mRNA in GenBank (as of August, 2001) using the CAP4 program (Paracel). To provide an Open Reading Frames (ORFs) for AllGenes assemblies, four different programs for ORF reconstruction were compared to select the one with best combination of performance time and accuracy and the framefinder program (www.hgmp.mrc.ac.uk/~gslater/estateman/framefinder.html) was chosen for further use. We have developed a statistical model for assessing the a posteriori significance of ORF reconstruction based on the nucleic acid and ORF lengths. This gives us the ability to identify poor ORFs and thus reduce the noise from pseudo-coding regions as well as assess the significance of ORF length. The statistic is based on the Bernoulli extreme value model with Poisson approximation similar to the p-value statistic implemented in BLAST.
We ran framefinder on 363520 mouse assemblies consisting of 71709 non-singletons and 291811 singletons (assemblies with one input sequence). It took 56 hours to reconstruct and submit mouse ORFs to the GUS database underlying AllGenes on a Dell Dual Pentium III 450 MHz. We identified significant (p<0.05) ORFs in 50934 cases. Approximately half of these ORFs start with methionine (corresponding to the start ATG codon). To evaluate the quality of ORFs obtained we performed blastx and blastp similarity searches against nrdb (ncbi.nlm.nih.gov) for nucleic acid sequences (assemblies) and ORFs, correspondingly, to validate the results obtained by framefinder. For that we restricted our attention to the ORFs with p_values less than 0.05. We found that in 98% of cases the blastp subjects for ORFs were consistent with the blastx subjects for this subclass. This high degree of consistency provides an internal check of the validity of the translations. Cases where ORFs had no homology to known proteins (11% of the total set with p <0.05) are therefore not likely to be artifacts. We did identify at least some cases when trivial translation is more efficient than using framefinder. A range of illustrative examples is presented underlining the features, caveats and advantages of framefinder application.

Gene Finding Applications of New Models of RNA-mRNA Interactions

John Besemer, Alex Lomsadze and Mark Borodovsky
School of Biology, Georgia Institute of Technology, Atlanta, Georgia, 30332-0230 USA

Binding of functional RNAs to mRNA is an important feature of numerous cellular processes including translation initiation, intron splicing and polyadenylation. Typically, multiple sequence alignment algorithms, such as Gibbs sampling and simulating annealing, or maximum likelihood approaches, such as hidden Markov models, are used to elucidate the motifs that play a role in these RNA-RNA interactions. The result of these methods is typically a position specific frequency matrix of nucleotide composition. While models based on these methods have been useful towards the improvement of gene predictions, they are not adequate for understanding the in vivo mechanisms. Here we propose a new approach towards building models for these sites, focusing on the binding between the 16S rRNA and mRNA. These new models were used to improve gene finding accuracy in the most recent versions of GeneMark.hmm and GeneMarkS.

Robust Cluster Analysis of DNA Microarray Data: An Application of Nonparametric Correlation Dissimilarity

David R. Bickel
Medical College of Georgia, Office of Biostatistics and Bioinformatics 1120 Fifteenth St., AE-3037, Augusta, GA 30912-4900 USA

Several methods have been proposed for using expression data to classify genes into biologically meaningful groups. Although some of these techniques do not explicitly specify their assumptions about the data, the success of each method depends on how well its underlying model describes the patterns of expression. Outlier-resistant and distribution-free clustering of genes can be performed with nonparametric measures of the (dis)similarity of expression values such as intensity ratios or average differences; e.g., a simple robust measure of the similarity between genes is Spearman's correlation, R, computed by ranking the values across microarrays. A dissimilarity metric is then defined as the Euclidean distance D=sqrt{1-[(R+1)/2]^C} or D=sqrt{1-[abs(R)]^C}, C>0. (A distance between the vectors of ranks can also quantify dissimilarity.) Given a (dis)similarity measure, genes can be clustered by optimizing the sum of (dis)similarities of each gene from the closest of k central genes. Each cluster is then described by the range of data for each microarray for a fixed proportion of genes in that cluster that are closest to its central gene; error bars can similarly be computed for other ways of clustering around k central objects. These methods are applied to the data of DeRisi et al. (1997), with an evaluation of the performance of D relative to the analogous distance based on Pearson's correlation of the logs of expression ratios. Such methods are generally applicable to other types of data.

A Distributed Protein Visualization Application

Tolga Can
Department of Computer Science University of California, Santa Barbara, CA, USA

Protein visualization has become increasingly popular especially since the accomplishment of the Human Genome Project. Although there are several visualization software available for scientists, few address the aspect of collaboration, e.g. simultaneous access of the same protein model. Most of the current systems are standalone applications and researches have to share their ideas by exchanging snaphots of the protein models.
We have developed a distributed protein visualization application, in which a protein molecule can be viewed synchronously by many users in different geographical locations. Our system provides different 3D representations existing in many of today's protein visualization systems. These representations include: backbone model, balls-sticks model, space-fill model, and ribbon model. The structure information of protein molecules is obtained in the form of a Protein Data Bank (PDB) file. The 3D models are built as Java3D scene graphs using the atomic coordinate information contained in the PDB file. User can interact with the 3D models using zoom, pan and rotation functions. Furthermore we provide textual information in terms of a "molecule information window" and a "tree view window". The former includes information such as molecule name, number of amino acids in the molecule, the amino acid chain as one-letter symbols, and currently selected amino acid. The latter describes the hiearchy of the protein molecule both in terms of primary structure and the secondary structure. We implemented two way interaction between the hiearchical representation and the 3D models in the following sense. Users can select a sub-structure, e.g. an amino acid or an atom, in the molecule using either the tree view or the 3D view, and the corresponding structure is highlighted in the other view.
A session server handles the communication between the users. Users share the same view of a 3D protein model by using a locking mechanism. Our implementation is based on Java. It allows users from different platforms connect to the same collaboration session.
We plan to add new 3D representations, such as electron density map and solid surface model, into our visualization system. We also consider incorporating a protein folding algorithm, which will enable users not only visualize proteins of unknown structure, but also model and create new proteins on the fly by changing the amino acid sequence.

Genome-wide Comparative Analysis of Transcriptional Regulatory Regions

Yu Chen(1), Victor Olman(2), Ying Xu(1,2), Dong Xu(1,2)
(1)University of Tennessee-ORNL Graduate School of Genome Science and Technology, Knoxville, TN 37996 USA and (2)Oak Ridge National Laboratory, Oak Ridge, TN, 37831 USA

Transcriptional regulatory network is an indispensable prerequisite for understanding cellular function. However, the evolution of regulatory regions is not well understood compared to the evolution of coding regions. Sequence comparison between the regulatory regions of orthologs and paralogs may provide some insight about the evolution of the regulatory regions. For this purpose, we used the 51 Archaea and bacteria genomes with gene annotations from NCBI. From the COG database, we selected several orthologs that appear in many genomes, such as orthologs of flavohemoprotein. The genes with significant sequence similarity with each other in the same genome are presumed to be paralogs. As expected, gene regulatory regions are less conserved than gene coding regions among orthologs or among paralogs. However, the correlation between the sequence identity in the coding regions and the sequence identity in their regulatory regions is stronger among orthologs than among paralogs. We also carried out a comparative promoter analysis of the genes among the orthologs and paralogs using AlignACE. We found that orthologs have more conserved patterns in their promoter regions than paralogs. The patterns of regulatory regions provide quantitative measurements for the divergence of gene functions among paralogs and the convergence of gene functions among orthologs. It confirmed that the pattern changes of regulatory regions play an important role in genome evolution. We also analyzed the occurrence of dimeric tandem repeats, which are remarkably abundant in eukaryote DNA. We found that gene regulatory regions have stronger strongds long-range correlation of dimeric tandem repeats than coding regions. This suggests that the mutations at the coding regions may be more independent with each other (or with less correlated mutations) than the mutations at the regulatory regions during evolution.

Parallelism between Fusion Peptides and Others Fusion Systems Revealed by an Exhaustive Search for Sequences with Potential for Dynamic Insertion into Membranes

Victoria Dominguez Del Angel, Jean-Paul Mornon and Isabelle Callebaut
Laboratoire de Mineralogie-Cristallographie, CNRS UMR C7590, Universites Pierre et Marie Curie (P6) et Denis Diderot (P7), Case 115, 4, place Jussieu F-75252 Paris, FRANCE

Main aspects of protein function analysis include the detection of functional homologs potentially omitted by simple sequence alignment methods. This study is based on special short fragments (up to 20 amino acids) from viral fusion proteins which sequences are highly conserved within one virus family, but not among different families: the fusion peptides. Fusion peptides are involved in membrane fusion processes. They facilitate membrane fusion by inserting deeply into the lipid bilayer of the target membrane, destabilizing lipids and thus leading to hemifusion. We used a well-characterized, non redundant set of fusion proteins to make statistical analysis and identify amino acids which are preferentially found in fusion system. A preference was found for Alanine, Threonine involved in mobility through the lipids and alpha helical structures and Isoleucine and Methionine, for hydrophobicity. In light of these results, we designed a software for screening the protein database (Swiss-Prot), and searching for putative functionally homologs that could play critical roles in membrane fusion among other viruses families or fusion systems.

Comparing Protein Clustering Methods Using the Arabidopsis Proteome

Christine G. Elsik and William R. Pearson
Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA

Protein classification based on pairwise sequence comparison has been used in comparative genomics, representative protein selection, gene prediction assessment, and function and structure assignment. We wish to assemble a comprehensive, but non-redundant, set of protein clusters with homogenous domain organization within clusters. The simplest method of clustering by transitive sequence similarity, single linkage, frequently generates clusters that are much too large, because multidomain proteins have pulled unrelated proteins into the same cluster. One way to avoid this is to cluster domains instead of complete proteins. However, some applications benefit from classification of complete proteins. Consequently, alternative algorithms for clustering complete proteins have been developed to prevent incorrect merging of multidomain proteins.
We used the Arabidopsis proteome (25,549 proteins) to evaluate the ability of complete protein clustering methods to minimize total cluster number while maintaining reasonable cluster sizes and consistent domain organizations. A pairwise similarity measure, E()-value, was the linkage criterion for four linkage-based methods: single linkage (similar to the SEALS grouper program and GEANFAMMER's clustering algorithm), average linkage (similar to ProtoMap), complete linkage and fractional linkage (similar to grouper options and algorithms used by Celera). We also evaluated single linkage based on pairwise identity and alignment coverage (similar to BLASTCLUST). The methods were assessed using a cluster domain consistency (CDC) score after comparing proteins with Pfam, Smart and ProDom to identify domains. Each method was tested with a range of linkage thresholds, resulting in 7702 to 25053 clusters, with 4940 to 24,744 singletons, respectively.
At the most relaxed threshold (E() < 10^-10), single linkage based on E()-value alone generated the smallest cluster number, with the largest cluster containing 5852 proteins, which is unlikely to be a biologically meaningful grouping. At the same threshold, average linkage produced 9607 clusters; the largest cluster included 395 sequences. Although cluster sets produced by E()-based single linkage had the poorest CDC scores, sets of > 16,000 clusters generated by identity/alignment-based single linkage (> 40% identity) had the best CDC scores of all methods. Average and fractional linkage performed better than identity/alignment-based single linkage for sets with less than 16,000 clusters. Using a conservative threshold of E() < 10^-10, we found at least 9600 unique domain organizations in the Arabidopsis proteome.

New Features of FPC (FingerPrinted Contigs) V6.0

F. Engler(1), J. Hatfield(1), S. Blundy(1), S. Ness(2), C. Soderlund(1)
(1)Clemson University Genomics Institute, Clemson University Clemson, SC, 29634, USA and (2)Genome Sequence Centre, British Columbia Cancer Agency, 600 West 10th Avenue, Vancouver, BC V5Z 4E6, CANADA

Already used worldwide for the physical mapping of nontrivial genomes such as human, rice, and mouse, FPC (FingerPrinted Contigs) continues to grow. Recent improvements include a port to the Gimp Toolkit graphics library, which provides better graphics and higher compatibility than the previously used Athena widgets. Also, recent additions to the code will exploit parallelism when run on multiprocessor clusters. Three new features have been added that will make FPC even more useful. First, BSS provides the ability to run BLAST searches from FPC and use the results to map sequence back to the physical map. This tool has many applications, including that of picking a minimal tiling path and providing information needed to merge contigs. Second, FPC Simulated Digest takes sequence files as input and digests them in silico, outputting band files needed to add clones to FPC. With this tool, new sequence can be downloaded from global databases such as GenBank, converted to band files, and added as clones to FPC.
Finally, WebFPC provides a Java display for FPC, allowing any user to view the vital information from FPC online, as well as linking the user to databases that contain additional information on selected clones.

In Silico Prediction of the Transcriptional Regulation of Human Genes

M.C. Frith(1), J. Spouge(2), U. Hansen(3), and Z. Weng(4)
(1) Bioinformatics Program, Boston University, 44 Cummington St., Boston, MA 02215, USA;
(2)National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA;
(3)Department of Biology, Boston University, 5 Cummington St., Boston MA 02215, USA;
(4)Department of Biomedical Engineering, Boston University, 44 Cummington St., Boston, MA 02215, USA

The control of gene expression by modulation of transcription rate is one of the most fundamental processes in human physiology and disease. Transcription is regulated by the binding of transcription factors to cis-elements in the DNA sequence, which are often, but not always, located close to the 5' end of the gene. Since the entire human genome sequence is (almost) available, it ought to be possible to detect these transcription factor binding signals for any known human gene. Although traditional methods for cis-element detection can accurately predict affinities of transcription factors for naked DNA in vitro, they utterly fail at predicting functional binding sites in vivo. Two ideas for improving this pessimistic situation are presented in this poster. The first approach is to detect statistically significant clusters of multiple cis-elements, motivated by the observation that higher eukaryotic genes are typically regulated by multiple transcription factors. The second method is to search for conserved binding sites in alignments of orthologous DNA from human and another species such as mouse, assuming that cis-elements tend to be conserved across evolution.

Sequencing and Comparison of Orthopoxviruses

Michael Frace, Melissa Olsen-Rasmussen, Roger Morey, Yu Li, Miriam Laker, Richard Kline, Scott Sammons, Inger Damon, Robert Wohlhueter, Joseph J. Esposito, Ming Zhang
Centers for Disease Control and Prevention, 1600 Clifton Rd. NE, MailStop G-36, Atlanta, GA 30333 USA

The genomes of six variola major strains were sequenced by using long-distance, high-fidelity PCR of overlapping amplicons astemplates for fluorescence-based sequencing. Each genome is approximately 186 kb of double-stranded DNA with between 190 and 250 predicted open reading frames (ORFs) of greater than 60 amino acids. The most highly conserved ORFs are located in the center portion of the genome, and the majority have known functions involving transcription, DNA replication and repair, protein processing, virion structure, and nucleotide metabolism.
To minimize the amount of poxvirus needed, 15 mg of purified genomic DNA was used as template for approximately 1800 primer-walking sequencing reactions. The reactions were set up using robotic assistance and subjected to thermocycling, and the reaction products were separated by capillary electrophoresis (Beckman Coulter CEQ 2000XL). Sequencing of variola strain Congo-1970, Somalia-1977, India-1964, Horn-1948, Nepal-1973, and Afganistan-1970 has been completed.
Output sequence trace files were edited, evaluated for quality, and then assembled by using Phred/Phrap/Consed software until about a 10-fold redundancy of high-quality sequence data was attained. The ORFs were predicted using Glimmer, GeneMark, and getorf. Each ORF sequence has been compared with the five other locally sequenced strains and with sequences of previously published orthopoxviruses Bangladesh-1975 (L22579), India-1967 (X69198), and vaccinia virus Copenhagen (M35027). These predicted ORFs were also analyzed for the presence of known early, middle, and late promoter sequences. All results and analyses will be integrated into a relational database customized for orthopox viral genomes.

Identification of Sequence and Structural Determinants of Functional Diversification Using Site Specific Amino Acid Variation Profiles

Daniel S. Gonzalez(1), G. Reid Bishop(2) and I. King Jordan(3)
(1)United States Department of Agriculture, Aquatic Animal Health Research Unit, Auburn, Alabama 36831, USA;
(2)Department of Chemistry, Millsaps College, Jackson, Mississippi 39210, USA;
(3)3National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA

The extraction of position specific variability information from alignments of homologous proteins is emerging as a powerful method for gleaning meaningful biological information from sequence data. This approach was employed here to identify sequence and structural elements that have likely been shaped by natural selection during the functional diversification of organelle-specific groups of class I processing a -mannosidases. Class I a -mannosidases make up a homologous and functionally diverse family of glycoside hydrolases. Phylogenetic analysis based on an amino acid sequence alignment of the catalytic domain of class I a -mannosidases reveals four well-supported phylogenetic groups within this family. These groups include a number of paralogous members generated by gene duplications that occurred as far back as the initial divergence of the crown-group of eukaryotes. Three of the four phylogenetic groups consist of enzymes that have group-specific biochemical specificity and sites of activity. An attempt was made to uncover the role that natural selection played in the sequence and structural divergence between the phylogenetically and functionally distinct enoplasmic reticulum (ER) and Golgi apparatus groups. Comparison of site-specific amino acid variability profiles for the ER and Golgi groups revealed statistically significant evidence for functional diversification at the sequence level and indicated a number of residues that are most likely to have played a role in the functional divergence between the two groups. The majority of these sites appear to contain residues that have been fixed within one organelle-specific group by positive selection. Somewhat surprisingly these selected residues map to the periphery of the a -mannosidase catalytic domain tertiary structure. Changes in these peripherally located residues would not seem to have a gross effect on protein function. Thus diversifying selection between the two groups may have acted in a gradual manner consistent with the Darwinian model of natural selection.

Prediction of N-Glycosylation Sites in Proteins.

Ramneek Gupta(1), Eva Jung(2) and Soren Brunak(1).
(1)Center for Biological Sequence Analysis, Bldg-208, Technical University of Denmark, Lyngby, DENMARK and (2)Swiss Institute of Bioinformatics, Geneva, SWITZERLAND

Contrary to widespread belief, acceptor sites for N-linked glycosylation on protein sequences, are not well characterised. The consensus sequence, Asn-Xaa-Ser/Thr (where Xaa is not Pro), is known to be a prerequisite for the modification. However, not all of these sequons are modified and it is thus not discriminatory between glycosylated and non-glycosylated asparagines. We train artificial neural networks on the surrounding sequence context, in an attempt to discriminate between acceptor and non-acceptor sequons. In a cross-validated performance, the networks could identify 86% of the glycosylated and 61% of the non-glycosylated sequons, with an overall accuracy of 76%. The method can be optimised for high specificity or high sensitivity. Apart from characterising individual proteins, the prediction method can rapidly scan complete proteomes.
Glycosylation is an important post-translational modification, and is known to influence protein folding, localisation and trafficking, protein solubility, antigenicity, biological activity and half-life, as well as cell-cell interactions. We investigate the spread of known and predicted N-glycosylation sites across functional categories of the human proteome.
An N-glycosylation site predictor for human proteins shall be made available at www.cbs.dtu.dk/services/NetNGlyc

Prediction of Gene Function within a Family of Related Proteins: A Case Study of the Xanthine Oxidase Family

Nikolai V. Ivanov(1,2) and Dale E. Edmondson(2)
(1)Department of Chemistry and (2)Department of Biochemistry, Emory University, Atlanta, GA 30322 USA

The problem of correct gene assignments for a number of genomes sequenced up to date is being addressed using numerous methods. Most of the effort is directed toward creating a complete library of families and superfamilies for all known genes and proteins. The finer problem however exists - the function prediction of a gene assigned to a particular superfamily. In this work we present a method allowing to further classify proteins within a family of xanthine oxidase. The method is based on analysis of multiple alignments for characterized proteins of known function as well as site-directed mutagenesis, kinetic and crystallographic data. The multiple alignment data helps to locate the conserved residues of interest within superfamily genes and mutagenesis, kinetic, and crystallographic data provide information on the importance of the conserved residues. Each residue is classified as to have structural significance, or to be involved in cofactor or substrate/ligand binding. Those residues are compared between the enzymes of xanthine oxidase family based on several characteristics: substrate preference, nature of cofactor, and structural variations. A score is attributed to a gene of unknown function to account for the presence of residues characteristic to a particular function or binding site. In order to test this method we took several genes of known function, constructed the knowledge set from the rest of the characterized proteins of xanthine oxidase family. Our poster presents the successes and pitfalls of our prediction. Being able to predict function based on the gene sequence is very important for correct assignment of newly sequenced genes, as well as for prediction and interpretation of the results of site-directed mutagenesis of fairly studied proteins.

Genomic Scale Relative Rates Test and the Detection of Functional Diversification among Bacterial, Archaeal and Eukaryotic Proteins

I. King Jordan, Fyodor A. Kondrashov, Igor B. Rogozin, Roman L. Tatusov, Yuri I. Wolf and Eugene V. Koonin
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA

Detection of changes in a protein's evolutionary rate may reveal cases of change in that protein's function. We developed and implemented a simple relative rates test in an attempt to assess the rate constancy of protein evolution and to detect cases of functional diversification between orthologous proteins. The test was performed on clusters of orthologous protein sequences from complete bacterial genomes (Chlamydia trachomatis, C. muridarum, and Chlamydophila pneumoniae), complete archaeal genomes (Pyrococcus horikoshii, P. abyssi, and P. furiosus) and partially sequenced mammalian genomes (human, mouse, and rat). Amino acid sequence evolution rates are significantly correlated on different branches of phylogenetic trees representing the great majority of analyzed orthologous protein sets from all three domains of life. However, approximately 1% of the proteins from each group of species deviates from this pattern and instead shows variation that is consistent with an acceleration of the rate of amino acid substitution which may be due to functional diversification. Most of the putative functionally diversified proteins from all three species groups are predicted to function at the periphery of the cells and mediate their interaction with the environment. Relative rates of protein evolution are remarkably constant for the three species groups analyzed here. Deviations from this rate constancy are probably due to changes in selective constraints associated with diversification between orthologs. Functional diversification between orthologs is thought to be a relatively rare event. However, the resolution afforded by the test designed specifically for genomic scale data sets allowed us to identify numerous cases of possible functional diversification between orthologous proteins.

PlasmoDB: An Example of Using GUS and RAD to Build a Database for Malaria Researchers that Combines Mapping, Sequence and Expression Data.

Jessica C. Kissinger(1), Brian Brunk(2), Jonathan Crabtree(2), Sharon J. Diskin(2), Martin J. Fraunholz(1), Gregory R. Grant(2), Dinesh Gupta(1), Shannon. McWeeney(1), Arthur J. Milgram(1), David S. Roos(1), Jonathan Schug(2), and Christian J. Stoeckert Jr.(2)
(1)Department of Biology and (2)Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104 USA

PlasmoDB (PlasmoDB.org) is the official database of the Plasmodium falciparum genome sequencing consortium. The relational schemas used to build PlasmoDB (GUS, Genomics Unified Schema and RAD, RNA Abundance Database) employ a highly structured format to accommodate the diverse data types generated by sequence and expression projects. PlasmoDB currently houses sequence information (both finished and unfinished) from five Plasmodium species, and provides tools for cross-species comparisons. Sequence information is integrated with other genomic-scale data emerging from the Plasmodium research community, including gene expression analysis from EST, SAGE, and microarray projects. A variety of tools allow researchers to formulate complex, biologically-based, queries of the database. A version of the database is also available via CD-ROM (Plasmodium GenePlot), facilitating access to the data in situations where internet access is difficult (e.g. by malaria researchers working in the field). The goal of PlasmoDB is to enhance utilization of the vast quantities of data emerging from genome-scale projects by the global malaria research community.

An Analysis of Gene-Finding Approaches for Neurospora crassa

Eileen Kraemer(1), Jian Wang(1,2), Jinhua Guo(1), Samuel Hopkins(1), Jonathan Arnold(2)
(1)Computer Science Department and (2)Genetics Department, The University of Georgia, Athens, GA 30602 USA

Motivation: Computational gene identification plays an important role in genome projects. The approaches used in gene identification programs are often tuned to one particular organism, and accuracy for one organism or class of organism does not necessarily translate to accurate predictions for other organisms. We evaluated five computer programs on their ability to locate coding regions and to predict gene structure in Neurospora crassa. One of these programs (FFG) was designed specifically for gene-finding in Neurospora crassa, but the model parameters have not yet been fully "tuned", and the program should thus be viewed as an initial prototype. The other four programs were neither designed nor tuned for N. crassa.
We evaluated five programs (GenScan, HMMGene, GeneMark, Pombe and FFG) on data sets from the University of Mexico, the University of Georgia, and from the PEDANT database at MIPS(Munich Information Center for Protein Sequences). Our results show that overall the GenScan program has the best performance on sensitivity and ME(Missing exons) while the HMMGene and FFG programs have good performance in locating the exons roughly. However, the reader is cautioned as to the reliability of the annotated data sets, as GenScan was used in the annotation of some sequences.
The importance of evaluating programs based on the particular organism one wishes to study is clear. Most of the gene-finding programs evaluated are inappropriate for finding genes in N. crassa. Additional work motivated by this study includes the the creation of a tool for the automated and rapid evaluation of gene-finding programs, the collection of larger and more reliable data sets for N. crassa, parameterization of the model used in FFG to produce a more accurate gene-finding program for this species, and a more in-depth evaluation of the reasons that existing programs generally fail for N. crassa.
Links to the programs, data sets, and results may be found at: jerry.cs.uga.edu/~wang/genefind.html

Automatic Rule Generation for Protein Annotation with the C4.5 Data Mining Algorithm Applied on SWISS-PROT

Ernst Kretschmann
European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SA, UK

The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature curation and sequence analysis tools without the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements to manually curated databases such as TrEMBL or GenPept cover raw data but provide only limited annotation. To improve this situation automatic tools are needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually generated annotations.
The standard data mining algorithm C4.5 was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11306 rules were generated, which are provided in a database and can be applied to yet unannotated protein sequences and viewed using a web browser. They rely on the taxonomy of the organism, in which the protein was found and on signature matches of its sequence. The statistical evaluation of the generated rules by cross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be increased to 60% by tolerating a higher error rate of 5%.
The results of the automatic data mining process can be browsed on www.ebi.ac.uk/spearmint
The source code is available upon request.

LumberJack Generates a Forest of Trees by Jackknifing Alignments

Carolyn J. Lawrence (1), R. Kelly Dawe(1,2), and Russell L. Malmberg(2)
(1)Department of Botany and (2)Department of Genetics, University of Georgia, Athens, Georgia USA

Phylogenomics is a method of sequence-based function prediction by phylogenetic analysis (Eisen 1998). The phylogenomic method often yields more accurate functional hypotheses than techniques based solely upon sequence similarity (such as BLAST). It is implemented by constructing a reasonable phylogenetic tree for a given dataset, then mapping the functions of experimentally analyzed proteins onto the tree. While one might prefer to build such trees using ML-based algorithms, most implementations are computationally impractical for large datasets (especially those consisting of protein sequences). We are developing a ML heuristic search tool that we call LumberJack. LumberJack progressively jackknifes an alignment to generate multiple neighbor joining trees, then compares those trees statistically on the basis of their relative likelihood scores. Trees built wherein misleading blocks within the alignment were removed have a better likelihood score than those built using the entire alignment, and the likelihood score is worse for trees built wherein phylogenetically informative blocks of the alignment have been removed. Thus not only does LumberJack quickly generate a distribution of phylogenetic trees for further analyses, it also maps phylogenetic information onto the alignment.Using the kinesin dataset of Lawrence et al. (2001), we find that the most likely tree discovered using the LumberJack protocol is similar to the Lawrence et al. tree (built using ML star decomposition and placing individual sequences by hand). LumberJack also revealed a region of the alignment that mislead both parsimony heuristics and neighbor joining treebuilding.

mRNA Segment Scores of Neurospora crassa Genes Decrease Following Intron Splicing

By Tong Lee(1), April C. Ashford(2), Kaee N. Ross(2), Giovanni Carter(2), LaTreace Harris(2), and William Seffens(2)
(1)Department of Mathematics, Georgia State University, Atlanta, GA, USA and (2)Department of Biological Sciences, Clark Atlanta University, Atlanta 30314 USA

Free energies of folded mRNAs are usually more negative compared to mononucleotide shuffled sequences (Seffens and Digby, 1999). A segment score is the difference between the folding free energy of an mRNA and the mean free energy of folded shuffled sequences, divided by the standard deviation of the shuffled set. Thirteen genes from Neurospora crassa with introns were studied. mRNAs with introns were found to have segment scores that were more positive (less stable) compared to processed mRNAs. This suggests that intron splicing yields mRNAs that possess more secondary structures that expected compared to mononucleotide shuffled sequences.
Intron splicing is a phenomenon found in eukaryotic genes. Eukaryotic mRNAs exhibit posttranscriptional modifications from large RNA precursors, hnRNA(pre-mRNA), that are acted upon by SnRNPs which form a vital part of the splicesome that processes mRNA. Free energy is released as RNA structures are formed, creating a more stable structure. The activity of RNA is determined by its structure, the way it is folded back on itself; and cases have been described where the secondary structure plays an important role in gene regulation (for instance the trp operon in E. coli). Thirteen N. crassa mRNA sequences were selected from the GenBank database and analyzed using the GCG Wisconsin package version 10.2-Unix (Oxford Molecular Co.). N. crassa mRNA sequences less than 1200 bases long were randomly selected, and consisted of mRNAs with an identifiable start site, termination signal, and possessed introns. The thirteen mRNA sequences examined in N. crassa that possess introns were found to have an average segment score of –0.074. After intron splicing, the average calculated segment score is –0.276.
Neurospora is typical of many eukaryotes, in that more than 50% of its genes have introns. The significant difference in the segment scores between the hnRNA, and the processed mRNAs is a novel observation. These results may be generally applicable to other organisms. For randomly selected genes from a variety of other organisms the segment score was found to be –1.23 (Seffens and Digby, 1999). However, this observation was for a set of genes that did not possess introns. The decrease in segment scores indicate that processed mRNA has more secondary structures that hnRNA.
This work was performed during an undergraduate summer research experience sponsored by the University of Georgia at Athens with Dr. Jonathan Arnold (Genetics Department).


Seffens, W. and Digby, D. (1999) "mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences" Nuc.Acids Res. 27:1578-1584.

Rule and Dictionary-Based Text Mining to Cross-Reference Genes and Proteins to Relevent Journal Articles, and Clustering of Microarray Results by their Medline Associations Using the CELL Platform

Julie Leonard(1,2), Toby Segaran(1), Hong Dang(1), Jeff Colombe(1), Jennifer Pan(1), Josh Levy(1)
(1)Incellico, Inc., 2327 Englert Drive, Durham, NC, 27713, USA and (2)Bioinformatics Department, North Carolina State University, Raleigh, NC, USA

In order to obtain information about a biological entity of interest, such as a gene or a protein, biologists frequently perform a literature search using Medline or PubMed. In many cases, biological entities have more than one name or alias, requiring multiple literature searches. Consequently, finding all relevant literature references for a given list of biological entities can be very labor-intensive. In order to streamline this process, we have used both rule- and dictionary-based text mining methods to cross-reference biological entities with relavent articles in the scientific literature. The dictionary was constructed from gene and protein symbols in HUGO, OMIM, TAIR and LocusLink, while the rules for finding symbols in Medline abstracts were empirically determined using a training set of Medline entries. Included are rules that only take genetics-related abstracts, use organism-specific dictionaries, and add dictionary terms during analysis. We have cross-referenced the complete set of Medline abstracts with biological entities from human, mouse, drosophila, zebrafish, arabidopsis and rat. The result is a rich network of searchable cross-references between biological entities and literature entries, much more comprehensive than the set of literature cross-references contained in Genbank and Swissprot. Links between biological entities and literature references were given a score based on a statistical model for each biological symbol’s relevance to each abstract. The model was based on the comparison of the occurance of significant words in an abstract above the Medline background set of occurances with other abstracts containing that symbol. Using our methods, we obtained a precision rate 89.9% of and a recall rate of 91.5% using a small annotated test set derived from Medline. We also demonstrate the utility of such a rich network of cross-references by showing the results of clustering gene expression data by both expression and co-occurence of those genes in Medline abstracts using the CELL platform.

Application of Error-Driven Learning to Biologically Significant Patterns in Protein Sequences

Sergei Levin and Birgit H. Satir.
Department Of Anatomy and Structural Biology, Albert Einstein College of Medicine, Bronx, NY 10064 USA

Biologically significant protein sequence patterns, especially the ones responsible for post-translational modification and signaling, are sometimes highly variable and difficult to pinpoint with the naked eye. Thus, automated acquisition of correct consensus sequences from protein sequences is a very important task. We used error-driven learning to acquire protein consensus sequences for N and O glycosylation from pre-annotated protein sequences. The error-driven learning starts with learning of the base pattern, an amino-acid pattern that is present in all sequences with positive annotation. In case of glycosylation, it is a sequence always present at the glycosylation site. Next, the default value for the base pattern is learned, which implies that base pattern can either be specific to glycosylation (most frequently met at glycosylation sites) or non-specific (most frequently met at non-glycosylation sites). Subsequently, the alterations to the base pattern that lead to non-default assignments are learned via an iterative application of the base pattern and analysis of the cases where the assumption of the default value was incorrect. These are the patterns that specifically lead to glycosylation and are learned based on the errors made while pplying the base pattern. Application of the error-driven learning to the protein sequences for proteins with N and O glycosylation revealed that the base pattern is very simple and has a non-specific default value (no glycosylation). However, upon error-driven learning of the modified patterns, a high number of amino-acid patterns with confidence levels of 0.8 and higher have been obtained. Combined with other methods, this technique holds potential for automated pattern acquisition from biological sequences.

Strategies for Improving Multiple Alignment of Retrotransposon Sequences

Renyi Liu and Eileen Kraemer
Department of Computer Science, University of Georgia, Athens, GA 30602 USA

Multiple sequence alignment plays a crucial role in extracting structural, functional, and evolutionary information from the exponentially growing sequence data from the ongoing genome sequencing. Although there are a number of multiple sequence alignment algorithms and programs available, biologists often find it difficult or time consuming to choose the appropriate algorithm and to interact to refine the resulting alignment.
In this work, we first conducted a comparative study of three alignment programs, DIALIGN, Clustalw, and Prrn, which are representatives of local, progressive, and iterative programs, respectively. Entropy was used as the alignment quality indicator. It was shown that the performance of Clustalw and Prrn were close to each other and better than that of DIALIGN. We then experimented with some strategies to improve alignment quality, such as realigning certain sequences or sequence range with different programs or parameters and hand editing, with the alignment of some retrotransposon sequences as a case study. A graphical tool, named AlignAgain, was built to display alignments, evaluate alignment quality, and improve resulting alignments. AlignAgain is written in Java and allows users to realign whole or partial sequences either with different programs such as CLUSTALW and PRRN or with the same program but different parameters, conduct alignments locally or remotely, edit alignments by inserting or deleting gap letters, and append sequences with profile alignment.
Detailed results of the comparison study and links to AlignAgain may be found at: jerry.cs.uga.edu/~renyi

Refining Function Prediction by Analyzing Site Specific Amino Acid Conservation: A Case of PAS Domain-Containing Chemoreceptors

Qinhong Ma(1), Barry L. Taylor(1) and Igor B. Zhulin(2)
(1)Department of Microbiology and Molecular Genetics, School of Medicine, Loma Linda University, Loma Linda, CA 92350 USA and (2)School of Biology, Georgia Institute of Technology, Atlanta, GA 30332-0230 USA

One of the major goals of comparative genomics is to predict a biological function for proteins by using a variety of bioinformatics tools. Function prediction for some classes of proteins is difficult due to their mosaic structure and the presence of highly variable domains. Even when all domains can be detected in a given protein, prediction of exact biological function might be a challenge.
PAS domains are sensory elements in various classes of signal transduction proteins in organisms ranging from Bacteria and Archaea to humans. PAS domains are implicated in sensing oxygen, redox potential, light and small ligands inside a living cell. The Aer chemoreceptor of Escherichia coli is a model PAS domain-containing sensor, which governs bacterial motility in response to changes in the redox potential. The protein sequence of Aer_Ecoli contains an N-terminal PAS domain and a C-terminal chemoreceptor domain (MA domain in SMART database). Using BLAST searches of microbial databases, we have identified 55 apparent homologs of Aer that have N-terminal PAS domain(s) and a C-terminal MA domain. Phylogenetic analysis revealed that all PAS-containing receptors belong to several distinct classes. Protein from the first class all have a single PAS domain, where amino acids are conserved in specific positions crucial for FAD binding and signaling by Aer, as revealed by multiple sequence alignments and mapping on known 3D structures. Therefore, they all are predicted to be sensors of redox potential that utilize signaling mechanism of Aer_Ecoli. Proteins from other phylogenetic clusters have one to three repeats of the PAS domain, however most residues essential for FAD, FMN or heme binding and signaling are not conserved within their PAS domains. This rules out a possibility for these receptors to be sensors of redox potential or oxygen. Our results demonstrate that similarity searches and analysis of the domain architecture are not sufficient for accurate prediction of biological function for signal transduction proteins. Analysis of site specific conservation of amino acids known to be essential for the function is one of approaches to improve in silico predictions.

A DNA Repair System Specific for Thermophilic Archaea and Bacteria Predicted by Genomic Context Analysis

Kira S. Makarova (1,2), L. Aravind(1), Nick V. Grishin(3), Igor B. Rogozin(1), Eugene V. Koonin(1)
(1)National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;
(2)Department of Pathology, F.E. Hebert School of Medicine, Uniformed Services University of the Health Sciences, Bethesda, MD 20814-4799 USA;
(3)Howard Hughes Medical Institute and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-9050, USA

During a systematic analysis of conserved gene context in prokaryotic genomes, a previously undetected, complex, partially conserved neighborhood consisting of more than 20 genes was discovered in most archaea (with the exception of Thermoplasma acidophilum and Halobacterium) and some bacteria, including the hyperthermophiles Thermotoga maritima and Aquifex aeolicus. The gene composition and gene order in this neighborhood vary greatly between species, but all versions have a stable, conserved core that consists of five genes. One of the core genes encodes a predicted DNA helicase, often fused to a predicted HD-superfamily hydrolase, and another encodes a RecB-family exonuclease; three core genes remain uncharacterized, but one of these might encode a nuclease of a new family. Two more genes that belong to this neighborhood and are present in most of the genomes, in which the neighborhood was detected, encode, respectively, a predicted HD-superfamily hydrolase (possibly, a nuclease) of a distinct family and a predicted, novel DNA polymerase. Another characteristic feature of this neighborhood is the expansion of a superfamily of paralogous, uncharacterized proteins, which are encoded by at least 20-30% of the genes in the neighborhood. The functional features of the proteins encoded in this neighborhood suggest that it encodes a previously undetected DNA repair system, which, to our knowledge, is the first repair system largely specific for thermophiles to be identified. This hypothetical repair system might be functionally analogous to the bacterial- eukaryotic system of translesion, mutagenic repair whose central components are DNA polymerases of the UmuC-DinB-Rad30-Rev1 superfamily, which typically are missing in thermophiles.

Predicting Class II MHC/Peptide Multi-Level Binding with an Iterative Stepwise Discriminant Analysis Meta-Algorithm

Ronna R. Mallios
Office of Sponsored Projects and Research, University of California, San Francisco, 2615 East Clinton Avenue, Fresno, CA, 93703 USA

The initial immune response to an extra-cellular pathogen begins with the capture of the pathogen by a macrophage, dendritic cell, or B lymphocyte. In the cell's interior, the protein portion of the peptide is degraded into peptide fragments. Class II major histocompatibility complex (MHC) molecules bind to areas of the peptide fragments that are designated agretopes. The agretope/MHC complex travels to the cell surface where the class II MHC molecule displays the fragment to nearby CD4 T lymphocytes. When a CD4 T lymphocyte binds to the exposed area of the peptide fragment, designated an epitope, an immune response is initiated.
Each binding peptide fragment is comprised of a linear arrangement of amino acid residues. Knowledge of the amino acid sequence of an agretope is useful in vaccine development and immunotherapy. A motif or quantitative model that recognizes agretopes can be used to screen large numbers of potential binding peptides, reducing laboratory time and costs.
Previous efforts have developed algorithms that successfully separate binding peptides from non-binding peptides for various HLA-DR molecules. The problem of classifying peptides into three or more categories of binding affinity is much more difficult than the dichotomous problem. A large part of the difficulty is due to the fact that the binding affinities found in public databases are produced by a variety of experimental methods. As such, a peptide reported as a high-binder by one method might be classified as a moderate-binder by another method.
This study explores expansion of a dichotomous iterative Stepwise Discriminant Analysis (SDA) meta-algorithm to the general multi-level problem. It seeks to ascertain if the algorithm is relevant and if so, how it compares with other approaches.
HLA-DR1 was selected as the class II MHC molecule of investigation. A database of peptides classified as high binding, moderate binding or non-binding was assembled from the MHCPEP internet database and the published literature. In accordance with published literature, agretopes of length 9 were selected as the units of investigation.
The general algorithm is as follows:
Initialization: (1.) A permanent non-binding dataset is created by entering every subsequence of length 9 from each non-binding peptide. (2.) An initial binding dataset is created by entering every subsequence of length 9 from each binding peptide. (3.) An initial application of SDA produces one classification function for each of the three binding levels.
Repeat until Convergence is Reached: (1.) Create a new binding dataset utilizing the current classification functions. Select from each binding peptide the subsequence that scores the highest according to the appropriate classification function. (2.) Apply SDA to produce new classification functions for each of the three binding levels.
The resulting model correctly classifies over 85% of the peptides in the database. The HLA-DR1 multi-level binding motif is in agreement with other studies and the level of accuracy is competitive. The results suggest that moderate-binders follow a different pattern from high-binders.
A similar study using regression analysis can corroborate or challenge this conclusion. Regression analysis, however, requires standardized reliable measurements of binding affinity. A well maintained website specializing in standardized binding affinities for peptide/HLA-DR complexes (including non-binders) would expedite the investigation of this problem.

Prediction of the Transmembrane Regions of Beta Barrel Membrane Proteins with a Predictor Based on HMM and Neural Networks

P.L.Martelli(1), A.Krogh(2) and R.Casadio(1,3)
(1)Laboratory of Biocomputing, Centro Interdipartimentale per le Ricerche Biotecnologiche (CIRB), Bologna, ITALY;
(2)Centre for Biological Sequence Analysis, the Technical University of Denmark, Lyngby, DENMARK;
(3)Laboratory of Biophysics,Department of Biology, University of Bologna, Bologna, ITALY

Beta-barrel membrane proteins are inserted in the outer membranes of bacteria, mitochondria and chloroplasts by means of antiparallel beta strands[1]. The prediction of the structure of these proteins consists in the prediction of the position of beta-strands along the sequence. A method based on neural networks is trained and tested on a non-redundant set of beta-barrel membrane proteins known at atomic resolution with a jack-knife procedure [2]. This method predicts the topography of transmembrane beta strands with residue accuracy as high as 78 % when evolutionary information is used as input to the network. The neural network results are improved with a post-processing procedure based on Hidden Markov Models (HMM). The new algorithms we developed make possible to train HMMs on the basis of neural network outputs and to perform predictions that include the typical topological constraints of this class of proteins (e.g. segment lengths, even number of beta-strands). HMMs based on evolutionary information can be trained by means of similar algorithms. After a jack knife procedure, the predictor assigns: - the correct structure to 80 % of the residues; - the correct position to 95 % of the 158 beta-strands included in the training set; -the correct number of beta-strands along the equence for 10 out of the 11 examples of the training set; We propose this as a general method to fill the gap of the prediction of the structure of beta- barrel membrane proteins. Furthermore, the HMM based on evolutionary information can filter beta-barrel membrane proteins out from a set containing globular and all alfa membrane proteins.


1. Schulz, GE (2000) "Beta-barrel membrane proteins." Curr Op Struct Biol 10: 443-447.

2. Jacoboni I, Martelli PL, Fariselli P, De Pinto V e Casadio R (2001) "Prediction of the transmembrane regions of beta-barrel membrane proteins with a neural network-based." Protein Sci 10:779-787

3. Durbin R, Eddy S, Krogh A, Mitchinson G (1998) "Biological sequence analysis: probabilistic models of proteins and nucleic acids." Cambridge Univ Press, Cambridge.

Automated Annotation of Viral Genomes

Ryan Mills, John Besemer, Alex Lomsadze and Mark Borodovsky
School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30332-0230 USA

The GenBank database currently contains over 700 viral genomes. The diversity of this set makes it difficult to find the boundaries of these genes using a unified approach. Genome size and the host that the virus infects are two factors that determine which of the available gene finding algorithms is most appropriate. The smallest viruses and phages do not contain enough DNA for training utilizing traditional methods. For these cases, the GeneMark.hmm program along with heuristically derived models of protein-coding DNA is the suggested method.
For phage genomes larger than 10 kb, the GeneMarkS program, which utilizes an iterative self-training algorithm, produces accurate results. GeneMarkS makes use of a ribosomal binding site model to aid in the prediction of the starts of genes.
The genomes of viruses that infect eukaryotes can be analyzed with a new modified version of the GeneMarkS program, called GeneMarkS EV. This self-training program builds a model of start codon context along with the protein-coding and non-coding DNA models in each of its iterations.
A database of the predictions made utilizing these approaches will be made available on our web site at: opal.biology.gatech.edu/GeneMark.

Comparative Genomics of Two-Component Signal Transduction in Pseudomonas aeruginosa and Vibrio cholera

Christophe Mougel and Igor B. Zhulin
School of Biology, Georgia Institute of Technology, 310 Ferst Drive, Atlanta GA 30332-0230 USA

Living organisms monitor the environment and adjust their behavior, metabolism and development in response to changes in physico-chemical parameters. Even simple prokaryotic organisms possess sophisticated signal transduction networks that contain specialized receptors directly interacting with environmental cues. Two-component regulatory systems are the main mean of signal transduction in Bacteria and are also present in Archaea, low eukaryotes and plants.
We performed a comparative genomic analysis of the two-component (histidine kinase – response regulator) systems of two species of pathogenic gamma-proteobacteria, P. aeruginosa and V. cholerae that have diverged 7 million years ago. Sixty-four sensor histidine kinases were identified in P. aeruginosa , and forty in V. cholerae. However, there are only ten sensors conserved between the two species. The domain architecture was determined for all sensory proteins, and a correlation was found between the domain architecture and the phylogeny of histidine kinases. Phylogenetic analysis resulted in the identification of several paralogous sensors in both species and allowed us to predict a possible function for response regulators whose genes are not paired with histidine kinases on the chromosome. Comparative genomics of signal transduction provides a useful tool for understanding the biology of these important pathogens.

A New Approach to Sequence Assembly using Divide-and-Conquer Algorithms

Hasan H. Otu and Khalid Sayood
Department of Electrical Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0511 USA

We propose a new algorithm or assembling fragments from a long DNA sequence obtained by shotgun sequencing. The proposed algorithm solves the orientation, overlap and layout phases simultaneously. The fragments are clustered using their Average Mutual Information (AMI) profiles using the k-means algorithm. The AMI profile of the fragments prove to be a distinctive measure as fragments that belong to the same region of the target sequence have similar AMI profiles. Moreover, AMI profiles are robust to errors and remain unchanged when calculated for the reverse complement of the fragment. Clustering the fragments reduces the unnecessary computational burden of considering the collection of fragments as a whole. Instead, the orientation and overlap detection are solved efficiently within the clusters as we have a feasible number of fragments in each cluster coming from the same region of the target sequence. The consensus sequence of each cluster is considered to form a new set of fragments and the basic approach is repeated. This recursion process is repeated until there is one cluster left or until no new cluster is born. In the first case we have a final consensus sequence for the target, in the second case, we end up with a number of contigs which can be ordered arbitrarily. The simulation results are very promising both for artificial and real data sets. In the case of zero sequencing error, the final consensus sequence is identical to the target sequence using a coverage of five. When error rate is increased to 5%, using a coverage of five, the algorithm reconstructs the target sequence with over 99% similarity and within 2% of its length. Future work can focus on investigating different methods for clustering -as an alternative to k-means- and different methods for measures of similarity -as an alternative to AMI profiles.

SUBLIME 1.0: a Java-Based Tool to Automate BLAST Searches and to Comprehensively Identify Homologs of a Query Sequence

Jerry J. Palmer and John M. Logsdon, Jr
Program in Genetics and Molecular Biology and Department of Biology, Emory University, Atlanta, GA 30322 USA

BLAST is widely used similarity search and alignment tool designed to explore sequence databases for similarities to a given query sequence. To assess whether a given hit, or alignment, constitutes evidence of homology, a statistical model to assess the strength of the alignment is compared to what can be expected from chance alone. A BlastP search—when done at NCBI typically takes 1-10 minutes—may yield N significant hits for a particular protein query (with significance defined by some threshold expect value). While these hits may represent N proteins that are likely homologous to the initial query, they are often only an incomplete set of all homologous sequences in the database. To obtain a more complete list of homologs, a BlastP search can be carried out against each of the N protein hits. This procedure may result in additional, unique, hits at the cost of several man-hours of repetitive work (if done manually); the process of BLASTing each result and adding the results to a master list would be repeated until no more new sequences were found. To automate this process, we have created a Java program called SUBLIME, for Search Using BLast Iteratively for Molecular Evolution. This multi-threaded application which directly queries the NCBI databases has already proven useful—this previously time-consuming and tedious work can now be accomplished in minutes (often < 15 min.). Upon completion, SUBLIME automatically generates a web page that contains a master list of proteins (putative homologs), the BlastP and TBlastN results for each protein query and a link for each protein to its Entrez record at NCBI. We are currently in the process of adding additional features to the SUBLIME application and we will present the results of some of this ongoing development.

Integrated Genetic Map Service (IGMS)

Harald Pankow, Heike Pospisil, Alexander Herrmann, and Jens G. Reich
Max-Delbrueck-Center for Molecular Medicine, Department of Bioinformatics, Robert-Roessle-Str.10 13092 Berlin-Buch, GERMANY

We present three novel functions implemented in the IGMS in Berlin-Buch.
The IGMS is a comprehensive information system that combines the knowledge from genomic sequence, genetic map and genetic disorders databases. This system is updated weekly and focuses on the analysis of EST data.
The first application identifies UniGene clusters that are differentially expressed in different types of cancer with respect different reference tissues, using for example, as criteria defined ratios of the number of ESTs found in the tumour tissue as compared to the number found in normal tissues and a defined number of ESTs per cluster. The results can be combined with clinical data to asses the potential relevance of specific genes for patient survival or metastatic spread.
The second application maps EST with a specific expression profile, e.g. representing genes over expressed in breast cancer, to the corresponding regions of the genome and vice e versa, e.g. maps all genes on chromosome 8 that are over expressed in breast cancer.
The third application generates a database of alternative splice forms for eight organisms from EST and mRNA sequence data. The results can be used to find splicing patterns specific for certain tissues or tumour types.

Dealing with Errors in Interactive Sequencing by Hybridization

Vinhthuy Phan and Steven Skiena
Computer Science Department, State University of New York at Stony Brook, Stony Brook NY, 11794-4400 USA

A realistic approach to Sequencing by Hybridization must deal with realistic sequencing errors. The results of such a method can surely be applied to similar sequencing tasks.
We provide the first algorithms for interactive sequencing by hybridization which are robust in the presence of hybridization errors. Under a strong error model allowing both positive and negative hybridization errors without repeated queries, we demonstrate accurate and efficient reconstruction with error rates up to 7%, using 11 DNA sequences from GenBank. Under the weaker traditional error model of Shamir and Tsur, RECOMB 2001, p269-277, we obtain accurate reconstructions with up to 20% false negative hybridization errors.
Finally, we establish theoretical bounds on the performance of the sequential probing algorithm of Skiena and Sundaram, J. Computational Biology, 1995, p333-353, under the strong error model.

Mining SNPs for Associating Disease with Transcription Factor Binding Site Altered by Mutation

Julia Ponomarenko, Tatyana Merkulova, Galya Orlova, Elena Gorshkova, and Misha Ponomarenko
Institute of Cytology and Genetics, Novosibirsk, 630090, RUSSIA

The SNPs-referred alterations in both conserved codons and splice sites and, hence, protein structure-function relationships are explained easier than in case of variable DNA sites binding transcription factors (TF). That is why we have developed a system rSNP_Guide, wwwmgs.bionet.nsc.ru/mgs/systems/rsnp, associating SNP caused disease with TF site altered by mutation [Ponomarenko et al., NAR, 2001, 29, 312-316]. Our system treats two sorts of experimentally detected alterations: in DNA sequence and in DNA binding pattern to unknown TF. As a result of rSNP_Guide application, it is possible to predict the known TF sites by alterations in sequence-dependent recognition Score, which are consistent with experimental alterations in DNA binding to unknown TF. Our system provides both brief and in-depth SNP-analysis dependent on a user's interest to a number of known TF's, sites of which should be examined. We have already tested our system by many genes with experimentally known TF/disease-associations. Among these control data, CETP (Sp1; dietary cholesterol response); TGM1 (AP-1 and CRE, squamous metaphasia), factor-IX (Ets, Leyden form of hemophilia B); gpD (GATA, Duffy blood group Fy{a-b-}); hMPO (Sp1, myelocytic leukemia); hAG (ER, myocardial infraction); h-delta-G (GATA, delta-thalassemia); hRB (Sp1, abnormal tumor suppression) were examined. In addition, we have tested our system by using the site-directed mutagenesis data of both "multiple substitutions" and "deletion" types. In this cause, the known TF sites damaged artificially in regulatory regions of the genes rAT(1A)R-C (MEF-2), hCD4 (Ets and ATF), hTOP3 (YY1 and USF), AchR-delta (MyoD and E2A), p53 (NFkB), c-myc (NFkB), iNOS (IRF-1) and hsp70 (HSF) were treated. Finally, two novel TF sites, SNP-caused alterations in which could be associated with diseases, were predicted and, then, successfully confirmed experimentally. Fist, GATA in the second intron of the mK-ras gene causes lung tumor. Second, YY1 absent in the sixth intron of the hTDO2 gene causes mental disorders. With this in mind, we hope that our system rSNP_Guide could be applicable to the SNP-related analysis.

Presence of ATG Triplets in 5’ Untranslated Regions of Eukaryotic cDNAs Correlates with a "Weak" Context of the Start Codon

Igor B. Rogozin(1), Alexey V. Kochetov(2), Fyodor A. Kondrashov(1), Eugene V. Koonin(1) and Luciano Milanesi(3)
(1)National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA;
(2)Institute of Cytology and Genetics, 10. Lavrentyev Ave., Novosibirsk 630090, RUSSIA;
(3)Istituto di Tecnologie Biomediche Avanzate, Consiglio Nazionale Delle Ricerche, via Fratelli Cervi 93, 20090 Segrate (MI), ITALY

The context of the start codon (typically, AUG) and the features of the 5' untranslated regions (5' UTRs) are important for understanding translation regulation in eukaryotic mRNAs and for accurate prediction of the coding region in genomic and cDNA sequences. The presence of AUG triplets in 5'UTRs (upstream AUGs) might effect the initiation rate and, in the context of gene prediction, could reduce the accuracy of the identification of the authentic start. To reveal potential connections between the presence of upstream AUGs and other features of 5'UTRs, such as their length and the start codon context, we undertook a systematic analysis of the available eukaryotic 5'UTR sequences. We show that a large fraction of 5'UTRs in the available cDNA sequences, 15-53% depending on the organism, contain upstream ATGs. A negative correlation was observed between the information content of the translation start signal and the length of the 5'UTR. Similarly, a negative correlation exists between the "strength" of the start context and the number of upstream ATGs. Typically, cDNAs containing long 5'UTRs with multiple upstream ATGs have a "weak" start context, and in contrast, cDNAs containing short 5'UTRs without ATGs have "strong" starts. These counter-intuitive results may be interpreted in terms of upstream AUGs having an important role in the regulation of translation efficiency by ensuring low basal translation level via double negative control and creating the potential for additional regulatory mechanisms. One of such mechanisms, supported by experimental studies of some mRNAs, includes removal of the AUG-containing portion of the 5'UTR by alternative splicing. Availability: An ATG_EVALUATOR program is available upon request from I.B.Rogozin ( rogozin@ncbi.nlm.nih.gov)

Sequence-Structure Space and Resultant Data Redundancy in the Protein Data Bank

I.N. Shindyalov(1) and P.E. Bourne(2,3)
(1)San Diego Supercomputer Center, University of California San Diego,  9500 Gilman Drive, La Jolla, CA 92093-0537 USA;
(2)Department of Pharmacology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA;
(3)The Burnham Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037 USA

A study of sequence-structure space and resultant data redundancy has been performed using the Combinatorial Extension (CE) algorithm for determining structural alignment and BLAST for determining sequence similarity. Significant clusters in sequence-structure space associated with recurrent structures (convergent evolution) and protein superfamilies (divergent evolution) have been described. These observations have been compared to the scop classification of protein domains that defines similar features. Both methods indicate an enormous redundancy of data in the Protein Data Bank (PDB), and hence a need in defining representative (non-redundant) sets of proteins especially for use in various computational analyses. Various representations of the PDB as sequence and/or structure non-redundant set of protein chains have been defined with from 1200 to 6000 representatives. It was demonstrated that commonly used sequence similarity criterion alone is not very efficient in selecting unique proteins. We demonstrate that sequence or structure based representative sets of single polypeptide chains contain approximately 20-30% redundancy by complementary (sequence vs. structure) criteria. We propose here an approach for building representative sets using combined sequence and structure similarity criterion with additional conditions requiring adequate representation of proteins excluded from the set. Analysis of representative sets obtained using these various criteria and the correlation between different sets are analyzed.  Representative sets are updated on a weekly basis and available from cl.sdsc.edu/nr.html.

Identification and Accurate Modeling of Motifs Specifying Start Codon Locations in the Genome of an Unusual Cyanobacterium

Mark J Schreiber(1), John D Besemer(2), Mark Borodovsky(2), Chris M Brown(1)
(1)Department of Biochemistry, Univeristy of Otago, PO Box 56, Dunedin, New Zealand and (2)Department of Biology, Georgia Institue of Technology, 310 Ferst Drive, Atlanta, GA 30332, USA

Using the genome sequence of the cyanobacterium Synechocystis sp. PCC6803 and publicly available gene loci predictions we identified a previously unobserved element surrounding the start codon. Notably the Shine-Dalgarno ribosome-binding site conserved in almost all bacteria was found to be absent. Information Theory predicts that this element contains sufficient information to allow discrimination of the start codon by the ribosome.
To determine if systematic error in genome annotation had caused this observation we assessed the accuracy of the start codon predictions using a set N-terminally mapped 2D gel spots from the organism. While the accuracy of start codon predictions was found to be only 75% we did not believe that this would seriously bias the analysis. However, to further improve the predictions we developed a technique of iterative training that could provide start codon predictions with greater than 95% accuracy using only very small verified datasets. This technique was evaluated in both Synechocystis and E. coli showing no prior dependence on any organism specific motifs.

The Phosphoproteome Predicted: Using Neural Networks for Predicting Kinase Substrate Sites

Thomas Sicheritz-Ponten, Nikolaj Blom and Soren Brunak
Center for Biological Sequence Analysis, Technical University of Denmark, Bldg 208, DK-2800 Lyngby, DENMARK

Protein phosphorylation is the primary mean of switching the activity of a cellular protein rapidly from one state to another. Thus, protein phosphorylation is considered being a key event in many signal transduction pathways of biological systems. Phosphorylation of substrate sites at serine, threonine or tyrosine residues is performed by members of the protein kinase family. This gene family consists of app. 860 members and is the second largest family in the human genome.
We aim to describe the complete predicted phosphoproteome: a description of the entire collection of phosphoproteins in the eukaryotic cell, the sites of reversible phosphorylation and the kinase subtype performing the phosphorylation event. Earlier, we developed a method, NetPhos[1], for predicting the general probability of a given residue being a potential phosphorylation site or not. In order to predict the identity of the most probable kinase for each site we have now developed NetPhosK[2].
To validate our approach, we are using information about evolutionary conservation from related species. For example, if a specific serine residue is predicted as a potential PKA site in human protein X and is also predicted to be a PKA site in the conserved rat and mouse homologs of protein X, we consider this additional strong confidence in the prediction. On the other hand, a high kinase score in combination with a lack of conservation of the acceptor residue in related species indicate that the site is specific to a given species or that the site could be phosphorylated in vitro only, lacking a physiological role. In order to characterize the Human PhosphoProteome we apply the predictor on the draft genome containing 24819 genes from Ensembl (version 1.0) and present statistics on potential acceptor sites, overlapping specificities and orphan protein families. The kinase-specific prediction server will be made publicly available on the Internet.


"Sequence- and Structure-Based Prediction of Eukaryotic Protein Phosphorylation Sites.", Blom, N., Gammeltoft, S., and Brunak, S. (1999), Journal of Molecular Biology: 294(5), 1351-1362

2."NetPhosK: Prediction of Protein Kinase specificity of eukaryotic phosphorylation sites", Thomas Sicheritz-Ponten, Nikolaj Blom and Soren Brunak (manuscript in preparation)

Phylogenomic Atlases for Sequenced Microbial Genomes

T. Sicheritz-Ponten, J.O. Andersson, D. Ussery, A.J- Roger, J. Logsdon, R. Hirt and T.M. Embley
Center for Biological Sequence Analysis, Technical University of Denmark, Bldg 208, DK-2800 Lyngby, DENMARK

We have developed a method which combines phylogenomic information with DNA structural parameters. Phylogenetic trees are constructed for each gene in the genome, using PyPhy, and the results are visualized using DNA atlases. The original idea of PyPhy has been extended for quick tree-mining of ongoing EST projects.
Raw sequence data from ongoing EST projects is often incomplete and less reliable than edited and annotated end-product sequences. In order to facilitate the automated generation of phylogenetic trees even from partial sequence data we transfer the phylogenetic start sequence to a so called seed which is by our definition the first and best match of the partial sequence against the non-redundant sequence database. The program identifies the first best match (via blastx ) and uses this "seed" sequence to automatically select homologues which are together with the translated partial sequence (from the blastx result) used for the alignment and phylogenetic reconstructions.
In order to facilitate the discovery of "interesting" features, we integrated AutoTreeS into the DNA Atlases which plots structural measures for all positions in a long DNA sequence ( an entire chromosome) in the form of color-coded wheels which combine evolutionary information from PyPhy and provide an excellent genomic data mining tool. In completely sequenced genomes, the order and the position from individual genes is known which facilitates the drawing of phylome. Sequences from EST projects are most of the time of unknown position relative to each other. In order to draw phylomes we developed aditional array-structure based phylome visualization.

Splice Site Prediction by Using Neural Networks, Revisited Topic

Yuan Tian, Naira Hovakimyan and Mark Borodovsky
School of Biology, Georgia Institute of Technology, Atlanta, Georgia, 30332-0230 USA

Artificial neural networks have been applied to the prediction of the splice sites of the experimentally validated genes. In chromosome I of Caenorhabditis elegans sequences of 100nt length has been collected from genes with AGs (for acceptor site) and GTs (for donor site) in the middle positions. These sequences have been split into three sets, one of those being used for the NN training, the other two - for the NN validation. We used the SNNS, Stuttgart Neural Network Simulator [1], to build neural networks with single hidden layer. Numbers of hidden layer neurons ranged from 3 to 15. Our experiments have shown that 5 neurons in the hidden layer give the best result. With our method of training we were able to detect in the test sets 88% of the acceptor sites with 0.023% of false positive prediction. When different lengths of sliding window were tested, the 100nt gave the best prediction accuracy for acceptor sites. Predictions have been also done for donor sites.
The results were compared with ones known in literature. Neural networks with 61nt long input window and 15 neurons in the hidden layer were applied for Arabidopsis thaliana DNA [2]. Information from the global coding/non_coding network was also used. That network could detect 80% of the acceptor sites with 0.034% false positive rate. Neural networks with 41nt long input window and 20 neurons in the hidden layer were applied for human DNA [3]. In combination with the global coding/non_coding network, 90% of the true acceptor sites were predicted with 0.162% false positives rate. Our predictions for validated genes of C.elegans have been compared with performance of Netgene2 originally described in [3]. For this species Netgene2 detected 70% of acceptor sites. In the current paper we will present the results for splice site prediction for human and Arabidopsis thaliana genomic sequences as well.


1. www.informatik.uni-stuttgart.de/ipvr/bv/projekte/snns/snns.html

2. S.M. Hebsgaard, P.G.Korning, N.Tolstrup, J. Engelbtecht, P.Rouze, S.Brunak (1996), "Splice site prediction in Arabidopsis thaliana pre mRNA by combining local and global sequence information.", Nucleic Acids Res 24(17):3439-52

3. S.Brunak, J. Engelbrecht, S. Knudsen (1991), "Prediction of Human mRNA Donor and Acceptor Sites form the DNA Sequence.", J Mol Biol 220:49-65

Determining the Minimum Number of Types Necessary to Represent the Sizes of Protein Atoms

Jerry Tsai, Neil Voss, and Mark Gerstein
Department of Biochemistry & Biophysics, 103 Biochemistry/ Biophysics Building, Texas A&M University, 2128 College Station, Texas 77843-2128 USA and Department of Molecular Biophysics and Biochemistry Yale University, Bass Center, 266 Whitney Avenue P.O. Box 208114, New Haven, CT 06520-8114 USA

Traditionally, for packing calculations people have collected atoms together into a number of distinct "types". These types, in fact, often represent a heavy atom and its associated hydrogens (i.e. a united atom model) since hydrogens are not usually resolved in protein crystal structures. Also, atom typing is traditionally done strictly according to basic chemistry. This usually gives rise to 20 to 30 types of atoms in proteins -- such as carbonyl carbons, carbonyl oxygens, methyl groups, and hydroxyl groups. No one has yet investigated how similar in packing these chemically derived types are. Here we address this question in detail, using Voronoi volume calculations on a set of high-resolution crystal structures. We perform a rigorous clustering analysis with cross-validation on tens of thousands of atom volumes and attempt to compile them into types based purely on packing criteria. From this analysis, we are able to determine a "minimal" set of 18 atom types that most efficiently represent the spectrum of packing in proteins. Our analysis highlights a number of inconsistencies in traditional chemical typing schemes. Some united atoms exhibit unintuitive packing volumes. In particular, tetrahedral carbons with two hydrogens are almost identical in size to many aromatic carbons with a single hydrogen, which are thought to be smaller in size. Our programs available from bioinfo.mbb.yale.edu/geometry and molmovdb.org.

Reannotation of the E. coli K12 Genome

Vera van Noort(1,2), Marie Skovgaard(1), Thomas Schou Larsen(1) and David Ussery(1)
(1)Centre for Biological Sequence Analysis, The Technical University of Denmark, DENMARK and (2)Theoretical Biology / Bioinformatics, Utrecht University, The NETHERLANDS

E. coli K12 MG1655 was sequenced in 1997. The genes that were annotated had either experimental evidence or were predicted using codon usage statistics. In 1998 the annotation was updated by using the GeneMark program for prediction of genes. A recent study showed that the number of annotated protein coding genes in E. coli K12 MG1655 is about 15 percent higher than the number of expected genes, calculated based on stop codon frequency and matches of Long Open Reading Frames (ORFs) to SwissProt. This 15 percent consists of ORFs that occur in the genome by chance, but are not real genes. In biology and bioinformatics the annotation of genomes is used for a number of purposes, for example the choice of probes in micro array experiments, whole genome analysis, inclusion of hypothetical proteins in protein databases like SwissProt upon which a lot of analyses are based. Thus an accurate annotation is necessary. The annotation that we have made, represents a more reliable set of genes than the current annotation. Furthermore, we have given a measure of reliability to all Genbank annotated genes and genes that were annotated by us. We have done this, firstly by finding E. coli proteins with experimental evidence in SwissProt and mapping them to the genome.
Secondly genefinding was done using Profinder. Profinder is an HMM based genefinder, which is trained on high quality training sets of gene-containing sequences constructed from extensions of ORF homology hits in the SwissProt database. Nullstates are stimated from the shadows of these high-confidence genes. Using posterior logodds decoding, DNA sequences may then be scored for gene content using the trained HMM. Only high scoring genes were included in the reliable gene set.
Thirdly homology searches were done to non E. coli genes in SwissProt and to translated ORFs from fully sequenced genomes. Again only high scoring genes were included for being reliable. These three methods led to a set of reliable genes, that were visually inspected using the Artemis program developped by the Sanger Centre. This made it clear that most unreliable genes were short ORFs lying on the opposite strands of genes in the direct environment. Such ORFs are also questionable because prokaryotes tend to organize their genes in operons.
Apart from a measure of reliability, we also included the information of wether a gene was found in a transcript or not during experiments. Using Affymetrix micro array technology, binding levels of probes to mRNA were measured. It is known that probes can have different binding affinities thereby displaying up to 50 fold difference in mRNA level for the same gene. We modelled these binding affinities based on the sequence of the probes and their deviation from the gene level of the gene it is part of, using neural networks. We corrected the probe levels for calculated binding affinities, thereby getting probes displaying gene levels that can be compared. Using these corrected probe levels, gene levels were calculated and 'low', 'medium' or 'highly expressed' was added as a label to our annotated genes.
As it is impossible for people other than the submitters of a genome to suggest changes in Genbank entries, our annotation will be available on a webserver www.cbs.dtu.dk .

Identifying Number of Clusters in Gene Expression Data

Dali Wang, Habtom Ressom, Mohamad T. Musavi, and Cristian Domnisoru
University of Maine, Department of Electrical & Computer Engineering, Intelligent Systems Laboratory, 201 Barrows Hall, Orono, ME 04469, USA

Motivation: Clustering is a very useful and important technique for analyzing gene expression data. Self- Organizing Map (SOM) is one of the most useful clustering algorithms, which have been used to cluster the gene expression data. SOM algorithms require the number of clusters as one of the initialization parameter before clustering. However, we have no information about the number of clusters in the gene expression data set. The method that is currently being used is to validate the result from SOM to find the best numbers. This approach is very inconvenient and time-consuming.
This paper applies a novel model of SOM, called Double SOM (DSOM) to cluster the gene expression data set, which can overcome this limitation by clearly and visually telling us how many clusters would be the best. To validate this technique, we also use a novel validation technique, which is known as figure of merit (FOM).
Results: We use DSOM to cluster an artificial data set and two kinds of real gene expression data sets. Our results reveal that DSOM can not only cluster the whole data but can tell us the best number of clusters in the whole data set quickly and clearly.

Availability: All materials related to this paper are available upon request from the authors.
Contact: dwang@eece.maine.edu

Genome Trees Constructed Using Five Different Approaches Suggest New Major Bacterial Clades

Yuri I. Wolf, Igor B. Rogozin, Nick V. Grishin, Roman L. Tatusov, Eugene V. Koonin
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894 USA

The availability of multiple complete genome sequences from diverse taxa prompts the development of new phylogenetic approaches, which attempt to incorporate information derived from comparative analysis of complete gene sets or large subsets thereof. Such attempts are particularly relevant because of the major role of horizontal gene transfer and lineage-specific gene loss, at least in the evolution of prokaryotes.
Five largely independent approaches were employed to construct trees for completely sequenced bacterial and archaeal genomes: i) presence-absence of genomes in clusters of orthologous genes; ii) conservation of local gene order (gene pairs) among prokaryotic genomes; iii) parameters of identity distribution for probable orthologs; iv) analysis of concatenated alignments of ribosomal proteins; v) comparison of trees constructed for multiple protein families. All constructed trees support the separation of the two primary prokaryotic domains, bacteria and archaea, as well as some terminal bifurcations within the bacterial and archaeal domains. Beyond these obvious groupings, the trees made with different methods appeared to differ substantially in terms of the relative contributions of phylogenetic relationships and similarities in gene repertoires caused by similar life styles and horizontal gene transfer to the tree topology. The trees based on presence-absence of genomes in orthologous clusters and the trees based on conserved gene pairs appear to be strongly affected by gene loss and horizontal gene transfer. The trees based on identity distributions for orthologs and particularly the tree made of concatenated ribosomal protein sequences seemed to carry a stronger phylogenetic signal. The latter tree supported three potential high-level bacterial clades,: i) Chlamydia-Spirochetes, ii) Thermotogales-Aquificales (bacterial hyperthermophiles), and ii) Actinomycetes-Deinococcales-Cyanobacteria. The latter group also appeared to join the low-GC Gram-positive bacteria at a deeper tree node. These new groupings of bacteria were supported by the analysis of alternative topologies in the concatenated ribosomal protein tree using the Kishino-Hasegawa test and by a census of the topologies of 132 individual groups of orthologous proteins. Additionally, the results of this analysis put into question the sister-group relationship between the two major archaeal groups, Euryarchaeota and Crenarchaeota, and suggest instead that Euryarchaeota might be a paraphyletic group with respect to Crenarchaeota.
We conclude that, the extensive horizontal gene flow and lineage-specific gene loss notwithstanding, extension of phylogenetic analysis to the genome scale has the potential of uncovering deep evolutionary relationships between prokaryotic lineages.

Expression Profiler: Software to Analyze and Visualize Gene Expression Profiles

Tao Wu and Eileen Kraemer
Computer Science Department, The University of Georgia, Athens, GA 30602 USA

Gene expression profiling has become an important method in genomic research. Current software systems for visualizing and analyzing large amounts of expression profiling data suffer from insufficient flexibility in zooming and manipulating graphical representations of the expression data. This limits the degree of detail at which a user is able to explore the expression data and examine the results of numerous analysis methods on these data.
We have developed the ExpressionProfiler, a software system, written in Java, for visualizing and analyzing gene expression profiling data. The ExpressionProfiler allows very flexible zooming on the graphical representation of the expression data, and supports various operations for editing the data, and interacting with their graphical representation.
In the ExpressionProfiler, we have implemented two different views and one clustering algorithm for the expression data -- the Unweighted Pair-Group Method Average(UPGMA). However, the ExpressionProfiler has been built as an extensible framework -- additional analysis algorithms and associated visualizations can be added to the existing system easily and still enjoy the flexible zooming capability the current system provides. Interactions with the current visualizations include selection of subsets of genes and/or conditions, tree restructuring, and reordering and regrouping of clusters. In addition, the user is able to write out the resulting trees in standard formats, and to save or print images of the trees and heat maps.
The ExpressionProfiler achieves all this with limited memory requirement -- it maintains a buffered image, which is only part of the entire graphical representation of the data. In this way, the ExpressionProfiler creates an impression of smooth scrolling as the user requests different parts of the visualization, without excessive use of memory.
Images of the ExpressionProfiler visualizations, as well as class files, instructions for installation and use, and sample input files may be found at: jerry.cs.uga.edu/~twu

Analysis of Gene Expression Data by Ellipsoid ART and ARTMAP

Rui Xu, Donald C. Wunsch II
Applied Computational Intelligence Laboratory, Department of Electrical and Computer Engineering, University of Missouri – Rolla, MO 65409-0249 USA

1. Purpose
Advance in DNA microarray techniques makes it possible to measure gene expression levels of thousands of genes simultaneously under different conditions or treatments. To find the biological information behind the large amount of data becomes a big challenge forcomputational biologists. Many unsupervised clustering methods and supervised learning algorithms have been successfully used in the field. In the study, we use a new family of neural network architecture - Ellipsoid ART and ARTMAP (EA/EAM) to analyze the AML/ALL data set and the human cancer cell (NCI60) lines data set.
2. Method
EA/EAM comes from the ideas in Fuzzy ART and ARTMAP. In this architecture, hyper-ellipsoids are used to represent the shapes of categories generated instead of hyper-rectangles. EA/EAM keeps all the properties of FA/FAM and may describe the data structure more efficiently.
3. Result
Two data sets are presented to EA/EAM. One is the leukemia data set, which includes samples of two classes of leukemia cancer (acute myeloid leukemia and acute lymphoblastic leukemia). The results can classify all the training samples and 33 of the 34 test samples correctly. And the one with error is widely regarded as an outlier by most of other classifiers. The NCI60 lines data set consists of expression profiles for 1376 genes in a set of 60 human cancer cell lines. Most of cell lines whose tissue has common origin are clustered in the same category.
4. Conclusion
The results show that EA/EAM is a very useful technique for analyzing large-scale gene expression data, both for classification and for clustering.


T. Golub et al. "Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring.", Science, 286: 531-537,1999.

G. Anagnostopoulos, M. Georgiopoulos. "Ellipsoid ART and ARTMAP for increment clustering and classfication.", IJCNN01, pp.1221-1226, 2001.

Scherf U, Ross DT, et al. "A gene expression database for the molecular pharmacology of cancer.", Nature Genetics, 2000; 24(3): 236-44.

Robert Tibshirani et al. "Clustering methods for the analysis of DNA microarray data.", Technical report, Department of Statistics, Stanford University.

DIGIT: A Novel Gene Finding Program by Combining Gene-Finders

Tetsushi Yada(1), Yasushi Totoki(2), Yoshio Takaeda(3), Yoshiyuki Sakaki(1), Toshihisa Takagi(1)
(1)Human Genome Center, Institute of Medical Science, University of Tokyo, JAPAN;
(2)Genomic Sciences Center, RIKEN, JAPAN;
(3)Mitsubishi Research Institute, Inc., JAPAN

We have developed a general purpose algorithm which finds genes by combining plural existing gene-finders. The algorithm has been implemented into a novel gene-finder named DIGIT. An outline of the algorithm is as follows. First, existing gene-finders are applied to an uncharacterized genomic sequence (input sequence). Next, DIGIT produces all possible exons from the results of gene-finders, and assigns them their exon types, reading frames and exon scores. Finally, DIGIT searches a set of exons whose additive score is maximized under their reading frame constraints. Bayesian procedure and hidden Markov model are used to infer exon scores and search exon set, respectively. We have designed DIGIT so as to combine FGENESH, GENSCAN and HMMgene, and have assessed its prediction accuracy by using recently compiled benchmark data sets. For all data sets, it has been revealed that DIGIT successfully discarded many false positive exons predicted by gene-finders and yielded remarkable improvements in sensitivity and specificity at the gene level compared with the best gene level accuracies achieved by any single gene-finder.

A Visualization System for Protein Interaction Mapping

Yong Zhang(1), Hui Tian(1), Jonathan Arnold(2), Eileen Kraemer(1)
(1)Computer Science Department and (2)Genetics Department, University of Georgia, Athens, GA, USA

An exciting challenge in science today is to use sequenced genomes to predict how living systems function and evolve. The goal is to develop a new systems approach using sequenced genomes to identify the molecular machines underlying fundamental processes like transcription, metabolism, development, biological clocks, transvection, mating, aging, and pathogenicity. Protein-protein interactions are crucial to understanding these biological processes, and thus protein interaction mapping is an important element of this work. Visualization can provide scientists with insight into the relationships these proteins.
We have developed a tool designed to assist scientists in identifying clusters in protein-interaction data, through visualization and interaction. To begin, the user may select or provide a data set representing protein-protein interactions. Input may consist of either a simple listing of names of interacting pairs, as with the mapping data from Ito et al. (2000) on S. cerevisiae, or may include a numerical value representing the strength of the interaction. Users may then select from among several graph clustering, layout, coloring, and shading algorithms, view a 3-dimensional display of the protein-protein interaction map, and interact with this display to:
  • search for a particular protein in the graph
  • obtain additional information about nodes(proteins), edges(interactions), and strongly connected components (interaction clusters)
  • adjust the graph layout by moving or deleting nodes or clusters to select a node to serve as a "center", hide other nodes, and then interactively add nodes back in a step-by-step fashion to selectively color nodes to emphasize similarities or differences
  • apply graded coloration techniques to highlight the relative distance of various proteins from a selected node or cluster
  • modify the user's perspective on the graph through position and rotation control

  • The tool is implemented in Java-3D, which facilitates web-based interaction and distribution of results. Several heuristic algorithms have been implemented (simple, cluster, spring embedder, simulated annealing), and the results compared, both for the usefulness of the resulting graph for emphasizing clusters, and the time required to produce the graph. The executable version of the program, instructions for download and use, and details of the comparison are available download at: jerry.cs.uga.edu/~yozhang


    Ito, T.; Tashiro, K.; Muta, S.; et al (2000). "Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins." Proc Natl Acad Sci USA 97(3):1143-1147.

    Conference Home Contact Information

    Please send questions or comments about this site to Nataliya Shmeleva, gte522q@prism.gatech.edu