The Fifth Georgia Tech International Conference on Bioinformatics

Charting Chemical Space with Computers
Pierre Baldi
University of California, Irvine Irvine, CA

Small molecules with at most a few dozen atoms play a fundamental role in organic chemistry and biology. They can be used as combinatorial building blocks for chemical synthesis, as molecular probes for perturbing and analyzing biological systems, and for the screening/design/discovery of new drugs. As datasets of small molecules become increasingly available, it becomes important to develop computational methods to store, search, classify, and analyze small molecules and in particular to predict their physical, chemical, and biological properties.
We will describe databases and machine learning methods, in particular kernel methods, for chemical molecules represented by 1D strings, 2D graphs of bonds, and 3D structures. We will demonstrate state-of-the-art results for the prediction of physical, chemical, and biological properties of small molecules and the discovery of new reactions and compounds. More broadly, we will discuss some of the challenges and opportunities for computer science, AI, and machine learning in chemistry.

Comparative analysis of gene expression: insight into the evolution of transcription regulation
Naama Barkai
Weizmann Institute of Science

Evolution of gene expression plays a prominent role in generating phenotypic diversity, but only little is known about the genetic basis underlying broad modulations of the genome-wide transcription program. To gain insights into the principles underlying variations in gene expression between closely related species, our lab focuses on the analysis on yeast species of varying evolutionary distances. I will describe recent results concerning both the genetics basis underlying gene expression evolution, as well as generic properties which influence this evolution. Possible implications of the results to models of gene expresssion evolution will be discussed.

Nature vs Nurture Studied Using Protein Structure
Philip E. Bourne
University of California San Diego

In recent work we have shown that protein structure is a useful tool in the study of evolution [1]. We were able to construct a reasonably accurate tree of life from the simple presence or absence of fold superfamilies. We have taken this work in several directions of which three will be discussed. First, what does the species specific use of fold space tell us about the usage of functional space [2]. Second, there is evidence that nurture - in the form of influence from the environment - impacted what p rotein folds emerged and how they have been used [3]. Third, each protein superfamily has its own evolutionary story to tell and we have looked at the protein kinase like superfamily as a recent example [4].

[1] S. Yang, R.F. Doolittle and P.E. Bourne 2005 Phylogeny Determined through Protein Domain Content Proc. Nat. Acad. Ssi. (USA) 102(2): 373-378.
[2] L.Xie and P.E. Bourne 2005 Functional Coverage of the Human Genome by Existing Structures, Structural Genomics Targets and Homology Models. PLoS Comp Biol 1(3) e31.
[3] C. Dupont, K. Briedis, S. Yang, B. Palnik, P.E. Bourne 2005 in preparation
[4] E. Scheeff and P.E. Bourne 2005 Structural Evolution of the Protein Kinase-Like Superfamily PLoS Comp Biol in early release.

Incorporation of splice site probability models for non-canonical introns improves gene structure prediction in plants
Volker Brendel
Iowa State University

The vast majority of introns in protein-coding genes of higher eukaryotes have a GT dinucleotide at their 5'-terminus and an AG dinucleotide at their 3'-end. About 1-2% of introns are non-canonical, with the most abundant subtype of introns being characterized by GC and AG dinucleotides at their 5'- and 3'-termini, respectively. Most current gene prediction software, whether based on ab initio or spliced alignment approaches, does not include explicit models for non-canonical introns or may exclude their prediction altogether. With present amounts of genome and transcript data it is possible to apply statistical methodology to non-canonical splice site prediction. We pursued one such approach and describe the training and implementation of GC-donor splice site models for Arabidopsis and rice, with the goal of exploring whether specific modeling of non-canonical introns can enhance gene structure prediction accuracy. Our results indicate that the incorporation of non-canonical splice site models yields dramatic improvements in annotating genes containing GC-AG and AT-AC non-canonical introns.
Comparison of models shows differences between monocot and dicot species, but also suggests GC-intron specific biases independent of taxonomic clade. We also present evidence that GC-AG introns occur preferentially in genes with atypically high exon counts.

E.coli: Curation and Analysis of the Currently Largest Electronically-encoded Regulatory Network.
Julio Collado-Vides
Program of Computational Genomics, Center for Genomic Sciences, UNAM, Cuernavaca, Mexico

In the first part of my talk, I will summarize recent progress in curation of what constitutes currently the largest electronically-encoded transcriptional regulatory network of a free living organism, that of Escherichia coli K-12. My lab has been curating operon organization and transcriptional regulation in E.coli for years. This effort feeds both RegulonDB and EcoCyc. Navigation features of RegulonDB version 5.0 will be described.
The second part of my talk will be devoted to some examples of biological analyses we have recently performed with this accumulated knowledge. We have recently classified transcriptional factors or regulators (TFs) in three classes based on the origin of their allosteric metabolite, as either internal, external or hybrid sensing sytems. The TF repertoire is shown to be mostly governed by the internal sensing subset of TFs, due mostly to interactions by global regulators. Topological properties of the network will be also discussed.

[1] Salgado, H., et al. (2004) RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res 32 Database issue, D303-306.
[2] Martinez-Antonio, A., and Collado-Vides, J. (2003) Identifying global regulators in transcriptional regulatory networks in bacteria. Curr Opin Microbiol 6, 482-489.
[3] Martínez-Antonio A., Janga S.C., Salgado H. and Collado-Vides J. (2006) "The Internal sensing machinery directs the activity of the regulatory network in Escherichia coli" Trends in Microbiol. (in press).

A simple physical model for scaling in protein-protein interaction networks
Eric Deeds
Harvard University

It has recently been demonstrated that many biological networks exhibit a "scale-free" topology where the probability of observing a node with a certain number of edges follows a power law. This observation has been explained in terms of dynamical evolutionary models. Here we consider the network of protein-protein interactions and demonstrate that two published independent measurements of these interactions produce graphs that are only weakly correlated with one another despite their strikingly similar topology. We then propose a physical model based on the fundamental principle that (de)solvation is a major physical factor in protein-protein interactions. This model reproduces not only the scale-free nature of such graphs but also all higher-order correlations in these networks. A key support of the model is provided by the discovery of a significant correlation between number of interactions made by a protein and the fraction of hydrophobic residues on its surface.

Designing RNA conformational changes
Andrew Ellington
University of Texas at Austin

Nucleic acids can be selected that bind to ligands and catalyze reactions. The functionalities of selected RNA molecules is dependent in large measure on their secondary structures, which can be readily predicted and engineered. In particular, it has proven possible to engineer nucleic acid secondary structures to undergo programmed conformational changes in response to ligands. The resultant aptamer beacons and aptazymes can be used as biosensors for reporting on individual analytes. These molecules may also have applications as in vivo as part of enginereed genetic circuits for signal transduction.

Mining sequence annotation databanks for association patterns
Dmitrij Frishman
Technische Universität, Munich, Germany

Motivation: Millions of protein sequences currently being deposited to sequence databanks will never be annotated manually. Similarity based annotation generated by automatic software pipelines unavoidably contains spurious assignments due to the imperfection of bioinformatics methods. Examples of such annotation errors include over- and under-predictions caused by the use of fixed recognition thresholds and incorrect annotations caused by transitivity based information transfer to unrelated proteins or transfer of errors already accumulated in databases. One of the most difficult and timely challenges in bioinformatics is the development of intelligent systems aimed at improving the quality of automatically generated annotation. A possible approach to this problem is to detect anomalies in annotation items based on association rule mining.

Results: We present the first large-scale analysis of association rules derived from two large protein annotation databases ? Swiss- Prot and PEDANT - and reveal novel, previously unknown tendencies of rule strength distributions. Most of the rules are either very strong, or very weak, with rules in the medium strength range being relatively infrequent. Based on dynamics of error correction in subsequent Swiss-Prot releases and on our own manual analysis we demonstrate that exceptions from strong rules are, indeed, significantly enriched in annotation errors and can be used to automatically flag them. We identify different strength dependencies of rules derived from different fields in Swiss-Prot. A compositional breakdown of association rules generated from PEDANT in terms of their constituent items indicates that most of the errors that can be corrected are related to gene functional roles. Swiss-Prot errors are usually caused by under-annotation due to its conservative approach, while automatically generated PEDANT annotation suffers from over-annotation.

Availability: All data generated in this study are available for download and browsing at http://pedant.gsf.de/ARIA/index.htm.

Conservation and divergence of mammalian gene co-expression networks
King Jordan
NCBI/NIH

Divergence of gene expression patterns is an important part of the evolutionary process and represents a link between genotypic and phenotypic evolution. The recent accumulation of high-throughput gene expression data sets allows for systematic genome-scale comparisons of gene expression pattern divergence between species. I will present an evolutionary comparison of human and mouse gene expression patterns. Just over 9,000 orthologous human-mouse gene pairs were analyzed with respect to expression profiles measured across 28 tissue samples. The approach we employed is based on the analysis of gene co-expression networks where genes represent nodes that are connected by edges if they are considered to be co-expressed. Human and mouse gene expression networks show similar topological properties at the macroscopic level. For instance, they have comparable node degree distributions, average path lengths and clustering co-efficients. However, the human and mouse gene co-expression networks have diverged substantially when considered at a more microscopic level. Less than 10% of edges are preserved in the intersection of the human and mouse gene co-expression networks, and the node degree correlation between the two networks is low. The evolutionary implications of this distinction between macroscopic conservation and microscopic divergence of mammalian gene co-expression networks will be explored.

Somatic evolution and cancer
Natalia Komarova
University of California, Irvine Irvine, CA

Even though much progress has been made in main stream experimental cancer research at the molecular level, traditional methodologies alone are insufficient to resolve many important conceptual issues in cancer biology. For example, for the most part, it is still unknown how cancer originates, what drives its progression, and how treatment failure can be prevented. In this talk, I will describe novel mathematical tools which help obtain new insights into these processes. I will also show how the mathematical insights are combined with experimental studies through collaborations with cancer biologists. The main idea is to study cancer as an evolutionary dynamical system on a selection-mutation network. I will discuss the following topics: Stem cells and tissue architecture, Geometric constraints in cancer dynamics and Drug resistance in cancer.

Detecting selection at the sequence level
Alex Kondrashov
NCBI, NLM, NIH

Sequences of nucleic acids and of proteins are chronicles of past allele substituions and, thus, carry ample information on natural selection. Most of the currently used methods of inferring selection from the sequence data rely on comparing the rates of substitutions which may be under selection to some reference rate of supposedly neutral substitutions. I will consider several new approaches to detecting selection at the sequence level. Clumps of nonsynonymous substitutions, patterns in parallel substitutions, elevated evolution rates due to constant selection, and joint analysis of data on pathogenic mutations, SNPs and allele replacements will be reviewed.

Unifying measures of gene function and evolution
Eugene V. Koonin, Yuri I. Wolf, and Liran Carmel
NCBI, NIH

Recent genome analyses revealed intriguing correlations between variables characterizing the functioning of a gene, such as expression level, connectivity of genetic and protein-protein interaction networks, and knockout effect, and variables describing gene evolution, such as sequence evolution rate and propensity for gene loss. Typically, variables within each of these classes are positively correlated, e.g., products of highly expressed genes also tend to have many protein-protein interactions, whereas variables between classes are negatively correlated, e.g., highly expressed genes tend to evolve slowly. Here we describe principal component (PC) analysis of 7 genome-related variables and propose biological interpretations for the first three principal components. The first PC reflects different aspects of a gene's "importance", or the "status" of a gene in the genomic community, with positive contributions from knockout lethality, expression level and the number of paralogs, and negative contributions from sequence evolution rate and gene loss propensity. The second and third PC may be interpreted as reflecting different aspects of a gene's "adaptability" whereby genes with high adaptability tend to evolve fast, are relatively often lost during evolution, readily duplicate and are highly expressed, but only under certain conditions. Functional classes of genes substantially vary in status and adaptability, with the highest status characteristic of the translation system and cytoskeletal proteins, and highest adaptability seen in metabolic enzymes and transporters.

Scrambled Genes: Genetic and genomic rearrangements during development and evolution
Laura Landweber
Princeton University

All ciliated protozoa have two types of nuclei in a single cell: a germline micronucleus and a somatic macronucleus responsible for most mRNA production. At the genomic level, some ciliates undergo massive DNA elimination and rearrangement of their ~1 Gb micronuclear genome to construct a set of ~2 kb "nano-chromosomes" that comprise their ~50 Mb gene-rich macronuclear genome. In many species, we estimate that 20-30% of all genes are scrambled; i.e. both fragmented and permuted into several small unordered segments in the germline. These segments can be present on either strand within a locus, or even dispersed over unlinked loci in the germline. Experiments in our laboratory have surveyed the origin, evolution, and developmental processing of scrambled genes in ciliates. I will describe new complex patterns of scrambled genes. For example, in one case the coding segments for two independent transcripts are intertwined on two separate germline loci.

The genomics of bacterial gene flow
Jeffrey Lawrence
University of Pittsburgh

Unlike crown eukaryotic species, microbial lineages are created by continual processes of gene loss and acquisition promoted by horizontal genetic transfer. The amount of foreign DNA in bacterial genomes, and the rate at which it is acquired, is consistent with gene transfer being the primary catalyst for microbial lineage differentiation. Therefore, the higher-ordered taxonomic relationships among microorganisms reflect not only their shared evolutionary history, but the ongoing processes of gene acquisition and gene loss. Understanding the mechanisms that control the flow of genes among taxa will provide a framework for understanding microbial diversification. It has been proposed that the likelihood of successful horizontal transfer between taxa may be a function of their phylogenetic distance, whereby more closely related taxa have a higher probability of exchanging genes - even by illegitimate processes that are independent of the degree of nucleotide sequence divergence - than do more distantly related lineages. That is, different lineages may not be equivalent in their ability to act as potential donors for horizontal gene transfer, even if the genes they bear offer the same physiological benefit. Molecular mechanisms which could underlie such barriers to horizontal gene transfer are proposed, whereby strand-specific, asymmetrically-distributed sequences control DNA replication and segregation. Bioinformatic analyses demonstrate that such sequences appear to be shared among more closely related lineages, but comprise non-overlapping sets among distantly related sequences. As a result, incoming DNA from distantly-related taxa may bear sequences which would interfere with DNA replication and segregation in their new host genome, thereby reducing the probability of a successful transfer. These mechanistic constraints may serve to shape the flow of genes among bacteria by lateral transfer processes, and result in higher-ordered bacterial clades that reflect propensity for gene transfer as much as common evolutionary histories.

Phylogenomic mining: a novel approach to select phylogenetically informative genes from genome data
John Logsdon
University of Iowa

The 揼ene tree/species tree� problem is a fundamental issue in molecular phylogenetics. Which gene or genes provide the best estimators of organismal phylogeny? With the availability of abundant genomic data from diverse taxa, there are now many potential genes that could be used for phylogenetic reconstruction. However, not all genes are equally informative for phylogeny, and the selection of those genes that are most effective remains difficult. We have developed a new phylogenomic mining method that seeks to identify phylogenetically informative genes from a large potential set. Genes selected by this method have markedly reduced incongruence between their individual gene phylogenies and conform to trees constructed from concatenations of large numbers of these genes (which converge to the species tree). In this phylogenomic mining method, we first employ a self-organizing map (SOM) to cluster the gene data set. Second, the maximum entropy (ME) genes from each cluster are selected and phylogenetic trees are inferred from them, both individually and in concatenation. These ME genes appear to represent the most phylogenetically informative genes. We have validated our method using a test data set of orthologous genes that were selected from genomes of yeast species and had been previously analysed with a random gene concatenation method. Our approach performs significantly better than random gene selection using these data. We are attempting to generalize our approach to other data sets. Our initial results suggest that phylogenomic mining will be a useful method to efficiently identify phylogenetically informative genes from large genomic data sets.

The Origins of Eukaryotic Gene Structure.
Michael Lynch
Indiana University

Most of the phenotypic diversity that we perceive in the natural world is directly attributable to the peculiar structure of the eukaryotic gene, which harbors numerous embellishments relative to the situation in prokaryotes. These include introns that must be spliced out of precursor mRNAs, transcribed but untranslated leader and trailer sequences (UTRs), modular regulatory elements that drive patterns of gene expression, and expansive intergenic regions that harbor control mechanisms. Explaining the origins of these features is difficult because they each impose an intrinsic fitness disadvantage by increasing the genic mutation rate to defective alleles. To address these issues, a general hypothesis for the emergence of eukaryotic gene structure will be presented. Extensive observations on population sizes, recombination rates, and mutation rates strongly support the view that eukaryotes have reduced genetic effective population sizes relative to prokaryotes, with especially extreme reductions occurring in multicellular lineages. The resultant increase in the power of random genetic drift is sufficient to overwhelm the weak mutational disadvantages associated with most novel aspects of the eukaryotic gene, supporting the idea that most such changes arose as nonadaptive by-products rather than direct products of natural selection. However, by establishing a population-genetic environment permissive to the genome-wide repatterning of gene structure, the eukaryotic condition also promoted a reliable resource from which natural selection could secondarily build novel forms of organismal complexity.

Male mutation bias in the age of genomics: Who should be in the driver's seat?
Kateryna Makova
Penn State University

Male mutation bias is a higher mutation rate in mammalian males than in mammalian females thought to result from the greater number of germline cell divisions in males. If errors in DNA replication cause most mutations, then the magnitude of male mutation bias should reflect the relative excess of male vs. female germline cell divisions. Substitution rates averaged among all sites in a sequence and compared between mammalian sex chromosomes were shown to be indeed higher in males than in females. Do individual classes of substitutions as well as other mutations (e.g., insertions and deletions) exhibit male bias? Our genome-wide analysis of CpG transitions that occurred between human and chimpanzee indicated only a weak male mutation bias, suggesting that such mutations are largely replication-independent. In contrast, a strong male bias was observed in a genome-wide study of small insertions and deletions occurring between mouse and rat. This implies that these mutations might result from errors in DNA replication.

LTR retrotransposons and evolution: a comparative genomics approach
John McDonald
Georgia Tech

Once considered selfish DNA of little or no evolutionary significance, transposable elements are today widely recognized as major contributors to genome evolution. Contemporary interest in transposable elements focuses on various aspects of the evolutionary history of the elements themselves, as well as, their impact on the evolution of the host genomes in which they reside. Our laboratory is particularly interested in the evolution and significance of LTR retrotransposons. A combined bioinformatics and molecular biology approach has been taken to study LTR retrotransposons in a variety of species from yeast to humans. A summary of our methods and selected results will be presented.

Treading On the Interface between Molecular and Organismal Evolution
Boris Shakhnovich
Boston University

Proteins are the substrates of selection: genes are selected for or against because the proteins they code for are more or less fitted to their environments. The study of the evolution of life is therefore the study of protein evolution. Understanding the mechanisms and driving forces behind molecular evolution is the defining challenge of computational biology.
Recent investigations of high-throughput genomic and phenomic data have uncovered significant correlations between functional characteristics of a gene such as essentiality or expression and evolutionary rate of divergence. However, the importance of each characteristic in determining the strength of selection remains controversial. Furthermore, the relationship between paralogy, selection and dynamics of duplication remains largely unknown. First, I use a graph-theoretic representation of genomes to naturally partition genes into families of paralogs. For genes with paralogs, I show that strength of purifying selection does not necessarily depend on any phenotypic determinant of that gene. Instead, selection is characteristic of membership in a gene family. Surprisingly, while mutations accumulate uniformly slower for all genes in families under strong purifying selection, paralogs in those families diverge farther away in sequence. Observation of fewer pseudogenization events in families under strong selection supports a model where paralogs preferentially divide ancestral function after duplication.
Together, the observations presented in this paper offer a novel insight into how organismal constraints can have stark consequences on the evolution of genes and families. Furthermore, I examine the role of physical constraints such as structure on molecular evolution.
Finally, I discuss the relative contributions of "history and physics" on the imprint of molecular evolution in structure space.

Incorporating phenotype into models of sequence evolution
Jeffrey L. Thorne
North Carolina State University

The relationship between genotype and phenotype is central to evolution. Although genotype is specified via DNA sequence, widely used models for the evolution of DNA sequence ignore phenotype. Diverse in silico procedures have been introduced by computational biologists for predicting aspects of phenotype from genotype.
We have been developing statistical techniques for incorporating these in silico procedures into models of sequence evolution. These procedures and some results will be presented. Although our models of sequence change are designed for analyzing homologous sequences from different species, the models can be interpreted at the population genetic level.

Phylogeny of Mixture Models
Eric Vigoda
Georgia Tech

It is well known that phylogenetic trees can vary between genes. Even within regions having the same tree topology, the mutation rates often vary. This motivates the study of phylogenetic reconstruction in heterogeneous settings. We study the (im)possibility of reconstructing the underlying phylogeny when data is generated from a mixture of trees (same topology, different branch lengths). We first show the pitfalls of popular methods - maximum likelihood and BMCMC. We then determine in which evolutionary models, reconstructing the tree topology, under a mixture distribution, is impossible (due to ambiguity) or possible (via a linear test).

Conference Home

Contact Information

Please send questions or comments about this site to Last Modified: Thursday Nov 17, 2005 11:34 AM EST