KEYNOTE AND PLENARY LECTURE ABSTRACTS
Charting Chemical Space with Computers
University of California, Irvine Irvine, CA
Small molecules with at most a few dozen atoms play a
fundamental role in organic chemistry and biology. They can be used
as combinatorial building blocks for chemical synthesis, as
molecular probes for perturbing and analyzing biological systems,
and for the screening/design/discovery of new drugs. As datasets of
small molecules become increasingly available, it becomes important
to develop computational methods to store, search, classify, and
analyze small molecules and in particular to predict their physical,
chemical, and biological properties.
We will describe databases and machine learning methods, in
particular kernel methods, for chemical molecules represented by 1D
strings, 2D graphs of bonds, and 3D structures. We will demonstrate
state-of-the-art results for the prediction of physical, chemical, and
biological properties of small molecules and the discovery of new
reactions and compounds. More broadly, we will discuss some of the
challenges and opportunities for computer science, AI, and machine
learning in chemistry.
Comparative analysis of gene expression: insight into the evolution of transcription regulation
Weizmann Institute of Science
Evolution of gene expression plays a prominent role in generating
phenotypic diversity, but only little is known about the genetic basis
underlying broad modulations of the genome-wide transcription program.
To gain insights into the principles underlying variations in gene
expression between closely related species, our lab focuses on the
analysis on yeast species of varying evolutionary distances. I will
describe recent results concerning both the genetics basis underlying
gene expression evolution, as well as generic properties which
influence this evolution. Possible implications of the results to
models of gene expresssion evolution will be discussed.
Nature vs Nurture Studied Using Protein Structure
Philip E. Bourne
University of California San Diego
In recent work we have shown that protein structure is a useful tool in the study of evolution .
We were able to construct a reasonably accurate tree of life from the simple presence or absence
of fold superfamilies. We have taken this work in several directions of which three will be discussed.
First, what does the species specific use of fold space tell us about the usage of functional space .
Second, there is evidence that nurture - in the form of influence from the environment - impacted what p
rotein folds emerged and how they have been used . Third, each protein superfamily has its own evolutionary
story to tell and we have looked at the protein kinase like superfamily as a recent example .
 S. Yang, R.F. Doolittle and P.E. Bourne 2005 Phylogeny Determined through Protein Domain Content Proc. Nat. Acad. Ssi. (USA) 102(2): 373-378.
 L.Xie and P.E. Bourne 2005 Functional Coverage of the Human Genome by Existing Structures, Structural Genomics Targets and Homology Models. PLoS Comp Biol 1(3) e31.
 C. Dupont, K. Briedis, S. Yang, B. Palnik, P.E. Bourne 2005 in preparation
 E. Scheeff and P.E. Bourne 2005 Structural Evolution of the Protein Kinase-Like Superfamily PLoS Comp Biol in early release.
Incorporation of splice site probability models for non-canonical introns improves gene structure prediction in plants
Iowa State University
The vast majority of introns in protein-coding genes of higher eukaryotes have a GT dinucleotide
at their 5'-terminus and an AG dinucleotide at their 3'-end. About 1-2% of introns are non-canonical,
with the most abundant subtype of introns being characterized by GC and AG dinucleotides at their
5'- and 3'-termini, respectively. Most current gene prediction software, whether based on ab initio or spliced
alignment approaches, does not include explicit models for non-canonical introns or may exclude their prediction altogether.
With present amounts of genome and transcript data it is possible to apply statistical methodology to non-canonical
splice site prediction. We pursued one such approach and describe the training and implementation of GC-donor splice
site models for Arabidopsis and rice, with the goal of exploring whether specific modeling of non-canonical introns
can enhance gene structure prediction accuracy. Our results indicate that the incorporation of non-canonical splice
site models yields dramatic improvements in annotating genes containing GC-AG and AT-AC non-canonical introns.
Comparison of models shows differences between monocot and dicot species, but also suggests GC-intron specific biases
independent of taxonomic clade. We also present evidence that GC-AG introns occur preferentially in genes with atypically high exon counts.
E.coli: Curation and Analysis of the Currently Largest Electronically-encoded Regulatory Network.
Program of Computational Genomics, Center for Genomic Sciences, UNAM, Cuernavaca, Mexico
In the first part of my talk, I will summarize recent progress in curation of what constitutes currently the largest electronically-encoded
transcriptional regulatory network of a free living organism, that of Escherichia coli K-12. My lab has been curating operon organization
and transcriptional regulation in E.coli for years. This effort feeds both RegulonDB and EcoCyc. Navigation features of RegulonDB
version 5.0 will be described.
The second part of my talk will be devoted to some examples of biological analyses we have recently performed with this accumulated
knowledge. We have recently classified transcriptional factors or regulators (TFs) in three classes based on the origin of their allosteric
metabolite, as either internal, external or hybrid sensing sytems. The TF repertoire is shown to be mostly governed by the internal sensing
subset of TFs, due mostly to interactions by global regulators. Topological properties of the network will be also discussed.
 Salgado, H., et al. (2004) RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res 32 Database issue, D303-306.
 Martinez-Antonio, A., and Collado-Vides, J. (2003) Identifying global regulators in transcriptional regulatory networks in bacteria. Curr Opin Microbiol 6, 482-489.
 Mart¨ªnez-Antonio A., Janga S.C., Salgado H. and Collado-Vides J. (2006) "The Internal sensing machinery directs the activity of the regulatory network in Escherichia coli" Trends in Microbiol. (in press).
A simple physical model for scaling in protein-protein interaction networks
It has recently been demonstrated that many biological networks
exhibit a "scale-free" topology where the probability of observing a
node with a certain number of edges follows a power law. This
observation has been explained in terms of dynamical evolutionary
models. Here we consider the network of protein-protein
interactions and demonstrate that two published independent
measurements of these interactions produce graphs that are only
weakly correlated with one another despite their strikingly similar
topology. We then propose a physical model based on the fundamental
principle that (de)solvation is a major physical factor in
protein-protein interactions. This model reproduces not only the
scale-free nature of such graphs but also all higher-order
correlations in these networks. A key support of the model is
provided by the discovery of a significant correlation between
number of interactions made by a protein and the fraction of
hydrophobic residues on its surface.
Designing RNA conformational changes
University of Texas at Austin
Nucleic acids can be selected that bind to ligands and catalyze
reactions. The functionalities of selected RNA molecules is dependent in
large measure on their secondary structures, which can be readily
predicted and engineered. In particular, it has proven possible to
engineer nucleic acid secondary structures to undergo programmed
conformational changes in response to ligands. The resultant aptamer
beacons and aptazymes can be used as biosensors for reporting on
individual analytes. These molecules may also have applications as in
vivo as part of enginereed genetic circuits for signal transduction.
Mining sequence annotation databanks for association patterns
Technische Universität, Munich, Germany
Motivation: Millions of protein sequences currently being deposited to
sequence databanks will never be annotated manually. Similarity based
annotation generated by automatic software pipelines unavoidably
contains spurious assignments due to the imperfection of
bioinformatics methods. Examples of such annotation errors include
over- and under-predictions caused by the use of fixed recognition
thresholds and incorrect annotations caused by transitivity based
information transfer to unrelated proteins or transfer of errors
already accumulated in databases. One of the most difficult and timely
challenges in bioinformatics is the development of intelligent systems
aimed at improving the quality of automatically generated annotation.
A possible approach to this problem is to detect anomalies in
annotation items based on association rule mining.
Results: We present the first large-scale analysis of association
rules derived from two large protein annotation databases ? Swiss-
Prot and PEDANT - and reveal novel, previously unknown tendencies of
rule strength distributions. Most of the rules are either very strong,
or very weak, with rules in the medium strength range being relatively
infrequent. Based on dynamics of error correction in subsequent
Swiss-Prot releases and on our own manual analysis we demonstrate that
exceptions from strong rules are, indeed, significantly enriched in
annotation errors and can be used to automatically flag them. We
identify different strength dependencies of rules derived from
different fields in Swiss-Prot. A compositional breakdown of
association rules generated from PEDANT in terms of their constituent
items indicates that most of the errors that can be corrected are
related to gene functional roles. Swiss-Prot errors are usually caused
by under-annotation due to its conservative approach, while
automatically generated PEDANT annotation suffers from
Availability: All data generated in this study are available for
download and browsing at http://pedant.gsf.de/ARIA/index.htm.
Conservation and divergence of mammalian gene co-expression networks
Divergence of gene expression patterns is an important part of the
evolutionary process and represents a link between genotypic and phenotypic
evolution. The recent accumulation of high-throughput gene expression data
sets allows for systematic genome-scale comparisons of gene expression
pattern divergence between species. I will present an evolutionary
comparison of human and mouse gene expression patterns. Just over 9,000
orthologous human-mouse gene pairs were analyzed with respect to expression
profiles measured across 28 tissue samples. The approach we employed is
based on the analysis of gene co-expression networks where genes represent
nodes that are connected by edges if they are considered to be co-expressed.
Human and mouse gene expression networks show similar topological properties
at the macroscopic level. For instance, they have comparable node degree
distributions, average path lengths and clustering co-efficients. However,
the human and mouse gene co-expression networks have diverged substantially
when considered at a more microscopic level. Less than 10% of edges are
preserved in the intersection of the human and mouse gene co-expression
networks, and the node degree correlation between the two networks is low.
The evolutionary implications of this distinction between macroscopic
conservation and microscopic divergence of mammalian gene co-expression
networks will be explored.
Somatic evolution and cancer
University of California, Irvine Irvine, CA
Even though much progress has been made in main stream
experimental cancer research at the molecular level, traditional
methodologies alone are insufficient to resolve many important conceptual
issues in cancer biology. For example, for the most part, it is still
unknown how cancer originates, what drives its progression, and how
treatment failure can be prevented. In this talk, I will describe novel
mathematical tools which help obtain new insights into these processes. I
will also show how the mathematical insights are combined with
experimental studies through collaborations with cancer biologists. The
main idea is to study cancer as an evolutionary dynamical system on a
selection-mutation network. I will discuss the following topics: Stem
cells and tissue architecture, Geometric constraints in cancer dynamics
and Drug resistance in cancer.
Detecting selection at the sequence level
NCBI, NLM, NIH
Sequences of nucleic acids and of proteins are chronicles of past allele
substituions and, thus, carry ample information on natural selection. Most
of the currently used methods of inferring selection from the sequence data
rely on comparing the rates of substitutions which may be under selection to
some reference rate of supposedly neutral substitutions. I will consider
several new approaches to detecting selection at the sequence level. Clumps
of nonsynonymous substitutions, patterns in parallel substitutions, elevated
evolution rates due to constant selection, and joint analysis of data on
pathogenic mutations, SNPs and allele replacements will be reviewed.
Unifying measures of gene function and evolution
Eugene V. Koonin, Yuri I. Wolf, and Liran Carmel
Recent genome analyses revealed intriguing correlations between variables
characterizing the functioning of a gene, such as expression level, connectivity of
genetic and protein-protein interaction networks, and knockout effect, and variables
describing gene evolution, such as sequence evolution rate and propensity for gene loss.
Typically, variables within each of these classes are positively correlated, e.g., products
of highly expressed genes also tend to have many protein-protein interactions, whereas
variables between classes are negatively correlated, e.g., highly expressed genes tend
to evolve slowly. Here we describe principal component (PC) analysis of 7 genome-related
variables and propose biological interpretations for the first three principal components.
The first PC reflects different aspects of a gene's "importance", or the "status" of a gene in the
genomic community, with positive contributions from knockout lethality, expression level and the
number of paralogs, and negative contributions from sequence evolution rate and gene loss propensity.
The second and third PC may be interpreted as reflecting different aspects of a gene's "adaptability"
whereby genes with high adaptability tend to evolve fast, are relatively often lost during evolution,
readily duplicate and are highly expressed, but only under certain conditions. Functional classes of
genes substantially vary in status and adaptability, with the highest status characteristic of the translation
system and cytoskeletal proteins, and highest adaptability seen in metabolic enzymes and transporters.
Scrambled Genes: Genetic and genomic rearrangements during development and evolution
All ciliated protozoa have two types of nuclei in a single cell: a germline micronucleus and a
somatic macronucleus responsible for most mRNA production. At the genomic level, some ciliates
undergo massive DNA elimination and rearrangement of their ~1 Gb micronuclear genome to construct
a set of ~2 kb "nano-chromosomes" that comprise their ~50 Mb gene-rich macronuclear genome.
In many species, we estimate that 20-30% of all genes are scrambled; i.e. both fragmented
and permuted into several small unordered segments in the germline. These segments can be present
on either strand within a locus, or even dispersed over unlinked loci in the germline. Experiments
in our laboratory have surveyed the origin, evolution, and developmental processing of scrambled
genes in ciliates. I will describe new complex patterns of scrambled genes. For example, in one
case the coding segments for two independent transcripts are intertwined on two separate germline loci.
The genomics of bacterial gene flow
University of Pittsburgh
Unlike crown eukaryotic species, microbial lineages are created by continual
processes of gene loss and acquisition promoted by horizontal genetic
transfer. The amount of foreign DNA in bacterial genomes, and the rate at
which it is acquired, is consistent with gene transfer being the primary
catalyst for microbial lineage differentiation. Therefore, the
higher-ordered taxonomic relationships among microorganisms reflect not only
their shared evolutionary history, but the ongoing processes of gene
acquisition and gene loss. Understanding the mechanisms that control the
flow of genes among taxa will provide a framework for understanding
microbial diversification. It has been proposed that the likelihood of
successful horizontal transfer between taxa may be a function of their
phylogenetic distance, whereby more closely related taxa have a higher
probability of exchanging genes - even by illegitimate processes that are
independent of the degree of nucleotide sequence divergence - than do more
distantly related lineages. That is, different lineages may not be
equivalent in their ability to act as potential donors for horizontal gene
transfer, even if the genes they bear offer the same physiological benefit.
Molecular mechanisms which could underlie such barriers to horizontal gene
transfer are proposed, whereby strand-specific, asymmetrically-distributed
sequences control DNA replication and segregation. Bioinformatic analyses
demonstrate that such sequences appear to be shared among more closely
related lineages, but comprise non-overlapping sets among distantly related
sequences. As a result, incoming DNA from distantly-related taxa may bear
sequences which would interfere with DNA replication and segregation in
their new host genome, thereby reducing the probability of a successful
transfer. These mechanistic constraints may serve to shape the flow of
genes among bacteria by lateral transfer processes, and result in
higher-ordered bacterial clades that reflect propensity for gene transfer as
much as common evolutionary histories.
Phylogenomic mining: a novel approach to select phylogenetically informative genes from genome data
University of Iowa
The “gene tree/species tree” problem is a fundamental issue in molecular phylogenetics.
Which gene or genes provide the best estimators of organismal phylogeny? With the availability
of abundant genomic data from diverse taxa, there are now many potential genes that could be
used for phylogenetic reconstruction. However, not all genes are equally informative for phylogeny,
and the selection of those genes that are most effective remains difficult. We have developed a new
phylogenomic mining method that seeks to identify phylogenetically informative genes from a large potential
set. Genes selected by this method have markedly reduced incongruence between their individual gene
phylogenies and conform to trees constructed from concatenations of large numbers of these genes
(which converge to the species tree). In this phylogenomic mining method, we first employ a self-organizing map (SOM)
to cluster the gene data set. Second, the maximum entropy (ME) genes from each cluster are selected and phylogenetic
trees are inferred from them, both individually and in concatenation. These ME genes appear to represent the most
phylogenetically informative genes. We have validated our method using a test data set of orthologous genes that
were selected from genomes of yeast species and had been previously analysed with a random gene concatenation method.
Our approach performs significantly better than random gene selection using these data. We are attempting to generalize
our approach to other data sets. Our initial results suggest that phylogenomic mining will be a useful method to
efficiently identify phylogenetically informative genes from large genomic data sets.
The Origins of Eukaryotic Gene Structure.
Most of the phenotypic diversity that we perceive in the natural world is directly attributable to the peculiar
structure of the eukaryotic gene, which harbors numerous embellishments relative to the situation in prokaryotes.
These include introns that must be spliced out of precursor mRNAs, transcribed but untranslated leader and
trailer sequences (UTRs), modular regulatory elements that drive patterns of gene expression, and expansive
intergenic regions that harbor control mechanisms. Explaining the origins of these features is difficult because
they each impose an intrinsic fitness disadvantage by increasing the genic mutation rate to defective alleles.
To address these issues, a general hypothesis for the emergence of eukaryotic gene structure will be presented.
Extensive observations on population sizes, recombination rates, and mutation rates strongly support the view that
eukaryotes have reduced genetic effective population sizes relative to prokaryotes, with especially extreme reductions
occurring in multicellular lineages. The resultant increase in the power of random genetic drift is sufficient to overwhelm
the weak mutational disadvantages associated with most novel aspects of the eukaryotic gene, supporting the idea that
most such changes arose as nonadaptive by-products rather than direct products of natural selection. However, by
establishing a population-genetic environment permissive to the genome-wide repatterning of gene structure, the
eukaryotic condition also promoted a reliable resource from which natural selection could secondarily build novel forms
of organismal complexity.
Male mutation bias in the age of genomics: Who should be in the driver's seat?
Penn State University
Male mutation bias is a higher mutation rate in mammalian males than in mammalian females thought to
result from the greater number of germline cell divisions in males. If errors in DNA replication cause most
mutations, then the magnitude of male mutation bias should reflect the relative excess of male vs. female
germline cell divisions. Substitution rates averaged among all sites in a sequence and compared between
mammalian sex chromosomes were shown to be indeed higher in males than in females. Do individual
classes of substitutions as well as other mutations (e.g., insertions and deletions) exhibit male bias?
Our genome-wide analysis of CpG transitions that occurred between human and chimpanzee indicated only
a weak male mutation bias, suggesting that such mutations are largely replication-independent. In contrast,
a strong male bias was observed in a genome-wide study of small insertions and deletions occurring
between mouse and rat. This implies that these mutations might result from errors in DNA replication.
LTR retrotransposons and evolution: a comparative genomics approach
Once considered selfish DNA of little or no evolutionary significance, transposable elements are today
widely recognized as major contributors to genome evolution. Contemporary interest in transposable elements
focuses on various aspects of the evolutionary history of the elements themselves, as well as, their impact
on the evolution of the host genomes in which they reside. Our laboratory is particularly interested in the
evolution and significance of LTR retrotransposons. A combined bioinformatics and molecular biology approach
has been taken to study LTR retrotransposons in a variety of species from yeast to humans.
A summary of our methods and selected results will be presented.
Treading On the Interface between Molecular and Organismal Evolution
Proteins are the substrates of selection: genes are selected for or
against because the proteins they code for are more or less fitted to
their environments. The study of the evolution of life is therefore
the study of protein evolution. Understanding the mechanisms and
driving forces behind molecular evolution is the defining challenge of
Recent investigations of high-throughput genomic and phenomic data
have uncovered significant correlations between functional
characteristics of a gene such as essentiality or expression and
evolutionary rate of divergence. However, the importance of each
characteristic in determining the strength of selection remains
controversial. Furthermore, the relationship between paralogy, selection
and dynamics of duplication remains largely unknown. First, I use a
graph-theoretic representation of genomes to naturally partition genes
into families of paralogs. For genes with paralogs, I show that strength
of purifying selection does not necessarily depend on any phenotypic
determinant of that gene. Instead, selection is characteristic of
membership in a gene family. Surprisingly, while mutations accumulate
uniformly slower for all genes in families under strong purifying
selection, paralogs in those families diverge farther away in sequence.
Observation of fewer pseudogenization events in families under strong
selection supports a model where paralogs preferentially divide
ancestral function after duplication.
Together, the observations presented in this paper offer a novel
insight into how organismal constraints can have stark consequences on
the evolution of genes and families. Furthermore, I examine the
role of physical constraints such as structure on molecular evolution.
Finally, I discuss the relative contributions of "history and
physics" on the imprint of molecular evolution in structure space.
Incorporating phenotype into models of sequence evolution
Jeffrey L. Thorne
North Carolina State University
The relationship between genotype and phenotype is central to evolution.
Although genotype is specified via DNA sequence, widely used models for the
evolution of DNA sequence ignore phenotype. Diverse in silico procedures have been
introduced by computational biologists for predicting aspects of phenotype from genotype.
We have been developing statistical techniques for incorporating these in silico procedures into models of sequence evolution.
These procedures and some results will be presented. Although our models of sequence
change are designed for analyzing homologous sequences from different species, the models can be interpreted at the population genetic level.
Phylogeny of Mixture Models
It is well known that phylogenetic trees can vary between genes. Even within regions having the same tree topology, the mutation rates often vary. This motivates the study of phylogenetic reconstruction in heterogeneous settings. We study the (im)possibility of reconstructing the underlying phylogeny when data is generated from a mixture of trees (same topology, different branch lengths). We first show the pitfalls of popular methods - maximum likelihood and BMCMC. We then determine in which evolutionary models, reconstructing the tree topology, under a mixture distribution, is impossible (due to ambiguity) or possible (via a linear test).