|  | 
KEYNOTE AND PLENARY LECTURE ABSTRACTS
 
 
 
Charting Chemical Space with ComputersPierre Baldi
 University of California, Irvine Irvine, CA
 
 Small molecules with at most a few dozen atoms play a  
fundamental  role in organic chemistry and biology. They can be used 
as  combinatorial  building blocks for chemical synthesis, as 
molecular probes for  perturbing  and analyzing biological systems, 
and for the screening/design/discovery of  new drugs.  As datasets of 
small molecules become increasingly  available, it  becomes important 
to develop computational methods to store, search, classify, and 
analyze small molecules and in particular to predict their physical, 
chemical, and biological properties.
 We will describe databases and machine learning methods, in 
particular  kernel methods, for chemical molecules represented by 1D 
strings,  2D graphs  of bonds, and 3D structures. We will demonstrate
state-of-the-art   results  for the prediction of physical, chemical, and
biological properties   of small  molecules and the discovery of new
reactions and compounds.  More broadly, we will discuss some of the 
challenges and  opportunities for  computer science, AI, and machine 
learning in chemistry.
 
 
Comparative analysis of gene expression: insight into the evolution of transcription regulationNaama Barkai
 Weizmann Institute of Science
 
 Evolution of gene expression plays a prominent role in generating 
 phenotypic diversity, but only little is known about the genetic basis 
 underlying broad modulations of the genome-wide transcription program.
 To gain insights into the principles underlying variations in gene 
expression between closely related species, our lab focuses on the 
 analysis on yeast species of varying evolutionary distances. I will 
 describe recent results concerning both the genetics basis underlying 
gene expression evolution, as well as generic properties which 
influence this evolution. Possible implications of the results to 
models of gene expresssion evolution will be discussed.
 
 
Nature vs Nurture Studied Using Protein StructurePhilip E. Bourne
 University of California San Diego
 
 In recent work we have shown that protein structure is a useful tool in the study of evolution [1]. 
We were able to construct a reasonably accurate tree of life from the simple presence or absence 
of fold superfamilies. We have taken this work in several directions of which three will be discussed. 
First, what does the species specific use of fold space tell us about the usage of functional space [2].
Second, there is evidence that nurture - in the form of influence from the environment - impacted what p
rotein folds emerged and how they have been used [3]. Third, each protein superfamily has its own evolutionary
story to tell and we have looked at the protein kinase like superfamily as a recent example [4].
 
 [1]  S. Yang, R.F. Doolittle and P.E. Bourne 2005 Phylogeny Determined through Protein Domain Content Proc. Nat. Acad. Ssi. (USA) 102(2): 373-378.
 [2]  L.Xie and P.E. Bourne 2005 Functional Coverage of the Human Genome by Existing Structures, Structural Genomics Targets and Homology Models. PLoS Comp Biol 1(3) e31.
 [3] C. Dupont, K. Briedis, S. Yang, B. Palnik, P.E. Bourne 2005 in preparation
 [4] E. Scheeff and P.E. Bourne 2005 Structural Evolution of the Protein Kinase-Like Superfamily PLoS Comp Biol in early release.
 
 
 
Incorporation of splice site probability models for non-canonical introns improves gene structure prediction in plantsVolker Brendel
 Iowa State University
 
 The vast majority of introns in protein-coding genes of higher eukaryotes have a GT dinucleotide 
at their 5'-terminus and an AG dinucleotide at their 3'-end.  About 1-2% of introns are non-canonical, 
with the most abundant subtype of introns being characterized by GC and AG dinucleotides at their
5'- and 3'-termini, respectively. Most current gene prediction software, whether based on ab initio or spliced 
alignment approaches, does not include explicit models for non-canonical introns or may exclude their prediction altogether.  
With present amounts of genome and transcript data it is possible to apply statistical methodology to non-canonical 
splice site prediction. We pursued one such approach and describe the training and implementation of GC-donor splice 
site models for Arabidopsis and rice, with the goal of exploring whether specific modeling of non-canonical introns
 can enhance gene structure prediction accuracy.  Our results indicate that the incorporation of non-canonical splice 
 site models yields dramatic improvements in annotating genes containing GC-AG and AT-AC non-canonical introns.
 Comparison of models shows differences between monocot and dicot species, but also suggests GC-intron specific biases 
 independent of taxonomic clade.  We also present evidence that GC-AG introns occur preferentially in genes with atypically high exon counts.
 
 
E.coli: Curation and Analysis of the Currently Largest Electronically-encoded Regulatory Network.Julio Collado-Vides
 Program of Computational Genomics, Center for Genomic Sciences, UNAM, Cuernavaca, Mexico
 
 In the first part of  my talk, I will summarize recent progress in curation of what constitutes currently the largest electronically-encoded 
transcriptional regulatory network of a free living organism, that of Escherichia coli K-12. My lab has been curating operon organization
and transcriptional regulation in E.coli for years. This effort feeds both RegulonDB and EcoCyc. Navigation features of RegulonDB 
version 5.0 will be described.
 The second part of my talk will be devoted to some examples of biological analyses we have recently performed with this accumulated 
knowledge. We have recently classified transcriptional factors or regulators (TFs) in three classes based on the origin of their allosteric 
metabolite, as either internal, external or hybrid sensing sytems.  The TF repertoire is shown to be mostly governed by the internal sensing
 subset of TFs, due mostly to interactions by global regulators. Topological properties of the network will be also discussed.
 
 [1] Salgado, H., et al. (2004) RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res 32 Database issue, D303-306.
 [2]	Martinez-Antonio, A., and Collado-Vides, J. (2003) Identifying global regulators in transcriptional regulatory networks in bacteria. Curr Opin Microbiol 6, 482-489.
 [3]	Mart¨ªnez-Antonio A., Janga S.C., Salgado  H. and  Collado-Vides J. (2006) "The Internal sensing machinery directs the activity of the regulatory network in Escherichia coli" Trends in Microbiol. (in press).
 
 
 
A simple physical model for scaling in protein-protein interaction networksEric Deeds
 Harvard University
 
 It has recently been demonstrated that many biological networks 
exhibit a "scale-free" topology where the probability of observing a 
node with a certain number of edges follows a power law.  This 
observation has been explained in terms of dynamical evolutionary 
models.  Here we consider the network of protein-protein 
interactions and demonstrate that two published independent 
measurements of these interactions produce graphs that are only 
weakly correlated with one another despite their strikingly similar 
topology.  We then propose a physical model based on the fundamental 
principle that (de)solvation is a major physical factor in 
protein-protein interactions.  This model reproduces not only the 
scale-free nature of such graphs but also all higher-order 
correlations in these networks.  A key support of the model is 
provided by the discovery of a significant correlation between 
number of interactions made by a protein and the fraction of 
hydrophobic residues on its surface.
 
 
Designing RNA conformational changesAndrew Ellington
 University of Texas at Austin
 
 Nucleic acids can be selected that bind to ligands and catalyze
reactions.  The functionalities of selected RNA molecules is dependent in
large measure on their secondary structures, which can be readily
predicted and engineered.  In particular, it has proven possible to
engineer nucleic acid secondary structures to undergo programmed
conformational changes in response to ligands.  The resultant aptamer
beacons and aptazymes can be used as biosensors for reporting on
individual analytes.  These molecules may also have applications as in
vivo as part of enginereed genetic circuits for signal transduction.
 
 
Mining sequence annotation databanks for association patternsDmitrij Frishman
 Technische Universität, Munich, Germany
 
 Motivation: Millions of protein sequences currently being deposited to 
sequence databanks will never be annotated manually. Similarity based 
annotation generated by automatic software pipelines unavoidably 
contains spurious assignments due to the imperfection of 
bioinformatics methods. Examples of such annotation errors include 
over- and under-predictions caused by the use of fixed recognition 
thresholds and incorrect annotations caused by transitivity based 
information transfer to unrelated proteins or transfer of errors 
already accumulated in databases. One of the most difficult and timely 
challenges in bioinformatics is the development of intelligent systems 
aimed at improving the quality of automatically generated annotation. 
A possible approach to this problem is to detect anomalies in 
annotation items based on association rule mining.
 
 Results: We present the first large-scale analysis of association 
rules derived from two large protein annotation databases ? Swiss- 
Prot and PEDANT - and reveal novel, previously unknown tendencies of 
rule strength distributions. Most of the rules are either very strong, 
or very weak, with rules in the medium strength range being relatively 
infrequent. Based on dynamics of error correction in subsequent 
Swiss-Prot releases and on our own manual analysis we demonstrate that 
exceptions from strong rules are, indeed, significantly enriched in 
annotation errors and can be used to automatically flag them. We 
identify different strength dependencies of rules derived from 
different fields in Swiss-Prot. A compositional breakdown of 
association rules generated from PEDANT in terms of their constituent 
items indicates that most of the errors that can be corrected are 
related to gene functional roles. Swiss-Prot errors are usually caused 
by under-annotation due to its conservative approach, while 
automatically generated PEDANT annotation suffers from 
over-annotation.
 
 Availability: All data generated in this study are available for 
download and browsing at http://pedant.gsf.de/ARIA/index.htm.
 
 
Conservation and divergence of mammalian gene co-expression networksKing Jordan
 NCBI/NIH
 
 Divergence of gene expression patterns is an important part of the
 evolutionary process and represents a link between genotypic and phenotypic
 evolution.  The recent accumulation of high-throughput gene expression data
 sets allows for systematic genome-scale comparisons of gene expression
 pattern divergence between species.  I will present an evolutionary
 comparison of human and mouse gene expression patterns.  Just over 9,000
 orthologous human-mouse gene pairs were analyzed with respect to expression
 profiles measured across 28 tissue samples.  The approach we employed is
 based on the analysis of gene co-expression networks where genes represent
 nodes that are connected by edges if they are considered to be co-expressed.
 Human and mouse gene expression networks show similar topological properties
 at the macroscopic level.  For instance, they have comparable node degree
 distributions, average path lengths and clustering co-efficients.  However,
 the human and mouse gene co-expression networks have diverged substantially
 when considered at a more microscopic level.  Less than 10% of edges are
 preserved in the intersection of the human and mouse gene co-expression
 networks, and the node degree correlation between the two networks is low.
 The evolutionary implications of this distinction between macroscopic
 conservation and microscopic divergence of mammalian gene co-expression
 networks will be explored.
 
 
Somatic evolution and cancerNatalia Komarova
 University of California, Irvine Irvine, CA
 
 Even though much progress has been made in main stream
experimental cancer research at the molecular level, traditional
methodologies alone are insufficient to resolve many important conceptual
issues in cancer biology. For example, for the most part, it is still
unknown how cancer originates, what drives its progression, and how
treatment failure can be prevented. In this talk, I will describe novel
mathematical tools which help obtain new insights into these processes. I
will also show how the mathematical insights are combined with
experimental studies through collaborations with cancer biologists. The
main idea is to study cancer as an evolutionary dynamical system on a
selection-mutation network. I will discuss the following topics: Stem
cells and tissue architecture, Geometric constraints in cancer dynamics
and Drug resistance in cancer.
 
 
Detecting selection at the sequence levelAlex Kondrashov
 NCBI, NLM, NIH
 
 Sequences of nucleic acids and of proteins are chronicles of past allele
substituions and, thus, carry ample information on natural selection. Most
of the currently used methods of inferring selection from the sequence data
rely on comparing the rates of substitutions which may be under selection to
some reference rate of supposedly neutral substitutions. I will consider
several new approaches to detecting selection at the sequence level. Clumps
of nonsynonymous substitutions, patterns in parallel substitutions, elevated
evolution rates due to constant selection, and joint analysis of data on
pathogenic mutations, SNPs and allele replacements will be reviewed.
 
 
Unifying measures of gene function and evolutionEugene V. Koonin, Yuri I. Wolf, and Liran Carmel
 NCBI, NIH
 
 Recent genome analyses revealed intriguing correlations between variables 
characterizing the functioning of a gene, such as expression level, connectivity of 
genetic and protein-protein interaction networks, and knockout effect, and variables 
describing gene evolution, such as sequence evolution rate and propensity for gene loss. 
Typically, variables within each of these classes are positively correlated, e.g., products 
of highly expressed genes also tend to have many protein-protein interactions, whereas 
variables between classes are negatively correlated, e.g., highly expressed genes tend  
to evolve slowly. Here we describe principal component (PC) analysis of 7 genome-related  
variables and propose biological interpretations for the first three principal components.  
The first PC reflects different aspects of a gene's "importance", or the "status" of a gene in the  
genomic community, with positive contributions from knockout lethality, expression level and the 
number of paralogs, and negative contributions from sequence evolution rate and gene loss propensity. 
The second and third PC may be interpreted as reflecting different aspects of a gene's "adaptability"  
whereby genes with high adaptability tend to evolve fast, are relatively often lost during evolution,  
readily duplicate and are highly expressed, but only under certain conditions. Functional classes of  
genes substantially vary in status and adaptability, with the highest status characteristic of the translation 
system and cytoskeletal proteins, and highest adaptability seen in metabolic enzymes and transporters.
 
 
Scrambled Genes:  Genetic and genomic rearrangements during development and evolutionLaura Landweber
 Princeton University
 
 All ciliated protozoa have two types of nuclei in a single cell: a germline micronucleus and a 
somatic macronucleus responsible for most mRNA production.  At the genomic level, some ciliates 
undergo massive DNA elimination and rearrangement of their ~1 Gb  micronuclear genome to construct 
a set of ~2 kb "nano-chromosomes" that comprise their ~50 Mb gene-rich macronuclear genome.  
In many species, we estimate that 20-30% of all genes are scrambled; i.e. both fragmented 
and permuted into several small unordered segments in the germline. These segments can be present 
on either strand within a locus, or even dispersed over unlinked loci in the germline. Experiments
 in our laboratory have surveyed the origin, evolution, and developmental processing of scrambled 
 genes in ciliates. I will describe new complex patterns of scrambled genes. For example, in one 
 case the coding segments for two independent transcripts are intertwined on two separate germline loci.
 
 
The genomics of bacterial gene flowJeffrey Lawrence
 University of Pittsburgh
 
 Unlike crown eukaryotic species, microbial lineages are created by continual
  processes of gene loss and acquisition promoted by horizontal genetic
  transfer.  The amount of foreign DNA in bacterial genomes, and the rate at
  which it is acquired, is consistent with gene transfer being the primary
  catalyst for microbial lineage differentiation. Therefore, the
  higher-ordered taxonomic relationships among microorganisms reflect not only
  their shared evolutionary history, but the ongoing processes of gene
  acquisition and gene loss. Understanding the mechanisms that control the
  flow of genes among taxa will provide a framework for understanding
  microbial diversification. It has been proposed that the likelihood of
  successful horizontal transfer between taxa may be a function of their
  phylogenetic distance, whereby more closely related taxa have a higher
  probability of exchanging genes - even by illegitimate processes that are
  independent of the degree of nucleotide sequence divergence - than do more
  distantly related lineages. That is, different lineages may not be
  equivalent in their ability to act as potential donors for horizontal gene
  transfer, even if the genes they bear offer the same physiological benefit.
  Molecular mechanisms which could underlie such barriers to horizontal gene
  transfer are proposed, whereby strand-specific, asymmetrically-distributed
  sequences control DNA replication and segregation. Bioinformatic analyses
  demonstrate that such sequences appear to be shared among more closely
  related lineages, but comprise non-overlapping sets among distantly related
  sequences.  As a result, incoming DNA from distantly-related taxa may bear
  sequences which would interfere with DNA replication and segregation in
  their new host genome, thereby reducing the probability of a successful
  transfer.  These mechanistic constraints may serve to shape the flow of
  genes among bacteria by lateral transfer processes, and result in
  higher-ordered bacterial clades that reflect propensity for gene transfer as
  much as common evolutionary histories.
 
 
Phylogenomic mining: a novel approach to select phylogenetically informative genes from genome dataJohn Logsdon
 University of Iowa
 
 The “gene tree/species tree” problem is a fundamental issue in molecular phylogenetics.
 Which gene or genes provide the best estimators of organismal phylogeny? With the availability
 of abundant genomic data from diverse taxa, there are now many potential genes that could be 
used for phylogenetic reconstruction. However, not all genes are equally informative for phylogeny, 
and the selection of those genes that are most effective remains difficult. We have developed a new 
phylogenomic mining method that seeks to identify phylogenetically informative genes from a large potential 
set. Genes selected by this method have markedly reduced incongruence between their individual gene 
phylogenies and conform to trees constructed from concatenations of large numbers of these genes 
(which converge to the species tree). In this phylogenomic mining method, we first employ a self-organizing map (SOM) 
to cluster the gene data set. Second, the maximum entropy (ME) genes from each cluster are selected and phylogenetic 
trees are inferred from them, both individually and in concatenation. These ME genes appear to represent the most 
phylogenetically informative genes. We have validated our method using a test data set of orthologous genes that 
were selected from genomes of yeast species and had been previously analysed with a random gene concatenation method. 
Our approach performs significantly better than random gene selection using these data. We are attempting to generalize
 our approach to other data sets. Our initial results suggest that phylogenomic mining will be a useful method to 
efficiently identify phylogenetically informative genes from large genomic data sets.
 
 
The Origins of Eukaryotic Gene Structure.Michael Lynch
 Indiana University
 
 Most of the phenotypic diversity that we perceive in the natural world is directly attributable to the peculiar 
structure of the eukaryotic gene, which harbors numerous embellishments relative to the situation in prokaryotes. 
These include introns that must be spliced out of precursor mRNAs, transcribed but untranslated leader and 
trailer sequences (UTRs), modular regulatory elements that drive patterns of gene expression, and expansive 
intergenic regions that harbor control mechanisms. Explaining the origins of these features is difficult because 
they each impose an intrinsic fitness disadvantage by increasing the genic mutation rate to defective alleles. 
To address these issues, a general hypothesis for the emergence of eukaryotic gene structure will be presented. 
Extensive observations on population sizes, recombination rates, and mutation rates strongly support the view that 
eukaryotes have reduced genetic effective population sizes relative to prokaryotes, with especially extreme reductions 
occurring in multicellular lineages. The resultant increase in the power of random genetic drift is sufficient to overwhelm 
the weak mutational disadvantages associated with most novel aspects of the eukaryotic gene, supporting the idea that 
most such changes arose as nonadaptive by-products rather than direct products of natural selection. However, by
 establishing a population-genetic environment permissive to the genome-wide repatterning of gene structure, the 
 eukaryotic condition also promoted a reliable resource from which natural selection could secondarily build novel forms 
 of organismal complexity.
 
 
Male mutation bias in the age of genomics: Who should be in the driver's seat?Kateryna Makova
 Penn State University
 
 Male mutation bias is a higher mutation rate in mammalian males than in mammalian females thought to 
result from the greater number of germline cell divisions in males. If errors in DNA replication cause most 
mutations, then the magnitude of male mutation bias should reflect the relative excess of male vs. female
 germline cell divisions. Substitution rates averaged among all sites in a sequence and compared between
  mammalian sex chromosomes were shown to be indeed higher in males than in females. Do individual 
  classes of substitutions as well as other mutations (e.g., insertions and deletions) exhibit male bias? 
  Our genome-wide analysis of CpG transitions that occurred between human and chimpanzee indicated only 
 a weak male mutation bias, suggesting that such mutations are largely replication-independent. In contrast,
  a strong male bias was observed in a genome-wide study of small insertions and deletions occurring
   between mouse and rat. This implies that these mutations might result from errors in DNA replication.
 
 
LTR retrotransposons and evolution: a comparative genomics approachJohn McDonald
 Georgia Tech
 
 Once considered selfish DNA of little or no evolutionary significance, transposable elements are today 
widely recognized as major contributors to genome evolution. Contemporary interest in transposable elements 
focuses on various aspects of the evolutionary history of the elements themselves, as well as, their impact 
on the evolution of the host genomes in which they reside.  Our laboratory is particularly interested in the 
evolution and significance of LTR retrotransposons. A combined bioinformatics and molecular biology approach 
has been taken to study LTR retrotransposons in a variety of species from yeast to humans. 
A summary of our methods  and selected results will  be presented.
 
 
Treading On the Interface between Molecular and Organismal EvolutionBoris Shakhnovich
 Boston University
 
 Proteins are the substrates of selection: genes are selected for or
against because the proteins they code for are more or less fitted to
their environments. The study of the evolution of life is therefore
the study of protein evolution. Understanding the mechanisms and
driving forces behind molecular evolution is the defining challenge of
computational biology.
 Recent investigations of high-throughput genomic and phenomic data
have uncovered significant correlations between functional
characteristics of a gene such as essentiality or expression and
evolutionary rate of divergence. However, the importance of each
characteristic in determining the strength of selection remains
controversial. Furthermore, the relationship between paralogy, selection
and dynamics of duplication remains largely unknown.  First, I use a
graph-theoretic representation of genomes to naturally partition genes
into families of paralogs. For genes with paralogs, I show that strength
of purifying selection does not necessarily depend on any phenotypic
determinant of that gene. Instead, selection is characteristic of
membership in a gene family. Surprisingly, while mutations accumulate
uniformly slower for all genes in families under strong purifying
selection, paralogs in those families diverge farther away in sequence.
Observation of fewer pseudogenization events in families under strong
selection supports a model where paralogs preferentially divide
ancestral function after duplication.
 Together, the observations presented in this paper offer a novel
insight into how organismal constraints can have stark consequences on
the evolution of genes and families.  Furthermore, I examine the
role of physical constraints such as structure on molecular evolution.
 Finally, I discuss the relative contributions of "history and
physics" on the imprint of molecular evolution in structure space.
 
 
Incorporating phenotype into models of sequence evolutionJeffrey L. Thorne
 North Carolina State University
 
 The relationship between genotype and phenotype is central to evolution. 
Although genotype is specified via DNA sequence, widely used models for the 
evolution of DNA sequence ignore phenotype.  Diverse in silico procedures have been
 introduced by computational biologists  for predicting aspects of phenotype from genotype.
 We have been developing statistical   techniques for incorporating these in silico procedures into models of sequence evolution. 
  These procedures and some results will be presented.   Although our models of sequence 
  change are designed for analyzing homologous sequences from different species,  the models can be interpreted at the population genetic level.
 
 
Phylogeny of Mixture ModelsEric Vigoda
 Georgia Tech
 
 It is well known that phylogenetic trees can vary between genes.  Even within regions having the same tree topology, the mutation rates often vary.  This motivates the study of phylogenetic reconstruction in heterogeneous settings.  We study the (im)possibility of reconstructing the underlying phylogeny when data is generated from a mixture of trees (same topology, different branch lengths). We first show the pitfalls of popular methods - maximum likelihood and BMCMC.  We then determine in which evolutionary models, reconstructing the tree topology, under a mixture distribution, is impossible (due to ambiguity) or possible (via a linear test).
 
 
 
 |  |