As described in the call for abstracts, @trayc7, Felipe Zapata and I have organized a spotlight session on next-generation phylogenetic inference at Evolution 2016. The main site has the schedule, but I wanted to post the great abstracts we got as well.
All talks Monday, June 20, Ballroom A.
- 1:00-1:30 Sebastien Roch (University of Wisconsin-Madison) Large-scale phylogenetic inference: information-theoretic insights
- 1:30-1:45 Bret Larget (University of Wisconsin-Madison) Bayesian phylogenetics with importance sampling
- 1:45-1:50 Guifang Zhou (Louisiana State University) A network framework to explore phylogenetic structure in genome data
- 1:50-1:55 Dhiana Arcila (George Washington University) Genome-wide gene genealogy interrogation advances resolution of recalcitrant groups in the Tree of Life
- 1:55-2:00 August Guang (Brown University) Summarizing population genome variation in phylogenetic analyses
- 2:00-2:15 Xavier Meyer (University of Lausanne) Accelerating Bayesian inference for evolutionary biology models
2:15-2:45 pm COFFEE BREAK
- 2:45-3:00 Arman Bilge (University of Auckland) Hamiltonian Monte Carlo on the space of phylogenies
- 3:00-3:15 Crystal Zhao (University of British Columbia) Bayesian analysis of continuous-time Markov chain parameters using Hamiltonian Monte Carlo
- 3:15-3:20 Emily Jane McTavish (University of Kansas) Continually updated phylogenies
- 3:20-3:25 pm Rutger Vos (Naturalis Biodiversity Center) SUPERSMART: the Self-Updating Platform for Estimating Rates of Speciation and Migration, Ages, and Relationships of Taxa
- 3:25-3:30 Siavash Mirarab (University of California, San Diego) Fast coalescent-based computation of local branch support from quartet frequencies
- 3:30-3:45 Alex Gavryushkin (University of Auckland) Nearest neighbors of phylogenetic time-trees
- 3:45-4:00 Huw Ogilvie (Australian National University) StarBEAST2 improves convergence and can infer per-species substitution rates
Session 1 Abstracts
Large-scale phylogenetic inference: information-theoretic insights
Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison.
How much data is needed to estimate a large phylogeny? I will survey some old and new theoretical results on this question. In particular, I will discuss how the establishment of fundamental information-theoretic limits has produced a quantitive understanding of the statistical challenges involved in large-scale phylogenetic inference. In addition, these impossibility results have spurred the development of novel algorithmic ideas that - at least in theory - can be used to achieve best-possible reconstruction performance. I will also briefly mention a few open questions in this area.
Importance sampling of phylogenetic trees
Bret Larget; Claudia Solis Lemus, University of Wisconsin - Madison
All major Bayesian phylogenetic methods depend on the use of Markov chain Monte Carlo (MCMC) methods to sample phylogenies from posterior distributions. The computational complexities of MCMC are a major challenge for the Bayesian analysis with large trees and/or large data sets. We present an alternative method for Bayesian phylogenetic inference that uses importance sampling instead of MCMC, allowing independent sampling of phylogenies.
A network framework to explore phylogenetic structure in genomic data
Guifang Zhou, Louisiana State University; Jeremy Ash, North Carolina State University; Wen Huang, Université catholique de Louvain; Melissa Marchand, Florida State University; David Morris, Louisiana State University; Paul Van Dooren, Université catholique de Louvain; Jim Wilgenbusch, University of Minnesota; Jeremy Brown, Louisiana State University; Kyle Gallivan, Florida State University
The use of genome-wide sampling involving hundreds or thousands of genes is rapidly becoming the norm for modern phylogenetic studies. Such large datasets not only pose new challenges for the speed of phylogenetic analyses, they offer new and largely untapped opportunities to understand genome-wide variation in phylogenetic signals. These opportunities have not yet been realized because methods for efficiently and intuitively summarizing phylogenetic information from large sets of trees are in their infancy. Most standard approaches, developed in the context of much smaller datasets, rely on point estimates and discard a large amount of useful information. In recent years, interest has grown in the use of network analysis to explore phylogenetic structure. For example, clustering methods have been applied to matrices of tree distances to identify genes that share common evolutionary histories. However, the clustering methods used so far have limited flexibility and cannot handle negative edge weights, which might arise when different types of networks are constructed. To address these issues, we are exploring and extending a family of community detection methods that can accommodate both positive and negative weights, as well as explore the scale at which clustering is most natural. In addition, we are applying these methods to two types of networks: one where trees are nodes and edge weights correspond to their affinities, and another where bipartitions are nodes and edge weights correspond to their covariance across trees. The bipartition covariance can be both positive and negative. Here, we present some results on the performance of our approaches with illustrative, simulated, and empirical examples in different biological contexts. We also highlight areas in need of future work.
Gene genealogy interrogation (GGI) advances resolution of recalcitrant phylogenies
Dahiana Arcila; Guillermo Ortí, The George Washington University; Ricardo Betancur-R, University of Puerto Rico – Río Piedras, San Juan, Puerto Rico; Liam J. Revell, University of Massachusetts Boston, Boston, Massachusetts.
We present a new statistical approach to address difficult problems in phylogenomics when neither concatenation nor species trees methods deliver unambiguous results. While gene tree estimation error thwarts the application of summary coalescent approaches, large concatenated matrices are known to amplify biases and likewise produce inconsistent results. We propose using topology tests to ask which history, among a set of predefined alternatives, is supported by each gene with highest probability. Our method (namely Gene Genealogy Interrogation or GGI) implements constrained ML searches for each gene alignment under each hypothesis. Our implementation is based on the approximately unbiased (AU) test, which uses multi-scale bootstrapping applied to simultaneous comparisons of multiple trees. In this study, we generated and analysed one of the largest phylogenomic data sets, with 1051 genetic loci and 225 species, to resolve phylogenetic relationships among the Otophysi, a clade of fishes (ca. 10,000 species) that dominates freshwater habitats throughout the world. ur results Of greater concern, we note thattheseresultsindication. In contrast, by implementing GGI, we demonstrate how genome-level data can be analysed to produce an unambiguous result and, in this case, reconcile genome-level data with a long-dismissed morphological hypothesis which has remained elusive from previous molecular studies. We also apply GGI to published phylogenomic data sets to study recalcitrant nodes in the tree of life, resolving relationships among metazoans, yeast, birds, and mammals.
Summarizing population genome variation in phylogenetic analyses
August Guang; Casey Dunn; Charles Lawrence; Rami Kantor; Mark Howison, Brown University
Most genome sequences are derived from multiple genomes, whether they are multiple cells from a tissue sample or multiple individuals. Genome assemblies therefore summarize diversity across pooled genomes. In some cases the diversity of genome sequences is very low, as for most somatic tissue samples, but in other cases genome diversity can be high, as when rapidly evolving viruses are sampled from a patient. In such cases, identifying the composition of each individual viral genome (phasing) within a patient is a difficult problem. We thus simulated HIV sequence data and explored the impact of alternative strategies to phasing for summarizing within-patient sequence variation for reconstructing the between-patient phylogenetic relationships of the virus population. We found that a single sample or consensus sequence from the pooled patient viral population is insufficient for accurately inferring the phylogeny. However, we found that we could bypass the problem of phasing completely and still capture within-patient variation by simulating synthetic sequences from the observed within-patient variant structure. At its core, simulations are samples drawn from a generative model. This means simulations can summarize the variation present in an analysis result without necessarily having to fully describe every observation. Phylogenetic analyses of these synthetic sequences drawn from the within-patient variant structure resulted in between-patient trees that are as good as ones constructed from fully resolved viral genomes. Simulations from explicit generative models will become essential to address and solve other hard problems in phylogenetics beyond epidemiology, such as gene tree reconstruction in the presence of assembly errors.
Accelerating Bayesian inference for evolutionary biology models
Xavier Meyer, University of Lausanne, Switzerland; B. Chopard, University of Geneva); N. Salamin, University of Lausanne
Bayesian inference of phylogenies relies largely on computational Monte Carlo methods, in particular Markov chain Monte Carlo (MCMC) samplers. However, the design of more complex and realistic models and the ever growing availability of novel data is pushing the limits of the current use of these methods that are constrained by two main limiting factors. First, exploring efficiently mixed continuous and topological parameter space is challenging. Second, parallel computing resources are hardly exploitable to reduce the computational cost induced by phylogenies having large amount of taxa. We present here a parallel Metropolis-Hastings (M-H) framework built with a novel combination of enhancements aiming to address these limitations. First, we propose an efficient multivariate adaptive proposal for continuous parameters that improves mixing and takes advantage of correlations between them. Second, we employ a parallel MCMC method that exploits parallel computing resources to estimate several likelihoods concurrently. While topological proposals do not benefit from the proposed adaptive proposal, their high rejection rate is directly exploited by this parallel approach. Lastly, we show that, under a precise coupling of our adaptive proposal and parallel MCMC methods, we can achieve performance gains that exceed the sum of their parts. We demonstrate our new M-H framework performance by comparing it with MrBayes on two phylogenetic-based models. Using a codon-substitution models with fixed tree topology, we first show increases of the sampling efficiency up to 10 times with 32 processors. Finally, we show that our framework achieves up to a 20-fold faster convergence rate in tree reconstruction using 32 processors.
Session 2 abstracts
Hamiltonian Monte Carlo on the space of phylogenies
Arman Bilge, The University of Auckland; Vu Dinh; Erick Matsen, Fred Hutchinson Cancer Research Centre
Evolutionary tree inference, or phylogenetics, is an essential tool for understanding biological systems from deep-time divergences to recent viral transmission. The Bayesian paradigm is now commonly used in phylogenetics to describe support for estimated phylogenies or to test hypotheses that can be expressed in phylogenetic terms. However, current Bayesian phylogenetic inference algorithms are limited to about 1,000 sequences, which is much fewer than are available via modern sequencing technology. Here we develop phylogenetic Hamiltonian Monte Carlo (HMC) as a new approach to enable phylogenetic inference on larger data sets. HMC is an existing computational statistical method that scales to large datasets by using Newton's laws of motions to efficiently explore various parameter values. However, because a phylogenetic tree parameter includes both its branch lengths and topology, we must go beyond the current implementations of HMC which cannot consider this special structure of trees. To do so, we develop a probabilistic version of the physics simulator within HMC, which can explore tree space. This algorithm generalizes previous algorithms by doing classical HMC on the branch lengths when considering a single topology, but making random choices between the tree topologies at the "intersection" between various trees. We show that our algorithm correctly explores the entire tree space and provide a proof-of-concept implementation in open-source software.
Bayesian analysis of continuous-time Markov chain parameters using Hamiltonian Monte Carlo
Tingting (Crystal) Zhao, University of British Columbia
Bayesian analysis of continuous time, discrete state space time series is an important and challenging problem, where incomplete observation and large parameter sets call for user-defined priors based on known properties of the process.
Generalized linear models have a largely unexplored potential to construct such prior distributions. We show that an important challenge with Bayesian generalized linear modelling of continuous time Markov chains is that classical Markov chain Monte Carlo techniques are too ineffective to be practical in that setup. We address this issue using an auxiliary variable construction combined with an adaptive Hamiltonian Monte Carlo algorithm. This sampling algorithm and model make it efficient both in terms of computation and analyst's time to construct stochastic processes informed by prior knowledge, such as known properties of the states of the process. We demonstrate the flexibility and scalability of our framework using synthetic and real phylogenetic protein data, where a prior based on amino acid physicochemical properties is constructed to obtain accurate rate matrix estimates.
Continually updated phylogenies
Emily Jane McTavish; Mark T. Holder, University of Kansas
In order to determine the phylogenetic position of a newly sequenced taxon researchers usually take one of two approaches: Phylogenetic placement, which adds the new taxon onto a tree without affecting the existing relationships among taxa already in the tree, or a full phylogenetic analysis including putative close relatives of that taxon. The former approach provides information about likely phylogenetic position the novel taxon, whereas latter approach both provides that information and allows the sequence data from the new taxon to affect and update the the branch lengths and topology, and confidence of previously estimated relationships. However, when trees are very large (either in terms of number of tips or amount of sequence data per tip) a full tree search can be very slow. We have developed a tool, Physcraper for continual updating of phylogenies. Physcraper automates searching databases for sequences homologous to those in an existing alignment, and then uses the existing tree to inform an updated multiple sequence alignment and serve as a starting tree for additional analyses, and then repeats the procedure. This work is an update and extension of Izquierdo-Carrasco et al.'s (2014) PUmPER. Using empirical data we demonstrate that in large phylogenies the addition or one or a few taxa affects relatively few topological relationships. To improve the efficiency of the phylogenetic estimation aspect of this tool we are exploring a maximum likelihood hybrid placement and tree-search algorithm which uses alternative phylogenetic placements to assess what regions of the phylogeny are potentially affected by taxon addition, and performs branch swapping only in those regions. Rapid automated updating of large phylogenies has many applications including systematics of undescribed species, understanding the context of disease outbreaks, cataloging microbial diversity and adding taxa to the Open Tree of Life.
SUPERSMART: the Self-Updating Platform for Estimating Rates of Speciation and Migration, Ages, and Relationships of Taxa
Rutger A. Vos, Naturalis Biodiversity Center, Leiden, the Netherlands; Alexandre Antonelli, University of Gothenburg, Department of Biological and Environmental Sciences, Gothenburg Botanical Garden; Hannes Hettling, Naturalis Biodiversity Center; Fabien Condamine, University of Gothenburg, Department of Biological and Environmental Sciences; CNRS, UMR 5554 Institut des Sciences de l’Evolution (Université de Montpellier); Karin Vos, University of Gothenburg, Department of Biological and Environmental Sciences; R. Nilsson, University of Gothenburg, Department of Biological and Environmental Sciences; Michael Sanderson, University of Arizona, Ecology and Evolutionary Biology; Herve Sauquet, Universite Paris-Sud, Lab. Ecologie, Systematique, Evolution (ESE); Ruud Scharn, University of Gothenburg, Department of Biological and Environmental Sciences; Daniele Silvestro, University of Gothenburg, Department of Biological and Environmental Sciences; University of Lausanne, Department of Ecology and Evolution; Mats Töpel, Swedish Bioinformatics Infrastructure for Life Sciences; University of Gothenburg, Department of Marine Sciences; Christine Bacon, University of Gothenburg, Department of Biological and Environmental Sciences; Bengt Oxelman, University of Gothenburg, Department of Biological and Environmental Sciences
Rapidly growing biological data –including molecular sequences and fossils– hold an unprecedented potential to reveal how evolutionary processes generate and maintain biodiversity. However, most studies integrating these data use an idiosyncratic step-by-step approach for the reconstruction of time-calibrated phylogenies. In addition, divergence times estimated under different methods and assumptions, and based on data of various quality and reliability, are not directly comparable. Here we introduce a modular framework termed SUPERSMART (Self-Updating Platform for Estimating Rates of Speciation and Migration, Ages, and Relationships of Taxa), and provide a proof of concept for dealing with the moving targets of evolutionary and biogeographical research. This framework assembles comprehensive datasets of molecular and fossil data for any taxa and infers dated phylogenies using robust species tree methods combined with so-called “exa-scale” backbone topologies, also allowing the inclusion of genomic data produced through next-generation sequencing techniques. We exemplify the practice of our method by presenting comprehensive phylogenetic and dating analyses for the mammal order Primates and for the flowering plant family Arecaceae (palms). We believe that this framework will provide a valuable tool for a wide range of hypothesis-driven research questions in systematics, biogeography and evolution. SUPERSMART will also accelerate our inference of the “Dated Tree of Life” whose node ages are directly comparable.
Fast coalescent-based computation of local branch support from quartet frequencies
Siavash Mirarab; Erfan Sayyari, University of California, San Diego
Species tree reconstruction is complicated by effects of Incomplete Lineage Sorting (ILS), commonly modeled by the multi-species coalescent model. While there has been substantial progress in developing methods that estimate a species tree given a collection of gene trees, less attention has been paid to fast and accurate methods of quantifying support. In this paper, we propose a fast algorithm to compute quartet-based support for each branch of a given species tree with regard to a given set of gene trees. We then show how the quartet support can be used in the context of the multi-species coalescent model to compute i) the local posterior probability that the branch is in the species tree and ii) the length of the branch in coalescent units. We evaluate the precision and recall of the local posterior probability on a wide set of simulated and biological datasets, and show that it has very high precision and improved recall compared to multi-locus bootstrapping. The estimated branch lengths are highly accurate when gene tree estimation error is low, but are underestimated when gene tree estimation error increases. Computation of both the branch length and the local posterior probability is implemented as new features in ASTRAL. http://mbe.oxfordjournals.org/cgi/content/abstract/msw079?ijkey=OTHYAZPfjJsY2Ce
Nearest neighbors of phylogenetic time-trees
Alex Gavryushkin, Centre for Computational Evolution The University of Auckland, NZ
This is a joint work in progress with Erick Matsen and Chris Whidden from Fred Hutchinson Cancer Research Center, Seattle, WA. Based on an earlier work with and communicated by Alexei Drummond, the University of Auckland.
With phylogenetic methods being employed in various areas of science, the information carried by the tree may have substantially different meanings. Examples include gene trees, species trees, transmission trees, language trees, etc. In many of these applications, the concept of time explicitly presents in the data and often is an objective for phylogenetic time-tree inference. By time here we mean actual absolute time that can be put on a calendar, as opposed to relative measures that entangle time and mutation rates.
An important property that distinguishes a time-tree from a classical phylogenetic tree is that all nodes of the tree (divergence and sample events) are ranked according to their time. For example, the fact that the MRCA of human and chimp is younger than that of elephant and hyrax is expressed in a time-tree but not in a classical tree. Furthermore, incorporating time into Bayesian tree search algorithms greatly improves their efficiency (e.g. MCMC moves in BEAST).
Many phylogenetic comparative methods, as well as tree search methods, use elementary modifications of trees to explore the space. The most known and widely used modifications presently are NNI, SPR, and TBR. They constitute the main bottleneck for computations.
Surprisingly little is known about elementary modifications of time-trees. Indeed, there exists no standard coordinate system for continuous time-trees that induces a natural notion of an elementary modification for discrete time-trees. Furthermore, unlike traditional tree discretizations such as the NNI graph there is not yet any way to express time information in graph structure.
In this work, we suggest an approach to fill this gap by providing a novel coordinate system for phylogenetic time-trees. This system scales naturally from continuous to discrete trees by hierarchically approximating continuous time by discrete time segments. Although elementary moves between trees are inherited from the NNI move, geometric and algorithmic properties of the moves are greatly different.
In this talk, I will introduce the coordinate system and motivate it by popular applications in computational phylogenetics. I will compare the system with classical phylogenetic trees and demonstrate its algorithmic and statistical potential. I will finish with a list of open problems that have a wide range of applications in phylogenetics and beyond.
StarBEAST2 improves convergence and can infer per-species substitution rates
Huw A. Ogilvie, Research School of Biology, Australian National University; Alexei Drummond, Centre for Computational Evolution, University of Auckland
Maximum likelihood concatenation is a popular estimator of species tree topology but is statistically inconsistent inside the anomaly zone of short branch lengths. We have recently also shown that the ages of extant species inferred using concatenation can be severely overestimated. These known biases are motivating the development and use of multispecies coalescent (MSC) methods of species tree inference. Summary MSC methods like ASTRAL are becoming increasing popular as they are fast enough to use with large phylogenomic data sets, but cannot estimate divergence times and extract less information per gene compared to fully Bayesian MSC methods like *BEAST. In response we have developed and present StarBEAST2, a fully Bayesian method with improved computational performance. This increase in performance is achieved through a combination of analytical integration of population sizes and new MCMC operators for integration of other parameters. StarBEAST2 will enable inference of more accurate and precise species trees and divergence times by increasing the maximum number of genes that is practical for a given analysis. StarBEAST2 also adds support for applying a relaxed clock to the species tree, so that relative or absolute substitution rates may be estimated for each extant and ancestral species.