Just to be clear, by “online” I mean in the algorithmic sense http://en.wikipedia.org/wiki/Online_algorithm. That is, offline methods assume we have all the data to handle, and we build a tree (e.g., compute pairwise distances for all sequences, permute tree for all sequences to explore tree space). An **online algorithm** builds the tree as data comes to hand. Now, I guess the initial tree building steps in a tool like PAUP are effectively “online”: we start with, say, three sequences, then add each remaining sequence in turn. But I wonder if anyone’s done some work on this sort of problem. A use case would be, for example, having an existing phylogeny and adding new data to that tree without recomputing the whole tree. An old skool approach to this would be to take the existing tree as a constraint tree and add the new sequence to that tree - is there anything more to this problem than that? Googling is messy as “online” usually means “on the web”.

# “Online” algorithm for building phylogeny

This is a major interest for my group.

You may be interested in a recent paper by @blackrim and @Alexis_RAxML:

http://bioinformatics.oxfordjournals.org/content/early/2014/02/28/bioinformatics.btu053

I gave a talk at Evolution 2013 on our efforts (joint with @koadman) to develop online Bayesian phylogenetic methods. A couple of months after the Evolution conference, @cmccoy and I got obsessed with B cell immunology, so didn’t have time to pursue this work.

However, more recently @bcclaywell has been doing some really nice work combining our previous efforts with our special purpose surrogate function for phylogenetic likelihood (you can see @cmccoy’s Evolution talk on that). This provides much better performance than represented in my Evolution talk.

We will release a paper this summer.

@ematsen I sure hope we can get a paper out this austral winter!

Having online tree inference is in many ways just the first step. Working out which data to delete from the tree may be just as important as figuring out how to add new data. Think of all that Illumina 16S amplicon data piling up in databases. Do we really want a tree with trillions of tips? If we don’t delete then at least we need meaningful abstractions. And of course there’s a range of things that people do with trees that would need to go online as well. I’m thinking of ways to update marginalized posterior distributions and possibly also hypothesis tests or other approaches to model selection.

Have you checked out: Kannan, Sampath K., Eugene L. Lawler, and Tandy J. Warnow. “Determining the evolutionary tree using experiments.” Journal of Algorithms 21.1 (1996): 26-50.

Not quite what you want, but maybe in the ball park?

D.

Link to paper and link to PDF.

How time flies. We finally got two papers out about this.

Fourment, M., Claywell, B. C., Dinh, V., McCoy, C., Matsen, F. A., & Darling, A. E. (2017, June 2). Effective Online Bayesian Phylogenetics Via Sequential Monte Carlo With Guided Proposals. bioRxiv. https://doi.org/10.1101/145219

Dinh, V., Darling, A. E., & Matsen, F. A., IV. (2016, October 26). Online Bayesian phylogenetic inference: theoretical foundations via Sequential Monte Carlo. arXiv [q-bio.PE]. Retrieved from [1610.08148] Online Bayesian phylogenetic inference: theoretical foundations via Sequential Monte Carlo

If you don’t like the measure theory of the second one, we’re going to post a revised version with less technical explanations in a few weeks.

And a blog post about the first paper: