Imagine that I want to build a phylogeny for all DNA barcodes (for this purpose, let’s restrict it to animal barcodes, i.e. COI). So, we have over a million barcode sequences http://www.boldsystems.org/index.php/datarelease , how do we get a tree for this number of sequences? Why do I want a tree? Well, imagine that instead of measuring biodiversity by counting species I want to compute phylogenetic diversity for an arbitrary region of the planet. So, I’d like a tree for all COI sequences, then be able to extract the subtree for taxa within a geographic bounding box, then get the length of that tree. Presumably I’d need a divide and conquer approach that relied on the fact that the barcode sequences are already cluster dingo BINs, and we have information on higher-level taxonomy, but do we have tools in place that could tackle something on this scale?
It is only remotely related, but we have explored a divide and conquer strategy for phylogenetic reconstruction using concatenated alignments (https://peerj.com/preprints/223/) that goes in that direction. The idea is implemented as a python package for species and gene family tree reconstruction. Not yet released, but available at https://github.com/jhcepas/npr
In principle it would allow to run nested reconstructions in which faster methods are used for basal (large) nodes and more exhaustive approaches for the tips. Although I never tried anything larger than 8000 seqs…
I have been working on building such large trees for a while. I think that with more memory and parallelizing some of the code I can build a tree for the BOLD database, or at least come close.
Obviously, there are many caveats. I use a “primitive” phylogenetic method (UPGMA-like). UPGMA is statistically consistent under the assumption of a strict molecular clock, but no one could justify a strict clock for those time scales. Still, I think those trees are very useful for analysing NGS data. if nothing else, some trees I build from BOLD data indicate we can out some inconsistencies in the data (I assume either human errors or contaminations). To fully utilize the BOLD database we need to wait a few more years until they release the full taxonomic classifications (up to species level).
I put a copy of the relevant parts of a manuscript we just submitted here https://dl.dropboxusercontent.com/u/5675908/NGStree.pdf .
@jheled Thanks for the link. I also came across this paper which looks potentially useful if going down the UPGMA route:
Loewenstein, Y., Portugaly, E., Fromer, M., & Linial, M. (2008, July 1). Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics. Oxford University Press (OUP). http://dx.doi.org/10.1093/bioinformatics/btn174
The BOLD dataset is pre-clustered into BINs, so I guess we could be clever and use that to help constrain the search space. While BOLD’s failure to release full taxonomic identifications is a pain (and the reason the NCBI dumped a lot of their sequences from GenBank) it is still useful, especially if we just want a measure of sequence diversity at a site).
We are trying to make this possible using the SUPERSMART pipeline (http://www.supersmart-project.org) using a recursive divide-and-conquer approach, though we are going through PhyLoTA to get more loci than just the barcoding ones. Work in progress, obviously.