This is an archived static version of the original discussion site.

Building a tree for all DNA barcodes (> 1 million taxa)


Imagine that I want to build a phylogeny for all DNA barcodes (for this purpose, let’s restrict it to animal barcodes, i.e. COI). So, we have over a million barcode sequences , how do we get a tree for this number of sequences? Why do I want a tree? Well, imagine that instead of measuring biodiversity by counting species I want to compute phylogenetic diversity for an arbitrary region of the planet. So, I’d like a tree for all COI sequences, then be able to extract the subtree for taxa within a geographic bounding box, then get the length of that tree. Presumably I’d need a divide and conquer approach that relied on the fact that the barcode sequences are already cluster dingo BINs, and we have information on higher-level taxonomy, but do we have tools in place that could tackle something on this scale?


It is only remotely related, but we have explored a divide and conquer strategy for phylogenetic reconstruction using concatenated alignments ( that goes in that direction. The idea is implemented as a python package for species and gene family tree reconstruction. Not yet released, but available at

In principle it would allow to run nested reconstructions in which faster methods are used for basal (large) nodes and more exhaustive approaches for the tips. Although I never tried anything larger than 8000 seqs…


I have been working on building such large trees for a while. I think that with more memory and parallelizing some of the code I can build a tree for the BOLD database, or at least come close.

Obviously, there are many caveats. I use a “primitive” phylogenetic method (UPGMA-like). UPGMA is statistically consistent under the assumption of a strict molecular clock, but no one could justify a strict clock for those time scales. Still, I think those trees are very useful for analysing NGS data. if nothing else, some trees I build from BOLD data indicate we can out some inconsistencies in the data (I assume either human errors or contaminations). To fully utilize the BOLD database we need to wait a few more years until they release the full taxonomic classifications (up to species level).

I put a copy of the relevant parts of a manuscript we just submitted here .

Joseph Heled


@jheled Thanks for the link. I also came across this paper which looks potentially useful if going down the UPGMA route:

Loewenstein, Y., Portugaly, E., Fromer, M., & Linial, M. (2008, July 1). Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics. Oxford University Press (OUP).

The BOLD dataset is pre-clustered into BINs, so I guess we could be clever and use that to help constrain the search space. While BOLD’s failure to release full taxonomic identifications is a pain (and the reason the NCBI dumped a lot of their sequences from GenBank) it is still useful, especially if we just want a measure of sequence diversity at a site).


We are trying to make this possible using the SUPERSMART pipeline ( using a recursive divide-and-conquer approach, though we are going through PhyLoTA to get more loci than just the barcoding ones. Work in progress, obviously.


@rutgeraldo Nice. PhyLoTA is a greatly under appreciated resource. I harvested the phylogenies, added geography and publications from EBI records for sequences, and dumped it into The phylogenies are full of useful info, particularly at lower taxonomic levels.