@jheled Thanks for the link. I also came across this paper which looks potentially useful if going down the UPGMA route:
Loewenstein, Y., Portugaly, E., Fromer, M., & Linial, M. (2008, July 1). Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics. Oxford University Press (OUP). http://dx.doi.org/10.1093/bioinformatics/btn174
The BOLD dataset is pre-clustered into BINs, so I guess we could be clever and use that to help constrain the search space. While BOLD's failure to release full taxonomic identifications is a pain (and the reason the NCBI dumped a lot of their sequences from GenBank) it is still useful, especially if we just want a measure of sequence diversity at a site).