This is an archived static version of the original phylobabble.org discussion site.

Additive distance matrices, caterpillar trees, and validation of cluster analysis

thenomen

Dear All,

I am new to the forum and happy that it exists

I have been using phylogenetic tree reconstruction methods for non-biological data for some years now yet I still get into trouble from time to time.

My main question is this: What is the relationship between additive distance matrices (or tree-like data) and the need for validating your cluster analysis?

I know that if a distance matrix is additive, most algorithms (e.g., Neighbor-Joining or UPGMA) will reconstruct the “correct tree”. However, in the broader field of cluster analysis and statistics, people usually expect you to perform a validation of your analysis using, for example, PCA or silhouettes.

So again, how do we go about validation if the data are provably treelike, i.e., using one of the tests from: J.A. HARTIGAN: Statistical theory in clustering, Journal of Classification 2, 63-76 (1985)?

Also, are there any publications that deal with the properties of caterpillar trees w.r.t. the data they are based on?

I hope this is the right place to post such question. If now, I would be very glad if you could direct me to the correct one.

Many thanks and best wishes, Tudor

GrahamJones

There is a philosophical issue here. In phylogenetic analysis it is normally assumed that evolution is a branching process, so that the result must be a tree, even if the distance matrix is not additive. So questions about the number of clusters, and clustering tendency, do not arise. We assume that clades are real things from the outset. I don’t know if you can make such an assumption even if your data is ‘provably treelike’.

‘Clustering tendency’ comes from the book by Jain and Dubes ‘Algorithms for Clustering Data’. I am more familiar with this than Hartigan. Also, its freely available at homepages.inf.ed.ac.uk/rbf/BOOKS/JAIN/Clustering_Jain_Dubes.pdf

In phylogenetic analysis, the validity of individual clades is certainly of interest, and the usual metrics are bootstrap support for maximum likelihood methods and posterior probability for Bayesian methods. NNJ and UPGMA are less accurate for biological phylogenetics, and don’t provide a measure of uncertainty about their results.

This probably means that you are on your own. But not completely. Evolutionary biologists are increasingly interested in species delimitation, and that is a type of cluster analysis.

mtholder

Daniel Huson and @mathmomike had a 2004 Sys. Bio. paper ( http://sysbio.oxfordjournals.org/content/53/2/327.abstract ) that showed that under some forms of under-correcting distances for multiple hits, you can get a distance matrix that is additive… but additive on the wrong tree!

So there are clearly cases in which we cannot use deviations from additivity as the sole criterion for the suitability of a distance matrix for phylogenetic analysis. I think that paper should be very sobering for advocates of reconstructing trees from distances.