How to recognize a rearranged tree?


@cwhidden and I had an entertaining discussion about the following question, which apparently came up in the @Alexis_RAxML lab as well. By tree rearrangement I mean an operation on some class of trees, such as subtree-prune-regraft (SPR) on unrooted trees.

Given some collection A of trees and a member t of A, how can we decide if a rearranged version of t is a member of A?

It seems like we’d like some sort of means of serializing trees that is compatible with the class of tree rearrangements. @armanbilge suggested thinking of a tree as the collection of its clades, and encoding that collection of clades as a binary vector indexed by the power set on the taxa (i.e. the set of all taxon subsets), and then keep track of what clades appear and disappear upon rearrangement. This would work, though the power set gets pretty big.

Thoughts? cc @mathmomike @david_bryant


Let C(t) = {c_1, c2, …, c_L} be the set of clades of tree t. Now imagine we take our set A ={ t_1, t2, … t_k} of trees and compute the set U of unique clades in A. That is, we compute c*(t_i) = intersect(c(t_i), c(t_{i-1})), i > 1, thus U = c*(t_k). Now we have a “query” tree t*, which is a re-arrangement of tree t. All we want to know is if t* is in A. A necessary and sufficient condition is that intersect(c(t*), U) = c(t*), meaning that all the clades in t* are in U. In other words, if c(t*) is in U, then “we’ve seen this tree before”.

I think this bypasses having to deal with power set of clades – one can compare clades directly, “object-wise”, provided there is a unique way of writing them in string form – and also having to compute which clades would disappear upon rearrangement.

This feels too easy, though. What am I missing?


Hi Max!

It’s certainly necessary to have intersect(c(t*), U) = c(t*), but I don’t think it’s sufficient. Couldn’t we have the clades of t* appearing in the various trees of A although no single tree has all of them?


I really don’t know. Let’s think of this in terms of clade “compatibility”. A tree is a set of compatible clades. Could it be the case that U has all the compatible clades that make up t*?


To expand a little on your previous comment, we might have the following situation. Given A = {t_1, t_2, … t_k}, we might have one clade of t* appear in t_1, another in t_2, etc for all of them, but not have all of the clades appear in any one single tree. Thus we would have c(t*) contained in U without t* appearing in A.

On the other hand, you definitely have the core of a good idea there. Having a good necessary condition is certainly helpful if we can just check slowly if the tree is in the set. We could think of also keeping track of all of the pairs of clades that appear in trees of our set, and perhaps use that as an additional criterion in a similar way.


Thanks for the simple counter-example; certainly not a sufficient condition…
Having a necessary condition means we can quickly dispatch candidates that do not belong in A. I wonder if we can borrow from the ideas here: to get a better sense of what do once said condition is found to hold.


(Just pasting the pubmed link in on its own line makes a nice little box… :smile:)


Is there a good way to represent just the unlabelled tree topology? Perhaps equivalency of the topologies can be checked first before doing more work to verify the labels. Given the same tree topology, maybe checking the labels would be easier as well.


Is there a good way to represent just the unlabelled tree topology?

The last paragraph of the ‘computational complexity’ section of describes something that sounds like this.


The Prüfer code is a neat way of serialising trees. And to spell out one way a serialisation could be used: serialise A, serialise the set B of rearrangements of t, sort A, use binary search. Time is O( (|A|+|B|) log(|A|) ).

The following is something I just worked out. If you restrict to unrooted binary trees with labelled tips, but unlabelled internal nodes, then the Prüfer sequences - in essence - are as follows. Suppose there are m internal nodes. Take two copies each of the numbers 1,2,…m, and make any ordering of these so that a 1 appears first, a 2 is the next to appear and so on, to make a sequence of length 2m. Each one corresponds to a tree. The tip labels do not appear, but they are implied by the ordering of the sequence.


Maybe I should provide some background why we are interested in this:

If you do tree searches on very large trees (talking about 100,000 taxa or more) and you do SPR moves you might want to maintain a list/population of N good/promising candidate trees that are generated by these SPR moves. Now if you have a new tree generated by a SPR move and it has a good score you’d like to insert it into this list, but need to check if one with an identical topology (but potentially different likelihood) has already been inserted there. The problem is that for very large trees, this comparison of tree structures, no matter how you do it, will become so time-intensive that it will dominate execution times, that is, it will take more time to compare trees than to calculate likelihoods. Thus, we need a method for systematically applying SPR moves to a tree such that we know that all generated tree topologies to be inserted in this list are unique.



Is any information about distances (corresponding to the rearrangement) between trees in A available? E.g. if you have clustered A into a number of neighbourhoods of small radius, you might want to check first to what neighbourhood tree t belongs. Also, if computing the distance is feasible, you can reduce your search to only trees at a particular distance (from some `origin’ tree, for example).


In part inspired by this discussion, @cwhidden and I have a new manuscript:

Whidden, C., & Matsen, F. A., IV. (2016). Efficiently Inferring Pairwise Subtree Prune-and-Regraft Adjacencies between Phylogenetic Trees.

Here’s the corresponding blog post.

It’s really designed for the setting in which we have a collection of trees in advance and want to calculate the corresponding SPR subgraph on those trees, rather than the original formulation of the question, which would be more relevant to tree space exploration. We’re still thinking…