Long Branches Attract problem, and forcing a fix on it?


With HIV-1 phylogenies we quite often have data sets where we know important details about the true evolutionary history. For one example, with a local transmission chain such as husband to wife to infant chain we can often know for certain who was infected first and the dates of transmission events etc. In other cases the rough epidemiology is known so that we can tell that an epidemic spread out from a point source introduction.

Very often, or perhaps always given certain relative levels of diversity involved, the phylogenetic trees produced from a data set show a misrooting of one or more subclades which essentially turns that subclade “inside out” by putting the more diverse sequences rooted to the outgroup. The explanation seems to be the “long branches attract problem”. In the father->mother->infant case for example, the sequences from the infant often appear on a branch in between the father and mother, when we know for certain that this is “out of order”.

What I want is a tool that allows me to “fix” a known misrooting of a clade, and then calculate the likelihood value (or other such measurement) of the “correct” tree vs the misrooted tree. I can provide sample data sets and tree results to anyone who is interested in this.


FastTree supports topological constraints, where you can force the final tree to have a certain split. It would probably work for this situation.


Right: I believe all programs offer such topological constraints. This seems to be what you want.


Hi Brian,

This is with reference to the following statement.

“In the father->mother->infant case for example, the sequences from the infant often appear on a branch in between the father and mother, when we know for certain that this is “out of order”.”

As you explained, this may be a result of long-branch-attraction. The following paper provides an alternate explanation. Please take a look at the cases that are presented in Fig. 1.

Best, Prabhav


Here is an example tree, and a couple of figures drawn from it. It is usually not the whole tree that is misrooted or “inside out”, but just one clade within the tree. I am attaching here a tree that appears to have an example of the issue. The HIV-1 M group subtype B epidemic began in Haiti and then there were hundreds of exports of the virus from Haiti to other parts of the world. The sub-epidemic of subtype B introduction into South Korea happened in 1989 or 1990 and most likely did not come straight out of Haiti. Subtype B spread into Trinidad and Tobago much earlier than 1990, but there was not much sampling and sequencing of viruses there until more recently, so there is a bit of a bias there for isolates in the post 2005 time period in comparison to USA/Europe where we have a lot of sequences from viruses sampled in the 1984 - 2000 time period.

In this tree, we might get the impression that the viruses from South Korea and Trinidad/Tobago originate near the “root” of the subtype B clade which I have labeled as Root 1 in red. But it is far more likely, given all that we know of the epidemiology of HIV-1 subtype B, that the true root of the B clade is close the the green node I have labeled as “root 2”.

When the tree is drawn as a radial tree like this, it gives us a bit of a different impression that when it is drawn as a cladogram. And in either tree view, if we measure distances from tips to “root 1” we get different values than if we measure distances from tips to “root 2”.

The chat software here does not allow me to upload the treefile or alignment, but I can send it to anyone who wants it.


Another example.