Maximum Likelihood tree reconstruction pipeline


#1

I recently had need to reconstruct a reasonably large Ebola virus tree and I wanted a maximum likelihood tree with good branch length estimates and bootstrap values for all the nodes in the ML tree. We wrote this pipeline that bats back and forth between phyml (to get an initial NJ tree and optimise the branch lengths) and raxml (to find ML topologies and bootstrap).

I thought I would post it here for discussion and to see if anyone would do things differently. On a fixed topology phyml did a better job of optimising branch lengths but this may be because I couldn’t find the appropriate settings in raxml.

#phyml to find initial tree using BioNJ
phyml --quiet -i $1.phy -q -t e -a e -o lr -b 0
mv $1.phy_phyml_tree.txt initial_tree.newick

#raxml to find ml topology
raxml -f d -T 6 -j -s $1.fasta -n topology -m GTRGAMMA -t initial_tree.newick

#phyml to optimise branch lengths on ml topology
phyml --quiet -i $1.phy -q -t e -a e -o lr -u RAxML_result.topology -b 0
mv $1.phy_phyml_tree.txt $1.ml.tree

#use raxml to produce boostrap trees - the seed needs to be set here in $2
raxml -f a ­p $2 -s $1.fasta ­x 12345 -#200 -m GTRGAMMA -n bootstrap

#use raxml to project boostrap values on to ML tree
raxml -f b -t $1.ml.tree -z RAxML_bootstrap.bootstrap -m GTRGAMMA -n final
mv RAxML_result.final $1.ml_bootstrap.tree

#2

Perhaps you could be a little more specific what you saw in the phyml branch lengths that you preferred to the raxml branch lengths. Once the branch lengths were optimized by phyml, are they judged to have a higher likelihood by raxml than the branch lengths it inferred?

With that sort of information I bet that @Alexis_RAxML will be happy to advise.


#3

Hi @arambaut the settings look okay, could you be a bit more specific though about what you mean by better branch lengths? Did PHYML yield a better LnL?


#4

These are the likelihoods after each stage:

Phyml BioNJ				-43819.695026
Raxml topology search			-43686.158939
Phyml branch length optimization	-43673.144059

The main thing is that the branch lengths are very small because the sequences are long (19kb). This means they are arbitrarily interacting with a lower bound in the optimization algorithm. The branches that are affected are those with the maximum likelihood value of zero.

Here is a plot of the equivalent terminal branch lengths:

The difference in a branch length of 1E-6 and 1E-8 does actually matter here because there are some large polytomies that result in a ‘ladderised’ affect.

Possibly there is an option to set the lower bound on branch lengths? Interestingly Phyml is clearly going hard against its lower bound of 1E-8 but Raxml has a lower bound of 1.47E-6 which seems unlikely to be a selected constant so I suspect this is the result of some other stopping criterion?

raxml			phyml
0.000054351	0.000054140
0.000001470	0.000000010
0.000001470	0.000000010
0.000001470	0.000000010
0.000001470	0.000000010
0.000001470	0.000000010
0.000053708	0.000053520
0.000269037	0.000268020
0.000053768	0.000053570
0.000001470	0.000000010
0.000101194	0.000106900
0.000054020	0.000053820
0.000053750	0.000053550

I suspect that a general solution would be to introduce a constant scaling factor for the branch lengths between the optimization search algorithm and the likelihood evaluation. A suggestion would be to use the ratio of the sequence length to the number of unique site patterns.


#5

Hi @arambaut, yes now it makes sense, the lower bound in RAxML is indeed determined dynamically depending on other model params, but there is indeed a lower bound that is hit by RAxML, thanks for pointing this out, I’ll put this on the TODO list,

Alexis