How to infer topology and clade support with very low diversity alignments?


I was reading a paper recently that did a full-on BEAST analysis on what appeared to be an alignment with very few informative columns-- independent relaxed clocks, skyline plots, etc. It raised my eyebrows.

But what is best practice with that sort of alignment? My sense is that because there is little data, parsimony would be best. The only paper I could think of, though, was Sanderson and Kim’s paper on “parametric phylogenetics.”

What about clade support? Non-parametric bootstrapping seems a bit strange because of it equates the empirical distribution on columns with the distribution generated by evolutionary processes. I suppose parametric bootstrapping would be good, conditioning on the number of informative sites to be the same as in the original alignment?


Thanks for the paper. Looks interesting.

For your informative site bootstrapping approach, say you noticed informative sites occurred only at some third position codon sites and nowhere else. In this case I’d be inclined to generate bootstrap samples from third position codon sites rather than from informative sites. This would introduce some sampling error into the number of informative sites per bootstrap replicate (my intuition is this would weaken clade support values).

On a side note, I think a Bayesian approach could work well with due scrutiny given to the prior and model specification (e.g. model testing, prior sensitivity analysis). If the posterior resembles the prior for a broad class of models and priors, I’d take this as a red flag that there may not be enough data to resolve the tree.

There must be some low-data threshold at which there is no power to identify the tree. Since sequencing is so cheap, in many cases one might address the problem (not your question) by generating more data!


At first I thought that you might be trying to fashion a purse out of a sow’s ear, but now I’m wondering.

Could it be that having a only a few informative sites (and I assume, a large number of constant sites) could say more about, say, topology that many more ‘faster rate’ sites. Because the rate is supposedly low, backward mutations are highly unlikely, so that any informative sites are likely to nail a split in the tree. It could be that the posterior distribution (perhaps on multifurcating trees?) is more spiked than in other circumstances.

A single, clear, fingerprint on the lead piping is more informative than a potential weapon covered with fingerprints…

Is there some way that this can be formalised?



I think we are on the same page, but I’m not sure, so just to make sure:

  • By informative sites in this setting I mean sites that are non-constant.
  • The idea would be to generate a parametric bootstrap sample of the same alignment length as before, but conditioned on having the same number of informative (non-constant) sites as in the original alignment.
  • I was thinking of doing a nucleotide tree (parsimony or IID model) which wouldn’t respect position.

WRT Bayesian methods, are there any phylogenetic methods that perform shrinkage towards topological multifurcations? That would seem useful in this regime.

Yes, but I think having methods to assess whether there is enough data would be useful. Of course we always need more data, but even if sequencing was free, sample collection and prep will never be.

@david_bryant seems like (ahem) a simulation study would be useful here.


Lewis, Holder, and Holsinger (2005) describe one way to assign priors to polytomies vs. bifurcations, using rjMCMC to move through the space of trees with varying numbers of polytomies. It looks like it’s implemented in Phycas if you’re curious to try it.


Great conversation. Coincidentally, Joseph Brown and I are doing just this right now (basically looking at methods for seeing if there is enough data to estimate a branch). Should have some code and write up soon. But completely agree. Especially when looking at large phylogenies (thousands of tips) this is a necessary operation.


Very cool how thinking about these kind of questions is aligning with old school parsimony papers that were dealing with extremely small data sets (and hence, the information content of, and compatibility amongst, expensive individual characters was of interest). With the sea of data currently available, seems like the assumption is the signal will win out. Where that signal comes from, and where it is absent, seems to have fallen from interest.

Of course, with these large empirical alignments we have the additional concern of data coverage, formalized by [Sanderson & Steel’s][1] concept of “decisiveness”. Does the distribution of data coverage allow the recovery of specific bipartitions (without some sort of extrapolation from the model)?

So, seems to me this is two-pronged: 1) what bipartitions can possibly be reconstructed?, and 2) of those (conflicting) bipartitions, which of those are supported by the data (and by which characters)? [1]:


@josephwb yes. Note that there are very important current applications with the best of today’s technology and 100% coverage. For example, take the recent case of tracking hospital infections using whole genome sequencing. Sequence the whole genome of the bugs and you get…

@david_bryant when are you going to tell us to use SNAPP?

Are there any electronic copies of the Parametric bootstrapping in molecular phylogenetics: Applications and performance paper out there? Was cited in @arambaut’s original Seq-Gen paper.


This may be completely naive, but I often find it useful to understand what should be straight-forward things by running simple visual experiments. Here, I’m trying to understand the difference between a bootstrap interval and a proper posterior credible interval when the amount of data is small. (I’m sure this has been formally done by others…)

I’m using a simple Bernoulli model with a true underlying p of 0.5. In this case, I make 100 observations from this distribution and observe 47 successes. The point estimate of p is 0.47. I show both a bootstrap resampling (in black) and a Bayesian posterior (in red) assuming a uniform prior on p.

We see that the bootstrap distribution and the Bayesian posterior distribution are pretty much identical. However, if I reduce n to 4 representing little data and observe 2 successes, I get the following:

It looks like the bootstrap distribution is wider than the Bayesian posterior distribution.

I swear I’ve read that Bayesian clade credibility intervals overestimate clade support (I’m assuming in relation to bootstrap clade confidence intervals). Is this possibly the same basic effect as I’m showing in the simple Bernoulli example? Although in this example, it looks like the bootstrap with little data is giving overly wide confidence intervals.


I don’t necessarily think that this kind of analysis is a problem, as long as (as others have said) the authors make a serious attempt to figure out what’s coming from the data and what’s coming from the priors. Just running the analysis without the data (i.e. all sites changed to gaps) would let you sample from the real priors. Nicolas Lartillot has some interesting posts on his blog that are relevant (start here and work forwards:, which describe why massive overparameterisation is often not an issue for Bayesian stats.

There are some related issues with clade support in Bayesian analyses that may be relevant here (e.g. the star-tree paradox), which Ziheng Yang (and others) have worked on a fair bit:

I guess best practice depends on the question. And at first glance I agree that very few informative columns is not a great start…


Thanks for your thoughtful response.

With respect to @nicolas_lartill’s blog post, I don’t think that even he would say that “massive overparameterisation is often not an issue for Bayesian stats.” The point of that post was that if we set things up right with regularization problems with over-parameterization can be avoided. Here, as @mlandis points out, the relevant regularization is a way to contract bifurcations to multifurcations, and that requires a complex rjMCMC not implemented in BEAST.

With respect to the star tree paradox (high clade support even when there is no clade structure whatsoever), my impression was that it is a real problem that does not go away even in the case of lots of data, which is a better situation than the one I describe. This further motivates a way to return multifurcations as an end result.

I guess that to me, running an rjMCMC when you have an alignment like I describe above just seems like overkill.


Yes, sorry - lazy hyperbole on my part to say ‘massive overperameterisation is OK’.


Hi All,

(and thanks for the invitation!)

I think that, in principle, both maximum likelihood and Bayesian analysis on small, low-diversity alignments (few non-constant columns) will not behave too badly, in the sense that you should not get strong support for wrong clades in this regime.

But then, most often, I guess you will just not get strong support for any clade. So, at the end of the analysis, you are most often supposed to conclude that the alignment was indeed too small to be of any good.

However, I can imagine that, in many cases, people will instead correct for small sample size and will start to interpret patterns that have insufficiently strong support: they will take clades that have pp>0.7 seriously, for instance. Yet, those clades are probably highly unstable: they may just be statistical fluctuations.

So, perhaps, the question should not be: which method will be suited to analyzing very small alignments? the question should perhaps more fundamentally be: what are the limits there, in terms of minimum sample size and minimum statistical support, below which we can consider that we are just in a fundamentally unreliable regime, whichever method we use?


I’d be interested to hear people’s opinions here, but for what it’s worth, I think we can quite easily figure out the inferential limits on any given dataset, as long as we can frame alternative hypotheses in terms of the topologies they generate.

Let’s imagine that we have two hypotheses that generate different sets of trees of size N and M respectively. In a Bayesian framework we could just analyse the posterior distribution of trees and see how frequently N and M appear in the prior (i.e. no data) versus the posterior (i.e. with data) and use that to estimate a Bayes factor. There could be issues if N or M is small, but these are not intractable, e.g. we could only sample from sets N and M in the prior (simple if N and M are easy to specify as constraints, a bit harder otherwise).

Similarly, in an ML framework we could e.g. estimate the ML tree from set N and set M, and then use e.g. SOWH tests (i.e. parametric bootstraps) to simulate data and ask whether the overall ML tree (i.e. best of N and M) is sufficiently good to reject the alternative hypothesis.

My suspicion is that the Bayesian approach (or some modification) would be best when information content is low, because it seems odd to consider only a single ML tree (as in the SOWH test) if there’s very little information on topology.


yes, I agree that the Bayesian method should make a more efficient use of the available information by averaging over “nuisance parameters” (here averaging over trees). On the other hand, it does that by integrating over the prior, which is itself not always under full control. If this prior has nothing to do with the generating process, then, for small sample size, there will be some distortion on parameter estimation or on hypothesis testing.

For example, I remember some analyses by Kolaczkowski and Thornton (2007, Mol Biol Evol 24:2108), suggesting that, for very short alignments, you can easily get a relatively strong support for clades that don’t exist. But this should probably be revisited. In particular, if I remember correctly, they analyzed mostly 4-taxon trees. It would be interesting to see what happens with more taxa, and under various models and priors.

In any case, with very small sample size, everything becomes more touchy.

But we are supposed to be in a “big data” era anyway, so why not instead trying to convince ourselves and other people that “single gene” analyses will always be too inconclusive to be worth the effort ?

Or are there still situations today where we really have to accept to work with small datasets ?


@rob_lanfear I’m afraid I can’t follow your post. [quote=“rob_lanfear, post:14, topic:32”] Let’s imagine that we have two hypotheses that generate different sets of trees of size N and M respectively. [/quote] OK, so N and M are real numbers.

Wait, are N and M trees or tree sets?

Are they sequence data now? Sorry I can’t follow this.


WRT the star tree paradox, may I courteously suggest that the rigoro math analysis by Susko and the further simulations in Yang 2007 papers are more informative than the K&T paper. Yang’s solution of employing a “data dependent prior” seems a bit strange to me, but this figure was worth a lot of words:

That sure looks like a problem, doesn’t it seem like it to you? I don’t remember any papers saying that this goes away when such a 4 taxon tree is embedded in a larger tree (though memory is not my strong point).

For a solution, if it must be Bayesian, I don’t see anything other than the rjMCMC.


We will never escape some “small data” problems: see my post above to see an example with few informative sites from whole-genome sequencing.


OK, your older post is indeed convincing: there are cases where you will never have more than a few informative sites.

If we need to formalize and get a bit more quantitative here, for me, it would be this:

(1) what proportions of true/false discoveries do we get for a given number of informative sites, for a given problem and under a particular inference method (and using a given threshold) ? Ideally, one would like the methods to advertise a reasonably reliable estimate of the risk that they make such false discoveries. If they can do this even in a low-diversity regime, then users are left with sufficient information to make up their own mind.

(2) Then, once the rate of false discovery is at least approximately known, the second question is: for a given false discovery rate, imposed as a pre-requisite, which method gives you the most informative answer ?

Given that, in general, we are interested in having such good calibration properties mostly in the strong support (low false discovery rate) regime: in many practical situations, we don’t really care if nominal rates of 20% of false discoveries in fact correspond to 50%, because 20% is already too inconclusive anyway. But we do care about whether an advertised 1% really means 1% of false discovery (or at least, no more than a few %). Because these strongly supported claims will be the ones relied upon for downstream publication or decision making.

In any case, I think that false/true discovery rate is the crucial property we care about. And note that this is not just a small sample-size question. But it can become particularly critical under low-diversity alignment. In fact, we can further discuss about this, if you want, in a separate channel.

concerning polytomies, rjMCMC and the star tree paradox: I think this is a different question. Priors that deal with polytomies are important when polytomies exists in the true tree in the first place. But here, it is not that we believe that the tree is intrinsically polytomous. Instead, it is just that we do not have enough information for estimating a tree which may just be boringly bifurcating (of course, it may also be polytomous, but then it is another problem, fundamentally of model violation, not a sample size issue).


Perhaps not totally explicit in my last post is what I mean by false discoveries in the present context.

if you take all clades with pp support > 0.95, what fraction of them are in fact incorrect ? This would be your true rate of false discovery (here, given that the nominal rate is at 5%). And I would personally like this to be of the order of 5 percents (up to 10% would still be acceptable in practice, IMO). A more stringent limit is pp>0.99, for which I would expect ideally 1%, and at most a few percents, of false discoveries.

if this is true also for small alignments with current methods, then I would personally consider that it’s not that bad.

Once this is guaranteed, then the next question is: are there ways to increase sensitivity (more clades come up with support >0.99) without increasing the rate of false discovery? But this should be considered as a subsidiary question.

In any case, SOWH, for instance, does not control for false discovery, but for type I error. And I do not think that this is really useful (see current discussions about p-values).


Well, neat. It appears that we have wandered into unknown territory-- the upper-end calibration of low-diversity alignments. I’d love to see some simulation results.

Imagine a scenario (modeled after the above paper on Klebsiella transmissions in a hospital) where perhaps because of the transfer of medical devices there were four transmissions of a microbe to a new host within minutes of each other from a clonal population. These populations then go on and evolve within the hosts and are transferred to others. How would you simulate sequences under such a scenario? I would make a polytomy for the four simultaneous transmissions.


oh yes sure: if good reasons to believe that there are polytomies, then one should certainly put positive prior probability on them, and rjMCMC as in Lewis Holder Holsinger 2005 is a very neat way to do this. I totally agree on that. In fact, I even tend to think that implementing LHH2005 priors on tree topologies should be more seriously considered in current programs (in particular, because we start to confront phylogenies in which there is positive evidence for radiation-like patterns).

I guess I was just saying that there is not reason to be more concerned about this issue of polytomies just because the number of informative positions is smaller: for me, these are just two independent problems.

yes, I think that some validation/simulation results would be nice here (and more generally)