This is an archived static version of the original discussion site.

Let’s quantify the impact of missing data


Whenever I see an acrimonious debate about something where the evidence offered by either side consists of a collection of data sets along with simulations it makes me wonder where are those theoreticians?

I think that there is such a debate concerning the impact of missing data in phylogenetics (Wiens 2003, Lemmon et al 2009, Wiens and Morrill, 2011, Simmons 2012, Roure et al 2012). With pplacer, I have noticed that masking non-informative columns can have surprising effects on the relative likelihoods in cases when data is weak.

I think that the overall effect in the case of standard phylogenetic analysis is probably weak when the gaps are uniformly distributed, but when they are not I don’t think that it is. And because of primer bias, there is an interesting joint distribution on amplification probability by sequence identity for cases like RAD-seq.

It’s not uncommon to see people running trees on alignments that have a very high proportion of gap. Sanderson, McMahon, @mathmomike and did some interesting related work in their phylogenetic terraces paper, but for me this doesn’t quite do what I would like. I would just like to know the contribution to the phylogenetic likelihood is of adding a column with various patterns of gap given various phylogenetic trees.

In principle I think I have all of the skills to do this on my own, but it would be more fun to have others involved, especially someone that would be willing to do some computer work. Anyone want to play? We could do the work as an open Massively Multiplayer Online Research Project, with phylobabblers kibitzing from the sidelines.

Or is this a bad idea? Much ado about nothing?


@ematsen I enjoy playing. What computer work do you have in mind?


Just calculating likelihoods of single-column and multiple-column alignments on various trees with various models to consider their contribution to an overall likelihood. We’ve had good success using Bio++ for things like this in the past.


Hi Erick,

We could maybe use that phylogenetic likelihood library we are developing for this, it might be fairly easy to implement this.

We are also playing around with a very simple method for predicting the missing sequences and trying to quantify its accuracy right now.



@Alexis_RAxML sure, that would be great. Should I go ahead and start an (open) GitHub repository?

But first, I was hoping to get more negative feedback from everyone. Don’t be shy! Is this a bad idea? I will send anyone a box of girl scout cookies, or a smoked salmon, or whatever such thing you want if you convince me that it would not be a useful project.


regarding the repository: Let me see first I have the resources (in terms of man-hours) to do this.

If you want some rant: the key problem might be that we need to somehow sample appropriate trees for this in an intelligent way. It might be difficult to quantify the effect of missing data and the respective impact on the tree topology simultaneously.


The idea would be to assume that other parts of the alignment have determined the primary structure of the tree, and so we fix the tree except for several taxa. We would then have different patterns of base and gap, and see how those determined the positioning of those few taxa. Seem reasonable?


You seem to be discussing columns in an alignment where a large percentage of taxa have insertions or deletions. Another type of missing data is when researchers build trees with long sequences such as complete mitochondrial genomes, but include many species for which they have only one or two mitochondrial genes and the rest of the sequence for the missing data is either filled with ----, or NNNN, or ??? characters. This type or missing data in rows, can have impacts on the results, also. I have many examples from the published literature.


I think that Erick wants to address both issues. Erick, I am still a little bit skeptical as to what the criterion would be to select the taxa that need to be re-positioned. One may just chose those that have missing data above some threshold, but then, the absence of data might also have effects on the underlying tree structure that you want to keep fixed.


A recent paper on this subject:

Personally, I’ve experienced the trouble of having some taxa represented by some markers only while others are so by complete mitochondrial genomes. Misplacement and node instability are among the main consequences I’ve observed this far.




It looks like the figure captions for Figures 1 & 2 are mixed up for that paper. I’m contacting Simmons to get it fixed.


I just drafted a manuscript on missing data with the title and abstract below, and should be happy to send you the manuscript for comments and suggestions:

Title: Phylogenetic bias in the likelihood method caused by missing data coupled with among-site rate variation: an analytical approach

ABSTRACT More and more researchers in phylogenetics are concatenating gene sequences to produce supermatrices in the hope that larger data sets will lead to better phylogenetic resolution. Almost all of these supermatrices contain a high proportion of missing data which could potentially cause phylogenetic bias. Previous studies aiming to identify the missing-data-mediated bias in the maximum likelihood method have noted a bias associated with among-site rate variation. However, this finding is by sequence simulation and has been challenged by other simulation studies, with the controversy still unresolved. Here I illustrate analytically this bias caused by missing data coupled with among-site rate variation. This approach allows one to see how much the bias can contribute to likelihood differences among different topologies. The study highlights the point that, while supermatrices may lead to “robust” trees, such “robust” trees may be purchased with illegal phylogenetic currency.


This is so fantastic! I’m glad to know that you are thinking about this. Rather than sending around a preprint, it’s safer to post your paper at or because you establish ownership of your ideas. That way more people can see it too.

I’m looking forward to reading your work.


   In this data set, I again [find a lot of missing data][1].  The DNA alignment is a 32,767 columns long, but after stripping sites where one or more sequences is represented by a gap, there are only 885 columns remaining.

In addition to this, I find that many sites are invariant, so no phylogenetic signal in them, and then there are a few “odd” places in the alignment where I am not sure if there was sequencing error, or replacement of a small region of a gene such that this region is not truly “homologous”, or some type of data error. I could attach a JPG image of a small region of the alignment that is rather typical of the whole data set and illustrates these points, but I am a new user here not allowed yet to upload.

With such a huge data set, I cannot fault anyone (or any machine) for making a few errors, missing a few misaligned sites or whatever. But my main interest in all of this is: What is the relative importance of having a very good data set, in comparison to doing the very best analysis of a “run of the mill” data set? The whole focus of this publication, was claiming that using a very sophisticated model of evolution and phylogenetic reconstruction based on that model, would produce the “right” tree. But they have apparently taken only protein coding regions, for which models of the selection pressure are not well developed, and they are not treating first, second and third codon positions separately (unless I am missing something in my understanding of their methods). I am not thinking that the tree they created is “wrong” or “bad”, but I am questioning whether it is “the best we can do” with this type of genomic data.


Maybe to take a dataset where all genes are available for all taxa and the tree is reasonably resolved. The testing could be done on subsets of this dataset.


@Alexis_RAxML The in/out taxa are chosen ahead of time. The idea is that we have lots of data for n taxa, and we have mostly missing data for the n+1st taxon X. We assume that the columns that informative for X do not have enough information to change the topology for the first n taxa. Thus we leave the tree on the first n taxa fixed and we want to see how the pattern of missing data determines its position.

I realize of course that this is a bizarre set-up, but it is one that distills the effects of the missing data in a way that I think would be comprehensible, and perhaps even amenable to a mathematical approach.


I am not sure if the missing data link worked. Here it is again:


Hi Erick, that makes sense, so essentially you’d do some sort of leave-one-out test using pplacer or the EPA and then try to quantify how the taxon X moves around the tree or how the distribution of likelihood weights changes as a function of the fraction of missing data.



I think that we are talking about the same thing. I wouldn’t call it a leave-one out analysis because we fix our original n sequences making the background tree and then the sequence for X can be whatever we like it to be.