This is an archived static version of the original discussion site.

DIstance-based methods and incomplete sequence matrix


There is a problem (at least for me).

I analyse an incomplete sequence matrix when sequences for some genes have not been obtained for all the samples in analysis. For example, I’ve got Co-1 gene but failed to obtain CytB gene for a given sample. As a result, there are some sequences in my matrix that have no matches in nucleotides at all.

So my questions are:

If there any algorithms to calculate the distance between two sequences without matches and to reconstruct distance-based phylogeny using (for instance) availible information about distance between other sequences in the matrix?

If so, is it possible to evaluate the branch support for a given phylogeny?

Thank you!


This paper might be useful “On the extension of a partial metric to a tree metric” Discr. Appl. Math. 276 (2004), 229–248 by Guenoche, Leclerc and Markarenkov. The idea is that if the distances do fit a tree but you only know distances between some pairs of leaves (taxa) then you can apply an iterative rule to fill in some missing entries (if you are lucky the entire table - there are known sufficient conditions for when that will be the case - e.g. if the present entries form a ‘shellable lasso’ for a tree (as defined in a recent paper by Dress, Huber and me)). The fill-in just relies on the 4-point condition. In general though, there’s no guarantee such an iterative rule will fill in the table, and in any case it assumes the present distances do fit some tree.


Thank you so much for taking the time! This is very helpful information.


Andrei Popescu has implemented some methods in R as part of the ape package.