Reconstructing identity of amino acids that are 'missing data' in extant sequences


Hi, I’m wondering if any of the ancestral sequence reconstruction software packages allow the reconstruction of states in extant taxa that are coded as ambiguous/missing data. For example, if I was missing a stretch of sequence in an extant amino acid sequence, is there software that currently will estimate the states for that missing portion of the sequence using joint and/or marginal ancestral sequence reconstruction methods.

Sincerely Andrew Roger


Hi Andrew,

we are working on such a software. We have so far exclusively used nucleotide sequences, but I just added a amino acid/profile map. No amino acid model implemented yet (other than the trivial one), but if you have enough data, it will infer a model for you. The project is here:

you can use the script “”, for example like this

python ancestral_inference,py --aln my_alignment.fasta --tree mytree.newick --marginal --prot

it will fill in all X in the alignment with the most likely state (N for nucleotides). In fact, all characters not in the alphabet will be treated as such.

best, richard ps: the joint ml reconstruction is not quite correct. this is fixed in another branch but hasn’t hit master yet pps: let me know if you have questions/comments