Reconstructing identity of amino acids that are 'missing data' in extant sequences


#1

Hi, I’m wondering if any of the ancestral sequence reconstruction software packages allow the reconstruction of states in extant taxa that are coded as ambiguous/missing data. For example, if I was missing a stretch of sequence in an extant amino acid sequence, is there software that currently will estimate the states for that missing portion of the sequence using joint and/or marginal ancestral sequence reconstruction methods.

Sincerely Andrew Roger


#2

Hi Andrew,

we are working on such a software. We have so far exclusively used nucleotide sequences, but I just added a amino acid/profile map. No amino acid model implemented yet (other than the trivial one), but if you have enough data, it will infer a model for you. The project is here: https://github.com/neherlab/treetime

you can use the script “ancestral_inference.py”, for example like this

python ancestral_inference,py --aln my_alignment.fasta --tree mytree.newick --marginal --prot

it will fill in all X in the alignment with the most likely state (N for nucleotides). In fact, all characters not in the alphabet will be treated as such.

best, richard ps: the joint ml reconstruction is not quite correct. this is fixed in another branch but hasn’t hit master yet pps: let me know if you have questions/comments