Calculating distances with ambiguity codes (R, Y, N etc) in the data


#1

Are there any distance calculators, such as DNAdist in PHYLIP, which can be set to treat ambiguity codes as a partial match? For example, I want a R to be counted as half a match to A or G. I believe that PHYLIP DNAdist counts R as a full match to either A or G.

For diploid organisms an “R” is usually indicating that one allele had A and the other G. But for populations such as a swarm of HIV-1 in a single patient, the R usually means that part of the population had A and the other part G.


#2

There was a discussion of this topic on a mailing list a couple years ago that offered some possible options.

http://grokbase.com/t/r/r-sig-phylo/13517bj6z4/dist-dna-inconsistent-behavior-with-ambiguous-sequences

– Chris


#3

There’s also this:

I have some Julia code to calculate distances with ambiguities too, but I haven’t committed the code to the BioJulia repository yet.

Best,

Simon


#4

Hi Brian,

As @sdwfrost suggests, take a look at our TN93 calculator ( https://github.com/veg/tn93). It has fairly comprehensive ambiguity handling, including partial matching (like what you want), partial matching subject to constraints (e.g. two-fold ambigs only), and corresponding ambig-ambig matching. The code is easy to modify to compute any other nucleotide distance for which you have a closed form expression.

Best, Sergei