Reading codon alphabets, when you have ambiguity codes

Hi everyone,

I was wondering if people have any helpful hints on reading in codon alignments while handling ambiguity codes. I need to rewrite how I do this, because my current approach takes a ton of time and memory when the number of ambiguous patterns (e.g. NGC, or RGC) get large. As a result, I currently only allow Y,R,W,S, and N in codons, and disallow  K,M,B,D,H,V. Any thoughts?

How are you encoding this now? For a single nucleotide, you can encode each site as an integer from 0 to 15 using four bits. For a codon, this can be done with integers from 0 to 2^(4^3) = 2^64 using 64 bits, which might be in the ballpark of large numbers you’re talking about.

Does the state-space need to be expanded in this way throughout the analysis? For example, ambiguous states are only needed to initialize the tip states of a tree to compute the likelihood and can then be thrown away. If this is the case, you may be able to allocate states dynamically per dataset, since a single dataset is unlikely to contain all possible ambiguous states (e.g. assign integer 30234 to NGC when it is detected). These integers would then constitute the data matrix and would be pretty straightforward to decode with a map with integers as keys and a 3x4 bit-matrix as a value.

Is this for general use in a software library? Or more of a constrained problem in a stand-alone program?

To start with the second question first, the state space does NOT have to be expanded throughout the analysis. I’m implement a classic pruning algorithm for the substitution likelihood, so ambiguous letters are only relevant as observations for leaf branches. I’m assuming that, for most purposes, you can index the (unambiguous) letters by 0…N-1. This causes problems with codon alphabets: since we have to eliminate stop codons, the index of a codon stops being an obvious function of the component letters of the codon.

To answer the first question: right now every codon (e.g. ATA) or ambiguity code (e.g. NTY) gets its own index. It also gets an entry in a giant array that maps the index to a bitmask. The array takes a lot of RAM and also takes a lot of time to construct.

I guess the alphabet, sequence, and alignment-reading code are supposed to be general purpose. I haven’t factored them out into a library yet, but that would be nice, as they could then benefit from improvements by other people.

The alphabet code is here:

https://github.com/bredelings/BAli-Phy/blob/master/src/sequence/alphabet.C

My impression is that ambiguity codes are low-quality data. I’d just replace the whole codon with a gap.

It is not clear to me that ambiguity codes can be completely dismissed as just low quality data. One possibility would be to read codons like RGC as NGC. However, I am interesting in what algorithms people might use to actually read codons like KBN.

Does `Alphabet::letter_masks_` map the ambiguous codon integer-value to a 61-bit vector? And is this the RAM-intensive data structure?

If so, you might want to use `int*` as a 1D-representation of a 2D-array instead of `std::vector<std::vector<bool> >` for two reasons. First, those STL containers require something like 48 bits of memory overhead per instance, where your vector data is only about 61 bits. So if you have many states mapping to small bitfields, almost half the memory may be spent on container headers. Second, `std::vector<bool>` is actually a specialized template, which optimizes for space at the (potential) expense of speed, and is generally considered as a mistake in the C++ language definition. With `int*` you should generally have good speed and storage without more reliable behavior (although a less intuitive interface).

Also, I definitely think you should be able to expand your alphabet if you define ambiguous codons only for those seen in the current dataset. Not sure if this is too restrictive for software like BAli-Phy that might need to dynamically insert/remove characters into the alphabet as part of alignment.

Finally, nothing like a profiler to figure out what’s hogging the CPU/RAM.

Fun problem to think about!

Yup, back to the benchmarking board for me.

my current approach takes a ton of time and memory when the number of ambiguous patterns (e.g. NGC, or RGC) get large. As a result, I currently only allow Y,R,W,S, and N in codons, and disallow K,M,B,D,H,V. Any thoughts?

right now every codon (e.g. ATA) or ambiguity code (e.g. NTY) gets its own index.

I am interesting in what algorithms people might use to actually read codons like KBN.

I use an unnecessarily complicated data representation system that, for each node in the tree, associates an emission likelihood to each possible state. So in the traditional setting, unobserved internal nodes would have 1.0 associated to each codon state, and leaves would have 1.0 associated with only the observed codon and 0.0 associated with all other codons. In this setting of complete overkill, if a taxon has the ambiguously coded codon KBN at a given site in the alignment, then the corresponding leaf would have 1.0 associated with all codons compatible with KBN and 0.0 associated with all codons not compatible with KBN.

This is probably not your question, but instead of precomputing a mask for each possible triple (e.g. KBN) you could precompute a mask for each ambiguous nucleotide for each of the three possible sites within a codon, and then binary-and the three masks together later when you need it. So for example the mask associated with KBN could be something like mask[K0] & mask[B1] & mask[N2].

Thanks for the feedback. I went ahead and just added all the ambiguity codes, and it actually didn’t take very much time or memory. Apparently my own memory of this problem was bad I think I actually had a different reason for not enabling all the ambiguity codes. Oops.

If I were to allow all possible subsets of codons, then I’d need 2^64 table entries, which would be too large. But right now ambiguity for each codon position is independent, so I only need (2^4)^3 = 2^12 possible subsets. This is sufficient to handle ambiguity codes like RGN.

I went back and replaced vector with boost::dynamic_bitset<>. This has given me big speed improvements in the past, but here it didn’t really make much of a difference in speed. (For bipartitions, or for things where you are doing bitwise operations, dynamic_bitset<> can give a 20-fold speed improvement. Over vector I think.)

Hmm… do you mean binary OR?