Compute 2D representation of sequence space without comparing sequences


#1

This may seem like a strange question, but here goes. I’m looking for a computationally efficient way to represent a set of sequences in a 2D space. For example, imagine that we have 10,000 DNA barcodes. I could compute a tree, but (a) that get’s computationally hard (if it doesn’t seem hard, make it 100,000, or 1M), and (b) a tree drawing isn’t stable in the sense that there’s no global coordinate system that helps us compare trees for different subsets of data.

I thought about using something like DNA walks, where we start at 0,0 in a x-y graph, walk along a sequence, and make moves -1, 1 in the x or y direction depending on the next base in the sequence. For example, we could plot the final x,y coordinates at the end of the walk, and we’d have a simple to compute measure that depends solely on the sequence at hand, and which locates a sequence in a shared coordinate space. What I’d really like are broad-brush clusters that are recognisable enough to say “OK, over there are fish, these are insects, that cluster is molluscs”). I’m guessing this might work if there are clade-specific sequence properties such as base-composition, etc., otherwise, not so much.

Hope this doesn’t sound to ridiculous. I’m curious as to whether there’s a method for getting a quick sense of the taxonomic composition of a large set of N sequences that doesn’t require the N^2 comparisons needed to compare the sequences in order to build a tree.


#2

Lareo and Acevedo came up with an approach to position sequences into a 3D space (or 2D) which doesn’t require pairwise comparison of all sequences: http://www.ncbi.nlm.nih.gov/pubmed/10582276 The problem might be that sequences that have one substitution difference from each other will not necessarily be equidistant from each other in this space.


#3

@blJOg Many thanks for that reference (DOI http://dx.doi.org/10.1023/A:1002018205360 ). I’ll need to explore it further. So long as there’s some preservation of relative similarity in the resulting 2D space it may well do the trick.


#4

You could use something like “landmark MDS”. The idea is like multidimensional scaling except without requiring all of the N^2 entries of the distance matrix.