This is an archived static version of the original discussion site.

Compute 2D representation of sequence space without comparing sequences


This may seem like a strange question, but here goes. I’m looking for a computationally efficient way to represent a set of sequences in a 2D space. For example, imagine that we have 10,000 DNA barcodes. I could compute a tree, but (a) that get’s computationally hard (if it doesn’t seem hard, make it 100,000, or 1M), and (b) a tree drawing isn’t stable in the sense that there’s no global coordinate system that helps us compare trees for different subsets of data.

I thought about using something like DNA walks, where we start at 0,0 in a x-y graph, walk along a sequence, and make moves -1, 1 in the x or y direction depending on the next base in the sequence. For example, we could plot the final x,y coordinates at the end of the walk, and we’d have a simple to compute measure that depends solely on the sequence at hand, and which locates a sequence in a shared coordinate space. What I’d really like are broad-brush clusters that are recognisable enough to say “OK, over there are fish, these are insects, that cluster is molluscs”). I’m guessing this might work if there are clade-specific sequence properties such as base-composition, etc., otherwise, not so much.

Hope this doesn’t sound to ridiculous. I’m curious as to whether there’s a method for getting a quick sense of the taxonomic composition of a large set of N sequences that doesn’t require the N^2 comparisons needed to compare the sequences in order to build a tree.


Lareo and Acevedo came up with an approach to position sequences into a 3D space (or 2D) which doesn’t require pairwise comparison of all sequences: The problem might be that sequences that have one substitution difference from each other will not necessarily be equidistant from each other in this space.


@blJOg Many thanks for that reference (DOI ). I’ll need to explore it further. So long as there’s some preservation of relative similarity in the resulting 2D space it may well do the trick.


You could use something like “landmark MDS”. The idea is like multidimensional scaling except without requiring all of the N^2 entries of the distance matrix.