This may seem like a strange question, but here goes. I’m looking for a computationally efficient way to represent a set of sequences in a 2D space. For example, imagine that we have 10,000 DNA barcodes. I could compute a tree, but (a) that get’s computationally hard (if it doesn’t seem hard, make it 100,000, or 1M), and (b) a tree drawing isn’t stable in the sense that there’s no global coordinate system that helps us compare trees for different subsets of data.

I thought about using something like DNA walks, where we start at 0,0 in a x-y graph, walk along a sequence, and make moves -1, 1 in the x or y direction depending on the next base in the sequence. For example, we could plot the final x,y coordinates at the end of the walk, and we’d have a simple to compute measure that depends solely on the sequence at hand, and which locates a sequence in a shared coordinate space. What I’d really like are broad-brush clusters that are recognisable enough to say “OK, over there are fish, these are insects, that cluster is molluscs”). I’m guessing this might work if there are clade-specific sequence properties such as base-composition, etc., otherwise, not so much.

Hope this doesn’t sound to ridiculous. I’m curious as to whether there’s a method for getting a quick sense of the taxonomic composition of a large set of N sequences that doesn’t require the N^2 comparisons needed to compare the sequences in order to build a tree.