Pruning an alignment, rows or columns


#1

The ElimDupes program can be set to eliminate identical, or nearly identical sequences from a multiple sequence alignment. Useful for retaining one sequence per species for example.

The GapStreeze program, can be set to eliminate columns in an alignment, where one or more (x percent) of the sequences is represented by a gap character. It can be set to preserve codons, so the reading frame of the alignment does not get thrown out of sync.


#2

Interesting. My command-line tool alignment-thin can eliminate identical of near-identical sequences. It can also eliminate rows containing fewer than N non-gap characters. It doesn’t handle codons, though it could be extended to do so.


#3

If anyone is interested, the alignment-thin program is a command line tool that is part of the BAli-Phy codebase. You can download compiled binaries at http://www.bali-phy.org. The source code is available at https://github.com/bredelings/BAli-Phy. There isn’t a GUI version available, but there is usage information in the BAli-Phy manual under ‘alignment utilities’.


#4

Excellent. Gap Streeze just went into my class computer workshop. Thanks for making this available.


#5

In this context MaxAlign (Paper, Server, Perl-standalone) may also be of interest to some:

MaxAlign […] maximizes the number of nucleotide (or amino acid) symbols that are present in gap-free columns - the alignment area - by selecting the optimal subset of sequences to exclude from the alignment.