does anyone know of a (halfway standardized) file format or annotation scheme for taxonomies and/or for identifying clades on a tree?
One of our programs, Sativa (https://github.com/amkozlov/sativa) for example uses a taxonomy notation like this:
Taxon_name <tab> Rank_1;Rank_2;...
In order to use this for identifying clades of a tree, one can then use a particular rank of this taxonomy and put every taxon that shares this rank (and all the higher level ranks) into the same clade. This defines a clade on a tree, iff the tree is monophyletic with respect to the clade. That is, if there is a branch of the tree that contains all of the taxa of the clade (and no other taxa) in one of the induced subtrees, and all other taxa in the other subtree.
This is a bit ad-hoc in two ways: the file format is not standardized, and using it for defining clades on a tree comes with the caveat that someone first has to make sure that they give indeed monophyletic subtrees.
So, my questions are:
- Is there a standard file format for taxonomies?
- Is there one for dividing a tree into clades?
Thanks and cheers,
PS: I don't want to define the 1000th Newick extension...