File format for defining a taxnomony and clades on a tree


#1

Hi all,

does anyone know of a (halfway standardized) file format or annotation scheme for taxonomies and/or for identifying clades on a tree?

One of our programs, Sativa (https://github.com/amkozlov/sativa) for example uses a taxonomy notation like this:

Taxon_name <tab> Rank_1;Rank_2;...

In order to use this for identifying clades of a tree, one can then use a particular rank of this taxonomy and put every taxon that shares this rank (and all the higher level ranks) into the same clade. This defines a clade on a tree, iff the tree is monophyletic with respect to the clade. That is, if there is a branch of the tree that contains all of the taxa of the clade (and no other taxa) in one of the induced subtrees, and all other taxa in the other subtree.

This is a bit ad-hoc in two ways: the file format is not standardized, and using it for defining clades on a tree comes with the caveat that someone first has to make sure that they give indeed monophyletic subtrees.

So, my questions are:

  • Is there a standard file format for taxonomies?
  • Is there one for dividing a tree into clades?

Thanks and cheers, Lucas

PS: I don’t want to define the 1000th Newick extension…


#2

Maybe you can use the format that NCBI uses for the taxonomy dump? That way there are at least already some Bio* toolkits that can read it.


#3

Thanks, @rutgeraldo. I had a look at it, and it seems to be a very involved format, with different files (nodes, names, …), referring to each other via IDs, and allowing multiple entries per ID.

For NCBI, such complexity is probably needed. It’s however totally an overkill for projects where the taxa, their tree and taxonomy are relatively small (some thousands) and fixed (no need for taxon synonyms).