Mesquite and TreeBASE NEXUS file formats


#1

For my work, I usually use FASTA format for mutliple sequence alignments and Newick format for phylogenetic trees. On the downside these formats are very simple and don’t define partion blocks etc in the alignments, and don’t provide information about fonts or colors etc in the treefiles. On the plus side, the formats are very simple and almost all phylogenetic analysis programs can take them as input and write them out as output.

When I do need a more complex format, there are tools available to convert FASTA to NEXUS or other formats.

Working at the HIV Databases, where people write to us with their problems, I have found that even the very most simple formats can present a variety of problems that are trivial but often difficult to recognize and solve. For one example the “white space” characters for tabs, spaces, carriage returns, and line feeds can differ between operating systems and cause troubles. The file looks perfectly fine to the human eye but one program or another will tell you that the format is not valid (and usually give you no clue at all as to what type of problem there is).

Today, I am working with a researcher who needs to upload his published (in press) alignment and tree to TreeBase. I am not a regular Mesquite or TreeBase user, so I am getting frustrated with attempting to create the “flavor” of NEXUS file that these two programs or sites are happy with. I have reported the issues to TreeBASE at GitHub, but I also wanted to bring it up here, as an issue to consider when developing a tool or web site for broad use. It seems to me that any tool that requires a special format should provide some help with converting a standard format to the required special format.


#2

To be fair, the TreeBASE site does provide a nice tutorial on how to create a good TreeBASE entry. But the example begins with reading a file into Mesquite that seems to be ideal for a TreeBASE entry. Watching the tutorial does not tell us a lot of the backstory on TreeBASE data requirements such as to never abbreviate the Genus_species names. Most of the data I work with does not even fit a “Genus_species” format because it is from viruses where we usually need other details such as sample country and sample date as well as other information in order to make the analysis useful. When I am not working with virus data, I am often working with data downloaded from GenBank, and I am at the mercy of the authors of that data for how they may have coded or miscoded their data. The text of the tutorial also says to “Avoid using quotation marks, brackets, parentheses, commas, colons, and semicolons.” when in fact at least some of these characters are absolutely forbidden.


#3

The NEXUS standard has unfortunately devolved in a family of sometimes mutually incompatible dialects. As such the TreeBASE submission functionality is not a case of going from a “standard” format to a “special” format. Rather, it is a case of having to pick one of the multiple “special” formats and sticking to it.

There is nothing that TreeBASE can do about this situation other than recommend, as it does, that you use Mesquite to prepare your data so that TreeBASE at least knows which dialect it is going to receive (there is no reliable way to sniff this out based on the file contents).

The requirement to use Genus_species is a weak requirement in the sense that this is only necessary if you want your sequence / tip labels to be recognized by the taxonomic name reconciliation service that TreeBASE uses. If you want to submit a tree of viruses that doesn’t conform to the Genus_species paradigm than you are free to do so, it simply means that TNRS will fail.

As for the punctuation marks: if your file is “valid” in the sense that it is a NEXUS file that has passed through Mesquite than this is a non-issue: Mesquite won’t let those punctuation marks appear anywhere in the file where they could do harm (i.e. invalidate the file syntax).

By the way, I notice that your GitHub issue was responded to within 12 hours, with Bill Piel himself correcting your input file by hand. Honestly I don’t know what more Team TreeBASE is supposed to do here.


#4

I’d say the one thing that TreeBASE could be doing but isn’t is to accept NeXML format for submission. That it shouldn’t take a database editor with internal expert knowledge about file format and implicit metadata expectations to tweak a submitted file into the right form is, I think, a valid point. NeXML can be independently validated. (Though I suspect another data format isn’t what @BrianFoley was hoping for :slight_smile:)

Even then, though, we don’t have well-documented and widely implemented conventions for how to express common metadata for phylogenetic data. Hence, for a database such as TreeBASE that has strong expectations about certain metadata, a validatable format by itself isn’t going to solve the whole problem.

@BrianFoley if all that you care about is archiving your tree and alignment for perpetuity, you can use Dryad. That may well limit their usefulness for synthesis efforts such as Open Tree of Life, though, which do need good metadata.


#5

I was not meaning to criticize Mequite or TreeBASE specifically. I think TreeBASE is very great idea, and well implemented. I was more wanting to open a discussion about the problems users can have with file formats, and maybe hear some ideas about how to assist users with those problems. At the HIV Databases, we provided a web interface to a format converter that was originally based on READSEQ and then expanded to read and write more formats than READSEQ handles. But even this is not the solution to all problems.

I think the BEAST team did a great job with creating the BEAUTi program to assist users with assembling the data, for one example.