This is an archived static version of the original discussion site.

Distances are important. Arthropods vs Mammals for example


I am working with the Arthropod data (80 taxa, 62 nuclear genes, 41,985 DNA base columns in the alignment) which has been the subject of papers by Regier and Zwick.

Regier JC, Zwick A. Sources of signal in 62 protein-coding nuclear genes for higher-level phylogenetics of arthropods. PLoS One. 2011;6(8):e23408. doi: 10.1371/journal.pone.0023408. Epub 2011 Aug 4. PubMed PMID:21829732 PubMed Central PMCID: PMC3150433.

Regier JC, Shultz JW, Zwick A, Hussey A, Ball B, Wetzer R, Martin JW, Cunningham CW. Arthropod relationships revealed by phylogenomic analysis of nuclear protein-coding sequences. Nature. 2010 Feb 25;463(7284):1079-83. doi: 10.1038/nature08742. Epub 2010 Feb 10. PubMed PMID: 20147900

They discuss a lot of data partitioning options, such as analyzing gene which evolve faster or slower, first second or third codon position, using amino acid translation rather than DNA etc. My experience with data from viruses, bacteria, mammals, vertebrates and other organisms is that there is no “best answer” for how to partition data that fits all problems. When the taxa being studied are very closely related such as populations of modern humans, we must use a lot of single nucleotide polymorphisms. When the taxa being studied are highly diverged such as retroviruses or all vertebrates or the arthropods, the more divergent sites are beyond saturation with mutations and we need to focus on more conserved sites.

The data set contains a lot of missing data, some of which is coded as strings of NNNNNNNN and some of which is coded as strings of ---------- characters. Some programs will treat the two characters the same, but some will treat an N differently from a dash.

In almost all papers I read like this, the authors are very interested in the data, and how to make the best use of it by using various partitioning schemes etc. but they don’t use a second data set with a better known fossil record (such as vertebrates) for comparison. In the vertebrates we may not be confident of which lineage of mammals is most ancestral (most data suggests elephants branch off most ancestrally) but it is very solid that marsupials and monotremes predated mammals, and that amphibians and other tetrapods preceded them.

It is relatively easy to add more data, either more genes to a 60 taxa set or more taxa to a 60 gene set. But I suspect that more data is not always better and we want to make careful choices about which taxa or genes to add. But the most common question I see on phylogenetic discussions is “how do I work on my data set” and almost never “how do I gather a good data set to work on”.