A good dataset for undergraduate coursework


I’m about to run my Final Year BSc (hons) molecular phylogenetics unit again and am looking for some inspiration. In the past I have used Trypanosome genes as a dataset for the coursework exercises, but it’s starting to feel stale to me. Does anyone have any suggestions for an interesting phylogenetic question and dataset that would allow students to collect sequences, align, make inferences and thereby test some sort of hypothesis? Preferably (and this is the hard bit) not previously published so they can’t just crib from the papers.

Many thanks in advance.


Is there something that’s making the Trypanosome sequences feel stale? I.e. do you want more or longer sequences?

When you say “collect” do you want for them to be available on a public server?


Hi Erick. One problem with setting the same questions year on year is that sometimes students from the previous year will repeat the year or may share with current students the feedback they received. So mixing up the dataset and questions helps to avoid these problems. In addition, the Trypanosoma paraphyly question has been done to death in the literature, so it’s easy for students to look up the “answer” as it were. (I initially picked the system only because I was familiar with it from my time working with Ford Doolittle, but it was getting long in the tooth even then). Lastly, the GPDA sequences the inference is based on are somewhat problematic as they are incomplete for some species and this causes the students all sorts of problems (they find bioinformatics in general, let alone phylogenetics, challenging since they have had so little previous experience of it).

What I’m largely interested in is an engaging evolutionary question for which a phylogenetic inference will answer (or help answer) and for which there is publicly available sequence data. I’d like the students to go through the process of collecting the sequences, aligning them, carrying out inferences and confidence assessments and then interpreting the tree(s) in relation to the original hypothesis so they can come to an evaluated conclusion.

I’m sure I can come up with something if I trawl the literature some more, but time is pressing with the new academic year only two weeks away!


I find that , mammalian, insect or cone snail mtDNA is nice and tidy for students. In particular COI has been reliable and useful because it is easy to align and works for both deep and recent divergences with its combination of conserved and fast sites. Not so with the rRNAs although these are good for demonstrating noisy data compared to phylogenetically informative data. I really like the combination of ND1,ND2 and COI.

For nuclear genes and cone snails, I like to compare a longer sequence like an rRNA with the tidy little gamma glutymyl carboxylase intron 9. Turns out that the intron9 sequence is more phylogenetically informative than the larger gene. ITS is nice for a wide variety of organisms.


Thanks for your suggestion. Not only has that given me food for thought on a dataset, it’s also given me some ideas on inference exercises the students can carry out. Thanks again.


The complete mitochondrial genome of thousands of humans, a few Neanderthals, Denisova man, dozens of chimpanzees, a few gorillas, orangutan, etc. have been sequenced. The past few years there have been a lot of popular press stories, books, and scientific papers about what all this data can tell us. Did pre-humans mate with chimpanzees? Did prehistoric Europeans mate with Neanderthals? Etc. It’s all quite interesting, and there is fossil and/or archeological data to go with the DNA data.

I don’t think most of the questions are the type that students can answer with a few genomes sampled from the thousands done. But there are dozens of very good questions they can answer, such as: Do trees built from the COI gene give the same answer as trees built from complete mitochondrial genomes of the same samples (maybe 3 humans, one Neanderthal, 3 Chimpanzee, 2 gorilla and one orangutan genomes in the data set to be analyzed)? If you pick a 100-base long region at random, is the tree good? What about a 500 base region? What happens if you use macaque or baboon as the outgroup instead of orangutan? Do maximum likelihood and neighbor-joining methods give similar or identical results? Does selecting the model of evolution (Kimura, F84, GTR; with or without a gamma distribution) make a large difference? If we use 8 million years as the date of the common ancestor between chimpanzee and human, when did Neanderthal share a common ancestor with human?

I can think of thousands of great HIV data sets, but like Trypanosomes or fruit flies or most other organisms, the students are unlikely to understand how the organisms evolved and so are unlikely to be able to asses when the data support or refute various ideas about the evolution based on other factors such as phenotypes.


The complete mitochondrial genomes of many other vertebrates are also available. Humans tend to think we are so very different from chimpanzees, and that poodles are very different from Great Danes, but we think all Plethodont salamanders look alike. Does genetic diversity correlate with phenotypic diversity? Do all toads share one common ancestor separate from frogs, or do toads/frogs intermingle on a tree?