Diversification rates and sampling fraction


Hello All, I work on a genus of frogs that is distributed across several islands and thought it would be interesting to compare rates of diversification between landmasses (using TESS, RPanda etc…). However, all of the methods I have come across require before hand knowledge of the number of species in the clade (i.e a sampling fraction). The genus is poorly characterized, and this information is not available. Pybus et al. 2002 used their gamma distribution to estimate the number of flavivirus under a Yule model, which would be a start. I think they needed a diversification rate (which is what I am ultimately looking for) and I cannot find a package or program that implements their methods.

So…are there methods for inferring diversification rates without a sampling fraction? Absent that, are there methods for inferring clade size based on time, a tree model, sample size, and the distribution of nodes?

Any and all input would be appreciated…including “don’t bother”




Hi Elijah,

What you have to keep in mind is that however fancy the diversification-through-time model is, at it’s core, you’re still estimating speciation and extinction rates. To estimate a rate, you need a number of things that happened (speciation events), and the amount of time in which they happened (the duration of the clade). This applies whether we are using RevBayes or BEAST to estimate a time tree and diversification rates (from constant-rate models or fancier models) at the same time, or taking clade ages and species counts to get simple constant-rate estimates.

Incomplete sampling is important, because although my phylogeny may have n species in it, the size, m of the clade may be much larger then n, so any rate based off n will be wrong. Imagine we wanted to ask which has a greater speciation rate, birds or mammals. From mammals, which are ~435 million years old, we have sampled 2000 species. From birds, which are ~111 million years old, our grant money ran out and we only got 10 species. Following Magallon and Sanderson (2001), we estimate the net diversification rate as r = ln(n / 2) / t. So we estimate that mammals have a diversification rate of r = 0.0158799, while birds have a net diversification rate of r = 0.01449944, and we conclude that mammals speciate faster. But, there are really ~5400 mammals and ~10000 birds, so the rates are really something like r = 0.01816323 for mammals and r = 0.07673147 for birds. By missing samples, we not only estimated wildly incorrect rates, we flipped the direction of the result. Now, obviously it’s a bit hard to get the wrong result when one group is diversifying over 4x as fast as the other, but that’s a rather large difference, we can’t expect any groups we go study to display such a clear difference. That’s why m matters.

Now, how to get m? As you say, taxonomy gives us some answer. But you also mention using a diversification rate. But now we’re caught in a circle, because we use need r to get m, but we can’t get r without m. Anything we do, any assumption we make about how this group is diversifying, will completely color our results, as our results will be the assumption we made.

The other thing to keep in mind is that comparing diversification rates across land masses is probably either invalid, or requires a more complicated model. Unless the landmasses all have reciprocally monophyletic clades in them, you cannot split apart the tree by island, because the groups on those islands share a tree, and thus they are non-independent. It would be possible to do a joint reconstruction of island and diversification rate, as in a MuSSE model. Now, for a BiSSE model, you need something like a few hundred species to get a good estimate of the effect of the binary trait on speciation and extinction rates, one imagines that even more species are required for MuSSE. Even if they are reciprocally monophyletic, confidence intervals on speciation parameters are not small, and you may have a hard time saying anything about diversification rates.

The bottom line is this:

  1. If the islands don’t have monophyletic clades: a) If you don’t have samples for >100 species in the genus, there’s absolutely no way. b) If you do, and you only care about the differences between the islands, you may not have to worry too much about the sampling fraction (unless sampling is biased).

  2. If the islands have monophyletic clades: a) If you have some idea about the range of species missing, you could run multiple analyses of each clade through the range of possible. Then you have some idea of whether one group is diversifying faster, but between the uncertainty inherent in estimating diversification rates, and the uncertainty in what the true number of species is, unless there is a rather large effect, you probably couldn’t say anything remotely conclusive. b) If you really don’t have any idea, there’s absolutely no way to say anything.


Thank you!

I should have mentioned that each island is inhabited by a reciprocally monophyletic clade.


“… because we use need r to get m, but we can’t get r without m. Anything we do, any assumption we make about how this group is diversifying, will completely color our results, as our results will be the assumption we made.”

…is the exact problem.

I was checking to see if there have been any work on deriving r or m independently from the other, or if there were any ideas about how to go about it. I hadn’t found any.

How sensitive are analyses to mischaracterization of m? I imagine that the number of clades for which there is a known number of species is relatively small.




Yeah, the only way to derive r or m independently would be if there were a general rule of speciation rates, but no such evidence for a general relationship between clade age and species number.

What you have to realize about diversification analyses is that they are by nature sensitive fickle. They’re sensitive to the time frame (certain patterns, like slowdowns, are erased after enough time), to unmodeled factors (variation in rates, mass extinctions, etc.) and even to assumptions that go into the tree-making process (two-step analyses are especially iffy, but even in joint analyses of the divergence times and the diversification model, a bad clock model could ruin inference). Heck, they can even be sensitive to how we model the way missing species are missing.

When we have n and m, we can calculate rho, the sampling fraction. As it turns out, rho behaves a bit like extinction. And sure, we have some information about rates since we have data, but we also have the following non-identifiability issue. Do we have some small n species in some long time t because r is low and extinction is low and rho is high? because r is high and extinction is high and rho is high? because r is quite large and rho is low and extinction is low? some other combination of rho, speciation, and extinction rates?

My best advice to you would be the following:

  1. Plug your sequence data into RevBayes (you can probably do this in BEAST as well) and do analyses separately for each clade.

  2. Figure out what your best guess on the range of m is and use this to figure out what the range of rho is. Put this as a uniform prior on rho, uniform(low_guess, high_guess).

  3. Estimate the divergence times (and the tree, preferably, but at least the times) and speciation/extinction under this model.

  4. Now you have posterior distributions that capture the full uncertainty for each clade, compare the posterior distributions on the diversification parameters between the clades and draw conclusions(if anything can be concluded decisively).