This is an archived static version of the original discussion site.

Mike Steel’s 2011 predictions– how is he doing so far?


Every five years, Mike Steel puts predictions up on his webpage of five directions in phylogenetics that will grow in the next five years.

His predictions for the period 2011-2016:

  1. Development of network­‐based methods to display ‘evolution as it happened’ including reticulation (LGT, endosymbiosis, hybrid species etc) up to the limits of what can be discerned from extant data.
  2. Phylogenetic approaches for handling patchy taxon coverage and analyzing large numbers of short reads from next generation sequencing.
  3. Phylogenetic approaches to early life using non­‐stationary models and protein structural constraints
  4. Statistical approaches for analyzing non­‐aligned sequence data.
  5. More realistic models of speciation and extinction that better describe the shape of ‘real’ phylogenies.

I can say that people are certainly doing #2, but it seems to me that most folks are just running classical approaches to datasets that are a high percent gap.

A lot of these touch on topics that I don’t know anything about, such as early life.

Thoughts, anyone?


I was just at the NZ phylogenetics meeting with Mike Steel and a bunch of others. My general feeling from that meeting (abstracts here: on these points is:

  1. Some interesting work here, and in particular an awesome talk on figuring out the limits of what can be discerned from extant data.
  2. Not much new here. In fact, it strikes me that patchy coverage is still an open question (even to the point of there not having a great picture from simulations). Separate to the NZ meeting, I know that some folks in Allen Rodrigo’s group at Duke are doing awesome work on using short reads directly (rather than mapping them first and calling genotypes), and integrating that into BEAST.
  3. Non stationary models still seems like a sticking point. Though there was one talk at the NZ meeting which proposed an interesting model for this in which the amino acid matrices grow over time as new amino acids are added to the code. (With empirical analyses too…)
  4. Nothing from the meeting. But Ben Redelings (Duke / NESCent) is still working on BALi-phy which does this. Anyone know of any other work?
  5. Lots of this. Usually involving Tanja Stadler from ETH Zurich.

Funny, when I first read #4 I thought he was talking about

and I’m actually not sure now. Let’s see if we can get him to join the forum.


Actually a very interesting (theoretical) paper on this has just appeared in Ann. Appl. Prob. by Daskalakis and Roch ("Alignment-free phylogenetic reconstruction: sample complexity via a branch process analysis) see


Bouchard-Côté and Jordan’s Poisson Indel Process is similar to the approach described in TKF91, where the process allows indel events to forgo the need of an alignment, while still remaining tractable for large numbers of taxa. I haven’t seen it in action, though.


Yes, for those interested in the PIP, Bouchard-Côté described in a phyloseminar:


@mathmomike if this is the same as the version of their paper on arxiv then this is a distance-based algorithm, not a likelihood-based one, right? So would that satisfy your 2011 criteria?


Yes it’s distance-based, but it is still ‘statistical’ - i.e. they prove consistency and convergence results. It might not be as efficient as ML, but at least it is tractable!


I could not find the abstract that corresponds to your description. Do you remember the name of the authors? I’m intrigued! Thanks!


Hmmm. I can’t find it either. It must have been a replacement talk. @mathmomike can you remember? It’s the NZ physicist at Cantebury (I think) who works on models of amino acid replacement that account for the addition of new amino acids…


It was Assoc. Prof. Peter Wills, a physicist from Auckland University; his website is

  1. We developed a signature-based method for searching for horizontal gene transfer in Eukaryotes, called SigHunt (Jaron et al. 2014. Bioinformatics). I was fighting against self-promotion here for the few days, but now we’re looking for collaborators for an upcoming grant application. If you’re interested in a specific organism in this respect, let me know and we could discuss it.

I have some doubts about “phylogenetic approaches for … analyzing large numbers of short reads from next generation sequencing.” I think short read data is generally low quality data and anything that can be done differently should be. Even for environmental samples there are starting to be alternatives to big piles of unassembled short reads, such as better metagenomics assembly tools, longer read technologies, or Illumina’s Moleculo,


You can read a paper on the background to the approach I am taking to the phylogeny of the coding enzymes (and hope to apply to other ancient structures) using increasing-size alphabets at:

Wills PR. Genetic information, physical interpreters and thermodynamics; the material-­informatic basis of biosemiosis. Biosemiotics DOI 10.1007/s12304-013-9196-2 (online, 5 October 2013)


Thanks very much for this reference!


The pagination is now available: Biosemiotics 7, 141–165 (2014).


That link seems broken now. Here is one I found that works:


@mathmomike – it seems like these predictions are an every-5-year event. Are we going to get some new ones?


Hi Erick,

good point - yes, it’s always in April, so I have only one more week to go… I’ll upload something before the end of the month! Any suggestions meantime most welcome :wink:

best, Mike


Well, if it’s not already obvious, I think that in 5 years we’re going to have Bayesian samplers that are based on foundations other than vanilla MCMC. I also think that we’ll see new and better methods to extend existing inferences with more taxa.


“the best way to predict the future is to invent it”!