[Paper] A synchronized global sweep of the internal genes of modern avian influenza virus


By Worobey, Han, and @arambaut.


Zoonotic infectious diseases such as influenza continue to pose a grave threat to human health. However, the factors that mediate the emergence of RNA viruses such as influenza A virus (IAV) are still incompletely understood. Phylogenetic inference is crucial to reconstructing the origins and tracing the flow of IAV within and between hosts. Here we show that explicitly allowing IAV host lineages to have independent rates of molecular evolution is necessary for reliable phylogenetic inference of IAV and that methods that do not do so, including ‘relaxed’ molecular clock models, can be positively misleading. A phylogenomic analysis using a host-specific local clock model recovers extremely consistent evolutionary histories across all genomic segments and demonstrates that the equine H7N7 lineage is a sister clade to strains from birds-as well as those from humans, swine and the equine H3N8 lineage-sharing an ancestor with them in the mid to late 1800s. Moreover, major western and eastern hemisphere avian influenza lineages inferred for each gene coalesce in the late 1800s. On the basis of these phylogenies and the synchrony of these key nodes, we infer that the internal genes of avian influenza virus (AIV) underwent a global selective sweep beginning in the late 1800s, a process that continued throughout the twentieth century and up to the present. The resulting western hemispheric AIV lineage subsequently contributed most of the genomic segments to the 1918 pandemic virus and, independently, the 1963 equine H3N8 panzootic lineage. This approach provides a clear resolution of evolutionary patterns and processes in IAV, including the flow of viral genes and genomes within and between host lineages.

The bolding up there is my own. Here is the corresponding figure:

Definitely take a look at Figure 2 from that paper.

H/T @trvrb.


The relaxed clock model in BEAST assumes each branch draws its rate independently from a distribution with estimated mean and variance. Thus, there is no auto-correlation across branches. Although the host-specific rate estimates are quite cool, Mike and Andrew’s results might suggest that including a relaxed clock model with auto-correlation might perform better.

I wonder if Alexei and Marc’s random-local-clock model might do better than standard relaxed clock in this case. I haven’t read the paper closely enough to know if this was one of the models compared.


We are working on a mixture of local clock and relaxed clock and I certainly predict it will fit better. In this case there were only a few, quite obvious, host jumps so a specified local clock is going to work better than a random local clock but I would imagine the latter would probably not do badly (worth a comparison).


Thanks for clarifying @arambaut. I completely agree with using the pre-specified host-clocks for this analysis. But it was surprising how poorly the standard relaxed clock performed in this case. It would be good to have a relaxed clock model that is more robust to these sorts of strong lineage effects, which we might think are common.

I think I’ve asked you about this before, but have you experimented with auto-correlated relaxed clock models? Each branch samples its rate with mean equal to the mean of its parent branch. Mixing is obviously going to be an issue, but I wonder if the sort of GMRF operators could help with this? And I’m not sure how mixing in this case would compare to the random-local-clock model.


@trvrb it seems like one problem with the standard relaxed clocks in this setting is that they are all drawn from a single distribution, where we clearly have three distributions here, with very different means.

@arambaut I couldn’t tell if the random local clock had the topology constrained such that the viruses in the three animals were each monophyletic, as was the case for the HSLC.


This makes me wonder to what extent one could estimate the number of clocks and their branch assignments directly from the data, in a categorical version of what @nicolas_lartill does in a continuous framework here: http://onlinelibrary.wiley.com/doi/10.1111/j.1558-5646.2011.01558.x/full


@arambaut: a nice paper, and an interesting model. One can imagine many extensions here (e.g. by making an explicit model of the switches between hosts based on a metapopulation model).

@rob_lanfear. yes, there are possible connections here. On my side, I was thinking about modeling the log-rate as an Ornstein Uhlenbeck process with host-dependent mean. But other models would certainly be possible.

However, in the present case, I wonder if autocorrelated clocks would be much better: I am so surprised by the biased divergence time estimates shown in figure 1.c.

I could possibly imagine an interaction with the prior on divergence times here: is it something like a standard coalescent or a birth-death? If yes, then is that expected to correctly describe the structured populations represented by these data?

My idea is this: if the prior on divergence times, for some reason, really wants to push estimates for ancient nodes downwards, then, just because the relaxed clock happens to be more flexible, it will not oppose enough resistance to this bias. Hence what we see in figure 1c.


Autocorrelated relaxed clocks are implemented in BEAST but would make no sense in this case - the number of host switches are few and the AC model would expect a change drawn from the same distribution on all nodes. Perhaps if the changes drawn from an exponential so most changes are small and a few are big.


This makes me wonder to what extent one could estimate the number of clocks and their branch assignments directly from the data, in a categorical version of what @nicolas_lartill does in a continuous framework here: http://onlinelibrary.wiley.com/doi/10.1111/j.1558-5646.2011.01558.x/full2

Is this not the same as Alexei Drummond and Marc Suchard’s random local clock does?


Sorry - what I wrote was ambiguous. I meant using the trait data at the tips (in this case, host data) to help the process.


I understand. In this case the major host jumps are few and with 1.0 support but generally yes - if there were lots of jumping you could have a process describing the jumping that controls the rate changes. We have been doing this using the Markov Jumps (a method of efficiently getting realisations and expectations of transitions from a continuous-time Markov model).

This sort of process was used in the following paper to model rabies virus jumps between bat host species: http://dx.doi.org/10.1098/rstb.2012.0196 and this one for foot and mouth virus: http://mbio.asm.org/content/4/5/e00591-13.full These two papers didn’t have host-specific rates but it is being worked on.