Current best practice for simulating viral sequences?


I would like to simulate sequences for a population undergoing mutation and recombination in the presence of strong selection. I would like to be able to specify an arbitrary fitness landscape, hopefully with some nice parameterization. I don’t want to start from scratch and write a forward time simulator. I’m hoping to get some feedback from you all about how to do this.

Here is what I have seen so far:

@tgvaughan has written the very nice looking MASTER. However, my impression is that this involves a fixed number of types, and relative fitness is implicitly defined by the master equations determining how those types leave progeny. I would like a mutant to have a different fitness than its predecessor.

I found a software package called VIRAPOPS which is distributed as binaries. I contacted the author and the f77 code is not open-source. Not going to use a simulator that I can’t look at under the hood of, thankyouverymuch.

There is a cool looking package called forqs that is open source C++, but which is really designed with diploid populations that recombine via crossing-over in mind. I’ve been in touch with that author, who was very friendly and helpful, but it would require some deep recoding, or some abuse.

Am I forgetting something obvious?


Sorry, I do not have a simulator in mind. But I wanted to make sure you were not overlooking “deep sequencing” data sets from HIV-1 infected individuals. HIV packages 2 copies of the viral RNA genome, and “recombination” happens when the reverse transcriptase switches from one RNA genome to the other during reverse transcription. It is not strand break/repair, but results in recombinant cDNA and provirus. Anyway, HIV is actually undergoing mutation and recombination in the presence of several types of strong selection (CTL and antibody immune responses remove more than 99% of the virions produced each day; antoretrovirual drugs can remove more than 99.999%), and we have some stunningly huge data sets now from single time points and from patients followed over time.


Hi @ematsen!

I actually have some Python scripts to leverage Vaughn’s MASTER to do something like this:


It basically generates a rather large and complex configuration file for MASTER rather than being any clever code modification.

It probably is not ideal for your purposes, because you have to set the maximum number of strains that can evolve (and things get pretty slow once you get above 5 or 6 strains). I allow fitness varies as well, but the ways that it can vary are limited. So again, not quite what you are looking for. But maybe with some imagination you can tweak it to get what you want?

I actually did begin work on a from-the-ground-up viral phylogeny simulation in C++11 last year. It has been on the back-burner for a long time, mainly because it was difficult to justify working on it when other things could get me almost what I want with a little bit of hacking, so I would not expect anything quickly! Looking forward to someone else suggesting an alternative so that I do not disappear down the rabbit-hole to work on this toy (at the expense of other, ongoing “real” projects).


Hi Eric, you’re right about MASTER probably not being the best choice here. While it’s strictly possible to express non-neutral population genetics models within the chemical master equation framework the program operates in, this isn’t practical. (Each genetically distinct sub-population would need to be given its own type/location tag.) MASTER is mostly about neutral phylodynamics at this stage.


I find these data sets exceedingly interesting, but we don’t know the sequence of events that led to those deep sequencing data sets (unless you have something I don’t know about over at LANL!).

I can’t wait for those to become public!

@jeetsukumaran, you are a madman. I mean this in the most complimentary sense.

@tgvaughan… that mostly is of course piquing my interest. Do you have some future plans?


No, nothing solid, but it’s a problem I’ve thought about often!


Ages ago we (Alexei Drummond, Koen Deforche and others) wrote a simulator like this. You set it up using an XML description language. The aim was to simulate complete genomes with selection and recombination for 10^6 populations and higher.

I haven’t had time to work on it and I have basically let people in Annemie Vandamme’s group try to take it on. It is FOSS Java code and can be found here:

Documentation is here:


Some applications only need to model evolution in a single potential epitope region or event at a single amino acid site. I wonder if MASTER would be useful there.


@arambaut-- I just had a chance to look through the documentation. I’m impressed, especially by the component-wise flexibility you built into it.

Can I assume that the functionality in the wiki works as described?


It seems like this sort of data set is not kept on the LANL HIV site, and rather lives on the SRA. Do you have any collections of links to such data sets? I’m especially interested in longitudinal samples through time. Thanks, @BrianFoley!


The LANL HIV Database is also offering to store next generation data here Our goal is to have not only the “raw data” but alignments in more useful format, with sequence names (or sequence ID plus an accompanying spreadsheet of sequence information; patient ID, sample date, etc) that are informative.


This looks great, @BrianFoley, but I wish there were more! I know that this is a question of users not depositing data at LANL, but still, a list of data sets available at SRA or something would be really nice.

By the way, I have found the DNANexus interface to the SRA to be much better than the original.