I’m working on a project in which we are trying to infer infection time using viral sequence data. Some samples are from multiple time points and others have only one. We are using a tree height estimate from BEAST for our timing inference.
Sometimes the data sets have quite a number of duplicate sequences. Because these sequences are barcoded, we believe them to come from different template sequences, rather than just be PCR replicates.
Including duplicates, it’s not uncommon for us to have > 2,000 sequences. This is too many for BEAST to run on happily. When we deduplicate, we get down to about 200. However, this destroys our demographic assumptions by sampling sequences in a biased way. Namely, it will appear that our virus has diversified more rapidly than if we hadn’t deduplicated.
What do you suggest? It seems like this is a question that others might have encountered, and perhaps a fertile place for new methods.