This is an archived static version of the original phylobabble.org discussion site.

# [Paper] Hypermutable DNA chronicles the evolution of human colon cancer

ematsen

Here’s an example figure, but note that the “alignment” is transpose of how we usually think.

Phylogenetic Reconstruction. We reconstructed phylogenies using two in-
dependent approaches. First, we calculated a distance matrix for each patient
using an “equal or not” distance (31). This method increases the distances
between two samples if they have unequal genotypes, regardless of the
magnitude of the difference. We then used neighbor-joining (51) in R to
infer the phylogenetic relationships between samples. In the very rare case
of missing values, we imputed them using the nearest neighbor. We used
bootstrapping with 1,000 replicates to test the reliability of the resulting
trees (52) and collapsed all interior branches with bootstrap values below
70% into polytomies. Next, we used Bayesian inference of phylogeny—
a methodology that relies on a fundamentally different set of principles
than neighbor-joining—to construct the phylogenies. The results were al-
most identical in all cases, confirming the robustness of our approach.
Bayesian phylogenies and posterior probability values for all clades are
presented in SI Appendix, Fig. S10. We used the software MrBayes (53) with
the same model parameters that were previously used for the analysis of
poly-G tract mutation profiles (21).
ematsen

And another paper about phylogenetics and cell lineages, but with a different data type.

I think it would be fun to have a phyloseminar on these papers. Opinions?

@mathmomike — has there been any recent work from the “tree of cells” people?

ematsen

If anyone else is interested in following this topic, there has been an interesting recent contribution:

which really takes a completely different approach. Rather than focusing on phylogenetic inference itself, this new work focuses on inferring the posterior on per-site clonal genotypes in a mixed sample. They then build phylogenetic trees by either manually solving the perfect phylogeny problem or, if that is not possible, finding “nearby” genotype inferences for which the perfect phylogeny problem is soluble.

ematsen

Related papers keep coming!

http://genome.cshlp.org/content/early/2014/07/24/gr.180281.114.abstract

TITAN: Inference of copy number architectures in clonal cell populations from tumor whole genome sequence data Gavin Ha1, Andrew Roth1, Jaswinder Khattra1, Julie Ho2, Damian Yap1, Leah M Prentice2, Nataliya Melnyk2, Andrew McPherson1, Ali Bashashati1, Emma Laks1, Justina Biele1, Jiarui Ding1, Alan Le1, Jamie Rosner1, Karey Shumansky1, Marco A Marra3, C Blake Gilks4, David G Huntsman2, Jessica N McAlpine5, Samuel Aparicio1 and Sohrab P Shah1,6

• Author Affiliations

1 BC Cancer Agency; 2 Centre for Translational and Applied Genomics; 3 Genome Sciences Centre; 4 Vancouver General Hospital; 5 University of British Columbia ↵* Corresponding author; email: sshah@bccrc.ca Abstract

The evolution of cancer genomes within a single tumor creates mixed cell populations with divergent somatic mutational landscapes. Inference of tumor subpopulations has been disproportionately focused on the assessment of somatic point mutations, whereas computational methods targeting evolutionary dynamics of copy number alterations (CNA) and loss of heterozygosity (LOH) in whole genome sequencing data remain under-developed. We present a novel probabilistic model, TITAN, to infer CNA and LOH events while accounting for mixtures of cell populations, thereby estimating the proportion of cells harboring each event. We evaluate TITAN on idealized mixtures, simulating clonal populations from whole genome sequences taken from genomically heterogeneous ovarian tumor sites collected from the same patient. In addition, we show in 23 whole genomes of breast tumors that inference of CNA and LOH using TITAN critically inform population structure and the nature of the evolving cancer genome. Finally, we experimentally validated subclonal predictions using fluorescence in situ hybridization (FISH) and single-cell sequencing from an ovarian cancer patient sample, thereby recapitulating the key modeling assumptions of TITAN.

ematsen

And another!

BayClone: Bayesian Nonparametric Inference of Tumor Subclones Using NGS Data Subhajit Sengupta1 , Jin Wang2 , Juhee Lee3 , Peter Muller4 , Kamalakar Gulukota5 , Arunava Banerjee6 , Yuan Ji1;7; 1Center for Biomedical Research Informatics, NorthShore University HealthSystem 2Department of Statistics, University of Illinois at Urbana-Champaign 3Department of Applied Mathematics and Statistics, University of California Santa Cruz 4Department of Mathematics, University of Texas Austin 5Center for Molecular Medicine, NorthShore University HealthSystem 6Department of Computer & Information Science & Engineering, University Of Florida 7Department of Health Studies, The University Of Chicago In this paper, we present a novel feature allocation model to describe tumor heterogeneity (TH) using next-generation sequencing (NGS) data. Taking a Bayesian approach, we extend the Indian buet process (IBP) to dene a class of nonparametric models, the categorical IBP (cIBP). A cIBP takes categorical values to denote homozygous or heterozygous genotypes at each SNV. We dene a subclone as a vector of these categorical values, each corresponding to an SNV. Instead of partitioning somatic mutations into non-overlapping clusters with similar cellular prevalences, we took a dierent approach using feature allocation. Importantly, we do not assume somatic mutations with similar cellular prevalence must be from the same subclone and allow overlapping mutations shared across subclones. We argue that this is closer to the underlying theory of phylogenetic clonal expansion, as somatic mutations occurred in parent subclones should be shared across the parent and child subclones. Bayesian inference yields posterior probabilities of the number, genotypes, and proportions of subclones in a tumor sample, thereby providing point estimates as well as variabilities of the estimates for each subclone. We report results on both simulated and real data. BayClone is available at http://health.bsd.uchicago.edu/yji/soft.html.

ematsen

And another!

Comparing Nonparametric Bayesian Tree Priors for Clonal Reconstruction of Tumors

Amit G. Deshwar, Shankar Vembu, Quaid Morris

Statistical machine learning methods, especially nonparametric Bayesian methods, have become increasingly popular to infer clonal population structure of tumors. Here we describe the treeCRP, an extension of the Chinese restaurant process (CRP), a popular construction used in nonparametric mixture models, to infer the phylogeny and genotype of major subclonal lineages represented in the population of cancer cells. We also propose new split-merge updates tailored to the subclonal reconstruction problem that improve the mixing time of Markov chains. In comparisons with the tree-structured stick breaking prior used in PhyloSub, we demonstrate superior mixing and running time using the treeCRP with our new split-merge procedures. We also show that given the same number of samples, TSSB and treeCRP have similar ability to recover the subclonal structure of a tumor.

http://arxiv.org/abs/1408.2552

ematsen

… !!!

Bayesian Inference for Tumor Subclones Accounting for Sequencing and Structural Variants

Juhee Lee, Peter Mueller, Subhajit Sengupta, Kamalakar Gulukota, Yuan Ji

Tumor samples are heterogeneous. They consist of different subclones that are characterized by differences in DNA nucleotide sequences and copy numbers on multiple loci. Heterogeneity can be measured through the identification of the subclonal copy number and sequence at a selected set of loci. Understanding that the accurate identification of variant allele fractions greatly depends on a precise determination of copy numbers, we develop a Bayesian feature allocation model for jointly calling subclonal copy numbers and the corresponding allele sequences for the same loci. The proposed method utilizes three random matrices, L, Z and w to represent subclonal copy numbers (L), numbers of subclonal variant alleles (Z) and cellular fractions of subclones in samples (w), respectively. The unknown number of subclones implies a random number of columns for these matrices. We use next-generation sequencing data to estimate the subclonal structures through inference on these three matrices. Using simulation studies and a real data analysis, we demonstrate how posterior inference on the subclonal structure is enhanced with the joint modeling of both structure and sequencing variants on subclonal genomes. Software is available at this http URL

http://arxiv.org/abs/1409.7158