This is an archived static version of the original phylobabble.org discussion site.

[Paper] Hypermutable DNA chronicles the evolution of human colon cancer

ematsen

Here’s an example figure, but note that the “alignment” is transpose of how we usually think.

Phylogenetic Reconstruction. We reconstructed phylogenies using two in-
dependent approaches. First, we calculated a distance matrix for each patient
using an “equal or not” distance (31). This method increases the distances
between two samples if they have unequal genotypes, regardless of the
magnitude of the difference. We then used neighbor-joining (51) in R to
infer the phylogenetic relationships between samples. In the very rare case
of missing values, we imputed them using the nearest neighbor. We used
bootstrapping with 1,000 replicates to test the reliability of the resulting
trees (52) and collapsed all interior branches with bootstrap values below
70% into polytomies. Next, we used Bayesian inference of phylogeny—
a methodology that relies on a fundamentally different set of principles
than neighbor-joining—to construct the phylogenies. The results were al-
most identical in all cases, confirming the robustness of our approach.
Bayesian phylogenies and posterior probability values for all clades are
presented in SI Appendix, Fig. S10. We used the software MrBayes (53) with
the same model parameters that were previously used for the analysis of
poly-G tract mutation profiles (21).

ematsen

And another paper about phylogenetics and cell lineages, but with a different data type.

ncbi.nlm.nih.gov

Phylogenetic quantification of intra-tumour heterogeneity.

RF Schwarz, A Trinh, B Sipos, JD Brenton, N Goldman and F Markowetz, PLoS computational biology, Apr 2014

Intra-tumour genetic heterogeneity is the result of ongoing evolutionary change within each cancer. The expansion of genetically distinct sub-clonal populations may explain the emergence of drug resistance, and if so, would have prognostic and predictive utility. However, methods for objectively quantifying tumour heterogeneity have been missing and are particularly difficult to establish in cancers where predominant copy number variation prevents accurate phylogenetic reconstruction owing to horizontal dependencies caused by long and cascading genomic rearrangements. To address these challenges, we present MEDICC, a method for phylogenetic reconstruction and heterogeneity quantification based on a Minimum Event Distance for Intra-tumour Copy-number Comparisons. Using a transducer-based pairwise comparison function, we determine optimal phasing of major and minor alleles, as well as evolutionary distances between samples, and are able to reconstruct ancestral genomes. Rigorous simulations and an extensive clinical study show the power of our method, which outperforms state-of-the-art competitors in reconstruction accuracy, and additionally allows unbiased numerical quantification of tumour heterogeneity. Accurate quantification and evolutionary inference are essential to understand the functional consequences of tumour heterogeneity. The MEDICC algorithms are independent of the experimental techniques used and are applicable to both next-generation sequencing and array CGH data.

I think it would be fun to have a phyloseminar on these papers. Opinions?

@mathmomike — has there been any recent work from the “tree of cells” people?

ematsen

If anyone else is interested in following this topic, there has been an interesting recent contribution:

ncbi.nlm.nih.gov

Inferring clonal composition from multiple sections of a breast cancer.

H Zare, J Wang, A Hu, K Weber, J Smith, D Nickerson, C Song, D Witten, CA Blau and WS Noble, PLoS computational biology, Jul 2014

Cancers arise from successive rounds of mutation and selection, generating clonal populations that vary in size, mutational content and drug responsiveness. Ascertaining the clonal composition of a tumor is therefore important both for prognosis and therapy. Mutation counts and frequencies resulting from next-generation sequencing (NGS) potentially reflect a tumor's clonal composition; however, deconvolving NGS data to infer a tumor's clonal structure presents a major challenge. We propose a generative model for NGS data derived from multiple subsections of a single tumor, and we describe an expectation-maximization procedure for estimating the clonal genotypes and relative frequencies using this model. We demonstrate, via simulation, the validity of the approach, and then use our algorithm to assess the clonal composition of a primary breast cancer and associated metastatic lymph node. After dividing the tumor into subsections, we perform exome sequencing for each subsection to assess mutational content, followed by deep sequencing to precisely count normal and variant alleles within each subsection. By quantifying the frequencies of 17 somatic variants, we demonstrate that our algorithm predicts clonal relationships that are both phylogenetically and spatially plausible. Applying this method to larger numbers of tumors should cast light on the clonal evolution of cancers in space and time.

which really takes a completely different approach. Rather than focusing on phylogenetic inference itself, this new work focuses on inferring the posterior on per-site clonal genotypes in a mixed sample. They then build phylogenetic trees by either manually solving the perfect phylogeny problem or, if that is not possible, finding “nearby” genotype inferences for which the perfect phylogeny problem is soluble.

ematsen

Related papers keep coming!

http://genome.cshlp.org/content/early/2014/07/24/gr.180281.114.abstract

TITAN: Inference of copy number architectures in clonal cell populations from tumor whole genome sequence data Gavin Ha1, Andrew Roth1, Jaswinder Khattra1, Julie Ho2, Damian Yap1, Leah M Prentice2, Nataliya Melnyk2, Andrew McPherson1, Ali Bashashati1, Emma Laks1, Justina Biele1, Jiarui Ding1, Alan Le1, Jamie Rosner1, Karey Shumansky1, Marco A Marra3, C Blake Gilks4, David G Huntsman2, Jessica N McAlpine5, Samuel Aparicio1 and Sohrab P Shah1,6

Author Affiliations

1 BC Cancer Agency; 2 Centre for Translational and Applied Genomics; 3 Genome Sciences Centre; 4 Vancouver General Hospital; 5 University of British Columbia ↵* Corresponding author; email: sshah@bccrc.ca Abstract

The evolution of cancer genomes within a single tumor creates mixed cell populations with divergent somatic mutational landscapes. Inference of tumor subpopulations has been disproportionately focused on the assessment of somatic point mutations, whereas computational methods targeting evolutionary dynamics of copy number alterations (CNA) and loss of heterozygosity (LOH) in whole genome sequencing data remain under-developed. We present a novel probabilistic model, TITAN, to infer CNA and LOH events while accounting for mixtures of cell populations, thereby estimating the proportion of cells harboring each event. We evaluate TITAN on idealized mixtures, simulating clonal populations from whole genome sequences taken from genomically heterogeneous ovarian tumor sites collected from the same patient. In addition, we show in 23 whole genomes of breast tumors that inference of CNA and LOH using TITAN critically inform population structure and the nature of the evolving cancer genome. Finally, we experimentally validated subclonal predictions using fluorescence in situ hybridization (FISH) and single-cell sequencing from an ovarian cancer patient sample, thereby recapitulating the key modeling assumptions of TITAN.

ematsen

And another!

BayClone: Bayesian Nonparametric Inference of Tumor Subclones Using NGS Data Subhajit Sengupta1 , Jin Wang2 , Juhee Lee3 , Peter Muller4 , Kamalakar Gulukota5 , Arunava Banerjee6 , Yuan Ji1;7; 1Center for Biomedical Research Informatics, NorthShore University HealthSystem 2Department of Statistics, University of Illinois at Urbana-Champaign 3Department of Applied Mathematics and Statistics, University of California Santa Cruz 4Department of Mathematics, University of Texas Austin 5Center for Molecular Medicine, NorthShore University HealthSystem 6Department of Computer & Information Science & Engineering, University Of Florida 7Department of Health Studies, The University Of Chicago In this paper, we present a novel feature allocation model to describe tumor heterogeneity (TH) using next-generation sequencing (NGS) data. Taking a Bayesian approach, we extend the Indian buet process (IBP) to dene a class of nonparametric models, the categorical IBP (cIBP). A cIBP takes categorical values to denote homozygous or heterozygous genotypes at each SNV. We dene a subclone as a vector of these categorical values, each corresponding to an SNV. Instead of partitioning somatic mutations into non-overlapping clusters with similar cellular prevalences, we took a dierent approach using feature allocation. Importantly, we do not assume somatic mutations with similar cellular prevalence must be from the same subclone and allow overlapping mutations shared across subclones. We argue that this is closer to the underlying theory of phylogenetic clonal expansion, as somatic mutations occurred in parent subclones should be shared across the parent and child subclones. Bayesian inference yields posterior probabilities of the number, genotypes, and proportions of subclones in a tumor sample, thereby providing point estimates as well as variabilities of the estimates for each subclone. We report results on both simulated and real data. BayClone is available at http://health.bsd.uchicago.edu/yji/soft.html.

ematsen

And another!

Comparing Nonparametric Bayesian Tree Priors for Clonal Reconstruction of Tumors

Amit G. Deshwar, Shankar Vembu, Quaid Morris

Statistical machine learning methods, especially nonparametric Bayesian methods, have become increasingly popular to infer clonal population structure of tumors. Here we describe the treeCRP, an extension of the Chinese restaurant process (CRP), a popular construction used in nonparametric mixture models, to infer the phylogeny and genotype of major subclonal lineages represented in the population of cancer cells. We also propose new split-merge updates tailored to the subclonal reconstruction problem that improve the mixing time of Markov chains. In comparisons with the tree-structured stick breaking prior used in PhyloSub, we demonstrate superior mixing and running time using the treeCRP with our new split-merge procedures. We also show that given the same number of samples, TSSB and treeCRP have similar ability to recover the subclonal structure of a tumor.

http://arxiv.org/abs/1408.2552

ematsen

… !!!

Bayesian Inference for Tumor Subclones Accounting for Sequencing and Structural Variants

Juhee Lee, Peter Mueller, Subhajit Sengupta, Kamalakar Gulukota, Yuan Ji

Tumor samples are heterogeneous. They consist of different subclones that are characterized by differences in DNA nucleotide sequences and copy numbers on multiple loci. Heterogeneity can be measured through the identification of the subclonal copy number and sequence at a selected set of loci. Understanding that the accurate identification of variant allele fractions greatly depends on a precise determination of copy numbers, we develop a Bayesian feature allocation model for jointly calling subclonal copy numbers and the corresponding allele sequences for the same loci. The proposed method utilizes three random matrices, L, Z and w to represent subclonal copy numbers (L), numbers of subclonal variant alleles (Z) and cellular fractions of subclones in samples (w), respectively. The unknown number of subclones implies a random number of columns for these matrices. We use next-generation sequencing data to estimate the subclonal structures through inference on these three matrices. Using simulation studies and a real data analysis, we demonstrate how posterior inference on the subclonal structure is enhanced with the joint modeling of both structure and sequencing variants on subclonal genomes. Software is available at this http URL

http://arxiv.org/abs/1409.7158