Deep multi-region whole-genome sequencing reveals heterogeneity and gene-by-environment interactions in treatment-naive, metastatic lung cancer

Our understanding of genomic heterogeneity in lung cancer is largely based on the analysis of early-stage surgical specimens. Here we used endoscopic sampling of paired primary and intrathoracic metastatic tumors from 11 lung cancer patients to map genomic heterogeneity inoperable lung cancer with deep whole-genome sequencing. Intra-patient heterogeneity in driver or targetable mutations was predominantly in the form of copy number gain. Private mutation signatures, including patterns consistent with defects in homologous recombination, were highly variable both within and between patients. Irrespective of histotype, we observed a smaller than expected number of private mutations, suggesting that ancestral clones accumulated large mutation burdens immediately prior to metastasis. Single-region whole-genome sequencing of from 20 patients showed that tumors in ever-smokers with the strongest tobacco signatures were associated with germline variants in genes implicated in the repair of cigarette-induced DNA damage. Our results suggest that lung cancer precursors in ever-smokers accumulate large numbers of mutations prior to the formation of frank malignancy followed by rapid metastatic spread. In advanced lung cancer, germline variants in DNA repair genes may interact with the airway environment to influence the pattern of founder mutations, whereas similar interactions with the tumor microenvironment may play a role in the acquisition of mutations following metastasis.

. Overview of all cases analyzed with whole genome sequencing. Oncoprint outputs depicting SNV, Indel and CNV events in pan-cancer driver genes for each sample are shown with heterogeneous mutations highlighted. Lung cancer driver genes are highlighted in bold. LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; SCLC, small cell lung cancer.     . Circos plots from multi-region samples analyzed by WGS with patterns of structural variants consistent with unstable genomes. P, primary tumor; metastases are indicated by lymph node station except for Pl (pleura) and IP (intrapulmonary). LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; SCLC, small cell lung cancer. Figure S8. Overview of data summarized case-by-case.

TCGA data set
Patients from TCGA LUAD and LUSC datasets 5,6 were selected based upon documented smoking history, and no record of previous malignancy or neoadjuvant therapy. Patients with tumors with EGFR mutations or EML-ALK fusions were excluded. Patients with multiple tumor samples were considered as one case and the tumor sample randomly selected. Data were obtained from TCGA as BAM files and variants were called and processed using the same approach as for the EBUS-TBNA data.

Somatic signatures
Somatic signatures were determined using somatic variants identified by Strelka. We used Strelka default PASS variants only, which enriches for real somatic variants by excluding variants present in the germline and variants with insufficient evidence of somatic status, and was then used to process the variants and calculate motif matrices of the 96 possible mutation contexts. These matrices were compared against the 30 COSMIC mutational signature profiles identified as part of The Cancer Genome Atlas 7 to determine loadings, fits and residuals using the non-negative least squares algorithm implemented in the nnls R package (cran.r-project.org/web/packages/nnls/). Resulting profiles were visualised with custom ggplot2 8 plots to show the observed mutational signature profile, the fit against the 30 COSMIC profiles 9 , the differences (residuals) between the two and an overall percentage match of the observed profile to each of the 30 COSMIC profiles.

Somatic copy number variant calling and analysis
Sequenza applies an algorithmic approach to estimate overall tumor cellularity which we used to determine whether samples should be sequenced to a greater depth. It also produces segments for each copy number change through the genome, including the ploidy of the A and B allele for each segment. Analysis was performed to identify all segments that overlapped by 1 or more bases with any of the genes in the consensus cancer driver gene list described above. All reported variants were manually validated using IGV. 4

Copy number variation
To compare copy number profiles between samples, events smaller than 10kb were excluded on the basis that they did not validate in IGV. To minimise noise, for each sample we averaged the copy number values across cytoband coordinates in proportion to the number of bases covered by each variant in the cytoband, assuming diploid status for coordinates without variants. This resulted in a profile of copy number values for each of the 811 autosomal cytobands per sample.
To compare genome-wide CNV profile changes, we calculated pairwise coefficients of determination (R 2 ) between all samples and generated heatmaps using the heatmap.2 function from the gplots package for R. To visually display profiles across all samples for each patient, we generated segmented data files (*.seg) containing cytoband coordinates and copy number values and used IGV to plot these inline and scaled per patient. All reported variants were manually validated using IGV. 4

Phylogenetic reconstruction.
Accurate phylogenetic reconstruction requires high confidence in shared and private events. To achieve this for our multi-region cases, we first used GATK CombineVariants v3.3-0-g37228af 2 on all short somatic variants per patient to determine all variants present in one or more samples. We then ran Strelka again for each sample and forced it to call variants at every position where a variant was present in any sample from the patient. Finally, we combined the re-called variants to obtain genotype and sequencing depth information for each sample, even when there was initially not enough evidence for denoting somatic status. We proceeded to exclude variants deemed to have insufficient quality or information to enable determining shared or private status. We applied exclusion regions using BEDTools v2.22.0 10 to discard variants called in difficult to sequence regions of the genome. Variants were excluded where they did not pass default Strelka quality thresholds in at least one sample. We also excluded variants where no information was obtained for at least one sample during the re-calling, multi-allelic variants and those outside the autosome. To ensure we only analyzed somatic events, we discarded all variants where the germline had one or more reads supporting the alternate allele. To ensure adequate sequencing coverage, we discarded all variants where any one of the somatic samples had an overall depth of below 50. To ensure sufficient evidence of somatic status we excluded variants where no sample had 6 or more high quality reads supporting the alternate allele. Finally, to control for cellularity differences, a sample needed at least 2 reads supporting the alt allele to be classed as sharing a variant.