Genomic–transcriptomic evolution in lung cancer and metastasis

Intratumour heterogeneity (ITH) fuels lung cancer evolution, which leads to immune evasion and resistance to therapy1. Here, using paired whole-exome and RNA sequencing data, we investigate intratumour transcriptomic diversity in 354 non-small cell lung cancer tumours from 347 out of the first 421 patients prospectively recruited into the TRACERx study2,3. Analyses of 947 tumour regions, representing both primary and metastatic disease, alongside 96 tumour-adjacent normal tissue samples implicate the transcriptome as a major source of phenotypic variation. Gene expression levels and ITH relate to patterns of positive and negative selection during tumour evolution. We observe frequent copy number-independent allele-specific expression that is linked to epigenomic dysfunction. Allele-specific expression can also result in genomic–transcriptomic parallel evolution, which converges on cancer gene disruption. We extract signatures of RNA single-base substitutions and link their aetiology to the activity of the RNA-editing enzymes ADAR and APOBEC3A, thereby revealing otherwise undetected ongoing APOBEC activity in tumours. Characterizing the transcriptomes of primary–metastatic tumour pairs, we combine multiple machine-learning approaches that leverage genomic and transcriptomic variables to link metastasis-seeding potential to the evolutionary context of mutations and increased proliferation within primary tumour regions. These results highlight the interplay between the genome and transcriptome in influencing ITH, lung cancer evolution and metastasis.


Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested
A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g.means) or other basic estimates (e.g.regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g.confidence intervals) For null hypothesis testing, the test statistic (e.g.F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g.Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection No software was used for data collection Data analysis RNA-seq alignment and QC Illumina adapters were trimmed from raw sequencing reads using Cutadapt (v2.10)The quality of the trimmed reads estimated per flow cell lane using FASTQC (v.0.11.9)Fastq read files were aligned to the Hg19 human reference genome using STAR (v2.5.2a)Duplicated reads in each BAM file were marked with the MarkDuplicates function from GATK (v4.1.7.0)Aligned reads were quality checked using QoRTs (v1.3.6) to assess RNA integrity Somalier (v0.2.7) was used to detect potential instances of sample mislabelling.FASTQC, QoRTs and Somalier outputs were visualised using MultiQC (v1.9)RSEM (v1.3.3) was used with default parameters to quantify gene expression based on the BAM files aligned to the transcriptome RNA coverage was calculated for single nucleotide variants (SNVs) detected in matched whole exome sequencing data per tumour region using SAMtools (v1.9) mpileup All steps described were implemented through the Nextflow (v20.07.1) pipeline manager Reduced-representation Bisulfite Sequencing (RRBS) FastQC v0.11.2 was used for quality control Trim Galore! (Babraham Institute, https://www.babraham.ac.uk/) a wrapper around Cutadapt (v2.10), was used to trim reads The bisulfite converted DNA sequence aligner Bismark (v0.14.4) was used to align reads to the UCSC reference genome Hg19 PCR deduplication was carried out using NuDup (v2.3), leveraging NuGEN's molecular tagging technology (https://github.com/nugentechnologies/nudup) Most analyses were run using the R coding environment (v3.6.3)nature portfolio | reporting summary

Data Policy information about availability of data
All manuscripts must include a data availability statement.This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A description of any restrictions on data availability -For clinical datasets or third party data, please ensure that the statement adheres to our policy The RNA sequencing (RNA-seq), Whole exome sequencing (WES) and Reduced representation bisulfite sequencing (RRBS) data data (in each case from the TRACERx study) used during this study have been deposited at the European Genome-phenome Archive (EGA), which is hosted by The European Bioinformatics Institute (EBI) and the Centre for Genomic Regulation (CRG) under the accession codes EGAS00001006517 (RNAseq), EGAS00001006494 (WES) and EGAS00001006523 (RRBS); access is controlled by the TRACERx data access committee.Details on how to apply for access are available at the linked page.

Field-specific reporting
Please select the one below that is the best fit for your research.If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sample size
The sample size (421 patients) represents the half-way point of the TRACERx longitudinal study.In total, we analysed paired whole exome sequencing and RNA-seq paired data from 347 patients that passed quality check filters for RNA.
TRACERx is a programme of work of multiple projects built around a single observational cohort study.It is not possible to perform a sample nature portfolio | reporting summary March 2021 size calculation for each project, especially post hoc.The study size of the cohort was done in relation to tumour heterogeneity and disease free survival: The sample size is based on demonstrating a relationship between tumours with divergent intratumour heterogeneity index values and clinical outcome.Patients will be split evenly into those with a low and high intratumour heterogeneity index value (and other splits will be considered).Assuming a median Disease Free Survival (DFS) of 30 months and a hazard ratio (HR) of 0.77, with a 2-sided 5% significance level, 90% power, accrual period of 3 years and 5 years follow-up after the end of accrual, the sample size required is almost 400 per group (total of 800 patients).Assuming a 5% dropout rate, a total of 842 patients (421 per group) are required.At 85% power, 705 patients would be required in total, which could be the minimum target.However, we will instead aim for 750 patients and recruitment will continue for the length of time which is funded for accrual in order to get as close as possible to the ideal target of 842 patients.A study size of 842 is also large enough to detect a 10% improvement in a 5 year OS rate from 46% in the high Intratumour Heterogeneity Index (ITB) to 56% in the low Intratumour Heterogeneity Index group (HR=0.75),with 80% power and a 2 sided type I error set at 5% (logrank test).A high/low ITB value will be defined as values above/below the 50th percentile (median ITB).We have a target DFS effect of a 23% reduction in risk (hazard ratio 0.77), which means that our study is powered for an effect at least this large, including a 30% difference (which has been the target for progression-free survival in trials of advanced NSCLC, in relation to expected effects on OS).
Data exclusions Data was excluded only on the basis of: -Non-elegibility for the TRACERx clinical trial due to failure of the patient's data to comply with the study protocol (see below) -The sequenced data did not pass our quality check filters

Replication
TRACERx is a prospective longitudinal study.As such, the results shown here are not the result of an experimental setup.This is the half-way point of the TRACERx 421 and reflects hypothesis generating analysis.
Randomization Given the observational nature of the TRACERx longitudinal study, no experimental groups were allocated beforehand.Factors that could affect the interpretation of our results such as the background genetic makeup of each patient or the histological subtype of tumours were taken into account in all our statistical analyses.These were accounted for by including them as covariates in hypothesis testing.For instance, we used tumour ID as a random effect factor in linear mixed effects models for many of our analyses.

Blinding
Blinding is not relevant as this is an observational study.Patients were not allocated to any intervention and they were followed up and assessed as per routine practice.No biomarker results (tissue and bloods) are reported back to patients, so there is no likelihood of people changing their behaviours based on these findings.The laboratory analyses were all performed without knowing the outcome (DFS or survival) status of the patients, which represents a form of blinding.
Reporting for specific materials, systems and methods We require information from authors about some types of materials, experimental systems and methods used in many studies.Here, indicate whether each material, system or method listed is relevant to your study.

Recruitment
When patients are initially diagnosed with stage I-III lung cancer and then referred for surgical resection, a research nurse identifies them on a clinic/operating list.The patient has an initial eligibility assessment and then provided with written information about the TRACERx study and he/she can ask the research nurse any questions.
Patients have to agree to provide serial blood samples whenever they attend clinic for routine blood sampling, so this represents the only main potential self-selecting bias (i.e.only patients willing to do this would participate).However, it is unclear how this would affect the biomarker analyses.Also, the gender and ethnicity characteristics are in line with patients seen in routine practice.
Inclusion and exclusion criteria are summarised above.
Informed consent for entry into the TRACERx study was mandatory and obtained from every patient.
If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.at least 15mm to allow for sampling of at least two tumour regions (if 15mm, a high likelihood of nodal involvement on pre-operative imaging required to meet eligibility according to stage, i.e.T1N1-3) Exclusion Criteria: _Any other* malignancy diagnosed or relapsed at any time, which is currently being treated (including by hormonal therapy)._Any other* current malignancy or malignancy diagnosed or relapsed within the past 3 years**.*Exceptions are: non-melanomatous skin cancer, stage 0 melanoma in situ, and in situ cervical cancer **An exception will be made for malignancies diagnosed or relapsed more than 2, but less than 3, years ago only if a preoperative biopsy of the lung lesion has confirmed a diagnosis of NSCLC._Psychological condition that would preclude informed consent _Treatment with neo-adjuvant therapy for current lung malignancy deemed necessary _Post-surgery stage IV _Known Human Immunodeficiency Virus (HIV), Hepatitis B Virus (HBV), Hepatitis C Virus (HCV) or syphilis infection._Sufficient tissue, i.e. a minimum of two tumor regions, is unlikely to be obtained for the study based on pre-operative imaging Patient ineligibility following registration _There is insufficient tissue _The patient is unable to comply with protocol requirements _There is a change in histology from NSCLC following surgery, or NSCLC is not confirmed during or after surgery._Change in staging to IIIC or IV following surgery _The operative criteria are not met (e.g.incomplete resection with macroscopic residual tumors (R2)).Patients with microscopic residual tumors (R1) are eligible and should remain in the study _Adjuvant therapy other than platinum-based chemotherapy and/or radiotherapy is administered.