Club cells employ regeneration mechanisms during lung tumorigenesis

The high plasticity of lung epithelial cells, has for many years, confounded the correct identification of the cell-of-origin of lung adenocarcinoma (LUAD), one of the deadliest malignancies worldwide. Here, we employ lineage-tracing mouse models to investigate the cell of origin of Eml4-Alk LUAD, and show that Club and Alveolar type 2 (AT2) cells give rise to tumours. We focus on Club cell originated tumours and find that Club cells experience an epigenetic switch by which they lose their lineage fidelity and gain an AT2-like phenotype after oncogenic transformation. Single-cell transcriptomic analyses identified two trajectories of Club cell evolution which are similar to the ones used during lung regeneration, suggesting that lung epithelial cells leverage on their plasticity and intrinsic regeneration mechanisms to give rise to a tumour. Together, this study highlights the role of Club cells in LUAD initiation, identifies the mechanism of Club cell lineage infidelity, confirms the presence of these features in human tumours, and unveils key mechanisms conferring LUAD heterogeneity.

distal airway, both AT1 and AT2 cells repair the injured alveoli by selfrenewal and differentiation into each other [12][13][14] . Besides these welldefined epithelial cells, a rare population of double-positive CCSP/SPC bronchioalveolar stem cells (BASCs), located in the terminal and respiratory bronchioles, can maintain both airway and alveoli 15,16 .
Recently, single-cell RNA sequencing (scRNA-seq) has contributed to the identification of a rare population of H2-K1 high Club-like progenitors 11 , the transitional stem cell state of Krt8ADI 17 , AT2-cellderived damage-associated transient progenitors (DATPs) 18 , and pre-AT1 transitional cells (PATS) 19 , all of which contribute to lung regeneration. It remains controversial whether trans-differentiation of cells can take place during the initial transformation of normal cells, before giving rise to tumours. In this context, further characterization of this initial transition would be critical for the correct identification of the cell of origin of LUAD.
The cellular origin of NSCLC has been investigated over the last years using genetically engineered mouse models (GEMMs) demonstrating the importance of both the cell of origin as well as the genetic mutation spectrum in shaping lung cancer phenotypes 20 . For instance, many studies suggest AT2 cells to be the predominant cell of origin of Kras-driven LUAD [21][22][23][24] . However, rather than employing a stochastic system in which all cells have an equal chance of being transformed, these tumours were induced by forcefully expressing Kras in one specific cell type.
It is still highly debatable whether trans-differentiation of Club cells into AT2 cells can take place under specific environmental conditions or whether other genetic factors might alter the tumourinitiating cells 25 . We and others have demonstrated that Club cells can also induce LUAD. Employing engineered reporters together with toxic chemicals found in tobacco smoke, we proved that tobacco-induced tumours originate from airway epithelial cells, which attain alveolar characteristics, and not exclusively from cells originating in the alveoli 26 . In addition, two recent papers showed that mutant Kras expressed in Club cells induces LUAD in mice 27,28 . Altogether, these studies suggest that Club cells give rise to oncogene and chemicalinduced LUAD by attaining an AT2 phenotype; however, there is a general lack of knowledge of the developmental routes and pathways that Club cells use to become a tumour.
As opposed to LUAD with Kras mutations, the cellular origin of ALK-translocated LUAD has not yet been addressed. In the present study, we investigate tumour initiation and development of Club cells by employing a stochastic LUAD model. We induce the endogenous oncogenic Eml4-Alk rearrangement using an adenoviral system 29 that can infect any cell type within the lung epithelium. We combine the Eml4-Alk adenovirus with lineage-tracing mouse models, DNA methylome, and single-cell RNA transcriptome analysis and demonstrate that Club cells upon Eml4-Alk fusion trans-differentiate early during tumour development, gain the expression of alveolar markers, and yield heterogeneous tumour subgroups by using their intrinsic regeneration mechanisms.

Tumour initiation and development induced by Eml4-Alk fusion in distinct lung cell types
To faithfully recapitulate ALK-translocated human LUAD in mice, we used a published CRISPR-Cas9 construct to induce the endogenous Eml4-Alk (EA) oncogenic rearrangement 29 (Fig. 1a). In this model, intratracheal instillation of an adenovirus (hereafter Ad-EA) encoding a single guide RNA (sgRNA) targeting Eml4, a sgRNA targeting Alk and the Cas9 protein sequence, gives rise to LUAD 8 weeks after instillation. At the early stages of tumorigenesis, Eml4-Alk induced early hyperplasia in the bronchi that partially lost the Club cell marker CCSP, and acquired alveolar markers such as SPC, similar to Kras mutant LUAD, described by the authors in ref. 28 (Fig. 1b, c). Moreover, among more advanced tumour stages (adenoma and adenocarcinoma), almost every tumour was SPC positive ( Supplementary Fig. 1a), suggesting a linage switch from Club to AT2-like cells triggered by the oncogenic transformation. Compared to those mice that only received Ad5-CMV-Cre (hereafter Ad-Cre) as control, we also found an increase of double-positive CCSP/SPC cells in the bronchiolar lesions 8 weeks after instillation of Ad-EA ( Supplementary Fig. 1b). We speculate that this increase of dual positive cells in the early lesions is a consequence of a lineage switch from Club to AT2-like cells triggered by the oncogenic transformation and not necessarily an increase in BASCs, since these cells are localized in the bronchioles and not the BADJ (Supplementary Fig. 1c).
The advantage of using adenovirus to induce the Eml4-Alk rearrangement is that any cell type within the lung epithelium has the chance of being infected. To define the cell tropism of the adenovirus, we used Ad-Cre to instil a mouse strain that switches membranous tdTomato to membranous EGFP fluorescence (mT/mG) 30 upon Cremediated recombination. We identified that virtually all lung epithelial cell types got infected with Ad-Cre (Fig. 1d, e and Supplementary  Fig. 1d), being AT1 cells the largest cell population infected, followed by AT2, Club, and Ciliated cells. We rarely found infected BASCs possibly due to their low frequency compared to the other cell types (Supplementary Fig. 1e and Supplementary Table 1).
From these results, we hypothesize that the Eml4-Alk oncogenic rearrangement can initiate tumours in Club and AT2 cells, although late-stage tumours homogeneously express the AT2 cell marker, SPC.
Subsequently, we induced LUAD in the lineage-tracing mice by intratracheal instillation of Ad-EA four weeks after TAM injection. When tumours became discernible by μCT, we analysed the percentage of GFP + tumours in each line (Fig. 2c-e and Supplementary  Table 3). We found 44% ± 5 of GFP + lesions in Scgb1a1 mice and 12%±9 in Sftpc (Fig. 2d, e and Supplementary Fig. 2b). Surprisingly, almost one-fifth of the tumours in Hopx mice were also GFP + (19% ± 14), although AT1 cells were rarely reported to contribute to LUAD 34 . Due to labelling promiscuity of the Hopx line, we further tested the possibility that these GFP + tumours were arising from Club rather than AT1 cells. Therefore, we infected mT/mG; Krt5-CreER mice (hereafter Krt5) 35 with Ad-EA, since only AT1 cells in the distal lung, besides Basal cells in the trachea, are labelled in this model (Fig. 3a, b and Supplementary Fig. 3a). However, we failed to observe any GFP + tumour in Krt5 mice (0 out of 130 analysed tumours), suggesting that neither Basal cells from upper airways nor AT1 cells contributed to Eml4-Alk LUAD initiation (Fig. 3c, Supplementary Fig. 3b and Supplementary Table 3). Therefore, the GFP + tumours arising from the Hopx line most probably originated from unspecifically labelled Club cells (Fig. 3d, e). In addition, all tumours developed in Foxj1 mice were negative for GFP (0 out of 113 analysed tumours) (Fig. 2d, e, Supplementary Fig. 2b and Supplementary Table 3). Notably, the GFP + tumours from Hopx, Scgb1a1 and Sftpc lines exhibited the same expression pattern: positive for SPC but negative for CCSP ( Fig. 3e and Supplementary Fig. 3c). Taken together, these findings establish both Club and AT2 cells as the cell of origin of Eml4-Alk induced LUAD. Additionally, the phenotypic similarity of the tumours suggests that Club cells might undergo a lineage switch upon oncogenic transformation.
Lung cancer and cell-type-specific DNA methylation patterns Epigenetic reprogramming during differentiation is tightly regulated in a lineage-specific manner 5 , resulting in cell-type-specific DNA methylation 36 . Patterns of different developmental routes are preserved as "epigenetic memories" within each cell type 37 . Therefore, we sought to uncover signatures of each specific cell type originated tumour and to trace them back to their cells of origin. We captured the DNA methylation landscape of sorted GFP + cell populations from lineage-labelled healthy mice and tumours from the Sftpc, Hopx and Scgb1a1 lines ( Supplementary Fig. 4a) by tagmentation-based wholegenome bisulfite sequencing (TWGBS) 38 .
Using TWGBS data, we determined regions which could distinguish different cell types based on DNA methylation differences. We found that poised enhancers, as opposed to active enhancers and promoters, were specifically powerful in discriminating normal lung   cell types. However, in principal component analysis (PCA), tumours showed homogenous methylation patterns and did not cluster according to their cellular origin, as did the normal cell types ( Fig. 4a and Supplementary Fig. 4b). To infer cell type composition, we applied a reference-free deconvolution method, MeDeCom, to the DNA methylation data of the most variable CpG sites overlapping with bivalent enhancers. MeDeCom identifies so-called latent methylation components (LMCs) designed to capture the proportion of cell typespecific methylation signatures in each sample 39,40 .
After optimization and quality control of the methylome deconvolution model (for details, see "Methods"), we identified three main LMCs: LMC1 and LMC3 represented normal samples from Scgb1a1 and Sftpc lineages, respectively, whereas LMC2 represented a common lung cancer-specific signature (Fig. 4b). We could not identify a specific component for AT1 (appearing in Hopx lineage) or Ciliated cells (appearing in Scgb1a1 lineage). To define the cell types represented by each LMC, we performed an enrichment analysis of the cell typespecific markers, previously identified by scRNA-seq 17,41,42 . Since promoter methylation is considered to correlate negatively with gene expression, the enriched gene signatures captured the cell types associated with the given component. We, therefore, correlated the DNA methylation levels of each promoter with the proportion of each LMC and determined whether cell type-specific markers were over-represented among the inversely correlating genes (Fig. 4c). The results revealed that LMC1 represented not only Club but also Ciliated cells. Similarly, LMC3 captured both AT2 and AT1 cells (Fig. 4c). Importantly, tumour samples, despite showing a homogeneous pattern in LMC2, still resembled the specific signature of their cell of origin, classified by a higher contribution (>10%) of the respective components (Fig. 4b). A high proportion (>10%) of LMC1 in both tumours from the Hopx line supported our previous result that these tumours are derived from Club cells.
Differential methylation analysis revealed cell-type-specific differences between AT2 and Club cells. Notably, tumours, irrespective of their cell of origin or the mouse line from which they originated, (Scgb1a1, Sftpc or Hopx), showed AT2-like methylation patterns (Fig. 4d). These results strengthened the assumption that Club cell originating tumours switch their identity during tumorigenesis. This was also supported by the hypermethylation of the Scgb1a1 promoter in Club originated tumours, impairing the expression of this marker, and hypomethylation of the Sftpc promoter similar to that in normal AT2 cells, indicating the linage switch ( regions that were hypermethylated in Club cell originating tumours, as well as in normal AT2 cells compared to normal Club cells. Forkhead TFs regulate lung secretory epithelial fate 43 and play a role in maintaining epithelial cell identity 44 . Therefore, increased methylation of their binding site might contribute to derailing the normal lung regeneration processes. Tumour-specific hypomethylation events were enriched in TF binding motifs of the JUN/FOS proteins, building blocks of the activator protein-1 (AP-1) transcription factor known for their role in lung tumorigenesis as well as their involvement within other signal transduction pathways 45 . Taken together, the DNA methylation data supports the Club cell origin of labelled tumours from Scgb1a1 and Hopx lines. Furthermore, the DNA methylation patterns and differential methylation of TF binding motifs suggest an epigenetically driven lineage switch of Club cells into an AT2-like phenotype before or during tumorigenesis. Single-cell RNA sequencing identifies multiple cell states after the transformation of Club cells Lineage-tracing mouse models and DNA methylation data indicated that tumours originating from Club cells acquire an AT2-like phenotype. To elucidate the molecular processes of tumorigenesis, we designed a time-course experiment based on scRNA-seq of the Scgb1a1 line ( Fig. 5a). We collected two control samples; one with only TAM injection and another one 18 weeks after TAM and Adeno-Cas9 infection (Ad-Cas9); two early samples 2 weeks after Ad-EA and one intermediate sample 4 weeks after Ad-EA. We isolated live Club cells (EpCAM + /CD45 -/CD31 -/tdTomato -/GFP + /CD24 -/ß4 + /CD200 + ) from these samples ( Fig. 5a and Supplementary Fig. 5a). To discern possible implications of the labelled AT2 cells in the Scgb1a1 model, we included all GFP + cells from a second intermediate sample-4 weeks after Ad-EA-as well as all GFP + tumour cells from late-stage tumour nodules ( Fig. 5a and Supplementary Fig. 5a) and performed scRNA-seq.
Next, clusters were annotated using cell-type-specific markers 42,46 and previously published cell type signatures (Supplementary Data 3) 11,17,47 . As expected, multiple clusters were annotated as Club cells and in line with our previous results, tumour clusters presented high levels of AT2 markers, except for cluster 13, which unexpectedly showed increased levels of AT1 markers (Fig. 5e, f and Supplementary Data 3). Furthermore, due to the high expression of AW112010, H2-K1 and Cd74, we identified cluster 9 to be similar to the recently described H2-K1 high Club-like progenitor cells that were shown to contribute to lung regeneration after injury 11,18,19 (Supplementary Fig. 5b and Supplementary Data 3).
To verify whether tumour cells contained the Eml4-Alk rearrangement, we checked for the aberrant expression of the Alk region affected by the translocation. We found Alk expression not only in tumour clusters 12, 13 and 14 but also in cluster 11 (Fig. 5g). Moreover, we observed a progressive loss of Club cell identity and gain of AT2-like features throughout tumour development ( Supplementary Fig. 6a). Notably the tumour sample contained very few cells still expressing the Club cell marker Scgb1a1 (Supplementary Fig. 6b). Cluster 11 still showed a Club signature and low expression of tumour markers ( Supplementary Fig. 6a, c, d), suggesting its pre-tumour stage.
Interestingly, this cluster showed expression of the Basal markers Trp63 and Krt5 (Supplementary Fig. 6e).
Collectively, our scRNA-seq analyses identified distinct cell populations that were enriched in different stages of Club cell tumorigenesis. A dynamic change in cell-type composition was observed with the rise of progenitor-like cell states in early time points and distinct tumour clusters being discernible as early as 4 weeks after oncogenic induction ( Fig. 5h and Supplementary Table 4).

Club cells employ lung regeneration mechanisms during tumorigenesis
We next sought to model the differentiation trajectory of Club cells towards LUAD. RNA velocity 48 estimates the ratio of spliced and unspliced mRNA predicting the future state of cells. Partition-based graph abstraction (PAGA) analysis 49 generates a map, where nodes are connected by weighted edges representing the connectivity between clusters. Using RNA velocity ( Supplementary Fig. 7a) to direct the PAGA edges ( Supplementary Fig. 7b), we obtained an unbiased estimation of linage trajectories 49,50 (Fig. 6a).
To better understand the transitions of Club cell states during tumour progression, we explored gene expression activity programmes using consensus non-negative matrix factorization (cNMF) 47,51 . We identified nine activity programmes ( Fig. 6b and Supplementary Data 4). The intermediate Club-like progenitors (cluster 9) showed high activity of two immune-related programmes: an interferon programme and an immune activation/inflammation programme ( Fig. 6c and Supplementary Data 4). Additionally, three programmes depicted the tumour clusters (12, 13 and 14) described above: a stem-like tumour module correlated with cluster 14, and was characterized by stem cell markers such as Id2 and Sox9 together with AT2 identity genes like Sftpc, Lamp3 and Lcn2 (Fig. 6b, c and Supplementary Data 4); a regeneration-like tumour module that was mainly enriched in cluster 12 and similar to lung regeneration, this module showed high activity in HIF1 signalling, TGF-beta pathway, IL-17 signalling and metabolic pathways such as glycolysis (Supplementary Fig. 7c and Supplementary Data 4) 18 . The third one, an AT1-like tumour programme, was highly active in cluster 13, although also present in cluster 12 (Fig. 6c). This module showed high similarity to recently described signatures implicated in lung regeneration [17][18][19] as well as to "mixed AT1/AT2" and "highly mixed" programmes described in AT2 originated Kras mutant lung tumours 47 (Supplementary Fig. 7d, e and Supplementary Data 3). These results suggest that Eml4-Alk Club celloriginated tumours are highly heterogeneous, similar to the Kras AT2 originated ones 47 . Since cluster 6 consisted of two separated cell types, we split the cluster into two for further trajectory analysis: 6 for the Activated Club cells and 6b for the Activated AT2s.
RNA velocity ( Fig. 6a and Supplementary Fig. 7a) and PAGA analysis ( Supplementary Fig. 7b) predicted cluster 9 to be . The shape of the dots reflects the sample type, as normal lung (Normal) or GFP + tumours (Tumour). The colour indicates mouse lineages used. Coloured shading is drawn around the Normal (light blue) and Tumour (purple) samples. b DNA methylation deconvolution by MeDeCom analysis shows three identified latent methylation components (LMC 1-3). Colours indicate the proportion of LMCs for each sample. Samples are clustered within each mouse lineage (Sftpc, Scgb1a1 or Hopx) and according to sample type (Normal/Tumour). c Association of LMCs with cell type-specific gene signatures based on published single-cell RNA sequencing data 17 . Genes are ranked by their correlation of promoter methylation with the respective LMC. Enrichment scores are running sums calculated using the Gene Set Enrichment Analysis algorithm. d Heatmap of regions that were differentially methylated between Scgb1a1 and Sftpc normal samples. The methylation value is shown as beta values ranging from 0 to 1 visualized by the heatmap colours blue to red. For tumours, the label of each column shows the proposed originating cell type, while the colours below depict their originating lineage. e Average methylation across the Scgb1a1 and Sftpc genes for normal samples from Scgb1a1 and Sftpc lines and Club and AT2 originated tumours. Pink bars highlight regions with DNA methylation differences between normal lineages, while blue bars highlight regions with differences in tumours from distinct origins. ENCODE cCRE, showing candidate Cis-Regulatory elements, cHMM lung P0 showing ChromHMM regions in mouse lung, postnatal 0 days, and ATAC lung P0 showing ATAC-seq peaks in postnatal 0-day-old lung are UCSC tracks under the same name. f TF motif enrichment analysis of the differentially methylated regions (DMRs). Each column represents one comparison (e.g. Scgb1a1 normal vs. tumour) in one direction (e.g. hypermethylation in second, as in hypermethylated in tumours). The colour of the dots shows the enrichment of the DMRs for the motifs compared to random genomic regions. The size of the dots reflects on the -log 10 (p value) of the enrichment analysis. Empty lines show the lack of significant enrichment.
connected to the tumour clusters. Therefore, we hypothesize that these Club-like progenitor cells could represent a crucial state in determining the fate of Club cell-originated tumours. Based on these results, we postulated that progenitor-like cells (cluster 9) with active interferon and inflammation signalling can follow two major routes towards tumorigenesis. The first one is through proliferating cells (cluster 10), before giving rise to tumour cells (cluster 14). The second route underwent the pre-tumour state (cluster 11) and further differentiated into tumour cells in clusters 12 and 13, with the latter also connected with cluster 9. Notably, the second route was also partly shared by a subpopulation of Club cells in cluster 4 that showed a Goblet-like gene signature (upregulated Ltf, Bpifa1, Bpifb1, Reg3g expression), and differentiated into cluster 11 (Supplementary Data 3), retaining the expression of Ltf. Our results show that although different tumour clusters arise on diverse paths, these paths are interconnected. As representatives of these routes, we selected two candidate genes. Ltf, a key player of the second route, was highly expressed in clusters 4 and 11, although RNA velocity showed spliced transcripts in cluster 11 predicting its downregulation (Fig. 6d). The second candidate, Trp63, was highly expressed in cluster 11 ( Fig. 6d and Supplementary Fig. 6e).
To further confirm these tumorigenic routes, we checked the expression of Ltf and Trp63 in lung sections of different stages of tumour development and compared them to control mice instilled with Ad-Cas9 (Fig. 6e). In agreement with our computational analysis, we identified cytoplasmic Ltf expression in Club cells in early lesions, but not in normal Club cells or tumour cells, therefore validating the involvement of Club cells with Goblet-like (Ltf) signature in tumour progression (Fig. 6f). Additionally, few cells showed nuclear expression of Trp63 in Club cells in early lesions, but not in normal Club cells or tumour cells (Supplementary Fig. 7f). Transcriptional activity programmes of Club originated tumours are present in human LUAD Next, we analysed two published scRNA datasets on human LUAD. We compared tumour cells specific for 5 patients 52 , and different transcriptional states of LUAD (Fig. 7a) and metastatic (malignant) cells 53 (Fig. 7b). We concluded that our tumour modules are present in LUAD and in non-lung metastasis, and show high variation between patients as well as between transcriptional states.
Then, we investigated the relevance of the 9 identified modules on LUAD patient prognosis taking advantage of the survival information of The Cancer Genome Atlas (TCGA) and bulk RNA-seq data 54 . We used the human orthologs of the module-specific genes to calculate the hazard ratio (Fig. 7c). We found that the Stem-like, AT1-like and Regeneration-like tumour signatures were associated with poor survival, with the last one being the most prominent indicator of poor survival. Finally, to exclude that the observed signatures were simply tumour signatures and to validate whether Club originated tumours match better with LUAD than AT2 originated ones, we integrated our data with recently published 10× scRNA-seq data of AT2 originating Kras mutant;p53 null tumours 55 (Fig. 7d and Supplementary Fig. 7g, h). We found marked differences between our main Club-cell originating tumour clusters and those of AT2 origin, while the AT1-like tumour cluster showed high similarity with the AT1-like cells. We also investigated how the gene expression module scores of our tumour clusters (AT1-like tumour, Regeneration-like tumour, and Stem-like tumour modules) are distributed among our clusters and the ones from ref. 55 (Fig. 7e). We found that both the AT1-like and the Regeneration-like tumour modules were higher in our clusters and were most similar to the AT1 and Late Gastric clusters from ref. 55 All tumour modulesincluding the Stem-like-scored significantly higher in our clusters. Altogether, these results suggest that the gene programmes associated with Club-originated tumorigenesis play a role in a subset of human LUAD.

Discussion
The understanding of the cell of origin of lung adenocarcinoma is a debatable topic in cancer research. Here, we combined state-of-theart lineage-tracing mouse models of lung cancer, DNA methylome and single-cell transcriptome analysis to unveil the tumorigenesis of Club-originated tumours. We showed that due to the high plasticity of lung epithelial cells, tumour characteristics are only partly dependent on their originating cell type. Previous studies focused on defining the cellular origin of LUAD, by using Kras mutant models that were based on tumour induction in specific cell types [21][22][23]26,28,56 . We studied the cell of origin of Eml4-Alk LUAD using a stochastic adenoviral system and employed lineage-tracing mouse models to label Club, AT2, AT1, Ciliated and Basal cells. We identified Club and AT2 cells as the main cell types implicated in Eml4-Alk LUAD development and showed that Eml4-Alk fusion gene in AT1, Ciliated or Basal cells did not give rise to tumours.
Remarkably, regardless of the originating cell type, all Eml4-Alk tumours were positive for the AT2 cell marker, SPC. In line with this observation, Rosigkeit et al. 28 have recently shown that Cluboriginated tumours become AT2-like by expressing SPC. DNA methylation was shown to act as cellular memory, storing information on previous differentiation states 37 . After deconvolution of the DNA methylomes from Club and AT2 originating tumours, cell typedependent signatures were retained despite the large similarity between tumours. Although these signals could be due to contaminated normal cells in the tumours, we believe this is not the case, since based on the scRNA-seq analysis only very small number of cells in the tumour sample was expressing Scgb1a1, making it very unlikely that these cells generate a strong signal in the bulk methylome. In our study, analysis of the DNA methylome showed that Club tumours' methylation pattern was similar to that in AT2 cells, suggesting that Club cells switch their cellular identity and rewire their epigenomic landscape during tumorigenesis.
Our findings are especially interesting in comparison with cell-oforigin studies in haematological cancers. Normal haematopoiesis is characterized by unidirectional methylation changes 5,57 and leukaemia cells preserve large parts of these epigenomes 58 . This is in contrast to the patterns seen in LUAD where a dynamic, network-like differentiation landscape establishes highly similar tumour methylomes, arising on diverse pathways. Remarkably, the DNA methylation patterns in both cases can be used to infer the cell of origin.
Tumorigenesis can occur as a disruption of normal regeneration processes, as recently described in pancreatic cancer 59 . Club cells are able to regenerate the lung by renewing their own population 60 and transdifferentiating into AT2 and AT1 cells. Similar to lung regeneration, this high plasticity of Club cells was observed during Eml4-Alk tumorigenesis, where oncogenic stimuli rewired the transcriptome and epigenome of Club cells to evolve to an alternative fate programme. Our scRNA-seq identified several cellular stages which until now had only been shown to be transient states during lung regeneration after injury. For instance, an intermediate cell state derived from Goblet cells (expressing Bpifb1) dedifferentiated towards terminal Basal cells (Krt5, Trp63) upon lung regeneration 61 . Similarly, we observed that Goblet-like Club cells (cluster 4), marked by the expression of Ltf and Bpifb1, transition into a pre-tumour stage (cluster 11), which in addition to expressing high levels of Alk, shows the expression of Ltf, Krt5 and Trp63. We speculate that our pre-tumour stage is similar to the Goblet-Basal intermediate stage found in lung regeneration and that Club cells mimic this state during tumour initiation. Additionally, we identified tumour-associated activity programmes marked by the expression of signatures also identified in intermediate states between Club or AT2 cells towards AT1 cells upon lung regeneration [17][18][19] . Our regeneration-like activity programme, associated mainly to tumour cluster 12, showed activation of HIF1 signalling, TGF-beta pathway, IL-17 signalling and glycolysis; pathways described to be essential in AT1 differentiation upon lung injury and regeneration 18 . Moreover, one of the oncogene-driven routes that Club cells follow involving cluster 9-11-12-13 shows high similarity to the signatures of Krt8ADI and PATS; two cellular states described in lung regeneration. Altogether, our data reinforce the importance of lung cell plasticity during tumorigenesis and stabilize a link between lung regeneration and lung tumorigenesis 17,19 .
Finally, we found that all our identified tumour-related transcriptional programmes were present in human LUAD and associated with poor survival, the regeneration-like module being the most significant.
Interestingly, the tumorigenic routes followed by Club cells show correlation to some activity programmes previously reported in AT2originated tumours 47,62,63 . This suggests that although lung tumours can have different cells of origin they can converge into similar stages of tumorigenesis.
By integrating scRNA-seq with whole-genome bisulfite sequencing analysis of Eml4-Alk tumours, we show that LUADs, unlike other tumour types, are only partly dependent on their originating cell type. In contrast, they are primarily determined by the oncogenic signals and further signalling pathways that are hijacked by the tumour cells. We identified that Club cells exploit signalling cascades that frequently govern normal lung development and regeneration, and we describe the paths that they follow resulting in a highly heterogenous state through two major tumorigenic routes. Characterization of these Fig. 6 | Activity programmes. a UMAP embedding of combined RNA velocity and PAGA. Connections between clusters are established based on PAGA, while the direction of the arrows is inferred from the RNA velocity analysis. b Heatmap of the module scores of each cluster (Y axis) for the identified activity programmes (cNMF modules, "Methods"). The top 10 genes were used to calculate the module score. c Activity of the transcription programmes identified by the cNMF analysis for each identified cluster. The heatmap colours represent the z-score of the gene expression for the top 10 marker genes for each activity programme. d RNA velocity and gene expression of candidate genes Ltf (upper) and Trp63 (lower). The left side of the plots shows the current gene expression levels. The right side of the plots shows the RNA velocity. The green colour represents a high velocity, therefore upregulation, while the red colour denotes downregulation. e Immunohistochemistry of Ltf in Ad-Cas9 control samples and in representative Eml4-Alk lung sections at different time points after adenoviral instillation. Scale bar 20 μm. Below, a quantification of the number of Ltf positive cells mm −2 of bronchi of the animals. Each dot represents a bronchus. Mann-Whitney test, two-tailed. Data are presented as mean values +/− SEM. n = 3 mice per group were analysed except Ad-Cas9 2 weeks and Ad-EA 4 weeks (n = 4) and Ad-Cas9 6 endpoint (n = 2). f Schematic representation of the two routes followed by Club cells upon Eml4-Alk transformation. mechanisms, which are shared or unique in the different cancer subtypes, will help to decipher the routes of lung cancer initiation and identify actionable targets for personalized treatment of LUAD. directed to and will be fulfilled by the lead contact, Rocio Sotillo (r.sotillo@dkfz-heidelberg.de).

Mice, tamoxifen and adenoviral infection
To label epithelial cells in the lung, 6-12-week-old female and male mice were injected i.p. with 200 μg g −1 tamoxifen (Sigma, T5648, 30 mg ml −1 dissolved in corn oil) during 4 consecutive days. To select the optimal chasing period, we compared the number of labelled cells using 4 and 8 weeks chasing time in Scgb1a1 mice. Since there were no differences in labelled cells in these two time points and previous data from the literature suggested that a period over 3 weeks was sufficient for TAM to label cells 64 , we used 4 weeks of chasing time in all the experiments.
To induce Eml4-Alk rearrangement in the lung, mice were anaesthetized by intraperitoneal injection of 100 μg g −1 ketamine and 14 μg g −1 xylazine and intratracheally instilled with Eml4-Alk adenovirus, designed in the Ventura Lab 29 and purchased from Viraquest. The use of genetically modified organisms (GMO) was approved by the government of Baden-Wurttemberg, Germany (project No. 81078, 81155). Mice were randomly assigned to different experiments and investigators were not blind to the mice allocation during experiments and analysis. Mice were killed by cervical dislocation at the indicated time points or when the biggest tumour reached 0.5 cm 3 . The maximal tumour size in the stated experiments was never exceeded. The lungs were perfused with 10 ml PBS through the right ventricle.

Human samples
For the survival analysis, TCGA-LUAD RNASeq data was downloaded from the GDC Data Portal (https://portal.gdc.cancer.gov/, Data Release 24.0). For each cNMF module, the top 200 genes were selected. Using the human orthologs of these genes, we calculated a z-score on the primary tumours of TCGA-LUAD (n = 533). We used a Cox proportional hazard ratio model to examine the modules' effect on survival. The models were adjusted for age, gender and tumour stage. To visualize, we used a forest plot depicting the hazard ratio (HR) and the confidence interval (CI).
Human single-cell transcriptome datasets generated by Zilinois et al. 52 and Kim et al. 53 were downloaded to investigate the presence of cNMF modules in human lung cancer. Module scores were calculated as described by TCGA-LUAD.

Lung/tumour cell isolation and FACS sorting
For normal cells, lungs were minced into smaller pieces, and for tumour cells, visible nodules were carefully picked out, and the surrounding healthy tissue was removed. For TWGBS, both healthy lungs and tumours were dissociated into single cells using the lung dissociation kit (Miltenyi Biotech, 130-095-927) in a gentleMACS Octo Dissociator (Miltenyi Biotech, 130-095-937). After depletion of red blood cells with the red blood cell lysis buffer (Sigma, R7757), cells were incubated with a tumour cell isolation kit (Miltenyi Biotech, 130-110-187) following the manufacturer's instruction and passed through the LS columns (Miltenyi Biotech, 130-042-401). Cells from the flow-through were collected and DAPI (1 μg ml −1 ) was added as a viability marker. Cells were then sorted in a BD FACSAria cell sorter and the DAPI -;tdTomato -; GFP + population was collected for downstream applications.

DNA isolation and library preparation for bisulfite sequencing
Lung cell dissociation and sorting of GFP + cells were done as mentioned above. Each sample was a pool of 1-4 mice. DNA was isolated with QIAamp DNA Micro Kit (Qiagen, 56304). Tagmentation-based whole genome bisulfite sequencing (TWGBS) libraries were generated as described previously 38 using 20-30 ng genomic DNA as input. Per sample, four sequencing libraries with different barcodes were prepared and pooled in equimolar amounts to a final concentration of 2-10 nM. Pools were sequenced paired-end, 125 bp, on one lane of a HiSeq2000 sequencer (Illumina). Raw fastq files were aligned to the mm10 reference genome using methylCtools 65 as implemented by the Genomics and Proteomics Core Facility of the German Cancer Research Center.

DNA methylation analysis
Methylation levels were called by MethylDackel (https://github.com/ dpryan79/MethylDackel). Due to the specifics of the TWGBS, the following parameters were used to remove M-bias:-nOB 2,11,11,2-nOT 8,1,2,11. Quality control of the results included checking M-bias, bisulfite conversion rate and global methylation. Two samples were removed due to quality issues. Methylation data were analysed using R/Bioconductor 4.0 with packages Methrix 66 and bsseq 67 . CpG sites overlapping with single-nucleotide polymorphisms (SNPs) in any of the mouse strains were excluded based on data downloaded from Mouse Genome Project, database version 142 68 . The quality of the samples is shown in Supplementary Data 5.
For reference-free cell type deconvolution, we used MeDeCom, which allows the decomposition of DNA methylation into latent methylation components 39 . Regions were selected based on the 15-state ChromHMM model for mouse lung, postnatal 0 days downloaded from ENCODE 69,70 (ENCSR538YJF). The 100000 most variable CpG sites overlapping with poised or bivalent enhancers were included in the model. MeDeCom model was run using multiple lambda and K parameters with the following arguments: NINIT = 10, NFOLDS = 10, ITER-MAX = 300. The final model with K = 4 and lambda=0.0001 was selected based on cross-validation error. Labelled tumours were re-categorized based on their suspected cell of origin. Tumours with >10% proportion in LMC1 or LMC3 were categorized as Club or AT2 originating, respectively. To identify gene promoters negatively associated with the LMCs, we calculated the Pearson correlation coefficient between LMC proportions and promoter methylation. Using this as ranking, we performed a Gene Set Enrichment Analysis (GSEA), as implemented in the fgsea package 71 with the following parameters: minSize = 3, maxSize = 500, nperm = 1000. For each set of marker genes, the running score was visualized.
Transcription factor binding motif enrichment analysis was performed with Homer 77 based on DMRs with the following parameters: -len 8,10,12 -size 100 -S 8 -cache 6921 -fdr 0. Enrichment analysis was performed as implemented in Homer. Multiple testing correction was done using false discovery rate (FDR) for all regions included.

Single-cell RNA sequencing and data analysis
The time points and gating strategy used to collect the samples are shown in Fig. 5a and Supplementary Fig. 5A. Lung cells were dissociated as mentioned above and GFP + cells were harvested as indicated in Supplementary Fig. 5A. Libraries were prepared using Chromium Next GEM Single Cell 3ʹ GEM, Library & Gel Bead Kit v3.1 (10× Genomics) according to manufacturer's instruction. Gene expression counts were acquired using Cell Ranger 3.1 count from 10× Genomics with its default settings. The gene expression counts were analysed in R 4.0.3 using Seurat 3.2.2 78 unless indicated otherwise. Based on visual inspection, the following number of detected gene thresholds were used to exclude bad quality cells among the samples: 1000-6000 for TAM, 1000-6000 for Cas9, 1000-6000 for 2 wk-1, 2000-6000 for 2 wk-2, 1500-6000 for 4 wk-1, 2000-6000 for 4-wk GFP + , 1000-6000 for tumour. Cells with more than 10% mitochondrial genes were removed from all samples. Samples were normalized using a scaling factor of 10,000. Variable features were selected using the FindVariableFeatures function of Seurat.
The cell-cycle stage of the cells was calculated using the CellCy-cleScoring function of Seurat with a publicly available dataset 79 . Genes were converted from human to mouse using the BioMart database with biomaRt R package 80,81 .
Batch effect correction was done using Harmony 82 based on the PCA dimensionality reduction. Based on the resulted embeddings, we performed Uniform Manifold Approximation and Projection (UMAP) 83 and clustering. Clusters were calculated using a shared nearest neighbour (SNN) modularity optimization-based clustering algorithm. First, the k-nearest neighbours were calculated based on the first 30 dimensions of the Harmony embedding and an SNN graph was constructed. An optimal number of clusters was selected based on a suggestion by an elbow plot, as implemented in Seurat. Clustering was performed using the Louvain algorithm with a resolution of 0.6. Overexpressed markers of the clusters were selected if they were expressed in more than 25% of the cluster with a log fold change (FC) of 0.25.
After the initial analysis, a small cluster of Ciliated cells marked by high expression of Foxj1 was identified and excluded. The above process was then repeated without Ciliated cells.
Alk expression was calculated by counting the reads overlapping with the part of Alk gene affected by the translocation (chr17: 71867045-71898183). The counts were lognormalized to the number of transcripts in each cell.
Consensus non-negative matrix factorization (cNMF) 51 v.1.2 was run using 100 iterations and the 2000 most variable genes. Based on the optimization, a model with 9 components was selected. For the consensus estimates of the programmes a local density threshold of 0.01 was used. The other parameters were used as default.
Overrepresentation analysis of the genes representing the modules based on KEGG 84 and WikiPathways 85 databases was performed using the clusterProfiler Bioconductor package 86 .
Module and signature scores were calculated using the AddMo-duleScore function of Seurat based on the default parameters.
RNA velocity models the direction and speed of individual cells in the gene expression space by estimating the ratio of spliced and unspliced mRNA. Using this ratio, it predicts the future state of individual cells on a short timescale 48 . Here, we calculated the velocity using scVelo 0.2.3 50 . Since our data was batch corrected using Harmony the calculated velocity was projected on the UMAP based on the Harmony embedding.
Partition-based graph abstraction (PAGA) 49 generates a topologyreserving map of single cells. Here we used PAGA as implemented in scanpy 1.7.2 87 . For the standalone PAGA visualization, we used the algorithm on the Harmony embedding. We also used RNA velocity to direct the PAGA edges, as implemented in scVelo 49,50 .
Data integration of the time-course data from ref. 55 was performed using reciprocal PCA analysis of the Seurat package based on the most variable 4000 genes. We used a resolution of 0.9 to find clusters in the integrated dataset. We used the AddModuleScore Seurat command to calculate the scores for each module. The modules consist of the top 200 genes for each cNMF module coming from our dataset.
μCT imaging μCT imaging was performed using the Inveon multi-modality μPET/ SPECT/CT system (Siemens Medical Solutions, Knoxville, USA). Acquisitions covering the thorax and lungs were performed using a tube voltage of 80 kV and a tube current of 500 μA. A total of 360 projections were acquired over 360°with an integration time of 200 ms each. The detector was operated using a 4 × 4 binning mode resulting in a resolution of approximately 100 μm in the centre of rotation. Image reconstruction with isotropic resolution was performed using the Feldkamp algorithm with a Shepp-Logan kernel onto a 512 × 512 × 928 grid with appropriately sized voxels. Image analysis was performed using ImageJ.

Statistics and reproducibility
The statistical information for the experiments is detailed in the text, figure legends, and figures. The correlation between LMCs and gene promoter methylation was calculated using Pearson correlation. If not otherwise stated, p value < 0.05 was considered significant. False discovery rate (FDR) was used as a multiple test correction method, where appropriate. In this case, FDR q < 0.05 was considered significant.
Boxplots if otherwise not indicated are showing with the lower and upper hinges the first and third quartiles (the 25th and 75th percentiles). The upper and lower whiskers extend to ±1.5 × inter-quartile range from the upper/lower hinge. Data points outside of these ranges are plotted individually as outliers.
Two samples were excluded from the DNA methylation analysis, based on quality issues. Mice were randomized to different time point groups without selection. The investigators were not blinded to allocation during experiments and outcome assessment. No statistical method was used to predetermine sample size.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The single-cell RNA-seq and DNA methylation data have been deposited at the GEO database under accession code GSE176186 and are publicly available. TCGA-LUAD RNASeq data and clinical data on survival were downloaded from the GDC Data Portal [https://portal.gdc. cancer.gov/projects/TCGA-LUAD]. Single-cell RNA-seq published by ref. 55 was downloaded from Zenodo 88 . Publicly available human single-cell transcriptome datasets generated by Zilinois et al. 52 and Kim et al. 53 were downloaded from GEO [https://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE127465] and [https://www.ncbi.nlm.nih. gov/geo/query/acc.cgi?acc=GSE131907], respectively. All other data are available within the Supplementary Data and Source data provided with this manuscript. Source data are provided with this paper.