An atlas of epithelial cell states and plasticity in lung adenocarcinoma

Understanding the cellular processes that underlie early lung adenocarcinoma (LUAD) development is needed to devise intervention strategies1. Here we studied 246,102 single epithelial cells from 16 early-stage LUADs and 47 matched normal lung samples. Epithelial cells comprised diverse normal and cancer cell states, and diversity among cancer cells was strongly linked to LUAD-specific oncogenic drivers. KRAS mutant cancer cells showed distinct transcriptional features, reduced differentiation and low levels of aneuploidy. Non-malignant areas surrounding human LUAD samples were enriched with alveolar intermediate cells that displayed elevated KRT8 expression (termed KRT8+ alveolar intermediate cells (KACs) here), reduced differentiation, increased plasticity and driver KRAS mutations. Expression profiles of KACs were enriched in lung precancer cells and in LUAD cells and signified poor survival. In mice exposed to tobacco carcinogen, KACs emerged before lung tumours and persisted for months after cessation of carcinogen exposure. Moreover, they acquired Kras mutations and conveyed sensitivity to targeted KRAS inhibition in KAC-enriched organoids derived from alveolar type 2 (AT2) cells. Last, lineage-labelling of AT2 cells or KRT8+ cells following carcinogen exposure showed that KACs are possible intermediates in AT2-to-tumour cell transformation. This study provides new insights into epithelial cell states at the root of LUAD development, and such states could harbour potential targets for prevention or intervention.

from carcinogenesis and lineage-tracing mouse models that recapitulate the disease, with a focus on how specific populations evolve to give rise to malignant tumours.

Epithelial transcriptional landscape
Our study combined in-depth scRNA-seq of early-stage LUAD clinical specimens and cross-species analysis and lineage tracing in a human-relevant model of LUAD development following exposure to tobacco carcinogen (Fig. 1a).We used scRNA-seq to study EPCAM-enriched epithelial cell subsets from early-stage LUAD samples from 16 patients and 47 paired NL samples spanning a topographical continuum from the LUADs, that is, tumour-adjacent, tumourintermediate and tumour-distant locations 15 (Fig. 1a, Supplementary Fig. 1 and Supplementary Tables 1 and 2).We also collected tumour and normal tissue sets from the same regions for whole-exome sequencing (WES) profiling and high-resolution spatial transcriptomics (ST) and protein analyses (Fig. 1a).
Malignant cells showed low-to-no expression of lineage-specific markers and, overall, reduced lineage identity (Fig. 1b, bottom).Malignant cells formed 14 clusters (Fig. 1c) that were primarily patient-specific (Extended Data Fig. 1c, left), which signified strong inter-patient heterogeneity.Overall, malignant cells showed high levels of aneuploidy (Extended Data Fig. 1c, middle).We did not detect any distinct clustering pattern with respect to smoking status (Extended Data Fig. 1d).Annotation based on genomic profiling (by WES) showed that malignant cells from 3 patients with KRAS mutant LUADs (KM-LUADs; patients P2, P10 and P14) clustered closely together.By contrast, malignant cells from other LUADs showed a more dispersed clustering pattern (Fig. 1d, Extended Data Fig. 1c,e and Supplementary Table 1).scRNA-seq analysis confirmed the presence of copy number variations (CNVs) and KRAS G12D mutations in patient-specific tumour clusters and the absence of KRAS G12D in KRAS wild-type LUADs (KW-LUADs) (Extended Data Fig. 1c).

LUAD malignant transcriptional programs
Malignant cells from KM-LUADs clustered together and distinctively from those of EGFR mutant LUADs (EM-LUADs) or MET mutant LUADs (MM-LUADs) (Fig. 1e).KM-LUADs showed more transcriptomic similarity (that is, shorter Bhattacharyya distances) at both sample and cell levels (Extended Data Fig. 1f, left and right, respectively) compared with other LUADs (P < 2.2 × 10 −16 ).Distances between KM-LUADs (KM-KM) were significantly smaller compared with those between EM-LUADs (EM-EM; P = 0.02) or other LUADs (other-other; P = 0.03; Extended Data Fig. 1f, left).Clustering of malignant cells, following adjustment for patient-specific effects, showed that cluster 5 was enriched with cells from KM-LUADs (patients P2, P10 and P14; Extended Data Fig. 1g).Most of the KRAS mutant malignant cells clustered separately from other cells, which indicated the presence of distinct transcriptional programs in KRAS mutant cells (Fig. 1f).In line with previous reports 15,17 , malignant cells from KM-LUADs were chromosomally more stable than those from EM-LUADs (P < 2.2 × 10 −16 ; Extended Data Fig. 1h, left).CNV burden was significantly higher in malignant cells from patients who smoke than in patients who never smoked (P < 2.2 × 10 −16 ; Extended Data Fig. 1h, right).Differentiation states of malignant cells exhibited high inter-patient heterogeneity.That is, irrespective of tumour mutation load, KM-LUAD cells were the least differentiated, as indicated by their highest CytoTRACE 18 scores, followed by EM-LUADs (P < 0.001; Fig. 1g,h and Supplementary Table 4).There was intra-tumour heterogeneity (ITH) in differentiation states (for example, patients P2, P9, P14 and P15), whereby malignant cells from 7 out of the 14 patients with detectable malignant cells exhibited a broad distribution of CytoTRACE scores, with KM-LUADs showing a trend for higher variability in differentiation (greater Wasserstein distances) than EM-LUADs or other LUADs (Fig. 1h and Extended Data Fig. 1i).
Clustering of malignant cells (Meta C1 to Meta C5) based on levels of 23 recurrent meta-programs (MPs) 19 showed that Meta C1 comprised cells mostly from KM-LUADs (92%).Cells in Meta C1 also displayed the highest expression of gene modules associated with KRAS G12D present in pancreatic ductal adenocarcinoma (MP30) 19 , epithelial-to-mesenchymal transition (EMT-III; MP14) and epithelial senescence (MP19), and, conversely, the lowest levels of alveolar MP (MP31) (Extended Data Fig. 2a-c and Supplementary Table 5).Notably, malignant cells from patients P2, P10 and P14 with KM-LUADs showed significantly higher expression of MP30 than those from patients with KW-LUADs (P < 2.2 × 10 −16 ; Extended Data Fig. 2d).Malignant cell states also exhibited ITH in KM-LUADs (for example, patient P14; Extended Data Fig. 2e).A subset of KRAS G12D cells showed activation of MP30, and there were diverse activation patterns for other MPs (for example, cell respiration) across the mutant cells (Extended Data Fig. 2e, middle, 2f).Overall, malignant cells bearing KRAS G12D mutations showed reduced differentiation (Extended Data Fig. 2e, right), which was concordant with the loss of alveolar differentiation (MP31) in KM-LUADs (Extended Data Fig. 2a,b).Malignant cell clusters from patient P14 exhibited different levels of CNVs 15 , whereby a cluster enriched in KRAS G12D cells harboured relatively late CNV events (for example, chromosome 1p loss, chromosome 8 and chromosome 12 gains) and reduced alveolar signature scores, a result in line with attenuated differentiation (Extended Data Fig. 2g,h).A KRAS signature was derived based on distinct expression features of KRAS mutant malignant cells from our cohort (that is, specific to cluster 5; Extended Data Fig. 1g), which was strongly and significantly correlated with the MP30 signature (R = 0.92, P < 2.2 × 10 −16 , Extended Data Fig. 2i and Supplementary Table 6).KM-LUADs from The Cancer Genome Atlas (TCGA) cohort and with relatively high expression of our KRAS signature were enriched with activated KRAS MP30 and with other MPs that were increased in Meta C1 (Extended Data Fig. 2j).KW-LUADs in TCGA with a relatively higher expression of the KRAS signature displayed significantly lower overall survival (OS; P = 0.02; Extended Data Fig. 2k).A similar trend was observed when analysing KRAS G12D mutant LUADs alone despite the small cohort size (P = 0.3; Extended Data Fig. 2k).These data highlight the extensive transcriptomic heterogeneity between LUAD cells and transcriptional programs that are biologically and possibly clinically relevant to KM-LUAD.

AICs in LUAD
In contrast to AT2 cells, which were overall decreased in LUADs compared with multi-region NL samples (P = 0.002), AICs showed the opposite pattern (P = 0.02; Extended Data Fig. 3a,b).AT2 cell Article fractions were gradually reduced with increasing tumour proximity across multi-region NL samples from 7 out of the 16 patients with LUAD (P = 0.004; Extended Data Fig. 3c,d).No significant changes in fractions were found for other major lung epithelial cell types (Extended Data Fig. 3e).AICs were intermediary along the AT2-to-AT1 cell developmental and differentiation trajectories (Fig. 2a and Extended Data Fig. 3f,g), a result reminiscent of intermediary alveolar cells in cancer-free mice exposed to acute lung injury 20 .The proportion of least-differentiated AICs in LUAD tissues was higher than that of their more differentiated counterparts (29% compared with 11%, respectively; Extended Data Fig. 3h).Notably, AICs were inferred to transition to malignant cells, including KRAS mutant cells that were more developmentally late relative to EGFR mutant malignant cells (P < 2.2 × 10 −16 ; Fig. 2a and Extended Data Fig. 3f).Further analysis of AICs identified a subpopulation that had a distinctly high expression of KRT8 (Fig. 2b).These KACs had increased expression of CDKN1A, CDKN2A, PLAUR and the tumour marker CLDN4 (Fig. 2b, Extended Data Fig. 3i and Supplementary Table 7).KACs were also significantly less differentiated (P < 2.2 × 10 −16 ; Fig. 2c) and more developmentally late (P = 1.2 × 10 −11 ; Extended Data Fig. 3j) than other AICs.Notably, KACs transitioned to KRAS mutant malignant cells in pseudotime, whereas other AICs were more closely associated with differentiation to AT1 cells (Extended Data Fig. 3j).Proportions of KACs among non-malignant epithelial cells were strongly and significantly increased in LUADs relative to multi-region NL tissues (P = 2.4 × 10 −4 ; Fig. 2d), and were significantly higher in LUADs than in AT1, AT2 or other AIC fractions (P < 2.2 × 10 −16 ; Fig. 2e).Notably, tumour-associated KACs clustered farther away from AICs compared with NL-derived KACs (Extended Data Fig. 3k).High-resolution, multiplex imaging analysis of KRT8, CLDN4 and pan-cytokeratin (PanCK) showed that KACs were enriched in tumouradjacent normal regions (TANs) and were found immediately next to malignant cells showing high expression of KRT8 and CLDN4 (Fig. 2f and Extended Data Fig. 4a).Although KACs were also found in the uninvolved NL samples, consistent with our scRNA-seq analysis, only in the TANs did they display features of 'reactive' epithelial cells (Fig. 2f and Extended Data Fig. 4a).ST analysis of tumour tissue from patient P14 demonstrated increased expression of KRT8 in tumour regions (with high CNV scores) and in TAN regions that histologically comprised highly reactive pneumocytes and exhibited moderate-to-low CNV scores (Fig. 2g).Deconvolution showed that KACs were closer to tumour regions relative to alveolar cells (Extended Data Fig. 4b).ST analysis of a KAC-enriched region showed that KACs were intermediary in the transition of alveolar parenchyma to tumour cells (Extended Data Fig. 4b).Tumour regions had markedly reduced expression of NKX2-1 and the alveolar signature (Extended Data Fig. 4b), a result in line with reduced alveolar differentiation in KM-LUADs (Extended Data Fig. 2b).
KAC markers (Fig. 2b) were high in tumour regions and in TANs with reactive pneumocytes, and they spatially overlapped with the KRAS signature (Fig. 2g).Similar to KRAS, but unlike the AT1 and alveolar signatures, a KAC signature we derived was highest in KACs relative to AT1, AT2 or other AICs (Fig. 2h, Extended Data Fig. 4c,d and Supplementary Table 8).A signature pertinent to other AICs we derived was evidently lower in KACs relative to other AICs (Extended Data Fig. 4e).In KACs from all samples, KAC and KRAS signatures positively correlated

A KAC state is linked to mouse KM-LUAD
We next performed scRNA-seq analysis of lung epithelial cells from mice in which the lung lineage-specific G protein-coupled receptor a gene, Gprc5a, is knocked out (Gprc5a −/− ) 21,22 and which develop KM-LUADs following tobacco carcinogen exposure.We analysed lungs from Gprc5a −/− mice treated with nicotine-derived nitrosamine ketone (NNK) or saline (as control) at the end of exposure (EOE) and at 7 months after exposure, the time point of KM-LUAD onset (n = 4 mice per group and time point; Fig. 3a and Supplementary Fig. 4).Clustering analysis of 9,272 high-quality epithelial cells revealed distinct lineages, including KACs that clustered between AT1 and AT2 cell subsets and close to tumour cells (Extended Data Fig. 7a).Similar to their human counterparts, malignant cells displayed low expression of lineage-specific genes (Extended Data Fig. 7b and Supplementary Table 10).Consistently, cells from the malignant cluster had high CNV scores, expressed Kras G12D mutations and showed increased expression of markers associated with loss of alveolar differentiation (Kng2 and Meg3) and immunosuppression (Cd24a) 23 (Extended Data Fig. 7c,d).Malignant cells were present only at 7 months after NNK treatment and were absent at EOE to carcinogen and in saline-treated animals (Fig. 3b and Extended Data Fig. 7e,f).KAC fractions were markedly increased at EOE relative to control saline-treated littermates (P = 0.03), and they were, for the most part, maintained at 7 months after NNK treatment (Fig. 3b and Extended Data Fig. 7f,g).Immunofluorescence (IF) analysis showed that KRT8 + AT2-derived cells were present in NNK-exposed NL and were nearly absent in the lungs of saline-treated mice (Fig. 3c).LUADs also displayed high expression of KRT8 (Fig. 3c).KACs displayed a markedly increased prevalence of Kras G12D mutations, more so than CNV burden, and increased expression of genes (for example, Gnk2) associated with loss of alveolar differentiation 24 , albeit to lesser extents compared with malignant cells (Fig. 3d, Extended Data Fig. 7h and Supplementary Table 11).Of note, AT2 cell fractions were reduced with time (Extended Data Fig. 7f,g).ST analysis at 7 months after NNK treatment showed that tumour regions had significantly increased expression of Krt8 and Plaur and had spatially overlapping KAC and KRAS signatures (Fig. 3e and Extended Data Fig. 8a,c,e).In line with our human data, Krt8 high KACs with increased expression of KAC and KRAS signatures were enriched in 'reactive', non-neoplastic regions surrounding tumours and were themselves intermediary in the transition from normal to tumour cells (Fig. 3e and Extended Data Fig. 8).
Mouse (Extended Data Fig. 9a) and human (Extended Data Fig. 9b) KACs displayed commonly increased activation of pathways, including NF-κB, hypoxia and p53 signalling, among others.A p53 signature we derived was significantly increased in KACs at EOE, and more so at 7 months after exposure to NNK, compared with both AT2 and tumour cells (Extended Data Fig. 9c, left).Similar patterns were noted for the expression of p53 pathway-related genes and senescence markers, including Cdkn1a, Cdkn2b and Bax, as well as Trp53 itself (Extended Data Fig. 9c, right).Of note, activation of p53 has previously been reported in Krt8 + transitional cells 25 during bleomycin-induced alveolar regeneration, and which themselves showed overlapping genes with KACs from our study (32%; Extended Data Fig. 9d).A mouse KAC signature we derived and that was significantly enriched in mouse KACs and malignant cells (P < 2.2 × 10 −16 , Extended Data Fig. 9e) and in human LUADs (P = 1.2 × 10 −8 , Extended Data Fig. 9f, left) was also significantly increased in premalignant AAHs (P = 4.3 × 10 −4 ) and further increased in invasive LUADs (P = 1.5 × 10 −3 ) relative to matched NL tissues (Extended Data Fig. 9f, right).Similar to alveolar intermediates in acute lung injury 25,26 and KACs in human LUADs (Fig. 2), mouse KACs were probably AT2 cell-derived, acted as intermediate states in AT2-to-AT1 cell differentiation and were inferred to transition to malignant cells (Fig. 4a, top row, Supplementary Fig. 5 and Supplementary Table 12).KACs assumed an intermediate differentiation state that more closely resembled malignant cells than other alveolar subsets (Fig. 4a, middle).The KAC signature was increased in cancer stem cell and stem cell-like progenitor cells that we had cultured from the MDA-F471 LUAD cell line (derived from a Gprc5a −/− mouse exposed to NNK 27 ) relative to parental 2D cells (Extended Data Fig. 10a).KACs at EOE were less differentiated than those at 7 months after exposure (Fig. 4a, bottom right).Notably, the fraction of KACs with Kras G12D mutations was low at EOE (about 0.02) and was increased at 7 months after NNK (about 0.19) (Extended Data Fig. 10b).Kras G12D KACs from the late time point were significantly less differentiated (P = 7.8 × 10 −6 ; Extended Data Fig. 10c) and showed higher expression of KAC signature genes such as Cldn4, Krt8, Cavin3 and Cdkn2a than in Kras WT KACs (Extended Data Fig. 10d).Moreover, Kras WT KACs were more similar to previously reported Krt8 + intermediate cells 25 than Kras G12D KACs (20% overlap compared with 10%, respectively; Extended Data Fig. 10e,f).
We performed integrated scRNA-seq analysis of cells from our mouse cohort with those in mice driven by Kras G12D from two separate studies 28,29 .Cluster C5 comprised cells from all three studies with distinctly high expression of KAC markers and the KAC signature itself (Extended Data Fig. 10g-i).The majority of C5 cells were from our study; however, C5 cells from Kras G12D -driven mice still expressed higher levels of the mouse KAC signature compared with normal AT2 cells from all studies (Extended Data Fig. 10j).The mouse KAC signature was markedly and significantly increased in human AT2 cells with induced expression of KRAS G12D relative to those with KRAS WT from ref. 29 (P < 2.2 × 10 −16 ; Extended Data Fig. 10k).In agreement with these findings, the mouse KAC signature, like its human counterpart (Extended Data Fig. 4h), was significantly enriched in KACs and in malignant cells from KM-LUADs relative to EM-LUADs (P = 0.04 and P < 2.2 × 10 −16 , respectively; Extended Data Fig. 10l).
GFP + organoids from reporter mice at 3 months after NNK treatment showed significantly and markedly enhanced growth compared with those from saline-exposed animals, and were almost exclusively composed of cells with KAC markers (KRT8 + and CLDN4 + ; Extended Data Fig. 12a,e).Given that KACs, like early tumour cells, acquired Kras mutations, we examined the effects of targeted KRAS(G12D) inhibition on these organoids.We first tested effects of the KRAS(G12D) inhibitor MRTX1133 (ref.30) in vitro and found that it inhibited the growth of mouse MDA-F471 cells and LKR13 cells (derived from Kras LSL-G12D mice 31 ) in a dose-dependent manner (Extended Data Fig. 12b).This effect was accompanied by the suppression of phosphorylated levels of ERK1, ERK2 and S6 kinase in both cell lines (Extended Data Fig. 12c and Supplementary Fig. 9).Notably, MRTX1133-treated KAC marker-positive organoids showed significantly reduced sizes and KRT8 and CLDN4 expression intensities relative to DMSO-treated counterparts (P < 1.5 × 10 −10 ; Extended Data Fig. 12d,e).
To further confirm that KACs give rise to tumour cells, we labelled KRT8 + cells in Gprc5a −/− ;Krt8-creER;Rosa tdT/+ mice.Krt8-creER;Rosa tdT/+ mice were first used to confirm increased tdT + labelling (that is, higher number of KACs) in the lung parenchyma at EOE to NNK compared with control saline-treated mice (Fig. 4d and Extended Data Fig. 13a).We then analysed lungs of NNK-exposed Gprc5a −/− ;Krt8-creER;Rosa tdT/+ mice that were injected with tamoxifen immediately after NNK treatment (Fig. 4d).Of note, most tumours showed tdT + KRT8 + cells at varying  Article levels, with some tumours showing a strong extent of tdT labelling, which suggested oncogenesis of KRT8 + cells (Fig. 4d and Extended Data Fig. 13b,c).Most tdT + tumour cells were AT2 cell-derived (LAMP3 + ) (Fig. 4e and Extended Data Fig. 13b).The fraction of tdT + LAMP3 + cells out of the total tdT + cells was similar between EOE and follow-up after EOE to NNK (Fig. 4e).Normal-appearing regions also showed tdT + AT1 cells (NKX2-1 + LAMP3 − ), which indicated the possible turnover of AT2 cells and KACs to AT1 cells (Extended Data Fig. 13a).Taken together, our in vivo analyses identified KACs as an intermediate cell state in the early development of KM-LUAD and following tobacco carcinogen exposure.

Discussion
Our multi-modal analysis of epithelial cells from early-stage LUADs and the peripheral lung uncovered diverse malignant states, patterns of ITH and cell plasticity programs that are linked to KM-LUAD pathogenesis.Of these, we identified alveolar intermediary cells (KACs) that arise after activation of alveolar differentiation programs and that could act as progenitors for KM-LUAD (Fig. 4f).KACs were evident in normal-appearing areas in the vicinity of lesions in both mouse and patient samples, which suggested that the early appearance of these cells (for example, following tobacco exposure) may represent a 'field of injury' 11 .A pervasive field of injury is relevant to the development of human lung cancer and to the complex spectrum of mutations present in normal-appearing lung tissue 32,33 .We propose that KACs represent injured or mutated cells in the normal-appearing lung that have an increased likelihood of transformation to lung tumour cells (Fig. 4f).
Our analysis uncovered strong links and intimately shared properties between KACs and KRAS mutant lung tumour cells, including KRAS mutations, reduced differentiation and pathways.Notably, we showed that growth of KAC-rich and AT2 reporter-labelled organoids derived from lungs with early lesions was highly sensitive to KRAS(G12D) inhibition 34 .Although our in vivo findings are consistent with previous independent reports showing that AT2 cells are the preferential cells of origin in Kras-driven LUADs in animals [35][36][37] , they enable a deeper scrutiny of the specific attributes and states of alveolar intermediary cells in the trajectory towards KM-LUADs.Following acute lung injury, AT2 cells can differentiate into AICs that are characterized by high expression of Krt8 and are crucial for AT1 regeneration 25,26,38 .We found evidence of KAC-like cells with notable expression of the KAC signature in Kras G12D -driven mice, albeit at a reduced frequency compared with our tobacco-mediated carcinogenesis model.Thus, it is plausible that KACs can arise owing to an injury stimulus (here tobacco exposure) or mutant Kras expression or to both conditions.Our work raises questions that would be important to pursue in future studies.It is not clear whether KACs are a dominant or obligatory path in AT2-to-tumour transformation.Also, we do not know the effects of expressing mutant oncogenes, Kras or others, or tumour suppressors on the likelihood of KACs to divert away from mediating AT1 regeneration and, instead, transition to tumour cells.Recent studies suggest that p53 could curtail the oncogenesis of alveolar intermediate cells 39 .
Combining in-depth interrogation of early-stage human LUADs and Kras mutant lung carcinogenesis models, our study provided an atlas with an expansive number of epithelial cells.This atlas of epithelial and malignant cell states in human and mouse lungs underscores new cell-specific subsets that underlie inception of LUADs.Our discoveries may inspire the derivation of targets (for example, KAC signals such as early KRAS programs) to prevent the initiation and development of LUAD.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-024-07113-9.

Multi-regional sampling of human surgically resected LUADs and NL tissues
Study participants were evaluated at the MD Anderson Cancer Center and underwent standard-of-care surgical resection of early-stage LUAD (I-IIIA).Samples from all patients were obtained from banked or residual tissues under informed consent and approved by MD Anderson institutional review board protocols.Residual surgical specimens were then used for derivation of multi-regional samples for single-cell analysis (Supplementary Table 1).Immediately following surgery, resected tissues were processed by an experienced pathologist assistant.One side of the specimen was documented and measured, followed by tumour margin identification.Based on the placement of the tumour within the specimen, incisions were made at defined collection sites in one direction along the length of the specimen and spanning the entire lobe: tumour-adjacent and tumour-distant normal parenchyma at 0.5 cm from the tumour edge and from the periphery of the overall specimen or lobe, respectively.An additional tumour-intermediate normal tissue sample was selected for patients P2-P16 and ranged between 3 and 5 cm from the edge of the tumour.Sample collection was initiated with NL tissues that are farthest from the tumour moving inward towards the tumour to minimize cross-contamination during collection.

Single-cell isolation from tissue samples
Fresh tissues from human donors and mouse lungs were collected in RPMI medium supplemented with 2% FBS and maintained on ice for immediate processing.Tissues were placed in a cell culture dish containing Hank's balanced salt solution (HBSS) on ice, and extra-pulmonary airways and connective tissue were removed with scissors.Samples were transferred to a new dish on ice and minced into about 1 mm 3 pieces followed by enzymatic digestion.For human tissues, the enzymatic solution was composed of collagenase A (10103578001, Sigma Aldrich), collagenase IV (NC9836075, Thermo Fisher Scientific), DNase I (11284932001, Sigma Aldrich), dispase II (4942078001, Sigma Aldrich), elastase (NC9301601, Thermo Fisher Scientific) and pronase (10165921001, Sigma Aldrich) as previously described 40 .For mouse lung digestion, the enzymatic solution was composed of collagenase type I (CLS-1 LS004197, Worthington), elastase (ESL LS002294, Worthington) and DNase I (D LS002007, Worthington).Samples were transferred to 5 ml LoBind Eppendorf tubes and incubated in a 37 °C oven for 20 min with gentle rotation.Samples were then filtered through 70 μm strainers (Miltenyi Biotech, 130-098-462) and washed with ice-cold HBSS.Filtrates were then centrifuged and resuspended in ice-cold ACK lysis buffer (A1049201, Thermo Fisher Scientific) for red blood cell lysis.Following red blood cell lysis, samples were centrifuged and resuspended in ice-cold FBS, filtered (using 40 μm FlowMi tip filters; H13680-0040, Millipore) and an aliquot was taken to count cells and check for viability by Trypan blue (T8154, Sigma Aldrich) exclusion analysis.

Sorting and enrichment of viable lung epithelial singlets
Single cells from patient P1 were stained with Sytox Blue viability dye (S34857, Life Technologies) and processed on a FACS Aria I instrument.Cells from P2-P16 were stained with anti-EPCAM-PE (347198, BD Biosciences; 1:50 dilution in ice-cold PBS containing 2% FBS) for 30 min with gentle rotation at 4 °C.Mouse lung single cells were similarly stained but with a cocktail of antibodies (1:250 each) against CD45-PE/Cy7 (103114, BioLegend), ICAM2-A647 (A15452, Life Technologies), EPCAM-BV421 (118225, BioLegend) and ECAD-A488 (53-3249-80, eBioscience).Stained cells were then washed, filtered using 40 μm filters, stained with Sytox Blue (human) or Sytox Green (mouse) and processed on a FACS Aria I instrument (gating strategies for epithelial cell sorting are shown in Supplementary Figs. 1 and 4 for human and mouse cells, respectively).Doublets and dead cells were eliminated, and viable (Sytox-negative) epithelial singlets were collected in PBS containing 2% FBS.Cells were washed again to eliminate ambient RNA, and a sample was taken for counting by Trypan Blue exclusion before loading on 10X Genomics Chromium microfluidic chips.

Preparation of single-cell 5′ gene expression libraries
Up to 10,000 cells per sample were partitioned into nanolitre-scale Gel beads-in-emulsion (GEMs) using a Chromium Next GEM Single Cell 5′ Gel Bead kit v.1.1 (1000169, 10X Genomics) and by loading onto Chromium Next GEM Chips G (1000127, 10X Genomics).GEMs were then recovered to construct single-cell gene expression libraries using a Chromium Next GEM Single Cell 5′ Library kit (1000166, 10X Genomics) according to the manufacturer's protocol.In brief, recovered barcoded GEMs were broken and pooled, followed by magnetic bead clean-up (Dynabeads MyOne Silane, 37002D, Thermo Fisher Scientific).10X-barcoded full-length cDNA was then amplified by PCR and analysed using a Bioanalyzer High Sensitivity DNA kit (5067-4626, Agilent).Up to 50 ng of cDNA was carried over to construct gene expression libraries and was enzymatically fragmented and size-selected to optimize the cDNA amplicon size before 5′ gene expression library construction.Samples were then subjected to end-repair, A-tailing, adaptor ligation and sample index PCR using Single Index kit T Set A (2000240, 10X Genomics) to generate Illumina-ready barcoded gene expression libraries.Library quality and yield were measured using a Bioanalyzer High Sensitivity DNA kit (5067-4626, Agilent) and a Qubit dsDNA High Sensitivity Assay kit (Q32854, Thermo Fisher Scientific).Indexed libraries were normalized by adjusting for the ratio of the targeted cells per library as well as individual library concentration and then pooled to a final concentration of 10 nM.Library pools were then denatured and diluted as recommended for sequencing on an Illumina NovaSeq 6000 platform.

scRNA-seq data processing and quality control
Raw scRNA-seq data were pre-processed (demultiplex cellular barcodes, read alignment and generation of gene count matrix) using Cell Ranger Single Cell Software Suite (v.3.0.1) provided by 10X Genomics.For read alignment of human and mouse scRNA-seq data, human reference GRCh38 (hg38) and mouse reference GRCm38 (mm10) genomes were used, respectively.Detailed quality control metrics were generated and evaluated, and cells were carefully and rigorously filtered to obtain high-quality data for downstream analyses 15 .In brief, for basic quality filtering, cells with low-complexity libraries (in which detected transcripts were aligned to <200 genes such as cell debris, empty drops and low-quality cells) were filtered out and excluded from subsequent analyses.Probable dying or apoptotic cells in which >15% of transcripts derived from the mitochondrial genome were also excluded.For scRNA-seq analysis of Gprc5a −/− ;Sftpc creER/+ ;Rosa Sun1GFP/+ mice, cells with ≤500 detected genes or with a mitochondrial gene fraction that is ≥15% were filtered out using Seurat 41 .

Doublet detection and removal, and batch effect evaluation and correction
Probable doublets or multiplets were identified and carefully removed through a multi-step approach as described in previous studies 15,42 .In brief, doublets or multiplets were identified based on library complexity, whereby cells with high-complexity libraries in which detected transcripts are aligned to >6,500 genes were removed and, based on cluster distribution and marker gene expression, whereby doublets or multiplets forming distinct clusters with hybrid expression features and/or exhibiting an aberrantly high gene count were also removed.Expression levels and proportions of canonical lineage-related marker genes in each identified cluster were carefully reviewed.Clusters co-expressing discrepant lineage markers were identified and removed.Doublets or multiplets were also identified using the doublet detection algorithm DoubletFinder 43 .The proportion of expected doublets was estimated based on cell counts obtained before scRNA-seq library construction.Data normalization was then performed using Seurat 41 on the filtered gene-cell matrix.Statistical assessment of possible batch effects was performed on non-malignant epithelial cells using the R package ROGUE 36 , an entropy-based statistic, as described in previous studies 15,42 and Harmony 44 was run with default parameters to remove batch effects present in the PCA space.

Unsupervised clustering and subclustering analysis
The function FindVariableFeatures of Seurat 41 was applied to identify highly variable genes for unsupervised cell clustering.PCA was performed on the top 2,000 highly variable genes.The elbow plot was generated with the ElbowPlot function of Seurat and, based on which, the number of significant principal components (PCs) was determined.The FindNeighbors function of Seurat was used to construct the shared nearest neighbour (SNN) graph based on unsupervised clustering performed using the Seurat function FindClusters.Multiple rounds of clustering and subclustering analyses were performed to identify major epithelial cell types and distinct cell transcriptional states.Dimensionality reduction and 2D visualization of cell clusters was performed using UMAP 45 and the Seurat function RunUMAP.The number of PCs used to calculate the embedding was the same as that used for clustering.For analysis of human epithelial cells, ROGUE was used to quantify cellular transcriptional heterogeneity of each cluster.Subclustering analysis was then performed for low-purity clusters identified by ROGUE.Hierarchical clustering of major epithelial subsets was performed on the Harmony batch-corrected PCA dimension reduction space.For malignant cells, except for global UMAP visualization, downstream analyses, including identification of large-scale CNVs, inference of cancer cell differentiation states, quantification of meta-program expression, trajectory analysis and mutation analysis, were performed without Harmony batch correction.The hierarchical tree of human epithelial cell lineages was computed based on Euclidean distance using the Ward linkage method, and the dendrogram was generated using the R function plot.hc.For scRNA-seq analysis of Gprc5a −/− mice, the top-ranked ten PCs were selected using the elbowplot function.SNN graph construction was performed with resolution parameter = 0.4, and UMAP visualization was performed with default parameters.For scRNA-seq analysis of Gprc5a −/− ;Sftpc creER/+ ;Rosa Sun1GFP/+ mice, the top-ranked 20 Harmony-corrected PCs were used for SNN graph construction, and unsupervised clustering was performed with resolution parameter = 0.4.UMAP visualization was performed with the RunUMAP function with min.dist= 0.1.Differentially expressed genes (DEGs) of clusters were identified using the FindAllMarkers function with FDR-adjusted P value < 0.05 and log 2 (fold change) > 1.2.

Identification of malignant cells and mapping KRAS codon 12 mutations
Malignant cells were distinguished from non-malignant subsets based on information integrated from multiple sources as described in previous studies 15,42 .The following strategies were used to identify malignant cells.(1) Cluster distribution: owing to the high degree of inter-patient tumour heterogeneity, malignant cells often exhibit distinct cluster distribution compared with normal epithelial cells.Although non-malignant cells derived from different patients are often clustered together by cell type, malignant cells from different patients probably form separate clusters.(2) CNVs: we applied inferCNV 16 (v.1.3.2) to infer large-scale CNVs in each individual cell with T cells as the reference control.To quantify CNVs at the cell level, CNV scores were aggregated using a previously described strategy 16 .In brief, arm-level CNV scores were computed based on the mean of the squares of CNV values across each chromosomal arm.Arm-level CNV scores were further aggregated across all chromosomal arms by calculating the arithmetic mean value of the arm-level scores using the R function mean.(3) Marker gene expression: expression of lung epithelial lineage-specific genes and LUAD-related oncogenes was determined in epithelial cell clusters.(4) Cell-level expression of KRAS G12D mutations: as we previously described 15 , BAM files were queried for KRAS G12D mutant alleles, which were then mapped to specific cells.KRAS G12D mutations, along with cluster distribution, marker gene expression and inferred CNVs as described above, were used to distinguish malignant cells from non-malignant cells.Following clustering of malignant cells from all patients, an absence of malignant cells that were identified from P12 or P16 was noted.This can be possibly attributed to the low number of epithelial cells captured in tumour samples from these patients (Supplementary Table 2).
Mapping KRAS codon 12 mutations.To map somatic KRAS mutations at single-cell resolution, alignment records were extracted from the corresponding BAM files using mutation location information.Unique mapping alignments (MAPQ = 255) labelled as either PCR duplication or secondary mapping were filtered out.The resulting somatic variant carrying reads were evaluated using Integrative Genomics Viewer (IGV) 46 and the CB tags were used to identify cell identities of mutation-carrying reads.To estimate the VAF of KRAS G12D mutation and cell fraction of KRAS G12D -carrying cells within malignant and non-malignant epithelial cell subpopulations (for example, malignant cells from all LUADs, malignant cells from KM-LUADs, KACs from KM-LUADs), reads were first extracted based on their unique cell barcodes and BAM files were generated for each subpopulation using samtools (v.1.15).Mutations were then visualized using IGV, and VAFs were calculated by dividing the number of KRAS G12D -carrying reads by the total number of uniquely aligned reads for each subpopulation.A similar approach was used to visualize KRAS G12C -carrying reads and to calculate the VAF of KRAS G12C in KACs of normal tissues from KM-LUAD cases.To calculate the mutation-carrying cell fraction, extracted reads were mapped to the KRAS G12D locus from BAM files using AlignmentFile and fetch functions in pysam package.Extracted reads were further filtered using the 'Duplicate' and 'Quality' tags to remove PCR duplicates and low-quality mappings.The number of reads with or without KRAS G12D mutation in each cell was summarized using the CB tag in read barcodes.Mutation-carrying cell fractions were then calculated as the ratio of the number of cells with at least one KRAS G12D read over the number of cells with at least one high-quality read mapped to the locus.

PCA analysis of malignant cells and quantification of transcriptome similarity
Raw unique molecular identifier counts of identified malignant cells were log-normalized and used for PCA analysis using Seurat (RunPCA function).PCA dimension reduction data were extracted using the Embeddings function.The top three most highly ranked PCs were exported for visualization using JMP (v.15).3D scatterplots of PCA data were generated using the scatterplot 3D tool in JMP (v.15).Bhattacharyya distances were calculated using the bhattacharyya.distfunction from the R package fpc (v.2.2-9).The top 25 highly ranked PCs were used for both patient-level and cell-level distance calculations.For Bhattacharyya distance quantification at the cell level, 100 cells were randomly sampled for each patient group defined by driver mutations (for example, KM-LUADs).The random sampling process was repeated 100 times, and pairwise Bhattacharyya distances were then calculated between patient groups.Differences in Bhattacharyya distances between patient groups were tested using Wilcoxon rank-sum tests, and boxplots were generated using the geom_boxplot function from the R package ggplot2 (v.3.2.0).

Determination of non-malignant cell types and states
Non-malignant cell types and states were determined based on unsupervised clustering analysis following batch effect correction using Harmony 44 .Two rounds of clustering analysis were performed on non-malignant cells to identify major cell types and cell transcriptional states within major cell types.Clustering and UMAP visualization of Article human normal epithelial cells (Extended Data Fig. 1a) were performed using Seurat with default parameters.Specifically, the parameters k.param = 20 and resolution = 0.4 were used for SNN graph construction and cluster identification, respectively.UMAP visualization was performed with default parameters (min.dist= 0.3).For clustering analysis of airway and alveolar epithelial cells, the RunPCA function was used to determine the most contributing top PCs for each subpopulation and similar clustering parameters (k.param = 20 and resolution = 0.4) were used for SNN graph construction and cluster identification.UMAP plots were generated with min.dist= 0.3 using the RunUMAP function in Seurat.Density plots of alveolar intermediate cells were generated using the stat_densit_2d function in the R package ggplot2 (v.3.3.5) with the first two UMAP dimension reduction data as the input.DEGs for each cluster were identified using the FindMarkers function in Seurat with a FDR-adjusted P < 0.05 and a fold change cut-off > 1.2.Canonical epithelial marker genes from previously published studies by our group and others 15,47,48 were used to annotate normal epithelial cell types and states.Bubble plots were generated for select DEGs and canonical markers to define AT1 cells (AGER1 + ETV5 + PDPN + ), AT2 cells (SFTPB + SFTPC + ETV5 + ), SCGB1A1 + SFTPC + dual-positive cells, AICs (AGER1 + ETV5 + PDPN + and SFTPB + SFTPC + ), club and secretory cells (SCGB1A1 + SCGB3A1 + CYP2F1 + ), basal cells (KRT5 + TP63 + ), ciliated cells (CAPS + PIFO + FOXJ1 + ), ionocytes (ASCL3 + FOXI + ), neuroendocrine cells (CALCA + ASCL1 + ) and tuft cells (ASCL2 + MGST2 + PTGS1 + ).KACs were identified by unsupervised clustering of AICs and defined based on previously reported marker genes 25,26,49 , including significant upregulation of the following genes relative to other alveolar cells: KRT8, CLDN4, PLAUR, CDKN1A and CDKN2A.

Scoring of curated gene signatures
Genes in previously defined ITH MPs 19 were downloaded from the original study.Among a total of 41 consensus ITH MPs identified, MPs with unassigned functional annotations (unassigned MPs 38-41; n = 4), neural and haematopoietic lineage-specific MPs (MPs 25-29, MPs 33-37; n = 10) and cell-type-specific MPs irrelevant to LUAD (MPs 22-24 secreted/cilia, MP 32 skin-pigmentation; n = 4) were filtered out, resulting in 23 MPs that closely correlated with hallmarks of cancer and that were used for further analysis.Signature scores were computed using the AddModuleScore function in Seurat as previously described 15,42 .The KRAS signature used in this study was derived by calculating DEGs between the KRAS mutant malignant-cell-enriched cluster and other malignant cells (FDR-adjusted P value < 0.05, log(fold change) > 1.2; Extended Data Fig. 2i).Human and mouse KAC signatures and the human 'other AIC' signature were derived by calculating DEGs using FindAllMarkers among alveolar cells (FDR-adjusted P value < 0.05, log(fold change) > 1.2).Mouse genes in the p53 pathway were downloaded from the Molecular Signature Database (MSigDB; https://www.gsea-msigdb.org/gsea/msigdb/mouse/geneset/HALLMARK_P53_PATH-WAY;MM3896).Signature scores for KACs, other AICs, KRAS and p53 were calculated using the AddModuleScore function in Seurat.

Analysis of alveolar cell differentiation states and trajectories
Analysis of differentiation trajectories of lung alveolar and malignant cells was performed using Monocle 2 (ref.50) by inferring the pseudotemporal ordering of cells according to their transcriptome similarity.Monocle 2 analysis of malignant cells from P14 was performed using default parameters with the detectGenes function.Detected genes were further required to be expressed by at least 50 cells.For construction of the differentiation trajectory of lineage-labelled epithelial cells (GFP + ), the top 150 DEGs (FDR-adjusted P value < 0.05, log(fold change) > 1.5, expressed in ≥50 cells) ranked by fold-change of each cell population from NNK-treated samples were used for ordering cells with the setOrderingFilter function.Trajectories were generated using the reduceDimension function with the method set to 'DDRTree'.Trajectory roots were selected based on the following criteria: (1) inferred pseudotemporal gradient; (2) CytoTRACE score prediction; and (3) careful manual review of the DEGs along the trajectory.To depict expression dynamics of ITH MPs 19 , ITH MP scores were plotted along the pseudotime axis and smoothed lines were generated using the smoother tool in JMP Pro (v.15).Using the raw counts without normalization as input, CytoTRACE 18 was applied with default parameters to infer cellular differentiation states to complement trajectory analysis and further understand cellular differentiation hierarchies.The normalmixEM function from the R package mixtools was used to determine the CytoTRACE score threshold in AICs with k = 2.A final threshold of 0.58 was used to dichotomize AICs into high-differentiation and low-differentiation groups.The Wasserstein distance metric was applied using R package transport (v.0.13) to quantify the variability of distribution of CytoTRACE scores.The function wasserstein1d was used to calculate the distance between the distribution of actual CytoTRACE scores of one patient and the distribution of simulated data with identical mean and standard deviation.The robustness of Monocle 2-based pseudotemporal ordering prediction was validated by independent pseudotime prediction tools including Palantir 51 , Slingshot 52 and Cellrank 53 .Slingshot (v.2.6.0)pseudotime prediction was performed using slingshot function with reduceDim parameter set to 'PCA' and other parameters set to defaults.Cellrank prediction was performed using the CytoTRACEKernel function with default parameters from Cellrank python package (v.1.5.1).Palantir prediction was performed using Palantir python package (v.1.0.1).A diffusion map was generated using run_diffusion_maps function with n_components = 5.Palantir prediction was generated using run_palantir function with num_waypoints = 500 and other parameters set to defaults.Inferred pseudotime by the three independent methods was then integrated with that generated by Monocle 2 for each single cell, followed by pairwise mapping and correlation analysis.Cell density plots were generated using Contour tool in JMP (v.15) with n = 10 gradient levels and contour type parameter set to 'Nonpar Density'.To assess the pseudotime prediction consistency between Monocle 2 and the three independent methods, Spearman's correlation coefficients were calculated and statistically tested using cor.testfunction in R.

ST data generation and analysis
ST profiling of formalin fixation and paraffin-embedding (FFPE) tissues from P14 with LUAD and of three lung tissues from two Gprc5a −/− mice was performed using the Visium platform from 10X Genomics according to the manufacturer's recommendations and as previously reported 54 .P14 FFPE tissues were collected from areas adjacent to the tissues analysed by scRNA-seq.Regions of interest per tissue or sample, each comprising a 6.5 × 6.5 mm capture area, were selected based on careful annotation of H&E-stained slides that were digitally acquired using an Aperio ScanScope Turbo slide scanner (Leica Microsystems).HALO software (Indica Labs) was used for pathological annotation (tumour areas, blood vessels, bronchioles, lymphoid cell aggregates, macrophages, muscle tissue, normal parenchyma and reactive pneumocytes) of H&E histology images.Spot-level histopathological annotation and visualization was generated using loupe browser (v.6.3.0).In brief, cloupe files generated from Space Ranger were loaded into the loupe browser.Visualization of annotation was then generated in svg formats using the export plot tool.ST RNA-seq libraries were generated according to the manufacturer's instructions, each with up to about 3,600 uniquely barcoded spots.Libraries were sequenced on an Illumina NovaSeq 6000 platform to achieve a depth of at least 50,000 mean read pairs per spot and at least 2,000 median genes per spot.
Demultiplexed raw sequencing data were aligned, and gene level expression quantification was generated with analysis pipelines as previously described 54 .In brief, demultiplexed clean reads were aligned against the UCSC human GRCh38 (hg38) or the GRCm38 (mm10) mouse reference genomes by Spaceranger (v.1.3.0 for human ST data and v.2.0.0 for mouse ST data) and using default settings.Generated ST gene expression count matrices were then analysed using Seurat (v.4.1.0)to perform unsupervised clustering analysis.Using default parameters, the top-ranked 30 PCA components were used for SNN graph construction and clustering and for UMAP low-dimension space embedding with default parameters.UMAP analysis was performed using the RunUMAP function.The SpatialDimPlot function was used to visualize unsupervised clustering.The R package inferCNV 16 was used for copy number analysis.Reference spots used in CNV analysis were selected on the basis of careful review of cluster marker genes using the DotPlot function from Seurat and inspection of pathological annotation.CNV scores were calculated by computing the standard deviations of CNVs inferred across 22 autosomes.Loupe browser (v.6.3.0) was used for visualization of pathological annotation results.Expression levels of genes of interest (for example, KRT8) as well as signatures of interest (for example, KAC and KRAS) were measured and directly annotated on histology images with pixel-level resolution using the TESLA (v.1.2.2) machine learning framework 55 (https://github.com/jianhuupenn/TESLA).TESLA can compute superpixel-level gene expression and detect unique structures within and surrounding tumours by integrating information from high-resolution histology images.The annotation and visualize_annotation functions were used to annotate regions with high signature signals.KRT8, PLAUR, CLDN4, CDKN1A and CDKN2A were used for 'KAC markers' signature annotation in the human ST analysis.For mouse ST data, Krt8, Plaur, Cldn4, Cdkn1a and Cdkn2a were used for 'KAC signature' annotation.Gene level expression visualization of Krt8 and Plaur was generated using the scatter function from scanpy (v.1.9.1).Deconvolution analysis was conducted using CytoSPACE 56 (https://github.com/digitalcytometry/cytospace).Annotated scRNA-seq data were first transformed into a compatible format using function generate_cytospace_from_scRNA_seurat_object.Visium spatial data were prepared using the function generate_cyto-space_from_ST_seurat_object.Deconvolution was performed using CytoSpace function (v.1.0.4) with default parameters.To determine neighbouring cell composition for a specific cell population in Visium data, CytoSPACE was first applied to annotate every spot with the most probable cell type.Neighbouring spots were defined as the six spots surrounding each spot and, accordingly, the neighbouring cell composition for specific cell types were computed.Trajectory construction of ST data was performed using Monocle 2 (ref.18) with the DDRTree method using DEGs with FDR-adjusted P value < 0.05.

Bulk DNA extraction and WES
Total DNA was isolated from homogenized cryosections of human lung tissues and, when available, from frozen peripheral blood mononuclear cells (PBMCs) using a Qiagen AllPrep mini kit (80204) or a DNeasy Blood and Tissue kit (69504), respectively (both from Qiagen) according to the manufacturer's recommendations.Qubit 4 Fluorometer (Thermo Fisher Scientific) was used for measurement of DNA yield.TWIST-WES was performed on a NovaSeq 6000 platform at a depth of 200× for tumour samples and 100× for NL and PBMCs to analyse recurrent driver mutations and using either PBMCs or distant NL tissues when blood draw was not consented, as germline control.WES data were processed and mapped to the human reference genome, and somatic mutations were identified and annotated as previously described 57,58 with further filtration steps.In brief, only MuTect 59 calls marked as 'KEEP' were selected and taken into the next step.Mutations with a low VAF (<0.02) or low alt allele read coverage (<4) were removed.Then, common variants reported by ExAc (the Exome Aggregation Consortium, http://exac.broadinstitute.org),Phase-3 1000 Genome Project (http://phase3browser.1000genomes.org/Homo_sapiens/Info/Index) or the NHLBI GO Exome Sequencing Project (ESP6500) (http://evs.gs.washington.edu/EVS/)with minor allele frequencies greater than 0.5% were further removed.Intronic mutations, mutations at 3′ or 5′ UTR or UTR-flanking regions, and silent mutations were also removed.The mutation load in each tumour was calculated as the number of nonsynonymous somatic mutations (nonsense, missense, splicing, stop gain, stop loss substitutions as well as frameshift insertions and deletions).

Survival analysis
Analysis of OS in the TCGA LUAD and PROSPECT 60 cohorts was performed as previously described 15 .KRAS mutation status in TCGA LUAD samples was downloaded from cBioPortal (https://www.cbioportal.org, study ID: luad_tcga_pan_can_atlas_2018).For TCGA dataset, clinical data were downloaded from the PanCanAtlas study 18 .The logrank test and Kaplan-Meier methods were used to calculate P values between groups and to generate survival curves, respectively.Statistical significance testing for all survival analyses was two-sided.To control for multiple hypothesis testing, Benjamini-Hochberg method was applied to correct P values, and FDR q values were calculated where applicable.Results were considered significant at P value or FDR q value of <0.05.Multivariate survival analysis was performed using a Cox proportional hazards regression model that calculated the hazard ratio, the 95% confidence interval and P values when using pathologic stage, age, KAC and 'other AIC' signatures as covariables.

Analysis of public datasets
Publicly available datasets were obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/)under accession numbers GSE149813, GSE154989, GSE150263, GSE102511 and GSE219124.Details of the studies 28,29 analysed are as follows: GSE149813 investigated single lung cells from Kras LSL-G12D;LSL-YFP mice with Ad5CMV-Cre infection 29 ; GSE154989 studied AT2 lineage-labelled cells from lungs of Kras LSL-G12D/+ ;Rosa26 LSL-tdTomato/+ mice 28 .Gene expression count matrices of dataset interrogating Kras G12D -driven mouse model from GSE149813 were pre-processed using Seurat following the same filtering steps in that original report.For the GSE154989 dataset 28 , cells used for analysis were the ones labelled as "PASSED_QC" in supplementary table S7 in that study.For the GSE149813 dataset 29 , cells with >500 median number of genes detected and <10% fraction of mitochondrial genome derived reads, and according to the pre-processing methods described in their original report 29 , were retained for analysis.Cells with >7,500 number of genes detected were further filtered to remove potential doublets or multiplets, resulting in 8,304 cells in total for downstream analysis.Both datasets were integrated with mouse cell data generated in this study using Harmony 18 with default parameters settings.The top ranked 20 Harmony-corrected PCs were used for clustering with the FindClusters function using resolution = 0.4.UMAP dimension reduction embedding was performed using the RunUMAP function with the same set of Harmony-corrected PCs.Gene expression levels and frequencies of representative cluster marker genes were visualized using DotPlot function from Seurat.The KAC signature score was calculated using the AddModuleScore function from Seurat.The mouse KAC signature was also studied in human AT2 cells with and without inducible KRAS G12D (dataset GSE150263) also from ref. 29.Cell filtration criteria described in the original report 29 were followed to filter out potential dead cells and doublets (number of detected genes > 800 and the percent of mitochondrial gene reads fraction < 25%).The 20 top-ranked PCs were used for clustering using the FindClusters function with resolution = 0.1.UMAP dimension reduction embeddings were computed using the same SNN graph.The KAC signature score was calculated using AddModuleScore function from Seurat package.
The bulk RNA-seq dataset GSE102511 was a previously published dataset by our group and comprised normal lung tissues, precursor AAHs and matched LUADs (n = 15, each) 61 .The previously published 62 bulk RNA-seq data GSE219124 were generated on cancer stem cell and stem cell-like progenitor cells, in the form of spheres, and their parental MDA-F471 counterparts (a cell line we had developed and cultured from a KM-LUAD of an NNK-exposed Gprc5a −/− mouse) 62 .To interrogate the association of KACs with tumour formation, gene expression matrices of bulk RNA-seq data GSE102511 (TPM count matrix) and GSE219124 (FPKM count matrix) were extracted and used for quantification of KAC signature expression using MCPcounter (v.1.2.0) R package.Heatmaps were generated using pheatmap (v.1.0.12)R package.
Mouse KACs from this study were compared to mouse Krt8 + transitional cells involved in alveolar regeneration post-acute lung injury from a previous study 25 .Overlapping marker genes between mouse KACs and the previously reported Krt8 + transitional cells were statistically evaluated using the ggvenn (v.0.1.9)R package using the top-ranked 50 marker genes based on fold change from each study.

Animal housing and tobacco carcinogen exposure experiments
Animal experiments were conducted according to Institutional Animal Care and Use Committee (IACUC)-approved protocols at the University of Texas MD Anderson Cancer Center.Mice were maintained in a pathogen-free animal facility.No statistical methods were used to predetermine sample sizes.In all animal experiments, sex-matched and age-matched mice were randomized to treatment groups.For all experiments and until end points were reached (up to 7 months after exposure to saline or NNK), mice were monitored for signs of ill health and their body weight was measured to ensure weight loss did not exceed 20% of body weight over 72 h.None of the mice developed these symptoms; therefore, they were all euthanized after reaching IACUC-approved end points.End points permitted by our IACUC protocols were not exceeded in any of the experiments.Analysis of data from animal experiments was performed in a blinded fashion.To study KACs in the context of KM-LUAD pathogenesis in vivo, Gprc5a −/− mice were interrogated because they form LUADs that are accelerated by tobacco carcinogen exposure and acquire somatic Kras G12D mutations-features that are highly pertinent to KM-LUAD development 21,63,64 and therefore to exploring KACs in this setting.Gprc5a −/− mice were generated as previously described 21,65 .Sex-matched and age-matched Gprc5a −/− mice were divided into starting groups of 4 mice per exposure (NNK or saline control) and time point (EOE or 7 months after exposure, n = 16 mice in total).Eight-week-old mice were intraperitoneally injected with 75 mg kg -1 of body weight NNK or vehicle 0.9% saline (control), 3 times per week for 8 weeks.At EOE or at 7 months after exposure, lungs were collected for derivation of live single cells for scRNA-seq.Whole lungs from additional mice treated as described above were processed by FFPE and for analysis by IF (n = 2 mice per treatment group at EOE and 7 months after exposure, 8 mice in total) and ST (3 lung tissues from n = 2 mice at 7 months after NNK exposure).
Sftpc creER/+ ;Rosa Sun1GFP/+ mice were provided by H. Chapman (University of California, San Francisco) and were crossed to Gprc5a −/− mice to generate Gprc5a −/− ;Sftpc creER/+ ;Rosa Sun1GFP/+ mice for analysis of lineage-labelled AT2 cells.Gprc5a −/− ;Sftpc creER/+ ;Rosa Sun1GFP/+ mice were treated with 75 mg kg -1 NNK or control saline (intraperitoneally), 3 times per week for 8 weeks.At week 6 of treatment (2 weeks before EOE), mice from both groups received 250 μg (intraperitoneally) tamoxifen dissolved in corn oil for four consecutive days.At EOE or 3 months after exposure to saline or NNK, lungs were digested to derive live (Sytox Blue-negative) GFP + single cells by flow cytometry using a FACS Aria I instrument as previously described 66 (the gating strategy for GFP cell sorting is shown in Supplementary Fig. 6).Sorted single cells were analysed by scRNA-seq (GFP + and GFP − fractions from n = 2 mice per treatment at 3 months after exposure to saline and NNK) or used to derive organoids (GFP + cells from n = 4 or 5 mice at EOE to saline or NNK, respectively, and from n = 10 or 13 mice at 3 months after saline or NNK, respectively).Whole lungs from additional mice treated with saline or NNK and tamoxifen as described above (n = 2 per treatment group) were collected (FFPE) at 3 months after NNK and analysed by IF.
Krt8-creER;Rosa tdT/+ animals were used to generate Gprc5a −/− ; Krt8-creER;Rosa tdT/+ mice for analysis of lineage-labelled KRT8 + cells.Krt8-creER (stock number 017947) and Rosa tdT/+ (Ai14; stock number 007914) mice were obtained from the Jackson Laboratory.Mice harbouring Krt8-creER;Rosa tdT/+ were first used for pilot studies to examine labelling of KRT8 + cells.Mice were exposed to control saline (n = 2 mice) or to 8 weeks of NNK (n = 3 mice) as described above followed by 1 mg tamoxifen for 6 continuous days, after which lungs were analysed at the end of tamoxifen exposure.To examine the relevance of labelled KRT8 + cells to tumour development, Gprc5a −/− ;Krt8-creER;Rosa tdT/+ mice were similarly exposed to NNK for 8 weeks followed by tamoxifen, and lungs were then analysed at 8-12 weeks after NNK exposure (n = 3 mice).All lungs were collected and processed for formalin fixation, OCT embedding and IF analysis.above and imaged using a Nikon A1plus confocal microscope.Cell counter ImageJ plugin was used to count tdT + cells within lesions and cells in normal-appearing areas, namely: AT2 cells (LAMP3 + ), tdT + AT2 cells (tdT + LAMP3 + ), AT1 cells (LAMP3 -NKX2-1 + , avoiding noticeable airways) and tdT + AT1 cells (tdT + NKX2-1 + LAMP3 -).Percentages of tdT + LAMP3 + and tdT + NKX2-1 + LAMP3 -cells out of total tdT + cells were computed.Counts were averages of triplicate images taken at ×20 magnification for each time point.The percent regional surface area covered by tdT + cells in normal-appearing regions was estimated by examining the tdT expression across entire lobe sections for each replicate.
3D culture and analysis of AT2-derived organoids Gprc5a −/− ;Sftpc creER/+ ;Rosa Sun1GFP/+ were treated with NNK or saline and tamoxifen as described above, and they were euthanized at EOE (4 saline-treated and 5 NNK-treated mice) or at 3 months after exposure (10 saline-treated and 13 NNK-treated mice).Lungs were collected, dissociated into single cells (see mouse single-cell derivation in the Methods section 'Single-cell isolation from tissue samples'), and live (Sytox Blue-negative) GFP + single cells were collected by flow cytometry using a FACS Aria I instrument as previously described 66 .GFP + AT2 cells from NNK-treated or saline-treated groups were immediately washed and resuspended at a concentration of 5,000 cells per 50 μl of 3D medium (F12 medium supplemented with insulin, transferrin and selenium, 10% FBS, penicillin-streptomycin and l-glutamine).GFP + cells were mixed at a 1:1 ratio (by volume) with 50,000 mouse endothelial cells (collected from mouse lungs by CD31 selection and expanded in vitro as previously described 67 ) and resuspended in 50 μl of Geltrex reduced growth factor basement membrane matrix (A1413301, Gibco).Next, 100 μl of 1:1 GFP + :endothelial cell mixture was plated on Transwell inserts with 0.4 μm pores and allowed to solidify for 30 min in a humidified CO 2 incubator (EOE: n = 3 wells per condition; 3 months after exposure: n = 4 wells for saline-derived organoids and n = 12 wells for NNK-derived organoids).Each well was then supplemented with 3D medium containing ROCK inhibitor (Y-27632, Millipore) and recombinant mouse FGF-10 (6224-FG, R&D Systems), and plates were incubated at 37 °C in a humidified CO 2 incubator.Wells were replenished with 3D medium every other day.For GFP + organoids derived from mice exposed to NNK, 200 nM KRAS(G12D)-specific inhibitor MRTX1133 or DMSO vehicle was added to the medium and replenished 3 times a week (n = 6 wells per condition).Organoids were monitored and analysed twice a week using an EVOS M7000 imaging system (Thermo Fisher Scientific), whereby the numbers and sizes of organoids greater than 100 μm in diameter were recorded.At end point, 3D organoids were collected from the basement membrane matrix using Gentle Cell Dissociation reagent (100-0485, StemCell Technologies), fixed with 4% paraformaldehyde, permeabilized, blocked and stained overnight at 4 °C with a mixture of IF primary antibodies raised against LAMP3, GFP, KRT8 and cavin 3. The next day, organoids were washed and stained with fluorophore-conjugated secondary antibodies overnight at 4 °C while being protected from light.Organoids were washed and stained with DAPI nuclear stain for 30 min, after which they were collected in Aqua-Poly/Mount (18606-20, Polysciences) and transferred to slides.Images of organoids were captured using an Andor Revolution XDi WD spinning disk confocal microscope and analysed using Imaris software (Oxford Instruments).

Western blot analysis
LKR13 and MDA-F471 cells were plated in 6-well plates (10 6 cells per well) and grown under different conditions as described above.Protein lysates were extracted at 3 h after treatment and analysed by western blotting following overnight incubation with antibodies to the following primary proteins: vinculin (E1E9V, rabbit, Cell Signaling Technology, 13901; 1:1,000); phosphorylated p44/42 MAPK (ERK1/2, rabbit, Cell Signaling Technology, 9101; 1:2,000); phosphorylated S6 ribosomal protein (Ser 235/236, rabbit, Cell Signaling Technology, 4858; 1:2,000); p44/42 MAPK (ERK1/2, rabbit, Cell Signaling Technology, 9102; 1:2,000); or S6 (E.573.4,rabbit, Invitrogen, MA5-15164; 1:1,000).This was followed by 1 h of incubation with diluted secondary antibody (1706515 goat anti-rabbit IgG-HRP conjugate, Bio-Rad).Protein lysates from each cell line were analysed on multiple gels (four per cell line) with Precision Plus Protein Dual Color Standard (1610394, Bio-Rad) as the ladder and blotted to membranes to separately probe for phosphorylated and total forms of the same proteins, which have highly similar molecular weights (using phospho-specific antibodies or antibodies targeting total version of same protein).Vinculin protein levels were evaluated as loading control on each of the blots.Four blots (phospho-ERK, total ERK, phospho-S6 and total S6) for each of LKR13 and MDA-F471 are shown in Supplementary Fig. 9, each with its own analysis of equal protein loading (vinculin blot) and whereby only the ones indicated with green rectangles are presented in Extended Data Fig. 12c.Membranes were cut horizontally using molecular weight marker as a guide, and cut membranes were incubated with the specified antibodies (see Supplementary Fig. 9 for site of cutting and for overlay of colorimetric and chemiluminescent images of the same blot to display ladder and the analysed protein, respectively).Blots were imaged using the ChemiDoc Touch Imaging System (Bio-Rad) with Chemiluminescence and Colorimetric (for protein ladder) applications and auto expose or manual settings.

Chemicals and reagents
Tobacco-specific carcinogen (NNK) with a purity of 99.96% by HPLC was purchased from TargetMol.Tamoxifen and H&E staining reagents were purchased from Sigma Aldrich.The KRAS(G12D) inhibitor MRTX1133 was provided by J. Christensen (Mirati Therapeutics).

Statistical analyses
In addition to the algorithms and statistical analyses described above, all other basic statistical analyses were performed in the R statistical environment (v.4.0.0).The Kruskal-Wallis H-test was used to compare variables of interests across three or more groups.Wilcoxon rank-sum test was used for paired comparisons among matched samples from the same patients.Wilcoxon rank-sum test was used to compare other continuous variables such as gene expression levels and signature scores between groups.Spearman's correlation coefficient was calculated to assess associations between two continuous variables (for example, cellular proportions and gene signature scores).Fisher's exact test was used to identify differences in frequencies of groups based on two categorical variables.Ordinal logistic regression was performed using the polr function in the built-in R package MASS (v.7.3).Benjamin-Hochberg method was used to control for multiple hypothesis testing.All statistical tests performed in this study were two-sided.Results were considered significant at P values or FDR q values < 0.05.When a P value reported by R was smaller than 2.2e-16, it was reported as P < 2.2 × 10 −16 .Extended Data Fig. 6 | Prevalence of KRAS G12D mutant KACs in LUAD.a, UMAP clustering of alveolar subsets.b, Quantification of CNV scores across AT1, AT2, KACs and other AICs.Box-and-whisker definitions are similar to Extended Data Fig. 1f.n cells in each group: AT2 = 146,776; AT1 = 25,561, Other AICs =8,593; KACs = 1,440; Malignant = 17,064.P values were calculated using two-sided Wilcoxon Rank-Sum test with a Benjamini-Hochberg correction.KRAS G12D variant allele frequencies (c) and fractions of KRAS G12D mutant cells (d) in alveolar and malignant cells from LUAD and normal samples and analysed by scRNA-seq.VAF for KRAS G12C variant in KACs from KM normal tissues is shown in green (c).n on top of each bar in d: number of KRAS G12D mutant cells.e, KRAS activation signature was statistically compared across KRAS G12D mutant KACs, KRAS wt KACs, AICs, and AT2 cells.Box-and-whisker definitions are similar to Extended Data Fig. 1f.n cells in each box-and-whisker: KACs KRAS G12D = 15;

Extended Data
KACs KRAS wt = 1,425; Other AICs =8,593; AT2 = 146,776.P values were calculated using the two-sided Wilcoxon Rank-Sum test with a Benjamini-Hochberg correction.f, g, CytoTRACE scores in KACs versus other AICs from all cells of KM (f, left) and KW cases (f, right), in cells from normal lung tissues of patients with KM-LUAD (g, left), and cells from KM-LUAD (g, middle) and KW-LUAD (g, right) tissues.Box-and-whisker definitions are similar to Extended Data Fig. 1f.n cells in each box-and-whisker: KM cases, KACs = 719; KM cases, Other AICs = 2,414; KW cases, KACs = 721; KM cases, Other AICs = 6,179; KM normal tissues, KACs = 408; KM normal tissues, Other AICs = 2,286; KM-LUADs, KACs = 311; KM-LUADs, Other AICs = 128; KW-LUADs, KACs = 295; KW-LUADs, Other AICs = 940.P values were calculated using two-sided Wilcoxon Rank-Sum tests with Benjamini-Hochberg adjustment for multiple testing correction.Extended Data Fig. 8 | ST analysis of KACs in tobacco-associated development of KM-LUAD.a, ST analysis of the same tumour-bearing mouse lung in Fig. 3e with cell clusters identified by Seurat (inlet) and mapped spatially (left).Spatial maps with scaled expression of Krt8 and Plaur are shown on the right.b, Pseudotime trajectory analysis of C0 (alveolar parenchyma), C2 (reactive area with KACs nearby tumours), and clusters C7 and C8 (representing two tumours) from the same tumour-bearing mouse lung in a. c, ST analysis of another tumour-bearing lung region from the same NNK-exposed mouse as in panel a, and showing histological spot-level annotation of H&E-stained images (left) followed by spatial maps with scaled expression of Krt8, Plaur, and KAC signature (right).d, Cell clusters identified by Seurat (top left) and mapped spatially (top right) from the same mouse tumour-bearing lung in c. bottom of panel k: Pseudotime trajectory analysis of C0 (alveolar parenchyma), C8 (reactive area with KACs nearby the tumour), and C5 (representing one tumour) from the mouse tumour-bearing lung in c. e, ST analysis of a tumour-bearing lung from an additional mouse at 7 months following NNK showing histological spot-level annotation of H&E-stained images (left) followed by spatial maps with scaled expression of Krt8 (middle, top), Plaur (middle, bottom), and KAC signature (right).

Extended Data
Extended Data Fig. 10 | Mouse KACs exist in a continuum, bear strong resemblance to human KACs, and are present in independent KRAS G12Ddriven mouse models of LUAD.a, Mouse KAC signature score (left) and heatmap showing expression of select KAC marker genes (right) in bulk transcriptomes of MDA-F471-derived 3D spheres versus parental MDA-F471 cells grown in 2D.P value was calculated using two-sided Wilcoxon Rank-Sum test.Box-and-whisker definitions are similar to Extended Data Fig. 1f.b, Fraction of Kras G12D mutant cells in different mouse alveolar cell subsets including when separating KACs into early KACs at EOE and late KACs at 7 months following NNK.Numbers of Kras G12D mutant cells are indicated on top of each bar.c, CytoTRACE scores in late KACs with Kras G12D mutation and in those with wild type KRAS (Kras wt ).P value was calculated using two-sided Wilcoxon Rank-Sum test.Box-and-whisker definitions are similar to Extended Data Fig. 1f.n cells in each box-and-whisker: Kras G12D = 72; Kras wt = 564.d, Proportions and average expression levels of select marker genes for the different subsets indicated.Pie charts showing percentages of unique and overlapping DEG sets between Krt8 + transitional cells identified by Strunz and colleagues and either Kras G12D (e) or Kras wt (f) KACs from this study.g, UMAP clustering of cells integrated from our mouse cohort with cells in the scRNAseq datasets from studies by Marjanovic

Article
Extended Data Fig. 11 | KACs are enriched in lungs and they precede the formation of Kras G12D tumours in an AT2 lineage reporter tobacco carcinogenesis mouse model.a, Representative IF analysis of KRT8, GFP, and LAMP3 in GFP-labelled AT2-derived mouse lung organoids (n = 3 wells per condition) derived from tamoxifen-exposed AT2 reporter mice at EOE to saline (n = 4 mice) or NNK (n = 5 mice).Scale bar: 10 μm.b, UMAP distribution of GFP + cells at 3 months following NNK exposure or saline and coloured by alveolar or tumour subsets.c, Proportions and average expression levels of select marker genes for mouse normal alveolar cell lineages and tumour cells defined in b. d, Fraction of Kras G12D cells across alveolar and early tumour subsets.Absolute numbers of Kras G12D cells are indicated on top of each bar.e, UMAPs of GFP + cells from tumour-bearing AT2 reporter mice at 3 months following NNK or saline and coloured by presence of Kras G12D mutation or expression of KAC, AT1, and AT2 signatures.f, UMAPs showing distribution of alveolar and tumour cell subsets (left) as well as cells with Kras G12D mutation (right) by treatment (saline or NNK).g, Trajectories of GFP + cells from tumour-bearing reporter mice at 3 months following NNK or saline coloured by inferred pseudotime (left), differentiation (middle), and cell lineage and showing subset composition (right).h, CytoTRACE (left) and pseudotime (right) scores across GFP + subsets.Box-and-whisker definitions are similar to Extended Data Fig. 1f.n cells in each box-and-whisker: AT2 = 144; Early-AT2-like tumour = 144; KAC-KAC-like = 288; AT1 = 72.
Extended Data Fig. 12 | KAC-rich organoids are sensitive to targeted inhibition of KRAS.a, Size quantification of organoids derived from GFP + lungs cells of mice treated with saline (derived from 10 mice and plated into 4 wells) or NNK (derived from 13 mice and plated into 12 wells) at 3 months postexposure.Box-and-whisker definitions are similar to Extended Data Fig. 1f.n organoids in each group: Saline = 63; NNK = 66.P value was calculated using two-sided Wilcoxon Rank-Sum test.b, Analysis of relative viability 4 days post treatment of LKR13 and MDA-F471 cells following treatment with increasing concentrations of MRTX1133.n samples in each group of LKR13 cells: -= 7; 1 = 7; 10 = 3; 40 = 4; 100 = 3. n samples in each group of MDA-F471 cells: -= 8; 1 = 8; 10 = 7; 40 = 11; 100 = 6.n.s: non-significant (P > 0.05).Error-bars: standard deviations of means.P values were calculated using an ordinary one-way ANOVA with Dunnett's post-test.Results are representative of two independent experiments.c, Western blot analysis for the indicated proteins and phosphorylated proteins at 3 h post-treatment to EGF without or with increasing concentrations of the KRASG12D inhibitor MRTX1133 (from Mirati Therapeutics, Inc.).Proteins were run on additional gels (4 per cell line) to separately blot with antibodies against phosphorylated and total forms of each of the indicated proteins (Supplementary Fig. 9).Vinculin protein levels were analysed as loading control for each gel whereby four LKR13 and four MDA-F471 blots are shown in Supplementary Fig. 9.For lysates from each of the two cell lines, vinculin blots from Gel 1 (Supplementary Fig. 9) are selected and shown in this figure panel.Uncropped images of western blots with molecular weight ladder are also shown in Supplementary Fig. 9. Results are representative of three independent experiments.EGF: epidermal growth factor.d, Size quantification of organoids derived from GFP + lungs cells of NNK-treated AT2 reporter mice and treated with 200 nM MRTX1133 or control DMSO in vitro (n = 6 wells per condition).Box-and-whisker definitions are similar to Extended Data Fig. 1f.n samples (organoids) in each group: DMSO = 38; MRTX1133 = 53.P value was calculated using two-sided Wilcoxon Rank-Sum test.e, IF analysis showing representative organoids derived from sorted GFP + cells from AT2 reporter mice that were exposed to saline (top two rows; n = 4 wells) or exposed to NNK and then treated ex vivo with DMSO (middle two rows; n = 6 wells) or 200 nM MRTX1133 (bottom two rows; n = 6 wells).Scale bars = 50 μm except for the first DMSO-treated organoid (third row) whereby scale bar = 100 μm.Staining was repeated three times with similar results.

March 2021
Probable doublets or multiplets were identified and carefully removed through a multi-step approach as described in previous studies.In brief, doublets or multiplets were identified based on library complexity, whereby cells with high-complexity libraries in which detected transcripts are aligned to >6,500 genes were removed and, based on cluster distribution and marker gene expression,whereby doublets or multiplets forming distinct clusters with hybrid expression features and/or exhibiting an aberrantly high gene count were also removed.Expression levels and proportions of canonical lineage-related marker genes in each identified cluster were carefully reviewed.Clusters coexpressing discrepant lineage markers were identified and removed.Doublets or multiplets were also identified using the doublet detection algorithm DoubletFinder.The proportion of expected doublets was estimated based on cell counts obtained before scRNA-seq library construction.Data normalization was then performed using Seurat.on the filtered gene-cellmatrix.Statistical assessment of possible batch effects was performed on non-malignant epithelial cells using the R package ROGUE, an entropy-based statistic, as described in previous studies, and Harmony was run with default parameters to remove batcheffects present in the PCA space.

Unsupervised clustering and subclustering analysis
The function FindVariableFeatures of Seurat was applied to identify highly variable genes for unsupervised cell clustering.PCA was performed on the top 2,000 highly variable genes.The elbow plot was generated with the ElbowPlot function of Seurat and, based on which, the number of significant principal components (PCs) was determined.The FindNeighbors function of Seurat was used to construct the shared nearest neighbour (SNN) graph based on unsupervised clustering performed using the Seurat function FindClusters.Multiple rounds of clustering and subclustering analyses were performed to identify major epithelial cell types and distinct cell transcriptional states.Dimensionality reduction and 2D visualization of cell clusters was performed using UMAP and the Seurat function RunUMAP.The number of PCs used to calculate the embedding was the same as that used for clustering.For analysis of human epithelial cells, ROGUE was used to quantify cellular transcriptional heterogeneity of each cluster.Subclustering analysis was then performed for low-purity clusters identified by ROGUE.Hierarchical clustering of major epithelial subsets was performed on the Harmony batch-corrected PCA dimension reduction space.For malignant cells, except for global UMAP visualization, downstream analyses, including identification of large-scale CNVs, inference of cancer cell differentiation states, quantification of meta-program expression, trajectory analysis and mutation analysis, were performed without Harmony batch correction.The hierarchical tree of human epithelial cell lineages was computed based on Euclidean distance using the Ward linkage method, and the dendrogram was generated using the R function plot.hc.For scRNA-seq analysis of Gprc5a-/-mice, the top-ranked ten PCs were selected using the elbowplot function.SNN graph construction was performed with resolution parameter = 0.4, and UMAP visualization was performed with default parameters.For scRNA-seq analysis of Gprc5a-/-;SftpcCreER/+;RosaSun1GFP/+ mice, the top-ranked 20 Harmony-corrected PCs were used for SNN graph construction, and unsupervised clustering was performed with resolution parameter = 0.4.UMAP visualization was performed with the RunUMAP function with min.dist= 0.1.Differentially expressed genes (DEGs) of clusters were identified using the FindAllMarkers function with FDR-adjusted P value < 0.05 and log (fold change) > 1.2.

Identification of malignant cells and mapping KRAS codon 12 mutations
Malignant cells were distinguished from non-malignant subsets based on information integrated from multiple sources as described in previous studies.The following strategies were used to identify malignant cells.(1) Cluster distribution: owing to the high degree of interpatient tumour heterogeneity, malignant cells often exhibit distinct cluster distribution compared with normal epithelial cells.Although nonmalignant cells derived from different patients are often clustered together by cell type, malignant cells from different patients probably form separate clusters.(2) CNVs: we applied inferCNV (v.1.3.2) to infer large-scale CNVs in each individual cell with T cells as the reference control.
To quantify CNVs at the cell level, CNV scores were aggregated using a previously described strategy.In brief, arm-level CNV scores were computed based on the mean of the squares of CNV values across each chromosomal arm.Arm-level CNV scores were further aggregated across all chromosomal arms by calculating the arithmetic mean value of the arm-level scores using the R function mean.(3) Marker gene expression: expression of lung epithelial lineage-specific genes and LUAD-related oncogenes was determined in epithelial cell clusters.(4) Celllevel expression of KRASG12D mutations: as we previously described, BAM files were queried for KRASG12D mutant alleles, which were then mapped to specific cells.KRASG12D mutations,along with cluster distribution, marker gene expression and inferred CNVs as described above, were used to distinguish malignant cells from non-malignant cells.Following clustering of malignant cells from all patients, an absence of malignant cells that were identified from P12 or P16 was noted.This can be possibly attributed to the low number of epithelial cells captured in tumour samples from these patients (Supplementary Table 2).

Mapping KRAS codon 12 mutations
To map somatic KRASG12D mutations at single-cell resolution, alignment records were extracted from the corresponding BAM files using mutation location information.Unique mapping alignments (MAPQ = 255) labelled as either PCR duplication or secondary mapping were filtered out.The resulting somatic variant carrying reads were evaluated using Integrative Genomics Viewer (IGV) and the CBtags were used to identify cell identities of mutation-carrying reads.To estimate the VAF of KRASG12D mutation and cell fraction of KRASG12D-carrying cells within malignant and non-malignant epithelial cell subpopulations (for example, malignant cells from all LUADs, malignant cells from KM-LUADs, KACs from KM-LUADs), reads were first extracted based on their unique cell barcodes and BAM files were generated for each subpopulation using samtools (v.1.15).Mutations were then visualized using IGV, and VAFs were calculated by dividing the number of KRASG12D -carrying reads by the total number of uniquely aligned reads for each subpopulation.A similar approach was used to visualize KRASG12C-carrying reads and to calculate the VAF of KRASG12C in KACs of normal tissues from KM-LUAD cases.To calculate the mutationcarrying cell fraction, extracted reads were mapped to the KRASG12D locus from BAM files using AlignmentFile and fetch functions in pysam package.Extracted reads were further filtered using the'Duplicate' and 'Quality' tags to remove PCR duplicates and low-quality mappings.The number of reads with or without KRASG12D mutation in each cell was summarized using the CB tag in read barcodes.Mutation-carrying cell fractions were then calculated as the ratio of the number of cells with at least one KRASG12D read over the number of cells with at least one high-quality read mapped to the locus.PCA analysis of malignant cells and quantification of transcriptome similarity Raw unique molecular identifier counts of identified malignant cells were log-normalized and used for PCA analysis using Seurat (RunPCA function).PCA dimension reduction data were extracted using the Embeddings function.The top three most highly ranked PCs were exported for visualization using JMP (v.15).3D scatterplots of PCA data were generated using the scatterplot 3D tool in JMP (v.15).Bhattacharyya distances were calculated using the bhattacharyya.distfunction from the R package fpc (v.2.2-9).The top 25 highly ranked PCs were used for both patient-level and cell-level distance calculations.For Bhattacharyya distance quantification at the cell level,100 cells were randomly sampled for each patient group defined by driver mutations (for example, KM-LUADs).The random sampling process was repeated 100 times, and pairwise Bhattacharyya distances were then calculated between patient groups.Differences in Bhattacharyya distances between patient groups were tested using Wilcoxon rank-sum tests, and boxplots were generated using the geom_boxplot function from the R package ggplot2 (v.3.2.0).

Determination of non-malignant cell types and states
Non-malignant cell types and states were determined based on unsupervised clustering analysis following batch effect correction using nature portfolio | reporting summary

March 2021
Harmony.Two rounds of clustering analysis were performed on non-malignant cells to identify major cell types and cell transcriptional states within major cell types.Clustering and UMAP visualization of human normal epithelial cells (Extended Data Fig. 1a) were performed using Seurat with default parameters.Specifically, the parameters k.param = 20 and resolution = 0.4 were used forSNN graph construction and cluster identification, respectively.UMAP visualization was performed with default parameters (min.dist= 0.3).For clustering analysis of airway and alveolar epithelial cells, the RunPCA function was used to determine the most contributing top PCs for each subpopulation and similar clustering parameters (k.param = 20 and resolution = 0.4) were used for SNN graph construction and cluster identification.UMAP plots were generated with min.dist= 0.3 using the RunUMAP function in Seurat.Density plots of alveolar intermediate cells were generated using the stat_densit_2d function in the R package ggplot2 (v.3.3.5) with the first two UMAP dimension reduction data as the input.DEGs for each cluster were identified using the FindMarkers function in Seurat with a FDR-adjusted P < 0.05 and a fold change cut-off > 1.2.Canonical epithelial marker genes from previously published studies by our group and others were used to annotate normal epithelial cell types and states.Bubble plots were generated for select DEGs and canonical markers to define AT1 cells (AGER1+ ETV5+ PDPN+), AT2 cells (SFTPB, SFTPC, ETV5+), SCGB1A1+SFTPC+ dual-positive cells, AICs (AGER1+ETV5+PDPN+ and SFTPB+SFTPC+), club and secretory cells (SCGB1A1 +SCGB3A1+CYP2F1+), basal cells (KRT5+TP63+), ciliated cells (CAPS+PIFO+FOXJ1+), ionocytes (ASCL3+FOXI+), neuroendocrine cells (CALCA +ASCL1+) and tuft cells (ASCL2+MGST2+PTGS1+). KACs were identified by unsupervised clustering of AICs and defined based on previously reported marker genes, including significant upregulation of the following genes relative to other alveolar cells: KRT8, CLDN4, PLAUR, CDKN1A and CDKN2A.
Scoring of curated gene signatures Genes in previously defined ITH MPs were downloaded from the original study.Among a total of 41 consensus ITH MPs identified, MPs with unassigned functional annotations (unassigned MPs 38-41; n = 4), neural and haematopoietic lineage-specific MPs (MPs 25-29, MPs 33-37; n = 10) and cell-type-specific MPs irrelevant to LUAD (MPs 22-24 secreted/cilia, MP 32 skin-pigmentation; n = 4) were filtered out, resulting in 23 MPs that closely correlated with hallmarks of cancer and which were used for further analysis.Signature scores were computed using the AddModuleScore function in Seurat as previously described.The KRAS signature used in this study was derived by calculating DEGs between the KRAS mutant malignant-cell-enriched cluster and other malignant cells (FDR-adjusted P value < 0.05, log(fold change) > 1.2; Extended Data Fig. 2i).Human and mouse KAC signatures and the human 'other AIC'signature were derived by calculating DEGs using FindAllMarkers among alveolar cells (FDR-adjusted P value < 0.05, log(fold change) > 1.2).Mouse genes in the p53 pathway were downloaded from the Molecular Signature Database (MSigDB; https://www.gsea-msigdb.org/gsea/msigdb/mouse/geneset/HALLMARK_P53_PATHWAY;MM3896).Signature scores for KACs, other AICs, KRAS andp53 were calculated using the AddModuleScore function in Seurat.

Analysis of alveolar cell differentiation states and trajectories
Analysis of differentiation trajectories of lung alveolar and malignant cells was performed using Monocle 2 by inferring the pseudotemporal ordering of cells according to their transcriptome similarity.Monocle 2 analysis of malignant cells from P14 was performed using default parameters with the detectGenes function.Detected genes were further required to be expressed by at least 50 cells.For construction of the differentiation trajectory of lineage-labelled epithelial cells (GFP), the top 150 DEGs (FDR-adjusted P value < 0.05, log(fold change) > 1.5, expressed in 50 cells or more) ranked by fold-change of each cell population from NNK-treated samples were used for ordering cells with the setOrderingFilter function.Trajectories were generated using the reduceDimension function with the method set to 'DDRTree'.Trajectory roots were selected based on the following criteria: (1) inferred pseudotemporal gradient; (2) CytoTRACE score prediction; and (3) careful manual review of the DEGs along the trajectory.To depict expression dynamics of ITH MPs, ITH MP scores were plotted along the pseudotime axis and smoothed lines were generated using the smoother tool in JMP Pro(v.15).Using the raw counts without normalization as input, CytoTRACE was applied with default parameters to infer cellular differentiation states to complement trajectory analysis and further understand cellular differentiation hierarchies.The normalmixEM function from the R package mixtools was used to determine the CytoTRACE score threshold in AICs with k = 2.A final threshold of 0.58was used to dichotomize AICs into high-differentiation and lowdifferentiation groups.The Wasserstein distance metric was applied usingR package transport (v.0.13) to quantify the variability of distribution of CytoTRACE scores.The function wasserstein1d was used tocalculate the distance between the distribution of actual CytoTRACE scores of one patient and the distribution of simulated data with identical mean and standard deviation.The robustness of Monocle 2-based pseudotemporal ordering prediction was validated by independent pseudotime prediction tools including Palantir, Slingshot and Cellrank.Slingshot (v.2.6.0)pseudotime prediction was performed using slingshot function with reduceDim parameter set to 'PCA' and other parameters set to defaults.Cellrank prediction was performed using the CytoTRACEKernel function with default parameters from Cellrank python package (v.1.5.1).Palantir prediction was performed using Palantir python package (v.1.0.1).A diffusion map was generated using run_diffusion_maps function withn_components = 5.Palantir prediction was generated using run_palantir function with num_waypoints = 500 and other parameters set to defaults.Inferred pseudotime by the three independent methods was then integrated with that generated by Monocle 2 for each single cell,followed by pairwise mapping and correlation analysis.Cell density plots were generated using Contour tool in JMP (v.15) with n = 10 gradient levels and contour type parameter set to 'Nonpar Density'.To assess the pseudotime prediction consistency between Monocle 2and the three independent methods, Spearman's correlation coefficients were calculated and statistically tested using cor.testfunction in R.
ST data generation and analysis ST profiling of formalin fixation and paraffin-embedding (FFPE) tissues of P14 LUAD sample and of three lung tissues from two Gprc5a-/-mice was performed using the Visium platform from 10X Genomics according to the manufacturer's recommendations and as previously reported.P14 FFPE tissues were collected from areas adjacent to the tissues analysed by scRNA-seq.Regions of interest per tissue or sample,each comprising a 6.5 × 6.5 mm capture area, were selected based on careful annotation of H&E-stained slides that were digitally acquired using an Aperio ScanScope Turbo slide scanner (Leica Microsystems).HALO software (Indica Labs) was used for pathological annotation (tumour areas, blood vessels, bronchioles, lymphoid cell aggregates, macrophages, muscle tissue, normal parenchyma and reactive pneumocytes) of H&E histology images.Spot-level histopathological annotation and visualization was generated using loupebrowser (v.6.3.0).In brief, cloupe files generated from Space Ranger were loaded into the loupe browser.Visualization of annotation was then generated in svg formats using the export plot tool.ST RNA-seq libraries were generated according to the manufacturer'sinstructions, each with up to about 3,600 uniquely barcoded spots.Libraries were sequenced on an Illumina NovaSeq 6000 platform to achieve a depth of at least 50,000 mean read pairs per spot and at least 2,000 median genes per spot.Demultiplexed raw sequencing data were aligned, and gene level expression quantification was generated with analysis pipelines as previously described.In brief, demultiplexed clean reads were aligned against the UCSC human GRCh38 (hg38) or the GRCm38(mm10) mouse reference genomes by Spaceranger (v.1.3.0 for human ST data and v.2.0.0 for mouse ST data) and using default settings.Generated ST gene expression count matrices were then analysed using Seurat (v.4.1.0)to perform unsupervised clustering analysis.Using default parameters, the topranked 30 PCA components were used for SNN graph construction and clustering and for UMAP low-dimension space embedding with default parameters.UMAP analysis was performed using the RunUMAP function.The SpatialDimPlotfunction was used to visualize unsupervised clustering.The R package inferCNV was used for copy number analysis.Reference spots used in CNV analysis were selected on the basis of careful review of cluster marker genes using the DotPlot function from Seurat and inspection of pathological annotation.CNV scores were nature portfolio | reporting summary March 2021 calculated by computing the standard deviations of CNVs inferred across 22autosomes.Loupe browser (v.6.3.0) was used for visualization of pathological annotation results.Expression levels of genes of interest (for example, KRT8) as well as signatures of interest (for example, KAC and KRAS) were measured and directly annotated on histology images with pixel-level resolution using the TESLA (v.1.2.2) machine learning framework (https://github.com/jianhuupenn/TESLA). TESLA can compute superpixel-level gene expression and detect unique structures within and surrounding tumours by integrating information from high-resolution histology images.The annotation and visualize_annotation functions were used to annotate regions with high signature signals.KRT8, PLAUR, CLDN4, CDKN1A and CDKN2A were used for 'KAC markers' signature annotation in the human ST analysis.For mouse ST data, Krt8, Plaur, Cldn4, Cdkn1a and Cdkn2a were used for 'KAC signature' annotation.Gene level expression visualization of Krt8 and Plaur was generated using the scatter function from scanpy (v.1.9.1).Deconvolution analysis was conducted using CytoSPACE (https://github.com/digitalcytometry/cytospace).Annotated scRNA-seq data were first transformed intoa compatible format using function generate_cytospace_from_scRNA_seurat_object.Visium spatial data were prepared using the functiongenerate_cytospace_from_ST_seurat_object.Deconvolution was performed using CytoSpace function (v.1.0.4) with default parameters.To determine neighbouring cell composition for a specific cell population in Visium data, CytoSPACE was first applied to annotate every spot with the most probable cell type.Neighbouring spots were defined as the six spots surrounding each spot and, accordingly, the neighbouring cell composition for specific cell types were computed.Trajectory construction of ST data was performed using Monocle 2 with the DDRTree method using DEGs with FDR-adjusted P value < 0.05.
Bulk DNA extraction and WES Total DNA was isolated from homogenized cryosections of human lung tissues and, when available, from frozen peripheral blood mononuclear cells (PBMCs) using a Qiagen AllPrep mini kit (80204) or a DNeasy Blood and Tissue kit (69504), respectively (both from Qiagen) according to the manufacturer's recommendations.Qubit 4 Fluorometer (Thermo Fisher Scientific) was used for measurement of DNA yield.TWIST-WES was performed on a NovaSeq 6000 platform at a depth of 200x for tumour samples and 100x for NL and PBMCs to analyse recurrent driver mutations and using either PBMCs or distant NL tissues when blood draw was not consented, as germline control.WES data were processed and mapped to the human reference genome, and somatic mutations were identified and annotated as previously described with further filtration steps.In brief, only MuTect calls marked as 'KEEP' were selected and taken into the next step.Mutations with a low VAF (<0.02) or low alt allele read coverage (<4) were removed.Then, common variants reported by ExAc (the Exome Aggregation Consortium, http://exac.broadinstitute.org),Phase-3 1000 Genome Project (http://phase3browser.1000genomes.org/Homo_sapiens/Info/Index)or the NHLBI GO Exome Sequencing Project (ESP6500) (http://evs.gs.washington.edu/EVS/)with minor allele frequencies greater than 0.5% were further removed.Intronic mutations, mutations at 3' or 5' UTR or UTR-flanking regions, and silent mutations were also removed.The mutation load in each tumour was calculated as the number of nonsynonymous somatic mutations (nonsense, missense, splicing, stop gain, stop loss substitutions as well as frameshift insertions and deletions).

Survival analysis
Analysis of OS in the TCGA LUAD and PROSPECT cohorts was performed as previously described.KRAS mutation status in TCGA LUAD samples was downloaded from cBioPortal (https://www.cbioportal.org,study ID: luad_tcga_pan_can_atlas_2018).For TCGA dataset, clinical data were downloaded from the PanCanAtlas study.The logrank test and Kaplan-Meier methods were used to calculate P values between groups and to generate survival curves, respectively.Statistical significance testing for all survival analyses was two-sided.To control for multiple hypothesis testing, Benjamini-Hochberg method was applied to correct P values, and FDR q values were calculated where applicable.Results were considered significant at P value or FDR q value of <0.05.Multivariate survival analysis was performed using a Cox proportional hazards regression model that calculated the hazard ratio, the 95% confidence interval and P values when using pathologic stage, age, KAC and 'other AIC' signatures as covariables.

Analysis of public datasets
Publicly available datasets were obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/)under accession numbers GSE149813, GSE154989, GSE150263, GSE102511 and GSE219124.Details of the studies analysed are as follows: GSE149813 investigated single lung cells from KrasLSL-G12D;LSL-YFP mice with Ad5CMV-Cre infection; GSE154989 studied AT2 lineagelabelled cells from lungs of KrasLSL-G12D/+;Rosa26LSL-tdTomato/+ mice.Gene expression count matrices of dataset interrogating KrasG12Ddriven mouse model from GSE149813 were pre-processed using Seurat following the same filtering steps in that original report.For the GSE154989 dataset, cells used for analysis were the ones labelled as "PASSED_QC" in supplementary table S7 in that study.For the GSE149813 dataset, cells with >500 median number of genes detected and <10% fraction of mitochondrial genome derived reads, and according to the pre-processing methods described in their original report, were retained for analysis.Cells with >7,500 number of genes detected were further filtered to remove potential doublets or multiplets, resulting in 8,304 cells in total for downstream analysis.Both datasets were integrated with mouse cell data generated in this study using Harmony with default parameters settings.The top ranked 20 Harmony-corrected PCs were used for clustering with the FindClusters function using resolution = 0.4.UMAP dimension reduction embedding was performed using the RunUMAP function with the same set of Harmony-corrected PCs.Gene expression levels and frequencies of representative cluster marker genes were visualized using DotPlot function from Seurat.The KAC signature score was calculated using the AddModuleScore function from Seurat.The mouse KAC signature was also studied in human AT2 cells with and without inducible KRASG12D (dataset GSE150263).Cell filtration criteria described in the original report were followed to filter out potential dead cells and doublets (number of detected genes > 800 and the percent of mitochondrial gene reads fraction < 25%).The 20top-ranked PCs were used for clustering using the FindClusters function with resolution = 0.1.UMAP dimension reduction embeddings were computed using the same SNN graph.The KAC signature score was calculated using AddModuleScore function from Seurat package.The bulk RNA-seq dataset GSE102511 was a previously published dataset by our group and comprised normal lung tissues, precursor AAHs and matched LUADs (n = 15, each).The previously published bulk RNA-seq data GSE219124 were generated on cancer stem cell and stem cell-like progenitor cells, in the form of spheres, and their parental MDA-F471 counterparts (a cell line we had developed and cultured from a KM-LUAD of an NNK-exposed Gprc5a-/-mouse).To interrogate the association of KACs with tumour formation, gene expression matrices of bulk RNA-seq data GSE102511 (TPM count matrix) and GSE219124 (FPKM count matrix) were extracted and used for quantification of KAC signature expression using MCPcounter (v.1.2.0) R package.Heatmaps were generated using pheatmap (v.1.0.12)R package.Mouse KACs from this study were compared to mouse Krt8 transitional cells involved in alveolar regeneration post-acute lung injury from a previous study.Overlapping marker genes between mouse KACs and the previously reported Krt8 transitional cells were statistically evaluated using the ggvenn (v.0.1.9)R package using the top-ranked 50 marker genes based on fold change from each study.
3D culture and analysis of AT2-derived organoids Gprc5a-/-;SftpcCreER/+;RosaSun1GFP/+ were treated with NNK or saline and tamoxifen as described above, and they were euthanized at EOE (4 saline-treated and 5 NNK-treated mice) or at 3 months after exposure (10 saline-treated and 13 NNK-treated mice).Lungs were collected, dissociated into single cells (see mouse single-cell derivation in the Methods section 'Single-cell isolation from tissuesamples'), and live (Sytox Blue-negative) GFP+ single cells were collected by flow cytometry using a FACS Aria I instrument as previously described.GFP+ AT2 cells from NNK-treated or saline-treated groups were immediately washed and resuspended at a concentration of 5,000 cells per 50 microliters of 3D medium (F12 medium supplemented with insulin, transferrin and selenium, 10% FBS,penicillin-streptomycin and l-glutamine).GFP+ cells were mixed at a 1:1 ratio (by volume) with 50,000 mouse endothelial cells (collected from mouse lungs by CD31 selection and expanded in vitro as previously described and resuspended in 50 microliters of Geltrex reduced growth factor basement membrane matrix (A1413301, Gibco).Next, 100 microliters of 1:1 GFP+:endothelial cell mixture was plated on Transwell inserts with 0.4 micrometer pores and allowed to solidify for 30 min in a humidified CO2 incubator (EOE: n = 3 wells per condition; 3 months after exposure: n = 4 wells for saline-derived organoids and n = 12 wells for NNK-derived organoids).Each well was then supplemented with 3D medium containing ROCK inhibitor (Y-27632, Millipore) and recombinant mouse FGF-10 (6224-FG, R&D Systems), and plates were incubated at 37 °C in a humidified CO2 incubator.Wells were replenished with 3D medium every other day.For GFP+ organoids derived from mice exposed to NNK, 200 nM KRAS(G12D)-specific inhibitor MRTX1133 or DMSO vehicle was added to the medium and replenished 3 times a week (n = 6 wells per condition).Organoids were monitored and analysed twice a week using an EVOS M7000 imaging system (Thermo Fisher Scientific), whereby the numbers and sizes of organoids greater than 100 micrometers in diameter were recorded.At end point, 3D organoids were collected from the basement membrane matrix using Gentle Cell Dissociation reagent(100-0485, StemCell Technologies), fixed with 4% paraformaldehyde, permeabilized, blocked and stained overnight at 4 °C with a mixture of IF primary antibodies raised against LAMP3, GFP, KRT8 and cavin 3. The next day, organoids were washed and stained with fluorophore-conjugated secondary antibodies overnight at 4 °C while being protected from light.Organoids were washed and stained with DAPI nuclear stain for 30 min, after which they were collected in Aqua-Poly/Mount (18606-20, Polysciences) and transferred to slides.Images of organoids were captured using an Andor Revolution XDi WD spinning disk confocal microscope and analysed using Imaris software (Oxford Instruments).
Western blot analysis LKR13 and MDA-F471 cells were plated in 6-well plates (10 cells per well) and grown under different conditions as described above.Protein lysates were extracted at 3 h after treatment and analysed by western blotting following overnight incubation with antibodies tot he following primary proteins: vinculin (E1E9V, rabbit, Cell Signaling Technology, 13901; 1:1,000); phosphorylated p44/42 MAPK(ERK1/2, rabbit, Cell Signaling Technology, 9101; 1:2,000); phosphorylated S6 ribosomal protein (Ser 235/236, rabbit, Cell Signaling Technology, 4858; 1:2,000); p44/42 MAPK (ERK1/2, rabbit, Cell Signaling Technology, 9102; 1:2,000); or S6 (E.573.4,rabbit, Invitrogen, MA5-15164; 1:1,000).This was followed by 1 h of incubation with diluted secondary antibody (1706515 goat anti-rabbit IgG-HRP conjugate, Bio-Rad).Protein lysates from each cell line were analysed on multiple gels (four per cell line) with Precision Plus Protein Dual Color Standard (1610394, Bio-Rad) as the ladder and blotted to membranes to separately probe for phosphorylated and total forms of the same proteins, which have highly similar molecular weights (using phospho-specific antibodies or antibodies targeting total version of same protein).Vinculin protein levels were evaluated as loading control on each of the blots.Four blots (phospho-ERK, total ERK, phospho-S6 and total S6) for each of LKR13 and MDA-F471 are shown in Supplementary Fig. 9 each with its own analysis of equal protein loading (vinculin blot) and whereby only the ones indicated with green rectangles are presented in Extended Data Fig. 12c.Membranes were cut horizontally using molecular weight marker as a guide, and cut membranes were incubated with the specified antibodies (see Supplementary Fig. 9 for site of cutting and for overlay of colorimetric and chemiluminescent images of the same blot todisplay ladder and the analysed protein, respectively).Blots were imaged using the ChemiDoc Touch Imaging System (Bio-Rad) with Chemiluminescence and Colorimetric (for protein ladder) applications and auto expose or manual settings.

Statistical analyses
In addition to the algorithms and statistical analyses described above, all other basic statistical analyses were performed in the R statistical environment (v.4.0.0).The Kruskal-Wallis H -test was used to compare variables of interests across three or more groups.Wilcoxon rank-sum

Fig. 1 |
Fig. 1 | Transcriptional landscape of lung epithelial and malignant cells in early-stage LUAD.a, Schematic overview of the experimental design and analysis workflow.Composition, composition of cell subsets; Program, transcriptional programs in malignant cells; Spatial, in situ spatial transcriptome and protein analyses; State, cellular transcriptional state.b, Proportions and average expression levels (scaled) of selected marker genes for ten normal epithelial and one malignant cell subset.NE, neuroendocrine.c, Unsupervised clustering of 17,064 malignant cells coloured by cluster identity.Top right inset shows malignant cells coloured by KRAS G12D mutation status identified by scRNA-seq.d, Uniform manifold approximation and projection (UMAP) of malignant cells shown in c and coloured by driver mutations identified in each tumour sample using WES.e, Principal component analysis (PCA) plot of malignant cells coloured by driver mutations identified in each tumour sample by WES.f, UMAP plots of malignant cells coloured by patient identifier and grouped by driver mutation status.g, Top, UMAP of malignant cells by differentiation state inferred by CytoTRACE.Bottom, comparison of CytoTRACE scores between malignant cells from samples with different driver mutations.Boxes indicate the median ± interquartile range; whiskers, 1.5× the interquartile range; centre line, median.n cells in each box-and-whisker (left to right): 9,135, 5,457 and 2,472.P values were calculated using two-sided Wilcoxon rank-sum test with Benjamini-Hochberg correction.diff., differentiated.h, Per sample distribution of malignant cell CytoTRACE scores.The schematic in a was created using BioRender (https://www.biorender.com).

Fig. 2 |
Fig. 2 | Identification and characterization of KACs in human LUAD.a, Pseudotime analysis of alveolar and malignant cells.b, Left, subclustering analysis of AICs.Right, proportions and average expression levels (scaled) of representative KAC marker genes.c, CytoTRACE score in KACs versus other AICs.n cells (left to right): 8,591 and 1,440.P value was calculated using twosided Wilcoxon rank-sum test.d, Proportion of KACs among non-malignant epithelial cells.n samples (left to right): 16, 15, 16 and 16.P value was calculated using Kruskal-Wallis test.e, Fraction of alveolar cell subsets coloured by sample type.P values were calculated using two-sided Fisher's exact tests with Benjamini-Hochberg correction.f, Top, haematoxylin and eosin (H&E) staining of LUAD tumour (T), TAN displaying reactive hyperplasia of AT2 cells and uninvolved NL tissue.Bottom, digital spatial profiling showing KRT8, PanCK, CLDN4, Syto13 blue nuclear stain and composite image.Magnification, ×20.Scale bar, 200 μm.Staining was repeated four times with similar results.Dashed white lines represent the margins separating tumours and TAN regions.g, ST analysis of LUAD from patient P14 showing histologically annotated H&E-stained Visium slide (left) and spatial heatmaps (right) depicting CNV score and scaled expression of KRT8, KAC markers (b) and KRAS signature.h, Expression (top) and correlation (bottom) analyses of KAC, KRAS and alveolar signatures.n = 1,440 (KACs), 8,593 (other AICs), 146,776 (AT2) and 25,561 (AT1).R, Spearman's correlation coefficient.P values were calculated using Spearman's correlation test.i, KAC signature expression in premalignancy cohort (15 samples each).P values were calculated using two-sided Wilcoxon signed-rank test with Benjamini-Hochberg correction.j, Fraction of KRAS G12D cells in different subsets.For c,d,h and i, box-and-whisker definitions are the same as Fig. 1g.

Fig. 3 |
Fig. 3 | KACs evolve early and before tumour onset during tobaccoassociated KM-LUAD pathogenesis.a, Schematic view of the in vivo experimental design.b, Fraction of malignant cells (left) and KACs (right) across treatment groups and time points.Box-and-whisker definitions are same as in Fig. 1g.n = 4 biologically independent samples per condition.P values were calculated using two-sided Mann-Whitney U-test.NS, not significant.c, IF analysis of KRT8, LAMP3 and PDPN in mouse lung tissues.Scale bar, 10 μm.Results are representative of two independent biological replicates per treatment and timepoint.Staining was repeated three times with similar results.d, Top, distribution of CNV scores among alveolar and malignant cells.n on top of each bar denotes the numbers of Kras G12D mutant cells in each cell group.Bottom, fraction of Kras G12D mutant cells in KACs, malignant, AT1 and AT2 subsets.n = 496 (AT1), 1,320 (AT2), 512 (KACs) and 1,503 (malignant) cells.P values were calculated using two-sided Mann-Whitney U-test with Benjamini-Hochberg correction.e, ST analysis of lung tissue at 7 months after exposure to NNK and showing histological annotation of H&E-stained Visium slide (left) and spatial heatmaps showing scaled expression of KRT8 as well as KAC and KRAS signatures.ST analysis was done on three different tumour-bearing mouse lung tissues from two mice at 7 months following NNK.The schematic in a was created using BioRender (https://www.biorender.com).

Fig. 4 |
Fig. 4 | KACs are implicated in the transition of AT2 to Kras mutant tumour cells.a, Trajectories of alveolar and malignant cells coloured by inferred pseudotime, cell differentiation status and cell type (top left to right).Distribution of inferred pseudotime (bottom left) and CytoTRACE (bottom middle) scores across the indicated cell subsets.Bottom right panel shows CytoTRACE score distribution in KACs at the two time points.Box-and-whisker definitions are the same as in Fig. 1g.n cells (left to right): 1,791, 1,693, 636, 580, 1,791, 1,693, 636, 580, 301 and 335.b, Schematic overview showing analysis of Gprc5a −/− mice with reporter-labelled AT2 cells (Gprc5a −/− ;Sftpc creER/+ ;Rosa Sun1GFP/+ ).TMX, tamoxifen.c, Fractions of AT1, AT2, KACs and KAC-like cells (KAC-KAC-like) and early tumour and AT2-like tumour cells (early-AT2-like tumour) within GFP + cells from lungs of two NNK-treated and two saline-treated mice analysed at 3 months after exposure.d, IF analysis of tdT and KRT8 expression at EOE to NNK (first column; EOE) and at 8-12 weeks following NNK (follow-up after EOE) in normal-appearing regions (second column) and tumours (last two columns) of Gprc5a −/− ;Krt8-creER;Rosa tdT/+ mice.Tamoxifen (1 mg per dose) was delivered immediately after EOE to NNK for six continuous days.Results are representative of three biological replicates per condition.Staining was performed two times with similar results.Magnification, ×20.Scale bar, 10 μm.e, Left, percentage of lung tissue areas containing tdT + cells.Right, percentage of tdT + LAMP3 + cells among total tdT + cells in normal-appearing regions at different time points.Error bars show the mean ± s.d. of n biologically independent samples (left to right): 6, 6, 6, 6 and 10.P values were calculated using Mann-Whitney U-test.f, Proposed model for alveolar plasticity, whereby a subset of AICs in the intermediate AT2to-AT1 differentiation state are KACs and, later, acquire KRAS G12D mutations and are implicated in KM-LUAD development from a particular region in the lung.The schematics in b and f were created using BioRender (https://www.biorender.com).

Fig. 1 |
Analysis of normal lung epithelial and malignant subsets in early-stage LUADs.a,b, UMAP plots of 229,038 normal epithelial cells from 63 samples.Each dot represents a single cell coloured by major cell lineage (a, left), airway sub-lineage (a, top right) and alveolar sub-lineages (a, bottom right).SCGB1A1/SFTPC dual positive cells (SDP) cells were separately coloured to show their position on the UMAP (b).c,d UMAP plots of 17,064 malignant cells coloured by patient ID (c, left), CNV score (c, middle), presence of KRAS G12D mutation (c, right) and smoking status (d).e, Analysis of recurrent driver mutations identified by WES.f, Transcriptomic variances quantified by Bhattacharyya distances at the sample (left) and cell (right) levels among LUADs with driver mutations in KRAS (KM), EGFR (EM), and MET (MM), or LUADs that are wild type (WT) for these genes.Box, median ± interquartile range; whiskers, 1.5× interquartile range; centre line: median.n cells in each box-and-whisker in the left panel: KM-KM = 3; KM-EM = 15; KM-MM = 6; KM-Other = 12; EM-EM = 10; EM-MM = 10; EM-Other = 20; MM-Other = 8; Other-Other = 6.n cells in each box-and-whisker in the right panel: 100.P values were calculated by two-sided Wilcoxon Rank-Sum test with a Benjamini-Hochberg correction.g, Harmonycorrected UMAP plot of malignant cells coloured by cluster ID (left) and cluster distribution by sample (right).h, UMAP plots of malignant cells coloured by CNV scores (top left), smoking status (top right).Comparison of CNV scores between malignant cells from samples carrying different driver mutations (bottom left) or between smokers and never smokers (bottom right).Box-andwhisker definitions are similar to panel f.n cells in each box-and-whisker: EGFR = 5,457; Other = 9,135; KRAS = 2,472; Smoker = 5,999; Never smoker = 11,065.P values were calculated by two-sided Wilcoxon Rank-Sum test with a Benjamini-Hochberg correction.i, Analysis of Wasserstein distances among KM-LUADs, EM-LUADs, and LUADs with WT KRAS and EGFR (Double WT).Box-and-whisker definitions are similar to panel f.n samples in each box-andwhisker: 3; 5; 6. P value was calculated by a two-sided Wilcoxon Rank-Sum test.ExtendedData Fig. 2 | Characterization of inter-and intra-tumour heterogeneity of LUAD malignant cells.a, Unsupervised clustering of malignant cells based on expression of 23 previously defined consensus cancer cell meta-programs (MPs).b, Distribution of signature scores of 4 representative MPs across clusters from a. Box-and-whisker definitions similar to Extended Data Fig. 1f.n cells in each box-and-whisker: C1 = 2,600; C2 = 3,968; C3 = 1,647; C4 = 7,182; C5 = 1,667.c, Enrichment of clusters (C1-C5) in cells colour coded by recurrent driver mutation status (left) and patients (right).**: P < 2.2 × 10 −16 .P value was calculated using two-sided Fisher's exact test with a Benjamini-Hochberg correction.d, MP30 was computed in malignant cells in each patient (left) and in KM-LUADs versus KRAS WT LUADs (KW-LUADs, right).n cells in each box-and-whisker: P14 = 1,614; P10 = 326; P2 = 532; P1 = 64; P6 = 2,604; P7 = 823; P8 = 147; P15 = 1,819; P4 = 404; P9 = 25; P3 = 2,419; P5 = 5,872; P11 = 375; P13 = 40; KM-LUADs = 2,472; KW-LUADs = 14,592.Box-and-whisker definitions are similar to Extended Data Fig. 1f.P values were calculated using two-sided Wilcoxon Rank-Sum test with a Benjamini-Hochberg correction.e, Profiling of ITH in malignant cells from P14 LUAD.UMAP plots show malignant cells coloured by (top left to top right) KRAS G12D mutation status, KRAS signature expression, and cell differentiation status (CytoTRACE).Trajectories of P14 malignant cells coloured by (bottom left to bottom right) the presence of KRAS G12D mutation, inferred pseudotime, and differentiation status.f, UMAP plots showing P14 malignant cells coloured by expression of the 3 indicated MPs.g, Unsupervised clustering analysis of P14 malignant cells based on inferred CNV profiles (left).UMAP of P14 malignant cells (middle) and inferred trajectory (top right) coloured by CNV clusters, as well as KRAS G12D mutation expression status along pseudotime trajectory (bottom right).h, Alveolar MP expression across the CNV clusters shown in panel g.n cells in each group: 477; 464; 673.P values were calculated using two-sided Wilcoxon Rank-Sum test with a Benjamini-Hochberg correction.i, Harmony-corrected UMAP plot of malignant cells coloured by KRAS signature score (left).Correlation between MP30 expression and KRAS signature score in malignant cells of KM-LUADs (right).P value was calculated with Spearman correlation test.R denotes the Spearman correlation coefficient.j, Heatmap showing score distribution of the indicated MPs and signatures in TCGA LUAD samples.k, Kaplan-Meier plot showing differences in the survival probability between samples with high and low levels of KRAS signature (KRAS sig.), and those with KRAS G12D mutation.OS: overall survival.KRAS sig.high: samples within top quartile of KRAS signature score.KRAS sig.low: samples below the third quartile of KRAS signature score.mo.: months.P value was calculated with logrank test.Extended Data Fig. 3 | Phenotypic diversity and states of human normal lung epithelial cells.a, Composition of normal epithelial lineages across spatial regions as defined in Fig. 1a.Dis: distant normal.Int: intermediate normal.Adj: adjacent normal.NE: neuroendocrine.b, Changes in cellular fractions of AT2 cells (left) and AICs (right) across the spatial samples.Box-andwhisker definitions are similar to Extended Data Fig. 1f.n samples in each boxand-whisker (left to right): 16; 15; 16; 16.P values were calculated with Kruskal-Wallis test.c, Composition of normal epithelial lineages across the spatial regions at the sample level.d, Fractional changes of AT2 cells among all epithelial cells across the spatial regions at the patient level.c and d: Cases showing gradually reduced AT2 fractions with increasing tumour proximity (7 of the 16 patients; P = 0.004 by ordinal regression analysis in d).e, Fractions of AT1, basal, ciliated, and club and secretory cells along the continuum of the spatial samples.Box-and-whisker definitions are similar to Extended Data Fig. 1f.n samples in each box-and-whisker (left to right): 16; 16; 15; 16.P values were calculated with Kruskal-Wallis test.f, Distribution of CytoTRACE scores in AICs, AT1 and AT2 cells (left).Distribution of pseudotime scores in malignant cells from EGFR-or KRAS-mutant tumours (right).P value was calculated with two-sided Wilcoxon Rank-Sum test.Box-and-whisker definitions are similar to Extended Data Fig. 1f with n cells: AT2 = 14,649; AICs = 974; AT1 = 2,529; EGFR = 1,711; KRAS = 1,326.g, Pseudotime trajectory analysis of alveolar and malignant subsets coloured by tissue location.h, Distribution and composition of AICs with low (left) or high (right) CytoTRACE score.i, DEGs between KACs and other AICs.j, Pseudotime trajectory analysis of malignant and alveolar subsets colour-coded by cell lineage and presence of KRAS G12D mutation (top).Pseudotime score in KACs versus other AICs (bottom).Box-and-whisker definitions are similar to Extended Data Fig. 1f.n cells in each box-and-whisker: KACs = 157; Other AICs = 817.P value was calculated by two-sided Wilcoxon Rank-Sum test.k, Differences in cell densities between LUAD (top) and NL tissues (bottom).Extended Data Fig. 4 | Spatial and molecular attributes of human KACs.a, Microphotographs of P10 (left) and P15 (right) LUAD and paired uninvolved NL tissues.Top panels: H&E staining showing LUAD T and TAN (left columns) regions, and uninvolved NL (right columns).DSP analysis of KRT8 (red), CLDN4 (yellow), and pan-cytokeratin (PanCK; green) in LUAD, TAN, and NL regions.Blue nuclear staining was done using Syto13.Magnification, ×20.Scale bar = 200 μm.Staining was repeated four times with similar results.b, CytoSPACE deconvolution and trajectory analysis of P14 LUAD ST data.The left spatial map is coloured by deconvoluted cell types.Top middle panel shows the neighbouring cell composition of KACs, and the bottom middle panel depicts inferred trajectory and pseudotime prediction using Monocle 2. Scaled expression of NKX2-1 and alveolar signature are shown in the rightmost top and bottom panels, respectively.c-e, Expression of KRAS (c), AT1 (d), and other AIC (e) signatures across AT1, AT2, KACs and other AICs.Box-and-whisker definitions are similar to Extended Data Fig. 1f.n cells in each group: KACs = 1,440; Other AICs = 8,593; AT2 = 146,776; AT1 = 25,561.f, g, Correlation analysis between Other AIC and KRAS (f) or alveolar (g) signature scores.P values were calculated with Spearman correlation test.R denotes the Spearman correlation coefficients.h, Enrichment of KAC signature among KACs (left) and malignant cells (right) from KM-or EM-LUAD samples.Box-and-whisker definitions are similar to Extended Data Fig. 1f.n cells in each box-and-whisker (left to right): KACs, EM-LUADs = 135; KACs, KM-LUADs = 719; Malignant, EM-LUADs = 5,457; Malignant, KM-LUADs = 2,472.P values were calculated by two-sided Wilcoxon Rank-Sum test.

Fig. 7 |
scRNA-seq analysis of epithelial subsets in a tobacco carcinogenesis mouse model of KM-LUAD.a, UMAP distribution of mouse epithelial cell subsets.b, Proportions and average expression levels of select marker genes for mouse normal epithelial cell lineages and malignant cell clusters as defined in panel a. c, UMAP plots of alveolar and malignant cells coloured by CNV score, presence of Kras G12D mutation, or expression levels of Kng2 and Meg3.d, UMAP (top) and violin (bottom) plots showing expression level of Cd24a in malignant and alveolar subsets.Box-and-whisker definitions are similar to Extended Data Fig. 1f.n cells in each group: Malignant = 1,693; AT1 = 580; KACs = 636; AT2 = 1,791.e, UMAP distribution of alveolar and malignant cells coloured by cell lineage, Kras G12D mutation status, and CNV score at EOE or 7 months following NNK.f, Proportions of normal epithelial cell lineages and malignant cells in each sample.g, Fractional changes of malignant cells, KACs, AT2 and AT1 cells between EOE and 7 months post treatment with NNK or saline; n = 4 biologically independent samples in each group.Whiskers, 1.5× interquartile range; Center dot: median.h, UMAP (top) and violin (bottom) plots showing expression levels of Gkn2 in malignant and alveolar cell subsets.n cells in each group: Malignant = 1,693; AT1 = 580; KACs = 636; AT2 = 1,791.