Molecular characteristics of primary pulmonary lymphoepithelioma-like carcinoma based on integrated genomic analyses

Primary pulmonary lymphoepithelioma-like carcinoma (pLELC) is a rare non-small cell lung cancer (NSCLC) subtype. Clinical features have been described in our previous report, but molecular characteristics remain unclear. Herein, pLELC genomic features were explored. Among 41,574 lung cancers, 128 pLELCs and 162 non-pLELC NSCLCs were enrolled. Programmed cell death ligand 1 (PD-L1) and protein 53 (p53) expression was detected in 47 surgically resected pLELC samples by immunohistochemical assays. Multiomics genomic analyses, including whole-genome sequencing (WGS), RNA whole-transcriptome sequencing (RNA-seq), and Epstein-Barr virus (EBV) integration analyses, were performed on eight frozen pLELC tissues and compared with 50 lung adenocarcinomas (LUADs) and 50 lung squamous cell carcinomas (LUSCs) from The Cancer Genome Atlas (TCGA) and another 26 EBV-positive nasopharynx cancers (EBV+-NPCs). Progression-free survival (PFS) and overall survival (OS) of pLELC patients were better than those of non-pLELC patients. High PD-L1 or p53 expression was associated with extended disease-free survival (DFS). pLELC had 14 frequently mutated genes (FMGs). Somatically mutated genes and enrichment of genetic lesions were found, which differed from observations in LUAD, LUSC, and EBV+-nasopharyngeal carcinoma (NPC). Three tumor-associated genes, zinc finger and BTB domain-containing 16 (ZBTB16), peroxisome proliferator activated receptor gamma (PPARG), and transforming growth factor beta receptor 2 (TGFBR2), were downregulated with copy number variation (CNV) loss. EBV was prone to integrating into intergenic and intronic regions with two upregulated miR-BamH1-A rightward transcripts (BARTs), BART5-3P and BART20-3P. Our findings reveal that pLELC has a distinct genomic signature. Three tumor-associated genes with CNV loss and two miR-BARTs might be involved in pLELC tumorigenesis.


INTRODUCTION
Lymphoepithelioma-like carcinoma (LELC) is a histologically distinct tumor subtype that may invade several organs, such as the lungs, nasopharynx, and stomach. 1,2 Primary pulmonary LELC (pLELC) is categorized into the other and unclassified non-small cell lung cancer (NSCLC) group according to the 2015 World Health Organization (WHO) classification, 3 accounting for <1% of all NSCLC cases. 4 In our previous investigation, the clinicopathological features of 42 pLELCs and 134 pulmonary squamous carcinomas were compared. The results revealed that, unlike pulmonary squamous carcinoma, middle-aged women and nonsmokers predominated among pLELC patients. No epidermal growth factor receptor (EGFR) mutation, anaplastic lymphoma kinase (ALK) gene rearrangement, or c-ros oncogene 1 (ROS1) fusion was found. 5 Due to its low incidence and the lack of information regarding the molecular mechanism of its tumorigenesis, treatment strategies for pLELC have not been fully defined, although surgery, chemotherapy, and radiation are often used. 6 With the rise of immunotherapies, monoclonal antibodies against programmed cell death ligand 1 (PD-L1) have attracted researchers' attention in pLELC treatment. A potential benefit has been reported in some pLELC cases. 7 However, the prognostic significance of PD-L1 expression in pLELC is still completely controversial [8][9][10] In recent years, a few studies have reported the mutational landscape of pLELC using multiple approaches, including wholeexome sequencing, targeted deep sequencing, and singlenucleotide polymorphism (SNP) arrays. 11,12 However, analysis at the RNA level, e.g., transcriptome sequencing, seems to be absent in relevant research. In addition, Epstein-Barr virus (EBV), which has been detected in various tumors, such as Burkitt's lymphoma, EBV-associated gastric cancer, and nasopharyngeal carcinoma (NPC), 2 as well as pLELC, 5 either exists typically as an episome in cells or tends to integrate into the host genome. This mechanism facilitates the expression of viral protein and microRNAs (miRNAs) and in turn promotes the pathogenesis and progression of tumors. 13 Our previous study suggested that the serum EBV DNA level appeared to be a predictor of progression-free survival (PFS) for pLELC patients, 5 but the exact EBV integration loci and pathogenesis are still unknown.
To explore the molecular characteristics of pLELC with multigenomic analyses, we extended our pLELC sample size to 128 cases, with 162 non-pLELC NSCLCs as controls. All the subjects were identified from a database of 41,574 lung cancers. Wholegenome sequencing (WGS) and RNA whole-transcriptome sequencing (RNA-seq) combined with EBV integration analysis were performed on eight fresh-frozen tissues of pLELC, and distinct molecular features were identified on comparison of lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), and EBV-positive nasopharynx cancer (EBV + -NPC). To evaluate factors influencing survival in pLELC, PD-L1 expression was detected by immunohistochemical (IHC) analysis of surgically resected paraffin-embedded samples, as well as TP53 mutation, which has been implicated in the regulation of PD-L1 in oncogenesis. 14,15 The results of this study confirmed a favorable outcome of pLELC in a relatively large cohort. High expression of PD-L1 or wild-type p53 was associated with extended survival. The genome studies revealed a distinct mutational signature of pLELC different from those of LUAD, LUSC, and EBV + -NPC. Three tumorassociated genes were identified, which were clearly downregulated with simultaneous copy number variation (CNV) loss. EBV tends to integrate into intergenic and intronic regions in pLELC with two miR-BamH1-A rightward transcripts (BARTs), BART5-3P and BART20-3P, which are upregulated. These findings provide novel insights into the molecular mechanisms of pLELC.

Clinical features of pLELC and non-pLELC NSCLC patients
The workflow of this study is shown in Fig. 1. The clinicopathological characteristics of the 128 primary pLELC and 162 non-pLELC NSCLC cases are summarized in Table 1. The mean age of the pLELC patients was substantially younger than that of the controls, and a female predominance was noted. Smoking was much less common among pLELC patients. For tumor, node, metastasis (TNM) and clinical staging, pLELC patients seemed to be more prone to distant lymph node invasion and distant metastasis when diagnosed; therefore, fewer pLELC patients had stage I and II disease compared to NSCLC patients. These data suggested that pLELC may be a special type of NSCLC.
The median follow-up time was 25.6 months (2.87-128.6 months). As shown in Fig. 2a, c, disease progression was observed more infrequently in pLELC patients than in controls. Consistently, the median PFS of pLELC patients, estimated by the mean PFS value due to the high censor rate, was 78.4 months, which was substantially more favorable than that of non-pLELC patients.
In the overall survival (OS) analyses (Fig. 2b, c), the number of patients who had died since diagnosis was 4 vs. 123 in the pLELC and non-pLELC NSCLC cohorts, respectively. The median OS of the pLELC patients, also estimated by the mean OS value because of the high censor rate, was 124.0 months, which was also remarkably better than that of the controls. Fig. 1 Study profile. This flow chart depicts the enrollment and exclusion of all subjects. pLELC pulmonary lymphoepithelioma-like carcinoma, NPC nasopharyngeal carcinoma, LUAD lung adenocarcinoma LUSC lung squamous cell carcinoma, WGS whole-genome sequencing, RNA-seq RNA whole-transcriptome sequencing. The eighth patient (P8) numbered in the multi-genomic sequencing was excluded from the whole data analysis process because of metastatic NPC even though the sample had already been sequenced. Therefore, P9 was included in the integrated genomic analyses but P8 was not Prognostic significance of PD-L1 and p53 expression in pLELC Among the 128 pLELC patients, 47 patients who underwent radical surgery were enrolled for expression detection and prognostic significance analyses of nuclear p53 and membranous/cytoplasmic PD-L1. Considering that no deaths occurred in the pLELC cohort until the last follow-up exam and only ten patients showed disease progression, disease-free survival (DFS) was analyzed rather than OS. As shown in Table 2, the mean age of this cohort was 54.4 years. Then the patients were divided into those aged <50 years and those aged >50 years for DFS association analysis, but no significance was found. Negative correlations were also found for gender, smoking, family history of cancer, and clinical stage.  For the PD-L1 and p53 expression investigation, 61.7% (29/47) of the samples were positive for PD-L1, including 12 strong and 17 moderate cases. Notably, patients with strong expression of PD-L1 demonstrated the longest median DFS among the three groups ( Fig. 2d, f). Moreover, mutation-type p53 expression was observed in more than half of the cohort (26/47), and this group showed shorter DFS than the group with normal p53 expression (median DFS 20.0 months vs. 45.1 months; Fig. 2e, f). As abnormal expression of p53 is a strong predictor of an underlying TP53 mutation, the results indicated that TP53 mutation status is correlated with the prognosis of pLELC.
Taken together, the prognosis data above suggest that pLELC might be a particular type of NSCLC, which is usually at an advanced stage at the time of diagnosis but has a favorable prognosis, especially with high expression of PD-L1 and wild-type p53. However, molecular features of pLELC remain unclear, which prompted us to carry out subsequent multiomics studies.
Integrated genomic analyses of pLELC Mutation burden and mutation spectrum of pLELC. To further understand the molecular characteristics of pLELC and explore its mechanisms of tumorigenesis, WGS was performed on eight tumor specimens, the clinicopathological characteristics of which are listed in Supplementary Table 1. Because pLELC is extremely similar to LUSC in histopathology, complicating the differential diagnosis, their genomes were compared first and then together with that of LUAD. The overall nonsynonymous mutation rate of pLELC was significantly lower than that of LUSC (P = 0.00002; unpaired t test; Fig. 3a). Strong enrichment of the C>T transition was found in all eight pLELC samples (Fig. 3b), followed by enrichment of T>G transitions (P1 and P6). A combination of nonnegative matrix factorization clustering and correlations with the 30 curated mutational signatures defined by the Catalogue of Somatic Mutations in Cancer (COSMIC) database 16,17 revealed four dominant signatures (Fig. 3b). The predominant signatures were the deamination process of 5-methyl-cytosine (signature 1), followed by defective DNA mismatch repair (signatures 3+15) and the APOBEC/AID signature (signature 13) (Fig. 3b, c).
Chromothripsis is a phenomenon where hundreds of chromosomal rearrangements are generated via a single catastrophic event. In this study, chromothripsis was found in 12.5% (1/8) of pLELC samples ( Supplementary Fig. 1a-c). Compared with the other pLELC patients, the patient with chromothripsis had considerably shorter DFS ( Supplementary Fig. 1d). However, whether chromothripsis is prevalent in pLELC requires further investigation in larger cohorts.
Somatic mutations of pLELC. A total of 14 frequently mutated genes (FMGs) with abnormally high levels were found in the pLELC group (Fig. 3d). To explore the unique driver genes in pLELC, typical somatic mutation profiles were compared with those of other lung cancer subtypes (50 LUADs and 50 LUSCs in The Cancer Genome Atlas (TCGA)) and 26 EBV + -NPC cases. 18 Although carcinogenesis of both pLELC and NPC is related to EBV infection, somatic mutations of pLELC were found to be dramatically different from those of EBV + -NPC (Fig. 3e). Simultaneously, the typical driver genes reported in LUAD and LUSC were scarcely detected in pLELC except for TP53 and CDKN2A (Fig. 3f). Beyond the different mutation spectra, the number of FMGs in pLELC was much lower than those in LUAD, LUSC, and EBV + -NPC. These data suggest a unique etiological mechanism of pLELC.
Then somatic copy number alterations were profiled using the Control-FREEC tool 19 and GISTIC software with a cutoff of 0.05. CNV loss was widely observed in pLELCs (Fig. 3g), and the most significant CNV loss occurred in 3p24.1, 5q12.3, 11q24.3, and 14q11.2 (Fig. 3h). The results show that CNV loss rather than single-nucleotide variations may underlie the carcinogenesis of pLELC.
Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment with differentially expressed unigenes (DEGs). DEG analysis was performed for each patient by comparing tumors and adjacent normal lung tissues using Cufflinks and RSEM, and KEGG pathway enrichment was analyzed by EdgeR 20,21 with a corrected P < 0.05 (multiple-hypothesis correction by the Benjamini-Hochberg false discovery rate method) indicating significant enrichment. A total of 458 DEGs were enriched with 20 significant signaling pathways, ranging from 7.6 to 48.2% (Fig. 3i). The four mainly enriched KEGG pathways were further mapped into 74, 32, 19, and 11 subpathways ( Supplementary  Fig. 2).
Integrated genomic and transcriptome analyses. Using integrated genomic and transcriptome analyses, copy number and expression decreases in five tumor-associated genes (zinc finger and BTB domain-containing 16 (ZBTB16), peroxisome proliferator activated receptor gamma (PPARG), transforming growth factor beta receptor 2 (TGFBR2), early B cell factor 1 (EBF1), and phosphoinositide-3-kinase, regulatory subunit 1 (alpha) (PIK3R1)) were detected in pLELC, especially ZBTB16, PPARG, and TGFBR2 (Fig. 3j). Patients in the more advanced stage also seemed to exhibit a more distinct reduction in tumor-associated genes, which might be a potential pathogenesis of pLELC (Fig. 3j).
EBV integration analysis. The WGS data were aligned with EBV reference genomes to perform comprehensive profiling of EBV integration. The average percentage of EBV sequences in the WGS data was 0.14% (0.10-0.25%), and the coverage depth was 220.69× on average (40-1005×). Moreover, EBV genome contents in tumor tissues were found to be greater than those in normal samples, with a larger percentage and a greater depth of reads mapped on the EBV genome ( Supplementary Fig. 3). A total of 288 EBV integration breakpoints were identified, and 268 breakpoint sites were found in five patients (P2, P3, P4, P5, and P7) (Supplementary Material 1, available online only). These breakpoint sites were dispersed over 23 human chromosomes, except for the Y chromosome (Fig. 4a, b). Eight pairs of integration breakpoints were found to be ubiquitous, among others, by analyzing the breakpoint distribution in each patient (Supplementary Fig. 4 and Supplementary Table 2). Through gene annotation of the integration sites, EBV was found to have a strong tendency to integrate into intergenic rather than intragenic regions, predominantly within intronic regions (Fig. 4c). In addition, 19 (6.6%) integration sites were localized within noncoding RNA introns, two of which were localized within the introns of HNF1A-AS1 and LINC00673 in two patients (P3 and P5, Supplementary Fig. 5). EBV integration hotspots seemed to be fragile genomic regions, which were vulnerable to DNA damage. This feature increased the likelihood of EBV DNA insertion through microhomology-mediated DNA repair, as shown in Supplementary Fig. 5.    Fig. 6). The individual miR-BART frequency ranged from 9.1 to 34.1% (Supplementary Table 3). In addition, among 43 identified miRNAs in EBV, a total of 31 (72.1%) differentially expressed miR-BARTs were detected compared with adjacent normal lung tissues (Supplementary Material 2, available online only). Two BARTs (BART5-3P and BART20-3P) were frequently (7/8, 87.5%) upregulated, and the other three (BART10-5P, BART20-5P, and BART11-3P) were relatively less changed ( Fig. 4d and Supplementary Table 4). Additionally, the six patients with both BART5-3P and BART20-3P upregulation had low p53 and PD-L1 expression and displayed a relatively poor prognosis.

DISCUSSION
pLELC is a rare subtype of NSCLC characterized by EBV infection without a unique pathological presentation. In the present study, we assessed PD-L1 and p53 expression in 47 pLELCs and performed an integrated genomic analysis of 8 pLELCs using WGS and RNA-seq. Our work confirmed previous results reported in other studies indicating that females and nonsmokers are predominantly affected by pLELC and that patients with pLELCs usually show high PD-L1 expression. 10 Furthermore, the mutation spectrum of pLELC was found to be distinct from those of other subtypes of NSCLC and EBV + -NPC, with unique FMGs and widespread CNV loss. As EBV infection plays an important role in the tumorigenesis of pLELC, we studied EBV integration loci as well as EBV-encoded miRNA expression in host cells. In summary, our work shed lights on the distinct genetic features of pLELC and the mechanism underlying its pathological characteristics. In this study, 61.7% (29/47) of samples were positive for PD-L1 by IHC 28-8. This result is basically similar to previous findings indicating that 66-76% of pLELCs are positive for PD-L1-stained tumor cells. 22 High expression of PD-L1 has been suggested to be a biomarker for survival prediction and optional immunotherapy, while TP53 mutation and membranous PD-L1 expression levels were highly correlated in NSCLC. 22 Here we explored the prognostic significance of PD-L1 expression and TP53 mutation status in pLELC and found that pLELC patients with high PD-L1 expression tend to have longer DFS, which is consistent with previous studies. 22 In contrast, patients with wild-type p53 expression showed a better prognosis in our cohort, although the interpretation of p53 IHC varies across studies. Further studies with subjects from multiple centers are needed to confirm the value of PD-L1 and p53 in pLELC.
In our previous report, in a susceptible population, computed tomography (CT) appearances and prognosis of pLELC were all markedly different from those of LUSC. 5 We extended the cohorts of pLELC and controls for systematic comparison at a molecular level. First, the genomic mutation profile of pLELC was compared with those of LUSC and LUAD. Genes with high mutation frequencies in thoracic malignancies were barely observed in pLELC. The lack of powerful mutation drivers might render pLELC tumor cells less aggressive and may lead to increased survival in pLELC patients. Moreover, as an EBV-inter-related neoplasm, pLELC is very closely related to EBV + -NPC in histopathological characteristics, complicating the differential diagnosis in clinical practice. Therefore, genomic mutations of pLELC and EBV + -NPC were compared. However, no similarity was found. Lin et al.
demonstrated that mutations of chromosomal-modifying genes and the ERBB-PIK3CA signaling pathway were important in NPC, directly affecting epigenetic modification and promoting invasion. 18 These abnormalities were widely observed in other malignant tumors. 23,24 However, in the present study, no mutations of chromosome-modification pathways were detected; additionally, no mutations of RTK cycle pathway-related genes were tested, which may explain why pLELC is less aggressive than NPC.
In the integrated genomic and transcriptome analyses, the tumor-associated genes ZBTB16, PPARG, and TGFBR2 were found to be significantly downregulated with simultaneous CNV loss. ZBTB16 is also known as promyelocytic leukemia zinc-finger protein (PLZF). As a transcriptional repressor, ZBTB16 has an important role in cell death, differentiation, and tumor progression. 25 Downregulation of ZBTB16 was observed in pancreatic cancer and prostate cancer, 26,27 and the high expression of ZBTB16 was associated with long-term survival in cancer patients. 28,29 In our study, ZBTB16 was reduced, and this reduction might be one of the possible factors associated with pLELC tumorigenesis.
PPARG belongs to the nuclear receptor superfamily of PPARs. Several studies have concentrated on the key role of PPARG in cancers with inflammatory regulation. 30 PPARG was also found to be involved in the regulation of epithelial-mesenchymal transition (EMT). 31 However, activated PPARG could reverse the mesenchymal-epithelial phenotype and inhibit EMT. 32 In hepatocellular carcinoma and breast cancer, PPARG showed antitumorigenic effects. 30 In contrast, PPARG has been reported as an immunomodulator. High expression of PPARG impairs CD8+ T cell infiltration in bladder cancer and confers resistance to immunotherapies, while knockdown/inhibition of PPARG increases cytokine expression and revives immunosurveillance. 33 Therefore, we speculated that the loss of PPARG in pLELC might improve immunosurveillance, leading to a good prognosis in pLELC cases.
TGF-β is the prototype of the TGF-β family. During tumor progression, TGF-β becomes a stimulating molecule that promotes the growth, invasion, and metastasis of tumor cells by inducing immune escape. 34 Furthermore, TGF-β acts as an immunosuppressor by affecting the regulation of T cells. 35 A study on urothelial cancer showed that TGF-β expression was increased in PD-L1 nonresponders and that simultaneously blocking TGF-β and PD-L1 could facilitate T cell infiltration and provoke antitumor immunity and tumor regression. 36 Accordingly, TGF-β inhibition in pLELC patients was inferred to induce an immune response to promote survival.
Taken together, the loss of ZBTB16 can be hypothesized to improve pLELC tumorigenesis, while reductions in PPARG and TGFBR2 may promote antitumor immunity, leading to a favorable prognosis.
Similar to previous studies by Hong et al. and Xie et al., 11,12,37 TP53 mutation was also detected in our study. Except for TP53, no other common FMGs were observed within these studies, but a variety of similar CNVs were identified at both the segment and gene levels. We assume that the reasons included the different panels used in NGS and variable sample sizes. In addition, Signature 13 detected in the present study is essentially similar to Signature 2 reported by Hong et al. 11 because both of their biological functions are related to overactivity of APOBEC/AID. The number discrepancy is due to different sig database versions.
EBV infection has been linked to various human malignancies. 2 The EBV genome, which typically exists as an episome or is integrated within the host genome through the microhomology of integration, can promote the pathogenesis and progression of tumors. 13 In this study, all pLELC tumor tissues were positive in the EBV-encoded RNA (EBER) test. In addition, a total of 288 EBV integration breakpoints were determined by aligning the WGS data to the EBV reference genome. Two breakpoint sites were localized within the introns of HNF1A-AS1 and LINC00673 in pLELC-P5 and pLELC-P3, respectively. Similar integration was also found in EBV-associated colon cancer and regulated the progression of malignant cells. 38 Unfortunately, no tumor suppressor or oncogenes were discovered through inspection of the upstream (20 kb) and downstream (200 kb) breakpoints. In NPC, EBV is integrated into introns of the inflammation-related genes PARK2, tumor necrosis factor alpha-induced protein 3 (TNFAIP3), and cyclin-dependent kinase 15 (CDK15), which regulate the nuclear factor-κB pathway. 13 However, this behavior was not observed in pLELC. Moreover, EBV was reported to integrate near common fragile sites, avoiding short interspersed nuclear elements in NPC. 13 In contrast, intergenic regions were the favored integration site for EBV in pLELC in our study. These fragile regions are vulnerable to DNA damage, which increases the likelihood of EBV DNA insertion into host genomes. The functional impact of such integration needs to be investigated in future studies.
EBV encodes two groups of miRNAs, BHRF1 and BART. A total of 48 mature miRNAs, including 4 miR-BHRF1s and 44 miR-BARTs, are generated from 25 miRNA precursors. 39 In NPC, high expression of miR-BARTs contributes to cancer development by targeting various cellular and viral genes. 40 Moreover, BART3 and BART9 were observed to be closely associated with cell growth and proliferation, 41 while BART5 and BART10 played crucial roles in cell apoptosis. 42 In addition, BART1 induced tumor metastasis and recurrence in NPC by directly targeting the dominant tumor suppressor, phosphatase and tensin homolog (PTEN), and consequently activating PTEN-dependent pathways and inducing EMT. 43 BART5-3P and BART10-5P were also reported to be upregulated in NPC and to participate in the regulation of cell apoptosis and proliferation. 44 In this study, BART5-3P and BART20-3 were found to be upregulated in pLELC, but BART1 was not identified. The expression of these EBV-miRNAs was inferred to contribute to pLELC tumorigenesis and development.
In addition, BART5-3P has been reported to inhibit p53 protein expression. 45 In the current study, cases with BART5-3P overexpression were found to have low expression of p53 and PD-L1 and a poor prognosis. Combined with report indicating that PD-L1 was correlated with p53 in NSCLC and oral squamous cell carcinoma, 14,15 EBV insertion can be assumed to promote the expression of BART5-3P, consequently inhibiting p53 and PD-L1. However, the effects of BART20-3 on tumors have not been reported. Investigating the function of these miR-BARTs in pLELC and validating their regulatory role and downstream molecular mechanism may be of great significance in the future.
In conclusion, this study identified a distinct mutational signature in pLELC. Three tumor-associated genes were found to be downregulated with simultaneous CNV losses. In addition, this study provided an unbiased large-scale genome-wide analysis of the EBV integration landscape in pLELC. The potential targets of miRNA-derived EBV provided another clue to explore the pathogenesis of EBV in pLELC. Based on these findings, ZBTB16 with missing CNVs and EBV-encoded BART5-3P expression might be key in the tumorigenesis of pLELC, whereas the reductions in PPARG and TGFBR2 and the low mutation rate partially explained the favorable prognosis of pLELC.
Certainly, several limitations exist in this study. The most obvious weakness is the small sample size. Due to the rare incidence, only 128 patients had primary pLELC among the 41,574 cases. Therefore, this cohort is actually relatively large. Multicenter clinical research might be required to enroll more subjects. For genomic sequencing, only fresh-frozen tissues qualified for the analysis, and we recruited eight patients over the past few years. Although cases were limited, the comparatively deep analysis partly compensated for these shortcomings. Additionally, this is an observational investigation without interventional exploration. We will conduct further research on the mechanism of pLELC in the future.

Study population
A total of 41,574 cases of lung cancer were enrolled in the Newly Diagnosed Lung Cancer Database of West China Hospital of Sichuan University, China between January 2008 and December 2019. A total of 136 patients were diagnosed with pLELC. However, eight patients were excluded due to metastatic NPC, and the remaining 128 primary pLELC patients, accounting for 0.31% of all lung cancer cases, were recruited. Another 162 non-pLELC NSCLC patients who were hospitalized on the same day and treated by the same doctor as the pLELC patients were qualified as controls. These enrollment criteria minimized the effects of doctors' clinical preferences as much as possible. The control group included 82 LUADs, 73 LUSCs, and 7 adenosquamous carcinoma cases.
All patients were pathologically diagnosed according to the WHO classification criteria, 3 and all of the pLELCs were undifferentiated carcinomas with positive EBER staining. The TNM stage classification was based on the Eighth Edition of International Association for the Study of Lung Cancer International Staging Project. 46 Tumor tissues from 47 radical operation patients among the 128 pLELCs were analyzed using IHC to detect the expression and clinical significance of PD-L1 and p53.
Surgically resected frozen samples, including both tumor tissues and adjacent normal lung tissues, of eight pLELC patients were subjected to WGS, RNA-seq, and EBV integration analyses. Notably, the eighth patient (P8) numbered in the multigenome sequencing was excluded from the whole data analysis process because of metastatic NPC even though the sample had been sequenced. Therefore, patient nine (P9) was included in the integrated genomic analyses instead of P8. Ethical approval was obtained from the Institutional Review Board at our center, and all patients provided written informed consent for the investigation. 47 Clinicopathological information and follow-up parameters Clinical data, including demographic information (sex, age, smoking status, and family history of cancer), clinical staging, and pathology reports, were obtained from the electronic medical records retrospectively, whereas the prognostic data for OS, PFS, and DFS were received by telephone interview and systemic assessments of clinical examination and CT scans according to the Response Evaluation Criteria in Solid Tumors 1.1. 48 The time from diagnosis until death resulting from any cause was calculated as OS. In the whole cohort of 128 pLELC patients and 162 non-pLELC NSCLC patients, the time from diagnosis to clinical or radiological progression or death was defined as PFS. However, for 47 pLELC cases receiving radical resection with PD-L1 and p53 detection, only 10 patients showed disease progression. Therefore, DFS was utilized and measured from the date of diagnosis to locoregional or distant recurrence, second primary malignancy, or death, whichever occurred first. 49 Patients with no evidence of events or no follow-up information were documented as censored at the date of last contact on January 4, 2020. 48 IHC assay, WGS, and RNA-seq The steps for the IHC assay, WGS, and RNA-seq are provided in Supplementary Materials and Methods.
Significant mutations and oncogene/tumor-suppressor gene (TSG) analysis MuSiC 50 was used to identify FMGs in pLELC with a convolution test (P < 0.2). Somatic copy number alterations were analyzed using the GISTIC software with residual q values of 0.25 to identify wide peak limits. In addition, the COSMIC cancer gene census database was used to extract oncogenes/TSGs with tier 1 evidence. Somatic signature analyses were performed according to a previously described approach. 16 The nonsynonymous mutation rates of pLELC were compared with those of other lung cancer subtypes, including 50 LUADs and 50 LUSCs from TCGA and 26 EBV + -NPCs obtained from another group.
RNA quantification, DEG screening, and pathway enrichment analysis Transcripts were quantified using Cuffdiff, and the differential expression of transcripts between different groups was calculated and analyzed using the edgeR software. Records with a P value <0.05 were retained for differential screening. The clusterProfiler was used to select KEGG pathways and to analyze the pathway enrichment of DEGs. 20 Functional annotation and filtration ANNOVAR was performed to annotate the variant call format. Consensus Coding Sequence, RefSeq, Ensembl, and UCSC were used to determine amino acid variations. The annotation content contained variant positions, variant types, conservative predictions, etc. from the dbSNP, COSMIC, 1000g2015aug_all, and Exome Aggregation Consortium databases. ALL databases were also used to obtain the population frequencies of mutations.
EBV-encoded miRNA detection Details of genome alignment, host integration of EBV, and variant calling are provided in Supplementary Materials and Methods. Bowtie was used to match the length-screened sRNAs to the EBV reference sequence. Whether the reads mapped to the EBV genome were known EBV-encoded miRNAs was determined based on reports. 40,51 Statistical analysis The chi-square test was utilized to compare the clinical characteristic distribution between pLELCs and non-pLELC NSCLCs for categorical factors (Fisher's exact tests when necessary), and Student's t test was used for continuous variables. Kaplan-Meier method was employed to estimate the OS, PFS, and DFS with the log-rank test for comparison. Differences in the clinical data were considered statistically significant when the P value was <0.05 in SPSS version 20.0 (SPSS, Inc., USA). The integrated genomic analyses were conducted as reports, and methods are clarified in the corresponding sections with references marked.

DATA AVAILABILITY
Raw data of this project have been uploaded to the https://bigd.big.ac.cn/gsa and the code of our data was HRA000291. The online version of this article contains supplementary materials, which is available to authorized users.