Rare deleterious germline variants and risk of lung cancer

Recent studies suggest that rare variants exhibit stronger effect sizes and might play a crucial role in the etiology of lung cancers (LC). Whole exome plus targeted sequencing of germline DNA was performed on 1045 LC cases and 885 controls in the discovery set. To unveil the inherited causal variants, we focused on rare and predicted deleterious variants and small indels enriched in cases or controls. Promising candidates were further validated in a series of 26,803 LCs and 555,107 controls. During discovery, we identified 25 rare deleterious variants associated with LC susceptibility, including 13 reported in ClinVar. Of the five validated candidates, we discovered two pathogenic variants in known LC susceptibility loci, ATM p.V2716A (Odds Ratio [OR] 19.55, 95%CI 5.04–75.6) and MPZL2 p.I24M frameshift deletion (OR 3.88, 95%CI 1.71–8.8); and three in novel LC susceptibility genes, POMC c.*28delT at 3′ UTR (OR 4.33, 95%CI 2.03–9.24), STAU2 p.N364M frameshift deletion (OR 4.48, 95%CI 1.73–11.55), and MLNR p.Q334V frameshift deletion (OR 2.69, 95%CI 1.33–5.43). The potential cancer-promoting role of selected candidate genes and variants was further supported by endogenous DNA damage assays. Our analyses led to the identification of new rare deleterious variants with LC susceptibility. However, in-depth mechanistic studies are still needed to evaluate the pathogenic effects of these specific alleles.


INTRODUCTION
Lung cancer (LC), the leading cause of cancer mortality in the US, has recently shown substantial drops in mortality, largely attributed to reduced smoking rates and improvement in new treatments such as immunotherapy 1 . Prior genome-wide association studies (GWAS) identified novel genetic factors influencing LC risk, which are sometimes modulated by smoking behavior 2 . Notably, in the 15q25.1 region that shows the most significant and consistent genetic signal, a missense p.D398N and a 22-bp deletion (del) in the core promoter region of CHRNA5 have been identified that affect the function and expression 3,4 . Carriers of these variants find quitting smoking more difficult than noncarriers 5 and may benefit from a targeted smoking cessation intervention 6 .
Previous studies have estimated heritability of LC to be 18% 7 . Recent genetic studies suggest that rare variants (minor allele frequency [MAF] < 1%) that are functionally deleterious, exhibit far larger effect sizes than common variants 8-10 as they display signs of stronger selective pressure 11,12 , and could account for missing heritability unexplained by common variants 11 . Fewer than 3% of protein-coding single nucleotide variants (SNVs) corresponding to approximately 300 genes per genome are predicted to result in loss of protein function (LoF) through the introduction of stopgain, frameshift, or the disruption of an essential splice site 13 . Insertions (ins) or deletions (indels) have been understudied, though they are the second most abundant source of human genetic variation. Selected indels have been identified as playing a key role in causing LC, such as p.E746_A750del in EGFR [14][15][16] .
Supporting the hypothesis that deleterious mutations will show lower MAF are recent identifications of several rare missense variants that have a moderate-to-large effect on LC risk, for example, PARK2 p.R275W (OR 5.24) 17 , BRCA2 p.K3326X (OR 2.47), CHEK2 p.I157T (OR 0.38) 18 , LTB p.L87F (OR 7.52), P3H2 p.Q185H (OR 5.39) 19 , DBH p.V26M 20 , and ATM p.L2307F (OR 8.82) 21 . Because of 1 the stronger evolutionary pressure and weak linkage disequilibrium (LD) with common SNPs used in GWAS, finding these rare variants through population-based studies can be challenging 22 . To maximize the potential for the detection of large-effect, rare deleterious variants (SNVs and small indels ≤21 bp), we employed whole exome sequencing (WES) plus targeted sequencing on healthy controls and selected high-risk LC cases enriched with the highest genetic risk of LC, for example, early-onset or family history of LC (FHLC) 7,23,24 .

Demographics of study subjects
As shown in Table 1, the vast majority of subjects in the discovery study ─ Transdisciplinary Research in Cancer of the Lung (TRICL; 1,045 LCs vs. 885 controls) ─ and the validation sets (26,803 LCs and 555,107 controls) were primarily of European-descent ( Supplementary Fig. 1). LC cases were significantly more likely to be smokers and with higher pack-years than controls (P-value < 0.0001). The TRICL and Genetic Epidemiology of LC Consortium (GELCC) cases were enriched for having FHLC.
Identification of rare and deleterious variants in the TRICL discovery set In the discovery set, a total of 2,182,753 variants were detected. Applying a three-step filtering method based on allele frequency (MAF < 1% in non-Finnish European [NFE] population from the Genome Aggregation Database [gnomAD]), variant class (missense, protein-truncating and regulatory), and functional effects (predicted deleterious and or with clinical significance from ClinVar), we identified 67,470 rare and putatively deleterious variants: 63% missense, 16% frameshift (fs), 12% in-frame indels, 6% regulatory (untranslated region [UTR] and splice acceptor/ donor), and 3% stop-gain. Single variant association analysis identified 75 potential candidates.
Given the known challenge of excessive false-positive indel detection rates caused by the high frequency of homopolymerassociated sequencing errors [25][26][27][28] , we subjected these 75 potential candidates to additional filtering and manual inspection using Genome Browser (Supplementary Table 1). Twenty-five of the 75 were high-confidence putative candidates (two SNVs, four ins, and 19 del). Supplementary Fig. 2 shows the variant visualization map for the candidates and variant carriers (read alignment and depth). Thirteen out of the 25 candidates (in 24 genes) reported clinical significance in ClinVar, and eight were classified as pathogenic. Also, 5/24 genes were mapped to known LC-GWAS loci, such as 3q28 TP63 29  We next assessed the dose-effect of the 25 candidates: 16 were enriched in LCs (risk-conferring alleles) and 9 were enriched in controls (protective alleles). Compared with subjects with zero risk-and protective-alleles, the groups carrying one, and two riskalleles (5 LCs) showed a progressively increased risk, whereas groups carry one, and two protective-alleles (6 controls) demonstrated a gradually reduced risk (Supplementary Table 2). All 6 controls harbored MOB3A p.F69_I75del, whereas 4/5 LCs harbored STAU2 p.N364M fs*67del.
Studying the demographics of the mutation carriers, there was no significant difference in smoking (status and pack-years) or FHLC between carriers and non-carriers. Notably, 5/6 two-protectivealleles carriers were male, whereas 4/5 two-risk-alleles carriers were female and had adenocarcinoma (AD). Overall, age did not differ significantly between carriers and non-carriers ( Supplementary  Fig. 3). However, in LC cases, onset-age in risk-allele carriers (54 yrs for two-risk-alleles carriers, 62 yrs for one-risk-allele carriers) were significantly younger than the protective-allele carriers (69 yrs; Supplementary Table 2).
We subsequently conducted gene-based rare variant burden tests for the 24 genes harboring potential candidates, five genes, namely, MLNR, CCDC105, BMP8A, MME, and NPHP3, showed suggestive associations (Table 2). We also performed exome wide gene-based tests, however, none showed strong association after multiple testing corrections ( Supplementary Fig. 4).

Meta-analyses of the discovery and validation sets
In the seven validation datasets, of the 25 candidates, 100% were covered by the gnomAD, 22 (88%) in TCGA, 16 (64%) in COPDGene, nine (36%) in GELCC, and nine (36%) were covered in one of the three case-control studies (OncoArray, Affymetrix, and UKB) with genotyping data. Table 3 summarizes the top five candidates with consistent associations from the meta-analysis.
Candidate gene prioritization As shown in Table 2, of the 24 candidate genes, the most evolutionarily constrained (intolerance) genes with the lowest LoF observed/expected (o/e) values were PHF13, TP63, and STAU2; whereas the genes with the highest LC-correlated PhoRank scores were CHEK2, ATM, TP63, and MME. The most interesting protein interaction network consists of eight genes and is centered on three known DNA damage response genes, CHEK2-ATM-TP63, linking five other genes ( Supplementary Fig. 5). GO enrichment analysis highlighted genes involved in replicative senescence (which triggers a DNA damage response); whereas KEGG pathway analysis revealed that genes were involved in small cell LC (Supplementary Table 5).
Endogenous DNA damage assay Large conserved networks of E. coli and human proteins were recently discovered to promote endogenous DNA damage when overproduced 37 . These networks are known as DNA damageome proteins (DDPs) 37 . The DNA damageome also includes LoF variants that show DNA damage-up phenotypes 38 , most of which are not directly related to DNA repair but rather participate in the DNA damage production. We selected six prioritized genes for the assay: CHEK2, ATM, MPZL2, MLNR, POMC, and MME. We discovered the knockdown of five genes, overproduction of the mutant MLNR p.Q334V fs*3del and wildtype POMC promote DNA damage. Specifically, we first used pooled small interfering RNAs (siRNAs) that minimize off-target effects, and observed significantly increased DNA damage levels (γH2AX) for 5/6 genes ( Fig. 2a-c), including two well-known DNA repair genes (CHEK2 and ATM) and three newly discovered DDPs (POMC, MLNR, and MME). By contrast, the knockdown of MPZL2 did not affect DNA damage. For the three newly discovered DDPs, we further validated their DNA damage phenotypes using different individual siRNAs ( Fig. 2d-f). Moreover, overproducing the mutant MLNR p.Q334V fs*3del and the wildtype POMC open reading frame (ORF) from the plasmid promote DNA damage in the lung fibroblast-derived cell line (Fig. 2g-i).

DISCUSSION
Our analyses led to the identification of 25 rare deleterious candidates (in 24 genes) that may be associated with LC susceptibility. Of the five validated variants, we rediscovered two pathogenic variants mapped to known LC susceptibility loci, ATM p.V2716A and MPZL2 p.I24M fs*22del; and identified three deletions in novel LC susceptibility genes, POMC 3′ UTR c.*28delT, STAU2 p.N364M fs*67del, and MLNR p.Q334V fs*3del. Our GxE analysis also suggests some of these associations may be further modified by smoking (MLNR p.Q334V fs*3del and MOB3A p. F69_I75del) and FHLC (TXNDC15 p.E9G fs*68del). Additionally, our assays of cellular DNA damage identified POMC and MLNR as part of the DNA damageome, and confirmed a double-strand break repair role of ATM. This study confirms a robust association between LC susceptibility and ATM and discovered a new pathogenic p.V2716A, that reside in the PI3K catalytic domain. We also found this association is more evident in AD, which is consistent with several previous studies 21,39,40 . ATM is a critical first responder to DNA damage in the cell and essential for genome stability. Several association studies have indicated that common variants of ATM are linked to cancer susceptibility, including LC [41][42][43] . Expression of the PI3K domain in ataxia-telangiectasia cells resulted in complemented radiosensitivity and reduced chromosomal breakage after irradiation [44][45][46] , suggesting the PI3K domain contains many of the significant activity of ATM 47 . Our DNA damage assay also shows elevated DNA damage in lung fibroblasts confirming the previous finding that ATM defective cells accumulate more double-strand breaks 48 . Further, the presence of additional rare deleterious variants, together with previously identified p.P1054R 31 and p. L2307F 21 , strongly suggests that the ATM gene plays a role in LC susceptibility.
Another known LC locus we rediscovered is MPZL2 (also called Epithelial v-like antigen 1, EVA), and the pathogenic frameshift p.I24M fs*22del. MPZL2 is located at 11q23.3, a known GWAS locus for LC 31,49 and hearing loss 50,51 . MPZL2 is one of the top candidate target genes at this locus based on the expression quantitative trait loci (eQTLs) mapping 31 . MPZL2 is a member of the immunoglobulin superfamily, preferentially expressed in lung and thymus epithelium with a potential role as a favorable prognostic marker in thyroid cancer 52 . Interestingly, the MAF of p. I24M fs*22del in the AJ population was 5-fold higher than the general population in gnomAD. There are several examples where rare causal variants (e.g., variants in the P53, CFTR, and BRCA1/2) have higher frequencies within the AJ population [53][54][55][56] . In our DNA damage assay, MPZL2 expression levels do not affect endogenous DNA damage in lung fibroblasts, implying the need to investigate alternative mechanisms in future functional studies.
The most consistent and interesting findings are two new deletions: POMC 3′ UTR c.*28delT and MLNR p.Q334V fs*3del. POMC encodes a polypeptide hormone precursor that regulating energy metabolism, nicotinic-induced weight loss, and immune reactions [57][58][59] . In particular, POMC plays a role in UV-induced DNA damage through interactions with TP53 and is associated with skin cancer susceptibility [60][61][62][63][64] . Abnormal expression of POMC was a poor prognostic marker for LC [65][66][67][68] . Using in vitro models, Derghal et al. evaluated putative miRNA (i.e., miR-383, miR-384-3p, and Although the pathogenic variant, CHEK2 p.S428F with lower LC risk is not statistically significant in the meta-analysis, its protective effect is consistent with another known pathogenic low-frequency variant, CHEK2 p.I157T, associated with reduced risk of smokingrelated cancers (lung, laryngeal, urinary, and upper aerodigestive tract) 18,[73][74][75] . In contrast, both p.I157T and p.S428F showed an increased risk of breast cancer [75][76][77][78][79] . The mechanism underlying this effect is an ongoing question with unknown impact, perhaps related to smoking exposure and cell cycle checkpoint signaling/ apoptosis 75 . STAU2 is a double-stranded RNA-binding protein and a major regulator of mRNA transport, decay, and translation 80 . It was reported that STAU2 downregulation enhances levels of DNA damage (γH2AX) and promotes apoptosis (PARP1 cleavage) in camptothecin-treated cells 81,82 . The role of STAU2 in LC requires future investigations.
A main strength of the study is the focus on LC patients with extreme phenotypes of known risk factors (i.e., early-onset, FHLC, or familial cases in high-risk families), which provide >5 times statistical power 10 . Another strength was the relatively large sample size, which is by far the largest collection of LC rare variant analysis to our knowledge. It should be noted however that our study still has limited power to detect association for ultra-rare variants and those candidates (16/25) that could not be assessed in the validation. Third, our exome plus customized captures (50 Mb + 250 kb) in the discovery offers an efficient method for analyzing known susceptibility regions at greater depth and better coverage, particularly for indels that are often poorly captured in GWAS. Last, we have focused on the investigation of predicted LoF variants which provide directionality of effect. Notably, 14/25 candidates we identified were frameshift deletions that result in either truncated proteins or nonsense-mediated mRNA decay. In the discovery, we observed non-coding variants reside in regulatory regions that may influence target gene expression; however, the lack of population frequency information and insufficient coverage in the validation, limits our ability to explore this aspect for some non-coding variants.
There exist various challenges using the gnomAD as controls, including lack of individual-level data, inability to perform GxE interaction, gene-burden tests, and differences in platforms/ coverage. Additionally, there were some racial differences in  non-white between TCGA cases (27%) and gnomAD controls (30%), that could cause biased effect sizes in the meta-analysis. Genetic ancestry analysis shows 90% TCGA-LCs were inferred as genetic European ancestry 83 . However, it is possible that a small portion of European ancestry TCGA-patients has AJ origin, given that 7% of ovarian cancer 84 and 24% of endometrial cancer 85 are of AJ heritage. It is of note that in our dataset, none of the variant allele carriers of the 25 candidates were found to have Africanancestry. Therefore, we expect this potential population stratification effect to be relatively small on rare variant associations, particularly in non-Africans that have not experienced severe population bottlenecks [86][87][88] .
Although we demonstrated strong joint-effect of the 25 potential candidates (Supplementary Table 2), it is challenging to detect tissue-specific eQTL effects, identify mutational signatures, or construct polygenic risk score (PRS) based on these rare or ultra-rare candidates, due to their low frequencies and weak LD among rare or with common variants. We found some lung-tissue specific eQTL variants from The Genotype-Tissue Expression project (GTEx): three SNPs for ATM, 61 SNPs for POMC, 75 SNPs for MPZL2, and 141 SNPs for STAU2; but none of them overlap or are in LD with the 25 candidates we are reporting. Future studies could integrate single-cell transcriptomic sequencing and epigenomic maps in cells and tissues relevant to LC, to establish mutation signatures (i.e., DNA mismatch repair) and explore the application of PRS to clinical care.
In conclusion, our results provide evidence that rare deleterious variants with moderate to large effect sizes, in particular ATM p. V2716A, MPZL2 p.I24M fs*22del, STAU2 p.N364M fs*67del, POMC 3′ UTR c.*28delT, and MLNR p.Q334V fs*3del, contribute to LC susceptibility. Additional targeted studies using CRISPR/Cas9 mutagenesis could be performed for each variant, to evaluate more comprehensively what its effects are on gene functions and the underlying molecular mechanisms. Future extremely largescale multi-ancestry studies may also provide additional opportunities to assess ancestry-specific predisposing variants, and discover new genetic alterations with relatively large attributable risk for LC.

Study population in the discovery set
The discovery set included 1094 LC cases and 933 controls from the TRICL study 89 . All study subjects and biospecimens were collected with informed consent under institutional review board (IRB) approved protocols. Subjects were selected from four sites: Harvard School of Public Health (HSPH), International Agency for Research on Cancer (IARC), University of Liverpool, and Mount Sinai Hospital and Princess Margaret Hospital (MSH-PMH) in Toronto 89 . Cases were selected because they reported FHLC (firstdegree) or were early-onset (<60 yrs) or had specimens available (Table 1). Never smokers were defined as persons who had smoked fewer than 100 cigarettes in their lifetimes. The ethnicities were inferred using FastPop 90 .
WES and variant calling in the discovery set WES was performed using captures with Agilent SureSelect v5 (50 Mb, Agilent Technologies) and custom capture targeted known LC-GWAS region 91,92 (250 kb). Germline DNA was sequenced at the Center for Inherited Disease Research. The mean on-target coverage was 52x for each sequencing experiment and greater than 97% of on-target bases had a depth greater than 10x. Sequence reads were mapped to the human reference GRCh37/hg19 using the Burrows-Wheeler Aligner. SNVs and indels were called based on the union of raw GATK v3.3-0 and Atlas2. QC process involved the following user-definable criteria: i) low-complexity repeats and segmental duplications were filtered out; ii) quality score ≥20, depth ≥10, and AB ≥ 0.2 for heterozygous calls; iii) call rate ≥0.85; and iv) samples with abnormal heterozygosity rate, sex discordance, <95% completion rates, and unexpected relatedness (identity-by-state >10%) were filtered out.

Rare variant filtering and functional annotation in the discovery set
Following variant calling, rare variants were further enriched by the application of three-steps: i) Variant with MAF < 1% in the gnomAD (NFE ancestry, v2.1); ii) Variants class, including missense, protein-truncating, and regulatory; and iii) Mutation effects, i.e., variant results in protein truncation and predicted to be deleterious from 4/6 prediction tools (SIFT, Polyphen-2, MutationTaster, MutationAssessor, FATHMM, and FATHMM-MKL). The miRNAs putatively bound to the sequence containing UTR variants were identified by the TargetScan 35 . We additionally incorporated rare variants classified as pathogenic, likely pathogenic, or VUS from the ClinVar database, which compiles clinically observed human variants.

Single variant association test in the discovery set
For variants derived from the above automated filtering schema, we conducted the association test using Fisher's exact test. We used the Genome Browser (Golden Helix) visualization tool to verify the presence of the potential candidates in each carrier. By manual review of the variants' coverage plot (read depth) and pile-up plot (read alignment), we rule out low-confidence variants resulting from mapping error, strand bias, and weak exon conservation.

Gene-environment interaction and gene-based burden analysis in the discovery set
For the candidates identified from the association test, we performed G×E interaction (i.e., age-onset, sex, smoking status, pack-years, and FHLC), using the mixed linear regression model. To measure the cumulative effect of the rare deleterious variants within the gene, we performed collapsing tests using the CMC and the KBAC tests 93,94 .

Study populations in the validation sets and meta-analysis
The candidate variants were further examined in seven validation datasets, aggregated from different centers and across several platforms (four WES data and three genome-wide genotyping datasets as shown in Table 1). We tabulated the variant carrier counts per candidate and performed meta-analyses using the inverse-variance-weighted fixed-effects (assume the true effect size is the same in all studies).  i Representative histograms of (g). *P-value < 0.05, **P-value < 0.01, n.s not significant (P-value > 0.05).
LC Study Subjects. The sporadic LC patients were selected from our previous WES study 19,20  . There were 1162 participants in the OncoArray consortium who were also exome-sequenced in the TRICL discovery, and therefore these samples were excluded from the analysis in the validation phase. 6. Affymetrix case-control studies (5364 LCs vs. 5724 controls; dbGaP phs001681.v1.p1). This is a large pooled sample was assembled consisting of 10 independent case-control studies which previously described elsewhere 99,101 . Study participants were genotyped on an Axiom Exome Plus Array (Affymetrix) 99,101 , which contains a custom panel of key LC GWAS markers, and rare coding SNVs and indels 102 . There were 992 participants in the Affymetrix that were also exomesequenced in the TRICL discovery, and therefore these samples were excluded from the analysis in the validation phase. 7. UKB (UK Biobank cohort 103

Gene prioritization based on functional annotations and protein interactions network
To better reprioritize genes and candidates, we used three prioritization tools: 1) Gene evolutionary constraint to LoF variation, which using the o/e ratio from the gnomAD. 2) Phevor PhoRank algorithm 105 , which ranks the genes based on their phenotypic relevance as defined by diverse biomedical ontologies. 3) Protein-Protein interactions (PPI) network using the STRING database 106 , with an interaction score cut-off ≥0.15 (low confidence).  108  assays by flow cytometry were performed as previously described 37,38 . γH2AX primary antibody (Sigma, Catalog #05-636) and goat anti-mouse secondary antibody, Alexa Fluor 647 (Thermo Fisher, Catalog #A21236) were used to stain cells. Stained cells were then analyzed by a BD LSRFortessa flow cytometer. FCS files were analyzed by FlowJo 10.5 software. For siRNA experiments, cells were collected 72 h post transfection and median fluorescence intensity was quantified. Also, to quantify the DNA-damage positive subpopulations, 0.5% of the mock cells were gated as the γH2AX threshold as previously demonstrated. The percentage of γH2AX positive cells in each sample was calculated and compared to its corresponding nontargeting siRNA control. For overproduction experiments, mocktransfected cells were used to set the gates to determine the GFP and γH2AX positive cells. 0.5% of the mock cells were gated as the γH2AX threshold. The DNA-damage ratios by protein overproduction for 72 h are calculated as described. Briefly, the damage ratio is defined as (Q2/Q3)/(Q1/Q4), where Q2 is the portion of transfected γH2AX-positive cells; Q3 is the portion of transfected, γH2AX -negative cells; Q1 is the portion of untransfected, γH2AX-positive cells; and Q4 is the portion of untransfected, γH2AX-negative cells. The DNA damage ratios by candidate protein overproduction were compared with GFP-Tubulin as previously described.
Y. Liu et al.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

DATA AVAILABILITY
The data generated and/or analyzed during the related study are described in the figshare metadata record: https://doi.org/10.6084/m9.figshare.13280387 109 . The data that support the findings of this study are available via the dbGaP (database of genotypes and phenotypes) repository. The data are controlled-access, so interested parties will need to request accessinformation on how to do so can be found on pages linked to below. The access numbers are https://identifiers.org/dbgap: phs000878.v2.p1 110 Supplementary Fig. 3) and 'TRICL WES.bam' (underlying Supplementary Fig. 2). These data are only available to authorized researchers who have submitted an IRB application. Please email the corresponding author for access. Data underlying Supplementary Received: 27 May 2020; Accepted: 11 December 2020;