Structure-based classification predicts drug response in EGFR-mutant NSCLC

Epidermal growth factor receptor (EGFR) mutations typically occur in exons 18–21 and are established driver mutations in non-small cell lung cancer (NSCLC)1–3. Targeted therapies are approved for patients with ‘classical’ mutations and a small number of other mutations4–6. However, effective therapies have not been identified for additional EGFR mutations. Furthermore, the frequency and effects of atypical EGFR mutations on drug sensitivity are unknown1,3,7–10. Here we characterize the mutational landscape in 16,715 patients with EGFR-mutant NSCLC, and establish the structure–function relationship of EGFR mutations on drug sensitivity. We found that EGFR mutations can be separated into four distinct subgroups on the basis of sensitivity and structural changes that retrospectively predict patient outcomes following treatment with EGFR inhibitors better than traditional exon-based groups. Together, these data delineate a structure-based approach for defining functional groups of EGFR mutations that can effectively guide treatment and clinical trial choices for patients with EGFR-mutant NSCLC and suggest that a structure–function-based approach may improve the prediction of drug sensitivity to targeted therapies in oncogenes with diverse mutations.

Patients with classical EGFR mutations (L858R or exon 19 deletions (Ex19del)) show marked improvements in clinical outcomes when treated with first-, second-or third-generation tyrosine kinase inhibitors (TKIs) [4][5][6]11 . Other EGFR mutations in the kinase domain (exons 18−21) have also been established as oncogenic drivers of NSCLC 8 . Patients with atypical EGFR mutations show heterogeneous and reduced responses to EGFR inhibitors 1,3,[7][8][9][10] , including osimertinib 12 . Atypical EGFR mutations with US Food and Drug Administration (FDA)-approved treatments are EGFR S768I, L861Q and G719X, for which afatinib was deemed effective on the basis of retrospective studies [13][14][15] , and the EGFR/MET bispecific antibody amivantamab for exon 20 insertions (Ex20ins) 16 . There are no clear established guidelines for EGFR TKI treatment for patients with atypical EGFR mutations without an FDA-approved TKI, often resulting in patients receiving chemotherapy. Clinical trial design and treatment of patients with atypical EGFR mutations often rely on mutated-exon location to predict treatment, although heterogeneity in drug sensitivity across a single exon has been observed 1,8,17 . Therefore, there is an unmet clinical need to establish a system for classifying EGFR mutations that is predictive of drug sensitivity and for more robust clinical trial design.
To assess the effect of atypical EGFR mutations on patient outcomes, we determined the time to treatment failure 18 (TTF) of patients with NSCLC containing classical or atypical EGFR mutations. When treated with an EGFR TKI, patients with atypical EGFR mutations had a shorter TTF compared with patients with classical EGFR mutations (Fig. 1d, hazard ratio (HR) = 1.8, P < 0.0001), even when patients with Ex20ins were excluded from the analysis (Fig. 1e, HR = 1.6, P < 0.0001) or when patients were stratified by mutation exon location (Fig. 1e,Extended Data Fig. 1c). When patients were stratified by TKI treatment, those with classical EGFR mutations had a longer TTF than those with atypical EGFR mutations when treated with first-generation (HR = 1.9, P = 0.0005) or third-generation TKIs (HR = 3.0, P < 0.0001) (Extended Data Fig. 1d, e). A similar trend was observed for second-generation TKIs; however, the difference was not statistically significant (HR = 1.7, P = 0.10) (Extended Data Fig. 1f). Validating these findings in the cBioPortal database, patients with atypical EGFR mutations had a shorter progression free interval 19 and overall survival, irrespective of treatment (Extended Data Fig. 1g, h).

Structural groups predict drug response
We generated a panel of 76 cell lines expressing EGFR mutations spanning exons 18-21 and screened these cell lines against 18 EGFR inhibitors representing first-(non-covalent), second (covalent) and third-(covalent, T790M targeting) generation and Ex20ins-active TKIs (Supplementary Table 1). Using hierarchical clustering of in vitro selectivity over WT EGFR and mutational mapping of EGFR mutations, we observed four distinct subgroups of EGFR mutations: classical-like mutations that were distant from the ATP-binding pocket (Extended Data Fig. 2a), T790M-like mutations in the hydrophobic core (Extended Data Fig. 2b), insertions in the loop at the C-terminal end of the αC-helix in exon 20 (Ex20ins-L; Extended Data Fig. 2c), and mutations on the interior surface of the ATP-binding pocket or C-terminal end of the αC-helix, which were predicted to be P-loop and αC-helix compressing (PACC) (Fig. 2a, Extended Data Fig. 2d). Supervised heat maps of mutant/wild-type ratios by exon location (Extended Data Fig. 3a) and structure-function groups (Extended Data Fig. 3b) showed distinct differences, suggesting that structure-function-based groups better defined groups of mutations by drug sensitivity than exon-based classification. To test this hypothesis, we calculated the correlations of drug sensitivity and selectivity for each mutation to the predicted drug sensitivity by exon or structure-function groups (Extended Data Fig. 4a) and then compared the median rho value of each correlation for both groups. We found that structure-function-based groups were more predictive of mutation sensitivity than exon-based groups (P < 0.0001) (Fig. 2b). We used a secondary approach employing machine learning to analyse data by classification and regression trees (CART) and determine variable importance 20,21 (Extended Data Fig. 4b). Structure-function-based groups had a higher variable importance than exon-based groups, suggesting that structure-function-based groups were more predictive of which mutational groups would be most sensitive to a particular drug compared with exon-based groups (P < 0.0001) (Fig. 2c). We validated these findings without T790M mutations (Extended Data Fig. 4c, d), and structure-function-based groups remained more predictive of mutation and drug sensitivity than exon-based groups (P = 0.0034 and P < 0.0001, respectively) (Extended Data Fig. 4e, f).
Classical-like, atypical EGFR mutations were predicted to have little effect on the overall structure of EGFR compared with wild-type EGFR (Extended Data Fig. 5a-d) and were sensitive and selective for all classes of EGFR TKIs, particularly third-generation TKIs, in vitro and in vivo (Extended Data Fig. 5e-g). Mutations and assigned groups are in Supplementary Table 4.

Exon 20 mutations are heterogenous
Studies showed that exon 20 mutations are heterogenous in their response to TKIs 10,22 . Insertions in the αC-helix (for example, an FQEA insertion at A763 (A763insFQEA)) were pan-sensitive to EGFR TKIs 23,24 , whereas those in the loop following the αC-helix (A767-C775) were not 25,26 , and the T790M mutation was sensitive to third-but not first-or second-generation TKIs. We found that most exon 20 point mutations were PACC mutations; that exon 20 insertions in the αC-helix were classical-like mutations; and, that the remainder of exon 20 insertions occurring in the C-terminal loop of the αC-helix were a distinct subgroup: exon 20 loop insertions (Ex20ins-L) (Fig. 2a). Ex20ins-L were sensitive only to select second-generation TKIs (that is, poziotinib and tarlox-TKI) and Ex20ins-active TKIs in vitro and in vivo (Fig. 2a,. However, even within Ex20ins-L mutations, some degree of heterogeneity in drug sensitivity was observed (Fig. 2a Article lines expressing Ex20ins-L mutations spanning A767-V774, we found that Ex20ins-L mutations could be subdivided into two subgroups: near-and far-loop Ex20ins (Extended Data Fig. 6d). Exon 20 near-loop insertions (Ex20ins-NL) were more sensitive to second-generation and Ex20ins-active TKIs compared with exon 20 far-loop insertions (Ex20ins-FL) (P = 0.0025 and P = 0.027, respectively) (Extended Data Fig. 6e). These data exemplify that mutations within an exon are heterogenous and that an exon-based classification is unlikely to be optimal for guiding treatment decisions.

Drug repurposing for resistant mutations
Although all T790M-like mutants had at least one mutation in the hydrophobic core, there were two distinct subgroups of T790M-like mutantsthird-generation TKI sensitive (T790M-like-3S) and third-generation TKI resistant (T790M-like-3R) (Extended Data Fig. 7a). Previous reports have shown that protein kinase C 27 (PKC) and anaplastic lymphoma kinase 28,29 (ALK) inhibitors exhibit off-target activity for EGFR mutations including T790M, and the non-covalent nature of these compounds predict that they retain activity in mutations that interrupt covalent binding. T790M-like-3S mutants had high selectivity for third-generation TKIs and some Ex20ins-active inhibitors and moderate selectivity for ALK and PKC inhibitors (Extended Data Fig. 7b). T790M-like-3R mutants, complex mutations comprising T790M and a known drug-resistance mutation (that is, C797S 30 , L718X 31 or L792H 18,31 ), were resistant to classical EGFR TKIs but retained selectivity for select ALK and PKC inhibitors such as brigatinib or midostaurin (Extended Data Fig. 7c). These data support expanding testing of ALK and/or PKC inhibitors or development of novel non-covalent inhibitors for the broader group of T790M-like-3R mutations.

Second-generation TKIs inhibit PACC mutations
PACC mutations comprise mutations spanning exons 18-21 including G719X, L747X, S768I, L792X and T854I and were predicted to alter the orientation of the P-loop or αC-helix (Extended Data Fig. 8a, b). In silico analysis of the interaction of osimertinib with PACC mutations G719S and L718Q predicted that changes in the orientation of the P-loop alter the position of TKI stabilization points tilting the indole ring of osimertinib away from the P-loop, destabilizing drug binding (Extended Data Fig. 8c,d). By contrast, second-generation TKIs do not interact with the P-loop of EGFR and maintain interaction points in the hydrophobic cleft (Extended Data Fig. 8d, e). When we compared the selectivity of EGFR TKIs for PACC mutations, we found that second-generation TKIs were significantly more selective for PACC mutations than any other TKI class (Fig. 3a). In vivo, we also observed that NSCLC patient-derived xenografts (PDXs) containing G719A mutations were resistant to the third-generation TKI osimertinib, but most sensitive to the second-generation TKI poziotinib (Fig. 3b, Extended Data Fig. 8f). Notably, a patient with a complex PACC mutation, E709K/ G719S, saw clinical benefit and tumour shrinkage with afatinib treatment after progressing on osimertinib (Extended Data Fig. 8g). Similarly, acquired PACC mutations co-occurring with primary classical EGFR mutations retained sensitivity to second-generation TKIs while acquiring resistance to third-generation TKIs in an allele-specific manner (Fig. 3c, d). In silico analysis of acquired PACC mutation, G796S, co-occurring with Ex19del was predicted to confer resistance to third-generation TKIs such as osimertinib by shifting the hinge region of the receptor, preventing stabilization of osimertinib at M793 and displacing the acrylamide group of osimertinib away from C797 (Extended Data Fig. 8h). However, second-generation TKIs were less affected by  Table 2). In b, c, P value was determined using a paired two-sided t-test.
shifts in the hinge region of the receptor and were predicted to maintain the orientation of the acrylamide group (Extended Data Fig. 8f).
Previous studies have reported that C797S mutations confer resistance to third-generation TKIs even without the presence of T790M 30 (Fig. 3e). Similarly, C797S mutations without T790M conferred resistance to Ex20ins-active inhibitors, but not first-or second-generation TKIs unless T790M was present (Fig. 3e). Retrospectively, we identified three patients with NSCLC containing EGFR L858R mutations that received first-line osimertinib and subsequently developed an EGFR-dependent mechanism of resistance. In all patients, a PACC mutation was identified upon biopsy at progression (Extended Data Fig. 9a-c). Two patients acquired a L718V mutation, and one acquired two PACC mutations (V765L and C797S). All patients were treated with a second-generation TKI and experienced clinical benefits of stable disease and tumour shrinkage (Extended Data Fig. 9a-c). These data demonstrate that both primary and acquired PACC mutations are sensitive to second-generation TKIs, and structure-function-based groupings could identify a novel class of mutations, PACC mutations, for which second-generation TKIs had higher selectivity and efficacy than third-generation drugs.

Structure-based groups predict outcomes
To determine whether structure-function-based groups could identify patients who are most likely to benefit from a treatment better than exon-based groups, we used a publicly available database of outcomes for patients harbouring atypical EGFR mutations treated with afatinib 32 and determined overall response rate (ORR) and duration of treatment (DOT). Structure-function-based grouping showed clear differences between sensitive (classical-like and PACC) and resistant (T790M-like and Ex20ins-L) subgroups (ORR of 63% versus 20%), whereas exon-based groups had less variation between groups (Extended Data Fig. 10a, b). Structure-function-based groups identified that patients with PACC mutations (n = 156) had a significantly longer DOT for afatinib than other structure-based groups (DOT: 17.1 months, P < 0.0001) (Fig. 4a, b). Using exon-based groups, we also identified that patients with exon 18 mutations (n = 87) had a longer DOT than patients with mutations in exons 19-21 (DOT: 17.4 months, P < 0.0001) (Fig 4b, Extended Data Fig. 10c); however, the structure-based approach identified nearly twice as many individuals who benefited from afatinib treatment.
To determine whether structure-based groups could identify which class of inhibitors would provide the most benefit to patients with atypical EGFR mutations, we performed retrospective analyses of TTF of patients with atypical EGFR mutations treated with EGFR TKIs in MD Anderson Cancer Center GEMINI and Moffitt Cancer Center databases, and TTF was determined for the first EGFR TKI for which patients were treated. Most patients (80%) were stage IV at diagnosis, and there were no statistical differences in patient characteristics (Supplementary Tables 5, 6). When stratified by structure-function-based    Article groups, patients with PACC mutations treated with second-generation TKIs had a significantly longer TTF than patients treated with either first-or third-generation TKIs (21.7 months versus 10.0 months and 4.1 months, respectively; P < 0.0001, HR = 0.23) (Fig. 4c, d). By contrast, TTF was not significantly different between classes of EGFR TKIs in patients with non-PACC mutations (Fig. 4d, Extended Data Fig. 10d). Further, patients with PACC mutations had a longer TTF than patients with non-PACC mutations when treated with second-generation TKIs (21.7 months versus 10.0 months, respectively; HR = 2.6, P = 0.0068) (Fig. 4e). When patients were stratified by exon and TTF was calculated for first-, second-and third-generation TKIs, significant differences were observed only in patients with exon 18 mutations treated with second-generation TKIs compared with third-generation TKIs (20.9 months versus 5.5 months; P = 0.001, HR = 0.29) (Fig 4d, Extended Data Fig. 10e-h). Therefore, structure-function classification identified not only a larger subgroup of patients, but also a subgroup with greater benefit from second-generation TKIs than the exon-based classification.

Discussion
The diversity and higher than previously appreciated prevalence of atypical EGFR mutations shown here highlights the necessity of comprehensive next-generation sequencing (NGS) for patients with NSCLC.
We show that EGFR mutations, including atypical mutations, can be divided into four distinct subgroups based on structure and function (Fig. 5), and that structure-function-based groups can predict drug sensitivity and patient outcomes better than exon-based groups. While previous studies have shown activity of second-generation TKIs in patients with specific exon 18 mutations 33,34 , structure-function-based grouping identified a larger subgroup of EGFR mutations, PACC mutants, for which second-generation TKIs were more selective than third-generation TKIs. Clinically, second-generation TKIs have been associated with inhibition of wild-type EGFR and related adverse events 15,35,36 ; however, most second-generation TKIs are dosed at the maximum tolerated doses, resulting in plasma concentrations 10-100 fold greater than concentrations necessary for inhibiting PACC mutations. Unlike osimertinib, second-generation TKIs have limited activity in the central nervous system, demonstrating the need for novel TKIs with reduced inhibition of wild-type EGFR inhibition and CNS activity that can inhibit PACC mutants. These findings demonstrate that structure-function-based groups can identify classes of drugs that may be effective for whole groups of mutations, reflecting the observation that mutations in different regions of the gene may induce similar changes in protein structure. For example, L718Q, S768I, and T854I correspond to exons 18, 20, and 21, respectively, but are all PACC mutations with similar structural effects on drug binding. Conversely, mutations within the same exon may induce quite disparate changes. L747_K754del-insATSPE, L747P and E746-A750del mutations occur in exon 19 but are T790M-like, PACC, and classical mutations, respectively, with distinct drug sensitivity and structural effects. A clinical challenge for physicians treating patients with EGFR-mutant cancers is to appropriately identify and match patient mutations with the best EGFR TKI. While a separate classification for each individual mutation could provide more precision than the groups described here, validating the clinical activities of different drugs for each mutation is not feasible. The classification presented here provides a framework through which clinicians, informed by internet-based tools or companies providing NGS reports, could more effectively personalize EGFR TKI therapy. Finally, these findings support the notion that for cancers containing oncogenes with diverse mutations, adopting a structure-function-based approach may improve clinical trial design and drug development.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-021-03898-1.

Fig. 5 | EGFR mutations can be divided into four distinct subgroups.
Representative space-filling models of each EGFR subgroup demonstrate changes in overall shape of drug-binding pocket. The P-loop (blue), hinge region (ATP-binding site (orange)), hydrophobic cleft (green), and αC-helix (yellow) are shown. Red dots represent location of mutations. Arrows indicate location of structural changes compared with wild-type EGFR. The most common mutations are shown for each group, and drug sensitivity or selectivity is colour-coded and listed from most selective or sensitive (green) to resistant (red). PKCi, PKC inhibitor; ALKi, ALK inhibitor.   Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Methods
No sample size calculations were done to predetermine group sizes, and investigators were not blinded during randomization and outcome assessments.

Analysis of EGFR variants in MD Anderson Cancer Center GEMINI, Foundation Medicine, Guardant Health and cBioPortal databases
To analyse the numbers and frequencies of different EGFR mutations among patients with NSCLC in the MD Anderson Cancer Center GEMINI database, the database was queried for patients with EGFR mutations (n = 1,054) and manually curated as classical or atypical EGFR mutations. The MD Anderson Cancer Center GEMINI database is prospectively collected from patients consented and enrolled on protocol number PA13-0589 in accordance with the MD Anderson Institutional Review Board.
EGFR mutations were determined from formalin-fixed paraffin-embedded tumours or digital-droplet PCR for blood samples by CLIA-certified methods as previously described 18,37  To identify patients with EGFR mutations in the Foundation Medicine database, patient samples taken between November 2011 and May 2020 previously subjected to hybrid-capture based comprehensive genomic profiling using formalin-fixed paraffin-embedded tissue or plasma using previously validated assays 38,39 , were analysed for EGFR mutations (n = 10,221). Patients were stratified by EGFR mutation, and EGFR mutations were manually curated as atypical or classical EGFR mutations. Classical EGFR mutations were defined as L858R point mutations, T790M mutations, and various exon 19 deletions including any deletion in exon 19 beginning at amino acid E746 or L747 and ending at amino acid A755. Deletions also including insertions were allowed and still considered classical exon 19 deletions. Atypical EGFR mutations were defined as non-synonymous mutations that were not defined as classical mutations. Patients with EGFR mutations where the sequence of the mutation was unknown were excluded from the analysis.
To determine the frequency of individual EGFR variants reported across the MD Anderson GEMINI database, cBioPortal, Foundation Medicine and the Guardant Health database, each database was analysed separately, and the average of all databases was determined.
To determine the frequency of atypical mutations in the MD Anderson GEMINI and Foundation Medicine databases, atypical mutations were identified as described above and total number of known EGFR mutations across all patients was tabulated. For the analysis of cBio-Portal, all non-overlapping studies were selected and exported. For overlapping studies, only the largest dataset was used, and all known EGFR mutations were tabulated. To determine the frequencies of EGFR variants from Guardant Health, a database of sequenced circulating free DNA (cfDNA), the Guardant360 clinical database was searched for NSCLC samples tested between November 2016 and November 2019 harbouring EGFR mutations (n = 5,026 patients). Guardant360 is a CLIA-certified, CAP/NYSDOH accredited comprehensive cfDNA NGS test that reports on SNVs, indels, fusions and SNVs in up to 73 genes. The Guardant360 clinical database, and the four datasets reported here, are enriched in North American patients with NSCLC; the frequency of atypical EGFR mutations may differ in Asia or other regions.

Analysis of TTF in MD Anderson Cancer Center GEMINI and Moffitt Cancer Center
To determine TTF after EGFR TKI treatment, patients with NSCLC harbouring an EGFR mutation in the tyrosine kinase domain (exons 18-22) were identified in the MD Anderson GEMINI and Moffitt Cancer Center databases. Data collection for Moffitt Cancer Center (MCC) patients was performed under the protocol (MCC 19161), which was formally reviewed and granted approval by MCC in accordance with the Declaration of Helsinki and the 21st Century Cures Act. Outcomes were recorded for patients for only first EGFR TKI. Patients were stratified by classical (L858R or Ex19del, as defined above) or atypical (non-classical). There were 333 patients with NSCLC identified in the MD Anderson GEMINI database who had tumours expressing atypical mutations. Of these patients, 88 patients received at least one line of EGFR TKI treatment. In addition, at Moffitt Cancer Center, there were 21 patients with NSCLC with tumours harbouring atypical EGFR mutations. Clinical parameters were extracted from the respective databases. Patients previously receiving chemotherapy were included, and TTF was calculated for the first EGFR TKI received. TTF was determined as previously described 18 and defined as time from commencement of EGFR TKI to radiologic progression, TKI discontinuation, or death, and was not based on RECIST criteria. For patients treated beyond progression, radiologic progression was recorded as the end point, and data cut-off was May 2021. Median TTF was calculated using the Kaplan-Meier method. HR and P values were determined using Graph-Pad Prism software and two-sided Mantel-Cox log-rank tests.

Analysis of OS and PFI from cBioPortal Database
For overall survival (OS) and progression-free interval (PFI), analysis of patients in cBioportal was determined as previously described 19 for patients receiving any treatment with survival information and qualifying EGFR mutation. This information was curated from cBioportal by selecting all non-overlapping studies of NSCLC. For overlapping studies, the largest database was selected. PFI and OS analysis were restricted to the tyrosine kinase domain. Median OS and median PFI were calculated using the Kaplan-Meier method. HR and P values were determined using GraphPad Prism software and two-sided Mantel-Cox log-rank tests.

Ba/F3 cell generation, drug screening and IC 50 approximations
Ba/F3 cells were obtained as a gift from G. Mills (MD Anderson Cancer Center) and maintained in RPMI (Sigma) containing 10% FBS, 1% penicillin-streptomycin and 10 ng ml −1 recombinant mIL-3 (R&D Biosystems). To establish stable Ba/F3 cell lines, Ba/F3 cells were transduced with retroviruses containing mutant EGFR plasmids for 12-24 h. Retroviruses were generated using Lipofectamine 2000 (Invitrogen) transfections of Phoenix 293T-ampho cells (Orbigen) with pBabe-Puro based vectors listed in Supplementary Table 7. Vectors were generated by GeneScript or Bioinnovatise using parental vectors from Addgene listed in Supplementary Table 7. After 48-72 h of transduction, 2 µg ml −1 puromycin (Invitrogen) was added to Ba/F3 cell lines in complete RPMI. To select for EGFR-positive cell lines, cells were stained with PE-EGFR (Biolegend) and sorted by fluorescence-activated cell sorting. After sorting, EGFR-positive cells were maintained in RPMI containing 10% FBS, 1% penicillin-streptomycin, and 1 ng ml −1 EGF to support cell viability. Drug screening was performed as previously described 22,36 . Shortly, cells were plated in 384-well plates (Greiner Bio-One) at 2,000-3,000 cells per well in technical triplicate. Seven different concentrations of TKIs or DMSO vehicle were added to reach a final volume of 40 µl per well. After 72 h, 11 µl of Cell Titer Glo (Promega) was added to each well. Plates were incubated for a minimum of 10 min, and bioluminescence was determined using a FLUOstar OPTIMA plate reader (BMG LABTECH). Raw bioluminescence values were normalized to DMSO control-treated cells, and values were plotted in GraphPad Prism. Non-linear regressions were used to fit the normalized data with a variable slope, and IC 50 values were determined by GraphPad prism by interpolation of concentrations at 50% inhibition. Drug screens were performed in technical triplicate on each plate and either duplicate or triplicate biological replicates. Mutant to WT ratios for each drug were calculated by dividing the IC 50 values of mutant cell lines by the average IC 50 value of Ba/F3 cells expressing WT EGFR supplemented with 10 ng ml −1 EGF for each drug. Statistical differences between groups were determined by one-way ANOVA as described in the figure legends.
In silico mutational mapping and docking experiments X-ray structures of wild type EGFR in complex with AMP-PNP (2ITX) and osimertinib (4ZAU), and EGFR L858R mutant in complex with AMP-PNP (Protein Data Bank (PDB) ID: 2ITV) were retrieved from the Protein Data Bank. Molecular Operating Environment (2019.01; Chemical Computing Group CCCG) was used to generate mutant homology models, construct protein-ligand models and for visualization. Pymol was used for visualization of mutation location on WT EGFR (PDB ID: 2ITX) and structural alignment with EGFR D770insNPG (PDB ID: 4LRM) or EGFR G719S (PDB ID: 2ITN).

Heat map generation
Heat maps and hierarchical clustering were generated by plotting the median log (Mut/WT) value for each cell line and each drug using R and the ComplexHeatmap package 40 2.6.2 (R Foundation for Statistical Computing). Hierarchical clustering was determined by Euclidean distance between Mut/WT ratios. For co-occurring mutations, mutation order was assigned arbitrarily, and for acquired mutations, mutations were assigned in the order mutations are observed clinically. Structure-function groups were assigned based on predicted impact of mutation on receptor conformation.

Determination of EGFR groups and subgroups
Mutational mapping was used to separate EGFR mutations into distinct groups based on predicted drug sensitivity. Structural features of EGFR mutations with known drug sensitivity (that is, classical EGFR mutations 41,42 , T790M 43-45 and exon 20 insertions 22,25 ) were used as the basis for predicting the impact of mutations on drug sensitivity. Using mutational mapping there were four distinct groups: (1) no obvious effect on the drug binding pocket (similar to L858R); (2) a mutation in the hydrophobic core (similar to T790M); (3) a large inward shift of both the αC-helix and P-loop (similar to exon 20 insertions); and (4) a slight inward shift of the αC-helix and/or P-loop due to direct changes to the either the αC-helix and/or P-loop or indirectly through alterations of the ß-pleated sheets that are predicted to effect the position either the αC-helix and/or P-loop. Groups were validated by hierarchical clustering of in vitro sensitivity of Ba/F3 cells expressing the various EGFR mutations. Subgroups such as T790M-like-3S/T790M-like-3R and Ex20ins-NL/Ex20ins-FL were defined based on cell line sensitivity data.

Statistical analyses of structure-function groups
Correlations for mutations were determined using Spearman's rho by correlating the median log (Mut/WT) value for each mutation and drug versus the average of the median log (Mut/WT) value for the structurefunction-based group or exon-based group for which the mutation belongs. For each correlation, the mutation tested was removed from the average structure function and exon-based groups. Average rho values were compared by two-sided Student's t-test. To determine whether structure function groups or exon groups were better predictor of drug sensitivity, we performed recursive-partitioning analyses to construct a decision tree for each drug using structure function group and mutation data on exons 18, 19, 20, and 21 as predictors. The decision tree classified samples by posing a series of decision rules based on predictors. Each decision rule was constrained in an internal node, and every internal node points to yes-or-no questions that result in a 'yes' or 'no' branch. We applied the CART algorithm 20,21 using the rpart R package. We calculated variable importance as the sum of the goodness of split measures for each split. These were scaled to sum to 100 for a tree. Median SAS version 9.4 and R version 3.5.6 were used to carry out the computations for all analyses. The structure function group variable was involved in the first and second splits in all of the 18 regression trees of drug sensitivity. The variable importance of this variable was in a range of 66-94%. Both the order of the split and variable importance indicate that the structure function group variable was more predictive than the exon-based variables in evaluation of drug sensitivity. Code for this analysis can be found at https://github.com/MD-Anderson-Bioinformatics/ EGFR-Structure-Function-Nature-Manuscript.

PDX generation and in vivo experiments
As part of the MD Anderson Cancer Center Lung Cancer Moon Shots program, PDXs harbouring EGFR G719A and EGFR L858R/E709K were generated and maintained in accordance with Good Animal Practices and with approval from MD Anderson Cancer Center Institutional Animal Care and Use Committee on protocol number PA140276 as previously described 46 . Surgical samples were rinsed with serum-free RPMI supplemented with 1% penicillin-streptomycin then implanted into the right flank of 5-to 6-week-old NSG female mice within 2 h of resection. Tumours were validated for EGFR mutations by DNA fingerprinting and quantitative PCR as described 46 . PDXs harbouring EGFR S768dupSVD were purchased from Jackson Laboratories ( J100672). To propagate tumours, 5-to 6-week-old female NSG mice (NOD.Cg-Prkdcscid IL2rgtmWjl/Szj) were purchased from Jackson Laboratories (005557). Fragments of NSCLC tumours expressing EGFR S768dupSVD, G719A or L858R/E709K were implanted into 6-to 8-week-old female NSG mice. Once tumours reached 2,000 mm 3 , they were collected and re-implanted into the right flank of 6-to 8-week-old female NSG mice. Tumours were measured 3 times per week and were randomized into treatment groups when tumors reached a volume of 275-325 mm 3 for the EGFR G719A and S768dupSVD models, and 150-175 mm 3 for the L858R/E709K model. Treatment groups included vehicle control (0.5% methylcellulose, 0.05% Tween-80 in dH 2 O), 100 mg kg −1 erlotinib, 20 mg kg −1 afatinib, 2.5 mg kg −1 poziotinib, 5 mg kg −1 osimertinib, and 25 mg kg −1 osimertinib. During treatment, body weight and tumour volumes were measured three times per week, and mice received treatment five days per week (Monday to Friday). Dosing holidays were given if mouse body weight decreased by more than 10% or overall body weight dropped below 20 g. Maximum allowed tumour burden by approved IACUC protocol was a volume of 2,000 mm 3 . Mice were humanely euthanized if tumour sizes exceeded the maximum size.

Case studies of patients treated with second-generation TKIs
Patients were consented under the GEMINI protocol (PA13-0589) which was approved in accordance with the MD Anderson Institutional Review Board, or protocol MCC 19161, which was formally reviewed and granted approval by Moffitt Cancer Center in accordance with the Declaration of Helsinki and the 21st Century Cures Act for retrospective analysis of patient outcomes and treatment course for case studies of patients presented. Both protocols include informed consent for publication of deidentified data.

Retrospective analysis of ORR and duration of treatment with afatinib
Response to afatinib and duration of afatinib treatment was tabulated from 803 patients in the Uncommon EGFR Database (www.uncommonegfrmutations.com). Objective response rate was reported in 529 patients. Patients were stratified by either structure-function-based groups or exon-based groups and ORR was determined by counting the number of patients reported to have complete response or partial response. Fisher's exact test was used to determined statistical differences between subgroups (structure based or exon-based). Duration of treatment was provided in the Uncommon EGFR Database for 746 patients. Patients were stratified by structure-function-based groups and exon-based groups and median DOT was calculated using the Kaplan-Meier method. Statistical differences in Kaplan-Meier plots, HR and P values were generated using GraphPad Prism software and the Mantel-Cox log-rank method. When mutations were not explicitly stated (that is, exon 19 mutation) those patients were excluded from the structure-function-based analysis but included in the exon-based analysis.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Data availability
Source data for all figures can be found at https:// github.com/MD-Anderson-Bioinformatics/ EGFR-Structure-Function-Nature-Manuscript. Public datasets used in this study include non-overlapping studies including NSCLC in cBioportal (www.cbioportal.org) and the Uncommon EGFR Database (www. uncommonegfrmutations.com). Details of specific studies accessed can be found in the Reporting Summary. Data from Foundation Medicine and Guardant Health were provided under data use agreements; however, summarized data used in Fig. 1 and Extended Data Fig. 1 are provided at https://github.com/MD-Anderson-Bioinformatics/ EGFR-Structure-Function-Nature-Manuscript. Source data are provided with this paper.

Code availability
All code used in this study can be found at https://github.com/MD-Anderson-Bioinformatics/EGFR-Structure-Function-Nature-Manuscript.   When the delta bar shifts to the right, the spearman rho value was higher for structure-function-based groups, and when the grey bar shifts to the left, the spearman rho value was higher for the exon-based groups. b, Representative classification and regression trees for each indicated drug. Colors represent drug sensitivity (green) or resistance (red) as defined by log (mutant IC50/WT EGFR IC50). c, Bar plot of Spearman rho values for indicated mutations (excluding T790M mutations) compared to exon-based groups (yellow) or structure-function-based groups (green). The delta of the two rho values is shown as an overlapped grey bar. d, Representative classification and regression trees for each indicated drug excluding T790M from the analysis.
Colors represent drug sensitivity (green) or resistance (red) as defined by log (mutant IC50/WT EGFR IC50). e, Dot plot of rho values from Spearman correlations of mutations vs exon-based group averages or structure-function based averages for each drug excluding T790M mutations. Dots are representative of each mutation; bars are representative of the average rho value ± standard deviation (SD). p-value was determined using a paired two-sided t-test, and n = 59 cell lines/mutations. f, Dot plot of variable importance calculated as sum of the goodness of split for each split in the classification and regression trees (CART). Dots are representative of variable importance for each drug in the exon and structure-function-based groups as indicated and excluding T790M mutations. Bars are representative of the median + 95% confidence interval of variable importance for all drugs (Supplementary Table 3). p-value was determined using a paired two-sided t-test, and n = 18 drugs.

Data analysis
Heat maps and hierarchical clustering were generated by plotting the median log (Mut/WT) value for each cell line and each drug using R and the ComplexHeatmap package 2.6.2 (R Foundation for Statistical Computing, Vienna, Austria. Complex Heatmap). Hierarchical clustering was determined by Euclidean distance between Mut/WT ratios. To determine if structure function groups or exon groups were better predictor of drug sensitivity, we performed recursive-partitioning analyses to construct a decision tree for each drug Using structure function group, mutation data on exons 18, 19, 20, and 21 as predictors. Decision tree classified samples by posing a series of decision rules based on predictors. Each decision rule was constrained in an internal node, and every internal node points to yes-or-no questions that result in a 'yes' or 'no' branch. We applied the classification and regression trees (CART) algorithm using "rpart" R package 4.1-15. We calculated variable importance as the sum of the goodness of split measures for each split. These are scaled to sum to 100 for a tree. Median SAS version 9.4 and R version 6.5.6 are used to carry out the computations for all analyses. Where indicated, analysis of heat maps was completed using ComplexHeatmap software package in R, and CART algorithm was applied using rpart in R package. Median SAS version 9.4 and R version 6.5.6 are used to carry out the computations of the CART algorithm. All other statistical analyses were completed using GraphPad Prism version 8.0. Code used for data analysis can be found at https://github.com/MD-Anderson-Bioinformatics/EGFR-Structure-Function-Nature-Manuscript.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub

Authentication
Cell line identity was confirmed by DNA fingerprinting via short tandem repeats using the PowerPlex 1.2 kit (Promega).
Fingerprinting results were compared with reference fingerprints maintained by the primary source of the cell line.

Mycoplasma contamination
Regular mycoplasma testing is completed in our lab and cells were found to be free from mycoplasma.
Commonly misidentified lines (See ICLAC register) No commonly misidentified cell lines were used in this study.

Animals and other organisms
Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research

Laboratory animals
For PDX models, female NSG mice were engrafted with tumor fragments at 6-8 weeks of age. Five to eight mice were implanted per group. Mice with tumors not meeting the indicated tumor sizes were not randomized. Mice were housed in a modified barrier, specific pathogen free facility including automatic 12 hour light/dark cycles. The facility is kept at an ambient temperature of 72 degrees Fahrenheit and 45% humidity. Food and water are provided ad libitum.

Wild animals
This study did not include wild animals.
Field-collected samples This study did not include field-collected samples.

Ethics oversight
Patient derived xenografts were generated and maintained in accordance with Good Animal Practices and with approval from MD Anderson Cancer Center Institutional Animal Care and Use Committee (Houston, TX) on protocol number PA140276.
Note that full information on the approval of the study protocol must also be provided in the manuscript.

Human research participants
Policy information about studies involving human research participants

Population characteristics
There were no co-variate analyses completed based on patient characteristics in this study.

Recruitment
Patients were consented prospectively through the MD Anderson GEMINI protocol. Recruitment for the GEMINI protocol is preformed at routine visits. Patients within the institution who are scheduled for appointments in the Thoracic Center will be identified by either an attending physician or by a study member in the Department of Thoracic/Head and Neck Medical Oncology. During clinic appointments, potential subjects will be informed of their eligibility and asked if they would be interested in research participation. The attending physician is also available to address any questions or concerns the subject may have. Subjects who agree to participate will sign the protocol-specific informed consent. A research nurse, research data coordinator, or designee with appropriate training and experience sufficient to address issues raised by potential subjects may obtain the Informed Consent.

Ethics oversight
The MD Anderson Cancer Center GEMINI database is prospectively collected from patients consented and enrolled on protocol number PA13-0589 in accordance with the MD Anderson Institutional Review Board. Data collection for Moffitt Cancer Center (MCC) patients was performed under the protocol (MCC 19161), which was formally reviewed and granted approval by MCC in accordance with the Declaration of Helsinki and the 21st Century Cures Act. Both protocols allow for publication of identified data Note that full information on the approval of the study protocol must also be provided in the manuscript.