A proteogenomic profile of early lung adenocarcinomas by protein co-expression network and genomic alteration analysis

The tumourigenesis of early lung adenocarcinomas, including adenocarcinoma in situ (AIS), minimally invasive adenocarcinoma (MIA), and lepidic predominant invasive adenocarcinoma (LPA), remains unclear. This study aimed to capture disease-related molecular networks characterising each subtype and tumorigenesis by assessing 14 lung adenocarcinomas (AIS, five; MIA, five; LPA, four). Protein–protein interaction networks significant to the three subtypes were elucidated by weighted gene co-expression network analysis and pairwise G-statistics based analysis. Pathway enrichment analysis for AIS involved extracellular matrix proteoglycans and neutrophil degranulation pathway relating to tumour growth and angiogenesis. Whereas no direct networks were found for MIA, proteins significant to MIA were involved in oncogenic transformation, epithelial-mesenchymal transition, and detoxification in the lung. LPA was associated with pathways of HSF1-mediated heat shock response regulation, DNA damage repair, cell cycle regulation, and mitosis. Genomic alteration analysis suggested that LPA had both somatic mutations with loss of function and copy number gains more frequent than MIA. Oncogenic drivers were detected in both MIA and LPA, and also LPA had a higher degree of copy number loss than MIA. Our findings may help identifying potential therapeutic targets and developing therapeutic strategies to improve patient outcomes.

www.nature.com/scientificreports/ Identification of key protein modules by WGCNA. We constructed a weighted gene co-expression network and clustered all the identified proteins, and identified 49 protein modules ( Fig. 2A). In the WGCNA, a soft threshold power of 10 was selected to define the adjacency matrix according to the criteria of approximate scale-free topology, with a minimum module size of 10 and a module detection sensitivity deepSplit of 4. The clinical traits for patients were set according to the lung adenocarcinoma subtype-AIS, MIA, or LPA. The correlations between the resultant modules and clinical traits were determined to identify the protein modules whose expressions were upregulated or downregulated in the subtype samples. A heatmap of the proteome abundance of eigen-proteins and samples and pairwise correlations between the modules in the heatmap of eigen-protein expressions were presented (Fig. 2B, C). We identified several modules that showed high and/or moderate correlations with clinical traits (correlation: |r|> 0.5) (Fig. 3). We performed multiple testing correction by the Benjamini-Hochberg method. Finally, the eight modules-WM26, WM27, WM 29, WM30, WM31, WM32, WM33, and WM35 (indicated by the red dashed squares) associated with LPA were found to be statistically significant whereas none of the modules associated with AIS or MIA remained significant.
Identification of the PPI networks associated with AIS and MIA by pairwise G-statistical analysis of identified proteins. Trait analysis in WGCNA trends to overlook important modules for investigating differential disease mechanisms whereas WGCNA is a powerful tool in identifying the co-expression of molecular modules. Indeed, we could not identify key co-expression modules associated with AIS and MIA with high statistical confidence. We adopted the pairwise G-statistics approach 21 that can identify individual proteins with significant differences in spectral counting (SpCs)-based proteome abundance among different patient groups, where William's correction for continuity was applied to the 2 × 2 tables 13 . The adjusted G-statistical calculation enabled us to handle the data containing small spectral counts including zero. A protein was defined to be significant to AIS or MIA when the protein was expressed in ≥ 60% samples of the group and has its relative abundance > 50% and q values < 0.05 that were corrected by the Benjamini-Hochberg method for pairwise www.nature.com/scientificreports/ G-statistics p values among the three groups. We thus identified thirty and eight proteins significant to AIS and MIA, respectively (Fig. 4).

Functional enrichment analysis of the PPI networks.
To characterise the key protein networks, the biological association among the proteins in each network was analysed by mapping the network proteins in the human protein-protein interaction (PPI) network and among the biological pathways by pathway enrichment analysis (Figs. 5 and 6). The PPI networks generated using the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database 22 were reconstructed using the CYTOSCAPE version 3.7.2 software program (Institute for Systems Biology, Seattle, WA, USA) for the protein networks of AIS and MIA identified by the pairwise G-statistics and the eight WGCNA network modules associated with LPA (WM26, WM27, WM 29, WM30, WM31, WM32, WM33, and WM35). Top hub proteins were identified according to maximal clique centrality (MCC) by using in the CYTOHUBBA PLUGIN 23 . Figure 5 shows the protein networks of AIS, MIA, and the top three WGCNA (2) Reactome pathways: extracellular matrix organization, ECM proteoglycans, molecules associated with elastic fibres, and non-integrin membrane-ECM interactions. The top MCC-hub protein heparan sulfate proteoglycan 2 (HSPG2) (also known as Perlecan) is an integral component of basement membranes and interacts with other ECM components as captured in the PPI networks associated with AIS involving laminin (LAMB2 and LAMC1) and prolargin (PRELP) (Fig. 5A). Perlecan binds growth factors via heparan sulfate chains and interacts with vascular endothelial growth factor receptors 2 (VGFR2) which plays a major role in tumour angiogenesis. Dynamic remodelling of tissue architecture takes place during tumour growth and angiogenesis, where perlecan is one of the ECM constituents of the tumour microenvironment. Whereas the intact protein perlecan is known to possess pro-angiogenic properties, its C-terminal fragment, which is released by proteolysis during cancer remodelling and known as endorepellin, has the opposite functions of antiangiogenic activity 24  Recently, the role of the neutrophil degranulation has been discussed how the release of neutrophil granule proteins is associated with cancer development and tumour progression via neutrophil-mediated transport of cancer cells leading to different cellular phenotypes and into different tissues 25 . The subnetwork consisting of ACADSB and ECHDC1 may relate to the carboxylic acid catabolic process.
No protein networks were obtained in-network depth of 0 interactions mainly because of the limited number of proteins significant to MIA by which no hub proteins were identified (Fig. 5B). The only enriched pathway www.nature.com/scientificreports/ was obtained as interferon-gamma signalling in Reactome pathways (Fig. 6), which included MHC class II HLA-DR beta 1 chain (HLA-DRB1) and protein tyrosine phosphatase non-receptor type 1 (PTPN1). PTPN1 (also known as PTP1B) belongs to protein tyrosine phosphatases (PTPs) family. PTP1B has been known to have the two faces in tumorigenesis that PTP1B promotes tumour progression in some cancers but functions as a tumour suppressor in other cancers 26 , whereas the role of PTP1B in NSCLC has been unknown. It has been reported that the high expression of PTP1B in NSCLC tissues was associated with the stage and overall survival of NSCLC patients 26 . PTP1B inhibitors are currently considered a promising anti-cancer therapy due to its involvement in the progression of numerous types of cancers via oncogenic transformation 27 . Hematopoietic pre-B-cell leukemia transcription factor (PBX)-interacting protein (PBXIP1), also known as HPIP, is a corepressor of PBX. Shi et al. 28 demonstrated that HPIP silencing suppressed TGF-β1-induced epithelial-mesenchymal transition (EMT) in A549 lung cancer cells in vitro. DCXR encodes diacetyl/l-xylulose reductase, a multifunctional enzyme in glucose metabolism. Reactive carbonyls are known to cause severe respiratory diseases, which are detoxified by carbonyl reductases in the lung, in particular, DCXR that mediates chemical redox cycling 29 . The eight WGCNA modules significant to LPA ( (2) Reactome pathways: snRNP assembly, mRNA splicing-minor pathway, histone stem-loop-binding protein (SLBP) independent processing of histone pre-mRNAs, cleavage of a growing transcript in the termination region, and citric acid cycle (TCA cycle). The eigengene PPP1CB encodes serine/ threonine-protein phosphatase 1 (PP1)-beta catalytic subunit (PP-1B) which is one of the catalytic subunits of PP1 involved in regulating cell division. Low levels of PP1s and their key regulatory subunit PPP1R9B (known as Spinophilin) are related to a poor prognosis in numerous cancers including lung cancers, which was more involved in squamous cell lung carcinoma than lung adenocarcinoma 30 . It might be noteworthy that a malignant glioma of infancy was recently found to harbour a novel PPP1CB-ALK fusion protein 31 , to be compared with the EML4-ALK translocation mutation present in ca. 6% of NSCLC. The main PPI network (1) including PP-1B, damage specific DNA binding protein 1 (DDB1) and nuclear mitotic apparatus protein 1 (NUMA1) (Fig. 5C) is The enriched pathways of the WM30 module included (1) biological process (GO): chaperone-mediated protein folding, response to endoplasmic reticulum stress, and protein de-ubiquitination; (2) Reactome pathways: AU-rich element RNA-binding protein 1 (AUF1) (also known as hnRNP D0) binding and destabilisation of mRNA, activation of nuclear factor kappa B (NF-κB) in B-cells, and oxygen-dependent proline hydroxylation of hypoxia-inducible factor-alpha (HIF-alpha) (Figs. 5D and 6). The eigengene PSMC4 encodes 26S proteasome regulatory subunit 6B, a component of the 26proteasome, playing a key role in the maintenance of protein homeostasis by removing misfolded or damaged proteins. The translation elongation factor eukaryotic elongation factor 1-alpha1 (EEF1A1) was reported to participate in the entire heat shock response process from transcription through translation 32 . Depending on the type of stimulus, EEF1A1 can be phosphorylated and/or methylated and can interact with trans-acting factors including AUF1, which possibly determines the stability of specific mRNAs and its interaction with 70-kd heat shock protein (HSP70s) mRNA. Stress-induced phosphoprotein 1 (STIP1), also known as HOP/P60/STI1, is a chaperone protein that comprises three tetratricopeptide repeat domains, which can simultaneously bind HSP70s and HSP90s. STIP1 is a tumour-associated antigen (TAA) 33 ; its overexpression has been identified in numerous cancers, including colorectal carcinoma 34 , pancreatic cancer 35 , cholangiocellular carcinoma 36 and ovarian cancer 37 , and it is possibly associated with poor survival outcomes in patients with cancer 38 . ST13 encodes ST13 HSP70s-interacting protein (HIP)/putative tumour suppressor ST13/ suppression of tumorigenicity 13 protein, which is also involved in the regulation of heat shock factor 1 (HSF1)mediated heat shock response and which mediates the association of HSP70s and HSP90s. In an in vitro study on pancreatic ductal adenocarcinoma, Ma et al. 39 reported that signal recognition particle receptor subunit beta www.nature.com/scientificreports/ (SRPRB) plays a central role in the interaction between proteins and stress-associated endoplasmic reticulum (ER) protein 1 (SERP1), which is responsible for the accumulation of unfolded protein in ER stress. They also noted that the downregulation of SERP1 significantly increased SRPRB expression, leading to cell apoptosis through NF-κB activation. Hypoxia upregulated protein 1 (HYOU1) has a significant degree of interaction with SRPRB that well correlated with both the upregulated expressions of SRPRB and HYOU1 observed in our protein expression data. Yamaguchi et al. 40 found that SERP1 expression was enhanced in vitro by hypoxia, which was associated with the accumulation of unfolded protein in ER stress. The subnetwork (4) is involved in the pathway of carbohydrate metabolic process.
A genomic alteration analysis for MIA and LPA. Protein expression profiles were found to be highly differentiated between MIA and LPA (Figs. 1A and 2B). We then conducted NGS to obtain genomic alteration profiles in FFPE tissue specimens, which were obtained from the same FFPE blocks used for the proteomic analyses. NGS was performed to identify genomic alterations such as gene mutations including single-nucleotide variations, insertion/deletion variations, and gene CNVs, using the ACTONCO panel (ACT Genomics, Taipei, Taiwan), which comprises of 440 cancer-related genes. Oncoprints were prepared to visualise somatic mutations and CNVs by heatmap together with the TCGA lung adenocarcinoma (ADC) datasets for comparison (Fig. 7A,  B). TCGA lung ADC datasets (M stage: M0, Mx, NA; N stage: N0, NA; sample number: n = 471) were obtained from the cBioPortal Pan-Lung cancer (TCGA, Nat Genet 2016) (https ://www.cbiop ortal .org/). LPA was suggested to accompany both somatic mutations with loss of function and copy number gains more frequent than the MIA group. Given the small sample size, it is hard to perform statistical analysis or to find any enriched biomarkers in MIA or LPA1. However, it was shown that all both MIAs and LPAs harboured at least one oncogenic driver with EGFR oncogenic mutations being most prevalent. Other oncogenic events detected include ERBB2 p. Gly776delinValVal in the sample MIA_T75, and EGFR copy number gain in the samples of LPA_T87 and www.nature.com/scientificreports/ MIA_T80. Oncogenic drivers identified in our study were quite similar to those from the large-scale genomic studies 41 . Aside from the tumour which exhibited the APOBEC-like mutation signature, in our cohort, a high percentage of MIAs and LPAs were EGFR mutant, which may explain why the number of somatic mutations is low and no difference in tumour mutational burden was found between two groups.

Discussion
The tumourigenesis of early lung adenocarcinoma is thought to progress in a stepwise manner in the order of AIS, MIA, and LPA; however, this has not been completely elucidated to date. Therefore, understanding of diseaserelated molecular mechanisms and profiles in early lung adenocarcinomas would be markedly useful for selecting treatment strategies and could improve the treatment outcomes of individual patients. Protein-based networks by both WGCNA and pairwise G-statistics analyses identified several protein modules and networks that were potentially associated with disease mechanisms driven by distinct early lung adenocarcinomas. Most activities in ECM organization and integrin-mediated cell-cell adhesion characterised by AIS were consistent with its growth pattern restricted to neoplastic cells with pre-existing alveolar structures without stroma, vascular, or pleural invasion. MIA might be associated with oncogenic transformation and detoxification in lung cell-extracellular matrix interactions although no direct networks were identified. This seems to coincide with the observation of tumour histology defined as MIA presenting a limited size of invasion with a myofibroblastic stroma-associated invasive tumour. Pathways of cell surface-receptor signalling, and HSF1-mediated heat shock response regulation, cell cycle regulation were characteristic to LPA.
Assuming that the tumourigenesis proceeds along with the AIS-MIA-LPA axis, the key protein networks identified in this study appear to reflect the nature of the cancer cells of these three subtypes. Generally, the following two possible scenarios can be considered: either LPA cells emerge via transformation of MIA cells or from a different cell origin. The former scenario requires LPA cells to have the same gene mutation profiles as MIA cells and transformed LPA cells to possibly not grow so rapidly but instead co-exist with MIA cells. Conversely, the latter scenario instead requires both MIA and LPA cells to have distinctly different gene mutation profiles from one another. Cancer stem cells (CSCs) are known to emerge via the dysfunctional organogenesis of organ stem cells resulting from the loss of gene regulatory control 43 . In this process, new aberrant cells emerge and acquire a self-renewing capability via the neoplastic transformation of CSCs. Such aberrant cells would emerge as a subpopulation of tumour cells owing to genetic intra-tumour heterogeneity, and their rapid growth would disrupt tumour environment and result in predominant cell survival. The orthogonal partial least square-discriminant analysis (OPLS-DA) 44 performed for mutant proteins identified in the previous study revealed the profound differences in distance between the LPA group and the AIS plus MIA group 11,12 .
The mutational landscape shown in the larger study conducted by Qian et al. 42 suggests that a majority of patients in their study have an ethnic background different from ours. Our cohort has a higher rate of oncogenic EGFR mutations (57% vs 29%) and a lower rate of KRAS mutations (0% vs 25%). Moreover, in the study performed by Chen et al. 41 oncogenic drivers were not detected in up to half of their samples. Although our study sample size is limited and ethnic background is different, we found similarities between two studies of which oncogenic drivers were detected in MIA and LPA (Fig. 7A, B). Besides, we also found that LPA had a higher degree of copy number loss (one or two copy deletion) than MIA (Fig. 7C), which feature was also seen in the genomic study 42 .
In conclusion, our results could identify disease-related protein networks that are possibly associated with distinct early lung adenocarcinomas-AIS, MIA, and LPA. The large genomic studies suggested mutation signature profiling did not vary significantly throughout pre-invasive lung ADCs-AIS and MIA and invasive lung ADCs 41,42 . However, together with our protein network-based profiles and the previous OPLS discriminant analysis of mutant proteomes 11 , it remains disputable that LPA cells emerge via a direct transformation from AIS or MIA, whereas no evidence was attained by the genomic alteration analysis performed in this study. It should be also noted that their studies grouped all 11 subtypes 45 including abundant papillary and acinar ADCs as invasive lung ADCs. A large-scale genomic alteration study using tissue specimens histologically well-defined as LPA would be needed for further investigation. Network-based investigations regarding tumourigenesis will further provide clinically important information about proteogenomic landscapes in lung adenocarcinoma. www.nature.com/scientificreports/ Plex Diagnostics Inc.) according to the manufacturer's protocol 47 . The procedures have been described in detail elsewhere 12,13 . A total of eight tumour samples (four MIA and four LPA) selected from the specimens used for proteomic analyses, were subjected to NGS. The RECOVER ALL Total Nucleic Acid Isolation kit (Qiagen, Hilden, Germany) was used to isolate genomic DNA from FFPE tumour samples. The DNA concentration and integrity were analysed using the QUBIT-IT dsDNA HS assay (Invitrogen, Carlsbad, CA, USA) and FRAGMENT ANALYSER (Advanced Analytical Technologies, Ankey, IA, USA), respectively.

Liquid chromatography-tandem MS (LC-MS/MS)-based proteomic analysis.
A label-free quantitation approach using spectral counting by LC-MS/MS was adopted for the global proteomic analysis. The digested samples (5 μL for a single run) were analysed in triplicate by LC-MS/MS using a reverse-phase LC system interfaced with a Q EXACTIVE ORBITRAP mass spectrometer (Thermo Fisher Scientific, Waltham, MA, USA) via a nano-electrospray ionisation device (AMR Inc., Tokyo, Japan). LC-MS/MS analysis has been described in detail previously 13 . The expressions of the identified proteins were assessed by spectral count-based protein quantification 48 . Tumour sequencing and analysis of genetic alterations. The NGS was performed with the ACTONCO panel. The extracted genomic DNA was amplified by a polymerase chain reaction to enrich the targeting coding exons of the analysed genes. For variant analysis, raw reads were mapped to the hg19 reference genome using the ION TORRENT SUITE version 5.2 software program (Thermo Fisher Scientific). Coverage depth was calculated using TORRENT COVERAGE ANALYSIS PLUG-IN. Single-nucleotide variants and short insertion/deletions were identified using the TORRENT VARIANT CALLER PLUG-IN (version 5.2; Thermo Fisher Scientific). The coverage was down-sampled to 4,000. The VARIANT EFFECT PREDICTOR version 88 software program (European Bioinformatics Institute, Cambridge, UK) was used to annotate variants using datasets from the Catalogue Of Somatic Mutations In Cancer (COSMIC) version 86 resource and the GENOME AGGREGATION database (version 2.0.2; MacArthur Lab, Boston, MA, USA). Variants with coverage ≥ 25 and an allele frequency of ≥ 10% were retained for further analysis. Variants reported in the GENOME AGGREGA-TION database with a minor allele frequency of > 1% were considered as polymorphisms. The in-house peripheral blood mononuclear cell database of ACT Genomics was used to determine technical errors.
WGCNA. The similarity among protein expression patterns for all protein pairs was calculated according to their pairwise Pearson's correlation coefficient, i.e. the similarity between proteins i and j was defined as (1 − r i,j )/2, where r i,j is the correlation of the protein expression patterns between the two proteins i and j. An adjacency matrix was then computed by increasing the similarity matrix up to the power of 10 to generate a co-expression network with scale-free properties. Subsequently, from the resultant scale-free co-expression network, we generated a topological overlap matrix (TOM) that considers topological similarities between a pair of proteins in the network. Using the dissimilarity according to TOM (1-TOM), we conducted hierarchical clustering to generate a tree that clustered proteins in its branches. Dynamic tree cutting was used to trim the branches to identify protein modules. A protein module was summarised by the top hub protein (referred to as eigen-protein) with the highest connectivity in the module. To identify the protein modules associated with clinical traits, we calculated the correlation coefficients between the eigen-proteins and clinical traits. WGCNA was conducted using a GARUDA PLATFORM GADGET (The Systems Biology Institute, Tokyo, Japan) that implemented the WGCNA pipeline based on the WGCNA R-package 17 .
Protein-protein interaction network construction. To construct a protein interaction network for a protein module, we used the STRING database (version 11.0) 22 , which accumulates information on proteinprotein interactions from various other databases such as IntAct, Reactome, DIP, BioGRID, MINT, KEGG, NCI/ Nature PID, The Interactive Fly, and BioCyc. STRING networks were constructed under the criteria for linkage only with experiments, databases, text mining, and co-expression using the default settings, i.e. a medium confidence score of 0.400, a network depth of 0, or 50 interactions. Subsequently, protein networks imported from the STRING database were visualised using CYTOSCAPE version 3.7.2. Functional enrichment results were obtained for canonical pathways considering p < 0.05 to be statistically significant.