Identification of early and intermediate biomarkers for ARDS mortality by multi-omic approaches

The lack of successful clinical trials in acute respiratory distress syndrome (ARDS) has highlighted the unmet need for biomarkers predicting ARDS mortality and for novel therapeutics to reduce ARDS mortality. We utilized a systems biology multi-“omics” approach to identify predictive biomarkers for ARDS mortality. Integrating analyses were designed to differentiate ARDS non-survivors and survivors (568 subjects, 27% overall 28-day mortality) using datasets derived from multiple ‘omics’ studies in a multi-institution ARDS cohort (54% European descent, 40% African descent). ‘Omics’ data was available for each subject and included genome-wide association studies (GWAS, n = 297), RNA sequencing (n = 93), DNA methylation data (n = 61), and selective proteomic network analysis (n = 240). Integration of available “omic” data identified a 9-gene set (TNPO1, NUP214, HDAC1, HNRNPA1, GATAD2A, FOSB, DDX17, PHF20, CREBBP) that differentiated ARDS survivors/non-survivors, results that were validated utilizing a longitudinal transcription dataset. Pathway analysis identified TP53-, HDAC1-, TGF-β-, and IL-6-signaling pathways to be associated with ARDS mortality. Predictive biomarker discovery identified transcription levels of the 9-gene set (AUC-0.83) and Day 7 angiopoietin 2 protein levels as potential candidate predictors of ARDS mortality (AUC-0.70). These results underscore the value of utilizing integrated “multi-omics” approaches in underpowered datasets from racially diverse ARDS subjects.

Acute respiratory distress syndrome (ARDS) is a systemic disease which despite increasing health care improvements continues to exhibit an excessive mortality rate of 30-40% 1 . The pathogenesis of ARDS is extremely heterogeneous, involves multiple inciting stimuli, and is influenced by multiple comorbidities and genetic factors. The clinical and biological ARDS heterogeneity contributes to the inability of current clinical severity scoring systems such as the APACHE II score 2 or the lung injury severity score 3 to accurately predict ARDS mortality 4 . As clinical factors alone are recognized to poorly predict ARDS outcomes, omics-based biomarkers 5 offer a potentially unbiased tool for ARDS mortality prediction and may potentially identify specific endotypes and potential therapeutic targets worthy of further investigation. Previous genome-wide association studies (GWAS) identified single nucleotide polymorphisms (SNPs) associated with ARDS susceptibility and mortality which have been summarized in several reviews 5,6 . Multiple transcriptomic, methylation or proteomic studies related to ARDS have been recently summarized 7,8 , however, these studies have largely focused on ARDS susceptibility and a single 'omics' approach. In contrast, the potential exists for a multi-omics approach to elucidate pathogenic pathways that enhance understanding of biological pathways and therapeutic target discovery 9 . Systems biology approaches provide a global map of the functional relationships between cellular entities such as genetics, genomics, methylation, and proteomics, and, in the specific case of ARDS, a framework to develop a comprehensive algorithm to identify the molecular patterns associated with ARDS mortality. These findings may guide the utilization of molecular signatures to inform clinical decision-making in ARDS and ultimately identify novel therapeutic targets.
To address the unmet need for novel ARDS biomarkers, an integrated "multi-omics" approach was utilized in a racially-diverse ARDS cohort encompassing individually-underpowered ARDS GWAS, RNA transcription, and DNA methylation datasets with a goal to identify early and intermediate predictive biomarkers of ARDS mortality. These analyses yielded a 9-gene set that differentiated ARDS survivors from non-survivors www.nature.com/scientificreports/ and prioritized a biologic pathway list which included the p53-, HDAC-, TGF-β-, and IL-6-signaling pathways, results validated utilizing a longitudinal transcription dataset. Predictive biomarker discovery revealed that both transcription levels of the 9-gene set (TNPO1, NUP214, HDAC1, HNRNPA1, GATAD2A, FOSB, DDX17, PHF20, CREBBP) and Day 7 angiopoietin-2 protein levels significantly predict ARDS mortality. The application of a multi-omics integrated prediction model for ARDS mortality may be of substantial utility in the design of future ARDS clinical trials.

Results
Study population characteristics. A diverse cohort of 568 subjects (mean age of 53 ± 15 years, male 55%, 40% self-reported Blacks) with an overall day 28 mortality rate of 27% was included in the analysis (Fig. 1). Table 1 contains the basic cohort characteristics stratified by each 'omic' approach utilized (with overlap). Supplemental Figure S1 depicts the Venn diagram of overlapping 'omics' data from the study population. The majority of cohort subjects had genotyping and protein data available.
Genome-wide SNP analysis and identification of genes associated with ARDS mortality. GWAS association analysis of SNPs failed to identify a SNP that passed genome wide multiple testing cor- Figure 1. Flowchart of total study enrollment and subjects included in each of the "omics" platforms. Analysis of potential ARDS plasma biomarkers. We interrogated 11 plasma proteins previously implicated as potential ARDS biomarkers 6 with levels measured on entry to ICU (baseline at Day 0) and at Day 7 for survivors. No baseline protein level was associated with ARDS mortality whereas Day 7 plasma measurements of Ang2 protein were significantly associated with ARDS mortality. Each 1 ng/mL increase in plasma Ang2 protein at Day 7, produced an 8% increase in odds of mortality (p = 4.7 × 10 -6 ). We included analysis for protein measured on baseline and Day 7 individually to avoid the correlation between measurements on the same subject. A full result of the logistic regression is shown in Supplemental Table S4.
Candidate gene analyses. While no candidate gene reached the statistical threshold defined by FDR adjusted p-value < 0.05 in any 'omics' analysis, several candidate genes exhibited raw p-values < 0.05. This included differential methylation of cg18537894 in ADIPOQ (raw p-value = 0.02), upregulated NAMPT gene expression in non-survivors compared to the survivors (log fold change = 1.41, raw p = 0.01) and upregulated EGLN1 gene expression in non-survivors compared to survivors (log fold change = 0.65, raw p = 0.03) with significant differential EGLN1 methylations (cg20682143; cg21875980; cg18979762, raw p = 0.02). No significant associations of FER, IL1B, or TNF were found in any 'omics' types. The results were summarized in Table 2.
Multi-omic integration and pathway analysis. Using the Network approach and the dense module search method 10 , 9082 nodes (genes) and 51,871 edges (protein-protein interactions or PPIs) were included in the final network resulting in a total of 5854 modules identified. The top 0.1% of modules included six mod- www.nature.com/scientificreports/ ules and 25 unique genes with 9 of 25 unique genes also differentially methylated (raw p-value < 0.05): TNPO1, NUP214, HDAC1, HNRNPA1, GATAD2A, FOSB, DDX17, PHF20, and CREBBP. The results of these 9 genes in each omic type are shown in Supplemental Table S5. Further evaluation of this 9 differentiating gene-set, revealed 87 pathways with a q-value < 0.05 with the top unique pathways consisting of 'HDAC class I signaling' , 'TGF-β signaling pathway' , 'IL-6 signaling pathway' , 'Chromatin-modifying enzymes' , 'Sumoylation by ranbp2 regulates transcriptional repression' and 'Transcriptional regulation by TP53' . These pathways are listed in Table 3 and Fig. 2 and the full list is provided in Supplemental Table S6. Table 3. Significant pathways associated with ARDS mortality using the Network approach. a Numbers of genes in the pathway are from the 9 genes identified through the Network approach. www.nature.com/scientificreports/ Using the Overlap gene approach, we identified an overlap gene list consisting of ten genes (OCIAD2, MLST8, TOP3A, ZC3H8, C4orf3, PIP5K1A, PLD5, TNFRSF19, C14orf93, RUNX3) with the seven most significant pathways (q-value < 0.05) being 'RNA polymerase II transcription' , 'Regulation of TP53 activity' , 'PIP3 activates AKT signaling' , 'Intracellular signaling by second messengers' , 'Generic transcription pathway' , and 'Transcriptional regulation by TP53' (Table 4). Thus, there was a significant overlap between pathways (but not genes) identified utilizing both the Overlap and Network pathway approaches.

Validation using longitudinal transcription analysis. Temporal gene/pathway expression was signifi-
cantly different between ARDS survivors and non-survivors in five out of the six tested pathways ("sumoylation by ranbp2 regulates transcriptional repression" the exception). The Transcriptional Regulation by TP53 pathway was the most significant pathway (adjusted p-value = 6.84e−20) with median gene set expression (median of standardized gene expression) elevated at baseline, remaining elevated at Day 14 in non-survivors. In ARDS survivors, the median gene set expression was lower than non-survivors at baseline, elevated at Day 7 but returned to baseline values by Day 14. In addition, the patterns of pathway expression were reproducibly homogeneous in ARDS survivors when compared to non-survivors. Statistical differences in survivor and non-survivor patterns and in median gene expression changes over time are shown in Fig. 3. Table S5 and Table 2 summarize the overall results (detail result in each molecular level) for the multi-omics analysis with p-values for novel genes identified through the multi-omics approach and for candidate genes selected from the publicly available literature. The transcription prediction model included the nine genes in the logistic regression model with the derived AUC from the testing set (baseline RNA microarray) at 0.83 (90% C.I. 0.62-1.00). For the protein prediction model, the AUC of the prediction model, using baseline protein levels as predictors, was poor (< 0.5). However, when Day 7 proteins were used as candidate predictors, Ang-2 and IL-1R2 were predictors for ARDS mortality through the stepwise method with derived AUC of 0.72 (90% C.I. 0.52-0.91). We next focused on 172 subjects with both Day 7 Ang2 and IL-1R2 protein measurements and assessed AUCs of Ang2 and IL-1R2 which demonstrated that Ang2 alone (AUC 0.71, 90% C.I. 0.51-0.91) exhibited a higher AUC compared to IL-1R2 alone and was comparable to both Ang2 and IL-1R2 as predictors (Fig. 4A). Figure 4B shows the trend of Ang2 protein level changes in survivor and non-survivor groups with median Ang2 concentrations similar at baseline between the two groups. In contrast, Day 7 Ang2 levels decreased in survivors but remained elevated in non-survivors with the optimized cutoff point for ARDS mortality prediction shown in Fig. 4C. Using a cutoff point of 12.1 ng/ml, the sensitivity and specificity of Ang2 as a test for ARDS mortality were 0.45 and 0.89, respectively. The AUC of the validation cohort ( Fig. 4D) consisting of 93 subjects with Ang2 measurement was 0.70 (90% C.I. 0.57-0.83).

Discussion
The present study utilized multi-omic approaches to identify potential early and intermediate predictive indicators of and contributors to ARDS mortality which include: (i) a 9-gene expression set, (ii) plasma Ang2 protein levels, and iii) the p53-, TGFβ-, and IL-6 signaling pathways. Through the integration of genetics, transcriptomics, methylation, and protein-protein interaction, we identified novel genetic networks that potentially contribute to the pathogenesis of ARDS mortality. These genes and pathways were validated using longitudinal transcriptional data further strengthening the prospects of their involvement in ARDS pathobiology and mortality. A strength of our analyses was the utilization of a racially-and gender-diverse ARDS cohort exhibiting multiple ARDS-inciting causes. Whereas traditional approaches have individually evaluated each 'omic" dataset (genotype, transcription, methylation, protein), we chose to integrate underpowered genotyping, microarray, and sequencing datasets to embrace the relationship between different omics largely lost by traditional approaches. Genotypes and methylation data may alter gene transcription and result in altered protein expression and, therefore, may miss important biological information. Our systems biology studies indicate that combining several small, albeit woefully underpowered 'omic' studies, may increase the power of identifying significant and important biological pathways and biomarkers that can predict/contribute to ARDS mortality.
At the single gene level, nine genes were associated with ARDS mortality through a systems biology approach including HDAC1, encoding a histone deacetylase (HDAC) previously linked to acute inflammatory disorders [11][12][13] including acute lung injury with HDAC inhibitors shown to prevent endothelial hyperpermeability 14 , the major Table 4. Significant pathways associated with ARDS mortality using the Overlap gene approach. a Numbers of genes in the pathway are from the 10 genes identified through the Overlap gene approach. www.nature.com/scientificreports/ pathophysiological defect in acute lung injury [15][16][17] . HDAC inhibitors are often included in lists of potential therapeutic ARDS approaches 18 . Another 9-gene set member is HNRNPA1 which encodes heterogeneous nuclear ribonucleoprotein A, a recognized regulator of alternative splicing processes including alternate splicing of MYLK 19 , a well-described ARDS candidate gene 20 and via its encoded protein, non-muscle myosin light chain kinase isoform (nmMLCK), a key regulator of lung vascular integrity 21 . FOSB is involved in ventilator-induced lung injury (VILI) via evoked AP-1 binding and gene expression 22 , increased ROS burden, and VILI outcomes 23 . Transportin 1 (TNPO1) is a protein involved in importing proteins into the nucleus with prominent targets being RNA-binding proteins including HNRNPA1, the same gene identified by our 'omics' approach 24 . The PHF20 (PHF20) protein regulates p53 signaling and epigenetic regulation by chromatin regulation/histone deacetylases 25 , as does CREBBP protein a histone acetyltransferase and transcriptional coactivator 26 , findings in sync with the 9-gene set, DNA methylation, and protein datasets we analyzed. Less information is available for proteins encoded by GATAD2A, DDX17, NUP214 although they appear to share estrogen and androgen receptor regulatory properties [27][28][29][30] . Ang2 (angiopoietin 2) is a ligand of endothelial receptor Tie2 and a key mediator of pulmonary vascular permeability and Day 7 Ang2 protein levels were an intermediate biomarker for ARDS mortality, a finding consistent with a recent meta-analysis that higher plasma Ang2 levels are predictive of mortality in both ARDS and at-risk patients [31][32][33][34] . Our studies showed that Day 7 Ang2 has a better predictive value compared to baseline Ang2 which implies that Ang2 may respond to endothelial damages from ARDS in a later phase. In addition to outlining the relationship between mortality and the level of Ang2 at different stages, our study further provides a cutoff point of the Ang2 concentration for mortality prediction. We identified that Day 7 Ang2 level of 12.1 ng/mL can www.nature.com/scientificreports/ be used as a highly specific tool for the ARDS mortality and may offer potentially clinically useful information and promote therapies that target Ang2. IL-1R2 (IL-1 Receptor Type 2), which regulates the proinflammatory activities of IL-1β, is an ARDS biomarker 35 with IL1R2 gene expression upregulated in ARDS patients compared to controls. We found plasma IL1R2 protein levels between Days 7 and 14 higher in ARDS non-survivors compared to survivors, a trend not seen at days 0 to 3. The results support our findings that in addition to Ang2, Day 7 IL-1R2 levels, but not baseline IL1R2 levels, is a good biomarker for ARDS mortality prediction. At the pathway level, five of six pathways identified by our multi-omic approaches exhibited similar expression patterns over time and significantly distinguished ARDS survivors from non-survivors. The pathway expression patterns in survivors were low on Day 0 (baseline), while elevated at Day 7 and returned to baseline values on Day 14. In contrast, pathway expression in non-survivors was elevated at Day 0 compared to survivors, continued to rise at Day 7, and remained elevated at Day 14, results that suggest that activation these of pathways contributes to ARDS mortality. The TP53 pathway was the most significant pathway identified (via both Network and Overlap approaches) and centers on the functional strong anti-inflammatory role of p53 as a tumor suppressor protein. Inflammation involving NFkB activation reduces p53 function and p53 activation is an inflammationsuppressing response to reduce unchecked vascular leakage 36 . In contrast, HDAC pathway 18 and TGF-β activation may increase endothelial permeability and play important role in lung fluid balance in ARDS 37,38 . We also found genes in the well-known pro-inflammatory IL-6 pathway 39 , were highly expressed in ARDS non-survivors throughout the whole 14 days, a finding in sync with studies demonstrating persistently elevated plasma or BAL IL-6 levels in ARDS non-survivors 40 predicting poor ARDS outcomes (e.g., prolonged mechanical ventilation, organ dysfunction, and mortality) [41][42][43][44][45] . Taken together, except for ANGPT2 (the protein name is Ang2), none of the candidate genes (IL1R2, ADIPOQ, NAMPT, EGLN1, FER, IL1B, TNF, IL-6, IL-8 (CXCL8), VEGF (VEGFA), MIF, S1PR3, or HMGB1) were significant in our multi-omic analysis of ARDS mortality. The findings are not surprising since mortality is a complicated biologic process and is not highly unlikely to be driven by a single gene. Nevertheless, our pathway analysis implicated several candidate genes in pathways mortality affecting ARDS such as HDAC-and IL-6-signal pathways.
Clearly, our study has several limitations, the foremost limitation being that despite an excellent number of ARDS subjects, few subjects had available data from more than three omic types and limited time points www.nature.com/scientificreports/ availability, thereby straining the ability to expand the longitudinal analysis. As few subjects had multiple omics measured at the same time, this limited the robustness of our analytical methodologies. For example, a more complete and integrative predictive model using all omics data simultaneously cannot be applied in our dataset. In addition, our mortality outcome is a categorical outcome because we did not collect the day of the death which may reduce our power and the information derived from the analysis. An inherent limitation is that each "omic" data approach has advantages and disadvantages depending on study aims. For example, GWAS genotypes are easily measured and offer disease susceptibility information but cannot reflect the influence of environmental factors and therefore are limited in ARDS mortality associations. Gene expression and DNA methylation together reflect both genetic and environmental factors and are good 'omic' types for mechanism investigation, however, these require more sophisticated sample processing to obtain the data and due to high cost, have much smaller datasets. Protein biomarkers analytics is a good omic platform, however, as protein levels are affected by genetics, transcriptomics, methylation, and post-translational modification, it is difficult to explore the underlying biological mechanism for ARDS mortality with this platform alone. Additional limitations of our work were the diversity in platforms utilized for our genetic, genomic, and methylation studies including the varying time horizons with limited overlapping specimens, decreasing our statistical power to identify predictive biomarkers. Clearly, future well-designed prospective validation cohorts and timely sampling and analysis will obviate a number of these limitations. In summary, a well-characterized, diverse ARDS cohort was utilized to assess the potential for a unique multi-omics approach to address the unmet need for tools that allow the stratification of ARDS subjects for clinical trials. This "multi-omics" approach, integrating several 'omics' data types from individually underpowered ARDS GWAS, RNA transcription, and DNA methylation datasets, identified a 9-gene set and specific signaling pathways that differentiated ARDS survivors from non-survivors, results validated utilizing a longitudinal transcription dataset. The prioritized list of genes and p53-, HDAC-, TGF-β-, and IL-6-signaling pathways exhibit high biological plausibility to serve as early and intermediate predictive biomarkers and pathways for ARDS mortality. Although further studies are needed, these results suggest that the additional application of a multiomics integrated prediction model for ARDS mortality may be of utility in future ARDS clinical trial design.  48 . Enrollment included subjects admitted to the intensive care units with confirmed ARDS. Written consent was obtained for all the participants. Among the 749 subjects, a total of 568 ARDS subjects with mortality information (mortality at day 28) and with at least one type of 'omics' data available were included in the final analysis (Fig. 1).

Study population.
Genome-wide genotyping. Genome-wide genotyping was performed using two platforms for two distinct populations based on race: Affymetrix SNP 6.0 (Thermo Fisher Scientific, Santa Clara, CA) for the European-origin populations and Affymetrix Pan-African array (Thermo Fisher Scientific, Santa Clara, CA) for the African descent individuals. Subjects with a call rate of less than 98% were removed from the analysis. SNPs were removed if failing the Hardy-Weinberg equilibrium test (p-value < 10 -6 ) or a call rate of less than 98%. After the quality control steps, a total of 297 ARDS subjects: 80 European-origin subjects with 905,043 SNPs, and 217 African Americans with 146,595 SNPs, were included in the analysis.
Genome-wide transcription profiling. Genome-wide transcription data were obtained using two gene expression platforms, RNA microarrays, and sequencing. RNA was available for a total of 93 participants with a total of 136 samples at different time points. Among the 93 participants, 7 participants had two baseline RNA samples (sent for both RNA-sequencing and RNA-microarray), 8 participants had RNA samples on Day 7 after enrollment and 28 participants had RNA samples on Day 14 after enrollment. Peripheral blood mononuclear cell (PBMC) isolation was performed using the Ficoll-Paque™ (Sigma-Aldrich) method 49 . Total RNA was isolated from PBMCs using RNAeasy MiniKit Qiagen™ following the manufacturer's protocol. RNA concentration and quality (RIN > 7) were assayed by Nanodrop™ (Thermo Fisher) and 2100 Bioanalyzer RNA™ (Agilent). A total of 52 baseline samples, 8 Day 7 samples, and 28 Day 14 samples were sent for transcription profiling using RNA microarray Affymetrix GeneChip Human Gene 2.0 ST Array (Thermo Fisher). A detailed protocol and quality control procedure for RNA microarray profiling has been previously reported 50 . The gene expression levels were normalized with a log transformation. After quality control, a total of 17,641 probes were included in the downstream analysis.
Samples of 1-2 μg total RNA from a cohort of 48 ARDS patients were sent for transcription profiling via RNA sequencing with the RNA concentration and quality (RIN > 7) assessed by Nanodrop™ (Thermo Fisher) and corroborated by Bioanalyzer™ (Agilent). To prepare the sequencing library, total RNA was enriched by oligo (dT) magnetic beads (rRNA removed). The RNA-seq library preparation was performed using KAPA Stranded RNA-Seq Library Prep Kit (Illumina). The completed libraries were qualified with Agilent 2100 Bioanalyzer and quantified by the absolute quantification qPCR method. To sequence the libraries on the Illumina HiSeq 4000 instrument, the barcoded libraries were captured on the Illumina flow cell, amplified in situ, and subsequently sequenced for 150 cycles for both ends on the Illumina HiSeq instrument. R package Tximport was used to www.nature.com/scientificreports/ transform the raw counts from the transcripts into gene counts. Genes with low expression defined by low count (< 6) were removed, a total of 18,652 genes were included in the final analysis.
Genome-wide DNA methylation profiling. High-quality DNA from PBMCs was available from 61 ARDS subjects for DNA methylation analysis as previously described 51,52 . Additional protocol details are available on the manufacture's website (Bisulfite conversion Qiagen, hybridization, and Infinium Methylation Assay Illumina). CpG sites were quantitatively assessed utilizing two Methylation array platforms. The initial 18 subjects were assessed by HumanMethylation 450 K BeadChip (Illumina). Subsequently, DNA from a cohort of 43 ARDS subjects was assessed utilizing the MethylationEPIC 850 K BeadChip array (Illumina). Poor quality samples were excluded using a detection p-value cutoff greater than 0.05 using the R package minifi. After quality controls, data from the two assays involving 61 subjects were pooled for analysis. Methylation levels were represented as M-values which were calculated as log 2 (methylated/unmethylated) and were used in downstream analyses.

Plasma protein levels.
Plasma biomarker studies were focused on 11 selected proteins: eNAMPT, IL-1β, IL-1R2, IL-6, IL-8, VEGF, Ang2, MIF, S1PR3, RAGE, and HMGB1 with plasma levels measured using Bio-Rad (Hercules, CA), detailed in prior reports 53,54 . In brief, whole blood was collected in EDTA-treated tubes, centrifuged within 1 h from sample collection (2000 × g for 20 min, RCF), and stored at − 80 °C. Plasma concentrations of three biomarkers (IL-6, IL-8, IL-1β) were measured in duplicate using a custom Bio-Plex Pro Human Cytokine 5-plex immunoassay (Bio-Rad, Hercules, CA) and Bio-Plex MAGPIX instrument following the manufacturer's guidelines. Enzyme-linked immunosorbent assay (ELISA) techniques were utilized to quantify plasma levels of eNAMPT (an internally developed ELISA) 54 , MIF (R&D System®, Minneapolis, MN), and Ang-2 (R&D System ® , Minneapolis, MN) using commercially-available ELISAs according to the manufacturer's instructions as previously described 53 . Undetectable protein levels were assigned as zero. A total of 240 subjects with at least one protein level measured were included in our analysis.
Analytical approach and strategies. We first performed the analysis to identify the biomarkers that can differentiate ARDS non-survivor vs. survivor in each single omics platform independently. Second, we integrated multiple omics types to identify important candidate biomarkers and performed a pathway analysis. Finally, we developed a prediction model of ARDS mortality. An overview of the analysis flow is shown in Supplemental Figure S2.
Genome-wide association analysis. SNP level analysis was performed using PLINK 55 with a separate analysis of the African-Descent and European-Descent cohorts. First, the principal components (PCs) for population stratification were calculated and included in the first two PCs in the additive regression model. After obtaining individual p-values for each SNP, a gene-based analysis was performed using the method of versatile gene-based association study-2 (VEGAS2) to incorporate genes and accounts for linkage disequilibrium between markers by using simulations from the multivariate normal distribution 56 . To map identified SNPs to their related gene, the 1000 K Genome with African and European datasets were used as reference panels and the gene region was defined as + /− 50 k base pairs. Due to the distinct ethnicity and different platforms, gene-based analyses were conducted in each ethnicity independently and a meta-analysis was then performed to combine the p-values using Fisher's method.
Differential expression gene analysis. Differential expression gene (DEG) analysis was performed using the RNA sequencing data and the R package limma 57 . We normalized the raw RNA sequencing count data using variance stabilization transformation (VST) 58 . The significant DEG level was defined as the false-discovery rate (FDR) adjusted p < 0.05.

Differential methylation analysis. Combining data from the 450 K BeadChip and MethylationEPIC
BeadChip (Illumina) was used to stratify quantiles for data normalization. Probes were filtered for removal if failing in one or more samples or on the sex chromosome. The methylation data were normalized using the subset quantile normalization 59 . M values were obtained for the probe-wised differential analysis. The analysis was conducted using the R package minifi 60 . The threshold for the significance was the FDR adjusted P < 0.05.

Plasma protein analysis.
Logistic regression analysis was used to assess the association between protein levels and ARDS mortality. Age and gender were included in the model as covariates. Associations were tested using the selected 11 protein levels 6 measured on baseline and Day 7 respectively instead of including both baseline and Day 7 protein levels to avoid the correlation between these two measures in the same individual.
Candidate gene analysis. In addition to the genome-wide approach mentioned above, individual candidate genes, selected from previous publications 6  www.nature.com/scientificreports/ each approach were then compared and overlapping genes/pathways were identified that were associated with the ARDS mortality.
Network approach. The method of edge-weighted dense module search was first used to identify potential genes associated with ARDS mortality 10 . The R package, dmGWAS_3.0, was used to build the network by combining GWAS, protein-protein interactions (PPIs), and gene expression results. Network nodal weights were derived based on the p-value of each gene calculated through VEGAS2 (see genome-wide association analysis section). Edge weights (change of gene co-expression between death and alive) were obtained using the RNAsequencing gene expression data. Genes within the top modules (0.1%) and also differentially methylated were selected (raw p < 0.05) and used for downstream pathway analysis using ConsensusPathDB 63 . Significant pathways were defined as q-value < 0.05.
Overlap gene approach. A loosened significance threshold (raw p < 0.05) was used to generate gene lists from each 'omic' including GWAS, RNA sequencing, and DNA methylation. Overlapping genes among these generated gene lists were used for downstream pathway analysis using ConsensusPathDB 63 . Significant pathways were defined as q-value < 0.05.

Validation using longitudinal transcription analysis. Longitudinal transcription analysis of all genes
within the specific pathway was performed to validate identified overlapping pathways through the Network and Overlap pathway approaches. The top significant pathways identified through the network approach were validated using the longitudinal transcription cohort (microarrays gene expression-derived data at baseline, Day 7, and Day 14). The longitudinal gene expression change over the 14 days was tested using the variance component score test 64 using the R package TcGSA. This method is a linear mixed-effect model to account for repeat measurements from the same participants and a time variable is included. The pattern of gene expression changes in pathways differentially between ARDS non-survivors vs. survivors was tested with the null hypothesis that the patterns are the same between groups. Due to the small sample size (88 measurements), 1000 permutations were performed to calculate the p-value. The threshold for a significant difference in the patterns was set as an adjusted p-value < 0.05.
Prediction model development. The transcription prediction model was developed using a logistic regression model with the genes identified (Table 2) through the network approach as biomarkers. We divided the baseline RNA microarray cohort into training and testing sets at a 7:3 ratio. The model was tested using the testing set and area under the receiver-operating characteristic curve (AUC). Each AUC with a 90% confidence interval was calculated. A good prediction model was defined as a model with AUC > 0.70 with a lower limit of confidence interval > 0.50.
For the protein prediction model, to have sufficient samples for the training, testing, and validation cohort, proteins were excluded if measured in less than 150 subjects. The remaining seven proteins including eNAMPT, IL-1β, IL-1R2, IL-8, Ang2, MIF, and S1PR3 were included as candidate biomarkers for the prediction model. We used a logistic regression model with the stepwise method. The chosen significance level for entry (SLE) and the chosen significance level for stay (SLS) was set at 0.25. We divided the randomly selected 190 subjects into training and testing sets at a 7:3 ratio respectively. The remaining 100 subjects were reserved as a validation cohort. We developed the models using baseline protein levels and Day 7 protein levels independently and then validated them. Each AUC with a 90% confidence interval was calculated. A good prediction model was defined as a model with AUC > 0.70 with a lower limit of confidence interval > 0.50. After the model development, we used the R package cutpointr to find the optimized threshold of the protein level to predict ARDS mortality.

Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request and will also be deposited in the Gene Expression Omnibus (GEO176529).