Mortality prediction in sepsis via gene expression analysis: a community approach

Improved risk stratification and prognosis in sepsis is a critical unmet need. Clinical severity scores and available assays such as blood lactate reflect global illness severity with suboptimal performance, and do not specifically reveal the underlying dysregulation of sepsis. Here three scientific groups were invited to independently generate prognostic models for 30-day mortality using 12 discovery cohorts (N=650) containing transcriptomic data collected from primarily community-onset sepsis patients. Predictive performance was validated in 5 cohorts of community-onset sepsis patients (N=189) in which the models showed summary AUROCs ranging from 0.765-0.89. Similar performance was observed in 4 cohorts of hospital-acquired sepsis (N=282). Combining the new gene-expression-based prognostic models with prior clinical severity scores led to significant improvement in prediction of 30-day mortality (p<0.01). These models provide an opportunity to develop molecular bedside tests that may improve risk stratification and mortality prediction in patients with sepsis, improving both resource allocation and prognostic enrichment in clinical trials.


Introduction
Sepsis, recently defined as organ dysfunction caused by a dysregulated host response to infection 1 , contributes to half of all in-hospital deaths in the US and is the leading cost for the US healthcare system 2,3 . Although in-hospital sepsis outcomes have improved over the last decade with standardized sepsis care, mortality rates remain high (10-35%) 4 . Sepsis treatment still focuses on general management strategies including source control, antibiotics, and supportive care. Despite dozens of clinical trials, no treatment specific for sepsis has been successfully utilized in clinical practice 5 . Two consensus papers suggest that continued failure of proposed sepsis therapies is due to substantial patient heterogeneity in the sepsis syndrome and a lack of tools to accurately categorize sepsis at the molecular level 5,6 . Current tools for risk stratification include clinical severity scores such as APACHE or SOFA as well as blood lactate levels. While these measures assess overall illness severity, they do not adequately quantify the patient's dysregulated response to the infection and therefore fail to achieve the personalization necessary to improve sepsis care 7 .
A molecular definition of the severity of the host response in sepsis would provide several benefits. First, improved accuracy in sepsis prognosis would improve clinical care through appropriate matching of patients with resources: the very sick can be diverted to ICU for maximal intervention, while patients predicted to have a better outcome may be safely watched in the hospital ward or discharged early. Second, more-precise estimates of prognosis would allow for better discussions regarding patient preferences and the utility of aggressive interventions. Third, better molecular phenotyping of sepsis patients has the potential to improve clinical trials through both (1) patient selection and prognostic enrichment for drugs and interventions (e.g., excluding patients predicted to have good vs. bad outcomes), and (2) better assessments of observed-to-expected ratios for mortality 5,6 . Finally, as a direct quantitative measure of the dysregulation of the host response, molecular biomarkers could potentially help form a quantitative diagnosis of sepsis as distinct from non-septic acute infections 8,9 . Thus, overall, a quantitative test for sepsis could be a significant asset to clinicians if deployed as a rapid assay.
Quantitative evaluation of host response profiles based on these observations have been validated prospectively to show specific outcomes 12,13 , but none have yet been translated into clinical practice. Still, the availability of high-dimensional transcriptomic data from these accumulated studies has created unprecedented opportunities to address questions across heterogeneous representations of sepsis (different ages, pathogens, and patient types) that could not be answered by any individual cohort.
Transcription-based modeling has been deployed across many diseases to improve prognostic accuracy. These are typically developed in a method-specific manner using data collected from single cohorts. As a result, prognostic models often lack the generalizability that is necessary to confer utility in clinical applications 18 . In contrast, community modeling approaches (where multiple groups create models using the same training data) can provide an opportunity to explicitly evaluate predictive performance across a diverse collection of prognostic models sampled from across a broad solution space [19][20][21][22]23 . Here, we systematically identified a large collection of both public and privately-held gene expression data from clinical sepsis studies at the time of sepsis diagnosis. Three scientific groups were then invited to build models to predict 30-day mortality based on gene expression profiles. These three groups produced four different prognostic models, which were then validated in external validation cohorts composed of patients with either community-acquired sepsis or hospital-acquired infections (HAIs).

Systematic Search
Two public gene expression repositories (NCBI GEO 24 , EMBL-EBI ArrayExpress 25 ) were searched for all clinical gene expression microarray or next-generation sequencing (NGS / RNAseq) datasets that matched any of the following search terms: sepsis, SIRS, trauma, shock, surgery, infection, pneumonia, critical, ICU, inflammatory, nosocomial. Clinical studies of acute infection and/or sepsis using whole blood were retained. Datasets that utilized endotoxin or LPS infusion as a model for inflammation or sepsis were excluded. Datasets derived from sorted cells (e.g., monocytes, neutrophils) were also excluded.
Overall, 16 studies containing 17 different cohorts were included (Table 1a-b). These 16   studies include expression profiles from both adult 13,15,17,[26][27][28][29][30][31][32][33][34][35] and pediatric 31,[36][37][38][39] cohorts. In these cases, the gene expression data were publicly available. When mortality and severity phenotypes were unavailable in the public data, the data contributors were contacted for this information. This included datasets E-MTAB-1548 11,40 , GSE10474 27 13 , used the same inclusion/exclusion criteria, and were processed on the same microarray type. Thus, after re-normalizing from raw data, we used ComBat normalization 41 to co-normalize these two cohorts into a single cohort, which we refer to as E-MTAB-4421.51. For this study, data were included only for patients sampled on the day of hospital admission. In addition to the above 17 datasets, we identified four additional privatelyheld datasets (Table 1c) representing patients with HAI. In-depth summaries of each HAI cohort can be found in the supplementary text.
We selected cohorts as either discovery or validation based on their availability. Studies for which outcome data was readily available were included as discovery cohorts. Only GSE54514 15 was initially held out for validation given its large size and representative patient characteristics. After we had trained the models some outcomes data became newly available, so so these were added as validation cohorts 13,[33][34][35] . Additionally, given the known differences in sepsis pathophysiology and gene expression profiles as compared to patients with communityacquired sepsis 39,42 , the HAI datasets were set aside as a second validation cohort. The validation cohorts were not matched to the discovery cohort on any particular criteria but rather provide a validation opportunity across a heterogeneous range of clinical scenarios.

Gene Expression Normalization
All Affymetrix datasets were downloaded as CEL files and re-normalized using the gcRMA method (R package affy 43 ). Output from other array types were normal-exponential background corrected and then between-arrays quantile normalized (R package limma 44 ). For all gene analyses, the mean of probes for common genes was set as the gene expression level. All probe-to-gene mappings were downloaded from GEO from the most current SOFT files.
Two of the cohorts, CAPSOD 17 and the Duke HAI cohort, were assayed via NGS. For compatibility with micro-array studies, expression from NGS data sets were downloaded as counts per million total reads (CPM) and were normalized using a weighted linear regression model using the voom method 45 (R package limma 44 ). The estimated precision weights of each observation were then multiplied with the corresponding log2(CPM) to yield final gene expression values.

Prediction Models
Prediction models were built by comparing patients who died within 30 days of hospital admission with sepsis to patients who did not. In the CAPSOD dataset (which was used in model training) we excluded two patients with unclear mortality outcomes, and one patient who died inhospital but after 30 days. Mortality was modeled as a binary variable as since time-to-event data were not available. Overall, a total of four prognostic models were built by three different academic groups (Duke University, Sage Bionetworks, and Stanford University). All four models started with the same gene expression data in the discovery phase. Each model was built in two phases: a feature selection phase based on statistical thresholds for differential gene expression across all discovery cohorts, and then a model construction phase optimizing classification power. Full descriptions of the four models can be found in the supplementary text, and in Supplementary Figures 1-3.

Comparison with severity scores
We compared the prognostic accuracy of the gene scores with the prognostic accuracy of clinical severity scores (APACHE II, PELOD, PRISM, SAPS II, SOFA, and the Denver score) where such information was available. These clinical severity scores were not necessarily built to predict mortality in the specific populations in which they were used here, but nonetheless serve as important comparators for the gene expression models. To compare prognostic power, logistic regression was performed to predict mortality using either the clinical severity score or the given gene model's output score. We then tested a combined model (mortality as a function of clinical severity and gene score, without interaction term) and measured the AUROC of the combined model. Comparisons were made between AUROCs with paired t-tests.

Discriminatory Power Analyses
We examined class discriminatory power for separating survivors from non-survivors using receiver operating characteristic (ROC) curves of the gene scores within datasets. The area under the ROC curves (AUROC) was calculated using the trapezoidal method. Summary ROC curves were calculated via the method of Kester and Buntinx 46,47 . We examined the ability of the models to predict non-survivors using precision-recall curves generated from the gene scores in each examined dataset. Precision-recall curves of the gene scores were constructed within datasets, and the area under the precision-recall curve (AUPRC) was calculated using the trapezoidal method.

Enrichment Analysis
We conducted two analyses to evaluate the functional enrichment of the genes selected as predictors by the four models. This included a targeted enrichment analysis for cell types as previously described 39 and an exploratory enrichment analysis that assessed a large number of functionally annotated gene sets.
In a mixed tissue such as blood, shifts in gene expression can be caused by changes in cell type distribution. To check for this effect, we used gene expression profiles derived from known sorted cell types to determine whether a given set of genes is enriched for genes represented in a specific cell type. In each curated cell type vector, a 'score' is calculated by the geometric mean of the upregulated genes minus the geometric mean of the downregulated genes.
A higher 'score' represents a greater presence of the given cell type in the differential gene expression signature.
For exploratory enrichment, we curated thousands of gene sets from two widely-used databases: gene ontology (GO) 48 and the Reactome database of pathways and reactions in human biology 49,50 . Our 12 discovery cohorts had approximately 6,000 genes in common, which formed a 'background' set of genes. We removed all genes not in the background genes from the Reactome/GO sets. We then retained all Reactome/GO gene sets containing at least 10% and at least 3 genes overlapping with the predictor genes. The remaining Reactome/GO gene sets were removed to reduce the multiple testing burden. Fisher's Exact test was used to test enrichment in each of the curated reference gene sets. Both nominal and Benjamini-Hochberg-corrected significance were tested.

Statistics and R
All computation and calculations were carried out in the R language for statistical computing (version 3.2.0) and Matlab R 2016a (The MathWorks, Inc.). Significance levels for pvalues were set at 0.05 and analyses were two-tailed.

Analysis Overview
We used a community approach to build gene-expression-based models predictive of sepsis-induced mortality using all available gene expression datasets (21 total cohorts, Table 1).
In this community approach, three different teams (Duke University, Sage Bionetworks, and Stanford University) performed separate analyses using the same input data; we thus sampled the possible model space to determine whether output performance is a function of analytical approaches ( Figure 1). Two models (Duke and Stanford) used parameter-free difference-ofmeans formulae for integrating gene expression, and the other two models (both from Sage Bionetworks) used penalized logistic regression (LR) 51  Each of the four models was trained using 12 discovery cohorts (485 survivors and 157 non-survivors) composed primarily of patients with community-acquired sepsis. Performance was evaluated across two groups of heterogeneous validation data sets (5 community-acquired sepsis cohorts with 161 survivors and 28 non-survivors and 4 HAI cohorts with 258 survivors and 24 non-survivors, Table 1). The community-acquired sepsis and HAI cohorts were considered separately in validation because of their known differences in host-response profiles.
Due to the nature of public datasets, we had limited information on demographics, infection, severity and treatment and so these variables were not controlled for in model selection. The cohorts included patients from multiple age groups, countries, and hospital wards (emergency department, hospital ward, medical ICU, and surgical ICU). As expected in varied patient populations, mortality rates varied widely across cohorts (mean 23.2% ± 13.4%).

Prognostic Power Assessments
Model performance was primarily evaluated using ROC analysis separately in the discovery, validation, and HAI cohorts. Boxplots of the AUROCs for each model are shown in  Table 4). The AUPRCs for nonsurvivor prediction were notably lower than the AUROCs, suggesting that the models' primary utility may be in ruling out mortality for individuals much less likely to die within 30 days (those less likely to require substantial intervention) as opposed to accurately identifying the minority of patients who are highly likely to die within 30 days. On the contrary the AUPRC of the ensemble model was averaged at 0.428 in validation cohorts (Supplementary Table 3), suggesting complementarity in discriminatory power between individual models.

Comparison to Standard Predictors
We next assessed whether the performance of these gene expression-based predictors of mortality outperformed standard clinical severity scores. Notably, clinical measures of severity were only available in a subset of cohorts (8 discovery, 3 validation, 3 HAI; Supplemental Table   5). The mean differences in the AUROCs of the gene model over clinical severity scores were: Duke -0.044; Sage LR 0.010; Sage RF 0.094; Stanford 0.064; only the Stanford model trended towards significance (paired t-test p = 0.098). However, we combined gene models and clinical severity scores into joint predictors, and each combination significantly outperformed clinical severity scores alone (all paired t-tests p ≤ 0.01). This suggests that the gene expression-based predictors add significant prognostic utility to standard clinical metrics.

Comparison Across Models
We next studied whether models were correctly classifying the same patients or whether each model was correctly classifying different groups of patients. We tested model correlations across all patients by comparing the relative ranks of each patient within each model instead of comparing raw model scores. We found the models were moderately correlated (Spearman rho = 0.35 -0.61, Supplemental Figure 6). We then evaluated the agreement between the four models by comparing model-specific patient classifications (Supplementary Table 6). For this purpose, we chose cutoffs for each model that yielded 90% sensitivities for non-survivors. We then labeled patients as being either always misclassified, correctly classified by 1 or 2 models (no consensus), or correctly classified in at least 3 of 4 models (consensus). As expected by the 90% sensitivity threshold, 10% of patients were misclassified by all models. In the remaining cases, 63% were correctly predicted by consensus and 27% do not reach consensus. Together, the model correlation and consensus analyses showed that 73% of patients were classified by at least one model, with variance leading to discordance in the remaining 27%.

Biology of the Gene Signatures of Mortality
Gene predictors were chosen for both optimized prognostic power and sparsity in our data-driven approach and so do not necessarily represent key nodes in the pathophysiology of sepsis. Still, we examined whether interesting biology was being represented in the models. We first looked for overlap in the gene sets used for prediction across the four models, but found few genes in common ( Table 2). Since each signature had too few genes for robust analysis, we analyzed the genes from all four models in aggregate, resulting in 58 total genes (31 up-regulated and 27 down-regulated, Supplementary Table 7).
First, we studied whether the differential gene expression identified may be indicative of cell-type shifts in the blood. The pooled gene sets were tested in several known in vitro gene expression profiles of sorted cell types to assess whether gene expression changes are due to cell type enrichment (Supplemental Figure 7). No significant differences were found, but the trend showed an enrichment of M1-polarized macrophages and band cells (immature neutrophils), and underexpression in dendritic cells. This is consistent with a heightened pro-inflammatory response and a decrease in adaptive immunity in patients who ultimately progress to mortality 10 . We next tested the 58 genes for enrichment in gene ontologies and Reactome pathways, but after multiple hypothesis testing corrections, no pathways were significantly enriched. This may be either due to the relatively low number of genes in the predictor set, or it may indicate that there is not unified biology across the four models. In addition, the models were generated in a way that penalized the inclusion of genes that were redundant for classification purposes.

Discussion
Sepsis is a heterogeneous disease, including a wide possible range of patient conditions, pre-existing comorbidities, severity levels, infection incubation times, and underlying immune states. Many investigators have hypothesized that molecular profiling of the host response may better predict sepsis outcomes. Here, we extensively assessed the predictive performance of whole-blood gene expression using a community-based modeling approach. This approach was designed to evaluate predictive capabilities in a manner that was independent of specific methodological preferences, and instead created robust prognostic models across a broad solution space. We developed four state-of-the-art data-driven prognostic models using a comprehensive survey of available data including 21 different sepsis cohorts (both communityacquired and hospital-acquired, N=1,113 patients), with summary AUROCs around 0.85 for predicting 30-day mortality. We also showed that combining the gene-expression-based models with clinical severity scores leads to significant improvement in the ability to predict 30-day mortality, indicating clinical utility.
Prediction of outcomes up to 30 days after the time of sampling represents a difficult task, given that the models must account for all interventions that occur as part of the disease course. An accuracy of 100% is likely not only unachievable but also undesirable, as it would suggest that mortality is pre-determined and independent of clinical care. Given this background, and since similar prognostic power was observed across all individual models and the ensemble model, our prognostic accuracy may represent an upper bound on transcriptomic-based prediction of sepsis outcomes. In addition, since prognostic accuracy was retained across broad clinical phenotypes (children and adults, with bacterial and viral sepsis, with communityacquired and hospital-acquired infections, from multiple institutions around the world) the models appear to have successfully incorporated the broad clinical heterogeneity of sepsis.
Sepsis remains difficult to define. The most recent definition of sepsis (Sepsis-3) requires the presence organ dysfunction as measured by an increase in SOFA of two or more points over baseline 1 . Determining the SOFA score can help guide which organ systems are dysfunctional, but this fails to characterize the biological changes are driving the septic response. Molecular tools like the ones developed here provide an opportunity to provide a simple, informative prognosis for sepsis by improving patient risk stratification. Host response profiles could also help to classify patients with sepsis as opposed to non-septic acute infections. Identifying such high-risk patients may also lead to greater success in clinical trials through improved enrichment strategies. This identification of subgroups or 'endotypes' of sepsis has already been successfully applied to both pediatric and adult sepsis populations 12,13 .
The goal of this study was to generate predictive models but not necessarily to define sepsis pathophysiology. However, our community approach identified a large number of genes associated with sepsis mortality that may point to underlying biology. The association with immature neutrophils and inflammation in sepsis has been previously shown 53 . Results of this study confirm this finding as we note increases in the neutrophil chemoattractant IL-8 as well as neutrophil-related antimicrobial proteins (DEFA4, BPI, CTSG, MPO). These azurophilic granule proteases may indicate the presence of very immature neutrophils (metamyelocytes) in the blood 54 . Many of these genes have also been noted in the activation of neutrophil extracellular traps (NETs) 55,56 . NET activation leads to NETosis, a form of neutrophil cell death that can harm the host 56 . Whether these involved genes are themselves harmful or are markers of a broader pathway is unknown. Along with immune-related changes, there are changes in genes related to hypoxia and energy metabolism (HIF1A, NDUFV2, TRIB1). Of particular interest is the increase in HIF1A, a hypoxia-induced transcription factor. This upregulation is corroborated by previous findings in patients with higher early mortality in the larger E-MTAB-4421.51 cohort 13 . This may be evidence of either a worsening cytopathic hypoxia in septic patients who progress to mortality, or of a shift away from oxidative metabolism ("pseudo-Warburg" effect), or both 57 .
Modification of the Warburg effect due to sepsis has been implicated in immune activation 58,59 , trained immunity 60 , and immunoparalysis 61 .
The present study has several limitations. First, as a retrospective study of primarily publically available data, we are not able to control for demographics, infection, patient severity, or individual treatment. However, our successful representation of this heterogeneity likely contributed to the successful validation in external community-acquired and hospital-acquired sepsis cohorts. Second, despite a large amount of validation data, we do not present the results of any prospective clinical studies of these biomarkers. Prospective analysis will be paramount in translating the test to a clinically relevant assay. Third, the genes identified here were specifically chosen for their performance as biomarkers, not based on known relevance to the underlying pathophysiology of mortality in sepsis. As such, the biological insights gained from these biomarkers will need to be confirmed and expanded on by studies focused on the entire perturbation of the transcriptome during sepsis and through targeted study of individual genes and pathways. Fourth, the use of 30-day mortality as our endpoint is a crude measure of severity, and may miss important intermediate endpoints such as prolonged ICU stay or poor functional recovery. While such intermediate outcomes were not available in the current data, the models' abilities to predict these functional outcomes will need to be tested prospectively.
Researchers, clinicians, funding agencies, and the public are all advocating for improved platforms and policies that encourage sharing of clinical trial data 62 . Meta-analysis of multiple studies leads to results that are more reproducible than from similarly-powered individual cohorts (PMID: 27634930). The community approach used here has shown that aggregated transcriptomic data can be used to define novel prognostic models in sepsis. This collaboration of multidisciplinary teams of experts encompassed both analytical and statistical rigor along with deep understandings of both the transcriptomics data and clinical data. To advance beyond the work presented here, more data must be made available, including demographics, treatments, and clinical outcomes, as well as other data types like proteomics and metabolomics. Data-driven collaborative modelling approaches using these data can be effective in discovering new clinical tools.

Conclusions
We have shown comprehensively that patients with sepsis can be risk-stratified based on their gene expression profiles at the time of diagnosis. The overall performance of expressionbased predictors paired with clinical severity scores was significantly higher than clinical scores alone. These gene expression models reflect a patient's underlying biological response state and could potentially serve as a valuable clinical assay for prognosis and for defining the host dysfunction responsible for sepsis. These results serve as a benchmark for future prognostic model development and as a rich source of information that can be mined for additional insights.
Improved methods for risk stratification would allow for better resource allocation in hospitals and for prognostic enrichment in clinical trials of sepsis interventions. Ultimately, prospective clinical trials will be needed to confirm and extend the findings presented here.   Figure 5: Boxplots of other performance metrics for each model in individual cohorts, with cutoffs set to the sensitivity nearest to 90%. Supplemental Figure 6: Rank order correlation between sample scores across the four models for all samples. Supplemental Figure 7: Cell type enrichments of the entire set of 58 genes used across all four prediction models.