The innate immune system of humans and other mammals responds to pathogen-associated molecular patterns (PAMPs) that are conserved across broad classes of infectious agents such as bacteria and viruses. We hypothesized that a blood-based transcriptional signature could be discovered indicating a host systemic response to viral infection. Previous work identified host transcriptional signatures to individual viruses including influenza, respiratory syncytial virus and dengue, but the generality of these signatures across all viral infection types has not been established. Based on 44 publicly available datasets and two clinical studies of our own design, we discovered and validated a four-gene expression signature in whole blood, indicative of a general host systemic response to many types of viral infection. The signature’s genes are: Interferon Stimulated Gene 15 (ISG15), Interleukin 16 (IL16), 2′,5′-Oligoadenylate Synthetase Like (OASL), and Adhesion G Protein Coupled Receptor E5 (ADGRE5). In each of 13 validation datasets encompassing human, macaque, chimpanzee, pig, mouse, rat and all seven Baltimore virus classification groups, the signature provides statistically significant (p < 0.05) discrimination between viral and non-viral conditions. The signature may have clinical utility for differentiating host systemic inflammation (SI) due to viral versus bacterial or non-infectious causes.
Systemic inflammation (SI), as indicated by clinical signs such as fever and increased respiratory and heart rates, can be due to a variety of underlying non-infectious or infectious causes including trauma, thermal burns, surgery, ischemia-reperfusion events and viral or bacterial infections. Patients presenting with SI can pose a diagnostic challenge for clinicians in determining the underlying etiology; consequently it can be difficult to select the most appropriate options for treatment and patient management1,2,3,4,5. There is a clinical need for rapid diagnostic tests that can help clinicians distinguish between non-infectious, viral and bacterial etiologies of SI in (critically ill) patients. Without such tests, patients may be over-prescribed antibiotics when there is little clinical evidence of infection4, 6. Reducing inappropriate and unnecessary use of antibiotics, the concept of antibiotic stewardship, is essential in slowing the spread of resistant bacteria7.
Traditional reference methods for determining bacterial or viral causes of SI involve the culture, isolation and identification of causative pathogens from multiple specimens from a patient. Such an approach, however, has several limitations: (i) the causative pathogen might not be present in the specimens taken for examination; (ii) the specimens may become contaminated by organisms unrelated to the cause of infection; (iii) multiple organisms may be present in the specimens (e.g. due to contamination or non-harmful microbiota) and it can be difficult to determine which organism is the cause of the presenting clinical signs8,9,10. Furthermore, (iv) some sampling techniques (e.g. bronchoalveolar lavage, lumbar puncture) are relatively invasive. Finally, (v) some pathogens are not easily cultured. Although traditional culture-based methods are steadily being supplemented or displaced by immunological and molecular methods such as rapid immunoassays and polymerase chain reaction (PCR)11, 12, these newer methods also suffer from limitations, for example: (i) an inability to detect organisms not represented in an immunoassay or PCR panel; (ii) an inability to discriminate between live and dead organisms in a specimen; and (iii) a tendency to detect low levels of virus that may not be clinically relevant13.
Given these limitations, increasing attention is being paid to an alternative approach: that of identifying biomarkers that reflect the differential host response to underlying non-infectious, bacterial, or viral conditions14,15,16,17,18,19,20,21,22,23. Our current investigation builds upon and extends previous host biomarker studies by identifying a molecular signature that is demonstrably specific to SI caused by a broad range of pathogenic viruses that represent all seven Baltimore virus classification groups and that cause infection in different tissues in multiple mammalian species. We used, as a discriminating function, the Area Under Curve (AUC) in Receiver Operating Characteristic Curve (ROC) analysis, and boosted specificity by employing a filtering step in our discovery process whereby biomarkers with high AUCs for non-viral causes of SI were removed. Independent validation of the signature in adult and pediatric cohorts demonstrated a strong discrimination of viral vs. non-viral causes of SI. Notably, this viral signature relies on only four biomarkers, and this high degree of parsimony should help to ensure the performance robustness necessary for effective translation to a rapid point-of-care format.
Discovery of the pan-viral signature
An initial search was conducted across 13 Gene Expression Omnibus (GEO) datasets (Table 1) from human adult and pediatric subjects, and one GEO dataset from macaques. These 14 discovery datasets (comprising 417 cases and 182 controls) spanned three Baltimore Group I viruses (cytomegalovirus, human herpesvirus 6, enterovirus), one Group III virus (rotavirus), two Group IV viruses (Dengue, hepatitis C), and six Group V viruses (influenza, Lassa virus, rhinovirus, lymphocytic choriomeningitis virus, respiratory syncytial virus, and measles virus). Next, a comprehensive, stepwise filtering approach was applied to 19 additional GEO datasets comprising a total of 1337 cases and 1106 controls (Table 1), to exclude genes that were differentially expressed in conditions that may present as SI but appear unrelated to viral systemic inflammation. The end result, after the filtering step was applied, was a “pan-viral” signature based on the expression levels of four genes: Interferon Stimulated Gene 15 (ISG15), Interleukin 16 (IL16), 2′,5′-Oligoadenylate Synthetase Like (OASL), and Adhesion G Protein Coupled Receptor E5 (ADGRE5). Table 2 summarizes what is known about the role, function and tissue expression of these four genes. Three of the genes (ISG15, OASL, IL16) have previously been reported to be associated with host response to viral infection, although they are not entirely specific to such a response. The four genes are all strongly expressed in whole blood and white blood cells, and to a lesser degree in most other tissues.
Validation of the pan-viral signature in independent GEO datasets
To ensure the resulting pan-viral signature was not overfit to the discovery datasets and was generalizable across different viruses and mammalian species, we next validated its performance in 13 human and non-human mammalian datasets (11 from GEO and 2 from clinical trials, comprising a total of 332 cases and 302 controls). Importantly, these datasets represented a completely independent set of observations to those used during the discovery process. The validation datasets were chosen on the basis of (i) coverage of all seven of the Baltimore virus classification groups, and (ii) the potential impact of each virus on human health. In the case of the human datasets, the subjects had either naturally acquired viral infections, or had been vaccinated with attenuated viral vaccines (see Table 3 for details). The AUCs for performance of the pan-viral signature in the validation datasets ranged from 0.90 to 0.98.
GEO Validation Dataset #1: Adenovirus (Baltimore Group I, double-stranded DNA)
Fifty-one different serotypes of adenovirus are known to infect humans, and serotypes 1, 2, 3, 4, 5, 7, 21 in particular are significant causes of upper respiratory tract infections, especially in children24,25,26,27. For evaluation of the performance of the pan-viral signature in Baltimore Group I viral infections, we chose GEO dataset GSE4128 which was derived from a study28 of mice injected with adenovirus type 5 capsids (“vector”) or phosphate buffered saline (“mock”) (Fig. 1A). Adenovirus capsids are known to induce an innate inflammatory response28. Gene expression analyses were performed on liver samples taken six hours post-infection for both wild type mice, and mice rendered deficient for complement component 3 (C3) by gene targeting. We observed a clear difference (AUC = 1.00) in pan-viral signature values between infected and mock-infected mice. Whilst the authors found a “blunted” immune response to adenovirus injection in the C3-deficient mice, we found little overall difference in pan-viral signature response, suggesting that the absence of C3 does not affect the pan-viral signature value. Note that for analysis of dataset GSE4128, our pan-viral signature incorporated the mouse gene 2′–5′ Oligoadenylate Synthetase-Like 1 (OASL1), which is the ortholog of human OASL29. Also, two samples were omitted from our analysis because the study authors labeled each sample as both ‘mock’ and ‘virus-infected’ in the phenotypic table associated with GSE4128.
GEO Validation Dataset #2: Porcine Circovirus PCV2 (Baltimore Group II, single-stranded DNA)
There are few publicly available datasets, in either humans or other species, that describe host gene expression in response to infection by pathogenic Baltimore Group II viruses. Some example viruses in this group include parvoviruses (B19, canine parvovirus, bocavirus, adeno-associated virus) and circoviruses (porcine circovirus, chicken anemia virus). Porcine circovirus, type 2 (PCV2) is the primary cause of post-weaning multi-systemic wasting syndrome in pigs, which has had a large economic impact in the food production industry30. We analyzed a time-course dataset (GSE14790) derived from peripheral blood samples from Landrace cesarian-derived colostrum-deprived (CDCD) piglets infected, at post-gestation day 7, with subclinical doses of porcine circovirus 2 (PCV2, Burgos isolate). This study30 used an Affymetrix 24 K Genechip Porcine Genome Array to generate gene expression data. This microarray unfortunately did not include the OASL gene. We therefore were limited to analyzing this dataset using a linear combination of just two of the four genes, ISG15 and IL16, which carries most of the diagnostic power of the signature. Figure 1B shows box and whisker plots for ISG15/IL16 performance on weekly whole blood samples out to 29 days post-inoculation in piglets infected with PCV2. The ISG15/IL16 component of the pan-viral signature produced AUC = 0.94 for day 7 vs. day 0 comparison, and AUC = 1.00 for days 14, 21, 29 vs. day 0 comparison.
GEO Validation Dataset #3: Rotavirus (Baltimore Group III, double-stranded RNA)
Rotaviruses are the most common cause of gastroenteritis worldwide in children less than five years of age, resulting in over 2 million hospitalizations annually31. Despite the main clinical signs of rotavirus infection being related to gastroenteritis, peripheral blood gene expression changes associated with infection have been reported32. We analyzed dataset E-GEOD-50628, generated from peripheral blood samples from six children with rotavirus infections in the acute phase (2–4 days from disease onset) versus recovery phase (7–11 days from disease onset)32. Figure 2 shows a box and whisker plot demonstrating that the pan-viral signature can be used to differentiate between children acutely infected with rotavirus from those in recovery (p < 0.05 by Mann-Whitney-U test).
GEO Validation Dataset #4: Yellow Fever Virus (Baltimore Group IV, positive-sense single-stranded RNA)
The flaviviridae family includes yellow fever, dengue, hepatitis C, Japanese encephalitis and Zika virus which together impact the lives of millions of people33. Yellow fever virus is considered to be a prototypical flavivirus, for which single-dose vaccination with a live attenuated virus is an effective protection34. We analyzed GEO dataset GSE13699 from a yellow fever vaccination study35 in which two geographically separated groups of volunteers (Lausanne, n = 11; Montreal, n = 15) were vaccinated subcutaneously on day 0 with Stamaril vaccine (Sanofi-Pasteur YF17D-204 YF-VAX), a vaccine containing live attenuated yellow fever virus that confers protection from 10 days following vaccination. Whole blood samples were collected on days 0, 3 and 7 for the Lausanne cohort and on days 0, 3, 7, 10, 14, 28 and 60 for the Montreal cohort. The pan-viral signature value peaked on day 7 following vaccination and dropped to pre-vaccination levels by day 14 (Fig. 3). The temporal behavior of the pan-viral signature suggests that the vaccine engenders an immune response that peaks on day 7 but does not persist beyond day 14 (as might be expected for the response to an attenuated vaccine).
GEO Validation Dataset #5: Respiratory Syncytial Virus (Baltimore Group V; negative-sense single-stranded RNA)
The most common cause of acute lower respiratory infection in children less than five years of age is respiratory syncytial virus (RSV), with an estimated 3.4 million infected children requiring hospitalization each year worldwide36. We analyzed GEO dataset GSE69606, which was generated in a study designed to identify biomarkers of RSV infection severity in children37. In this study, peripheral blood samples were collected from children with mild (n = 9), moderate (n = 9) or severe (n = 8) clinical signs during the acute stage of infection. An additional set of samples was collected 4–6 weeks later from recovered children who originally displayed moderate or severe clinical signs. The pan-viral signature score showed a clear difference between acute and recovery stages (AUC = 0.903), but was invariant in the acute stage regardless of RSV infection severity (Fig. 4).
GEO Validation Dataset #6: HIV-1 Virus (Baltimore Group VI, positive-sense single-stranded RNA virus, replicating through a DNA intermediate)
The initial clinical signs of acute HIV-1 infection are relatively non-specific, involving fever and influenza-like illness, which bear a clinical resemblance to other types of infection including bacterial sepsis. We analyzed GEO dataset GSE29429 which was generated from a time-course study38 comparing (A) HIV-1 infected adults who first presented in the acute stage of infection but who did not receive antiretroviral therapy (ART; African, n = 43), versus (B) HIV-1 infected adults who presented similarly but did receive ART (USA, n = 15). The study also included two sets of matched healthy controls (n = 55). Blood samples were collected at study enrollment when the patients had a confirmed acute infection, and at post-enrollment weeks 1, 2, 4, 8, 12 and 24. Figure 5 shows AUCs over time for the pan-viral signature when comparing the healthy controls to either the untreated African patients (panel A) or the treated American patients (panel B). The pan-viral signature AUC when comparing the untreated African patients to the corresponding healthy African controls remained at or above 0.9 at all time points; in contrast, the AUC when comparing the treated American patients to the corresponding healthy American controls dropped from above 0.9 at enrolment to less than 0.5 by Week 24 (panel C). The decrease in pan-viral signature values in the treated American patients also reflected a corresponding decrease in mean HIV-1 viral loads from ~800,000 virus particles/mL blood at study entry to ~2,000 virus particles/mL blood by week 24.
GEO Validation Dataset #7: Hepatitis B (Baltimore Group VII, double-stranded DNA virus, replicating through a single-stranded RNA intermediate)
We analyzed GEO dataset GSE68112 which was generated from a study of HBV infection of primary rat hepatocytes39. Figure 6 shows pan-viral signature scores over a 72-hour period in primary rat hepatocytes. In this study, primary rat hepatocytes were plated at 0 hours, then infected with an adenovirus-based construct containing either the gene for Green Fluorescent Protein (GFP) alone, or a copy of the Hepatitis B Virus (HBV) genome in combination with the GFP gene. Post-infection, an increase in the pan-viral signature score was observed in rat hepatocytes infected with the adenovirus + GFP + HBV construct, compared to infection with the adenovirus + GFP construct lacking HBV. At the 48 hour timepoint, this increase was small and not statistically significant (p > 0.05 by one-tailed t-test, unequal variances assumed). However, at 72 hours post-infection, the increase was much larger and statistically significant (p < 0.02 by one-tailed t-test, unequal variances assumed). The results at 72 hours post-infection indicate that the pan-viral signature responds to acute infection by HBV in rats, in tissues other than blood, in an in vitro study.
In Figs 1–6 we have presented validation data representing all seven Baltimore viral classification groups. In Supplementary Figures S1–S5 we discuss additional GEO datasets, derived from human and animal peripheral blood samples, which were used to further validate the pan-viral signature. Human studies included rhinovirus (HRV) infection in children (Figure S1; Baltimore group IV; AUC 0.81-0.90); and a time-course study in which adult volunteers were inoculated with influenza virus (Figure S2; Baltimore Group IV; AUC up to 1.00). Animal studies included a time-course study of influenza in mice (Figure S3; Baltimore group IV; AUC up to 1.00), which parallels the aforementioned human study; inoculation of Hepatitis C and Hepatitis E in chimpanzees (Figure S4; Baltimore group IV; AUC 0.96-1.00); and infection of macaque monkeys with Marburg virus (Figure S5; Baltimore group V; AUC 0.98). Performance of the pan-viral signature was strong in all of these additional validation datasets, as indicated in Table 3 and in the Supplementary Figures.
Additional validation from clinical studies
The pan-viral signature was also tested in two clinical studies that were conducted to determine the signature’s ability to differentiate patients with virus-associated SI from those with SI due to other etiologies, including bacterially- and surgically-induced SI. Gene expression levels were inferred from RNA sequencing (RNA-seq) data obtained from whole blood samples collected in PAXgene blood RNA tubes.
Internal Validation Dataset #1: FEVER study
This study involved adult patients presenting to a UK emergency department with fever (see the Supplementary Text S1, Figure S6 and Table S1 for study details, and Table S2 for line data). All patients included in the study were admitted to hospital and received retrospective physician diagnosis (RPD), using all available clinical information at discharge, including any results of clinical microbiology and virus testing, to determine the presumptive etiology of the fever. Of the 90 patients comprising the FEVER study cohort, those with confirmed bacterial infections (N = 54) were identified by microbial culture of pathogenic bacteria from sterile sites. Confirmed viral infections (N = 14) were identified by positive nucleic acid detection or serological tests as ordered by the attending clinician (see Text S1). Patients who had no positive microbiological tests and recovered without empirical antimicrobial treatment (N = 22) were designated as indeterminate cases. Positively identified viruses in the ‘virally infected’ patients included Baltimore group I (herpes virus, varicella-zoster virus, Epstein-Barr virus, cytomegalovirus), Baltimore group IV (dengue virus), and Baltimore group V (Influenza A and B viruses). Figure 7, panel A shows box and whisker plots of the pan-viral signature, assayed in blood samples from the three patient groups. The pan-viral signature effectively separated febrile patients of confirmed viral etiology from those of confirmed bacterial etiology with AUC 0.93. All patients in this study had a fever (temperature >38.5 °C) at the time of presentation and blood sampling. The fact that the indeterminate cases recovered spontaneously may be most consistent with self-limiting viral illnesses, but interestingly only 2–3 of 22 indeterminate cases had pan-virus signature scores significantly higher than the proven cases of bacterial infection, suggesting that the majority of these indeterminate cases did not represent acute viral infections.
Internal Validation Dataset #2: GAPPSS study
A second clinical study40 (clinicaltrials.gov reference # NCT02728401) was undertaken that involved pediatric patients (age range: 38 weeks estimated gestational age – 18 years) in intensive care (see Supplementary Text S2 and Table S3 for study details, and Table S4 for line data). Using all available clinical information, including clinical microbiology and virus testing, the patients were retrospectively diagnosed with bacterial sepsis (n = 25), bacterial sepsis with a viral coinfection (n = 10), viral SI (n = 5), or sterile post-surgical SI (n = 29). Testing of respiratory samples from the cohort, using the BioFire FilmArray Respiratory Panel (Biofire Diagnostics, Utah, USA), identified viruses in Baltimore group I (varicella-zoster virus; herpes simplex virus), Baltimore group IV (rhinovirus/enterovirus; coronavirus HKU1; norovirus Type 2) and Baltimore group V (parainfluenza 3; respiratory syncytial virus; metapneumovirus). Results are displayed graphically in Fig. 7, Panel B and summarized in Table 3.Whilst only a limited number of viral patients were included in this study (n = 5), the pan-viral signature resolved viral SI from non-infectious SI with AUC 0.91, and resolved viral SI from bacterial sepsis with AUC 0.76. Similar to our observation in the adult study (Fig. 7, panel A above), the pan-viral signature was much less effective at separating bacterial sepsis from non-infectious SI (AUC 0.60) demonstrating that the signature is specific for viral systemic inflammation and not bacterial systemic inflammation. Discordance between RPD and the pan-viral score in some cases suggests the possibility that either some patients had undetected viral infections, that the pan-viral signature had reduced specificity in children, or the study was not sufficiently powered to draw definitive conclusions.
Resolution of viral vs. bacterial SI using two specific signatures
We have previously discovered and validated a four-gene host response signature (SeptiCyte TM LAB) for differentiating SI due to either bacterial or non-infectious etiology41. Given that the pan-viral signature was developed to be specific for discrimination of viral vs. non-infectious SI, and appears to be largely unaffected by bacterial infection, we hypothesized it would be possible to apply the two signatures simultaneously to allow a three-way discrimination between non-infectious SI, viral SI, and bacterial SI.
As an initial test of this hypothesis, we reanalyzed a dataset (GSE63990) from a study42 of patients with acute respiratory illness (ARI). This study enrolled 273 patients of which 115, 70 and 88 received retrospective clinical diagnoses of bacterial infection, viral infection, and non-infectious illness, respectively. We analysed GSE63990 using an 8-gene classifier consisting of the four pan-viral signature genes (IL16, ISG15, OASL, ADGRE5) combined with the four genes (CEACAM4, LAMP1, PLA2G7, PLAC8) from SeptiCyte TM LAB. The line data used in our analysis is given in Supplementary Table S5. We applied a Random Forest - multidimensional scaling (RF-MDS) analysis43,44,45 using the combined eight genes. Figure 8 (Panels A, B) presents two different visual representations of the analysis, which show that the GSE63990 dataset has been resolved into the three patient subgroups of bacterial infection (green), viral infection (purple), and non-infectious illness (orange). An animated representation of this analysis, in which the figure is rotated in three dimensions, is provided in Supplementary Animation S1. To assess whether these 8 genes were contributing materially to the underlying biology, and thus to the clinical diagnoses of viral, bacterial or non-infectious illness, we used the resampling method described by Li et al.46 and created 2,000 permutations of GSE63990 in which the group labels were randomly shuffled. Application of the Random Forest model to the permuted datasets failed to resolve the three groups, after group label randomization. Thus the classifier was found to be significant under the null hypothesis. That is, the results presented in Fig. 8 illustrate a true dependency between the 8 genes and the response labels, at a significance level of p < 0.001. Additional details of the permutation test are provided in Supplementary Figures S7 and S8.
We note the GSE63990 dataset was not used in the initial discovery or validation of either the pan-viral signature or SeptiCyte TM LAB signature. Also, the possibility of bacterial or viral co-infection was not considered in our analysis. Furthermore, the diagnostic performance of SeptiCyte TM LAB and the pan-viral signature is dependent upon the accuracy of retrospective physician diagnoses of acute respiratory illness cases. There is some degree of discordance between the retrospective physician diagnoses and our two signatures, a finding that was also reported in the original publication42 when classifiers reported in that paper were used (35 of 273 patients had a discordant result (12.8%)). Clearly further validation work is required to demonstrate the clinical utility of combining both signatures, but these data provide a valuable insight into the potential of an assay that combines viral and bacterial host responses.
In this paper we identify and validate a peripheral-blood signature based on the expression of four genes (ISG15, OASL, IL16, ADGRE5), which exhibits high AUC for discriminating viral from bacterial and non-infectious causes of SI. This signature has been validated using publicly available GEO datasets, and in our own clinical studies in adults and children. We have termed the signature “pan-viral” because it has demonstrable diagnostic power across six mammalian species (human, macaque, chimpanzee, pig, rat and mouse), in multiple tissue types, in vivo and in vitro, and in infections caused by viruses representing all seven Baltimore classification groups.
Because the direct sensing of different classes of viruses is mediated through different Pathogen-Associated Molecular Patterns (PAMPs), we hypothesize that the pan-viral signature most likely reflects some type of integrated downstream response47, 48. A plausible hypothesis regarding the functional significance of three of the genes in the pan-viral signature (ISG15, OASL, IL16) is that they relate to type 1 interferon signaling. ISG15, a well-studied component of the type 1 interferon-mediated response to viral infection, is a mediator of ISGylation, a protein modification similar to ubiquination49,50,51,52. OASL is a non-enzymatic member of the highly conserved OAS gene family53 and is also a component of the Type 1 interferon response to viral infection54, 55. IL16 is a cytokine with multiple functions, having been linked to inhibition of HIV-1 infection56, 57, modulation of HBV infection58, lentiviral infection59 and autoimmune and allergic disorders60, 61. A paper from some years ago62 demonstrated that IFN-α induces the secretion of IL-16 by several cell types. A more recent paper63 reported a negative effect of IFN-β1a (a type 1 interferon) on the expression level of IL-16; thus IL16 may also be functionally related to the Type 1 interferon pathway, although the linkage is not especially well studied or documented. Finally, although ADGRE5 has not been linked to interferon Type 1 signaling, this gene has previously been directly associated with host response to infection by human papilloma virus64 and HIV65. Additionally, the ADGRE5 ligand DAF (decay accelerating factor) is the cellular receptor for both echovirus66, 67 and coxsackie virus68, 69.
Context for our work is found in prior studies describing transcriptional signatures that were designed to distinguish between some viral, bacterial, and non-infectious SI conditions. However, we have found that prior work was limited by either a large number of genes/probes required, a lack of specificity of the signatures in light of other possible causes of SI, or a lack of validation across a broad range of virus types. For example, Zaas et al.19 identified a 30-gene signature from microarray analysis of symptomatic vs. asymptomatic subjects infected with rhinovirus, respiratory syncytial virus, or influenza A; this signature was able to discriminate symptomatic influenza A-infected subjects from both healthy subjects and bacterially-infected subjects in a second independent cohort. Other researchers17, 18, 20, 70 have described signatures for discriminating between viral infections and other conditions, but with limitations relating to the large number of biomarkers in the signature (>18), a limited number of viruses examined, or a lack of demonstrated specificity with respect to possible bacterial co-infection or SI due to non-infectious causes. Tsalik et al.42 identified host gene expression signatures for viral, bacterial and non-infectious causes of acute respiratory inflammation. Whilst respiratory illness accounts for a large proportion of patients presenting to emergency clinics, the viral signature identified in this study consisted of a large number of genes (n = 33) and was not validated on patients with SI as a result of viral infection of body systems other than respiratory. Sweeney et al.22, 71 described an 11-gene signature for differentiating infectious and non-infectious SI, and also a 7-gene signature for differentiating bacterial and viral SI, but not non-infectious SI. Used in succession the authors claimed that such signatures could be used as an “integrated antibiotics decision model”. Finally, Herberg et al.23 described a two-gene signature for differentiating viral and bacterial infection in febrile children. This signature was developed without using a cohort of non-infectious SI and therefore the output is binary and assumes that patients have a viral or bacterial infection.
Our approach to discovery of host response viral biomarkers is novel in comparison to the prior studies because we have: (1) included representative pathogenic viruses from all seven Baltimore viral classification groups, thus providing evidence that innate immune response commonalities may be potentially harnessed for broad diagnostic utility across diverse viral infection; (2) incorporated datasets from multiple mammalian species to demonstrate the robustness that host response -based methods offer; (3) used non-infectious SI as our control group, recognizing the fact that discriminating viral, bacterial, and non-infectious causes of SI is a highly critical and difficult distinction to make on the basis of clinical features alone6, 41, 71, 72; (4) applied a comprehensive specificity screen to eliminate biomarkers that respond to potentially confounding medical conditions or demographic variables; and (5) applied strong selection pressure towards minimizing the number of biomarkers in the pan-viral signature to avoid overfitting and to enable a straightforward conversion to a practical assay format, such as a format employing reverse transcription - quantitative polymerase chain reaction (RT-qPCR).
A number of the discovery and validation datasets in our study (Tables 1 and 3 respectively) were derived from time course and/or challenge experiments. The use of such datasets is important because samples taken from subjects early in the viral pathogenesis, or from otherwise healthy subjects undergoing vaccine challenge, are most likely to reflect an infection with a single type of virus, rather than an infection with multiple virus types, or co-infection with bacteria. Analysis of the time-course datasets revealed that, in general, it took up to three days post-exposure for the pan-viral signature to first register a significant difference compared to pre-exposure samples; the pan-viral signature response coincided with the ability to first detect virus in tissue but preceded viremia, clinical signs and antibody response.
Our study has several limitations. First, the validation datasets we employed were generated from multiple sample types (blood, liver biopsy, cultured hepatocytes) using multiple experimental methods (microarrays, RNA-seq). This diversity of sample types and methods could contribute a significant amount of noise which would tend to obscure relevant signals. Once the assay has been translated to a single assay technology and sample type, then more precise comparisons between different viral infections and disease severities can be made. Second, Baltimore Group II is under-represented in our validation data. The dataset that we analyzed (GSE14790, porcine circovirus infection) did not include OASL. Because the genes comprising the pan-viral signature were discovered by a process in which gene pairs were linearly combined, we present results for the linear combination of ISG15 and IL16, which still carries significant diagnostic power in the cohort tested (GSE14790). We expect that eventually additional Baltimore group II datasets will become available, which will allow a more in-depth validation of the pan-viral signature performance in this viral group.
Third, the FEVER and GAPPSS studies we have described in Fig. 7 are limited with respect to the size of the viral infection groups (n = 14 for FEVER, and n = 5 for GAPPSS). These studies are ongoing, and additional recruitment is expected over the coming months.
Fourth, definitive clinical utility of the pan-viral signature remains to be determined. Our observations from a variety of validation datasets suggest that the pan-viral signature could potentially have multiple clinical applications: as an early diagnostic tool, in monitoring recovery from viral infections, in monitoring host response to therapeutic interventions, in monitoring host response to vaccines, and/or in surveillance of populations at risk. For example, in combination with a bacterial signature that has inherent high negative predictive value, the pan-viral signature could potentially be a useful tool in an antibiotic stewardship program, or in providing guidance for ongoing diagnostic testing. It could also prove useful in identifying patients early in the course of a viral infection, which in turn could affect decisions on infection control and patient isolation, especially in disease outbreaks. Additional clinical studies will be needed to determine if the pan-viral signature has clinical utility for these or other purposes.
We believe a particular strength of our discovery approach was the resultant specificity of the pan-viral signature when compared to bacterial and non-infectious causes of SI. Such specificity allows this signature to be combined with our SeptiCyte TM LAB signature, which has specificity for bacterial SI. The combination of virus-specific and bacteria-specific host SI signatures may provide clinicians with timely information to aid in informed decision making in patients presenting with SI, for example in deciding whether to initiate or cease antibiotics. Ultimately, clinical utility for a “pan-viral” signature may be found in combination with an infection status classifier, like that we have previously described40 whereby together, both the probability of systemic infection, along with infection type (i.e. bacterial vs. viral) can be rapidly determined and factored into patient management and treatment decisions.
Several different statistical tests were used to evaluate the performance of classifiers. (1) When sufficient numbers of samples were available, ROC curve analysis was performed and AUCs were calculated. A resampling method was used to estimate the AUC 95% confidence interval (CI) associated with each ROC curve. Venkatraman’s method73, as implemented in the pROC package in R, was used to compare the AUC values between different biomarker combinations with p < 0.05 considered statistically significant. (2) For some performance estimates the Mann-Whitney U test was used, which gives an equivalent statistic to AUC74. (3) For some analyses with very small sample sizes, Student’s t-test was used, following appropriate small-sample guidelines75.
Discovery of the pan-viral signature
In the discovery phase we searched for RNA transcripts or transcript combinations with expression levels that varied during a host response to viral infection. The initial search was conducted across 13 datasets from human adult and pediatric subjects, plus one set of data from macaques. We expected there to be some variability between datasets in quantification of the levels of particular RNA transcripts because different studies used different sample types, sample collection tubes, experimental platforms (microarrays, RNA-seq), and data reduction/processing methods to estimate gene expression levels. A considerable literature has arisen on comparing gene expression results across platforms76,77,78,79 and on estimating the biases that may arise specifically within microarray-based approaches80, 81 and RNA-seq -based approaches82,83,84. For each GEO dataset, we represented each gene’s RNA transcript family by the single microarray probe that gave the maximal average intensity for that gene, across all samples used in the analysis. Probe identities are listed in Supplementary Table S6.
We began the search using four core datasets (GSE40366, GSE51808, GSE52428 and GSE41752). To decrease the dimensionality of the search space and to ensure that only those transcripts with moderate to high expression levels were examined, we applied a mean expression filter that allowed only the top 6,000 RNA transcripts from each of the core datasets to be retained. Regression analysis was then applied across the search space, with RNA transcripts combined in pairs, using a linear objective function with coefficients set to −1 or +1 for the log2 expression value of each transcript in a pair. In theory, each core dataset produced 36,000,000 transcript pairs to examine (not taking into account reciprocal pairs). Setting the coefficients to −1 or +1 (instead of allowing the coefficients to vary) reduced the computational effort to a manageable level. ROC curve analysis on each transcript pair then allowed the transcript pairs to be ranked by AUC for their ability to separate the case and control groups in each of the core datasets.
The RNA transcript pairs were then filtered by the following two-step process: (1) those with average AUC <0.92 across the four core datasets were discarded; and (2) those with average AUC < 0.92 across ten additional viral-based “sensitivity” datasets (Table 1) were discarded. This resulted in a severely reduced pool of transcript pairs (N = 856) with AUC ≥ 0.92. Next, the four “core” and ten “sensitivity” datasets (Table 1) were individually normalized, as follows. (1) The mean expression level of each RNA transcript was calculated across all samples in that dataset. (2) The expression level of this transcript in each sample was then adjusted by subtracting its mean value. (3) All expression values were then scaled to unit variance. This procedure was performed for every transcript in each individual dataset. All 14 viral datasets were then merged into a single expression matrix.
Specificity screen with independent GEO datasets
To ensure that candidate transcript pairs were associated uniquely with a viral host response and not a host response due to confounding phenotypes, they were individually assessed against 19 “specificity” datasets. The specificity datasets were derived from bacterial-positive patients, some of whom were classified as septic (GSE3341, GSE16129, GSE40396), patients with SIRS (GSE40012), healthy subjects ranging in age from childhood to nonagenarian (GSE40366), patients with inflammation not associated with positive viral infection (GSE42834, GSE17755, GSE19301, GSE47655, GSE38485, GSE36809, GSE29532, GSE61672), neonatal and pediatric bacterial sepsis patients (GSE25504, GSE30119, GSE6269), patients with anxiety (GSE61672), subjects administered dexamethasone (GSE46743), and healthy subjects displaying demographic confounders such as age, ethnicity and gender (GSE35846). Candidate transcript pairs having AUC >0.80 in more than 3 of the 19 specificity datasets were discarded. A total of 473 candidate transcript pairs remained after this step.
Final selection step
Finally, a greedy forward search was performed on the reduced pool of highest-ranked RNA transcript pairs according to previously described methods41. The end product of this search was the final pan-viral signature containing two upregulated and two down regulated RNA transcripts as a linear sum (ISG15 + OASL) - (IL16 + ADGRE5).
Validation in independent GEO datasets
The pan-viral signature was then tested against 11 independent “validation” datasets (Table 3). These datasets were derived from six mammalian species (human, macaque, chimpanzee, pig, mouse and rat), all seven Baltimore groups, and various tissue types (blood, liver biopsies, in vitro primary hepatocytes), and included time course and vaccination studies in humans. It should be noted that differences in the y-axis scale (pan-viral signature value) between various studies, as indicated in figures in the text and Supplementary Material, result from differences in the various gene expression measurement platforms across studies.
Validation in independent clinical studies
The pan-viral signature received additional validation from two independent clinical studies, FEVER and GAPPSS, which were conducted on adult and pediatric patients respectively. Details of the FEVER study are provided in Supplementary Tables S1 and S2, Figure S6 and Text S1, and details of the GAPPSS study are provided in Supplementary Tables S3 and S4, Text S2, and the publication by Zimmerman et al.40 The GAPPSS study was an institutional review board-approved prospective, observational study (Seattle Children’s Hospital IRB #14761). Parental informed permission was obtained prior to sample and data collection. All sample and data collection was carried out in accordance with approved protocols and procedures. The FEVER study was also an institutional review board-approved prospective, observational study (UK National Research Ethics services reference number: 09/H0701/103). All participants provided written informed consent, prior to sample and data collection. All sample and data collection was carried out in accordance with approved protocols and procedures.
The FEVER study cohort consisted of adult patients presenting with fever to the Emergency Department, and then admitted to hospital. A comparison was made between those retrospectively diagnosed with a viral infection (n = 15), with bacterial sepsis (n = 55) or with infection-negative SI (n = 22). In the FEVER study, testing for viral infections was only performed on those patients suspected of a viral infection, and involved use of one or more single-virus diagnostic tests based on the clinician’s judgment and according to hospital procedures85 (e.g. PCR for influenza, serology for dengue, etc.). The GAPPSS study cohort consisted of pediatric intensive care patients retrospectively diagnosed with a viral infection (n = 5), bacterial sepsis (n = 25), or bacterial sepsis with a viral co-infection (n = 10), as well as infection-negative SI controls undergoing cardio-pulmonary bypass surgery (n = 29). All patients in the GAPPSS study, except for one bacterial sepsis patient who was omitted from the analysis, were tested for the presence of viral nucleic acid sequences in nasal swabs using the Biofire FilmArray Respiratory Panel (Biofire Diagnostics, Utah, USA). Supplementary Tables S3 and S4 present the relative gene expression values for ISG15, IL16, OASL, ADGRE5 derived from RNA-seq data for the FEVER and GAPPSS patients, respectively. For each of the two datasets (FEVER or GAPPSS), we represented the expression level of a gene of interest by Fragments Per Kilobase of transcript per Million mapped reads (FPKM)86. This measure of gene expression should be independent of whether the data are in the form of single-end reads (FEVER) or paired-end reads (GAPPSS).
Combination of SeptiCyte™ LAB and pan-viral signature
To demonstrate utility of a combined bacterial and viral host response assay, we analysed GEO dataset GSE63990 using an 8-gene classifier consisting of the four pan-viral signature genes (IL16, ISG15, OASL, ADGRE5) combined with the four genes (CEACAM4, LAMP1, PLA2G7, PLAC8) from SeptiCyte TM LAB. The class labels used in GSE63990 were: bacterial infection, viral infection, and non-infectious illness. Line data from GSE63990 are presented in Supplementary Table S5. To assess whether a significant biological response exists from the eight genes, we performed a permutation test. Under this statistical framework the dependency between the feature space and the response (class labels) is broken thus allowing us to understand the behavior of the model under the null hypothesis that the explanatory variables and response labels are independent. The model, in this case, consisted of a supervised Random Forest analysis43 constructed from 1000 trees and allowing √f features to be selected randomly at each split, where f = 8 and represents the number of gene targets. The class labels were then randomly permuted 2,000 times which allowed for a 0.05 alpha level with a 0.01 precision87. The data were then modeled using Random Forests. For each null model the multiclass log-loss was calculated to construct the null distribution before assessing the true response labels against the final null model.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was funded by Immunexpress, Seattle Children’s Research Institute, and the National Institute for Health Research University College London Hospitals Biomedical Research Centre. An Australian provisional patent (AU2015/903986) has been submitted covering aspects of work presented. We thank the anonymous reviewer and Editorial Board member for their constructive and helpful reviews.