Introduction

Respiratory syncytial virus (RSV) causes infections that range from common cold to severe lower respiratory tract infection that in some instances may have a fatal outcome. Especially infants, elderly and patients with underlying chronic disorders suffer from severe RSV infections1,2. In infants, RSV is the leading cause of lower respiratory tract infections (LRTI) and is responsible for 80% of acute bronchiolitis cases3. RSV infections pose a huge burden on society in terms of disease, logistics and socio-economic sequelae. There is an unmet need for an RSV vaccine, despite considerable research efforts no licensed vaccine has been developed.

In industrialized countries, 1–5% of infants with RSV infection are hospitalized4,5,6,7. Some of these infants yet suffer from severe disease upon admittance, while others are admitted without severe symptoms since the course of bronchiolitis is highly variable and the need for supportive care cannot be predicted8,9. Several risk factors for developing severe RSV disease in infants have been identified, including preterm birth, young age, sex and environmental factors like in-house smoking10. Notwithstanding these known risk factors, current medical practice does not allow accurate prediction of whether an infant will further progress to severe RSV disease or not and could even be sent home safely. Genomic technologies have contributed to study the virus-host interaction, including virus discovery, pathogenesis studies, the design of antiviral strategies and identification of biomarkers to support clinical management of infectious diseases11,12,13,14. For RSV infections, this has supported the characterization of vaccine-induced skewed host responses upon infection15,16. Meijas et al.17 recently used blood transcriptome profiles obtained within 3 days of hospitalization to characterize the host response to RSV infection in infants compared with rhinovirus or influenza infections and identified transcriptional profiles that associate with RSV disease severity. However, a prognostic model for RSV severity based on gene expression profiles collected at admittance to the hospital has not been developed.

In this study we aim to identify and validate a gene signature that discriminates severe from less severe RSV LRTI that do not require advanced support. Such a signature together with other clinical parameters may improve the prognosis of less severe patients that could be safely sent home.

Material and Methods

Study design

Study subjects were recruited at Canisius Wilhelmina Hospital, Radboud University Medical Center, Nijmegen, and Erasmus Medical Center, Rotterdam, The Netherlands. Nasopharyngeal wash and blood samples were prospectively obtained from patients less than 2 years of age with a viral bronchiolitis. Patient enrolment occurred 7 days a week and samples were taken within 24 hours after first contact with the hospital. Seventy-three percent of all eligible bronchiolitis patients agreed to participate in the study. The major reasons for non-inclusion were parental availability to sign consent and the hesitancy for the venipuncture. Only patients with an RSV infection, as retrospectively determined by PCR were included in the study. Exclusion criteria were: immunodeficiency, systemic steroid treatment in the previous 2 weeks, blood transfusion, congenital heart and chronic lung disease. A Tempus tube (TempusTM, Applied Biosystems, Austria) and sodium heparin tube were filled with 3 ml of blood. Medical history, demographic and clinical data were collected from medical records and questionnaires. The (hospitalized) patients were followed until recovery and were retrospectively classified as: mild for children without hypoxia, moderate for patients requiring supplemental oxygen (oxygen saturations <90%, ≥10 minutes) and severe for children requiring mechanical ventilation due to apnea, exhaustion and/or respiratory failure. Recovery samples were obtained after 4–6 weeks, during home visits. Blood samples were obtained from healthy controls (HC) without underlying diseases or medication subjected to elective surgery.

Study approval

The study protocol was approved by the Regional Committees on Research involving Human Subjects of Arnhem-Nijmegen and Rotterdam and were conducted in accordance with the principles of the Declaration of Helsinki. Written informed consent was obtained from the parents of all children prior to inclusion in the study.

Sample processing and blood transcriptome profiling

Inclusion of patients and sample collection was performed by a single MD at the hospitals. Multiplex RT-PCR was performed to test the nasopharyngeal washes on 15 different viral pathogens, as previously described18. Blood was collected in Tempus tubes for immediate stabilization of RNA and subsequently stored at −80 °C. Total RNA was isolated from each blood sample, processed, assessed, labelled and hybridized to a single Affymetrix Human Genome U133 plus 2 gene chips; and image analysis was performed in the same lab and by one technician as described in supplementary material. The raw data has been deposited in the ArrayExpress database under access number E-MTAB-5195.

Data preprocessing

Microarray data was preprocessed using R 3.1.2 19 and Bioconductor20. Upon initial quality control and VSN normalization (to render samples comparable), probeset (a combination of multiple probes) summarization was performed by median polish21,22. Unless otherwise stated, all probesets/genes present on the Affymetrix GeneChip were used for data analysis. Samples were labelled and hybridized in two batches which did not correspond to any biological variable as samples were randomly assigned to the batches. The normalized expression values were adjusted for a batch effect (see supplementary Fig. S1) using ComBat23. Additionally, we assessed confounding effects of clinical parameters age and sex on gene expression–severity relationship using “biasograms”24.

Differential expression analysis

To obtain a global view of the blood transcriptome changes in response to RSV infection (i.e. to evaluate whether whole transcriptome changes associate with severity), a principal component analysis (PCA) as an exploratory analysis was performed on the age by sex standardized data. Next, a differential expression (DE) analysis was performed on the normalized-batch-adjusted data controlling for an age by sex effect using empirical Bayes linear models25 implemented in the R package limma26. Details of the models are found in supplementary material. We controlled for multiple testing via false discovery rate (FDR) using a Benjamini and Hochberg procedure27. Gene set enrichment analysis was performed using Ingenuity pathway analysis (IPA, www.qiagen.com/ingenuity).

Identification and evaluation of prognostic biomarkers

Since we are interested in identifying RSV-infected infants that will progress to severe stage upon presentation to the hospital, we grouped mild and moderate samples and aimed to separate these samples from infants that were presented with or progressed to severe disease after hospitalization. We chose to utilized probabilistic predictors (to predict the chance of an RSV-infected infant to be severe) because in clinical applications, probabilities are more informative than absolute yes or no predictions28. Several probabilistic predictors exist in the literature and their performance depends on the type of the data they are being applied on ref. 29. Using results of ref. 29,30 and observed correlations in the data, three probabilistic classification functions that could be optimal for this data were chosen as described in supplementary material. These functions were support vector machines (SVM)31, shrunken centroids discriminant analysis (SCDA)32 and random forest (RF)33.

For each classification function, the experimental data was split into a learning set and a test set using leave-one-out cross-validation (LOOCV). Cross-validation reduces optimistic bias by ensuring that our models are evaluated on an independent dataset that was not used to constructed these models. Most probabilistic classification functions require hyper-parameters to perform variable selection among the huge number of variables (probesets). Usually, the best values for these hyper-parameters are also determined by cross-validation. Thus, the parameter(s) of the function were optimized using an inner loop of five-fold cross-validation on the learning set. Next, a prognostic model was built with the optimal parameter(s) on the entire learning set and evaluated with the test set, as described in supplementary material. The following R packages; CMA34, e107135, pamr36 and randomForest37 were utilized for class prediction. The best calibrated and refined function amongst the three functions was selected and its performance evaluated using the area under the receiver operating characteristic (ROC) curve (AUC). Finally, the transcripts that maximized the binomial log-likelihood function, with the leave-one-out cross-validated data were retained as a gene signature from the selected function as described in supplementary material.

Comparison of biomarkers to clinical parameters

Age and sex are readily available clinical parameters that have been determined to be associated to RSV disease severity38. To assess the gain attained with a genomic model over a model with these clinical parameters, and the effect of standardization, the leave-one-out cross-validated predicted probabilities of progressing to severe for all samples were transformed to genomic scores (a genomic score is single measure of the genome of a sample as predicted by a model) for models with unstandardized and standardized data. Logistic regression models (see supplementary material) were then fitted with the genomic scores and/or clinical parameters and their AUCs compared.

Validation of biomarkers

For an independent validation, a subset of the Illumina RSV data of Meijas et al.17 was used. Since the experimental data and validation data were obtained using different platforms, we linked the data using gene symbols and applied cross-platform transformation (to render gene expression comparable across datasets) as described in supplementary material. The transformed data was supplied to our prognostic model for predictions of probabilities of severity. For a confirmatory analysis of how well our prognostic model performs, we built and evaluated a prediction model with the chosen function (same function used to build our prognostic model) on the entire Illumina data and compared our validation performance to the performance from this (unrestricted) data.

Results

Study subjects and sampling

Thirty-nine infants hospitalized with acute RSV bronchiolitis were included in the study. Nasopharyngeal wash and whole blood samples for mRNA profiling were collected within 24 hours upon hospital admittance. Table 1 presents the characteristics of the study subjects. As expected, patients with the most severe course of RSV bronchiolitis were significantly younger than those with a relative mild or moderate course of this disease. The variables related to disease severity; duration of oxygen, and length of stay in the hospital were highest in the severe group, with ventilation indicating the method by which oxygen was supplied. The proportion of co-infections was lower in severe patients as compared to the other severity categories. There were no differences in the occurrence of other known risk factors.

Table 1 Patient characteristics (n represents the number of samples per group).

Age and sex as confounders of gene expression–severity relationship

Figure 1a,b respectively illustrate the confounding effects of sex and age on the gene expression-disease severity relationship. These figures show that whereas age is negatively correlated to severity, sex is uncorrelated to severity. Nevertheless, the high positive/negative correlations of a considerable number of transcripts to sex and age, as well as severity, indicate a confounding effect of these variables on the expression-severity relationship of these transcripts, thus warrant adjustment. Figure 1d,e illustrate the “biasograms” after an age by sex standardization. These figures show that standardization has no effect on severity correlated transcripts but as expected, transcripts that were originally correlated to age and sex become uncorrelated. A positive correlation of age to sex which signifies an age by sex interaction as a potential confounder on the gene expression-severity relationship was also observed (Fig. 1c) and eliminated after standardization (Fig. 1f).

Figure 1
figure 1

Confounding effect of Sex, Age and Age by Sex on gene expression–severity relationship, before.

(a,b,c) and after: (d,e,f), an age by sex standardization. The blue and green lines represent the clinical variables, the cosine of the angle between the lines represents its correlation to the blue line (Sex is not correlated to Severity, Age is negatively correlated to Severity i.e. younger kids become severe and Age is positively correlated to sex i.e. females are older). The cloud of points represent the transcripts and their correlations to both variables with most transcripts uncorrelated to the variables (yellow cloud) while a considerable number (black cloud) are correlated to Severity, Sex, Age and Age *Sex. The associations between the transcripts and Sex, Age or Age *Sex are significantly eliminated after standardization while retaining that of Severity.

Global blood transcriptome profiles associate with RSV disease severity

Figure 2 illustrate a PCA on the whole transcriptome and the first principal component accounts for 25% of the variance in the transcriptomes and associates with disease severity. Transcriptome profiles of HC and recovery samples group together on the first principal component and are located opposite to profiles of severe infants. The distinct groups do not form discrete clusters in the PCA but gradually shift from mild through moderate to severe, with considerable overlap. This shows that the blood mRNA profiles substantially capture the severity of lower respiratory tract RSV infection.

Figure 2
figure 2

Global blood transcriptome profiling with principal component analysis: the first principal component (PC1) accounts for 25% of the variance in the dataset and associates with disease severity.

This can be observed as a shift from healthy controls and recovery cases (left) through mild and moderate to severe cases (right), with considerable overlap.

Number of differential gene expression relates to RSV disease severity

Table 2 presents results of differential gene expression analysis and reveals that the number of DE transcripts increases with disease severity. No DE transcript was identified between mild versus HC samples when applying a FDR of 5% and absolute fold change (FC) threshold of 2. However, 17 and 221 transcripts were DE between moderate and severe versus HC respectively. Interestingly, all transcripts that are DE in moderate class are also DE in severe class with larger FC. About 90% of these DE transcripts are up-regulated. Comparison of HC with recovery samples revealed a single down-regulated transcript while moderate versus mild yielded no DE transcript, severe versus mild or moderate yielded 178 and 49 DE transcripts respectively. Lastly, 95 transcripts were DE between severe versus combined mild/moderate samples.

Table 2 Number of differentially expressed transcripts for each contrast at FDR of 5% and absolute fold change cutoff of 2.

RSV induced blood transcriptome profiles reveal an inflammatory response

Figure 3 shows that multiple relevant categories of molecular and cellular functions are significantly enriched when comparing severe to HC samples. With “Cell-to-Cell Signaling and Interaction” top category, gene sets related to activation of several types of immune cells including lymphocytes, granulocytes and specifically neutrophils are most significantly enriched. In addition, gene sets that are involved in migration and tissue infiltration of these same activated cell types are most significantly enriched within the category “Cellular Movement” that ranks third on this figure. Finally, several high ranking molecular and cellular function categories and their underlying gene sets indicate the immune cells involved are strongly proliferating. A list of genes involved in each of these pathways is presented in supplementary Table S1. Taken together, blood transcriptome changes in RSV disease reveal a typical inflammatory response to a viral infection.

Figure 3
figure 3

Ingenuity pathway analysis (IPA) Molecular and Cellular functions gene set analysis for severe vs healthy control contrast.

Early blood transcriptome changes to predict a severe outcome of RSV infection

To construct a predictive model, we combined mild and moderate cases as a single group and three probabilistic classification functions were chosen based on supplementary Fig. S2 and results of Jong et al. and Kim and Simon29,30. Using these functions, classifiers were built and evaluated using LOOCV on the experimental data. SVM was chosen as the best calibrated and refined as shown on supplementary Fig. S3 and henceforth considered for all analyses. The LOOCV predicted probabilities from SVM were used to evaluate its performance and are plotted on Fig. 4a against the true RSV status as retrospectively determined. This figure shows that 5 samples out of 39 were misclassified at a 50% cutoff and when applying a proposed uncertainty band of 30–70% just one false negative is witnessed. Evaluation of the clinical characteristics of the single false negative patient as well as those patients with uncertain predictions (plotted within the proposed uncertainty band) did not reveal any recognizable pattern. The false negative patient had uniquely RSV and only a single patient plotted within the uncertainty band had RSV+ other virus(es). Figure 4b presents the corresponding ROC curve from the LOOCV predicted probabilities and AUC of 0.966 demonstrates the high discriminative power of our prognostic model.

Figure 4
figure 4

Internal validation of gene signature.

(a) samples’ predicted probabilities of being severe. (b) shows the ROC curve and the AUC for predicted probabilities. The AUC value of approximately 1 indicates how accurate our signature performs on this internal validation set. (c) shows that a genomic model from the age by sex standardized data out performs that from the unstandardized data. In addition, there is a significant difference between a model with clinical parameters and that with a genomic score and just a slight improvement when both parameters are included in a model.

Genomic biomarkers outperform clinical parameters age and sex

Figure 4c presents the ROC curves from logistic models of clinical parameters and/or genomic scores. The genomic score model from the standardized data (AUC = 0.966) outperforms that from the unstandardized data (AUC = 0.915). In addition, there is a significant difference between a model with age and sex only (AUC = 0.833) and that with a standardized (AUC = 0.966) or unstandardized genomic score (AUC = 0.915). Whereas there is a slight improvement from clinical parameters and standardized genomic score model (AUC = 0.971), there is no improvement from the clinical parameters and unstandardized genomic score (AUC = 0.911), indicating that indeed standardization completely removed an age-by-sex effect on the gene expression data.

An 84 gene expression signature predicts absence of disease progression

To extract the prognostic signature, we selected top transcripts maximizing the binomial log-likelihood function using LOOCV predicted probabilities as illustrated on supplementary Fig. S4. This figure depicts a 1-SE maximum of 95 transcripts corresponding to 84 unique genes, which are displayed on Table 3. Of the 95 transcripts constituting the prognostic signature, 81 (85.26%) were found to be significantly DE between severe and non-severe patients (FDR cutoff of 5%, supplementary Table S2). The inclusion of non-DE transcripts in the classification model is expected, since not only DE genes are instrumental in class discrimination as illustrated by the two-dimensional scenario in supplementary Fig. S5.

Table 3 Gene signature of 95 transcripts (probesets) corresponding to 84 genes.

Performance of the genomic signature retained on an independent dataset

For an independent validation, a subset of the Illumina RSV data of Meijas et al.17 was used. Since the experimental data and validation data were obtained using different platforms, we linked the data using gene symbols and applied cross-platform transformation (to render gene expression comparable across datasets) as shown on supplementary Fig. S6 and extensive described in the supplementary material. Figure 5a presents predicted probabilities of severe on the validation data using 75 of our 84 prognostic gene signature that were common in both experimental and validation datasets, while Fig. 5b presents the LOOCV predicted probabilities from SVM on the entire Illumina data. Both figures show that using the unrestricted data leads to more certain probabilities and slightly improves specificity compared to our signature. Nonetheless, Fig. 5c illustrates a large agreement between the predicted probabilities of the two models, while Fig. 5d clearly reveals that both models are alike as demonstrated by the AUCs of 0.858 and 0.856 for our signature and the unrestricted model respectively. To assess the concordance of the expression patterns of our signature on both datasets (Affymetrix and Illumina), we plotted the log2 fold changes of the common 75 genes as shown on Fig. 6. From this figure, one clearly sees that there is a huge concordance in the direction of expressions across datasets. Where there are slight differences, these differences are not significant as shown by a non-significant p-value in at least one of the datasets.

Figure 5
figure 5

Predicted probabilities of being severe from the validation data against true RSV status using; our diagnostic signature (a) and LOOCV on unrestricted data (b). (c) illustrates the agreement of the predictions from both models, green regions are perfect agreement, blue are disagreements at a 50% cutoff and red are disagreements at a 30–70% uncertainty band. Finally, (d) presents the ROC curves and AUC from both models illustrating similar AUC values.

Figure 6
figure 6

Log2 Fold change between Severe vs Non-Severe infants for 75 common genes in Affymetrix and Illumina datasets.

Red represent up-regulation while green represents a down-regulation and the significant FDR adjusted p-values are placed in the cells. As one can clearly see, there is a huge overlap in the direction expressions across datasets. Where there are slight differences, these differences are not significant as shown by a non-significant p-value in at least one of the datasets.

Discussion

RSV infection in infants may cause life-threatening disease. No vaccine is yet available and triage of patients is challenging since RSV infections may rapidly progress to severe disease. No reliable prognostic model to predict which RSV patient will not progress to severe disease and could be safely send home is available either. Thus, clinical care is symptom-based and a significant proportion of RSV infected infants is hospitalized for observation purposes. We have provided an 84 gene signature that discriminates hospitalized infants with less severe RSV infection from those infants with severe RSV disease. The identified signature yielded a LOOCV AUC of 0.966 on the experimental data and was independently validated with an AUC of 0.858 and might serve as a basis to develop a prognostic test for clinical management of RSV disease.

In line with epidemiological observations38 and observations of Mejias et al.17, we showed the confounding effects of age and sex on gene expression-severity relationship for RSV disease. Studies in any RSV patient cohort with a naturally occurring “skewed” distribution of age and sex can be standardized for these parameters. By adjusting for an age-by-sex effect in our analyses, we obtained age-by-sex independent results which can be effectively applied to any patient(s). The high performance of our signature on the age and sex matched validation data signifies age-by-sex independence and robustness of this signature. Fewer co-infections were observed in severe patients (Table 1). A similar trend has been described previously39. In our cohort study we did not take into account co-infections since no consistent association between the occurrence or absence of co-infections with RSV disease severity have been reported39,40,41,42,43. Furthermore, we aimed at the identification of a gene signature in a natural “real-life” cohort of patients not stratified according to age or occurrence of co-infections.

We hypothesized that changes in blood cell type distribution and/or mRNA expression changes of the circulating cells collected from peripheral blood reflect local lung host response characteristics that associate with disease severity. PCA and DE analysis indeed revealed significant changes in the transcriptome profile of whole blood. Gene set analysis further shows that relevant processes are monitored including the activation, migration and tissue infiltration of lymphocytes, granulocytes and neutrophils. Individual DE genes in severe RSV disease revealed overexpression of the neutrophil associated genes MMP8 and MMP9, which have previously been related to severe RSV disease44. ARG1 and CHI3L1 that have been linked to alternatively activated macrophages in a mouse model for vaccine enhanced RSV disease16 were also found to be strongly up-regulated. This suggests that the collected blood transcriptome profiles indeed reflect local lung host response.

In our class prediction analysis, three functions were evaluated and the best was chosen. While it has been pointed by45,46,47,48 that selecting a minimal-error classifier leads to selection bias that should be corrected, the literature does not stipulate a selection bias when using calibration and refinement scores as evaluation measures. Nevertheless, we employed the nested cross-validation correction of selection bias46 in our model building procedure by splitting our experimental data into learning and test sets with an inner loop split on the learning set for parameter(s) optimization. Though found to contain high variance, we utilized leave-one-out cross-validation for the test set because it yields approximately an unbiased estimate of the true (expected) prediction error49 and because we were interested in the individual sample predicted probability of severe and not entirely on the expected predicted error. Nevertheless, where we were interested in the expected predicted error, as in the optimization of parameters, we utilized five-fold cross-validation as recommended by Breiman and Spector50. To validate the identified signature, an independent dataset generated on a different platform was used. Despite (i) the several sources of variability between our experimental data and the validation data that stem from - but not limited to - array platforms and different clinical cutoffs of RSV severity statuses, (ii) different time of profiling, 1–3 days after hospitalization and (iii) loss of information due to a reduction in signature because of no corresponding transcripts on Illumina platform and the aggregation of multiple transcripts to genes, our signature yielded an AUC of 0.858 that was comparable to accuracy (AUC of 0.856) when using the Illumina data (validation set) as experimental set. Cross-platform validation is rare due to lack of guidance on how this can be done reliably. We presented a cross-platform validation procedure.

The RSV patients enrolled in the study displayed varying disease severities but were all hospitalized thus representing a severe disease enriched subset of RSV infected infants. The patients enrolled however also represent a natural cohort of patients including a significant number of patients that eventually did not require extensive medical care and could have been discharged home. Since the blood samples were collected soon after hospital admission, the generated blood transcriptomes and the derived gene signature may serve as a basis for the development of a novel genomic tool to support clinical management of RSV disease including triage of patients presenting at the hospital provided that a rapid (real time) gene test can be developed. Larger transcriptome data sets are however required to construct predictive models that may also allow for discriminating mild from moderate and moderate from severe cases. Ultimately, one would like to extend the RSV biomarker program to earlier time point samples (e.g. obtained when visiting a general practitioner) and to samples collected from patients infected by other (respiratory) infectious agents or pathological conditions (comorbidities) in order to identify specific respiratory viral prognostic biomarkers. To this end a novel gene signature have to be developed using a much larger early blood sample cohort. The current results support the development of diagnostic tests for personalized medicine that not only provide information on the causative infectious agent, but also about the disease severity that may be expected.

Additional Information

How to cite this article: Jong, V. L. et al. Transcriptome assists prognosis of disease severity in respiratory syncytial virus infected infants. Sci. Rep. 6, 36603; doi: 10.1038/srep36603 (2016).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.