Transcriptome assists prognosis of disease severity in respiratory syncytial virus infected infants

Respiratory syncytial virus (RSV) causes infections that range from common cold to severe lower respiratory tract infection requiring high-level medical care. Prediction of the course of disease in individual patients remains challenging at the first visit to the pediatric wards and RSV infections may rapidly progress to severe disease. In this study we investigate whether there exists a genomic signature that can accurately predict the course of RSV. We used early blood microarray transcriptome profiles from 39 hospitalized infants that were followed until recovery and of which the level of disease severity was determined retrospectively. Applying support vector machine learning on age by sex standardized transcriptomic data, an 84 gene signature was identified that discriminated hospitalized infants with eventually less severe RSV infection from infants that suffered from most severe RSV disease. This signature yielded an area under the receiver operating characteristic curve (AUC) of 0.966 using leave-one-out cross-validation on the experimental data and an AUC of 0.858 on an independent validation cohort consisting of 53 infants. A combination of the gene signature with age and sex yielded an AUC of 0.971. Thus, the presented signature may serve as the basis to develop a prognostic test to support clinical management of RSV patients.


Study design
Nasopharyngeal wash (using Cheiron Dynamic II apparatus) and blood samples were prospectively obtained from patients less than 2 years of age with a bronchiolitis. Patient enrolment occurred 7 days a week and samples were taken within 24 hours after first contact with the hospital. Only patients with an RSV infection, as determined by PCR retrospectively, were included in the study. Exclusion criteria were: immunodeficiency, systemic steroid treatment in the previous 2 weeks, blood transfusion, congenital heart and chronic lung disease. Patients were followed until recovery and were retrospectively classified as: mild for children without hypoxia, moderate for patients requiring supplemental oxygen (oxygen saturations <90%, ≥10 minutes) and severe for children requiring mechanical ventilation due to apnea, exhaustion and/or respiratory failure. Recovery samples were obtained after 4-6 weeks, during home visits. Blood samples were obtained from healthy patients without underlying diseases or medication subjected to elective surgery.

Sample processing and blood transcriptome profiling
Multiplex RT-PCR was performed to test the nasopharyngeal washes on 15 different viral pathogens, as previously described (E1). Blood was collected in Tempus tubes and stored at -80°C. Total RNA was isolated from the blood using Tempus Spin RNA isolation kit (Applied Biosystems, Bleiswijk, The Netherlands). Globin mRNA was removed from total RNA preparations using the Globinclear kit (Life Technologies). RNA concentrations and OD 260/280 ratios were measured with the NanoDrop ND-1000 UV-VIS spectrophotometer (NanoDrop Technologies, Wilmington, USA). Assessment of RNA quality and purity was performed with the RNA 6000 Nano assay on the Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). RNA (200ng) was labelled using the MessageAmp Premier RNA Amplication kit (Applied Biosystems) and hybridized to Human Genome U133 plus 2 gene chips (Affymetrix), according to the manufacturer's recommendations. Image analysis was performed using GeneChip Operating Software (Affymetrix). Microarray Suite version 5.0 software (Affymetrix) was used to generate .dat and .cel files.

Differential expression analysis
For = 1 … samples and = 1, … , transcripts (probesets) the linear model for each probeset was as follows: Where Y i is a vector of the expression values of probeset i, Status is an indication matrix of the samples by RSV categories and β i is a vector of coefficients of the RSV categories for probeset i. The DE analysis was then performed by comparing the contrasts of Status for each probeset.

Identification and evaluation of prognostic biomarkers
Given that we are interested in genomic prognostic biomarker(s), we retained a sex by age standardized dataset by fitting the linear models in equation (1) above using limma (E2) and from those models the sex by age standardized expression set was: As class labels, we combined the mild and moderate groups as one class (class 0) with focus on predicting severe cases (class 1) from others. For clinical application, a prediction of the probability to progress to severe disease is of primary interest than direct classification (E3). Therefore, using results of (E4, E5)  For each classification function, a prediction model was built and evaluated using leave one out cross validation with an inner loop of 5-fold cross validation for parameter(s) optimization as shown on Fig.   S6 step 1, optimizing the parameters by maximizing the binomial log-likelihood function. We evaluated the prediction models using calibration score (CS) and refinement score (RS) (E5) which is a decomposition of the Brier score. The calibration score expresses on one hand, how well the predicted probabilities agree with the true chances of patients and it is equal to zero in case of perfect agreement. On the other hand, the refinement score expresses how uncertain the predicted probabilities are; that is how close the predicted probabilities are to 0.5. The closer the predicted probabilities are to 0.5, the higher the uncertainty and the poorer the model. A good class prediction model has a CS and RS of zero. The best calibrated and refined function amongst the three classification functions ( The list of transcripts from the optimal parameter(s) was retained (Fig. S6, step 3) as a gene signature.

Logistic regression models
Let be the leave one out cross-validated predicted probability of sample to progress to severe state by a genomic signature, then the genomic score is given as: The general logistic regression model is then written as: where ( ) is the probability of sample to progress to severe state, a vector or matrix of predictive parameters (i.e. Genomic score, Age, and/or Sex) and is a vector of parameter(s) estimate(s). Let be the predicted value of sample from equation (3), the probability of that sample to progress to severe state is computed from its predicted value using the inverse logistic function as: ( ) = 1− . With these predicted probabilities, the AUCs were computed.

Validation of biomarkers
For an independent (external) validation, a subset of the Illumina RSV data set of (E9) was used. Since the experimental data and validation data are from different platforms, we opted to link the data using gene symbols. To achieve this, the signature transcripts were annotated to gene symbols and unannotated transcripts were eliminated if any (Fig. S6, step 4). The Illumina data was also annotated to gene symbols (Fig. S6, step 5) and the common genes between our annotated gene signature and the annotated Illumina data were extracted using common gene symbols as shown in step 6 of Fig. S6 and were referred to as the surrogate signature (step 7). The final expression set of the surrogate signature was computed by assigning the median expression value of a gene with multiple transcripts to the corresponding gene (step 8) and was referred to as unique surrogate signature. The final model was then built with the unique surrogate signature expression set from the Affymetrix data (step 9) and validated with the unique surrogate signature expression data from Illumina (step 12).
To be able to validate with the Illumina data, all the beads corresponding to the genes in the surrogate signature were retrieved (step 10) and multiple beads per gene were combined by calculating their median expression values (step 11). Since the expression values of a single gene might be measured on different scales across Illumina and Affymetrix platforms, we rescaled the Illumina data of the gene signature to same scale as Affymetrix data (step 12). Suppose for each gene i the expression scale in To transform an expression value of gene i in sample j from Illumina to Affymetrix scale, we use the following function: The scaled Illumina data was then predicted using our prognostic model ( For a confirmatory analysis of our validation performance, using the chosen classification function we built class prediction model on the entire Illumina data using LOOCV as shown on step 1 of Fig. S6.