Early and accurate diagnosis of stroke improves the probability of positive outcome. The objective of this study was to identify a pattern of gene expression in peripheral blood that could potentially be optimised to expedite the diagnosis of acute ischaemic stroke (AIS). A discovery cohort was recruited consisting of 39 AIS patients and 24 neurologically asymptomatic controls. Peripheral blood was sampled at emergency department admission, and genome-wide expression profiling was performed via microarray. A machine-learning technique known as genetic algorithm k-nearest neighbours (GA/kNN) was then used to identify a pattern of gene expression that could optimally discriminate between groups. This pattern of expression was then assessed via qRT-PCR in an independent validation cohort, where it was evaluated for its ability to discriminate between an additional 39 AIS patients and 30 neurologically asymptomatic controls, as well as 20 acute stroke mimics. GA/kNN identified 10 genes (ANTXR2, STK3, PDK4, CD163, MAL, GRAP, ID3, CTSZ, KIF1B and PLXDC2) whose coordinate pattern of expression was able to identify 98.4% of discovery cohort subjects correctly (97.4% sensitive, 100% specific). In the validation cohort, the expression levels of the same 10 genes were able to identify 95.6% of subjects correctly when comparing AIS patients to asymptomatic controls (92.3% sensitive, 100% specific), and 94.9% of subjects correctly when comparing AIS patients with stroke mimics (97.4% sensitive, 90.0% specific). The transcriptional pattern identified in this study shows strong diagnostic potential, and warrants further evaluation to determine its true clinical efficacy.
Stroke is currently the leading cause of disability and the fifth leading cause of death in the United States.1 It is well established that early and accurate diagnosis improves outcome by increasing the probability of successful intervention;2,3 however, the diagnostic tools currently available to clinicians for the identification of stroke have significant limitations.
Although neuroradiological imaging is the gold standard for diagnosis of stroke,4 it is inaccessible in the field and at the initial point of contact in emergency departments. Furthermore, such imaging techniques are often not immediately available in hospitals without dedicated stroke centres, such as smaller facilities and those which serve rural areas.5 As a result, crucial decisions regarding the triage of potential strokes by emergency department staff and emergency medical technicians are based on the assessment of overt patient symptoms using stroke recognition and severity scales such as the Cincinnati pre-hospital stroke scale (CPSS) and the National Institutes of Health stroke scale (NIHSS).4 In the hospital setting, the ability to identify stroke with such assessments is highly inconsistent, with an estimated sensitivity ranging from 44 to 85%, and specificity ranging from 64 and 98%.6 The sensitivity and specificity of these assessments are even lower in the pre-hospital setting,7 where the ability to quickly identify stroke facilitates the transfer of patients to stroke-ready hospitals, increasing the chances of appropriate treatment and positive outcome.8 Due to these current limitations, a rapidly measurable blood-based biomarker panel could be invaluable in informing pre-hospital and in-hospital decisions early in the acute phase of care, and could ultimately expedite access to interventional treatment.9
As a result, there has been a substantial push for the identification of stroke-associated peripheral blood biomarkers. The earliest stroke biomarker studies focused on the peripheral blood proteome, and countless protein-based biomarker panels have been evaluated to date. While a handful of these protein-based panels have demonstrated a strong ability to differentiate between stroke patients and healthy controls lacking the presence of cardiovascular disease (CVD) risk factors, a majority have failed to achieve specificities and sensitivities approaching 90% when tested against clinically relevant control groups.9,
Analysis of high-dimensional gene expression data using a pattern-recognition approach known as genetic algorithm k-nearest neighbours (GA/kNN) has been successfully used in a small number of cancer studies to identify diagnostically relevant biomarker panels with strong discriminatory ability.18,
While GA/kNN has proven robust in several applications in the field of cancer, it has yet to be utilised for biomarker discovery in the realm of cardiovascular disease (CVD). In this study, we applied the GA/kNN approach to analyse peripheral blood gene expression data generated via microarray to identify transcriptional patterns which could potentially be optimised for the detection of AIS in the acute phase of care.
In order to identify potential transcriptional biomarkers for the identification of AIS, we first recruited a discovery cohort consisting of 39 AIS patients and 24 neurologically asymptomatic controls. In terms of demographic and clinical characteristics, AIS patients were older than controls, and displayed a higher prevalence of CVD risk factors such as hypertension and dyslipidaemia (Table 1). Furthermore, AIS patients displayed a more substantial history of cardiac conditions such as myocardial infarction and atrial fibrillation, and higher proportion of AIS patients reported as currently taking antihypertensives and anticoagulants.
Peripheral whole blood was sampled from patients at emergency department admission, and genome-wide expression profiling was performed via microarray. Gene expression data were subjected to GA/kNN analysis, and genes were ranked based on the ability of their expression levels to discriminate between AIS patients and controls, according to the number of times they were selected as part of a near-optimal solution (Figure 1a). The expression levels of top 50 genes identified by GA/kNN displayed a strong ability to discriminate between groups using kNN in leave-one-out cross-validation; a combination of just the top 10 ranking genes (ANTXR2, STK3, PDK4, CD163, MAL, GRAP, ID3, CTSZ, KIF1B and PLXDC2) were able to classify 98.4% of subjects in the discovery cohort correctly with a sensitivity of 97.4% and specificity of 100% (Figure 1b).
In order to evaluate the robustness of our GA/kNN analysis in terms of its ability to select optimally discriminative genes, we compared the ability of the expression levels of top 50 genes selected by GA/kNN to differentiate between stroke patients and controls to that of genes selected at random. Specifically, we compared the accuracy of GA/kNN-selected genes to the accuracy of 50 sets of 50 genes randomly generated from the total pool of gene expression data, as well as to the accuracy of 50 sets of 50 genes randomly selected from a subpool of genes that displayed greater than 1.7-fold differential regulation between groups. The top genes selected by GA/kNN performed significantly better than genes selected at random genome wide, as well as significantly better than genes selected at random from those which were differentially regulated greater than 1.7-fold (Figure 1c). Collectively, the results of this analysis, in combination with the levels of accuracy observed, suggest that our biomarker discovery strategy was effective at selecting genes with optimal diagnostic potential in terms of the subjects of the discovery cohort. Because the use of genes beyond the top 10 did not appear to improve overall accuracy (Figure 1b), and displayed diminishing diagnostic robustness relative to genes selected at random (Figure 1c), we chose to focus on only the top 10 genes for the remainder of our analysis.
When comparing the peripheral blood expression levels of the top 10 genes between AIS patients and controls, the magnitude of differential expression was modest in terms of fold change in the case of most genes; however, differences in expression levels between groups were highly consistent across all subjects, which was reflected by high levels of statistical significance in parametric statistical testing (Figure 2a). The combined discriminatory power of the top 10 genes was evident when their coordinate expression levels were plotted on a continuum for each individual subject; the overall pattern of expression was strikingly different between AIS patients and controls, and it was clear that the overall pattern of expression was more diagnostically powerful than the expression levels of any given gene on its own (Figure 2b).
In order to more intuitively explore the relationship between the pattern of gene expression observed across the top 10 genes and relevant clinical characteristics, we first used principal components analysis to describe the expression levels of the top 10 genes as single composite RNA expression variable. The expression levels of the top 10 genes were highly correlated, and a single principal component was able to describe 70% of the collective variance in expression (Supplementary Table 1A). The result component scores (composite RNA expression) were strongly correlated with the expression levels of each of the individual candidate gene (Supplementary Table 1B), and visually appeared to summarise the gene expression pattern well (Figure 2c).
We first used this composite RNA expression variable to examine the influence of potentially confounding intergroup differences in clinical and demographic characteristics on the expression levels of the top 10 genes. Stroke, age, anticoagulant status, hypertension, antihypertensive status, dyslipidaemia, history of myocardial infarction and history of atrial fibrillation were regressed against the composite RNA expression levels of the top 10 genes using multiple regression. We then performed variance decomposition via the Lindeman-Merenda-Gold (LMG) method to estimate the relative contributions of each regressor to the total variance in composite RNA expression explained by the resultant regression model.21 Stroke remained significantly associated with the composite RNA expression levels of the top 10 genes after accounting for all potentially confounding factors included in the model (Figure 3a), and was responsible for a majority of the explained variance (77.9%, Figure 3b). In terms of potentially confounding factors, both antihypertensive status and anticoagulant status were significantly associated with the composite RNA expression levels of the top 10 genes after accounting for all other regressors (Figure 3a); however, these associations only accounted for a small amount of the variance in composite RNA expression explained by the model (6.5% and 4.5%, respectively, Figure 3b). Results of this multiple regression analysis were supported by the results of a more traditional logistic regression analysis in which the composite RNA expression levels of the top 10 genes were identified as the only significant predictor of stroke when considering the same potentially confounding covariates (Supplementary Table 2). Taken as a whole, these findings suggest that the pattern of differential expression observed across the top 10 genes between groups is highly associated with stroke independently of the assessed potential confounding factors. Although these findings do suggest that antihypertensive status and anticoagulant status may influence the expression levels of the top 10 genes, the effect of this influence on expression levels is likely minimal relative to the effect of stroke, and intergroup differences in these factors were likely not significant drivers of the selection of these genes by GA/kNN.
We next used this composite RNA expression variable to examine the potential influence of stroke severity and time to blood draw on the pattern of gene expression observed across the top 10 genes. The composite RNA expression levels of the top 10 genes displayed a significant positive association with stroke severity as assessed by the NIHSS (Figure 4a), suggesting that the expression levels of the top 10 genes are likely directly responsive to stroke pathology. We observed a weak nonsignificant negative relationship between the composite RNA expression levels of the top 10 genes and the time from symptom onset to blood draw (Figure 4b). However, this negative relationship was likely driven by the influence of stroke severity, given that the composite expression levels of these genes were positively associated with stroke severity, and patients undergoing more severe strokes generally presented to the emergency department earlier than patients undergoing less severe strokes (Figure 4b). Collectively, these observations suggest that the stroke-induced differential expression of the top 10 genes may have additional utility for the stratification of stroke severity, and is relatively temporally stable during the acute phase of care.
We then tested the diagnostic ability of gene expression pattern identified in the discovery cohort in an independent validation cohort enroled via a second geographically and socioeconomically distinct clinical site (see Materials and methods section). This validation cohort included an additional 39 AIS patients along with two different control groups, one consisting of 30 neurologically asymptomatic controls and the other consisting of 20 acute stroke mimics. Like in the discovery cohort, AIS patients were older than neurologically asymptomatic controls; however, AIS patients and asymptomatic controls were better matched in terms of the prevalence of comorbidities and CVD risk factors (Table 2). AIS patients were also significantly older than stroke mimics, however, extremely well matched in terms of all other clinical and demographic characteristics (Table 2).
Peripheral blood samples were once again obtained from patients at emergency department admission, and the expression levels of the top 10 genes identified by GA/kNN in the discovery cohort were measured via qRT-PCR. The overall pattern of differential expression between AIS patients and asymptomatic controls observed across the top 10 genes in the discovery cohort was also seen when comparing AIS patients and asymptomatic controls in the validation cohort (Figure 5a). The strong ability of the top 10 genes to differentiate between stroke patients and asymptomatic controls in the discovery cohort using kNN was also recapitulated in the validation cohort; the expression levels of the top 10 genes used in combination were able to classify 95.6% of subjects correctly with a sensitivity of 92.3% and a specificity of 100% (Figure 5b).
When comparing AIS patients to stroke mimics, the overall pattern of differential expression observed across the top 10 genes was identical to that observed when comparing AIS patients with asymptomatic controls; however, the magnitude of these expression differences was smaller in the case of several genes (Figure 5c). Despite this reduction in the magnitude of differential expression, the expression levels of the top 10 genes used in combination were still able to accurately discriminate between AIS patients and stroke mimics, classifying 94.9% of subjects correctly with a sensitivity of 97.4% and a specificity of 90.0% (Figure 5d). However, it is important to note that it was evident that all 10 genes were required to achieve high levels of diagnostic accuracy when comparing AIS patients with stroke mimics (Figure 5d), whereas similar levels of accuracy could be achieved with as few as the top four markers when comparing AIS patients with neurologically asymptomatic controls in both the discovery cohort (Figure 1b) and the validation cohort (Figure 5b). Despite this, the collective validation cohort results supported those of the discovery cohort, and provide further evidence that the top 10 markers selected by GA/kNN have high potential performance for identification of AIS.
The primary objective of this study was to apply the GA/kNN approach to identify a pattern of gene expression in peripheral blood that could potentially be optimised to identify AIS in the acute phase of care. The 10 transcriptional markers identified by GA/kNN in our analysis proved robust in their combined ability to differentiate between AIS patients and controls in both the discovery cohort and the independent validation cohort; not only did these markers display levels of diagnostic accuracy that exceed those reported in a majority of previous stroke biomarker studies, they also demonstrated characteristics that suggest they have the potential to be clinically useful. Besides having diagnostic utility, some of the markers identified in this study may represent viable therapeutic targets in the context of stroke immunopathology.
With regards to the countless number of peripheral blood biomarker explorations that have been performed to date, to our knowledge, only one prior investigation has reported similar levels of diagnostic accuracy to those which we observed in this study in terms discriminating between stroke patients and clinically relevant control populations. Dambinova et al.22 recently reported that plasma levels of brain-derived NR2 peptide, a degradation product of N-methyl-D-aspartate receptor cleavage, could be used to differentiate between stroke patients and a combination of acute stroke mimics and neurologically asymptomatic controls with 92% sensitivity and 96% specificity.22 However, a majority of blood samples in this prior study were obtained between 24 and 72 h post-symptom onset, and it is currently unknown whether NR2 peptide would exhibit an equivalent level of diagnostic performance early in the acute phase of care. The 10-marker panel identified in our analysis was tested earlier in the progression of pathology, and thus exhibits an obvious advantage in that they has the potential to provide actionable diagnostic information at an early enough time point to influence critical triage decisions that has an impact on outcome.
The 10-marker panel identified in our analysis displayed several favourable characteristics that could make it well suited for identification of ischaemic stroke in the acute care setting. Most notably, the pattern of differential expression we observed between AIS patients and controls appeared to be relatively temporally stable. This is of clinical relevance from the standpoint that it is well established that acute stroke patients tend to arrive to the emergency department in two waves, the first within 4 h from symptom onset (typically patients with more severe overt symptoms), and the second more than 8 h from symptom onset (typically patients with milder symptoms).23 For this reason, a potential diagnostic for identification of acute stroke needs to be diagnostically robust across a wide time window with regards to the progression of stroke pathology. Another diagnostically beneficial characteristic we observed was that the stroke-associated pattern of expression across these 10 markers was positively correlated with the NIHSS. Thus, these markers may have utility in stratifying injury severity, information that is commonly considered when making decisions regarding the prescription of interventional treatment.4 These characteristics, along with the fact that we observed levels of sensitivity and specificity, which well exceed those achievable via the tools currently available to clinicians for the identification of stroke during acute triage, suggest that the 10-marker panel identified in our analysis has legitimate potential for future clinical implementation.
Besides having diagnostic utility, some of the markers identified in this study may represent potential therapeutic targets in the context of stroke immunopathology. Perhaps, the most interesting of these markers from this standpoint is CD163. It is well established that stroke induces a state of peripheral adaptive immune suppression characterised by a limited capacity of lymphoid cells to respond to antigen.24,25 This suppressed adaptive immune state leaves patients highly susceptible to post-stroke infection,26 which is the leading cause of death in the post-acute phase of care.27 CD163 encodes for a protein known as cluster of differentiation 163 (CD163), a membrane-bound scavenger receptor for extracellular haemoglobin, which is predominantly expressed on immune populations of myeloid lineage.28,29 Mature CD163 is known to undergo ectodomain shedding to generate a soluble truncated peptide (sCD163), which has been shown in multiple studies to directly interact with lymphocytes and inhibit antigen-mediated activation.30,
In addition to CD163, the markers identified in this study included several other genes that may be pathologically relevant within the context of the stroke-induced peripheral immune response. We observed downregulated expression levels of MAL and GRAP in the peripheral blood of AIS patients; both genes encode proteins that are critically involved in T-cell receptor activation and signal transduction.34,35 Furthermore, AIS patients exhibited elevated expression levels of STK3, a gene encoding a seine threonine kinase involved in pro-apoptotic signal transduction36,37 and suppression of lymphocyte proliferation.38 Taken as a whole, the differential regulation we observed across these genes is consistent with suppressed adaptive immune state induced in response to stroke, and may be mechanistically involved in blunting the responsiveness of the adaptive immune system following ischaemic brain injury. Conversely, two of the markers identified as being upregulated in the peripheral blood of AIS patients in this study, KIF1B and ANTXR2, may be mechanistically involved in the innate immune response to ischaemic insult. It is well established that stroke induces robust recruitment of myeloid-derived innate immune populations such as neutrophils and monocytes from the peripheral blood into the brain parenchyma;39,40 both genes encode proteins that have been shown to have a role in cellular adhesion and migration,41,
Collectively, the findings reported here are exciting; however, it is important to note that this study was not without limitations. Perhaps, most notably was the fact that AIS patients and neurologically asymptomatic controls in our discovery cohort were not well matched with regards to several clinical and demographic characteristics; thus, intergroup differences in these factors had the potential to confound the selection of stroke-specific genes in our GA/kNN analysis. To account for this possible limitation, we utilised a relatively high termination cutoff for optimal solution selection; under these conditions, a confounding factor would have to be almost ubiquitously present in one group, and nearly ubiquitously absent in the other, for it to influence the selection of candidate genes. The results of our multiple regression analysis suggest that this strategy was largely successful; however, they did infer that medication status may influence the expression of the candidate genes. Despite this, the 10 candidate genes were still able to demonstrate high levels of diagnostic accuracy when discriminating between groups that were better matched in terms of these factors in the validation cohort.
Taken as a whole, the results of this preliminary study demonstrate that a highly accurate RNA-based companion diagnostic for AIS is plausible using a relatively small number of markers, and also highlight the potential power of machine-learning approaches for biomarker discovery in the realm of CVD. The 10 transcriptional biomarkers identified in this study displayed levels of diagnostic performance that well exceed those reported in a majority of previous stroke biomarker investigations, as well as several characteristics that suggest that they may have true clinical utility for identification of ischaemic stroke during the acute phase of care. Furthermore, future exploration of these markers may reveal novel mechanisms that underlie the peripheral immune response to stroke, and lead to novel therapeutic targets in the context of stroke-induced immunopathology. Owing to the robust results of this preliminary analysis, the 10 transcriptional biomarkers identified in this study warrant further evaluation to determine their true clinical efficacy.
Materials and methods
Discovery cohort patients
Acute ischaemic stroke patients and neurologically asymptomatic controls were recruited at Suburban Hospital, Bethesda, MD, USA, which serves an upper-class metro area bordering Washington DC. AIS cases were of mixed aetiology, and diagnosis was confirmed using magnetic resonance imaging according to the established criteria for diagnosis of acute ischaemic cerebrovascular syndrome.45 The median time from symptom onset to blood draw was 5.3 h, as determined by the time the patient was last known to be free of AIS symptoms. In the case of patients who received thrombolytic therapy, blood samples were collected before the administration of recombinant tissue plasminogen activator. Injury severity was determined according to NIHSS at the time of blood draw. Control subjects were deemed neurologically normal by a trained neurologist at the time of enrolment. Demographic information was collected from either the subject or significant other by a trained clinician. All procedures were approved by the institutional review boards of the National Institute of Neurological Disorders/National Institute on Aging at the National Institutes of Health and Suburban Hospital. Written informed consent was obtained from all subjects or their authorised representatives before any study procedures.
Blood collection and RNA extraction
Peripheral whole-blood samples were collected via PAXgene RNA tubes (Qiagen, Valencia, CA, USA) and stored at −80 °C until RNA extraction. Total RNA was extracted via the PreAnalytiX PAXgene blood RNA Kit (Qiagen) and automated using the QIAcube System (Qiagen). Quantity and purity of isolated RNA was determined via spectrophotometry (NanoDrop, Thermo Scientific, Waltham, MA, USA). Quality of RNA was confirmed by chip capillary electrophoresis (Agilent 2100 Bioanalyzer, Agilent Technologies, Santa Clara, CA, USA).
RNA amplification and microarray
RNA was amplified and biotinylated using the TotalPrep RNA Amplification Kit (Applied Biosystems, Grand Island, NY, USA). Samples were hybridised to HumanRef-8 expression bead chips (Illumina, San Diego, CA, USA) containing 25,000 unique probes and scanned using the Illumina BeadStation. Raw probe intensities were background-subtracted, quantile-normalised and then summarised at the gene level using Illumina GenomeStudio. Sample labelling, hybridisation and scanning were performed per standard Illumina protocols. Raw data are assessable through the National Center for Biotechnology Information Gene Expression Omnibus via accession number GSE16561.
Normalised microarray data were filtered based on absolute fold difference between stroke and control; genes exhibiting a greater than 1.7 absolute fold difference in expression between AIS and control were retained for analysis. Filtered gene expression data were z-transformed and GA/kNN analysis was performed using C source code developed by Li et al. 20 compiled in Linux Mint. Two thousand near-optimal solutions were collected per sample using five nearest neighbours, majority rule, a chromosome length of five and a termination cutoff of 0.97. Leave-one-out cross-validation was performed using the top 50 ranked genes. The top 50 genes were tested against random gene combinations, which were selected using the R sample() function (R 2.14, R Project for Statistical Computing).
Validation cohort patients
AIS patients, acute stroke mimics and neurologically asymptomatic controls were recruited at Ruby Memorial Hospital, Morgantown, WV, USA, which serves an impoverished rural region of West Virginia that displays some of the highest CVD rates in the nation.1 As with the discovery cohort, AIS cases were of mixed aetiology, and diagnosis was confirmed via neuroradiological imaging. Patients admitted to the emergency department as suspected strokes based on the overt presentation of stroke-like symptoms, but receiving a negative diagnosis for stroke upon imaging according to the established acute ischaemic cerebrovascular syndrome diagnostic criteria,45 were identified as acute stroke mimics. Discharge diagnoses of stroke mimics included cases of seizures, complex migraines and other conditions, which induce neurological symptoms such as hypertensive encephalopathy. The median time from symptom onset to blood draw was 4.6 h and all blood was sampled before the administration of recombinant tissue plasminogen activator. Assessment of injury severity, screening of neurologically asymptomatic controls and collection of demographic information were performed in an identical manner. All procedures were approved by the institutional review boards of West Virginia University and Ruby Memorial Hospital. Written informed consent was obtained from all subjects or their authorised representatives before study procedures.
Quantitative reverse transcription PCR
Complementary DNA was generated from purified RNA using the Applied Biosystems high-capacity reverse transcription kit. For qPCR, target sequences were amplified from 10 ng of complementary DNA input using sequence-specific primers (Supplementary Table 3) and detected via SYBR green (PowerSYBR, Thermo Fisher, Waltham, MA, USA) on the RotorGeneQ (Qiagen). Raw amplification plots were background-corrected and CT values were generated via the RotorGeneQ software package. All reactions were performed in triplicate. Transcripts of B2M, PPIB and ACTB were amplified as references, and normalisation was performed using the NORMAgene data-driven normalisation algorithm.46
Parametric statistical analysis was performed using SPSS (IBM, Chicago, IL, USA) in combination with R 2.14 via the SPSS R integration plug-in. χ 2-tests were used for comparison of dichotomous variables, whereas Student's t-tests were used for comparison of continuous variables. Spearman’s rho was used to assess the strength of correlational relationships. For multiple regression analysis, variance decomposition was performed using the relaimpo R package.21 Penalised logistic regression was performed using the logistf R package.47 The level of significance was established at 0.05 for all parametric statistical testing. In the cases of multiple comparisons, P-values were adjusted using Holm’s Bonferonni method.48
The authors would foremost like to thank the subjects and their families, as this work was truly made possible by their selfless contribution. The authors also thank the stroke team Ruby Memorial Hospital and the NIH stroke team at Suburban Hospital for supporting this research effort. Work was partially funded via a Robert Wood Johnson Foundation Nurse Faculty Scholar award to TLB (70319) and a National Institutes of Health CoBRE sub-award to TLB (P20 GM109098).