A Nasal Brush-based Classifier of Asthma Identified by Machine Learning Analysis of Nasal RNA Sequence Data

Asthma is a common, under-diagnosed disease affecting all ages. We sought to identify a nasal brush-based classifier of mild/moderate asthma. 190 subjects with mild/moderate asthma and controls underwent nasal brushing and RNA sequencing of nasal samples. A machine learning-based pipeline identified an asthma classifier consisting of 90 genes interpreted via an L2-regularized logistic regression classification model. This classifier performed with strong predictive value and sensitivity across eight test sets, including (1) a test set of independent asthmatic and control subjects profiled by RNA sequencing (positive and negative predictive values of 1.00 and 0.96, respectively; AUC of 0.994), (2) two independent case-control cohorts of asthma profiled by microarray, and (3) five cohorts with other respiratory conditions (allergic rhinitis, upper respiratory infection, cystic fibrosis, smoking), where the classifier had a low to zero misclassification rate. Following validation in large, prospective cohorts, this classifier could be developed into a nasal biomarker of asthma.


Supplementary
: Visual description of the machine learning pipeline used to select predictive features (genes) and develop classification models based on them in the RNAseq development set. By considering 100 splits of the development set into training and holdout sets (dotted box), many such models were evaluated for classification performance and then compared statistically using Friedman and Nemenyi tests. From this comparison, the best combination of predictive genes and global classification algorithms was determined, which was then executed on the development set to train the final asthma classifier model. This model was applied to an independent RNAseq test set and external microarray-derived cohorts with asthma and other respiratory conditions for final evaluation.  Given a training set, this component used a 5x5 nested (outer and inner) cross-validation (CV) setup to select sets of predictive features (genes). The inner CV round was used to determine the optimal number of features to be selected, and the outer round was used to select the set of predictive genes based on this number, thus reducing the cumulative effect of potential sources of overfitting. The selection of features itself was performed using the Recursive Feature Elimination (RFE) algorithm in combination with wrapper Logistic Regression and SVM with Linear kernel classification algorithms. precision, recall, positive predictive value, and negative predictive value are summarized. F-measure, which is a harmonic (conservative) mean of precision and recall that is computed separately for each class, provides a more comprehensive and reliable assessment of model performance when classes are imbalanced, as is frequently the case in biomedical scenarios. Figure 6: Performance of permutation-based random classification models in test sets of independent subjects with asthma and controls. To determine the extent to which the performance of the classifier could have been due to chance, 100 permutation-based random models were obtained by randomly permuting the labels of the samples in the development set and executing each of the feature selection-global classification combinations on these randomized data sets in the same way as described above for the real development set. These random models were then applied to each of the asthma test sets considered in our study, and their performances were also evaluated in terms of the F-measure. Figure 7: Performance of permutation-based random classification models in test sets of independent subjects with non-asthma respiratory conditions and controls. 100 permutation-based random models were obtained by randomly permuting the labels of the samples in the development set and executing each of the feature selection-global classification combinations on these randomized data sets in the same way as described above for the real development set. These random models were then applied to these test sets, and their performances were also evaluated in terms of the F-measure.

Supplementary Figure 8: Distribution of DESeq2 FDR values of differentially expressed genes in the asthma classifier (blue bars) vs. other genes in the RNAseq development set (coral bars)
. The Y-axis shows the probability of a gene having a -log10(FDR) value in the corresponding bin. This plot shows that the genes in the asthma classifier were likely to be more differentially expressed, i.e., higher -log10(FDR) or lower differential expression FDRs, than other genes in the development set. Race Caucasian n/a n/a n/a 57% 82% 89%

Supplementary
African American n/a n/a n/a 17% 9% 0% Hispanic n/a n/a n/a 13% 0% 0% Asian/Other n/a n/a n/a 13% 9% 11% n/a n/a n/a n/a PC20 (mg/ml) n/a n/a n/a 4. *For Asthma2, data that the authors deposited in GEO GSE46171 are a subset of their published results [32]. GSE46171 has data for 16 of the 23 subjects with controlled asthma, 7 of the 11 subjects with uncontrolled asthma, and 5 of the 9 controls reported in the authors' publication [32]. We indicate the number of subjects with publically available data (GSE46171) that were used in our analyses. The summary statistics shown are drawn from the authors' publication on their reported sample. † Median (range)

Supplementary Table 4: Characteristics of the external cohorts with non-asthma respiratory conditions and controls used for testing the asthma classifier
Allergic Rhinitis [35] GEO GSE43523* URI Day 2 [32] GEO GSE46171^ URI Day 6 [32] GEO GSE46171^ Cystic Fibrosis [36] GEO GSE40445 Smoking [12] GEO GSE8987 Results are number (%) or mean (SD) unless otherwise indicated *Data that the authors deposited in GEO GSE43523 are a subset of their published results [35]. GSE43523 has data for 7 of the 15 subjects with allergic rhinitis, and 5 of the 13 controls reported in the authors' publication [35]. We indicate the number of subjects with publically available data (GSE43523) that were used in our analyses. The summary statistics shown are drawn from the authors' publication on their reported cohort. ^E ach subject provided a URI and control sample. The data that the authors deposited in GEO GSE46171 are a subset of their published results [32]. GSE46171 has data for 6 of the 9 healthy subjects reported in the authors' publication who provided samples during URI, and 5 of the 9 healthy subjects who provided samples after resolution of their URI [32]. We indicate the number of subjects with publically available data (GSE46171) that were used in our analyses. The summary statistics shown are drawn from the authors' publication on their reported cohort. † Median (range)

Gene
Annotation References ALOX15B Member of lipoxygenase family whose members can affect bronchiolar constriction, cytokine secretion, and immune cell migration. The ALOX15B isoform may regulate cytokine secretion by macrophages and macrophage differentiation.

C3
Central role in classical and alternative complement pathway system activation; downstream effects include smooth muscle contraction, vascular permeability, histamine release.

CD177
Glycoprotein expressed by neutrophils. Used as a cell surface marker in studies of IL-17RB granulocytes in asthma.

CDH26
One of eight genes targeted in a candidate gene study of asthma; negative results.

CDHR3
Variant in this gene associated with rhinovirus-induced wheezing and rhinovirus C illness. May function as a rhinovirus C receptor. 6, 7

CDKN1A
Mediator of microRNA-221-modulated airway smooth muscle hyperproliferation in cell culture studies of severe asthma. 8 CEBPD CEBPD gene expression in bronchial specimens from asthma subjects associated with asthma susceptibility and inhaled corticosteroid treatment. 9

CLEC7A
Expression on CD11b+ dendritic cells plays a role in house dust mite-induced allergic airway inflammation in murine models. 10

CPA3
Mast cell mediator whose gene expression in epithelial brushings is upregulated in mild asthma and suppressed by corticosteroids in moderate asthma.

CYFIP2
One of 237 candidate genes targeted in a candidate gene study of asthma in Mexicans.

CYP1B1
One of 25 candidate genes targeted in a candidate gene study of xenobiotic-metabolizing enzymes in asthma among Russians. 13

DEFB1
Protein level elevated in induced sputum from severe asthmatics vs. controls. Also studied as one of 44 candidate genes in a candidate gene study of innate immune pathways in asthma and eczema among children from Boston and Connecticut.

DUSP1
Expression in bronchial epithelial cell culture increased by dexamethasone, leading to suppression of p38 MAPK signaling and cytokine inhibition. 16

ESR1
SNPs in this gene associated with bronchial hyperresponsiveness and FEV1 decline, especially in females. miRNAs may impact pathogenesis of dust mite-induced asthma via regulation of ESR1. [16][17][18] FOS Encodes a transcription factor involved in anti-inflammatory activity of steroid action in asthma. 19, 20

GSTT1
Modifies the impact of air pollution exposure on asthma. 20

LPHN1
SNP in LPHN1 associated with asthma and found to regulate airway smooth muscle cell adhesion and proliferation in vitro 22

LTBP1
siRNA knockdown of LTBP1 inhibited TGFbeta1 release in airway fibroblasts from asthma subjects. 23

MMP9
Released by neutrophils in allergic asthma subjects and in murine models of asthma. 11,24,25 NMU Neuropeptide that amplifies Type 2 innate lymphoid cell-driven allergic lung inflammation in murine models. 26

S100A7
Antimicrobial peptide induced by IL-22 in T-cell lines derived from lung biopsy specimens of asthmatic subjects.

S100A8
Anti-apoptotic protein detected in supernatant of neutrophils treated with house dust mite extract and elevated in BAL from asthmatic vs. control subjects. 28

SCD
Inhibition of SCD in mice promoted airway hyperresponsiveness. SCD1 expression reduced in bronchial epithelial cells from asthma subjects vs. controls. 29 SCGB1A1 Levels in induced sputum higher in subjects with severe asthma vs. mild-moderate and healthy controls. BAL levels of SCGB1A1 correlated with epithelial detachment in bronchial biopsies. 30

SEMA5A
One of 11 genes mapped by 1000 of the top SNPs shared across European, African, and Hispanic populations in a rank-based analysis of shared genetic factors for asthma. 31 SERPINB3 In vitro-polarized Th2 cells from subjects with grass pollen allergy expressed higher mRNA levels of this serine protease inhibitor relative to CD27+CD4+ cells. Mediates mucus production in murine models of asthma. 32,33 SERPINE2 Selected SNPs in this gene were associated with asthma and related traits.

SLC26A4
Up-regulated in airway epithelial cells in association with mucus overproduction in murine models. 36, 37

SPRR1A
Intratracheal inoculation of mice with IL-13 induced more gene expression of SPRR1A than inhalation of IL-4. 38 TFPI TFPI level concentration studied in 17 subjects with asthma during early and late stages of reaction. 39

TPSAB1
Mast cell biomarker used to subtype sputum subtypes in a study of eosinophilia and corticosteroid response in asthma. 40

TPSB2
Encodes mMCP-6, which is required for airway hyperresponsiveness in murine models of asthma.