Predicting susceptibility to tuberculosis based on gene expression profiling in dendritic cells

Tuberculosis (TB) is a deadly infectious disease, which kills millions of people every year. The causative pathogen, Mycobacterium tuberculosis (MTB), is estimated to have infected up to a third of the world’s population; however, only approximately 10% of infected healthy individuals progress to active TB. Despite evidence for heritability, it is not currently possible to predict who may develop TB. To explore approaches to classify susceptibility to TB, we infected with MTB dendritic cells (DCs) from putatively resistant individuals diagnosed with latent TB, and from susceptible individuals that had recovered from active TB. We measured gene expression levels in infected and non-infected cells and found hundreds of differentially expressed genes between susceptible and resistant individuals in the non-infected cells. We further found that genetic polymorphisms nearby the differentially expressed genes between susceptible and resistant individuals are more likely to be associated with TB susceptibility in published GWAS data. Lastly, we trained a classifier based on the gene expression levels in the non-infected cells, and demonstrated reasonable performance on our data and an independent data set. Overall, our promising results from this small study suggest that training a classifier on a larger cohort may enable us to accurately predict TB susceptibility.

Using this approach, previous studies have characterized gene expression profiles in innate immune cells isolated from individuals known to be susceptible or resistant to infectious diseases, including those with latent or active TB 20 and acute rheumatic fever 21 .
We hypothesized that gene expression profiles in innate immune cells may be used to classify individuals with respect to their susceptibility to develop active TB. To test this hypothesis, we differentiated dendritic cells (DCs) from monocytes isolated from individuals that had recovered from a past episode of active TB, which we refer to as susceptible, and from individuals with confirmed latent TB, which we refer to as putatively resistant (this group is enriched in resistant individuals but we cannot exclude that some still have the potential to develop active TB 22 ). We infected the DCs with MTB because these innate immune cells help shape the adaptive immune response, which is critical for fighting MTB 23,24 . We discovered that the gene expression differences between innate immune cells from resistant and susceptible individuals were present primarily in the non-infected state, that these differentially expressed genes were enriched for nearby SNPs with low p-values in TB susceptibility GWAS, and furthermore, that these gene expression levels could be used to classify individuals based on their susceptibility status.

Results
Susceptible individuals have an altered transcriptome in the non-infected state. We obtained whole blood samples from 25 healthy male Caucasian individuals (Supplementary Data S1). Six of the donors had recovered from active TB, and are thus putatively susceptible. The remaining 19 tested positive for latent TB without ever experiencing symptoms of active TB, and are thus putatively resistant. We isolated dendritic cells (DCs) and treated them with Mycobacterium tuberculosis (MTB) or a mock control for 18 hours. To measure genome-wide gene expression levels in infected and non-infected samples, we isolated and sequenced RNA using a processing pipeline designed to minimize the introduction of unwanted technical variation ( Supplementary  Fig. S1). We obtained a mean (±SEM) of 48 ± 6 million raw reads per sample. We performed quality control analyses to remove non-expressed genes ( We performed a standard differential expression analysis using a linear modeling framework (Supplementary Data S3), defined in equation (1). As expected, there was a strong response to infection with MTB in both resistant and susceptible individuals ( Supplementary Fig. S8). Considering the putatively resistant individuals, we identified 3,486 differentially expressed (DE) genes between the non-infected and infected states at a q-value of 10% and an arbitrary absolute log-fold change greater than 1. Similarly, 3,789 genes were classified as DE between the non-infected and infected states in the putatively susceptible individuals. In both classes of samples, the DE genes included the important immune response factors IL12B, REL, and TNF. While the treatment effect was obvious in all individuals, of most interest were the patterns of gene expression differences between the putatively susceptible and resistant individuals in either the non-infected or infected states (Fig. 1). We identified 645 DE genes between putatively resistant and susceptible individuals in the non-infected state at a q-value of 10%, including ATPV1B2, FEZ2, PSMA2, TNFRSF25, and TRIM38. The log 2 fold change in gene expression for these 645 genes was relatively small (less than 2), which was expected given that susceptibility to TB is a polygenic trait influenced by many genes. In contrast, no genes were DE between putatively resistant and susceptible individuals in the infected state (at a q-value of 10%).
Differentially expressed genes are enriched with TB susceptibility loci. We next sought evidence that genes classified as DE in our in vitro experimental system play a role in determining susceptibility to TB. To do this, we intersected our data with results from TB susceptibility GWAS conducted in Russia 18 , The Gambia 13 , Ghana 13 , and Uganda and Tanzania 19 . We also included data from a height GWAS conducted in individuals of European ancestry 25 as a negative control. To perform a combined analysis of our gene expression data and the GWAS results, we had to define pairs of genes (for which we have expression data) and SNPs (for which we obtained GWAS P values). Thus, each gene in our expression data was coupled with the GWAS SNP with the lowest p-value among all tested SNPs located within 50 kb of the gene's transcription start site (Supplementary Data S4; this gene-SNP definition was performed separately for each GWAS data set).
Once we defined gene-SNP pairs, we asked whether differences in gene expression levels between putatively resistant and susceptible individuals could help us identify genetic variation that is associated with susceptibility to TB. In other words, we asked whether increasing evidence for DE genes is associated with low GWAS p-values. To do so, we calculated the fraction of SNPs with a GWAS p-value lower than 0.05 among SNPs that were paired with ranked subsets of genes whose expression profiles show increasing effect size of expression differences between putatively resistant and susceptible individuals. In order to assess the significance of the observations, we performed 100 permutations of the enrichment analysis to derive an empirical p-value.
Using this approach, we observed a clear enrichment (empirical P < 0.01) of low p-values for TB GWAS SNPs that are paired with genes that are differentially expressed between susceptible and resistant individuals in the non-infected state (Fig. 2a). In fact, we observed significant enrichments of lower GWAS p values (empirical P < 0.01) in all 4 TB susceptibility GWAS (Russia, The Gambia, Ghana, Uganda and Tanzania) ( Supplementary  Fig. S10) for all 4 differential expression contrasts, namely resistant vs. susceptible individuals in the non-infected state (Fig. 2c), resistant vs. susceptible individuals in the infected state (Fig. 2d), effect of treatment in resistant individuals (Fig. 2e), and effect of treatment in susceptible individuals (Fig. 2f). Reassuringly, we did not observe an enrichment of low p values (empirical P > 0.01) when we used the same approach to consider data from the height GWAS (Fig. 2b-f).
Scientific RepoRts | 7: 5702 | DOI:10.1038/s41598-017-05878-w Susceptibility status can be predicted based on gene expression data. Next we attempted to build a gene expression-based classifier to predict TB susceptibility status (Supplementary Data S5). We focused on the gene expression levels measured in the non-infected state both because this is where we observed the largest gene regulatory differences between putatively susceptible and resistant individuals (Fig. 1a,c), and also because, from the perspective of an ultimate translational application, it is more practical to obtain gene expression data from non-infected DCs. We trained a support vector machine using the 99 genes that were differentially expressed between resistant and susceptible individuals in the non-infected state at a q-value less than 5% (see Methods for a full description of how we selected this model). Encouragingly, we observed a clear separation between putatively susceptible and resistant individuals when comparing the predicted probability of being susceptible to TB for each sample obtained from leave-one-out-cross-validation (Fig. 3a). Using a cutoff of 0.25 for the predicted probability of being susceptible to TB, we obtained a sensitivity of 100% (5 out of 5 susceptible individuals classified as susceptible), a specificity of 88% (15 out of 17 resistant individuals classified as resistant), and a positive predictive value (PPV) of 71% (5 of 7 individuals classified as susceptible were susceptible).
Unfortunately our current data set is too small to properly split into separate training and testing sets (it is challenging to collect samples from previous TB patients, who are healthy and have no medical reason to go back for a GP visit). To our knowledge, there are also no other suitable data sets available with which to test out classifier (that said, see Supplementary Fig. S17 for the results of applying the classifier to a non-ideal data set, which measured gene expression in macrophages from a small number of individuals 20 ). Thus, in order to further assess the plausibility of our model, we applied the classifier to data from an independent study, which collected p-values for testing the null of no differential expression between susceptible and resistant individuals in the (a) non-infected or (b) infected state. The bottom panels show the corresponding volcano plots for the (c) non-infected and (d) infected states. The x-axis is the log fold change in gene expression level between susceptible and resistant individuals and the y-axis is the −log 10 p-value. Red indicates genes that are classified as differentially expressed with a q-value less than 10%.
Scientific RepoRts | 7: 5702 | DOI:10.1038/s41598-017-05878-w genome-wide gene expression levels in DCs from 65 healthy individuals 24 , none with a previous history of TB. Using the cutoff of 0.25 for the probability of being susceptible to TB (determined to be optimal in the training set), 11% (7 of 65) of the individuals were classified as susceptible to TB (Fig. 3b). Adjusting for the PPV obtained from the training set (71%), our model predicted that 7.7% of the healthy individuals were susceptible. While we cannot confirm this result (the true susceptibility status of these 65 individuals is unknown), this observation is encouraging because our estimate is similar to the commonly used inference that roughly 10% of the general population is susceptible to progression of infection from latent to active TB.

Discussion
We obtained dendritic cells (DCs) from individuals that were known to be putatively susceptible or resistant to developing active tuberculosis (TB) and measured genome-wide gene expression levels in non-infected DCs and DCs infected with Mycobacterium tuberculosis (MTB) for 18 hours. As expected, there were large changes in gene expression due to MTB infection in the DCs from both putatively resistant and susceptible individuals ( Supplementary Fig. S8). We identified 645 genes which were differentially expressed (DE) between susceptible and resistant individuals in the non-infected state; whereas, we did not observe any DE genes between susceptible and resistant individuals in the infected state ( Fig. 1). This suggests that the differences in the transcriptomes between DCs of resistant and susceptible individuals are present pre-infection. Yet, 18 hours after infection gene expression profiles in both susceptible and resistant individuals have converged to a similar gene regulatory network, presumably to fight the infection. We confirmed that the absence of DE genes in the infected state is not caused by a decrease in statistical power due to an overall increase in gene expression variance upon infection ( Supplementary Fig. S9). We chose to measure gene expression 18 hours post-infection because this time point was previously associated with a large change in genome-wide gene expression levels 26 . Given our observations, however, future studies investigating the difference in the innate immune response between individuals resistant and susceptible to TB may want to focus on earlier time points post-infection.
It is important to note that our study was not designed to uncover the mechanisms underlying susceptibility or resistance to TB, but to try and find a gene regulatory signature that might allow us to classify individuals as either susceptible or resistant. That said, among the 645 DE genes between resistant and susceptible individuals in the non-infected state, there were many interesting genes involved in important innate immune activities critical for fighting MTB and other pathogens such as autophagy 27,28 , phagolysosomal acidification, and antigen processing. In particular, FEZ2, a suppressor of autophagosome formation 29 , was down-regulated when DCs were infected with MTB; however, in the non-infected DCs, this gene has elevated expression level in susceptible compared with resistant individuals. In turn, ATP6V1B2, a gene coding for a subunit of the proton transporter responsible for acidifying phagolysosomes [30][31][32] , has increased expression in susceptible individuals compared to resistant in the non-infected state. Lastly, genes coding for nine subunits of the proteasome, which is critical for processing of MTB antigens to be presented via major histocompatibility complex (MHC) class I molecules [33][34][35][36] , have increased expression in susceptible individuals compared to resistant in the non-infected state. These genes are candidates for future functional studies investigating the mechanisms of TB susceptibility.
We observed that DE genes in our in vitro experimental system were enriched for lower GWAS p-values (Fig. 2). This suggests that such in vitro approaches are informative for interrogating the genetic basis of disease susceptibility. That being said, we recognize a major caveat with this analysis is that assigning SNPs to their nearest gene on the linear chromosome is problematic because regulatory variants can have longer range effects. Nevertheless, considering this limitation, it was encouraging that we were able to detect evidence of the genetic basis of TB susceptibility in this system.
Not only did this analysis identify a global enrichment of TB susceptibility loci, but by intersecting the expression and GWAS data, we were able to identify a few interesting candidate genes, which were only marginally significant in the differential expression analysis (Supplementary Data S3) and in the original GWAS (Supplementary Data S4). Here we highlight two genes (CCL1 and UNC13A), which have been previously implicated in the response to MTB infection. CCL1 is a chemokine that stimulates migration of monocytes 37 . In our study, it was upregulated in susceptible individuals compared to resistant in both the non-infected and infected states (but did not reach statistical significance in either) and was statistically significantly upregulated with MTB treatment. Furthermore, the nearby SNP assigned to CCL1 had a p-value less than 0.01 in the TB susceptibility GWAS from The Gambia and Ghana. A previous differential expression study of TB susceptibility (discussed in more detail below) found that CCL1 was upregulated to a greater extent 4 hours post-infection with MTB in macrophages isolated from individuals with active TB (i.e. susceptible) compared to individuals with latent TB (i.e. resistant) 20 . Additionally they performed a candidate gene association study and found that SNPs nearby CCL1 were associated with TB susceptibility. In our previous study, we discovered that CCL1 was one of only 288 genes that were differentially expressed in macrophages 48 hours post-infection with MTB and related mycobacterial species but not unrelated virulent bacteria 38 . UNC13A is involved in vesicle formation 39 . In our study, it was downregulated in susceptible individuals compared to resistant in both the non-infected and infected states (but did not reach statistical significance in either) and was statistically significantly upregulated with MTB treatment. Furthermore, the nearby SNP assigned to UNC13A had a p-value less than 0.01 in the TB susceptibility GWAS from Russia, The Gambia, and Ghana. In our past study mapping expression quantitative trait loci (eQTLs) in DCs 18 hours post-infection with MTB, UNC13A was one of only 98 genes which were associated with an eQTL post-infection but not pre-infection, which we called MTB-specific eQTLs 24 . Thus our new results increased the evidence that CCL1 and UNC13A may potentially impact TB susceptibility.
Previous attempts to use gene expression based classifiers in the context of TB have focused on predicting the status of an infection rather than the susceptibility status of an individual 8,40,41 . In other words, the goal of most previous studies was to detect individuals in the early stages of active TB when antibiotic intervention would be most effective or to monitor the effectiveness of a treatment regimen 42 . In contrast, our goal was not to distinguish between active and latent TB, but instead to be able to determine susceptibility status before individuals are infected with MTB. Even with our small sample size, we were able to successfully train a classier with high sensitivity and reasonable specificity. Because such a classification of susceptibility status could affect the decision of whether or not to take antibiotics to treat latent TB 6 , false negatives (susceptible individuals mistakenly classified as resistant) would be much more harmful than false positives (resistant individuals mistakenly classified as susceptible). For that reason, we emphasized sensitivity over specificity.
While the small sample size was the main concern in interpreting the classifier results, we recognize multiple other potential caveats of our study design. First, because the cells used in the experiments were collected from Caucasians, it is unknown if the classifier would perform well when applied to other human populations. Second, because the classifier focused solely on the transcriptional response of the host immune cell to infection with MTB H37Rv, it may not be suitable for the detection of differences in susceptibility with respect to other, more or less virulent MTB strains 43 . Third, because the susceptible individuals had already experienced active TB, it is possible that their innate immune cells retained memory of the infection 44,45 . If this innate immune memory affected how the monocytes differentiated into DCs in response to cytokine stimulation, or in general affected gene regulation, this would drastically change the interpretation of our results.
With respect to the third caveat, while we cannot entirely exclude the possibility of trained immunity, we do not believe there is sufficient evidence to support this possibility in our case. Innate immune memory is known to be a short-term phenomenon 45 , its affects may be erased when immune cells are moved to a new microenvironment 46 , and the whole blood transcriptional signature of active TB disappears 6-12 months after the initiation of treatment 40,47 . Moreover, the GWAS enrichment results (Fig. 2) suggest that trained immunity cannot explain our observations. Indeed, the differences in gene expression between susceptible and resistant individuals in the non-infected state were enriched for low p-values from TB susceptibility GWAS. This observation cannot be explained by possible downstream consequences of experiencing active TB because enrichment of DE genes in GWAS data supports a causal relationship. In other words, if most of the DE genes we observed could be explained by trained immunity, one would not expect these genes to be associated with susceptibility to the disease. That said, we recognize that the ideal study would be a long-term prospective design in which blood samples were obtained from a large cohort of individuals prior to MTB infection.
To our knowledge, our study was only the second to collect data from in vitro MTB-infected innate immune cells isolated from individuals known to be putatively susceptible to MTB 20 . However, there were substantial differences between our study and that of Thuong et al. 20 . First, they derived and infected macrophages, the primary target host cell in which MTB resides; whereas, we derived and infected DCs, which play a larger role in stimulating the adaptive immune response to MTB. Second, we collected samples from a larger number of putatively resistant individuals (19 versus 4), increasing our power to distinguish between the gene expression profiles of susceptible and resistant individuals. Third, they measured gene expression with microarrays; whereas, we used RNA-sequencing. Considering the substantial technical differences between the methods used and the biological differences between DCs and macrophages 26,48 , unsurprisingly, we were unable to identify the susceptible individuals from Thuong et al. 20 using our classifier (Supplementary Fig. S17).
Indeed, at this time, we are not aware of any other data set from healthy individuals known to be susceptible to TB, with which we can further test our classifier. When we applied our classifier to an independent set of non-infected DCs isolated from healthy individuals of unknown susceptibility status, our model predicted that 7.7-11% of the individuals were susceptible to TB, which reassuringly is similar to the average in the general population (10%). Despite this, our results must be interpreted cautiously; at best as a proof-of-principle, due to our very small sample size of only 5 susceptible individuals. That said, our promising results in this small study suggest that collecting blood samples from a larger cohort of susceptible individuals would enable building a gene expression based classifier able to confidently assess risk of TB susceptibility. By reducing the number of resistant individuals receiving treatment for latent TB, we can eliminate the adverse health effects of a 6 month regimen of antibiotics for these individuals and also reduce the selective pressures on MTB to develop drug resistance.
Scientific RepoRts | 7: 5702 | DOI:10.1038/s41598-017-05878-w Methods Ethics statement. We recruited 25 subjects to donate a blood sample for use in our study. All methods were carried out in accordance with relevant guidelines and regulations. All participants gave written informed consent in accordance with the Declaration of Helsinki principles. Peripheral human blood was collected from patients at ICAReB platform of Institut Pasteur Paris and at the Centre for Infectious Disease Prevention, University hospital Caen. The Protocol has been approved by French Ethical Committee (CPP North Ouest III, no. A12 -D33 -VOL.13), and by the Institutional Review Boards of the University of Chicago (10-504-B) and the Institut Pasteur (IRB00006966).
Sample collection. We collected whole blood samples from healthy, HIV-negative Caucasian male individuals living in France. The putatively resistant individuals tested positive for latent TB in an interferon-γ release assay, but had never developed active TB. The putatively susceptible individuals had developed active TB in the past, but were currently healthy after having completed at least 6 months of antibiotic chemotherapy. All putatively susceptible individuals had finished their treatment at least 6 months prior to sample collection and were considered fully recovered based on a chest X-ray, sputum culture, and lack of clinical symptoms. We do not know which strain of MTB had infected any of the individuals in the study.

Isolation and infection of dendritic cells.
We performed these experiments as previously described 24 .
RNA extraction and sequencing. We extracted RNA using the Qiagen miRNeasy Kit and prepared sequencing libraries using the Illumina TruSeq Kit. We sent the master mixes to the University of Chicago Functional Genomics Facility to be sequenced on an Illumina HiSeq 4000. We designed the batches for RNA extraction, library preparation, and sequencing to balance the experimental factors of interest and thus avoid potential technical confounders (Supplementary Fig. S1).

Read mapping.
We mapped reads to human genome hg38 (GRCh38) using Subread 49 and discarded non-uniquely mapping reads. We downloaded the exon coordinates of 19,800 Ensembl 50 protein-coding genes (Ensembl 83, Dec 2015, GRCh38.p5) using the R/Bioconductor 51 package biomaRt 52,53 and assigned mapped reads to these genes using featureCounts 54 . Quality control. First we filtered genes based on their expression level by removing all genes with a transformed median log 2 counts per million (cpm) of less than zero. This step resulted in a set of 11,336 genes for downstream analysis ( Supplementary Fig. S2, Supplementary Data S2). Next we used principal components analysis (PCA) and hierarchical clustering to identify and remove 6 outlier samples (Supplementary Figs S3, S4 and S5). We did this systematically, by removing any sample whose data projections did not fall within two standard deviations of the mean for any of the first six PCs (for the first PC, which separated the samples by treatment, we calculated a separate mean for the non-infected and infected samples). The remaining 44 high-quality samples included 17 putatively resistant individuals and 5 putatively susceptible individuals in both the non-infected and infected states (Supplementary Fig. S5). Note that after outlier removal, the study design was no longer completely paired because a subset of individuals was only represented by available data from either the non-infected or the infected state.
After filtering lowly expressed genes and removing outliers, we performed the PCA again to check for any potential confounding technical batch effects ( Supplementary Fig. S6). Reassuringly, the major sources of variation in the data were from the biological factors of interest. PC1 was strongly correlated with the effect of treatment, and PCs 2-6 were correlated with inter-individual variation. The only concerning technical factor was the infection experiments, which were done in 12 separate batches (Supplementary Fig. S1). Infection batch correlated with PCs 3 and 5; however, we verified that this variation was not confounded with our primary outcome of interest, TB susceptibility ( Supplementary Fig. S7).

Differential expression analysis.
We used limma+voom [55][56][57] to implement the following linear model to test for differential expression: where β 0 is the mean expression level in non-infected cells of resistant individuals, β treat is the fixed effect of treatment in resistant individuals, β status is the fixed effect of susceptibility status in non-infected cells, β treat,status is the fixed interaction effect of treatment in susceptible individuals (i.e. modeling the interaction between treatment and susceptibility status), and I is the random effect of individual. The random individual effect was implemented using the limma function duplicateCorrelation 58 . To jointly model the data with voom and dupli-cateCorrelation, we followed the recommended best practice of running both voom and duplicateCorrelation twice in succession 59 . We used the model to test different hypotheses (Supplementary Data S3). We identified genes which were differentially expressed (DE) between infected and non-infected DCs of resistant individuals by testing β treat = 0, genes which were DE between infected and non-infected DCs of susceptible individuals by testing β β + =0 . We corrected for multiple testing using q-values estimated via adaptive shrinkage 60 and considered differentially expressed genes as those with a q-value less than 10%.
Note that we also tested the interaction term, β = 0 treat status , , to identify genes in which the difference in expression level between the infected and non-infected states was significantly different between susceptible and resistant individuals. However, as expected since no DE genes were identified between susceptible and resistant individuals in the infected state (see Results), the results of testing the interaction term were partially redundant with the results of testing differences between susceptible and resistant individuals in the non-infected state, and thus we ignored these results throughout this study.
Combined analysis of gene expression data and GWAS results. The GWAS p-values were from previously published studies of TB susceptibility conducted in Russia 18 , The Gambia 13 , Ghana 13 , and Uganda and Tanzania 19 (and a height GWAS in individuals of European descent 25 ). To perform a combined analysis of the gene expression and the summary statistics from each GWAS, we assigned each gene to the SNP with the minimum GWAS p-value out of all the SNPs located within 50 kb up or downstream of its transcription start site. Specifically, we obtained the genomic coordinates of the SNPs with the R/Bioconductor 51 package SNPlocs. Hsapiens.dbSNP144.GRCh38 and matched SNPs to nearby genes using GenomicRanges 61 . 10,265 to 11,060 of the 11,336 genes were assigned an association p-value depending on the GWAS (Supplementary Data S4). For each of the 4 differential expression contrasts we tested (resistant vs. susceptible individuals in the non-infected state, resistant vs. susceptible individuals in the infected state, effect of treatment in resistant individuals, effect of treatment in susceptible individuals), we also performed an enrichment analysis. To do so, we calculated the fraction of genes assigned a GWAS SNP with p-value less than 0.05 for bins of genes filtered by increasingly stringent cutoffs for the observed differential expression effect size (the absolute value of the log fold change). The effect size cutoffs were chosen such that on average each subsequent bin differed by 25 genes. To measure enrichment, we calculated the area under the curve using the R package flux 62 (we also subtracted the background area under the line y = 1 because the number of genes assigned a SNP varied across the GWAS). In order to assess significance, we calculated the area under the curve for 100 permutations of the data. All differential expression tests were statistically significantly enriched for SNPs with low GWAS p-values in every TB susceptibility GWAS (empirical P < 0.01) and not enriched for the height GWAS (empirical P > 0.01) ( Fig. 2; Supplementary Fig. S10).
Classifier. The training set included data from the 22 high-quality non-infected samples from this study with known susceptibility status. The test set included the 65 non-infected samples from one of our previous studies in which the susceptibility status is unknown 24 , and thus assumed to be similar to that in the general population (10%) (we also tested the classifier on data from a small study of macrophages 20 , Supplementary Fig. S17). Because the two studies are substantially different, we took multiple steps to make them comparable. First, we subset to include only those 9,450 genes which were assayed in both. Second, because the dynamic range obtained from RNA-seq (current study) and microarrays (previous study 24 ) were different, we normalized the gene expression levels to a standard normal (μ = 0, σ = 1) distribution (Supplementary Fig. S11; note however that this strategy is unable to correct for the inability of microarrays to accurately quantify genes with expression levels that result in fluorescence levels below the background level or above the saturation limit). Third, we corrected for the large, expected batch effect between the two studies by regressing out the first PC of the combined expression data using the limma function removeBatchEffect 57 (Supplementary Fig. S12).
To identify genes to use in the classifier, we performed a differential expression analysis on the normalized, batch-corrected data from the current study using the same approach described above (with the exception that we no longer used voom 56 since the data were no longer counts). Specifically, we tested for differential expression between susceptible and resistant individuals in the non-infected state and identified sets of genes to use in the classifier by varying the q-value cutoff. Cutoffs of 5%, 10%, 15%, 20%, and 25% corresponded to gene set sizes of 99, 385, 947, 1,934, and 3,697, respectively. We used the R package caret 63 to train 3 different machine learning models: elastic net 64 , support vector machine 65 , and random forest 66 (the parameters for each individual model were selected using the Kappa statistic). To assess the results of the model on the training data, we performed leave-one-out-cross-validation (LOOCV). In order to choose the model with the best performance, we calculated the difference between the mean of the LOOCV-estimated probabilities of being TB resistant for the samples known to be TB resistant and the corresponding mean for the samples known to be TB susceptible. This metric emphasized the ability to separate the susceptible and resistant individuals into two separate groups. Using this metric, the best performing model was the support vector machine with the 99 genes that are significantly differentially expressed at a q-value of 5% (Fig. 3a, Supplementary Fig. S13, Supplementary Data S5); however, both the elastic net ( Supplementary Fig. S14) and random forest ( Supplementary Fig. S15) had similar performance. Lastly, we tested the classifier by predicting the probability of being TB susceptible in the 65 healthy samples (Fig. 3b). For evaluating the predictions on the test set of individuals with unknown susceptibility status, we used a relaxed cutoff of the probability of being TB susceptible of 0.25, which was based on the ability of the model at this cutoff to classify all TB susceptible individuals in the training set as susceptible with only 2 false positives. As expected, the 99 genes used in the classifier had similar normalized, batch-corrected median expression levels in the non-infected state across both studies (Supplementary Fig. S16).
Software implementation. We automated our analysis using Python (https://www.python.org/) and Snakemake 67 . Our processing pipeline used the general bioinformatics software FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), MultiQC 68 , samtools 69 , and bioawk (https://github.com/lh3/bioawk). We used R 70 for all statistics and data visualization. We obtained gene annotation information from the Ensembl 50 and Lynx 71 databases. The computational resources were provided by the University of Chicago Research Computing Center. All code is available for viewing and reuse at https://github.com/jdblischak/tb-suscept. Data availability. The raw fastq files have been deposited in NCBI's Gene Expression Omnibus 72 and are accessible through GEO Series accession number GSE94116 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?ac-c=GSE94116). The RNA-seq gene counts and other summary data sets are included as Supplementary Data and are also available for download at https://github.com/jdblischak/tb-suscept.
While a small subset of the GWAS summary statistics required for partially reproducing our results were included in Supplementary Data S4, we do not have permission to share the full set of summary statistics from the previously published TB susceptibility GWAS that would be required for fully reproducing our results. To access the summary statistics, contact the authors directly: Russia -Sergey Nejentsev (sn262@cam.ac.uk), Ghana and The Gambia -Thorsten Thye (thye@bnitm.de), Uganda and Tanzania -Scott M. Williams (smw154@case.edu). The summary statistics for the height GWAS can be downloaded from the GIANT Consortium's website (http:// portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files).