Progression of recent Mycobacterium tuberculosis exposure to active tuberculosis is a highly heritable complex trait driven by 3q23 in Peruvians

Among 1.8 billion people worldwide infected with Mycobacterium tuberculosis, 5-15% are expected to develop active tuberculosis (TB). Approximately half of these will progress to active TB within the first 18 months after infection, presumably because they fail to mount the initial immune response that contains the local bacterial spread. The other half will reactivate their latent infection later in life, likely triggered by a loss of immune competence due to factors such as HIV-associated immunosuppression or ageing. This natural history suggests that undiscovered host genetic factors may control early progression to active TB. Here, we report results from a large genome-wide genetic study of early TB progression. We genotyped a total of 4,002 active TB cases and their household contacts in Peru and quantified genetic heritability of early TB progression to be 21.2% under the liability scale. Compared to the reported of genome-wide TB susceptibility (15.5%), this result indicates early TB progression has a stronger genetic basis than population-wide TB susceptibility. We identified a novel association between early TB progression and variants located in an enhancer region on chromosome 3q23 (rs73226617, OR=1.19; P < 5×10−8). We used in silico and in vitro analyses to identify likely functional variants and target genes, highlighting new candidate mechanisms of host response in early TB progression.

develop immediately (within the first 18 months) after recent M.tb infection or after many years 23 of latency, presumably caused via distinct disease mechanisms. Late progression or TB 24 reactivation is more likely the consequence of acquired immune compromise due to other 25 diseases or ageing, whereas early progression is presumably due to failure in mounting the 26 initial immune response that contains the bacterial spread. Previous studies have indicated a 27 strong heritable component of population-wide TB susceptibility, that includes early disease 28 progression, reactivation and infection [3][4][5] . But whether early progression has a different genetic 29 architecture compared to population-wide susceptibility has yet to be defined. 30 31 Reported associations for TB, and other infectious diseases, has to be considered in the context 32 of TB diagnostic criteria and selected control groups 6,7 . To date, genome-wide association 33 studies (GWAS) of TB have compared mixed pools of TB patients with early progression or 34 reactivation, to population controls, who may not have been exposed to M.tb at all [8][9][10][11][12]  To identify host factors that drive pulmonary early TB progression, we conducted a large, 44 longitudinal genetic study in Lima, Peru (Fig. 1b), where the TB incidence rate is one of the 45 highest in the region 2 . We enrolled patients with microbiologically confirmed pulmonary TB. 46 Within two weeks of enrolling an index patient, we identified their household contacts (HHCs) 47 and screened for infection as measured by a tuberculin skin test (TST) and for signs and 48 symptoms of pulmonary and extra-pulmonary TB. HHCs were re-evaluated at two, six and 49 twelve months. We considered individuals to be early progressors if they are (1) index patients 50 whose M.tb isolates shared a molecular fingerprint with isolates from other enrolled patients; (2) 51 HHCs who developed TB disease within one year after exposure to an index patient and (3) 52 index patients who were 40 years old or younger at time of diagnosis. We considered HHCs 53 who were TST positive at baseline or any time during the 12 month follow up period, but who 54 had no previous history of TB disease and remained disease free, as non-progressing controls 55 (Methods, Figure 1b). In total, we genotyped 2,175 recently exposed pulmonary TB cases 56 (early progressors) versus 1,827 HHCs with latent tuberculosis infection, who had not 57 progressed to active TB during one year of follow-up (non-progressors), as controls (Methods, 58 Supplementary Table 1). 59 60 To our knowledge, this represents the most extensive genetic study conducted in Peru to date. 61 Peru is a country with a complex demographic history and underexplored genomic variation. 62 When Spanish conquistadors arrived in the region in the 16th century, Peru was the center of 63 the vast Inca Empire and was inhabited by a large Native American population 13,14 . During the 64 colonial period, Europeans and Africans (brought in as slaves) arrived in large numbers to Peru. 65 After Peru gained its independence in 1821, there was a flow of immigrants from southern 66 China to all regions of Peru as a replacement for slaves 15,16 . As a result, the genetic background 67 of the current Peruvian population is shaped by different levels of admixture between Native 68 Americans, Europeans, African and Asian immigrants that arrived in waves with specific and 69 dated historical antecedents. When compared to individuals from other South American 70 countries 17,18 , Peruvians tend to share a greater genetic similarity with Andean indigenous 71 people such as Quechua and Aymara (Figure 2, Supplementary Figure 1 Figure 2). When compared to other more comprehensive genotyping 78 platforms available at the time, LIMAArray showed an approximately 5% increase in imputation 79 accuracy, particularly for population-specific and low-frequency variants (Supplementary Table  80 3). We derived estimated genotypes for ~8 million variants using the 1000 Genomes Project 81 Phase 3 17 as the reference panel and tested single marker and rare-variant burden associations 82 with linear mixed models that account for both population stratification and relatedness in the 83 cohort ( Supplementary Figure 3-4, Methods). Genome-wide association results of 2,160 84 cases and 1,820 controls after quality control (Methods) are summarized in Supplementary 85   22 . In contrast, using LD Score regression 24 on summary 101 statistics from a GWAS of population-wide TB susceptibility in Russia 10 , we estimated the of 102 population-wide TB susceptibility to be 15.5% (s.e.=0.04, = 5.33×10 !! ) with assumed 103 prevalence of 0.04 25 . These data suggest that recently exposed TB progression may have a 104 stronger host genetic basis compared to population-wide TB susceptibility, and larger 105 progression studies may be well-powered to discover additional variants. 106 107 We next identified a novel risk locus associated with TB progression on chromosome 3q23, 108 which is comprised of 11 variants in non-coding regions downstream of RASA2 and upstream of 109 (Figure 3, Supplementary Table 6). The strongest association was at a 110 genotyped variant rs73226617 (OR=1.19; = . × ! ). To test for artifacts and to identify 111 stronger associations that might have been missed due to genotyping and imputation, we first 112 checked the genotype intensity cluster plot of the top associated variant which showed clear 113 separation between genotypes AA, AG and GG (Supplementary Figure 6). We then designed 114 individual TaqMan genotyping assays for four top associated variants (Methods, 115 Supplementary Table 7). We genotyped these four SNPs in 4,002 initial subjects and 116 concluded that all four variants show a high concordance rate (>99%) with imputed genotypes 117 (Supplementary Table 6 Table 8). These 141 SNPs were less frequent (<1%) in the African populations than in the European and Peruvian 142 populations, resulting in lower statistical power to detect association. We therefore examined 143 the SNPs in two previously published Russian 10 (5,530 TB cases and 5,607 controls) and 144 Icelandic 27 (4,049 TB cases and 6,543 TST+ controls) GWAS datasets. We observed that the 145 effects in the Russian cohort were similar, as they shared comparable ORs of 1.10 (Peru) and 146 1.18 (Russia) for rs73226617 (P Russia =0.065). In contrast, there was no signal observed in the 147 Icelandic cohort (OR=1.06, P Iceland =0.437). Consistent with our previous case-only analysis, the 148 weaker signals observed in both European cohorts indicate that 3q23 is specifically associated 149 with early TB progression. The association signals were therefore most likely diluted due to the 150 inclusion of reactivation cases and non-infected controls in the cohort collection. 151 We next examined how previously published TB GWAS risk loci are associated in this study. 153 We detected evidence of association in a previously reported TB locus at rs9272785 in the HLA 154 region 27 (OR=1.04, = 4.49×10 !! ), but did not detect signals at other reported risk loci 155 (Supplementary Table 9). Thus, previously reported loci may relate to infection or reactivation 156 phenotypes, rather than early TB progression whereas HLA association may affect both early 157 progression and reactivation. The strongest association observed in the HLA region after 158 imputation (Method, Supplementary Figure 9 other Type I interferon response genes, was detected early in tuberculosis contacts who 176 progressed to active disease 30,31 . Overall, we saw an enrichment of the interferon response 177 factor in the 3q23 locus (Figure 3d). We then performed electrophoretic mobility shift assays (EMSA) and luciferase assays to functionally identify the most likely causal variant among the 179 seven variants that constitute 90% credible set (Method). EMSA tests whether the variants 180 differentially bound nuclear complexes in an allele-specific manner. Four variants (rs73226617, 181 rs148722713, rs11710569 and rs73226608) showed differential EMSA signals in the risk 182 variants that were suppressed with unlabeled probes, consistent with allele-specific protein 183 binding in the Jurkat76 cell line. We performed luciferase assays on the four candidate variants 184 but failed to detect allele-specific enhancer activity on human embryonic kidney (HEK293T) 185 cells (Methods, Supplementary Figure 10). This negative result may be driven by the 186 sensitivity limits of the assay or the variants having cell-type-specific activities which might not Preparation of genome-wide genetic data 227 We enrolled index cases as adults (aged 15 and older) who presented with clinically suspected 228 pulmonary TB at any of 106 participating health centers. We excluded patients who resided 229 outside the catchment area, who had received treatment for TB before and those who were and symptoms of pulmonary and extra-pulmonary TB disease at two, six and 12 months after 237 enrollment. To select cases who were likely to have recently exposed TB, we chose HIV-238 negative, culture-positive, drug-sensitive pulmonary TB cases from one of three groups: (1) 239 exposed HHCs who developed active TB during a 12 month follow up period; (2) index patients 240 whose M.tb isolates shared a molecular fingerprint with isolates from other enrolled patients and 241 (3) index patients who were 40 years old or younger at time of diagnosis. To maximize the 242 likelihood that controls were exposed to M.tb but did not develop active disease, we chose them 243 from among TST positive HHCs with no previous history of TB disease, and who remained 244 disease free at at the time of recruitment both by directly re-contacting individuals to inquire 245 about their latest medical history and by checking their names against lists of notified TB 246 patients at all of the 106 health clinics. Where possible, we chose controls who are less than 247 second-degree related to the index cases. 248 Luo et al.

genetics study of TB progression
Customized Axiom array for Peruvian populations 249 We developed a custom array (LIMAArray) based on whole-exome sequencing data from 116 250 active TB cases to optimize the capture of genome-wide genetic variation in Peruvians. Many 251 markers were included because of known associations with, or possible roles in, phenotypic 252 variation, particularly TB-related (Supplementary Table 11). The array also includes coding 253 variants across a range of minor allele frequencies (MAFs), including rare markers (<1% MAF), 254 and markers that provide good genome-wide coverage for imputation in Peruvian populations in 255 the common (>5%), low frequency (1-5%) and rare (0.5-1%) MAF ranges (Supplementary 256 Table 3). This approach allowed the detection of rare population specific coding variants and 257 those which predisposed individuals to TB risk. 258 Genotyping and quality control 259 We extracted genomic DNA from whole blood of the participating subjects. Genotyping of all 260 samples was performed using our customized Affymetrix LIMAArray. Genotypes were called in 261 a total of 4,002 samples using the apt-genotype-axiom 35 . Individuals were excluded if they were 262 missing more than 5% of the genotype data, had an excess of heterozygous genotypes (±3.5 263 standard deviations, Supplementary Table 12), duplicated with identity-by-state >0.9 or index 264 cases with age at diagnosis greater than 40 years old. After excluding these individuals, we 265 excluded variants with a call rate less than 95%, with duplicated position markers, those with a 266 batch effect ( < 1×10 !! ), Hardy-Weinberg (HWE) P-value below 10 !! in controls, and a 267 missing rate per SNP difference in cases and controls greater than 10 !! (Supplementary 268 Table 13). In total, there were 3,980 samples and 677,232 SNPs left for imputation and 269 association analyses after quality control. 270 Imputation and association analyses 271 The genotyped data were pre-phased using SHAPEIT2 36 . IMPUTE2 37 was then used to impute 272 genotypes at untyped genetic variants using the 1000 Genomes Project Phase 3 dataset 17 as a 273 reference panel. For chromosome X, males are coded as diploid. That is male genotypes are 274 coded as 0/2 and females genotypes are coded as 0/1/2. HLA imputation was performed using 275 SNP2HLA 38 and a multi-ethnic HLA imputation reference pane 39 . Imputed SNPs were excluded 276 if the imputation quality score ! was less than 0.4, HWE P-value < 10 !! in controls or a 277 missing rate per SNP greater than 5%. After filtering, 7,756,401 SNPs were left for further 278 association analyses. 279

280
Common single variant associations were tested with a linear mixed model (LMM) implemented 281 in GEMMA 40 version 0.94.1 on genotype likelihood from imputation assuming an additive 282 genetic model. We use the genetic relatedness matrix (GRM) as random effects to correct for 283 cryptic relatedness between collected individuals. Sex and age were included as fixed effects to 284 correct for population stratification (Supplementary Figure 2). The GRM was obtained from an 285 LD-pruned (r 2 < 0.2), with MAF ≥1% after removing large high-LD regions 41 (Supplementary 286  ( !" ) for variants after imputation was 1.03 and 1.00 for common and rare association study 293 respectively (Supplementary Figure 4), indicating that we have successfully controlled for any 294 residual population structure or cryptic relatedness between genotyped samples. 295

296
To avoid false-positive signals due to population stratification and heterogeneity of effects due 297 to differential LD in admixed populations, we also computed GRMs based on methods 20,21 that 298 account for inflation of identity-by-state statistics due to admixture LD. LMM with admixture-299 aware GRMs resulted in numerically similar association statistics to those from unadjusted 300 analyses (Supplementary Table 15). 301

302
To identify likely causal variants in the identified risk locus, we used FINEMAP 28 method to 303 calculate marginal likelihoods and Bayes factor for each variant assuming that there is one true 304 causal variant in the region, and it has been included in the analysis and has been well imputed 305 (--n-causal-max 1). We used the in-sample LD scores calculated using LDstore 43 to further 306 increase the accuracy of the fine-mapping analysis. 307 308 TaqMan SNPs and Genotyping 309 Selection of SNPs in the 3q23 locus was conducted based on information from the dbSNP 310 database (http://www.ncbi.nlm.nih.gov/projects/SNP/). Two polymorphisms rs73226617, 311 rs73226619, rs73239724 and rs73226608 were included for the genotyping tests. Real-time 312 PCR using the following calculations: 2.5uL Genotyping Master Mix, 0.25uL SNP Assay-probes, 313 and 2.25uL DNA template (at 5ng/uL= 11.25ng total).Thermal cycling conditions were as 314 follows: 60C 30secs Pre-read, 95 °C for 10 min, followed by 40 cycles at 95°C for 15 s and at 60 315 °C for 1 min, then 60C 30secs Post-read. Genotyping of the polymorphisms was carried out 316 using the 5' exonuclease TaqMan Allelic Discrimination assay, which was performed utilizing 317 minor groove binder probes fluorescently labeled with VIC or FAM and the protocol 318 recommended by the supplier (Applied Biosystems, Foster City, CA, USA). Analysis for 319 interpretation was performed with Via7 software and Taqman Genotyper software calls. Per 320 variant concordance rate was obtained by comparing genotypes obtained from imputation and 321 from TaqMan assays (Supplementary Table 6 In silico functional annotation of candidate causal variants 332 We combined multiple sources of in silico genome-wide functional annotations from publicly 333 available databases to help identify potential functional variants and target genes in the 3q23 334 novel risk locus. To investigate functional elements enriched across the region encompassing 335 the strongest candidate causal variants,We aggregated approximately 400 genomic and 336 sequence annotations including cell-type-specific annotation types such as ATAC-seq, DNase-337 seq, FAIRE-seq, HiChIP-H3K27ac, HiChIP-CTCF, polymerase and elongation factor ChIP-seq, 338 and histone modification ChIP-seq, as well as cell-type-nonspecific annotations such as 339 conservation, coding annotation, and distance to TSS. A list of all included resources is 340 summarized in Supplementary Table 10   are outlined, with the percentages of individuals progressing between steps taken from the 539 WHO TB report 2 . (b) Schema of cohort collection. In this study, we focus on a genetic study 540 between recently exposed active pulmonary TB cases (progressors) and subjects with 541 tuberculin skin test (TST) positive results, who did not progress to active TB (non-progressors). 542 Index cases had sputum with confirmed TB. Controls were recruited in the same household as 543 index cases, with 12 month follow-up periods to confirm infection status using TST. 544