Discerning asthma endotypes through comorbidity mapping

Asthma is a heterogeneous, complex syndrome, and identifying asthma endotypes has been challenging. We hypothesize that distinct endotypes of asthma arise in disparate genetic variation and life-time environmental exposure backgrounds, and that disease comorbidity patterns serve as a surrogate for such genetic and exposure variations. Here, we computationally discover 22 distinct comorbid disease patterns among individuals with asthma (asthma comorbidity subgroups) using diagnosis records for >151 M US residents, and re-identify 11 of the 22 subgroups in the much smaller UK Biobank. GWASs to discern asthma risk loci for individuals within each subgroup and in all subgroups combined reveal 109 independent risk loci, of which 52 are replicated in multi-ancestry meta-analysis across different ethnicity subsamples in UK Biobank, US BioVU, and BioBank Japan. Fourteen loci confer asthma risk in multiple subgroups and in all subgroups combined. Importantly, another six loci confer asthma risk in only one subgroup. The strength of association between asthma and each of 44 health-related phenotypes also varies dramatically across subgroups. This work reveals subpopulations of asthma patients distinguished by comorbidity patterns, asthma risk loci, gene expression, and health-related phenotypes, and so reveals different asthma endotypes.

probability of getting a statistic at least as extreme was computed to be 4.89×10 −5 . We then adjusted this probability value to 2.81×10 −3 (< 0.05) after controlling the false discovery rate, and thus claimed that rs71465403 had a significantly stronger effect in the GI subgroup than that in the any-CDs group by rejecting the null hypothesis. In this way, we subjected all the SNPs that met the suggestive threshold (p < 10 −5 ) to the test, and in total, we identified 182 loci that showed comparably stronger effects in subgroups (see Supplementary Data 8 for a detailed summary).

Detecting genomic regions with shared genetic influences in asthma subgroups and in the any-CDs group
To complete our query into the GWAS findings, we tested not only the genetic specificities between asthma subgroups and the any-CDs group, but also their genetic commonalities; in other words, whether genomic regions share asthma-associated influences between the two. For this purpose, we first divided the 22 autosomes into 1703 approximately independent regions based on patterns of linkage disequilibrium 1 , and then for each region applied a hierarchical model to estimate the probability that a genomic locus contained at least one variant that influenced the both 2 (see Methods for details). A systematic scan through all the regions located 73 unique genomic regions that had shared influences between the subgroup(s) and the any-CDs group (see Supplementary Data 9 for the complete results). In Supplementary Fig. 2, these regions were colored in blue. We can see their great abundance in the relatively densely populated subgroups 3, 5 and 8, and their scarcity in sparsely populated subgroups 4 and 6 possibly due to limited detection powers. In the same way, we also investigated the genetic commonality between asthma subgroups, finding 21 unique genomic regions with shared influences (see Supplementary Data 10 for the complete results).
Supplementary Fig. 6 shows eight genomic regions in which genetic influences observed in the any-CDs group are shared with four or more different subgroups, and we consider them as conserved regions. In particular, two consecutive genomic regions, No. 655 and 656, covering SNP positions from 30 Mb to 32 Mb on chromosome 6, are among the most conserved regions (shared by five subgroups), and therein lie the Human Leukocytes Antigen (HLA) super-loci, which have been reported to associate with various asthma phenotypes 3 and with regulation of immune system, as well as some other fundamental cellular processes 4 . We applied topic modeling to five different cohorts (see b-f ), generating a total of 33 subgroups (see Methods "The US MarketScan Commercial database and topic modeling for asthma subgroup identification"). Out of these subgroups, we particularly numbered the eleven stable subgroups that can be commonly found in all the five cohorts and thus were further discussed in this study (see the first eleven rows), while named the other 22 subgroups as "Extra" subgroups (see the last 22 rows).

Supplementary Table 1. Genome-wide significant associations with asthma in the general population with any comorbid diseases (the any-CDs group) or in their subgroups (see Methods "UK Biobank database and GWAS")
b We applied topic modeling to a population of 6,048,247 asthma patients aged between 15 and 70 in the US MarketScan insurance claims database, identifying a total of 22 subgroups (see the filled rows). Here, we report the top five most abundant comorbid diseases therein contained as well as their respective occurring frequencies (see Supplementary Data 1 for the complete subgroup profiles).
c The same topic modeling procedure was applied to a population of 3,152,519 individuals in the US MarketScan data who were aged between 15 and 70, but carried at least two asthma codes. This sensitivity analysis 1 generated a total of 25 subgroups, of which 21 subgroups had been seen among the subgroups found in the discovery cohort b (see Supplementary Data 2 for the complete subgroup profiles).
d The same topic modeling procedure was applied to a population of 3,401,250 individuals in the US MarketScan data who carried at least one asthma code, but were aged between 40 and 70. This sensitivity analysis 2 generated a total of 23 subgroups, of which 20 subgroups had been seen among the subgroups found in the discovery cohort b (see Supplementary Data 3 for the complete subgroup profiles).
e The same topic modeling procedure was also applied to a population of 3,687,965 individuals in the US MarketScan data who not only were aged between 15 and 70 and carried at least one asthma code, but also had at least one type of asthma drug prescriptions. This sensitivity analysis 3 generated a total of 26 subgroups, of which 22 subgroups had been seen among the subgroups found in the discovery cohort b (see Supplementary Data 4 for the complete subgroup profiles).
f The same topic modeling procedure was also applied to a population of 66,448 individuals enrolled in UK Biobank who carried at least one asthma code and were aged between 39 and 72. This sensitivity analysis 4 generated a total of 18 subgroups, of which eleven subgroups had been seen among the subgroups found in the discovery cohort b (see Supplementary Data 5 for the complete subgroup profiles).
g In order to assess whether any of the subgroups generated based on the cohorts for sensitivity analyses can be claimed as successful replications of the subgroups discovered based on the discovery cohort, we computed their Pearson's correlations based on the median frequency profiles of comorbid diseases in the respective subgroups. We only claim a successful replication, if the respective correlation is determined to be significant. The Pearson's correlation coefficients and their corresponding two-sided p-values out of Student's t tests are shown here. The assay was performed on blood samples which were obtained during UK Biobank assessment center visit. Eosinophil count in the table, for example, is the median proportion of (eosinophils/100) × white blood cell count given in 10 9 cells/liter (i.e., unit of measurement here), and the values in parentheses are interquartile ranges; b The white British subset of UK Biobank.  b FVC stands for forced vital capacity, and we report its fraction of predicted FVC value here (see Methods "Associating with health-related phenotypes based on UKB phenotypic data"); c FEV1 stands for forced expiratory volume in one second, and we report its fraction of predicted FEV1 value here; d PEF stands for peak expiratory flow, and we report its fraction of predicted PEF value here; e FEV1/FVC is the ratio of FEV1 to FVC, and we report its fraction of predicted ratio value here.  b BMI stands for body mass index, and it was constructed from height (in meters) and weight (in kilograms) measured during the initial Assessment Centre visit. Here we report the median BMI value given in kilogram/meter 2 , and the values in parentheses are interquartile ranges; c The two percentage values in each cell are the percentages of the participants, who smoked previously (have stopped now) and are still smoking now, respectively; a Distribution similarity is measured by overlapped area relative to each pair of distributions; the value ranges from 0 to 1. The distribution similarity metric is equal to 1 for two identical distributions and 0 for two completely dissimilar ones.  Fig. 1. Visualizations of the identified eleven asthma subgroups (Related to Fig. 1).

Supplementary Table 13. Summary statistics of individuals' durations (in weeks) of enrollment and actual diagnosis recordings in each subgroup and in these subgroups combined (based on US MarketScan data).
(a) The elbow method determined the optimal threshold number of cluster points for claiming a stable subgroup. Varying the threshold numbers, we computed the mean stability score of the resulting subgroups for each threshold number. Here, we plot the mean stability scores (y-axis) against different threshold numbers (x-axis). The threshold number of 50 appears optimal, because it is where the increase of the mean stability score switches from fast to slow (i.e., the "elbow" location, indicated by a dashed line).
(b) The t-SNE projection of the identified asthma subgroups. Applying the flowchart shown in Fig.1 generated eleven asthma subgroups (shown in different colors). We named each subgroup after the broader category to which several most frequently occurring diseases belonged, and also numbered the subgroups for easier reference. Singleton subgroup points were treated as noises and thus excluded from display. This two-dimensional t-SNE projection here is for visualization purpose only, while the actual subgrouping was done based on all the dimensions of 567 diseases (see Methods). are mutually exclusive, and each subgroup is defined by a unique distribution of co-occurring diseases. Supplementary Fig. 2. GWAS Manhattan plots with stronger-effect and shared-effect loci annotated (Related to Fig. 2b).
On the basis of UK Biobank data analysis, Manhattan plots were generated to indicate statistical significance of genetic associations between states of SNPs and asthma in individual subgroups. We selected subgroups 3 "GI," 4 "Lymphoma," 5 "Musculoskeletal," 6 "Lung," and 8

K R T 8 P 3 7 ( r s 2 7 6 5 4 0 0 )
"Cardiovascular" that contained genome-wide significant loci for display here (as labelled in each plot's title, there also shows asthma case count versus non-asthma control count in the respective subgroup). All p-values are shown on a -log10 scale on the y-axis, and genomic locations are shown on the x-axis. The threshold of genome-wide significance (5×10 −8 ) is indicated as a red horizontal line in each plot. Some loci shared the similar effects in subgroups as those in the any-CDs group, and we color these loci in blue; other loci showed significantly stronger effects in subgroups (see Methods) and are marked in orange along with their nearest genes. In addition, we highlighted the genome-wide significant loci that are subgroup-specific in purple and labelled them with the nearest genes and the SNPs (in parentheses). Detailed results can be found in Supplementary Data 8-10.