Multi-ancestry transcriptome-wide association analyses yield insights into tobacco use biology and drug repurposing

Most transcriptome-wide association studies (TWASs) so far focus on European ancestry and lack diversity. To overcome this limitation, we aggregated genome-wide association study (GWAS) summary statistics, whole-genome sequences and expression quantitative trait locus (eQTL) data from diverse ancestries. We developed a new approach, TESLA (multi-ancestry integrative study using an optimal linear combination of association statistics), to integrate an eQTL dataset with a multi-ancestry GWAS. By exploiting shared phenotypic effects between ancestries and accommodating potential effect heterogeneities, TESLA improves power over other TWAS methods. When applied to tobacco use phenotypes, TESLA identified 273 new genes, up to 55% more compared with alternative TWAS methods. These hits and subsequent fine mapping using TESLA point to target genes with biological relevance. In silico drug-repurposing analyses highlight several drugs with known efficacy, including dextromethorphan and galantamine, and new drugs such as muscle relaxants that may be repurposed for treating nicotine addiction.


Supplementary Figures
: Distribution of genomic control values. Panels a-d showed violin plots of the genomic control values of four TWAS methods using GTEx eQTL data (i.e., TESLA, FE-TWAS, RE-TWAS and EURO-TWAS) for four smoking traits, i.e., a) AgeInit, b) CigDay, c) SmkInit and d) SmkCes. Each data point in the violin plot represents the genomic control value from a tissue. The line in the middle of each box represents median, the bold red dot in the middle of the box represents the mean. The upper and lower bounds of the box represent the 25 th and 75 th percentile, and whiskers are 1.5 times the inter-quartile range. The contours of the "violin" represents the density function of the data points. We plot any outliers that are outside the range of 1.5 times IQR. All methods have well-behaved genomic control values across all scenarios. Panels e-l displayed the Quantile-Quantile plot of p-values for four TWAS methods using LIBD eQTL dataset from nucleus accumbens in African and European ancestries. Panels e and f are the results for trait AgeInit for EUR-TWAS and AFR-TWAS respectively. Panels g and h are the results for trait CigDay for EUR-TWAS and AFR-TWAS respectively. Panels i and j are the results for trait SmkCes for EUR-TWAS and AFR-TWAS respectively. Panels k and l are the results for trait SmkInit for EUR-TWAS and AFR-TWAS respectively. Shaded areas in the plot represent the 95% confidence band of different quantiles of − log !" (p − value). a b c d Supplementary Figure 9: Sanky plot of drug pathway enrichment analysis. We visualize enrichment results, linking smoking traits to mechanisms of action by the drug target genes and drug categories. The widths of the bands between trait and pathways represent the fractions of TESLA hits that belong to a given drug pathway. The widths of the bands between pathways and drugs indications represent the fraction of pathways that are targeted by the drugs with certain indications.    Table 7 Comparison of TESLA Results using eQTL datasets of European and African American ancestries from LIBD. We list the number of loci that were identified in each ancestry (with twosided p-value < 2.5 × 10 !" ) and the number of loci that also remained significant in the other ancestry.   The meta-regression models that produce minimal p-values inform the extent of phenotypic effect heterogeneity. TESLA fits multiple meta-regression models with varying number of allele frequency principal components. Based on estimated phenotypic effects in each meta-regression model, we perform TWAS using two-sided tests and combine the results using minimal p-values. The meta-regression model that yields the minimal p-value will inform the extent of phenotypic heterogeneity between ancestries. For example, the meta-regression model with 0 PC included is equivalent to fixed effect meta-analysis. As the first PC separates African and non-African ancestry, the meta-regression model with 1 PC is likely to produce the most significant p-values for genes showing different effects between African American and other ancestries.
Here, among genes that reach significance level = 2.5 × 10 !" , we report the fraction and number of different meta-regression models that produce minimal p-values. For each trait, we performed fine mapping for TESLA results. We define a locus as a 1 million basepair window surrounding a sentinel gene with most significant two-sided p-values (<2.5 × 10 !" ). Posterior inclusion probability was calculated for the sentinel gene, and the genes within 90% credible interval were also reported. For secondary association signals, iterative conditional analysis was performed, by conditioning on the most significant gene in the previous iteration and conditional TESLA two-sided p-value and converted Z-scores were used to estimate PIP for secondary signals.
Supplementary Table 11 [Excel spreadsheet]: GO terms enrichment results. We created gene sets based upon pathways in the Gene Ontology database, and performed enrichment analysis using TELSA results. We used parametric bootstrap to control for family-wide error rate (FWER). We reported GO terms with significant twosided p-values (FWER<0.05), as well as the categories of the GO term. As a sensitivity analysis on the impact of the pathway database used, we also included significant enrichment hits using KEGG, Reactome, and wikiPathways following the same pipeline.

C. Smoking Cessation (SmkCes)
1. Binary phenotype with current smokers coded as "2" and former smokers coded as "1", and never smokers are coded as missing. 2. Does not include information about pipes/cigars/chew, or other non-cigarette forms of tobacco use. 3. Usually measured through a combination of questions, including: a. Do you currently smoke and have you ever smoked regularly? b. Do you smoke and have you smoked over 100 cigarettes in your entire life?

D. Smoking Initiation (SmkInit)
1. This is a binary phenotype. Any participant reporting ever being a regular smoker in their life (current or former) were coded "2", while any participant who reported never being a regular smoker in their life were coded "1". 2. Does not include information about pipes/cigar/chew, or other non-cigarette forms of tobacco use. 3. This phenotype was measured in a variety of ways according to the reference 2 a. Have you smoked over 100 cigarettes over the course of your life? b. Have you ever smoked every day for at least a month? c. Have you ever smoked regularly?

Dataset description
Below, we describe the transcriptomics datasets used in our manuscript.

Genotype Tissue Expression (GTEx) Project Data
We obtained pre-computed gene expression prediction model weights of 48 tissues from the PrediXcan website, which were based on GTEx (version 7) 3 .

Lieber Institute for Brain Development (LIBD) Human Brain Repository Data
To complement GTEx nucleus accumbens data, we leveraged RNA-seq and genotype data from postmortem nucleus accumbens samples of physiologically normal human brains. Compared to GTEx data of the same tissue, our data is more ancestrally diverse and includes a greater fraction of African American samples, i.e., 53% individuals are from European (N=104) and 47% from African ancestry (N=94).
We used paired-end, stranded RNA-seq and Illumina genotyping array data from postmortem nucleus accumbens of physiologically normal human brains. Details on their data collection, RNA-seq and genotyping data processing, and quality control (QC) are described in Markunas et al 4 .
For eQTL analyses, we included samples with RNA integrity number (RIN) >6, gene assignment rate (GAR; proportion of reads mapping to a gene annotation) >0.3, mitochondrial mapping rate (proportion of reads mapping to mitochondrial DNA) >0.11, and overall mapping rate >0.5. Among the samples passing these QC filters, lowly expressed genes (genes with <10% of samples having 1 transcript per million mapped reads (TPM) or <10% of samples having 10 raw counts) were excluded. Both TPM and raw count values were quantified using the software Salmon version 1.5.2 (https://salmon.readthedocs.io). Across sample normalization was then applied to the raw counts using median-of-ratios normalization in DESeq2 (https://bioconductor.org/packages/release/bioc/html/DESeq2.html), followed by a variance stabilizing transformation. Data processing was done separately by ancestry, resulting in 20,486 genes among European ancestry samples and 20,973 genes among African American ancestry samples for eQTL analyses.
We estimated latent factors using probabilistic estimation of expression residuals (PEER) 5 separately for each ancestry. Following PEER, we tested 12 million 1000 Genomes-imputed genetic variants for association with genes in cis (genes +1 MB of each variant) using Matrix eQTL 6 . eQTL models included age at death, sex, RIN, GAR, mitochondrial mapping rate, top 5 genotype principal components, and 4 PEER factors among European ancestry samples (N=104) and sex, RIN, top 5 genotype principal components, and 6 PEER factors as covariates among African American ancestry samples (N=94). Gene expression prediction model was also separately generated by ancestry using elastic net (as in PrediXcan), which resulted in 19,566 models for European ancestry and 16,526 models for African American ancestry.

Single Cell RNA-seq Data from Entire Mouse Nervous System
To identify smoking-relevant cell types, we made use of an existing gene expression dataset derived from 500,000 single cells from 19 regions in the mouse nervous system 7 . These single cells were then classified into 39 cell types. For each cell type, we created gene sets that consist of top 10% most highly expressed genes as "celltype specific genes", which were used in enrichment analysis to identify cell types relevant for smoking behaviors 7 .

Simulation Evaluation for TESLA
We conducted extensive simulations to evaluate the type I error and power of TESLA. We used real haplotype data to simulate genotype data that reflects realistic allele frequency and LD patterns. To assess power under the alternative hypothesis, for each simulated gene, we randomly selected a model from the PrediXcan database and used real eQTL weights as eQTL effects, which mediate the phenotypic effects in samples of European ancestry. For other ancestries, we considered different scenarios where the phenotypic effects and the set of causal variants are the same as European ancestry (i.e., homogeneous model) and where the causal variants and genetic effects are ancestry-specific (i.e., Eurasia, European only, or Admixed). We also examined the scenario where the eQTL effects differ between ancestries (i.e., heterogeneous effect model), which may be due to different ancestries having different eQTL SNPs or effect heterogeneities. We varied the fraction of samples of European ancestry in the multi-ancestry studies to compare different methods. Type I errors and power were evaluated using 100 million replicates in each scenario under the Bonferroni threshold for testing up to 20,000 expressed genes ( = 2.5 × 10 !" ). Details of different phenotypic effect models are shown in Suppl. Table 1 and Supplementary Text.
The type I error was controlled for all scenarios (Suppl. Table 2). In the scenarios with ancestry-specific effects, TESLA outperforms alternative methods in power. For example, in the Eurasia model, the phenotypic effects are only present in European and Asian samples. When 40% (20%) of the studies were of European (Asian) ancestry and the expression effect = .5, the power for TESLA, FE-TWAS and EURO-TWAS were 93%, 88%, and 75%, respectively. If the phenotypic effect was only present in European samples, EURO-TWAS is expected to be the most powerful method, but its power was only slightly better than TESLA. In this case, FE-TWAS has much lower power as it ignores the heterogeneity of effect sizes (Suppl. Table 2). For the Admixed model, when the expression effect was .25 and 40% of the cohorts were of European ancestry, TESLA was the most powerful method (58%). FE-TWAS does not incorporate effect heterogeneity, violates the proportionality condition, and has less power (55%). EURO-TWAS does not fully utilize non-European samples and was also underpowered (49%) (Suppl. Table 2).
In the presence of phenotypic effect heterogeneities, the power advantage for TESLA over FE-TWAS increases with the fraction of non-European samples. For example, in the European effects only model (c=.25), when 80% of the samples were of European ancestry, the power for TESLA, FE-TWAS, and EURO-TWAS was 67%, 63%, and 68% respectively. As expected, EURO-TWAS was the most powerful method, but TESLA's power is comparable and FE-TWAS's power is only slightly lower. However, even when only 20% of the samples came from European ancestry, the power for the three methods was reduced to 32%, 11%, and 36%. The power for TESLA remains comparable to EURO-TWAS, the optimal method in this scenario, but FE-TWAS becomes severely underpowered.
The power comparison changed under the homogeneous phenotypic effect model: FE-TWAS was the most powerful method, yet the power for TESLA remained within 2% of FE-TWAS (Suppl. Table 2). The power of EURO-TWAS decreased dramatically when the fraction of European ancestry decreased. RE-TWAS consistently performed worse than FE-TWAS across all scenarios due to the conservativeness of the RE method in GWAS meta-analysis 8 . Some more advanced random effect meta-analysis methods do not produce effect size estimates, and hence cannot be used for TWAS. Across all comparisons, TESLA was consistently the most powerful method or a close second, and the power advantage usually increased with the fraction of non-European samples, when phenotypic effect heterogeneity is present. On the other hand, FE-TWAS, RE-TWAS, and EURO-TWAS can be substantially underpowered in scenarios that do not favor their assumptions. Given the phenotypic model is unknown in practice and the expectation that human genetic studies will expand to include more non-European samples, TESLA established itself as a clear choice for TWAS.

Extent of Phenotypic Effect Heterogeneity
In TESLA, to model the phenotypic effects across studies, we fitted multiple meta-regression models with different numbers of PCs to capture the extent of phenotypic effect heterogeneities among ancestries. For each fitted model, we estimated phenotypic effects and performed TWAS separately. The meta-regression model that yielded the minimal p-value could inform the extent of heterogeneity of phenotypic effects across ancestries. We conduct the analyses using PrediXcan weights from GTEx sample of European ancestry as an example. Not surprisingly, the phenotypic model with no PCs (equivalent to a fixed effects meta-analysis model) yielded minimal p-values for 77% of the genes, as a large proportion of phenotypic effects are expected to be homogeneous across ancestries. The first PC separates cohorts of African ancestry from the rest of the ancestries (Suppl. Figure 4). The model with 1 PC yielded minimal p-values for 11% of the genes, which indicated that these eQTL SNPs may show distinct effects in African ancestry samples. A small fraction of genes (12%) showed greater heterogeneity in phenotypic effects, as the minimal p-values were produced from models with more than 1 PC. (Suppl. Table 9, Suppl. Figure 5). As the phenotypic effects of eQTL SNPs vary between phenotypes, and different tissues have different sets of eQTL SNPs used in gene prediction models, the fractions of loci where models with 0/1/2/3 PCs yield minimal p-values differed slightly between phenotypes and tissue.

Fine-mapping of TESLA Identified Gene x Trait Associations
We performed fine mapping (see details in Supplementary Text) across the 4475 loci for all 48 tissues and four tobacco use phenotypes. Among these loci, 77% were fine mapped to a single gene in the 90%-credible set (Suppl. Table 10). Our results point to novel target genes with biological relevance, pleiotropic effects on neuropsychiatric traits, and tissue-specific effects.
First, fine-mapping identified potential causal genes with biological relevance to tobacco use. For example, in the hypothalamus, a brain region that regulates body homeostasis, stress hormone release, and circadian rhythm, HEY1 (Hes Related Family BHLH Transcription Factor with YRPW Motif 1) was identified. Overexpression of HEY1 in hypothalamus is associated with an increase in CigDay (TESLA max Z-score 4.83, multi-tissue two-sided p-value 2.4´10 -6 , Posterior inclusion probability (PIP)=1) (Suppl. Figure 6a). This gene is a target for the Notch signaling pathway, an important regulator of neuronal development and proper network development in the brain 9 . It has also been identified as a potential candidate gene for the regulation of the dopamine transporter (SLC6A3) gene 10 . The TESLA analysis using African ancestry eQTL data from the LIBD nucleus accumbens dataset identified DTX4 to be significantly associated with CigDay (two-sided p=1.8´10 -7 ). Ubiquitylation of Notch1 by the E3 ubiquitin ligase DTX4 is known to promote the internalization of Notch1 in response to ligand binding 11 , which also highlighted the impact of the Notch signaling pathway. DTX4 was not significant in TESLA analysis using the European ancestry eQTL datasets from GTEx.
Fine-mapping further identified the gene ASIP as a potential causal gene, where an increased level of genetically regulated gene expression level in brain cortex leads to decreased CigDay (TESLA Z-score statistic -4.63 with two-sided multi-tissue p-value 4.5´10 -8 , PIP=1). (Suppl. Figure 6b). ASIP encodes the agouti-signaling protein, which acts as an antagonist to melanocortin receptors (MCR), similarly to agouti-related protein. The MC1R gene is typically associated with skin pigmentation, but alterations in the gene have also been associated with modulated pain sensitivity 12 . The MC4R gene has been associated with several psychological diseases such as depression and anxiety 13,14 . This receptor may impact hypothalamic-pituitary-adrenal (HPA) stress axis functionality 15 . Antagonists for the receptor have even been suggested for preventing or treating post-traumatic stress disorder (PTSD) 16 . Nicotine has also been shown to change the expression pattern of MC4R in the brain 17 .
Additionally, fine-mapping identified PTPRD (protein tyrosine phosphatase receptor type D) that has been associated with cocaine addiction. Specifically, PTPRD was identified, where an increase in the genetically regulated gene expression level in amygdala leads to decreased risk of smoking initiation (TESLA Z-score statistic -4.0 with two-sided multi-tissue p-value 4.8´10 -7 , PIP=.94) (Suppl. Figure 6c). Previous research has shown that PTPRD knock-out mice have reduced overall use of cocaine and reduced conditioned place preference for the drug 18 .

Correlation between TESLA Statistics
In TESLA, we model the genetic effect heterogeneity using meta-regression models. Specifically, the metaregression model with PCs takes the form: The regression coefficients can be estimated based on weighted least square method: , we can estimate the phenotypic effect in the ancestry of the eQTL dataset. We denote the allele frequency PCs of the eQTL dataset as U [)] and the estimated phenotypic effect equals to V For each model [)] , we construct TWAS statistics

-60
With variance = 7 Finally, we combine the results based upon different models using minimal p-value statistic.
In order to evaluate the statistical significance for minimal p-value statistic, we need to calculate the correlations between TWAS statistics 1234 , = 0,1,2 and 3. As the calculated TWAS statistics are functions of phenotypic effect estimates, a critical step is to estimate the covariance between V So we only need to find out cov4 ⋅ , ⋅ 6, which is a × diagonal matrix The correlation between phenotypic effects can be approximated by the LD coefficients. Given that the ancestry of participating cohorts may differ, it is important to choose an appropriate LD reference panel for each cohort based upon its ancestry. In our analysis, we used TOPMed sequence data as reference panel. For each cohort, we choose individuals of the same ancestry from TOPMed for use as reference panel and estimate correlations between phenotypic effect estimates. We assumed and verified that the cohorts are independent of each other.
We denote the covariance matrix between V As the TWAS statistics are linear combinations of the phenotypic effect estimates, their covariance is straightforward to calculate, i.e., With the covariance between TWAS statistics, the minimal p-values can be evaluated based on multivariate normal distribution functions.

Equivalence of FE-TWAS and meta-analysis of TWAS statistics from participating studies.
FE-TWAS is a special case of TESLA where no allele frequency PC is included. In this section, we will establish theoretically that when eQTL weights used in different studies are the same, FE-TWAS is equivalent to meta-TWAS, which is to conduct TWAS for each participating cohort and then combine the TWAS statistics across studies using inverse-variance weighted meta-analysis.
The equivalence is intuitively clear: meta-TWAS performs TWAS within each ancestry/study, aggregates information across variant sites, and then conducts meta-analysis across studies/ancestries. On the other hand, FE-TWAS first aggregates information across studies/ancestries and then across variant sites. The two methods only differ in the order of data integration, i.e., either aggregating over variant sites first (meta-TWAS) or over studies first (FE-TWAS). As summation is commutative, exchanging the order of summation (across variant sites vs. across studies/ancestries) yields identical results.
Specifically, FE-TWAS method conducts TWAS using fixed-effect GWAS meta-analysis results, i.e., where is a normalizing constant: The FE-TWAS statistic is given by It is easy to verify that <=!1234 can be calculated based on cohort specific LD panels, as we describe in the previous section.
On the other hand, the meta-TWAS first performs TWAS in each ancestry/study and then combines the results using inverse-variance weighted meta-analysis. In study/ancestry , TWAS analyzes imputed gene expression in linear models. Without loss of generality, we assume that the trait residuals (after adjusting for non-genetic covariates) are standardized to have mean of 0 and variance of 1, to simplify notations.

The regression model for TWAS takes the form of
is residual for the regression model.
The least square estimates for is given by We further define: , It is easy to verify that The meta-TWAS statistic is given by inverse-variance weighted meta-analysis of , 1234 , with weights being It is important to note that which establishes the equivalence of meta-TWAS and FE-TWAS.

Simulation Study:
We conducted extensive simulations to evaluate the type I error and power of TESLA. We used real haplotype data to simulate genotype data that reflects realistic allele frequency and LD patterns. We considered a number of scenarios where the phenotypic effects and the set of causal variants differ between ancestries and where the phenotypic effects and causal variants remain the same. We also varied the fraction of samples of European ancestry in the multi-ancestry studies.
More specifically, we first simulated a meta-analysis of 20 studies. Genotypes for each gene were simulated based upon pairing randomly selected haplotypes from the TOPMed for the given ancestry. We varied the fraction of European studies in the meta-analysis between 20% to 80%. Half of the non-European cohorts were generated using haplotypes from African American ancestry and the other half were simulated using haplotypes from East Asian ancestry. For each replicate, a gene was randomly chosen from the PrediXcan database of GTEx whole blood tissue. The phenotypic effectwas simulated using the weightsfrom the chosen gene and gene expression effect on phenotypes ( ), i.e., -= -. We varied among a set of plausible values (i.e., 0.25, 0.33, or 0.5). We considered scenarios with different phenotypic effects in non-European populations, including the scenario where the phenotypic effects are homogenous across ancestries and where the phenotypic effects were only present in a subset of ancestries. We also considered a scenario of admixed effects, where only the allele of European descent has non-zero phenotypic effect (in samples of European ancestry and samples of African American ancestry). A summary for the simulation models can be found in (Suppl. Table 1).

Enrichment Analysis
Here we describe enrichment analyses using TESLA hits as well as the application to evaluate the drug target enrichment for drug repurposing. As a comparison, we also conducted enrichment analysis based on GWAS hits using MAGMA.

Quantifying Pathway Enrichment of TESLA Hits
Gene-level association results, when combined with pathway information, can be used to prioritize key pathways for tobacco use phenotypes. We used the same weighted regression approach 19 as MAGMA to quantify the enrichment of target genes in each pathway, which we call eTESLA. Contrary to MAGMA, which calculates a gene-level statistic from single SNP p-values in GWAS, the eTESLA statistic is based upon TESLA p-values from each tissue. To implement weighted regression, we need to calculate correlation between TESLA statistics of different genes using a Monte Carlo algorithm.
As the first step of eTESLA, we converted the TESLA p-values to Z-scores using inverse normal transformation. We denote the converted vector of Z-scores as: Where ' is the Z-score for gene . We also encoded the membership of each gene in different pathways using an indicator matrix, i.e., Here, 'E equals to 1 if gene belongs to pathway , and 0 otherwise. A weighted regression analysis was then conducted by regressing the gene-level Z-score over the pathway membership covariates, i.e.: = + The model can be fitted using weighted least square, i.e., ~= ( 7 ) !0 7 is the covariance matrix between Z-score statistics converted from TESLA p-values, which we will calculate using a Monte Carlo algorithm as described below.

Monte Carlo Algorithm for Calculating
For each pair of genes (for which we denote as gene 1 and gene 2, with variants 0 = 1, … , 0 and > = 0 + 1, … , 0 + > ). We repeat steps 1-3 10,000 times. In iteration a, Step 1: We simulate phenotypic effects 0 is the correlation matrix between the estimated phenotypic effects (as detailed in Supplementary Text).
Step 2: For each simulated vector of the phenotypic effects, we calculate the TESLA statistic using different numbers of PCs and calculate the p-values for the TESLA statistic of genes 1 and 2 respectively.
Step 3: We convert the p-values for genes 1 and 2 to Z statistics 0,C and >,C .
The covariance between the TESLA p-value converted Z-scores is given by Go enrichment and semantic similarity analysis. GO items gene sets were retrieved from The Molecular Signatures Database (MSigDB) ontology gene set collection (c5), which consists of a comprehensive catalog of known disease-associated proteins 20,21 . To reduce the redundancy of GO items for each trait and tissue pair, we leveraged REVIGO 22 to calculate the semantic similarity measures between GO terms and then clustered similar GO terms. REVIGO uses a simple clustering algorithm to summarize a list of GO terms using their sematic similarity measures. It relies on pre-computed information content for GO terms. This method could reduce the redundant and tangled raw GO analysis results by choosing a representative subset of the terms, which facilitates visualization and interpretation. We also compared the results with other tools (e.g. simplifyEnrichment 23 ) that use information theoretic similarity 24 , to verify the robustness of the results. The REVIGO results are visualized by using CirGO 25 to deliver more comprehensive and intuitive information. As a sensitivity analysis on the impact of the pathway database used, we also performed enrichment analysis using KEGG 26 , Reactome 27 , and wikiPathways 28 following the same pipeline. All reported p-values are two-sided.

Drugbank sets enrichment Analysis for Drug Repurposing Analysis
We leveraged enrichment analysis to prioritize key drug pathways that are enriched with TESLA hits, which were then used to identify putative drugs that may be repurposed for smoking cessation treatment. We made use of DrugBank 29 , a publicly available database that contains >10,000 FDA-approved drugs along with data on ~5000 unique drug targets, to compile gene sets that consist of target genes for each drug. We removed drugs with less than two drug target genes, as enrichment analysis cannot be evaluated for gene sets with only one gene 30,31 . Resulting gene sets consist of 1642 drugs for enrichment analysis. All reported p-values are twosided.
To further explore the relationships between original drug indications and different underlying molecular pathways, we classified all the identified drugs based on indications and molecular mechanism of action, respectively.
First, we manually reviewed and curated all the significant drugs' indications and major target gene groups. We grouped drug indications into 15 groups including:

MAGMA Enrichment Analysis for Identifying Relevant Tissues and Cell Types
In order to pinpoint tissues or cell types for tobacco use phenotypes, MAGMA (v1.08) was used to assess enrichment of GWAS signals in the top 10% highly expressed genes in each tissue or cell type. As MAGMA was developed for samples from a single ancestry, we only used GWAS fixed effect meta-analysis of samples with European ancestry. We conducted the analysis using default parameters, and calculated p-values for enrichment as well as false discovery rates for each gene set. All reported p-values are two-sided.

Fine Mapping TESLA Results
TESLA statistics adaptively combine the p-values from each sub-model with different number of principal components, in order to accommodate different extent of phenotypic effect heterogeneities between ancestries.
Here, we performed fine-mapping of TWAS hits by extending existing methods based upon Gaussian Copula. We first transformed two-sided TWAS p-values of each gene to Z-score statistics [i.e., = 1 − Φ !0 ( ), where Φ is the cumulative distribution function for standard normal random variables], then estimated correlation between converted Z-scores using the Monte Carlo approach as in enrichment analysis, and finally calculated Bayes factors and posterior inclusion probabilities to quantify the probability that each gene was causal. This is conceptually similar to other GWAS/TWAS fine-mapping methods using approximate Bayes factors 32 . However, through working with p-values, this method allows us to work with a broader class of statistical methods including the ones based upon combined p-values.
Specifically, for fine-mapping, we defined each locus being a 1 Mb window surrounding a significant TESLA signal. The TESLA p-values are denoted by " 0 , … , " ' and the single SNP p-values in the gene region that are not used in gene expression models are denoted as 0 , … , L . We converted these p-values to Z-score statistics using inverse normal transformation, which we denoted as U 0 , … , U ' , 0 , … , L (i.e., = Φ !0 (1 − ), where Φ is the cumulative distribution function for standard normal random variable). Next, we estimated correlations between converted statistics. Under the null hypothesis, the single variant association statistics follow multivariate normal distribution, with correlation matrix equal to the LD coefficients. To calculate the correlation between statistics U 0 , … , U ' , 0 , … , L , we employ a Monte Carlo approach as described for eTESLA.
Similar to several other fine mapping methods 33,34 , we took an iterative approach which allows us to fine map the top signal in the locus first and then for secondary signals using conditional association results. We calculated the approximate Bayes factor for the primary signal, i.e., gene H using the approximate Bayes factor: with = 0.1 being the standard deviation of the prior on the effect sizes, following Wakefield 32 .
The PIP for gene H can be calculated by When multiple association signals are present, conditional analysis of Z-scores is performed conditioning on the top gene/variants in each locus. The fine-mapping for secondary association signals is conducted using conditional Z-scores.   Data Capture) is a secure, web-based application designed to support data capture for research studies, providing 1) an intuitive interface for validated data entry; 2) audit trails for tracking data manipulation and export procedures; 3) automated export procedures for seamless data downloads to common statistical packages; and 4) procedures for importing data from external sources. Due to their ancestral history, the OOA are enriched for rare exonic variants that arose in the population from a single founder (or small number of founders) and propagated through genetic drift. Many of these variants have large effect sizes and identifying them can lead to new biological insights about health and disease. The parent study for this WGS project provides one (of multiple) examples. In our parent study, we identified through a genome-wide association analysis a haplotype that was highly enriched in the OOA that is associated with very high LDL-cholesterol levels. At the present time, the identity of the causative SNP -and even the implicated gene -is not known because the associated haplotype contains numerous genes, none of which are obvious lipid candidate genes. A major goal of the WGS that will be obtained through the NHLBI TOPMed Consortium will be to identify functional variants that underlie some of the large effect associations observed in this unique population.

ARIC (Atherosclerosis Risk in Communities)
The Cohort Component began in 1987, and each ARIC field center randomly selected and recruited a cohort sample of approximately 4,000 individuals aged 45-64 from a defined population in their community, to receive extensive examinations, including medical, social, and demographic data. Follow-up also occurs semi-annually, by telephone, to maintain contact and to assess health status of the cohort.
In the Community Surveillance Component, the four communities are investigated to determine the long term trends in hospitalized myocardial infarction (MI) and coronary heart disease (CHD) deaths in approximately 470,000 men and women aged 35-84 years.
Objectives of the study includes: (1) Examine the ARIC cohort to characterize heart failure stages in the community, identify genetic and environmental factors leading to ventricular dysfunction and vascular stiffness, and assess longitudinal changes in pulmonary function and identify determinants of its decline. (2) Cohort follow-up for cardiovascular events, including CHD, heart failure, stroke, and atrial fibrillation; and for the study of risk factors related to progression of subclinical to clinical CVD. (3) Enhance the ARIC cohort study with cardiovascular outcomes research to assess quality and outcomes of medical care for heart failure and heart failure risk factors. (4) Community surveillance to monitor long-term trends in hospitalized MI, CHD deaths, and heart failure (inpatient and outpatient). (5) Provide a platform for ancillary studies, training for new investigators, and data sharing.
The Atherosclerosis Risk in Communities study has been funded in whole or in part with Federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, Department of Health and Human Services, under Contract nos. (HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700005I, HHSN268201700004I). The authors thank the staff and participants of the ARIC study for their important contributions.

BAGS (Barbados Asthma Genetics Study)
Epidemiologic studies of asthma have been underway in Barbados since 1991, when PI Barnes reported a relationship between modernization of the domestic environment in Barbados and increased risk of asthma. The baseline prevalence of asthma in Barbados is high (~20%), and from admixture analyses, we have determined that the proportion of African ancestry among Barbadian founders is similar to U.S. African Americans, rendering this a unique population to disentangle the genetic basis for asthma disparities among African ancestry populations in general. The primary outcome measure is asthma, and the approach for characterizing asthma in the Barbados population is based on the validated Respiratory Health Questionnaire (RHQ) designed from the 1978 American Thoracic Society questionnaire. Additional phenotype data include lung function measures, asthma severity, total serum IgE, and serum levels of various cytokines. In 1993, the Barbados Asthma Genetics Study (BAGS) was initiated on nuclear and extended asthmatic families who self-reported as African Caribbean, resulting in the first evidence for linkage for asthma and tIgE in an African-ancestry population, and the development of novel family-based methods. Recruitment into the BAGS program was enhanced through its involvement in the international Genetics of Asthma International Network (1999)(2000)(2001) and the current sample of >1300 participants continues to grow through the efforts of collaborators and nursing staff at the Chronic Disease Research Centre in Barbados. Pediatric probands were recruited through referrals at local polyclinics or the Accident and Emergency Department at the Queen Elizabeth Hospital, and their nuclear and extended family members were subsequently recruited. All subjects gave verbal and written consent as approved by the Johns Hopkins Institutional Review Board (IRB) and the Barbados Ministry of Health.
In 2007 we performed a genome-wide association study (GWAS) on 655,352 SNPs using the Illumina Infinium™ II HumanHap650Y BeadChip v.1.0 (Illumina Inc.) on a subset of 1,000 Barbados participants. This represented the first GWAS of asthma focusing exclusively on populations of African ancestry, and data from this study also contributed to the NHLBI-supported EVE Consortium. BAGS also contributed 96 samples to Phase 2 of the Thousand Genomes Project (TGP). Subsequently, BAGS samples were included in the NHLBI-supported parent grant, entitled New Approaches for Empowering Studies of Asthma in Populations of African Descent" (R01 HL104608-01), in which whole genome sequencing (WGS) was performed on ~1,000 individuals from North, Central, and South American and Caribbean and two West African populations. These populations constitute the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), which aims to discover genes influencing risk for asthma, and catalog genetic diversity in descendants of the African Diaspora in the Americas. So far, CAAPA sequencing has greatly expanded the lexicon of human diversity, as we have observed >20% more variants than reported in the 1000 Genome Project (TGP). Using these WGS data, a custom, gene-centric SNP genotyping array was developed by Illumina, Inc., called the African Diaspora Power Chip (ADPC), to complement current, commercially available genome-wide chips, which provide sub-optimal tagging of genes among individuals of African ancestry. This ADPC was recently genotyped on all BAGS samples, with a goal of combining ADPC data with existing GWAS data from the 650Y to test for association with asthma. The initial goals of the parent grant did not include validating the ADPC. Moreover, the ADPC, combined with existing GWAS data, will be limited in detecting contributions of rare and structural variants, which may account for some of the "missing heritability" of asthma. We therefore are performing WGS on 1,100 asthmatics and family members from the BAGS, in order to (i) expand the CAAPA WGS dataset and thereby the genomic catalog of African ancestry for the research community; (ii) validate the ADPC by capturing information from both common and rare variants; and (iii) generate additional discovery of rare and structural variants that may control risk to asthma. Tools resulting from this study will result in substantial advancements in the technology available for identifying genes relevant to disease in under-represented minorities.
Given the data available on this large, deeply genotyped cohort from a relatively homogeneous environment representing an underrepresented minority group suffering most from asthma, the BAGS sample provides a unique opportunity to employ novel genomics.
We gratefully acknowledge the contributions of Pissamai and Trevor Maul, Paul Levett, Anselm Hennis, P. Michele Lashley, Raana Naidu, Malcolm Howitt and Timothy Roach, and the numerous health care providers, and community clinics and co-investigators who assisted in the phenotyping and collection of DNA samples, and the families and patients for generously donating DNA samples to the Barbados Asthma Genetics Study (BAGS). Funding for BAGS was provided by National Institutes of Health (NIH) R01HL104608, R01HL087699, and HL104608 S1.

BEAGESS (The Barrett's and Esophageal Adenocarcinoma. Genetic Susceptibility Study)
This study made use of data generated by investigators in the BEACON

CADD (Center on Antisocial Drug Dependence)
The Center on Antisocial Drug Dependence (CADD) data were funded by grants from the National Institute on Drug Abuse (P60 DA011015, R01 DA012845, R01 DA021913, R01 DA021905, R01 DA035804). For more information about this study, contact John K. Hewitt (john.hewitt@colorado.edu).

EOCOPD (Boston Early-Onset COPD Study)
The Boston Early-Onset COPD (EOCOPD) study was designed to study genetic factors for early-onset and severe COPD.14 Probands were selected to be physician-diagnosed COPD cases with FEV1≤ 40% predicted and age ≤ 53. Subjects with severe alpha-1 antitrypsin deficiency and other chronic lung diseases (except asthma) were excluded. All subjects completed a questionnaire and spirometry testing before and after bronchodilator administration. Blood samples and written informed consent were obtained for each study subject. A subset of the most severe unrelated probands from this study were sent for whole-genome sequencing through the TOPMed project. For the current cross-sectional WGS effort, only baseline spirometry data were available and utilized for analyses.
The Boston Early-Onset COPD Study was supported by R01 HL113264 and U01 HL089856 from the National Heart, Lung, and Blood Institute.

CFS (The Cleveland Family Study)
Obstructive Sleep Apnea (OSA) affects more than 10% of the population, especially minorities, and is associated with significant cardio-metabolic morbidity. We propose using data from the Cleveland Family Study (CFS), a genetic epidemiological study of OSA, as well as data from cohorts studied as part of our collaborations with other NHLBI cohorts to enhance the identification of genes that increase susceptibility to OSA, with a focus on those variants that increase susceptibility in African Americans.
The CFS is a genetic epidemiological study of 352 rigorously phenotyped families ascertained through probands with OSA identified through Cleveland, OH area sleep centers, neighborhood controls, and the spouses and first and second-degree relatives of probands. Participants were studied on up to 4  We used WGS and highly sophisticated statistical tools to completely characterize the genetic variation in richly phenotyped multi-ethnic populations and in families enriched for OSA as well as CVD and pulmonary traits. We aim to more completely and definitively characterize the allelic spectrum of functional genetic variation associated with OSA, as well as to contribute to consortia-wide activities to identify causal variation for other HLB phenotypes. We propose to conduct WGS in 1000 Cleveland Family Study family as well as to collaborate with other WGS consortium members (e.g., Jackson Heart Study) where sleep phenotyping is available. Complete characterization of genetic variation with WGS will allow for direct interrogation of causal functional variation irrespective of whether it is coding or regulatory; common, rare, private or de novo, thus improving upon data from exome sequencing. We will apply existing and newly developed analytical tools for detecting associations informed by linkage, and for conducting gene-based tests, bioinformatics pathway analyses, finemapping of GWAS and linkage signals using functional annotation, cross-phenotype analyses, and heritability partitioning to identify causal variants and reveal the allelic architecture of OSA, facilitating the discovery of physiological pathways. More comprehensive sequencing data, including a complete catalogue of genetic variation in each sequenced participant, will improve the ability to identify important inherited and de novo functional coding and regulatory variants outside of exomic regions for OSA, fine-map GWAS and linkage signals as well as will contribute to the discovery and fine-mapping of variants for a broad range of CVD, blood and pulmonary phenotypes collected in these cohorts. We will focus on the major metrics that characterize OSA such as the Apnea Hypopnea Index, as well as highly heritable traits that provide information on physiological mechanisms underlying OSA such as hypopnea duration as well as overnight oxygenation.

CHS (The Cardiovascular Health Study)
Cardiovascular Health Study (CHS) is a population-based, longitudinal study of risk factors for coronary heart disease and stroke (REF). The study included 5,888 adults 65 years of age or older from four field centers. Participants were sampled from local Medicare eligibility lists, and baseline visits were in 1989-90 for the first cohort (n=5,201) and 1992-94 for the second cohort (n=687, predominantly African-American). At each study visit, physical and laboratory evaluations were performed to identify the characterize the severity of cardiovascular disease risk factors. Blood samples for DNA extraction were from the baseline study visit. CHS was approved by institutional review committees at each field center and individuals in the present analysis had available DNA and gave informed consent including consent to use of genetic information for the study of cardiovascular disease.

COGEND (Collaborative Genetic Study of Nicotine Dependence)
This research was supported by P01 CA089392, U01 HG004422, and R01 DA036583. Funding support for genotyping which was performed at the Johns Hopkins University Center for Inherited Disease Research was provided by the NIH "Genome-wide Association Studies in the Genes and Environment Initiative" (U01HG004438) and the NIH contract "High throughput genotyping for studying the genetic contributions to human disease" (HHSN268200782096C). We also acknowledge funding from R01 DA026911, and we thank Weimin Duan for analytic assistance. For more information about this study, contact Laura J. Bierut (laura@wustl.edu).

COPDGene (Genetics of Chronic Obstructive Pulmonary Disease)
COPDGene (also known as the Genetic Epidemiology of COPD Study) is an NIH-funded, multicenter study. A study population of more than 10,000 smokers (1/3 African American and 2/3 non-Hispanic White) has been characterized with a study protocol including pulmonary function tests, chest CT scans, six minute walk testing, and multiple questionnaires. Five years after this initial visit, all available study participants are being brought back for a follow-up visit with a similar study protocol. This study has been used for epidemiologic and genetic studies. Previous genetic analysis in this study has been based on genome-wide SNP genotyping data.Approximately 1900 subjects will undergo whole genome sequencing in this NHLBI WGS project, including severe COPD subjects and resistant smoking controls. The COPDGene Study web site is: http://www.copdgene.org/ .
The COPDGene project described was supported by Award Number U01 HL089897 and Award Number U01 HL089856 from the National Heart, Lung, and Blood Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health. The COPDGene project is also supported by the COPD Foundation through contributions made to an Industry Advisory Board comprised of AstraZeneca, Boehringer Ingelheim, GlaxoSmithKline, Novartis, Pfizer, Siemens and Sunovion. A full listing of COPDGene investigators can be found at: http://www.copdgene.org/directory .

CRA_CAMP (The Genetic Epidemiology of Asthma in Costa Rica and the Childhood Asthma Management Program)
From February 2001 to August 2008, questionnaires were sent to the parents of 16,912 children (ages 6-14 years) enrolled in 140 Costa Rican schools; 9,180 (54.3%) questionnaires were returned. Children were eligible for the study if they had asthma (physician-diagnosed asthma and ≥ 2 respiratory symptoms or asthma attacks in the prior year) and a high probability of having ≥ 6 great-grandparents born in the Central Valley of Costa Rica (as determined by the study genealogist on the basis of the paternal and maternal last names of each of the child's parents). Of the 9,180 children screened, 3,113 (33.9%) had asthma. By the close of the recruitment and enrollment of the CRA Study in 2011, samples from 4,245 individuals have been collected for the main study and related subsequent studies. Children, parents and pedigree relatives gave blood samples for DNA extraction. All probands/children completed a protocol including questionnaires, spirometry, methacholine challenge testing (if their FEV1 was ≥ 65% of predicted), allergy skin testing, and collection of blood (for plasma, DNA and RNA extraction, and measurement of serum total and allergen-specific IgE) and house dust (for measurement of dust mite/cockroach allergens) samples.
The parent grant, R37 HL066289 was initially funded as an R01 and then was successfully renewed with a 1st percentile score and was then converted to an R37 MERIT Award that has subsequently been renewed. The grant is currently in its 13th year and has two years remaining in its current segment; thus, it is eligible for this administrative supplement.
The unique aspects of this population are that the asthma prevalence rates in Costa Rica (CR) are among the highest in the world and the Central Valley of Costa Rica, the primary recruitment/enrollment area, is a relative genetic isolate where we have had the ability to ascertain relatives in large extended pedigrees and phenotype these subjects for asthma and related traits. A total of 671 subjects in 8 large pedigrees were ascertained and phenotyped, and we have collected an additional 1053 trios. Initial efforts were focused on linkage studies and then we subsequently moved to genetic association studies, at which time we added the trios (N=720) in the Childhood Asthma Management Program (CAMP) to the grant as a replication population for our results, as these two populations have exactly the same study design (e.g., trios) and identical protocols for phenotyping subjects. We have approximately 100 phenotypes relevant to heart, lung, blood and sleep disorders, including asthma, obesity, height, COPD and blood and lipid disorders via metabolomics It is highly unlikely that any application has the ability to test private mutations in extended pedigrees, as well as generalize from inbred to outbred populations to the extent that we do in this proposal.
We have been productive in publishing our work. A total of 114 linkage and association papers have been written using these two study populations. Some of the most important findings from the initial 13 years of the study are the following: we have (1) identified MMP12 as an important gene for asthma, COPD and lung function decline; (2) identified allele specific chromatin remodeling via an insulator on chromosome 17 that coregulates the ZPBP2/GSDMB/ORMDL3 locus; (3) identified, through linkage and fine-mapping, PRKCA as a novel gene for asthma and obesity; (4) replicated GWAS results for PDE4D from CAMP; (5) identified a novel gene for obesity, ROBO1; and (6) identified vitamin D deficiency as an important risk factor for increased asthma severity. In the third five-year cycle, we added GWAS and integrative genomics, including whole blood gene expression and methylation arrays on 384 probands. The current aims of the grant are to perform GWAS and association analysis on Costa Rica and replicate them in CAMP and other Hispanic populations. This aim is directly related with this administrative supplement. The GWAS analyses are just now being completed and the gene expression, microRNA and methylation arrays are just being run. This means that the aims of the parent grant are directly aligned with the aims of the proposed administrative supplement. Therefore, with two years remaining on the parent grant, there is ample time to perform the whole genome sequencing proposed and report the results from the project. As described in the significance section of the administrative supplement.

deCODE (deCODE Genetics/AMGEN, Inc.)
The authors are thankful to the Icelandic participants and staff at the Patient Recruitment Center. The work at deCODE genetics / Amgen was supported in part by the National Institute of Drug Abuse (NIDA grants, R01-DA017932 and R01-DA034076). For more information about this study, email info@decode.is.

ECLIPSE (Evaluation of COPD longitudinally to Identify Predictive Surrogate Endpoints)
The "Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints" (ECLIPSE) study was a longitudinal, multicenter, observational investigation of 2164 COPD subjects and a smaller number of smoking controls (337) and nonsmoking controls (245)

EGCUT (Estonian Genome Center)
The EGCUT studies were financed by Estonian Government (grants IUT20-60 and IUT24-6) and by European Commission through the

FHS (Framingham Heart Study)
Whole genome sequencing (WGS) in the Framingham Heart Study (FHS) would provide a unique opportunity to study the association of genome-wide genetic variation with heart, lung, blood, and sleep traits that are available in our six constituent cohorts (with nearly 15,000 participants). The Framingham Heart Study (FHS) acknowledges the support of contracts NO1-HC-25195, HHSN268201500001I and 75N92019D00031 from the National Heart, Lung and Blood Institute and grant supplement R01 HL092577-06S1 for this research. We also acknowledge the dedication of the FHS study participants without whom this research would not be possible. Dr. Vasan is supported in part by the Evans Medical Foundation and the Jay and Louis Coffman Endowment from the Department of Medicine, Boston University School of Medicine.

GeneSTAR (Genetic Studies of Atherosclerosis Risk)
GeneSTAR began in 1982 as the Johns Hopkins Sibling and Family Heart Study, a prospective longitudinal family-based study conducted originally in healthy adult siblings of people with documented early onset coronary disease under 60 years of age. Commencing in 2003, the siblings, their offspring, and the coparent of the offspring participated in a 2 week trial of aspirin 81 mg/day with pre and post ex vivo platelet function assessed using multiple agonists in whole blood and platelet rich plasma. Extensive additional cardiovascular testing and risk assessment was done at baseline and serially. Follow-up was carried out to determine incident cardiovascular disease, stroke, peripheral arterial disease, diabetes, cancer, and related comorbidities, from 5 to 30 years after study entry. The goal of several additional phenotyping and interventional substudies has been to discover and amplify understanding of the mechanisms of atherogenic vascular diseases and attendant comorbidities.
GeneSTAR was supported by the National Institutes of Health/National Heart, Lung, and Blood Institute (U01 HL72518, HL087698, HL112064) and by a grant from the National Institutes of Health/National Center for Research Resources (M01-RR000052) to the Johns Hopkins General Clinical Research Center.

SAFS (San Antonio Family Studies)
Population of SAFS include Mexican American in SAFHS extended pedigrees. The San Antonio Family Heart Study (SAFHS) is a complex pedigree-based mixed longitudinal study designed to identify low frequency or rare variants influencing susceptibility to cardiovascular disease, using whole genome sequence (WGS) information from 2,590 individuals in large Mexican American pedigrees from San Antonio, Texas. The major objectives of this study are to identify low frequency or rare variants in and around known common variant signals for CVD, as well as to find novel low frequency or rare variants influencing susceptibility to CVD.

SARP (Severe Asthma Research Program)
SARP is the world's most comprehensive study of adults and children with severe asthma, linking 7 leading asthma clinical university centers and 1 data coordinating center through a National Institutes of Healthsponsored network. SARP is not a clinical trial but rather an intensive characterization study of adults and children with asthma. Now in its third phase (SARP III), SARP III has enrolled over 700 participants in its program, including over 500 adults and 180 children aged 6-17 years. The SARP network's mission is to improve the understanding of severe asthma in order to develop better treatments. Through SARP, we are gaining insight into how severe asthma develops in patients and learning about the molecular, cellular and biological mechanisms that lead to different types of asthma. Participants enrolled in SARP are being followed over at least three years in order to determine how these characteristics develop or change with time.
The overall goal of the Severe Asthma Research Program (SARP) is to identify and characterize subjects with severe asthma to understand pathophysiologic mechanisms in severe asthma. Subjects with mild and moderate asthma were recruited for comparison but the program was enriched for subjects with severe asthma from multiple centers. Subjects were comprehensively phenotyped for asthma related traits including lung function, atopy, questionnaires on medical and family history, exhaled nitric oxide and health care utilization including exacerbations and symptoms. Asthma is a heterogenous disease. Cluster analysis in SARP has shown multiple subphenotypes and endotypes.

Samoan (Samoan Adiposity Study)
The parent Samoan Adiposity Study ("Samoan", formerly "SAS") is a population-based genome-wide association study (GWAS) of adiposity and cardiometabolic phenotypes among adults from the independent nation of Samoa in the South Pacific. The research goal of this study is to identify genetic variation that increases susceptibility to obesity and cardiometabolic phenotypes. Over 3,400 individuals ages 25-65 years were recruited in 2010 from 33 villages from all census regions of the nation, which is experiencing economic development and the nutrition transition. Eligibility was based on self-report of having four Samoan grandparents, not being pregnant, and not having severe physical impairment which would prohibit collection of anthropometric, biomarker and questionnaire measures, nor cognitive impairment which would not allow informed consent about the genetic purposes of the study. We collected overnight fasting blood samples and assayed glucose, insulin, leptin, adiponectin, total cholesterol, high-density and low-density lipoprotein cholesterol, and triglycerides. Anthropometric and bioelectrical impedance measurements provided measures of weight, height, body circumferences, skinfold thicknesses, BMI and other indices, as well as estimation of percent body fat and lean tissue. Questionnaires assessed socio-demographic characteristics, physical activity, dietary intake using food frequency questionnaires, medication use, history of prior diagnoses of type 2 diabetes, hypertension and cardiovascular disease, and alcohol and tobacco use. DNA was collected and the Affymetrix 6.0 chip used for SNP genotyping. After quality control checks on genotyping and excluding individuals with key missing data we have a final sample of 3,122 adults with high-quality genome-wide marker data.
Participation in the NHLBI TOPMed WGS project will enable us to more thoroughly investigate the genetic architecture of Samoan cardiometabolic conditions by establishing a Samoan-specific reference panel for imputation. Specifically, the NHLBI TOPMed WGS project will perform whole-genome sequencing in an optimally-selected subset of more than 400 individuals from our GWAS sample. After quality control work, we will use this Samoan reference panel to impute genotypes for the rest of our discovery sample. Using the imputed genotypes, we will carry out association analyses for each of our cardiometabolic traits, primarily using gene-based association tests.
Date collection was funded by NIH grant R01-HL093093. We thank the Samoan participants of the study and local village authorities. We acknowledge the support of the Samoan Ministry of Health and the Samoa Bureau of Statistics for their support of this research.

NINDS SiGN (The National Institute of Neurological Disorders and Stroke Genetics Network)
The NINDS International Stroke Genetics Consortium Study dataset was funded by the National Institute of Neurological Disorders and Stroke Cooperative Agreement Award 1U01NS069208 (Steven Kittner). The dataset used for the analyses described in this manuscript was obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000615.v1.p1.

THRV (Taiwan Study of Hypertension using Rare Variants)
The THRV-TOPMed study consists of three cohorts: The SAPPHIRe Family cohort (N=1,271), TSGH (Tri-Service General Hospital, a hospital-based cohort, N=160), and TCVGH (Taichung Veterans General Hospital, another hospital-based cohort, N=922), all based in Taiwan. 1,271 subjects were previously recruited as part of the NHLBI-sponsored SAPPHIRe Network (which is part of the Family Blood Pressure Program, FBPP). The SAPPHIRe families were recruited to have two or more hypertensive sibs, some families also with one normotensive/hypotensive sib. The two Hospital-based cohorts (TSGH and TCVGH) both recruited unrelated subjects with different recruitment criteria (matched with SAPPHIRe subjects for age, sex, and BMI category).
The Rare Variants for Hypertension in Taiwan Chinese (THRV) is supported by the National Heart, Lung, and Blood Institute (NHLBI) grant (R01HL111249) and its participation in TOPMed is supported by an NHLBI supplement (R01HL111249-04S1). THRV is a collaborative study between Washington University in St. Louis, LA BioMed at Harbor UCLA, University of Texas in Houston, Taichung Veterans General Hospital, Taipei Veterans General Hospital, Tri-Service General Hospital, National Health Research Institutes, National Taiwan University, and Baylor University. THRV is based (substantially) on the parent SAPPHIRe study, along with additional population-based and hospital-based cohorts. SAPPHIRe was supported by NHLBI grants (U01HL54527, U01HL54498) and Taiwan funds, and the other cohorts were supported by Taiwan funds.

UKB (UK Biobank)
This research has been conducted using the UK Biobank Resource under Application Number 16651. Informed consent was obtained from UK Biobank subjects.

VTE (Venous Thromboembolism project)
This study consists of 338 VTE cases from an inception cohort of Olmsted County, MN residents (OC) with a first lifetime objectively-diagnosed idiopathic VTE during the 40-year study period, 1966-2005. All living study subjects were invited to provide a whole blood sample at the Mayo Clinical Research Unit for leukocyte genomic DNA and plasma collection. For living study subjects who did not provide a blood sample, we retrieved any leftover blood ("waste" blood) from samples collected as part of routine clinical diagnostic testing and used this to extract DNA after obtaining patient consent. For deceased cases, with IRB approval, we extracted DNA from any available stored tissue within the Mayo Tissue Archive. This "tissue" DNA has been successfully genotyped in prior studies. Three trained and experienced study nurse abstractors reviewed the complete medical records in the community of all potential cases.
Funded, in part, by grants from the National Institutes of Health, National Heart, Lung and Blood Institute (HL66216 and HL83141). the National Human Genome Research Institute (HG04735, HG06379), and research support provided by Mayo Foundation.

WGHS (The Women's Genome Health Study)
The Women's Genome Health Study (WGHS) is a prospective cohort comprised of over 25,000 initially healthy female health professionals enrolled in the Women's Health Study, which began in 1992-1994. All participants in WGHS provided baseline blood samples and extensive survey data. Women who reported atrial fibrillation during the course of the study were asked to report diagnoses of AF at baseline, 48 months, and then annually thereafter. Participants enrolled in the continued observational follow-up who reported an incident AF event on at least one yearly questionnaire were sent an additional questionnaire to confirm the episode and to collect additional information. They were also asked for permission to review their medical records, particularly available ECGs, rhythm strips, 24-hour ECGs, and information on cardiac structure and function. For all deceased participants who reported AF during the trial and extended follow-up period, family members were contacted to obtain consent and additional relevant information. An end-point committee of physicians reviewed medical records for reported events according to predefined criteria. An incident AF event was confirmed if there was ECG evidence of AF or if a medical report clearly indicated a personal history of AF. The earliest date in the medical records when documentation was believed to have occurred was set as the date of onset of AF.