Article | Open

Challenges and disparities in the application of personalized genomic medicine to populations with African ancestry

  • Nature Communications 7, Article number: 12521 (2016)
  • doi:10.1038/ncomms12521
  • Download Citation
Received:
Accepted:
Published online:

Abstract

To characterize the extent and impact of ancestry-related biases in precision genomic medicine, we use 642 whole-genome sequences from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) project to evaluate typical filters and databases. We find significant correlations between estimated African ancestry proportions and the number of variants per individual in all variant classification sets but one. The source of these correlations is highlighted in more detail by looking at the interaction between filtering criteria and the ClinVar and Human Gene Mutation databases. ClinVar’s correlation, representing African ancestry-related bias, has changed over time amidst monthly updates, with the most extreme switch happening between March and April of 2014 (r=0.733 to r=−0.683). We identify 68 SNPs as the major drivers of this change in correlation. As long as ancestry-related bias when using these clinical databases is minimally recognized, the genetics community will face challenges with implementation, interpretation and cost-effectiveness when treating minority populations.

Introduction

The idiom ‘searching for a needle in a haystack’ is frequently used in genomics, and is especially apt for describing the search for causal alleles in patients with non-canonical diseases of likely genetic origin. As a field, we tend to be singularly focused on the needle and forget that the complexity of the haystack is actually a highly rate-limiting step of this search. The motivation of this project is to characterize the complex interaction between variant prioritization and ancestry, often believed to be largely affected by the predominance of European-based data within clinical databases1,2,3, to better understand the application of clinical genomics to minority populations. Any ancestry-related biases that exist when using typical filters and databases to implement variant prioritization and other similar precision genomic medicine techniques can have profound confounding effects, as most methodological biases do. Therefore, here we quantify the extent of ancestry-related biases inherent to approaches and databases typically used for precision genomic medicine, and we present how such biases have changed over time. We also show how these biases translate to the level of the individual and their proportion of African ancestry, with implications for diagnostic accuracy and cost.

To explore the role ancestry plays in variant prioritization approaches often implemented in genomic medicine, we utilize whole-genome sequencing data from 642 study individuals in the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA). The CAAPA project represents a diverse group of admixed individuals of African descent with no suspected Mendelian conditions. It has been shown that there is a strong correlation between the overall number of variants found per individual and African ancestry4,5,6,7. Furthermore, significant differences exist between populations in the number of variants per individual considered disease causing by the two popular clinical databases, Human Gene Mutation Database (HGMD) and ClinVar6. On the basis of annotations from HGMD, individuals with predominantly African ancestry have by far the most variants considered disease causing, whereas variants prioritized as disease causing based on annotations from ClinVar are most abundant in individuals with predominantly European ancestry and are of intermediate to below-average abundance in predominantly African-ancestry individuals6. These population-based discrepancies reflect differences between databases, and suggest that the interplay between database and sample ancestry is important. The CAAPA cohort utilized here serves as an appropriate sample, with representative quantities of variation (that is, similar-sized haystacks), for evaluating whether biases exist when applying precision genomic medicine to African-ancestry individuals. Any biases and/or population specificities for African-ancestry patients that inflate the number of prioritized variants (that is, make the haystack bigger), would result in increased effort (that is, time and money) to identify a causative variant (that is, find the needle) in African-ancestry patients.

Results

Variant classification

We initially classified variants into two main groups, with pathogenic annotated variants (PAVs) comprising those identified as disease-causing in the Online Mendelian Inheritance in Man (OMIM)8, HGMD9 or ClinVar10 databases, and non-annotated variants (NAVs) consisting of those not annotated as disease-causing in these databases. Unless otherwise noted, we used a an allele frequency filter, and excluded common variants with a minor allele frequency (MAF) >5% from our analyses (Methods). Each category was then sub-classified as deleterious or non-deleterious based on computational predictions (Fig. 1a)11,12,13,14,15,16,17,18,19,20, which we consider as a type of filter based on deleteriousness and note when used for categorization. Since there is evidence for all PAVs (deleterious and non-deleterious) and deleterious NAVs to be further evaluated as higher priority, variants in these categories often require time-consuming and costly follow-up review by a clinical team1,21,22 to identify causative variants with a low false-negative rate.

Figure 1: Variant classification workflow and correlation with African ancestry.
Figure 1

(a) The pipeline used to categorize variation into four groups (deleterious (Del.) PAVs, non-deleterious (Non-del.) PAVs, deleterious NAVs and non-deleterious NAVs) each with different levels of clinical relevance (see Methods for further explanation). The three key filters used in separating variants are (1) MAF from multiple databases, (2) pathogenic annotation (as defined by the ClinVar and/or HGMD) and (3) deleterious prediction. For be the x axis is the proportion of African ancestry as estimated by ADMIXTURE. The corresponding y axes represent the total number of variants per individual for the following groups: (b) deleterious PAVs, (c) non-deleterious PAVs, (d) deleterious NAVs and (e) non-deleterious NAVs. Colours of each individual reflect the population sampling location.

Correlations with African ancestry and variant counts

We find significant correlations between estimated African ancestry (Supplementary Fig. 1) and the number of variants per individual in all variant sets except deleterious PAVs (Fig. 1b–d). Both deleterious and non-deleterious NAVs show similar levels of correlation with African ancestry as does all genomic variation pooled together7. When we remove the aforementioned MAF and deleteriousness filters, as well as a filter on stop/splice sites, and identify PAVs from either HGMD or ClinVar databases separately, we find a strong positive correlation between estimated African ancestry and variants identified in HGMD (r=0.992, P=6.12 × 10−14) and a modest positive correlation between African ancestry and variants in ClinVar (r=0.539, P=0.031). The correlation becomes less positive or even negative (Supplementary Table 1) as we re-add our two main filters: (1) inclusion of variants with MAF <5% (MAF filter); and (2) inclusion of variants called deleterious by at least 2 of 11 in silico predictions (deleterious filter).

One possible explanation for this general reduction of the positive correlation with ancestry is that these filters effectively remove functionally neutral variants, of which there are more in persons of African ancestry. Assuming this, one would predict a reduction in the positive correlation with African ancestry, as long as the filters remove a higher number of functionally neutral variants, relative to causative variants, from African populations compared with European populations. Given recent studies showing that African populations have more genetic variation than European populations5,23,24, but that the number of deleterious alleles in an individual is independent of demography or lower in Africans, depending on the level of deleteriousness of these alleles25,26,27,28,29, one would expect all filters to remove higher numbers of non-causal variants from individuals with greater African ancestry, as is consistent with what we report here. Specifically, as we apply the MAF filter and exclude all common variants, we are eliminating variants that have been misidentified in databases as disease causing22, of which there are more among individuals of African ancestry. Similarly, as we use in silico predictors to filter out putatively non-deleterious variants, we remove more functionally neutral variants from Africans than from Europeans. For instance, the number of non-deleterious PAVs per individual increases with African ancestry, whereas the number of deleterious PAVs per individual does not. Furthermore, because the number of deleterious mutations in African individuals is not greater than in European individuals25,26,27,28, these filters do not remove more deleterious variants from Africans. This disproportionate removal of functionally neutral variants will more effectively reduce the number of incorrectly characterized variants in each class in African-ancestry individuals, and explains the reduction of the positive correlation with African ancestry as filters are applied.

Deleterious predictors are different depending on annotation

While filtering significantly reduces the correlation between the number of deleterious PAVs and African ancestry, it does not impact the correlation between the number of deleterious NAVs and African ancestry. One possible explanation is that the effects of the filters differ between the two categories of variants, with functionally neutral variation filtered out more efficiently for PAVs. Because we require at least 2 of 11 predictors to call a variant putatively deleterious, it is possible that predictors calling PAVs deleterious are consistently different than those calling NAVs deleterious. This is what we observe (P=10−15, χ2-test of independence), with predictors that use clinical databases to train their algorithms being over represented in deleterious PAV calls, and algorithms that are agnostic to clinical databases making up a larger percentage of the deleterious NAV calls (Supplementary Table 5). One possibility is that the machine learning-based algorithms preferentially optimize for patterns within the PAVs, and can thus inherit an ancestry-specific bias. Supporting this is the notion that most new African-specific causal variants will initially be identified as NAVs and may thus be less likely to be called by the currently trained predictors. Alternatively, though not mutually exclusively, conservation algorithms may be better able to remove background variants from the NAVs if the conservation score range for NAVs is significantly larger than PAVs. Given that NAVs are not annotated and are less processed than PAVs, they are more likely to be sampled equally across the entire distribution of conservation scores, and to therefore represent a wider range of conservation scores than PAVs. This is consistent with what others have observed2, and might explain why conservation algorithms predominate in the separation of deleterious NAVs from non-deleterious NAVs, compared with PAVs. While this differences in the type of predictors used in distinguishing deleterious and non-deleterious variants of different classification may represent the potential extension of ancestry related biases to deleterious predictors, this needs to be studied in more detail.

ClinVar correlation with African ancestry over time

To explore the historical context of recognized PAVs, and evaluate how ancestry related biases may have impacted the reproducibility of previous clinical applications relying on ClinVar, we conducted an analysis of how biases in archived versions of ClinVar have changed over time. ClinVar, a developing database of pathogenic variation officially released in April 2013, was chosen for this analysis as it has monthly updates that allow us to easily track changes over time. As of March 2015, the number of known pathogenic variants has almost doubled from 14,697 to 26,409. In Fig. 2, we show the correlation over time between African-ancestry proportion in our CAAPA individuals and counts of ClinVar-based pathogenic variants in these same individuals for each update between 16 June 2012 (pre-official release) and 5 March 2015. As seen from this figure, the content of the database is highly susceptible to ancestry-related biases, which affects the interpretation of results. Furthermore, these biases can change over time, further complicating the ability to interpret results and account for ancestry-related biases. The largest change happens over a single month, from March to April of 2014, when a significant positive correlation (r=0.733, P=0.001) switches to a significant negative correlation (r=−0.683, P=0.004). An analysis of differences between the March and April 2014 releases identifies 68 single-nucleotide polymorphisms (SNPs) that drive this marked change, and more details are presented in the supplement (Supplementary Note 1 and Supplementary Table 3).

Figure 2: Historical view of African-ancestry biases in ClinVar.
Figure 2

The x axis represents the various archived versions of ClinVar. In the black y axis on the right, we see the number of PAVs recorded from each version of the database. There are a few decreases in numbers, but overall this number shows continuous growth. In the blue y axis on the left, we see the correlation coefficient estimated between the number of PAVs per CAAPA individual and their proportion of African ancestry. The dotted grey line represents the date of the first official release of ClinVar. The blue trend line shows the instability across different ClinVar releases of the correlation of African-ancestry proportion with average number of pathogenic variants per individual. The change in correlation is particularly notable for sequential releases between March and April 2014, after which the correlation remains significantly negative for 3 months (April–July 2014) before once again becoming significantly positive. The red trend line represents the same relationship between ancestry–pathogenicity correlation and ClinVar release over time after applying filters, and shows a significant change in correlation during the same 3-month period of April–July 2014 despite an overall reduction in movement of the correlation across time.

The red line near the bottom of Fig. 2 shows the same correlation over time after filtering the data, again by MAF, mutation type and deleterious predictions (Fig. 1a). Similar to the unfiltered data, the filtered data show the first major shift in correlation from March to April of 2014, but the shift is in the opposite direction, with April showing a significantly less negative correlation (stats test) compared with March. The filtered data continue to show a less negative correlation for 3 months, before the pattern returns to a more significant negative correlation in July 2014, which is again similar in trend to but opposite in direction from the pattern seen in the unfiltered analysis. These simultaneous similarities and differences in the shift of the correlation between ancestry and pathogenic variation across database releases and filtering procedures reflect the precariousness of the current clinical databases, particularly when prioritizing variants of individuals with significant non-predominantly European ancestry. In contacting ClinVar about any possible curation differences for the March to July 2014 releases, we learned that ClinVar received a large deposit in April 2014 from the Breast Cancer Information Core database30 with significant amounts of non-European data. While further information about this deposit is unavailable and exactly why it caused a marked change from positive to negative correlation is currently unclear, these observations further support our message that database content reflects ancestry-related biases and can impact overall reproducibility.

Analysis of ancestral biases at the gene level

To explore ancestral biases at a gene level, we evaluated the correlation between the number of PAVs per gene and African ancestry using the March 2015 release of ClinVar. After correcting for multiple testing, we found a significant negative correlation with African ancestry for 10 genes (Supplementary Table 2). These genes represent a subset with the strongest bias, and while we suspect this negative correlation with African ancestry is likely due to some type of technical or ascertainment bias, it is nevertheless possible that this bias could have some biological basis. These genes will require particular care in clinical analysis and represent an interesting set for follow-up investigation, as African causal variants in these genes are more likely to be labelled as NAVs and require a greater identification effort. We find, in general, that the subset of genes with significant positive or negative correlations (P<0.05, uncorrected) are not enriched for those genes associated with known Mendelian diseases or those found in the GWAS catalogue31 (Methods).

Discussion

The ability to accurately report whether a genetic variant is responsible for a given disease or phenotypic trait depends in part on the confidence in labelling a variant as pathogenic. Such determination can often be more difficult in persons of predominantly non-European ancestry, as there is less known about the pathogenicity of variants that are absent from or less frequent in European populations. A key part of this are the differences between pathogenic variants, deleterious variants and prioritized variants, which are merely members of the proverbial haystack with differing levels of evidence for potential disease causality. It is important to note that a deleterious variant will only be labelled as pathogenic if its effect size is large enough to directly cause disease and this effect has been seen and annotated, and that a pathogenic variant will only be deleterious if it negatively impacts reproductive fitness. These terms are not the same, nor synonyms of true causality, but the use of deleteriousness as evidence for true disease causality is predicated on the fact that deleteriousness and pathogenicity should be correlated. While we cannot be sure which of these variants are truly disease-causing (actual ‘needles’ rather than haystack members) without additional functional or association-based evidence, we believe that discrepancies between true pathogenicity and annotated pathogenicity are a major source of the biases we report. A likely contributor to this incongruity is that databases are missing population-specific pathogenicity information, and with regard to the results we report here, African-specific pathogenicity data. Therefore, true causal variants for predominantly non-European patients are likely to fall into the NAV categories. Since NAVs have the highest degree of positive correlation with African ancestry (that is, bias), causal variants falling into this group are more difficult to distinguish, as they exist amongst a larger number of high-priority background variants (that is, larger haystack). This problem is compounded in individuals of substantial African ancestry, as their larger amount of overall genetic variation5,23,24 results in an even greater number of deleterious NAVs requiring adjudication.

As a consequence, review of genomic test results for persons of predominantly non-European ancestry could be both more challenging and costly. Positive correlations of African-ancestry proportion with non-deleterious PAVs and deleterious NAVs result in more variants to evaluate for African-ancestry individuals (that is, larger haystack), which leads to higher costs and longer turnaround times. Assuming a cost of $500 per variant for Sanger confirmation in a CLIA-certified laboratory (see Supplementary Table 6 for the range of costs found in clinical laboratories), and given gene candidate prioritization approaches that use phenotype to gene mapping32 and limit variants receiving follow-up confirmation to those in about 1% of the genome (that is, about 200 genes), we estimate an African-ancestry patient would have about 4.5 prioritized variants needing validation compared with 2.8 in an individual of European ancestry. This translates to a 1.6-fold increase in the number of variants prioritized, and represents a confirmation cost difference of over $800 per patient. Notably, these estimates are simplified and conservative, as we do not consider the substantial cost of having each of these variants reviewed by a clinician.

A potential solution would be to reserve follow-up confirmation for deleterious PAVs, which are uncorrelated to African ancestry and should therefore not be more common in individuals of African ancestry. However, doing this would limit the diagnostic landscape for both Europeans and non-Europeans to only previously found variation, and would greatly undermine the promise that sequencing technology holds for clinical genomics. Furthermore, this would limit the field to Euro-centric databases that would frequently miss causal variants in minority populations. In these situations, the missed causal variants would only be represented among the NAVs, which underlines the importance of not excluding prioritized NAVs from follow-up analysis.

These limitations translate into serious challenges, and despite the increased costs, provide good reason to cast a wider net for variant prioritization and confirmation when applying genomic testing to patients of African ancestry, and likely other predominantly non-European ancestries. As long as ancestry-related biases are not addressed, and most studies continue to predominantly sample from European populations, the genetics community will face challenges with implementation, interpretation and cost-effectiveness when treating minority populations.

Methods

Filtering pipeline

We annotate all variation using ANNOVAR33, a programme that facilitates the comprehensive and integrative annotation of multiple data types for each variant. Variants are divided into two main classes, each with two subgroups, for a total of four categories. PAVs consist of variants annotated as pathogenic in clinically annotated genetic databases, and are subdivided into deleterious and non-deleterious subgroups as determined by in silico predictions. NAVs include all variants not annotated as pathogenic (labelled as disease mutations), as well as those entirely absent from clinically annotated databases, and are also subdivided into deleterious and non-deleterious subgroups. Using customized ANNOVAR index tables, we annotate variants with 11 in silico predictors of function11,12,13,14,15,16,17,18,19,20 (Supplementary Table 4), functional information about protein-coding effect, clinical variation knowledge from ClinVar10 (all archived versions from 2012 to March 2015), the professional version of HGMD9 (fourth quarter version of 2014) and allele frequencies from multiple population sequencing projects, including the 1000 Genomes Project (phase 3)23, the ExAC database (http://exac.broadinstitute.org) and the Exome Sequencing Project5. We also integrate the final output as a list of variants belonging to each of the variant classes described above (Fig. 1a)1.

Filtering criteria

For variants from the OMIM8, HGMD9 and ClinVar10 databases, only those found in protein-coding genes are included. We also remove variants with MAF >5% in any of the 1000 Genome super-populations, ExAC populations or Exome Sequencing Project populations. With regard to the analysis portrayed in Fig. 1, if a variant is not found in any of the clinical databases, we use an allele frequency cutoff of 2% (Fig. 1a) and include only protein-altering variants found in the three following gene annotation databases: ENSEMBL GENE; KnownGene; or RefSeq. We also filter variants on the basis of in silico prediction, and require that at least 2 of 11 in silico prediction methods identify variants as deleterious (see Supplementary Table 4 for individual predictor cutoffs). An exception to this is that nonsense and splice site variants are called deleterious irrespective of their in silico predictors. Situations where these predicted deleteriousness filters are not applied are identified as exceptions in the text.

Variant classes

The first variant class, deleterious PAVs, are defined as variants with exact matches in genes in the OMIM8, HGMD9 or ClinVar10 databases, and are known to be associated with disease phenotypes. In addition, this class has to meet the above in silico prediction filter. The second class of variants is non-deleterious PAVs, and they only differs from the first category in that the requirement of being deleterious is removed. Deleterious NAVs make up the third class. This class is not annotated as pathogenic in any of the clinical databases, but these variants are identified by at least two in silico predictors as being deleterious. Finally, variants neither previously annotated as pathogenic nor predicted to be deleterious by at least two in silico predictors are classified as non-deleterious NAVs; they are seen as the least likely to be causative for a known disorder. Non-deleterious NAVs are also filtered by the frequency filters described above. Overall, <1% of NAVs are found in databases but not annotated as disease-causing; the remaining NAVs are not identified in any database.

Whole-genome sequencing data from the CAAPA Project

CAAPA consists of high coverage (30 × ) whole-genome sequence data (N=642) and provides a catalogue of genetic diversity from multiple populations of African descent. These populations include individuals from North America, South America, the Caribbean and continental Africa7, and study individuals are categorized as being cases and controls for asthma (sampling and variant calling are presented in more detail in Mathias et al.7) However, we do not suspect an atypical number of clinically relevant pathogenic variants among cases with such a complex disease phenotype as asthma. We have sampled 16 populations, including 8 different African American populations. Assembly of individual genomes, as well as variant calls are done using the Consensus Assessment of Sequence and Variation (CASAVA) package34. Using probabilistic models to build probability distributions over all diploid genotypes at every genomic site, genotypes are called after numerous quality control filtering steps. For each genomic position, a set of candidate SNPs becomes output. Multi-sample VCFs are generated at Knome Inc. (Cambridge, MA, USA), using VCFtools v0.1.11 (ref. 35) and custom scripts for additional data processing. While everything we report is based on a genome sequencing data set containing 642 samples, when we repeated our analyses on an expanded set of samples (total N=950) containing additional currently unpublished CAAPA data, our results were unchanged.

Estimation of ancestry proportions

To estimate ancestry we combine the CAAPA data with phase 1 of the 1000 Genomes Project, and data from previously published studies, which genotyped Hispanic and Native American samples on an Affymetric 6.0 chip36,37. All A–T and G–C SNPs are removed, and a missingness filter of 5% and a MAF filter of <5% are applied. The resulting SNPs are then LD pruned with plink38 using windows of 50 SNPs and removing SNPs with an r2>0.25, then iterating by 5 SNPs (that is, plink command—indep-pairwise 50 5 0.25). This results in 167,987 SNPs for admixture analysis.

We estimate ancestry proportions using the software package ADMIXTURE39. After performing 30 replicates modelling four clusters, we select the parameter values with the highest negative log likelihood. We identify the cluster that represents African ancestry by using the African groups from the 1000 Genomes Project as a reference (that is, the cluster where they have >99% membership), and we extract the proportion estimates for each of our CAAPA samples from this cluster. These become the values used to estimate the correlations. We present them as a bar plot in Supplementary Fig. 1.

Statistics to accommodate sampling structure

Owing to our population sampling approach, the full cohort does not represent an unstructured selection of individuals of African ancestry. To account for this when performing correlation analysis, we use the approach implemented in the R package ‘psych’40. The approach estimates correlations within each single population, which represent the pillars of the population substructure, and then combines these estimates weighted by sample size. Reported correlation coefficients and the P values are from the ‘weights’41 package, and significance is reported with a false discovery rate approach to correct for multiple testing.

Data availability

The whole-genome sequence data referenced in this study were generated by the CAAPA7 and have been deposited in dbGAP with the accession code phs001123.v1.p1.

Additional information

How to cite this article: Kessler, M. D. et al. Challenges and disparities in the application of personalized genomic medicine to populations with African ancestry. Nat. Commun. 7:12521 doi: 10.1038/ncomms12521 (2016).

References

  1. 1.

    et al. Clinical whole-exome sequencing for the diagnosis of Mendelian disorders. N. Engl. J. Med. 369, 1502–1511 (2013).

  2. 2.

    et al. Clinical exome sequencing for genetic identification of rare Mendelian disorders. JAMA 312, 1880–1887 (2014).

  3. 3.

    et al. Molecular findings among patients referred for clinical whole-exome sequencing. JAMA 312, 1870–1879 (2014).

  4. 4.

    et al. Population genetic inference from personal genome data: impact of ancestry and admixture on human genomic variation. Am. J. Hum. Genet. 91, 660–671 (2012).

  5. 5.

    et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012).

  6. 6.

    1000 Genomes Project Consortium. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  7. 7.

    et al. A continuum of admixture in the Western Hemisphere revealed by the African Diaspora genome. Nat. Commun. 7, 12522 (2016).

  8. 8.

    , , , & OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 43, D789–D798 (2015).

  9. 9.

    et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1–9 (2014).

  10. 10.

    et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).

  11. 11.

    , & in Research in Computational Molecular Biology 190–205Springer (2006).

  12. 12.

    & Identification of deleterious mutations within three human genomes. Genome Res. 19, 1553–1561 (2009).

  13. 13.

    et al. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics 25, i54–i62 (2009).

  14. 14.

    , & Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4, 1073–1081 (2009).

  15. 15.

    et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).

  16. 16.

    et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).

  17. 17.

    , & Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 39, e118 (2011).

  18. 18.

    , , , & Predicting the functional consequences of cancer-associated amino acid substitutions. Bioinformatics 29, 1504–1510 (2013).

  19. 19.

    et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

  20. 20.

    et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum. Mol. Genet. 24, 2125–2137 (2015).

  21. 21.

    et al. Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci. Transl. Med. 4, 154ra135 (2012).

  22. 22.

    et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015).

  23. 23.

    1000 Genomes Project Consortium. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  24. 24.

    et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013).

  25. 25.

    , , & Characteristics of neutral and deleterious protein-coding variation among individuals and populations. Am. J. Hum. Genet. 95, 421–436 (2014).

  26. 26.

    , , & The deleterious mutation load is insensitive to recent population history. Nat. Genet. 46, 220–224 (2014).

  27. 27.

    et al. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat. Genet. 47, 126–131 (2015).

  28. 28.

    , , , & Estimating the mutation load in human genomes. Nat. Rev. Genet. 16, 333–343 (2015).

  29. 29.

    et al. Distance from sub-Saharan Africa predicts mutational load in diverse human genomes. Proc. Natl Acad. Sci. USA 113, E440–E449 (2016).

  30. 30.

    , , & The breast cancer information core: database design, structure, and scope. Hum. Mutat. 16, 123 (2000).

  31. 31.

    et al. A Catalog of Published Genome-Wide Association Studies. (European Bioinformatics Institute) Available at: (Date accessed 14 October 2015).

  32. 32.

    et al. The human phenotype ontology: semantic unification of common and rare disease. Am. J. Hum. Genet. 97, 111–124 (2015).

  33. 33.

    , & ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).

  34. 34.

    CASAVA v1.8.2 (Illumina Inc., 2014).

  35. 35.

    et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

  36. 36.

    et al. Identifying signatures of natural selection in Tibetan and Andean populations using dense genome scan data. PLoS Genet. 6, e1001116 (2010).

  37. 37.

    et al. Genetic variation in Native Americans, inferred from Latino SNP and resequencing data. Mol. Biol. Evol. 28, 2231–2237 (2011).

  38. 38.

    et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

  39. 39.

    , & Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).

  40. 40.

    psych: Procedures for Personality and Psychological Research. R package version 1 (Northwestern University, Evanston, Illinois, USA, 2014).

  41. 41.

    , , & Weights: Weighting and Weighted Statistics. Computer software. CRAN. Version 0.80. CRAN, 04 March 2014 (accessed on 14 October 2015) (2014).

Download references

Acknowledgements

We acknowledge the contributions of Paul Levett, Anselm Hennis, P. Michele Lashley, Raana Naidu, Malcolm Howitt and Timothy Roach (BAGS); Audrey Grant, Eduardo Viera Ponte, Alvaro A. Cruz and Edgar Carvalho (BIAS); Susan Balcer-Whaley, Maria Stockton-Porter and Mao Yang (GRAAD); Mario Meraz, Jaime Nuñez and Eileen Fabiani Herrera Mejía (HONDAS); Deanna Ashley (JAAS); Silvia Jimenez, Nathalie Acevedo and Dilia Mercado (PGCA); Ann Jedlicka (REACH); Addison K. May, Caroline Gilmore and Patricia Minton (Vanderbilt University); Qun Niu (University of Chicago); and Adeyinka Falusi and Abayomi Odetunde (University of Ibadan, Nigeria). We also acknowledge the support of John Jay Shannon (Cook County Health Systems) and Kevin Weiss (Northwestern University); Regina Miranda and the Indians Zenues guards (San Basilio de Palenque, Bolivar, Colombia); Ulysse Ateba Ngoa (Leiden University); and Charles Rotimi, Adeyemo Adebowale, Floyd J Malveaux and Elena Reece (Howard University). We thank the numerous health-care providers and community clinics and co-investigators who assisted in the phenotyping and collection of DNA samples, and the families and patients for generously donating DNA samples to BAGS, BIAS, BREATHE, CAG, GRAAD, HONDAS, REACH, SAGE II, VALID, SAPPHIRE, SARP, COPDGene, JAAS, GALA II, PGCA and AEGS. Special thanks to community leaders, teachers, doctors and personnel from health centres at the Garifuna communities for organizing the medical brigades and to the medical students at Universidad Católica de Honduras, Campus San Pedro y San Pablo for their participation in the fieldwork related to HONDAS; study coordinator Sandra Salazar, and the recruiters in SAGE and GALA: Duanny Alva, MD; Gaby Ayala-Rodriguez; Ulysses Burley; Lisa Caine; Elizabeth Castellanos; Jaime Colon; Denise DeJesus; Iliana Flexas; Blanca Lopez; Brenda Lopez, MD; Louis Martos; Vivian Medina; Juana Olivo; Mario Peralta; Esther Pomares, MD; Jihan Quraishi; Johanna Rodriguez; Shahdad Saeedi; Dean Soto; Ana Taveras; Emmanuel Viera; Dr Michael LeNoir; Dr Kelley Meade; Mindy Jensen; and Adam Davis; and health liaisons and public health officers of the main Conde office, Adaliudes Conceição, Luciana Quintela, Ivanice Santos, Analú Lima, Benivaldo Valber Oliveira Silva and Iraci Santos Araujo, and students from the Federal University of Bahia who assisted in data collection in BIAS: Rafael Santana; Roberta Barbosa; Ana Paula Santana; Charlton Barros; Marcele Brandão; Ludmila Almeida; Thiago Cardoso; and Daniela Costa. We are grateful for the support from the international state governments and universities from Honduras, Colombia, Brazil, Gabon, Nigeria, Netherlands, Jamaica, Barbados and the United States who made this work possible. We also thank Robert Genuario for invaluable assistance in the whole-genome sequencing at Illumina, Inc.; Gonçalo Abecasis, William Cookson and Miriam Moffatt for helpful discussions; Pat Oldewurtel and Murali Bopparaju for technical support; Shuai Yuan for software support; and Kit Rees and Cate Kiefe for artistic contributions. We thank Steven Salzberg and Alex Szalay for computing and data storage resources available on the Data-Scope instrument at the Institute for Data Intensive Science (IDIES), Johns Hopkins University. We also thank Aksinija A. Shamah for artistic assistance. We acknowledge the support from James Kiley, Susan Banks-Schlegel and Weiniu Gan at the National Heart, Lung, and Blood Institute. Funding for this study was provided by the National Institutes of Health (NIH) R01HL104608 and the Center for Health Related Informatics and Bioimaging at the University of Maryland (M.D.K., A.C.S. and T.D.O.). Additional NIH funding includes the following: NCI: R21CA178706 (R.D.H.), U01CA161032 and P50CA125183 (O.I.O.); NCRR: G12RR003048 (G.M.D.) and RR24975 (TH); NHGRI: R01HG007644, R21HG007233 (R.D.H.), R21HG004751 (H.R.J., J.G. and Z.S.Q.) and T32HG000044 (C.R.G.); NHLBI: R01HL087699 (K.C.B.), R01HL118267 (L.K.W.), R01HL117004, R01HL088133, R01HL004464 (E.G.B.), HL081332, HL112656 (L.B.W.), R01HL69167, U01HL109164 (E.B. and D.M.), RC2HL101651, RC2HL101543, U01HL49596, R01HL072414 (C.O.), R01HL089897, R01HL089856, K01HL092601(M.G.F.), R01HL51492, R01HL/AI67905 (J.G.F.), HHSN268201300046C, HHSN268201300047C, HHSN268201300048C, HHSN268201300049C and HHSN268201300050C (J.G.W.); NIAID: K08AI01582 (T.H.), R01AI079139 (L.K.W.) and U19AI095230 (C.O.); NIEHS: R01ES015794 (E.G.B.); NIGMS: S06GM08016 (M.U.F.) and T32GM07175 (C.R.G.); NIMHD: P60MD006902 (E.G.B.), 8U54MD007588 and P20MD0066881 (M.G.F.); NSFGRF #1144247 (R.T.). Additional sources of funding include the following: American Asthma Foundation (L.K.W. and E.G.B.); American Lung Association Clinical Research Grant (T.H.); Colombian Government (Colciencias) 331-2004 and 680-2009 (L.C.); EDCTP:CT.2011.40200.025 (A.A.A.); EU-IDEA HEALTH-F3-2009-241642 and EU-TheSchistoVac HEALTH-Fe-2009-242107 (M.Y.); Ernest Bazley Fund (P.C.A., R.K., L.G. and R.S.); and the Fund for Henry Ford Hospital (L.K.W.). The Jamaica 1986 Birth Cohort Study was supported by grants from the Caribbean Health Research Council, Caribbean Cardiac Society, National Health Fund (Jamaica) and Culture Health Arts Sports and Education Fund (Jamaica). Study nurses were supported by the University Hospital of the West Indies (T.F. and J.K.M.), Ralph and Marion Falk Medical Trust (C.O.O., O.O., O.O. and G.A.), UCSF Dissertation Year Fellowship (C.R.G.), Universidad Católica de Honduras, San Pedro Sula (E.H.P.), University of Cartagena (J.M.) and Wellcome Trust 072405/Z/03/Z, 088862/Z/09/Z (P.J.C.). The Jackson Heart Study is supported by contracts HHSN268201300046C, HHSN268201300047C, HHSN268201300048C, HHSN268201300049C and HHSN268201300050C from the NHLBI and the NIMHD. E.G.B. was funded by Flight Attendant Medical Research Institute, RWJF Amos Medical Faculty Development Award and the Sandler Foundation; the Sloan Foundation to R.D.H.; C.R.G. was supported in part by the UCSF Chancellor’s Research Fellowship and Dissertation Year Fellowship. K.C.B. was supported in part by the Mary Beryl Patch Turnbull Scholar Program. R.A.M. was supported in part by the MOSAIC Initiative Awards from Johns Hopkins University. M.P.-Y. was funded by a Postdoctoral Fellowship from Fundación Ramón Areces. M.I.A. is an investigator supported by National Council for Scientific and Technological Development (CNPq). T.V.H. was supported in part by K24 AI 77930, UL1 TR00445 and U19 AI95227. R.O. was funded by NHLBI Diversity Supplement R01HL104608. Funding for the cohorts was provided by the following: AEGS, BAGS, BIAS, BREATHE (K08AI001582 and RR24975), CAG, COPDGene, GALA II, GRAAD, HONDAS, JAAS (The Jamaica 1986 Birth Cohort Study was supported by grants from the Caribbean Health Research Council, Caribbean Cardiac Society, National Health Fund (Jamaica) and Culture Health Arts Sports and Education Fund (Jamaica). The study nurses were supported by the University Hospital of the West Indies), PGCA (University of Cartagena and Colciencias Contracts 183-2002, 680-2009), REACH, SAGE II, SAPPHIRE, SARP, SCAALA and VALID.

Author information

Author notes

Affiliations

  1. Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA

    • Michael D. Kessler
    • , Amol C. Shetty
    • , Wei Song
    •  & Timothy D. O’Connor
  2. Department of Medicine, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA

    • Laura Yerges-Armstrong
    • , Wei Song
    •  & Timothy D. O’Connor
  3. Program in Personalized and Genomic Medicine, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA

    • Laura Yerges-Armstrong
    • , Kristin Maloney
    • , Linda Jo Bone Jeng
    • , Wei Song
    •  & Timothy D. O’Connor
  4. Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland 21287, USA

    • Margaret A. Taub
    •  & Ingo Ruczinski
  5. Department of Public Health Sciences, Henry Ford Health System, Detroit, Michigan 48202, USA

    • Albert M. Levin
  6. Center for Health Policy & Health Services Research, Henry Ford Health System, Detroit, Michigan 48202, USA

    • L. Keoki Williams
    •  & Badri Padhukasahasram
  7. Department of Internal Medicine, Henry Ford Health System, Detroit, Michigan 48202, USA

    • L. Keoki Williams
    •  & Badri Padhukasahasram
  8. Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland 21205, USA

    • Terri H. Beaty
    • , Rasika A. Mathias
    • , Kathleen C. Barnes
    •  & Jean G. Ford
  9. Department of Medicine, Johns Hopkins University, Baltimore, Maryland 21224, USA

    • Rasika A. Mathias
    • , Kathleen C. Barnes
    • , Meher Preethi Boorgula
    • , Monica Campbell
    • , Sameer Chavan
    • , Cassandra Foster
    • , Li Gao
    • , Nadia N. Hansel
    • , Edward Horowitz
    • , Lili Huang
    • , Romina Ortiz
    • , Joseph Potee
    • , Nicholas Rafaels
    • , Alan F. Scott
    •  & Candelaria Vergara
  10. Department of Medicine, University of Colorado, Aurora, Colorado 80045, USA

    • Kathleen C. Barnes
  11. Department of Medicine, The Brooklyn Hospital Center, Brooklyn, New York, USA

    • Jean G. Ford
  12. Data and Statistical Sciences, AbbVie, North Chicago, Illinois, USA

    • Jingjing Gao
  13. Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, USA

    • Yijuan Hu
    • , Henry Richard Johnston
    •  & Zhaohui S. Qin
  14. Department of Microbiology, Howard University College of Medicine, Washington, DC, USA

    • Georgia M. Dunston
  15. National Human Genome Center, Howard University College of Medicine, Washington, DC, USA

    • Georgia M. Dunston
    •  & Mezbah U. Faruque
  16. Department of Genetics, Stanford University School of Medicine, Stanford, California, USA

    • Eimear E. Kenny
    • , Carlos Bustamante
    • , Francisco M. De La Vega
    • , Chris R. Gignoux
    • , Suyash S. Shringarpure
    • , Shaila Musharoff
    •  & Genevieve Wojcik
  17. Department of Genetics and Genomics, Icahn School of Medicine at Mount Sinai, New York, New York, USA

    • Eimear E. Kenny
  18. Illumina, Inc., San Diego, California, USA

    • Kimberly Gietzen
    • , Mark Hansen
    • , Rob Genuario
    • , Dave Bullis
    •  & Cindy Lawley
  19. Knome Inc., Cambridge, Massachusetts, USA

    • Aniket Deshpande
    • , Wendy E. Grus
    •  & Devin P. Locke
  20. Pulmonary and Critical Care Medicine, Morehouse School of Medicine, Atlanta, Georgia, USA

    • Marilyn G. Foreman
  21. Department of Medicine, Northwestern University, Chicago, Illinois, USA

    • Pedro C. Avila
    •  & Leslie Grammer
  22. Department of Preventive Medicine, Northwestern University, Chicago, Illinois, USA

    • Kwang-YounA Kim
  23. Department of Pediatrics, Northwestern University, Chicago, Illinois, USA

    • Rajesh Kumar
  24. The Ann & Robert H. Lurie Children’s Hospital of Chicago, Chicago, Illinois, USA

    • Rajesh Kumar
  25. Department of Medicine, Northwestern Feinberg School of Medicine, Chicago, Illinois, USA

    • Robert Schleimer
  26. Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California, USA

    • Esteban G. Burchard
    • , Ryan D. Hernandez
    •  & Zachary A. Szpiech
  27. Department of Medicine, University of California, San Francisco, San Francisco, California, USA

    • Esteban G. Burchard
    • , Celeste Eng
    • , Maria Pino-Yanes
    •  & Dara G. Torgerson
  28. Department of Neurology, University of California, San Francisco, San Francisco, California, USA

    • Pierre-Antoine Gourraud
    •  & Antoine Lizee
  29. Institute for Human Genetics, University of California, San Francisco, San Francisco, California, USA

    • Ryan D. Hernandez
  30. California Institute for Quantitative Biosciences, University of California, San Francisco, San Francisco, California, USA

    • Ryan D. Hernandez
  31. CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain

    • Maria Pino-Yanes
  32. Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, California, USA

    • Raul Torres
  33. Department of Medicine, University of Chicago, Chicago, Illinois, USA

    • Dan L. Nicolae
    • , Olufunmilayo Olopade
    •  & Oluwafemi Oluwole
  34. Department of Statistics, University of Chicago, Chicago, Illinois, USA

    • Dan L. Nicolae
  35. Department of Human Genetics, University of Chicago, Chicago, Illinois, USA

    • Carole Ober
  36. Department of Medicine and Center for Global Health, University of Chicago, Chicago, Illinois, USA

    • Christopher O. Olopade
  37. Department of Chemical Pathology, University of Ibadan, Ibadan, Nigeria

    • Ganiyu Arinola
  38. Department of Biostatistics, SPH II, University of Michigan, Ann Arbor, Michigan, USA

    • Goncalo Abecasis
  39. Department of Medicine, University of Mississippi Medical Center, Jackson, Mississippi, USA

    • Adolfo Correa
    •  & Solomon Musani
  40. Department of Physiology and Biophysics, University of Mississippi Medical Center, Jackson, Mississippi, USA

    • James G. Wilson
  41. Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA

    • Leslie A. Lange
  42. Department of Genomic Sciences, University of Washington, Seattle, Washington, USA

    • Joshua Akey
    • , Wenqing Fu
    •  & Deborah Nickerson
  43. Department of Pediatrics, University of Washington, Seattle, Washington, USA

    • Michael Bamshad
    •  & Jessica Chong
  44. University of Washington, Seattle, Washington, USA

    • Alexander Reiner
  45. Department of Medicine, Vanderbilt University, Nashville, Tennessee, USA

    • Tina Hartert
    •  & Lorraine B. Ware
  46. Department of Pathology, Microbiology and Immunology, Vanderbilt University, Nashville, Tennessee, USA

    • Lorraine B. Ware
  47. Center for Human Genomics and Personalized Medicine, Wake Forest School of Medicine, Winston-Salem, North Carolina, USA

    • Eugene Bleecker
    • , Deborah Meyers
    •  & Victor E. Ortega
  48. Genetics and Epidemiology of Asthma in Barbados, The University of the West Indies, West Indies

    • Maul R. N. Pissamai
    •  & Maul R. N. Trevor
  49. Faculty of Medical Sciences Cave Hill Campus, The University of the West Indies, West Indies

    • Harold Watson
  50. Queen Elizabeth Hospital, The University of the West Indies, West Indies

    • Harold Watson
  51. Immunology Service, Universidade Federal da Bahia, Salvador, Brazil

    • Maria Ilma Araujo
  52. Laboratório de Patologia Experimental, Centro de Pesquisas Gonçalo Moniz, Salvador, Brazil

    • Ricardo Riccio Oliveira
  53. Institute for Immunological Research, Universidad de Cartagena, Cartagena, Spain

    • Luis Caraballo
    • , Beatriz Martinez
    •  & Catherine Meza
  54. Instituto de Investigaciones Immunologicas, Universidad de Cartagena, Cartagena, Spain

    • Javier Marrugo
  55. Faculty of Medicine, Universidad Nacional Autonoma de Honduras en el Valle de Sula, San Pedro Sula, Honduras

    • Gerardo Ayestas
    • , Allan Saenz
    •  & Gloria Varela
  56. Facultad de Medicina, Universidad Catolica de Honduras, San Pedro Sula, Honduras

    • Edwin Francisco Herrera-Paz
    • , Pamela Landaverde-Torres
    • , Said Omar Leiva Erazo
    • , Rosella Martinez
    • , Luis F. Mayorga
    •  & Hector Ramos
  57. Centro de Neumologia y Alergias, San Pedro Sula, Honduras

    • Edwin Francisco Herrera-Paz
    • , Alvaro Mayorga
    •  & Delmy-Aracely Mejia-Mejia
  58. Faculty of Medicine, Centro Medico de la Familia, San Pedro Sula, Honduras

    • Edwin Francisco Herrera-Paz
    • , Delmy-Aracely Mejia-Mejia
    •  & Olga Marina Vasquez
  59. Tropical Medicine Research Institute, The University of the West Indies, West Indies

    • Trevor Ferguson
    • , Jennifer Knight-Madden
    •  & Rainford J. Wilks
  60. Department of Child Health, The University of the West Indies, West Indies

    • Maureen Samms-Vaughan
  61. Centre de Recherches Médicales de Lambaréné, Libreville, Gabon

    • Akim Adegnika
    •  & Ulysse Ateba-Ngoa
  62. Institut für Tropenmedizin, Universität Tübingen, Tübingen, Germany

    • Akim Adegnika
    •  & Ulysse Ateba-Ngoa
  63. Department of Parasitology, Leiden University Medical Center, Leiden, Netherlands

    • Akim Adegnika
    • , Ulysse Ateba-Ngoa
    •  & Maria Yazdanbakhsh

Consortia

  1. Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA)

Authors

  1. Search for Michael D. Kessler in:

  2. Search for Laura Yerges-Armstrong in:

  3. Search for Margaret A. Taub in:

  4. Search for Amol C. Shetty in:

  5. Search for Kristin Maloney in:

  6. Search for Linda Jo Bone Jeng in:

  7. Search for Ingo Ruczinski in:

  8. Search for Albert M. Levin in:

  9. Search for L. Keoki Williams in:

  10. Search for Terri H. Beaty in:

  11. Search for Rasika A. Mathias in:

  12. Search for Kathleen C. Barnes in:

  13. Search for Timothy D. O’Connor in:

Contributions

T.D.O. conceived the project; M.D.K. and T.D.O. designed and performed experiments, analysed data and wrote the manuscript; L.Y.-A., M.A.T., A.C.S., K.M., L.J.B.J., I.R., A.M.L., L.K.W., T.H.B., R.A.M. and K.C.B. provided technical assistance and assisted with the writing and review of the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Timothy D. O’Connor.

Supplementary information

PDF files

  1. 1.

    Supplementary Information

    Supplementary Figure 1, Supplementary Tables 1-6, Supplementary Note 1, Supplementary Discussion, Supplementary Methods and Supplementary References

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Creative Commons BYThis work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/