The Brazilian Initiative on Precision Medicine (BIPMed): fostering genomic data-sharing of underrepresented populations

Rocha, Cristiane S.; Secolin, Rodrigo; Rodrigues, Maíra R.; Carvalho, Benilton S.; Lopes-Cendes, Iscia

doi:10.1038/s41525-020-00149-6

Download PDF

Brief Communication
Open access
Published: 02 October 2020

The Brazilian Initiative on Precision Medicine (BIPMed): fostering genomic data-sharing of underrepresented populations

npj Genomic Medicine volume 5, Article number: 42 (2020) Cite this article

4469 Accesses
24 Citations
1 Altmetric
Metrics details

Subjects

Abstract

The development of precision medicine strategies requires prior knowledge of the genetic background of the target population. However, despite the availability of data from admixed Americans within large reference population databases, we cannot use these data as a surrogate for that of the Brazilian population. This lack of transferability is mainly due to differences between ancestry proportions of Brazilian and other admixed American populations. To address the issue, a coalition of research centres created the Brazilian Initiative on Precision Medicine (BIPMed). In this study, we aim to characterise two datasets obtained from 358 individuals from the BIPMed using two different platforms: whole-exome sequencing (WES) and a single nucleotide polymorphism (SNP) array. We estimated allele frequencies and variant pathogenicity values from the two datasets and compared our results using the BIPMed dataset with other public databases. Here, we show that the BIPMed WES dataset contains variants not included in dbSNP, including 6480 variants that have alternative allele frequencies (AAFs) >1%. Furthermore, after merging BIPMed WES and SNP array data, we identified 809,589 variants (47.5%) not present within the 1000 Genomes dataset. Our results demonstrate that, through the incorporation of Brazilian individuals into public genomic databases, BIPMed not only was able to provide valuable knowledge needed for the implementation of precision medicine but may also enhance our understanding of human genome variability and the relationship between genetic variation and disease predisposition.

A comprehensive whole genome database of ethnic minority populations

Article Open access 17 June 2024

Whole-genome sequencing of 1,171 elderly admixed individuals from Brazil

Article Open access 04 March 2022

Actionable secondary findings in the 73 ACMG-recommended genes in 1559 Thai exomes

Article 08 October 2021

Introduction

Precision medicine combines molecular and clinical information to improve healthcare delivery. Since precision medicine uses individualised information from patients, such as genomic signatures, it allows for more accurate diagnoses and tailored treatment options^1,2. This approach is a significant improvement over the current paradigm in which physicians prescribe therapeutics designed to most effectively treat the average patient. However, precision medicine cannot be implemented without understanding the contribution of human genomic diversity to health and disease³. Therefore, the development of strategies used in precision medicine requires detailed knowledge of the genetic background of the population throughout which it will be applied. This approach is particularly important because the distribution of rare and common variants may differ depending on the population considered^4,5,6,7,8. This issue is more challenging for admixed American populations since their genomes present a mosaic of chromosomal tracts derived from different ancestral populations^9,10,11.

Large-scale genomic studies conducted using subjects not selected based on disease-related phenotypes (defined here as the reference population) have been performed to characterise the genetic architecture of specific populations. These studies include the HapMap project¹², 1000 Genomes Project⁴, Simons Genome Diversity Project¹³, and Genome Aggregation Database (gnomAD)¹⁴. More recently, national initiatives devoted to the development and improvement of precision medicine have been conducted in several countries, including the United States¹⁵, the United Kingdom¹⁶, the Netherlands¹⁷, Qatar¹⁸, Japan¹⁹, Australia²⁰, and some African countries²¹. Several of the projects relied on the findings of previous large-scale genomic studies to guide experimental design and analytical protocols, highlighting the importance of acquiring genomic information at the population level to facilitate the implementation of precision medicine.

However, despite the availability of reference genomes from some admixed American populations, this population group remains underrepresented in all large reference population databases, and especially in publicly available datasets²². For instance, we found that of the 2504 individuals who participated in the 1000 Genome Project and 141,456 individuals included in the gnomAD v2.1 dataset, only 20.13% and 12.53% were admixed Americans, respectively. Even though Brazil has the largest population among all countries in Latin America and the Caribbean (32.57% in 2015) and is the fifth-largest population worldwide (https://population.un.org/wpp/Download/Standard/Population/), the Brazilian population is underrepresented in both public genomic reference databases and genome-wide association studies (GWAS). This observation remains true even if one includes Latin American populations represented in worldwide collaborative studies, such as the 1000 Genomes Project and gnomAD, which involved Colombian, Peruvian, Puerto Rican, and Mexican populations^4,14,22. Indeed, among the 3529 studies published in the GWAS catalogue³, only 75 studies contain data from Brazilian individuals, and only three are exclusively comprised of Brazilian populations^23,24,25.

Similar to other admixed American populations, the Brazilian population is derived from sub-Saharan African, European, and Native American populations^25,26,27,28. However, we cannot use other admixed American populations as a reference for Brazilians due to differences in the proportions of ancestral populations from which the current Brazilian and other admixed Americans are derived^{9,10,11,28,29}. In this specific case, genomic markers detected in other admixed American populations have the potential to mischaracterise the genomic landscape of interest because the allele frequencies of some genetic markers are population-specific. In addition, due to the size and heterogeneous background of the Brazilian population, different ancestral proportions are likely to occur in different geographic regions of the country as a result of evolutionary and demographic events^26,27. Although previous reports have included genomic information from Brazilian populations, the limited quantity of variant information across the genome²⁷ and the restricted set of subpopulations evaluated are insufficient^25,26, and a greater volume of genomic data will be needed for the adequate implementation of precision medicine in Brazil.

Importantly, data generated in the majority of previous studies that have examined the Brazilian population are not publicly available. To address the issue mentioned above, a coalition of five research, dissemination, and innovation centres supported by the São Paulo Research Foundation (http://www.fapesp.br/) created the Brazilian Initiative on Precision Medicine (BIPMed; http://www.bipmed.org) in November of 2015. The main objective of the BIPMed project is to facilitate the implementation of precision medicine in Brazil by acting as a catalytic element used to foster collaboration among stakeholders, which include physicians, scientists, health authorities, policymakers, and society. In this context, we aim to investigate the distribution of rare and common variants present in two BIPMed datasets and assess the composition of a sample of the Brazilian population from a large metropolitan area in São Paulo, the most populated state of Brazil, located in the southeast region of the country. In the current manuscript, we present evidence highlighting the importance of compiling and analysing genomic datasets from underrepresented populations in the context of genomic and precision medicine. We initially describe the two datasets available in BIPMed: a whole-exome sequence (WES) dataset and a single nucleotide polymorphism (SNP) array genotyping dataset. Second, we present a comparison of variants identified from each dataset against those of publicly available databases. Finally, we compared the population genomic structure provide by information derived from WES and SNP array data.

Results

WES dataset

Overall, we found 851,109 different variants within 18,202 genes in the dataset, which included single nucleotide variants and small insertions and deletions. After removing variants containing >20% missing data, 823,481 variants remained. Among these, 522,290 (63.4%) had alternative allele frequency values (AAF) < 1%, and 96,971 (11.8%) were not present in the dbSNP database. Among the variants absent from the dbSNP, 6480 had AAF values >1% (Supplementary Data). Interestingly, nine variants absent from the dbSNP occurred at a high frequency within the BIPMed dataset (>90%).

A comparison between the WES dataset and the Clinvar database revealed that 727 variants were classified as pathogenic and 41 were likely to be pathogenic. Among these, we identified 509 (70.0%) pathogenic variants and 33 (80.5%) variants that were likely to be pathogenic that were rare (AAF < 1%) in the BIPMed WES dataset. The AAF values of most of the common variants (AAF ≥ 1%) found in the WES dataset were similar to those identified using gnomAD and TOPMed from the dbSNP dataset. Interestingly, we did not find variants classified as pathogenic in the BIPMed WES data that overlapped with the 1000 Genome dataset.

SNP array dataset

After performing quality control procedures, the SNP array data contained 902,939 variants; 25,492 of which overlapped with WES data, and 897,990 (99.44%) were also determined to be present in the 1000 Genomes datasets. We identified 65,519 variants with AAF values between 1 and 5%, and 831,266 with AAF values >5%.

Comparing genomic population structure between WES and SNP array datasets

The PCA used to assess the two BIPMed datasets revealed that both WES and SNP array datasets produced similar results, which are in accordance with previous reports^25,26,27,28. PC1 shows variant frequencies similar to European populations, and PC2 indicates characteristics of both European and sub-Saharan African populations (Fig. 1a, b). In addition, the similarity between both PCA performed in the two BIPMed datasets reflected in high correlation estimations of WES and SNP array data for PC1 (ρ ≥ 0.90; Fig. 1c) and PC2 (ρ ≥ 0.95; Fig. 1d). According to Euclidean distance estimations, both WES (Fig. 2a) and the SNP array (Fig. 2b) were closest to the European population, followed by admixed American populations.

**Fig. 1: Comparison between principal components (PCs) of WES and SNP array data.**

**Fig. 2: Comparison of Euclidean distance estimations between the BIPMed datasets and continental populations from the 1000 Genome project (1 KGP).**

Comparing the BIPMed dataset with the 1000 Genomes dataset

To compare the allele frequency of variants found within the BIPMed dataset with the 1000 Genomes dataset, we first merged the WES and SNP array to produce a single, large dataset, which provided 1,626,829 unique autosomal variants from both the SNP array and WES. Allele frequencies were estimated, based on merged WES and SNP array data. The estimation revealed 1,136,454 (69.9%) common variants with a minimum allele frequency (MAF) ≥ 1% and 490,375 (30.1%) rare variants with a MAF < 1%. After applying genotype and individual filtering³⁰, 817,240 (52.5%) autosomal variants could be found in the 1000 Genomes database. These results indicated that 809,589 variants (47.5%) present in the BIPMed reference population were not present in the 1000 Genomes datasets.

After performing a comparison of BIPMed data with the 1000 Genomes datasets, we found that rare variants in European (75,584; 9.2%), sub-Saharan African (67,109; 8.2%), and admixed American populations (34,360; 4.2%) were common in the BIPMed database. In contrast, 7493 (1.0%) common variants in European populations, 65,565 (8.0%) in sub-Saharan African, and 12,132 (1.5%) in admixed American populations were determined to be rare in the BIPMed reference datasets (Table 1). Assuming the null hypothesis that there is a similarity between the frequency of variants in the BIPMed and the 1000 Genomes datasets, our results provide evidence that data are not compatible with the null hypothesis (Fisher’s exact test p value = 2.2e⁻¹⁶). It is important to point out that the BIPMed sample (N = 358), was similar in size to the other datasets used for the comparative analyses performed in the present work, which contained European (N = 404), African (N = 504), and Admixed American (N = 347) populations.

Table 1 Distribution of minimum allele frequencies (MAF) among variants.

Full size table

Discussion

The application of precision medicine in admixed American populations requires a refined knowledge of the environmental exposure, lifestyle, biological susceptibility, and genomic structure of their admixed genomes^15,31. Indeed, studies have shown that risk-associated allele frequencies of different populations vary, a phenomenon which implies that risk-associated alleles identified in one population are not necessarily informative when predicting disease prevalence of all human groups^7,8. If physicians do not take this information into account when implementing precision medicine, they are likely to provide incorrect diagnoses of patients, and correspondingly, provide inadequate treatments³². This scenario is especially likely to occur in Brazilian admixed populations, which are remarkably underrepresented in public genomic databases²⁸.

Here we aimed to highlight the importance of compiling, analysing, and sharing genomic data obtained from an underrepresented population to enhance the application of precision medicine. Also, we have shown that, even when limited datasets are available, they can be of value in this scenario, since small datasets are better than no information at all. This point is particularly relevant for scientists and physicians in mid-low-income countries, which often believe that new developments in precision medicine may not be of use to the populations they serve.

The sample studied was representative of the target population; patients followed at the University of Campinas (UNICAMP) hospital, which was the population from the geographic region delimited by our study. However, based on the limited public data available in the Brazilian population^25,26,27,28, it is very likely that multiple datasets from different geographic regions will be needed to generate data for the application of precision medicine in the different areas of Brazil. This observation is a very relevant point, which is probably valid for many other regions, if not all the Americas, given the remarkable differences observed between population histories. These differences are based on the various origins of founder populations, migration waves, and other population genetics phenomena. Therefore, we strongly believe in the value of presenting BIPMed data, which contributes to this type of discussion, which is relevant to any country with diverse and admixed populations.

In the BIPMed WES dataset, we identified 768 variants classified as pathogenic or likely pathogenic, according to Clinvar. This result could have a significant impact on disease risk estimates for the Brazilian population. In addition, we observed that 47.5%% of the variants present in the BIPMed dataset were not present in the 1000 Genomes database, including 6480 variants with AAF values that were higher than 1% in BIPMed. Indeed, these novel variants have the potential to reveal new insights regarding genetic variation and the effects of complex traits in admixed Brazilians. However, we are aware that validation by other techniques, such as Sanger sequencing, will be needed to confirm the presence of the identified variants and exclude the possibility that they are false positives generated by the WES technique. Validation is especially important for the nine variants that are absent from dbSNP but appeared at a high frequency in BIPMed (>90%).

Other potential causes for the divergence observed in allele frequencies reported here should also be considered, such as the technical differences between WES (BIPMed) and whole-genome sequencing (WGS) platforms (1000 Genomes project). In this case, bias and variability may be affected by the use of different sequencing equipment and libraries for exome capture, which covers different genomic regions.

Given the fact that the BIPMed reference databases provide two different types of genomic information for 239 individuals, we could also compare whether the two datasets produced similar estimates of population structure. Our results showed that, based on the first two principal components (which possessed the highest proportion of variability observed), WES and SNP array datasets provided similar information regarding the genomic structure (Fig. 1). The concordance between the two datasets was important since results obtained with the SNP array could have had a European bias³³. However, since the data generated by WES covered all coding regions, and therefore was not at risk of bias, the concordance of results produced independently using the different platforms validates our results. Previous studies also compared WES and SNP array datasets from individuals that were predominantly from the Middle East, North Africa, Western Europe, and five admixed American individuals from Brazil, Colombia, and Mexico. They demonstrated that WES could provide population structure adjustments that were similar to those produced using SNP array data³⁴. Interestingly, Euclidean distances determined only reflected the structure observed in PC1, in which BIPMed data was closer to that of European and admixed American/Asian, rather than African populations. In this case, we suggest that the Euclidean distance estimates are less robust than eigenvector and eigenvalue estimations from the PCA, and thus, provide limited information regarding genomic structure.

The value of describing the BIPMed datasets can be further highlighted, since they provide a complete genomic map of variants within admixed Brazilian individuals, as BIPMed contains data that is rich in variant information found within the coding regions from WES, and additional information from the noncoding genomic regions provided by SNP array genotyping. By assessing the similarity between the frequencies of all variants identified by WES and SNP arrays in BIPMed and 1000 Genomes datasets, we found that they differ significantly. This result indicates that none of the admixed American populations present in the 1000 Genomes dataset can be used as a surrogate for studies of the Brazilian population since the 1000 Genomes datasets produced significantly different allele frequencies for both common and rare variants than the BIPMed datasets. Nevertheless, we acknowledge that the 1000 Genome dataset was built from WGS, which includes all variants within the genome. Indeed, differences in NGS platforms may influence our results because we did not evaluate all variants available in the 1000 Genome database.

Similar to BIPMed, other Brazilian initiatives have aimed to make genomic data more transparent and reproducible³⁵. However, BIPMed is the first to provide the public with easy access to raw data (https://bipmed.org/datasharing/). Additionally, the data-sharing process in BIPMed has been facilitated by the federated model of genomic databases proposed and provided by the Global Alliance for Genomics and Health³⁶.

However, we are aware of the limitations of the data currently available in BIPMed. First, although we analysed individuals born in all five geographic regions of Brazil (Table 2), BIPMed samples were predominantly from the Southeast region (49.44%), followed by the Northeast (5.59%), and the South (2.79%). Therefore, we can only provide reliable genomic estimates of population structure for three of the five Brazilian geographic regions. Second, the BIPMed dataset does not contain all genome variations, and likely missed rare variants located outside coding regions and structural variants. However, the latter can be assessed based on the SNP array data provided. Both limitations are currently being addressed by expanding the geographic reach of BIPMed samples and by including whole-genome data from the Brazilian reference individuals. We also encourage other Brazilian research groups to help improve the BIPMed database by depositing data generated from individuals from different geographic regions of Brazil (https://bipmed.org/docs/2_DepositDataBIPMed.docx).

Table 2 Distribution of birth location of BIPMed reference individuals within five Brazilian geographic regions.

Full size table

To date, BIPMed includes eight public databases, which contain information from 884 Brazilian admixed individuals distributed among six disease-specific datasets, and the two reference datasets included in this report. Though the disease-specific datasets in BIPMed do not include WES or SNP array data, BIPMed has provided valuable information for the application of precision medicine within the Brazilian admixed population.

One additional challenge in the implementation of precision medicine is related to the integration and sharing of genomic and clinical data generated by different groups and interested parties^36,37. The worldwide community, including the research community, would benefit significantly from increased cooperation. It will enhance the expansion and improve the availability of datasets, facilitating the detection of smaller genetic effects in complex disorders. It is well understood that the ability to access increased quantities of shared genomic and clinical data improves our understanding of the mechanisms underlying the diseases that affect individuals worldwide, and these diseases may have population-specific features. Through networking, clinicians will have access to improved information for performing risk assessment, prevention, and the delivery of optimised treatment regimens. Thus, in addition to its local importance for the full implementation of precision medicine in Brazil, we expect that BIPMed will catalyse similar initiatives within other underrepresented populations worldwide.

In conclusion, we showed that by studying two BIPMed datasets that included information from reference admixed Brazilian individuals from a specific geographic area, we detected a diverse population background, even when compared with other admixed American populations. The population structure estimations provided by WES and SNP array data were concordant. By incorporating admixed Brazilian individuals in public genomic databases, BIPMed not only contributes important knowledge for the proper implementation of precision medicine in Brazil, but it also enhances information regarding the variability of the human genome and the relationship between genetic variation and predisposition to diseases.