Abstract
Regardless of the overwhelming use of next-generation sequencing technologies, microarray-based genotyping combined with the imputation of untyped variants remains a cost-effective means to interrogate genetic variations across the human genome. This technology is widely used in genome-wide association studies (GWAS) at bio-bank scales, and more recently, in polygenic score (PGS) analysis to predict and stratify disease risk. Over the last decade, human genotyping arrays have undergone a tremendous growth in both number and content making a comprehensive evaluation of their performances became more important. Here, we performed a comprehensive performance assessment for 23 available human genotyping arrays in 6 ancestry groups using diverse public and in-house datasets. The analyses focus on performance estimation of derived imputation (in terms of accuracy and coverage) and PGS (in terms of concordance to PGS estimated from whole-genome sequencing data) in three different traits and diseases. We found that the arrays with a higher number of SNPs are not necessarily the ones with higher imputation performance, but the arrays that are well-optimized for the targeted population could provide very good imputation performance. In addition, PGS estimated by imputed SNP array data is highly correlated to PGS estimated by whole-genome sequencing data in most cases. When optimal arrays are used, the correlations of PGS between two types of data are higher than 0.97, but interestingly, arrays with high density can result in lower PGS performance. Our results suggest the importance of properly selecting a suitable genotyping array for PGS applications. Finally, we developed a web tool that provides interactive analyses of tag SNP contents and imputation performance based on population and genomic regions of interest. This study would act as a practical guide for researchers to design their genotyping arrays-based studies. The tool is available at: https://genome.vinbigdata.org/tools/saa/.
Similar content being viewed by others
Introduction
Over the last decade, low-cost, robust genotyping platforms and large-scale genome variation projects such as the 1000 Genomes Project1 have facilitated genome-wide association studies (GWAS) on numerous human phenotypes, ranging from height to diseases2. To date, thousands of DNA loci that are significantly associated with complex traits and diseases have been discovered3. Among numerous possible applications of GWAS results, disease risk prediction is rapidly gaining broad interest recently4,5,6. A polygenic score (PGS) or polygenic risk score (PRS) is an estimate of an individual’s genetic liability to a trait or disease, calculated based on their genotype profile and relevant GWAS data7. In its most common form, a PGS is computed as the sum of allele count of risk alleles (0, 1, or 2) that are weighted by its effect size (i.e. log odd ratio or beta coefficient) of hundreds to thousands of associated SNPs. The outcome is a single score that aggregates each individual’s genetic loading proportional to the risk of a given disease or a quantitative trait6. Although the clinical utility of PGS has yet to be established, recent works have suggested that PGS may be used for disease risk stratification that potentially facilitates early disease detection, assists in diagnosis, or informs treatment choices4,5. For example, PGS of coronary artery disease, type 2 diabetes, and breast cancer at the top 8, 3.5, and 1.5% are risks equivalent to a monogenic mutation risk that confers an odds ratio of 38.
Similar to GWAS analysis, PGS can be derived from various types of genotyping data such as those obtained by single-nucleotide polymorphism (SNP) microarrays or whole-genome sequencing (WGS). While WGS is attractive of the ability to interrogate variations across the entire human genome, SNP arrays are the dominant assays to obtain genetic data for PGS calculation. They come up with several advantages such as cost-effectiveness and light computational requirement which are preferable for population-scale screening, where PGS would be most useful9. Because the coverage of SNP arrays is typically limited to lower than a million SNPs, a procedure involving haplotype phasing and genotype imputing of missing sites is usually employed to add more genotyping information that can increase the power of these genetic studies7,10,11. The imputation performance is affected by three main factors, including algorithms of choice12, imputation reference panels13,14, and the SNP array designs15.
In principle, genotyping SNP arrays are designed by selecting a set of SNPs, commonly referred to as “tag SNPs”, which maximize coverage of ungenotyped DNA variants through associations between these alleles in the population (known as linkage disequilibrium, LD)16,17. Based on the target population, human genotyping SNP arrays can be classified into three categories optimized for global, super population, or specific to targeted populations. In the early phase of development, genotyping SNP arrays were focused on common genetic variations of the whole world population (minor allele frequency, MAF, of 0.10 or greater) based on the HapMap catalog18. The second generation of SNP arrays was designed to cover variants with MAF as low as 0.01 by providing SNP arrays specifically for European, East Asian, African American, and Latino race/ethnicity populations based on the 1000 Genomes Project (1KGP) catalog19,20. However, the fact that the majority of human genetic variants are rare and population-specific demands customizing SNP arrays to improve over those designed for global or super populations21,22. Indeed, population-specific genotyping arrays such as the UK Biobank Axiom Array2, the Axiom-NL Array23, the Japonica and Japonica NEO Arrays24,25, and the Axiom KoreanChip26 have been developed on top of the many existing commercial arrays. These arrays are not only optimized for genomic coverage based on their unique variant catalogs but also include a large number of functional variants. For example, the Axiom KoreanChip contains more than 200,000 nonsynonymous loci and the new Japonica NEO Arrays were designed with abundant disease risk variants25,26.
The development of customized arrays accompanied by commercial arrays provided by genotyping platform producers results in a large number of genotyping arrays. Each of these arrays has specific properties and contents, and thus, there is an urgent demand for a systematic guideline to determine which array best suits specific research questions and populations. Although there are SNP array comparative studies, they are either not updated with the many recent arrays15,27, or limited in only testing for a small set of populations, and some studies focused on LD coverage27,28 that may not be relevant to current imputation practice for use in association studies and PGS analysis7,11. Moreover, although PGS is gaining increasing attention, practical evaluation of performance for PGS applications by current genotyping arrays is still lacking. Here, we provide a comprehensive evaluation of imputation-based genomic coverage15,29 and PGS performance of 23 human genotyping arrays in diverse populations. These analyses are intended to be a practical guide for researchers in selecting the most suitable genotyping array for their genetic studies.
Materials and methods
Genotyping arrays
In this study, we benchmarked 23 different human genotyping arrays including 14 arrays from Illumina and 9 arrays from Affymetrix. The examined arrays contain the numbers of tag SNPs (array size) ranging from approximately 300,000 (Infinium HumanCytoSNP-12 v2.1) up to more than 4,300,000 (Infinium Omni5 v1.2). They can be classified as old arrays such as the Genome-Wide Human SNP Array 6.0; population-specific optimized arrays such as Axiom UK Biobank Array and Axiom Japonica Array NEO; multiple populations optimized arrays such as Infinium Multi-Ethnic Global v1.0 and Infinium Global Diversity Array v1.0; cytogenetics and cancer applications optimized arrays such as Infinium CytoSNP-850K v1.2. Recently developed arrays include Infinium Global Screening Array v3.0, Axiom Precision Medicine Research Array, and Axiom Precision Medicine Diversity Array. Manifests of the 23 examined arrays were obtained from respective manufacturers’ websites. Genomic positions were further harmonized to the UCSC hg38 reference genome coordinate with CrossMap v0.2.6 for those requiring lifted over30. Details and component statistics of these arrays are shown in Table 1.
Genomic datasets and pipelines
An overview of our evaluation pipeline is presented in Fig. 1. In brief, the phased genomic data of 22 autosomal chromosomes in Variant Call Format (VCF) of 2,504 and 1,008 unrelated individuals from the 1000 Genomes Project samples that were re-sequenced by New York Genome Center (1KGP)31 and the 1000 Vietnamese Genomes Project (1KVG)32, respectively, were used to estimate imputation-based coverage and PGS performance of 23 different genotyping arrays by the tenfold cross-validation approach. In the 1KGP dataset, 26 populations were grouped into 5 supper-populations according to their continental groups including East Asian (EAS), European (EUR), South Asian (SAS), African (AFR), and American (AMR). For consistent naming throughout the text, these continental groups are hereafter considered as a population. This dataset was randomly divided into 10 batches equally distributed across populations (4 batches with 251 samples and 6 batches with 250 samples). Similarly, the Vietnamese population (VNP) was processed separately with 8 batches of 101 and 2 batches of 100 samples. In each turn, one batch was used as the test set and the remaining samples as the reference set. For each array, variants in the test set with the same position as variants on the array were extracted with vcftools v0.1.1733 and phasing information was removed to generate the pseudo SNP array genotyped data, while variants in reference data were used as the pre-phasing and imputation reference panel. The pre-phasing and imputation were performed with SHAPEIT v4.1.334 and Minimac4 v1.0.212 respectively. Finally, the imputed genotyping data of 10 batches were combined to estimate imputation and PGS performance according to their populations, including 504, 503, 489, 661, 347, and 1,008 individuals in EAS, EUR, SAS, AFR, AMR, and VNP, respectively. This approach is similar to the strategy used previously to estimate imputation-based genomic coverage15,29,35.
Imputation performance evaluation
Both GWAS and PGS often require genotype imputation that involves the prediction of untyped variants in the genome. While GWAS benefits from boosting the number of imputed SNPs that can be tested for association11, computation of PGS is conducted by summing the product of risk allele count (0, 1, or 2) and its effect size derived from the GWAS. Thus, imputation performance is expected to play a key role in PGS derivation. Here, we focus on imputation \(r^2\) metric although there are several other criteria that can be used to assess imputation performance such as allele concordance15, imputation quality28, LD coverage36. We choose imputation \(r^2\) as the evaluation metric for the following reasons. First, it is more relevant to the context of GWAS and PGS analysis because the imputation \(r^2\) at a given variant is proportional to its \(\chi ^2\) statistic that results from an association test37,38,39,40. This leads to the interpretation that an increase in mean imputation \(r^2\) at genome wide scale directly corresponds to the increase of statistical power37,40. Second, it is less sensitive to allele frequency than concordance15. Third, it incorporates imputation uncertainty by using expected allele dosage rather than the most likely genotype15. Finally, imputation \(r^2\) can be computed on a site-by-site basis, which enables a more detailed evaluation than at the allele frequency level40. In this evaluation setting, we treated genotypes derived from WGS datasets as gold standard. Imputation performance is measured as imputation \(r^2\) that is SNP-wise squared Pearson’s correlation between the imputed dosages and the WGS genotypes, and imputation coverage is defined as the proportion of SNPs with imputation \(r^2\) passing the cut-off of 0.8. These metrics were stratified into three minor allele frequency (MAF) bins, including (0–0.01], (0.01–0.05], (0.05–0.5]. To reduce the data noise, multiallelic sites were not considered, and variants with allele count less than 2 were excluded in the bin of (0–0.01]. Of note, the MAF bin of (0.01–0.5], which is the most common cutoff for GWAS and PGS analysis, was also considered in the analysis7,41.
PGS performance assessment
Instead of using pre-tuned PGS models as in other studies9,40, PGS was computed with a standard P+T (Prunning and Thresholding) approach implemented in PRSice-242 in this study. The main reason for using this approach is that we tried to mimic the real-life practice of PGS analysis that involves running a PGS computational method with multiple parameters and selecting the best one7. Another reason is that using pre-built PGS models may introduce a potential bias for some specific arrays as they were used in tuning in these established PGS model, i.e., we tried to avoid training using the same array twice. Using summary statistics for three phenotypes, namely height, body mass index (BMI), and type 2 diabetes (T2D), obtained from previous GWAS meta analyses43,44, a PGS for an individual i was calculated as:
where \(P_T\) is the p-value threshold values (5e−08, 1e−07, 1e−06, 1e−05, 0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.5, and 1); M is number of SNPs after clumping with “–clump-kb 250kb” and “–clump-r2 0.1”; \(x_{ij}\) and \(\hat{\beta _j}\) is the allele count and the marginal effect size derived from GWAS summary statistics of \(SNP_j\).
Similar to imputation performance evaluation, we treated PGS derived from WGS as the “gold standard”. PGS derived from 23 different SNP arrays were evaluated using Pearson’s correlation to PGS derived from WGS data under the same PRSice-2 parameter settings. In addition, absolute differences in PGS percentile ranking generated by array-imputed and the WGS data were also evaluated.
Ethics approval and consent to participate
The study did not generated new dataset. Ethics approval and consent to participate were applied according to corresponding orginal works. In the 1KVG study, subjects provided informed consent and the study was approved by the Vinmec International Hospital Institutional Review Board with number 543/2019/QD-VMEC in accordance with the relevant guidelines and regulations (e.g. Helsinki Declaration). In the 1KGP-NYGC study, genetic data are publicly available according to the original ethics approval.
Results
Imputation performance
Overall, we found two main factors affecting the imputation accuracy and imputation coverage that are array sizes and population optimization. The two densest arrays that are the Infinium Omni2.5 v1.5 and Infinium Omni5 v1.2 with approximately 2.4 and 4.3 minion tag SNPs yielded the highest imputation performance. In contrast, the two sparsest SNP arrays with approximately 300,000 tag SNPs that are Infinium HumanCytoSNP-12 v2.1 and Infinium Core-24 v1.2 obtained the poorest imputation performance in all six examined populations. At the MAF bin of (0.01–0.5], the Infinium Omni5 v1.2 yielded the mean imputation accuracy \(r^2\) of 0.9032, 0.9144, 0.8644, 0.9176, 0.8873, 0.9499 and the imputation coverage of 0.8721, 0.8813, 0.8019, 0.8885, 0.8344, 0.9207 while the Infinium HumanCytoSNP-12 v2.1 obtained 0.6682, 0.7708 0.7112, 0.7608 0.7218, 0.8635 for mean imputation accuracy \(r^2\) and 0.4031, 0.6265, 0.5879, 0.6297, 0.5731, 0.7655 for imputation coverage in six populations AFR, AMR, EAS, EUR, SAS, and VNP respectively. Details are reported in Fig. 2 and Tables 2, 3.
Regarding population optimization, imputation performance is generally better for those arrays optimized specifically for the targeted populations. For example, the Axiom UK Biobank Array, which was optimized for the British population, performed superiorly in the EUR than most other arrays (except for the ultra-high-density arrays Infinium Omni2.5 v1.5 and Infinium Omni5 v1.2). In detail, at the MAF bin of (0.01–0.5], The Axiom UK Biobank Array with the size of 844k SNPs obtained the mean imputation coverage of 0.8389 which was higher than globally optimized, higher density arrays such as Axiom Precision Medicine Research Array (919k), Axiom Precision Medicine Diversity Array (922k), Genome-Wide Human SNP Array 6.0 (932k), Infinium Multi-Ethnic Global v1.0 (1784k), and Infinium Global Diversity Array v1.0 (1905k), with imputation coverage of 0.7814, 0.8078, 0.7513, 0.8228, 0.8277, respectively and lower 0.8409, and 0.8885 that were obtained by Infinium Omni2.5 v1.5 and Infinium Omni5 v1.2 arrays with 2373k and 4327k SNPs. Similarly, the Axiom Japonica Array NEO (671k) which was designed for the Japanese population also performed well against global optimized, higher-density arrays. These two arrays yielded mean imputation accuracy of 0.831, 0.9333; and imputation coverage of 0.7642, 0.9024 in EAS and VNP populations. These performances were higher than those of multi-ethics SNP arrays, even with higher density including Axiom Precision Medicine Research Array (919k), Axiom Precision Medicine Diversity Array (922k), Genome-Wide Human SNP Array 6.0 (932k) as showed in Fig. 2 and Tables 2, 3. Notably, the Infinium OmniZhongHua v1.4 (Chinese optimized array) also outperformed other arrays in EAS and VNP populations. Regarding the AFR population, an array optimized for this population is Axiom Genome-Wide PanAFR with 2265k SNPs performed nearly equivalent the Infinium Omni5 v1.2 array with 4327k SNPs (0.9002 versus 0.9032 for mean imputation accuracy, and 0.8700 versus 0.8721 interns of imputation coverage). There were no SNP arrays with superior performances in the two remaining populations (AMR and SAS), although the Axiom UK Biobank Array and the Axiom Genome-Wide ASI obtained slightly better performance than other arrays with the same sizes when applied for the AMR and SAS populations. In this case, we focused on the MAF bin of (0.01–0.5] as this is the most common cutoff allele frequency in both GWAS and PGS analysis7,45. However, the results were also generalized for other bins as shown in Fig. S.1 and Table S.1–6.
PGS performance
We evaluated PGS performance of these arrays based on two criteria: (i) Pearson’s correlation of PGS estimated by using imputed SNP array data compared to the PGS estimated by using WGS data—hereafter we refer as PGS correlation for short, (ii) absolute difference of percentile ranking (ADPR) between PGS generated by array-imputed and gold standard WGS. Both comparisons are set under various p-value cutoffs that allow us unbiased evaluate PGS performance of these arrays. In general, we found that PGS performance was highly concordant with imputation performance, i.e. SNP arrays with better imputation performance showed higher PGS correlation and less ADPR than the arrays with poor imputation performances.
The summary results of Pearson’s correlation values of PGS from 23 genotyping SNP arrays for three different phenotypes are shown in Fig. 3 and in Tables S.7–9. In general, all examined arrays yielded high PGS correlations. Notably, the vast of majority PGS correlations ranged from 0.90 to 0.99, except for the two lowest density arrays (Infinium HumanCytoSNP-12 v2.1 and Infinium Core-24 v1.2) which had the lowest values. Interestingly, when optimal arrays for populations were used (the Axiom UK Biobank Array was used for the EUR population; and the Axiom Japonica Array NEO, Infinium OmniZhongHua v1.4 were used for EAS and VNP populations), the PGS correlations were higher than 0.97. The PGS correlation patterns were also highly concordant in all three evaluated traits with comparable performances. As expected, SNP arrays with larger sizes showed higher PGS correlations. The lowest performer was the Infinium HumanCytoSNP-12 v2.1 with a correlation of 0.8731 in the height phenotype in the AFR population while the highest performance was obtained by the Infinium Omni5 v1.2 with PGS correlation higher than 0.99 in all examined populations and traits. We also examined the deviation of PGS correlation in various p-value settings. The results showed that SNP array with lower PGS correlation had higher PGS correlation standard deviation than the high-performance arrays. A possible explanation for this observation is the PGS estimated from low imputation performance are more vulnerable to the random pruning process than the high imputation performance arrays42. Notably, we also observed higher standard deviations of PGS correlation in EAS than in other populations.
In agreement with imputation performance, SNP arrays optimized specifically for targeted populations showed superior PGS correlation in the targeted/closely related populations. For instance, Axiom Japonica Array NEO and Infinium OmniZhongHua v1.4 which were optimized for Japanese, and Chinese showed clear advantages in the populations of EAS, and VNP while Axiom UK Biobank Array yielded higher PGS correlation in the EUR population than the other size-comparable genotyping arrays. Taking height as a typical trait of interest, PGS correlations of the Japonica Array NEO were 0.9760, and 0.9847, while the Infinium OmniZhongHua v1.4 had 0.9879, and 0.9914 in EAS and VNP respectively. Interestingly, we observed that the Infinium CytoSNP-850K v1.2 was the array with superior PGS correlations in all populations, for all the three evaluated traits. For example, the PGS correlation for this array for height phenotype in AFR, AMR, EAS, EUR, SAS and VNP were 0.9679, 0.9876, 0.9789, 0.9908, 0.9844, 0.988, respectively.
Regarding the ADPR metric, the performance of arrays was in an agreement with the trend from comparing PGS correlation i.e. ADPRs were also affected by array sizes and optimization population. ADPR measurements in different PRSice-2 p-values settings are shown in Figs. 4, S.2–12; and reported in Tables S.10–21. Most of the arrays yielded mean ADPR less than 10 in all three traits. Exceptions were the AFR population with low-density arrays. The highest density array, i.e. Infinium Omni5 v1.2, had the highest performance with ADPR less than 4. Notably, ADPR varied by populations. Under-represented populations like AFR, and EAS tended to exhibit higher ADPRs than the others. Taking the p-value cutoff at 5e−8 for the height phenotype as an example (Fig. 4), Infinium Omni5 v1.2 obtained ADPR means of 3.8600, 2.4774, 2.8884, 1.9758, 2.8391, and 2.3699 in AFR, AMR, EAS, EUR, SAS and, VNP respectively. A consistent trend was also observed in other traits, with the lowest performance in AFR and the highest performance in EUR with ADPR means of 3.5974 and 1.8489 in BMI, and of 3.7206 and 1.6592 in type 2 diabetes. Similar to the other experiments, population specific arrays and the Infinium CytoSNP-850K v1.2 also illustrated their advantages when comparing the ADPR metric. The Axiom UK Biobank Array obtained good performance for the EUR population with ADPR means of 3.0584, 3.1714, and 2.2734 in height, BMI, and type 2 diabetes respectively. This trend was also observed in the cases of Axiom Japonica Array NEO, and Infinium OmniZhongHua v1.4 applied for the EAS and VNP populations. Regarding the Infinium CytoSNP-850K v1.2 array, good performances in all traits and populations were observed. Specifically, ADPR means of the height phenotype were 5.7141, 3.4914, 4.3753, 3.2501, 3.7638, 3.0267; for BMI at 4.9872, 2.5463, 4.1560, 2.6272, 3.5409, 3.1523; and for type 2 diabetes at 5.2000, 2.5762, 3.7687, 2.6066, 2.4707, 2.3812 in AFR, AMR, EAS, EUR, SAS and, VNP, respectively, all at the same p-value cutoff.
Comparative analysis of real SNP array genotyping data and simulated genotyping data
We further utilized the availability of real genotyping data in the 1KVG dataset with 24 out of the 1008 samples also genotyped by the Axiom Precision Medicine Research Array and the Infinium Global Screening Array v3.0 to investigate how our simulated array data performed relative to the real array data. In brief, we generated pseudo genotyping data (termed simulated data) of 24 samples by extracting variants from WGS data that matched with the Axiom Precision Medicine Research Array and the Infinium Global Screening Array v3.0 manifests before excluding phasing information. Regarding real genotyping data, processed VCF (individual calling rate filtering at 97% and Hardy-Weinberg test filtering of 1e−6) files of 24 out of 1008 samples were obtained from https://genome.vinbigdata.org/ with no further filtering and quality control applied. We then applied the same pipeline to compare the imputation performance of the simulated genotyping data against the results obtained from the real genotyping data. In details, both simulated and real genotyping data were phased with SHAPEIT v4.1.334, and imputed with Minimac4 v1.0.212. Reference data for both phasing and imputation were the remaining 984 WGS samples. Finally, imputation performance of both simulated and real arrays were estimated as described in the “Imputation performance evaluation” section. As expected, the imputation accuracies of simulated and real data were highly concordant in both the two examined arrays as shown in Fig. 5 and Table 4. For example, mean and standard deviation of imputation accuracies of simulated Axiom Precision Medicine Research Array were 0.8144 ± 0.0359, 0.8971 ± 0.0273, 0.9459 ± 0.016, 0.9542 ± 0.014; and real data were 0.8173 ± 0.0379, 0.9013 ± 0.0285, 0.9492 ± 0.0158, 0.9573 ± 0.0135 in four MAF bins of (0–0.01], (0.01–0.05], (0.01–0.5], and (0.05–0.5], respectively. Furthermore, relative performances between the Axiom Precision Medicine Research Array and the Infinium Global Screening Array v3.0 were equivalent in simulated and real data. These results indicated the robustness of our simulation approach in imputation performance evaluation of genotyping arrays in reality.
Discussions and conclusions
Even in a booming time of next-generation sequencing technologies, current big genotyping projects are still using SNP arrays as the work-horse for generating valuable data, especially for bio-bank scale projects2,25,26. Moreover, genotyping by SNP arrays produce the exact information typically required for PGS analysis, which is based on summarizing effect sizes from individual SNPs. A promising application of genomic research that is gaining increasing interest recently across the healthcare system, and direct-to-consumer genomic services based on polygenic scoring like 23andMe5,46. SNP arrays are clearly economical in data generation and analysis, an important factor in designing projects with large sample sizes and/or limited budget. Given that there are many available human genotyping arrays optimized for various purposes, a comprehensive guideline for choosing the most suitable SNP arrays in multiple ancestry groups is still lacking. To address this gap, we have introduced a systematic approach to assess a large range of SNP arrays across multiple datasets. We performed imputation and PGS performance assessments for 23 human available genotyping arrays in six ancestry groups using both public and in-house datasets by various metrics. By comparing the relative performance of SNP arrays to WGS with 4 metrics including imputation accuracy, imputation coverage, PGS correlation, and ADPR, we discovered important insights that can be used to suggest suitable arrays for genotyping-based studies on a specific population, especially under-represented populations.
Overall, we found that all 23 assessed arrays had high performances in both imputation and PGS. These commercial arrays differ markedly in designs, i.e. the number of markers on the arrays and targeted ancestry groups that would cause performance deviations. An important finding in our analysis was that in order to obtain high imputation performances, the choice of an array is not necessarily about getting higher density, but small to moderately-sized arrays (approximately 650k–850k tag SNPs), accompanied by well optimization for the targeted population could also produce high imputation and PGS performances. For example, the Japonica Array NEO, and the UK Biobank Array showed the highest performance when compared with other arrays with the same sizes for EAS, and EUR populations respectively. This indicates that using customized, small-size SNP arrays at the population-specific level can be a cost-effective genotyping solution without losing PGS performance22,47. We also observed that there were no specific arrays with moderate sizes that had superior imputation performances in AFR, and SAS, suggesting the need for genotyping arrays optimized for these populations. PGS performances were concordant to imputation performances in general. However, CytoSNP-850K v1.2 was an interesting array that showed superior PGS performances in all populations. This superior performance may be explained by the enrichment of cytogenetic regions in the design of the Infinium CytoSNP-850K v1.2 array48. The analyses also showed that underrepresented populations such as AFR, and SAS exhibited lower PGS performances (and ADPRs tended to be higher in AFR, and SAS) than other well-studied populations regardless of sample sizes were not significantly different in these populations. A possible explanation for these lower performances is due to the use of meta-analysis GWAS summary statistics in the current study. The strong bias in GWAS participants toward populations of European descent could be a reason for lower PGS in other populations as described previously43,44,49,50. In addition, PGS performances of small-sized arrays were significant lower in AFR which was possibly due to the higher number of genetic variations in this population1.
Notably, PGS constructed from imputed genotypes were very high in comparison with the original WGS PGS. The majority PGS correlations ranged from 0.90 to 0.99. In cases of optimal arrays for targeted populations in used (UK Biobank Array was used for the EUR population, Japonica Array NEO was used for EAS and VNP populations), the PGS correlation to WGS was higher than 0.97. In addition, PGS ranking differences between WGS and imputed array genotypes were not high with the majority of differences were under 5 percentile when optimal arrays were used. The possible reason for this observation was that current GWAS summary statistics were mostly generated by imputed array genotypes43,44 that were limited to detect rare associated markers. This indicates that using WGS for PGS analysis does not provide significant improvement in term of disease risk stratification at this time although this trend can change in the future when GWAS summary statistics at higher resolution become widely available51.
Finally, to make this analysis capability available to broad audiences, we have developed a web tool that provides interactive analyses of SNP array contents and performances. As researchers may be interested in specific variants or regions, the tool aimed to support researchers to analyze SNP array contents and imputation performance based on population and genomic regions of interest. We hope this tool could facilitate researchers in designing their SNP array-based studies.
Data availability
The 1KGP-NYGC datasets are freely available at IGSR data portal (https://www.internationalgenome.org). The 1KVG WGS and genotyping datasets are available under agreement at MASH data portal (https://genome.vinbigdata.org/). Data and source codes to generate figures of this study are available at: https://github.com/datngu/SNP_array_comparison. SNP array analyzing tool is available online at: https://genome.vinbigdata.org/tools/saa/. SNP-wise imputation performance estimation based on 1KGP-NYGC data are freely available at: https://zenodo.org/record/6548098. SNP-wise imputation performance estimation based on 1KVG data are available and can be supplied under ethical policy agreement.
References
Consortium G. P. et al. A global reference for human genetic variation. Nature 526, 68 (2015).
Bycroft, C. et al. The UK biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Buniello, A. et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19, 581–590 (2018).
Lewis, C. M. & Vassos, E. Polygenic risk scores: From research tools to clinical instruments. Genome Med. 12, 1–11 (2020).
Lambert, S. A., Abraham, G. & Inouye, M. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet. 28, R133–R142 (2019).
Choi, S. W., Mak, T.S.-H. & O’Reilly, P. F. Tutorial: A guide to performing polygenic risk score analyses. Nat. Protocols 15, 2759–2772 (2020).
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
Chen, S.-F. et al. Genotype imputation and variability in polygenic risk score estimation. Genome Med. 12, 1–13 (2020).
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G. R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).
Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Huang, J. et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6, 1–9 (2015).
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Nelson, S. C. et al. Imputation-based genomic coverage assessments of current human genotyping arrays. G3 Genes Genomes Genet. 3, 1795–1807 (2013).
Gibbs, R. A. et al. The International Hapmap Project (2003).
Carlson, C. S. et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet. 74, 106–120 (2004).
Consortium, I. H. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851 (2007).
Hoffmann, T. J. et al. Next generation genome-wide association tool: Design and coverage of a high-throughput European-optimized SNP array. Genomics 98, 79–89 (2011).
Hoffmann, T. J. et al. Design and coverage of high throughput genotyping arrays optimized for individuals of east Asian, African American, and Latino race/ethnicity using imputation and a novel hybrid SNP selection algorithm. Genomics 98, 422–430 (2011).
Consortium, G. P. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Tam, V. et al. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 20, 467–484 (2019).
Ehli, E. A. et al. A method to customize population-specific arrays for genome-wide association testing. Eur. J. Hum. Genet. 25, 267–270 (2017).
Kawai, Y. et al. Japonica array: Improved genotype imputation by designing a population-specific SNP array with 1070 Japanese individuals. J. Hum. Genet. 60, 581–587 (2015).
Sakurai-Yageta, M. et al. Japonica array neo with increased genome-wide coverage and abundant disease risk SNPs. bioRxiv (2020).
Moon, S. et al. The Korea biobank array: Design and identification of coding variants associated with blood biochemical traits. Sci. Rep. 9, 1–11 (2019).
Ha, N.-T., Freytag, S. & Bickeboeller, H. Coverage and efficiency in current SNP chips. Eur. J. Hum. Genet. 22, 1124–1130 (2014).
Verlouw, J. A. et al. A comparison of genotyping arrays. Eur. J. Hum. Genet. 29, 1611–1624 (2021).
Lindquist, K. J., Jorgenson, E., Hoffmann, T. J. & Witte, J. S. The impact of improved microarray coverage and larger sample sizes on future genome-wide association studies. Genet. Epidemiol. 37, 383–392 (2013).
Zhao, H. et al. Crossmap: A versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).
Byrska-Bishop, M. et al. High coverage whole genome sequencing of the expanded 1000 genomes project cohort including 602 trios. bioRxiv (2021).
Tran, H. et al. Deep whole-genome sequencing in Vietnam. In-preparation (2022).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 1–10 (2019).
Nguyen, D. T., Dinh, H. Q., Vu, G. M., Nguyen, D. T. & Vo, N. S. A comprehensive imputation-based evaluation of tag SNP selection strategies. In 2021 13th International Conference on Knowledge and Systems Engineering (KSE), 1–6 (IEEE, 2021).
Barrett, J. C. & Cardon, L. R. Evaluating coverage of genome-wide association studies. Nat. Genet. 38, 659–662 (2006).
Pritchard, J. K. & Przeworski, M. Linkage disequilibrium in humans: Models and data. Am. J. Hum. Genet. 69, 1–14 (2001).
Chapman, J. M., Cooper, J. D., Todd, J. A. & Clayton, D. G. Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Hum. Hered. 56, 18–31 (2003).
Marchini, J. Haplotype estimation and genotype imputation. In Handbook of Statistical Genomics: Two Volume Set 87–114 (2019).
Li, J. H., Mazur, C. A., Berisa, T. & Pickrell, J. K. Low-pass sequencing increases the power of GWAS and decreases measurement error of polygenic risk scores compared to genotyping arrays. Genome Res. 31, 529–537 (2021).
Marees, A. T. et al. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int. J. Methods Psychiatr. Res. 27, e1608 (2018).
Choi, S. W. & O’Reilly, P. F. Prsice-2: Polygenic risk score software for biobank-scale data. Gigascience 8, giz082 (2019).
Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body mass index in 700000 individuals of European ancestry. Hum. Mol. Genet. 27, 3641–3649 (2018).
Xue, A. et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat. Commun. 9, 1–14 (2018).
Visscher, P. M. et al. 10 years of GWASN discovery: Biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).
Folkersen, L. et al. Impute. me: An open-source, non-profit tool for using data from direct-to-consumer genetic testing to calculate and interpret polygenic risk scores. Front. Genet. 11, 578 (2020).
Nguyen, D. T., Hoang Nguyen, Q., Thuy Duong, N. & Vo, N. S. LmTag: Functional-enrichment and imputation-aware tag SNP selection for population-specific genotyping arrays. Brief. Bioinform. 23(4), bbac252 (2022).
Illumina. Infinium cytosnp 850k genotyping array. https://www.illumina.com/products/by-type/clinical-research-products/infinium-cytosnp-850k.htm.
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 26–31 (2019).
Wainschtein, P. et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nat. Genet. 54, 263–273 (2022).
Acknowledgements
We especially thank Nguyen T. Nguyen for his kindly help in downloading the 1KGP-NYGC datasets, Hoang H. Ho for the help with deploying the web tool. We also thank the Vingroup Big Data Institute for providing computational resources.
Funding
This work is funded by Vingroup Big Data Institute internal funding, and partly supported by the Vingroup Innovation Foundation under grant VINIF.DA.2020.02
Author information
Authors and Affiliations
Contributions
D.T.N. initiated the study, designed experiments, analyzed data, interpreted results, developed the web tool, and drafted the manuscript. T.T.H.T., M.H.T., and N.T.D. contributed to the 1KVG data generation and preprocessing. K.T., D.P., Q.N., and N.S.V. contributed to the discussion, design and interpretation. N.S.V. and Q.N. revised the manuscript, coordinated the project, and supervised the study. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
TTHT, MHT, NTD, and NSV are current employees of GeneStory, Vietnam, a company that develops and markets products for genetic testing. The other authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Nguyen, D.T., Tran, T.T.H., Tran, M.H. et al. A comprehensive evaluation of polygenic score and genotype imputation performances of human SNP arrays in diverse populations. Sci Rep 12, 17556 (2022). https://doi.org/10.1038/s41598-022-22215-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-022-22215-y
This article is cited by
-
Recent advances in polygenic scores: translation, equitability, methods and FAIR tools
Genome Medicine (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.