Differentiation of Hispanic biogeographic ancestry with 80 ancestry informative markers

Ancestry informative single nucleotide polymorphisms (SNPs) can identify biogeographic ancestry (BGA); however, population substructure and relatively recent admixture can make differentiation difficult in heterogeneous Hispanic populations. Utilizing unrelated individuals from the Genomic Origins and Admixture in Latinos dataset (GOAL, n = 160), we designed an 80 SNP panel (Setser80) that accurately depicts BGA through STRUCTURE and PCA. We compared our Setser80 to the Seldin and Kidd panels via resampling simulations, which models data based on allele frequencies. We incorporated Admixed American 1000 Genomes populations (1000 G, n = 347), into a combined populations dataset to determine robustness. Using multinomial logistic regression (MLR), we compared the 3 panels on the combined dataset and found overall MLR classification accuracies: 93.2% Setser80, 87.9% Seldin panel, 71.4% Kidd panel. Naïve Bayesian classification had similar results on the combined dataset: 91.5% Setser80, 84.7% Seldin panel, 71.1% Kidd panel. Although Peru and Mexico were absent from panel design, we achieved high classification accuracy on the combined populations for Peru (MLR = 100%, naïve Bayes = 98%), and Mexico (MLR = 90%, naïve Bayes = 83.4%) as evidence of the portability of the Setser80. Our results indicate the Setser80 SNP panel can reliably classify BGA for individuals of presumed Hispanic origin.

It is important to study the genetics of Hispanic populations to avoid oversimplifying this heterogeneous ethnicity into a single conglomerate. The identification of specific biogeographic ancestries (BGA) has implications both in clinical 1 and forensic 2 genetics. Clinically, a more complete description of the various Hispanic BGAs may result in identification of rare variants that may not have been previously described when grouping all Hispanic populations together 3 , or for controlling for population substructure in clinical trials 4,5 . Hispanic individuals are known to have differential predispositions for various diseases and ignoring this diversity restricts the generalizability of the results 6 . In forensics, BGA data could be used to investigate the origin of unidentified human remains (UHR) 7 , or locate the rightful parents/guardians of a child who is unable to identify where she/he is from 8 . It is the heterogeneous nature of Hispanic populations that has previously deterred full characterization of their substructure. However, in the past decade, there has been a movement to explore global human diversity and a variety of genetic panels have been designed for this purpose.
Early ancestry informative marker (AIMs) panels are "continental" in nature, focused on admixture mapping to determine from which of the six inhabited continents an individual has ancestry; these include: Seldin128 9 , Galanter et al. 's 446 10 , Kidd55 11 , EUROFORGEN 12 , Genetic Atlas 13 , Genographic Project 14 , Cuba by Marcheco-Teruel et al. 15 , and Cuba by Fortes-Lima et al. 16 . Although these studies assessed continental ancestry proportions (e.g. Seldin128) 9 , highly differentiated populations may be detected within continental panels, even identifying admixed populations such as Gujarati Indians in Houston, TX and Mexican ancestry from Los Angeles, CA 17 . The ability to separate small admixed populations among larger more homogenous populations supports the notion that continental SNPs with high genetic differentiation may still be informative on a more specific country level. The simultaneous description of highly divergent populations alongside less specific populations using the same SNP panel is central to the goals of our study. However, dual level analysis of admixed populations within continental panels is rare, as it tends to decrease the panel's performance 2,17 .
We performed a principal components analysis (PCA) for the AIMs panels in the GOAL population ( Fig. 1b,1d,1f). In the PCA of the Setser80, HUR clearly separated across PC1 and PC2, DOM separated from HUR across PC2, and COL separated from HUR across PC1 and from DOM across PC1 and PC2, which occupies three separate quadrants of the PCA (Fig. 1d). Seldin128 9 PCA showed HUR and COL separated together but apart from the other populations across PC1, and CUB and DOM separated together along PC2 (Fig. 1f). The Kidd55 11 performed poorly in PCA (Fig. 1b), not forming recognizable clusters, consistent with the genetic proportions generated in STRUCTURE 26 (Fig. 1a)  Classification of unknowns. Based on the GOAL 25 and 1000 Genomes Project 28 (1000 G) allele frequencies, we modeled populations to determine classification accuracy using the Snipper 2.5 app suite 29 . Snipper uses naïve Bayesian likelihood ratios and multinomial logistic regression (MLR) for prediction of unknowns via −log(likelihood) 29 . Despite the different algorithms, both analyses had similar results.
As expected, the Setser80 had the highest overall accuracy across the three panels in the simulated GOAL dataset (98.4%) by naïve Bayesian classification implemented via leave-one-out cross-validation. Additionally, the Setser80 achieved 90% accuracy in the 1000 G dataset and 91.5% in the 7 Populations Combined dataset, both of which include populations not involved in our SNP ascertainment (Table 2). In the latter, the Setser80 panel (98%) and the Seldin panel (98.8%) achieved approximately equal accuracy in PEL, a population on which the Setser80 was not trained. In the 1000 G simulations, the Seldin panel was more accurate overall (92.4%) in comparison to the Setser80 (90%).
Utilizing the MLR algorithm, Setser80 had the highest accuracy in GOAL and 7 Populations Combined (99% and 93.2%, respectively); the Setser80 and Seldin panel had equal accuracy in 1000 G (93.8%); and the Kidd panel had 80.5% in GOAL, 71.4% in 7 Populations Combined, and 82.2% overall in 1000 G ( Table 4). As expected, HUR achieved >95% accuracy in the Setser80 and the Seldin panel across all datasets. Surprisingly, PEL also achieved >95% and MXL > 90% accuracies using the Setser80, although the Setser80 had not been trained on these populations.
Despite performing best overall, the Setser80 did misclassify COL 22.5% of the time in the 7 Populations Combined dataset (Supplemental Table S3). When it misclassified COL, the individual was classified as MXL 77.8% and PUR 22.2% of the time. Conversely, even though MXL classified correctly 90% of the time, when individuals were misclassified they were misclassified as COL 100% of the time. In comparison, the Seldin panel misclassified COL 17.5% of the time spread across four countries, primarily into PUR (10%). The Kidd panel exhibited a similar trend where COL misclassified into five countries: PUR (15%), MXL (10%), HUR (7.5%), CUB (7.5%), and DOM (2.5%) in addition to one individual which could not be classified. When MXL was misclassified using the Kidd panel, it misclassified into PEL (7.5%), HUR (5%), and COL (5%). Additionally, the Kidd panel had high misclassification of HUR into MXL (20%), COL (15%), and PUR (7.5%).

Discussion
We report a panel of 80 AIMs for Hispanic BGA classification using Weir & Cockerham's estimator 30 of Wright's F ST 31 . Honduras (HUR) and DOM emerged first in STRUCTURE 26 and PCA, followed by COL at K = 4, which separated from CUB & PUR, indicating three distinct populations (Table 1). Based on the allele frequencies, we created a series of micro-simulations to compare the BGA classification of the Setser, Seldin, and Kidd panels. Overall, the Setser80 outperformed the Seldin and Kidd panels in naïve Bayesian classification and MLR classification accuracies in the GOAL dataset (naïve Bayes = 98.4%, MLR = 99%) and the 7 Populations Combined (naïve Bayes = 91.5%, MLR = 93.2%). Notably, PEL and MXL were classified with >95% and >80% accuracy, respectively, indicating the Setser80 panel is portable into other Hispanic datasets and populations. www.nature.com/scientificreports www.nature.com/scientificreports/ Many panels have sought country-level ancestry determination, using a variety of SNP ascertainment methods 19,21,28,29 . Continentally, the EUROFORGEN Global AIMs 12 and the Kidd55 11 panel used allele frequency differentials (δ). Within a country, the United States HapMap 3 populations 20 used PCA with receiver operating characteristics curve (ROC) 19 , and the Mexican mestizos panel used nested subsets with high SNP weights followed by the lowest number of SNPs with the highest PC1 21 . Similar to Kidd et al. 11 , we prioritized SNPs that distinguished populations with lower mean F ST per country. However, we focused on differentiating Hispanic instead of continental populations. Kosoy et al. 9 (Seldin128) also concentrates on continental differentiation, but they also evaluated their AIMs on African American, Puerto Rican, and Mexican/Mexican American populations.
We used the Snipper 2.5 app suite 29 that provided two classification methods: a naive Bayesian classifier and MLR 32 . This web-based classifier was designed for classification of externally visible characteristics 33-37 and ancestry 12,38-41 , particularly in forensics. Snipper has successfully analyzed admixed South American populations 34,42,43 , similar to those used here.
The classification accuracy of the Seldin and Kidd panels is due to both the composition of their SNP ascertainment datasets and the size of the panels. The Seldin panel (96.2%, 96.3%) was more accurate in MXL than the Setser80 (83.4%, 89.8%) in the 7 Populations Combined and 1000 G datasets, respectively. Its success is likely because 199 of their 825 samples were from admixed Latin American and Amerindian individuals (Mexico and Puerto Rico especially) 9 . The Kidd panel emphasized capturing diversity by using 63 global populations 11 including seven isolated Amerindian populations; they continue to add more populations via ALFRED 44 . The size of the Kidd panel and the ratio of SNPs to the number of samples (Kidd55 = 55 SNPs / 3071 samples = 0.0179; Seldin128 = 128 SNPs / 825 samples = 0.1552) suggests the number of SNPs, rather than SNP ascertainment population size, is the higher contributing factor to population differentiation. However, the number of individuals per population may also be a factor.
Our study's limitations include: genechip design, sample size and its effect on allele frequencies. The GOAL 25 study genechip 45 was built on 270 African (YRI), Caucasian (CEU), and East Asian population (CHB and JPT) samples from HapMap 1 46 , without any Amerindian component. Although, our SNP ascertainment dataset was small it was not inconsistent with other studies 11,18,20 where the larger overall size was coupled with small sub-populations. Therefore, we combined the GOAL 25 dataset with the 1000 Genomes Admixed American dataset (n = 347) 28 , merging COL with CLM (n = 147) and PUR with PUR (n = 122) due to negligible allele frequency differences, to create the 7 Populations Combined.
The design of the Setser80 is based on the balance of the countries via country attributable mean F ST and selection of SNPs with LD < 0.7. Using a dilution series of 234 to 44 SNPs, we evaluated the effect of panel size on classification accuracy in relation to Seldin and Kidd sized panels and found 80 SNPs to be sufficient. Therefore we chose 80 SNPs from 247 candidates by selecting SNPs such that ~20% could be attributed to each country. It is possible that other panels informative of Hispanic ancestry could be selected from the same candidates, but testing multiple different panels was beyond the scope of this study. Residual LD is possible despite our threshold where four pairs of SNPs had r 2 > 0.5; however, removing one of each pair and classifying two separate 76 SNP subsets had negligible effect on classification accuracy via naïve Bayes (Supplemental Table S4) or MLR (Supplemental Table S5). By treating these loci as independent, we may underestimate accuracy as Kidd et al. 2013 has shown that diplotypes are effective predictors of ancestry 47 .
We used micro-simulations in this study in order to normalize the size of each population and expand the analysis to seven Hispanic populations instead of the four publicly available through the 1000 Genomes Project 28 . Although real genotypes would have been preferable, widely variable population sizes could disproportionately   www.nature.com/scientificreports www.nature.com/scientificreports/ affect the classification accuracy for smaller populations, as may have been the case with the real MXL genotypes. Our analysis of additional populations is a more realistic representation of the challenges of a more granular classification of heterogeneous populations. Forensic labs may not have access to a sizeable Hispanic database of individuals from multiple different countries; therefore, we simulated datasets based on readily available allele frequencies from multiple sources. By doing so, we have allowed MXL to misclassify into HUR which otherwise do not exist within the same dataset.
Additionally, our use of a static model for BGA determination may have overestimated classification success; despite reasonable success by other research groups 48 . Finally, our imputation of the Seldin128 9 and Kidd55 11 into the GOAL 25 dataset required removal of ~30 loci to comply with the Setser80 QC filters. Missingness was not detrimental here because STRUCTURE disregards it 49,50 , and at 10% MLR is robust 49 . Alternatively, some missingness in micro-simulations may approximate the degraded forensic samples 51 .
Our findings indicate that the Setser80 can predict BGA of individuals of presumed Hispanic origin with high confidence. By selecting additional SNPs attributed to countries with lower average country attributable F ST (COL, CUB, and PUR) to create the panel, the Setser80 had similar accuracy overall in GOAL 25 and 7 Populations Combined. The Setser80 is robust as it clusters well with Bayesian model-based clustering and PCA, and classifies equally well in naïve Bayes classification and MLR. The Setser80 is portable and, to our knowledge, is the first BGA AIMs panel specifically for the Caribbean and surrounding mainland countries. In comparison to Seldin128 9 , Kidd55 11 , and 46 Consensus SNPs 24 , our 80 AIMs for Hispanic BGA is unique, both exact and by linkage disequilibrium. Therefore, it is our intention that the Setser80 be integrated into a future Western Hemisphere panel.     We calculated the mean F ST for each of the five countries and assigned each SNP to a country based on the highest mean F ST . The next highest mean F ST was designated the 2 nd country mean F ST . For example, rs3777908 is attributed to HUR because the average of HUR vs. DOM, HUR vs. COL, HUR vs. CUB, and HUR vs. PUR is [(0.27318 + 0.19754 + 0.19560 + 0.28808)/4] = 0.23860, which was the highest country mean F ST value for rs3777908. The 2 nd highest country mean F ST = 0.07442, corresponded to PUR (see Supplemental Table S1 for example calculations).

SNP Panel
We binned the 1509 SNPs by the 1 st and 2 nd highest country attributable mean F ST and removed SNPs where the 1 st country mean F ST < 0.11 and 2 nd country mean F ST < 0.09, resulting in 437 SNPs. Since 63.3% of the 1509 candidate SNPs were attributable to HUR or DOM, we removed SNPs where HUR and DOM had the 1 st and 2 nd highest country mean F ST , where HUR had the 2 nd highest country mean F ST , and the 100 lowest ranked SNPs where HUR or DOM had the highest country mean F ST . From the remaining 247 SNPs, we chose a subset of 80 in order to maintain ~20% contribution of SNPs for each country across 1 st and 2 nd country attribution. Therefore, we proceeded with the Setser80 (Supplemental Table S2), which has the following country attributable mean To assess the value of our panel, we compared it to two commonly sited AIMs panels 9,11 . Here, we refer to the panel developed by Kosoy et al., 2009 as the Seldin128 9 , and the 55 ancestry informative SNPs developed by Kidd et al., 2014 as the Kidd55 11 . We performed each analysis on the Setser80 in parallel with the Kidd and Seldin panels to evaluate the utility of our Hispanic AIMs panel.  www.nature.com/scientificreports www.nature.com/scientificreports/ Imputation. The SNPs on the Affymetrix 6.0 gene chip 45 were pre-determined and not all SNPs were included in the ABI Taqman assay used to genotype the Seldin128 9 and Kidd55 11 ; therefore, we imputed these two panels into the GOAL dataset 25 using IMPUTE2 56 on the full 250 individuals using a 5 Mb window centered on each SNP and an effective population size of 20,000 as seen in Instructions for IMPUTE version 2 57 . We used 2,504 individuals from 1000G 28 for the genetic map and legend and the strand alignment from dbSNP batch query. Given the use of genome builds NCBI35/hg17 to GRCh38/hg38, we converted all components to GRCh37/hg19 for analysis.
However, the gene chip used 45 was based on an early genome build (NCBI35/hg17) which did not have all the tag SNPs necessary (in comparison to the 1000 G Project) to reliably impute ~30 of the SNPs from Seldin128 9 and 11 from Kidd55 11 for each individual. We assessed the accuracy of the imputation using the concordance tables provided by IMPUTE2; of the ~160 imputed SNPs from 20 chromosomes the mean concordance = 92.6% and range = 85.3% to 96.4%. Of the ~30 SNPs with missingness >10%, there was no obvious pattern between missingness proportion and concordance. Despite multiple attempts with different intervals, rs10954737 from the Seldin128 9 was unable to be imputed due to the lack of Panel 2 SNPs. Because STRUCTURE and PCA ignore missing data 49,50 , the full Seldin128 9 and Kidd55 11 were used in these analyses. However, since the resampling approach to simulations is dependent upon the reliability of allele frequencies in our real data 58 26 to compare the self-reported to computer-determined (K) populations. We performed STRUCTURE analysis at K = 2 to K = 7 for each dataset/panel at 10 iterations each using the admixture model, no LOCPRIOR, 10,000 burn-in, and 100,000 Markov Chain Monte Carlo (MCMC) repetitions. The final STRUCTURE diagrams for each SNP panel were optimized and averaged through STRUCTURE Harvester 59 , CLUMPP 60 , and Distruct 61 to create the diagrams in Fig. 1.

Principal components analysis (PCA).
We analyzed the Setser80, Seldin128 9 , and Kidd55 11 on the GOAL dataset by PCA using EIGENSOFT v.6.1.4 62 and plotted the first three eigenvectors. Genesis 63 was used for improved visualization of clustering as seen in Fig. 1.

Linkage disequilibrium (LD) analysis.
Using the web-based tool LDmatrix 64 , we compared the Setser80 to the Seldin128 9 and Kidd55 11 , and the 46 Consensus SNPs described in a review article by Soundararajan et al. 24 . We used r 2 > 0.7 as the threshold to evaluate whether any SNP in the Setser80 was in strong LD with SNP(s) from Seldin128 9 and Kidd55 11 (tested together) or the 46 Consensus SNPs appearing in more than 3 of 21 panels of AIMs 24 .
Modeling for the prediction of unknowns. To model the data for BGA prediction of unknown individuals, we used a resampling approach based on calculated allele frequencies of the three SNP panels on each dataset 58 . We simulated a randomly mating population of 100-125 individuals within each country. Next, we assigned a genotype to individuals by generating a random number between 1 and 0 and comparing this number to the maf for the country at the specified locus. Any random number above the maf was assigned the major allele. All genotypes were created from 2 separate allele generations for each locus. The simulation of each population was performed at least 5 times for the GOAL and 1000 G countries. The 7 Populations Combined dataset was created by merging the countries from the 1000 G and GOAL simulations without regard to simulation number. We verified our model using a chi-square test for each panel and found the allele frequencies from the simulation sets were not significantly different from the true allele frequencies at α = 0.05 after Bonferroni correction.
Classification of unknowns. Snipper 2.5 app suite 29 is a web-based Naïve Bayes classifier, found here (http://mathgene.usc.es/snipper/), which calculates −log(likelihood) with leave-one-out cross-validation and multinomial logistic regression (MLR) options. Cross-validation divides a set of data into a training set and a testing set, and rotates the samples until all samples have been in the testing set. Using the "Thorough analysis of population data with a custom Excel file" option, Snipper calculated likelihood ratios (LR) of population vs. not the population and selected the country that corresponded to the highest LR. MLR is similar to STRUCTURE 26,32 , which calculated genetic proportions of individuals (as percent admixture) instead of whole populations, and categorized individuals based on those probabilities. We used 100-125 micro-simulations (individuals) each from population for references and selected 10% of profiles from a separate set of micro-simulations to predict unknowns. We evaluated potential overlap of MLR classification using the confusion matrix and assessed the validity of our classification by sensitivity, specificity, and positive predictive value from the naïve Bayes classification of the actual 1000 G genotypes (n = 347; CLM = 94, PUR = 104, PEL = 85, and MXL = 64).