A Simple Algorithm for Population Classification

Hu, Peng; Hsieh, Ming-Hua; Lei, Ming-Jie; Cui, Bin; Chiu, Sung-Kay; Tzeng, Chi-Meng

doi:10.1038/srep23491

Download PDF

Article
Open access
Published: 31 March 2016

A Simple Algorithm for Population Classification

Peng Hu^1,2,
Ming-Hua Hsieh³,
Ming-Jie Lei^1,2,
Bin Cui⁴,
Sung-Kay Chiu⁵ &
…
Chi-Meng Tzeng^1,2

Scientific Reports volume 6, Article number: 23491 (2016) Cite this article

3901 Accesses
2 Citations
1 Altmetric
Metrics details

Subjects

Abstract

A single-nucleotide polymorphism (SNP) is a variation in the DNA sequence that occurs when a single nucleotide in the genome differs across members of the same species. Variations in the DNA sequences of humans are associated with human diseases. This makes SNPs as a key to open up the door of personalized medicine. SNP(s) can also be used for human identification and forensic applications. Compared to short tandem repeat (STR) loci, SNPs have much lower statistical testing power for individual recognition due to the fact that there are only 3 possible genotypes for each SNP marker, but it may provide sufficient information to identify the population to which a certain samples may belong. In this report, using eight SNP markers for 641 samples, we performed a standard statistical classification procedure and found that 86% of the samples could be classified accurately under a two-population model. This study suggests the potential use of SNP(s) in population classification with a small number (n ≤ 8) of genetic markers for forensic screening, biodiversity and disaster victim controlling.

The variation and evolution of complete human centromeres

Article Open access 03 April 2024

Glennis A. Logsdon, Allison N. Rozanski, … Evan E. Eichler

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Article Open access 12 April 2024

Ting-Hsuan Sun, Chia-Chun Wang, … Kai-Cheng Hsu

Introduction

Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference across members of the same species, the SNPs can be used to identify the correct source population of an individual. In recent years there have been several publications about the application of SNP technology in the forensic, human identification and population classification^1,2,3,4. Such as Nina Zhou et al.¹, using a ranking measure, i.e., a modified t-test or F-statistics, combined with the support vector machine (SVM) classifier, they had found that using on average 64 SNPs could obtain 82.46 ± 11.41% classification accuracy for 3 population classification. In another studies, Kohnemann et al.⁴ showed a potential application of a mitochondrial DNA (mtDNA) SNP analysis for forensic application, 32 SNPs were detected in a multiplex polymerase chain reaction (PCR) assay and a multiplex SNaPshot analysis. In the analysis cases, STR-analysis and sequencing of the mtDNA hyper-variable region I (HVR I) failed and the mtDNA SNP analysis was the only way to obtain satisfactory results, even in a case with mixed stains. Yet, all these analyses are time-consuming and very expensive, since if we want to ensure the accuracy of the classification procedure and obtain a desirable feature subset SNPs with the minimum size and most informativeness, the number of SNP must be at least 30–70 in all these analyses^5,5,5,5. Nevertheless, when applied to appropriate data set, data mining and machine learning techniques can be more effective on feature selection and therefore provide excellent classification accuracy. Sushmita Mitra et al.⁵ had made a positive summary about various machine learning techniques, a.k.a. soft computing in bioinformatics. For example, a gene is a long DNA sequence so that each gene is much more powerful than a SNP for classification. In light of this difference between a gene and a SNP, Lipo Wang et al. had done excellent works^6,7,8, providing effective methods of gene selection and finding that just a few genes can give very accurate cancer classification. In particular, they use SVM and Fuzzy Neural Network (FNN) to find the minimum gene subset after the step of gene importance ranking⁶. They find 2 genes are sufficient to produce high classification accuracy. Another work is based on spectral biclustering to find just 2 genes and provide 99.92% classification accuracy for Lymphoma; and just one gene to produce 98.7% classification accuracy for liver cancer⁷. In the other work, they provide effective method for gene selection in high dimensional data, such as microarray gene expression data⁸. Basically, they apply dimensional reduction technique first and then apply a voting scheme by utilizing binary SVMs. They found t-test-based gene selection is most effective among others. For SRBCT and lymphoma data set, they can use just 6 and 5 genes to give a 100% accurate classification. In this report, we adopted the simple extension of the standard likelihood ratio test and performed a standard statistical classification procedure to minimum the number of the SNP and enhance the classification accuracy. Those could be applied either in forensics or in disaster control precisely and promptly.

Problem Statement and Classification Procedure

Suppose there are two human populations, A and B, which are in accordance with Hardy-Weinberg equilibrium. Populations A and B have different frequencies of SNP genotypes, and there is no linkage disequilibrium between SNP sites. Each genotype of a given SNP (i) follows a trinomial distribution with parameters p_i, q_i, and r_i (p_i + q_i + r_i = 1) in population A and each genotype of the same given SNP (i) follows a trinomial distribution with parameters u_i, v_i, and w_i (u_i + v_i + w_i = 1) in population B. In particular, if the genotypes of SNP (i) consist of the data set {CC, CT, TT}, then a random selected individual from population A has the following properties:

1
Prob (genotype of SNP (i) = CC) = p_i
2
Prob (genotype of SNP (i) = CT) = q_i
3
Prob (genotype of SNP (i) = TT) = r_i

If the individual is from population B, then the following is true:

4. Prob (genotype of SNP (i) = CC) = u_i
5. Prob (genotype of SNP (i) = CT) = v_i
6. Prob (genotype of SNP (i) = TT) = w_i

When the genotypes of SNP(i) consist of the data set {AA, AG, GG}, the probability statements are the same as above except CC, CT, and TT are replaced by AA, AG, and GG, respectively.

Given an individual sample S consisting of n SNP markers, the problem is to determine whether the individual comes from population A or B. The likelihood functions of S must be derived.

L(A) = likelihood function (S is from population A)

7.

L(B) = likelihood function (S is from population B)

8.

Where I(.) is the indicator function.

A simple classification procedure can then be defined based on likelihood functions L (A) and L (B):

If L (A)/L (B) > 1, the individual is from population A. Otherwise, the individual is from population B. The model we proposed is in a general setting. A population can be any specific group of people and population parameters can be estimated from representative samples. For example, in this paper, Population A is the group of people lives in a specific geographical region and Population B is the general population. Parameters of Population A and B are estimated using representative samples from an epidemic society of Shanghai and NCBI, respectively. Such application setting is common for life insurers. For example, to design and price a medical insurance contract of a 65-years-old male, actuary needs to analyze two populations: Populations A represents 65-years-old male specific to the life insurer, due to the company screening process for the policy holders and Population B represents all of 65-years-old males. Representative samples of Population A and B then come from the internal database of the insurer and public organizations such as National Association of Insurance Commissioners (NAIC) or Society of Actuaries (SOA), respectively. The population parameters estimated by representative samples may entail some estimation errors. However, as we are entering the era of “Big Data”, the representative samples are converging to the true populations.

The algorithm is a simple extension of the standard likelihood ratio test based on Neyman-Pearson lemma⁹. The statistical efficiency depends on the difference between parameters of two populations and the number of SNPs used to calculate the likelihood functions. We investigate the empirical efficiency of this algorithm based on 8 selected SNPs in Section 4. The standard likelihood ratio test assumes the selected individual come from A or B population with equal probability. This assumption is suitable for general statistical hypothesis tests. However, if the population sizes are known and the individual is a random sample from these two populations, then it would be more appropriate to incorporate the population size into consideration. Considering this, the algorithm also could be presented as: Suppose there are two populations: population A with size N (A), and population B with size N (B). There is a sample with genotype (G). Let’s L(X) be the likelihood function defined in the manuscript. Let E (A) = N (A) × L (A). Therefore E (A) is the expected number of samples (with genotype G) found in population A. E (B) is the same for population B. It is then appropriate to use the ratio E (A)/E (B) to determine the origin of this sample.

Eight-Marker SNP Sample from a Chinese Population

Based on the information from the Human Genome and International HapMap Projects¹⁰, eight SNP markers having a high allele mutation frequency (0.249 < MAF < 0.355), were randomly selected from eight different chromosomes, respectively: rs2243191 (1q32), rs2856838 (2q14), rs583911 (3q25), rs2227306 (4q13), rs20541 (5q31), rs8193036 (6p12), rs4739139 (8q12), and rs741344 (12q15). A sample of 641 was collected from an epidemic society of Shanghai. The eight SNPs were selected from HapMap, which was with high allele mutation frequency from random selected SNPs in human being. After filtering by minimum allele frequency (MAF), the MAF of the eight SNPs is between 0.249and 0.355, which basically meet the experimental set P ≈ 0.333. Therefore, we could take advantage of higher equilibrium factor to group the independent classifications. SNPs could be picked as randomly, but with higher frequency in species.

All extracted gDNA by MagCore HF-16 (RBC Bioscience) was subjected to quality control using a threshold of 260/280 ratio and validated with a final concentration of 10 ng/ul. SNPstream (Beckman Coulter) was used for SNP genotyping in this study. Primers were designed and generated using Autoprimer (http://www.autoprimer.com/). PCR amplification, amplicon purification, DNA hybridization, and data analysis were accomplished using an SNP stream automation analyzer.

Haploview provided a summary table for the SNP sample¹¹. The characteristics of these eight SNP markers are summarized in Table 1, they are located on different chromosomes. The P-values from Hardy-Weinberg equilibrium analysis of each of the markers were all greater than 0.05. This indicates these alleles are in equilibrium¹².

Table 1 Haploview summary.

Full size table

This classification procedure is based on the frequencies (0.249 < MAF < 0.355) of genotypes of SNPs. Their frequencies are summarized in Table 2. This population played the role of population A in the classification procedure.

Table 2 Frequencies of genotypes of SNP markers in population A.

Full size table

Frequencies of SNP markers from the NCBI SNP database

To determine the classification procedure defined in Section 1, we determined the genotype frequencies of the SNPs from the NCBI SNP database at http://www.ncbi.nlm.nih.gov/SNP/index.html. The collected information is summarized in Table 3. The last row of Table 3 shows the number of individuals used to compute the genotype frequencies of each SNP marker. We treated the genotype frequencies of SNP markers as having been from a “general” population. This population played the role of population B in the classification procedure.

Table 3 Frequencies of genotypes of SNP markers in population B.

Full size table

In our hypothetical situation, the hypothetical suspect “S” comes from population A, it is natural to set the general population of Chinese as population B. If there are two suspects, it would be natural to set population A & B as their origins, respectively.

Accuracy of the classification procedure

We tested the classification procedure using the frequency data described in Sections 2 and 3. To determine the impact of the number of SNP markers on the accuracy of the classification procedure, we set the number of SNP markers n = 2, 4, 6, and 8. The accuracy of the classification depends on the multinomial parameters of the two populations, such as population size, the difference between two populations and the selection and number of SNPs. It is common to use the power of the test to represent the accuracy of classification. If the parameters of the two populations were entirely different, then the likelihood ratio in the algorithm would converge to infinity quickly. In such situation, the power of the test is high. In terms of classification accuracy, increasing the numbers of SNPs will only increase the power (accuracy). The results are shown in Table 4. It became clear that accuracy increased when the number of SNP markers used was increased. when n ≥ 8, such as 10–12 ,the accuracy is saturated around 88–90%(data not shown) without increasing significantly. The results indicate that a sample with a smaller number of SNP markers can be useful in identifying the population from which a given individual may have come.

Table 4 Frequencies of genotypes of SNP markers.

Full size table

Discussion

With the human genome project and haplotype-depth research program, SNP genotyping has been applied in disease diagnosis, population genetics, pharmacogenomics, and many other fields. The development of restriction fragment length polymorphism (RFLP) and short tandem repeat microsatellite markers (STR) have led to widespread use of SNPs in many types of applications.

The most successful application of SNP detection is in the field of forensic genetics, where it is used to evaluate rare, degraded, and even nearly fossilized nucleic acid evidence. It has also been used in the identification of human beings, animals, and goods and in the study of race, migration, evolution, lineage, and intellectual property issues. FFPE samples from clinics and universities could be used in SNP analysis to decipher genetic markers relevant to risk assessment, prognosis, and therapeutic diagnosis^13,14,15. However, SNP is restricted by the fact that there are only three possible polymorphisms per residue. The identification power of the number of SNPs is about 50–70 rather than 13 for STRs¹⁶, to core the populatin of the world.

The purpose of this study is to take advantage of the efficiency and simplicity of SNP detection to process data from large populations and to reduce the number of SNP targets. STR is restricted by assigned primers and restriction enzyme. SNP pools can be easily adjusted based on the frequency and identity of the mutation in question. We here propose a statistical algorithm for empirical SNP detection to increase the power of classification and to narrow down the factors for criminal screening using this effective method.

Additional Information

How to cite this article: Hu, P. et al. A Simple Algorithm for Population Classification. Sci. Rep. 6, 23491; doi: 10.1038/srep23491 (2016).

References

Zhou, N. & Wang L. Effective selection of informative SNPs and classification on the HapMap genotype data. BMC Bioinformatics. 8, 484(2007).
Article Google Scholar
Yamaguchi-Kabata, Y. et al. Japanese population structure, based on SNP genotypes from 7003 individuals compared to other ethnic groups: effects on population-based association studies. Am J Hum Genet. 83(4), 445–456 (2008).
Article CAS Google Scholar
Paneto, G. G. et al. A single multiplex PCR and SNaPshot minisequencing reaction of 42 SNPs to classify admixture populations into mitochondrial DNA haplogroups. Mitochondrion. 11(2), 296–302 (2011).
Article CAS Google Scholar
Kohnemann, S. & Pfeiffer H. . Application of mtDNA SNP analysis in forensic casework. Forensic Sci Int Genet. 5(3), 216–221 (2011).
Article Google Scholar
Mitra, S. & Hayashi, Y. Bioinformatics with soft computing. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on. 36(5), 616–635 (2006).
Article Google Scholar
Wang, L., Chu, F. & Xie, W. Accurate cancer classification using expressions of very few genes. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 4(1), 40–53 (2007).
Article Google Scholar
Liu, B., Wan, C. & Wang, L. An efficient semi-unsupervised gene selection method via spectral biclustering. NanoBioscience, IEEE Transactions on. 5(2), 110–114 (2006).
Article Google Scholar
Chu, F. & Wang, L. Applications of support vector machines to cancer classification with microarray data. International journal of neural systems. 15(6), 475–484 (2005).
Article Google Scholar
Rice, J. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury 2007.
International HapMap Consortium. A haplotype map of the human genome. Nature. 437(7063), 1229–320 (2005).
Barrett, J.C. et al. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 21(2), 263–265 (2005).
Article CAS Google Scholar
Wigginton, J. E., Cutler, D. J. & Abecasis, G. R. A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet. 76(5), 887–893 (2005).
Article CAS Google Scholar
Fondevila, M. et al. A multiplex assay with 52 single nucleotide polymorphisms for human identification. Electrophoresis. 27(9), 1713–1724 (2006).
Article Google Scholar
Sanchez, J.J. et al. Forensic typing of autosomal SNPs with a 29 SNP-multiplex--results of a collaborative EDNAP exercise. Forensic Sci Int Genet. 2(3), 176–183 (2008).
Article CAS Google Scholar
Børsting, C. et al. Performance of the SNPforID 52 SNP-plex assay in paternity testing. Forensic Sci Int Genet. 2(4), 292–300 (2008).
Article Google Scholar
Dixon, L.A. et al. Analysis of artificially degraded DNA using STRs and SNPs-results of a collaborative European (EDNAP) exercise. Forensic Sci Int. 164(1), 33–44 (2006).
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Translational Medicine Research Center (TMRC), School of Pharmaceutical Science, Xiamen University, Xiamen, P.R. China
Peng Hu, Ming-Jie Lei & Chi-Meng Tzeng
Key Laboratory for Cancer T-Cell Theranostics and Clinical Translation (CTCTCT), Xiamen, P.R. China
Peng Hu, Ming-Jie Lei & Chi-Meng Tzeng
Department of Risk Management and Insurance, National Chengchi University, Taiwan
Ming-Hua Hsieh
Department of Endocrine and Metabolic Diseases, Rui-jin Hospital, Shanghai Jiao-tong University School of Medicine, Shanghai, 200025, China
Bin Cui
Department of Biology and Chemistry, City University of Hong-Kong, Hong Kong
Sung-Kay Chiu

Authors

Peng Hu
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Hua Hsieh
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Jie Lei
View author publications
You can also search for this author in PubMed Google Scholar
Bin Cui
View author publications
You can also search for this author in PubMed Google Scholar
Sung-Kay Chiu
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Meng Tzeng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.-M.T. and S.-K.C. formulated the original idea and participated in the design of the study and its coordination, C.-M.T. draft the manuscript mainly. B.C. provided the samples and also introduced the application to forensics. M.-H.H. contributed and improved the statistical algorithm. P.H. and M.-J.L. carried out the extraction of gDNA and the SNPs analysis.

Corresponding authors

Correspondence to Sung-Kay Chiu or Chi-Meng Tzeng.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Hu, P., Hsieh, MH., Lei, MJ. et al. A Simple Algorithm for Population Classification. Sci Rep 6, 23491 (2016). https://doi.org/10.1038/srep23491

Download citation

Received: 20 October 2015
Accepted: 07 March 2016
Published: 31 March 2016
DOI: https://doi.org/10.1038/srep23491

This article is cited by

Gut-Bioreactor and Human Health in Future
- Hemant J. Purohit
Indian Journal of Microbiology (2018)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.