The Greater Middle East (GME) has been a central hub of human migration and population admixture. The tradition of consanguinity, variably practiced in the Persian Gulf region, North Africa, and Central Asia1,2,3, has resulted in an elevated burden of recessive disease4. Here we generated a whole-exome GME variome from 1,111 unrelated subjects. We detected substantial diversity and admixture in continental and subregional populations, corresponding to several ancient founder populations with little evidence of bottlenecks. Measured consanguinity rates were an order of magnitude above those in other sampled populations, and the GME population exhibited an increased burden of runs of homozygosity (ROHs) but showed no evidence for reduced burden of deleterious variation due to classically theorized 'genetic purging'. Applying this database to unsolved recessive conditions in the GME population reduced the number of potential disease-causing variants by four- to sevenfold. These results show variegated genetic architecture in GME populations and support future human genetic discoveries in Mendelian and population genetics.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
1029 genomes of self-declared healthy individuals from India reveal prevalent and clinically relevant cardiac ion channelopathy variants
Human Genomics Open Access 05 August 2022
Journal of Human Genetics Open Access 20 June 2022
Multivariate statistical approach and machine learning for the evaluation of biogeographical ancestry inference in the forensic field
Scientific Reports Open Access 28 May 2022
Subscribe to Journal
Get full journal access for 1 year
only $6.58 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Anwar, W.A., Khyatti, M. & Hemminki, K. Consanguinity and genetic diseases in North Africa and immigrants to Europe. Eur. J. Public Health 24 (Suppl. 1), 57–63 (2014).
Al-Gazali, L., Hamamy, H. & Al-Arrayad, S. Genetic disorders in the Arab world. Br. Med. J. 333, 831–834 (2006).
Hussain, R. & Bittles, A.H. The prevalence and demographic characteristics of consanguineous marriages in Pakistan. J. Biosoc. Sci. 30, 261–275 (1998).
Sheffield, V.C., Stone, E.M. & Carmi, R. Use of isolated inbred human populations for identification of disease genes. Trends Genet. 14, 391–396 (1998).
Sharp, J.M. The Broader Middle East and North Africa Initiative: an overview. in CRS Report for Congress Congressional Research Service. The Library of Congress. US Government. Vol. RS22053 (2005).
Hellenthal, G. et al. A genetic atlas of human admixture history. Science 343, 747–751 (2014).
Ravindranath, V. et al. Regional research priorities in brain and nervous system disorders. Nature 527, S198–S206 (2015).
Hunter-Zinck, H. et al. Population genetic structure of the people of Qatar. Am. J. Hum. Genet. 87, 17–25 (2010).
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Moreno-Estrada, A. et al. Reconstructing the population genetic history of the Caribbean. PLoS Genet. 9, e1003925 (2013).
Botigué, L.R. et al. Gene flow from North Africa contributes to differential human genetic diversity in southern Europe. Proc. Natl. Acad. Sci. USA 110, 11791–11796 (2013).
Li, J.Z. et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319, 1100–1104 (2008).
Henn, B.M. et al. Genomic ancestry of North Africans supports back-to-Africa migrations. PLoS Genet. 8, e1002397 (2012).
Gérard, N., Berriche, S., Aouizérate, A., Diéterlen, F. & Lucotte, G. North African Berber and Arab influences in the western Mediterranean revealed by Y-chromosome DNA haplotypes. Hum. Biol. 78, 307–316 (2006).
Green, R.E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010).
Sankararaman, S. et al. The genomic landscape of Neanderthal ancestry in present-day humans. Nature 507, 354–357 (2014).
SIGMA Type 2 Diabetes Consortium. Sequence variants in SLC16A11 are a common risk factor for type 2 diabetes in Mexico. Nature 506, 97–101 (2014).
Pickrell, J.K. & Pritchard, J.K. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 8, e1002967 (2012).
Tadmouri, G.O. et al. Consanguinity and reproductive health among Arabs. Reprod. Health 6, 17 (2009).
Leutenegger, A.L., Sahbatou, M., Gazal, S., Cann, H. & Génin, E. Consanguinity around the world: what do the genomic data of the HGDP-CEPH diversity panel tell us? Eur. J. Hum. Genet. 19, 583–587 (2011).
Pippucci, T., Magi, A., Gialluisi, A. & Romeo, G. Detection of runs of homozygosity from whole exome sequencing data: state of the art and perspectives for clinical, population and epidemiological studies. Hum. Hered. 77, 63–72 (2014).
Pemberton, T.J. et al. Genomic patterns of homozygosity in worldwide human populations. Am. J. Hum. Genet. 91, 275–292 (2012).
Szpiech, Z.A. et al. Long runs of homozygosity are enriched for deleterious variation. Am. J. Hum. Genet. 93, 90–102 (2013).
Itan, Y. & Casanova, J.L. Can the impact of human genetic variations be predicted? Proc. Natl. Acad. Sci. USA 112, 11426–11427 (2015).
MacArthur, D.G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).
Sulem, P. et al. Identification of a large set of rare complete human knockouts. Nat. Genet. 47, 448–452 (2015).
Jones, S. The Darwin Archipelago (Yale University Press, 2011).
Haldane, J.B.S. The effect of variation of fitness. Am. Nat. 71, 337–349 (1937).
Overall, A.D., Ahmad, M. & Nichols, R.A. The effect of reproductive compensation on recessive disorders within consanguineous human populations. Heredity 88, 474–479 (2002).
Neale, B.M. et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485, 242–245 (2012).
Simons, Y.B., Turchin, M.C., Pritchard, J.K. & Sella, G. The deleterious mutation load is insensitive to recent population history. Nat. Genet. 46, 220–224 (2014).
Casanova, J.L., Conley, M.E., Seligman, S.J., Abel, L. & Notarangelo, L.D. Guidelines for genetic studies in single patients: lessons from primary immunodeficiencies. J. Exp. Med. 211, 2137–2149 (2014).
MacArthur, D.G. et al. Guidelines for investigating causality of sequence variants in human disease. Nature 508, 469–476 (2014).
Novarino, G. et al. Exome sequencing links corticospinal motor neuron disease to common neurodegenerative disorders. Science 343, 506–511 (2014).
Blackstone, C., O'Kane, C.J. & Reid, E. Hereditary spastic paraplegias: membrane traffic and the motor pathway. Nat. Rev. Neurosci. 12, 31–42 (2011).
Dixon-Salazar, T.J. et al. Exome sequencing can improve diagnosis and alter patient management. Sci. Transl. Med. 4, 138ra78 (2012).
Okada, S. et al. Impairment of immunity to Candida and Mycobacterium in humans with bi-allelic RORC mutations. Science 349, 606–613 (2015).
Alsalem, A.B., Halees, A.S., Anazi, S., Alshamekh, S. & Alkuraya, F.S. Autozygome sequencing expands the horizon of human knockout research and provides novel insights into human phenotypic variation. PLoS Genet. 9, e1004030 (2013).
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
Patterson, N., Price, A.L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Cann, H.M. et al. A human genome diversity cell line panel. Science 296, 261–262 (2002).
Behar, D.M. et al. The genome-wide structure of the Jewish people. Nature 466, 238–242 (2010).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Pruitt, K.D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756–D763 (2014).
Alexander, D.H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer Science & Business Media, 2009).
Polasek, O. et al. Comparative assessment of methods for estimating individual genome-wide homozygosity-by-descent from human genomic data. BMC Genomics 11, 139 (2010).
Magi, A. et al. H3M2: detection of runs of homozygosity from whole-exome sequencing data. Bioinformatics 30, 2852–2859 (2014).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80–92 (2012).
Davydov, E.V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).
Erichsen, A.K., Koht, J., Stray-Pedersen, A., Abdelnoor, M. & Tallaksen, C.M. Prevalence of hereditary ataxia and spastic paraplegia in southeast Norway: a population-based study. Brain 132, 1577–1588 (2009).
Stevanin, G. et al. Mutations in SPG11 are frequent in autosomal recessive spastic paraplegia with thin corpus callosum, cognitive decline and lower motor neuron degeneration. Brain 131, 772–784 (2008).
Vardi-Saliternik, R., Friedlander, Y. & Cohen, T. Consanguinity in a population sample of Israeli Muslim Arabs, Christian Arabs and Druze. Ann. Hum. Biol. 29, 422–431 (2002).
Shami, S.A., Qaisar, R. & Bittles, A.H. Consanguinity and adult morbidity in Pakistan. Lancet 338, 954 (1991).
Stoltenberg, C., Magnus, P., Lie, R.T., Daltveit, A.K. & Irgens, L.M. Birth defects and parental consanguinity in Norway. Am. J. Epidemiol. 145, 439–448 (1997).
Do, R. et al. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat. Genet. 47, 126–131 (2015).
SIGMA Type 2 Diabetes Consortium. Association of a low-frequency variant in HNF1A with type 2 diabetes in a Latino population. J. Am. Med. Assoc. 311, 2305–2314 (2014).
Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222–226 (2012).
Huerta-Sánchez, E. et al. Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA. Nature 512, 194–197 (2014).
Wang, S., Lachance, J., Tishkoff, S.A., Hey, J. & Xing, J. Apparent variation in Neanderthal admixture among African populations is consistent with gene flow from non-African populations. Genome Biol. Evol. 5, 2075–2081 (2013).
Lowery, R.K. et al. Neanderthal and Denisova genetic affinities with contemporary humans: introgression versus common ancestral polymorphisms. Gene 530, 83–94 (2013).
The authors thank S. Sunyaev and D. Reich for help with PolyPhen-2 and DAF corrections, M. Turchin for help with purging analysis, J. Pickrell for help with TreeMix, and V. Bafna, N. Schork, and S. Bonissone for suggestions. Work was supported by grants from the US National Institutes of Health (P01HD070494 and R01NS048453), the Qatari National Research Foundation (NPRP6-1463), the Simons Foundation Autism Research Initiative (175303 and 275275) to J.G.G., the Yale Center for Mendelian Disorders (U54HG006504), the Broad Institute (U54HG003067), the Rockefeller University CTSA (5UL1RR024143-04), the Howard Hughes Medical Institute (to J.G.G. and J.-L.C.), INSERM, the St. Giles Foundation, and the Candidoser Association and by grants R01AI088364, R37AI095983, P01AI061093, U01AI109697 (to J.-L.C.), U01AI088685 (to J.-L.C. and L.A.), R21AI107508 (to E. Jouanguy), the DHFMR Collaborative Research Grant, and KACST 13-BIO1113-20 (to F.S.A.).
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Country distribution of GME samples and designation of geographical subregions.
GME samples collected across 20 countries and territories from the GME. Pie size corresponds to the number of samples from each country, and each pie shows the proportion of samples filtered because of quality control and relationship status (Online Methods). Geographical subregions are colored to show the sets of grouped countries. Some non-uniformity of sampling was inevitable owing to the inaccessibility of some populations. Map downloaded from https://www.presentationmagazine.com/ then colored.
Supplementary Figure 2 Unbiased genetic clustering demonstrates shorter genetic distance between samples from proximal geographical subregions.
Dendrogram of unbiased genetic clustering correlated with geographical subregion designation. 2,497 samples underwent exome sequencing from the Greater Middle East Consortium, including 1,111 GME samples as well as samples from Africa, East Asia, Europe, the Americas, Oceania, and unknown regions. Calculated identity-by-state (IBS) distances between samples represent the number of non-identical positions. Concordance between recruitment location and IBS clustering for all GME subregions was observed. Some intermixing was evident, suggesting recent migration events.
(a) Cross-validation errors for the ADMIXTURE results shown in Supplementary Figure 1. Analysis with k = 6 gave the lowest cross-validation error. (b) Cross-validation errors for GME and 1000 Genomes Project samples.
Results of ADMIXTURE analysis for LD-filtered variants for 1,111 GME samples across the six geographical subregions. Eleven iterations of k were run, from 2 to 12, to optimize clustering. Each vertical bar represents a single individual. The y axis shows the estimated proportion of the genome assigned to each ancestral cluster. Samples grouped by subregion and organized from west (left) to east (right), showing trends of overlap. Substantial substructure was apparent throughout much of the GME, but three apparent ‘sources’ of ancestral populations stem from the NWA (yellow), AP (red), and PP (green) subregions.
Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME, European, and East Asian samples except for NWA.
(a) Individuals from the 1000 Genomes Project reference populations and GME subregions were projected onto the first two principal components calculated from Neanderthal, chimpanzee, and Denisovan genomes. PC1 separates ancient human populations from chimpanzee, and PC2 separates the Neanderthal and Denisovan populations. When human samples were projected onto these principal components, they clustered near the center of these three species. Arrows are drawn from the center of the sub-Saharan African populations to each of the ancestral human and chimpanzee points. The sub-Saharan African populations represent a control group, where only limited Neanderthal and Denisovan introgression should be present. (b) Magnified view of a showing the dispersal of human populations within these two principal components. Samples are colored on the basis of continental origin, and subpopulations are labeled to indicate the center of each population. African populations were found to be separate from the remaining populations, which were found from this adjusted origin along the Neanderthal vector. Most populations were found to be tightly clustered with only the TP and NWA populations, showing clear separation, suggesting a common time point of introgression among these clustered populations. The NWA samples had less introgression than the other GME populations.
Supplementary Figure 6 Heat map of pairwise FST values among all 1000 Genomes Project and GME populations identifies three clusters with a low degree of differentiation.
Top right, Wright's fixation index; bottom left, standard error values. Populations are ordered on the basis of geographical location. Three distinct clusters of close populations (shown as a blue gradient) are evident: 1000 Genomes Project Africa (LWK and YRI); 1000 Genomes Project Europe (FIN, CEU, and TSI), and GME subregions (NWA, NEA, AP, SD, TP, and PP); and 1000 Genomes Project East Asia (JPT, CHS, and CHB). Among global populations, the GME and European populations were more closely related than any other two continental regions. The greatest distance between any two populations was estimated as 0.212 for YRI and JPT. As populations became more distant, standard error values increased but remained small for all comparisons.
Supplementary Figure 7 Principal-component analysis on GME and 1000 Genomes Project populations showed that PC3 and PC4 explained inter-GME variance.
Plots comparing all combinations of PC1, PC2, PC3, and PC4 and percentages of variance explained. GME populations are color-coded by geographical regions. PC1 (39.03%) and PC2 (31.38%) together accounted for the majority of variation in the data and were associated with separating Africans and East Asians from other samples, respectively. PC3 and PC4 separated GME and European populations along north–south and east–west axes, respectively. AP was the most distant cluster from the 1000 Genomes Project reference populations, showing the greatest separation along PC3. Both of the North African populations tended to cluster closer to the sub-Saharan African cluster, whereas PP and TP trended toward the East Asian cluster.
Supplementary Figure 8 Reported consanguineous marriage rates many fold higher in GME than in other continental populations.
Clinical survey results aggregated to estimate regional averages of the consanguineous marriage rate. Weighted averages, taking sample size into account, were calculated across all studies falling within a given region. The highest rates of consanguineous marriage were documented in PP and AP.
Supplementary Figure 9 GME samples carried longer and rarer runs of homozygosity than 1000 Genomes Project populations.
(a) Cumulative proportion total ROH length by bin for African, East Asian, European, and GME populations. African populations had the shortest accumulation of ROH spans, whereas GME populations showed the longest despite the limited influence of bottlenecks. (b) Distribution of total ROH length (in Mb) for all 1000 Genomes Project and GME populations. Wider distributions were evident for the GME populations owing to heterogeneity in long ROHs. (c) The total number of exomic bases found in ROHs binned by frequency in each population. GME ROHs tended to be unique in comparison to 1000 Genomes Project populations.
Supplementary Figure 10 Identity-by-state distance comparing human and chimpanzee reference genomes showed burden bias associated with hg19 corrected using estimated ancestral alleles.
(a) Homozygous and heterozygous variant counts shown for samples using hg19 (left) and PanTro2 (right) as the reference genomes. PanTro2 alleles demonstrated a linear relationship between populations, arguing for no burden difference. (b) IBS distance to the reference for chimpanzee genomes PanTro2 and PanTro4 (x axis) versus human hg19 (y axis). Human populations stratify by IBS distance using the hg19 reference genome. With chimpanzee ancestral variants, populations were equidistant from the chimpanzee reference genome.
Supplementary Figure 11 Correction of PolyPhen-2 predictions for derived variants resolved missense burden bias.
(a) The proportions of derived (Der) and ancestral (Anc) variants falling into each PolyPhen-2 class (B, benign; P, possibly damaging; D, probably damaging), across 14 allele frequency bins. The bias was apparent in the absence of possibly damaging and probably damaging calls for derived variants across nearly all bins. This bias can misrepresent results when comparing populations. (b) The same proportions after correction of derived variant PolyPhen-2 classes (Online Methods). Derived variant classes reflect the distributions of the ancestral variants. The x axis shows derived allele frequency bins, with parentheses and square brackets designating exclusion and inclusion, respectively.
Supplementary Figure 12 Mean derived allele frequencies for GME and 1000 Genomes Project populations across seven functional and deleteriousness variant classes suggested equivalent selective pressure.
(a) Calculated mean DAFs and standard errors for GME and 1000 Genomes Project populations. Variants were separated by functional class (noncoding, synonymous, nonsynonymous, and LOF) and corrected PolyPhen-2 deleteriousness class (benign, possibly damaging, probably and damaging). Populations are ordered as indicated on the right. No significant difference between populations was found for any variant class. (b) Mean DAF comparison for the X chromosome. Large error bars for some classes reflect limited ascertainment of variants within those classes.
Supplementary Figure 13 Comparison of allele frequency estimates from Exome Variant Server European-American and African-American populations showed poor correlation.
Comparison of the distribution of estimated allele frequencies for shared variants from two populations, EA and AA, showed poor correlation (Pearson's r = 0.1147). Hexagonal bins are colored according to the abundance of variants falling within each region. The linear regression line (blue) and identity line (black) are shown.
About this article
Cite this article
Scott, E., Halees, A., Itan, Y. et al. Characterization of Greater Middle Eastern genetic variation for enhanced disease gene discovery. Nat Genet 48, 1071–1076 (2016). https://doi.org/10.1038/ng.3592
This article is cited by
Identification of three novel homozygous variants in COL9A3 causing autosomal recessive Stickler syndrome
Orphanet Journal of Rare Diseases (2022)
1029 genomes of self-declared healthy individuals from India reveal prevalent and clinically relevant cardiac ion channelopathy variants
Human Genomics (2022)
Analysis of recent shared ancestry in a familial cohort identifies coding and noncoding autism spectrum disorder variants
npj Genomic Medicine (2022)
Unraveling a fine-scale high genetic heterogeneity and recent continental connections of an Arabian Peninsula population
European Journal of Human Genetics (2022)
Journal of Human Genetics (2022)