Whole-exome analysis in Tunisian Imazighen and Arabs shows the impact of demography in functional variation

Human populations are genetically affected by their demographic history, which shapes the distribution of their functional genomic variation. However, the genetic impact of recent demography is debated. This issue has been studied in different populations, but never in North Africans, despite their relevant cultural and demographic diversity. In this study we address the question by analyzing new whole-exome sequences from two culturally different Tunisian populations, an isolated Amazigh population and a close non-isolated Arab-speaking population, focusing on the distribution of functional variation. Both populations present clear differences in their variant frequency distribution, in general and for putatively damaging variation. This suggests a relevant effect in the Amazigh population of genetic isolation, drift, and inbreeding, pointing to relaxed purifying selection. We also discover the enrichment in Imazighen of variation associated to specific diseases or phenotypic traits, but the scarce genetic and biomedical data in the region limits further interpretation. Our results show the genomic impact of recent demography and reveal a clear genetic differentiation probably related to culture. These findings highlight the importance of considering cultural and demographic heterogeneity within North Africa when defining population groups, and the need for more data to improve knowledge on the region’s health and disease landscape.


Fig. S1 ADMIXTURE Cross-validation error
Cross-validation error for the ADMIXTURE analysis in Fig. 1c.      Population included are only Tunisians. Error bars represent the 95% confidence intervals.

Fig. S10 Comparison of the per-individual number of derived alleles (Nalleles) and homozygous derived genotypes (Nhom) across populations for variants in different GERP RS score categories with the missense-synonymous filter
Pairwise population ratios of the mean per-individual number of derived alleles and homozygous derived genotypes using only GERP RS score as functional filter. Plot title indicates the range of GERP RS scores of the variants included. For GERP < 2, only synonymous variants are included. For GERP ≥ 2, only missense variants are included. Error bars represent the 0.025 and 0.975 quantiles obtained by bootstrapping by site 1,000 times, dividing the exome data into 1,000 blocks and performing bootstrap resampling of blocks 1,000 times. Statistical significance is shown in the following way: *p-value < 0.05, **p-value < 0.01, ***p-value < 0.001. To account for multiple testing errors, significance threshold was set to p < 0.001. Population names abbreviated as in Fig. 1.

Fig. S11 Comparison of the per-individual number of derived alleles (Nalleles) and homozygous derived genotypes (Nhom) across populations for variants in different deleteriousness categories
Between-population ratios of the mean per-individual number of derived alleles and homozygous derived genotypes. Plot title indicates the range of (A) CADD scores and (B) PolyPhen scores of the variants included. Error bars represent the 0.025 and 0.975 quantiles obtained by bootstrapping by site 1,000 times, dividing the exome data into 1,000 blocks and performing bootstrap resampling of blocks 1,000 times. Statistical significance is shown in the following way: *p-value < 0.05, **p-value < 0.01, ***p-value < 0.001. Population names abbreviated as in Fig. 1.

Fig. S13 Ratio of missense 2 ≤ GERP RS score < 4 to synonymous homozygous sites inside and outside ROHs
Boxplots indicate the distribution of the per-individual ratio of missense homozygous sites with 2 ≤ GERP RS scores < 4 to synonymous homozygous sites in four different regions of the exome (left to right): inside all ROH regions, inside ROHs 1-2.5 Mb long, inside ROHs 2.5-5 Mb long, inside ROHs >5 Mb long and outside ROH regions. Statistical significance is shown in the following way: *p-value < 0.05, **p-value < 0.01, ***p-value < 0.001, ****p-value < 0.0001. Population names abbreviated as in Fig. 1.

Fig. S14 Ratio of missense 4 ≤ GERP RS score < 6 to synonymous homozygous sites inside and outside ROHs
Boxplots indicate the distribution of the per-individual ratio of missense homozygous sites with 4 ≤ GERP RS scores < 6 to synonymous homozygous sites in four different regions of the exome (left to right): inside all ROH regions, inside ROHs 1-2.5 Mb long, inside ROHs 2.5-5 Mb long, inside ROHs >5 Mb long and outside ROH regions. Statistical significance is shown in the following way: *p-value < 0.05, **p-value < 0.01, ***p-value < 0.001, ****p-value < 0.0001. Population names abbreviated as in Fig. 1.

Fig. S15 Ratio of missense GERP RS score ≥ 6 to synonymous homozygous sites inside and outside ROHs
Boxplots indicate the distribution of the per-individual ratio of missense homozygous sites with GERP RS scores ≥ 6 to synonymous homozygous sites in four different regions of the exome (left to right): inside all ROH regions, inside ROHs 1-2.5 Mb long, inside ROHs 2.5-5 Mb long, inside ROHs >5 Mb long and outside ROH regions. Statistical significance is shown in the following way: *p-value < 0.05, **p-value < 0.01, ***p-value < 0.001, ****p-value < 0.0001. Population names abbreviated as in Fig. 1.