The genetics of East African populations: a Nilo-Saharan component in the African genetic landscape

East Africa is a strategic region to study human genetic diversity due to the presence of ethnically, linguistically, and geographically diverse populations. Here, we provide new insight into the genetic history of populations living in the Sudanese region of East Africa by analysing nine ethnic groups belonging to three African linguistic families: Niger-Kordofanian, Nilo-Saharan and Afro-Asiatic. A total of 500 individuals were genotyped for 200,000 single-nucleotide polymorphisms. Principal component analysis, clustering analysis using ADMIXTURE, FST statistics, and the three-population test were used to investigate the underlying genetic structure and ancestry of the different ethno-linguistic groups. Our analyses revealed a genetic component for Sudanese Nilo-Saharan speaking groups (Darfurians and part of Nuba populations) related to Nilotes of South Sudan, but not to other Sudanese populations or other sub-Saharan populations. Populations inhabiting the North of the region showed close genetic affinities with North Africa, with a component that could be remnant of North Africans before the migrations of Arabs from Arabia. In addition, we found very low genetic distances between populations in genes important for anti-malarial and anti-bacterial host defence, suggesting similar selective pressures on these genes and stressing the importance of considering functional pathways to understand the evolutionary history of populations.

sub-Saharan Africa, might have contributed to East Africa having the greatest level of regional substructure in the continent and the world.
East Africa complexity can be seen in the fact that all families of continental African languages are represented in the region. Continental African languages have been classified into four major language families: Afro-Asiatic, Nilo-Saharan, Niger-Kordofanian (or Niger-Congo), and Khoisan. Afro-Asiatic, spoken predominantly by northern and eastern African pastoralists and agro-pastoralists, covering North Africa, includes the Semitic, Cushitic, and ancient Egyptian (Coptic) languages. Nilo-Saharan, spoken predominantly by eastern and central African pastoralists, includes in its main Chari-Nile branch the Central Sudanic and Eastern Sudanic (also called Nilotic) languages. Niger-Kordofanian, spoken predominantly by agriculturalist populations across western, eastern, central, and southern Africa, includes the Bantu languages 3,4 . It is interesting to note that the outlier Kordofanian branch, that expanded the previous Niger-Congo family, is represented in the present study.
In an extensive and detailed study, Tishkoff et al. 5 characterized the population substructure in Africa and identified 14 ancestral components predominantly associated with linguistic affiliations. Recent studies have further analysed these ancestral components to explain their origins 6,7 . Despite the genetic and linguistic complexity present in East Africa, there are some populations that have not been properly assessed and which might provide a complementary understanding of the population diversity in the region.
Here, we focus on the region of Sudan and South Sudan with some other external related populations (Ethiopians in the East; Fulani, in the West); we refer as Sudanese Region to the ensemble (see Fig. 1 and Table 1). The genetic population history of the Sudan has been interrogated using non-recombinant markers (mitochondrial DNA, Y-chromosome) 1,8 and a small number of autosomal markers 9 . More recent studies have analysed a significant number of microsatellites 5,10 and of single nucleotide polymorphisms (SNPs) 6,11 , with results suggesting that African populations may have maintained a large and subdivided population structure throughout most of their evolutionary history 5 . But as the Sudanese region is inhabited by ethnically, linguistically and culturally diverse populations, studies using a larger number of markers and representative samples of the ethno-linguistic groups of the area are needed for fine-scale population structure inference.
The first aim of this study was to provide new insights into the genetic history of East African populations by analysing six Sudanese ethnic groups belonging to the main African linguistic families spoken in the region (Afro-Asiatic, Nilo-Saharan and Niger-Kordofanian), in addition to ethno-linguistic neighbouring groups (Nilotes of South Sudan, nomadic Fulani from the Sahel, and Ethiopians). We assessed the genetic diversity and relationships between these different ethno-linguistic groups to clarify the genetic history of East Africa. The second aim of the study was to use genetic distance estimated as F ST to identify putative signals of adaptive selection in these populations, with a focus on immunological adaptation in anti-malarial, anti-bacterial and anti-fungal defence genes.

Results
Population Structure. We applied a principal component analysis (PCA) to investigate the population structure of the new populations genotyped in this study from the Sudanese region ( Supplementary  Fig. S1a). PC1 (3.56% of the variation) follows a North-South cline and separates populations inhabiting the region between the Nile River and the Red Sea (Nubians and Arabs along the Nile, Beja and Ethiopians along the coast) from Darfurians and Nuba of South-West Sudan, and Nilotes of South Sudan. Copts are a separated group close to the North-East populations, in a more outlier position: they are the extreme of the northern genetic component. PC2 (0.7%) separates the nomadic Fulani from the other populations.
Next, we combined our new populations (140 K data set) with previously studied populations of special interest for this analysis: Qatar 12 , Egypt 13 , and three sub-Saharan populations (Luhya, Yoruba and Maasai) from 1000 Genomes Project 14 to have external references both in the north and south of the Sudanese region. This new data set contains 14,343 SNPs (14 K data set). Even if the number of SNPs in this second set is small, it is enough to differentiate components in the African genetic landscape 15 . Fig. 2 shows a PCA of this extended data set, where East African populations are distinct from both sub-Saharan and North African populations. PC1 (6.08%) separates between populations from North Africa/Middle East and sub-Saharan Africa (Fig. 2a). Copts are closer to North African and Middle East populations but remain as a separate cluster when PC2 is considered. PC2 (1.46%) along with PC1 separate the two homogeneous clusters of North-East and South-West populations: Nubians, Arabs, Beja and Ethiopians on one hand, and Nuba, Darfurians and Nilotes on the other. PC2 separates all Sudanese and Ethiopian populations from the rest. PC3 (0.56%) differentiates West-African populations (Fulani and Yoruba) from Sub-Saharan East African populations (Maasai) (Fig. 2b). Both PC analysis using data sets with different number of SNPs preserve the topology of the populations. As expected, with a low number of SNPs we observe a higher intra-population variation ( Supplementary Fig. S1b).
To test whether these particular sets of Immunochip SNPs (140 K and 14 K data sets) can recover population structure, we extracted 1000 Genome data from world-wide populations and observed that the genetic structure between them is maintained across the different data sets of SNPs used ( Supplementary  Fig. S2). In addition, the effect of ascertainment bias in the Immunochip was also assessed using a subset of presumably neutral SNPs (SNPs located in intergenic regions) ( Supplementary Fig. S3 Supplementary Fig. S5). Populations geographically close had low average F ST values, even though population-specific characteristics were emphasized by excluding population outliers ( Supplementary Fig. S4). The lowest average F ST (0.003) was found both in the pair Arabs and Nubians, located at the Nile River Valley, and in the pair Beja and Ethiopians, located at the coast. Among To test the hypothesis that geographically close populations are genetically similar, we performed a Mantel test to determine to which extent geographic and genetic distances (as pairwise F ST ) between populations are correlated. We found a significant positive correlation between genetic and geographic distance (r = 0.5105, p-value < 0.0001). Nilo-Saharan component, which is also found at lower percentage in the North-East cluster and Maasai, will be outlined in the discussion.
Copts share the same main ancestral component than North African and Middle East populations (dark blue), supporting a common origin with Egypt (or other North African/Middle Eastern populations). They are known to be the most ancient population of Egypt and at k = 4 (Fig.3), they show their own component (dark green) different from the current Egyptian population which is closer to the Arabic population of Qatar.
It is noteworthy the case of the Fulani, which feature more Sudanese ancestry (>45%) than North African (<40%) or sub-Saharan (<15%) and at k = 5 show their own component (Fig.3). They have a high individual component variance suggesting a recent admixture event in this population.
To formally test the results of the admixture analysis, we applied the three-population test (f 3 statistics) 16 . We used all possible pairs of populations as surrogates of the ancestral populations of each ethno-linguistic group. All populations that have a complex pattern of admixture (Fig. 3) showed statistically significant results (Z-score <−4, p-value <3.2 × 10 −5 ): those of the North-East cluster (Beja, Ethiopians, Arabs and Nubians) and Fulani. Populations from the North-East cluster: Beja, Ethiopians, Arabs and Nubians (Table 2) may be explained as admixture products of an ancestral North African population (similar to Copts) and an ancestral South-West population (Nuba, even if in one case Darfurians have better fit). These four populations had an intermediate position between Copts and South-West Sudanese populations both in the PC and admixture analyses.
Fulani, who are known to have West-African ancestry, have a negative f 3 with Copts and Yoruba as source populations ( Table 2). As they have a complex history and present high levels of admixture with different populations and high individual variance, this three-population phylogeny seems naïve to explain their complex population history. None of the South-West populations (Darfurians, Nuba and Nilotes) appear as admixed in the three-population test. This result fits the ADMIXTURE analysis ( Fig. 3 and Supplementary Fig. S10) and it confirms a specific ancestral component for these populations. Low genetic distance between populations for genes involved in infectious diseases. We studied the effects of infectious pressures on the genetic make-up of populations in East Africa by calculating genetic distances (as F ST ) between populations using the genetic variation in genes involved in defence against different agents. We selected among the genes genotyped in the Immunochip those associated with resistance/susceptibility to malaria 17 (Supplementary Table S5), those related to host defence against bacteria 18 (Supplementary Table S6), and those related to host defence against fungi (Supplementary Table S7). For every pair of populations, the mean F ST of those genes was compared to the mean F ST of a set of randomly selected SNPs from genic regions with the same sample size and similar MAF, using a permutation test (10.000 permutations). All pairwise comparisons showed that the mean F ST score of malaria-related genes was significantly lower than the mean F ST score of the sampling distribution (Fig. 4). This suggests that all these populations have suffered a strong selective pressure in the same direction in genes related to malaria resistance. In the case of antibacterial host defence genes, all comparisons except Copts and the North-East populations had a mean F ST score significantly lower than the sampling distribution mean (Fig. 5). For the genes encoding proteins important for antifungal defence only three comparisons showed populations with a mean F ST score lower than the sampling distribution: Copts compared to South-West populations, Copts compared to Fulani, and North-East populations compared to South-West (Fig. 6).
We tested whether the specific SNPs present in the Immunochip for genes related to infectious diseases are a representative sample of all the SNPs of those genes using 1000 Genomes data of African populations (Supplementary Table S8). Results show that the SNPs present in the Immunochip for the genes of interest can be considered as a representative sample of all the SNPs in those genes.

Discussion
In this study we present an extensive genome-wide data set characterizing East African human genetic diversity in populations from Sudan, South Sudan and Ethiopia. We further analyse the Nilo-Saharan ancestral component within the variation of South-Saharan Africans. This component belongs linguistically to Eastern Sudanic languages and geographically to South and West of Sudan and South Sudan, including highly diverse ethnic groups in a similar genetic background. This component was identified in previous studies using Nilotic populations, but it was not analysed in other Nilo-Saharan populations, such as Darfurians or the Nuba people. In addition, we show convergent evolutionary pressures exerted on genes involved in anti-malaria and anti-bacterial host defence processes.
Africa genetic landscape is shaped by geographic barriers 19 , but the forces clustering populations vary depending on the scale. On a regional scale, East Africa populations cluster mainly by linguistic affiliation 5 . However, it has been previously reported that language plays a lesser role in the genetic clustering of Sudanese populations, as geography is the main factor that groups them 10 . This observation is supported by our data, as shown in the PCA (Fig. 2.), where PC1 represents a north-east to south-west axis delimited by the Nile River and its main tributaries: the Blue Nile and the White Nile. Genetic and geographic distances between populations of the Sudanese region are positively correlated (Mantel test; r = 0.5105, p-value < 0.0001), with Sudanese populations clustering in four groups according to their geographic location (Supplementary Fig. S1).
Nubians are the only Nilo-Saharan speaking group that does not cluster with groups of the same linguistic affiliation, but with Sudanese Afro-Asiatic speaking groups (Arabs and Beja) and Afro-Asiatic Ethiopians (Supplementary Fig. S1a). Y-chromosome and mitochondrial DNA studies reported Nubians to be more similar to Egyptians than to other Nilo-Saharan populations 1,8 : Nubians were influenced by Arabs as a direct result of the penetration of large numbers of Arabs into the Nile Valley over long periods of time following the arrival of Islam around 651 A.D 20 .
Interestingly, our analyses shows a unique ancestry for Sudanese Nilo-Saharan speaking groups (Darfurians and Nuba) related to Nilotes of South Sudan, but not to other Sudanese populations or sub-Saharan populations (Fig. 3). This ancestral component is not present in places where the Bantu  Table 2. Three-population test. Here we show the combinations of source populations that give the most negative f 3 statistic (Z-score < -4, p-value < 3.2×10 -5 ) for each target population (α L is the lower bond and α U is the upper bound of α, where α is the admixture proportion by which the target population was formed from the ancestral population of source population 1). Yoruba was used as outgroup population to estimate α except in Fulani, where the outgroup population used was Luhya.
Scientific The presence of the core of Nilo-Saharan languages in the confluence of the two Nile rivers suggests that the Sudanese region is the place of origin of the Nilo-Saharan linguistic family despite their fragmented distribution, as shown by the location of the Nubian language 21,22 . It is interesting to note that Nuba populations constitute an homogeneous group, even if some speak Kordofanian (of the Niger-Kordofanian family) and others different languages of two branches of the Nilo-Saharan family. Their genetic composition denotes their Nilo-Saharan origin, with linguistic replacements in some groups.
Population displacement, whether it is followed with cultural or genetic exchange with local populations, would explain why not every Nilo-Saharan speaking group has this genetic component (as is the case of Nubians) and not every population that has it is mainly formed by Nilo-Saharan speakers (as is the case of Niger-Kordofanian speaking Nuba).
The North African/Middle Eastern genetic component is identified especially in Copts. The Coptic population present in Sudan is an example of a recent migration from Egypt over the past two centuries. They are close to Egyptians in the PCA, but remain a differentiated cluster, showing their own component at k = 4 (Fig. 3). Copts lack the influence found in Egyptians from Qatar, an Arabic population. It may suggest that Copts have a genetic composition that could resemble the ancestral Egyptian population, without the present strong Arab influence.
A population that shows signals of recent admixture is the Fulani. Fulani are nomadic pastoralists who speak a Niger-Kordofanian (Niger-Congo) language and occupy a large area in Africa's Sahel. Their origin is still controversial, as mitochondrial DNA indicates a West African and traces of North African origin 23 , whereas Y-chromosome studies showed shared ancestry with Afro-Asiatic and Nilo-Saharan Sudanese populations 8 . This shared ancestry with East African populations can be seen in Fig. 3 (k = 3), suggesting that they have admixed with local populations. This finding does not agree with studies of Fulani people in the Lake Chad Basin which reported that Fulani from West Africa's Sahel usually have consanguineous marriages and do not seem to have admixed with local farmers 24 . These data together The second objective of our study was to analyse how infectious pressures affected the genetic variation of East African populations. The a-priori hypothesis was that selective pressures on host defence genes induced by similar infections would determine lower genetic distances between populations, as compared with a genome-wide distribution: a de facto convergent evolution in host defence. Similar signals of convergent evolution in the TLR1/2/6/10 cluster were recently reported between European and Rroma populations living in the same geographic area 25 . It has been proposed that these similar effects on different populations were exerted by plague 25 . Confirming this hypothesis, we see that most populations have suffered a strong selective pressure in the same direction in genes related to host defence against bacteria and malaria, leading to smaller inter-population genetic distances (Figs. 4,5). No such strong effects were present when genes important for antifungal host defence mechanisms were assessed (Fig. 6). This might be expected considering that life-threatening fungal infections occur mainly in immunocompromised settings due to either invasive medical procedures or HIV infections, both conditions not encountered in early history.

Conclusions
In this work, we analyse genotyping data for almost 140,000 SNPs in nine East African populations from Sudan, South Sudan and Ethiopia. Our main results add new and interesting features to the North East African genetic complexity, with new populations that define a genetic component in southern Nilo-Saharan speakers that cannot be related to a North-African or other sub-Saharan components. These populations should be included in further population genetics and epidemiological studies to have a representative sample of the genetic diversity of the region of East Africa. Moreover, a functional analysis shows similar genetics signals related to genes involved in antimalarial and antibacterial immune response. These findings suggest convergent evolution of the immune system of various ethnic groups in East Africa due to the major common selective pressures attributable to parasitic and bacterial infections acting on these populations.

Materials and Methods
Samples. Saliva samples were collected from 500 individuals belonging to nine east African populations based on self-reported ethnicity. The samples used in the present research were collected and studied with ethical approval and informed consent. All experimental protocols were approved by the IRB of University of Medical Sciences and Technology in Khartoum and that of Universitat Pompeu Fabra (CEIC-IMAS; Comitè Ètic d'Investigació Clínica) in Barcelona and were carried out in accordance with the approved guidelines. The population samples from the Sudan belonged to: the Afro-Asiatic (Copts, n=40; Beja, n=40; and Arabs, n=120); the Nilo-Saharan (Nubians, n=80; Darfurians, n=50; and Nuba, n=21); and the Niger-Kordofanian linguistic families (Nuba, n=19). In addition to these populations, we also collected samples from neighbouring populations: Nilo-Saharan speaking Nilotes (n=50) from South Sudan, and Afro-Asiatic speaking Ethiopians (n=40) currently living in Khartoum (Sudan). Samples from Niger-Kordofanian speaking Fulani (n=40), a nomadic group that usually traverse Africa's Sahel, were also analysed. These samples were genotyped on the Immunochip (Illumina Infinium single-nucleotide polymorphism microarray), a custom-made, high-density genotyping array containing 195,806 single-nucleotide polymorphisms (SNPs) and 718 small insertion-deletions 26 . Additional information about these populations is available in Table 1 (and Supplementary Table S1) and the sampling locations are shown in Fig. 1. Figure 1  For comparative studies, a Middle Eastern population (Qatar) 12 , a North African population (Egypt) 13 , and three sub-Saharan populations (Maasai, Luhya and Yoruba) from HapMap Phase 3 27 were merged with the 140 K data set. These populations had 14,343 SNPs in common ("14 K" data set). See Supplementary Information for details.
Population structure. To study the genetic relationships among East African ethno-linguistic groups, we used principal components analysis (PCA) as implemented in the Eigensoft package 28 .
Population differentiation was estimated using classical pairwise F ST values 29 for each pair of Sudanese populations for the 140 K data set. Then, we applied a Mantel test to study the correlation between geographic distance and genetic distance as measured by pairwise F ST between populations. Mantel test was calculated using the R ADE4 package 30 with 10,000 permutations to estimate the statistical significance. Geographic distance was calculated as great-circle distances between populations. Nomadic Fulani were excluded from this last analysis due to their imprecise geographic distribution.
Population admixture. Population admixture was analysed using ADMIXTURE 31 . This analysis identifies the genetic components of each group analysed and the ancestral clusters of the samples. It was run both on the 140 K data set of nine populations and on the 14 K data set of 14 populations (Sudan, South Sudan, Ethiopia, Egypt, Qatar and HapMap populations). To control for sample size differences, a random subset of 18 individuals was chosen for each population. Up to ten ancestral components (k = 2 through 10) were tested successively and the optimal value of k was estimated by ten-fold cross-validation. Clustering results were visualized with Distruct 32 .
To formally test whether admixture happened within the studied populations, and to measure its extend, we used the three-population test implemented in the ADMIXTOOLS software package 16 . This test is of the form f 3 (X;A,B), where a negative value of the f 3 statistic implies that population X (target population) was the result of an admixture event between the two ancestral populations of A and B (source populations). We tried every combination of source populations for each of our nine target populations and estimated the mixing coefficient (α) with Yoruba as the outgroup population. It is the proportion of the admixture of the target population given by the source population A, while 1 − α is the proportion given by the source population B. For each comparison we kept the results with a significantly negative value of the f 3 statistic after multiple testing correction (Z-score <-4, p-value < 3.2×10 -5 ).
Infectious disease-related genes. To take advantage of the particular design of the array used, groups of functionally related genes were analysed to look for particular signals in a given population. Genes related with resistance/susceptibility to malaria 17 , and genes related to host defence against bacteria 18 and fungi were selected for specific analyses (see Supplementary Table S5,  SNPs were assigned to a gene if they were up to 1 kb upstream or downstream of the transcription start site of that gene. SNPs were annotated using ANNOVAR 37 . For each pairwise comparison between populations, for each of the 3 functional categories of genes (malaria, bacterial, and fungal infections), the mean value of the F ST score of those genes was compared to the sampling distribution of the average F ST value of a subset of randomly selected genomic SNPs with the same sample size and similar MAF values than those of the functional categories. P values were calculated using a permutation test (10.000 permutations).