Whole-genome sequencing for an enhanced understanding of genetic variation among South Africans

Choudhury, Ananyo; Ramsay, Michèle; Hazelhurst, Scott; Aron, Shaun; Bardien, Soraya; Botha, Gerrit; Chimusa, Emile R.; Christoffels, Alan; Gamieldien, Junaid; Sefid-Dashti, Mahjoubeh J.; Joubert, Fourie; Meintjes, Ayton; Mulder, Nicola; Ramesar, Raj; Rees, Jasper; Scholtz, Kathrine; Sengupta, Dhriti; Soodyall, Himla; Venter, Philip; Warnich, Louise; Pepper, Michael S.

doi:10.1038/s41467-017-00663-9

Download PDF

Article
Open access
Published: 12 December 2017

Whole-genome sequencing for an enhanced understanding of genetic variation among South Africans

Ananyo Choudhury¹^na1,
Michèle Ramsay^1,2^na1,
Scott Hazelhurst^1,3,
Shaun Aron¹,
Soraya Bardien⁴,
Gerrit Botha⁵,
Emile R. Chimusa⁶,
Alan Christoffels⁷,
Junaid Gamieldien ORCID: orcid.org/0000-0001-7614-1887⁷,
Mahjoubeh J. Sefid-Dashti⁷,
Fourie Joubert⁸,
Ayton Meintjes⁵,
Nicola Mulder⁵,
Raj Ramesar⁶,
Jasper Rees⁹,
Kathrine Scholtz¹⁰,
Dhriti Sengupta¹,
Himla Soodyall^2,11,
Philip Venter¹²,
Louise Warnich¹³ &
…
Michael S. Pepper¹⁴

Nature Communications volume 8, Article number: 2062 (2017) Cite this article

12k Accesses
72 Citations
40 Altmetric
Metrics details

Subjects

Abstract

The Southern African Human Genome Programme is a national initiative that aspires to unlock the unique genetic character of southern African populations for a better understanding of human genetic diversity. In this pilot study the Southern African Human Genome Programme characterizes the genomes of 24 individuals (8 Coloured and 16 black southeastern Bantu-speakers) using deep whole-genome sequencing. A total of ~16 million unique variants are identified. Despite the shallow time depth since divergence between the two main southeastern Bantu-speaking groups (Nguni and Sotho-Tswana), principal component analysis and structure analysis reveal significant (p < 10⁻⁶) differentiation, and F_ST analysis identifies regions with high divergence. The Coloured individuals show evidence of varying proportions of admixture with Khoesan, Bantu-speakers, Europeans, and populations from the Indian sub-continent. Whole-genome sequencing data reveal extensive genomic diversity, increasing our understanding of the complex and region-specific history of African populations and highlighting its potential impact on biomedical research and genetic susceptibility to disease.

Whole genomes from Angola and Mozambique inform about the origins and dispersals of major African migrations

Article Open access 02 December 2023

High-depth African genomes inform human migration and health

Article 28 October 2020

The landscape of genomic structural variation in Indigenous Australians

Article Open access 13 December 2023

Introduction

African populations harbor the greatest genetic diversity^1,2,3,4,5 and have the highest per capita health burden (WHO), yet they are rarely included in large genome studies of disease association^6,7,8. The complex history of the people of sub-Saharan Africa is reflected in the diversity of extant populations and recent migrations that have led to extensive regional admixture^9,10,11. This diversity provides both a challenge and an opportunity for biomedical research and the hope that Africans will one day benefit from genomic medicine.

Present day South Africans include a major ethnolinguistic group of black southeastern Bantu-speakers (79.2% of the population), an admixed population (including European, Southeast Asian, South Asian, Bantu-speaking African, and hunter gatherer ancestries) referred to as Coloured (COL)^12,13,14 (8.9%), whites of European origin (8.9%), an Indian population originating from the Indian sub-continent (2.5%), and a small proportion of additional ethnolinguistic affiliations not broadly covered in the aforementioned (http://www.statssa.gov.za/). The focus of this pilot study from the Southern African Human Genome Programme (SAHGP) is on the southeastern Bantu-speaker and COL populations.

Archeological evidence suggests that the migration of groups of Bantu-speaking agro-pastoralists into southern Africa was initiated about 2000 years ago^15,16,17. It further supports two different migration paths, one in the east and one in the west of Africa, giving rise to southeastern Bantu-speaker (SEBs) and southwestern Bantu-speakers (SWB)^{15, 18}. Migration of SEB is estimated to have occurred in multiple distinct waves (in the early, middle, and late iron age) along the eastern coast^{19,20,21,22,23}. The patterns of distribution of artifacts and rock art from different iron-age sites indicate the complex nature of the interactions between the Bantu-speaking immigrants and the Khoesan (KS) inhabitants²⁴. These involved long phases of coexistence, trade, assimilation of hunter–gatherer peoples into agro-pastoralist communities, and in some cases the displacement of KS groups^25,26,27,28. Such interactions have not only involved linguistic and cultural exchange but also admixture at the genetic level^{29, 30}. It can be postulated that the migration of each Bantu-speaking group into a new territory likely involved an independent set of interactions and admixture events with the resident agro-pastoralist Bantu-speaker and hunter–gatherer populations.

These migrations and interactions have led to the formation of ethnolinguistic divisions within the SEB of present day South Africa, of which the two major groups are the Nguni-speakers and Sotho-Tswana-speakers who are estimated to have diverged geographically over the past 500 years or so²¹. The Nguni-speakers expanded to occupy the coastal areas extending down the east coast of South Africa, whereas Sotho-Tswana-speakers expanded across the highland plateau between the eastern escarpment and the more arid regions in the west²¹. Although the details of the arrival of these populations are unclear, it is proposed that the Nguni- and Sotho-Tswana-speakers or their antecedents migrated to southern Africa and started occupying vast territories by the fifteenth century^{21, 22, 31}. Some of the boundaries between these populations have, however, been obscured by more recent migrations, conquests, admixture, and, in some cases, rapid language adaptations, especially over the last two centuries³². This makes the consideration of geography and language important when assessing the divergence of these groups and begs the question as to whether genetic studies would be sufficiently powered to detect population differences.

Southern African populations have recently been investigated using a number of genomic approaches including genotyping array and, more rarely, whole-genome sequencing (WGS) technologies^{9, 13, 29, 30, 33,34,35,36,37,38}. However, the focus of most of these studies has been to analyze the genomic diversity among hunter–gatherers and the extent of their admixture in the present day SEB^{13, 29, 30, 36, 38}. An early study, based on mtDNA, Y-chromosome, and a limited number of autosomal markers, suggested that the ethnolinguistic divisions between the major SEB groups were reflected by observed genetic divergence, although the clustering was not consistent for the three data types³⁹. More recent genome-scale studies have not replicated the substructure within the SEB^{9, 38} and in some cases the authors concluded that the SEB is genetically a relatively homogenous group. This assumption needs more thorough investigation.

Over the past century there has been extensive urbanization of SEB in South Africa and the migration to economic hubs has resulted in a confluence of multiple ethnolinguistically diverse groups (http://www.statssa.gov.za/). When recruiting study participants from urban settings, the ethnolinguistic boundaries become blurred and the distinctions are no longer evident. We have therefore purposely recruited the SEB for this study from rural and semi-urban regions where we anticipated little or no ethnolinguistic admixture.

The arrival and settlement of Europeans during the last 500 years is an important migration that has influenced the peopling of southern Africa^{12, 40}. Slave trade into the Western Cape from the 1600s also brought South Asian and Southeast Asian people to South Africa¹². The interactions between these populations, Bantu-speakers and KS, have given rise to complex admixed population, one of which is the five-way admixed Cape COL population¹². The recent and complex admixture patterns of the COL populations from different geographic regions and religious affiliations have been investigated in many different studies^{12,13,14, 30, 40, 41}. They confirmed the presence of at least five ancestral populations and demonstrate significantly different ancestry proportions among individuals sampled from different regions of South Africa^{13, 40, 42}. These studies were all based on SNP-array data and, to date, no WGS data have been published from COL individuals. Largely due to unavailability of data from appropriate ancestral populations, representation of populations from Southeast and South Asia may, in some instances, have biased the estimate of ancestry proportions.

The focus of the SAHGP pilot study is to provide a WGS-based, unbiased estimate of genetic variation in the region and to study the genetic differences between some of the major ethnolinguistic groups in the country. The study included 24 ethnically self-identified individuals comprising 16 SEBs (seven Sotho-Tswana- and nine Nguni-speakers) and 8 COL individuals. The first major aim was to study possible correlations between language groups and genetic clustering. The results suggest that, at least for individuals sampled on the basis of both language and geographic location, there is a discernable genetic separation between the two major SEB ethnolinguistic groups. The second major aim was to investigate the ancestral composition of the COL individuals based on novel WGS data and a comprehensive assortment of potential ancestral populations. As a result of the inclusion of additional representative populations our study demonstrates a much stronger South Asian ancestry in the COL when compared to previous studies. We document significant novel SNV discovery from the 24 WGS and highlight the potential implications for disease susceptibility in Africans.

Results

Description of variants discovered

This study used deep WGS (~50×; Supplementary Table 1) data to provide an unbiased assessment of genomic variation in 24 apparently healthy South African male individuals. In an attempt to capture a spectrum of diversity in under-represented populations we included eight individuals of mixed ancestry from the Western Cape (referred to as COL) and 16 black South African SEB (7 Sotho-speakers from the Free State (SOT), eight Xhosa-speakers (XHS) from the Eastern Cape and 1 Zulu-speaker (ZUS) from Gauteng; Fig. 1a). Single-nucleotide variants (SNVs) were called using three different approaches with high concordance and only SNVs called by all three were used for downstream analyses (Supplementary Table 2 and Supplementary Note 1). Indels and copy number variants (CNVs) were called according to the standard Illumina pipeline. The analysis approach is outlined in Supplementary Fig. 1a. The average number of SNVs, indels, and CNVs was markedly higher in the black South Africans compared to the COL individuals (Fig. 1b–d and Supplementary Tables 2, 3). Across the 24 samples, 16.3 million unique SNVs were identified. A significant proportion of the SNVs identified were singletons (Fig. 1e). Interestingly, the number of singletons in SOT and XHS was found to be higher in comparison to singletons detected in randomly selected low-coverage African WGS sets of equal size (Fig. 1f, Supplementary Notes 1, 2); however, the observed differences in addition to demographic factors might also reflect the differences in sequencing coverage among the studies^{5, 43, 44} (Supplementary Notes 1, 2).

SNVs and indels were annotated according to genic locations using ANNOVAR⁴⁵ (Supplementary Tables 4, 5). A total of 3936 unique loss of function (LOF) candidate variants, which included stop gain, stop loss, splice, and frameshift mutations, were observed (Supplementary Fig. 2). The list was pruned by excluding variants observed at a MAF > 0.01 in 1000 Genomes Project Phase 3 (KGP)⁴⁶ and the African Genome Variation Project (AGVP)⁹, resulting in 1703 variants. Their gene locations were determined and 146 genes had at least two LOF variants in the data set, of which 22 genes showed a potential knockout configuration (two heterozygous LOF variants in the same individual) in at least one individual. Six of the genes were excluded because they are listed in the false discovery panel⁴⁷ and the LOF variants in the remaining 16 genes are shown in Table 1 and Supplementary Table 6. These genes are not associated with known phenotypes in OMIM (http://www.omim.org/), with the exception of SLC17A9, which has variants segregating with autosomal dominant disseminated superficial actinic porokeratosis-8 in two unrelated Chinese families (http://wwwgenecards.org).

Table 1 SNVs showing potential knockout configurations in 16 genes that have two or more LOF mutations in the heterozygous state in the same individual

Full size table

Novel variants

Of the 16.3 million unique SNVs identified, 815,404 were detected to be novel (defined as absent from dbSNP142⁴⁸, KGP⁴⁶, and the AGVP study⁹ (Supplementary Table 7)). Novel SNVs were categorized according to minor allele count (MAC), with the largest proportion of variant alleles observed only once (Fig. 1g). The large number of novel variants demonstrates the potential for novel discovery in African populations. The representation of novel SNVs in various functional categories was also studied and is summarized in Fig. 1h (Supplementary Note 3). The distribution of novel SNVs across the genome is shown in Supplementary Fig. 3 and Supplementary Table 8 and highlights regions of high density and potential interest (Supplementary Note 3). Regions with high overall SNV density differences between black South Africans and other African populations were identified (Supplementary Note 4). Several of these regions were found to be associated with protein-coding genes (Supplementary Table 9). Local ancestry analysis of these regions may reveal hotspots for mutational activity or enrichment of haplotype blocks from specific ancestral populations (e.g., the KS).

Population structure and admixture

Recent historical events including geographic isolation, cultural practices, political conflict, colonization, and extensive admixture have shaped the genetic diversity among populations of southern Africa^{11, 30, 49}. Comparative studies for population structure and admixture were done using SNP-array data available in the public domain^{9, 30, 46} (see Supplementary Table 10 for the list of populations used). Fig. 2a shows global data and Fig. 2b focuses on Africa.

Principal component analysis (PCA) showed that the COL individuals form a dispersed cluster linked to African and non-African populations including European, South-Asian (Indian sub-continent), and Austronesian populations⁵⁰, confirming their parental contributions as reported in historical accounts (Fig. 2, Supplementary Fig. 4, and Supplementary Note 5). The analysis of ancestry proportions based on novel proxy populations provided an indication of substantive admixture from the Indian sub-continent along with contributions from the European, KS, SEB, and the Austronesians (Malay; Fig. 2c, Supplementary Table 10, Supplementary Fig. 5, and Supplementary Note 6). However, it needs to be noted that the admixture among groups of COL individuals is known to differ significantly along religious lines and geographic dispersal^{51, 52}.

Despite the recent linguistic and geographic divergence between the XHS and SOT groups, genetic data using PCA showed them to be significantly different (p < 10⁻⁶; Fig. 2 and Supplementary Fig. 6). The genetic structure between the two groups was also reflected in the structure analysis (Fig. 2c and Supplementary Fig. 7).

One individual who self-identified as XHS was found to have recent non-African admixture of European origin (he is identified as XHD in Fig. 2a), leaving 15 individuals in the SEB group. The ZUS individual did not seem to cluster with the AGV Zulu participants (see Supplementary Fig. 6e)⁹. The ZUS individual was recruited in Soweto, Johannesburg, which is a cosmopolitan area having attracted migrants from across southern Africa, with different ethnic backgrounds. Soweto has a complex history, including people who were forcibly relocated there under apartheid legislation from other urbanized areas in the 1950s⁵³. Thus, Soweto has an effective 120-year history of people from different backgrounds living together in an urbanized setting. Although there is only one ZUS individual in the study and so comment risks being anecdotal, the fact that the ZUS individual does not cluster in our sample with the ZUL individuals (from the AGVP data set) indicates that care needs to be taken in interpreting ethnic origin when recruiting from urbanized areas in African countries. Language and self-identity may not be good markers for genetic background.

Regions of genomic differentiation between Sotho and Xhosa

To investigate the differentiation of SOT and XHS further, we studied the distribution of average fixation index (F_ST) scores (in 25 kb windows) across the genome (Supplementary Note 7). The analysis of regional F_ST was able to identify genomic regions with high divergence between the two groups (Supplementary Fig. 8 and Supplementary Table 11). Although a large proportion of the high F_ST windows was found to occur in the intergenic regions and pseudogenes, some of the windows were found to include the olfactory receptor genes OR4S2 and OR4C6, and other genes like SEMA4F, EREG, PLN, and PTF1A (Supplementary Fig. 8 and Supplementary Table 11). The potential biological roles of these genes were inferred using the Genecards database (http://wwwgenecards.org). The gene SEMA4F encodes a transmembrane class IV semaphorin family protein, which plays a role in neural development. This gene has been suggested to be involved in neurogenesis related to prostate cancer, the development of neurofibromas, and breast cancer tumorigenesis. In addition to cancers, the SEMA4F gene has also been suggested to be involved in pulmonary tuberculosis and dyslexia. The EREG (epiregulin) gene can function as a ligand of EGFR (epidermal growth factor receptor), as well as a ligand of most members of the ERBB (v-erb-b2 oncogene homolog) family of tyrosine-kinase receptors and is known to be associated to diseases like colorectal cancer, and hypopharynx cancer. Similarly, the PTF1A gene is a transcription factor involved in pancreatic development. Diseases associated with PTF1A include pancreatic cancer and cerebellar agenesis. In the absence of data for variation in phenotype or disease incidence or prevalence in these groups, it is not possible to infer whether the highlighted genes have any medical or evolutionary significance for these populations. Nevertheless, the genetic differences in these regions might flag some of these diseases/traits for epidemiological investigation among the southern African populations.

Affinities of populations based on rare variant sharing

To further understand the genetic affinities of the South Africans, we performed f ₂ analysis using WGS data from the KGP⁴⁶. If a variant occurred only twice in the merged SAHGP-KGP data set, such that one copy was observed in the SEB, the KGP⁴⁶ population with the other copy was noted (Supplementary Note 8). The SEB included 7 SOT, 7 XHS, and the ZUS individual. The frequency of f ₂ variants shared with SEB by various KGP⁴⁶ populations is summarized in Fig. 3a, demonstrating that the majority of f ₂ variants was shared with the Luhya from Kenya. There was also significant but less sharing with other African populations (especially ACB and ESN) and low sharing with non-African populations. This trend was also observed when the analysis was extended to examining SNVs, irrespective of MAC, shared between the SEB and a single KGP⁴⁶ population (Supplementary Fig. 9). The COL also showed the same pattern of f ₂ variant sharing with the KGP⁴⁶ populations (Fig. 3b). The higher sharing of f ₂ variants between South African populations with the East African Niger-Congo-speakers compared to West African Niger-Congo-speakers is consistent with the historical accounts of Bantu migration^{49, 54}. The distribution of continent-specific variants also demonstrated a similar pattern (Supplementary Fig. 9 and Supplementary Note 8).

Characterization of Khoesan affinities

An important characteristic that distinguishes the SEB from other Africans is the relative proportion of KS admixture^{9, 29, 49, 55,56,57}. Genetic distance between the KS and the SAHGP populations was estimated using F_ST. As expected, the SOT and XHS showed closer affinity with the Zulu and other SEB from South Africa compared to the other Africans (Supplementary Fig. 10 and Supplementary Note 9). When comparing the COL, SOT, and XHS to different KS groups, the COL showed greater genetic distances than either the XHS or SOT. Relative to the XHS, the SOT consistently showed smaller F_ST values, demonstrating that the KS had contributed to the gene pools of the SOT, XHS, and COL populations to varying degrees (Fig. 3c, d). The genetic distances of the COL reflected the geographic proximity of current day KS population dispersal, suggesting that this provided the impetus for admixture. The genetic distance for SOT and XHS, however, showed a more complex pattern of variation with geography that could be due to variation in levels of Bantu-speaking admixture in the KS populations (e.g., the Khwe)¹⁵. To reduce bias on the F_ST estimates introduced due to admixture, we used PCAdmix to identify and mask genomic regions with non-Niger-Congo (NC) ancestry and repeated the analysis^{9, 58} (Supplementary Note 9). The results showed the estimates from using the genomic regions of only NC origin (i.e., non-NC regions masked) to be largely similar to the unmasked set, suggesting that these genetic distances are inherent to the Bantu-speaking populations and not only due to differential KS admixture (Supplementary Fig. 10 and Supplementary Note 9).

Analysis of runs of homozygosity

The distribution of the runs of homozygosity (ROH) segments in the SAHGP populations was compared to various populations from Africa and other continents (Supplementary Note 10). The comparison of ROH between African and non-African populations challenges the previous observations of uniformly lower ROH in Africans^{59, 60} and shows extreme diversity in ROH segments among African populations³⁰ (Fig. 4, Supplementary Fig. 11, Supplementary Table 12, and Supplementary Data Set 1). The southern African Bantu-speakers (shown in light purple in Fig. 4) were found, in general, to harbor longer and more abundant ROH segments in comparison to Bantu-speaking populations from the East, Central West, and West Africa. The KS exhibited large variations in ROH length and abundance. More northern KS populations, SWB, and the Somali were found to show the highest ROH length and abundance within the continent, in some cases comparable to non-African populations. The COL, along with other recently admixed populations like the ASW and ACB, shows the lowest total ROH as well as the smallest number of segments among the African populations (Fig. 4 and Supplementary Fig. 12), reflecting their relatively recent and complex multi-ancestral admixture¹³. The significance levels of differences in ROH length between populations were estimated using the Mann–Whitney U-test and are shown in Supplementary Fig. 11 and Supplementary Data Set 1.

Distribution of mitochondrial and Y chromosome haplogroups

Mitochondrial and Y chromosome haplogroups showed a gender-biased gene flow (Table 2). The mtDNA haplogroups were predominantly KS with two-thirds showing the L0d haplogroup in both the COL and black South Africans. Two of the COL individuals had Southeast Asian/European haplogroups and one had a haplogroup found in Bantu-speakers. Four of the black South Africans had mtDNA haplogroups found among other eastern Bantu-speakers and one a haplogroup common in West Africa. Conversely, the Y haplogroups showed significant differences between the COL group (predominantly of European origin, with one African haplogroup) and black South Africans with the latter having almost exclusively African haplogroups. The self-reported black South African (XHD) with significant recent admixture had a paternal lineage of Mediterranean origin (Table 2). The mtDNA and Y haplogroup findings are consistent with previous studies that indicated cross-cultural assimilation, favoring the inclusion of female hunter–gatherers into Bantu-speaking farming communities^{3, 41, 61,62,63,64}.

Table 2 Mitochondrial and Y chromosome haplogroup distribution in the 24 SAHGP individuals

Full size table

Discussion

The SAHGP study is the first report on the genetic architecture of Africans using high-coverage WGS data that is fully funded by an African government and analyzed and interpreted locally. It demonstrates capacity for genome analysis and highlights the high discovery rate of novel variants and a deeper understanding of population histories and affinities.

Although hinted at in an earlier study³⁹, population differentiation among the SEB has not been reported in any of the recent genome-scale studies. In fact, many of these studies have shown and/or assumed the SEB to be a genetically homogeneous population^{30, 38}. Despite the small number of samples, our study is the first genome-scale study to report genetic differentiation between the two major language divisions of the SEB in South Africa. We postulate that one reason is the locations from which participants were sampled. In our study, we purposely recruited the SEB from rural areas or regions with little ethnolinguistic diversity, whereas other studies may have recruited from urban settings with a multi-ethnic and multi-cultural mix of individuals. Careful scrutiny of the PCA plots in the AGVP study⁹, in the light of our findings, shows evidence of a tighter and more homogeneous clustering of the Zulu from Kwa-Zulu Natal and a more diffuse clustering of the Sotho who were recruited from urban Soweto, just outside Johannesburg. The latter self-reported as Sotho-speakers but may have had parents from two different ethnolinguistic groupings. Furthermore, in the detailed analysis of admixture reported in the AGVP study (Extended data Figure 7)⁹ clear differences in the nature, source, and timing of admixture in the Sotho and Zulu are evident.

A failure to detect genetic differences between SEB groups who speak different but related languages in some of the previous studies is likely due to large-scale demographic changes that have occurred over the last two centuries²³. These include migrations, displacements, admixture, and adoption of new languages, that might have rendered language alone an inadequate proxy for capturing underlying genetic differences, especially in urban centers. According to oral history, linguistic and archaeological evidence, a common ancestry is likely to be as recent as 1000–1200 years for the SOT and XHS^{3, 65, 66}. Therefore, the differences in their genetic structure, in addition to differential admixture, could represent the consequences of very recent geographic, linguistic, and cultural separation with concomitant genetic drift effects, given the small effective population sizes^{15, 16, 39, 67}. The small sample sizes for this pilot study, as well as the lack of population-scale WGS data from the KS populations, restricted our ability to investigate the role of genetic drift and selection in the genetic differentiation. Studies based on larger sample sizes will be necessary to assess the extent to which these factors have influenced the genomic differences. Given that differences were observed, this provides a compelling argument for investigating population substructure in South African studies as this may affect the outcomes and interpretation of biomedical genetic association and pharmacogenomics studies in the region.

Turning our attention to the admixed COL populations of South Africa, several studies have detected up to five distinct ancestry components, arising from KS, Bantu-speaker, East Asian/Southeast Asian, South Asian, and European admixture^{12,13,14, 30, 41, 51}. In most of these studies, the Chinese and the Gujrati populations from the HapMap data set⁶⁸ have been used to represent East/Southeast Asian and South Asian ancestries, respectively. A survey of the seventeenth century slave-trade routes, however, suggests these to be unsuitable proxies for the populations that might have contributed the East Asian and South Asian ancestry in COL individuals. Based on data from the KGP⁴⁶ and Malay genome studies⁵⁰, we were able to identify the Malay as a better proxy for the Southeast Asian and the Bengali (BEB) for the South Asian ancestry. Moreover, the geographic locations of these populations were found to be much closer to the seventeenth century Dutch trading posts⁶⁹ and historical accounts of the presence of these groups in the Cape during that time is also well documented⁷⁰. Based on the use of the more appropriate comparative populations, we were able demonstrate that the South Asian contribution was higher in comparison to the East Asian contribution. This was corroborated in an independent study³⁰ of COL individuals in South Africa.

In conclusion, the SAHGP pilot study emphasizes the high discovery rate of novel variants in African populations. Despite previous reports of relatively low genetic divergence among SEBs, we detected significant population differentiation between two SEB groups in South Africa, highlighting the need to consider population structure in disease-association studies involving southern African populations. Our study is limited by the small number of participants and lack of representation of additional ethnolinguistic groups in the region. In particular, the absence of population-scale WGS data for KS groups restricted our ability to fully utilize our WGS data in analyses such as admixture mapping and local ancestry detection. The availability of such data would enable a more comprehensive analysis and is expected to provide novel insights.

Methods

Participants and sample collection and DNA extraction

The study was approved by the Human Research Ethics Committee (HREC; Medical) of the University of the Witwatersrand, Johannesburg (Protocol number: M120223). Three groups of participants were enrolled and venous blood was collected into tubes containing EDTA anticoagulant. Inclusion criteria were as follows: male, over the age of 18 years, four grandparents who speak the same language as the participant (in the Bantu-speakers in order to avoid recently admixed individuals), not known to be related to the other participants in the study, and willing to provide broad informed consent (including consent to share data and DNA for future studies approved by the HREC (Medical)). Where feasible, community engagement preceded enrollment. Three main ethnolinguistic groups were included in this SAHGP pilot study. Individuals self-identified in terms of the ethnolinguistic group as part of the recruitment process. Group 1: individuals of mixed ancestry (referred to as COL in the South African context) were recruited through the Western Province Blood Transfusion Service by Sister Debbie Joubert under the guidance of Professor Soraya Bardien. Group 2: Sotho (Sotho-Tswana-speakers): seven individuals in this group were recruited from in and around the town of Ventersburg in the Free State Province, following community engagement done by Professor Michèle Ramsay and recruitment by Mr. and Mrs. Botha and Mrs. van den Berg. Group 3: Xhosa-speakers (Nguni language): eight individuals were recruited by Dr Nomlindo Makubalo from her medical clinic in the Eastern Cape Province. One individual was a Zulu-speaker (Nguni language) from Johannesburg. All DNA samples were extracted in the same laboratory using a modification of the salting out procedure⁷¹. The DNA was normalized and sent to the service provider (Illumina Fast Track) as a single batch at the same time, and all the data were returned in one batch.

Data generation and processing

The DNA samples were normalized to ~60 ng/µl and ~5 µg DNA was submitted to the Illumina Service Centre for sequencing on the Illumina HiSeq 2000 instrument (~100 bp paired-end reads, ~314 bp insert size) with a minimum of ×30 coverage. Initial analysis of the raw read data was conducted by Illumina FastTrack Sequencing Services using their in-house-developed Isaac analysis pipeline.

SNP array data

Each sample was also genotyped using the IlluminaOmni2.5 genotyping array.

Whole-genome alignment and BAM processing

Reads were aligned to NCBI 37 (hg19) of the human genome reference sequence using the Isaac Alignment Software⁷². During the mapping selection phase, low-quality 3′ ends and adaptor sequences were trimmed. Following the alignment-phase PCR duplicates were marked and indels realigned by the Isaac Alignment Software. Finally, the base-quality scores were recalibrated using GATK⁷³ to generate the final sorted, duplicate marked, indel-realigned BAM files that were used for variant calling (Supplementary Note 1). The quality of the alignment per sample was assessed using SAMtools version 1.1-26-g29b0367⁷⁴ to examine the percentage of duplicates and successfully mapped reads (Supplementary Table 1).

SNV calling

SNV calling was performed on all samples using the Isaac Variant Caller. The final data set of variants produced by the Isaac Variant Caller was filtered based on various features to generate a high-quality SNV data set (Supplementary Note 1). To assess the accuracy of the variant calls generated by the Isaac Variant Caller, two additional approaches were used to recall variants using the BAM files produced by the Isaac Alignment software. Variant calling was conducted using GATK’s HaplotypeCaller version 3.2-2⁷⁴. The variant calling was conducted independently at the University of the Witwatersrand (Wits) and the University of Pretoria (UP) using the same GATK pipeline with varying parameters (Supplementary Note 1). The Wits site conducted the variant calling using GATK’s suggested best practices, while UP used more stringent variant-calling parameters (Supplementary Note 1). Each of the GATK variant call data sets was filtered using the GATK Variant Quality Score Recalibration and the transition–transversion ratios assessed across the range of MACs (Supplementary Note 1, Supplementary Table 2, and Supplementary Fig. 1). The concordance between the three filtered data sets was examined and found to have an overlap of 97% for the SNVs called (Supplementary Fig. 1). In order to move forward with a high quality, robust set of SNVs, the intersection of filtered SNVs called by all three approaches was used for further downstream analysis.

Indels and structural variant calling

Indels and structural variants were called using the Isaac variant caller software according to the Illumina pipeline⁷².

Functional categories for SNVs and indels

The annotation was performed with the ANNOVAR software⁴⁵ using the database version (2015Mar22). Variant type counts for SNVs, indels and CNVs within each population was calculated.

Gene descriptions

The identification of genes in genomic regions of interests was performed using BioMart (http://www.ensembl.org/biomart/). The description of genes and their potential functions was inferred using GeneCards (http://www.genecards.org/).

Relatedness

As several of the analysis methods used in this study assume the use of unrelated samples for accurate results, we assessed the data set for relatedness using an identity-by-descent (IBD) approach in PLINK v1.9⁷⁵. The IBD approach is based on calculating genome-wide identity by state (IBS) for each pair of individuals, based on the average proportion of alleles shared in common at the genotyped SNPs. The genotype data set was used for the IBD analysis and revealed no level of relatedness based on the π_hat values generated, where values of greater than 0.1875 are indicative of closely related individuals.

Site frequency spectrum

To avoid bias due to possible incorrect assignment of ancestral alleles, a folded site frequency spectrum (SFS) based on MACs was calculated using a custom perl script. The script was used to study SFS in the three SAHGP populations along with eight randomly selected samples from representative African (YRI, ASW) and non-African populations (CEU) from the KGP⁴⁶ and the AGVP⁹ (ZUL) data sets. As the main application of this analysis was to compare the SFS within each data set, it needs to be noted that variation in sequencing depths among data sets might introduce some biases in cross data set comparisons.

Mitochondrial DNA haplotype calling

Haplogrep2⁷⁶ was used to identify mitochondrial haplotypes for each individual. For this, all reads were aligned using BWA-mem to the RSRS sequence. The BAM files produced were then uploaded to mtDNA-server service as suggested by the webserver documentation. This service performs QC filtering (Mapping Quality Score < 20; read alignment quality < 30; base quality < 20; heteroplasmy level < 1%; and BAQ filtering) and annotates regions of low complexity and NUMTS and finally assigns the most likely haplogroups.

Y chromosome haplogroup analysis

Y-chromosome haplogroup analysis was done using the AMY-tree algorithm and tool⁷⁷. For each person, the variants detected from the WGS were extracted, and converted into the correct format before being input into the AMY-tree program.

LOF analysis

The LOF mutations in our data set include Stop Gain, Stop Loss, Frameshifts (defined as indel in exon which in not a multiple of 3), and Splice Variants (defined as SNP/indel in position +1, +2, −1, −2 in introns). The above-mentioned categories of mutations in the whole-genome sequence data were identified using ANNOVAR⁴⁵. The SNVs showing MAF > 0.05 were excluded as they were assumed to be mutations of lower impact. The distribution of the LOF variants in each individual was analyzed and if an individual was found to contain two different heterozygous LOF mutations, one in each chromosome, as inferred from phased whole-genome sequence data, in the same gene, the individual was characterized as a potential “complete knockout” with respect to that gene. Not all SNVs could be phased accurately because they were novel and therefore, when in doubt, we made the assumption that they were in a trans-configuration.

Population structure and admixture and relationship analysis

We investigated population structure using both PCA and structure analysis. In choosing comparative populations we used prior work and historical knowledge. After some preliminary experimentation we chose specific data sets (Supplementary Table 10). The KGP⁴⁶ data for Yoruba in Ibadan (YRI), Luhya in Webuye, Kenya (LWK), Utah residents with Northern and Western European ancestry (CEU), and Bengali in Bangladesh (BEB) were used. In addition, Malays from the Singapore Sequencing Malay Project⁵⁰; Black South Africans from Soweto (SEB2)³⁴; several populations from the study by Schlebusch et al.³⁰, namely several Khoe-San (KS), and COL groups (COLC and COLW), South-east Bantu-speakers (SEB1) and SWB were included in the comparisons. Moreover, Baganda (BAG) and Zulu (ZUL) whole-genome sequences from the AGVP data set⁹ were also included in the analyses. The data were merged using PLINK v1.9⁷⁵, and filtered to exclude SNVs and/or individuals with poor quality. For both PCA and ADMIXTURE the SNVs were pruned to select sample SNVs not in LD with each other, leaving ~197 K SNVs for analysis.

PCA plots

PCA analysis was done using PLINK v1.9⁷⁵. Further analysis was done using EIGENSTRAT⁷⁸ in order to estimate the statistical difference between the XHS and SOT.

Population structure analysis

Structure analysis was done using ADMIXTURE⁷⁹. For K = 3,..,10, 40 independent runs were performed using ADMIXTURE, which were averaged using CLUMPP⁸⁰. The minimum cross-validation score computed by Admixture is for K = 7. The tool Genesis (http://www.bioinf.wits.ac.za/software/genesis) was used to visualize the results from the PCA and population structure analyses.

Population differentiation

The analysis of the fixation index (F_ST) at the whole-genome level provides an estimate of the genetic distance between two populations and has been used extensively in inferring relationships between a set of populations^{9, 81}. We investigated the relationship between the southern African populations in our data sets and two distinct sets of populations known to be related to them; the Bantu-speaking groups (from South, West, and East Africa) and the KS populations from southern Africa. For this a merged data set consisting of the SAHGP data, Schlebusch et al.³⁰ and AGVP⁹ was generated. The Weir and Cockerham’s (WC) F_ST estimate⁸² was computed between the SAHGP and other groups using PLINK v1.9⁷⁵.

Local ancestry-based masking

Three data sets—the SAHGP, KGP⁴⁶, and Schlebusch et al.³⁰, all genotyped on the Illumina Omni 2.5 M SNP chip—were merged together using PLINK v1.9⁷⁵. The merged data set was phased using SHAPEIT2⁸³ with standard parameters. Analysis of local ancestry was performed using PCAdmix⁵⁸, with Ju/’hoansi, YRI, and non-African (CEU, CHB, and JPT) as the three ancestral populations and the SEB2 from Schlebusch et al.³⁰ as the target population). Based on the ascertainment of ancestry of all the 20 SNP windows, the windows showing <20% of YRI ancestry were masked out to generate a minimal non-admixed SEB data set.

f2 and rare variant sharing analysis

To compare rare allele sharing between the SAHGP and the KGP⁴⁶ data set, we merged the 15 SEB individuals (7 SOT, 7 XHS, and the ZUS) with the KGP⁴⁶ data sets and identified those variants that occur precisely twice in the merged data set (f ₂ variants)^{46, 84}. As the sample sizes in the two data sets were not uniform and an unbiased estimate of f ₂ sharing was difficult, instead of performing a complete f ₂ analysis we focused on those f ₂ variants that occur at least once in one of the 15 SEB and once in the KGP⁴⁶ data set. A similar analysis was performed using SNVs shared between only two populations irrespective of the minor allele frequency. This was mainly done to compensate for the small sample size in our study, which might have considered some SNVs to be singletons that could have been present multiple times if we had included more samples. Both these analyses were performed for the COL individuals. We also identified SNPs that occur in only one of the five continental population sets in the KGP⁴⁶ data and studied their distribution in the 15 SEB and the 8 COL individuals.

SNV density comparisons

To study the variation in SNV enrichment patterns within Africa, we compared SNV densities in the YRI and LWK from the KGP⁴⁶ data set to Zulu from the AGVP⁹ and SEB from the SAHGP data set. For this, we scanned the genome using 1 Mb sliding windows (with no overlap) and computed the number of SNVs occurring in that region in each population. The empirical distribution of SNV densities thus obtained for each population was used to assign a rank score and p-value to the density level observed for each window in that population. A similar scan was conducted using 25 kb windows. We noted that there are marked differences in sample size and coverage between data sets such as KGP⁴⁶, AGVP⁹, and SAHGP, and these factors could also result in differences in estimation of SNV densities. Therefore, we considered only the regions for which both Zulu and SEB were found to show similar SNVs densities and vary strongly with both of the other African populations.

ROH analysis

Three data sets—the SAHGP, AGVP⁹, and Schlebusch et al.³⁰, all genotyped on Illumina Omni 2.5 M SNP chip, were merged using PLINK v1.9⁷⁵ (Supplementary Table 12). An overall QC was performed on the merged data and SNVs with missingness greater than 0.05 and individuals with missingness greater than 0.05 were removed. We also excluded SNVs showing extreme deviation from Hardy–Weinberg equilibrium (p-value <1 × 10⁻⁷) from the data. The populations were merged according to linguistic and geographic affinities into superpopulations (Supplementary Table 12). To correct for possible ascertainment bias, SNVs with frequencies lower than 0.01 in any of the merged superpopulations were removed (Supplementary Table 12). This resulted in a data set containing around 500 K SNPs (total genotyping rate in this data set was 0.999182). Total ROH length and number of ROH segments were estimated using PLINK v1.9⁷⁵. By default, in PLINK v1.9 only ROH containing at least 100 SNVs, and of total length ≥1000 kb are noted. Therefore, we performed an additional analysis with the ROH window size set to 500 kb. The scanning window contained 50 SNVs and a scanning window hit was allowed to contain at most one heterozygous call and five missing calls. The Mann–Whitney U-test was used to test differences between the total lengths of ROH distribution in population and superpopulation pairs.

Regions of extreme differentiation between Sotho and Xhosa

To identify regions that show high F_ST variation within the SOT and XHS, the WC F_ST estimate for each SNV was computed using PLINK v1.9⁷⁵. A sliding window of 25 kb was used to scan the distribution of average F_ST scores across the genome and the top 0.005% windows showing highest F_ST scores were identified. As the WGS data include a lot of novel and population-specific SNVs, only the windows containing at least 10 SNVs that were found to be present in both the populations were considered (Supplementary Table 11).

Novel SNV identification and their genomic distribution

The novel SNVs reported in this study were identified by comparing the presence of the SNVs occurring in the 15 SEB samples to all SNPs in the dbSNP142⁴⁸, in the KGP⁴⁶ and AGVP⁹. To identify genomic regions enriched in novel SNVs, a sliding window of 1 Mb was used to scan the genomes and the regions showing most number of novel SNVs were selected. A similar data set was generated using 25 kb sliding windows.

Data availability

The WGS and the SNP-array data that form the basis of the findings reported in the study have been deposited in the in the European Genome-phenome Archive (EGA; https://www.ebi.ac.uk/ega/home; accession numbers: study: EGAS00001002639, sequence data set: EGAD00001003791, array data set: EGAD00010001418). Access to data is determined by a Data Access Committee (DAC: EGAC00001000734). Data access decisions can be passed to the EGA by emailing ega-helpdesk@ebi.ac.uk with the email address of each applicant and confirmation of the dataset(s) to provide access. The EGA will then create an EGA account with the relevant access permissions.

References

Tishkoff, S. A. et al. The genetic structure and history of Africans and African Americans. Science 324, 1035–1044 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Lachance, J. et al. Evolutionary history and adaptation from high-coverage whole-genome sequences of diverse African Hunter-gatherers. Cell 150, 457–469 (2012).
Article CAS PubMed PubMed Central Google Scholar
Marks, S. J. et al. Static and moving frontiers: the genetic landscape of Southern African Bantu-speaking populations. Mol. Biol. Evol 32, 29–43 (2015).
Article CAS PubMed Google Scholar
Henn, B. M. et al. Hunter-gatherer genomic diversity suggests a southern African origin for modern humans. Proc. Natl Acad. Sci. USA 108, 5154–5162 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Henn, B. M. et al. Distance from sub-Saharan Africa predicts mutational load in diverse human genomes. Proc. Natl Acad. Sc. USA 113, E440–E449 (2016).
Article CAS Google Scholar
Bustamante, C. D., Burchard, E. G. & De la Vega, F. M. Genomics for the world. Nature 475, 163–165 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ramsay, M., Tiemessen, C. T., Choudhury, A. & Soodyall, H. Africa: the next frontier for human disease gene discovery? Hum. Mol. Genet. 20, R214–R220 (2011).
Article CAS PubMed Google Scholar
H3Africa Consortium. et al. Research capacity. Enabling the genomic revolution in Africa. Science 344, 1346–1348 (2014).
Article Google Scholar
Gurdasani, D. et al. The African genome variation project shapes medical genetics in Africa. Nature 517, 327–332 (2014).
Article ADS PubMed PubMed Central Google Scholar
Montinaro, F. et al. Unravelling the hidden ancestry of American admixed populations. Nat. Commun. 6, 6596 (2015).
Article CAS PubMed PubMed Central Google Scholar
Patin, E. et al. Dispersals and genetic adaptation of Bantu-speaking populations in Africa and North America. Science 356, 543–546 (2017).
Article ADS CAS PubMed Google Scholar
de Wit, E. et al. Genome-wide analysis of the structure of the south African coloured population in the Western Cape. Hum. Genet. 128, 145–153 (2010).
Article PubMed Google Scholar
Petersen, D. C. et al. Complex patterns of genomic admixture within southern Africa. PLoS Genet. 9, e1003309 (2013).
Article CAS PubMed PubMed Central Google Scholar
Patterson, N. et al. Genetic structure of a unique admixed population: implications for medical research. Hum. Mol. Genet 19, 411–419 (2010).
Article CAS PubMed Google Scholar
Phillipson, D. W. African Archaeology (Cambridge University Press, 2005).
Wallace, M. & Kinahan, J. A History of Namibia: from the Beginning to 1990 (Hurst & Company, 2011).
Badenhorst, S. Descent of iron age farmers in southern africa during the last 2000 years. African Archaeol. Rev. 27, 87–106 (2010).
Article Google Scholar
Huffman, T. N. & Herbert, R. K. New Perspectives on Eastern Bantu. Azania: Archaeological Research in Africa 29–30 (Taylor & Francis Group, 1994).
Russell, T. & Steele, J. A geo-referenced radiocarbon database for Early Iron Age sites in sub-Saharan Africa: initial analysis. South Afr. Humanit. 21, 327–344 (2009).
Google Scholar
Mitchell, P. & Whitelaw, G. The archaeology of southernmost Africa from c. 2000 BP to the early 1800s: a review of recent research. J. Afr. Hist. 46, 209–241 (2005).
Hall, S. Farming Communities of the Second Millennium: Internal Frontiers, Identity, Continuity and Change. The Cambridge History of South Africa (Cambridge University Press, 2010).
Huffman, T. N. Handbook to the Iron Age: the Archaeology of Pre-Colonial Farming Societies in Southern Africa (University of KwaZulu-Natal Press, 2007).
Hebinck, P. & van Averbeke, W. in Livelihoods and Landscapes: the people of Guquka and Koloni and their Resources (eds. Lent, P. & Hebinck, P.) 33–66(Brill, 2007).
Mitchell, P. in The Oxford Handbook of African Archaeology (eds. Mitchell, P. & Lane, P. J.) 471–488 (Oxford University Press, 2013).
Denbow, J. Congo to Kalahari: data and hypotheses about the political economy of the western stream of the Early Iron Age. African Archaeol. Rev. 8, 139–175 (1990).
Article Google Scholar
Hall, S. & Smith, B. Empowering places: rock shelters and ritual control in farmer-forager interactions in the northern province. Goodwin Ser. 8, 30 (2000).
Article Google Scholar
Jolly, P. Symbiotic interaction between black farmers and south-eastern san: implications for southern African rock art studies, ethnographic analogy, and hunter-gatherer cultural identity. Curr. Anthropol. 37, 277–305 (1996).
Article Google Scholar
Mitchell, P. in Interactions between Hunter-Gatherers and Farmers: from Prehistory to Present. National Museum of Ethnology (eds. Ikeya, K., Ogawa, H. & Mitchell, P.) 15–46 (Blackwell Publishing Ltd, 2009).
Pickrell, J. K. et al. The genetic prehistory of southern Africa. Nat. Commun. 3, 1143 (2012).
Article PubMed PubMed Central Google Scholar
Schlebusch, C. M. et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science 338, 374–379 (2012).
Article ADS CAS PubMed Google Scholar
Boeyens, J. C. A. The late iron age sequence in the marico and early tswana history. South African Archaeol. Bull 58, 63 (2003).
Article Google Scholar
Beck, R. B. The History of South Africa (Greenwood, 2013).
Schuster, S. C. et al. Complete Khoisan and Bantu genomes from southern Africa. Nature 463, 943–947 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
May, A. et al. Genetic diversity in black South Africans from Soweto. BMC Genomics 14, 644 (2013).
Article CAS PubMed PubMed Central Google Scholar
Shriner, D., Tekola-Ayele, F., Adeyemo, A. & Rotimi, C. N. Genome-wide genotype and sequence-based reconstruction of the 140,000 year history of modern human ancestry. Sci. Rep 4, 6055 (2014).
Article ADS PubMed PubMed Central Google Scholar
Kim, H. L. et al. Khoisan hunter-gatherers have been the largest population throughout most of modern-human demographic history. Nat. Commun. 5, 5692 (2014).
Article CAS PubMed PubMed Central Google Scholar
Montinaro, F. et al. Complex ancient genetic structure and cultural transitions in southern african populations. Genetics 205, 303–316 (2016).
Article PubMed PubMed Central Google Scholar
Chimusa, E. R. et al. A genomic portrait of haplotype diversity and signatures of selection in indigenous southern african populations. PLOS Genet. 11, e1005052 (2015).
Article PubMed PubMed Central Google Scholar
Lane, A. B. et al. Genetic substructure in South African Bantu-speakers: evidence from autosomal DNA and Y-chromosome studies. Am. J. Phys. Anthropol. 119, 175–185 (2002).
Article CAS PubMed Google Scholar
Uren, C. et al. Fine-scale human population structure in southern Africa reflects ecogeographic boundaries. Genetics 204, 303–314 (2016).
Article PubMed PubMed Central Google Scholar
Quintana-Murci, L. et al. Strong maternal Khoisan contribution to the South African coloured population: a case of gender-biased admixture. Am. J. Hum. Genet. 86, 611–620 (2010).
Article CAS PubMed PubMed Central Google Scholar
Daya, M. et al. A panel of ancestry informative markers for the complex five-way admixed South African coloured population. PLoS ONE 8, e82224 (2013).
Article ADS PubMed PubMed Central Google Scholar
Frazer, K. A. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).
Article ADS CAS PubMed Google Scholar
Han, E., Sinsheimer, J. S. & Novembre, J. Fast and accurate site frequency spectrum estimation from low coverage sequence data. Bioinformatics 31, 720–727 (2015).
Article CAS PubMed Google Scholar
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Article PubMed PubMed Central Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article ADS PubMed Google Scholar
Fuentes Fajardo, K. V. et al. Detecting false-positive signals in exome sequencing. Hum. Mutat. 33, 609–613 (2012).
Article CAS PubMed Google Scholar
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article CAS PubMed PubMed Central Google Scholar
Busby, G. B. et al. Admixture into and within sub-Saharan Africa. Elife 5, e15266 (2016).
Article PubMed PubMed Central Google Scholar
Wong, L.-P. et al. Deep whole-genome sequencing of 100 southeast Asian Malays. Am. J. Hum. Genet. 92, 52–66 (2013).
Article CAS PubMed PubMed Central Google Scholar
Chimusa, E. R. et al. Genome-wide association study of ancestry-specific TB risk in the South African Coloured population. Hum. Mol. Genet. 23, 796–809 (2014).
Article CAS PubMed Google Scholar
Daya, M., van der Merwe, L., van Helden, P. D., Möller, M. & Hoal, E. G. The role of ancestry in TB susceptibility of an admixed South African population. Tuberculosis 94, 413–420 (2014).
Article PubMed Google Scholar
Bonner, P. & Segal, L. Soweto: A History (Maskew Miller Longman, 1998).
Li, S., Schlebusch, C. & Jakobsson, M. Genetic variation reveals large-scale population expansion and migration during the expansion of Bantu-speaking peoples. Proc. Biol. Sci. 281, 20141448 (2014).
Article PubMed PubMed Central Google Scholar
Pickrell, J. K. et al. Ancient west Eurasian ancestry in southern and eastern Africa. Proc. Natl Acad. Sci. USA 111, 2632–2637 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Patin, E. et al. The impact of agricultural emergence on the genetic history of African rainforest hunter-gatherers and agriculturalists. Nat. Commun. 5, 3163 (2014).
Article PubMed Google Scholar
Barbieri, C. et al. Unraveling the complex maternal history of Southern African Khoisan populations. Am. J. Phys. Anthropol. 153, 435–448 (2014).
Article PubMed Google Scholar
Brisbin, A. et al. PCAdmix: principal components-based assignment of ancestry along each chromosome in individuals with admixed ancestry from two or more populations. Hum. Biol. 84, 343–364 (2012).
Article PubMed PubMed Central Google Scholar
Pemberton, T. J. et al. Genomic patterns of homozygosity in worldwide human populations. Am. J. Hum. Genet. 91, 275–292 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kirin, M. et al. Genomic runs of homozygosity record population history and consanguinity. PLoS ONE 5, e13996 (2010).
Article ADS PubMed PubMed Central Google Scholar
Coelho, M. et al. On the edge of Bantu expansions: mtDNA, Y chromosome and lactase persistence genetic variation in southwestern Angola. BMC Evol. Biol. 9, 80 (2009).
Article PubMed PubMed Central Google Scholar
Barbieri, C. et al. Ancient substructure in early mtDNA lineages of Southern Africa. The Am. J. Hum. Genet. 92, 285–292 (2013).
Article CAS PubMed Google Scholar
Schlebusch, C. M. et al. MtDNA control region variation affirms diversity and deep sub-structure in populations from southern Africa. BMC Evol. Biol. 13, 56 (2013).
Article CAS PubMed PubMed Central Google Scholar
Montinaro, F., Davies, J. & Capelli, C. Group membership, geography and shared ancestry: genetic variation in the Basotho of Lesotho. Am. J. Phys. Anthropol. 160, 156–161 (2016).
Article PubMed Google Scholar
Cavalli-Sforza, L. L. & Bodmer, W. F. The Genetics of Human Populations. (W.H. Freeman, 1971).
Nurse, G. T., Weiner, S. J. & Jenkins, T. The Peoples of Southern Africa and their Affinities (Oxford University Press, 1987).
Pleurdeau, D. et al. ‘Of sheep and men’: earliest direct evidence of caprine domestication in southern Africa at Leopard Cave (Erongo, Namibia). PLoS ONE 7, e40340 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Article ADS PubMed Central Google Scholar
State, P. F. A Brief History of the Netherlands (Facts On File, 2008).
Worden, N. Indian ocean slaves in Cape Town, 1695–1807. J. South. Afr. Stud. 42, 389–408 (2016).
Article Google Scholar
Miller, S. A., Dykes, D. D. & Polesky, H. F. A simple salting out procedure for extracting DNA from human nucleated cells. Nucleic Acids Res. 16, 1215 (1988).
Article CAS PubMed PubMed Central Google Scholar
Raczy, C. et al. Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms. Bioinformatics 29, 2041–2043 (2013).
Article CAS PubMed Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Article PubMed PubMed Central Google Scholar
Weissensteiner, H. et al. HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing. Nucleic Acids Res. 44, W58–W63 (2016).
Article CAS PubMed PubMed Central Google Scholar
Van Geystelen, A., Decorte, R. & Larmuseau, M. H. D. AMY-tree: an algorithm to use whole genome SNP calling for Y chromosomal phylogenetic applications. BMC Genomics 14, 101 (2013).
Article PubMed PubMed Central Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article CAS PubMed Google Scholar
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Article CAS PubMed PubMed Central Google Scholar
Jakobsson, M. & Rosenberg, N. A. CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23, 1801–1806 (2007).
Article CAS PubMed Google Scholar
The Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).
Article Google Scholar
Weir, B. S. & Cockerham, C. C. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358–1370 (1984).
CAS PubMed Google Scholar
Delaneau, O., Zagury, J. F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2012).
Article Google Scholar
Mathieson, I. & McVean, G. Demography and the age of rare variants. PLoS Genet. 10, e1004528 (2014).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We are grateful to the South African National Department of Science and Technology for funding this initiative under the umbrella of the Southern African Human Genome Programme (SAHGP). We thank Sister Debbie Joubert and the Western Province Blood Transfusion Service for recruitment of the COL study participants, Dr Nomlindo Makubalo for recruitment of the Xhosa-speakers from the Eastern Cape and Mr. and Mrs. Botha and Mrs. van der Berg for recruitment of the Sotho-speakers from the Free state Province, as well as all the participants for generously agreeing to share their data and biological samples. We also thank the participants of the SAHGP launch meeting held in 2011 from which this pilot project was initiated (Supplementary Note 11). A.C. was supported by the AWI-Gen Collaborative Centre funded by the NIH (1U54HG006938) as part of the H3Africa Consortium. M.R. is a South African Research Chair in Genomics and Bioinformatics of African populations hosted by the University of the Witwatersrand, funded by the Department of Science and Technology and administered by National Research Foundation of South Africa (N.R.F.). M.S.P. was funded by the South African Medical Research Council (Flagship and Stem Cell Extramural Unit awards) and the Institute for Cellular and Molecular Medicine (University of Pretoria). N.M. and S.A. were supported by the H3ABioNet NIH grant (U41HG006941).

Author information

Ananyo Choudhury and Michèle Ramsay contributed equally to this work.

Authors and Affiliations

Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, 2193, South Africa
Ananyo Choudhury, Michèle Ramsay, Scott Hazelhurst, Shaun Aron & Dhriti Sengupta
Division of Human Genetics, School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, 2000, South Africa
Michèle Ramsay & Himla Soodyall
School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, 2050, South Africa
Scott Hazelhurst
Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, Tygerberg, 7505, South Africa
Soraya Bardien
Computational Biology Division, Department of Integrative Biomedical Sciences, IDM, University of Cape Town, Cape Town, 7925, South Africa
Gerrit Botha, Ayton Meintjes & Nicola Mulder
Division of Human Genetics, Department of Pathology, IDM, Faculty of Health Sciences, University of Cape Town, Cape Town, 7925, South Africa
Emile R. Chimusa & Raj Ramesar
South African MRC Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville, 7925, South Africa
Alan Christoffels, Junaid Gamieldien & Mahjoubeh J. Sefid-Dashti
Department of Biochemistry and Genomics Research Institute, Centre for Bioinformatics and Computational Biology, University of Pretoria, Pretoria, 0083, South Africa
Fourie Joubert
Agricultural Research Council, Pretoria, 0184, South Africa
Jasper Rees
Department of Preclinical Sciences, School of Health Care Sciences, Faculty of Health Sciences, University of Limpopo, Mankweng, 0727, South Africa
Kathrine Scholtz
National Health Laboratory Service, School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, 2000, South Africa
Himla Soodyall
Department of Medical Sciences, School of Health Sciences, Faculty of Health Sciences, University of Limpopo, Mankweng, 0727, South Africa
Philip Venter
Department of Genetics, Stellenbosch University, Stellenbosch, 7600, South Africa
Louise Warnich
Institute for Cellular and Molecular Medicine, Department of Immunology, Faculty of Health Sciences, University of Pretoria, Pretoria, 0084, South Africa
Michael S. Pepper

Authors

Ananyo Choudhury
View author publications
You can also search for this author in PubMed Google Scholar
Michèle Ramsay
View author publications
You can also search for this author in PubMed Google Scholar
Scott Hazelhurst
View author publications
You can also search for this author in PubMed Google Scholar
Shaun Aron
View author publications
You can also search for this author in PubMed Google Scholar
Soraya Bardien
View author publications
You can also search for this author in PubMed Google Scholar
Gerrit Botha
View author publications
You can also search for this author in PubMed Google Scholar
Emile R. Chimusa
View author publications
You can also search for this author in PubMed Google Scholar
Alan Christoffels
View author publications
You can also search for this author in PubMed Google Scholar
Junaid Gamieldien
View author publications
You can also search for this author in PubMed Google Scholar
Mahjoubeh J. Sefid-Dashti
View author publications
You can also search for this author in PubMed Google Scholar
Fourie Joubert
View author publications
You can also search for this author in PubMed Google Scholar
Ayton Meintjes
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Mulder
View author publications
You can also search for this author in PubMed Google Scholar
Raj Ramesar
View author publications
You can also search for this author in PubMed Google Scholar
Jasper Rees
View author publications
You can also search for this author in PubMed Google Scholar
Kathrine Scholtz
View author publications
You can also search for this author in PubMed Google Scholar
Dhriti Sengupta
View author publications
You can also search for this author in PubMed Google Scholar
Himla Soodyall
View author publications
You can also search for this author in PubMed Google Scholar
Philip Venter
View author publications
You can also search for this author in PubMed Google Scholar
Louise Warnich
View author publications
You can also search for this author in PubMed Google Scholar
Michael S. Pepper
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.R. and M.S.P. co-lead the SAHGP initiative, and the project was designed and coordinated by the core working group including M.R., M.S.P., S.B., H.S., R.R., J.R., K.S., P.V., N.M., F.J., S.H., and L.V. M.R. and H.S. obtained ethics approval for the study. The data analysis team was led by S.H. (PCA; STRUCTURE and Y chromosome analysis) and included A.C. (novel SNV characterization, LOF variant, f ₂, F_ST, SFS, and ROH analysis), N.M. (functional analysis), F.J. (variant calling), S.A. (variant calling), G.B. (functional annotation and data curation), E.R.C. (admixture), J.G. (functional annotation), M.J.S.D. (functional annotation), A.M. (functional annotation, SNV characterization, data curation, and mtDNA analysis), and D.S. (regional F_ST analysis, data visualization). All authors wrote the Methods section and notes on their analyses. M.R. and A.C. drafted the manuscript, and A.C. was responsible for coordinating Tables and Figures (including the Supplement). All authors read, commented on, and approved the manuscript.

Corresponding authors

Correspondence to Michèle Ramsay or Michael S. Pepper.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Supplementary Information

Supplementary Data 1

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Choudhury, A., Ramsay, M., Hazelhurst, S. et al. Whole-genome sequencing for an enhanced understanding of genetic variation among South Africans. Nat Commun 8, 2062 (2017). https://doi.org/10.1038/s41467-017-00663-9

Download citation

Received: 29 September 2016
Accepted: 17 July 2017
Published: 12 December 2017
DOI: https://doi.org/10.1038/s41467-017-00663-9

This article is cited by

Precision ophthalmology: a call for Africa not to be left in the dark
- Lisa Roberts
Gene Therapy (2024)
Exome-based mutation screening in South African children with primary congenital glaucoma
- Nadia Carstens
- Saadiah Goolam
- Susan Eileen Isabella Williams
Eye (2023)
A pan-African pathogen genomics data sharing platform to support disease outbreaks
- Alan Christoffels
- Gerald Mboowa
- Yenew Kebede Tebeje
Nature Medicine (2023)
A roadmap to increase diversity in genomic studies
- Segun Fatumo
- Tinashe Chikowore
- Karoline Kuchenbaecker
Nature Medicine (2022)
VariantStore: an index for large-scale genomic variant search
- Prashant Pandey
- Yinjie Gao
- Carl Kingsford
Genome Biology (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.