Introduction

Copy number variations (CNVs) are major genomic structural variations (SVs) spanning large regions of chromosomes, ranging from several hundred base pairs to several megabases [1,2,3]. CNVs may exert profound effects on phenotype, primarily via gene dosage effects [4,5,6,7], create novel fusion genes, alter the distance of a gene from a regulatory element, or modify the number of protein-coding exons within a gene [8]. These genetic variations have been associated with a number of systemic autoimmune diseases [9,10,11], neuropsychiatric disorders [12,13,14], as well as other complex disorders [15,16,17]. Owing to its recombination mechanisms [18,19,20], CNVs are generally observed as loss or gain of copies of a certain DNA segment compared to the reference genome. This alteration in copy number often harbors genes that are responsive towards environmental factors, such as the Amylase gene (AMY) [6] and beta-defensin (DEFB) [21]. Therefore, CNVs are likely to be subjected to natural selection [6, 22, 23].

Malaysia is a multi-ethnic, linguistic, and cultural country. The Malays (MLY) comprise the major community, accounting for 2/3 of the population, whereas the indigenous populations (locally known as “Orang Asli”) comprise only ~0.6% of the total population in Malaysia [24]. The Orang Asli populations are generally categorized into three main tribes; namely, Negrito (NGO), Senoi (SNI), and Proto-Malay (PML). The NGOs are phenotypically similar to, but genetically distinct from, the African Pygmies. Most NGOs still live as “semi-” hunter-gatherers in remote areas. The SNIs are traditionally believed to be a slash-and-burn farming community, living both plantation and hunter-gathering lifestyles, whereas the PMLs are mainly farmers and workers. Both SNIs and PMLs have recently undergone different levels of urbanization. Numerous genome-wide SNP genotyping studies have indicated that these indigenous communities may have experienced long periods of isolation [24,25,26,27]; however, the impact of such a process on the CNV architecture of these populations has been lacking.

While the majority of the populations in most parts of the world have been sampled, those in Southeast Asia (SEA) have rarely been included in such population studies [23]. To date, efforts in mapping CNVs of Southeast Asian populations is limited [28, 29], despite their contribution to genetic variation in the region, particularly the natives from Peninsular Malaysia (PM). Therefore, the present study aimed to construct a comprehensive CNV map for the four native populations of PM: MLY, NGO, SNI, and PML. Notably, we have included all six NGO sub-tribes from PM in this analysis; namely, Beteq, Mendriq, Kensiu, Jehai, Kintak, and Lanoh. The population structures of the native populations and the impact of local adaptation in shaping their genomes were assessed using the mapped CNVs.

Results

Characterization of CNVs

In this analysis, all calls for copy number polymorphisms (CNPs) and rare CNVs were integrated. Two sets of CNV maps were constructed, namely, (i) the four PM indigenous populations and along with the publicly available global HapMap populations (denoted herein as “global data set”); and (ii) the Peninsular Malaysia populations only (denoted herein as “PM data set”).

The CNV calls were annotated to the human genome reference hg18. A total of 3102 copy number variable regions (CNVRs) were obtained from the global data set (denoted herein as global CNV map; Table S1), and 929 from the PM data set (denoted herein as PM CNV map; Table S2), respectively. On average, >100 CNVR events per individual were detected (Figure S1 and Table S3). The African population harbored most of the CNVRs (156 CNVR per individual), in agreement with the recently published CNV map [23], whereas the populations from PM harbored the lowest number of CNVRs (117–137 CNVRs per individual). This discrepancy may be attributed to lower CNV diversity in the PM populations, although we do not rule out the possibility of ascertainment bias. Within the four PM populations, the highest and lowest CNV loadings were observed in PML and NGO, respectively (141 CNVRs vs. 120 CNVRs per individual from the PM map). The vast majority (~80%) of the CNV events were deletions. The CNVs were mapped to autosomes (Fig. 1). The global CNV map covers 223.8 Mb (~7.8%) of the autosomal genome, with an average size of 72 kb (median: 18 kb), whereas the PM CNV map encompasses 59.8 Mb (~2.08%) of the autosomal genome, with a mean size of 64 kb (median: 16 kb). In contrast, on average, less than 1% of each individual genome was covered by CNVs (ranging from 0.48 to 1.5% for the global map). Compared to the Database for Genomic Variants (DGV) (dated July 23, 2015), the reproducibility of our CNVRs against the DGV records gradually decreased as the stringency of the overlapping criteria increased [30] (Figure S2).

Fig. 1
figure 1

CNV map of the native populations of Peninsular Malaysia (PM). The map was generated based on (i) the data set from PM populations only; and (ii) the four PM indigenous populations and along with the publicly available global HapMap populations

The majority (>75% and >60% of the two CNV maps, respectively) of the CNVRs had low frequency (<1% for the global map and 5% for the PM map, considering that the total sample size is <100 for the PM populations) (Figure S3). We thus exercised caution as the number of available samples per population has restricted us to define “rare” CNVs (i.e., frequency <0.1%) in our CNV map (Figure S3), particularly for the PM CNV map. Furthermore, considering the small sample size of the PML tribe, we did not remove singletons from the CNV map.

Shared CNVs between populations

To elucidate the relationships among populations, CNVs that were shared between populations were assessed (Fig. 2). A total of 111 CNPs and 1740 CNVRs that were found in the global populations data set occurred only once in the PM populations, whereas 157 CNPs and 148 CNVRs were observed in all populations studied (herein denoted as common CNVs). Approximately 47% of the CNVs and 57% of the CNPs called were shared by at least two populations. On a separate note, the “population-private” CNVs, defined as CNVs that were observed only in one population (top green part of Fig. 2 and S4A), accounted for a substantial proportion of the CNVs in each population because: (i) the rare CNVs identified in this study were more likely to be detected only within a particular population; and (ii) the population-private CNVRs were mainly singletons (shadowed with 45° sloping lines in Fig. 2). Africans were found to harbor the largest number of such private CNVs, which is in line with the observations of Sudmant [23]. We also observed that the singletons (i.e., CNVs that occurred in only one sample in a particular population) within each population (the thin bar with 135° sloping lines in Fig. 2 and S4A) accounted for ~46% and 25% of the total number of CNVRs and CNPs for each population, respectively, many of which occurred once in a population but were common in other populations (thin blue bar in Fig. 2 and S4A). The East Asian (EA) populations (i.e., CHB, CHD, and JPT) shared relatively more reciprocal CNVs that were restricted to the PM populations and vice versa, suggesting that these populations are genetically closer than other populations. Within the four PM populations, the highest pairwise sharing of CNVs was detected between NGO and MLY (Figures S4B-S4D), possibly because of the higher number of CNVs detected in these two populations (Fig. 2 and S4A).

Fig. 2
figure 2

CNVR sharing among all populations. CNVRs in each population are divided into different parts: common in every population (red, bottom), shared only with YRI (dark gold), CEU (forest green), East Asian populations (grass green), and Peninsular Malaysia populations (dark red), population-private (light green in the top), and others (blue). The thin bar represents the singletons (Pop) within each population (shadowed by the 135° sloping lines) and the singletons (All) counted by across all the samples are shadowed by the 45° sloping lines

Population structure and genetic relationships

The fine-scale population genetic structure and corresponding relationships were then assessed based of the generated CNV map. We conducted the principal component analysis (PCA) [31] using the shared biallelic CNVs (biCNVs) (N = 878) among the populations. As expected, the populations from three main continents were well separated along PC1 (Africans and non-Africans) and PC2 (Europeans and Asians). The PM populations were clustered close to the EA populations (Fig. 3a). We then looked at EA and PM populations and found three distinct clusters, namely, the Japanese Tokyo (JPT), HAN Chinese (CHB+CHD), and PM populations. MLY was clustered between the EA populations and the PM Austro-Asiatic populations, i.e., the NGO and the SNI (Fig. 3b), suggesting that these urban Malays may be genetically admixed between the EA and their neighboring natives. We then further zoomed into the genetic relationships of the four PM populations. With the 878 biCNVs identified, the Malays represented a distinct population separate from the NGOs and the SNIs, with a few outliers (Fig. 3c). The PCA also exhibited sub-structures within the six sub-tribes of Negritos (Figure S5). This is likely due to the reason that these indigenous hunter-gatherers may have been isolated from other populations for a long period of time, thereby resulting in a distinct genetic background, and that sub-structures might have occurred within the NGOs.

Fig. 3
figure 3

Principal Component Analysis (PCA) using bi-allelic CNV (BiCNVs). Three independent PCAs were performed with different population data sets a global populations. The blow-up section shows the relationships between the East Asian populations; b Asian populations; c native populations of Peninsular Malaysia

By including the commonly shared biCNVs among the studied populations, we performed STRUCTURE analysis [32,33,34,35] with K = 2–10 (with five replicates). Distinct clusters appeared as K increased (Fig. 4 and S6). When K = 3, the PM populations appeared to carry the same ancestral component as the EA populations, with some small amount of admixture of the African and European ancestral components (Figure S6). At K = 4, the distinguishable Austro-Asiatic component (here referring to NGO and SNI) appeared, and when K = 5, JPT was separated from the HAN Chinese (CHB and CHD). Collectively, this implies that the NGO and SNI are genetically unique, whereas the MLY and PML exhibited admixed patterns, in agreement with the PCA results (Table S4).

Fig. 4
figure 4

Neighbor-joining tree and STRUCTURE analyses. The NJ-tree based on FST (upper panel) values is in agreement with the results of STRUCTURE analysis (K = 5; bottom panel)

Subsequently, population differentiation was measured using the unbiased FST statistic [36]. The pairwise population FST values were computed after CNV allele frequencies were inferred using an EM algorithm [37]. A neighbor-joining (NJ) tree was constructed with the FST-based distance [38, 39] (Fig. 4). The pattern of the clades coincided with the results of STRUCTURE analysis, and an earlier study using SNP genotyping [26]. The NJ tree revealed three major branches, with YRI as outlier. The EA nodes revealed that MLY and PML were more closely related to the EA populations, thereby suggesting some degree of gene flow from the EA populations, whereas NGO and SNI formed a unique clade distinct from the MLY and EA populations.

Linkage disequilibrium between CNVs and SNPs

A previous study has shown a decrease in linkage disequilibrium (LD) between CNPs and the flanking SNPs in Chinese populations compared to the European population [37]. However, whether the PM populations exhibit the same trend remains unclear. Thus, we investigated the ‘taggability’ (i.e., measuring the CNVs that are in high LD with flanking SNPs) of SNPs to CNPs in the PM populations. We exercised caution in our analysis as the small sample size of each PM population may result in large variations in LD. Therefore, we gathered the SNI, PML, and MLY as a group (denoted as SEA in the figures), and NGO as another. CN deletions that showed strong correlation (r2 ≥ 0.8) with the flanking SNPs (in either 3 Mb or 5 Mb searching windows) exhibited generally similar frequencies across populations, ranging from 58.36% (CEU), 56.86% (NGO), and CHB (54.87%) to SEA (47.76%), which was in agreement with the findings of a previous study [37] (Table S5A). Stronger taggibility of CN deletions has also been earlier reported [23].

In contrast to deletions, CEU had the lowest number of CN duplications that exhibited strong correlations with the flanking SNPs, possibly due to the reason that CEU has fewer CN “bi-duplications” (about 50% lower than those in CHB) (Table S5A). When calculating for CNVRs, CEU again showed the highest taggability among the investigated populations, even with a relatively smaller number of CNVRs (Table S5B).

Population-specific CNVs

Population isolation and forces such as natural selection and genetic drift may result in population-specific genomic variants. Therefore, we identified population-specific CNVs that occurred exclusively in the four PM populations or those that showed significantly higher frequencies than the global populations studied.

The number of CN deletions that were highly differentiated between PM (except PML) and the HapMap populations decreased when the geographical distance of the populations decreased (Fig. 5 and S8). Two CN duplications had significantly higher frequencies in SNI (CNP11172; chr7:g.138438_163742; 68.75% in SNI) and NGO (CNVR2705; chr16:g.(68534936_68545887)_(68818240_68824276); 45.45% in NGO) (Table S6). When the three EA populations were pooled together, more PM population-specific CNVs were revealed (Table S6). We acknowledge that the PM private CNVs were also included in the analysis, although they were primarily singletons (Table S7).

Fig. 5
figure 5

Number of population-specific CNVs across all populations. The number of population-specific deletion CNPs between PM populations and HapMap global populations decreases as the population geographic distance decreases

The “population-specific” CNVs identified showed high concordance with the records in DGV, implying the reproducibility of our CNV calls. We suspect that genetic drift or environmental pressure could have in part played a role in resulting the emergence of highly differentiated CNVs in the PM populations compared to that in other global populations. In contrast, less than half of the “population-private” CNVRs could be found in DGV when a 50% reciprocal overlap criterion was applied, suggesting that there may be novel alleles that have yet to be detected, and thus complementing the enrichment of global CNV surveys.

Signatures of local adaptation

A recent study suggested that population-specific CNVs may be attributed, at least in part, to recent local adaptation of the hunter-gatherer Negritos [28]. We therefore explored further supporting evidence of signatures of local adaptation on the CNVs in our samples. The unbiased FST [36] was used to scan the genome-wide CNVs between pairs of populations from PM and global populations in this study (as reference), respectively. Then, the top 1% loci of the FST values of each pair were listed as candidates regions of local adaptation of the PM populations. Candidate genes underlying the putative regions of local adaptations were then annotated with RefSeq (date 08/2013) (Table S8). Annotations and the enrichment analyses were performed on the candidate genes using DAVID (version 6.7) (https://david.ncifcrf.gov/).

Because the number of genes overlapping CNPs was limited (Table S8A), we only assessed the enrichments in NGO, as well as the pooled PM populations (SEA). The enrichment scores were 6.85 and 5.38, respectively. The top enriched items referred to “defensin” (p values after Benjamini correction were 4.7e-12 and 7.2e-10, respectively; and the same for the following), “antibiotic” (p = 6.6e-11 and 1.0e-8), “antimicrobial” (p = 5.8e-11 and 8.7e-9), and “defense response to bacterium” (6.0e-9 and 1.6e-6). This is in agreement with the fact that the hot and humid climate of the tropical rainforest is preferred by various parasites and pathogens. Therefore, it is likely that the Orang Asli, particularly the hunter-gatherer Negritos, were exposed to various stresses against infections. As for CNVRs, which included both common CNPs and rare/de novo CNVs (Table S8B), the gene cluster involving “defensin” or response to bacterium remained at the top of the list (with enrichment scores>3 and p values<0.001 after Benjamini correction) in all PM populations. In addition, in SNI, genes related to sensory perception were identified, e.g., olfactory receptor (OR), EYS, and the family of the taste receptor protein (TAS2Rs).

The underlying genes in the population-specific CNVs were found to be enriched in the category “response to anti-microbiome” for both the NGOs and the SNIs (Table S6 and S7). Several candidate genes had drawn our attention (Table 1). Notably, APOBEC3A_B, a fusion of the APOBEC3A and APOBEC3B genes due to a deletion of the genomic region linking them, was reported to contribute to immunity by restricting the transmission of foreign DNA. A recent study suggested that the deletion of APOBEC3B is strongly associated with susceptibility to malaria [40]. Furthermore, the beta-defensin gene family, including DEFB4B, DEFB103A, and DEFB104B, encode a group of defensins that respond to bacterial invasion. In addition, the chemokine gene CCL3L1, which is associated with susceptibility to HIV-1 infection and autoimmune diseases, is utilized as a signal for local adaptation. CNV duplications harboring the CCL3L1 gene have occurred across all the studied populations, with the following frequencies in specific populations: CEU (97.87%), YRI (91.38%), NGO (78.18%), JPT (42.86%), CHB (38.20%), SNI (29.41%), PML (25.00%), and MLY (17.65%), thereby suggesting that the Negritos may have exerted a different pressure on this gene, compared to the other Peninsular Malaysians. CNVs that overlapped with the CES1 gene were significantly differentiated in MLY and SNI, and this gene is responsible for the hydrolysis or transesterification of various xenobiotics and drug clearance. Besides, a gene related to responses to radiation, MAT2A, was also observed, and may be attributed to the exposure to sunlight (hence UV exposure) in the tropical region, which is an important environmental stress in tropical rainforests.

Table 1 Candidate genes harbouring the CNVRs with signals of local adaptation.*, Candidate genes derived from CNPs

Discussion

The complex histories and multiple inflows of populations in PM and the environmental pressures due to the distinct climate of the tropical rainforest have essentially shaped the unique genetic diversity of the human populations in this region. Although the most comprehensive global CNV map has been reported, numerous regions remain unexplored, e.g., natives of PM [23]. This study has narrowed the gap of the current CNV map. Approximately 72% of the CNVs found in the PM map are novel.

One advantage of this study is that both populations, as well as samples are identical to our earlier study [26], which revealed the population genetic structure of the PM populations using SNP data. Therefore, the findings of this study can be thoroughly assessed and verified. Although different scales of genomic variants would capture different genomic information, the results of population genetic analyses using SNPs and CNV show high concordance. The STRUCTURE analysis demonstrates that a small proportion of the European component of the Malay population (K = 4) is in agreement with the findings of an earlier study [26], thereby serving as supporting evidence for this observed concordance. Collectively, our analysis supports the idea that the Malay population is likely an admixture population comprising two major ancestral components; namely, East Asian populations and Southeast Asian indigenous populations. In addition, the genetically closer affinity observed between SNI and NGO, as well as between the MLY and EA populations are in agreement with the findings of earlier reports [26, 41].

The taggability of the SNPs and CNVs was relatively higher in CEU than our populations (Table S5), which is in agreement with the results of earlier studies [37]. We agree with the postulation of Lou et al. [37] that higher taggability may be attributable to ascertainment bias in both SNP discovery and array design between Asian and European populations.

We acknowledge that some of the population-specific CNVs identified in this study did not match those identified in a previous investigation [28] (Table S10), which may be due to the following reasons: (i) Mokhtar et al. [28] used a very stringent criterion to construct the CNV set, which included overlapped calls of the 2 out of the 3 algorithms; therefore, some of the true positives might have gone undetected; (ii) the definition of population-specific CNVs differed between the two studies. They defined population-specific CNVs as those present in one but not in other populations, whereas we extended this to CNVs that are of significantly higher occurrence in one population compared to another.

Although selection signals for SNP in various populations have been extensively explored, such analyses for CNVs are relatively scarce. The candidate genes underlying the signals of local adaptation were significantly enriched with genes related to defense against microbes and immune responses and pathways involved in taste transduction, particularly those in NGO and SNI. Essentially, these findings coincide with the physical and environmental attribution of the nomadic lifestyle of the NGO and SNI during prehistoric days, which exposed them to various forms of transmissible diseases and undernourishment; hence, selective pressure may have responded against these challenges. The results of our enrichment analysis are in agreement with the results of Mokhtar et al. [28], in which enriched genes related to immune system processes were also identified. It is interesting to note that although different candidate genes were identified in these two studies, the same biological pathway was identified in both studies, implying that a complex host immune system is essential in tropical rainforests and that positive natural selection would have acted on these PM populations. We investigated whether these putative signals of local adaptation are correlated with those identified by SNPs [26]; however, no correlation was observed (r2 ≥ 0.5 and 3 Mb window size). Rather, some of these were mapped to putative CNV regions of local adaptation (Table S11). For instance, a CNV located at chr4:g.(68943659_68964800)_(69234339_69378740) harbored six candidate SNPs that were fixed (i.e., with 100% frequency) in MLY and SNI and that also showed high frequencies in NGO. While this observation further confirms the respective local adaptation signals, we acknowledge that the false-positive signals may be introduced due to the distribution of the probe design; hence, caution should be exercised when interpreting the findings. Considering the limitations of the current CNV data set, further confirmatory analyses are warranted.

Intriguingly only a limited number of candidate genes underlying the signatures of local adaptation were in agreement with the findings of Deng et al. [26] (Table S12). Further investigation revealed that all the CNV signals, except for one, were singletons. However, considering our limited sample size, we do not rule out the possibility that these genes may have been experienced multiple mutation mechanisms (i.e., SNPs and CNVs) in the evolutionary process.

In summary, we constructed the first CNV map of native populations from PM, thereby providing further insights to the CNV landscape of the human genome. We have also reported the first assessment of LD between CNVs and SNPs in the native populations of PM. Several putative candidate genes underlying the CNVs were proposed to have undergone local adaptation. We acknowledge that constraints of sample size and potential ascertainment bias could have confounded our results; however, the results on the CNV characteristics and population genetic structures are in agreement with the findings of earlier reports. Essentially, forces such as natural selection and genetic drift have shaped the unique genomic structure of the PM populations, thereby contributing to the emergence of distinct genomic variants that are involved in specific biological processes. Further investigations on other indigenous populations from Southeast Asia may facilitate in generating a denser CNV map for a better understanding of the evolution of the human genome and its related medical implications.

Materials and methods

Populations and samples

Peripheral blood samples of 100 unrelated individuals comprising 17 MLYs (from Kelantan), 62 NGOs (from Bateq, Mendriq, Kensiu, Jehai, Kintak, and Lanoh), 17 SNIs (from Temiar), and 4 PMLs (from Temuan) were collected from distinct regions of PM. Informed consents were obtained from the participants. All procedures were in accordance with ethical standards of the Research and Ethics committee as approved by the local Ethical Committees and Department of Orange Asli Development (Jabatan Kemajuan Orange Asli, JAKOA) and the Helsinki Declaration of 1975, as revised in 2000.

Genotyping, CNV detection, and quality control

Samples were genotyped with an Affymetrix Genome-wide SNP array 6.0 genotyping platform according to the manufacturer’s instruction. CNVs were called using Birdsuite (1.5.5). The CNVs called comprised two major components: (i) copy number of the 1316 pre-defined CNPs (i.e., copy number polymorphisms); and (ii) rare/de novo CNV segments of each individual. Birdsuite also provides results combining the two parts [42]. The HapMap samples, including 58 YRIs (Yoruba in Ibadan, Nigeria), 47 CEUs (Utah residents with Northern and Western European ancestry from the CEPH collection), 89 CHBs (HAN Chinese in Beijing, China), 90 CHDs (Chinese in Metropolitan Denver, Colorado), and 91 JPTs (Japanese in Tokyo, Japan), were also included in this study and independently called using the same tool and same procedure. To minimize the bias introduced by sample size, we conditionally grouped the sub-tribes of NGO together in most of our analyses.

The threshold for QC call-rate was set at 0.80. Seven NGO samples that failed the QC were removed from subsequent analysis. The final data set consisted of 17 MLYs, 55 NGO; 11 Bateqs, 10 Mendriqs, 6 Kensius, 16 Jehais, 5 Kintaks, 7 Lanohs), 17 SNIs, 4 PMLs, 58 YRIs, 47 CEUs, 89 CHBs, 90 CHDs, and 91 JPTs (Table S13). Due to the uncertainty of the performance of Birdsuite on the sex chromosomes, only autosomal CNVs were called. All probe coordinates were mapped to the human genome assembly build36 (hg18).

Generation of the CNV map

One pre-defined CNP, CNP11594 (chr9: g.38906782_65412427), which spans an unusually large region of more than 26.5 Mb across the centromere of chromosome 9, was identified. We believe it is an artifact and thus was excluded from subsequent analyses. A total of 1290 autosomal CNPs were included in the library (Table S14). Then the cut-off value of 0.1 was set as confidence score to decide the true CNP callings, as suggested by Birdsuite.

For integrated CNPs and rare/de novo CNVs, segments with lengths <1 kb, confidence scores <5, or covered by <2 probes were first removed. Then, CNVRs were called by merging these individual CNV segments which consisted two probes that overlapped across individuals [3]. Consequently, a CNV map consisting of 3102 CNVRs was constructed for all the 468 individuals included in this study, which we refer to as the global CNV map, and a CNV map consisting of 929 CNVRs from PM populations, which we refer to as the PM CNV map, respectively.

The CNVs called have been submitted to dbVar with the accession number nstd156.

Population genetic analysis

Analysis of population structures and relatedness was conducted using PCA (EIGENSOFT version 5.0.2) [31] and STRUCTURE (STRUCTURE version 2.3.1) [32,33,34,35] of biallelic CNVs. The Expectation-Maximization (EM) algorithm was applied to calculate the allele frequency of the CNVs on the basis of Hardy-Weinberg equilibrium (HWE), followed by unbiased FST [36] for each pair of populations. The NJ tree was generated using PHYLIP [38], MEGA [39], and Dendroscope [43], respectively, based on the pairwise population FST values of the CNVs. Considering the limited sample size of PML, we bootstrapped the other 8 populations 100 times to construct a consensus tree using the same procedure (Figure S7A). A simple pairwise distance (i.e., the genotype differences between pairs of individuals, normalized by the total CNVs) between any pair of individuals to build another individual NJ tree was measured (Figure S7B). Missing data were excluded from the analysis.

Analysis of LD between CNVs and SNPs

We assessed for LD between CNVs and flanking SNPs within windows of 3–5 Mb. In this analysis, a biallelic model was applied. The CNVs were categorized into deletions and duplications. Analysis was performed using PLINK [44] (http://pngu.mgh.harvard.edu/purcell/plink/, version 1.07). When searching for the tagging SNPs of CNVs, the r2 cut-offs were set to 0.2, 0.5, and 0.8 respectively. We were aware that the limited sample size of our study may introduce bias to the results; therefore, during analysis, the samples from PM were classified into (i) NGO and (ii) Peninsular Malaysian (denoted as SEA), which comprised MLY, SNI, and PML.

Population-specific CNVs

Two strategies were applied to define population-specific CNVs. First, deletions or duplications with significantly high frequencies in the test population were identified (Fisher’s exact test, p value with Bonferroni correction; 3.8e-5 for CNPs and 1.6e-5 for CNVRs). Second, as a complement to the power deficiency of Fisher’s test on the population-private CNVs, the population-private CNVs were also included in the analysis.

Signals of local adaptation and gene enrichment analysis

For all studied populations, the EM algorithm was used to infer CNV allele frequencies, and subsequently calculated the pairwise FST values for each CNV by comparing the populations from PM with the HapMap populations. Then, the CNVs were ranked according to their FST values for each pair, and identified the CNVs in the PM populations among the top 1% (Table S8). The population-specific CNVs were also taken into consideration.

Enrichment and functional analyses of the candidate genes harboring the CNVs that were identified as signals of local adaptation were performed using DAVID Bioinformatics Resources 6.7 (http://david.abcc.ncifcrf.gov/).