A map of copy number variations in the Tunisian population: a valuable tool for medical genomics in North Africa

Copy number variation (CNV) is considered as the most frequent type of structural variation in the human genome. Some CNVs can act on human phenotype diversity, encompassing rare Mendelian diseases and genomic disorders. The North African populations remain underrepresented in public genetic databases in terms of single-nucleotide variants as well as for larger genomic mutations. In this study, we present the first CNV map for a North African population using the Affymetrix Genome-Wide SNP (single-nucleotide polymorphism) array 6.0 array genotyping intensity data to call CNVs in 102 Tunisian healthy individuals. Two softwares, PennCNV and Birdsuite, were used to call CNVs in order to provide reliable data. Subsequent bioinformatic analyses were performed to explore their features and patterns. The CNV map of the Tunisian population includes 1083 CNVs spanning 61.443 Mb of the genome. The CNV length ranged from 1.017 kb to 2.074 Mb with an average of 56.734 kb. Deletions represent 57.43% of the identified CNVs, while duplications and the mixed loci are less represented. One hundred and three genes disrupted by CNVs are reported to cause 155 Mendelian diseases/phenotypes. Drug response genes were also reported to be affected by CNVs. Data on genes overlapped by deletions and duplications segments and the sequence properties in and around them also provided insights into the functional and health impacts of CNVs. These findings represent valuable clues to genetic diversity and personalized medicine in the Tunisian population as well as in the ethnically similar populations from North Africa.


INTRODUCTION
Copy number variations (CNVs) are considered as genomic structural variations ranging from 1 kb to multiple megabase pairs in length [1][2][3] . CNVs are likely caused by one single or a combination of multiple genomic rearrangements, such as unbalanced translocation, deletions, insertions, and duplications. Therefore, CNVs are generally observed as a gain or a loss of DNA segment copies that deviate from the normal diploid state. CNVs may influence phenotypes by changing gene dosage, interrupting coding sequences, creating novel fusion genes, or by altering the distance of a gene from its regulatory elements [4][5][6] . It has been assessed that up to 60% of the human genome encloses CNVs, which generally range in size from 100 to 500 kb 7 . These CNVs are major contributors to human genetic diversity.
Two models of CNV-phenotype associations have been suggested 8 . The first model encompasses common copy number polymorphisms (CNPs) with a frequency exceeding 1% in the general population. Genes spanned by CNPs are mainly enriched for biological functions and pathways related to drug response, immunity, and sensory perception 9,10 . They alter phenotypes by changing the dosage of genes or other functional elements, thus influencing complex traits such as HIV-1/AIDS susceptibility (MIM 609423), Crohn's disease (MIM 266600,) and glomerulonephritis in systemic lupus erythematosus (MIM 152700). CNVs also occur in genes encoding drug-metabolizing enzymes, including the cytochrome P450s (CYP2B6 and CYP2D6), which are susceptible to structural variations due to highly homologous pseudogenes. CNV distribution influences drug metabolism and are important in pharmacogenomics screening 11 . The second model involves rare and highly penetrant CNVs. These CNVs are responsible for the deletion or the duplication of large genomic segments resulting in genomic disorders such as Prader-Willi syndrome/Angelman syndrome (MIM 176270/105830, 15q11-q13 deletion), Williams-Beuren syndrome (MIM: 194050, 7q11.23 deletion), Potocki-Lupski syndrome (MIM:610883, 17p11.2 duplication), and Charcot-Marie-Tooth disease, type 1A (MIM:610098, 17p12 duplication) 12 .
In order to understand the extent to which CNVs influence phenotypes, deep analyses in both patient and healthy individuals are required. Different approaches, including quantification of hybridization to specific oligonucleotides 13 , clone arrays 14 , direct genome sequencing 15,16 , and single-nucleotide polymorphism (SNP) array [17][18][19] , allowed to explore CNVs, thus providing their global estimates of frequencies, distribution, and functional features in large population cohorts and HapMap samples 1,2,4,16,[19][20][21][22][23][24][25][26][27][28][29] . Although medical and clinical genetic studies have been widely performed in the Arab World known to display high rates of consanguinity and endogamy, little attention has been paid to potential variations linked to health in the region 30,31 . Therefore, information related to molecular pathogenesis and knowledge of gene variants segregating in the Arab genome is lacking as well as genotype-phenotype correlation of genetic conditions for both monogenic and multifactorial diseases.
Studies focusing on the characterization of CNVs in the Arab World are not available, except one on the Qatari population 16 . In this study, we applied Affymetrix Genome-Wide Human SNP Array 6.0, which was designed for both SNP and CNV detection, to explore genome-wide CNV in the Tunisian population. Tunisia is a North African country with 11 million inhabitants. The native background population is Berber and the genetic properties of the present population are shaped by the multiple invasions and the migratory waves of allogenic populations and ethnic groups mainly from the Middle East and Europe 32 . In addition, like other countries from North Africa and the Middle East, the Tunisian population depicts high rates of consanguinity and endogamy, leading to the expression of recessive genetic diseases at relatively high frequencies and in several cases leading to comorbidity 30,33,34 . Because of the relatively high inbreeding rates in this population 35 , it is likely that CNVs, alongside SNPs and indels, play a role in the inherited disease risk burden. In the present study, we provide the first comprehensive Tunisian CNV map and performed functional analysis on CNV overlapping genes in order to understand their role in conveying disease risk.

Characteristics of CNVs identified by PennCNV
A total of 4591 CNV calls on 102 individuals (73 males and 29 females) were merged into adjacent CNVs leading to 4573 CNVs. CNV carrier rate was 75.5%. After filtering unreliable CNV calls from telomeric, centromeric, and immunoglobin regions and removing CNV with <10 probes and according to length, 3964 CNV events were obtained. In this dataset, an average of 38.86 CNV per individual with a ratio of deletions to duplications of about 2:1 was identified (Supplementary Table 1). The number of CNVs per individual ranged from 17 to 63 ( Supplementary Fig. 1). The mean size of a CNV was 96.7 kb. The median size of duplications is~57.170 kb, which is larger than that of deletions (Wilcoxon test, p value < 2.2e − 16) (Supplementary Table 1 and Supplementary Fig. 2a, b).
Characteristics of CNV loci (CNVR) identified by PennCNV By merging overlapping CNVs into CNV regions (CNVRs), we identified 751 CNVRs with sizes ranging from 1.02 kb to 3.184 Mb and an average size of 104 kb (Supplementary Table 2). Among these CNVR, we identified 469 loci containing only deletions (lossloci), 173 loci containing only duplications (gain-loci), and 109 loci containing both deletions and duplications (mixed loci). These 751 CNVRs are covering 78.072 Mb of the genome with a sum of lossloci length of 23.102 Mb, which is slightly larger than that of gain-loci (22.458 Mb). Nevertheless, the CNVR deletion length median (19.380 kb) was significantly lower than that of gain-loci (61.430 kb) (Wilcoxon test p value < 2.2e − 16) similarly for the individual CNVs. Moreover,~40% of CNVR were <20 kb and the majority of these segments were <100 kb (78.5%) (Supplementary Table 3 and Supplementary Fig. 3a). About 60% of the CNVR in this size range (100 kb) were loss-loci, whereas duplication loci represented only 14.24%.

Characteristics of CNVs identified by Birdseye (Birdsuite)
After filtering spurious CNVs, a total of 6263 segments have been called with an average of~61.4 CNV per individual (Supplementary Table 1). The count of deletions was nearly three times that of duplications. The number of CNVs per individual ranged from 38 to 325. The mean size of a CNV was 46.770 kb (Supplementary Table 1). The median size of duplication is~42.770 kb, which is larger than that of deletions (Wilcoxon test, p value < 2.2e − 16) (Supplementary Table 1 and Supplementary Fig. 2c, d).
Characteristics of CNV loci (CNVR) identified by Birdseye (Birdsuite) We also merged Birdseye output that overlapped into CNVR similarly to the PennCNV output analysis. The 6263 called CNVs were collapsed into 1236 regions, of which 546 were loss-loci, 603 were dup-loci, and the remaining were 87 mixed loci (Supplementary Table 2). The 1236 CNVRs identified by the Birdseye data cover 65.607 Mb of the nucleotide sequence. The sum of the loss-locus lengths (18.971 Mb) is lower than that of duplication loci (29.068 Mb) (Supplementary Table 2). As for the PennCNV data, the median of the loss-locus lengths (13.330 kb) is lower than that of the duplication loci (16.140 kb) in the data generated by Birdseye (Supplementary Table 2). However, this difference is not significant (Wilcoxon test p value = 0.8451).
Comparison of CNVRs generated by both algorithms (PennCNV and Birdsuite) Significant differences were found when comparing CNVR parameters of data generated by PennCNV and Birdsuite (Supplementary Tables 2 and 3 and Supplementary Fig. 3a, b). The proportion of loss-loci (62.4%) identified by PennCNV data was higher than that for the Birdseye data (44.2%). Nevertheless, the proportions of the duplication (48.8%) were higher in the Birdseye data than that of PennCNV (23%). This difference was significant (χ 2 test p value < 2.2e − 16). In addition, the median length of CNVRs generated by PennCNV (28.010 kb) data is twice higher than that of Birdsuite data (16.1 kb) (Wilcoxon test p value < 2.2e − 16). In addition, PennCNV tends to call CNVs that collapse into deletion regions longer than those of Birdsuite (Wilcox test p value = 3.873e − 06). Similarly, the duplication CNVRs of PennCNV data are longer than that of Birdsuite (Wilcoxon test p value < 2.2e − 16). No significant difference was noted for mix-loci lengths between the two algorithms. In addition, CNVR length distributions between the two algorithms were significantly different (Kolmogorov-Smirnov test p value = 9.992e − 16). This was also the case for loss-loci and dup-loci lengths (Kolmogorov-Smirnov test p value = 4.987e − 05 and p value < 2.2e − 16, respectively) (Supplementary Tables 2 and 3).
Concordance of PennCNV CNVR As CNV detection using microarrays is usually plagued with poor specificity or a high false-positive rate, and as there is a significant difference between the performance of CNV detection algorithms as shown earlier (Supplementary Tables 1, 2, and 4), we aimed to look for overlapping regions between these two datasets as a concordance and an in silico validation step. Only those detected by both algorithms have been considered. Seventy-eight percent of PennCNV loci overlapped with those of Birdseye (Supplementary Fig. 4a). Consequently, we found 586 loci with 50% reciprocal overlap with Birdseye data on the 22 autosomes called validated CNVR (vCNVR) (Supplementary Fig. 4b). Moreover, deletion and duplication states of PennCNVR output were consistent with Birdseye data in 79% loci. The vCNVR length ranged from 1.02 kb to 2.074 Mb with an average of 90.3 kb. In all, 78.32% of the vCNVR were <100 kb ( Supplementary Fig. 5). About 56% of these segments were deletions (del-loci) and 23.2% duplications (dup-loci). Indeed, these 586 vCNVR comprise 102 homozygous deletions (CN = 0), 424 single-copy deletions (CN = 1), 187 singlecopy duplications (CN = 3), and 82 amplifications (CN = 4).
The CNP length ranged from 1.017 to 487.878 kb with an average of 22.827 kb. In all, 94.87% of the genotype CNPs are <100 kb ( Supplementary Fig. 7). About 61.05% of these segments were exclusively deletions (417 del-loci) and 16.54% exclusively duplications (113 dup-loci). The count of the CNP loci deletions was therefore near 3.69 times that of duplications (Supplementary  Table 6). Similarly, the gain CNP loci are longer than loss CNP loci (Wilcoxon test p value = 3.8 × 10 −08 ). In addition, CNP length distributions between the two types of loci were significantly different (Kolmogorov-Smirnov test p value = 4.43 × 10 −07 ).
Characteristics of the global CNV map As vCNVR and CNP overlapped (169 overlapping vCNVRs), we merged them by union together and checked them manually in order to provide a global CNV map of the Tunisian population. After removing likely false-positive segments with frequency >90% that were absent from the 1000 Genomes project and Genome Aggregation database (gnomAD) databases, the global CNV map is composed of 1083 CNVs (Supplementary Fig. 8  The pairwise correlation between the CNV frequency of our dataset overlapping with those of the 1000 Genomes was positive, high, and significant (Pearson correlation r = 0.70, 95% confidence interval (CI) = [0.67;0.73], p value < 2.2e − 16) (Fig. 2). The pairwise correlation was also positive, high, and significant at the population level among the five continental groups of the 1000 Genome project: Africa (AFR): r = 0. 6 Fig. 9a-e). These results suggest that the CNV profile of the Tunisian population is similar to the European population and different from the African.
Pathway analysis of these genes revealed 16 significant enriched pathways of potential concern for public health (Table 3). Furthermore, standard functional annotation using the GO terms and pathways showed significant cellular component terms relating to the nervous system (Supplementary Table  9). Biological processes and molecular functions of genes bearing CNVs were also related to physiology of the nervous system, drug metabolism, carbohydrate metabolism, immunological system, cardiac and lung organs, and olfactory and auditory systems (Supplementary Table 9).
Mapping CNV genes to diseases and phenotypes In order to determine whether CNV might play a role in disease expression in the Tunisian population, we queried the 597 RefSeq genes affected by the CNV segments against the GAD database 36 , which is an archive of human genetic association studies of complex diseases. We found genes associated with 255 diseases clustered in 10 disease classes in this database (Supplementary Table 10). Significant disease classes include diseases of the immune and blood systems and metabolic diseases (Supplementary Table 10). Diseases of the nervous system are also present. CNV genes seems also to underlie or to be associated with aging and ocular disorders.
In addition, these RefSeq genes were compared to the database of Online Mendelian Inheritance in Man (OMIM). One hundred and three genes were reported to cause 155 Mendelian diseases/ phenotypes (Supplementary Dataset). According to the World Health Organization international classification of diseases (WHO ICD), genic CNVs are responsible for three major disease groups: (1) diseases of the nervous system (20%), (2) the congenital malformations, deformations, and chromosomal abnormalities (19.35%), and (3) endocrine, nutritional, and metabolic diseases (13.55%). Mental and behavioral disorders as well as neoplasm represented 7.74% and 6.45% of the diseases, respectively (Fig. 6). Among the CNV predisposing diseases, 41.93% are autosomal recessive (AR) and 30.32% autosomal dominant (AD) (Fig. 5).
In order to identify genes that have evidence of disease, we performed a manual annotation of CNV-overlapped genes listed in OMIM to determine the number of exons and introns that were potentially disrupted by these CNVs. Consequently, we focused on functional deletions that affect exons as well as functional duplications that either overlap at least one entire gene, thus likely leading to an increased dosage, or those that are internal to the gene that potentially could disrupt the protein Heterozygous deletions leading to the loss of genomitranslation frame. In addition, we discarded all intronic events. Sixty-three (63) disease genes were identified as harboring 54 CNVs (Table 4). The list of OMIM disease genes was then split into two groups. The first one includes CNVs that have been previously reported in the 1000 Genome project, gnomAD, the DGV, and dbVar databases, as well as those identified in the studies of DDD and Ira's lab. The second one contains disease genes affected by the novel, likely Tunisianspecific CNV, that were not reported by the mentioned queried databases.
In the former group, we reported 56 genes affected by 53 known CNV segments (13 deletions, 27 duplications, and 13 mixed loci) ( Table 4). Frequency of these segments range from 0.98 to Table 2. 100% (Table 4). Among these CNV known segments, heterozygous deletions in two individuals have been detected spanning from exons 6 to 7 in the ABO gene. Three exons have been deleted at the heterozygous state in the STRC gene known to cause deafness.
Two exons are also deleted at the heterozygous state in the COX10 gene responsible for the Leigh syndrome. Heterozygous deletions leading to the loss of genomic segment spanning from the start of the transcript to intron 21 in the spermatogenic failure disorder       gene have also been reported. As these disorders are AR and the individuals harbor these deletions at the heterozygous state, consequently, they do not express the corresponding phenotypes. In addition, homozygous deletions (in 13% of the sample) and heterozygous deletions (in 23%) overlapping from intron 15 to the end of the CFH gene known to cause three diseases have been identified. These diseases are: basal laminar drusen (AD), complement factor H deficiency (AR, AD), and atypical hemolytic uremic syndrome (AR, AD). Unfortunately, no additional phenotype data are available to check if these CNVs were causative of one of these phenotypes in the genotyped individuals of our cohort. Moreover, 16 CNVs completely delete 18 genes including RHD. Complete gene deletions affect, for example, the CFHR1, CFHR3, and PMP22 genes known to predispose to hemolytic uremic syndrome and to cause Charcot-Marie-Tooth disease type 1A, Charcot-Marie-Tooth disease type 1E, Dejerine-Sottas disease, and Roussy-Levy syndrome ( Table 4).
Heterozygous deletion leading to the disruption of the coding sequence of the FRRS1L gene has been identified. This gene is the cause of the early infantile epileptic encephalopathy, an AR disease. Complete gene duplication in AR genes like CARS2 responsible for combined oxidative phosphorylation deficiency has been observed ( Table 4).
We also examined OMIM-gene-containing likely novel CNVs. In this group, we identified one singleton duplication affecting the ALG9 gene and the complete PPP2R1B gene ( Table 4). The two genes are known to be the causes of the congenital disorder of glycosylation type II and lung cancer, respectively (Table 4).
Linkage disequilibrium (LD) analysis of CNV with nearby SNP We investigated the "taggability" (measuring the CNVs that are in high LD with flanking SNPs) of SNPs to CNVs in the Tunisian population. We restricted our LD analysis to polymorphic biallelic deletions identified in at least four individuals (frequency ≥ 3.9%). Therefore, 223 bi-allelic deletions (bi-deletions) were retained for this analysis (Supplementary Dataset). The LD analysis between bi-deletions and flanking SNP was performed within five windows (200 kb, 500 kb, 1 Mb, 2 Mb, and 3 Mb). As expected, all the bi-deletions had at least one neighboring SNP within the genomic region of either breakpoint. The maximum number and the median of SNPs in LD with CNVs were 15 and 4, respectively. This result indicates that half the bi-allelic deletions could be tagged by >4 SNPs and that some of these deletions could be tagged by 15 SNPs. Nevertheless, only 47 CNVs (21.07%) were in strong correlation (r 2 ≥ 0.8) with at least one SNP in all five tested windows. A total of 111 SNPs was in strong correlation (r 2 ≥ 0.8). Consequently, these findings highlight that the Affymetrix 6.0 SNP array is not adapted to identify bi-allelic deletion as the majority of them were not well tagged by the nearby SNPs. Furthermore, we evidenced that the strength of the r 2 value decreases as the distance of the CNVs and the SNP increases (Fig. 6). In order to bring out whether not well-tagged bi-allelic deletions tend to be located in the genomic regions where SNP markers are sparse, we performed a correlation analysis. The Spearman test result suggests the absence of any pattern (Spearman's rank correlation rho p value = 0.56) (Supplementary Fig. 10). Nevertheless, a weak association is present between the correlation (r 2 ) and the distance of SNP from the CNVs (Spearman's rank correlation rho = 0.1; p value = 1.16 × 10 −3 ). However, smaller-sized CNVs were generally in strong correlation (Spearman's rank correlation rho = −0.21, p value = 1.08 × 10 −2 ) with more SNPs (Spearman's rank correlation rho = −0.3, p value = 2.09 × 10 −6 ) ( Supplementary Fig. 11).
The 111 tag-SNPs have been analyzed with RegulomeDB to assess their potential functional effect. Only eight variants (7.14%) are considered known expression quantitative trait loci (eQTL) for genes and thus have been shown to be associated with expression as their RegulomeDB score has been estimated to 1f, affecting significantly the expression of many genes in different tissues (data not shown). Three other tag-SNPs are putative functional variants but without eQTL data, and thus no known direct effect on binding was reported. Their scores were low (2a, 2b, and 3a). Among these 11 functional tag-SNPs, only one (rs7542235) has been described to be associated with advanced age-related macular degeneration, age-related macular degeneration with geographic atrophy, and age-related macular degeneration with neovascularization. Due to the unavailability of clinical information on the participants, these phenotypes could not be confirmed.
Furthermore, only 22 genic CNVs (22/223 = 9.9%) were tagged by array SNPs (r 2 > 0.8). Four bi-allelic deletions are overlapping with OMIM genes known to predispose to the hemolytic uremic syndrome, albinism, hereditary neuropathy, mental retardation, and ventricular fibrillation ( Table 5). The CNP147 is in high LD with the rs7542235 SNP. This 71-kb deletion spans the entire CFHR1 and CFHR3 locus that is associated with an increased risk of the hemolytic uremic syndrome and a decreased risk of age-related macular degeneration. The one-copy deletion frequency is 44% and the homozygous deletion frequency is 14%.
We also investigated the potential role of bi-allelic deletions in their etiology of complex diseases or traits. Therefore, we calculated the correlation r 2 between bi-allelic deletions and the SNPs available on the NHGRI-GWAS catalog. Only one GWAS-SNP (rs4704970) was found to be in high correlation with one CNP (CNP877) among the SNPs that have been found to be associated with various complex diseases and traits in the NHGRI-GWAS catalog. Therefore, we redefine our threshold to r 2 > 0.5. Only three SNPs in high correlation (r 2 > 0.68) with three CNVs were identified to be correlated with three diseases or phenotypes, such as high-density lipoprotein cholesterol and triglyceride levels, multiple sclerosis, and obesity (Table 6). One SNP was in perfect correlation (r 2 = 1) with one CNP. The rs4704970 was in perfect correlation with the CNP877. This CNV locus is a 5.957-kb deletion located in chromosome 5 and is located 18.263 kb from the SNP. No RefSeq genes overlap with this locus. The frequency of this deletion in the Tunisian population was high (24.75%), of which 23.76% were one-copy deletions and 0.99% were two-copy deletions. The remaining CNVs, CNP158 and CNVR_16_6_CNP2150, were also found in high LD with two SNPs (rs4844913 and rs11639988, respectively) known for their association with metabolic phenotypes. Fig. 6 The correlation between the r 2 and the distance between CNV deletion and single-nucleotide polymorphism (SNP). The decrease of the r 2 strength as the distance of the CNVs and the SNP increases is shown.

DISCUSSION
We present a pioneering study of the CNVs in the Tunisian population and provided for the first time a comprehensive map of these structural variations. In agreement with previous studies available from various cohorts and populations, our findings suggest that a high proportion of individuals (75.5%) carried at least one CNV 21,27,37 . We consistently identified reliable 1083 loci spanning 61.443 Mb of the Tunisian genome. It was shown to have more deletions than gains, while deletions seem to be shorter than duplications. Similar results have also been detected in other reports using different arrays, methods, and algorithms 8,23,27,38 . Such patterns have been associated partially with the bias of SNP genotyping arrays for detecting a greater number of deletions than duplications 39,40 . In addition, many detected CNVRs were featured by both losses and gains. These "mixed" loci potentially reflect recurrent copy number changes at the same locus.
The two used programs, PennCNV and Birdseye, implementing the hidden Markov model (HMM) algorithm, show better detection performance among the tools tested on the Affymetrix platform 41 . Nevertheless, we observed large variations in the generated calls, on which we applied a number of quality filtering criteria to minimize false positives and identify high-quality calls. It has been recommended that using multiple CNV calling algorithms and softwares designed for a genotyping platform instead of just one is a better strategy to decrease the falsenegative rate as well as to consider subsequently overlapping regions for downstream analysis 37,42 . In addition, for multi-allelic CNVs like those encompassing the salivary amylase and UDPglucuronosyltransferase genes, obtaining accurate genotypes is particularly challenging. This is observed not only for genotyping array technologies but also using sequencing techniques 43 . The availability of high-throughput sequencing (or next-generation sequencing (NGS)) projects would produce improved CNV calling standards, like the 1000 Human Genomes project and other new efforts 16,[44][45][46] . Large inconsistencies in the outputs of diverse CNV calling algorithms draw special attention to the importance to standardize array data collection, assessment of quality, and experimental validation. Consequently, in order to shed light on the burden of the CNVs on both disease susceptibility and normal phenotype variability, cautious experimental designs as well as accurate data filtering would be required. Hybrid approaches combining NGS and complementary analysis tools will help undoubtedly and correctly define the CNV structure and clear up its function.
In the absence of a Tunisian reference genome, our study gave important insights on CNV distribution that could be of a great impact on public health in Tunisia. In our sample, seven of the reported CNVs, which were all singleton CNVs, were not present in former data submitted to databases such as DGV, 1000 Genomes, gnomAD, and dbVar databases. Taking into consideration the relatively modest size of the studied sample, it is difficult to confirm the specificity of the identified events to the Tunisian population and to determine whether these singleton CNVs are true "rare event" or not.
Furthermore, the comparative analysis of our data with 1000 Genomes data gives insights on the Tunisian population structure. The pairwise correlation analysis suggests that the CNV profile of the Tunisian population is similar to the European and different from the African. Population stratification using principal component analysis with CNP genotype data shows that the Tunisian population clusters with European populations and differs from the African (Romdhane et al., unpublished). These results are in agreement with previous studies on the Tunisian population structure using different autosomal, Y chromosomal, and mitochondrial markers highlighting the heterogeneity of the Tunisian population and the contribution of the European in its genetic background [47][48][49][50][51] .
In general, CNVs can be categorized as pathogenic or benign. These latter CNVs are not often associated with detectable phenotypic outcomes as they overlap with non-functional genomic segments. Functional annotation of the reported CNVs has unveiled notable features as approximately half of the CNV in the Tunisian genome (48.38 %) affected known genes, suggesting that they could contribute to key traits in Tunisian individuals and might affect population health. Mapping genic CNVs to diseases highlighted the enrichment of three major diseases groups, which suggests that Tunisians are likely at risk of developing diseases of the nervous system, congenital malformations, chromosomal abnormalities, and endocrine, nutritional, and metabolic diseases if we take into account our previous results on genetic disease spectrum in the Tunisian population 33 .
The genic CNV segments were found to encompass coding and functional elements, several disease-related genes, and important drug-metabolizing genes that might contribute to the burden of chronic diseases. Homozygous deletions in our dataset are significantly gene poor as it has been previously reported 8 . It has also been shown that rare CNVs harbor more genes than common events 8 . Such relationship was not found in our study mainly due to the modest sample size that underpowered discovering such a pattern. We have also noticed an increased burden of CNVs in some Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways involved in infectious disease susceptibility. Among these genes, the KIR3DL1 gene that encodes killer cell immunoglobin-like receptors (KIR) was identified in a "mixed" locus. It has been reported that the counts of individual genes in the KIR locus directly influence early aspects of HIV-1 control 52 . This finding deserves further investigations in order to understand different degrees of susceptibility and resistance to HIV infection in Tunisia and other the Middle East and North African (MENA) populations 53,54 .
Furthermore, we noted enrichment in other medically relevant pathways, including starch and sucrose metabolism, carbohydrate digestion and absorption, and biosynthesis of unsaturated fatty acids. Among genes involved in such pathways were the salivary amylase gene AMY1 and the pancreatic amylase gene AMY2 for which copy number changes were previously shown to be determinant of metabolic states, such as body mass index, obesity, and insulin resistance 55,56 . These metabolic markers are present in diabetes, a serious population health problem that shows the highest prevalence in the MENA region with >10% of the population being diabetic 57 . Therefore, it would be preferable to investigate whether the amylase genes associated with diabetes in our population in addition to the predisposing genetic factors already described in Arabs 58,59 . Moreover, MGAM is another gene involved in such metabolic pathways and was associated with non-syndromic oral clefts, a common birth defect 60 . Deletions found at this gene should be further investigated as biomarkers of such congenital malformations in our population.
In addition, CNV genes are also enriched in drug metabolism and detoxification, as well as chemical carcinogenesis pathways. The most common example is UGT2B17, where CNV has been correlated with pancreatic, prostate, and colorectal cancer, in addition to other diseases related to differential testosterone concentrations [61][62][63] . UGT2B17 gene deletion (UGT2B17*2) has been associated with the pharmacokinetics of aromatase inhibitor drugs such as exemestane, as well as in bone health 64 .
It is important to recall that all genotyped individuals were phenotypically healthy at the moment of recruitment. Therefore, the possibility that the carried CNV might be responsible for a disease cannot be excluded in the absence of complete prospective phenotypic data as we observed a high proportion of deletions in coding sequences, even complete gene losses at homozygous states. Common polymorphic deletions have been essentially identified in coding exons of genes involved in sex steroid metabolism (UGT2B28 and UGT2B17), olfactory receptors such as OR51A2 and OR4F5, and drug response, namely CYP2A6, GSTT1, and GSTM1 65 . It has been suggested that these common deletion polymorphisms generally depict ancestral mutations being in LD with nearby SNPs 65 . Interestingly, non-coding sequences such as microRNAs and introns were found to overlap with several of our CNV loci. Consequently, they could affect different biological process regulation such as cell apoptosis, proliferation, and differentiation. MicroRNAs have also been linked to human diseases. Indeed, some studies have evidenced downregulation of microRNAs in tumors when compared with normal tissues as well as their upregulation expression due to CNV in lymphoma 66,67 . Moreover, it has been recently evidenced that intronic CNVs influence gene expression variability and splicing 68 . Therefore, the CNV overlapping coding genes, introns, and microRNAs found in our study could not only contribute to genetic diversity but also in disease susceptibility in our population.
The presence of CNVRs spanning functional sequences, including those associated with the disease, challenges the discrimination between benign and pathogenic CNVs 69 . Apparently, healthy individuals harboring deletions of partial and/or complete coding exons in genes known to be responsible for severe genetic diseases remain to be understood, as medically relevant genes harboring homozygous deletions have been identified in our study. Homozygously deleted genes have also been reported in the CNV map generated by Zarrei et al. 69 , involving OMIM genes such as UGT2B17, RHD, KIR3DL1, PSG1, HLA-DRB1, and HLA-DQA1. These genes were considered as non-essential and "dispensable" because they could be absent from the genomes of apparently healthy individuals 15,69 .
The assessment of the burden of "pathogenic" CNVs on health is challenging due to variable phenotypic manifestations between individuals harboring similar CNVs. A "double-hit" model has been advanced in order to explain the variable expressivity involving the 16p11.2 microdeletions and duplications [70][71][72][73][74] . These CNVs have been previously described in neurologic diseases like mental retardation, schizophrenia, and autism, and even in apparently healthy individuals [70][71][72][73][74] . In their study, the authors advanced that a first hit, such as the 16p12.1 microdeletion, in combination with a secondary hit like an epigenetic, environmental, or genetic insult could result in a more severe phenotype 75 . Phenotypic effects of CNVs on individuals could be influenced by several properties.
It has been reported that inherited CNVs are more likely to be benign, while those pathogenic tend to be enriched for de novo mutation 76,77 . In addition, the position of the CNV within the genome has a significant influence on the phenotype. Indeed, CNVs spanning genes important to development, dose-sensitive genes or regulatory sequences, are likely contributing to disease expression or predisposition. The production of a functional transcript has been evidenced after a gain or a loss of genetic material and depends on different gene features such as the coding phase of the non-affected exons and the presence of alternative splicing isoforms that may counterbalance for the depletion of the major transcript, as it has been illustrated by the study on the NRXN1 gene involved in schizophrenia 78 . Moreover, pathogenicity also appears to be proportional to the size of the CNV, as large CNVs likely affect multiple genes, whereas smaller CNVs affect fewer genes. Furthermore, the nature of the CNV itself is also mandatory for pathogenesis as duplications are noticed to have a smaller pathogenic burden than deletions 79 . This is the case for the SMN1 locus known to cause AR disease spinal muscular dystrophy. Deletions of this locus are found in 96% of SMA patients 80 , whereas persons harboring two or more copies of the SMN1 gene are typically healthy 81,82 . Decoding the complex relationship between CNV genotypes and apparent phenotype is challenged by confounding elements such as environmental factors and variable penetrance. CNV-associated phenotypes can be impacted by haploinsufficiency, genomic imprinting, and the presence of other genetic factors 77 . All these findings emphasize the importance of decoding this type of variation to decipher the required part of the genome for normal human development due to the gene function loss [83][84][85][86] . Our study is the first step in building a Tunisian-specific structural variation database, thus paving the way to assess the burden of rare and common CNV of the Tunisian "CNVariome" in a much larger cohort 87 .
In an additional analysis, we examined SNPs that are in high LD with CNVs. We took only 223 bi-allelic deletions. These variants (tag-SNPs for CNV) would serve to predict CNV alleles when genotyped and proxy CNVs in investigating associations between CNVs and disease. Such an approach would reduce genotyping costs because SNPs are presently much easier to genotype than CNVs. Some reports supposed that deletion polymorphisms are commonly in strong LD and segregate on ancestral SNP haplotypes 65,88 . Tagging SNPs were found for only 146 CNVs (65.76%). By this analysis, we aimed also to determine the usefulness of the Affymetrix 6.0 SNP chip to identify CNVs in the Tunisian population. The efficacy of this array was rather poor, with 40.54% of CNVs that were tagged at r 2 > 0.5 and 21.17% in strong LD. Similar findings were observed in the CNV study in the Qatari population when using the Illumina OMNI2.5M array 16 . This taggability gap could be attributed to the local SNP density paucity, thus influencing the LD level. A comparative LD analysis using two different SNP sets genotyped by different SNP arrays revealed that~80% of CNVs are in high LD (r 2 > 0.8) when HapMap phase 2 SNP set was used compared to~50% of CNVs that were in high LD with commercial SNP array sets 18 . An improvement of this taggability has been obtained with whole-genome sequencing SNPs as >70% of the deletions have been tagged by at least one SNP at r 2 > 0.5 and over 50% at r 2 > 0.8 in the Qatari dataset in agreement with LD data generated using 1000 Genome dataset 15,16 . Consequently, this fact suggests that CNV genotyping could be challenging using this or other commercial arrays in North African populations. Furthermore, correlation analysis between bi-allelic deletions and GWAS-SNP indicates that these loss segments could be likely causal variants because of their strong LD with GWAS-SNP. Notably, the high correlation between the CNP877 and the GWAS-SNPs near the SGCD gene involved in multiple sclerosis is consistent with a previous study 27 . Consequently, data on LD between SNPs and deletions highlight the advantage of a unique database integrating findings on SNP genotypes and structural variations and could be combined in eventual arrays conceived to genotype Tunisian cohorts, in addition, to imputing these deletions in Tunisian or in ancestrally and/or ethnically similar populations.
In summary, we provided the first genome-wide study of CNVs leading to a CNV map of the Tunisian population by generating a highly dense catalog of 1083 CNVRs in a cohort of 102 Tunisians using stringent QC criteria for CNV detection. Our study contributes to shedding light on this unstudied kind of variation in the Tunisian population. CNV genes reported here are involved in biological pathways relevant to public health. In addition, we brought out the first assessment of LD between bi-allelic deletions and SNPs in the Tunisian population, thus allowing their imputation in future studies of a matched cohort. Knowing the prevalence and characteristics of recurrent CNVs is clinically invaluable. Consequently, they should deserve further characterization and be systematically assessed in a larger cohort in order to assess significance and associations with the relevant diseases or traits in the Tunisian population. This specificity will be informative for the design of a population-specific clinical copy number array, the interpretation, and the assignment of the pathogenicity of such variations within our population. These findings gave first insights into the CNVariome of the Tunisian population and raise many questions regarding their contribution to health issues in the Tunisian population and ethnically similar North African populations, thus paving the way to precision medicine implementation in the MENA region.

Study population sample preparation and genotyping
A total of 135 healthy (free from any genomic disorder) unrelated Tunisian individuals (103 males and 32 females) originating from Northern, Central, and Southern Tunisia has been recruited. Participants' mean age is 48 ± 10 years. All individuals gave informed consent. Identities of the participants were kept anonymous and no personal identifiers were used. According to the Declaration of Helsinki Principles, ethical approval was obtained from the biomedical ethics committee of Pasteur Institute of Tunis (PV09/06, IRB# 0,000,000,044).
All samples have been genome-wide scanned using the Affymetrix Genome-Wide SNP Array 6.0 as mentioned in a previous study 89 . This array contains 906,600 polymorphic probes designed to detect both SNPs and CNVs as well as 946,000 non-polymorphic probes to call CVNs only. CNV probes were basically chosen for their genomic spacing and based on known CNVs available in the DGV (Affymetrix Inc.: Genome-wide human SNP array 6.0 Datasheet. Available at www.affymetrix.com 2009). Sixteen individuals were excluded from subsequent analysis as they had QC contrast values >0.4. The remaining 119 individuals were used for CNV detection.
Affymetrix Power Tool (APT) v1.8.6 was used to obtain genotype calls required for copy number estimation. Called SNPs were excluded if they do not fit the following criteria: minor allele frequency (MAF) >1%, genotype rate >95%, and p value for Hardy-Weinberg equilibrium (HWE) test over 10 −4 . This quality control step was performed using PLINK v1.07 90 leaving a total of 782,392 SNPs for subsequent analysis.

CNV detection algorithms and analyses
The two most used CNV detection algorithms, namely PennCNV v.1.0.3 91 and Birdsuite v1.5.5 92 , were used for both CNV detection and validation. In this study, we only focused on CNVs in the 22 autosomes because of the inaccuracy of CNV detection in sex chromosomes. Genomic coordinates for all CNVs detected in this study were mapped to the human genome assembly build 37 (hg19).
In the first step, CNV segments were identified using the HMM algorithm implemented in PennCNV. We followed the PennCNV-Affy Protocol available at this link (http://www.openbioinformatics.org/penncnv/ penncnv_tutorial_affy_gw6.html). The Log 2R and B-allele frequency values were obtained. PennCNV provided CNV-specific quality control (QC) metrics in order to find potentially poor CNV samples. Individuals with poor quality of signal intensity were removed if they had >100 CNV segments detected, along with wave factor >0.05 or Log 2R standard deviation >0.4 and B-allele drift >0.0125. The wave factor consists of the overall "waviness" or variation of signal intensity and the B-allele drift is the fraction of "abnormal" markers not clustering in the usual positions (0, 0.5, and 1). This value represents the median of all chromosomes and it is useful for detecting genotyping failure 42 .
In order to assess the copy number changes, we used the HMM model. It mainly executes segmentation of the log 2 ratio intensity data. In addition, it predicts copy number states for each segment. There are up to five delineating the following states: CN state = 0 (homozygous deletion), CN state = 1 (heterozygous deletion or single-copy loss), CN state = 2 (neutral copy number of normal diploid), CN state = 3 (single-copy gain), and CN state = 4 (amplification). CNVs with a normal copy (state = 2) are not incorporated in the final CNV report. In addition, we used a minimum number of ten SNPs overlapping with the CNV in order to prevent falsepositive CNV. We used also the same algorithm to merge adjacent CNVs because it splits large CNV (>500 kb) into smaller parts between 100 and 150 kb CNV calls. Spurious CNV calls in specific genomic regions (immunoglobulin, telomeric, and centromeric) were removed.
Along with PennCNV, we used the Birdsuite package 92 to detect copy number change. There are two components in Birdsuite. The first one is Birdseye, which is able to call rare copy number changes. As such, the CNVs detected by the two algorithms could be cross-validated on the 102 samples. Birdseye called 96,177 CNVs. Those with low confidence scores (<5), as recommended by the program, were excluded from subsequent analysis. Similarly, only CNVs on autosomes were used and we restricted the number of SNPs overlapping with each CNV to three markers. Therefore, 6670 CNVs with a normal copy are not included in the final CNV report. CNVs called in centromeric regions are likely to be false positives because SNP coverage in these regions is very low. Consequently, all the CNVs spanning centromeres, telomeres, and immunoglobulin regions were excluded from the analysis for both algorithm outputs. In addition, we excluded CNVs <1 kb or >3 Mb, leading to a final set of 6263 CNVs.
The genotype of 1316 CNPs can be defined by the second component of Birdsuite, which is a Canary used to determine the integer copy number state at each of these CNPs predefined on the Affymetrix 6.0 array 92 . These CNPs are distributed on all the autosomes as well as the heterochromosomes. They were identified in more than one HapMap II individuals and their sizes were also precisely determined 92 . The Canary algorithm was performed on the 102 Tunisians. As recommended by the software, only CNP on the 22 autosomes were kept for subsequent analysis as well as those with integer copy numbers detected with a high confidence score (>0.1). CNPs were filtered according to their size, only those ≥1 kb were kept.

Construction of CNV loci using PennCNV and Birdseye outputs
The CNVs identified by both PennCNV and Birdseye overlapped across individuals. Therefore, they were merged into discrete non-overlapping loci called CNVR. The boundaries of each locus were assessed by the union of all CNVs that are included in that given locus. A CNVR is then defined as the maximum region that is shared between all individuals harboring a CNV at the same locus. This step was achieved using the BedTools v2.25.0 utilities 93 . The CNVRs were classified into three classes, "loss" (loci encompassing deletions), "gain" (loci encompassing duplications), and "mixed" (loci encompassing both deletions and duplications). The CNV locus construction was performed in order to assess the CNV frequencies in the studied population.

Comparison of CNVRs detected by PennCNV and Birdseye
The CNVRs generated using the PennCNV and Birdseye outputs were compared as a concordance and in silico "validation" step. The reciprocal 50% overlapping method was used to compare the CNVRs identified by these two computational methods. CNVRs found to overlap with 50% of their lengths were considered as CNV locus and kept for further analyses. This final set of CNV was called vCNVRs.
Construction of the global CNV map and comparison to structural variants from the 1000 Genomes Project and gnomAD In order to construct the comprehensive autosomal CNV map in the Tunisian population, we merged vCNVRs and Canary results to generate the global CNV map. To further provide reliable data, we downloaded the 1000 Genomes phase 3 structural variant data in VCF format as reported in the original publication 15 (https://www.internationalgenome.org/) as well as the structural variant dataset from the gnomAD 94 . We then compared our CNVR dataset while requiring at least 50% of reciprocal overlap size using Bedtools. When CNVR candidates in our dataset matched multiple allele structural variants in these databases, we summed the frequency of all alternate alleles. CNVR with frequency >90% not overlapping with any structural variants from these databases were considered as potential false positive and removed. Pairwise correlation analysis of frequency between the CNV frequency was identified in the present study and that of the 1000 Genomes. The p value threshold for statistical significance is 5%.

Functional annotation of the global CNV map
Merged vCNVRs and allelic CNP of the global CNV map were searched to find overlapping genes using the RefSeq Gene annotations (the hg19 genome version) using the reciprocal 50% overlapping threshold. These genes were screened against the OMIM database to find out whether these copy number segments mapped to disease and phenotype genes. In addition, we also obtained disease-gene associations available from the GAD through the DAVID 6.8 bioinformatics suites in order to predict disease-metabolic pathway associations 84,95 . The GAD is the NIH-supported public collection of human genetic association studies of complex diseases 36 . This database includes the complete known gene-phenotype associations as well as non-Mendelian common complex diseases 36 . Furthermore, we considered gene ontology of the overlapping genes with this CNV map in order to estimate the enrichment of these genes when compared with other genes of the human genome using the DAVID bioinformatics resources 95 . In addition, pathway analysis of these genes was considered using the KEGG pathway database by the DAVID bioinformatics suite. The statistical significance cutoff was 0.05. In addition, the WHO ICD-10 version 2007 was used (http://apps.who.int/classifications/apps/icd/ icd10online/) in order to determine the genetic disorders distribution according to the affected tissue, process, system, or organ.

Identification of novel CNV loci
In order to establish whether a CNV locus is novel, we have compared our data to those available in the following databases: DGV, dbVar, the DDD study 96 , the study of Ira M. Hall's lab (IMH) 97 using the AnnotSV v2.3 software 98 in addition to those in 1000 Genomes, and gnomAD. A CNV was designed as a novel if it did not share at least 50% of its size with any CNV loci stored in these databases.

LD analysis
We performed a correlation analysis for CNV and nearby SNPs within five windows (200 kb, 500 kb, 1 Mb, 2 Mb, and 3 Mb). Called SNPs were previously subject to exclusion if they do not fit the following criteria: MAF over 1%, genotype rate over 95%, and p value for HWE test over 10 −4 . LD analysis was performed using the "--r2 --ld-snp" PLINK v1.07 command 90 . In this LD analysis, a bi-allelic model was applied as we only selected polymorphic bi-allelic deletions (frequency ≥ 4%). We used the squared Pearson's correlation (r 2 ) for correlation analysis implemented in the PLINK v1.07 software 90 . If an SNP with a high r 2 (>0.8) for a CNV locus allele was identified, then we designed the CNV allele as being in LD with that singlenucleotide variant. The "tag-SNP" was selected as the variant having the highest r 2 value from such SNPs.

Statistical analysis and data visualization
All the downstream analyses were performed using the statistical software R version 3.6.2 (http://www.r-project.org). For data visualization, the R package ggplot2 was used 100 . RCirocs was used to plot genome-wide distribution of CNV states 101 .

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.