Introduction

Copy number variation (CNV) represents the major portion of variation in the human genome with respect to size1, 2, 3, 4 and is known for its role in altering gene expression, thereby affecting genetic diversity, evolution and disease risk.5, 6, 7 The evaluation of the role of a CNV in disease risk relies on its frequency in normal population cohorts.8, 9 Many such cohorts exhibited inter- and intra-population differences in CNV frequency distributions.10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22 In addition to disease risk, these differences, furthermore, explain a significant proportion of normal phenotypic variation.23, 24, 25, 26

In this context, we characterized the genome-wide architecture of CNVs in 286 healthy, unrelated subjects characterized for musical aptitude and related traits.27 We wanted to essentially evaluate the role of CNV enrichment in music-related phenotypes. In a broader perspective, the sample set represents the isolated Finnish population that has experienced multiple bottlenecks in its population history.28 Owing to founder effect and genetic drift, 36 monogenic disorders (caused by one major mutation) were enriched in this population, whereas many other rare monogenic disorders such as phenylketonuria and maple syrup disease were depleted.28, 29 Moreover, CNV remains poorly characterized in genetically isolated populations, including the Finnish population. Characterization of the genome-wide architecture of CNVs in this sample set thus enables the genotype–phenotype correlation, and provides novel insights into normal structural variation of a population isolate.

Methods

Study material

The study material comprised 286 healthy, unrelated Finnish subjects (167 females, 119 males; mean age of 55.21 years; range 18–94 years) who participate in the MUSGEN project, where molecular background of musical aptitude and related traits are studied.27 The participants neither reported any relatives in the study (based on a questionnaire) nor showed any close relatedness in identity by descent analysis. No medical information was available from the participants, but as far as we know, they are healthy. The Ethical Committee of Helsinki University Central Hospital approved this study. An informed consent was obtained from all participants. More information about the MUSGEN project and the sample recruitment can be found in the Supplementary information.

Genotyping

For genotyping, we used 200 ng of DNA that was extracted from the peripheral blood of each subject (no cell lines were used). All the samples were genotyped using Illumina Infinium HumanOmniExpress-12v1.0 beadchip (730 K; San Diego, CA, USA), with an average overall call rate of 99.54%. Normalized signal intensity data was obtained through Illumina BeadStudio software. Normalized measures of total signal intensity (Log2 R ratios) and the relative allelic signal intensity ratio (B-allele frequencies) at each marker were used for CNV identification in all samples.

CNV detection

CNVs were identified using two algorithms: PennCNV30 and QuantiSNP,31 and only the consistent calls were retained for further analyses. All probe coordinates in this study were mapped to human genome build GRCh37/hg19. We followed two different approaches for constructing a CNV map, which are: (1) so-called copy number variable region (CNVR) using any-overlap criterion4, 8, 11, 12, 13, 16, 17, 18, 20, 21 and (2) copy number variable cytogenetic region (CNVcR) representing any cytogenetic region that contains one or more CNVs. The study protocol is shown in Supplementary Figure S1. Detailed descriptions of all the methods with their associated references are provided in the methods section of Supplementary material.

Results

General characteristics of CNVs

We observed a total of 5493 CNV events in 267 samples that passed the quality control evaluation. Of these, 3888 (70.7%) CNV events were deletions, whereas 1605 (29.3%) were duplications. Notably, 12.4% of the autosomal CNVs were homozygous deletions, whereas only 0.05% constituted four-copy duplication. On average, 20 CNV events (14 deletions and 6 duplications) were discovered per person. We observed at least one CNV of size >100 kb in almost 90% of the samples and of size >500 kb in 13% of samples, where 4% of the samples had a CNV >1 Mb (Supplementary Table S1). The total CNV size accounted for 287.83 Mb (147.67 Mb deletions and 140.15 Mb duplications), whereas the average size of CNV per locus was 52.39 kb (37.98 kb for deletions and 87.32 kb for duplications). Approximately 75% of the total CNV events were <50 kb and 40% of them were <10 kb. (Figure 1, Supplementary Table S2). A recent study32 has provided evidence for age-related accumulation of CNVs. In our sample, we did not find statistically significant difference in the numbers of CNVs, large CNVs, novel CNVs between elderly (>60 years of age; N=107) and middle-aged (<55 years of age; N=133) subjects. Unfortunately, longitudinal follow-up of the appearance of CNVs was not available in this study.

Figure 1
figure 1

Size distribution of CNVs in 267 unrelated Finnish subjects. The x-axis represents different size bins and the y-axis represents the proportion of CNVs falling into each size bin.

Genome-wide map of CNVs

We followed two different criteria for the construction of a genome-wide CNV map in this sample set. From the 5493 consistent CNV events, a total of 999 CNVRs (618 (61.9%) deletions, 381 (38.1%) duplications) were constructed. Among the 999 CNVRs, 631 (63.2%) regions contained rare CNVs (<1% population), whereas 368 (36.8%) regions contained polymorphic CNVs2, 3(>1% population).

A total of 467 different cytogenetic regions contained CNVs in this sample set. We define such cytogenetic regions as CNVcRs in this study. Of these 467 CNVcRs, 190 regions (40.68%) were found to be rare (<1% population), whereas 277 regions (59.31%) were polymorphic (>1% of population). We tested whether the highly frequent CNVcRs in this study (Table 1) were significantly overrepresented or underrepresented compared with other populations. For this, we computed the frequencies of CNVcRs from 39 studies representing 9793 samples. Specifically, we combined the CNV data from the database of genomic variants1 (DGV; 37 studies, 8528 samples); Vogler et al16 (1167 samples) and Teo et al18 (98 samples). Excepting 14q11.2, four other highly frequent CNVcRs (6q14.1, 11q11, 2p22.3, 3q28) were found to be significantly overrepresented among the Finns (two-sided Fisher’s exact test; P-value (fdr) <0.05) (Table 1).

Table 1 CNVcRsa with their frequencies and gene content in the Finnish sample set (n=267)

Novel CNVR characteristics

We found that 6.9% of the CNVRs detected in this study were novel, whereas 93.1% were already known and cataloged in DGV. Nearly 83% of the novel CNVRs in our data were rare (<1% population). Approximately 93% of the novel CNVRs were <50 kb. Although 50% of the novel CNVRs overlapped with Refseq genes, only half of them overlapped exonic regions. Notably, homozygous deletions were not observed in the novel CNVRs.

We further compared the CNVRs in this study with CNV calls from few individual studies to assess the level of concordance. Highest concordance was observed with the study of Shaikh et al,14 whose study was based on Illumina Infinium II Human-Hap550 beadchip with 65% of their sample set comprising Caucasians. Overall, depending on the CNV detection methodology (platforms and algorithms) and ethnic background of the population, the level of concordance with different studies varied significantly (Supplementary Table S3).

Enrichment of CNVs and their phenotype–genotype correlation

We checked if any particular CNV was significantly enriched or depleted in this sample set (two-sided Fisher’s exact test, P-value threshold: 0.05) compared with (1) mixed Caucasian and African–Americans,14 (2) African and Swiss populations16 and (3) Swedish population.18 Interestingly, several CNVRs showed significant enrichment and were observed only in the Finnish sample set (Table 2a). In fact, some of the enriched CNVRs intersected genes that affect brain function. For example, CNVRs overlapping protocadherin alpha gene cluster (PCHDA1-9; 47 subjects), glucose mutarotase gene (GALM; 45 subjects) and cGMP-dependent protein kinase type I (PRKG1; 23 subjects) are notably relevant for brain function and could be relevant candidates for musical traits. Several other common CNVRs in Finnish sample set such as chr8:51031221-51040022 containing SNTG1 (P=1.7 × 10−37) and chr16:28615243-28620752 with SULT1A1 (P=1.2 × 10−31) were not reported in other populations. In this connection, analysis of the music-related phenotypes (COMB scores and creativity in music; detailed in Supplementary information) among the enriched CNV carriers showed no significant excess of the phenotypes in either carriers or noncarriers (data not shown).

Table 2a Finnish CNVRs consistently enriched against CNVRs from Rwanda, Mixed, Swedish and Swiss populations

On the other hand, several CNVRs that were relatively common in other populations were not observed in the Finnish population (Table 2b). Putative functions of the genes intersected by these enriched and depleted CNVRs are shown in Supplementary Table S5.

Table 2b Finnish CNVRs consistently depleted against CNVRs from Rwanda, Mixed, Swedish and Swiss populationsa

Genomic impact of CNVs in the finnish sample set

A total of 491 (49.1%) CNVRs overlapped with 835 RefSeq genes of which 321 genes (38.4%) were deleted, whereas 423 genes (50.7%) were duplicated. In all, 91 genes (10.9%) have undergone both deletions and duplications, whereas 37 genes (4.4%) were overlapping with novel CNVRs. Table 3 shows some of the clinically important CNVs and genes that were significantly polymorphic in the population.

Table 3 CNVs overlapping with genes of known clinical relevance

The most common CNVR contained genes from the olfactory gene cluster (OR4C11, OR4P4, OR4S2), amylase gene cluster (AMY1A, AMY1B, AMY1C) and protocadherin alpha gene cluster (PCDHA1-9) (Supplementary Figure S2). Of the 835 Refseq genes that fell within CNVs, 396 genes were present in the OMIM database, which contains information on all Mendelian disorders, whereas 593 genes were present in PharmGKB, the pharmacogenomics knowledge base (more functional categories detailed in Supplementary Table S4).

To identify the enriched functional categories falling within CNVRs of this study, we used a hypergeometric distribution test implemented in Genetrail.33 A stringent P-value threshold (fdr) of 0.01 resulted in the enrichment of only one KEGG pathway; olfactory transduction (P=2.07 × 10−6). In addition to this, several gene ontology terms were significantly enriched in our CNV data set (P-value (fdr); threshold 0.01) (Supplementary Figure S3). These enriched functional categories included: (1) biological processes such as cell-adhesion, sensory perception, cognition and neurological system process, (2) cellular component terms pertaining to the membrane parts and (3) molecular function terms such as molecular transducer activity, alpha-amylase activity and olfactory receptor activity.

Discussion

This study presents the first comprehensive CNV map of 286 healthy, unrelated subjects belonging to MUSGEN project, who originate from the isolated Finnish population. Primarily, this genome-wide CNV investigation in the Finnish sample set demonstrated features that are characteristic to isolated populations. In particular, highly significant enrichment of certain CNVRs and a total depletion of other CNVRs in this population suggest a founder effect. Adding strength to this finding, even the most common CNV locations in the world’s normal population cohorts (6q14.1, 11q11) were overrepresented in this population (Table 1). In addition, CNVs in this population comprised a higher proportion of homozygous deletions than three other populations from a recent study,19 hinting at a founder effect.

Further, 6.9% of the CNVRs detected in this population sample was novel. The majority of those novel CNVRs were small (<50 kb), suggesting that smaller variants in the human genome have not been comprehensively characterized yet. Although the functional impact of such smaller variants has often been underestimated, recent studies34, 35 have shown that smaller variants (often intergenic) affect transcription factor binding and, consequently, gene expression. Moreover, identification of 93% known variants indicates that the detection methodology used in this study was sufficiently sensitive to capture known variations.

The enriched CNVRs-intersecting genes such as PCDHA1-9, GALM and PRKG1 are intriguing because of their remarkable relevance for brain function. PCHDA1-9 gene cluster is related to the serotonergic systems that influences neurocognitive and motor functions,36 and was found to be cosegregated with low-music test scores in a recent family-based CNV study.37 GALM is associated with serotonin transporter binding potential in the human thalamus,38 whereas PRKG1, expressed in the neurons of amygdala, was suggested to support synaptic plasticity.39 Although these CNVRs showed no statistically significant association with the music-related phenotypes, owing to a limited sample size, we cannot exclude their role as possible candidate genes for musical aptitude and related traits.

Several CNVs detected in this study have considerable clinical relevance. Specifically, some of the highly polymorphic (> 5% population) CNVs intersected known disease-related genes (Table 3), which may intrigue the public health sector. Lack of disease markers screening in the study participants, makes it unfeasible to exclude the probability of an identified CNV to be potentially predisposing for disease conditions. In addition, 13% of the individuals in this study accommodated at least one large, rare CNV. In this regard, it is worth noting that these large CNVs are typically rare in normal population cohorts,12 but have a potential role in neuropsychiatric diseases.5, 40, 41, 42 Further investigations are warranted but remain beyond the scope of this study.

The functional impact of common CNVs in the Finnish population appeared to be consistent with the world’s populations in a number of aspects. Firstly, duplications overlapped more genes than deletions in our study, supporting the presumption of deletions being biased away from genes.4, 43 Secondly, our findings aligned with the idea of rare CNVs being large and harboring more genes than common events.12 Further, significantly enriched gene ontology terms in the Finnish CNVs included extracellular biological processes, such as sensory perception, cognition and neurological system process that are in accordance with the findings of previous population-based CNV studies.4, 15, 17, 22, 43 In fact, the genes affecting these biological processes are also intriguing for music-related phenotypes. Also, a recent CNV study in swine species44 reported similarity in the enriched functional categories that allows us to speculate that CNVs are comparable across different species. Focusing on individual genes, we found that genes from the olfactory gene cluster (OR4C11, OR4P4, OR4S2), amylase gene cluster (AMY1A, AMY1B, AMY1C) and protocadherin alpha gene cluster (PCDHA1-9) were relatively more frequent among the CNVs in this population. Regardless of their frequencies in the Finnish population, these genes have been widely described in CNVs in previous studies.45, 46, 47

The general characteristics of CNVs detected in this study are similar to those in previous studies that used the same technology platform. For instance, the number of CNVs, deletion/duplication ratio and the average size of CNV (20, 2.4:1 and 52.39 kb, respectively) in this study are comparable to the statistics of Shaikh et al.14 and Xu et al.19 Both of these studies used an Illumina SNP array with relatively less marker density. Moreover, when we estimated the degree of overlap between Finnish CNVs and CNVs from 10 different studies, we found a higher degree of overlap with studies based on Illumina SNP array.

In a broader perspective, array-based CNV studies often involve several discrepancies in detection, cataloging and comparisons. Most importantly, differences in the array architecture, choice of algorithms, population differences and phenotypes affect our understanding of CNVs.48, 49, 50, 51 Being aware of all such discrepancies, we chose to abide by widely adopted methods to make our data comparable across studies. However, an increase in the sample size and a higher marker density in this study would fine-tune the estimates of population frequencies and their correlation with the studied phenotypes.

In conclusion, this first-generation CNV map of the MUSGEN project originating from the isolated Finnish population shows features characteristic to a founder effect. It would be interesting to see whether a similar phenomenon can be detected in other population isolates. The enriched CNVRs and biological processes suggest pathways important in evolution52 and may serve as candidate genes for music-related traits. Finally, these findings help future studies on music-related phenotypes, human genetic diseases, and demographic history, as well as contribute to the global variation map of CNVs.