Introduction

The advent of new technologies has allowed the identification of structural variants that have a more significant impact on human diversity than does the entire set of single-nucleotide polymorphisms (SNPs). Copy number variations (CNVs) are one such type of structural variant and constitute the largest proportion of genomic variations.1, 2, 3 CNVs result from the duplication or deletion of a DNA segment and are commonly observed in human genomes.4, 5, 6, 7 When a genomic event results in a CNV, not only the copy number of a gene can be altered but also its genic sequences. Therefore, CNVs can cause disease or contribute to disease susceptibility,8, 9, 10 and they have been compiled in several databases for public use.9, 10, 11

Although a number of deleterious changes may have been negatively selected during human evolution, it is likely that phenotypically neutral changes have been retained, transmitted and accumulated over generations. Increasing numbers of CNVs are found in phenotypically normal human individuals.1 Accordingly, each ethnic group tends to have distinct features in terms of the positions, copy numbers and frequencies of their CNVs, and it is possible that fixed CNVs have contributed to ethnic differences in phenotypic variations and disease susceptibility.12, 13, 14, 15 Therefore, it is important to have a list of CNVs for each ethnic group, especially for medical purposes. However, the number of reported CNVs from Asian populations is small compared with those of Europeans and Africans. Extensive examination of Asian CNVs is eagerly awaited by Asian researchers.

The compilation of nonpathogenic variations, in addition to disease-related variations, is also important for a better understanding of the genetic landscape of the human genome. Data sets including both these sorts of variations should be helpful in pinpointing causative mutations. Even when we search for variations using patient samples, most of the variations identified would be normal polymorphisms, together with a few pathogenic mutations. Although we can consider most of the available variation data nonpathogenic, it is difficult to know which variations are pathogenic. Therefore, the collection of data from normal controls is essential. To investigate phenotypically ‘normal’ samples in this study, we considered reproduction and child development, and chose parous Japanese women, who had experienced normal pregnancies and deliveries.

Although the origin of the Japanese population remains controversial, the last major migration to the Japanese Archipelago is thought to have occurred approximately 2000 years ago.16, 17 The population has been mixed well with various Asian ethnic groups during previous migrations, but has remained relatively isolated for 2000 years. However, although the current population of Japan is 127 million, far fewer CNVs have been documented in Japanese samples than in Europeans. Compiling a list of Japanese CNVs is also important from the perspective of medical science in Japan.

Materials and methods

Subject recruitment and SNP genotyping with a high-resolution microarray

We examined 411 unrelated Japanese women who had had one or more normal parities, with no significant abnormalities in any of their pregnancies, deliveries or neonates. Ethical approval was also obtained from each review board of the hospitals that participated in the study. The informed consent of all the subjects was obtained. To avoid cell-culture-induced chromosomal rearrangements, genomic DNAs were extracted directly from blood using the QIAsymphony DNA Midi Kit (Qiagen, Venlo, The Netherlands) with the QIAsymphony SP instrument and analyzed with a high-resolution SNP-based genotyping microarray, HumanOmni2.5-8 BeadChip (Illumina, San Diego, CA, USA). Only data that met the quality control guidelines of the manufacturer were used for further analyses.

Identification of CNVs and CNVRs

Two distinct algorithms were used to maximize the specificity of our CNV calling: a likelihood-based method with CNVPartition version 3.2.0 (http://www.illumina.com/software/illumina_connect.ilmn) and a hidden Malkov method with PennCNV version (27 August 2009).18 The parameters applied with these tools were referred to those typically used by many research groups (at least three consecutive probes to define a CNV, using the GC wave adjustment option, etc.). These programs computed confidence scores that can be used to filter out CNV regions that are likely to be false positives. However, we should note that the two programs calculated the scores in different ways, with different scales. To minimize false positives, we first chose only CNVs with high confidence scores; that is, more than 100 with CNVPartition, and selected copy number variable regions (CNVRs) that overlapped those called by PennCNV for at least 80% of their lengths. For PennCNV, we generated a list of B allele frequencies using a collection of signal intensities for 47 samples from HapMap Japanese in Tokyo with the compile_pfb script (Figure 1a).

Figure 1
figure 1

(a) Data processing flow. The initial 26 150 regions identified with CNVPartition were validated with PennCNV. (b) Chromosomal distribution of the CNVRs. Each CNVR is shown by a horizontal bar. Gain- and loss-type CNVs are distinguished by bars on the left and right, respectively. The numbers of each type of CNV are also shown, and are drawn with Idiographica (http://www.ncrna.org/idiographica/). CNV, copy number variation; CNVR, copy number variable regions.

Multiplex PCR assay

Multiplex polymerase chain reaction (PCR) assay was used to confirm regions that had been called homozygously deleted. The reactions were performed with both a control primer pair that generated a 296-bp fragment and a test primer pair that amplified a target region. The thermal cycling conditions were initial denaturation at 95 °C for 2 min, followed by 35 cycles of denaturation at 95 °C for 30 s, annealing at 60 °C for 30 s and extension at 72 °C for 30 s, and a final extension at 72 °C for 3 min. Detailed information on these primers is given in Supplementary Table S3.

Results

Genetic ancestry of the subjects

First, the population structure was inferred with the Structure software (http://pritchardlab.stanford.edu/structure.html) to confirm the Japanese ancestry of the subjects.19 A cluster analysis of our samples together with the sequences of 499 HapMap individuals from three ancestral populations (European, African and Asian) was performed using 1959 unlinked tag SNPs on chromosome 21. The expected ancestry of all the subjects was confirmed with a minimum coefficient of 0.85. We also performed a principal components analysis with the pca.jar program (Biobank Japan project; http://genome-analysis.src.riken.jp/PCP/). The results indicated that all but one subject were derived from the main islands of Japan and that the remaining singleton was Ryukyuan.20

Characterization of CNVs and CNVRs

The CNVPartition software (Illumina) identified 26 150 candidate regions as CNVs. We then used another program, PennCNV,18 which is based on an integrated hidden Markov algorithm, to maximize the specificity of the analysis. If a candidate CNV was also supported by PennCNV for at least 80% of its length, it was retained. In this way, we identified 6871 CNVs and 1043 regions with variable copy numbers from 411 Japanese individuals, with an average of 16.7 CNVs per diploid genome (Supplementary Table S1). Detailed information on all the SNP probes used for the CNV calls is tabulated (Supplementary Table S2). The mean length of the CNVs was 79.9 kb, ranging from 169 bases to 2.27 Mb. These 6871 CNVs corresponded to 1043 CNVRs (588 losses and 455 gains). Figure 1 shows the chromosomal distribution of the observed CNVRs. The total length of all of these CNVRs was 163 720 kb, which is equivalent to 0.5% of the whole human genome. The CNVRs can be divided into gain regions and loss regions, depending on whether their copy numbers have increased or decreased. Of the 1043 regions identified, 1033 overlap the latest database of genomic variants (DGVs) (released on 23 July 2013) reported at the DGV. More than half the CNVRs, including 72% of the gain CNVRs and 36% of the loss CNVRs, intersect RefSeq gene loci.

As far as we know, three studies have examined the Japanese population with array-based methods: two of them used samples from HapMap and the other used healthy individuals.21, 22, 23 These results are summarized with our data set (Table 1). Although those three studies had together already reported 82 regions, more than half the regions reported in the present study were not detected by them. It is probable that the higher resolution of our analysis and our larger sample size allowed us to detect additional CNVRs. Depending on the types of platform used, array-based CNV studies occasionally show discrepancies in the regions of CNVs.21, 22 Differences in the array architectures, scanning machines and calling algorithms could affect the final data sets. Using reported CNV data from SNP arrays, we counted the overlapping regions among studies that focused on other populations or HapMap data13, 14, 24, 25, 26 (Table 2). The similarities among these studies are comparable, but our results suggest a greater similarity between the Japanese and Chinese populations.

Table 1 Comparison of the CNVRs with those reported in other studies and in the DGVs
Table 2 CNVRs overlapping between the Japanese and other populations

Homozygous deletions found in parous Japanese women

In our study, 1628 homozygous deletions that could affect 112 RefSeq gene loci were called in a total of 822 chromosomes. Although the CNV analysis was unable to determine the precise breakpoints, our data indicate that some exonic sequences are disrupted by homozygous deletions (Table 3). Using multiplex PCR with both control and test primer pairs, we confirmed the null genotypes caused by deletions (Supplementary Figure S1 and Supplementary Table S3). Five genes, FCGR3B, FCGR2B, UGT2B17, HLA-DRB1 and CYP2A6, are described as disease related in the OMIM database. The FCGR3, FCGR2B and HLA-DRB1 genes have roles in the immune system. FCGR3B and FCGR2B encode the crystallizable region of immunoglobulin G. Several studies have shown that a low copy number at the FCGR3BFCGR2B locus is associated with a susceptibility to systemic lupus erythematosus in the Caucasian population,27, 28, 29 but not in the Chinese population.28 UGT2B17 encodes a protein that belongs to the family of UDP-glucuronosyltransferases enzymes, which catalyzes the glucuronidation of steroid hormones. A case–control study of osteoporosis-related fracture suggested that a CNV at the UGT2B17 locus contributes to osteoporosis.30 Jakobsson et al.31 found that its null genotype was more common in Koreans (67%) than in Swedish (9%). Our array results also showed a high frequency (74%) of the null genotype. The CYP2A6 protein metabolizes nicotine and coumarin in the liver. The lack of a CYP2A6 gene may affect nicotine levels in individuals and probably has a protective effect against tobacco dependence.32 Another study reported that the frequency of homozygotes for the CYP2A6 gene deletion was lower in Japanese lung cancer patients than in control samples.33 Except for HLA-DRB1, these disease-related genes have been reported to be frequently deleted in Asian populations.25, 34, 35, 36 Because we limited our samples to parous women only, it is unlikely that the CNVRs identified in the present study are related to human reproduction.

Table 3 List of genes lying within a homozygously deleted region

Discussion

In the present study, we compiled a catalog of copy number variable regions identified in phenotypically normal Japanese samples, especially those with a history of full-term pregnancy and deliveries without major complications. The data set will be useful in the search for novel or rare CNVs that increase the individual’s susceptibility to congenital diseases and complications during pregnancy. It is unlikely that the newly identified CNVs are related to infertility or miscarriage. CNVs in parous women without complications have never before been investigated. Although the copy numbers of these regions were not thoroughly validated with other methods; such as, quantitative PCR, according to DGV, most of the CNVRs identified here have been reported in previous studies, indicating that they should be observed by other methods or techniques. Because our identification strategy was based on a microarray technique, it is inevitable that errors would have occurred. Besides routine data processing, we also carefully curated the data by examining the B allele frequencies and signal intensities (log R ratio) for each CNVR using the GenomeStudio software (Illumina) (Figure 2). We found that many implausible calls were situated in regions with high G+C contents; for example, in subtelomeric regions. All of them were copy number gain-type CNVs rather than copy number loss-type CNVs. Although further research is required, it is important to note that CNVRs tend to be detected in those regions by SNP microarrays. Even if such CNVRs are false positives, our data set is still useful for screening large numbers of candidate CNVs.

Figure 2
figure 2

(a) A copy number variation (CNV) located on chromosome 7. The panel shows the region at nucleotides 108 541 155–108 698 960 in hg19. This CNV was called with high-intensity probes. The B allele frequencies (BAFs) were separated into four levels, which corresponded to AAA, AAB, ABB and BBB, respectively. (b) Another CNV located in the subtelomeric region on chromosome 4. The panel shows the region at nucleotides 863 513–1 113 194. Despite high-intensity probes used, as in the example shown above, the four levels of BAFs were not observed, suggesting that the call might be implausible. Such CNVs tended to be called in G+C-rich regions; for example, 58% G+C content in this case. The snapshot was made with the IGV program (http://www.broadinstitute.org/igv/). A full color version of this figure is available at Journal of Human Genetics online.

It is unclear whether CNVs are selectively neutral on the basis of genetic drift, but they are certainly distributed throughout all human populations. Using the genotypes of mitochondrial DNA and Y chromosome, geneticists and anthropologists have surmised various intriguing scenarios about the history of humans.37, 38, 39, 40 However, these genetic materials have been transmitted exclusively through maternal and paternal lineages, respectively. In contrast, the CNVs reported here occur in the more extensive remaining genome regions; that is, on autosomes or the X chromosome. Therefore, they have acted some times as maternal alleles at and at other times as paternal alleles. They might also have been subjected to crossingover. CNV data from various parts of the world are essential to substantiate these hypothetical scenarios.

Chromosomal anomalies are found with conventional cytogenetic techniques in approximately half of all early sporadic miscarriages.41 It is possible that miscarriages and pregnancy losses are also caused by submicroscopic chromosomal changes, including CNVs. Twenty-eight CNVs have been reported as candidate miscarriage-related variations when instances of recurrent pregnancy loss were examined by Rajcan-Separovic et al.42 When 17 Caucasian and three African-American couples with recurrent pregnancy losses and their miscarriage samples were examined, CNVs that may have been related to miscarriages were reported.42 They reported 11 novel CNVs in miscarriage samples and three in the parent samples and suggested that these CNVs were probably mutations causing susceptibility to miscarriage. Of the 11 CNVs in the miscarriage samples, one on chromosome 12 (130 060 706–130 430 847 in hg18) and another one on chromosome X (6 498 521–8 091 951) overlapped with our data set. Whereas the first one on chromosome 12 was up to 370 kb in length and encompassed the GPR133 gene, the corresponding variable region in our data set is much shorter and includes no known genes. The GPR133 gene encodes one of the orphan G-protein-coupled receptors, but its function is unknown.43 It is possible that this receptor protein has a role in several signal-transduction pathways via classical receptor/G-protein interactions. Therefore, the CNV mentioned above may be a variant that causes miscarriage. However, one of the CNVs on chromosome X is consistent with our data set, suggesting that it is a commonly observed variant. In fact, Rajcan-Separovic et al.42 tried to define the common CNVs using a collective repository in the DGV, but insufficient phenotypic information was available to refine the data. Taking these observations together, it seems that to define a set of common CNVs, it will be necessary to collect a large number of control data that focus on a specific phenotype; such as, normal parity in this case.

The Japanese are an admixture of ancient Asian populations that inhabited regions outside the Japanese Archipelago. We investigated the similarities among the CNVRs detected in various populations and noted that around 15% of Japanese CNVRs overlap those of other populations (Table 2). It has been suggested that the number of overlapping CNVs is influenced by the number of subjects. For instance, Japanese and Tibetan data showed dissimilarity because of the limited number of Tibetan subjects. Although the sample sizes of the Korean and Chinese populations are smaller than those of the European and African populations, similarities between the Japanese and other East Asian populations were similar to those of the European and African populations. This probably suggests strong similarities between the Japanese and other East Asian populations.

Previous studies have predominantly targeted European and African populations, but CNVs have been observed at different frequencies or copy numbers in different populations; for example, variations in the salivary amylase gene.44 Many CNVs; such as, those at the AMY1 locus, may be associated with diabetes, asthma, hypertension, allergy and other diseases of affluence in each ethnic group. Although CNVRs may result from the accumulation of tolerable structural mutations in the course of an ethnic history, they could start to influence the population’s susceptibility to disease once its lifestyle is altered. The allelic frequencies of SNPs and short indels in each population have recently been documented.45 The complete documentation of the CNVRs in each ethnic group is similarly important. The development of an innovative method to achieve this; such as, one involving next-generation sequencing and informatics, is another challenge.