Introduction

Human genetic variations comprise various types of structural genomic changes and single nucleotide polymorphisms (SNPs). Large microscopic changes affect more than tens of millions of bases (mb) in the genome, and are rare in healthy individuals, but smaller structural variations ranging from 1 kb to hundreds of kb are frequent and widespread even in normal individuals, contributing to human genetic diversity or disease susceptibility (Feuk et al., 2006; Freeman et al., 2006). Such submicroscopic genomic variations have been defined in terms of copy number variations (CNVs) and include large-scale copy number variants (LCVs) (Iafrate et al., 2004), copy number polymorphisms (CNPs) (Sebat et al., 2004), and intermediate-sized variants (ISVs) (Tuzun et al., 2005), as well as other types of genomic variations such as low copy repeats (LCRs) (Lupski and Stankiewicz, 2005), multisite variants (MSVs) (Fredman et al., 2004), and paralogous sequence variants (PSV) (Eichler 2001). However, by convention, genomic variations do not include variants that arise from the insertion/deletion of transposable elements (Freeman et al., 2006).

Various experimental platforms and analytical tools such as array-based methods (SNP genotyping array, BAC- and oligonucleotide-array CGH) and clone-based large scale sequencing approaches have been utilized to study structural genomic variations in humans as well as other species (Li et al., 2004; Newman et al., 2005; Perry et al., 2006; Dumas et al., 2007; Graubet et al., 2007; Human Genome Structural Variation Working Group 2007). In humans, multiple studies, including the international HapMap project, have so far annotated CNVs to more than 4000 distinct regions spanning 600 mb, though their abundance and size are likely to be overestimated due to variability of methods and fewer cross-platform validations (Cooper et al., 2007). Since most CNV detection technologies rely on a comparison to a reference genome, CNVs are determined when cross-referenced to disease-affected individuals or different ethnic populations (Rodriguez-Revenga et al., 2007). Thus, absolute copy number information, which is especially important for clinical diagnosis or assessment of disease susceptibility, cannot be easily determined by current quantitative assays except Fiber-FISH. Nonetheless, recent studies have shown that DNA copy number variations are implicated in human diseases including glomerulonephritis (FCGR3B) (Aitman et al., 2006), HIV-1/AIDS (CCL3L1) (Gonzalez et al., 2005), bipolar disorder and schizophrenia (GLUR7, CACNG2 and AKAP5) (Wilson et al., 2006), muscular atrophy (SMN) (Kesari et al., 2005) and neoplasia (14q12) (Braude et al., 2006).

On the other hand, recent reports also suggest that different ethnic groups may represent different profiles of CNVs that are stratified in the human population (Redon et al., 2006; Kidd et al., 2007). In our previous BAC array CGH study, Korean copy number variants were discovered when compared to reference DNA from different ethnic groups (Jeon et al., 2007). In an attempt to obtain a standard CNV profile for the Korean population, which would facilitate association studies of CNVs with disease susceptibility as well as population genetic diversity, we analyzed a comprehensive CNV profile of 90 Korean individuals using the publicly available Korean HapMap SNP 50 k chip data sets and tested its application to population stratification.

Results

To generate CNV profiles of Koreans, we extracted DNA copy number information from the publicly available Korean HapMap SNP 50 k chip data (http://www.khapma.org), and then conducted either P value-based or copy number-based CNV analyses as well as the combined P value and copy number-based CNV analysis using two different copy number reference genome assembly sets. Basically, two different reference sets were used to detect CNVs from study subjects (n = 90); 1) the Korean reference set generated from all the genomes of 90 individuals, 2) the Affymetrix reference set provided as a copy number reference from multiple ethnic groups by Affymetrix Inc. We tested the validity of different CNV calling criteria by the quantitative multiplex PCR of short fluorescent fragment (QMPSF) experiments. The best validation rate was observed in the combined CNV calls with P and SD values.

P value-based CNV analysis (cutoff P < 0.01 or P < 0.001)

Our P value-based CNV analysis using the Korean reference set showed that 90 Korean individuals represented 435 copy number variation regions (CNVRs) covering 123 mb equivalent to 4.1% of the genome using a cutoff of P < 0.01, while the choice of a more stringent cutoff of P < 0.001 allowed detection of less CNVRs (n = 126) covering 35 mb (1.2%) (Supplemental Data Table S1). In contrast, when the Affymetrix reference set from multiple ethnic groups was used to detect CNVRs from Korean individuals, the more stringent cutoff of P < 0.001 was chosed because this cutoff of P < 0.001 provided enough stringency in CNV calling to get a CNV profile of a reasonable number of CNVRs. Indeed, even stringent criteria of CNV calling detected more CNVRs (n = 2034) covering 594 mb equivalent to 19.8% of the genome (Supplemental Data Table S1). The proportion of CNVRs on a given chromosome varies from 11.3% on chromosome 14 to 44% on chromosome 12, with the mean proportion of 19.8% on average for all chromosomes. Our P value-based CNV analysis using the Affymetrix reference set (P < 0.001) showed that CNVRs were uniformly distributed across the human chromosomes, and the population-wide occurrence of particular CNVRs ranged from zero to 72 out of 90 individuals (data not shown). According to the results of QMPSF experiments for the CNV calls detected by P value-based CNV analysis using the Korean reference set, the validation rate was approximately 50% (3 out of 6 CNVRs) of tested CNVRs (Table 1, see also Supplemental Data Materials for CNV validation).

Table 1 PCR validation of the CNVRs detected by different CNV calls using the Korean reference set.

Copy number-based CNV analysis (cutoff SD ≥ 0.25)

We also employed the standard deviation (SD ≥ 0.25) of copy numbers of each probe for the 90 individuals as the criteria of CNV calling in the copy number-based CNV analysis, which detected the population-wide CNVRs among the Korean population. This copy number-based CNV analysis detected 595 CNVRs (8.9% of the genome) and 790 CNVRs (11.8%) from 90 individuals, using the Korean reference and the Affymetrix reference sets, respectively (Supplemental Data Table S1). The average length of CNVRs was approximately 448kb when using both reference genomes. The validation rate was approximately 46% (18 out of 39 CNVRs) of tested CNVRs (Table 1, and see also Supplemental Data Materials for CNV validation).

Combined CNV analysis with P value (P < 0.01 or 0.001) and copy numbers (SD ≥ 0.25)

When compared with the Korean reference set using the combined criteria of P value (P < 0.01) and standard deviation of copy numbers (SD ≥ 0.25) of given probes among study subjects, Korean individuals (n = 90) exhibited 123 CNV regions (CNVRs) encompassing 27.2 mb, equivalent to 1.0% of the genome (Table 2, and see also Supplemental Data Table S4 for CNVR list). In contrast, when compared with the Affymetrix reference set, the combined CNV analysis (P < 0.001 and SD ≥ 0.25) detected more CNVRs (n = 643) encompassing 135.1 mb in larger proportions (5.0%) of the genome (Table 2, and see also Supplemental Data Table S5 for CNVR list). The proportion of copy number gains was lower than that of copy number losses when compared with the Korean reference set, whereas the ratio of gains to losses was higher when compared with the Affymetrix reference set (Figure 1).

Table 2 Summary of population-wide CNVRs from a Korean population (n = 90).
Figure 1
figure 1

Composition of population-wide CNVRs. (A) Population-wide CNVRs were defined for two or more consecutive probes among study subjects (n = 90) using the combined CNV analysis of P and SD values. High or low copy numbers of given genomic regions in relative to a reference genome assembly set were classified into gain or loss of CNVRs. The mixed type of CNVRs was defined when a genomic region encompassing two or more consecutive probes represented both types of gain and loss of CNVRs among the study population. (B) Composition of CNVR types. CNVRs were detected by the combined CNV analysis using the Korean and Affymetrix reference sets, and then were classified into three CNVR types (gains, losses, or mixed types).

Standard deviation of copy numbers allowed us to detect only population-wide CNVRs among the study subjects (n = 90) which could not provide the number of CNVRs per person, while the P value-based CNV analysis could detect individual-based CNVRs. Therefore, the combined criteria of P and SD values could detect both population-wide and individual-based CNVRs, enhancing the reliability of CNV calls. Indeed, the validation rate was approximately 75% (9 out of 12 CNVRs) of tested CNVRs, which was higher than those of other CNV calls (Table 1 and Supplemental Data Materials for CNV validation).

Properties of Korean CNVs

According to the results of the combined CNV analysis, an average number of CNVRs per person was 2.1 ± 5.0 (ranging from 0 to 32 CNVRs) when compared to the Korean reference set (Figure 2). Thirteen individuals exhibited 5 or more CNVRs, and the top three highest numbers of CNVRs were detected in individuals of KR-41 (32 CNVRs), KR-72 (27 CNVRs) and KR-70 (15 CNVRs). Fifty individuals did not represent CNVRs when compared to the Korean reference set using the combined criteria of P and SD values (Figure 2).

Figure 2
figure 2

Numbers of CNVRs per person. The CNVRs were detected by the combined CNV analysis with P and SD values using the Korean and Affymetrix reference set. The average number of CNVRs per person was 2.1 ± 5.0 (mean ± STD) (up to 32 CNVRs at KR-41 sample) when compared with the Korean reference set, and 33.1 ± 26.6 (mean ± STD) (up to 124 CNVRs at KR-41 sample) when compared with the Affymetrix reference set.

In contrast to the Korean reference set, the Affymetrix reference set allowed detection of more CNVRs in the combined CNV analysis (P < 0.001 and SD ≥ 0.25). An average number of CNVRs per person was 33.1 ± 26.6 (ranging from 2 to 124 CNVRs) (Figure 2). Thus, the average number of CNVRs per person was much higher in the combined CNV analysis using the Affymetrix reference than the Korean reference. This observation may be ascribed to ethnic diversity between the Korean reference set from a single population and the Affymetrix reference set from the multiple ethnic groups. On the other hand, KR-2 and KR-4 samples represented lowest numbers of CNVRs when referenced to both the Korean and Affymetrix reference sets, while KR-41 and KR-72 samples represented highest numbers of CNVRs. The average length of individual-based CNVRs exhibited no big difference in the combined CNV analysis between the Korean reference set (128 kb per CNVR) and the Affymetrix reference set (142 kb per CNVR).

The most frequent CNVR (KC16-T01) was detected in 7 individuals (7.8%) out of the study subjects (n = 90) in the combined CNV analysis using the Korean reference set (Supplemental Data Table S4). Seventy CNVRs (57%) of 123 CNVRs in total were occurred in single individuals while 53 CNVRs (43%) were occurred in two or more individuals (7 individuals at most). The most frequent CNVR (KC16-T01) was localized in 16p12.1 with a higher occurrence (n = 7) than that of a CNVR (n = 4) at the Ig locus, suggesting that this frequent CNVR may be one of highly susceptible CNV targets as much as Ig loci.

Chromosome 21 and 18 represented the least proportion (0.03% and 1.86% of the corresponding chromosome) of CNVRs using the Korean reference set and the Affymetrix reference set, respectively (Table 2). This observation suggests that chromosome 21 may be relatively conserved for genomic structural variations within the Korean population. When the two criteria (P < 0.01 and SD ≥ 0.25) were combined in CNV calling using the Korean reference set, 64 CNVRs (52.1%) of 123 CNVRs were known while 59 CNVRs (47.9%) were unknown according to the Database of Genomic Variants (http://projects.tcag.ca/variation) (Supplemental Data Table S4).

Identification of copy number invariant regions and qPCR validation of CNVRs

Based on the result of copy number-based CNV analysis, we first selected eleven copy number invariant regions showing the lowest standard deviation (SD) of GSA_CN values (Table 3). These regions exhibited almost no variation in DNA copy numbers among all individuals when compared to the self-including Korean reference set. One of them (2q36.1) was validated for no copy number variation in reference to copy numbers of Factor VIII gene among all study subjects (n = 90) (Figure 3).

Table 3 Copy number invariant regions (non-CNVRS).
Figure 3
figure 3

Validation of copy number invariant regions. DNA copy numbers of a copy number invariant region (CN2-2) at 2q36.1 was quantitated for all study subjects (n = 90) by the QMPSF method. Relative copy numbers of CN2-2 to Factor VIII was 1.32 ± 0.08 with the coefficient of variation (6.3%) among the test subjects. Error bars indicate standard deviations from three independent QMPSF measurements performed in triplicate.

Next, we evaluated the reliability of CNV calls from P value-based and copy number-based analyses, as well as the combined P value and copy number-based analysis in order to determine the better CNV calling rule. Since our Korean reference set was generated from averaged copy numbers of a given probe for all of 90 individuals, the coefficient of variation (CV) was used as a statistical criteria of copy number difference among the tested subjects (n = 90). Therefore, when a particular CNVR displayed over 8% of the coefficient of variation of copy number measurements from the QMPSF results, we considered the particular CNVR to be validated. As a result, the validation rate of CNVRs was higher in the combined CNV analysis than P value-based or copy number-based CNV analyses (Table 1, and Supplemental Data Materials for CNV validation).

Refining the population stratification of 90 individuals

CNV profiles identified in this study were applied to the principle component analysis to stratify the 90 individuals. Multi-dimensional scaling showed that the first and second eigenvectors explained 21% and 9% of the distance variables, respectively (Figure 4). When referenced to the Affymetrix reference, the 90 individuals were dispersed in a plot of eigenvetors, suggesting that the Korean population can be stratified into subgroups according to DNA copy number variations. In contrast, most of the 90 individuals were centered around an intersecting point of two eigenvectors when referenced to the self-including Korean reference genome. This result suggests that a reference genome assembly from the more heterogeneous population gives more informative data for population stratification studies using CNV information. Thus, if an appropriate copy number reference genome is selected, genomic structural variation would provide a valuable source for refining stratification of ethnic groups within a single population.

Figure 4
figure 4

Plots of two eigenvectors for 90 unrelated individuals from the P value-based analysis of DNA copy numbers. The first and second eigenvectors explain 21% and 9% of the distance variables, respectively. The Affymetrix reference that was generated from more heterogeneous populations resulted in a more widely dispersed plot than the self-including Korean reference that was generated from a single ethnic population.

Discussion

We analyzed the profiles of DNA CNVs of 90 Korean individuals using publicly available Korean HapMap SNP 50k chip data. Different CNV reference genome assembly sets (either the Korean reference or Affymetrix reference sets) were used for the combined CNV analysis with P and SD values, as well as P value-based and copy number-based analyses. As results of QMPSF experiments, the validation rate of CNVRs was higher in the combined CNV analysis than P value-based or copy number-based CNV analyses. Thus, here we finally generated the CNV profile of a Korean population by the combined CNV analysis. The Korean HapMap samples (n = 90) represented five times less CNVRs in total when referenced to the Korean reference set (123 CNVRs) than the Affymetrix reference set (643 CNVRs), reflecting an ethnic difference in CNV profiles between the Korean population and other ethnic groups. In fact, it was reported that copy numbers of particular genes (e.g., CCL3L1, MAPT) varied among different ethnic groups (Gonzalez et al., 2005). Thus, this result suggests that each ethnic population has a distinct CNV profile applicable to various population genetic studies.

On the other hand, in the combined CNV analysis with both P and SD values, no CNVR was detected in 50 individuals when referenced to the self-including Korean reference set. These 50 individuals would represent the standard genome of the Korean population in terms of structural genomic variations because they did not exhibit a copy number difference in relative to the Korean reference genome. Thus, genetic information of these 50 individuals could be a good standard reference for further population genetic studies.

Some array CGH studies detected 11 CNVs (Sebat et al., 2004) or 12.4 CNVs (Iafrate et al., 2004) on average in each person. Generally, array CGH assays identifies a smaller number of CNVRs comprising large-insert clones, which result in the overestimation of CNV length (Redon et al., 2006). Moreover, the different choice of reference genome sets can affect total and average numbers of CNVRs in normal healthy individuals. In this study, the Korean population exhibited 123 CNVRs in total and 2.1 ± 5.0 per person (up to 32 CNVRs) when compared with the Korean reference genome set in the combined CNV analysis with cutoffs of P < 0.01 and SD ≥ 0.25. These CNV calls were made in reference to the self-including Korean genome assembly which was generated from the average copy numbers for the study subjects. Therefore, the choice of self-including reference genome assembly set might contribute to small numbers of CNV calls.

Generally, the term polymorphic is used to denote variants that occur in > 1% of the population (Strachen and Read, 1999). CNVRs identified in this study are considered to be potentially polymorphic rather than rare mutations in the Korean population, because those CNVRs were detected at least once more in 90 individuals. Furthermore, these structural genomic variations provide an additional dimension for human genetic diversity and disease susceptibility along with SNPs. Thus, it would be interesting to study the simultaneous association of these potentially polymorphic CNVRs and SNPs of particular genes with human complex diseases including diabetes and hypertension.

Many CNV analysis tools have been developed to extract accurate copy number information from probe intensities of SNP genotyping arrays (Lin et al., 2004; Nannya et al., 2005; Price et al., 2005; Slater et al., 2005; Fiegler et al., 2006; Hu et al., 2006). However, the detection of real CNVs still needs additional experiments for validation, such as PCR-based methods (e.g. quantitative real-time PCR, QMPSF, MAPH, MLPA) and hybridization-based methods (e.g., Fiber-FISH, Southern blotting) (Feuk et al., 2006). According to the QMPSF results, the validation rates of CNV calls from our different CNV analyses were 50%, 46% and 75% for P value-based, copy number-based analysis, and the combined CNV analysis with P and SD values, respectively. The P value-based CNV analysis could detect individual-based CNVRs whereas the copy number-based CNV analysis with SD values could detect population-wide CNVRs. Thus, our combined CNV analysis enabled to detect population-wide as well as invidividual-based CNVRs with the higher validation rate (75%) of CNV calls. Redon group estimated false positive rate using singleton CNVs (called in only a single individual) (Redon et al., 2006). They found an initial validation failure rate of 24% and then finally estimated a false positive rate of 8% by extrapolating these validation rates across the entire data set (24% multiplied by the frequency of singleton CNVs called on only one plate form).

There are possible explanations for low validation rates of our CNV calls. First, our CNV calls were made in reference to the genome assembly set from multiple individuals (n = 90) instead of the single genome, which could not appropriately provide a single reference DNA sample for QMPSF experiments. Second, QMPSF primers might not exactly target authentic CNV regions because our CNV profile was generated from the 50 k Affymetrix chip data which contained a relatively low density of SNP probes, contributing to detect a low resolution of CNVRs in the present study. Third, an alternative explanation is that some CNVRs might be multi-allelic or complex CNVRs that were difficult to be validated by conventional PCR based methods. Rather, hybridization based methods (e.g., Fiber-FISH, Southern blotting) or a whole genome sequencing approach would detect such multi-allelic or complex CNVRs (Rodriguez-Revenga et al., 2007). Taken together, the validation rate would be increased if CNV-affected samples were compared with a single genome reference in validation experiments of CNVRs. Moreover, more advanced chip platforms with a higher density of probes would give a higher-resolution map of CNVRs, which supports better experimental validation.

The longest CNVR detected by P value-based CNV anlaysis was rather a microscopic genome change encompassing approximately 31 mb at 4q13.14q22.13, which was validated in the corresponding EBV-transformed lymphoblastoid cell line (LCL) by the QMPSF method. The copy number change of this region was not found in blood DNA from the same donor as the LCL DNA (data not shown), suggesting that this copy number change was acquired during EBV-mediated B cell transformation. The LCL strain containing this CNVR would provide a valuable resource for a haploinsufficiency study of corresponding genes at this locus.

An immunoglobulin locus (22q11.22) was also detected as CNV-affected regions with the occurrence of four (4/90) in the combined CNV analysis using the Korean reference set. Previous reports suggested that polymorphic gene duplication may frequently occur in Ig loci including 14q32.33 (Ig heavy variable cluster) and 22q11.2 (Ig heavy chain constant region and Ig lamda) (Sasso et al., 1995; van der Burg et al., 2002; Buckland et al., 2003). It is likely that the copy number variation of the Ig loci may not be from the germ line but a de novo CNV enriched during LCL generation. In fact, Ig locus CNVs were excluded in CNV analyses (Redon et al., 2006).

Recent reports suggest that as much as 40% of the known CNVs occur in gene deserts while the other CNVs are enriched for genes involved in immunity and environmental responses (Derti et al., 2006; Rodriguez-Revenga et al., 2007). Copy number invariant regions may be evolutionary conserved for DNA copy numbers because of the potential effects of gene dosage. In addition, with respect to genome stability, it would be interesting to know whether individuals who contain more CNVRs have a higher risk of cancer than individuals who contain less CNVRs. Therefore, CNVRs and non-CNVRs identified in this study would be a good starting point of further CNV studies to determine their clinical relevance in complex traits or disease susceptibility in the Korean population.

Methods

DNA samples

We used the Affymetrix GeneChip Mapping 50 k array data set obtained from the Korean HapMap project for which DNA samples were selected to include 90 unrelated healthy Korean individuals with an equal sex ratio and age 40-69 years for the Korean Health and Genome Epidemiology Study (http://cgs.cdc.go.kr), as described in previous reports (Kim et al., 2006; Yoo et al., 2006). Genomic DNA was extracted from EBV-transformed B lymphoblastoid cell lines (LCLs) provided by the National Biobank of Korea, Korea National Institute of Health.

DNA copy number analysis

The Affymetrix GeneChip Mapping 50k_XBA240 array data for the Korean HapMap project (Korean HapMap 50 k) were generated according to manufacturer's instructions, as reported elsewhere (Herbert et al., 2006). Average call rate of the Korean HapMap 50 k data set was 98.4% with a standard deviation of 0.92%, ranging from 95.1% to 99.4%. For CNV detection, the Korean HapMap 50 k data were analyzed using Affymetrix GeneChip Chromosome Copy Number Analysis Tool (CNAT) 3.0. DNA copy numbers of individual Koreans (n = 90) were compared to two different copy number reference genome assembly sets: 1) the Affymetrix Mapping 50k_XBA240 reference data (hereinafter called the Affymetrix reference set) from three different ethnic groups (42 African Americans, 20 Asians, and 42 Caucasians) and the fourth group (24 PD panel) in which the 20 Asians did not include Koreans, 2) the self-including copy number reference genome assembly from 90 Korean individuals of the Korean HapMap 50 k (hereinafter called the Korean reference set).

With regard to CNV calling rules, we annotated CNVRs to particular chromosomal regions if the region was more than 1kb in size, encompassed two or more consecutive probes, and met either of the following cutoff criteria: 1) P < 0.001 from the GSA_pVal (genome-smoothed analysis of the P-value) when compared with the Affymetrix reference, 2) P < 0.01 from the GSA_pVal when compared with the self-including Korean reference, 3) a standard deviation (SD ≥ 0.25) of GSA_CN values (genome-smoothed analysis of the copy number) in copy number-based analysis, 4) the combined criteria of P value (P < 0.01 or P < 0.001) and standard deviation (SD ≥ 0.25) of copy numbers. DNA copy number variations of the sex chromosomes were not analyzed in this study because of different sex ratios of test samples to reference genome assemblies.

CNV validation

DNA copy numbers were validated by the QMPSF method using fluorescein-labeled forward primers, as described before (Charbonnier et al., 2002; Vaurs-Barriere et al., 2006). Briefly, 100 ng of DNA template was added to make 25 µl of PCR reaction mixture including 1× PCR Gold buffer, 2 mM MgCl2, 0.2 mM dNTP, appropriate PCR primers at concentrations of 0.04-0.09 µM, and 3 units of Ampli-Taq Gold (Applied Biosystems). One of copy number invariant regions identified among the 90 individuals in this study (CN2-2) or the coagulation factor VIII gene was used as an internal reference DNA for the normalization of input DNA (Levine et al., 2005; Jeon et al., 2007). Primer sequences of reference genes were as follows. CN2-2: forward, 5'-CTTAGGTTCCCACGGTTTGA-3'; reverse 5'-GCACTTGAAAGGTGCCTAGC-3', Factor VIII: forward, 5'-TACCATCCAGGCTGAGGTTTAT-3', reverse, 5'-AAAGAGTTGTAACGCCACCATT-3'. Hot start PCR was carried out at 95℃ for 10 min for a denaturation step, followed by 21 cycles of 94℃ for 30 s, 60℃ for 30 s, and then further extended at 72℃ for 50 min. Ethanol precipitated PCR products were dissolved in 5 µl of water. One microliter of the purified PCR products was mixed with 1 µl of Gene Scan-500 LIZ size standards and 14 µl of HiDi formamide, and then run on the ABI3730 capillary sequencer (Applied Biosystems). Data analysis was performed using the GeneMapper software (Applied Biosystems). In order to obtain experimental copy numbers of CNVRs, the relative peak height of a given CNVR was devided by the peak height of a reference of Factor VIII or CN2-2 from the QMPSF chromatogram, which resulted in relative copy numbers of CNVRs (see Supplemental Data Materials for QMPSF experiments in details).

Multi-dimensional scaling (MDS)

After CNV profiles of the 90 individuals were obtained by the P value-based analysis in reference to the Affymetrix (P < 0.001) or Korean (P < 0.01) reference genome assembly sets, loss and gain of DNA copy numbers were coded to "1" and "2" for each probe, respectively. Probes with no copy number change were coded to zero "0". These codes of copy number patterns were used in principle component analysis. Briefly, the pairwise similarity coefficients (sij) of CNV for individuals were computed as;

where cijk is 1 (or 0) depending on the copy number change (or no change) between two individuals (i and j) at a marker position(k), N is the number of markers and L (=M) is the number of subjects. In particular, the diagonal matrix elements (sii) are not normalized to be the same among individuals since the number of accumulated copy number changes in a whole genome could be different individually. Using the similarity coefficient (sij) Jacobi transformation was performed to calculate eigenvectors (Press et al., 1988).