MSeq-CNV: accurate detection of Copy Number Variation from Sequencing of Multiple samples

Currently a few tools are capable of detecting genome-wide Copy Number Variations (CNVs) based on sequencing of multiple samples. Although aberrations in mate pair insertion sizes provide additional hints for the CNV detection based on multiple samples, the majority of the current tools rely only on the depth of coverage. Here, we propose a new algorithm (MSeq-CNV) which allows detecting common CNVs across multiple samples. MSeq-CNV applies a mixture density for modeling aberrations in depth of coverage and abnormalities in the mate pair insertion sizes. Each component in this mixture density applies a Binomial distribution for modeling the number of mate pairs with aberration in the insertion size and also a Poisson distribution for emitting the read counts, in each genomic position. MSeq-CNV is applied on simulated data and also on real data of six HapMap individuals with high-coverage sequencing, in 1000 Genomes Project. These individuals include a CEU trio of European ancestry and a YRI trio of Nigerian ethnicity. Ancestry of these individuals is studied by clustering the identified CNVs. MSeq-CNV is also applied for detecting CNVs in two samples with low-coverage sequencing in 1000 Genomes Project and six samples form the Simons Genome Diversity Project.

Copy Number Variation (CNV) and balanced rearrangements such as inversions and translocations are types of the large structural variations in the human genome and other organisms. In Copy Number Variation, a gene or a genomic region appears in different number of copies in different individuals or even in different cells of the same individual. CNVs are generally referred to as a duplication or deletion of a genomic region with at least 1 kb in length. However, several clinically important CNVs are shorter than 1 kb in length. CNV results in having variations in the gene expressions and abnormalities in the human phenotypes 1 . Moreover, CNV is envisaged to be associated with many human diseases such as autoimmune disease 2 , autism 1 and developmental disabilities 3 , diabetes, schizophrenia 4 , cancer 3 and obesity.
In the last decade, CNVs are studied via Microarray-based Comparative Genomic Hybridization (aCGH) methods [5][6][7][8][9][10] . However, the current aCGH platforms which benefit of more than 1 million genomic probes have a lower detection limit of CNVs of length ~5 kb to 25 kb 11,12 . In the recent years, Next Generation Sequencing (NGS) has provided new opportunities for the CNV studies with an unprecedented resolution [13][14][15] . In NGS, millions of single end or mate pair reads are generated from the sample genomes with shotgun sequencing. CNVs are then detected based on the frequency of the reads (read depth) or aberrations in the mate pairs, after mapping the short reads to the reference genome.
Multiple sequencing reduces the effect of the systematic errors and artifacts which are attributed to the library-preparation protocol or individual sample genome characteristics 35 . There are common CNVs which are shared by complex diseases 36 and can be detected from sequencing of multiple samples. Simultaneous analysis of multiple samples allows detecting read counts variations occurring due to the noise across samples, even in genomic positions with constant copy numbers. Therefore, to increase the detection power, more samples should be sequenced with a low sequencing coverage rather than for sequencing a few samples with high-coverage sequencing. Indeed, CNV detection methods which rely on a low-coverage sequencing data are more relevant in the future studies 37,38 . Currently, many individuals are sequenced with a low genome-wide coverage. For example, the 1000 genomes project carried out the whole-genome shotgun sequencing of 179 individuals with 2× to 4× coverages [39][40][41] .
Currently, there are a few tools which are capable of the simultaneous analysis of several sequenced samples 37,42,43 . However, a major drawback of these tools is relying only on read depth data which results in suffering from a low power or a high false positive rate, due to the large noise in read depth signals. Moreover, these tools do not take the observed aberrations in the mate pair reads into account. Indeed, besides read depth, mate pair insertion sizes provide another source of information for the genome-wide CNV detection with an increased resolution.
In this paper, MSeq-CNV is proposed for detecting recurrent genomic deletions and duplications across multiple individuals, by the simultaneous analysis of samples. To the best of our knowledge, MSeq-CNV is the first computational tool which takes both read depth and insertion size signals in several individuals into account. The MSeq-CNV applies a mixture density to model the distribution of the read counts and the distribution of the number of mate pairs with aberrations in insertion size.
Each component in the mixture density applies a Binomial distribution to model the number of mate pairs with insertion size aberrations and a Poisson distribution to model the read counts, in each genomic region. After estimating the model parameters based on Expectation-Maximization (EM) algorithm, the posterior probability of the digitized copy number of each segment in the sample genomes is computed. The resolution of the MSeq-CNV is evaluated on a set of samples with implanted CNVs, which are constructed based on the human reference genome. Compared to the other state of the art methods, MSeq-CNV has reached an unprecedented precision and recall values which allows detecting recurrent genomic variants, accurately.
The MSeq-CNV is also applied for the CNV detection in a set of six HapMap individuals with high-coverage sequencing, including a CEU trio of European ancestry (NA12891, NA12892, and NA12878) and a YRI trio of Yoruba Nigerian ethnicity (NA19238, NA19239, and NA19240).

Methods
Assume that there are k sample genomes which are sequenced using a Next Generation Sequencing platform and mate pair reads are generated. After mapping mate pairs, the reference genome is divided into T segments of length L. Here, we aim at estimating the copy number of each genomic segment in samples 1 to k. To estimate the copy number of each sample in the t th genomic segment where t = 1, 2, …, T, this paper relies on two signals: i) number of reads which are mapped to the segment, and ii) information from the mate pair whose insertion (un-sequenced) region is passing the t th genomic segment and its reads are flanking the corresponding genomic segment.
Here, number of reads which are generated from the j th sample and are mapped to the t th segment of reference genome (studied segment) is denoted by f j . Also, n j denotes the number of mate pairs which are generated from the j th sample and their insertion region is passing the t th segment, after mapping to it.
Signal characteristics in different genomic states. The studied segment in the j th sample has one of the following copy number states i.e. {homozygous deletion, heterozygous deletion, diploid and duplications}. In each state, the characteristics of the read counts and mate pair signals which are used for the mathematical modeling are described below: Diploid. in this state, j th sample carries two copies of the corresponding segment in the reference genome. Here, a mate pair which is generated from the j th sample aligns to the reference genome with a normal insertion size, distributed with the clone library insertion size distribution. Also, number of reads which are mapped to the corresponding segment in the reference genome i.e. f j is assumed to have a Poisson distribution with parameter λ.
Heterozygous deletion. in this state, the j th sample carries only one allele of the corresponding segment in the reference genome. Therefore, some mate pairs which are generated from the j th sample align to the reference with a normal insertion size, distributed with the clone library insertion size distribution. Other mate pairs align to the reference genome much further apart than expected. In this state, read counts are also distributed with a Poisson distribution with a parameter of 2 λ .
Homozygous deletion. in this state, both alleles are deleted from the j th sample. Therefore, a high percentage of the mate pairs which are generated from this region will map to the reference genome much further apart than the expected insertion size distribution in the clone library. Also, we consider a Poisson distribution with a parameter of ελ for the read counts, where ε is assumed to be a very small value.
Duplication. in this state, the sample genome carries more than two copies of the corresponding segment in the reference genome. However, mate pairs which are generated from duplicated regions will map to the reference with the insertion size distribution of the clone library. Read counts are also distributed with a Poisson distribution with a parameter of λ i /2, for the samples carrying i copies.
Mathematical modeling of mate pair insertion sizes and read counts. Consider a segment of the reference genome whose copy number in the j th sample is of interest, j = 1, 2, …, k. Also, let n j denote the total number of mate pairs which are generated from the j th sample and align to the reference genome with condition ii, as mentioned before. Also, n j1 denotes the number of mate pairs which are mapped to the reference with the insertion size distribution of the clone library and n j2 denotes the number of mate pairs which are mapped to the reference much further apart, compared to the insertion sizes in the clone library. Clearly, = + n n n j j j 1 2 . Here, we assumed that n j1 is binomially distributed as follows: where, n n 1, 2, , j j 1 = … . In the above distribution, i β indicates the probability of observing a mate pair mapped to the reference with a clone library insertion size distribution, when sample genome is in the i th CNV state. Where, i = 0, 1, 2, 3, …, m corresponds to {homozygous deletion, heterozygous deletion, diploid and duplica-tions}. Also, the maximum copy number of a genomic segment is denoted by m, i.e. i m ≤ . When j th sample has i copies of the studied segment of the reference genome, read count f j follows a Poisson distribution with a parameter of θ λ i : Also, from a total number of k samples, let α i denote the percentage of samples which have i copies of the studied segment of the reference genome. Taking the above descriptions into account, the probability of observing f n n ( , , ) j j j 1 2 in the j th sample genome can be written as follows: However, it should be added that in the above formulations n j1 and n j2 are not known and they depend on the unknown parameter β i . Also, the estimation of n j1 and n j2 , requires estimating the probability of each insertion size to be distributed with the clone library insertion size distribution. For this purpose, let o jr denote insertion size of the r th mate pair which was generated from the j th sample and was mapped to the studied segment of the reference genome. Where = … r n 1, 2, , j and j = 1, 2, …, k. Consequently, a random variable z jr is corresponded to each mate pair insertion size o jr : To estimate the expected value of n j1 and n j2 , we calculate the probability of having a z jr equal to 1, for = … r n 1, 2, , j and j k 1, 2, , = … , see Supplementary file 1 for a detailed description. Now, a Dirichlet prior distribution is defined for the parameter vector α Also, the prior of each β i is considered to be a beta distribution as follows: where, i = 0, 1, 2, …, m. Moreover, the prior distribution of λ is considered to be a uniform distribution, over the interval of t (0, ), where t is large enough.

Model parameters.
There are a number of parameters in the above mathematical model which have to be estimated. These parameters include λ, the average read counts in a genomic segment of diploid state. The parameter vector ( , , , ) represents the percentage of samples with copy numbers 0, 1, …, m of the studied segment of the reference genome. Also, for a sample genome with copy number state i, β i indicates the proportion of the mate pairs which are mapped to the reference genome much further apart than expected under the clone library insertion size distribution.
The parameters of the prior distribution over α are given values based on information from genome-wide CNV percentage. Since a high percentage of genomic segments in each sample are expected to be in diploid state, γ 2 is given a value much higher than γ i , ≠ i 2. Also, a beta distribution is defined as a prior distribution over each β i , i = 0, 1, 2, …, m. The parameters of the beta distribution i.e. ν i1 and ν i2 are given values In this study, we have considered a segment size of 150 bps in the simulated data analysis and a segment size of 100 bps in the real data analysis. Considering a longer segment size decreases the running time of the algorithm, with the cost of lower resolution. The methods which are compared to MSeq-CNV are also implemented with the same segment size, as MSeq-CNV. However, there are other methods which are specialized in detecting genomic rearrangements and tandem duplications using paired reads which get the nucleotide resolution breakpoint 25,29 . Data availability. BAM (Binary Alignment/Map) files of the alignment of the mate pair reads to the build 36 (hg18) of the human reference genome are available at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/. R program of MSeq-CNV and a detailed procedure for its running is available at https://github.com/CNVdetection/ MSeq-CNV. The list of detected CNVs in the studied individuals is also submitted to this webpage.

Results
Implementation To Real Data of The Human Reference Chromosome. We have constructed 40 sample genomes with implanted CNVs, from chromosome 3 of the human reference genome. After duplicating chromosome 3 of the reference genome, it was altered with implanted CNVs of length 250 bp, 500 bp, 750 bp, 1 kb, 1.5 kb, 2 kb, 2.5 kb, 3 kb, 3.5 kb, 4 kb, 4.5 kb, 5 kb. The position of each CNV is randomly chosen so that CNVs do not overlap along chromosome. After determining the CNV positions on the reference genome, CNVs are implanted into each sample genome.
Indeed, for each CNV region which is implanted in the reference genome, distributions and characteristics of CNVs across sample genomes are determined based on a previous analysis of the HapMap individuals.
Based on the characteristics of the HapMap individuals, 80% of the implanted CNVs were of type loss in which deletions occurs in some sample genomes. Also, 15% of the CNV regions were of type gain in which some sample genomes have an elevated number of copies. The other 5% of the implanted CNV regions were of type mixed in which sample genomes may either have copy loss or copy gain.
In each CNV region, the copy number of each sample was also drawn from the copy number distribution in HapMap individuals. For a genomic loss region, a sample has copy numbers 2, 1, and 0 with probabilities 0.8, 0.15 and 0.05, respectively. For a genomic gain region, a sample has copy numbers 2, 3, 4, and 5 with probabilities 0.85, 0.08, 0.06 and 0.01, respectively. Also, for a CNV region of type mixed, a sample has copy numbers 0, 1, 2, 3, 4 with probabilities 0.04, 0.16, 0.67, 0.11 and 0.02, respectively.
After constructing the sample genomes, MAQ is applied for generating mate pair reads from each sample genome. Mate pairs are then mapped to the human reference genome. After dividing the human reference genome into segments of length 150 bp, MSeq-CNV is applied for detecting CNVs in the corresponding segments of the constructed sample genomes. In Table 1, the performance of MSeq-CNV is reported for each CNV state i.e. homozygous deletion, heterozygous deletion, and duplications, for a genome-wide sequencing coverage of 10×. The performance of MSeq-CNV is also compared to the central CNV detection tools i.e. rSW-seq, CNV-seq and cnMOPS. These tools are selected for comparisons because of their high resolution and their capability in detecting both genome-wide deletions and duplications 37 . It should be added that rSW-seq and CNV-seq are not capable of detecting the digitized copy number of genomic regions i.e. these tools do not discriminate heterozygous deletions from homozygous deletions. However, MSeq-CNV resembles cn.MOPS in detecting the digitized copy number of each CNV region.
For calculating precisions and recalls, the whole simulation study was repeated five times for each setting and the average results across five repeats are summarized in Table 2. In this table, for a genome-wide sequencing coverage of 1×, 5×, 10×, MSeq-CNV is compared to the other tools in each genomic state i.e. homozygous deletion, heterozygous deletion, diploid and duplications. The F-score which is the harmonic mean of the precision and recall values are also calculated in Table 2, for each CNV state.
As shown in Table 2, for all coverage values and according to the F-score, the performance of MSeq-CNV has been superior to the compared tools in detecting genomic regions with deletions and duplications. In regions with diploid copies, MSeq-CNV outperformed the compared tools for a coverage of 1×. Also for a coverage of 5× and 10×, MSeq-CNV and CNV-seq are both ranked as the best tools in detecting regions with diploid copies.
In Table 3, the overall performance of MSeq-CNV is compared to the other tools. To calculate the overall performance of each tool in estimating the correct copy number state, number of nucleotides whose states were  Table 3. The overall performance of the MSeq-CNV in comparison with rSW-seq, CNV-seq and cn.MOPS. For each method, the average of overall accuracies over five different runs of the whole study is given in each cell. For each coverage, the highest accuracies are indicated in bold. **Number of nucleotides whose states were correctly predicted is divided by the total number of genomic nucleotides (genome length).
correctly predicted is divided by the genome length. As indicated in Table 3, the overall performance of the MSeq-CNV is superior to rSW-seq, CNV-seq and cnMOPS, for a coverage of 1× and 5×. Also, MSeq-CNV and CNV-seq outperformed rSW-seq and cn.MOPS with an overall accuracy of 0.98, for a 10× coverage. Performance of MSeq-CNV is also evaluated in terms of allele frequency of CNVs. As shown in Table 4, in CNV regions with copy loss, accuracy does not change with an increase in allele frequency i.e. MSeq-CNV is accurate in detecting genomic deletions. However, in CNV regions of type copy gain or mixed, overall accuracies decrease with an increase in allele frequency. This is associated with lower accuracies in detecting genomic duplications, compared to the other genomic regions. Figure 1 shows the RAM usage, running time and overall accuracy of MSeq-CNV in terms of sequence numbers, i.e. number of individuals which are compared to each other, for 10× sequencing coverage. To obtain these results, 4 computer cores were applied for running the parallel programming version of the MSeq-CNV, on a 64-bit windows operating system with Intel Core(TM) i7-4710HQ CPU @3.5 GHz processor. As shown in Fig. 1A,B, RAM usage and running time of MSeq-CNV both increase, with an increase in sequence numbers. However, as shown in Fig. 1C, analyzing more sample genomes at a time has a positive effect on the overall accuracy of MSeq-CNV.    In past studies 44 , to reduce the false discovery rate, CNVs' callset were heavily pre-filtered and only a high confident set of CNVs were reported. In our Bayesian framework, to reduce the false positive rate, we report a set of CNVs with high posterior probabilities. Indeed, those CNVs with posterior probabilities lower than a fixed threshold i.e. 0.5 are filtered out from the callset.

Results From The High-Coverage
In Table 5, number of CNVs and their total size (in Mega bp) are reported for deletion and duplication calls with at least 1 kb in size. As indicated in Table 5 The average number of CNVs in YRI trio i.e. 258, 508 (467.300 Mega bp) is slightly higher than the average calls in CEU i.e. 249, 152 (407.319 Mega bp). Also, as indicated in Table 5, the average number of deletion and duplication calls in YRI individuals (91, 609 and 166, 899) are both more than CEU trio (83, 778 and 165, 374), indicating the increased diversity of the African individuals in comparison with CEUs 46 . Moreover, in the studied individuals genomic deletions are less common compared to duplications 44,46,47 .
Total number of CNV calls in each chromosome is plotted in Fig. 2A, for each HapMap individual. Numbers of deletion and duplication calls are also given in Table S To investigate the validity of the CNV calls, their overlap with the Database of Genomic Variants (DGV) 48 , http://dgv.tcag.ca/dgv/ is studied. DGV includes 8, 599 CNVs from 40 HapMap individuals which are validated experimentally using aCGH methods.
The overlap of the detected CNVs with DGV are determined by the number of calls and also by the size of overlap, in base pairs. CNV calls in CEU trio NA12891, NA12892 and NA12878 overlap with DGV respectively with a ratio of 0.61, 0.62 and 0.61, for the number of calls. Also, a base which is called as a CNV in NA12891, NA12892 and NA12878 overlap with a base in DGV respectively with a ratio of 0.60, 0.61 and 0.61. The YRI individuals NA19238, NA19239 and NA19240 overlap with DGV respectively with a ratio of 0.62, 0.61 and 0.62 for the number of calls, and 0.61, 0.59 and 0.60 for the base pairs. Therefore, more than a half of CNV calls which are made by MSeq-CNV are previously validated using aCGH methods.
Size distribution of CNVs are also shown in Fig. 3A,B, respectively for the deletion and duplication calls. Clearly, the numbers of deletion and duplication calls decrease exponentially, with an increase in CNV size. As shown in Fig. 3A, deletion size distributions almost overlap in all studied individuals. Duplication size distributions are also very similar in all individuals, with CEU individuals having more CNVs of smaller sizes.  Moreover, we applied the hierarchical clustering algorithm to the matrix of CNV regions which are identified in the genome of six HapMap individuals. As shown in Fig. 4, although no information about the individual's identities are used in the hierarchical clustering, the algorithm has correctly segregated the ancestry of the six individuals in two groups. While one group includes the CEU individuals NA12891, NA12892 and NA12878 with European ancestry, the other group includes YRI individuals NA19238, NA19239 and NA19240 with Nigerian ancestry.

Results From The Low-Coverage Data of The 1000 Genomes Project.
MSeq-CNV is also applied for the CNV detection form the low-coverage data of two individuals i.e. NA12761 and NA12762, from 1000 Genome project. After downloading BAM files of the alignment of mate pair reads to the human reference genome, mate pairs with low mapping qualities (Q < 25) are filtered out.
MSeq-CNV called a total number of 145, 462 (191.495 Mega bp) and 119, 074 (169.339 Mega bp) CNVs in the genomes of NA12761 and NA12762, respectively. Also, in both individuals, genomic deletions are less common compared to the duplications 44,46,47 . Details of deletion and duplication calls are given in Table 5 and Table S.3.
The low number of CNV calls in the genome of NA12761 and NA12762 is potentially associated with lower accuracies in detecting genomic CNVs, especially duplications, from low-coverage sequencing data of NA12761 and NA12762 (see Table 2).
The overall number of detected CNVs in each chromosome is shown in Fig. 2B, for NA12761 and NA12762. Detected CNVs in NA12761 and NA12762 overlap with DGV respectively with a ratio of 0.65 and 0.66 for the number of calls, and 0.66 and 0.67 for base pairs.  The highest number of CNVs are detected in the South Asian individuals LP6005519-DNA_A04 and LP6005519-DNA_A05 (from INDIA) respectively with a total number of 262, 405 (431.409 Mega bp) and 341, 909 (545.435 Mega bp) calls. Extensive CNVs in Indian individuals, which is as many as YRI trio, were also previously reported in the admixed Indian population of African ancestry 47,50 , to adopt with environmental conditions. Details of deletion and duplication calls are given in Table S.4 and Table S.5. The overall number of detected CNVs in each chromosome is shown in Fig. 2C, for each individual.

Results
Detected CNVs in LP6005592-DNA_H03, LP6005442-DNA_E07, LP6005443-DNA_G05 LP6005519-DNA_ A04 and LP6005519-DNA_A05, and LP6005592-DNA_D01 overlap with DGV respectively with a ratio of 0.61, 0.62, 0.61, 0.60, 0.62, and 0.62, for the number of calls and 0.60, 0.60, 0.59, 0.60, 0.59, and 0.61, for the base pairs. Applications and Limitations. The MSeq-CNV can be applied for detecting the recurrent genome-wide CNVs from NGS data in the diploid genome of human and other organisms, as well. However, the current version of MSeq-CNV is not capable of detecting CNVs in the sequencing data of a haploid genome. The input NGS data for the MSeq-CNV are possibly the mate pair reads which are collected from sequencing with multiple platforms, multiple individuals and experimental conditions.
Although the current version of the MSeq-CNV is limited to the whole genome shotgun sequencing, further work is in progress to adopt MSeq-CNV with the exome or gene panel sequencing data.
Also, as mentioned above, the other attractive feature of the MSeq-CNV is in constructing the ancestry of the sequenced individuals, based on the detected CNV matrix.

Discussion
In this article we proposed MSeq-CNV as a new tool for detecting genome-wide deletions and duplications from sequencing of multiple samples. Simultaneous analysis of multiple samples allows detecting common CNVs which are shared by complex diseases. Also, read count variations which occur due to the sequencing noise can be detected by the analysis of several samples together. MSeq-CNV applies a novel probabilistic framework for modeling the read depth and insertion size signals, together.
The overall performance of MSeq-CNV has been superior to the central CNV detection tools such as rSW-seq, CNV-seq and cnMOPS. Specially, for a coverage of 1× which is fairly low, the overall performance MSeq-CNV has been considerably higher than the compared tools. Reaching a high performance in low coverage data is an advantage of MSeq-CNV. In future, CNV detection tools which rely on a low-coverage sequencing are more relevant 37,38 . Indeed, a low coverage sequencing is common in many individuals e.g. in the 1000 Genomes Project the shotgun sequencing of 179 individuals is carried out with a coverage of 2× to 4× 41 .
The MSeq-CNV works with the empirical distribution of the insertion sizes in clone library. Therefore, MSeq-CNV is robust to deviations from the theoretical insertion size distribution which occurs due to several artifacts, attributed to the library-preparation protocols.