Introduction

Copy Number Variation (CNV) and balanced rearrangements such as inversions and translocations are types of the large structural variations in the human genome and other organisms. In Copy Number Variation, a gene or a genomic region appears in different number of copies in different individuals or even in different cells of the same individual. CNVs are generally referred to as a duplication or deletion of a genomic region with at least 1 kb in length. However, several clinically important CNVs are shorter than 1 kb in length. CNV results in having variations in the gene expressions and abnormalities in the human phenotypes1. Moreover, CNV is envisaged to be associated with many human diseases such as autoimmune disease2, autism1 and developmental disabilities3, diabetes, schizophrenia4, cancer3 and obesity.

In the last decade, CNVs are studied via Microarray-based Comparative Genomic Hybridization (aCGH) methods5,6,7,8,9,10. However, the current aCGH platforms which benefit of more than 1 million genomic probes have a lower detection limit of CNVs of length ~5 kb to 25 kb11,12. In the recent years, Next Generation Sequencing (NGS) has provided new opportunities for the CNV studies with an unprecedented resolution13,14,15. In NGS, millions of single end or mate pair reads are generated from the sample genomes with shotgun sequencing. CNVs are then detected based on the frequency of the reads (read depth) or aberrations in the mate pairs, after mapping the short reads to the reference genome.

The majority of the current CNV detection tools analyze only one sample genome, at a time. These tools which are not capable of the simultaneous analysis of multiple samples rely either on read depth data e.g. CNV-seq14, rSW-seq16, m-HMM17, BIC-seq18, EWT19, SegSeq20, CNVwire21 and ReadDepth22 or on mate pair/split reads23,24,25,26,27,28,29,30,31,32,33,34. However, there are benefits in having the capability to analyze several sequencing samples, simultaneously.

Multiple sequencing reduces the effect of the systematic errors and artifacts which are attributed to the library-preparation protocol or individual sample genome characteristics35. There are common CNVs which are shared by complex diseases36 and can be detected from sequencing of multiple samples. Simultaneous analysis of multiple samples allows detecting read counts variations occurring due to the noise across samples, even in genomic positions with constant copy numbers.

Therefore, to increase the detection power, more samples should be sequenced with a low sequencing coverage rather than for sequencing a few samples with high-coverage sequencing. Indeed, CNV detection methods which rely on a low-coverage sequencing data are more relevant in the future studies37,38. Currently, many individuals are sequenced with a low genome-wide coverage. For example, the 1000 genomes project carried out the whole-genome shotgun sequencing of 179 individuals with 2× to 4× coverages39,40,41.

Currently, there are a few tools which are capable of the simultaneous analysis of several sequenced samples37,42,43. However, a major drawback of these tools is relying only on read depth data which results in suffering from a low power or a high false positive rate, due to the large noise in read depth signals. Moreover, these tools do not take the observed aberrations in the mate pair reads into account. Indeed, besides read depth, mate pair insertion sizes provide another source of information for the genome-wide CNV detection with an increased resolution.

In this paper, MSeq-CNV is proposed for detecting recurrent genomic deletions and duplications across multiple individuals, by the simultaneous analysis of samples. To the best of our knowledge, MSeq-CNV is the first computational tool which takes both read depth and insertion size signals in several individuals into account. The MSeq-CNV applies a mixture density to model the distribution of the read counts and the distribution of the number of mate pairs with aberrations in insertion size.

Each component in the mixture density applies a Binomial distribution to model the number of mate pairs with insertion size aberrations and a Poisson distribution to model the read counts, in each genomic region. After estimating the model parameters based on Expectation-Maximization (EM) algorithm, the posterior probability of the digitized copy number of each segment in the sample genomes is computed. The resolution of the MSeq-CNV is evaluated on a set of samples with implanted CNVs, which are constructed based on the human reference genome. Compared to the other state of the art methods, MSeq-CNV has reached an unprecedented precision and recall values which allows detecting recurrent genomic variants, accurately.

The MSeq-CNV is also applied for the CNV detection in a set of six HapMap individuals with high-coverage sequencing, including a CEU trio of European ancestry (NA12891, NA12892, and NA12878) and a YRI trio of Yoruba Nigerian ethnicity (NA19238, NA19239, and NA19240).

Methods

Assume that there are k sample genomes which are sequenced using a Next Generation Sequencing platform and mate pair reads are generated. After mapping mate pairs, the reference genome is divided into T segments of length L. Here, we aim at estimating the copy number of each genomic segment in samples 1 to k. To estimate the copy number of each sample in the tth genomic segment where t = 1, 2, …, T, this paper relies on two signals: i) number of reads which are mapped to the segment, and ii) information from the mate pair whose insertion (un-sequenced) region is passing the tth genomic segment and its reads are flanking the corresponding genomic segment.

Here, number of reads which are generated from the jth sample and are mapped to the tth segment of reference genome (studied segment) is denoted by \({f}_{j}\). Also, \({n}_{j}\) denotes the number of mate pairs which are generated from the jth sample and their insertion region is passing the tth segment, after mapping to it.

Signal characteristics in different genomic states

The studied segment in the jth sample has one of the following copy number states i.e. {homozygous deletion, heterozygous deletion, diploid and duplications}. In each state, the characteristics of the read counts and mate pair signals which are used for the mathematical modeling are described below:

Diploid

in this state, jth sample carries two copies of the corresponding segment in the reference genome. Here, a mate pair which is generated from the jth sample aligns to the reference genome with a normal insertion size, distributed with the clone library insertion size distribution. Also, number of reads which are mapped to the corresponding segment in the reference genome i.e. \({f}_{j}\) is assumed to have a Poisson distribution with parameter \(\lambda \).

Heterozygous deletion

in this state, the jth sample carries only one allele of the corresponding segment in the reference genome. Therefore, some mate pairs which are generated from the jth sample align to the reference with a normal insertion size, distributed with the clone library insertion size distribution. Other mate pairs align to the reference genome much further apart than expected. In this state, read counts are also distributed with a Poisson distribution with a parameter of \(\frac{\lambda }{2}\).

Homozygous deletion

in this state, both alleles are deleted from the jth sample. Therefore, a high percentage of the mate pairs which are generated from this region will map to the reference genome much further apart than the expected insertion size distribution in the clone library. Also, we consider a Poisson distribution with a parameter of \(\varepsilon \lambda \) for the read counts, where \(\varepsilon \) is assumed to be a very small value.

Duplication

in this state, the sample genome carries more than two copies of the corresponding segment in the reference genome. However, mate pairs which are generated from duplicated regions will map to the reference with the insertion size distribution of the clone library. Read counts are also distributed with a Poisson distribution with a parameter of \(i\,{\rm{\lambda }}/2\), for the samples carrying i copies.

Mathematical modeling of mate pair insertion sizes and read counts

Consider a segment of the reference genome whose copy number in the jth sample is of interest, j = 1, 2, …, k. Also, let \({n}_{j}\) denote the total number of mate pairs which are generated from the jth sample and align to the reference genome with condition ii, as mentioned before. Also, \({n}_{j1}\,\)denotes the number of mate pairs which are mapped to the reference with the insertion size distribution of the clone library and \({n}_{j2}\) denotes the number of mate pairs which are mapped to the reference much further apart, compared to the insertion sizes in the clone library. Clearly, \({n}_{j}={n}_{j1}+{n}_{j2}\). Here, we assumed that \({n}_{j1}\) is binomially distributed as follows:

$$p({n}_{j1},{n}_{j2})=(\begin{array}{c}{n}_{j1}+{n}_{j2}\\ {n}_{j1}\end{array}){\beta }_{i}^{{n}_{j1}}{(1-{\beta }_{i})}^{{n}_{j2}}$$
(1)

where, \({n}_{j1}=1,2,\ldots ,{n}_{j}\). In the above distribution, \({\beta }_{i}\) indicates the probability of observing a mate pair mapped to the reference with a clone library insertion size distribution, when sample genome is in the ith CNV state. Where, i = 0, 1, 2, 3, …, m corresponds to {homozygous deletion, heterozygous deletion, diploid and duplications}. Also, the maximum copy number of a genomic segment is denoted by m, i.e. \(i\le m\).

When jth sample has i copies of the studied segment of the reference genome, read count \({f}_{j}\) follows a Poisson distribution with a parameter of \({\theta }_{i}\lambda \):

$$p(\,{f}_{j})={e}^{-{\theta }_{i}\lambda }\frac{{({\theta }_{i}\lambda )}^{{f}_{j}}}{{f}_{j}!}$$
(2)

where, \({{\rm{\theta }}}_{0}={\rm{\varepsilon }}{\rm{\lambda }}\), and \({{\rm{\theta }}}_{{\rm{i}}}=i{\rm{\lambda }}/2\), for \({\rm{i}}\ge 1\).

Also, from a total number of k samples, let \({{\rm{\alpha }}}_{{\rm{i}}}\) denote the percentage of samples which have i copies of the studied segment of the reference genome. Taking the above descriptions into account, the probability of observing \(({f}_{j},{n}_{j1},{n}_{j2})\) in the jth sample genome can be written as follows:

$$\begin{array}{rcl}{p}(\,{{f}}_{{\boldsymbol{j}}},{{n}}_{{\boldsymbol{j}}1},{{n}}_{{\boldsymbol{j}}2}) & = & \sum _{{\boldsymbol{i}}=0}^{{m}}{{\alpha }}_{{\boldsymbol{i}}}{p}({{f}}_{{j}},{{n}}_{{j}1},{{n}}_{{j}2}|{state}\,{i})\\ & = & \sum _{{\boldsymbol{i}}=0}^{{m}}{{\alpha }}_{{i}}{{e}}^{-{{\boldsymbol{\theta }}}_{{\boldsymbol{i}}}{\boldsymbol{\lambda }}}\frac{{({{\theta }}_{{\boldsymbol{i}}}{\lambda })}^{{{\boldsymbol{f}}}_{{\boldsymbol{j}}}}}{{{f}}_{{\boldsymbol{j}}}!}[(\begin{array}{c}{{n}}_{{\boldsymbol{j}}1}+{{n}}_{{\boldsymbol{j}}2}\\ {{n}}_{{\boldsymbol{j}}1}\end{array}){{\beta }}_{{\boldsymbol{i}}}^{{{\boldsymbol{n}}}_{{\boldsymbol{j}}1}}{(1-{{\beta }}_{{\boldsymbol{i}}})}^{{{n}}_{{\boldsymbol{j}}2}}]\end{array}$$
(3)

However, it should be added that in the above formulations \({n}_{j1}\) and \({n}_{j2}\) are not known and they depend on the unknown parameter \({\beta }_{i}\). Also, the estimation of \({n}_{j1}\) and \({n}_{j2}\), requires estimating the probability of each insertion size to be distributed with the clone library insertion size distribution. For this purpose, let \({o}_{jr}\) denote insertion size of the rth mate pair which was generated from the jth sample and was mapped to the studied segment of the reference genome. Where \(r=1,\,2,\,\ldots ,\,\,{n}_{j}\) and j = 1, 2, …, k. Consequently, a random variable \({z}_{jr}\) is corresponded to each mate pair insertion size \({o}_{jr}\):

$${z}_{jr}=\{\begin{array}{l}1\,if\,{o}_{jr}\,comes\,from\,the\,insertion\,size\,distribution\,of\,the\,clone\,library\\ 0\,if\,{o}_{jr}\,comes\,from\,a\,shifted\,insertion\,size\,distribution\,of\,the\,clone\,library\end{array}$$

where, \({n}_{j1}=\sum _{r=1}^{{n}_{j}}{z}_{jr}\) and \({n}_{j2}=\sum _{r=1}^{{n}_{j}}(1-{z}_{jr})\). To estimate the expected value of \({n}_{j1}\) and \({n}_{j2}\), we calculate the probability of having a \({z}_{jr}\) equal to 1, for \(r=1,\,2,\,\ldots ,\,\,{n}_{j}\) and \(j=1,\,2,\,\ldots ,\,\,k\), see Supplementary file 1 for a detailed description.

Now, a Dirichlet prior distribution is defined for the parameter vector \({\boldsymbol{\alpha }}=({\alpha }_{0},\,{\alpha }_{1},\,\ldots ,\,{\alpha }_{m})\):

$${p}({\boldsymbol{\alpha }})\propto \prod _{{\boldsymbol{i}}=0}^{{\boldsymbol{m}}}{{\alpha }}_{{\boldsymbol{i}}}^{{{\boldsymbol{\gamma }}}_{{\boldsymbol{i}}}-1}$$
(4)

where, \({\alpha }_{0}=1-\sum _{i=1}^{m}{\alpha }_{i}\,\)and \({\gamma }_{s}=\sum _{i=0}^{m}{\gamma }_{i}\). Also, the prior of each \({\beta }_{i}\) is considered to be a beta distribution as follows:

$${p}({{\beta }}_{{\boldsymbol{i}}})=\frac{{\Gamma }({{\nu }}_{{\boldsymbol{i}}1}+{{\nu }}_{{\boldsymbol{i}}2})}{{\Gamma }({{\nu }}_{{\boldsymbol{i}}1}){\Gamma }({{\nu }}_{{\boldsymbol{i}}2})}{{\beta }}_{{\boldsymbol{i}}}^{{{\boldsymbol{\nu }}}_{{\boldsymbol{i}}1}-1}{(1-{{\beta }}_{{\boldsymbol{i}}})}^{{{\boldsymbol{\nu }}}_{{\boldsymbol{i}}2}-1}$$
(5)

where, i = 0, 1, 2, …, m. Moreover, the prior distribution of \(\lambda \) is considered to be a uniform distribution, over the interval of \((0,t)\), where t is large enough.

Model parameters

There are a number of parameters in the above mathematical model which have to be estimated. These parameters include \(\lambda \), the average read counts in a genomic segment of diploid state. The parameter vector \({\boldsymbol{\alpha }}=({\alpha }_{0},\,{\alpha }_{1},\,\ldots ,\,{\alpha }_{m})\) represents the percentage of samples with copy numbers 0, 1, …, m of the studied segment of the reference genome. Also, for a sample genome with copy number state i, \({\beta }_{i}\) indicates the proportion of the mate pairs which are mapped to the reference genome much further apart than expected under the clone library insertion size distribution.

The parameters of the prior distribution over \({\boldsymbol{\alpha }}=({\alpha }_{0},\,{\alpha }_{1},\,\ldots ,\,{\alpha }_{m})\) are given values based on information from genome-wide CNV percentage. Since a high percentage of genomic segments in each sample are expected to be in diploid state, \({\gamma }_{2}\) is given a value much higher than \({\gamma }_{i}\), \(i\ne 2\). Also, a beta distribution is defined as a prior distribution over each \({\beta }_{i}\), i = 0, 1, 2, …, m. The parameters of the beta distribution i.e. \({\nu }_{i1}\) and \({\nu }_{i2}\) are given values based on the expected number of mate pairs which are mapped to the reference much further apart, compared to the clone library insertion size distribution. In genomic diploid state and segments with an elevated number of copies \({\nu }_{i1}\gg {\nu }_{i2}\), for i = 2, 3, …, m. In genomic segments with heterozygous deletion \({\nu }_{11}\cong {\nu }_{12}\) and in genomic segments with homozygous deletions \({\nu }_{01}\ll {\nu }_{02}\).

Parameter estimation

MSeq-CNV applies the Expectation-Maximization (EM) algorithm, for parameter estimation. The parameter estimation details are given in Supplementary file 1.

Parameter initialization in EM algorithm

\(\lambda \) is initialized based on the number of reads that are expected to be generated from a genomic segment with diploid state, after taking the sequencing coverage into account. For example, for a coverage of 5×, a read length of 100 bp and genomic segments of length 100 bp, \(\lambda \) is initialized with a value of 20. Also, \({\alpha }_{2}=0.90\), \({\alpha }_{i}=0.02\) for i = 0, 1, 3, 4, 5, and \({\beta }_{2}={\beta }_{3}={\beta }_{4}={\beta }_{5}=1\), \({\beta }_{0}=0\), and \({\beta }_{1}=0.5\) are taken as the start point of the parameters in the EM algorithm. In the jth sample, \({\mu }_{j1},{\mu }_{j2}\) are initialized by comparing the mate pair insertion sizes with the clone library insertion size distribution.

In this study, we have considered a segment size of 150 bps in the simulated data analysis and a segment size of 100 bps in the real data analysis. Considering a longer segment size decreases the running time of the algorithm, with the cost of lower resolution. The methods which are compared to MSeq-CNV are also implemented with the same segment size, as MSeq-CNV. However, there are other methods which are specialized in detecting genomic rearrangements and tandem duplications using paired reads which get the nucleotide resolution breakpoint25,29.

Data availability

BAM (Binary Alignment/Map) files of the alignment of the mate pair reads to the build 36 (hg18) of the human reference genome are available at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/. R program of MSeq-CNV and a detailed procedure for its running is available at https://github.com/CNVdetection/MSeq-CNV. The list of detected CNVs in the studied individuals is also submitted to this webpage.

Results

Implementation To Real Data of The Human Reference Chromosome

We have constructed 40 sample genomes with implanted CNVs, from chromosome 3 of the human reference genome. After duplicating chromosome 3 of the reference genome, it was altered with implanted CNVs of length 250 bp, 500 bp, 750 bp, 1 kb, 1.5 kb, 2 kb, 2.5 kb, 3 kb, 3.5 kb, 4 kb, 4.5 kb, 5 kb. The position of each CNV is randomly chosen so that CNVs do not overlap along chromosome. After determining the CNV positions on the reference genome, CNVs are implanted into each sample genome.

Indeed, for each CNV region which is implanted in the reference genome, distributions and characteristics of CNVs across sample genomes are determined based on a previous analysis of the HapMap individuals.

Based on the characteristics of the HapMap individuals, 80% of the implanted CNVs were of type loss in which deletions occurs in some sample genomes. Also, 15% of the CNV regions were of type gain in which some sample genomes have an elevated number of copies. The other 5% of the implanted CNV regions were of type mixed in which sample genomes may either have copy loss or copy gain.

In each CNV region, the copy number of each sample was also drawn from the copy number distribution in HapMap individuals. For a genomic loss region, a sample has copy numbers 2, 1, and 0 with probabilities 0.8, 0.15 and 0.05, respectively. For a genomic gain region, a sample has copy numbers 2, 3, 4, and 5 with probabilities 0.85, 0.08, 0.06 and 0.01, respectively. Also, for a CNV region of type mixed, a sample has copy numbers 0, 1, 2, 3, 4 with probabilities 0.04, 0.16, 0.67, 0.11 and 0.02, respectively.

After constructing the sample genomes, MAQ is applied for generating mate pair reads from each sample genome. Mate pairs are then mapped to the human reference genome. After dividing the human reference genome into segments of length 150 bp, MSeq-CNV is applied for detecting CNVs in the corresponding segments of the constructed sample genomes. In Table 1, the performance of MSeq-CNV is reported for each CNV state i.e. homozygous deletion, heterozygous deletion, and duplications, for a genome-wide sequencing coverage of 10×.

Table 1 Precision and recall values of MSeq-CNV are evaluated in 1,120,000 genomic segments of length 150 bp (168 million base pairs are evaluated), for a genome-wide coverage of 10×. In columns 3 to 8, predicted states are reported versus the real states of the genomic segments.

The performance of MSeq-CNV is also compared to the central CNV detection tools i.e. rSW-seq, CNV-seq and cnMOPS. These tools are selected for comparisons because of their high resolution and their capability in detecting both genome-wide deletions and duplications37. It should be added that rSW-seq and CNV-seq are not capable of detecting the digitized copy number of genomic regions i.e. these tools do not discriminate heterozygous deletions from homozygous deletions. However, MSeq-CNV resembles cn.MOPS in detecting the digitized copy number of each CNV region.

For calculating precisions and recalls, the whole simulation study was repeated five times for each setting and the average results across five repeats are summarized in Table 2. In this table, for a genome-wide sequencing coverage of 1×, 5×, 10×, MSeq-CNV is compared to the other tools in each genomic state i.e. homozygous deletion, heterozygous deletion, diploid and duplications. The F-score which is the harmonic mean of the precision and recall values are also calculated in Table 2, for each CNV state.

Table 2 MSeq-CNV in comparison with rSW-seq, CNV-seq and cn.MOPS, for whole-genome sequencing coverage of 1×, 5×, 10×. The average precision and recall values are calculated across five different repeats of the whole study. The highest F-scores are indicated in bold.

As shown in Table 2, for all coverage values and according to the F-score, the performance of MSeq-CNV has been superior to the compared tools in detecting genomic regions with deletions and duplications. In regions with diploid copies, MSeq-CNV outperformed the compared tools for a coverage of 1×. Also for a coverage of 5× and 10×, MSeq-CNV and CNV-seq are both ranked as the best tools in detecting regions with diploid copies.

In Table 3, the overall performance of MSeq-CNV is compared to the other tools. To calculate the overall performance of each tool in estimating the correct copy number state, number of nucleotides whose states were correctly predicted is divided by the genome length. As indicated in Table 3, the overall performance of the MSeq-CNV is superior to rSW-seq, CNV-seq and cnMOPS, for a coverage of 1× and 5×. Also, MSeq-CNV and CNV-seq outperformed rSW-seq and cn.MOPS with an overall accuracy of 0.98, for a 10× coverage.

Table 3 The overall performance of the MSeq-CNV in comparison with rSW-seq, CNV-seq and cn.MOPS. For each method, the average of overall accuracies over five different runs of the whole study is given in each cell. For each coverage, the highest accuracies are indicated in bold.

Performance of MSeq-CNV is also evaluated in terms of allele frequency of CNVs. As shown in Table 4, in CNV regions with copy loss, accuracy does not change with an increase in allele frequency i.e. MSeq-CNV is accurate in detecting genomic deletions. However, in CNV regions of type copy gain or mixed, overall accuracies decrease with an increase in allele frequency. This is associated with lower accuracies in detecting genomic duplications, compared to the other genomic regions.

Table 4 The overall accuracy of MSeq-CNV, in terms of allele frequency in genomic regions with CNVs. Allele frequencies are categorized into 4 groups i.e. 0–5, 6–10, 11–15, and 16–25, in the simulated data of 40 samples.

Figure 1 shows the RAM usage, running time and overall accuracy of MSeq-CNV in terms of sequence numbers, i.e. number of individuals which are compared to each other, for 10× sequencing coverage. To obtain these results, 4 computer cores were applied for running the parallel programming version of the MSeq-CNV, on a 64-bit windows operating system with Intel Core(TM) i7-4710HQ CPU @3.5 GHz processor. As shown in Fig. 1A,B, RAM usage and running time of MSeq-CNV both increase, with an increase in sequence numbers. However, as shown in Fig. 1C, analyzing more sample genomes at a time has a positive effect on the overall accuracy of MSeq-CNV.

Figure 1
figure 1

Ram usage, run time and accuracy of MSeq-CNV is reported, in terms of sequence numbers, in a 5 Mega bp of the genomic region and for a coverage of 10×. These results are achieved using parallel programming version of the MSeq-CNV on a 64-bit windows operating system with Intel Core(TM) i7-4710HQ CPU @3.5 GHz processor. (A) Ram usage of MSeq-CNV in terms of sequence numbers. (B) Run time of MSeq-CNV in terms of sequence numbers. (C) The overall accuracy of MSeq-CNV, in terms of number of analyzed individuals.

Results From The High-Coverage Data of The 1000 Genomes Project

MSeq-CNV is applied for the CNV detection in the genome of six HapMap individuals. These genomes which are sequenced with a high coverage as part of the 1000 Genomes Project (http://www.1000genomes.org) consist of a CEU trio of European ancestry (NA12891, NA12892 and NA12878) and a YRI trio of Yoruba Nigerian ethnicity (NA19238, NA19239 and NA19240).

BAM (Binary Alignment/Map) files of the alignment of the mate pair reads to the build 36 (hg18) of the human reference genome are downloaded from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/. Mate pair reads with low mapping qualities (Q < 25) are then filtered out using SAMtools (samtools.sourceforge.net). Then, MSeq-CNV is applied for detecting deletions and duplications in the genome of CEU and YRI individuals, simultaneously.

In past studies44, to reduce the false discovery rate, CNVs’ callset were heavily pre-filtered and only a high confident set of CNVs were reported. In our Bayesian framework, to reduce the false positive rate, we report a set of CNVs with high posterior probabilities. Indeed, those CNVs with posterior probabilities lower than a fixed threshold i.e. 0.5 are filtered out from the callset.

In Table 5, number of CNVs and their total size (in Mega bp) are reported for deletion and duplication calls with at least 1 kb in size. As indicated in Table 5, numbers of CNVs in CEU trio NA12891, NA12892 and NA12878 are respectively 249, 404 (421.284 Mega bp), 248, 889 (390.479 Mega bp) and 249, 162 (410.194 Mega bp). Also, numbers of CNVs in YRI trio NA19238, NA19239 and NA19240 are respectively 278, 027 (459.610 Mega bp), 246, 229 (453.394 Mega bp) and 251, 269 (488.895 Mega bp). Also, similar to previous estimations45, CNV calls in NA12891, NA12892, NA12878, NA19238, NA19239 and NA19240 respectively cover 13.02%, 12.07%, 12.68%, 14.21%, 14.02% and 15.11% of the human genome.

Table 5 Number of CNV calls which are made by MSeq-CNV in the six high-coverage sequencing data the 1000Genomes Project i.e. NA12891, NA12892, NA12878 (CEU trio), NA19238, NA19239, NA19240 (YRI trio) and in the low-coverage sequencing data of NA12761, NA12762, and six individuals from the Simons Genome Diversity Project (SGDP).

The average number of CNVs in YRI trio i.e. 258, 508 (467.300 Mega bp) is slightly higher than the average calls in CEU i.e. 249, 152 (407.319 Mega bp). Also, as indicated in Table 5, the average number of deletion and duplication calls in YRI individuals (91, 609 and 166, 899) are both more than CEU trio (83, 778 and 165, 374), indicating the increased diversity of the African individuals in comparison with CEUs46. Moreover, in the studied individuals genomic deletions are less common compared to duplications44,46,47.

Total number of CNV calls in each chromosome is plotted in Fig. 2A, for each HapMap individual. Numbers of deletion and duplication calls are also given in Table S.1, for each chromosome. See Table S.2 for the size (in Mega bp) of deletion and duplication calls.

Figure 2
figure 2

Number of CNV calls which are made by MSeq-CNV. (A) In the high-coverage sequencing data of HapMap individuals NA12891, NA12892, NA12878, NA19238, NA19239 and NA19240, from 1000 Genomes Project. (B) In the low-coverage sequencing data of NA12761 and NA12762, from 1000Genomes Project. (C) In the six individuals from the Simons Genome Diversity Project (SGDP).

To investigate the validity of the CNV calls, their overlap with the Database of Genomic Variants (DGV)48, http://dgv.tcag.ca/dgv/ is studied. DGV includes 8, 599 CNVs from 40 HapMap individuals which are validated experimentally using aCGH methods.

The overlap of the detected CNVs with DGV are determined by the number of calls and also by the size of overlap, in base pairs. CNV calls in CEU trio NA12891, NA12892 and NA12878 overlap with DGV respectively with a ratio of 0.61, 0.62 and 0.61, for the number of calls. Also, a base which is called as a CNV in NA12891, NA12892 and NA12878 overlap with a base in DGV respectively with a ratio of 0.60, 0.61 and 0.61. The YRI individuals NA19238, NA19239 and NA19240 overlap with DGV respectively with a ratio of 0.62, 0.61 and 0.62 for the number of calls, and 0.61, 0.59 and 0.60 for the base pairs. Therefore, more than a half of CNV calls which are made by MSeq-CNV are previously validated using aCGH methods.

Size distribution of CNVs are also shown in Fig. 3A,B, respectively for the deletion and duplication calls. Clearly, the numbers of deletion and duplication calls decrease exponentially, with an increase in CNV size. As shown in Fig. 3A, deletion size distributions almost overlap in all studied individuals. Duplication size distributions are also very similar in all individuals, with CEU individuals having more CNVs of smaller sizes.

Figure 3
figure 3

The size distribution of the deletion and duplication calls – made by MSeq-CNV - in six individuals with high-coverage sequencing from 1000Genomes Project. (A) Deletion size distributions, (B) Duplication size distributions.

Moreover, we applied the hierarchical clustering algorithm to the matrix of CNV regions which are identified in the genome of six HapMap individuals. As shown in Fig. 4, although no information about the individual’s identities are used in the hierarchical clustering, the algorithm has correctly segregated the ancestry of the six individuals in two groups. While one group includes the CEU individuals NA12891, NA12892 and NA12878 with European ancestry, the other group includes YRI individuals NA19238, NA19239 and NA19240 with Nigerian ancestry.

Figure 4
figure 4

The hierarchical clustering algorithm is applied to the matrix of CNV regions which are detected using MSeq-CNV in the genome of HapMap individuals NA12891, NA12892, NA12878, NA19238, NA19239 and NA19240. Based on the CNV information, the hierarchical clustering algorithm has correctly segregated the ancestry of the six individuals in two groups i.e. a group includes CEU trio and a group includes YRI trio.

Results From The Low-Coverage Data of The 1000 Genomes Project

MSeq-CNV is also applied for the CNV detection form the low-coverage data of two individuals i.e. NA12761 and NA12762, from 1000 Genome project. After downloading BAM files of the alignment of mate pair reads to the human reference genome, mate pairs with low mapping qualities (Q < 25) are filtered out.

MSeq-CNV called a total number of 145, 462 (191.495 Mega bp) and 119, 074 (169.339 Mega bp) CNVs in the genomes of NA12761 and NA12762, respectively. Also, in both individuals, genomic deletions are less common compared to the duplications44,46,47. Details of deletion and duplication calls are given in Table 5 and Table S.3.

The low number of CNV calls in the genome of NA12761 and NA12762 is potentially associated with lower accuracies in detecting genomic CNVs, especially duplications, from low-coverage sequencing data of NA12761 and NA12762 (see Table 2).

The overall number of detected CNVs in each chromosome is shown in Fig. 2B, for NA12761 and NA12762. Detected CNVs in NA12761 and NA12762 overlap with DGV respectively with a ratio of 0.65 and 0.66 for the number of calls, and 0.66 and 0.67 for base pairs.

Results From The Simons Genome Diversity Project (SGDP)

MSeq-CNV is also applied for CNV detection in the genome of six individuals from the Simons Genome Diversity Project49 i.e. LP6005592-DNA_H03 (USA), LP6005442-DNA_E07 (Taiwan), LP6005443-DNA_G05 (Taiwan), LP6005519-DNA_A04 (India), LP6005519-DNA_A05 (India), and LP6005592-DNA_D01 (Finland).

As indicated in Table 5, in the analyzed individuals from SGDP, the lowest number of CNVs are called in the LP6005592-DNA_H03 (USA)46 (171, 410 CNVs with a total size of 303.064 Mega bp). Two East Asian individuals LP6005443-DNA_G05, LP6005442-DNA_E07(form TAIWAN) and West Eurasian individual LP6005592-DNA_D01 (from FINLAND) are the next, respectively with a total number of 183, 439 (317.397 Mega bp), 197, 435 (323.638 Mega bp), 185, 837 (323.829 Mega bp) calls.

The highest number of CNVs are detected in the South Asian individuals LP6005519-DNA_A04 and LP6005519-DNA_A05 (from INDIA) respectively with a total number of 262, 405 (431.409 Mega bp) and 341, 909 (545.435 Mega bp) calls. Extensive CNVs in Indian individuals, which is as many as YRI trio, were also previously reported in the admixed Indian population of African ancestry47,50, to adopt with environmental conditions.

Details of deletion and duplication calls are given in Table S.4 and Table S.5. The overall number of detected CNVs in each chromosome is shown in Fig. 2C, for each individual.

Detected CNVs in LP6005592-DNA_H03, LP6005442-DNA_E07, LP6005443-DNA_G05 LP6005519-DNA_A04 and LP6005519-DNA_A05, and LP6005592-DNA_D01 overlap with DGV respectively with a ratio of 0.61, 0.62, 0.61, 0.60, 0.62, and 0.62, for the number of calls and 0.60, 0.60, 0.59, 0.60, 0.59, and 0.61, for the base pairs.

Applications and Limitations

The MSeq-CNV can be applied for detecting the recurrent genome-wide CNVs from NGS data in the diploid genome of human and other organisms, as well. However, the current version of MSeq-CNV is not capable of detecting CNVs in the sequencing data of a haploid genome. The input NGS data for the MSeq-CNV are possibly the mate pair reads which are collected from sequencing with multiple platforms, multiple individuals and experimental conditions.

Although the current version of the MSeq-CNV is limited to the whole genome shotgun sequencing, further work is in progress to adopt MSeq-CNV with the exome or gene panel sequencing data.

Also, as mentioned above, the other attractive feature of the MSeq-CNV is in constructing the ancestry of the sequenced individuals, based on the detected CNV matrix.

Discussion

In this article we proposed MSeq-CNV as a new tool for detecting genome-wide deletions and duplications from sequencing of multiple samples. Simultaneous analysis of multiple samples allows detecting common CNVs which are shared by complex diseases. Also, read count variations which occur due to the sequencing noise can be detected by the analysis of several samples together. MSeq-CNV applies a novel probabilistic framework for modeling the read depth and insertion size signals, together.

The overall performance of MSeq-CNV has been superior to the central CNV detection tools such as rSW-seq, CNV-seq and cnMOPS. Specially, for a coverage of 1× which is fairly low, the overall performance MSeq-CNV has been considerably higher than the compared tools. Reaching a high performance in low coverage data is an advantage of MSeq-CNV. In future, CNV detection tools which rely on a low-coverage sequencing are more relevant37,38. Indeed, a low coverage sequencing is common in many individuals e.g. in the 1000 Genomes Project the shotgun sequencing of 179 individuals is carried out with a coverage of 2× to 4×41.

The MSeq-CNV works with the empirical distribution of the insertion sizes in clone library. Therefore, MSeq-CNV is robust to deviations from the theoretical insertion size distribution which occurs due to several artifacts, attributed to the library-preparation protocols.