Benchmarking germline CNV calling tools from exome sequencing data

Whole-exome sequencing is an attractive alternative to microarray analysis because of the low cost and potential ability to detect copy number variations (CNV) of various sizes (from 1–2 exons to several Mb). Previous comparison of the most popular CNV calling tools showed a high portion of false-positive calls. Moreover, due to a lack of a gold standard CNV set, the results are limited and incomparable. Here, we aimed to perform a comprehensive analysis of tools capable of germline CNV calling available at the moment using a single CNV standard and reference sample set. Compiling variants from previous studies with Bayesian estimation approach, we constructed an internal standard for NA12878 sample (pilot National Institute of Standards and Technology Reference Material) including 110,050 CNV or non-CNV exons. The standard was used to evaluate the performance of 16 germline CNV calling tools on the NA12878 sample and 10 correlated exomes as a reference set with respect to length distribution, concordance, and efficiency. Each algorithm had a certain range of detected lengths and showed low concordance with other tools. Most tools are focused on detection of a limited number of CNVs one to seven exons long with a false-positive rate below 50%. EXCAVATOR2, exomeCopy, and FishingCNV focused on detection of a wide range of variations but showed low precision. Upon unified comparison, the tools were not equivalent. The analysis performed allows choosing algorithms or ensembles of algorithms most suitable for a specific goal, e.g. population studies or medical genetics.

www.nature.com/scientificreports/ RC of target regions but also that of the off-target ones. To improve the efficiency of CNV calling, CLAMMS 17 and ExomeDepth also include a procedure of reference set optimization. CoNIFER 18 performs systematic bias correction using singular decomposition and CODEX 19 applies log-linear decomposition-based normalization. Both XHMM 20 and FishingCNV 21 use principal component analysis to reduce noise but FishingCNV applies circular binary segmentation (CBS) on the test sample and then performs the simple comparison of normalized coverage against background, while XHMM uses hidden Markov models on Z-RPKM (reads per kilo-base per million total reads) values. Moreover, some algorithms have other features: ExonDel 22 and HMZDelFinder 23 can only detect decrease in the number of copies in the genome, and PatternCNV 24 and CONTRA 25 are focused on the identification of exon-level CNVs. Among all tools, only DeAnnCNV 26 is available online and, in addition to CNV calling, includes the variation annotation module. CNVkit, CODEX, EXCAVATOR2, ExomeDepth, ExonDel, PatternCNV can be considered as more universal for WES analysis, they are designed for both germline and somatic CNV calling.
There are several studies focusing on comparison of sets of 3-6 existing CNV calling algorithms [27][28][29] . However, there are no works that would compare larger sets of currently available tools. Also, the criteria for estimation of their efficiency on validation data could be improved. Existing studies consider only well-known tools (XHMM, CONIFER, ExomeDepth, CONTRA, exomeCopy) and use different overlap criteria to confirm predictions. One group used CNVnator 30 calls from whole-genome sequencing (WGS) data as a CNV set standard; the other used calls obtained from CMA, including custom design arrays.
In addition to in-house generated data, public data is also used to evaluate algorithm performance. Since CNVs have been acknowledged as a natural part of the human genome, more than 70 studies have been performed to identify CNVs in the human genome. The most widely-used CNV call sets are: (1) the study of Conrad et al. (2010), in which 31 a set of 20 comparative hybridization arrays with 2.1 million probes to identify CNVs over 500 bp long has been used; (2) structural variations obtained during the pilot and/or phase 3 of the 1000 Genomes Project, predicted from the whole-genome data by 19 different algorithms (read-depth, read-pair, split-read, assembly-based) 32-34 ; (3) and high-confidence CNV calls from NA12878 sample supported by multiple signals using svclassify 35 . Specific features of CNV set formation, for example, accuracy of prediction from www.nature.com/scientificreports/ WGS or low resolution and choice of the reference sample in case of array technology, do not allow to obtain a comprehensive spectrum of variations and identify all the advantages and limitations of a CNV calling tool.
To perform a unified comparative analysis: (1) we chose NA12878 as one of the most characterized samples of the Genome in a Bottle project; (2) we used exon as a minimal unit for comparison, (3) we constructed the set of CNV and non-CNV exons based on available CNV sets for the NA12878 using Bayes model, and (4) we evaluated the performances of 16 existing germline CNV tools (Table 1) using the same reference set.

Methods
Study data. The collection of exome data mapped to GRCh37 decoy reference genome (hs37d5.fa) in the BAM format was downloaded from Phase 3 of the 1000 Genomes Project (hg19) (detailed description of samples is provided in Supplementary Table S1). All alignment data were processed by the standard data pre-processing protocol: sorting, filtering (deleting the reads with MAPQ > 10) with SAMTools v.0.1.19, and removing PCR duplicates by Picard MarkDuplicates v. 2.5.0. The average coverage for the data was 104.3X, and the average fraction of target regions covered with at least 10X and 20X were 89.9% and 83.1%, respectively.
To construct validation set structural variations previously detected in the NA12878 sample were received from the sixteen peer-reviewed research studies [31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][46][47] ; experiment design and data processing procedures for each of them are described in Supplementary Table 2. If calls were not available for hg19 genome build we converted their coordinates using UCSC LiftOver. To operate at the exon level, we considered an exon as an exon with copy number variation (CNV-exon) if the CNV region spans at least 50% of the exon. The set of known exons was obtained from GENCODE comprehensive gene annotation Release 19 (GRCh37.p13); to construct a set of non-overlapping exons per gene, the R GenomicFeatures package was used.
Parameters of CNV calling tools used. The CNV calling from exome data were performed by 16 tools (Table 1). As a reference set for the NA12878 sample we used the top 10 most correlated samples according to recommendations 11 (function select.reference.set implemented in the ExomeDepth 1.1.10 software package). Most tools were run with default parameters, all changes in standard pipelines for the rest of the tools are described above. For GC-content calculation implemented in some algorithms, GRCh37 decoy reference genome (hs37d5.fa) was used.
During the CONIFER normalization step, two singular value decomposition components were removed based on the inflection point of the scree plot for our data; for calling, threshold value 0.95 was used due to the small coverage variability for correlated samples. ExomeDepth and CLAMMS were applied without their reference set selection step; also all CNVs with non-Phred quality score < 0 identified by CLAMMS were filtered off. To predict CNV in a test sample by cn.MOPS, a pipeline for exome sequencing data (exomecn.mops) was used on the preliminarily calculated read count matrix. Since longer calls are more reliable, we consider ExonDel results found by a moving window with a maximum of the available number of exons (9). CNV calling with HMZDelFinder was performed without theAOH analysis. FishingCNV and PatternCNV calls with p value > 0.05 were excluded.
Performance evaluation. Analysis of the existing tools was performed only on autosomal chromosomes with respect to the characteristics of predicted CNVs (number of calls, size, total length, target regions), concordance between algorithms, and efficiency of the latters. To assess the prediction accuracy of an algorithm, we constructed the validation set of CNV-exons for NA12878 based on the previous research using Bayes model (See "Construction of a standard CNV-exon set based on Bayesian estimation" section). For correct estimation, we considered only those exons that are included in the target design and our validation set. Since the data is inherently unbalanced we used the following metrics for evaluation of a tool:

Results
Moderate concordance among CNV validation sets. A comprehensive validation set is needed to adequately compare existing methods. Due to the wide range of sizes and types of structural variation, the development of such is challenging. Although the NA12878 sample is one of the standards for benchmarking of multiple callers, a gold-standard CNV set doesn't exist. The recent set of high-confidence CNVs (svclassify) proposed by the GIAB Consortium includes only deletion events and besides it only spans 184 exons. Therefore, we collected CNV call sets for the NA12878 sample from previous studies (Supplementary Table S2).
To perform the pairwise comparison of the 16 CNV sets we evaluated the set size and the fraction of exons with the same status (CNV or non-CNV) in shared exons of any two validation sets (Fig. 1, Supplementary  Table S3). Early studies [36][37][38][39][40][41] (Supplementary Table S2; the mccarroll2006, conrad2006, wang2007, pinto2007, and coper2008 sets) detected a small number of CNVs but almost all of them were well supported by later research. www.nature.com/scientificreports/ The sets obtained by the integration signals from multiple technologies (e.g. svclassify, metasv) also showed a high level of concordance with other sets. Another problem of the available sets is their limitation to identify false positives. The 5 sets (mccarroll2008, conrad2010, hapmap, pilot, and phase3) contained information about true non-CNV exons, their pairwise comparison at the exon level is shown in Table 2.
Overall, the number of CNV-exons varied greatly among the sets; on average, they overlapped by 40%. These results can be explained by differences in the approaches, reference sets, and criteria used to call CNVs in different studies. Such differences in the validated exons cast doubt on the correctness of use of only one of these sets for the comparative analysis of CNV tools. Therefore, we made an attempt to generate our own validation set using available data.

Construction of a standard CNV-exon set based on Bayesian estimation. Since available CNV
sets describe different genome regions, and for some of them true non-CNV are not defined, it is impossible to formulate the majority rule for any exon. Assuming that all sets are independent, we considered that in conditions of conflicting and incomplete information, the Bayes estimator would be the most appropriate approach for making a decision. In Bayes estimation, the unknown parameter θ is viewed as random and the model is specified in term of the conditional probability density function f (x|θ) and prior distribution π(θ) on θ . By Bayes rule, the posterior distribution of given some measurement x: The quality of the estimate is measured by loss function L(θ , δ) and estimator δ which minimizes the expected posterior loss E[L(θ, δ)|x] for each x is called Bayes estimator 48 .
To construct the validation set we wanted to rank exons based on their estimation of being CNV-exon and consider the most highly ranked as CNV. Let ith exon is characterized by n number of CNV sets and x sets described ith exon as CNV-exon. Suppose that x ∼ Binomial(n, θ ), θ ∼ Beta(α, β) , where θ is the probability of being CNV-exon, then posterior distribution (1) π θ |x ∝ f x|θ π(θ ).  www.nature.com/scientificreports/ The most common estimation problem is squared error loss due to its simplicity and convenience. It is characterized by high sensitivity to outliers and the same magnitude for positive and negative errors. However, to focus on confident CNV-exons especially under the limited number of sets including data about true diploid genome regions we needed to penalize errors differently. Therefore we considered an asymmetric piecewise linear loss function 49 Then the expected posterior loss may be written Minimizing this expression with respect to d we get that Bayes estimator is t 2 t 1 +t 2 fractile of posterior distribution (Eq. 5): where I(a, b) is the regularized incomplete beta function.
To evaluate the parameters of a priori distribution, we used the stringent CNV map of the human genome constructed by Zarrei and co-authors 1 . Using the data on the European population, we constructed the probability distribution of the presence of CNVs in 21,242 exons that are crossed by the CNV regions. Probability distribution was fit using the maximum likelihood method (python function scipy.stats.beta.fit with location = 0 and scale = 1), which yielded α = 0.33 and β = 0.93.
The penalties were chosen out of concern that overestimation of the parameter was worse than underestimation. For simplicity, we defined the coefficients as the maximum number of sets that are available to characterize exon as CNV (t 1 = 16) and non-CNV (t 2 = 5). To solve the Eq. (5), the special.betaincinv function of the Python SciPy version 1.0.0 was used.
The distribution of the estimated exon's probability of being CNV is presented in Fig. 2. As we expected, exons are divided into groups depending on their representation in CNV sets. The highest left peak corresponded to exons described only as non-CNV. Data about exons with parameter value in the 0.2-0.4 range were the most contradictory, with a slight preponderance in favor of two copy number region or CNV; exons described by only one set were also ranked low. On other exons, we observed a prevalence of sets confirmed CNV. A cutoff of 0.45 for selecting confident CNV-exons based on their estimates was chosen to provide 95% accuracy on a set of 225 CNV-exons validated using PCR. Thus, our validation set for NA12878 contained 6853 CNV-exons and 103,197 non-CNV exons.
To compare the internal standard and the initial data, we estimated the percentage of CNVs in any CNV set included in our standard. The recent set of high-quality CNVs formed in the framework of the GIAB project and experimentally validated variants from Conrad and McCarroll studies from 2006 were fully integrated into the standard. The constructed validation set contains above 79% of CNV-exons discovered from WGS (The 1000 www.nature.com/scientificreports/ Genome Project, FDR < 5% by orthogonal validation methods), SNP-array data (hapmap, wang2007, pinto2007, cooper2008). Variations from mccarroll2008, redon2006 and conrad2010 discovered by genotyping array and CGH array respectively were presented only a half in the standard, the remaining CNV-exons were singleton (presented only in one CNV set) and insufficiently confirmed. The same situation was observed for predictions based on long reads or integration of different signals (paired-read, split-read, read depth). The observed relationships between sets allow us to consider the standard as reliable for benchmarking analysis.

Evaluation of algorithms.
To take into account the genome variability and reduce the number of falsepositive calls, we chose 10 exomes well-correlated by coverage with NA12878 as a reference sample set and run 16 germline CNV calling tools on these exome data.
Differences in count and size of predicted CNVs. The algorithms differed by an order of magnitude in terms of both the number of predicted variations and the number of affected target regions ( Table 3). The highest number of variations were predicted by FishingCNV (1210 CNV) and exomeCopy (845 CNV); the lowest, by a web-tool DeAnnCNV (2 CNV). CONTRA, EXCAVATOR2, ExomeDepth, and PatternCNV identified about 200-300 variations; in the case of CONTRA and PatternCNV these were single-exon CNVs. Other algorithms detected an average of 26 variations. Most algorithms preferentially detected deletions over duplications, which can be explained by peculiarities of the data and better distinguishes between decreases relative to diploidy than increases. The total length of CNVs ranged from 50 kb to 1304 Mb, which exceeds the known fraction of CNVs in the human genome. This indicates the need to filter the calls produced by some tools, in particular, by Fish-ingCNV and exomeCopy. We noticed that CNV tools can be divided into groups based on the length of identified CNVs. The algorithms Most algorithms detect CNVs within 1-50 kb with an average number of targeted regions equal to 7 (Fig. 3).
Concordance between CNV calling tools. Concordance check between tools was performed at the exon level. About 70,000 CNV-exons were singleton, i.e. detected by a single tool (Supplementary Table S4), and the main portion was called by exomeCopy and FishingCNV. The other predictions of these two algorithms mostly intersected with EXCAVATOR2. For ExonDel, CANOES, EXCAVATOR2, and DeAnnCNV, 25-30% of predicted CNV-exons were unique, while for the rest of the tools more than 90% predicted CNV-exons were confirmed by the others. The best confirmation was observed for HMZDelFinder and CONIFER calls (Fig. 4, Supplementary Table S6) with zero unique CNV-exons. Despite the high-level matching of calls with at least one tool, we observed a low number of common CNV-exons called by multiple tools (3 and more). The maximum available overlapping spanned 2 exons and included 11 algorithms (PatternCNV, CODEX, ExomeDepth, XHMM, FishingCNV, CONTRA, CANOES, HMZDelFinder, EXCAVATOR2, cn.MOPS, CNVkit). We also performed pairwise comparison of calling tools (Supplementary Table S6). CODEX and ExomeDepth showed the best Table 3. Description of predicted CNVs. *Target, number of target regions covered by all predicted CNVs. **Mean of target, average number of target regions covered by a CNV. www.nature.com/scientificreports/ concordance with almost 33% of common CNV-exons. exomeCopy and FishingCNV had 24% of concordant CNV-exons but unlike previous ones they practically do not intersect with other algorithms. On average, a CNV calling tool had some agreement (> 5%) with 3-4 others. The worst similarity was observed for CANOES (no more than 1%). cn.MOPS, CONTRA, PatternCNV, CoDEX and ExomeDepth paired with each other showed a moderate concordance level (10-21%).
Accuracy of CNV calling tools. We estimated the efficiency of CNV calling tools using the generated set of CNV and non-CNV exons (Supplementary Table S7, Fig. 5). Among all algorithms, CANOES turned out to be the worst with 3.9% precision and 0.2% recall. The highest recall was demonstrated by exomeCopy: it identified 65%

Discussion
CNV is an important type of structural variation, accurate detection and interpretation of which are essential for both population studies, medical genetics, evolution, and cancer research. Chromosomal microarray analysis often is limited by array resolution and detects mostly the major rearrangements. WGS and WES, in turn, can detect all levels of CNVs and are of interest in clinical practice, especially WES, which is the primary diagnostic test for many orphan diseases, spectrum disorders and syndromes. Despite the development of many CNV calling tools, detection of this type of variation remains challenging. The reason for this is not only poor sequencing efficiency in regions with low sequence complexity or regions with high GC-content, but also the challenge to construct a true set of CNVs for adequate evaluation of applied methods.
Using a sample from the "Genome in the Bottle" project as an example, we considered the contradictions between available CNV call sets arising from differences in the resolution of detection methods, choice of the reference pool (from one to several dozens of samples), and analysis tools. Since no set covers all CNVs to the full, we used a stringent CNV map of the human genome as prior knowledge and with the Bayes approach ranged the exons by the probability of being CNV-exons. Thus, we constructed an exon-level CNV set for the NA12878 sample which includes about 110 thousand exons and can be used for an independent evaluation of tools on exome data.
Comparative analysis of 16 germline CNV calling tools showed that each algorithm has its certain range of detected lengths: the minimum possible size is an exon as predicted by CONTRA or PatternCNV, and the maximum size is more than 1 Mb, as observed in CNVkit calls. Most algorithms call CNVs up to 100 kb in length and span 4-10 exons. Due to the nature of the sequencing data, tools preferentially detected loss of genetic material over its gain.
We observed low concordance between the results produced by different tools. In addition to the differences in models applied for CNV calling, one of the possible reasons for the situation can be the characteristics of the exons. In particular, part of the CNV-exon singletons was obtained for exons with extreme values of GC content and mappability, which are filtered off in some of the algorithms (Table 1).
Algorithms are most effective in detection of variation from 1 kb, it could be a feature of calling model fitting: choice of train data and evaluation criterion for variant (e.g. number of targets). Also we showed differential focus of tool performance on precision or recall. Most algorithms identify a small number of variations yet with satisfactory precision (about 70%). This group of algorithms can be of use in population studies. In case of clinical diagnostics, there is a need to identify as many variations as possible with exome sequencing even if at the expense of precision. Later on the variants will be evaluated based on the clinical presentation, de novo appearance compared to parents and joint effect of revealed variants on the phenotype-thus filtering the false-positive variants. www.nature.com/scientificreports/ Also, in case studies, additional validation of individual candidate CNVs using cheaper techniques is possible, which decreases the false-positive rate and enhances the importance of minimization of false-negative outputs. This condition is met by only three algorithms: EXCAVATOR2, exomeCopy, and FishingCNV with recall over 26%. Therefore, part of CMA performed in frames of diagnostics of hereditary disorders can be replaced with primary CNV calling from exome. The minimum precision threshold should be chosen depending on the aims of the analysis and in the case of medical applications, on the group of disorders under investigation. Limitations in the applicability of existing algorithms described in the article (number of CNV, expected length, calling efficiency) should be considered when using CNV calling. In the case of detecting small variations (1 exons), PatternCNV and CONTRA are practically equivalent in terms of the number of CNVs and prediction accuracy. The CNV calls intersection increases precision to 73%. At the main range (1-100 kb) we recommend using cn.MOPS if the goal is low false positive rate (20%) and EXCAVATOR for more sensitive CNV calling.
The findings of this study have to be seen in light of some limitations. The first is the analysis was conducted on only one sample. Since the development of a CNV benchmark set is a rather laborious process, the choice of samples for evaluating the algorithm's efficiency is limited. Expanding the set of samples would allow researchers to fully evaluate the characteristics of the predicted CNV and investigate performance results under different sequencing depths. The second is the formation of the validation set at the exon level. Such a description does not allow us to assess the efficiency of identifying CNV of different lengths due to the complexity of assigning "true" CNV-exon to one or another group. The third limitation concerns the fact that uniform conditions may not be optimal for algorithms. For example, the minimum number of samples required for quality identification or using only the default parameters. Comparison of the best performances is of particular interest for understanding the capabilities of CNV calling tools. Moreover, we think this may be a good practice in the future to set the parameters depending on the available data or alternatively use ensemble models, thereby increasing the detection efficiency.