CONY: A Bayesian procedure for detecting copy number variations from sequencing read depths

Copy number variations (CNVs) are genomic structural mutations consisting of abnormal numbers of fragment copies. Next-generation sequencing of read-depth signals mirrors these variants. Some tools used to predict CNVs by depth have been published, but most of these tools can be applied to only a specific data type due to modeling limitations. We develop a tool for copy number variation detection by a Bayesian procedure, i.e., CONY, that adopts a Bayesian hierarchical model and an efficient reversible-jump Markov chain Monte Carlo inference algorithm for whole genome sequencing of read-depth data. CONY can be applied not only to individual samples for estimating the absolute number of copies but also to case-control pairs for detecting patient-specific variations. We evaluate the performance of CONY and compare CONY with competing approaches through simulations and by using experimental data from the 1000 Genomes Project. CONY outperforms the other methods in terms of accuracy in both single-sample and paired-samples analyses. In addition, CONY performs well regardless of whether the data coverage is high or low. CONY is useful for detecting both absolute and relative CNVs from read-depth data sequences. The package is available at https://github.com/weiyuchung/CONY.


Supplementary Text 1. Hyperparameter settings
Here, the default settings of the hyperparameters, including  (for the window boundary part), 0 W  (for the copy number state part), and 0 μ  We consider a single-sample analysis as an example. For the normal state (CN: k=2), the mean ( 02  ) is estimated by the sample median ( 50%  ), which is robust against potential outliers from extreme CNVs. The identical proportions of copy losses and gains are assumed to be a total of 10%, which is approximately the same as that in the human genome (i.e., the CN state of RDSs

Supplementary Text 2. Acceptance probabilities
To balance the parameter status for the boundary-updating move, the reverse move of each strategy is essential. The four strategies mentioned above are grouped into three types of jumping pairs, including merge/split, double merge/trifid, and boundary changes to the left and right. Each pair is illustrated in Figure 3, and their accepted probabilities are derived as follows.
In the accepted probabilities,  should be fixed when the state of region * m is determined, and its probability is set to 1. In addition, the Jacobian of the transformation equals 1.
For the reverse move split, we randomly select a window boundary that belongs to region m to change its value to 1. Then, the region is split into m* and m**. One of the two updated regions is randomly selected with an identical probability, and the state of the selected region is assigned to equal that of the original one. The state of the other region is restricted to be unequal to the original and adjacent states. According to the transition rule, the density function   . q is derived. Moreover, the variable u  should also be fixed, and its probability is set to 1.
In addition, the Jacobian of the transformation is equal to 1.       to that of the adjacent region, then we merge these regions again. Based on the procedure introduced in the above subsection, the transition density function is calculated. Moreover, the variables 1 u  and 2 u  should be fixed, and their joint probability is set to 1. The Jacobian of the transformation is equal to 1.
For the reverse move trifid, we randomly select two window boundaries that belong to region m to change the value. Then, the region is split into m*, m**, and m***. The states of the three updated regions were previously illustrated, and the transition density function could be derived. The variables 1 u  and 2 u  should also be fixed with joint probability 1. The Jacobian of the transformation is equal to 1.
d. Figure S2. Effects of analytic section length on the CNV detection rate. Results of a. the detection of copy losses in the single-sample analysis, b. the detection of copy gains in the single-sample analysis, c. the detection of copy losses in the paired-samples analysis, and d.