Introduction

Copy-number alterations (CNAs) are gains or losses in chromosomal segments that frequently occur in tumor cells. Recent surveys suggest that certain cancers are driven by CNAs1. In addition to directly affecting cancer genes (e.g., MYCN and MDM2 amplifications and RB1 and CDKN2A deletions), CNAs are known to be driver events in a wide variety of cancer types2,3,4,5. Several experimental methods are available to identify CNAs in tumor cells. Fluorescence in situ hybridization provides direct evidence of CNAs and is the gold standard for CNA detection in a targeted region6. Before the development of next-generation sequencing (NGS) technologies, array comparative genomic hybridization and high-resolution single nucleotide polymorphism (SNP) arrays permitted genome-wide evaluation of CNAs at 30-kb to 100-kb resolution.

The development of NGS, especially whole-genome sequencing (WGS) and whole-exome sequencing (WXS) platforms, has revolutionized the detection of somatic mutations, including CNAs, in cancer samples. For example, Copy Number Segmentation by Regression Tree in Next Generation Sequencing (CONSERTING)7 incorporates read-depth and structural-variation data from BAM files for accurate CNA detection in high-coverage WGS data. However, CONSERTING and other WGS-based CNA algorithms produce a fractured genome pattern (i.e., a hypersegmented CNA profile with an excessive number of intrachromosomal translocations) in samples with library construction artifacts7, which poses a major challenge for precise CNA inference. Although the frequency of observing the fracture genome pattern has dropped substantially with improvement of libraries preparation and sequencing methods, we still occasionally identify the pattern in samples sequenced with the latest technology. Our extensive analysis indicated that although CNA and structural-variation detection was severely impaired by library artifacts, point-mutation detection was largely unaffected7,8, suggesting that a robust CNA tool can be developed from the variant information. CONSERTING and other NGS algorithms require direct access to aligned BAM files. Moreover, advances in technology and declines in costs have made NGS a commodity for both basic research and clinical service. However, local installation of CONSERTING and other NGS algorithms often involves complicated steps, which may be challenging for individual groups without dedicated bioinformatics support. Cloud-based pipelines may require transfer of large BAM files, a current bottleneck in their applications. Therefore, a robust CNA analytical tool that is efficient, convenient, and robust to library artifacts is needed to manage the demands of NGS data analysis.

VCF2CNA is both a web-based (http://vcf2cna.stjude.org), and commandline (http://www.github.com/XCLab/VCF2CNA) tool for CNA analysis. The preferred input to VCF2CNA is a paired Tumor/Germline Variant Call Format (VCF) file. VCF is a widely adopted format for genetic variation data exchange, and VCF files are quite small compared to WGS BAM files. Each variant in a typical VCF file contains its chromosome position, reference/alternative alleles, and corresponding allele counts, which are used by VCF2CNA to identify copy-number alterations. This tool also accepts input in the Mutation Annotation Format (MAF) and the variant file format produced by the Bambino program9.

Results

VCF2CNA can be run through a simple web interface (Fig. 1A) or as a commandline line tool. For the web interface, the sole input is a VCF file (or a file in one of the other supported variant file formats) from a paired tumor–normal WGS or WXS analysis, which is uploaded via the interface to a web server where the application runs. The results are returned to a user-provided email address. For the commandline tool, the pipeline is run by invoking a single run command. VCF2CNA consists of two main modules: (1) SNP information retrieval and processing from the input data and (2) recursive partitioning–based segmentation using SNP allele counts (Fig. 1B). Actual running time for a typical WGS sample is approximately 30 to 60 minutes, depending on the complexity of the genome.

Figure 1
figure 1

Overview of the VCF2CNA process. (A) User interface with parameters. (B) Server side pipeline. A parallelogram depicts input or output files, a rectangle depicts an analytical process, and a diamond depicts the condition for a follow-up process.

To evaluate the utility of VCF2CNA, we ran it on 192 tumor–normal WGS data sets and 15 tumor–normal WXS data sets. These sequences comprised 46 WGS adult glioblastomas (GBMs) from The Cancer Genome Atlas (TCGA-GBM) dataset10, sequenced by Illumina technology, and 146 WGS pediatric neuroblastomas (NBLs) from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET-NBL) dataset11, sequenced by Complete Genomics, Inc. (CGI) technology. On average, VCF2CNA used approximately 2.8 million high-quality SNPs per sample (median 2,811,245; range, 2,029,467–3,519,454 in TARGET-NBL data) to derive CNA profiles. We further evaluated the consistency between WGS and WXS using 15 rhabdomyosarcoma samples that were sequenced on both platforms12 and estimated the tumor purity in these samples.

CNA analysis of TCGA-GBM data

The adult TCGA-GBM data downloaded from dbGaP (accession number: phs000178.v8.p7) included 46 samples. We first evaluated VCF2CNA’s resistance to library construction artifacts by using 24 samples from this set, which were previously identified as having a fractured genome pattern by CONSERTING and other CNA algorithms7. Indeed, VCF2CNA produced CNA profiles that are globally consistent with those of SNP array–derived CNA profiles (downloaded from TCGA, Supplementary File s1) and more robust to noise than those produced by CONSERTING. Specifically, VCF2CNA yielded a mean 59.4-fold reduction in the number of predicted segments than did CONSERTING (median, 46.2; range, 16.2–285.7; p = 3.0 × 10−6 by Wilcoxon signed-rank test, Fig. 2A and Supplementary File s1).

Figure 2
figure 2

A Circos plot that displays CNAs found by CONSERTING (outer ring), VCF2CNA (middle ring), and SNP array (inner ring) for (A) TCGA-GBM fractured sample 41-5651-01A and (B) TCGA-GBM unfractured sample 06-0125-01A. Alternating gray and black chromosomes are used for contrast. Yellow regions depict sequencing gaps, whereas red regions depict centromere location. Blue segments depict copy-number loss, and red segments indicate copy-number gain. Legend depicts CNA range for each track.

We used an F1 scoring metric13 to measure the consistency between the CNA profiles derived from VCF2CNA and CONSERTING in the remaining 22 high-quality sample pairs (Fig. 2B and Supplementary File s2). These programs identified approximately 700 Mb of the CNA regions in each sample (range, 92–2299 Mb) with high consistency (mean F1 score, 0.9941; range, 0.9699–0.9995) (Table 1).

Table 1 F1 score between CONSERTING and VCF2CNA and autosomal CNAs per sample in 22 TCGA samples.

We evaluated the segmental overlap between the CONSERTING outputs and the VCF2CNA outputs for each sample. A CNA segment detected by CONSERTING was classified as corroborated if 90% of the bases in the segment received the same type of CNA call from VCF2CNA (Table 2). The comparison shows that VCF2CNA faithfully recapitulated medium to large CNA segments (≥100 kb), whereas CONSERTING had greater power for identifying focal (<100 kb) low-amplitude (absolute log2 ratio change <1.0) CNAs (p = 1.306 × 10−5 by Wilcoxon signed-rank test). Furthermore, the segmental–based analysis revealed that the detection power was less affected in focal CNAs with large amplitudes (log2 ratio ≥ 3.0) (Fig. 3).

Table 2 Counts of corroborated and uncorroborated segments by segment length.
Figure 3
figure 3

Violin plot stratified by segment size and CNA intensity for all 22 TCGA-GBM unfractured samples. Gold diamond represents the mean fraction of matching segments between VCF2CNA and CONSERTING.

To further test whether VCF2CNA accurately captures the CNA patterns in samples with library artifacts, we applied the cghMCR algorithm14. This package in R Bioconductor provides functions to identify genomic regions of interest based on segmented copy number data from multiple samples. We used this functionality to depict these common gains and losses across all 46 samples from either VCF2CNA profiles or SNP array–derived CNA profiles (downloaded from TCGA). The results are quantified by a segment gain or loss (SGOL) score. Although the signal from VCF2CNA contained less noise than did the signal from the SNP array in most samples (Supplementary File s1), both profiles reveal common recurrently amplified and/or lost regions (Fig. 4). These changes included chromosome-level changes (i.e., chr7 amplifications and loss of chr10) and segmental CNAs (i.e., focal deletion of the CDKN2A/B locus on chr9p)15. Moreover, VCF2CNA identified recurrent losses in ERBB4 on chr2q and GRIK2 on chr6q that were absent in the SNP array profiles. ERBB4 encodes a transmembrane receptor kinase that is essential for neuronal development16. It is frequently mutated in patients with non-small cell lung cancer17, and silencing of ERBB4 through DNA hypermethylation is associated with poor prognosis in primary breast tumors18. Similarly, GRIK2 is a candidate tumor suppressor gene that is frequently deleted in acute lymphocytic leukemia19 and silenced by DNA hypermethylation in gastric cancer20.

Figure 4
figure 4

A chgMCR plot of 46 TCGA-GBM samples. (A) SNP array data and (B) VCF2CNA data are shown.

Amplifications such as double minute chromosomes and homogeneously staining regions represent a common mechanism of oncogene overexpression in tumors21. Among the 46 TCGA-GBM samples analyzed, VCF2CNA identified double minute chromosomes in 34 samples affecting the EGFR22, MDM223, MDM424, PDGFRA25, HGF26, GLI127, CDK428, and CDK629 genes (Fig. 5 and Supplementary File s3). These events consisted of high-level amplifications in 21 samples with potential fractured genome patterns (Supplementary File s3a) and 13 previously reported samples (Supplementary File s3b)7,30.

Figure 5
figure 5

A Circos plot of VCF2CNA (outer ring) and CONSERTING (inner ring), depicting high-amplitude focal CNA segments in TCGA-GBM sample 06-0152-01A. Included in these segments are the known cancer genes EGFR, CDK4, and MDM2. CNA range is specified for each sample.

CNA analysis of TARGET-NBL data

We applied VCF2CNA to the TARGET-NBL dataset11 downloaded from dbGap (assession number: phs000467). This dataset consists of 146 tumors with matched normal WGS samples, sequenced with CGI technology. Because the ligation-based CGI technology has notable differences in the detection of single nucleotide variants (SNVs) and insertions/deletions (indels) compared to Illumina systems31, this dataset provided an opportunity to evaluate VCF2CNA’s robustness using different sequencing platforms.

We used VCF2CNA to perform cghMCR analysis with CNA profiles and observed a genome pattern similar to that reported for SNP array platforms (Fig. 6A)32. In addition to loss of large regions on chr1p, 3p, and 11q and a broad gain of chr17q, VCF2CNA found frequent focal amplifications of MYCN in NBL tumors and several potential cancer-related CNAs, including high-level amplifications of CDK4 (1 tumor), and ALK (2 tumors) (Fig. 6B).

Figure 6
figure 6

Analysis of the TARGET-NBL dataset, consisting of 146 tumors. (A) A chgMCR plot in which green depicts regions of copy-number gain and red depicts regions of copy-number loss. (B) A Circos plot showing a focal gain on chromosome 2 for MYCN and ALK5 for sample PARETE-01A-01D. CNA range is specified.

High-level amplification of MYCN is a known oncogenic driver found in ~25% of pediatric patients with NBL, and is associated with aggressive tumors and poor prognosis33. A subset of 32 tumors in the TARGET-NBL cohort contains clinically validated amplifications of MYCN. Although the CGI’s hidden Markov CNA model (unpublished) reported MYCN amplifications in 15 of these 32 tumors, VCF2CNA successfully identified high-level amplifications in 31 tumors. In the clinically validated MYCN-amplified sample that went undetected by VCF2CNA, a follow-up review revealed that tumor heterogeneity and sampling bias most likely contributed to the discrepancy. Moreover, VCF2CNA predicted two additional MYCN amplification events among the remaining tumor samples, indicating that VCF2CNA can identify clinically relevant CNAs that were undetected by traditional methods of CNA detection. The high-level concordance with clinically validated data provides a strong indication that VCF2CNA is applicable to multiple tumor types collected from different sequencing platforms.

CNA analysis of rhabdomyosarcoma data to compare WXS and WGS

Although, WGS provides unbiased coverage measurements across the genome, whole exome sequencing (WXS) offers characterization of the coding regions of the genome (2% of genome) at much higher depth, which provides a convenient and inexpensive alternative to WGS and has been widely adopted in large scale genome profiling projects and clinical settings. Due to major design differences between the two platforms, we evaluated the consistency of copy number alteration detection between whole exome and whole genome sequencing, using a set of rhabdomyosarcoma samples that were sequenced on both platforms12. We observed highly consistent CNA profiles between WGS and WXS platforms (mean F1 score 0.97 on a set of 15 rhabdomyosarcoma xenograph samples). While focal changes are more likely to be missed in the WXS platform compared to the WGS platform, VCF2CNA reliably detects large CNAs from both WGS and WXS platforms (Fig. 7, Supplementary File s5).

Figure 7
figure 7

Somatic CNAs computed using VCF2CNA for paired whole-exome and whole-genome Rhabdomyosarcoma xenograph sample SJRHB000026_X1_G1.

CNA-based purity estimation

Using the absolute copy number result for each segment identified through VCF2CNA, and B-allele frequencies (BAFs) computed from the paired tumor-normal VCF file, we developed an algorithm to estimate tumor purity using segments with a single copy number gain or loss in VCF2CNA. Briefly, for germline heterogeneous single nucleotide polymorphisms (SNPs, base BAF of 0.5), the extent of loss of heterozygosity (LOH) can be measured by the absolute difference between the B-allele fraction in tumor and that in germline sample. LOH is the result of copy number alteration and/or copy neutral-LOH in tumor cells. We used LOH signals in copy neutral or single-copy gain/loss regions (between single-copy chromosome loss and single-copy chromosome gain) to estimate tumor purity.

Using purity estimates from various regions within the genome we performed an unsupervised clustering analysis using the mclust package (version 5.4) in R (version 3.4.0). The tumor purity of the sample was defined as the highest cluster center value among all clusters. We estimated Tumor purity for 15 matched tumor-normal xenograph Rhabdomyosarcoma WGS samples. (Table 3). All but one case had a tumor purity prediction near 100%, consistant with the notion that the most mouse-derived reads won’t be mapped to the human genome assembly34,35. The sample SJRHB010468_X1_G1 showed extensive subclonal CNAs across multiple chromosomes (Supplementary File s5). While subclonal CNAs are not indicative of low purity, the extensive subclonal copy number segments result in an incorrect tumor purity estimation (0.533), which is a limitation of the algorithm. The Mutant allele fraction (MAF) density plot for somatic single nucleotide variations (SNVs) detected in diploid regions, revealed a subclone in 50% of the tumor cells, which harbors more than 75% of the detected SNVs (Supplementary File s6).

Table 3 Purity estimation of Rhabdomyosarcoma paired tumor-normal WGS samples.

Discussion and Conclusions

We developed VCF2CNA for the systematic and robust detection of CNAs from VCF and other genotyping variant call formats, which can be derived from various sequencing platforms and/or alignment file formats (e.g. BAM, CRAM, Petagene, etc.). Analysis of 192 paired tumor–normal WGS samples sequenced on multiple platforms demonstrates that VCF2CNA is robust to library construction artifacts and captures medium to large CNA segments with high accuracy. The analysis in tumor samples sequenced on both WGS and WXS platforms further suggests that VCF2CNA produces highly consistent CNA profiles from both platforms. VCF2CNA identified recurrent losses in potential tumor suppressors that were undetectable by alternative approaches. The algorithm attempts to automatically determine the diploid region in tumor samples and uses that as the reference. However, in samples with genome-wide duplication, it will not be detected by VF2CNA or any read depth-based algorithm. To address this, VCF2CNA allows the user to define a reference chromosome/region.

VCF2CNA was designed with SNPs that were (on average) thousands of base pairs apart, which limits support for identifying focal copy-number changes. Therefore, state-of-the-art CNA algorithms have superior detection power for focal low-amplitude CNAs in high-quality, high-coverage WGS data.

VCF2CNA includes a method to estimate overall tumor purity for samples with sufficient number of purity estimates for segments containing single copy number gain or loss. Most WXS samples contained too few purity estimates to produce a reliable purity estimation, (a minimum of 20 segments required). The tumor purity estimation in VCF2CNA is derived from CNAs and LOH signals and therefore, the result will be biased if tumor cells do not have these leisions or these leisions are primarily identified in a subclone. The final tumor purity estimation should be compiled from various analyses, including CNA, SNV, and pathology-based evaluation, etc.

In conclusion, VCF2CNA is a web-based tool (with an option of local installation) that is capable of accurate and efficient detection of CNAs and tumor purity from variants called from high-coverage WGS and WXS data sequenced on various platforms.

Methods

Server availability

The webserver for VCF2CNA is available at https://vcf2cna.stjude.org. The downloadable executable is available at http://www.github.com/XCLab/VCF2CNA.

Parameter definitions

The Specify Diploid Chromosome parameter normalizes results by the specified chromosome. The Median Normal Coverage parameter permits input of the median coverage value of SNPs from normal samples. The Minimum Scale Factor (autosomes) parameter is multiplied by the median to compute the minimum coverage value. The Maximum Scale Factor (autosomes) parameter is multiplied by the median to compute the maximum coverage value. The Minimum X Scale Factor is the minimum scale factor for chromosome X. The Maximum X Scale Factor is the maximum scale factor for chromosome X. The Sample Order (VCF format only) parameter defines the ordering of tumor and normal samples. VCF inputs must include tumor and normal data after the FORMAT field. Selecting the Tumor/Normal button assigns the tumor data to the first field after FORMAT and normal data to the second field. The Normal/Tumor radio button specifies the reverse order.

Input data for VCF2CNA

The input for VCF2CNA analysis includes VCF, MAF, and the variant file format produced by the Bambino program. A fixed window size of 100 bp is used to obtain the mean coverage for each window. Windows with no variants are ignored. The mean read depth per window can be normalized to a set of reference diploid chromosomal regions by using the same criteria as CONSERTING or specified via the Specify Diploid Chromosome parameter.

Tumor purity estimation

Basic Definitions:

  1. 1.

    B Allele Fraction (BAF): the frequency a given base does not match the corresponding reference sequence, divided by the read depth at that position.

  2. 2.

    Loss of Heterozygosity (LOH): the absolute value of the difference between the BAF of the tumor sample and the BAF of the germline sample at heterozygous sites. This is also referred to as allelic imbalance (AI).

  3. 3.

    Copy Number Alteration (CNA): Inferred copy number change in the tumor sample, where +1/0/−1 represents one copy gain/no change/one copy loss, respectively.

Key Relationships:

  1. 1.

    BAF cluster 1: The location of the left cluster center

    $$\begin{array}{c}{\rm{Left}}\,{\rm{Center}}=\frac{1+{\rm{CNA}}}{2+{\rm{CNA}}}\end{array}$$
    (1)
  2. 2.

    BAF cluster 2: The location of the right cluster center

    $$\begin{array}{c}{\rm{Right}}\,{\rm{Center}}=\frac{1}{2+{\rm{CNA}}}\end{array}$$
    (2)
  3. 3.

    BAF separation: The distance between the cluster centers of two populations of BAFs is given by 2(LOH). This is also given by the absolute value of the difference between BAF cluster 1 and 2.

$$\begin{array}{c}{\rm{BAF}}\,{\rm{Distance}}=|\frac{1+{\rm{CNA}}}{2+{\rm{CNA}}}-\frac{1}{2+{\rm{CNA}}}|=|\frac{{\rm{CNA}}}{2+{\rm{CNA}}}|=2({\rm{LOH}})\end{array}$$
(3)

VCF2CNA output

The VCF2CNA pipeline produces an output text file including the following fields:

  1. 1.

    Seg.mean: A value of 1.0 corresponds to 2 copy gain, 0.5 corresponds to 1 copy gain, 0 corresponds to no gain or loss, −0.5 to 1 copy loss and −1.0 to 2 copy loss.

  2. 2.

    Gmean: A value of 2.0 corresponds to a diploid sample, 0.5 corresponds to 1 copy loss and 0 corresponds to 2 copy loss.

BAF calculation

The distribution of BAF values is related to the underlying copy number changes. They represent the total number of reads matching one of two allele types at a given heterozygous site. The A allele is the allele matching the germline genome, while the B allele is the corresponding unmatched allele. At a heterozygous site, the expected BAF value in the germline sample is 0.5. Copy number changes at these heterozygous sites in tumor samples may cause a deviation from 0.5. This LOH depends on the absolute copy number changes. Analysis of BAF plots of heterozygous sites for copy number change regions depict uni/bimodal distribution BAFS. The distance between BAF distributions varies due to varying combinations of tumor purity and copy number changes. Using absolute copy number changes computed from VCF2CNA and LOH measurements from heterozygous BAF sites we compute tumor purity.

To accomplish this we combine the segmented CNA output from VCF2CNA with BAF values from heterozygous sites in germline samples of paired tumor-germline samples. Each segment has the same CNA value. The j-th segment of the tumor genome is given by Cj with j = 1,2, …, J. Heterozygous sites in the corresponding normal genome are mapped to these segments using the starting and ending location of the segment. We specify (i,j) to index the i-th heterozygous site on segment j with i = 1, 2, …, Ij, where Ij is the total number of heterozygous sites in segment j. Only BAF sites that fall inside a given VCF2CNA segment are used in analysis.

Purity derivation

We assume the following populations:

  1. 1)

    x: fraction of cells with a single-copy CNA. (gain x > 0, loss x < 0)

  2. 2)

    y: fraction of cells with CN-LOH. (The chromosome lost is the chromosome lost in CN-LOH, or the chromosome gained is the chromosome gained in CN-LOH).

$$\begin{array}{c}{\rm{Measured}}\,{\rm{CNA}}=x\end{array}$$
(4)
$$\begin{array}{c}{\rm{Measured}}\,{{\rm{LOH}}}_{{\rm{loss}}}=(\frac{1+y}{2+x})-0.5\end{array}$$
(5)
$$\begin{array}{c}{\rm{Measured}}\,{{\rm{LOH}}}_{{\rm{gain}}}=(\frac{1+x+y}{2+x})-0.5\end{array}$$
(6)
$$\begin{array}{c}{\rm{Measured}}\,{{\rm{LOH}}}_{{\rm{combined}}}=(\frac{2y+|x|}{4+2x})\end{array}$$
(7)

Solve equation 7 for y:

$$\begin{array}{c}{\rm{y}}={\rm{LOH}}(2+{\rm{x}})-0.5|{\rm{x}}|\end{array}$$
(8)
$$\begin{array}{c}{\rm{purity}}=|{\rm{x}}|+{\rm{y}}={\rm{LOH}}(2+{\rm{CNA}})+\frac{|{\rm{CNA}}|}{2}\end{array}$$
(9)

Run-time analysis

Single VCF files must be converted to a paired tumor/normal file before uploading. Alternatively, VCF2CNA accepts MAF and Bambino variant file formats. After uploading files to the server, the median running time was 23 minutes on an intel Xeon E5-2680 processor at 2.70 Ghz with 64 GB RAM. Server processing occurs in two principal steps: (1) preprocessing and SNP information extraction from input files and (2) running the recursive partitioning segmentation.

F1 scoring metric and segmental corroboration

A genomic position was assigned a corroborated CNA call if its computed CNA type (gain or loss) by VCF2CNA matched the call computed by CONSERTING. A CNA segment in the CONSERTING profile was corroborated in the VCF2CNA profile if ≥90% of the segment positions were corroborated. The F1 score is given by \({F}_{1}=\frac{2(precision)\,(recall)}{precision+recall}\) . It was used to summarize the accuracy of VCF2CNA, compared with that of CONSERTING.

VCF2CNA web server pipeline

Step1 (snvcounts)

Single nucleotide variant frequencies are computed from the input file. For each chromosome and position, the values computed are TumorMutant, TumorTotal, NormalMutant, and NormalTotal. Additionally, the mean normal coverage is computed.

Step2 (consprep)

Using a collection of 625 WGS samples in the Pediatric Cancer Genome Project36, we generated an in-house blacklist of suspicious SNPs, many of which are potentially mapping artifacts in the blacklisted regions of ENCODE (https://personal.broadinstitute.org/anshul/projects/encode/rawdata/blacklists/hg19-blacklist-README.pdf). The consprep program reads the SNV count data and incorporates this list to identify heterozygous sites. It also reads a file specifying the number of 100-bp windows in each chromosome. If the total number of reads from the normal sample falls outside of the ranges specified by the options (median, minfactor, maxfactor, xminfactor, or xmaxfactor), the input position is ignored by the consprep step in the pipeline. The –xminfactor and –xmaxfactor settings apply to positions in chrX; the –minfactor and –maxfactor settings apply to all other chromosomes. The minimum coverage is the median multiplied by the –minfactor, and the maximum coverage is the median multiplied by the –maxfactor.

Application

To run VCF2CNA, users should navigate to the application home page and click “run application.” The application runs on Google Chrome, Safari, Mozilla Firefox, and Microsoft Internet Explorer 11. Users must provide a valid email in the email address text field. Users will select whether results will be sent to the provided email address as either an email attachment or a link to the result files stored on the server. Once the analysis is complete, the original input file is deleted from the server immediately. If an error occurs during analysis, the input file is stored on the server for 7 days and then purged from the system. The results of the analysis are stored on the server for 14 days. After that time-period, they are deleted from the server.

Default run parameters may be modified depending on job specifications. Users should select the input file and click the “upload/run” button. The browser window should not be killed during the file upload. Once the file has been successfully uploaded, a notification will be displayed in the browser window and the user may discard the window.

Rationale for not using the reciprocal-overlap rule

To compare CNA calls from different algorithms, the reciprocal 50% overlap criterion31 is commonly used. This rule is not suitable when two CNA calls are derived from platforms with different powers in detecting focal CNAs. A considerably larger average distance occurred between adjacent probes. VCF2CNA-derived CNA calls have an inherently lower resolution than does CONSERTING. When a focal CNA identified through CONSERTING occurs on top of a large CNA fragment, CONSERTING breaks the region into multiple segments. Although the CNA fragments in the region are largely corroborated between the two CNA callers, potentially none of these fragments satisfied the rule of reciprocal 50% overlap (Supplementary File S4).