CNspector: a web-based tool for visualisation and clinical diagnosis of copy number variation from next generation sequencing

Next Generation Sequencing is now routinely used in the practice of diagnostic pathology to detect clinically relevant somatic and germline sequence variations in patient samples. However, clinical assessment of copy number variations (CNVs) and large-scale structural variations (SVs) is still challenging. While tools exist to estimate both, their results are typically presented separately in tables or static plots which can be difficult to read and are unable to show the context needed for clinical interpretation and reporting. We have addressed this problem with CNspector, a multi-scale interactive browser that shows CNVs in the context of other relevant genomic features to enable fast and effective clinical reporting. We illustrate the utility of CNspector at different genomic scales across a variety of sample types in a range of case studies. We show how CNspector can be used for diagnosis and reporting of exon-level deletions, focal gene-level amplifications, chromosome and chromosome arm level amplifications/deletions and in complex genomic rearrangements. CNspector is a web-based clinical variant browser tailored to the clinical application of next generation sequencing for CNV assessment. We have demonstrated the utility of this interactive software in typical applications across a range of tissue types and disease contexts encountered in the context of diagnostic pathology. CNspector is written in R and the source code is available for download under the GPL3 Licence from https://github.com/PapenfussLab/CNspector.


Generation of data displayed by CNspector
CNspector is agnostic to the tools used to generate the tables that it displays. For completeness, we outline the steps used to generate the figures generated by CNspector.
Alignment and breakpoint generation -reads were aligned using subread v1.5.0-p3 against the human genome reference build GRCh37.73. Putative breakpoints were also generated by subread and then tabulated for later display.
Read abundance estimation -The R package Rsubread::featureCounts() v1.24.0 was used to generate read abundance for all samples using constant non-overlapping fixed-width bins of width 5,000, 50,000, 1,000,000. For enriched samples, read abundance in targeted regions was estimated using bins defined by the genomic regions of the bait sequences. For unenriched WG samples and for RNA-seq, read abundance in targeted regions was estimated using bins corresponding to the exons used for annotation.
GC correction of read abundance -read abundance values were GC corrected using Loess fitting. All bins intersecting with the ENCODE blacklisted regions (https://www.encodeproject.org/annotations/ENCSR636HFF/) were removed prior to fitting as were bins in the targeted regions corresponding to baits known not to be distributed with weight a function of GC content.
Copy number estimation -for each bin, copy number was estimated to be two times (or one times for X and Y bins in men) the ratio of the normalised count in the bin divided by the median of the normalised counts in the corresponding bin in a set of reference samples. For each resolution of bins in each sample, normalisation was performed by scaling so that the median of the non-zero bins was one. The median absolute deviation (MAD) from the median was estimated for each bin across the reference samples and used to compute the standard deviation (SD). Under the assumption of normality this was taken to be equal to Multi-sample mode -copy number estimation is performed as described in SI Section 2, using the user-selected samples from the displayed batch as the reference set.
Minimum displayed read support -by default features are displayed regardless of the number of reads that support them. Adjusting the minimum displayed read depth slider restricts display to those features with read support greater than the number selected. This can be useful to remove noisy entries that may be cluttering the display or confounding interpretation.
Breakpoint display -breakpoints are displayed with an O and one end and an X at the other joined by a line. The log of the read support is displayed but scaled so that all breakpoints fit in the region 0<CN<=4. This gives visual separation for clusters of breakpoints near to each other.
Displaying allele frequencies -by default all BAFs are displayed regardless of read support or significance. The reason is that for sparsely sampled regions such as unenriched areas in targeted sequencing, there is a shortage of loci that can be used to reliably determine sequence variants or estimate allele frequencies. For visual inspection, even unreliable estimates, taken together, can give a visual indication of zygosity or sample heterogeneity. If required, the option still exists to remove the poorly supported or noisy BAFs by increasing the minimum displayed read support.