Framework for quality assessment of whole genome, cancer sequences

Working with cancer whole genomes sequenced over a period of many years in different sequencing centres requires a validated framework to compare the quality of these sequences. The Pan-Cancer Analysis of Whole Genomes (PCAWG) of the International Cancer Genome Consortium (ICGC), a project a cohort of over 2800 donors provided us with the challenge of assessing the quality of the genome sequences. A non-redundant set of five quality control (QC) measurements were assembled and used to establish a star rating system. These QC measures reflect known differences in sequencing protocol and provide a guide to downstream analyses of these whole genome sequences. The resulting QC measures also allowed for exclusion samples of poor quality, providing researchers within PCAWG, and when the data is released for other researchers, a good idea of the sequencing quality. For a researcher wishing to apply the QC measures for their data we provide a Docker Container of the software used to calculate them. We believe that this is an effective framework of quality measures for whole genome, cancer sequences, which will be a useful addition to analytical pipelines, as it has to the PCAWG project.


Introduction 46
advantages: increased statistical power, the ability to extend hypotheses across several 48 projects and the possibility of asking biological questions covering a wider range of 49 phenomena. However when the genome sequencing data comes from different centres, 50 was sequenced at different times and under different protocols, great care must be taken 51 to ensure that the sequencing data is of comparable quality, to avoid drawing false 52 conclusions. The Pan-Cancer Analysis of Whole Genomes (PCAWG) project provided us 53 with a great opportunity to assemble, test and finalise which quality control measures are 54 important for comparing the quality of whole genome, cancer sequences. 55 The PCAWG project assembled a cohort of 48 projects encompassed in the International the sequencing methodology was evolving rapidly). To be able to perform analysis across 64 the whole data set, it was necessary that the quality of the sequencing be carefully 65

assessed. 66
There are advantages in a comprehensive set of quality measures. We will be able to 67 exclude samples of low quality. This will save running downstream analyses, saving 68 computational and the researchers' time. Another advantage is for researchers in 69 PCAWG studying driver mutations, we can provide a sanity check. If the driver mutation 70 is only found in low quality samples, it may not be a good candidate, compared to if it is 71 supported by high quality samples. As PCAWG will release the data for community to 72 use, our quality measures will provide a guide to the quality of the whole genome 73 sequences within. For researchers who wish to assess the quality of their whole genome 74 to those that had many sequencing quality issues. 100

Results 101
All our analyses are based on the aligned sequences from the PCAWG core pipeline 11 . 102 Within the aligned sequences we did not use duplicate reads, reads with a mapping 103 quality of zero and ignored supplementary alignments (reads that map to more than one 104 place in the genome). The first three quality control measures; mean coverage, evenness 105 of coverage and somatic mutation calling coverage; are linked to different aspects of the 106 coverage of the genomic sequence. The other two measures indicate discrepancies 107 between the paired reads: mapping to different chromosomes and the ratio of edits 108 between the paired reads compared to the reference genome. Finally we summarise these 109 five measures into a star rating, for easy comparison of each of the sample pair's quality. 110

Mean Coverage
When deciding on what depth to sequence cancer genomes to, a trade 111 off has to be made between the advantages of having a high coverage to the cost of 112 sequencing. The higher the cancer genome is sequenced the greater the confidence in 113 calling somatic events (see Alioto et al. 12 for a comparison of somatic mutation calling at 114 depths up to 300X). A precondition for the inclusion of a donor in the PCAWG study was 115 the availability of a whole genome sequence of the normal and tumour with 25X 116 coverage or greater. We found that a number of the projects submitting these genomes 117 had calculated coverage differently. For standardization the mean number of reads 118 covering each position in the genome was calculated, after low quality and duplicate 119 reads were excluded so to not inflate the number of reads (see Supplementary Methods 120 for exact methods used). As shown in Supplementary Figure S1, most commonly the 121 normal samples were sequenced to around 30X, while there was a bimodal distribution 122 for the tumour samples with maxima at 38X and 60X. To provide a meaningful guide to 123 the quality of the genomes in PCAWG, we therefore set the thresholds for the mean 124 coverage, after aligning, to 25X for normal samples and 30X for tumour samples. This 125 resulted in 0.4% normal and 2.2% tumour samples not reaching these minimum criteria 126 Figure S1). 127

Evenness of Coverage To confidently identify germline variants and somatic mutations, 128
an even coverage across the target area 13 , in this case the entire genome, is ideal. For this 129 QC measure we used two methods to test if the genome is evenly covered. One method is 130 to calculate the ratio of the median coverage over the mean coverage (MoM). An evenly 131 covered sequence should have a ratio of one, with the mean value the same as the median 132  Figure S2). 137 The second measure of evenness looks at the variation of the normalised coverage in ten 138 kilobase genomic windows, after correction for GC-dependent coverage bias using the 139 somatic CNV calling algorithm ACEseq 14 ( Figure 2). The main cloud, which corresponds 140 to the main copy number state of the sample, is determined (as shown by the red dots in 141 for example large deletions could lead to a more unevenly covered sample. If the normal 152 sample is unevenly covered, it is more likely due to a sequencing artefact. Hence, we are 153 more stringent for the normal than the tumour samples. 154 The two evenness measures identify different samples as having uneven coverage ( Figure  155 3). Spearman's correlation coefficient for the two measures suggests that these measures 156 are not correlated for the normal (ρ = 0.24) and tumour (ρ = −0.06) samples. FWHM is 157 insensitive to GC bias, as the CNV caller corrects for this while MoM identifies other 158 evenness outliers. 159 The samples needs to be in the respective ranges of the MoM and below the thresholds 160 for FWHM for the normal and the tumour to pass the evenness quality measure, of which 161 6.28% and 5.81% respectively of the samples were not.  Figure S4). next step was to summarise them, to give an overall score for quality for the other 211 researchers in PCAWG to use. 212

Star rating system 213
We used the five quality measures to construct a star rating for each cancer genome 214  Figure S9), which had metadata recorded in CGHub 21 about the 237 time and instruments used to sequence. We hypothesise that this will be true for other 238 projects as well. 239 Having calculated the star rating for the sequences, it was interesting to see how our QC 240 measures relate to the calling of somatic single nucleotide variants (SNVs) 11 , somatic 241 insertion and deletions (indels) 11 and somatic structural variants (SVs) 22 in PCAWG. An 242 advantage of using these PCAWG datasets is that four callers were used for each. 243 Looking at the proportion of calls, which all four callers supported, gives us a good idea 244 how the quality of sequencing influences the identification of unambiguous somatic 245 mutations. While the proportion of calls supporting the four callers varies greatly by 246 sample, we find that the samples with four stars or more tended to have higher 247 proportions than samples with less than four stars for SNVs, indels and SVs (with p-248 values of ~10 -5 , ~10 -5 , ~10 -18 respectively, using the Mann-Whitney-U test, also see 249 The results from this analysis suggest quality of sequencing, measured by our star rating, 263 does have a measurable effect on the downstream analyses. As our QC measures reflect 264 different aspects of sequencing quality, they also have varying levels of importance in 265 using these sequences in the calling of SNVs, indels and SVs. 266

Discussion 267
The established star rating system allows grading the normal and tumour sample 268 For those projects in PCAWG, which we had metadata, we found that sequencing quality 285 has definitely improved over the time period 2009-2014 in which the samples sequenced. 286 Our results for the CLLE-ES project suggest that in part a protocol change to PCR-free 287 methods improved sequencing, as in line with best practices from a recent benchmarking 288 exercise 12 . 289 Another advantage of our quality control is the link to the downstream analyses. In 290 aggregate, the higher the quality of the sequences, had a higher proportion of the somatic 291 SNVs, indels, SVs identified, by all the callers for each type of somatic mutation. These 292 results suggest overall that higher quality sequence will identify the true positive somatic 293 mutations with higher probability. Our data would suggest that when pre-amplification of 294 DNA is needed for WGS, for example DNA isolated from formalin fixed, paraffin 295 embedded tissue, the star rating system will be helpful when the variants and mutations 296 are interpreted. 297 We believe that our method can be adapted for similar projects that look to use whole 298 genome sequences from a variety of sources. The thresholds we used based on our 299 experience and applied to this dataset of 2959 cancer genomes can also be used as guide 300 to quality of sequences. It is worth noting that they represent a trade-off of being severe 301 enough to penalise poor quality while not discriminating against samples with valid 302 biological causes. We also would recommend using our methods to ascertain the quality 303 before downstream analyses by other groups. To enable others to use our approach, there 304 is a Docker Container, which can be accessed at https://github.com/eilslabs/PanCanQC. 305 We provide a framework for quality assessment, which opens the door to do large-scale 306 meta-analysis in a more robust framework. Competing financial interests 339 The authors declare no competing financial interests.    than four stars. This is significant using the Mann-Whitney U test, with p-value ~ 10 -8 . 441