Genome-wide mutational signatures in low-coverage whole genome sequencing of cell-free DNA

Mutational signatures accumulate in somatic cells as an admixture of endogenous and exogenous processes that occur during an individual’s lifetime. Since dividing cells release cell-free DNA (cfDNA) fragments into the circulation, we hypothesize that plasma cfDNA might reflect mutational signatures. Point mutations in plasma whole genome sequencing (WGS) are challenging to identify through conventional mutation calling due to low sequencing coverage and low mutant allele fractions. In this proof of concept study of plasma WGS at 0.3–1.5x coverage from 215 patients and 227 healthy individuals, we show that both pathological and physiological mutational signatures may be identified in plasma. By applying machine learning to mutation profiles, patients with stage I-IV cancer can be distinguished from healthy individuals with an Area Under the Curve of 0.96. Interrogating mutational processes in plasma may enable earlier cancer detection, and might enable the assessment of cancer risk and etiology.

high sequencing depths of WGS, dilution of mutant DNA in wild-type cfDNA would still result in many true mutation loci being observed with one mutant read at best when ctDNA fractions are low 16,20 . This is due to the long tail of low allele fraction mutations in the tumor being occasionally sampled in plasma.
Here, we show an approach called Pointy, which enables the analysis of genome-wide mutational signatures from low-coverage plasma WGS (0.3-1.5× depth). We demonstrate single-base substitution (SBS) signature profiling and sample classification using a combination of signature extraction and machine learning (Fig. 1). Germline sequencing is not performed to maximize scalability, and we implement methods to mitigate technical and biological noise. We identify mutational signatures in the plasma of individuals with and without cancer, which may be leveraged for early cancer detection.

Characterizing and normalizing Pointy data
We developed a pipeline to extract point mutations from low-coverage plasma WGS called Pointy ("Methods" and Supplementary Fig. 1). We first explored a cohort of patients with stage IV colorectal cancer (CRC, n = 16), many of whom had mismatch repair deficiency (MMR-D) and/ or microsatellite instability 21 (MSI). Healthy controls from the same cohort were used (n = 19). Each library was sequenced to a median of 31.0 × 10 6 reads, with a median duplication rate of 0.37%. Data were downsampled to a target of 0.3× (10M paired-end reads), which resulted in a median of 10.0 × 10 6 reads. Samples with fewer than 90% of the target number of reads were not evaluated (n = 2). A median of 79.3% of genomic positions had zero coverage, and 14% of bases had 1× coverage, equating to a mean coverage of 0.28× (95% confidence interval (CI) 0.26-0.29×, Supplementary Fig. 2a).
In this study, error-suppression by read collapsing of duplicates is limited by the low duplication rate of WGS (<0.5% duplication rate, Supplementary Fig. 2b). Instead, we utilized error-suppression filters based on previous work 16,20 , as follows: minimum base quality (BQ) threshold of 30, mean BQ threshold of 30, requiring mutations to be present in both read 1 (R1) and read 2 (R2), and mapping quality (MQ) threshold of 60. After applying these filters, a mean of 9886 mutations per sample was retained, prior to SNP filtering (95% CI 8782-10,990, Supplementary Fig. 2c).
The samples from the PGDX cohort were sequenced in two batches from the same sequencing instrument, so we explored data from healthy individuals for batch effects. In healthy samples, there was no significant difference in the mean number of mutations between batches (9049 vs. 10,089, p = 0.47, two-sided Wilcoxon test, Supplementary Fig. 3a). However, Principal Component Analysis (PCA) of SBS profiles revealed a difference in mutation profile ( Supplementary  Fig. 3b), which may arise from differences in GC-bias between sequencing runs ( Supplementary Fig. 3c). We identified a significant difference in the mean contribution of PC2 per sample (unadjusted p = 0.022, two-sided Wilcoxon test, Supplementary Fig. 3d). The largest contributors to PC2 were contexts at the extremes of GC content ( Supplementary Fig. 3e). Therefore, the GC bias for each sample was determined, as was the average GC profile of the sequencing batch, which was combined to normalize the SBS profile of each sample ("Methods"). This approach is analogous to GC-correction methods used to correct whole-genome copy-number 5,22 or fragmentation profiles 9 . After GC-correction, there were no significant differences in any PC between the two sequencing runs (unadjusted p > 0.05, twosided Wilcoxon test, Supplementary Fig. 3f). In Supplementary Fig. 3g, we show high cosine similarities between samples even without GCcorrection, which increased significantly following GC-correction (0.995 vs 0.999, p < 2.2 × 10 −16 , two-sided Wilcoxon test). Differences in SBS profile with and without GC-correction are shown in Supplementary Fig. 3h.
Following GC-bias normalization, cancer patient plasma samples and healthy controls showed SBS mutation profiles that had a cosine similarity of 0.999 (95% CI 0.999-0.999, Supplementary Fig. 4a), although this included SNPs. Samples from cancer patients showed significantly more point-mutated reads compared to healthy controls (median 11,786, vs. 9322, p = 0.028, two-sided Wilcoxon test, Supplementary Fig. 4b).

Detection of mutational signatures in CRC plasma
Mutational signatures were fitted to mutation profiles after background subtraction ("Methods"), i.e. for each sample, for each SBS context, the median number of mutations in controls was subtracted ("Methods"). Sequencing artifact signatures were included in the database used for signature fitting to minimize the misattribution of mutations to biologically relevant signatures.
We assessed the sensitivity and specificity of signature fitting in silico. Between 10 and 1000 mutations belonging to each SBS signature were spiked into a randomly selected control sample, repeated 100 times (Supplementary Methods). The contribution of each signature was assessed pre-and post-spike. Across all SBS signatures, the mean sensitivity of spiking was 38% for 10 mutations and 93% for 1000 mutations ( Supplementary Fig. 5a). An example control SBS profile is shown following each spike-in (Supplementary Fig. 5b). Signatures whose profiles were concentrated in few SBS contexts, such as SBS1 and SBS2, were more efficiently fitted to than flat signatures such as SBS5 ( Supplementary Fig. 5c).
To assess the performance of signature recovery in the setting of multiple signatures, we iteratively spiked in signatures and simultaneously spiked in SBS1 at a ratio of 1:1 or 10:1 ( Supplementary Fig. 5d and Supplementary Methods). At a 1:1 ratio of spike-in of both signatures, there was no impact on signature fitting. However, when 10×  Data 1). Mutational signatures were extracted from these data, enabling signature profiling and sample classification ("Methods"), tested in two independent cohorts. more SBS1 mutations were spiked in compared to the signature of interest, the rate of on-target signature fitting was reduced in multiple signatures (Benjamini-Hochberg corrected p < 0.05), especially in signatures with low cosine similarity to SBS1 (linear regression p = 1.5 × 10 −9 ). In contrast, signatures with high similarity to SBS1 gained mutations directly from SBS1 (q > 0.05). We show the extent of false positive signature fitting in the context of a singly spiked signature in Supplementary Fig. 6, where the proportion of mutations that were misattributed ranged from 1.7% with 10 mutations spiked, to 0.1% with 1000 mutations spiked.
Given the role of aging and MSI in this CRC cohort, we studied aging and MSI signatures, including SBS1, SBS5, SBS20, and SBS21. Both aging and MSI signatures had significantly higher contributions in the plasma of patients with CRC ( Fig. 2a, b), and remained significant when iteratively downsampled to 10M reads 50 times ( Supplementary  Fig. 7). We tested whether these plasma signature contributions correlated with both ctDNA fraction and tumor mutation burden (TMB). ctDNA fraction was determined by ichorCNA 5 and tumor mutation burden was determined by targeted panel sequencing of plasma 21 . Multiple aging and MSI-associated signatures showed a significant correlation with ctDNA fraction (Fig. 2c) and TMB (adjusted p ≤ 0.05, Fig. 2d).
To detect individual signatures per sample, signature detection was performed ("Methods"). The healthy samples were used as a panel of normals, with a threshold of 95% specificity for detection of each signature. Aging signatures were detected in 10 out of 16 (62.5%) patients, and MSI signatures in 11 out of 16 (69.0%, Fig. 2e). Patients with MSI-H tumors had significantly greater SBS20 and SBS21 contributions than controls, whereas patients with MSS tumors were nonsignificantly different ( Supplementary Fig. 8).
Signatures identified in CRC patient samples were compared against signatures fitted to targeted sequencing mutation calls on the same samples 21 (Supplementary Methods). Both approaches identified aging and MSI signatures, with 77.6% agreement across all signatures ( Supplementary Fig. 9). Targeted sequencing identified SBS15 (Supplementary Fig. 9a), which was not detected with 95% specificity in Pointy data. We suggest that SBS15 mutations may have been misattributed to SBS1 given their high cosine similarity ( Supplementary  Fig. 9b), combined with the relatively low sensitivity of Pointy for SBS15 from spike-in benchmarking ( Supplementary Fig. 5a). When the cluster of similar signatures identified in Supplementary Fig. 9b (namely, SBS1 and SBS6) were excluded from signature fitting, SBS15 could be observed ( Supplementary Fig. 9c-e). Germline subtraction and mutation calling would likely improve the resolution of signature profiling, although this would conventionally require 1-2 orders of magnitude greater sequencing 17,20 .
SNP subtraction and signature fitting. Signature fitting was repeated on the same Pointy data with SNP subtraction. Following SNP subtraction, SBS1´(SBS1´= SBS1 with SNP subtraction) and SBS5´were assigned a median of zero mutations each (Supplementary Fig. 10a and Supplementary Data 3). We hypothesized that the SNP database contained aging mutations, which had been subtracted from Pointy data.
To assess the bias introduced by SNP-subtraction, the SBS profile of the aggregated SNPs from the 1000 Genomes database 23 was generated ( Supplementary Fig. 10b). We confirmed that the majority of SNPs fitted SBS1 (12.1%) and SBS5 (63.2%, Supplementary Fig. 10c). Despite the mutation profile bias introduced by SNP subtraction, removal of mutated reads at SNP sites reduced the cosine similarity between the SBS profiles of cases and controls to a mean of 0.982 using bootstrapping with 50 iterations (95% CI 0.982-0.983, Supplementary Fig. 10d), compared to 0.999 when SNPs were retained (95% CI 0.999-0.999, Supplementary Fig. 4a). Together, these data suggest that SNP-subtracted data may be more suited to cancer classification, whereas SNP-retained data may provide a less biased profile of fitted signatures.

Colorectal cancer detection
We next sought to classify samples into cancer vs. healthy based on their SBS mutation profile. To maximize the signal-to-noise ratio, SNPs were subtracted. Then, SBS´(SNP-subtracted) mutation profiles underwent dimensionality reduction using Principal Component Analysis (PCA), and the principal components of SBS profiles (analogous to mutational signatures) were used for machine learning classification (Methods).
PCA showed separation of cases and controls based on two Principal Components, particularly in PC2 (Fig. 3a). As SNP-subtracted data were used, few mutations were fitted to aging signatures due to the bias introduced by SNP-subtraction ( Supplementary Fig. 10a). PC1 and PC2 from SNP-subtracted data were both significantly correlated with ctDNA fraction (p < 0.0076, Fig. 3b). The signature contributions to PC1 and PC2 were assessed, which showed SBS8´was the greatest contributor to PC2 (Fig. 3c). The SBS1 signature contribution was significantly correlated with SBS8´(Pearson r = 0.63, p = 5.6 × 10 −5 , Fig. 3d), suggesting aging mutations were being fitted to SBS8´when SNPs were subtracted.
To classify samples as either cancer or healthy, we used SBS mutation profiles as input to a machine learning model. Four methods were tested, including xgboost 24 , random forest (RF), support vector machine (SVM), and logistic regression. For each condition, nested ten-fold cross-validation was performed, repeated 10 times ("Methods"). First, we assessed whether raw SNP-subtracted 96-SBS profiles could be used as input to the xgboost model, or if dimensionality should be reduced. Raw mutation input resulted in an AUC of 0.82 (95% CI 0.71-0.90, Supplementary Fig. 11a), whereas PCA-transformed input gave an AUC of 0.92 (95% CI 0.88-0.97, Supplementary Fig. 11b). Across all models, with SNPs subtracted, a median AUC of 0.96 was reached (range 0.92-0.98, Supplementary Fig. 11b-e), with the RF model performing best (AUC 0.98, 95% CI 0.90-1.00). Adding ichor ctDNA fraction to the model improved the AUC of the RF model to 1.00 (95% CI 1.00-1.00, Supplementary Fig. 11f), which was selected for subsequent analyses. We confirmed this result with ten-fold crossvalidation, repeated 500 times (AUC 0.99, 95% CI 0.95-1.00, Fig. 3e).
To confirm the enhanced signal-to-noise ratio following removal of SNPs from Pointy data, classification was performed using RF with SNPs retained, which showed an AUC of 0.65 (95% CI 0.53-0.76, Supplementary Fig. 11h). We also quantified the effect of error-suppression, i.e., requiring mutations to be supported in both F and R mate pairs vs. being supported in either F or R only. This showed a significant increase in AUC associated with error suppression (0.93 without vs. 0.98 with error-suppression, p = 0.004, Wilcoxon test, Supplementary  Fig. 11d, i). Therefore, for subsequent analyses for cancer detection, we processed data by using (a) SNP-subtraction, (b) PC-transformation, (c) error-suppression, and (d) detection using an RF model ("Methods").
Signature detection in plasma across multiple cancer types. We next applied Pointy to the Cristiano et al. 9 plasma WGS dataset to test  Fig. 12a), which showed batch differences in SBS profile in healthy individuals despite correction for GC-bias ( Supplementary Fig. 12b). Therefore, cases were compared against controls from the same sequencer. Data were processed similarly to the previous cohort, with downsampling to 10M reads.    Signature fitting was performed with SNPs retained, as before, and all signature contributions are shown in Supplementary Data 4. Signatures previously identified in tumors by Alexandrov et al. 12 for each cancer type were used for signature detection with 95% specificity. Across the cohort, the proportion of patients with ≥1 signature detected ranged from 0.85 in NSCLC to 0.38 in pancreatic cancer (bootstrapped median, 100 iterations, Fig. 4a). By stage, the rate of detection of ≥1 signature ranged from 0.70 in stage I disease to 0.75 in stage IV disease (bootstrapped median, 100 iterations, Fig. 4b). Across all cancer types tested, none showed a significant correlation between ichorCNA ctDNA fraction and the total number of plasma mutations following correction for multiple testing (Supplementary Fig. 12c). However, this correlation is limited by the ability to quantify ctDNA fractions in this cohort: only a median of 11.4% of samples was detected using ichorCNA with a 95% specificity threshold.
In patients with stage I-IV colorectal cancer, plasma signatures were detected in 21 out of 27 patients (78%, Fig. 4c). Aging signatures were detected in 12 (44%), APOBEC signatures in 14 (51%), and MSI signatures in 11 out of 27 patients (41%). In stage I-IV NSCLC, signatures known to be associated with lung cancer 12 and tobacco exposure 18 were assessed. APOBEC signatures were the most prevalent signature detected (in 24 out of 37 samples, 65%), followed by aging (21 out of 37, 57%, Fig. 4d). SBS4 was observed in 3 out of 37 samples (8%), all of which were in stage IV disease. In patients with breast cancer, plasma signatures were detected in 67% (Fig. 4e); aging signatures were the most frequent (27 out of 48, 56%), followed by APOBEC signatures (16 out of 48, 33%). In patients with gastric cancer, APOBEC signatures were the most prevalent (11 out of 27, 41%), and evidence of MSI signatures was found in 22% (Fig. 4f). Two patients with stage 0 disease were included from the DELFI cohort, identifying an APOBEC signature in plasma in one case. In patients with stage I-IV ovarian cancer, aging signatures were the most frequent (9 out of 36, 35%, Fig. 4g). In patients with stage I-III pancreatic cancer, APOBEC signatures were the most frequently detected (11 out of 35, 31%), although overall detection rates were low (Fig. 4h).
The ratio between short (<150 bp) to long mutant fragments (>150 bp) was assessed as a quality control metric (Supplementary Methods). Patients with CRC, NSCLC, and breast cancer showed significantly shorter fragments than healthy controls (q < 0.005, two-sided Wilcoxon test, Supplementary Fig. 12d). These findings of short ctDNA fragments are consistent with the previous literature 8, 25 . cfDNA from patients with pancreatic and gastric cancer were non-significantly longer in fragment size compared to healthy individuals (q = 0.53). There was a significant correlation between ichorCNA tumor fraction and short:long fragment ratio (Pearson r = 0.41, p = 6.9 × 10 −8 ).
Given the high prevalence of SBS2 mutations detected with 95% specificity in ref. 9 sequencing data, we tested whether sequencing noise might contribute to SBS2 mutations. To quantify noise, we utilized the discordant mutations in the overlapping region of paired-end sequencing reads in each sample ( Supplementary Fig. 12e). Discordance in mutations between overlapping R1 and R2 reads likely arise from sequencing noise 20 , whereas true mutations would be present in both R1 and R2. The number of discordant mutations per sample was constant across each of the SBS contexts of SBS2 in patients with NSCLC compared to healthy individuals (q > 0.05, Supplementary Fig. 12f), suggesting that SBS2 calls are unlikely to arise from sequencing noise.

Aging signatures in healthy individuals
Given the predominance of aging/clock-like signatures in Pointy data, we explored their relationship with chronological age in healthy individuals. Individuals with cancer were not used for this analysis to exclude tumor cells as a source of aging mutations. We expected the magnitude of any relationship to be small based on previous estimates of aging mutation rates 18 , combined with recent evidence for aging signatures varying between tissues 26 . One hundred and fifty-nine healthy individuals' plasma data arising from the same sequencer from ref. 9 study were used. Data were downsampled to 50M reads (1.5×) WGS, GC-normalized, and signatures fitted with SNPs retained. The age range of healthy individuals in this cohort was 49-75 years old, with a median age of 54 (Supplementary Data 5). Clock-like mutational signatures 27 (SBS1 and SBS5) were compared against chronological age using SNPretained data. Both SBS1 and SBS5 showed a significant correlation with biological age (p < 0.022, one-sided Pearson correlation, Fig. 5).
To explore aging signals in other SBS signatures, signatures that were significantly correlated with SBS1 were identified as putative aging-correlated signatures (Supplementary Fig. 13a). These additional SBS1-correlated signatures were then compared against the chronological age of each healthy individual. Following correction for multiple testing, 13 further signatures showed a significant correlation with chronological age (q < 0.05, one-sided Pearson correlation, Supplementary Fig. 13b). With SNPsubtracted data, no mutations fitted to SBS1´in this case due to bias introduced by SNP-subtraction ( Supplementary Fig. 14a). Nonetheless, SBS2´, SBS30´, SBS33´, and SBS46´mutation counts were significantly correlated with age (q < 0.02, Supplementary  Fig. 14b). These data suggest that aging mutations may be detected in the plasma of healthy individuals, both in SBS1 and SBS1correlated signatures, though the latter may be due to misattribution of aging mutations to other signatures due to signature similarity or biased fitting due to SNP removal.
Based on differences observed in PC1 and PC2 between samples using PCA ( Supplementary Fig. 15l), we assessed whether samples could be classified into individual cancer types. We selected patient samples sequenced on the same sequencer from the DELFI study (n = 70, Supplementary Methods). Healthy samples were excluded. Classification of individual cancer types achieved a median balanced accuracy of 0.80 (95% CI 0.52-0.89).
Lastly, we assessed the generalizability of this approach across cohorts, as patients with CRC were common to both cohorts (Supplementary Methods). We identified evidence of batch effect affecting SNP-subtracted mutation profiles of healthy controls between the two studies ( Supplementary Fig. 16a), despite using quality filters and GCbias correction. This may be due to differences in sample collection location, as cases were collected across multiple academic sites, with controls in the former study sourced from a commercial site 9,21 . To mitigate this, we pooled healthy and CRC patient samples across the  two studies in equal numbers to allow training across batches. Ten-fold nested CV was performed using RF, resulting in an AUC of 0.93 (95% CI 0.87-0.96, Supplementary Fig. 16b).

Discussion
In this study, we identified mutational signatures in low-coverage plasma WGS in two independent datasets. Both exogenous and endogenous mutational processes were identified in plasma, including aging, smoking, APOBEC, and MSI signatures. We demonstrated sensitive cancer detection using these plasma signatures. Additionally, in healthy individuals, an age-correlated mutational signature was identified in plasma, consistent with previous findings in human tissues 11,13,18,28 . This study has notable limitations. We carried out this analysis with low-coverage WGS without matched germline samples, which improves the scalability of the approach but limits sensitivity for signature detection. This is particularly relevant for signature fitting, where the resolution for low-abundance and similar signatures was hampered by noise. Performing germline sequencing with low coverage sequencing alone would be of limited benefit: if 0.3× WGS (10M reads) were used to sequence a matched germline, only 13% of SNPs would be identified in the germline with 1 mutant read each. Germline calling conventionally requires >15× coverage 17,20 , representing a 1-2 order of magnitude increase in sequencing, and would also need to be performed for healthy individuals to enable comparison. Future studies using deep sequencing in plasma and matched germline samples are needed to fully characterize circulating signatures.
To mitigate noise, we leveraged machine learning for the classification of samples and used quality filters ("Methods"). Differences in noise profile observed between control samples arising from independent studies highlight the importance of validation of this approach on a larger scale across multiple cohorts. In the future, errorsuppression using read collapsing may be employed, for example, by combining Pointy with laboratory methods to increase duplication rates 29 . Even with error-suppression, plasma signature profiles may still appear different from tumor signature profiles due to the effects of sampling error in plasma: low-frequency tumor mutational signatures may be missed, similar to the lower representation of heterogeneous tumor mutations in plasma 30 .
Although subject to technical and biological noise, this proof-ofconcept study of low-coverage plasma WGS provides an insight into circulating signatures in cfDNA and their potential utility in oncology. Deeper sequencing in future would enable the exploration of plasma signatures with greater detail. These signatures, whose exposures may be operative both before and during cancer development 10,13 , might be used for earlier cancer detection. Sensitivity may be boosted further through utilizing ensemble machine learning approaches that combine mutational signatures with other parameters, such as cfDNA fragmentation patterns, fragment ends 31,32 , and/or preferred end coordinates 33 . Lastly, deeper exploration of circulating mutational signatures in healthy individuals might enable the evaluation of cancer risk and the etiology of cancers.

Patient and sample characteristics
In this study, cfDNA WGS data were analyzed from a total of 215 patients and 227 healthy control individuals across two cohorts (Supplementary Data 1). For the initial cohort (PGDX), 16 patients with stage IV CRC provided plasma samples following written informed consent for research use as part of clinical trial NCT01876511. This protocol was approved by the Johns Hopkins Institutional Review Board 21,34 . Plasma samples from 21 healthy control individuals were procured through BioIVT 21 .
We next studied 199 patients and 206 healthy control individuals from the DELFI 9 dataset following approval from their Data Access Committee (DAC). Samples in this cohort were obtained under Institutional Review Board approved protocols, with informed consent from all participants for research use at participating institutions 9 . Patients with the following cancer types were included from the DELFI study: CRC (n = 27), gastric (n = 27), NSCLC (n = 37), ovarian (n = 26), breast (n = 48), and pancreatic (n = 34). Only pre-treatment timepoints from the DELFI study were used. For this proof-of-concept study, no blinding or randomization was performed.

Plasma sample preparation and sequencing
Plasma whole-genome library preparation was performed as described in refs. 21 and 9. Briefly, for both cohorts, cell-free DNA (cfDNA) was extracted from plasma using the QIAamp Circulating Nucleic Acid Kit. Libraries were prepared with 5-250 ng of cfDNA using the NEBNext DNA Library Prep Kit. Whole-genome libraries were sequenced using 100 bp paired-end runs on HiSeq 2000/ 2500 sequencers (Illumina).

Whole-genome sequencing data processing
An overview of the pipeline used is shown in Supplementary  Fig. 1. Raw FASTQ files were trimmed using trimmomatic (version 0.39) 35 in paired-end mode, with the following settings: all reads were cropped to 100 bp (CROP: 100), Illumina sequencing adaptors were removed (ILLUMINACLIP: 2:30:10:2:keepBothReads), leading and trailing 3 bp were trimmed if they were low quality (LEADING: 3, TRAILING: 3), and reads with an average base quality <30 were removed (AVGQUAL: 30). For public datasets, where BAM files were provided, we converted each BAM file to FASTQ using Bedtools (version 2.28.0) bamtofastq prior to running trimmomatic. Trimmed FASTQ files were aligned to the hg38 genome using BWA (version 0.7.15) mem, sorted and indexed with samtools (version 1.7), and duplicates marked and removed with Picard (version 2.19.0) MarkDuplicates. Indel realignment was performed with GATK (version 3.8). Each BAM was downsampled using Picard (version 2.19.0) DownsampleSam to either 10M reads (PGDX cohort -signature profiling and classification; DELFI cohort -signature profiling), 25 M reads (DELFI cohort -classification), or 50M reads (DELFI cohort -signatures in healthy individuals, Supplementary Data 1). BAM files with <90% of the target number of reads for downsampling were excluded. To maximize the quality of the mapped reads, downsampled BAMs were intersected with UCSC tracks WindowMasker 36 and RepeatMasker to remove repeats, then were intersected to retain only regions in the GATK WGS calling regions BED from the GATK hg38 resource bundle. Reads with secondary mapping positions were removed with grep. Reads with a fragment length of zero were removed with awk, as were reads with any supplementary alignments.
Each BAM file was converted to SAM using samtools (version 1.7) and then filtered using awk to retain mutant reads containing a single point mutation only. Samtools mpileup (version 1.7) was used to identify point mutations, considering only reads with a mapping quality of 60 (-q) and considering mutations only if they had a minimum base quality of 30 (-Q). Indels were removed from the mutation VCF using grep. ANNOVAR (version 2018-04-16) was used to annotate variants using RefSeq 37 and dbSNP 151 38 . Mutations were annotated as being either concordant, i.e., supported by both R1 and R2 of the same mate pair, or discordant. Annotated and filtered VCFs were read into R (version 4.1.2) and mutations were annotated with single base substitution contexts using the MutationalPatterns package (version 1.10.0) 39 and processed with dplyr (version 1.0.8) and plyr (version 1.8.7).
For all samples, the sequencer ID was obtained from the read header in the FASTQ file using a custom shell script. To minimize sequencer-specific batch effects on signature profile analysis and sample classification, all downstream analyses were batched by sequencer, with patient samples being controlled by healthy individuals on the same sequencer. Two sequencer IDs were excluded due to few samples or only healthy samples being present ( Supplementary  Fig. 12a).

GC normalization
A GC-normalization step was performed to correct for differences in GC profile between samples. First, a GC profile was first determined for each downsampled BAM file. GC bias metrics were generated using Picard (version 2.19.0) CollectGcBiasMetrics with a WINDOW_SIZE of 300 bp based on previous literature on GC bias in cfDNA 40 . Next, GC profiles for all samples belonging to the same sequencer ID were loaded into R, and a generalized additive model (GAM) was used to generate an averaged profile for the batch, using ggplot2 (version 3.3.5) geom_smooth() using method = "gam" and formula = "y~s(x, bs = "cs")." Supplementary Fig. 3c shows GC profiles for samples sequenced on two runs from the same sequencer, plus the GAM averaged GC profile.
The averaged GC profile was used to normalize the mutation counts of all samples, based on the GC content of each mutated read, as follows: a custom R script was used to annotate all mutations in each sample with their associated GC sequence content, rounded to the nearest 1%. The number of mutations in each GC content % bin was normalized relative to the averaged GC profile of that batch, aiming to mitigate differences in GC bias.

Mutational signature profiling and detection in patient samples
For analysis of mutational signatures in patient plasma samples in both cohorts, a 96-SBS mutation profile was generated for each sample, as described above. For each of the 96 SBS contexts in all samples (cases and controls), the median number of background mutations in that SBS context in control samples was subtracted. Background subtraction was performed relative to control samples sequenced on the same sequencer.
Mutational signatures were fitted to 96-SBS mutation profiles using the fit_to_signature() function from MutationalPatterns (version 1.10.0) 39 ; SBS signatures published by Alexandrov et al. 12 were used as the reference signature matrix. This function finds "the optimal nonnegative linear combination of mutation signatures to reconstruct the mutation matrix" 39 . Mutations that had been annotated as SNPs were retained for analysis of signature profiles, as we showed that removal of SNPs on these data can distort signature fitting processes due to high contributions of aging mutations among SNPs ( Supplementary  Fig. 10). After signature fitting, a matrix of signature contributions for each sample was generated.
To determine whether the signature contribution in an individual sample was significantly above background, we set 95% specificity thresholds for signature detection/calling based on values in control samples. Bootstrapping with 100 iterations in controls was used to generate each 95% specificity threshold.

Mutational signature profiling in healthy individuals
For signature profiling in healthy individuals from the DELFI study, all healthy individuals sequenced on the sequencer named HISEQ were analyzed (n = 159). Signature fitting was performed as above, except background subtraction was not performed (as all samples were controls). Signature contributions were correlated against healthy individuals' chronological age from DELFI metadata and visualized using ggplot2 (version 3.3.5) and ggpubr (version 0.4.0).

Sample classification
For sample classification, SNPs were subtracted to maximize signal:noise. 96-SBS mutation matrices were used as input. For all samples, PCA was used to reduce dimensionality using the stats package in R, and Principal Components with <1% variability were removed as a feature selection step. For each sample, a matrix of PCs, annotated with ichorCNA ctDNA fraction, was used as input for the classification model. Samples were classified using controls from the same study and from the same sequencer. For sample classification to either healthy or cancer, we tested multiple classification methods using a nested ten-fold cross-validation method 41 , repeated 10 times, using: xgboost, Random Forest (RF), Support Vector Machine (SVM) and Logistic Regression. Nested k-fold cross-validation develops a new model on each training set, with testing on the held-out fold, and has been suggested to be robust to limited sample size 41,42 . CreateFolds() from the caret package (version 6.0-90) was used to generate balanced folds for each round of cross-validation.
xgboost (version 1.5.2.1) was used in R with the default parameters and nrounds = 100. randomForest (version 4. [6][7][8][9][10][11][12][13][14] was used in R with the default parameters and ntree = 500. For SVM, svm() from the e1071 package (version 1.7.9) was used with default settings. For logistic regression, glm() from the stats package (version 4.1.2) was used with default settings. Following each iteration of cross-validation, a Pointy score for each sample was generated, ranging from 0 to 1 (higher represents more likely to be cancer). Classification performance characteristics were determined using the ci.cvAUC function from the cvAUC package (version 1.1.0) in R, using Pointy scores from all iterations as input. Random Forest showed the highest performance (Supplementary Fig. 11) and was selected for use for sample classification with nested 10-fold cross-validation with 500 iterations. Median Pointy scores are shown in Supplementary Data 1. Pointy scores from all iterations from all samples from each study were used as input into ci.cvAUC() to generate AUC values for each iteration by cancer type and stage ( Fig. 3e and Supplementary Fig. 15).

Classification of cancer type
For classification of samples to individual cancer types, cancer samples from the DELFI cohort one sequencer was used (HWI-D00837). Plasma WGS data were downsampled to 25M reads. For each sample, PCs were extracted from the 96-SBS mutation matrix belonging to each sample (as before), and these were used as input into a random forest classifier. Samples were classified to any of the cancer types present in the dataset using nested ten-fold cross-validation, repeated 10 times. This classifier generates a probability of matching the sample to each class (i.e., cancer type), and the highest scoring class was chosen as the predicted class. In the event of ties between classes, these were resolved using ties.method = "last". The classification performance was assessed using a confusion matrix from the caret package (version 6.0-90) in R.

ctDNA fraction quantification using ichorCNA
For all plasma and tumor samples, the ctDNA fraction (termed as the tumor.fraction) was calculated using ichorCNA (version 0.2.0) 5 , using a window size of 1 mb (--window), minimum quality of 20 (--quality), across all autosomes and sex chromosomes (--chromosome), with a maximum copy number of 3 (--maxCN). A panel of normals was not used; instead, ichorCNA was run across all healthy control samples within each batch. Detection thresholds for ichorCNA were determined in the DELFI cohort using a 95% specificity threshold of ctDNA fractions in healthy individuals in that cohort.

Fragmentation analysis
To analyze the fragment size of Pointy mutations, insert sizes were obtained from the SAM file belonging to each sample. Each raw mutation matrix containing concordant mutations (i.e., present in both F and R mate pairs) was annotated with the insert sizes from the SAM file using a custom R script. Fragments with an insert size >1000 bp were excluded. A short:long fragment size ratio was calculated for each sample using a threshold of 150 bp for short fragments.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Plasma whole-genome sequencing data have been deposited at the European Genome-phenome Archive (EGA), which is hosted by the EBI and the CRG, under accession number EGAS00001006377. Sequence data from the Cristiano et al. 9 study were previously deposited at the EGA, under accession number EGAD00001005339. Further information about EGA can be found on https://ega-archive.org "The European Genome-phenome Archive of human data consented for biomedical research" 43 . The sequencing data are available under restricted access to comply with patient consent for data sharing, access can be obtained by approval via their respective Data Access Committees via the EGA. The following public databases were used to annotate mutations: 1000 Genomes 23 , RefSeq 37 , and dbSNP 151 38 . Source data are provided with this paper.

Code availability
Code used in the Pointy pipeline is available for academic research purposes only 44 at https://doi.org/10.5281/zenodo.6666951. Code is in a restricted-access repository; users are required to agree to the license terms and conditions prior to approval. We aim to respond to data access requests within 5 working days.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/ licenses/by/4.0/.