Accuracy of mutational signature software on correlated signatures

Mutational signatures are characteristic patterns of mutations generated by exogenous mutagens or by endogenous mutational processes. Mutational signatures are important for research into DNA damage and repair, aging, cancer biology, genetic toxicology, and epidemiology. Unsupervised learning can infer mutational signatures from the somatic mutations in large numbers of tumors, and separating correlated signatures is a notable challenge for this task. To investigate which methods can best meet this challenge, we assessed 18 computational methods for inferring mutational signatures on 20 synthetic data sets that incorporated varying degrees of correlated activity of two common mutational signatures. Performance varied widely, and four methods noticeably outperformed the others: hdp (based on hierarchical Dirichlet processes), SigProExtractor (based on multiple non-negative matrix factorizations over resampled data), TCSM (based on an approach used in document topic analysis), and mutSpec.NMF (also based on non-negative matrix factorization). The results underscored the complexities of mutational signature extraction, including the importance and difficulty of determining the correct number of signatures and the importance of hyperparameters. Our findings indicate directions for improvement of the software and show a need for care when interpreting results from any of these methods, including the need for assessing sensitivity of the results to input parameters.


Results
Software tested. We considered 26 methods for signature extraction, and among these methods, found 18 suitable for testing 2,[26][27][28][32][33][34][35][36][37][38][39][40][41][42][43] (Supplementary Tables S1, S2). We excluded methods that did not use the most common classification of mutations as shown in Fig. 1, and Supplementary Table S2 details additional reasons  for exclusion. When running the software, we specified arguments and hyperparameters as suggested in the relevant publications and documentation, and if these were not available, default values. Supplementary Table S1 details the parameters selected and rationales for selecting them. Because most methods rely on random sampling, results often vary from run to run on the same input data. Therefore, excluding 2 methods with hard-coded, fixed random seeds, we ran each method 20 times on each data set, each time with a different, specified random seed.
Synthetic data. We generated 20 sets of synthetic data, each consisting of 500 synthetic mutational spectra.
The data sets had a range of values for two parameters: • SBS1:SBS5 ratio, defined as the mean over the 500 spectra of (SBS1 mutation count) / (SBS5 mutation count).
We generated data sets with SBS1:SBS5 ratios of 0.1, 0.5, 1, 2, and 10. • SBS1-SBS5 correlation, defined as the Pearson R 2 of correlation between log 10 of the number of mutations ascribed to SBS1 and log 10 of the number of mutations ascribed to SBS5. We generated data sets with SBS1-SBS5 Correlations of 0.1, 0.2, 0.3 and 0.6.
There was one data set for each of the 20 possible combinations of values for the SBS1:SBS5 ratio and the SBS-SBS5 Correlation. The synthetic data sets are at https:// doi. org/ 10. 5281/ zenodo. 55108 36. Evaluation measures. We assessed each method according to 4 measures: • Cosine similarity to SBS1, the mean of the cosine similarities between SBS1 and each of the extracted signatures that are more similar to SBS1 than to SBS5, if any exist. Otherwise, if all signatures are more similar to SBS5 than to SBS1, then the cosine similarity between SBS1 and the extracted signature most similar to SBS1. • Cosine similarity to SBS5, analogous to cosine similarity to SBS1.  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T preceded by a b c proportion followed by  www.nature.com/scientificreports/ SBS1 > 0.9 and let x 5 be defined analogously for SBS5. Let c 1 be 1 if x 1 > 0 , or 0 otherwise, and let c 5 be 1 if x 5 > 0 or 0 otherwise. Then TP = c 1 + c 5 . • True positive rate (TPR), TP divided by the number of ground-truth signatures, which is always 2.
To summarize the assessment of each method used a "Composite Measure", defined as the sum of the 4 individual measures.
Signature extraction when the number of signatures to extract was unspecified. We first evaluated signature extraction on each of the synthetic data sets without specifying the number of signatures to extract, which is the usual case in practice. As with much of unsupervised learning, determining the number of items to learn, in this case the number of signatures, is a central challenge. Ten methods provide functionality to select the number of signatures to extract and 5 methods specify algorithms for selecting the number of signatures, which we implemented (Supplementary Table S1). For three methods, mutSignatures, signature.tools.lib and SomaticSignatures.NMF, there is no implementation and no specified algorithm for choosing the number of extracted signatures, and we tested these only in a later part of this study (Supplementary Table S1). Two topic-model based methods-hdp and TCSM-and two NMF-based methods-SigProExtractor and mutSpec.NMF-stood out as best able to extract the ground-truth signatures when the number of signatures to extract was not specified (Figs. 2, 3, Supplementary Tables S3, S4, full results at https:// doi. org/ 10. 5281/ zenodo. 55120 02) 26,31,32,38 . These 4 methods usually extracted 2 signatures that were almost identical to SBS1 and SBS5 except at the most extreme SBS1:SBS5 Ratios (0.1 and 10) and the highest correlation (Supplementary Figs. S1-S4, Supplementary Tables S3, S4). They, as well as many other methods, usually extracted SBS1 more accurately than SBS5. This is consistent with our previous experience that sparse signatures such as SBS1, which consists almost entirely of only 4 mutation types, are more easily extracted than relatively flat signatures such as SBS5 (Fig. 1a,b) 2 . We also noted that the Composite Measures for SigProExtractor and TCSM were essentially   . This was because, at these ratios, they did not extract SBS5, but rather a merge of SBS1 and SBS5 (denoted "SBS1 + 5", Fig. 4a). EMu usually extracted 2 signatures: SBS1 and SBS1 + 5, but never extracted SBS5 regardless of the SBS1:SBS5 Ratio and correlation, and consequently had low TPRs (Supplementary Table S4). www.nature.com/scientificreports/ The remaining methods usually extracted ≥ 3 signatures-as many as 4.54 signatures on average for sigfit. EMu-and consequently had low PPVs (Supplementary Table S3, S4). Two methods, sigminer, and Signature-Analyzer, often extracted SBS1 and one or two split versions of SBS5, in which the peaks for some mutation types were put in a separate, extremely sparse, signature (Fig. 4b). They thus had low PPVs (Supplementary Table S4). MultiModalMuSig.LDA and MultiModalMuSig.MMCTM had highly variable results in each data set (Supplementary Figs. S13, S14). They also extracted signatures that included nearly identical duplicates of SBS1, SBS5, SBS1 + 5, as well as signatures that did not closely resemble either of SBS1 or SBS5 (Fig. 5). Two methods, sigfit. NMF and sigfit.EMu, extracted multiple, nearly indistinguishable versions of SBS1 and SBS5 (Fig. 6).
Signature extraction when the number of signatures to extract was specified as 2. When the number of ground-truth signatures was unspecified, 4 of the methods (hdp, SigProExtractor, TCSM, and mut-Spec.NMF) had substantially higher Composite Measures than other methods, but it was not clear whether this was solely because of better estimation of the number of signatures or whether other factors contributed. Therefore, we evaluated signature extraction on the same 20 data sets, but this time specifying or suggesting 2 signatures (Fig. 7, Supplementary Fig. S15, full results at https:// doi. org/ 10. 5281/ zenodo. 55120 18). The performance of sigminer and SignatureAnalyzer improved markedly, as they no longer split SBS5 into 2 signatures, as they had done previously (Fig. 4b, Supplementary Tables S5-S7). The performance of MultiModalMuSig. LDA also improved, because it less often extracted multiple versions of SBS1, which led to better PPV, although results were still variable from run to run on the same data set ( Supplementary Fig. S13). The performance of hdp, which does not allow exact specification of the number of signatures to extract, declined slightly (Supplementary Fig. S8), because it less often accurately extracted SBS5 (Supplementary Table S5, cell AC9). The results of the other methods with the best performance when K , the number of signatures to extract, was unspecified (SigProExtractor, TCSM, mutSpec.NMF) changed very little when K was specified as 2 (Supplementary Figs. S5, S6, S7, Supplementary Table S5). The remaining methods still did not extract SBS5 from data sets with SBS1:SBS5 ratios ≥ 2; instead, as when K was not specified, they extracted the SBS1 + 5 merge (Supplementary  Tables S5-S7).
Variable results from 5 methods that use the same NMF implementation. A notable result from the analyses above was that five of the methods-mutSpec.NMF, MutationalPatterns, signature.tools.lib, maftools, and SomaticSignatures.NMF-performed differently even though they use the same implementation of the Brunet NMF algorithm 44   www.nature.com/scientificreports/ we examined how the five methods used this NMF implementation. Among these 5 methods, signature.tools. lib is unusual in that it calls the nmf function multiple times on resampled data, as described in Table 1. For the input data in the current study, this strategy did not improve the Composite Measure compared to the single calls to the nmf function used by mutSpec.NMF and MutationalPatterns (Fig. 7, Supplementary Fig. S15). Among the 4 methods other than signature.tools.lib, in the simpler case, when K was specified as 2, there are 3 differences in how nmf is used (Table 1). First, the nrun argument to nmf is 200 in mutSpec.NMF and MutationalPatterns but 1 in maftools and SomaticSignatures.NMF. This argument specifies the number of matrix factorizations to be carried out, each starting at a different random initial state. The final return value is the factorization that generated the product with the lowest Kullback-Leibler divergence from the input matrix. Second, maftools and MutationalPatterns hard-code a random seed of 123,456. Third, for reasons we could not find explained, MutationalPatterns hard-codes the addition of a "pseudocount" of 10 -4 to each cell of the input matrix. The 4 methods generate identical results when nmf is called in the same way (Supplementary Table S8).
For the 4 methods that use a single call to the nmf function, we dissected the reasons for the differences in mean Composite Measure for the case where K was specified as 2 (Table 1): • MutationalPatterns had lower mean Composite Measures than mutSpec.NMF, for two reasons (Supplementary Figure S16, Supplementary When K is unspecified, nrun must also be specified as an argument to the nmfEstimateRank function, which selects K (the number of signatures, called the "factorization rank" in the NMF package). After K is selected, a possibly different value of nrun is supplied to the nmf function for a final factorization. Thus, if nmfEstimateRank estimates K correctly as 2, nmf is simply called with nrun = 2. For each of the factorizations carried out by nmfEstimateRank, mutSpec.NMF uses nrun = 50, while MutationalPatterns and maftools use nrun = 10. (Somat-icSignatures.NMF and signature.tools.lib do not call nmfEstimateRank and do not automate selection of K.) For MutationalPatterns, the mean Composite Measure is lower when K is unspecified than when K is specified as 2 (Table 1). This was because, when K was not specified, the function nmfEstimateRank selected K = 3 in some tests. This in turn was because of three differences in between how MutationalPatterns and mutSpec.NMF call nmfEstimateRank (Table 1, Supplementary Fig. S20, Supplementary Table S13). Although maftools also calls nmfEstimateRank with nrun = 10 and a single hard-coded seed, when run with no pseudocount, nmfEstimat-eRank always estimated K = 2.

Discussion
We assessed 18 methods for extracting mutational signatures on 20 synthetic data sets constructed from mutational signatures SBS1 and SBS5 (Fig. 1). In these data sets, the number of mutations due to each signature and the correlations between the signatures varied. When the number of signatures to extract was not specified in advance, which is the usual situation in practice, 4 methods-hdp, SigProExtractor, TCSM, and mutSpec.NMFmost accurately extracted signatures (had the highest Composite Measures, Figs. 2,3, Supplementary Tables S3,  S4). When the number of signatures was specified or suggested in advance to be 2, sigminer and SignatureAnalyzer also extracted both signatures accurately (Fig. 7, Supplementary Fig. S15, Supplementary Tables S5, S6).
This study focused on a specific question regarding separating mutations generated by two correlated signatures, which let us dissect the reasons for differences in accuracy. The results highlighted the challenges and  www.nature.com/scientificreports/ importance of accurately estimating the number of signatures. However, even when the correct number of signatures was specified to the software, there was considerable variability in performance. The most accurate methods were surprisingly robust both to the SBS1:SBS5 Ratio and to SBS1-SBS5 Correlation. Indeed, their accuracy degraded only for data sets with the most extreme SBS1:SBS5 ratios (0.1 or 10) and the highest correlation (Supplementary Figs. S1-S4, Supplementary Table S4, S7). By contrast, many of the other methods did not extract SBS5 from data with an SBS1:SBS5 ratio ≥ 2 (Supplementary Figs. S9-S12), but rather extracted a merge of SBS1 and SBS5 that we call SBS1 + 5 (Fig. 4a). Two methods extracted multiple instances of almost identical signatures but failed to merge them (Fig. 6), and two methods had extremely variable results from run to run on each data set (Supplementary Figs. S13, S14). Results from 5 methods based on the same implementation of the Brunet algorithm varied substantially, in some methods due to an inadequate, hard-coded number of iterations (nrun = 1) in the Brunet algorithm (Table 1). We are aware of only two previous studies that systematically assessed multiple methods on the same data sets 30,31 . One of these studies evaluated 7 computational methods for signature extraction 30 . This study was not designed to assess the critical aspect of whether methods were able to accurately estimate the number of signatures to extract, because it provided the correct number of signatures to the methods. The study used several measures to evaluate the methods' results. One measure was reconstruction error, i.e. how well the input spectra could be reconstructed from the extracted signatures, which is rarely a question of interest, because, as acknowledged in the study, most methods extract signatures that yield good reconstructions. A second measure was the specificity of the extraction, defined as the number of COSMIC signatures correctly not detected divided by the number of COSMIC signatures absent from the input synthetic data. However, incorrectly extracting a known signature is rarely a problem, and therefore all methods did well by this measure. Finally, a third measure was sensitivity. This indeed is an issue in signature extraction because methods often fail to extract signatures. Unfortunately, the sensitivity results in this previous study shed little light on the results of the current study because, out of the 6 methods with the highest Composite Measures in the current study when K was specified as 2, the previous study only analyzed one: a version of SignatureAnalyzer (termed "bayesNMF" in 30 ). In addition, the previous study did not report any measures of variability of sensitivity across replicates.
A more extensive previous study evaluated 14 methods on 37 synthetic data sets in a paper presenting the implementation of SigProExtractor 31 . Two authors of the current study (YW and SGR) are also authors on this previous study. There were two major differences in approach compared to the current study. First, the previous study assessed the signature extraction methods on a wide range of synthetic data designed to mimic the signature exposures in tumors, while the current study was designed to allow detailed dissection of the behavior of methods in analyzing two correlated signatures. Second, SigProExtractor was optimized on the synthetic data presented in the study and outperformed the other methods, while the methods and parameters used in the current study were not optimized on the synthetic data. Of the 4 methods that had the highest Composite Measures in the current study when K was unspecified, only SigProExtractor and mutSpec.NMF were tested in the SigProExtractor study. SignatureAnalyzer extracted fewer false positive in the SigProExtractor study than in the current study. Possibly this was because the data sets in the SigProExtractor study had many more signatures per input sample, and thus SignatureAnalyzer may have overestimated K less often. An important conclusion from both the SigProExtractor study and the current study is the importance of assessing methods on synthetic data.
We draw some broader conclusions from the current study. First, there was substantial variability in the Composite Measure across the methods, and the behavior of some methods suggests that they were not extensively tested. Examples of this include the multiple, nearly identical duplicate signatures returned by sigfit.NMF and sigfit.EMu (Fig. 6), the variable results from MultiModalMuSig.LDA and MultiModalMuSig.MMCTM across different random seeds for the same data set (Supplementary Figs. S13, S14), and the lack of tuning of the nrun argument to the nmf function in the R NMF package by several methods (Table 1). This again points to the importance of testing on a range of data sets, including those presented here and those in 1,2 .
Second, the results underlined the importance of selecting the correct number of signatures. Notably, in the current study, SignatureAnalyzer and sigminer tended to extract too many signatures when the number of signatures was not specified, but they extracted highly accurate signatures when the correct number of signatures to extract was provided. The importance of determining the number of signatures was also evident in the SigProExtractor study 31 . For example, in that study, sigfit.NMF estimated a K that was on average only 34.8% of the true number of signatures, and consequently the average true positive rate was 33.1%. This in turn shows the importance of human judgement regarding the number of signatures present when assessing results in the light of all available evidence.
Third, the methods with the best performance in the current study have multiple parameters, including parameters that govern the amount of sampling done that affect the results (the number of burn-in and Gibbssampling iterations or bootstrap replicates, Supplementary Table S1). TCSM and hdp also require additional parameters and hyperparameters. The importance of these parameters, over and above the critical question of estimating the number of signatures to extract, again implies that use of the software and interpretation of the results require considerable expertise and depend on human interpretation of results in the light of all available evidence.
Finally, the results of the current study lead to some recommendations for best practices in extraction of signatures: one should do multiple runs with different random seeds to determine stability, do multiple runs to test sensitivity to parameters and hyperparameters, and make use of available diagnostics, especially regarding selection of the number of signatures to extract. www.nature.com/scientificreports/

Methods
Generating synthetic data. We generated one data set for each of the 20 possible combinations of values for the SBS1:SBS5 Ratio and the SBS-SBS5 Correlation, using the CreateSBS1SBS5CorrelatedSyntheticData function in the SynSigGen package (https:// github. com/ steve rozen/ SynSi gGen). The synthetic data sets are available at https:// doi. org/ 10. 5281/ zenodo. 55108 36. We generated each synthetic data set as follows: 1. Designate the signature that will have the larger number of mutations as the "main signature" and the other signature as the "correlated signature".
2. Repeat the following steps until the Pearson's R 2 of correlation between the two signatures is within 0.01 of the desired SBS1-SBS5 Correlation: 2.1. Generate 500 exposures to the main signature from a log 10 -normal distribution with µ = 2.5 and σ as specified in Supplementary Table S14. It was necessary to select the value of σ by trial and error to enable generation of data with the desired correlation. The µ of 2.5 represents a reasonable number of mutations ascribed to either SBS1 or SBS5 based on the numbers of mutations ascribed to them in Ref. 2 . Discard and regenerate any exposures with < 100 mutations.
2.2. For each of the exposures, e , generated in Step 2.1, generate exposure to the correlated signature by first drawing r from a log 10 -normal distribution with µ = log 10 (e) and with a σ selected by trial and error to enable the target correlation (Supplementary Table S14). Set the exposure to the correlated signature as r·(SBS1:SBS5 Ratio) -1 if SBS1 is the main signature, or r·(SBS1:SBS5 Ratio) if SBS5 is the main signature. Discard and regenerate any exposures with < 1 mutation.
3. To generate each spectrum from the exposures to SBS1 and SBS5, multiply the exposure times the respective signature, add the two products, and then round. The profiles of SBS1 and SBS5 were taken from https:// www. synap se. org/# !Synap se: syn12 025148 (Ref. 2  www.nature.com/scientificreports/