SigLASSO: a LASSO approach jointly optimizing sampling likelihood and cancer mutation signatures

Multiple mutational processes drive carcinogenesis, leaving characteristic signatures on tumor genomes. Determining the active signatures from the full repertoire of potential ones can help elucidate the mechanisms underlying cancer initiation and development. This involves decomposing the frequency of cancer mutations categorized according to their trinucleotide context into a linear combination of known mutational signatures. We formulate this task as an optimization problem with L1 regularization and develop a software tool, sigLASSO, to carry it out efficiently. First, by explicitly adding multinomial sampling into the overall objective function, we jointly optimize the likelihood of sampling and signature fitting. This is especially important when mutation counts are low and sampling variance, high, such as the case in whole exome sequencing. sigLASSO uses L1 regularization to parsimoniously assign signatures to mutation profiles, leading to sparse and more biologically interpretable solutions. Additionally, instead of hard thresholding and choosing a priori, a discrete subset of active signatures, sigLASSO fine-tunes model complexity parameters, informed by the scale of the data and prior knowledge. Finally, it is challenging to evaluate sigLASSO signature assignments. To do this, we construct a set of criteria, which we can apply consistently across assignments.

tions. Additionally, instead of hard thresholding and choosing a priori, a discrete subset of active signatures, sigLASSO fine-tunes model complexity parameters, informed by the scale of the data and prior knowledge. Finally, it is challenging to evaluate sigLASSO signature assignments. To do this, we construct a set of criteria, which we can apply consistently across assignments.

Main
Mutagenesis is a fundamental process underlying cancer development. Examples of mutational mechanisms include spontaneous deamination of cytosines, the formation of pyrimidine dimers by ultraviolet (UV) light, and the crosslinking of guanines by alkylating agents. Multiple endogenous and exogenous mutational processes drive cancer mutagenesis and leave distinct fingerprints (1).
Mutation profiling of cancer samples at presentation has revealed that mutations accumulate over a lifetime; this includes somatic alterations generated by multiple mutational processes both before cancer initiation and during cancer development. In a generative model, multiple latent mutational processes generate mutations over time, drawing from their corresponding nucleotide context distributions ("mutation signature") (4; 5). Here a mutation signature is a multinomial probability distribution of mutations of various nucleotide contexts. In cancer samples, mutations from various mutational processes are mixed and observable by sequencing.
By applying unsupervised methods such as non-negative matrix factorization (NMF) and clustering to large-scale cancer studies, researchers have decomposed the mutation mixture and identified at least 30 distinct mutational signatures (2; 7). Many signatures have been linked with mutational processes with known etiologies, such as aging, smoking, or ApoBEC activity. Investigating the fundamental processes underlying mutagenesis can help elucidate cancer initiation and development.
One major task in cancer research is to leverage signature studies on large-scale cancer cohorts and efficiently attribute active signatures to new cancer samples. A popular previously published method, deconstructSigs (8), decomposes the mutation profile into a signature mixture using binary search to try coefficients one-by-one iteratively and then by pruning signatures with low estimated contribution to archive sparsity. Other approaches use linear programming (9) or iterate all combinations by brute force (10). None of these approaches explicitly formulates sampling uncertainty into the model or uses efficient regression techniques.
Although we do not fully know the latent mutational processes in cancer samples, we can make reasonable and logical assumptions about the optimal solutions of such studies. Here, we aimed to design a computational framework, sigLASSO, that could meet several criteria. First, the set of estimated mutational mechanisms should be small, as past studies indicate that not all signatures can be active in a single sample or even a given cancer type. We also prefer a sparser solution as it explains an observation in a simpler fashion, consistent with Occam's razor principle. Second, the estimated mutational mechanisms should be biologically interpretable and reflect some cancer type specificity. For example, we should not observe UV-associated signatures in tissues that are not exposed to UV. Likewise, we only expect to observe activation-induced cytidine deaminase (AID) mutational processes, which are biologically involved in antibody diversification, in B-cell lymphomas. Finally, the solution should be robust to the exact datasets for signatures and the data should control the model complexity.
In particular, it is a major challenge to reliably recover the signature composition when mutation number is low (8). Low mutation count results in high sampling variance, leading to an unreliable estimation of the mutation context probability distribution, which is the target for signature fitting.
A desirable signature identification tool should model the sampling process and take sampling variance into consideration.
In this work, we formulate the task as a joint optimization problem with L1 regularization. First, by jointly fitting signatures and the parameters of a multinomial sampling process, sigLASSO takes into account the sampling uncertainty. Cooperatively fitting a linear mixture and maximizing the sampling likelihood enables knowledge transfer and improves performance. Specifically, signature fitting imposes constraints on the previously unconstrained multinomial sampling probability distribution. Conversely, a better estimation of the multinomial sampling probability helps signature fitting. This property is especially critical in high sampling variance settings, for example, when we only observe low mutation counts in whole exome sequencing (WES). Second, sigLASSO penalizes the model complexity and achieves variable selection by regularization. The most straightforward way to do this would be to use the L0 norm (cardinality of active signatures), but this approach cannot be effectively optimized. Conversely, using the L2 norm flattened out at small values leads to many tiny, non-zero coefficients, which are hard to interpret biologically. sigLASSO uses the L1 norm, which promotes sparsity. The L1 norm is convex, and thus allows efficient optimization (11; 12). Additionally, this approach is able to harmoniously integrate prior biological knowledge into the solution by fine-tuning penalties on the coefficients. Compared with the approach of subsetting signatures before fitting, our soft thresholding method is more flexible for noise and unidentified signatures. Finally, sigLASSO is aware of data complexity such as mutational number and patterns in the observation. Our method is automatically parameterized empirically on performance, allowing data complexity to inform model complexity. In this way, our approach also promotes result reproducibility and fair comparison of datasets.
In sum, sigLASSO exploits constraints in signature identification and provides a robust framework for scientists to achieve biologically sound solutions. sigLASSO also can empower researchers to use and integrate their biological knowledge and expertise into the model. Unveiling the underlying mutational processes in cancer samples will enable us to recognize and quantify new mutagens, understand the mutagenesis and DNA repair processes, and develop new therapeutic strategies (9; 13; 14; 15; 16).

Results
The signature identification problem Mutational processes leave mutations in the genome within distinct nucleotide contexts. Specifically, we considered the mutant nucleotide context and looked one nucleotide ahead and behind. This divides mutations into 96 trinucleotide contexts. Each mutational process carries a unique signature, which is represented by a multinomial distribution over mutations of trinucleotide context (Fig. 1a).
Thirty signatures were identified by NMF (with Frobenius norm penalty) and clustering from largescale pan-cancer analyses (2; 3). Here, our objective was to leverage the pan-cancer analysis and decompose mutations from new samples into a linear combination of signatures. Mathematically, the problem is formulated as the following non-negative regression problem and maintains the original Frobenius norm: The mutation matrix, M contains mutations of each sample cataloged into 96 trinucleotide contexts. m i (i = 1 ... n) ∈ M denotes the mutation count of the i th category. S is a 96 × 30 signature matrix, containing the mutation probability in 96 trinucleotide contexts of the 30 signatures. W is the weights matrix, representing the contributions of 30 signatures in each sample.
Sampling variance In practice, this problem is optimized using continuous relaxation for efficiency and simplicity (8), ignoring the discrete nature of mutation counts. This approach essentially transforms observed mutations into a multinomial probability distribution, making model estimation insensitive to the total mutation count. Yet the total mutation count plays a critical role in inference. Assuming mutations are drawn from an underlying probability distribution, which is the mixture of several mutational signatures, the mutations follow a multinomial distribution. The total mutation count is the sample size of the distribution, thus affecting the variance of the inferred distribution.
For instance, 20 mutations within the 96 categories give us very little confidence in inferring the underlying mutation distribution. By constast, if we observed 2,000 mutations, we would have much higher confidence. Here, we aimed to use a likelihood-based approach to acknowledge the sampling variance and design a tool sensitive to the total mutation count. sigLASSO model We divided the data generation process into two parts. First, multiple mutational signatures mix together to form an underlying mutation distribution. Second, we observed a set of categorical data (mutations), which is a realization of the underlying mutation distribution.
We used m i (i = 1 ... n) to denote the mutation count of the ith category. − → p is the underlying mutation probability distribution with p j denoting the probability of the j th category.
To promote sparsity and interpretability of the solution, sigLASSO adds an L1 norm regularizer on the weights (i.e., coefficients) of the signatures. LASSO is mathematically justified and can be computationally solved efficiently (12). Adding an L1 norm regularizer is equivalent of placing a Laplacian prior on − → w (17). Thus, from this generative model, we can write down the likelihood function for one single sample.
The log-likelihood function, which is our objective function to maximize, is given as following: Here, α = 1/σ 2 . We can infer α from the residual errors from linear regression. − → c is a vector of k penalty weights (c 1 , c 2 , ... c k ), each indicating the strength to penalize the coefficient of a certain signature. This value should be tuned to reflect the level of confidence in prior knowledge. We also used − → c to perform adaptive LASSO (18) by initializjng − → c to 1/β OLS , where β OLS are the coefficients from nonnegative ordinary least square. Our aim was to obtain a less biased estimator by applying smaller penalties on larger values.
Optimizing sigLASSO The negative log likelihood is convex in respect to both − → p and − → w when evaluated individually. Hence the loss function is biconvex. Instead of using a generic optimizer, we exploited the biconvex nature of this problem and effectively optimized the function by Alternative Convex Search (ACS), which iteratively updates these two variables (19).
To begin the iteration, we initialized − → m using maximum likelihood estimation (MLE) and started with the − → w −step. The − → w −step is a nonnegative linear LASSO regression that can be efficiently solved by glmnet (18). λ is parameterized empirically (See Methods).
Next, we used the LASSO error variance estimator to estimate α (18). We solved the − → p with a Lagrange multiplier to maintain the linear summation constraint n i=1 p i = 1. The nonnegative constraint of p i is satisfied by only retaining the nonnegative root of the solution (see Methods).
In − → p −step, we tried to estimate − → p that optimizes the multinomial likelihood while constraining it to be not too far away from the fitted − → p . If we only used the point MLE of − → p based on sampling and did not perform the − → p −step, the model would assume the sampling is perfect and become insensitive to the total mutation counts. The trade-off in the − → p −step between the multinomial likelihood and the L2 loss reflects the sampling error. The sampling size (sum of m i ), the goodness of the signature fit (as reflected in α), and the overall shapes of − → p all affect the tension between sampling and linear fitting.
Algorithm 1 LIBRA algorithm 1: initialization: sigLASSO is aware of the sampling variance Jointly optimizing both sampling process and signature fitting, sigLASSO is aware of the sampling variance and infers an underlying mutational context distribution − → p . The underlying latent distribution is optimized in respect to both sampling likelihood and the linear fitting of signatures (Fig. 1b). In low mutation counts, the uncertainty in sampling increases and thus the estimated underlying distribution goes closer to the least square estimate (Fig. 1c). In contrast, when the total mutation count is high, the estimate of the distribution is closer to the MLE of the multinomial sampling process.
We illustrated how the mutation count affects the estimation of − → p using a simulated dataset (five signatures, noise level: 0.1, see Methods). When the sample size was small (≤ 100), high uncertainty in sampling pushed the inferred underlying mutational distribution − → p far from the MLE in exchange for better signature fitting. When the sample size increased, lower variance in sampling dragged − → p close to the sampling MLE and forced the signatures to fit with larger error.
Because linear fitting and sampling likelihood optimization mutually inform each other, concurrently learning an auxiliary sampling likelihood improves performance. We compared the accuracy of the estimation of − → p with and without this joint optimization (Fig. 1d). As expected, − → p estimation in low mutation count performs worse. sigLASSO is able to achieve a lower MSE in both estimating − → p and the underlying true signature mixture.
Performance on simulated datasets We first evaluated sigLASSO on simulated datasets. Both sigLASSO and deconstructSigs performed better with higher mutation number and lower noise ( Fig. 2a, S1). A decrease in mutation number leads to an increase of uncertainty in sampling, which is mostly negligible in the high mutation scenarios. As expected, the MSE jumped to the 0.05-0.3 range regardless of the noise level when the mutation number was low. Similar pattern is observed in support recovery (i.e. precision/recall/accurary). Thus, the error is dominated by undersampling rather than embedded noise. Overall, the performances, measured by MSE, of the two tools were comparable. sigLASSO produces a sparser solution. It maintains a higher precision level when mutation number decreases and/or noise increases. The precision of sigLASSO is less affected by the number of signatures. When the signature number is low (sparse settings), sigLASSO shows a better accuracy.
Using known signature to tune the weights boosts performance (Fig. 2b, S2). As the fraction of true signatures given as prior knowledge increased, the performance improved. When more false signatures were mixed with true signatures given as prior knowledge, the performance slowly deteriorated. Stronger priors had bigger effects to the solution as expected.
Evaluating criteria for signature assignment We next moved from synthetic datasets to real cancer mutational profiles. Real cancer mutational profiles are likely noisier than our simulation and exhibit a highly non-random distribution of signatures.
One of the limitations of cancer signature research is that the ground truth of real samples typically cannot be obtained. Previous large-scale signature studies largely relied on mutagen exposure asso- These expectations are not quantitative, but they help direct us to recognize the most plausible solution as well as the less favorable ones.
WGS scenario using renal cancer datasets We benchmarked the two methods using 35 WGS papillary kidney cancer samples (20) . The median mutation count was 4,528 (range: 912-9,257).
We found that without prior knowledge, both sigLASSO and deconstructSigs showed high contributions from signature 3 and 5 ( Fig. 3a,b). deconstructSigs also assigned a high proportion for signature 8, 9 and 16 . Signatures 3, 8, 9 and 16 were not found to be active in pRCC in previous studies and currently no biological support rationalizes their existence in pRCC (2).
However, if we naively subset the signatures and took the ones that were found to be active in previous studies, the signature profile was completely dominated by signature 5, to which only less than 5% mutations were assigned with other signature. This finding suggests possible an overly simply, underfitted model.
When sigLASSO took into account the prior knowledge of active signatures, the proportion of backbone signature 5 increased to about 80%, which is in line with previous reports. SigLASSO also assigned a small portion of mutations to signature 3 and 13.
WES scenario using esophageal carcinoma datasets We next aimed to evaluate the two methods on 182 WES esophageal carcinoma (ESCA) samples with more than 20 mutations. The median mutation count was 175.5 (range: 28-2,146), which is considerably lower than WGS but typical for WES. We did not use any prior knowledge because COSMIC does not have active signatures in esophageal cancers.
In sigLASSO, the L1 penalty strength is tuned based on model performance. In a low mutation count, ESCA WES dataset, the model variance is likely high, which pushes up the penalty. As DeconstructSigs has been shown to be able to distinguish between different histological types of esophageal cancer (8). We demonstrated that sigLASSO generates a sparser but comparable result with wider signature 3 and 25 gaps between the subtypes (Fig. 3c). The adenocarcinoma subtypes tended to have higher fractions of signature 1 and 25, and lower fractions of signature 3 and 9.
Performance on 8,892 The Cancer Genome Atlas (TCGA) samples We ran sigLASSO with step-by-step set-ups and deconstructSigs on 8,892 TCGA tumors (33 cancer types, S5) with more than 20 mutations (Fig. 4).
We noticed that simple nonnegative regression resulted in an overly dense matrix. Applying an L1 penalty made the solution remarkablely sparse. Then by incorporating the prior knowledge, the signature landscape further changed without significantly affecting on the assignment sparsity.
The solution is in consistent with the priors given. In comparison, the solution of deconstructSigs was less sparse.
sigLASSO is computationally efficient sigLASSO iteratively solves two convex problems. The − → w −step can be solved using a very efficient coordinate descent algorithm (glmnet) (12). The How to leverage results from large-scale signature studies and apply them to a small set of incoming samples is a very practical problem for many researchers. Although it might seem to be a simple linear system problem at first, the core challenge is how to prevent over-and underfitting on only one single sample, often with very few mutations (especially in WES) and promote sparsity.
First, under the current generative model, cancer draws mutations from a multinomial distribution of all active cancer signatures and then further draws from the multinomial nucleotide context distribution given by the signature. Mutations are first divided into several signatures and then categorized further into 96 types based on the nucleotide composition. With the mutation number less than a few hundred, sampling variance becomes a significant factor in reliable signature identification. Therefore, the fitting scheme should be aware of the sampling variance, which is especially pronounced in low mutation count scenarios (WES or cancer types with low mutation burden).
Ideally, the tool should be able to attribute the signatures by flexibly inferring the underlying true mutation distribution given the sampling variance and the signature fitting performance. In addition, sparse solutions can give better predictions due to lower estimator variance and are in line with Occam's razor principle, which prefers the simplest solution that explains an observation.
Third, a desirable method should be aware of data complexity and be parameterized accordingly to achieve optimum fitting. Finally, mutational signatures are not orthogonal due to their biological nature. Co-linearity of the signatures will lead to unstable fittings that change erratically with even a slight perturbation of the observation.
DeconstructSigs was the first tool to identify signatures even in a single tumor. This tool uses binary search to iteratively tune coefficients and archives sparsity by post-hoc pruning with a preset 6% cut-off value. The mutation spectrum is normalized before fitting, thus making mutation counts irrelevant to the model. Moreover, the greedy nature of stepwise coefficient tunning is prone to eliminating valuable predictors in later steps that are correlated with previously selected ones (22). Here, we describe sigLASSO, which simultaneously optimizes both the sampling process and an L1 regularized signature fitting. By explicitly formulating a multinomial sampling likelihood into the optimization, we designed sigLASSO to take into account the sampling variance.
Meanwhile, sigLASSO uses the L1 norm to penalize the coefficients, thus promoting sparsity and achieving effective variable selection. By fine-tuning the penalizing terms using prior biological knowledge, sigLASSO is able to further exploit previous signature studies from large cohorts and promote signatures that are believed to be active in a soft thresholding manner.
Jointly optimizing a mutation sampling process enables sigLASSO to be aware of the sampling variance. By additionally modeling an auxiliary multinomial sampling process and corresponding distribution, we demonstrated that sigLASSO achieves better signature attribution, especially in low mutation counts cases. In cancer research, WES data is abundant but it also suffers severely from undersampling in signatures attribution. In these cases, sigLASSO is able to simultaneously learn the linear regression of signatures with a multinomial sampling process, generating more reliable and robust solutions. Moreover, formulating the mutation sampling uncertainty can have further implications in mutagenesis modeling and parameter estimation, for example in estimating the nucleotide-specific background mutation rate in cancer.
Additionally, as the cost of sequencing drops rapidly, we expect an even greater number of cancer samples to be whole-genome sequenced (23). The vast amount of cancer genomics data will give scientists larger power to discern unknown or rare signatures. The growing number of signatures will eventually make the signature matrix underdetermined (when k > 96, i.e., the number of possible mutational trinucleotide contexts). A traditional simple solver method would give infinitude (noiseless) or unstable (noisy) solutions in this underdetermined linear system. However, by assuming the solution is sparse, we were able to apply regulation to achieve a simpler, sparser and stable solution.
Moreover, sigLASSO does not specify a noise level explicitly beforehand, but instead empirically tunes parameters based on model performance. On the other hand, deconstructSigs specifies a noise level of 0.05 to derive the cut-off of 0.06 to achieve sparsity. In general, sigLASSO lets the data itself control the model complexity and leave any post-hoc filtering to users.
Finally, due to the colinearity nature of signatures, pure mathematical optimization might lead al-gorithms to select wrong signatures that are highly correlated with truly active ones. To overcome this problem, sigLASSO allows researchers to incorporate domain knowledge to guide signature identification. This knowledge input could be cancer-type specific signatures or patient clinical information (e.g., smoking history or chemotherapy). We showcased the performance of sigLASSO on real cancer datasets. Although we lack the ground truth of the operative mutational signatures in tumors, we have several reasonable beliefs about the signature solution. sigLASSO produced signature solutions that are biologically interpretable, properly align with our current knowledge about mutational signatures, and well distinguish cancer types and histological subtypes.

Methods
Optimization of the p−step. In the − → p -step, we tried to solve the following problem withp i from estimation of the w−step:p t+1 We added the Lagrangian multiplier and take the derivatives in respect to p i , (i = 1, 2...n) and λ. Now we get n + 1 equations.
The roots of the quadratic equation are given by of the log-likelihood is − m i p i − α, which is strictly negative. Therefore, the root we found is a maximum.
We plugged all the roots in the last equation (ie. the linear constrain) and use the R function uniroot() to solve λ.
Parameter tuning. We tuned lambda by repeatedly splitting the nucleotide contexts into training and testing sets and testing the performance. Because mutations of the same single nucleotide substitution context are correlated, we split the data set into eight subsets. Each subset contained two of every single nucleotide substitutions. We then held one subset as the testing dataset and only fit the signatures on the remaining ones. After circling all eight subsets and repeating the process 20 times, we used the largest λ (which leads to a sparser solution) that gave MSE 0.5 SD from the minimum MSE. λ was tuned when − → p deviated far from the estimation from the previous round.
By adaptively learning − → p , sigLASSO avoids overestimating the errors in the signature fitting and thus allows a higher fraction of mutations to be assigned with signatures. We fixed λ when the deviation was small to avoid the inherited randomness in subsetting that affects convergence. Illustrating on real datasets. To assess the performance of our method on real-world cancer datasets, we used somatic mutations from various cancer types from TCGA. We downloaded MAF files from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/). A detailed list of files used in this study can be found in S5.
We compared the signature composition results with a previous pan-cancer signature analysis (http://cancer.sanger.ac.uk/cosmic/signatures). We also extracted prior knowledge on active signatures in various cancer types from this source.
sigLASSO software suite. sigLASSO accepts processed mutational spectrums. We provided simple scripts to help parse mutational spectrums from VCF files. sigLASSO allows users to specify biological priors (i.e., signatures that should be active or inactive) and their weights. sigLASSO Competing Interests The authors declare that they have no competing financial interests.