## Introduction

Systemic light-chain amyloidosis (AL) is a monoclonal gammopathy characterized by the abnormal proliferation of a plasma cell clone producing large amounts of pathogenic immunoglobulin free light chains (LCs)1. LCs, mainly secreted as homodimers2, misfold forming toxic species and amyloid fibrils which accumulate in target organs and lead to fatal organ dysfunction and death1. Although LCs deposition can occur in any organ except the brain, the kidney and heart are the most affected sites, with the latter bearing the worst prognosis. Symptoms of AL are non-specific and usually reflect advanced organ involvement. Therefore, an early diagnosis is essential to avoid irreversible organ damage. However, the complexity of the disease and its vague symptoms make a timely diagnosis of AL extremely challenging3,4.

Pre-existing monoclonal gammopathy of undetermined significance (MGUS) is a known risk factor for developing AL, with 9% of MGUS patients progressing to AL5,6,7. However, early diagnosis is still difficult since reliable diagnostic tests predicting whether MGUS patients are likely to develop AL are currently lacking7,8. Predicting the onset of AL is highly challenging, as each patient carries a different pathogenic LC sequence resulting from a unique rearrangement of variable (V) and joining (J) immunoglobulin genes and a unique set of somatic mutations (SMs) acquired during B cell affinity maturation9 (Fig. 1a). Therefore, the development of a specific prediction tool represents a crucial step to anticipate AL diagnosis and improve patients’ prognosis.

Machine learning techniques are becoming very prominent in different areas of science and are also gaining acceptance in medicine. Indeed, machine learning has been used in different areas of medicine, such as diagnosis10,11,12, prognosis13,14, drug discovery15,16 and drug sensitivity prediction17,18,19. In these approaches, machines learn information from data without being explicitly programmed and simulate human intelligence to make predictions20. The high diversity of LC sequences accountable for AL development and the possibility of accessing databases of pathogenic and non-pathogenic LC sequences prompted us to use a machine-learning-based strategy to devise a predictor of LC toxicity in AL named LICTOR (λ-LIght-Chain TOxicity predictoR).

LICTOR uses SMs as predictor variables based on the hypothesis that SMs are the main LC toxicity-discriminating factors. We assess LICTOR’s performance with an independent set of LCs with a known clinical phenotype but not used in the training. Furthermore, to experimentally validate LICTOR, we use our predictor to abolish the pathological phenotype of a cardiotoxic LC and verify the outcome with a Caenorhabditis elegans-based assay evaluating the reduction of the pharyngeal pumping rate after administration of cardiotoxic LCs as a measure of proteotoxicity. Taken together, these results confirm that LICTOR provides insights into specific features differentiating toxic and non-toxic LCs. Therefore, it may represent a powerful tool to improve AL diagnosis and unveil a novel strategy for patient treatment through personalized medicine.

## Results

### SMs are key LC toxicity-discriminating factors

To investigate the role of SMs in the generation of toxic LCs and validate their use as predictor variables in an LC toxicity predictor, we collected a database of 1075 λ LC sequences. The database included 428 “toxic” sequences (i.e. LCs responsible for the formation of toxic aggregates an AL development) extracted from AL patients (hereafter referred to as tox) and 647 “non-toxic” LCs (nox) comprising sequences from healthy donor repertoires, other autoimmune diseases or cancer, obtained from Amyloid Light-chain Database (AL-Base)21 (428 tox, 590 nox) and an in-house LCs’ database not related to AL (57 nox). We restricted our analysis to λ LCs since this isotype is more prevalent than the kappa (κ) isotype in AL patients (λ/κ = 3:1 compared to that of healthy individuals, λ/κ = 1:2)22. To identify SMs, all LCs were aligned to the corresponding germline (GL) sequence obtained using the IMGT database23. LCs were then numbered according to the Kabat-Chothia scheme (using a progressive enumeration from 1 to 125), allowing the structural comparison of LCs with different sequence lengths (“Methods” and Fig. 1b). Next, we counted the number of mutated (M) and non-mutated (NM) residues at each position i in tox and nox sequences (toxiM and noxiM, toxiNM and noxiNM, respectively) and used Fisher’s exact test24 to assess whether the frequencies of mutations (toxiM, noxiM) and non-mutations (toxiNM, noxiNM) were significantly (p < 0.05) different. Finally, the odds ratio (OR)24 was used to assess the association strength between mutations and toxicity at each position i in tox and nox sequences (“Methods” and Fig. 1c). Interestingly, 48 of 53 positions with a statistically significant difference (p < 0.05) between the two groups (Fig. 1c) showed a higher rate of mutation in the tox group (OR > 1), while only 5 positions reported a higher mutation rate in the nox group (OR < 1). To exclude possible bias induced by the use of a group of nox sequences having an artificially low level of SMs, we randomly selected 1000 LC sequences from a healthy donor repertoire (hdnox)25 and compared the probability distributions of the number of SMs (PDSM) between the three groups. We observed similar PDSM between the nox and hdnox groups, while the PDSM of tox and hdnox, as well as, tox and nox were significantly different. This result supports nox sequences as a bona fide group of LCs (Supplementary Fig. 1). Overall, these findings suggest that SMs are key determinants of the toxicity of LCs and, thus, can be used as predictor variables to develop LC toxicity prediction tools.

### Prediction of LC toxicity using machine learning

The previous findings prompted us to use SMs as features to develop a machine learning approach automatically classifying LCs as either toxic or non-toxic in AL. To this end, we combined the information from SMs with knowledge of the 3D structure of LC homodimers26,27 to create three families of predictor variables used in the training of machine learning algorithms. The first family, termed AMP (amino acid at each mutated position), highlights sequence features, identifying the presence or absence of an SM at each position of the LC sequences. The second family, termed MAP (monomeric amino acid pairs), identifies the presence or absence of mutations in residues in close contact in the LC monomeric 3D structure (distance < 7.5 Å). Finally, the third family, named DAP (dimeric amino acid pairs), identifies the presence or absence of mutations at positions in close contact, but belonging to different chains. Next, four machine learning algorithms (Bayesian network, logistic regression, J48 and random forest)28 were evaluated for their ability to solve the classification problem, using our database as input. To assess the importance of the different classes of predictor variables, we performed 28 prediction experiments including all the possible combinations of AMP, MAP and DAP families. In addition, to avoid unbalanced class problems, i.e. the tendency of machine learning algorithms to assign sequences to the largest class in the dataset, nox in our case, each of the 28 experiments was performed with and without balancing the training set using an SMOTE (synthetic minority over-sampling technique) filter29. The assessment of algorithms and predictor variable combinations was performed using 10-fold cross-validation to avoid overfitting. We found that for all the tested machine learning algorithms, the best combination of predictor variable families provided an area under the receiver operating characteristic curve (AUC) that substantially differed from that of a random classifier (0.50), with random forest being the best classifier (0.87) and J48 being the worst (0.75) (Fig. 2a and Supplementary Data 1). Furthermore, all four classifiers relied on the AMP family to predict LC toxicity, while only random forest used all three families of predictor variables (AMP + MAP + DAP) in its best configuration. Overall, these findings highlight the importance of the structural context of SMs in defining the toxicity of an LC and identify random forest using AMP, MAP and DAP as the best approach in our case. For this reason, we used random forest in our implementation of LICTOR.

### Validation of LICTOR

Next, we sought to validate the prediction accuracy of LICTOR with a set of sequences with known clinical phenotypes but not present in the training set (valset)27. The valset (Supplementary Data 2) comprised a total of 12 LCs, including 7 sequences associated with AL with cardiac involvement (H3, H6, H7, H9, H15, H16 and H18) and 5 from multiple myeloma (MM) patients (M2, M7, M8, M9 and M10). LICTOR was able to correctly classify 10 (6 tox and 4 nox) of the 2 LCs as either toxic or non-toxic (Supplementary Data 2 and Fig. 3a). The probability of achieving a similar accuracy with a random classifier was 0.016, strengthening the argument that LICTOR is a robust and accurate tool to predict the clinical toxicity of previously unseen LCs.

To further assess the robustness of LICTOR, we performed two additional tests. In the first test, we predicted the toxicity of 100 randomly selected LCs from the healthy donor repertoire hdnox (all absent from the training set). In this case, LICTOR correctly classified 80% of the sequences as non-toxic (Supplementary Data 3), thus confirming a similar level of accuracy as for the training set. In the second test, we assessed LICTOR by further verifying the absence of overfitting. To achieve this, we took all tox sequences and randomly labelled half of them as nox. Then, we trained a classifier using the 10-fold cross-validation on such a dataset obtaining an AUC of 0.5 (Supplementary Data 4), equivalent to that of a random classifier. The same procedure has been used for the nox group (by randomly assigning half of them as tox and training a classifier on such a dataset), obtaining, in this case also, an AUC of 0.5 (Supplementary Data 5). These results further underline that tox and nox sequences have distinctive features allowing their discrimination.

### LC germline VJ rearrangement information does not improve prediction performance

To further underscore the role of SMs as main discriminants between toxic and non-toxic LCs, we trained the same machine learners employed before, using the LC germline VJ rearrangements as a unique predictor variable, given the well-documented overrepresentation of certain VL germline genes in AL30,31,32. All the resulting germline-based classifiers achieved an AUC of 0.77 in their best configuration (Fig. 2b and Supplementary Data 6), a value substantially better than that of a random classifier, although much lower than LICTOR’s score (0.87). Interestingly, adding LC germline VJ rearrangements to LICTOR did not improve its prediction performance (Supplementary Data 7).

Next, we computed the specificity and sensitivity of the two random forest predictors, LICTOR and the germline-based predictor, maximizing the Youden index (J)33 as a function of the confidence level of the random forest predictions, i.e. the probability that a sequence belongs to the predicted phenotype (Fig. 2c). LICTOR achieved a specificity of 0.82 and a sensitivity of 0.76 (J = 0.58, threshold = 0.46 in identifying tox), while the germline-based classifier showed a specificity of 0.69 and a sensitivity of 0.73 (J = 0.43, threshold = 0.48 in identifying tox). Overall, these data suggested that SMs harbour key information that can be used to discriminate between tox and nox, while LC germline VJ rearrangements do not seem to carry additional information that can improve the prediction performance of LICTOR.

### LICTOR unveils specific features of LC toxicity

To identify the key features leading to LC toxicity in AL, we ranked the predictor variables of LICTOR according to their “information gain”, a value representing the importance of the information carried by each predictor variable for the classification34. We found that among the top 10 most important features of the three families of predictor variables, feature 49-A, which denotes an SM to alanine at position 49, obtained the highest score in the AMP family ranking, as well as in the general ranking (Fig. 2d and Supplementary Data 8). Indeed, feature 49-A was present in 54 tox sequences but only 8 nox sequences. Furthermore, the 49-A mutation, which is located at the dimeric interface of LCs (Fig. 2e), was also ranked among the top 10 features in the DAP family in combination with no substitutions at other residue positions (Fig. 2d). Moreover, among the best-ranked predictor variables of the three families, those describing mutated positions were more frequent in tox sequences than in nox sequences (Fig. 2d). Interestingly, all these mutations were located at the LC homodimeric interface (Fig. 2e), suggesting that mutations in these positions may affect the structural integrity of the dimeric interface and/or induce local instability of the monomer, thus leading to LC misfolding and aggregation. A similar trend was also observed for other top-ranked features, where unmutated positions were, conversely, more frequent in nox sequences than in tox sequences (Fig. 2d).

To investigate the role of the top-ranked features in the prediction of LC toxicity, we performed a quantitative analysis of the importance of features identified by the feature selection technique. To this end, we trained 30 different classifiers (with a 10-fold cross-validation) adding successively the 10 most important features of each feature family according to their information gain. Results are reported in Supplementary Fig. 2 and Supplementary Data 9. Interestingly, the classifier using only the highest ranked feature 49-A achieves an accuracy of 64% with an AUC of 0.55, while to achieve an AUC above 0.77 at least 17 top features are required.

Taken together, these findings show that the presence or absence of specific mutations at specific positions of an LC are key features used by LICTOR to classify the LC phenotype. More importantly, this further underlines the pivotal role of SMs as causative of AL.

### Reverting the toxic phenotype of an LC using LICTOR

Having assessed the prediction accuracy of LICTOR, we sought to validate the key toxicity determinants identified previously through information gain by computationally reverting the toxic phenotype of an LC with LICTOR and verifying the results in a validated in vivo C. elegans model35,36. We therefore selected an LC from our database (tox153) previously described in the literature as cardiotoxic37, and thus having the worst prognosis in AL. For this sequence, we had access to a bone marrow sample, from which we obtained the full-length sequence of tox153 (see “Methods”). Despite its toxic phenotype, tox153 differs by only 5 SMs from the corresponding germline (Fig. 3b), hence representing a good candidate with which to perform our study. From the analysis of the 5 SMs of tox153, we found that the best candidate, i.e. the mutation with the largest information gain, was at position 52, in which the germline leucine (LEU) was somatically mutated to a valine (VAL). This feature was one of the top-ranked predictor variables in all three families (AMP, MAP and DAP). In fact, an unmutated amino acid at position 52 (52X) was significantly more frequent in nox sequences than in tox sequences (pval = 1.5 e-09, Fig. 2d). Therefore, 52X may represent a “non-toxic feature” and is able to revert the phenotype of tox153 (Fig. 3c). To test this hypothesis, we restored the leucine of the germline sequence at position 52 of tox153 (tox153V52L) and used LICTOR to predict the toxicity of the new sequence. LICTOR predicted tox153V52L as a nox sequence, highlighting that this single point mutation is able to completely revert the toxic phenotype of tox153, according to in silico prediction. Next, we analysed the other 4 SMs according to their information gain. Among these SMs, only the MAP feature 56X-59X (pval = 2.2 e-16, Fig. 2d) was included in the top rank. Since tox153 is mutated at position 56 but not at position 59, we also reverted this SM in tox153V52L by mutating alanine to glycine (tox153V52LA56G) (Fig. 3c). Interestingly, in silico prediction by LICTOR confirmed the non-toxic phenotype of tox153V52LA56G. These results underline that key predictor variables identified in the feature selection process and used by LICTOR to perform the predictions represent molecular determinants of AL.

### Experimental validation of LICTOR using C. elegans

Next, we assessed the accuracy of LICTOR’s toxicity predictions in a validated in vivo model, exploiting the ability of C. elegans to specifically identify cardiotoxic LCs35,36. To this end, recombinantly expressed tox153, single mutant tox153V52L, double mutant tox153V52LA56G and tox153 germline protein (tox153GL), the protein without SMs, were administered to worms and their toxicity was evaluated by measuring alterations in the pharyngeal pumping rate35,36. Tox153 caused significant pharyngeal dysfunction (Fig. 3d) comparable to that induced by the administration of cardiotoxic LCs purified from patients suffering from AL (H6, H7, H9) (Fig. 3e)35,36,38, while the effect of tox153GL was comparable to that of the vehicle. The presence of a single mutation in tox153V52L significantly decreased the ability of the wild-type protein to cause pharyngeal toxicity (Fig. 3d). Notably, the double mutant tox153V52LA56G, similar to LCs purified from patients affected by MM (M2, M7, M8)35,36, did not display toxic activity (Fig. 3e).

Finally, we thought to validate our starting hypothesis that LCs acquire toxic features trough the addition of specific SMs during the process of affinity maturation and that, conversely, germline LCs are never associated with AL development. To achieve this, we exploited the C. elegans model and tested a total of recombinantly expressed three germline LCs (H6GL, H9GL and Tox153GL). Values of their pharyngeal toxicity were then compared to those of the corresponding cardiotoxic LCs (H6, H9 and tox153), for which the pharyngeal bulb contraction (pumps/min) values were already published35,36 and obtained under the same experimental conditions, or measured by us (Fig. 3d,e). Interestingly, the three germline LCs do not show any significant proteotoxicity. Finally, we tested an additional cardiotoxic LC (H18) belonging to the same germline family of tox153 (Fig. 3d), whose sequence was present in the valset. As expected, H18 caused a significant impairment of C. elegans pharyngeal activity, while not all germline proteins affect the pharyngeal pumping rate (Fig. 3d, e).

Globally, these results experimentally validate our starting hypothesis that SMs are pivotal determinants of LC toxicity. Moreover, our in vivo analysis confirms the soundness of LICTOR predictions and the validity of key predictor variables identified by information gain as determinants of in vivo proteotoxicity.

## Discussion

Early diagnosis of AL is essential to readily apply therapeutic interventions and prevent permanent and fatal organ damage. However, AL is usually detected only once the symptoms reflecting advanced organ involvement occur, which results in poor patient prognosis. Moreover, although pre-existing MGUS is a known risk factor for AL, predicting whether MGUS patients will progress to AL remains an open, unsolved problem. The extreme sequence diversity of LCs responsible for AL, due to VJ recombination and SMs, further complicates this scenario. Consequently, to deepen our understanding of the AL determinants and ultimately foster early AL diagnosis, we investigated LC sequences with a known clinical phenotype, with the aim of devising a predictive tool that can flag toxic LCs in advance (i.e. LCs responsible for the formation of toxic aggregates and AL development). To achieve this goal, we analysed a large dataset of toxic (tox) and non-toxic (nox) LCs of the λ isotype, the most frequent isotype in AL, following the hypothesis and also posed by other research groups39,40,41 that specific SMs can increase the propensity of LCs to cause AL. Therefore, we performed a statistical analysis of the distribution of SMs between tox and nox sequences. This analysis revealed that toxic LCs have significantly higher SM frequencies than non-toxic LCs (Fig. 1c). Based on these findings, we designed LICTOR, a machine learning approach using SMs to classify the LC phenotype. LICTOR achieved a specificity and a sensitivity of 0.82 and 0.76, respectively, with an AUC of 0.87, making it an unprecedented tool in early AL diagnosis. Interestingly, including LC germline VJ rearrangements as additional predictor variables in LICTOR configuration did not improve prediction performance, further suggesting that, despite the prevalence of some VL germline genes in AL, SMs represent the critical driver of the disease.

LICTOR differs from approaches such as AGGRESCAN42, PASTA43, WALTZ44 and others45,46,47,48 that predict the aggregation propensity of a protein by identifying amyloidogenic regions. LICTOR, instead, aims at finding hotspots responsible for LC toxicity in AL amyloidosis, starting from the known clinical phenotype of LC sequences and following the assumption that SMs are the key determinants of LC proteotoxicity. Indeed, our approach uses an innovative encoding scheme to express LC sequences as the difference with respect to the germline. Then, this strategy is applied to extract sequence and structural features, and to investigate their role in the determination of LC proteotoxicity. Indeed, we showed that specific features (sequence AMP or structural MAP and DAP features) that provide the largest information gain to LICTOR harbour crucial information to accurately predict LC phenotype and can thus be regarded as effective AL molecular determinants. In fact, through the information gain feature selection process, we identified a set of features characterized by the strongest association with the LC phenotype, which, remarkably, were mainly located at the dimeric interface of the LC structure. This finding further emphasizes the key role of the structural context of SMs as drivers of LC proteotoxicity.

We also performed a comparison with AGGRESCAN, PASTA and WALTZ, using the aggregation propensities (PA for AGGRESCAN, PP for PASTA and PW for WALTZ) provided by the respective programs to construct three classifiers. Interestingly, in all three cases tox sequences are significantly more aggregation prone than nox sequences (for AGGRESCAN mean PA = −8.53 ± 4.2 vs −7.44 ± 3.8, p-value < 0.0001 unpaired t-test; for PASTA mean PP = −6.35 ± 1.08 vs −5.82 ± 1.15, p-value = < 0.0001 unpaired t-test; and for WALTZ mean PW = 97.78 ± 0.83 vs 97.49 ± 0.89, p-value = 0.009 unpaired t-test). However, given the considerable overlap between the propensity distributions, classifiers based on aggregation propensity have a rather limited accuracy (AGGRESCAN accuracy = 0.59, PASTA accuracy = 0.68, WALTZ accuracy = 0.64). Nevertheless, these approaches indicate that toxic sequences are more prone to aggregate than non-toxic ones, a fact that is mirrored in LICTOR by the identification of SMs clustering at the dimer interface as divers of proteotoxicity in AL.

Previous studies have analysed the fibril formation propensity of LC variable domains (VL), showing that destabilizing mutations at specific structural sites correlate with increased amyloid fibril formation49,50,51. Additional reports have also compared the stability and fibril formation propensity of variable domains of toxic LC sequences associated with AL and non-toxic ones associated with MM52,53, suggesting that specific mutations could induce a destabilization in toxic LCs. However, as pointed out in a recent analysis of full-length LCs associated with AL or MM27, despite significant differences in some properties such as the melting temperature (Tm), it is not possible to unequivocally differentiate between pathogenic and non-pathogenic LCs based on a single biophysical property. Only flexibility and susceptibility to protease cleavage emerged as discriminative factors of proteotoxicity.

Other reports have suggested that germline proteins are more stable than the corresponding pathological LCs39,54. Analysing the variable domains of the pathogenic light chain AL09 and its corresponding germline κI O18/O8, Baden et al., noticed that a non-conservative SM at the dimeric interface of AL09 (Y87H, according to Kabat numbering scheme), induced an altered dimer interface, characterized by a 90° rotation with respect to the canonical homodimeric structure of the germline counterpart. Interestingly, the same position (position 99 according to our sequential numbering scheme, corresponding to position 87 in Kabat-Chothia numbering) was identified as one of the most important structural features (Fig. 2d and Supplementary Data 8) used by LICTOR to predict LC toxicity.

Cryo-EM structures of LC amyloid fibrils55,56 show an interesting rearrangement in the region comprising of the intrachain disulfide bond; namely, in folded LCs these two cysteines connect parallel ß-strands, while in the amyloid fibrils the two ß-strands are antiparallel. These conformational rearrangements break the intrachain interaction between CDR1 and CDR3, as well as, the intrachain interactions between FR2 and end of FR3. Furthermore, the dimerization interface of the folded LC is disrupted in the fibrils, as they are on the opposite side of the fibril layer. These findings are in line with our results that suggest that SM located at the LC homodimeric interface may impair the structural integrity of the protein–protein interface and/or induce local instability of the monomer, with consequent triggering of LC misfolding and the generation of toxic species.

The starting hypothesis that SMs are key determinants of AL, the accuracy of LICTOR and of our in silico findings were also experimentally confirmed in C. elegans, a validated in vivo model for assessing LC toxicity. We demonstrated that germline LCs are not able to induce proteotoxicity in vivo, validating the assumption that naive LC sequences acquire the toxic phenotype in AL during affinity maturation. Notably, as predicted by LICTOR, the toxic phenotype of an LC was abolished by reverting a single SM.

Taken together, these findings confirm the accuracy and robustness of our in silico approach in the identification of toxic and non-toxic LCs and suggest its usefulness as a diagnostic instrument for AL.

Machine learning relies on data. Larger datasets of LC sequences would, therefore, be beneficial for validation, as well as for the improvement of LICTOR accuracy. However, collecting large numbers of toxic LC sequences is difficult due to the low prevalence of the disease. We believe that the application of LICTOR as a possible diagnostic tool could encourage clinicians to obtain—and make available to the public—LC sequences of AL patients, thus increasing the size of LC sequence databases and consequently allowing improvement of LICTOR accuracy. Furthermore, other factors such as the increased protein dynamics of toxic LCs27 or the generation of LC glycosylation sites by SMs, may be included among the predictive features to improve LC toxicity classification, as suggested by previous reports51,57. This would not only improve the accuracy of LICTOR, but also deepen our understanding of AL determinants and shed light on the complex mechanism of AL development.

In conclusion, LICTOR represents the first method for the accurate prediction LC toxicity from their sequence, allowing the timely identification of high-risk patients, such as MGUS subjects likely to progress to AL. Using LICTOR can thus promote a closer monitoring for AL development and foster early treatment and better patients’ prognosis. Finally, LICTOR may be used together with other recently proposed strategies, such as the differential recruitment efficacy of patient-derived full-length LCs by synthetic amyloid fibrils58, to predict the risk of AL development. Our approach may, furthermore, guide the development of novel predictive tools useful for other diseases, such as cancer, in which the prognosis may depend on SMs of specific tumour-linked proteins. LICTOR is available as a webservice at http://lictor.irb.usi.ch.

## Methods

### Dataset

The database used in the training was composed of 428 tox and 590 nox sequences of the λ isotype collected from AL-Base (http://albase.bumc.bu.edu). Furthermore, it contained 57 nox λ LC sequences that we collected at the Institute for Research in Biomedicine (IRB-DB), known to be non-toxic in the context of AL. The 1075 sequences were automatically aligned using a progressive Kabat-Chothia numbering scheme (http://www.bioinf.org.uk/abs/). According to this scheme, for example, the CDR1 of a given LC with Kabat-Chothia numbering 30A, 30B, 30C, 30D, 30E and 30F was assigned 31, 32, 33, 34, 35 and 36. For the ALBase sequences, germline information was taken from the database, while for IRB-DB LCs, the germline was assessed with an in-house script. Next, germline (GL) sequences were reconstructed using the IMGT database23.

The germline sequences were aligned with the same numbering scheme used for the LCs. Next, each LC in the dataset was compared with the corresponding germline to identify all SMs, with the differences encoded using an X for unmutated positions and the LC amino acid for SMs; this sequence was referred to as Smut. For example, an LC with sequence SYELTQPP and a corresponding germline with the sequence SYVLTQPP was encoded as XXEXXXXX since there is an SM (VE) at position 3. To compare the presence of SMs in Smut at each position i in the Kabat-Chothia numbering scheme, the following four quantities were computed:

• toxiNM - the number of toxic sequences without an SM at position i;

• toxiM - the number of toxic sequences with an SM at position i;

• noxiNM - the number of non-toxic sequences without an SM at position i;

• noxiM - the number of non-toxic sequences with an SM at position i.

### Statistical analysis

The fisher.test function in R version 3.5.1 with the arguments conf.int = TRUE and conf.level = 0.95 was used to assess significant differences in SMs between toxic and non-toxic sequences. The OR between toxiM/toxiNM and noxiM/noxiNM was computed as

$${\mathrm{O}}{{\mathrm{R}}}_{{{\mathrm{tox}}}-{{\mathrm{nox}}}}^{i}=\frac{{{{\mathrm{tox}}}}_{{\mathrm{M}}}^{i}/{{{\mathrm{tox}}}}_{{{\mathrm{NM}}}}^{i}}{{{{\mathrm{nox}}}}_{{\mathrm{M}}}^{i}/{{{\mathrm{nox}}}}_{{{\mathrm{NM}}}}^{i}}$$
(1)

OR = 1 indicates that the event under study (i.e. the frequency of mutations at position i) is equally likely in the two groups (e.g. tox vs nox). OR > 1 indicates that the event is more likely in the first group (tox). OR < 1 indicates that the event is more likely in the second group (nox). The t.test function in R version 3.5.1 was used to evaluate whether the PDSM differed between the tox, nox and hdnox datasets. Data from C. elegans-based assays were analysed using GraphPad Prism 8.2.1 software by one-way ANOVA and Dunn’s post-test analysis. A p-value < 0.05 was considered significant.

### Predictor variables used by the machine learners

Given a sequence, the following three features were extracted:

#### Amino acid at each mutated position (AMP)

From a sequence Smut, a list of predictor variables was extracted, each describing the type of amino acid added by the SM at a given position or the absence of a mutation at the position. Thus, each of these variables was a pair (position, amino acid), where we used the letter “X” instead of the amino acid at the positions for which no SMs were present.

#### Monomeric amino acid pairs (MAP)

LCs share a conserved 3D structure. Therefore, pairs of interacting residues were defined as amino acids having a distance between the respective Cβ atoms less than 7.5 Å in the X-ray structure (PDB ID: 2OLD).

#### Dimeric amino acid pairs (DAP)

Similarly, pairs of residues that interact at the LC–LC interface were defined using the 2OLD LC homodimeric X-ray structure. Two residues belonging to different chains were considered to interact if the distance between their Cβ atoms was less than 7.5 Å.

### Machine learning algorithms

Weka 3.8.128 implementation was used for the four machine learning algorithms (Bayesian network, logistic regression, J48 and random forest) to solve the classification task. For all algorithms, the default Weka parameters were used. The algorithms were evaluated by performing 10-fold cross-validation over the dataset. The performance of each algorithm was: first, assessed using only one family of features (e.g. AMP, MAP and DAP, for a total of three combinations); second, the three families were combined into pairs (e.g. AMP U MAP, for a total of three combinations); third, all three families were combined together. This led to a total of 7 (feature configurations) × 4 (algorithms) = 28 prediction experiments. Moreover, each of the 28 experiments was performed with and without the balancing of the training set with SMOTE29 on the toxic sequences so that the number of toxic instances was equal to the number of non-toxic instances in the training set during each of the ten cross-validations used in the evaluation. This led to 28 × 2 (with/without SMOTE) = 56 total experiments.

### Prediction performance

The various prediction algorithms were assessed by computing the following classification errors: (i) Type-I misclassifications, indicating toxic sequences incorrectly classified as non-toxic (false negative, FN) and (ii) Type-II misclassifications, indicating non-toxic sequences misclassified as toxic (false positive, FP). The correct classifications were instead indicated by the number of true positives, TP (a toxic sequence correctly classified) and true negatives, TN (a non-toxic sequence correctly classified). Based on TP, TN, FP and FN, the following metrics were used to evaluate the performance of our classifiers:

• Area under the receiver operating characteristic curve (AUC): The AUC is used to assess the performance of a two-class classifier (such as that in our study) and is equal to the probability that the classifier will rank a randomly chosen positive instance (in our case, a toxic sequence) higher than a randomly chosen negative instance (a non-toxic sequence). A random classifier has an AUC = 0.5, while the AUC is 1.0 for a perfect classifier.

• Sensitivity: Computed as TP/(TP + FN), this represents the percentage of toxic sequences correctly identified by the classifier.

• Specificity: Computed as TN/(TN + FP), this represents the percentage of non-toxic sequences correctly identified by the classifier.

• Accuracy: Computed as (TP + TN)/(TP + FP + TN + FN), this represents the overall percentage of correctly classified sequences.

• Balanced accuracy: Computed as (specificity + sensitivity)/2, this represents the arithmetic mean of sensitivity and specificity.

• F1 score: Computed as 2TP/(2TP + FP + FN), this represents the harmonic mean of the sensitivity and the precision, which is computed as the number of TP/(TP + FN). F1 = 1 indicates perfect precision and sensitivity, while F1 = 0 represents the lowest possible value achieved if either the precision or the sensitivity is 0.

### Youden index

The Youden (J) index was used to validate the effectiveness of the predictors and to find the optimal cut-off point to separate toxic LCs associated with the disease from non-toxic LCs using the following formula:

$$J={\mathrm{ma}}{{\mathrm{x}}}_{c}[{\mathrm{Se}}(c)+{\mathrm{Sp}}(c)-1]$$
(2)

### Information gain feature selection

The InfoGainAttributeEval filter implemented in Weka 3.8.134 was used to remove all features that did not contribute to the information available for the prediction of the sequence type. All features having an information gain less than 0.01 were removed. Given the computational cost of this procedure, this experiment was performed for the best-performing algorithm and configuration identified in the previous 56 experiments. The full list of ranked features is shown in Supplementary Data 8.

#### Tox153 sequence

The LC sequence was sequenced from circularized cDNA obtained from bone marrow cells as previously described59. First, 1 µL of circularized cDNA was mixed with 5 µL 5X Q5®Reaction Buffer (New England Biolabs), 0.5 µL 5X Q5® High GC Enhancer (New England Biolabs), 0.5 µL dNTPs mix (25 mM), 1.25 µL primers (10 µM each) and 0.25 µL Q5® High-Fidelity DNA Polymerase (2 U/µL) (New England Biolabs) in a final volume of 25 µL. Then the sample was denatured for 1 min at 98 °C and 35 PCR cycles were performed under the following conditions: 98 °C (10 s), 67 °C (20 s), 72 °C (40 s) and final extension 72 °C (2 min). Lambda light chains were amplified with the specific primers λ-CLA (5ʹ-AGT GTG GCC TTG TTG GCT TG-3ʹ) and λ-CLB (5ʹ-GTC ACG CAT GAA GGG AGC AC-3ʹ), and a library of unique sequences was obtained by a Zero Blunt® TOPO® PCR Cloning Kit (Life Technologies) and subsequent analysis of single colonies.

Acquisition, storage and use of biological samples were approved by the institutional review board (Comitato Etico Area di Pavia). Written informed consent was received from participants prior to inclusion in the study. The study was conducted in accordance with the Declaration of Helsinki.

#### Protein production and purification

Tox153, tox153V52L and tox153V52LA56G were custom expressed in mammalian cell lines (Expi293F), purified by affinity purification column and analysed by SDS-PAGE and western blot by GenScript (New Jersey, USA). H6GL, H9GL, Tox153GL and H18 were expressed in mammalian cell lines (Expi293F), purified by HiTrap® LambdaFabSelect (GE Healthcare) and analysed by SDS-PAGE.

#### Effect of LCs on C. elegans

Bristol N2 nematodes were obtained from the C. elegans Genetic Center (CGC, University of Minnesota, Minneapolis, MN) and propagated at 20 °C on solid nematode growth medium (NGM) seeded with Escherichia coli OP50 (CGC) for food. Worms were incubated with 100 µg/mL tox153 wild-type protein, tox153V52L or tox153V52LA56G (100 worms/100 µL) in 10 mM phosphate-buffered saline (PBS, pH 7.4)8,9. Hydrogen peroxide (1 mM) was administered under dark conditions as a positive control and 10 mM PBS (pH 7.4) as a negative control (vehicle). After 2 h of incubation with orbital shaking, worms were transferred onto NGM plates seeded with OP50 E. coli. The pharyngeal pumping rate, measured by counting the number of times the terminal bulb of the pharynx contracted over a 1 min interval, was scored 24 h later.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.