Exploring the sequence features determining amyloidosis in human antibody light chains

The light chain (AL) amyloidosis is caused by the aggregation of light chain of antibodies into amyloid fibrils. There are plenty of computational resources available for the prediction of short aggregation-prone regions within proteins. However, it is still a challenging task to predict the amyloidogenic nature of the whole protein using sequence/structure information. In the case of antibody light chains, common architecture and known binding sites can provide vital information for the prediction of amyloidogenicity at physiological conditions. Here, in this work, we have compared classical sequence-based, aggregation-related features (such as hydrophobicity, presence of gatekeeper residues, disorderness, β-propensity, etc.) calculated for the CDR, FR or VL regions of amyloidogenic and non-amyloidogenic antibody light chains and implemented the insights gained in a machine learning-based webserver called “VLAmY-Pred” (https://web.iitm.ac.in/bioinfo2/vlamy-pred/). The model shows prediction accuracy of 79.7% (sensitivity: 78.7% and specificity: 79.9%) with a ROC value of 0.88 on a dataset of 1828 variable region sequences of the antibody light chains. This model will be helpful towards improved prognosis for patients that may likely suffer from diseases caused by light chain amyloidosis, understanding origins of aggregation in antibody-based biotherapeutics, large-scale in-silico analysis of antibody sequences generated by next generation sequencing, and finally towards rational engineering of aggregation resistant antibodies.

. List of all single amino acid features. First 49 features are taken from the literature [Gromiha et al., 1999] and the rest of alphanumeric values are AAIndex ids [Kawashima et al., 2000].

Features* Explanation
Absolute solvent accessibility (ASA) NetSurfP 4 Probability for α-helix NetSurfP 5 Probability for β-strand NetSurfP 6 Probability for coil NetSurfP 7 Number of Hot Spots (nHS) AGGRESCAN 8 Area of the profile above threshold (AAT) AGGRESCAN 9 Total hot spot area (THSA) AGGRESCAN 10 Total area (TA) AGGRESCAN 11 Normalized aggregation propensity (Na4vSS) AGGRESCAN 12 Aromatic Note: Shannon entropy is calculated for each sequence position in multiple sequence alignment, which has more than 50% occupancy to remove the rare insertion/deletion. AL represents amyloidogenic light chain dataset and Non-AL represents non-amyloidogenic light chain dataset. Supplementary Figure S2. Hydrophobicity of CDRs plotted against the percentage of gatekeeper residues in FRs for amyloidogenic (♦) and non-amyloidogenic light chain (□) dataset.

Supplementary
The amyloidogenic light chains are mainly present in the second quadrant (high hydrophobicity and low percentage of gatekeeper residues). The second quadrant contains 63.8% of the amyloidogenic light chains and 29% of the non-amyloidogenic light chains.
The average values of "hydrophobicity of CDR region" and "percentage of gatekeeper residues for the FR region" were considered the origin of the plot. Figure S3. Correlation between the features selected for the final model. Figure S4. Analysis of four additional features selected in the classification model for amyloidogenic (red) and non-amyloidogenic light chain (blue) dataset.
Supplementary Table S5. The importance of the features measured after removing one feature or using one feature at a time.

Performance of aggregation-prone region prediction algorithm Aggregation propensity and aggregation-prone regions
We have analyzed the performance of the aggregation-prone region prediction algorithm, such as TANGO  Figure S5 (a)). A further classification of the antibody dataset to kappa () and lambda () isotype showed similar results for TANGO algorithm (Supplementary Figure S5 (b)). However, WALTZ prediction for the lambda () dataset showed a higher aggregation propensity for the amyloidogenic light chain dataset (Supplementary Figure S5  (c)). Since there was no significant variation in the aggregation propensity of light chain variable region (VL) in amyloidogenic and non-amyloidogenic datasets, we further analyzed the aggregation propensity of aggregation-prone regions (APRs) (Supplementary Figure S6). Similar to variable region analysis, aggregation propensity of the APRs also presented ambiguous results. TANGO predicted a slightly higher aggregation propensity for the amyloidogenic light chains, where WALTZ had almost similar aggregation propensity for amyloidogenic and non-amyloidogenic light chains (Supplementary Figure S6 (a)). For the kappa () isotype, TANGO predicted a higher aggregation propensity for the amyloidogenic light chains; however, WALTZ predicted lower aggregation propensity. Lambda () isotype had almost similar aggregation propensity predicted by TANGO and WALTZ for amyloidogenic and non-amyloidogenic light chains (Supplementary Figure S6 (b,c)).
We further analyzed the position and average aggregation propensity of the APRs present in the variable region of the light chain (VL region) (Supplementary Table S6 Table S6). Most of the APRs predicted by both the algorithms were present in the CDR1-FR2, FR2-CDR2 and FR3 in the VL region (Supplementary Figure S7). TANGO also predicted significant APRs in the CDR3-FR4. WALTZ predicted more APRs and higher average APR aggregation propensity in the amyloidogenic light chains of CDR1-FR2 (percentage of APR: 48% for amyloid and 43.1% for non-amyloid; average aggregation propensity of APRs: 71.5 for amyloid and 68.5 for non-amyloid) and FR2-CDR2 (percentage of APR: 37.8% for amyloid and 28.4% for non-amyloid; average aggregation propensity of APRs: 55.7 for amyloid and 50.2 for non-amyloid). However, TANGO showed a higher average aggregation propensity for the non-amyloidogenic light chains in CDR1-FR2 (42.4 for amyloid and 44.7 for non-amyloid) and FR2-CDR2 (53.2 for amyloid and 56.6 for non-amyloid). The percentage of APRs were higher in the amyloidogenic light chains for CDR1-FR2 (35% for amyloid and 32.3% for non-amyloid) and lower in FR2-CDR2 (21.4% for amyloid and 32% for non-amyloid). TANGO has also predicted APRs in CDR3-FR4 region where amyloidogenic light chains have more APRs (23.6% for amyloid and 11.6% for non-amyloid) with higher average aggregation propensity (56.7 for amyloid and 46.7 for non-amyloid).

Analysis of gatekeeper residues
There is a high variation in the number of APRs, APR position and aggregation propensity predicted by TANGO and WALTZ. However, both algorithms predicted APRs in amyloidogenic as well as in non-amyloidogenic light chain dataset. The presence of the gatekeeper residues in the flanks of the APRs can substantially discourage the amyloid formation. Hence, we further analyzed the presence of gatekeeper residues in ±3 residue flanks of the APRs (Supplementary Table S7). The gatekeeper analysis shows that most of the APRs in light chain dataset contains gatekeeper in one flank and non-amyloidogenic light chain dataset have a significantly higher percentage of such APRs compare to amyloidogenic light chain dataset. There are also a higher number of APRs with no gatekeeper in ±3 residue flanks in the amyloidogenic light chain dataset.

Analysis of FR3 region
Interestingly, TANGO and WALTZ predicted a significant amount of APRs in the relatively buried and conserved FR3 region. Our analysis in section 3.2.2 in the main text (Figure 2 (b)) has already shown that the FR3 region has a relatively higher percentage of gatekeeper residues compared to other regions. It is also the only region with almost equal percentage of gatekeeper residues in amyloidogenic and non-amyloidogenic light chain datasets. It suggests that APR regions are inherently present in FR3 regions and subdued by a higher percentage of gatekeeper residues. We further checked the presence of gatekeeper residues in the ±3 residue flank of the predicted APRs in amyloidogenic and non-amyloidogenic light chain dataset. The amyloidogenic light chains had a significantly higher number of APRs predicted by TANGO (amyloid: 45% and non-amyloid: 23.2%) and WALTZ (amyloid: 47.1% and non-amyloid:6.8%) which does not contain any gatekeeper residues in the ±3 residue flanks. The analysis shows that FR3 regions, if exposed, might be potential nucleating regions in the light chain variable region of antibodies and the position of the gatekeeper residues relative to APR(s) in the FR3 region can play an important role in the aggregation process.