Discovery of potent inhibitors of α-synuclein aggregation using structure-based iterative learning

Machine learning methods hold the promise to reduce the costs and the failure rates of conventional drug discovery pipelines. This issue is especially pressing for neurodegenerative diseases, where the development of disease-modifying drugs has been particularly challenging. To address this problem, we describe here a machine learning approach to identify small molecule inhibitors of α-synuclein aggregation, a process implicated in Parkinson’s disease and other synucleinopathies. Because the proliferation of α-synuclein aggregates takes place through autocatalytic secondary nucleation, we aim to identify compounds that bind the catalytic sites on the surface of the aggregates. To achieve this goal, we use structure-based machine learning in an iterative manner to first identify and then progressively optimize secondary nucleation inhibitors. Our results demonstrate that this approach leads to the facile identification of compounds two orders of magnitude more potent than previously reported ones.


Introduction
Parkinson's disease is the most common neurodegenerative movement disorder, affecting 2-3% of the population over 65 years of age [1][2][3][4][5] . The aggregation of αsynuclein (αS) has been associated with the initial neurodegenerative processes underlying this disease, in which the pathological accumulation of misfolded proteins results in neuronal toxicity beginning in the substantia nigra 1,2,4,6 . Since αS aggregates have been shown to exhibit various mechanisms of cellular toxicity 7,8 , major efforts are being invested into identifying compounds that can inhibit αS aggregation mechanisms [9][10][11][12] . This is a particularly pressing need given the lack of diseasemodifying therapies currently available to PD patients [13][14][15] . With the recent approval by the FDA of the first two disease-modifying drugs for Alzheimer's disease, aducanumab 16 and lecanemab 17 , approaches based on blocking secondary nucleation appear to be promising 18 .
Computational methods could be expected to reduce the time and cost of traditional drug discovery pipelines [19][20][21] . In this area, machine learning is rapidly emerging as a powerful drug discovery strategy 22 . To explore the potential of this strategy in drug discovery programs for Parkinson's disease and other synucleinopathies, we describe here a machine learning approach to explore the chemical space to identify compounds that inhibit the aggregation of αS. Our starting point is an approach that combines docking simulations with in vitro screening, which was recently employed to identify a set of compounds that bind to the fibril structures of αS, and prevent the autocatalytic proliferation of αS fibrils as a result 23 . Here, we used this initial set of compounds as input for a structure-based machine learning approach to identify chemical matter that is both efficacious and represents a significant departure from the parent structures, providing compounds that conventional similarity searches would have failed to efficiently identify.
This approach is based on the lessons learned using chemical kinetics about the importance of secondary nucleation in αS aggregation [24][25][26] . Because of the autocatalytic nature of this process, structure-based methods could be expected to effectively target the catalytic sites on the surface of αS aggregates 23 . As we show here, the implementation of this idea within an iterative machine learning procedure 4 leads to the identification and optimisation of compounds with high hit rates and great potency.

Results
Components of the machine learning method. The machine learning approach used here consists of 3 main components 27 : (1) the experimental data, i.e. a readout of the potency of the compounds in an aggregation assay, (2) the variational autoencoder required to represent the compounds as latent vectors, and (3) a model for training and prediction using these vectors and the assay readouts.
For component 1, we used a chemical kinetics assay 9,28,29 that provided both the initial data for the model training and the data that were iteratively fed back into the model at each cycle of testing and prediction. This assay identifies the top compounds that inhibit the surface-catalysed secondary nucleation step in the aggregation of αS.
For component 2, we used a junction tree variational autoencoder 30 , pre-trained on a set of 250,000 molecules 31 enabling accurate representation of a diverse population of molecular structures. Using this approach, SMILES strings were standardised using MolVS 32 and converted into latent vector representations.
For component 3, we used a random forest regressor (RFR) with a Gaussian process regressor (GPR) fitted to the residuals 33,34 of the RFR, with both regressors using the latent vectors as training features. The RFR provided the highest performance compared to other combinations of multi-layer perceptrons (MLPs), GPRs and linear regressors (LRs) in terms of R 2 score, mean absolute error (MAE) and root mean square error (RMSE). Performance and parameters are shown in Figure S1 and Table S1, respectively. Combining the RFR and GPR provided only a marginal improvement in the metrics of the RFR alone, but crucially enabled leveraging of the associated uncertainty measure of the GPR when ranking molecules during acquirement prioritisation 27 . Tuning the weighting applied to this uncertainty measure allowed a ranking based on both the predicted potency of the molecules and the uncertainty of that prediction. Component 3 was then trained on the 161 initial 5 experimental data points (see below). The best molecules predicted by the model were then tested in the same assay and the results fed back into the model in an iterative fashion (~55-65 new molecules tested at each iteration). The molecules used at each stage of the project are illustrated in Figure S2, together with the structures of the most potent hits at each stage. An overview of the pipeline is shown in Figure 1.
Initial set of small molecules. The initial set of molecules was identified via docking simulations to αS fibrils (see SI), followed by similarity searches around molecules that performed well in the chemical kinetics assay to identify further candidates 23 . The docking screening was carried out using the consensus strong binders predicted by AutoDock Vina 35 and Openeye's FRED [36][37][38] software.
The first round of in vitro experiments was performed using 68 lead compounds identified in an in silico structure-based docking study carried out previously 23 . In that study, the binding site encompassing residues His50−Lys58 and Thr72−Val77 was selected due to its propensity to form a pocket according to the Fpocket software 37 ( Figure S3A), and its mid to low solubility according to CamSol 41 ( Figure S3B).
Additionally, His50 is predicted to be protonated below the pH value (5.8) at which αS secondary nucleation more readily occurs 42 , which may be significant for initial interactions. 79 lead molecules were identified by docking of 2 million molecules with optimal CNS-MPO 43 properties using Autodock Vina to target the selected binding pocket ( Figure 1A). To increase the confidence of the calculations, the top-scoring 100,000 small molecules were selected and docked against the same αS binding site, using FRED 36 . The top-scoring, common 10,000 compounds in both docking protocols were selected and clustered using Tanimoto clustering 44  Subsequent experiments to test these predicted binders in aggregation assays identified 4 active compounds 23 labelled molecule 48, 52, 68 and 69, referred to as the 'docking set', ( Figure 1A). Here, using the Tanimoto similarity metric between Morgan Fingerprint representations (radius = 2, nbits = 2048) of the molecules, 2 similarity 6 searches were then carried out using these 4 structures as starting points ( Figure 1B).
Different Tanimoto similarity thresholds were used to specify molecule subsets for testing, from initial analogue searches to the creation of a library to screen from. As such a similarity value >0.5 was used for closely related molecules, >0.4 for loosely related molecules and >0.3 for very loosely related molecules. While this use of a structurally related screening library constrains the models ability to generalise, the lack of diversity in terms of hits also makes it unlikely for the model to perform well in chemical space significantly divergent from this region. We are thus carrying out an exploitation strategy here. We remove the need for a curated screening library in a parallel work by utilising generative modelling and reinforcement learning 45 , allowing for both exploitation and exploration strategies. A selection of closely related molecules (Tanimoto similarity > 0.5) to the parent compounds (referred to as the 'close similarity docking set', Figure 1B and Figure S2B) was tested in the aggregation assay. The hit selection was made according to a cut off corresponding to a normalised half-time of the aggregation (t1/2) of 2 times that of the negative control. This step was then followed by a larger selection of compounds with a looser cut-off of structural similarity (Tanimoto similarity > 0.4) to the parent compounds (referred to as the 'loose similarity docking set', Figure 1B). Although new hits featured amongst this set, the hit rate was low (4%), and both molecules 48 and 52, which had initially appeared the most promising of the parent structures, yielded poor results. From the 29 molecules related to molecule 48 in the loose similarity docking set, none were hits, while from the 24 molecules related to molecule 52, only 2 were hits. The functional range of molecules 48 and 52 appeared narrowly limited around the chemical space of the parent structures. Molecule 69 yielded 1 hit from 16 molecules. Overall, the hit rate from the loose similarity docking set was less than a quarter of that of the close similarity docking set and involved testing 3 times as many compounds.
These results suggest that it would be challenging to further explore the chemical space using conventional structure-activity relationship (SAR) techniques without significant attrition, since the hit rate worsened as the similarity constraint to the hits was loosened. To overcome this problem, the compounds resulting from these 7 experiments were then used as input for a machine learning method for an iterative exploration of the chemical space ( Figure 1C). The similarity searches removed the most obvious targets of the machine learning approach, but also increased the size of the dataset available for training. The training set, however, remained small by typical machine learning standards, consisting of 161 molecules. Since training sets of this size are common in early-stage research, a further aim of this work was to demonstrate that machine learning can be used effectively even in such data sparse scenarios.
Iterative application of the machine learning approach. One of the issues with applying machine learning to a data sparse scenario is that predictions are likely to be overconfident. While this problem can be addressed to an extent by utilising Gaussian processes, a complementary strategy is to restrict the search area to a region of chemical space that is more likely to yield successful results. To this end, a structural similarity search of the 4 hit molecules in the docking set was carried out on the 'clean' and 'in stock' subset of the ZINC database, comprising ~6 million molecules. Any molecules showing a Tanimoto similarity value of > 0.3 to any of the 4 structures of interest was included. This low threshold for Tanimoto similarity was intended to narrow the search space but without being overly restrictive of the available chemical landscape, yielding a dataset of ~9000 compounds which comprised the prospective 'evaluation set'. The distribution of this evaluation set in terms of the predicting binding energies is shown in Figure S4A.
Different machine learning models were initially trialled against the docking scores calculated for the evaluation set as a test of the project feasibility, and these models were then tuned on the much smaller aggregation data set. The best performing set up, the RFR-GPR stacked model, was then trained on the whole aggregation data set and used to predict the top set of molecules (see Machine Learning Implementation in Supplementary Information, and Figures S1, S5 and S6). For this work, the t1/2 for the light seeding assay was used as the metric of potency to be used in machine learning because of its robustness. For comparison, the amplification rate is more susceptible to small fluctuations in the slope of the aggregation fluorescence trace 23 ( Figure S7).
Molecules that achieved a t1/2 2-fold greater than that of the negative control under standard assay conditions (see Methods) were classed as hits 46 . The algorithm was run repeatedly from different random starting states and those molecules that 8 appeared in the top 100 ranked molecules more than 50% of the time (64 molecules) were chosen for purchase (first iteration). In this first iteration, there was an inherent bias towards the structure of molecule 69 in the dataset given the relative population sizes ( Figure S2A), but with the caveat that many of these structures were only loosely related to the parent (Tanimoto similarity < 0.4). Many of the hit molecules came from this group, suggesting chemical departures from the parent structure.
The dynamic range within the aggregation dataset in terms of potency was large, in that a majority of the molecules had no effect on aggregation, while initial docking hits exhibited relative t1/2 of up to 4-5 times that of the negative control (limited by the length of the experimental run) at 25 µM. Molecules then found via machine learning produced a relative t1/2 of ~4-5 at up to 8 fold lower concentration (3.12 µM, 0.3:1 molecule:protein) than that carried out in the initial screening (25 µM, 2.5:1 molecule:protein). This compares favourably with previous molecular matter tested in a less aggressive seeded aggregation assay such as the flavone derivatives, apigenin, baicalein, scutellarein, and morin which achieved relative t1/2 of 1-2 at a stoichiometry of 0.5:1 molecule:protein 9 . Anle-138b 12 is another example of a well-characterised small molecule inhibitor, which was also taken into clinical trials, whose relative t1/2 is 1.22 (Figure 2) at a ratio of 2.5:1 molecule:protein in the assay used in this work, which is significantly lower than any of the molecules discovered using the strategy employed here.
After the first iteration, the compound data were pooled together to extend the training set and a further 2 iterations were carried out, adding the resultant data to the training set at each iteration. This was followed by a fourth and final iteration trained on low dose (3.12 μM) data of all the previously obtained molecules. Example kinetic traces for a molecule from the fourth iteration are shown in Figure 2A. The molecules are labelled according to iteration number and hit identifier within that iteration. For example I4.05 is the fifth hit (05) within iteration 4 (I4). The dose-dependent potency in the aggregation assay was investigated (Figures 2A and S8) with all hit molecules exhibiting substoichiometric potency. For comparison Anle-138b is also shown.

Figure 2B
shows an approximate overall rate of aggregation at different concentrations of I4.05, Anle-138b and the parent molecule. This approximate rate was taken as 1/t1/2, and fitted to a Hill slope. A kinetic inhibitory constant (KIC50), the concentration of molecule at which the t1/2 is increased by 50% with respect to the control, as defined previously 46 , was then derived. The KIC50 values for the hits were in the range of 0.5-5 μM, which compare favourably with the parent of the hit molecules The elongation rate was largely unaffected in the presence of molecules at any concentration ( Figure 2C). This was expected given the designed mechanism of action of the small molecule. It was also reassuring, since compounds that inhibit elongation may increase the population of oligomers 46 , which are considered the most damaging of the aggregate species in vivo 7,8 . Then, using the amplification and elongation rates derived from Figure 2A,C, the oligomer population over time was calculated 9 (see Methods). These calculations are shown in Figure 2D for I4.05 and Figure S8 for the rest of the hits. All hits demonstrated a dose-dependent delay and reduction of the oligomer peak. Across all metrics, I4.05 performed significantly better than Anle-138b and the parent molecule at substoichiometric ratios, as do all of the hits obtained in previous iterations (Figures S8 and S9).
The aggregation data from the first 3 iterations are also shown in Figure 2E. Of the 64 molecules from iteration 1, 8 were strong hits, representing a hit rate of 12.5%, the second iteration showed a further increase, with 12 strong hits representing a 18.8% hit rate and the third iteration, with 12 hits, exhibited a hit rate of 21.4%. These hit rates represent an order of magnitude improvement over HTS (~1%) and, remarkably, an overall 45% improvement over the combined similarity search hit rates, which removed the most likely hit candidates. The potency of the machine learning hits was significantly higher on average than those identified by the similarity searches ( Figure   S10A), without compromising the CNS-MPO scores ( Figure S10B). The flow of molecules derived from each parent in terms of positives and negatives over the course of the project is illustrated in Figure 2F. The accumulated training data from all stages of the project for all molecules in terms of half time distribution is shown in Figure S4B and S4C.
Given that αS aggregation and toxicity has also been linked to membrane interactions 7,47 a parallel investigation was carried out with a lipid induced aggregation assay ( Figure S11) which was used as a validation of the molecules rather than for machine learning optimisation. The tested hit molecules also showed strong efficacy in this assay.  Figure S13A. The stacked RFR-GPR model assigned low uncertainty to areas of the chemical space proximal to the observed data, and the corresponding acquirement priority mirrored this when trained on the aggregation data ( Figure S13B-D). This figure also illustrates how the uncertainty weighting could be altered during the ranking, depending on how conservative a prediction was required. A drawback to a high uncertainty penalty was that the model remained in the chemical space it was confident in, while a lower uncertainty penalty ensured reasonable confidence of hit acquirement while still exploring the chemical space.
The changes in similarity of the hits to the parent structures are shown in Figure S14.
The similarity of the molecules to their parent structure dropped for all structures at successive stages of the investigation, reaching its lowest point at the iterations of the machine learning approach. The more potent hits mostly retained the central ring and benzene substituent of molecule 69 albeit with the addition of polar groups to the benzene ring, but featured significant alterations to the rest of the scaffold. For example, from iteration 1, I1.01 replaced the fused ring substructure of molecule 69 with a single substituted benzene ring, while I1.02 replaced it with a substituted furan ring, and subsequent iterations saw more complexity introduced. These changes were reflected in the Tanimoto similarity values, which were at the lower end of what was permitted in the evaluation set, 0.3 being the cut off. It was evident from this result that parts of the substructure were important to retain for potency, which the model did effectively while also identifying alterations in the rest of the scaffold that enhanced the potency considerably beyond that of the parent.
The observation that the QSAR model converges on the structures from two areas of the UMAP space related to structure 69 was encouraging in that it suggested the models were learning useful information and not selecting at random. While we have not tested a random set of molecules due to prohibitive resource cost, we do note that if a random selection of molecules were taken from the accumulated training data from all stages of the project, its hit rate (11%) would be lower than that of iterations 1, 2 and 3 on average. Though performance improves with additional data the QSAR performance in terms of R 2 remains modest ( Figure S1), but this is in part due to sparsity of training data. We would anticipate improvement if this approach could be implemented at medium scale with correspondingly more complex QSAR models, and we have an indication of this from trials of the this model set up against the docking scores of the evaluation set, where performance in terms of R 2 score is 3 fold higher for a slightly larger dataset.

Figure S15
shows 2D UMAP representations of the tested molecules, with the latent vector clustering indicated by colour and the SHAP clustering indicated by shape.
From the UMAP representation, we note that the SHAP clustering identified clusters more effectively than the hierarchical clustering. The SHAP values for each feature show the importance of that feature in the interpretation of potency, and this in turn could be used to identify which substructures within the molecules are relevant for potency by observing the structures that recurred in each cluster. For example, Figure   S15 shows the top dimensions of each SHAP cluster, revealing that dimension 24 at least partly encoded for the key sub-structure 3,5-pyrazolidinedione, which was present in every molecule in cluster α and a significant proportion of cluster b, while dimension 26 at least partly encoded for the key sub-structure of cluster d, a chromenone fused ring system which was present in every other molecule in the cluster. This confirmed the hypothesis previously put forward 30 that in a junction tree variational autoencoder, the latent space encoding preserved the key features of each molecule. Molecules which were clustered together shared many molecular substructures in common.

Measurement of binding affinity. A series of validation experiments were carried out
on the most potent hits from the machine learning iterations. We first tested the binding to fibrils using surface plasmon resonance (SPR, see Methods) under different buffer conditions. The results for molecule I4.05 vs Anle-138b are shown in Figure 3. The proposed mechanism of action is the binding of molecules to the fibrils thereby blocking nucleation sites for further aggregation. Support for this mechanism of action comes from the observations that the molecules function at significantly substoichiometric ratios, discounting monomer interactions, and also show negligible effect on elongation. Covalent interactions can also be discounted, as no mass change is observed of the αS monomer by mass spectrometry. The large effect observed in an assay that isolates secondary nucleation as the dominant mechanism implies that the molecules are specifically affecting this step, and the substoichiometry implies that the molecules must be interacting with the fibrils which are present in nM monomer equivalents at the start of the aggregation.
Proof of binding and evidence for this potential mechanism are shown by SPR in Figure 3. Figure 3A shows a schematic representation of molecule binding to the binding pocket targeted during the initial docking simulation. Figure 3B shows SPR response curves for a concentration range between 0.3 nM and 1.1 μM of I4.05, while Figure 3C shows the same experiment utilising Anle-138b from 1.1 μM to 5 μM. The binding was tested under the conditions of the light seeded assay, pH 4.8, and also at pH 8, allowing direct comparison to the seeding assay conditions of Aß42, which were tested as a control in Figure 3D. αS is highly charged at neutral pH and has a PI of 4.7 52 . It therefore requires a pH in this region to render the protein uncharged in order to aggregate on an experimentally accessible timescale under quiescent conditions, whereas Aß42 is highly aggregation prone and requires high pH to prevent it aggregating too rapidly 46 . At both pH values, I4.05 exhibited binding to αS fibrils, with kinetic fits giving KD values of 68 nM at the lower pH and 13 nM at the higher pH. The data for Anle-138b showed no response for pH 4.8 and so no KD could be obtained, while at pH 8 an approximate KD of 8.1 µM was obtained. It is evident that the 2 orders of magnitude improvement in KIC50 of I4.05 compared to Anle-138b was matched by a similar degree of improvement in terms of binding efficacy. Figure 3D shows that I4.05 has no effect on the seeded aggregation of Aß42, nor does it bind effectively to Aß42 fibrils, which suggests that this molecule is not a promiscuous aggregation inhibitor between different amyloidogenic proteins.
Inhibition of aggregation using brain-derived seeds. While this result was encouraging, with the recent determination of the pathological αS fibril structure 39 , it became clear that the recombinant in vitro fibril structure we had employed for computational and experimental work was different to that found in the brains of Parkinson's disease patients. To test whether these molecules might work against patient-derived fibrils, these molecules were tested in an RT-QuIC assay (Figure 4) that employs brain samples from patients suffering with Dementia with Lewy Bodies (DLB). The fibril structure found in DLB was found to match that found in Parkinson's disease 39 .
14 The RT-QuIC assay was initially introduced as a diagnostic assay 53, 54 , showing distinct aggregation curves in the presence of brain material derived from different pathologies 55 . In this case, we use it to test the ability for these molecules to slow the aggregation of αS induced by DLB brain material. As a negative control, samples from patients with a tauopathy (corticobasal degeneration, CBD) were also used, as these did not induce αS aggregation as no αS seeds were present (Figure 4A, B). The conditions are different to those initially screened, as this assay was carried out at pH 8 and utilised shaking to induce aggregation. This is a more challenging paradigm for the molecules to function in as multiple aggregation processes occur in tandem 42 . In addition to secondary nucleation from the fibril surfaces, fragmentation of the fibrils induced via shaking results in more fibril ends for elongation, which in turn provides more fibril surface for secondary nucleation.
Despite these challenges, and the different fibril structure present, the molecules still function well in inhibiting aggregation, and still at substoichiometric ratios ( Figure 4C).
There is again a clear improvement for the hits over Anle-138b, which in fact appears to accelerate aggregation in this example, and the parent molecule, although the ranking of the hits in terms of efficacy is altered compared to the screening assay. To understand these results we note that there is a similarity in the binding pockets in the structures 6CU7 (recombinant) and 8A9L (brain derived) ( Figure S16). We currently do not know whether or not this similarity is serendipitous, but binding pockets with similar features can also be observed via cryo-EM in the MSA-I and MSA-II fibril folds as well as the Lewy fold, with an unresolved species bound within the pocket 39 .
To account for differences in brain samples and also investigate potential efficacy against MSA derived brain material, we tested a single concentration of the same selection of molecules against 3 neuropathologically confirmed MSA brain samples (Figure S17A, C) and 2 further DLB brain samples (Figure S17A, D). As a further negative control, a sample with no seed was tested, to determine the degree of spontaneous nucleation in the absence of brain material ( Figure S17B). Aggregation in the negative control is effectively inhibited by all the ML molecules, given αS is likely to assume the 6CU7 polymorph in this condition, and not by Anle-138b which accelerates aggregation. It should be noted however that the CBD samples are the

Oligomer quantification by micro free-flow electrophoresis.
Having observed that molecule I3.02 was the most broadly effective in the RT-QuIC assay, an investigation was carried out to directly measure the oligomeric species formed during the reaction.
This was achieved using microfluidic free-flow electrophoresis (µFFE) 56 , a technique optimised using similar conditions to that used in the RT-QuIC assay, albeit at significantly higher αS concentration (100 µM). The results of this are shown in Figure   5. Aggregation time courses were tracked using AlexaFluor™ 488 labelled N122C rather than ThT. Figure 5 shows a schematic of the approach, where samples were extracted from an aggregation time course, centrifuged to remove insoluble aggregates, and finally submitted to µFFE. The degree of deflection and the photon count of each particle are proportional to the size and charge of the biomolecule. The former allows the separation of monomers from oligomers and the latter gives a measure of the number and size of the oligomers at a particular time point in the presence of different inhibitors. Oligomer electrophoretic mobility (μo) for an oligomer comprised of nm monomer units is proportional to oligomer charge (qo) and inversely proportional to oligomer hydrodynamic radius (ro) and so can be described by 56 where v is a scaling exponent linking qo with nm. Approximating the oligomers as spherical species yields 56 where the oligomer electrophoretic mobility is defined only in terms of the monomer number (nm) and hydrodynamic radius (rm), and the scaling exponent v* = v -1/3. The identification of inhibitors of αS aggregation based on chemical kinetics approaches has advanced to the point that specific steps in the aggregation process, including primary nucleation and secondary nucleation, can be targeted in a reproducible way 9,28,29 . The mechanism targeted in this work is the surface-catalysed secondary nucleation step, which is responsible for the autocatalytic proliferation of αS fibrils. In a recent initial report, initial hit molecules identified via docking simulations were shown to bind competitively with αS monomers along specific sites on the surface of αS fibrils 23,24,57 . Specific rate measures and other aggregation metrics were derived from these experiments allowing quantitative and reliable comparisons between molecules in terms of SAR and offering metrics to optimise structures of interest 9, 46 . This has been augmented with tests against diseased brain material and detailed, experimental fibril binding and oligomer flux analyses.

Discussion
The aim of this work was to develop a machine learning approach to drug discovery for protein aggregation diseases that could improve both the hit rate of the in vitro assays employed and provide novel chemical matter more efficiently than conventional approaches. As of the first iterations, the hit rate of the approach using initial hit compounds identified via docking simulations was an over 20-fold improvement over typical HTS hit rates (~0-1%) 58 . These structures also represent discoveries that could not have been obtained by staying close in chemical space to the parent structure, as would have been dictated by similarity search approaches.
There were ~4000 molecules in the test set that have Tanimoto similarity values in the range of these hits, and all of these would potentially have had to be screened to locate these hits using similarity searches alone, as demonstrated by the looser similarity search approach which exhibited a comparatively poor hit rate (4%) despite more conservative structural alterations to the parent hits. The machine learning method was therefore able to supply a degree of novelty as well as an improved hit rate.
A limitation of this approach is the requirement to select molecules from a pre-existing library. To resolve this limitation generative modelling combined with reinforcement learning has been applied in a parallel project to remove the need for a library to screen from 45,59 . A second limitation is the focus on one assay metric of interest as a learning parameter. Addressing this limitation will involve future work on multi-parameter optimisation, which is a challenging area in rapid development [60][61][62] . Another topic of great interest in drug discovery approaches based on machine learning besides potency prediction is the prediction of pharmacokinetics and toxicity 63 . It could be possible to achieve this multi-parameter optimisation utilising multiple models in parallel and then employing a joint ranking metric, or architectures that screen for individual metrics in series, although this has primarily been demonstrated with predicted chemical properties such as clogP and QED rather than experimental results [60][61][62] . The molecules in this work were derived from a set that passed CNS-MPO criteria in the initial docking simulation, and so the CNS-MPO score of the whole aggregation inhibitor set is relatively favourable with most hit molecules exceeding the common cut off value of 4 43 ( Figure S10B).
It would have been preferable to begin this approach using seeds derived from relevant pathological brain material, but this was not possible, as neither structures nor samples for these were available at the start of this study. Nonetheless, we have demonstrated that these molecules still function against disease relevant inducers, likely because of the degree of commonality between the binding sites of the fibril 18 polymorphs. The complete loss of function against another aggregation prone protein, Aß42, does however suggest specific functionality against αS alone.

Conclusions
The results that we have presented illustrate a drug discovery approach that involves an iterative structure-based machine learning strategy to generate potent protein aggregation inhibitors. The resulting hits offer a significant improvement in potency over the parent and competitor molecules and represented a major structural departure from them. We anticipate that using machine learning approaches of the type described here could be of significant benefit to researchers working in the field of protein misfolding diseases, and indeed early-stage drug discovery research in general.

Compounds and chemicals
Compounds were purchased from MolPort (Riga, Latvia) or Mcule (Budapest, Hungary) and prepared in DMSO to a stock of 5 mM. All chemicals used were purchased at the highest purity available.
Following transformation, the competent cells were grown in 6L 2xYT media in the presence of ampicillin (100 μg/mL). Cells were induced with IPTG, grown overnight at

Determination of the αS amplification rate constant
In the presence of low concentrations of seeds (~ nM), the fibril mass fraction, M(t),

22
The parameters and represent combinations for the effective rate constants for primary and secondary nucleation, respectively, and are defined as 65 where kn and k2 denote the rate constants for primary and secondary nucleation, respectively, and nc and n2 denote the reaction orders of primary and secondary nucleation, respectively. In this case, nc was fixed at 0.3 for the fitting of all data (corresponding to a reaction order of n2 = 4), and k2, the amplification rate, is expressed as a normalised reduction for αS in the presence of the compounds as compared to in its absence (1% DMSO).

Determination of the αS oligomer flux over time
The theoretical prediction of the reactive flux towards oligomers over time was calculated as 9,65 where r+ = 2k+m(0) is the apparent elongation rate constant extracted as described earlier, and m(0) refers to the total concentration of monomers at the start of the reaction. The sklearn library hierarchical clustering method was used to cluster latent vectors for comparison, with initial cluster number set to 7 72 .

Surface plasmon resonance
All work was carried out using Biacore T200 at 25 °C.

Preparation of human brain tissue homogenates
Deidentified postmortem human brain specimens used in the RT-QuIC assay are referenced in Table S2. These specimens were obtained from the NIH

Docking and Machine Learning Implementation
A full description of the initial docking approaches can be found in the previous work 1 , using AutoDock Vina 2 and FRED 3 docking software, but is also explained in overview here. As described in the main text, the binding site encompassing residues His50−Lys58 and Thr72−Val77 on PDB 6CU7 4 was selected due to its propensity to form a pocket according to Fpocket 5 software and its simultaneous mid to low solubility according to CamSol 6 ( Figure S3). Additionally, a key histidine residue in this site was predicted to protonate below the pH value where αS more readily aggregates (pH 5.8).
A binding box was selected that had size 12 Å by 12 Å by 9 Å centred at 10.00 Å, 9.89 Å, 11.52 Å on the 6CU7 PDB, encompassing the site of interest. The target protein was left rigid, while the ligand was flexible, able to translate and rotate (including rotation of internal bonds). We prepared (added hydrogens) the target protein using and this is what is shown in Figure S5 (though improvement is not drastic as further data is added, possibly due to the relative ease with which strong dockers are selected).
Different uncertainty penalties were tested during this process. We found that a low uncertainty penalty produced better results by removing the most overconfident predictions without placing too many limitations on the model. At the early stages most predictions with low uncertainty were those with predicted binding scores close to the mean of the training set. An excessive uncertainty penalty during these stages would cause the model to only predict molecules that it was confident in, which were also likely to be mild.
The same process was utilised using different parts of the molecular feature set (the latent vector consists of a tree vector representation of clusters within a molecule, plus a graph representation of the molecule), and it was found that GPR performance metrics were better when using the molecular graph alone compared with using the entire representation. In general, it is to be expected that fitting fewer features to a predicted value is easier for a regressor to achieve and so higher scores are obtained.
However, a better average R 2 score across the data set does not necessarily lead to a better result in terms of the actual molecules picked, and we found using the full representation led to more hits being identified ( Figure S5).
A snapshot of the results of this testing is shown in Figures S5 and S6. Figure S5 demonstrates 2 points: the performance was slightly improved using the Matérn kernel in place of the RBF kernel both in terms of overall hit selection and performance improvement with increasing training set size, and the full-length molecular representation gave a significant boost in terms of number of hits selected vs the truncated representation, despite lower R 2 scores. These results also provided some evidence that Gaussian process learning might work reasonably effectively even in this data sparse scenario albeit at a modest level. It was expected that fitting experimental data would prove more challenging, however, and so a boost in performance was sought for that would not compromise the simplicity of the model, through use of the coupled RFR-GPR model. Correlation values of 0.6-0.7 were 47 obtained using this set up on docking energies and a large portion of the dataset (4000 molecules), and this fell to between 0.2 and 0.3 for the aggregation data ( Figure S1), which while low was encouraging given the much smaller dataset and noisier data. MLP were trialled with their default parameters either alone or in conjunction with a GP, but showed poor performance so were not further investigated. (B) GP and RF models were the best performing and so were subjected to hyperparameter optimisation via grid search cross validation using the R 2 score as the optimisation metric. The best performing parameters are shown. The performance of these models is shown in Figure S1.