The early-stage drug discovery process relies on target identification, assay development and high-throughput screening (HTS) to identify lead compounds for chemical optimization and further preclinical development. Traditional HTS campaigns are often limited to 1–2 million compounds owing to the high costs and operational bottlenecks that limit the chance for lead identification1,2. However, recent advances in computational technologies have made it possible to virtually screen millions of compounds for potential biological activity2. Existing virtual screening (VS) methods can be grouped into two broad categories: ligand-based VS and target structure-based VS. Both methods depend on chemical structure information to make predictions, whereas the target-based approach in addition requires the availability of detailed target protein information. These severe dependencies have tended to limit the applicability of such methods to querying only in the close structural vicinity of already known ligand structures and drug targets.

A critical advance that enabled the development of activity rather than structural paradigm described here was the large-scale application of quantitative HTS (qHTS)3, where every compound is tested in a broad concentration response format. The high-quality data from qHTS are, thus, substantially richer for use in computational modeling to predict activities of large compound libraries against new assays or new drug targets. In the past 15 years, our in-house collections of over half a million compounds have been screened in a wide spectrum of biological assays in qHTS format4, resulting in compound activity profiles that enabled the development of a biological activity-based modeling (BABM) approach complementary to traditional structure-based approaches. Among our in-house libraries, the National Center for Advancing Translational Sciences (NCATS) Pharmaceutical Collection (NPC)5 and the Library of Pharmacologically Active Compounds (LOPAC) have been screened in nearly every one of our ~2,000 assays, providing the most comprehensive set of activity profiles that comprise an ideal training dataset for machine learning models.

Unlike traditional quantitative structure–activity relationship (QSAR) approaches (part of the ligand-based VS category)6,7, where similarity in chemical structure is used to infer biological activity, BABM builds on the hypothesis that compounds that show similar activity patterns tend to share similar targets or mechanisms of action8,9. In this approach, each assay is treated as an independent descriptor. Analogous to structure descriptors, where the presence and absence of certain structure features or properties are used to represent a compound, the presence and absence of activities against a panel of assays form the activity profile or signature of a compound. If extracted from across multiple screening campaigns, each at massive scale, such activity signatures can then be applied to infer compound activity in a completely new assay or against a completely new target.

A fundamental difference compared to traditional QSAR modeling is, thus, that BABM does not require any chemical structure information to make predictions, such that its application domain is not limited to small molecules with well-defined structures. In fact, BABM can be applied to any substances with available biological profiling, including macromolecules and mixtures (for example, natural products). Of particular note is that compounds showing similar activities do not necessarily share similar structures10. Thus, the BABM approach has no intrinsic limitations in discovering new chemical scaffolds11. These new scaffolds can then serve as starting points for lead identification efforts and be used to construct new QSAR models for lead optimization.

The global pandemic of the highly contagious Coronavirus Disease 2019 (COVID-19), caused by SARS-CoV-2 (ref. 12), presents an urgent need for new methods that can quickly and systematically screen large compound libraries for new drug candidates. In this context, we first applied BABM to generate prediction models for two infectious diseases, Zika13 and Ebola14, to test the robustness of BABM and its applicability to different assay and data types and to benchmark against traditional QSAR methods. The BABM model identified actives that were experimentally verified with high confirmation rates (~50–80%). The approach was then applied to build prediction models for SARS-CoV-2. To build prediction models for these disease targets, we selected training data that included both qHTS assay data (SARS-CoV-2 and Zika virus (ZIKV) non-structural protein NS1 (ref. 15)) and data collected from published literature (SARS-CoV-2 and Ebola virus (EBOV)16). These models, mostly trained on the qHTS activity profiles of the NPC and LOPAC library compounds, were applied to predict the activity of all ~0.5 million compounds in our in-house library. Models were constructed using BABM, and the performances were compared to those of traditional QSAR models as well as a combination of both activity and structural features. A little over 300 compounds identified by the BABM models as potential anti-SARS-CoV-2 leads were then tested in a live virus assay, with ~100 confirmed (>30%), validating the utility and accuracy of the BABM approach. The confirmed anti-SARS-CoV-2 compounds were further investigated for their potential antiviral mechanisms in terms of viral entry inhibition17, SARS-CoV-2 main protease inhibition18 and autophagy modulation19. Some of the experimentally confirmed lead compounds might have the potential to be further developed into new antiviral therapies.


Model performance and validation

Table 1 provides an overview of the three viral targets (SARS-CoV-2, ZIKV and EBOV) used for modeling. The entire model training, testing and validation process is illustrated in Fig. 1. Model performance was measured by the area under the receiver operating characteristic curve (AUC-ROC; see Methods for details). Most models performed well on their corresponding test sets, with mean AUC-ROC values >0.8 (Fig. 2a and Supplementary Table 1). The structure–activity combined models (CMs) showed the best performances compared to the models built on activity (BABM) or structure (SBM) alone, with mean AUC-ROC values >0.83. Of the BABM models using data from different assay panels and compound libraries, the BABM-S and BABM-M models all showed good performances, with mean AUC-ROC values of 0.79 and 0.84, respectively (Supplementary Table 1). The BABM-G models with the smallest assay panel for training showed the lowest AUC-ROC values, averaging 0.75. The SBMs generally showed lower performances than the CM and BABM models, with a mean AUC-ROC of 0.72. Supplementary Fig. 1 shows example ROC curves from each of the three types of models.

Table 1 Overview of viral targets used for modeling and summary statistics of model-identified active compounds
Fig. 1: Biological activity-based modeling process.
figure 1

a, BABM process overview. For any biological target of interest T (for example, SARS-CoV-2, ZIKV NS1 and EBOV), the model identifies the activity pattern of active versus inactive compounds based on the training data, which are activity profiles of a set of compounds across a diverse panel of assays, including T. The active signature is then matched against the activity profiles of a new set of compounds across the same assay panel. The ability of the model to use this signature to correctly identify actives from the new compound set is first tested using part of the data with known T activity (the test set). An AUC-ROC value is calculated using the test set to evaluate the model performance. The model is then applied to a set of compounds with unknown T activity (prediction set; for example, Sytravon, MLS and Genesis). Predictions are made on the new compounds based on their activity profile similarity to that of the active signature for T. The predicted T actives are further validated experimentally for their activity against T. Comparing experimental results with model predictions, true positives and false positives are counted to determine the performance of the model. In the heat maps, each row represents a compound, each column is an assay and the heat map is colored by the compound activity. b, Detailed flowchart of the modeling process. c, Types of signatures and fingerprints used in different models.

Fig. 2: Model performance and experimental validation.
figure 2

a, Model performances on the test sets measured by AUC-ROC values. Mean AUC-ROC values from ten randomly generated test sets are plotted, with the error bars indicating the s.d. values. b, Model performances measured by external experimental validation PPV in comparison to training set active rates (for example, original assay hit rate) (Supplementary Table 2). Model-selected compounds are significantly enriched with true actives. Model type: SBM, structure-based model; BABM, activity-based model (Sytravon); CM, combined model (SBM + BABM).

To further validate the models and identify new compounds with antiviral activity, a subset of model-predicted actives was selected for each viral target for experimental validation (see Methods for details). For ZIKV, 1,676 selected actives predicted by the model were tested in the original NS1 assay15 that generated the data to train the models. To validate the EBOV models, the EBOV-eGFP infection assay15,20 was applied to test 96 selected model-predicted EBOV actives. All 96 compounds were first inspected at 30 µM for potential cytotoxicity, resulting in 62 compounds with <50% cell killing, which were further tested for EBOV infection inhibition. The EBOV inhibition activity of these 62 compounds was used to evaluate model performance. The positive predictive value [PPV = TP/(TP + FP)]—that is, the fraction of model-predicted actives that are experimentally confirmed—was calculated for each model (Fig. 2b and Supplementary Table 1). The model PPVs ranged from 30% (SBM for NS1) to 89% (CM-G for EBOV). The EBOV models showed higher PPVs (~80%) than the NS1 models (~40%). Compared to the active rates in their corresponding training datasets (that is, original assay hit rate), all model-predicted active sets were significantly enriched with true active compounds (two-tailed Fisher exact test: P < 10−10) (Supplementary Table 1). For example, the active rate of the EBOV BABM-S model training set was 11.8%, and the corresponding model PPV (that is, experimental validation set active rate) was 80%. Thus, the enrichment of actives by the EBOV BABM-S model was 6.8-fold (80/11.8). The enrichment of actives for all models (Table 1) ranged from 2.7-fold (BABM-S for NS1; P < 10−20) to 27.5-fold (SBM for NS1; P < 10−20). Most models showed enrichments between five- and ten-fold when compared to the active rates in the training set. The potency ranges of the experimentally confirmed actives are summarized in Table 1 and Fig. 3. The models identified potent compounds for all three disease targets with half-maximal inhibitory concentrations (IC50s) in the nanomolar range (Fig. 3). Experimental validation data for all models are provided as Supplementary Data 1.

Fig. 3: Experimental validation results from the secondary confirmation of model-predicted actives.
figure 3

a, Potencies and examples of compounds confirmed in the EBOV inhibition assay with minimal cytotoxicity. Replicate data are presented as mean ± s.d. b, Potencies and examples of compounds confirmed in the ZIKV NS1 inhibition assay with minimal cytotoxicity. Replicate data (n = 3) are presented as mean ± s.d. c, Potencies and examples of compounds confirmed in the anti-SARS-CoV-2 CPE assay.

Identification of anti-EBOV and anti-ZIKV compounds

Of the 50 compounds with anti-EBOV activity confirmed at 30 µM, we selected 27 that showed >90% inhibition of EBOV infection with minimal cytotoxicity (>80% cell viability) to test in concentration response format (0.17 nM to 30 µM; 1:3 fold dilution; triplicate) to determine their EBOV inhibition potency. All 27 compounds showed concentration-dependent inhibition of EBOV infection with IC50s ranging from 25 nM to 25 µM (Fig. 3a and Supplementary Data 2). Seven of these compounds were potent, with IC50 ≤ 5 µM, and were not apparently cytotoxic or at least six times more potent in the EBOV inhibition assay compared to the cell viability counter assay. Two of the seven compounds, umifenovir and difeterol, are known drugs (see Supplemental Information for details). The other five compounds have no previously reported anti-EBOV activity.

A subset (170) of the experimentally confirmed NS1 assay active compounds with relatively potent NS1 signal inhibition activity (IC50 < 10 µM) and no apparent cytotoxicity were selected for secondary confirmation, with compounds tested at 11 concentrations in triplicate (Fig. 3b and Supplementary Data 2). Ten of the 170 compounds did not show activity in the secondary confirmation assay, yielding a confirmation rate of 94% for the NS1 assay. Twenty-nine compounds showed potent inhibition, with IC50 < 1 µM, 17 of which were not apparently cytotoxic or at least three times more potent in the NS1 assay. Several of these potent compounds are known drugs or bioactive compounds (see Supplemental Information for details). The other eight potent compounds can potentially be developed into new antiviral therapies.

Identification of anti-SARS-CoV-2 compounds

The activity of 311 compounds predicted by the SARS-CoV-2 BABM models was tested in the live virus cytopathic effect (CPE) assay, 99 of which were confirmed as active, yielding a hit rate of 32% (Fig. 2b and Supplementary Table 1). The model PPVs ranged from 32% (CM-S) to 38% (BABM-S). Compared to the active rates in their corresponding training datasets, all model-predicted active sets were significantly enriched with true active compounds (two-tailed Fisher exact test: P < 10−3) (Fig. 3c and Supplementary Table 1). Compared to the hit rate of the original NPC screen (11%), the models were able to improve the hit rate by 2.8- to 3.3-fold (Table 1). The SBM was not used for compound selection because its performance (average AUC-ROC = 0.71) during model training and testing did not meet the 0.75 cutoff. Nonetheless, the SBM predictions made on the 311 compounds were used to assess the performance of the SBM on the experimental validation set in comparison with the BABM models (Fig. 2b and Supplementary Table 1). The PPV of the SBM was 31.6%, which was the lowest of all SARS-CoV-2 models. The potency ranges of the experimentally confirmed actives are summarized in Table 1 and Fig. 3. Experimental validation data for all 311 compounds are provided in Supplementary Data 1 and 3. The structures and weighted feature significance (WFS) scores of ~5,000 compounds predicted as active by at least one of the SARS-CoV-2 BABM models are provided as Supplementary Data 4.

The experimentally confirmed SARS-CoV-2-active compounds were further tested at eight concentrations (instead of three concentrations in the primary screen) to get more accurate potency measures (Supplementary Data 2 and 5). Nine of the 94 compounds were inactive in the secondary confirmation assay, yielding a confirmation rate of 90% for the SARS-CoV-2 CPE assay. The most potent compound (MLS000699212-03; benzaldehyde, 3-methyl-, 2-(2,6-di-4-morpholinyl-4-pyrimidinyl)hydrazone) had an IC50 of 500 nM. This compound showed slight cytotoxicity inhibiting 55% cell viability with an IC50 of 14 µM, indicating a large therapeutic window (selectivity index = 28). This compound has only one published study, which is a patent on a compound series described as autophagy modulators for treating neurodegenerative diseases21. Autophagy has been implicated in the entry of coronavirus into host cells, including SARS-CoV, MERS-CoV and SARS-CoV-2 (refs. 19,22). Another potent compound with IC50 < 1 µM (800 nM) is a synthetic compound with no previous literature report (NCGC00100647-01; N2,N4-bis(3-methylphenyl)-6-(4-morpholinyl)-1,3,5-triazine-2,4-diamine). In addition, 13 compounds had IC50 < 5 µM, eight of which are known drugs or bioactives (see Supplemental Information for details), and the other five are compounds without any well-annotated biological activity. Some of the known anti-SARS-CoV-2 compounds reported in the literature, especially those currently in clinical trials for COVID-19, were also screened in our CPE assay with varying potencies23—for example, remdesivir (10 µM), chloroquine (6.5 µM), lopinavir (12.6 µM), azithromycin (48 µM), apilimod (23 nM) and emetine (46 nM). In comparison, the potencies of the anti-SARS-CoV-2 compounds identified by our models fall within the range of the known anti-SARS-CoV-2 compounds.

Antiviral mechanism of anti-SARS-CoV-2 compounds

There are multiple targets for therapeutics intervention against SARS-CoV-2 infection, including viral entry into host cells, proteolysis of viral polypeptide by the 3C-like protease to release the non-structural proteins and autophagy pathway in host cells24. We further investigated the potential antiviral mechanism of the 85 experimentally confirmed anti-SARS-CoV-2 compounds using three assays: the SARS-CoV-2 pseudotyped particle (PP) entry assay25,26,27, the SARS-CoV-2 3C-like protease (3CLpro) assay18 and the GFP-LC3 assay for autophagy modulators (see Supplemental Information for details)28. Of the 85 anti-SARS-CoV-2 compounds, 53 were viral entry inhibitors determined by the PP entry assay; 35 were identified as autophagy modulators in the GFP-LC3 assay by all three parameters; and 52 were active in at least one autophagy parameter (Fig. 4a). Two compounds showed marginal activity in the 3CLpro assay. The results from all three assays are summarized in Supplementary Data 2. These results suggest that autophagy plays a major role in the antiviral activity of the model-identified anti-SARS-CoV-2 compounds. Most of these compounds are viral entry inhibitors, and 3CLpro inhibition (related to viral replication) is not a major antiviral mechanism of these compounds. The most potent anti-SARS-CoV-2 compound, MLS000699212, showed potent inhibition (IC50 = 592 nM) of viral cell entry and was active in all three parameters of the autophagy assay, indicating a dual mechanism of action (Fig. 4).

Fig. 4: Mechanistic study of experimentally confirmed anti-SARS-CoV-2 compounds.
figure 4

a, Activity distribution of compounds in the SARS-CoV-2 PP entry, 3CLpro and GFP-LC3 autophagy assays. Concentration response curves of the most potent compound, MLS000699212, in the SARS-CoV-2 3CLpro (b), PP entry (c) and GFP-LC3 autophagy (d) assays. EC50, half-maximal effective concentration.


Traditional QSAR models rely on chemical structure similarity to infer biological activity and, thus, are limited in their power to discover new chemical scaffolds. Consequently, biological activity predictions made on chemicals with structure types not included in the training set are often not reliable; this is commonly referred to as the ‘applicability domainʼ (AD) issue29. QSAR models are, thus, fundamentally restricted by their ADs—namely, by the chemical spaces within which the models were originally trained. Incorporating biological response patterns into the models helps to alleviate this issue by expanding the model AD to cover structurally dissimilar chemicals that share similar activity profiles. Activity-based modeling is a relatively new concept, especially when applied to drug discovery. The prerequisite of activity-based modeling is the availability of sets of compounds tested consistently across multiple biological assays, with the results serving as compound descriptors or fingerprints. This is enabled by the recent advances in HTS technologies that have produced a tremendous amount of biological activity data in a relatively short amount of time. As a center specialized in HTS, NCATS has a data repository that hosts biological response data on over half a million compounds tested against thousands of assays, mostly in qHTS format, which form a rich set of activity profiles at unprecedented scale (over 130 million wells screened over the last 4 years)3,30. We show here that a subset of these data could be used to build activity-based models to identify antiviral compounds for Zika, Ebola and COVID-19.

Compared to traditional QSAR models built with chemical structure data alone, the BABM identified compounds that are structurally distinct from the training set and the compounds identified by the SBM (see Supplemental Information for details), demonstrating the advantage of the BABM in discovering new chemical types. Combining traditional SBMs with BABM can maximize the chance of identifying the best lead compounds as new candidates for any therapeutic target of interest. Both the BABM and CM used activity data in other assays as descriptors for training, whereas the CM used structure features in addition. The model predictions were further validated experimentally. Using the ZIKV NS1 models, for example, even though the BABM identified a larger portion of the experimentally confirmed actives (that is, was more sensitive), the CM had a lower FP rate (that is, was more specific). Adding structure information helped the CM to achieve a slightly improved PPV. For all three viral targets modeled in this study, the CM achieved the best overall performance compared to the SBM and BABM. More intriguingly, the sizes of the training sets for all the models were much smaller than the prediction sets on which the models were applied—30- to 100-fold for the BABM and CM and up to 300-fold for the SBM (Supplementary Table 2). That models built on a small training set performed well on predicting a much larger and more diverse set of compounds with accuracies on par with or better than most in silico screening approaches further demonstrated that the models were robust enough to be applicable to large and diverse compound collections to identify new leads1,31.

The SARS-CoV-2 BABM models identified ~100 compounds that were experimentally verified to show antiviral activity in a live virus assay. The results from further mechanism of action studies showed that most of these compounds inhibited SARS-CoV-2 cell entry and/or modulated the autophagy process in host cells. Models built for Zika and Ebola also identified new lead compounds. In addition, we provided the prediction results of ~5,000 compounds that were predicted as active by the SARS-CoV-2 BABM models as a resource to the scientific community to develop new anti-COVID-19 therapies. The activity-based approach was demonstrated here to be able to be rapidly applied to identify lead compounds for new targets or disease phenotypes.

As a complement to structure-based approaches, either ligand- or target structure-based, the additional information provided by activity data is shown here to significantly improve the predictive power of VS models. Furthermore, the assays, as part of the activity signature, that contributed the most to the predictive power of the BABM models could provide clues to the underlying targets or mechanisms of the disease for which the models were built, such as COVID-19 (ref. 32). The chemical scaffolds identified by BABM from an existing screening library can also be incorporated into QSAR models to screen other chemical libraries more efficiently, with no bioactivity profiles available. Of note is that, in addition to HTS libraries, the general concept of BABM can be extended to any type of biological data, such as genomics and proteomics data33, data generated on mixtures or antibodies and clinical data, where clearly defined structure information is not available. As such, the BABM approach shows the promise of broad applications in different areas of biology.


SARS-CoV-2 CPE assay

Vero E6 cells (ATCC Vero E6 CRL-1586) previously selected for high angiotensin-converting enzyme 2 (ACE2) expression34 (grown in EMEM, 10% FBS and 1% penicillin–streptomycin) were cultured in T175 flasks and passaged at 95% confluency. Cells were washed once with PBS and dissociated from the flask using TrypLE. Cells were counted before seeding. A CPE assay previously used to measure antiviral effects against SARS-CoV35 was adapted for performance in 384-well plates to measure CPE of SARS CoV-2 with the following modifications. Cells, harvested and suspended at 160,000 cells per ml in MEM/1% PSG/1% HEPES supplemented 2% HI FBS, were batch inoculated with SARS CoV-2 (USA_WA1/2020) at a multiplicity of infection (MOI) of approximately 0.002, which resulted in approximately 5% cell viability 72 h after infection. Compound solutions in dimethyl sulfoxide (DMSO) were acoustically dispensed into assay-ready plates (ARPs) as three-point 1:5 titrations (or eight-point 1:3 titrations for confirmation screen). ARPs were stored at −20 °C and shipped to a Biosafety Level (BSL)-3 facility (Southern Research Institute) for CPE assay. The plates were transported to the BSL-3 facility, where ARPs were brought to room temperature and 5 µl of assay medium was dispensed to all wells. A 25-μl aliquot of virus-inoculated cells (4,000 Vero E6 cells per well) was then added to each well in columns 3–24. The wells in columns 23 and 24 contained virus-infected cells only (no compound treatment). A 25-μl aliquot of uninfected cells was added to columns 1 and 2 of each plate for the cell-only (no virus) controls. After incubating plates at 37 °C with 5% CO2 and 90% humidity for 72 h, 30 μl of CellTiter-Glo (Promega) was added to each well. After incubation at room temperature for 10 min, the plates were sealed with a clear cover and surface decontaminated, and luminescence was read using a PerkinElmer Envision plate reader to measure cell viability.

NS1 TR-FRET assay

HEK293 cells were maintained in EMEM medium with 10% FBS and 1% penicillin–streptomycin (Gibco, cat. no. 15140–122). Cells were seeded at 1,000 cells per 3 µl per well in the white 1,536-well plate and incubated at 37 °C with 5% CO2 overnight. Compounds in dilution were added to cells at 23 nl per well and incubated for 1 h, followed by addition of 2 µl per well of the prototypic ZIKV strain MR766 solution to cells (MOI = 0.5). After an incubation at 37 °C for 24 h, 2.5 µl per well of detection reagent mixture of two labeled anti-ZIKV NS1 antibodies was added to assay plates. Time-resolved fluorescence resonance energy transfer signals were measured using an Envision plate reader (PerkinElmer). Compounds were tested as seven-point 1:5 titrations in the primary screen and 11-point 1:3 titrations in triplicate in the confirmation screen. Data were normalized by using the control wells (without addition of ZIKV) as a negative control (0% NS1) and positive wells (with ZIKV) as 100% NS1 level.

ATP content assay for cell viability and compound cytotoxicity

Cells were seeded in 1,536-well assay plates and incubated for 16 h at 37 °C with 5% CO2. Test compounds dissolved in DMSO were added to assay plates at a volume of 23 nl per well by an automated pintool workstation (Wako Automation). Compounds were incubated with cells for 48 h at 37 °C with 5% CO2. ATPlite, the ATP monitoring reagent (PerkinElmer), was then transferred to the assay plates and incubated for 15 min at room temperature. The resulting luminescence was measured using the PHERAstar FSX plate reader (BMG LABTECH). Data were normalized using the wells without cells as a control for 100% cell killing, and cell-containing wells with DMSO control were used as full cell viability (0% cell killing).

EBOV-eGFP infection assay

As described previously15,20, Vero E6 cells were maintained in DMEM (HyClone) supplemented with 10% FBS (Sigma-Aldrich). The following Ebola virus was used: Ebola virus NML/H.sapiens-lab/COD/1976/Mayinga-eGFP-p3 (EBOV/May-eGFP) (derived from an Ebola virus, family Filoviridae, genus Ebolavirus, species Zaire ebolavirus, GenBank accession no. NC_002549). All work with infectious virus was performed in the BSL-4 facility at the National Microbiology Laboratory of the Public Health Agency of Canada in the Canadian Science Centre for Human and Animal Health. All procedures were conducted in accordance with international protocols appropriate for this level of biosafety. The toxicity of compounds was evaluated in Vero E6 cells by using the PrestoBlue cell viability reagent, which is a resazurin dye-based assay (Life Technologies). Cells were plated, allowed to adhere overnight and then treated with various compound concentrations for 2 h. Control cells received an equivalent volume of 10% DMSO only. PrestoBlue cell viability reagent was added according to the manufacturer’s protocol. Viability was determined by comparing fluorescence readings of treated cells to those of untreated controls.

3CLpro enzyme assay and counter-screen18

SARS-CoV-2 3CLpro, sensitive internally quenched fluorogenic substrate and assay buffer were obtained from BPS Bioscience. The enzyme was expressed in Escherichia coli expression system with a molecular weight of 34 kDa. The peptide substrate contains 14 amino sequence (KTSAVLQSGFRKME) with Dabcyl and Edans attached on its N- and C-termini, respectively. The reaction buffer is composed of 20 mM Tris-HCl (pH 7.3), 100 mM NaCl, 1 mM EDTA, 0.01% BSA and 1 mM DTT. The 3CLpro enzyme assay was carried out in 1,536-well black, medium-binding microplates (Greiner Bio-One), with a total volume of 4 µl that includes 2 µl of 2× enzyme (50 nM) in reaction buffer and 2 µl of 2× substrate (20 µM). The experiment was conducted at room temperature. In brief, 2 µl per well enzyme was first added into a 1,536-well plate. Compounds in DMSO were then transferred as 23 nl per well with an automated pintool workstation. The compounds and enzyme were incubated for 30 min at room temperature. Afterwards, 2 µl per well substrate was dispensed into the assay plate, followed by 1-h incubation for the enzyme reaction. The fluorescence intensity was measured on a PHERAstar FSX plate reader (BMG LABTECH) with excitation = 340 nm/emission = 460 nm. A counter-screen assay to eliminate the fluorescence-quenching compounds was carried out by dispensing 4 μl of substrate containing fluorescent Edans fragment, SGFRKME-Edans, into 1,536-well assay plates in the absence of enzyme. Compounds were pin transferred as 23 nl per well, and the fluorescence signal was read. Compounds were tested as 11-point 1:3 titrations in duplicate for both enzyme assay and counter-screen.

PP entry assay in 1,536-well format

Cell line and cell culture

HEK293 cell line with stable expression of human ACE2 (HEK293-ACE2) was generated by Codex BioSolutions36. In short, Expi293F cells (Thermo Fisher Scientific) were seeded into cells on a six-well plate with 70–80% confluency. For each well, the cells were transfected with 2.5 µg of pCMV_ACE2_IRES_puromycin plasmid (Codex BioSolutions) using Lipofectamine 3000 (Thermo Fisher Scientific). Twenty-four hours later, the cells were disassociated with trypsin and transferred into 100-mm dishes. The cells were selected with 1 µg ml−1 of puromycin for 2–3 weeks. Single colonies were picked into 24-well plates containing 1 ml of DMEM and 10% FBS supplemented with 1 µg ml−1 of puromycin. Western blot was performed to screen the ACE2 expression clones with an ACE2-specific antibody. The positive clones were further confirmed with SARS-CoV2-S PP entry assay.

PP generation

SARS-CoV2-S PP, VSV-G PP and delEnv (bald) PP were custom produced by Codex BioSolutions using previously reported methods using a murine leukemia virus pseudotyping system25,26. The SARS-CoV2-S construct with Wuhan-Hu-1 sequence (BEI no. NR-52420) was C-terminally truncated by 19 amino acids to reduce endoplasmic reticulum retention27 for pseudotyping.

PP entry assay

HEK293-ACE2 cells were seeded in white, solid-bottom, 1,536-well microplates (Greiner Bio-One) at 2,000 cells per well in 2 µl per well of medium and incubated at 37 °C with 5% CO2 overnight (~16 h). Compounds were titrated 1:3 in DMSO and dispensed via pintool at 23 nl per well to assay plates. Cells were incubated with test articles for 1 h at 37 °C with 5% CO2, before 2 µl per well of PP was added. The plates were then spinoculated by centrifugation at 1,500 r.p.m. (453g) for 45 min and incubated for 48 h at 37 °C in 5% CO2 to allow cell entry of PP and expression of luciferase reporter. After the incubation, the supernatant was removed with gentle centrifugation using a Blue Washer (BlueCatBio). Then, 4 µl per well of Bright-Glo Luciferase detection reagent (Promega) was added to assay plates and incubated for 5 min at room temperature. The luminescence signal was measured using a PHERAStar plate reader (BMG LABTECH). Compounds were tested as 11-point 1:3 titrations in duplicate. Data were normalized with wells containing PPs as 100% and wells containing control delEnv PP (no spike protein) as 0%.

GFP-LC3 high-content assay

As previously described28, GFP-LC3 mouse embryonic fibroblast cells (provided by Wen-Xing Ding from the University of Kansas Medical Center) were dispensed at 800 cells per 5 μl per well in 1,536-well tissue culture-treated black/clear-bottom, collagen-coated plates (Corning) using a Flying Reagent Dispenser (Aurora Discovery). The assay plates with cells were incubated at 37 °C with 5% CO2 for 5 h, followed by the addition of 23 nl of compound or control, chloroquine diphosphate, into the assay wells using a Wako Pintool station. After 18-h incubation at 37 °C with 5% CO2, the cells were fixed with 4% (vol/vol) paraformaldehyde (EMS), and nuclei were stained with Hoechst 33342 (Invitrogen) for 30 min at room temperature. After washing twice with PBS using Blue Washer, the assay plates were imaged for GFP-LC3 puncta formation using an Operatta CLS (PerkinElmer) through ×20 objective in confocal format. EGFP channel (excitation 460–490 nm/emission 500–550 nm) and DAPI (excitation 355–385 nm/emission 430–500 nm) were used to measure the fluorescence intensities. Images were acquired from each well for one center field (around 25% of a single well area in a 1,536-well plate) and analyzed with Operetta Harmony 4.6 software. The compartment analysis algorithm was used to identify the nuclei, apply a cytoplasmic mask and quantitate GFP spots in the GFP channel. A nuclear mask was generated from DAPI-stained nuclei. Autophagosomal membrane-associated GFP-LC3 (puncta) was detected as GFP-fluorescent vesicular objects that exceeded a threshold defined by untreated cells and that were located exclusively in the cytoplasmic area. Data were expressed as three output parameters: % of positive cells, total spot area − mean per well, and relative spot intensity − mean per well. Compounds were tested as 11-point 1:3 titrations in triplicate.

In vitro assay and structure data

qHTS data generated on the NPC from the CPE assay (, as well as compounds reported as active from recent anti-SARS-CoV-2 repurposing screens37,38,39 and drugs proposed by the scientific community as potential COVID-19 therapies40,41,42,43, were used to train the SARS-CoV-2 models. The detailed qHTS data analysis process, including data normalization, correction, classification of concentration response curves and activity assignment, was described previously44. Briefly, concentration response curves were fit to a four-parameter Hill equation yielding concentrations of IC50 and maximal response (efficacy) values3,45. From the CPE assay, compounds that showed concentration-dependent response with >30% efficacy were considered active. Other compounds were considered inactive. Literature-reported anti-SARS-CoV-2 compounds were considered active.

qHTS data generated in house at NCATS were used to train the models for ZIKV NS1. NS1 activity data15 were generated in qHTS format on three bioactive collections: the LOPAC (1,280 compounds), the NPC (2,816 approved and investigational drugs)5 and the Mechanism Interrogation Plate (1,866 cancer drugs with known mechanism of action)46. Compounds that showed inhibition in both the ratio and 615-nm readouts were considered active. Compounds that were inactive in the ratio readout were considered inactive. Other compounds were considered inconclusive and excluded from modeling. A NCATS in-house collection, the Genesis library, of ~90,000 diverse compounds, was also screened for NS1 activity at a single concentration (14 µM). From these results, compounds that showed >30% inhibition in both the ratio and 615-nm readouts were considered active, and other compounds were considered inactive.

The activity data on ~2,600 drugs screened in an EBOV assay from a literature report were used to train the EBOV activity models16. These compounds were mapped to 2,065 unique compounds in the NCATS compound library. The anti-EBOV activities (active or inactive) of these compounds were assigned according to the literature report16. All compounds and their assay activities (1 = active, 0 = inactive) used to train the SARS-CoV-2, ZIKV NS1 and EBOV models are provided as Supplementary Data 6.

A subset of the compounds in the bioactive collections, NPC and LOPAC in particular, were screened in nearly all the assays available at NCATS. Two NCATS in-house diverse compound libraries—Sytravon, which contains ~44,000 compounds, and Genesis, which contains ~90,000 compounds—and a subset (~100,000 compounds) of the other NCATS bioactive libraries and a large diverse compound library (MLS) were also screened in subpanels of the NCATS assay portfolio. The bioactive compound activity profiles in the assays that also screened the Sytravon (130 readouts), Genesis (39 readouts) or MLS (225 readouts) library were used to train and test the activity-based models (BABM-S, BABM-G or BABM-M). Structure fingerprints were generated for all compounds using the ChemoTyper47 for the SBMs. Structure data on all the compounds with target activity data available were used to train and test the SBM. The compositions of these datasets are summarized in Supplementary Table 2, and the different types of models based on these datasets are summarized below and illustrated in Fig. 1c. The assay activity-based models (BABM-S, BABM-G and BABM-M) and the activity–structure combined models (CM-S, CM-G and CM-M) were applied to predict the target activity of the compounds with activity profiles available from the Sytravon/Genesis/MLS assays (Fig. 1). In the CMs, the activity profile and the structure fingerprint were concatenated to form a new fingerprint for each compound. The SBM was applied to predict the target activity of all ~600,000 compounds in the NCATS compound library. For activity-based models, only compounds that showed activity in at least 10% of the Sytravon, Genesis or MLS assay panel were kept for analyses. Here, the definition of ‘activeʼ is not as strict as what would normally be considered as a ‘hitʼ for lead identification. Any type of concentration-dependent activity observed, regardless of potency or efficacy, was labeled as ‘activeʼ. As such, compounds that showed activities in multiple assays are not compounds that were deemed ‘promiscuousʼ in the traditional sense.

Models built on different training datasets

Assay data source/model type

Chemical structure

Assay activity

Activity and structure combined













  1. SBM, structure-based model; BABM-M, activity-based model (MLS); BABM-S, activity-based model (Sytravon); BABM-G, activity-based model (Genesis); CM-M, combined model (SBM + BABM-M); CM-S, combined model (SBM + BABM-S); CM-G, combined model (SBM + BABM-G).


The WFS method previously developed at NCATS48 was applied to construct the models. Briefly, WFS is a two-step scoring algorithm. In the first step, a two-tailed Fisher’s exact test is used to determine the significance of enrichment for each feature in the active compounds compared to inactive compounds, and a P value is calculated for all the features present in the dataset. For structure data, the feature value was set to 1 for compounds containing that structural feature and to 0 for compounds that do not have that feature. For assay activity data, each assay readout was treated as a feature, and the feature value was set to 1 for ‘activeʼ compounds and to 0 for inactive compounds. If a feature is less frequent in the active compound set than the inactive compound set, then its P value is set to 1. These P values form what we call a ‘comprehensiveʼ feature fingerprint, which is then used to score each compound for its active potential according to Equation (1), where Pi is the P value for feature i; C is the set of all features present in a compound; M is the set of features encoded in the ‘comprehensiveʼ feature fingerprint (that is, features present in at least one active compound); N is the number of features; and α is the weighting factor, which is set to 1 in all the models described here so that all assay features and structure features are treated equally. A high WFS score indicates a strong potential to be active.

$$\mathrm{WFS}=\frac{\sum \log ({P}_{i})}{\min (\log ({P}_{i}))\times (\alpha {N}_{C-M}+{N}_{M\cap C})}$$

For each model, compounds were randomly split into two groups of approximately equal sizes, one used for training and the other for testing. The randomization was conducted ten times to generate ten different training and test sets to evaluate the robustness of the models. Model performance was assessed by calculating the AUC-ROC, which is a plot of sensitivity [TP/(TP + FN)] versus 1 – specificity ([TN/(TN + FP)])49. A perfect model would have an AUC-ROC of 1, whereas an AUC-ROC of 0.5 indicates a random classifier. The random data split and model training and testing were repeated ten times, and the average AUC-ROC values were calculated for each model. For external experimental validation of models, model performance was measured by the PPV (PPV = TP/(TP + FP)). Statistical significance was determined by the two-tailed Fisher’s exact test comparing model PPV with the active rate in the training dataset for the corresponding target being modeled.

Selection of model predicted actives

Models with AUC-ROC > 0.75 were considered for compound selection. WFS score cutoff values for model-predicted actives were determined using the ROC curves where both sensitivity and specificity were optimized. Only compounds that scored higher than the cutoff values were considered candidates for follow-up selection. Owing to the limitations of different assays and resources, for each target we selected compounds with the largest possible structure diversity that could fit into one 1,536-well plate for experimental validation. When the candidate pool was much larger than the target number of compounds, the candidates were narrowed down based on structure type. For this purpose, the entire NCATS in-house compound library was clustered based on structure similarity (729-bit ChemoTyper47 fingerprints) using the self-organizing map algorithm50. From the clusters that contain model-predicted actives, a fraction of the active compounds was selected from each cluster based on the WFS score and the number of models that predicted the compound as active. Because the EBOV assay could test only ~100 compounds, the anti-EBOV candidates were manually inspected and narrowed down further based on literature reports, structure novelty and adsorption, distribution, metabolism and excretion properties. In most cases, the selection was driven by the availability of physical samples. All compounds that met the WFS score cutoff from a model were selected when fewer than 1,408 compounds had physical samples available for cherry-picking. The SARS-CoV-2 CPE assay (live virus) could be run in only 384-well format. Limited by the testing space available and physical sample availability, only 311 model-predicted compounds were selected for experimental confirmation in the SARS-CoV-2 live virus assay.

Statistical analysis and illustrations

Principal component analysis (PCA) was performed within R package version 3.4.3. The first three principal components (PCs)—PC1, PC2 and PC3—were calculated based on the 729-bit ChemoTyper fingerprints. Three-dimensional PCA plots were generated using the first three PCs in TIBCO Spotfire version 7.11.1. Concentration response curve plots were generated using Prism GraphPad 8, with IC50 values calculated using a three-parameter logistic regression.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.