Each year vast international resources are wasted on irreproducible research. The scientific community has been slow to adopt standard software engineering practices, despite the increases in high-dimensional data, complexities of workflows, and computational environments. Here we show how scientific software applications can be created in a reproducible manner when simple design goals for reproducibility are met. We describe the implementation of a test server framework and 40 scientific benchmarks, covering numerous applications in Rosetta bio-macromolecular modeling. High performance computing cluster integration allows these benchmarks to run continuously and automatically. Detailed protocol captures are useful for developers and users of Rosetta and other macromolecular modeling tools. The framework and design concepts presented here are valuable for developers and users of any type of scientific software and for the scientific community to create reproducible methods. Specific examples highlight the utility of this framework, and the comprehensive documentation illustrates the ease of adding new tests in a matter of hours.
Reproducibility in science is a systemic problem. In a survey published by Nature in 2016, 90% of scientists responded that there is a reproducibility crisis1. Over 70% of the over 1500 researchers surveyed were unable to reproduce another scientist’s experiments and over half were unable to reproduce their own experiments. Another analysis published by PLOS One in 2015 concluded that, in the US alone, about half of preclinical research was irreproducible, amounting to a total of about $28 billion being wasted per year2!
Reproducibility in biochemistry lab experiments remains challenging to address, as it depends on the quality and purity of reagents, unstable environmental conditions, and the accuracy and skill with which the experiments are performed. Even small changes in input and method ultimately lead to an altered output. In contrast, computational methods should be inherently scientifically reproducible since computer chips perform computations in the same way, removing some variations that are difficult to control. However, in addition to poorly controlled computing environment variables, computational methods become increasingly complex pipelines of data handling and processing. This effect is further compounded by the explosion of input data through “big data” efforts and exacerbated by a lack of stable, maintained, tested, and well-documented software, creating a huge gap between the theoretical limit for scientific reproducibility and the current reality3.
These circumstances are often caused by a lack of best practices in software engineering or computer science4,5, errors in laboratory management during project or personnel transitions, and a lack of academic incentives for software stability, maintenance, and longevity6. Shifts in accuracy can occur when re-writing functionality or when several authors work on different parts of the codebase simultaneously. An increase in complexity of scientific workflows with many and overlapping options and variables can prevent scientific reproducibility, as can code implementations that lack or even prevent suitable testing4. The absence of testing and maintenance causes software erosion (also known as bit rot), leading to a loss of users and often the termination of a software project. Further, barriers are created through intellectual property agreements, competition, and refusal to share inputs, methods, and detailed protocols.
As an example, in 2011 the Open Science Collaboration in Psychology tried to replicate the results of 100 studies as part of the Reproducibility Project7. The collaboration consisting of 270 scientists could only reproduce 39% of study outcomes. Since then, some funding agencies and publishers have implemented data management plans or standards to improve reproducibility8,9,10,11, for instance, the FAIR data management principles12. Guidelines to enhance reproducibility13,14 are certainly applicable, are outlined in Table 3, and are discussed in detail in an excellent editorial15 describing the Ten Year Reproducibility Challenge16 that is published in its own reproducibility journal ReScience C17. Other efforts focus directly on improving the methods with which the researchers process their data—for instance, the Galaxy platform fosters accessibility, transparency, reproducibility, and collaboration in biomedical data analysis and sharing13.
Reproducibility is also impacted by how methods are developed. Comparing a newly developed method to established ones, or an improved method to a previous version is important to assess its accuracy and performance, monitor changes and improvements over time and evaluate the cost/benefit ratio for software products to commercial entities. However, biases in publishing positive results or improvements to known methods, in conjunction with errors in methodology or statistical analyses18, lead to an acute need to test methods via third parties. Often, methods are developed and tested on a specific benchmark set created for that purpose and will perform better on that dataset than methods not trained on that dataset. A rigorous comparison and assessment require the benchmark to be independently created from the method, which unfortunately is rarely the case. Compounding issues are lack of diversity in the benchmark set (towards easier prediction targets) and reported improvements smaller than the statistical variation of the predicted results. Guidelines on how to create a high-quality benchmark19,20 are outlined in Table 3 below.
Scientific reproducibility further requires a stable, maintainable, and well-tested codebase. Software testing is typically achieved on multiple levels4,21. Unit tests check for scientific correctness of small, individual code blocks, integration tests check an entire application by integrating various code blocks, and profile and performance tests ensure consistency in runtime and program simplicity. Scientific tests or benchmarks safeguard the scientific validity and accuracies of the predictions. They are typically only carried out during or after the development of a new method (static benchmarking), as they require domain expertise and rely on vast computational resources to test an application on a larger dataset. However, the accuracy and performance of a method depend on the test set, the details of the protocol (i.e., specific command lines, options, and variables), and the software version. To overcome the static benchmarking approach, blind prediction challenges such as the Critical Assessments in protein Structure Prediction22, PRediction of protein Interactions23, Functional Annotation24, Genome Interpretation25, RNA Puzzles26, and Continuous Automated Model EvaluatiOn15,27 hold double-blind competitions at regular intervals. While these efforts are valuable to drive progress in method development in the scientific community, participation often requires months of commitment and does not address the reproducibility of established methods over time.
The Rosetta macromolecular modeling suite28,29 has been developed for over 20 years by a global community with now hundreds of developers at over 70 institutions4,30. This history and growth required us to adopt many best practices in software engineering4,29, including the implementation of a battery of tests. A detailed description of our community, including standards and practices, has previously been provided4. Scientific tests are important to maintain prediction accuracies for our own community and our users (including commercial users whose licensing fees, in our case, support much of Rosetta’s infrastructure and maintenance). We further want to directly compare different protocols and implementations and monitor the effect of score function changes on the prediction results. For many years, Rosetta applications31,32 and score functions33,34,35,36 have been tested independently using the static benchmarking approach20,37, often with complete protocol captures38,39. The disadvantage of static benchmarking is that the results become outdated due to the lack of automation. Reproducibility becomes impossible due to a lack of preservation of inputs, options, environment variables, and data analyses over time.
This background highlights the challenges in rigorously and continuously testing how codebase changes affect the scientific validity of a prediction method while maintaining or improving scientific reproducibility. Running scientific benchmarks continuously (1) suffers from a lack of incentive to set up as the maintenance character of these tests collides with academic goals; (2) requires both scientific and programming/technical expertise to implement, interpret and maintain; (3) is difficult to interpret with pass/fail criteria; and (4) requires a continuous investment of considerable computational resources. Here, we address these challenges by introducing a general framework for continuously running scientific benchmarks for a large and increasing number of protocols in the Rosetta macromolecular modeling suite. We present the general setup of this framework, demonstrate how we solve each of the above challenges, and present the results of the individual benchmarks in the Supplementary Information of this paper, complete with detailed protocol captures. The results can be used as a baseline by anyone developing macromolecular modeling methods, and the code of this framework is sufficiently general to be integrated into other types of software. The design principles presented here can be used by anyone developing scientific software, independent of the size of the method. We highly encourage small software development groups to follow these guidelines, even though their technical and personnel setup might differ. Supplementary Note (1) describes several options that small groups have available to test their software with limited resources.
Successful software development can be achieved by following a number of guidelines which are have previously been described in detail in ref. 4. Software testing is an essential part of this strategy which ties into scientific reproducibility. Over the past 15 years, the Rosetta community has created its own custom-built test server framework connected to a dedicated high-performance computing (HPC) cluster—its setup is shown in Fig. 1A and described in the Supplementary Information. The scientific testing setup is integrated into this framework.
Insights from the previous round of scientific tests led to specific goals
The Rosetta community learned valuable lessons from the long-term maintenance (or lack thereof) of several scientific benchmark tests set up over 10 years ago (see Supplementary Note 2). Their deterioration and development life cycle motivated specific goals that we think lead to more durable scientific benchmarks (Fig. 1B): (1) simplicity of the framework to encourage maintenance and support; (2) Generalization to support all user interfaces to the Rosetta codebase (command line, RosettaScripts40, PyRosetta41,42); (3) automation to continuously run the tests on an HPC cluster with little manual intervention; (4) documentation on how to add tests and scientific details of each test to allow maintenance by anyone with a general science or Rosetta background; (5) distribution of the tests to both the Rosetta community and their users, and publicizing their existence to encourage the addition of new tests and maintenance by the community; and (6) maintenance of the tests, facilitated by each of the previous points.
Goal 1—simplicity: simple setup facilitates broad adoption and support from our community
To encourage our community to contribute as many tests as possible, the testing framework needs to be simple and support fast and easy addition of tests. We decided on a Python framework that integrates well with our pre-existing testing HPC cluster (Supplementary Note 3). We further require these tests to be able to run on local machines (with different operating systems) as well as various HPC clusters with minimal adjustments. Debugging the scripts should be as simple as possible. With these requirements in mind, we decided on a setup as shown in Fig. 1C. We provide a template directory with all necessary files (described in detail in Methods). Simple modifications like naming scripts in the order in which they run—e.g., 0.compile.py to 9.finalize.py—greatly facilitate debugging or extension by new users.
Goal 2—generalization: new tests support interfaces of the command line, PyRosetta, or RosettaScripts
Rosetta supports several interfaces to facilitate quick protocol development while lowering the necessary expertise required by new developers to join our community4. Many mainstream protocols have been developed as standalone applications to be run via the command line, while customized protocols have been developed in RosettaScripts40 and PyRosetta41,42. For our test server framework, we sought a general code design that allows input from all three interfaces while supporting different types of outputs, quality measures, and analyses, sometimes even written in different scripting languages.
Goal 3—automation: tests require substantial compute power and are run on a dedicated test server
Running scientific benchmarks requires extensive CPU time; hence we chose to integrate them with our own custom-built test server framework connected to a dedicated HPC cluster (Fig. 1A and Supplementary Information). This test server framework consists of two main components: the backend holds low-level primitive code for compilation on different operating systems and HPC environments, cluster submission scripts, and web server integration code. The front end contains the test directories that are implemented by the test author. Our test server is accessible through a convenient web interface (Fig. 2A; available at https://benchmark.graylab.jhu.edu/). This framework has had a hugely positive impact on the growth and maintenance of both the Rosetta software and our community, due to its accessibility, GitHub integration, ease of use, and automation. In small software communities that lack the ability or resources to set up a dedicated test server, integration testing via external services like Github Actions43, Drone CI44, Travis CI45, or Jenkins46 is an excellent alternative. More details can be found in the Supplementary Information.
The RosettaCommons supports our benchmarking effort through expansion of our centralized test server cluster hardware and labor with an annual budget (see Supplementary Information and our previous publication4). Because the scientific tests are integrated into our test server framework, authors of the tests can focus on the scientific protocols (starting from a template directory set up as in Fig. 1C) instead of debugging errors in compilation, cluster submission, and computational environment. This pattern also makes these tests system-independent (the author writes the setup for a local machine and runs it on this server), i.e., portable between operating systems and computational environments. We currently limit the runtime per scientific test to typically 1000–2000 CPU hours.
Due to the required computational resources, we are unable to test every code revision in the main development branch of Rosetta; instead, we dedicate computational nodes to the scientific tests and run tests such that the nodes are continuously occupied. We found that scheduling the earliest-run test on an individual rolling basis, as compute nodes become available, is most efficient in balancing the server load while keeping nodes available for tests in feature branches. Upon discovery of a test failure and to find the specific revision (and therefore the code change) that caused the failure, our bisect tool schedules intermediate revisions on a low-priority basis. All test results are stored in the database and are accessible through a web interface (Fig. 2).
Goal 4—documentation: anyone can quickly and easily add new tests
Creating well-designed scientific benchmarks requires expertise in defining the scientific objective, establishing a protocol, and creating a high-quality test dataset. The last step of incorporating the test into our framework should be as simple as possible (as per our simplicity requirement). Once the dataset, interface (command line, RosettaScripts, or PyRosetta), specific command line, and quality measures have been chosen, the author can simply follow the individual steps outlined on the documentation page47 to contribute the test; the template guides the setup (Supplementary Note 4). We found that the setup is simple enough that untrained individuals can contribute a test in a few hours based on documentation alone—hence we achieved our goal of simplicity and detail in our documentation.
One of the reasons for the deterioration of earlier scientific tests was lack of maintenance due to insufficient documentation. Our goal is to drive the creation of extensive documentation for each test such that anybody with an average scientific knowledge of biophysics and introductory knowledge of programming in Rosetta can understand and maintain the tests. To ensure comprehensive documentation and consistency between tests, we provide a readme template with specific sections and questions that need to be answered for each test (see Supplementary Information). The template discourages writing short, insufficient, free-form documentation, and instead encourages the addition of important details and significantly lowers the barrier for writing extensive documentation. The questionnaire-style readme template (see Supplementary Information) saves time to locate necessary details to repair broken tests. The extent and quality of documentation is independently approved by a pull-request reviewer before the test is merged into the main repository. The benchmarking framework is configured such that documentation needs to be written once and is then directly embedded into the results page. Thus, the documentation is accessible both in the code and on the web interface while eliminating text duplication that could lead to discrepancies and confusion.
Goal 5—distribution: additions and usage of tests by our community requires broad distribution
Earlier scientific tests also deteriorated due to poor communication as to the existence of these tests, which resulted in a small pool of maintainers. Because our new scientific tests are integrated into our test server framework which most of our community uses and monitors, developers are immediately aware of the tests that exist and their pass/fail status. In conjunction with regular announcements to our community, this visibility should significantly broaden the number of people able and willing to sustain the scientific tests for a long time. If we nevertheless find that our new tests deteriorate, we will host a hackathon (eXtreme Rosetta Workshop4) to supplement or repair these tests in a concentrated effort.
Goal 6—maintenance: test failures are handled by a defined procedure
The often overlooked, real work in software development is not necessarily the development of the software itself, but its maintenance. We have a system in place outlining how test failures are handled and by whom. Each test has at least one dedicated maintainer (aka ‘observer’, usually the test author) who is notified of the test breakage via email and whose responsibility it is to repair the test. Test failures can be three-fold: technical failures, stochastic failures, or scientific failures. Technical failures (such as compiler errors, script failures due to new versions of programs, etc.) typically require small adjustments and fall under the responsibility of the test author and our dedicated test engineer.
Stochastic failures are an uncommon feature in software testing and are a rare but possible occurrence in this framework. Rosetta often uses Metropolis Monte Carlo algorithms and thus has an element of randomness present in most protocols. Setting specific seeds is done for integration tests in Rosetta (which are technically regression tests, discussed in the Supplementary Information of a previous publication4), which are not discussed here in detail. We refrain from setting random seeds in our scientific tests because the goal is to check whether the overall statistical and scientific interpretations hold after running the same protocol twice, irrespective of the initial seed. Further, a change in the vast Rosetta codebase that adds or removes a random number generator call is expected to cause trajectory changes even with set random seeds. The scientific tests are scaled so that individual trajectories are treated statistically and the lack of response to both seed changes and minor code changes is a feature and goal of the test. Moreover, due to the reasons above, rare stochastic failures are not a concern in our case and point to a sensibly chosen cutoff value (Supplementary Note 5). Scientific tests are interpreted in a Boolean pass/fail fashion but generally have an underlying statistical interpretation and are sampling from a distribution against a chosen target value. The statistical interpretation often varies from test to test and depends on the output of the protocol, the types of quality metrics, and sample sizes; therefore, we cannot provide specific suggestions as to which statistical measures should be used in general. Details about which statistics are used in which protocol are provided in the Supplementary Information and the linked tests. The randomness of Monte Carlo will occasionally cause a stochastic test failure because those runs happen to produce poor predictions by the tested metric. This is handled by simply rerunning the test: rare “stochastic” failures are either not stochastic—i.e., the test is signaling breakage—or are a symptom that the structure or pass/fail criteria of the test are not working as intended.
A scientific failure requires more in-depth troubleshooting and falls under the responsibility of the maintainer. If the maintainer does not fix the test, we have a rank-order of responsibilities to enforce the test repair. The principal investigator of the test designates someone in their lab. If the necessary expertise does not exist in the lab at the time (usually because people have moved on in their career), repairing the test becomes the responsibility of the person who broke it. If this developer lacks the expertise, the repair becomes community responsibility, which typically falls onto one of our senior developers.
Most major Rosetta protocols are now implemented as scientific benchmarks
Using the framework described above, our community implemented 40 scientific benchmarks spanning a broad range of applications including antibody modeling, docking, loop modeling, incorporation of NMR data, ligand docking, protein design, flexible peptide docking, membrane protein modeling, etc. (Table 1 and Supplementary Note 6). Each benchmark is unique in its selection of targets in the benchmark set, the specific protocol that is run, the quality metrics that are evaluated, and the analysis to define the pass/fail criterion. The details for all the benchmarks are provided in the comprehensive supplement to this paper. We further publish the benchmarks with results and protocol captures on our website (https://graylab.jhu.edu/download/rosetta-scientific-tests/) twice per year for our users to see, download, run, and compare their own methods against. This transparency is crucial for the representation of realistic performance and to enhance the scientific reproducibility of our tools.
Standardizing workflows highlights heterogeneity in score function implementations
Standardizing the workflows and creating this framework provides us with the possibility of running some protocols with different score functions. Rosetta has been developed over the past 25 years and the score function has been constantly improved over this timeframe. Details of this evolution and the latest standard score function REF2015 can be found in references35,36. The attempt to easily switch score functions for an application reveals a major challenge: many applications employ the global default score function differently, a problem exacerbated by the various user interfaces to the code (see Supplementary Note 7). The heterogeneity in implementations makes it impossible to easily test different score functions for all of the applications and reveals that it hinders both progress and unification of the score functions, possibly into a single one.
Use case (1): test framework allows comparison of score functions for multiple protocols
Using our framework allows us to directly compare runs with different variables. For instance, we can compare different score functions for various applications: protein–protein docking, high-resolution refinement, loop modeling, design, ligand docking, and membrane protein ddG’s (Table 2 and Figs. 3–5). We test the latest four score functions: score12, talaris2013, talaris2014, and REF2015 for all but ligand docking and membrane protein ddG’s. Ligand docking has a special score function and requires adjustments—we test the ligand score function, talaris2014, REF2015, and the experimental score function betaNov2016. Membrane protein ddG’s are tested on the membrane score functions mpframework2012, REF2015_mem, franklin2019, and the non-membrane score function REF2015 as a control.
The benchmark sets and quality metrics are described in Table 2 and in detail in the Supplementary Information. To compare the score functions, we plot each application’s quality metrics (for instance interface score vs. interface RMSD for protein-protein docking, total score vs. loop RMSD for loop modeling). We then evaluate the “funnel quality” by computing the PNear metric, which falls between 0 and 1, with higher values indicating higher quality48,49. For the protein design test, we compute the average sequence similarity of the 10 lowest-scoring (best) models instead of PNear and for the membrane ddG test, we use the Pearson correlation coefficient between experimental and predicted ddG’s. We further summarize the quality metrics per protocol and score function by a “winner-takes-it-all” comparison (Fig. 5A) and by an average metric overall target per application per score function (Fig. 5B).
A few main observations follow from this comparison: at first glance, in this comparison, REF2015 performs generally better overall, yet the best score function to use depends on the application—even different types of protocols can impact prediction accuracy. However, it should be noted that some tests have a small sample size due to the required computational resources, therefore impacting the statistical significance of these outcomes. Second, more recent score functions are not automatically better for any given application, likely because performance depends on how the score function was developed and tested. For a more detailed discussion, see the Supplementary Information. Third, results differ in some cases depending on how the data were summarized; the top-performing score functions per application from the “winner-takes-it-all” comparison are not necessarily the top performers when the average of the PNear value is used, as can be seen in ligand docking (Fig. 5B—reference50 discussed this in-depth).
Use case (2): scientific test framework facilitates bug fixes and maintenance
The scientific test framework is also useful for code maintenance, to ensure that the correction of bugs does not invalidate the scientific performance of the application. This can be achieved by comparing the scientific performance of a run before and after fixing a bug in the code. For example, in October 2019, we identified an integer division error in one of our core libraries: the fraction 2/3 was incorrectly assumed to evaluate to 0.6666…, when in fact integer division discards remainders, yielding 0. This calculation affected the computation of hydrogen bonding energies and their derivatives and correcting it resulted in a small but perceptible change in some of the hydrogen bond energies. This led to the need to balance between fixing the bug and managing the complex interdependencies or to preserve the existing scoring behavior since the rest of the score function had been calibrated with the bug present. By running the scientific tests on a development branch in which we had fixed the bug, we confirmed that although the correction results in a small change in the energies, it had no perceptible effect on the scientific accuracy of large-scale sampling runs for structure prediction, docking, design, and any other protocol tested. This allowed us to make the correction without harming Rosetta’s scientific performance. We are certain that the scientific tests will be invaluable for ensuring that future bug-fixing and refactoring efforts do not hinder the scientific performance of our software, thus illustrating a key example of scientific benchmarks informing substantive decisions developers must make as they navigate code life cycles.
Use case (3): test framework allows detailed investigation of new score functions under development
Our framework can also be used to test how major code improvements would affect scientific performance before they are adopted as default options in the code. As an example, we can test how newly developed score functions perform: although small molecules and proteins are generally more rigid structures, intermediate-scale molecules are frequently disordered and flexible. A recent study shows that Rosetta’s estimates of rigidity (using the funnel quality metric PNear computed to a designed binding conformation) for peptides designed to bind to and inhibit a target of therapeutic interest correlate well with IC50 values51. Since this prediction has relevance to computer-aided drug development efforts, we want to ensure that future protocol development would not impair these predictions. We created a test (called peptide_pnear_vs_ic50) that performs rigidity analysis on a pool of peptides that were previously characterized experimentally and computes the correlation coefficient for the PNear values from predicted models to the experimentally measured IC50 values. We find that the current default score function, REF2015, produces much better predictions than the legacy talaris2013 and talaris2014 score functions (R2 = 0.53, 0.53, and 0.90 for talaris2013, talaris2014, and REF2015, respectively), indicating an improvement of the score function accuracy for this particular application35. However, this correlation is considerably worse with the score function Beta currently under development (R2 = 0.19). This reveals problems in the candidate’s next-generation score function that will have to be addressed before it becomes the default. Our scientific tests embedded in the test server framework provide a means of rapidly benchmarking and addressing these problems.
Use case (4): this framework and tests encourage scientific reproducibility on several levels
How is this framework useful beyond the specific tests mentioned here? Its usefulness for Rosetta developers and users lies in the protocol captures, the specific performance of each protocol, and the knowledge that scientific performance is monitored over time. Developers of macromolecular modeling methods outside of Rosetta can use and run the exact test protocol captures to compare Rosetta’s results to their own, newly developed methods. The code for the general framework to run large-scale, continuous, automated tests is available under the standard Rosetta license and is useful for developers of any type of software. Lastly, the framework highlights how software can be developed in a scientifically reproducible manner, lessons of which are useful and necessary for the scientific community at large. While we recognize the time and work required to implement such tests and the underlying framework, the benefits far outweigh the effort spent in trying to reproduce results that were implemented in a manner that lacks necessary aspects for reproducibility, as discussed in Table 3.
Here, we present a test server framework for continuously running scientific benchmarks on an integrated HPC cluster and detail the way this framework has had a positive and substantive effect on our large community of scientists. The framework itself is sufficiently general that it could in principle be used on many types of scientific software. We use it on Rosetta protocols that cover the three main interfaces to the codebase: the command line, RosettaScripts, and PyRosetta. New benchmarks are easily added and debugged, and the workflow for setting them up is well-documented and general: new tests can be added in a matter of hours and require minimal coding experience in Rosetta. We provide detailed documentation and consistency in the presentation of results, thereby facilitating maintenance by more than just experts in the community and ensuring the longevity of these tests. Automated and continuous runs of these tests allow us to recognize shifts in performance, as development is simultaneously carried out on several interdependent but otherwise unrelated fronts. Thus, we can build a longitudinal map of accuracy and scientific correctness in a constantly evolving codebase (for ourselves and our users), provide realistic protocol captures of how to run applications, and build tools that follow guidelines for improving reproducibility. Diversity in the choice of targets in the benchmark sets provides a realistic performance somewhat insulated from institutional and career incentives. So far, 40 benchmarks for various biomolecular systems and prediction tasks have been added to our server framework and more will be added over time. Due to the size of our software and the large number of protocols available, running these benchmarks requires a substantial amount of resources, which are funded through RosettaCommons, since such benchmarks are a priority for software sustainability. Even though our setup involves the integration of a custom software framework and web interface with typical HPC hardware, we expect our design choices to be of general interest and integrable with paid services such as Drone CI44, Travis CI45, or Jenkins46, which are great options for small software development communities or labs that lack the hardware or personnel resources. This framework demonstrates how challenges in scientific reproducibility can be approached and handled in a general manner, even in a large and diverse community.
Implementation of a modular testing system addressing the goals above is a crucial step in achieving the reproducibility of software codes. Yet, several challenges remain that are mostly due to a lack of incentive structure. (1) In the past several years, funding agencies and journals have introduced requirements for data sharing, storing, and ensuring reproducibility. However, even if data/detailed workflows and output are shared and available, grant or paper reviewers are likely not going to take the time to run the code because it often comes with a substantial time investment for which the reviewers do not get much in return. We argue that offering high-value incentives, such as co-authorship on the paper, mini-grants, or other compensation to the reviewer, in return for them running the code and comparing the data, could potentially make a huge difference in closing the gap in the reproducibility crisis. Alternatively, funding agencies and journals could require that another scientist, independent from the group publishing the method, is the independent code reviewer and becomes a co-author. (2) Both funding agencies and academic labs working on smaller software tools need to understand that the bulk of the work in developing a tool is not the development of the tool itself, but its maintenance, requiring years of sustained effort for it to thrive into something valuable and useful with actual impact on the scientific community. (3) Similarly, funding agencies and labs need to understand that code is cheap but high-quality code is expensive to create. The short-term nature of most academic research labor (undergraduate, graduate student, and postdoctoral researcher) conflicts with the long-term necessity of maintenance. Sustainable research tooling requires careful oversight and long-term management by a project leader, ensuring that maintenance responsibilities are continually reassigned as the labor pool shifts.
The RosettaCommons community of developers has emphasized software testing for over 15 years. To support our community of hundreds of developers, our user base of tens of thousands of users, and the codebase of over 3 million lines of code4, we implemented a custom testing architecture to fit our needs. We use this platform (a.k.a. the “Benchmark Server”) to run all our tests including unit tests, integration tests, profile tests, style tests, score function tests, build tests, and others. Using this benchmark server to implement scientific tests is therefore a natural extension of its current use. Our custom testing software runs on a dedicated HPC cluster (which also runs the ROSIE server52), paid for by the RosettaCommons from government and non-profit funding, and commercial licensing fees.
The backend of the benchmark infrastructure
Our testing infrastructure consists of a number of machines:
Database server. Our data center stores information about revisions, test, and sub-test results as well as auxiliary data like comments to revisions or a list of branches that are currently tracked via GitHub4,53. We are using PostgreSQL.
Web server. The web interface for Rosetta developers connects to the database server. When a developer asks for a particular revision or test results, the webserver gathers these data from the database server, generates the HTML page, and sends it to the developer who looks at the page in a web browser. The web server also allows developers to queue new tests through the submit page on the web interface.
Revision daemon. This application watches the state of various branches, queues tests, and sends notifications. The daemon tracks the list of branches and periodically checks if a new revision for a particular branch was committed. When a new revision has been committed, it schedules the default test set for that branch. The daemon also watches for open pull requests on GitHub, and for each pull request, it checks for specific test labels (for instance “standard tests”). The revision daemon schedules any tests with that label for that pull request.
Because scientific tests require an enormous amount of computing power, we are currently unable to test every single revision in the Rosetta main branch. Instead, we run scientific tests on a best-effort basis. The tests run continuously, but because there are sometimes multiple updates to the main branch per day and it takes the scientific tests about a week to run, many revisions in the main branch remain untested. In case of a test failure, the revision daemon performs a binary search bisecting the untested revisions to determine the exact revision that is responsible for the breakage.
[4.-N.] Testing daemons. The testing daemons run on various platforms: Mac, Linux, and Windows. We currently have 18 of these daemons, some of which are meant for build tests (i.e., on Windows) and some of which are capable of running tests on our HPC cluster. Each daemon periodically checks the list of queued tests from the database server. If there is any test which that daemon is capable of running, it runs the test and then uploads the test results (logs, result files, and test results encoded in JSON) to our SQL database.
This backend code is specific to our hardware, HPC use patterns, and system administration environment, and maintained separately from the code that performs or tests science. This code does not include the frontend scientific testing framework (next paragraph) and is not needed to replicate any of the scientific results. The frontend implementation of the scientific testing framework including all the scientific benchmarks is fully available under the RosettaCommons license.
Setup of the scientific tests
We chose a simple setup as shown in Fig. 1C. Each scientific test requires a small number of files, available in a template directory. All files in this directory are well documented with comments, and the lines that require editing for specific tests are highlighted. Each scientific test directory starts from a template containing the following files:
input files—are either located in this directory or in a parallel git submodule if the input files exceed 5 MB. This policy prevents our main code repository from becoming overly inflated with thousands of input files for scientific benchmarking.
0.compile.py—compiles the Rosetta and/or PyRosetta executable.
1.submit.py—submits the benchmark jobs either to the local machine or to the HPC cluster. Note that this “or” provides hardware non-specificity; the user writes and debugs locally and can run at scale on the benchmark server.
2.analyze.py—analyzes the output data, depending on the scientific objective. Analysis functions that are specific to this particular test live in this script, while broadly useful analysis functions are located in a file that is part of the overall Python test server framework and that contains functions to evaluate quality measures.
3.plot.py—plots the output data via matplotlib54, or other plotting software as appropriate.
… – other numbered scripts can be added as needed; they will run consecutively as numbered.
9.finalize.py—gathers the output data and classifies the test as passed or failed, creates an HTML page by combining the documentation from the readme file, the plots of the output data and the pass/fail criterion. The HTML page is the main results page that developers, maintainers, and observers examine.
citation—includes all relevant citations that describe the protocol, the benchmark set, or the quality measures.
cutoffs—contains the cutoffs used for distinguishing between a pass or a failure for this test.
observers—email addresses of developers that either set up the test and/or maintain it. If a test fails on the test server, an email is sent to the observers to inform them of the test breakage.
readme.md—is a questionnaire-style markdown file that contains all necessary documentation to understand the purpose and detailed methods of the test. Obtaining detailed documentation is essential for the maintenance and longevity of the test. The goal is that anyone with basic Rosetta expertise and training can understand, reproduce, and maintain the test. The template readme file is provided in the Supplementary Information of this paper.
Most Rosetta protocols use the Monte-Carlo sampling protocol to create protein or biomolecule conformations, which are then evaluated by a score function.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
All Rosetta code and the frontend implementation of the scientific testing framework including all the scientific benchmarks are fully available under the RosettaCommons license. In addition, complete protocol captures for all benchmarks with input files, command lines, output files, analyses, and result summaries are publicly available to view and download at https://graylab.jhu.edu/download/rosetta-scientific-tests/. These complete protocol captures are available in two code revisions and will be automatically expanded with new revisions added about every 6 months. Older revisions remain on the server. Details about each of the 42 datasets with accession codes etc. are provided under the link above.
Rosetta is licensed and distributed through https://www.rosettacommons.org. Licenses for academic, non-profit, and government laboratories are free of charge; there is a license fee for industry users. A license is required to gain access to the Github repository. Specific version numbers are given in the Supplementary Information.
Baker, M. & Penny, D. Is there a reproducibility crisis? Nature 533, 452–454 (2016).
Freedman, L. P., Cockburn, I. M. & Simcoe, T. S. The economics of reproducibility in preclinical research. PLOS Biol. 13, e1002165 (2015).
Peng, R. D. Reproducible research in computational science. Science 334, 1226–1227 (2011).
Koehler Leman, J. et al. Better together: elements of successful scientific software development in a distributed collaborative community. PLOS Comput. Biol. 16, e1007507 (2020).
Adorf, C. S., Ramasubramani, V., Anderson, J. A. & Glotzer, S. C. How to professionally develop reusable scientific software—and when not to. Comput. Sci. Eng. 21, 66–79 (2019).
Baker, M. 1,500 scientists lift the lid on reproducibility: nature news & comment. Nature 533, 452 (2016).
Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349, aac4716–aac4716 (2015).
Stodden, V. et al. Enhancing reproducibility for computational methods. Science 354, 1240–1241 (2016).
Jeffrey Mervis. NSF to Ask Every Grant Applicant for Data Management Plan | Science | AAAS. Science. https://www.sciencemag.org/news/2010/05/nsf-ask-every-grant-applicant-data-management-plan (2010).
Editorial. Everyone needs a data-management plan. Nature 555, 286–286 (2018).
Williams, M., Bagwell, J. & Nahm Zozus, M. Data management plans: the missing perspective. J. Biomed. Inform. 71, 130–142 (2017).
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016).
Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 44, W3–W10 (2016).
Sandve, G. K., Nekrutenko, A., Taylor, J. & Hovig, E. Ten simple rules for reproducible computational research. PLoS Comput. Biol. 9, e1003285 (2013).
Perkel, J. M. Challenge to scientists: does your ten-year-old code still run? Nature 584, 656–658 (2020).
ReScience C—Ten Years Reproducibility Challenge. https://rescience.github.io/ten-years/.
ReScience C. http://rescience.github.io/.
Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J. & Reinero, D. A. Contextual sensitivity in scientific reproducibility. Proc. Natl Acad. Sci. USA 113, 6454–6459 (2016).
Peters, B., Brenner, S. E., Wang, E., Slonim, D. & Kann, M. G. Putting benchmarks in their rightful place: the heart of computational biology. PLOS Comput. Biol. 14, e1006494 (2018).
Ó Conchúir, S. et al. A web resource for standardized benchmark datasets, metrics, and Rosetta protocols for macromolecular modeling and design. PLoS ONE 10, e0130433 (2015).
Huizinga, D. & Kolawa, A. Automated Defect Prevention: Best Practices in Software Management | Wiley. https://www.wiley.com/en-us/Automated+Defect+Prevention%3A+Best+Practices+in+Software+Management-p-9780470042120 (2007).
Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T. & Tramontano, A. Critical assessment of methods of protein structure prediction (CASP)—round XII. Proteins Struct. Funct. Bioinforma. 86, 7–15 (2018).
Wodak, S. J. & Janin, J. Modeling protein assemblies: critical assessment of predicted interactions (CAPRI) 15 years hence. Proteins Struct. Funct. Bioinforma. 85, 357–358 (2017).
Friedberg, I. & Radivojac, P. Methods Mol. Biol. 1446, 133–146 (2017).
Daneshjou, R. et al. Working toward precision medicine: predicting phenotypes from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges. Hum. Mutat. 38, 1182–1192 (2017).
Miao, Z. et al. RNA-Puzzles round IV: 3D Structure predictions of four ribozymes and two aptamers. RNA 26 (2020).
Haas, J. et al. Continuous Automated Model EvaluatiOn (CAMEO) complementing the critical assessment of structure prediction in CASP12. Proteins Struct. Funct. Bioinforma. 86, 387–398 (2018).
Leaver-Fay, A. et al. ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol. 487, 545–574 (2011).
Koehler Leman, J. et al. Macromolecular modeling and design in Rosetta: recent methods and frameworks. Nat. Methods 17, 665–680 (2020).
Kaufmann, K. W. & Meiler, J. Using RosettaLigand for small molecule docking into comparative models. PLoS ONE 7, e50769 (2012).
Conway, P., Tyka, M. D., DiMaio, F., Konerding, D. E. & Baker, D. Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein Sci. 23, 47–55 (2014).
Leaver-Fay, A. et al. Scientific benchmarks for guiding macromolecular energy function improvement. Methods Enzymol. 523, 109–143 (2013).
O’Meara, M. J. et al. Combined covalent-electrostatic model of hydrogen bonding improves structure prediction with Rosetta. J. Chem. Theory Comput. 11, 609–622 (2015).
Park, H. et al. Simultaneous optimization of biomolecular energy functions on features from small molecules and macromolecules. J. Chem. Theory Comput. 12, 6201–6212 (2016).
Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 1–35 (2017).
Alford, R. F., Samanta, R. & Gray, J. J. Diverse scientific benchmarks for implicit membrane energy functions. J. Chem. Theory. Comput. 17, 5248–5261 (2021).
Renfrew, P. D., Campbell, G., Strauss, C. E. M. & Bonneau, R. The 2010 Rosetta developers meeting: macromolecular prediction and design meets reproducible publishing. PLoS ONE 6, e22431 (2011).
Bender, B. J. et al. Protocols for Molecular Modeling with Rosetta3 and RosettaScripts. Biochemistry https://doi.org/10.1021/acs.biochem.6b00444 (2016).
Fleishman, S. J. et al. RosettaScripts: a scripting language interface to the Rosetta Macromolecular modeling suite. PLoS ONE 6, 1–10 (2011).
Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691 (2010).
Gray, J. J., Chaudhury, S., Lyskov, S. & Labonte, J. W. The PyRosetta Interactive Platform for Protein Structure Prediction and Design: A Set of Educational Modules. http://www.amazon.com/PyRosetta-Interactive-Platform-Structure-Prediction/dp/1500968277 (2014).
Features • GitHub Actions · GitHub. https://github.com/features/actions.
Drone CI—Automate Software Testing and Delivery. https://www.drone.io/.
Travis CI—continuous integration. https://travis-ci.org/.
RosettaCommons. Rosetta documentation—Scientific Benchmarks. http://new.rosettacommons.org/docs/latest/development_documentation/test/Scientific-Benchmarks.
Bhardwaj, G. et al. Accurate de novo design of hyperstable constrained peptides. Nature 538, 329–335 (2016).
Hosseinzadeh, P. et al. Comprehensive computational design of ordered peptide macrocycles. Science 358, 1461–1466 (2017).
Smith, S. T. & Meiler, J. Assessing multiple score functions in Rosetta for drug discovery. PLoS ONE 15, e0240450 (2020).
Mulligan, V. K. et al. Computationally designed peptide macrocycle inhibitors of New Delhi metallo-β-lactamase 1. Proc. Natl Acad. Sci. USA 118 (2021).
Lyskov, S. et al. Serverification of molecular modeling applications: the Rosetta Online Server that Includes Everyone (ROSIE). PLoS ONE 8, e63906 (2013).
Matplotlib: Python plotting—Matplotlib 3.4.1 documentation. https://matplotlib.org/.
Weitzner, B. D. et al. Modeling and docking of antibody structures with Rosetta. Nat. Protoc. 12, 401–416 (2017).
Weitzner, B. D. & Gray, J. J. Accurate structure prediction of CDR H3 loops enabled by a novel structure-based C-terminal constraint. J. Immunol. 198, 505–515 (2017).
Sircar, A. & Gray, J. J. SnugDock: paratope structural optimization during antibody-antigen docking compensates for errors in antibody homology models. PLoS Comput. Biol. 6, e1000644 (2010).
Nance, M. L., Labonte, J. W., Adolf-Bryfogle, J. & Gray, J. J. Development and evaluation of GlycanDock: a protein–glycoligand docking refinement algorithm in Rosetta. J. Phys. Chem. B https://doi.org/10.1021/ACS.JPCB.1C00910 (2021).
Labonte, J. W., Adolf-Bryfogle, J., Schief, W. R. & Gray, J. J. Residue-centric modeling and design of saccharide and glycoconjugate structures. J. Comput. Chem. 38, 276–287 (2017).
Adolf-Bryfogle, J. et al. Growing glycans in Rosetta: accurate de-novo glycan modeling, density fitting, and rational sequon design. Prep. (2021).
Song, Y. et al. High-resolution comparative modeling with RosettaCM. Structure 21, 1735–1742 (2013).
Kortemme, T., Kim, D. E. & Baker, D. Computational alanine scanning of protein-protein interfaces. Sci. STKE 2004, pl2 (2004).
Guffy, S. L., Teets, F. D., Langlois, M. I. & Kuhlman, B. Protocols for requirement-driven protein design in the Rosetta modeling program. J. Chem. Inf. Model. 58, 895–901 (2018).
Nivón, L. G., Bjelic, S., King, C. & Baker, D. Automating human intuition for protein design. Proteins 82, 858–866 (2014).
Maguire, J. B. et al. Perturbing the energy landscape for improved packing during computational protein design. Proteins Struct. Funct. Bioinforma. 89, 436–449 (2021).
Loshbaugh, A. L. & Kortemme, T. Comparison of Rosetta flexible-backbone computational protein design methods on binding interactions. Proteins Struct. Funct. Bioinforma. 88, 206–226 (2020).
Yachnin, B. J., Mulligan, V. K., Khare, S. D. & Bailey-Kellogg, C. MHCEpitopeEnergy, a flexible rosetta-based biotherapeutic deimmunization platform. J. Chem. Inf. Model. 61, 2368–2382 (2021).
Gray, J. J. et al. Protein–protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J. Mol. Biol. 331, 281–299 (2003).
Marze, N. A., Roy Burman, S. S., Sheffler, W. & Gray, J. J. Efficient flexible backbone protein–protein docking for challenging targets. Bioinformatics 34, 3461–3469 (2018).
Alam, N. & Schueler-Furman, O. Methods Mol. Biol. 1561 139–169 (Humana Press Inc., 2017).
Gront, D., Kulp, D. W., Vernon, R. M., Strauss, C. E. M. & Baker, D. Generalized Fragment Picking in Rosetta: Design, Protocols and Applications. 6, e23294 (2011).
Canutescu, A. A. & Dunbrack, R. L. Cyclic coordinate descent: a robotics algorithm for protein loop closure. Protein Sci. 12, 963–972 (2003).
Mandell, D. J., Coutsias, E. A. & Kortemme, T. Sub-angstrom accuracy in protein loop reconstruction by robotics-inspired conformational sampling. Nat. Methods 6, 551–552 (2009).
Fernandez, A. J. et al. The structure of the colorectal cancer-associated enzyme GalNAc-T12 reveals how nonconserved residues dictate its function. Proc. Natl Acad. Sci. USA 116, 20404–20410 (2019).
Stein, A. & Kortemme, T. Improvements to robotics-inspired conformational sampling in rosetta. PLoS ONE 8, e63090 (2013).
Alford, R. F., Fleming, P. J., Fleming, K. G. & Gray, J. J. Protein structure prediction and design in a biologically realistic implicit membrane. Biophys. J. 118, 2042–2055 (2020).
Alford, R. F. et al. An integrated framework advancing membrane protein modeling and design. PLoS Comput. Biol. 11, e1004398 (2015).
Koehler Leman, J. & Bonneau, R. A novel domain assembly routine for creating full-length models of membrane proteins from known domain structures. Biochemistry 57, 1939–1944 (2018).
Koehler Leman, J., Lyskov, S. & Bonneau, R. Computing structure-based lipid accessibility of membrane proteins with mp_lipid_acc in RosettaMP. BMC Bioinforma. 18, 115 (2017).
Tyka, M. D. et al. Alternate states of proteins revealed by detailed energy landscape mapping. J. Mol. Biol. 405, 607–618 (2011).
Watkins, A. M., Rangan, R. & Das, R. FARFAR2: improved de novo Rosetta prediction of complex global RNA folds. Structure 28, 963–976.e6 (2020).
Watkins, A. M. et al. Blind prediction of noncanonical RNA structure at atomic accuracy. Sci. Adv. 4, eaar5316 (2018).
Kuenze, G., Bonneau, R., Leman, J. K. & Meiler, J. Integrative protein modeling in Rosetta NMR from sparse paramagnetic restraints. Structure 27, 1721–1734.e5 (2019).
ARO MURI W911NF-16-1-0372 to Watkins; American Heart Association 18POST34080422 to Kuenze; BSF 2015207 to Schueler-Furman, Ben-Aharon; Cancer Research Institute Irvington Postdoctoral Fellowship (CRI 3442) to Roy Burman; Candian Institutes of Health Research Postdoctoral Fellowship to Yachnin; Cyrus Biotechnology to Lewis; Simons Foundation to Bonneau, Koehler Leman, Mulligan; German Research Foundation KU 3510/1-1 to Kuenze; H2020 MSCA IF CC-LEGO 792305 to Ljubetic; HHMI to Baker; Hertz Foundation Fellowship to Alford; ISF 717/2017 to Schueler-Furman, Ben-Aharon; Lundbeck Foundation Fellowship R272-2017-4528 to Stein; Mistletoe Research Foundation Fellowship to Yachnin; NCN 2018/29/B/ST6/01989 to Gront, Krys; NIAID R01AI113867 to Schief, Adolf-Bryfogle; NIEHS P42ES004699 to Siegel; NIH 1R01GM123089 to Farrell, DiMaio; NIH 2R01GM098977 to Bailey-Kellogg; NIH F31-CA243353 to Smith; NIH F31-GM123616 to Jeliazkov; NIH GM067553 to Maguire; NIH NCI R21 CA219847 and NIH R01 GM121487 to Das, Watkins; NIH NHLBI 2R01HL128537 to Yarov-Yarovoy; NIH NIAID R21 AI156570 and NIH NIBIB R21 EB028342 to Bahl; NIH NIAID U01 AI150739, NIH NIDA R01 DA046138 to Meiler, Moretti; NIH NIGMS R01 GM080403 to Meiler, Moretti and Kuenze; NIH NIGMS R01 GM073151 to Kuhlman, Gray, Leaver-Fay, Lyskov, Moretti, Meiler; NIH NIGMS R01 GM121487 and NIH NIGMS R35 GM122579 to Das; NIH NIGMS 1R01GM132110 and NIH NINDS 1R01NS103954 to Yarov-Yarovoy; NIH NINDS UG3NS114956 to Nguyen, Yarov-Yarovoy; NIH F32 CA189246 to Labonte; NIH R01 GM 076324-11 to Siegel; NIH R01 GM129261 to Woods; NIH R01 GM078221 to Harmalkar, Roy Burman, Jeliazkov, Nance, Samanta, and Gray; NIH R01 GM127578 to Gray and Labonte; NIH R01 GM110089 to Loshbaugh, Kortemme, Barlow; NIH R35 GM131923 to Leaver-Fay, Teets, Kuhlman; NIH R01 GM132565 to Hansen, Khare; NSF 1507736 to Gray, Roy Burman; NSF 1627539 and NSF 1827246 to Siegel; NSF 1805510 to Siegel, Fell; NSF 2031785 to Bahl; NSF DBI‐1564692 to Loshbaugh, Kortemme, Barlow and O’Connor; NSF GRFP Fellowship to Alford; NSF CBET1923691 to Hansen, Khare; Novo Nordisk Foundation NNF18OC0033950 to Tiemann, Stein, Lindorff-Larsen; RosettaCommons Licensing Fund RC8010 to Bahl; RosettaCommons to Hansen, Moretti, Lyskov, Khare, Gray; NIH NRSA T32AI007244 and NIH U19AI117905 to Schief, Adolf-Bryfogle. The authors further thank Matt Mulqueen for expert administration of the multiple benchmark testing servers and cluster, RosettaCommons for hardware and staff support after the NIH ended their software infrastructure program, and companies that license Rosetta, providing support for critical software sustainability practices.
Rosetta software has been licensed to numerous non-profit and for-profit organizations. Rosetta Licensing is managed by UW CoMotion, and royalty proceeds are managed by the RosettaCommons. Under institutional participation agreements between the University of Washington, acting on behalf of the RosettaCommons, their respective institutions may be entitled to a portion of the revenue received on licensing Rosetta software including programs described here. D.B., J.J.G., R.B., O.S.F., D.G., T.K., J.M., and V.Y.Y. are unpaid board members of the RosettaCommons. As members of the Scientific Advisory Board of Cyrus Biotechnology, D.B. and J.J.G. are granted stock options. S.M.L., A.L.L., and D.F. are employed by or have a relationship with Cyrus Biotechnology. Cyrus Biotechnology distributes the Rosetta software, which includes the methods discussed in this study. V.K.M. is a co-founder of and shareholder in Menten Biotechnology Labs, Inc. The content of this manuscript is relevant to work performed at Menten. J.M. is employed by Menten with granted stock options. D.B. is a cofounder of, shareholder in, or advisor to the following companies: ARZEDA, PvP Biologics, Cyrus Biotechnology, Cue Biopharma, Icosavax, Neoleukin Therapeutics, Lyell Immunotherapeutics, Sana Biotechnology, and A-Alpha Bio. CBK is a co-founder and manager of Stealth Biologics, LLC, a biotechnology company. R.B. is executive director of Prescient Design/Genentech, a member of the Roche group. The remaining authors declare no competing interests.
Peer review information Nature Communications thanks Charlotte Deane, Thomas Kude, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Koehler Leman, J., Lyskov, S., Lewis, S.M. et al. Ensuring scientific reproducibility in bio-macromolecular modeling via extensive, automated benchmarks. Nat Commun 12, 6947 (2021). https://doi.org/10.1038/s41467-021-27222-7
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.