With the steep decline in the cost of many high-throughput technologies, large amounts of biological data are being generated and made accessible to researchers. Machine learning (ML) has come into the spotlight as a very useful approach for understanding cellular1, genomic2, proteomic3, post-translational4, metabolic5 and drug discovery data6, with the potential to result in ground-breaking medical applications7,8. This is clearly reflected in the corresponding growth of ML publications (Fig. 1), reporting a wide range of modeling techniques in biology. While ideally ML methods should be validated experimentally, this happens only in a fraction of the publications9. We believe that the time is right for the ML community to develop standards for reporting ML-based analyses to enable critical assessment10 and improve reproducibility11,12.

Fig. 1: Exponential increase of ML publications in biology.
figure 1

The number of ML publications per year is based on Web of Science from 1996 onwards using the topic category for “machine learning” in combination with each of the following terms: “biolog*”, “medicine”, “genom*”, “prote*”, “cell*”, “post translational”, “metabolic” and “clinical”.

Guidelines or recommendations on how to appropriately construct ML algorithms can help to ensure correct results and predictions13,14. In biomedical research, communities have defined standard guidelines and best practices for scientific data management15 and reproducibility of computational tools16,17. On the ML community side, there is demand for a cohesive and combined set of recommendations with respect to data, the optimization techniques, the final model, and evaluation protocols as a whole.

A recent comment highlighted the need for standards in ML18, arguing for the adoption of on-submission checklists10 as a first step toward improving publication standards. Through a community-driven consensus, we propose a list of minimal requirements asked as questions to ML implementers (Box 1) that, if followed, will help to assess the quality and reliability of reported methods more faithfully. We have focused on data, optimization, model and evaluation (DOME) as each component of an ML implementation usually falls within one of these four topics. We do not propose new specific solutions, only recommendations (Table 1). A reporting checklist is also provided (Box 1). Our recommendations are made primarily for the case of supervised learning in biological applications in the absence of direct experimental validation, as this is the most common type of ML approach used. We do not discuss how ML can be used in clinical applications19,20. It also remains to be determined whether the DOME recommendations can be extended to other fields of ML, like unsupervised, semisupervised and reinforcement learning.

Table 1 Supervised ML in biology: concerns, the consequences they impart and recommendations

Development of the recommendations

The recommendations outlined below were initially formulated through the ELIXIR Machine Learning Focus Group after the publication of a Comment calling for the establishment of standards for ML in biology18. ELIXIR, initially established in 2014, is now a mature intergovernmental European infrastructure for biological data and represents over 220 research organizations in 22 countries across many aspects of bioinformatics21. Over 700 national experts participate in the development and operation of national services that contribute to data access, integration, training and analysis for the research community. Over 50 of these experts involved in the field of ML have established the ELIXIR Machine Learning Focus Group (https://elixir-europe.org/focus-groups/machine-learning), which held meetings to develop and refine recommendations based on a broad consensus.

Scope of the recommendations

The recommendations cover four major aspects of supervised ML according to the DOME acronym. The key points and rationale for each aspect of DOME are described below and summarized in Table 1. Box 1 provides an actionable checklist (with the recommendations codified as questions), which we suggest authors use as a guide when reporting ML-based methods in manuscripts.

Data

State-of-the-art ML models are often capable of memorizing all the variation in training data. Such models when evaluated on data that they were exposed to during training would create the illusion of mastering the task at hand. However, when tested on an independent set of data (termed a test or validation set), the performance would seem less impressive, suggesting low generalization power of the model. To tackle this problem, initial data should be divided randomly into non-overlapping parts. The simplest approach is to have independent training and testing sets (and possibly a third validation set). Alternatively, the cross-validation or bootstrapping techniques that choose a new training/testing split multiple times from the available data are often considered a preferred solution22.

Overlap of training/testing data splits is particularly troublesome to overcome in biology. For example, in predictions on entire gene and protein sequences, independence of training and testing could be achieved by reducing the number of homologs in the data10,23. Modeling enhancer–promoter contacts requires a different criterion, for example, not sharing one endpoint24. Modeling protein domains might require the multidomain sequence to be split into its constituent domains before homology reduction25. In short, each area of biology has its own recommendations for handling overlapping data issues, and previous literature is vital to putting forward a strategy. In Box 1, we propose a set of questions under the category ‘data splits’ that should help to evaluate potential overlap between training and testing data.

Reporting statistics on the dataset size and distribution of data types can help show whether there is a good domain representation in all sets. Simple plots and/or tables showing the number of classes (classification), a histogram of real values binned (regression) and the different types of biological molecules in the data are vital pieces of information for each set. Further, in classification, inclusion of methods that address imbalanced classes26,27 is also needed if the class frequencies show as much. Models trained on one dataset may not be successful in dealing with data coming from adjacent but not identical datasets, a phenomenon known as covariance shift. The scale of this effect has been demonstrated in several recent publications—for example, for prediction of disease risk from exome sequencing28. Although covariance shift remains an open problem, several potential solutions have been proposed in the area of transfer learning29. Moreover, the problem of training ML models that can generalize well on small training data usually requires special models and algorithms30.

Lastly, it is important to make as much data available to the public as possible12. Having open access to the data used for experiments, including precise data splits, would ensure better reproducibility of published research and as a result will improve the overall quality of published ML papers. If datasets are not readily available in public repositories, authors should be encouraged to find the most appropriate vehicle—for example, ELIXIR deposition databases or Zenodo—to guarantee the long-term availability of such data.

Optimization

Optimization, also known as training, refers to the process of changing values that constitute the model (parameters and hyperparameters), including preprocessing steps, in a way that maximizes the model’s ability to solve a given problem. A poor choice of optimization strategy may lead to issues such as over- or underfitting31. A model that has suffered severe overfitting will show an excellent performance on training data while performing poorly on unseen data, rendering it useless for real-life applications. On the other side of the spectrum, underfitting occurs when very simple models capable of capturing only straightforward dependencies between features are applied to data of a more complex nature. Algorithms for feature selection32 can be employed to reduce the chances of overfitting. However, feature selection and other preprocessing actions come with their own recommendations. The main one is to abstain from using non-training data for feature selection and preprocessing—a particularly hard issue to spot for meta-predictors, which may lead to an overestimation of performance.

Finally, the release of files showing the exact specification of the optimization protocol and the type of parameters or hyperparameters are a vital characteristic of the final algorithm. Lack of documentation, including limited accessibility to relevant records for the parameters, hyperparameters and optimization protocol, may further compound the understanding of the overall model performance.

Model

Equally important aspects related to ML models are their interpretability and reproducibility. Interpretable models can infer causal relationships from the data and can output logical reasoning for each of their predictions. They are especially relevant in areas of discovery such as drug design6 and diagnostics33. Conversely, black box models often give accurate predictions but may not provide insight in a way humans can understand into why they made the predictions. Both interpretable and black box models are discussed in more detail elsewhere34. However, developing recommendations on the choice of black box or interpretability is not straightforward as both have their merits. The main recommendation would be that there is a statement as to whether the model type is black box or interpretable (Box 1), and if it is interpretable, clear examples of interpretable output should be given.

Reproducibility is a key component for ensuring research outcomes can be further used and validated by the wider community. Poor model reproducibility extends beyond the documentation and reporting of the involved parameters, hyperparameters and optimization protocol. Lacking access to the various components of a model (source code, model files, parameter configurations and executables), as well as having steep computational requirements for executing the trained models to generate predictions based on new data, can make reproducibility of the model either limited or practically impossible.

Evaluation

There are two types of evaluation scenarios in biological research. The first is the experimental validation of the predictions made by the ML model in the laboratory. This is highly desirable but beyond the scope of many ML studies. The second is a computational assessment of the model performance using established metrics. The following deals with the latter. There are a few possible risks in computational evaluation.

To start with performance metrics—that is, the quantifiable indicators of a model’s ability to solve the given task—there are dozens of metrics available35 for assessing different ML classification and regression problems. The plethora of options available, combined with the domain-specific expertise that might be required to select the appropriate metrics, can lead to the selection of inadequate performance measures. Often, there are critical assessment communities advocating certain performance metrics for biological ML models—for example, Critical Assessment of Protein Function Annotation (CAFA)3 and Critical Assessment of Genome Interpretation (CAGI)28—and we recommend that a new algorithm should use metrics from the literature and community-promulgated critical assessments. In the absence of literature, the ones shown in Fig. 2 could be a starting point.

Fig. 2: Metrics for ML.
figure 2

Top and middle: classification metrics. For binary classification, true positives (tp), false positives (fp), false negatives (fn) and true negatives (tn) together form the confusion matrix. As all classification measures can be calculated from combinations of these four basic values, the confusion matrix should be provided as a core metric. Several measures (shown as equations) and plots should be used to evaluate the ML methods. For descriptions of how to adapt these metrics to multi-class problems, see ref. 35. Bottom: regression metrics. ML regression attempts to produce predicted values (p) matching experimental values (y). Metrics (shown as equations) attempt to capture the difference in various ways. Alternatively, a plot can provide a visual way to represent the differences. It is advisable to report all these measures in any ML work. ROC, receiver operating characteristic; AUC, area under the ROC curve; RMSE, root mean squared error; MAE, mean absolute error.

Once performance metrics are decided, methods published in the same biological domain must be cross-compared using appropriate statistical tests (for example, Student’s t-test) and confidence intervals. Then, to prevent the release of ML methods that appear sophisticated but perform no better than simpler algorithms, baselines should be compared to the ‘sophisticated’ method and proven to be statistically inferior (for example, as in comparison of shallow vs. deep neural networks).

Open areas and limitations of the proposed recommendations

The primary goal of this work is to define best practices that can be of use in writing of ML-related papers while remaining agnostic as to the actual underlying solutions. We also expect that our proposed recommendations will be useful for peer reviewers of biological studies that use ML. Our intent is to trigger a discussion in the wider ML community leading to future work addressing possible solutions.

Several key issues related to reproducibility (for example, data are not published, data splits are not reported and model source code with its final parameters and hyperparameters are not released) can be aided by workflow systems that automate multistep processes to help to ensure that they are completely reproducible by tracking model parameters and exact versions of the source code and libraries. Examples of commonly used workflows include Galaxy36 and Nextflow37. Another de facto standard practice in software engineering is using version control systems such as Github to create an online copy of the source code, which can also include parameters and documentation. Similar version control systems exist for datasets. Public repositories can store experimental data on demand on a long-term basis, enabling long-term reproducibility of the experiment. Existing software engineering tools can be used to address many of the DOME recommendations.

Although having further, more topic-specific recommendations in the future will undoubtedly be useful, in this work we aim to provide a first version that should be of general interest. Adapting the DOME recommendations to address the unique aspects of specific topics and domains would be a task of those particular communities. For example, having guidelines for data independence is tricky because each biological domain has its own set of guidelines for this. Nonetheless, we believe it is relevant to at least have a recommendation that authors describe how they achieved data split independence. Discussions on the correct independence strategies are needed for all of biology. Given constructive consultation processes with ML communities, relying on our own experience, it is our belief that this Comment can be useful as a first iteration of the recommendations for supervised ML in biology. This will have the added benefit of kickstarting community discussion with a coherent but rough set of goals, thus facilitating the overall engagement and involvement of key stakeholders. Topics to be addressed by communities include how to adapt DOME to entire pipelines and to unsupervised, semisupervised, reinforcement and other types of ML. For instance, in unsupervised learning, the evaluation metrics shown in Fig. 2 would not apply and a completely new set of definitions would be needed. Another debate, as AI becomes more commonplace in society, is that ML algorithms differ in their ability to explain learned patterns back to humans. Humans naturally prefer actions or predictions to be made with reasons given. This is the black box vs. interpretability debate, and we point those interested to excellent reviews in refs. 38,39,40,41 as a starting point for thoughtful discussions.

Finally, we address the governance structure by suggesting a community-managed governance model similar to that of the open-source initiatives42. Community-managed governance has been used in initiatives such as Minimum Information About a Microarray Experiment (MIAME)43 or the Proteomics Standards Initiative (PSI) Molecular Interaction (MI) format44. This sort of structure ensures continuous community consultation and improvement of the recommendations in collaboration with academic (CLAIRE; see https://claire-ai.org/) and industrial (Pistoia Alliance; see https://www.pistoiaalliance.org/) networks. More importantly, this can be applied in particular to ML communities working with specific problems requiring more detailed guidelines—for example, imaging or clinical applications. We have set up a website (https://www.dome-ml.org/) where news and upcoming events will be posted to provide a platform for governance and community involvement around the DOME recommendations. As the recommendations and minimal requirements evolve over time, a version history will be available on the website. A template supplementary checklist in human-readable (spreadsheet) and machine-readable (YAML) format, as well as software for the automatic conversion of a YAML file into a human-readable one, are available from a dedicated GitHub repository (https://github.com/MachineLearning-ELIXIR/dome-ml).

Conclusion

The objective of our recommendations is to increase the reproducibility and clarity of ML methods for the reader, the experimentalist, the reviewer and the wider community. We accept that these recommendations are not complete and should be viewed as a first iteration of a consensus-based community discussion. One of the most pressing issues is to agree to a standardized data structure to describe the most relevant features of the ML methods being presented. As a first step in addressing this issue, we recommend including an ML summary table, derived from Box 1, in manuscripts describing ML-based studies (Supplementary Table 1). We recommend including the following sentence in the Methods section of a manuscript: “To support the reproducibility of the machine learning method of this study, the machine learning summary table (Table N) is included in the supporting information as per DOME recommendations (https://doi.org/10.1038/s41592-021-01205-4).”

We believe that the development of standardized reporting guidelines has the potential to make a major impact in increasing the quality of publishing ML methods. First, the current disparity among manuscripts in reporting key elements of the ML method can make reviewing and assessing the ML method challenging. Second, certain performance measures and essential statistics that may affect the validity of the publication’s conclusions are sometimes not mentioned at all. Third, there are unexplored opportunities associated with meta-analysis of ML datasets. Access to large sets of data can both enhance the comparison between methods and facilitate the development of better-performing methods while reducing unnecessary repetition of data generation. We believe that our recommendations to include a “machine learning summary table” and to make datasets available will greatly benefit the ML community and improve its standing with the intended users of these methods.