Cancer protein. Computer model of the enzyme protein tyrosine kinase, which is involved in cancer cell formation.

More than 20 years ago, the Cancer Research Screensaver harnessed distributed computing power to assess anti-cancer activity in molecules.Credit: James King-Holmes/SPL

Many people are expressing fears that artificial intelligence (AI) has gone too far — or risks doing so. Take Geoffrey Hinton, a prominent figure in AI, who recently resigned from his position at Google, citing the desire to speak out about the technology’s potential risks to society and human well-being.

But against those big-picture concerns, in many areas of science you will hear a different frustration being expressed more quietly: that AI has not yet gone far enough. One of those areas is chemistry, for which machine-learning tools promise a revolution in the way researchers seek and synthesize useful new substances. But a wholesale revolution has yet to happen — because of the lack of data available to feed hungry AI systems.

Any AI system is only as good as the data it is trained on. These systems rely on what are called neural networks, which their developers teach using training data sets that must be large, reliable and free of bias. If chemists want to harness the full potential of generative-AI tools, they need to help to establish such training data sets. More data are needed — both experimental and simulated — including historical data and otherwise obscure knowledge, such as that from unsuccessful experiments. And researchers must ensure that the resulting information is accessible. This task is still very much a work in progress.

Take, for example, AI tools that conduct retrosynthesis. These begin with a chemical structure a chemist wants to make, then work backwards to determine the best starting materials and sequence of reaction steps to make it. AI systems that implement this approach include 3N-MCTS, designed by researchers at the University of Münster in Germany and Shanghai University in China1. This combines a known search algorithm with three neural networks. Such tools have attracted attention, but few chemists have yet adopted them.

To make accurate chemical predictions, an AI system needs sufficient knowledge of the specific chemical structures that different reactions work with. Chemists who discover a new reaction usually publish results exploring this, but often these are not exhaustive. Unless AI systems have comprehensive knowledge, they might end up suggesting starting materials with structures that would stop reactions working or lead to incorrect products2.

An example of mixed progress comes in what AI researchers call ‘inverse design’. In chemistry, this involves starting with desired physical properties and then identifying substances that have these properties, and that can, ideally, be made cheaply. For example, AI-based inverse design helped scientists to select optimal materials for making blue phosphorescent organic light-emitting diodes3.

Computational approaches to inverse design, which ask a model to suggest structures with the desired characteristics, are already in use in chemistry, and their outputs are routinely scrutinized by researchers. If AI is to outperform pre-existing computational tools in inverse design, it needs enough training data relating chemical structures to properties. But what is meant by ‘enough’ training data in this context depends on the type of AI used.

A generalist generative-AI system such as ChatGPT, developed by OpenAI in San Francisco, California, is simply data-hungry. To apply such a generative-AI system to chemistry, hundreds of thousands — or possibly even millions — of data points would be needed.

A more chemistry-focused AI approach trains the system on the structures and properties of molecules. In the language of AI, molecular structures are graphs. In molecules, chemical bonds connect atoms — just as edges connect nodes in graphs. Such AI systems fed with 5,000–10,000 data points can already beat conventional computational approaches to answering chemical questions4 . The problem is that, in many cases, even 5,000 data points is far more than are currently available.

The AlphaFold protein-structure-prediction tool5, arguably the most successful chemistry AI application, uses such a graph-representation approach. AlphaFold’s creators trained it on a formidable data set: the information in the Protein Data Bank, which was established in 1971 to collate the growing set of experimentally determined protein structures and currently contains more than 200,000 structures. AlphaFold provides an excellent example of the power AI systems can have when furnished with sufficient high-quality data.

So how can other AI systems create or access more and better chemistry data? One possible solution is to set up systems that pull data out of published research papers and existing databases, such as an algorithm created by researchers at the University of Cambridge, UK, that converts chemical names to structures6. This approach has accelerated progress in the use of AI in organic chemistry.

Another potential way to speed things up is to automate laboratory systems. Existing options include robotic materials-handling systems, which can be set up to make and measure compounds to test AI model outputs7,8. However, at present this capability is limited, because the systems can carry out only a relatively narrow range of chemical reactions compared with a human chemist.

AI developers can train their models using both real and simulated data. Researchers at the Massachusetts Institute of Technology in Cambridge have used this approach to create a graph-based model that can predict the optical properties of molecules, such as their colour9.

There is another, particularly obvious solution: AI tools need open data. How people publish their papers must evolve to make data more accessible. This is one reason why Nature requests that authors deposit their code and data in open repositories. It is also yet another reason to focus on data accessibility, above and beyond scientific crises surrounding the replication of results and high-profile retractions. Chemists are already addressing this issue with facilities such as the Open Reaction Database.

But even this might not be enough to allow AI tools to reach their full potential. The best possible training sets would also include data on negative outcomes, such as reaction conditions that don’t produce desired substances. And data need to be recorded in agreed and consistent formats, which they are not at present.

Chemistry applications require computer models to be better than the best human scientist. Only by taking steps to collect and share data will AI be able to meet expectations in chemistry and avoid becoming a case of hype over hope.