The protein structure prediction problem is the question of how a protein’s sequence of amino acids results in its fully folded three-dimensional structure. This has presented a formidable computational challenge for many decades.

At the end of 2020, a significant advance was announced by DeepMind, a London-based artificial intelligence (AI) company now part of Google’s parent firm, Alphabet Inc. DeepMind’s AlphaFold 2 program had significantly outperformed other methods in the biennial Critical Assessment of protein Structure Prediction (CASP)1, producing models of a quality approaching that of experimental determination. AlphaFold 2 has since been published2 and, more recently, the source code and almost 350,000 protein models from various species, including human, have been made public3. This trove of protein structures has implications for both experimental and computational structural biology, and beyond4,5,6,7, but here we consider its possible bearing on medicine.

AlphaFold 2 uses data gathered by structural biologists and made publicly available by the worldwide Protein Data Bank (wwPDB)8—which currently holds over 180,000 experimentally determined structures. It is commendable that DeepMind has released the code and predictions for everyone to use.

Over 350,000 protein models have been made available on the AlphaFold Protein Structure Database at the European Molecular Biology Laboratory–European Bioinformatics Institute (EMBL-EBI), with tools to view and interrogate the structures3. These proteins come from 21 species, including the most common model organisms and some notable pathogens—Leishmania infantum, Mycobacterium tuberculosis, Plasmodium falciparum and Trypanosoma cruzi. Before the end of the year, DeepMind expect to release models covering UniRef90, a unique sample of all known protein sequences comprising 130 million proteins.

Although protein structures do not of themselves lead to new medicines, they often provide a better understanding of the molecular mechanisms of a protein and in so doing offer insights into how the protein works and how its modulation might lead to a disease or a therapy. Over the past 50 years, protein structures have been an integral part of drug design efforts, with many large pharma companies establishing their own structural biology teams. Structural data have played a critical role both in determining the druggability of a given protein target9 and then in enabling the design of small-molecule drugs that will bind to it7.

Variable quality

The AlphaFold AI program rapidly generates models of protein structures from their amino acid sequence more accurately than had previously been achieved. The accuracy of the models is variable (both within and between models) depending on the protein, but, importantly, a confidence measure is provided at each residue position by the predicted local distance difference test (pLDDT) score.

The predictions for single-chain, structured proteins are remarkably good—indeed, comparable in quality to those from experimental structure determination. However, the quality of the predictions depends on the length of the protein and its flexibility.

Not all protein structure predictions are of equal value. Figure 1 highlights three example predictions, showing the good, the bad and the ugly. Figure 2a provides an overview of the coverage (experimental and predicted) and quality of structures for the human proteome. Figure 2b illustrates the distribution of quality scores for the human sequences.

Fig. 1: The good, the bad and the ugly.
figure 1

a, The good. A superposition of the AlphaFold model of human 14-kDa phosphohistidine phosphatase (UniProt accession Q9NRX4) and the solution NMR structure of the same protein (PDB code 2ai6). The PDB structure is colored purple, while the AlphaFold model is colored according to the pLDDT score: dark blue for the most confidently predicted regions, via light blue and yellow to orange for the regions of very low confidence. The superposition is almost perfect except for the more disordered loop regions. b, The bad. Human insulin (UniProt accession P01308) represented by the most complete PDB structure (2kqp) in purple, and the AlphaFold model colored by confidence score (as in a). The AlphaFold model bears no resemblance to the PDB structure, possibly because it has missed the disulfide bonds that hold the protein together. c, The ugly. The AlphaFold model of human E3 ubiquitin-protein ligase PPP1R11 (UniProt accession O60927), an enzyme classed as EC, for which there is no PDB structure, not even of a homolog. One would expect it to be a globular protein, but the AlphaFold model is anything but.

Fig. 2: Confidence scores for AlphaFold models.
figure 2

a, Distributions of confidence scores for AlphaFold models for four organisms: human, Trypanosoma cruzi, Mycobacterium tuberculosis and Escherichia coli. The scores are classified as very high (dark blue), confident (light blue), low (yellow) and very low (orange). The two bacterial species show over twice as many very highly confident residues as do the other species, possible because they tend to have shorter proteins that can be more confidently predicted. b, Distribution of average confidence score per AlphaFold model (obtained by averaging the individual residue confidences over the whole model) for human proteins with no close homolog in the PDB (dark blue) and those in which at least part of the sequence can be homology-modeled from a structure in the PDB (orange). The latter distribution is heavily skewed to higher average confidence scores, suggesting models of higher quality. For long proteins, only the model of the first fragment has been included in the data.

A new structure prediction pipeline

Despite the varying quality of the new structures, SWISS-MODEL10 has already installed the code from AlphaFold to complement its existing structure prediction pipelines, while other groups have added the models to their databases of protein information, for example UniProt11 and PDBsum12. ColabFold13 provides tools for modeling multi-chain homo- and hetero-complexes using the AlphaFold and also RoseTTAFold models14. Another use of the models is in the interpretation of low-resolution electron microscopy data, especially where the protein shows flexibility between domains.

However, there are major limitations to the relevance of the AlphaFold data to the design of therapeutics. In particular, large multi-domain and flexible proteins still are not modeled very well, and the models lack any ligands (small molecules, DNA, cofactors, metals and other proteins) and therefore do not provide any interaction data, which are especially relevant for elucidating function.

Initially, the AlphaFold models will be used in exactly the same way as experimental structural data (and indeed will be used to help determine low-resolution experimental structures). We see four areas of immediate potential impact for medicine (see Fig. 3).

Fig. 3: Using AlphaFold for drug design and disease-associated variants.
figure 3

a, Application of AlphaFold models to drug design. The protein shown in stick representation is the AlphaFold model of 3-oxo-5-alpha-steroid 4-dehydrogenase 2 (UniProt accession P31213). According to DrugBank, the protein is a target for several drugs, including spironolactone and finasteride. The colored mesh represents the surface of the largest cleft, which forms a deep tunnel. The colors correspond to residue conservation scores (from red for most highly conserved to blue for the least). The large red tunnel suggests a highly conserved binding site that could form a basis for further drug design. The AlphaFold model is virtually identical (root mean squared deviation of 0.4 Å on all C-alpha atoms) to a recent PDB structure of the protein (PDB code 7bw1). b, Disease-associated variants. The same protein as in a, here shown as a blue cartoon, representing the main chain. The residues shown as red sticks are some of the disease-associated variants (labeled) responsible for pseudovaginal perineoscrotal hypospadias. They mostly line the deep tunnel shown in a (here seen end-on) and presumably interfere with the binding of the protein’s natural substrate.

Therapeutic design

Most small-molecule drugs are designed with the benefit of structural insights15. Future design programs (whether for small molecules, biologics, biosimilars or proteolysis targeting chimeras (PROTAC) therapeutics) will use the models from AlphaFold whenever an experimental structure is not available.

For human sequences, the novel coverage is actually rather small (Fig. 2b), especially for those proteins for which drugs have already been developed. It is, of course, invaluable to know the prospective ligand-binding site, preferably with a structure of the complex with a ligand (Fig. 3a). As the predicted models lack all ligands, however, this requires docking approaches, with their varying reliability.

Comparative analyses of the target proteins with AlphaFold models of similar proteins may be used to generate more specific drugs, such as drugs with potentially fewer toxic side effects. In addition, AlphaFold data from different species may be studied to make more informed choices as to the most suitable animal model for testing potential medicines targeted towards humans.

Better drugs and more validated targets are always needed, and although protein structural data may contribute to this, designing small molecules using protein structures at the start of a drug development program is rarely the bottleneck in the time taken to launch a new drug onto the market.

Human pathogenic variants

Structural data help to identify pathogenic variants in humans—that is, those that cause disease16. A current challenge is to identify such pathogenic variants (for example, in developmental diseases or cancer progression) among the many variants observed in an individual’s genome. Almost 50% of known variants are classified as variants of unknown significance (VUSs) in ClinVar17, a database of genomic variation and its relationship to human health.

AlphaFold has limited value for modeling the effects of individual mutations, although reliable models may be used to identify likely binding sites, enzyme active sites, interfaces or structural constraints, and so identify those variants that are more likely to be pathogenic than those that can be benignly replaced by other amino acids (Fig. 3b).

Most functions predicted from sequences or structures rely on close or distant evolutionary relationships. Predicted structures potentially allow one to see further back in evolutionary time, to identify the most distant relatives—from which some functional inference may be drawn.

Drug targets in pathogens

Structural coverage of pathogens in the wwPDB is often much less than for model organisms. With the larger release of data promised for later in 2021, however, predicted structures for many new organisms will be made available.

Protein structures from pathogens such as viruses, bacteria and fungi can be used to assess druggability and possible cross-reactions with human proteins and to aid in the design of medications targeted toward multiple pathogens. Identifying drug targets in infectious agents may provide the most available low-hanging fruit in the short term, and indeed DeepMind is already collaborating with organizations such as Drugs for Neglected Diseases Initiative and other partners.

Enhance vaccine and antibody design

With the COVID-19 pandemic and the development of SARS-CoV-2 vaccines, knowledge of the antigenic spike protein structure has assisted in understanding the surface topology of the virus and its antigenicity.

Amazingly, as of 3 September 2021, there were 1,491 structures of SARS-CoV-2 proteins in the wwPDB18, contributed by laboratories all around the world. The possibility of predicting viral spike proteins accurately will provide very rapid analysis compared to experimental structure determination for emerging viruses in future pandemics.

A data-driven revolution

The impact of the protein structures from AlphaFold in medicine is potentially substantial. However, AlphaFold is most likely to be just the start of a revolution based on data-driven prediction in biology and medicine. Biological processes at all levels (intracellular, intercellular, organoid and organism) involve interactions between molecules.

Although current AlphaFold predictions are limited to single protein chains and do not provide explicit information about interactions with other molecules, new AI-based tools could predict such interactions across the proteome—delving into different complexes in different cell types, which change with the environment and over time. In the longer term, AI methods will be developed and applied to many aspects of protein structures to improve predictability.

Projects such as the Earth Biogenomes19 and Darwin Tree of Life20 that ultimately seek to sequence all living organisms will generate masses of new protein sequence data. AlphaFold2 is the first step to generating the whole structural proteomes for all of these different species. The challenge is then to interpret these genomes in terms of each organism’s body shape, development, behavior and natural history, using genotype-to-phenotype studies. Natural products have been the basis for many drugs, so elucidating the genomes of many new species may ultimately lead to novel nature-inspired therapies. No doubt AI methods will be extensively employed in this quest.

From a medical perspective, the opportunities presented by AI are to follow in the footsteps of the DeepMind approach and use clinical data to understand diseases—their diagnosis and prognosis, and determination of what combinations of therapies are best suited for particular patients in a more holistic approach.

Protein Structure Prediction presented the perfect challenge for AI: the data for all known structures were freely available, well curated and organized in the wwPDB. The challenge was very specific, and the success of the outcome measurable and independently assessed in CASP.

The availability of biological research data from institutes such as the US National Center for Biotechnology Information (NCBI) and EMBL-EBI (with the many different types of data and available data resources) has transformed biological research in the last 20 years. The situation for clinical data is entirely different. Like biological data, clinical data are very heterogenous, but they are rarely easily available, often not quantitative, difficult to share across borders and described by limited ontologies and metadata. To add more complexity, such data cannot be made publicly available while maintaining personal confidentiality.

Consequently, to take advantage of the new, powerful AI methods, the imperative with clinical data should be to build the national and international infrastructures necessary to allow clinical data to be collected and shared, collated and standardized.

By analogy with AlphaFold’s success in predicting structures, this will accelerate the process of finding therapies that are effective and available to all. In the UK, Health Data Research UK is addressing this challenge by creating Trusted Research Environments for clinical data, and worldwide, the Global Alliance for Global Health is establishing standards and protocols to enable swifter progress. For this to be successful, multi-disciplinary teams will be needed, involving clinicians, domain experts and machine learning experts, to develop the tools to exploit the data.

It has taken many years to establish the biological databases that are so widely used today—and the challenge for clinical data is even larger. This calls for immediate investment in creating a new health data infrastructure so that patients will be proud to contribute their data to improve human health and the world can face new pandemics with confidence.