Different proteins receive very different levels of attention from scientists. The most frequently studied protein in the human proteome is p53. On average, it is the subject of two publications per day1. At the same time, the biological functions of thousands of human proteins remain unexplored2,3,4,5. This bias in the functional characterization of the human proteome is massive: 95% of all life science publications focus on a group of 5,000 particularly well-studied human proteins6. The sequencing of the human genome was expected to be a crucial step toward reducing this bias: by identifying all human genes, researchers would be offered opportunities to study previously unknown genes. However, in 2011, a decade after the publication of the genome sequence, 75% of publications still focused on genes that were already being studied before the genome was mapped7. Annotation inequality has increased since then and has almost doubled since the human genome sequence was released2.

Annotation inequality hinders biomedical progress because mechanistic investigations of gene–disease associations typically focus on proteins that are already well known (Fig. 1), a phenomenon also known as the street-light effect8. Meanwhile, many uncharacterized proteins are not subjected to functional studies despite strong evidence from omics studies for their association with human disease2. For example, the functions of many proteins involved in rare diseases (which are not rare collectively) are poorly understood9. Moreover, common diseases such as neurodevelopmental disorders and cancer are caused by collections of numerous rare genetic variants in different genes10. Remarkably, out of the 1,878 genes that are essential for proliferation in a human cell line, 330 (18%) remained uncharacterized as of 2015 (ref. 11). This bias extends to the ~3,000 proteins currently expected to be druggable: only 5–10% of these potentially druggable proteins are currently targeted by FDA-approved pharmaceuticals5.

Fig. 1: Protein annotation inequality impedes biomedical progress.
figure 1

The availability of prior publications, data and tools dictates the ease by which research questions involving a protein can be formulated and addressed. This reinforces annotation bias and the persistence of understudied proteins.

Functional proteomics could be instrumental in reducing the annotation gap by systematically associating uncharacterized proteins with proteins of known function and thereby assigning them to cellular processes. An important element of targeting uncharacterized proteins is to broaden the range of investigation beyond typical laboratory conditions and the limited set of laboratory model organisms’ genetic backgrounds. With a focus on mass spectrometry (MS)-based methods, here we outline opportunities and challenges for a coordinated functional proteomics initiative that would lay the groundwork for future detailed mechanistic studies.

Origins of protein annotation inequality

The reasons for the protein annotation bias are manifold. Some are of a practical nature, reflecting how easily a protein can be studied with widely available methods. For example, the availability of experimental tools such as antibodies, plasmids or curated reference data is a strong incentive to work on well-studied proteins2,7. The number of publications about a protein is also related to basic biological and biochemical properties, such as protein size, abundance, hydrophobicity and the sensitivity of its gene toward mutations4. The dynamic range of our detection devices does not yet match that of proteins in a cell. In fact, to date, 1,899 (9.6%) of the 19,733 human protein-coding genes lack credible support from any proteomics technology, some of which may constitute genome annotation errors12.

In addition, having a very small size is a strikingly common feature among under-studied proteins: 40% of the least well-annotated proteins in SwissProt are smaller than 15 kDa (ref. 13). This is despite the importance of microproteins, for example, as neuropeptides in brain development14. Moreover, what we currently consider to be the repertoire of understudied small proteins may just be the tip of an iceberg, as we are only beginning to uncover the array of ‘alternative proteins’ coming from genomic regions previously considered to be noncoding15.

Other reasons for protein annotation inequality may reflect conceptual biases in the research system rather than properties of the proteins themselves. For example, it is often assumed that proteins studied by many people are functionally more important7, although this is not supported by evidence such as genome-wide association studies or functional genomic screens2,11,16,17. In addition, scientists often prefer to explore a problem they already work on in more detail, in part because funding and peer-review systems are risk-averse7. Working in a large research field enhances the likelihood of being cited, and, consequently, also increases the possibility for high-impact journal publications, which are required for academic success18. However, large fields also tend to favor existing paradigms over new ideas, thus slowing scientific progress overall4,19,20.

Equally important is the limited set of conditions studied in the laboratory, a situation that might paradoxically be a consequence of the desire to make research more reproducible through standardization of experimental conditions. For example, under standard laboratory growth conditions, the deletion of ~20% of Saccharomyces cerevisiae genes causes a lethal phenotype21. However, when the condition space is expanded, 97% of the genes are essential for optimal growth under at least one condition22. Indeed, the choice of ‘standard’ conditions often reflects historical reasons rather than the desire to capture the entirety of biological complexity. For instance, the most popular synthetic yeast medium in use today emerged from an early 1950s publication of the US Department of Agriculture technical bulletin which attempted to help farmers and biotechnologists to grow a wide variety of yeasts; for example, to start fermentation processes23. The problem is further compounded for multicellular organisms with specialized cell types; some tissues or cell types are much more studied than others.

Finally, protein annotation bias could reflect the focus on hypothesis-driven rather than question-driven research24,25. It is difficult to formulate hypotheses on the mechanistic molecular function of an uncharacterized protein. Intriguingly, the philosopher Francis Bacon, often credited as the father of the scientific method, argued in the early 1600s that experiments should not be driven by hypotheses for fear of introducing bias in the observer and stifling innovation24,26. In line with this, it has been suggested that strictly data-driven approaches could help to reduce protein annotation inequality2,27.

Accelerating drug discovery for understudied proteins

From a standpoint of drug discovery, fundamental advances toward the characterization of understudied proteins are being made by initiatives that improve our understanding of protein–small molecule interactions, such as the Structural Genomics Consortium28, the Enzyme Function Initiative29, the Illuminating the Druggable Genome program5 and Open Targets30. In this context, ‘functional characterization’ is typically interpreted as revealing molecular properties of a protein that are particularly relevant for drug development; for example, its structure, ligands, inhibition by chemical probes and association with disease. Particular emphasis is placed on pharmacologically tractable protein families, such as ion channels, G-protein-coupled receptors and kinases5,31,32.

From a perspective of understanding protein function, it is equally important to study other levels of protein annotation, such as cellular processes, pathways and subcellular compartments. In addition, many understudied proteins do not belong to a traditional druggable family, although the definition of a druggable protein is evolving over time as new approaches (such as PROTACs33) are developed. One set of methods ideally suited to study the cellular functions of proteins, and to do so on a comprehensive, proteome-wide scale, is functional proteomics.

Tackling annotation inequality with functional proteomics

Two different types of protein annotation efforts may be distinguished: original investigations and ‘guilt-by-association’ approaches. The original investigation of a novel biological function is an essential but time-consuming and costly effort involving many detailed mechanistic studies. For researchers to commit to such an effort, it is necessary for a protein to have a certain basal annotation level. Without this, hypotheses to probe a protein’s function lack foundation. Here, annotation by ‘functional association’ can provide the lacking foundations through knowledge transfer, whereby previously uncharacterized proteins are linked to well-studied factors and their biological functions34,35,36,37,38.

Proteomics approaches are particularly well suited to revealing functional associations on a large scale. Such approaches include techniques that identify protein–protein interactions, such as affinity purification MS39,40,41, crosslinking MS42 and co-fractionation MS43; approaches that identify which proteins are co-regulated44,45,46,47,48,49,50,51; and methods that reveal which proteins share subcellular space52,53,54,55 (Box 1). For example, the majority of centrosomal proteins were considered to have been already identified56, and then hundreds more were identified by antibody-based proteomics57. It is noteworthy that although we focus here on MS and antibody-based proteomics, powerful alternative proteomics approaches also exist that have been reviewed elsewhere58,59. There are also many functional genomics approaches that do not rely on measuring proteins for functional association, including gene expression profiling, whereby functionally related genes are linked on the basis of similar expression patterns60, metabolic profiling61 and genetic interaction screening62. Rapid advances in genome-wide CRISPR–Cas9 screening have accelerated the pace of functional annotation of proteins involved in susceptibility to therapeutic compounds, or those that become essential in a specific genetic context63.

While MS-based proteomics does not yet reach the gene coverage of genomic approaches, observing proteins directly can be especially informative when studying the function of (protein-coding) genes. For example, protein co-expression captures functional relationships considerably better than mRNA co-expression13,64. Protein-based analyses also have the potential to distinguish between proteoforms; that is, the individual molecular forms of expressed proteins65, which, as a result of splicing and post-translational modifications, dramatically increase the functional diversity of the proteome65. Proteoform characterization may require the use of top-down66,67 or middle-down68 proteomics approaches. Proteomics is rapidly increasing in throughput, with methods emerging that allow for hundreds of proteomes to be recorded per day on a single mass spectrometer69,70. A new generation of functional proteomic studies will hence be able to generate a much more comprehensive spectrum of biological functionality.

Nevertheless, protein annotation inequality is unlikely to be resolved exclusively by large-scale approaches. The first step in a concerted effort to address protein annotation bias could be to systematically provide the necessary minimal data foundation required for individual researchers conducting targeted experiments. Ongoing examples of this include BioPlex71 and hu.MAP72, which use MS for the large-scale identification of protein–protein interactions and protein complexes; the Human Protein Atlas73,74, which uses antibodies to assign human proteins to different tissues and subcellular locations; and the neXt-CP50 project that aims to characterize 50 understudied proteins by proteomics75.

How to increase the impact of functional proteomics on mechanistic research

Some highly promising proteins remain ignored despite being perfectly amenable for detailed functional investigation4. Making protein–protein associations more accessible and usable for mechanistic follow-up studies will therefore be an important step toward reducing annotation inequality. Biologists can inspect molecular networks through a variety of powerful and user-friendly resources76, including IntAct, BioGRID, NDEx and STRING. The fact that annotation bias is worsening2 despite the wide availability of such resources could be the result of a number of factors. One may be a lack of awareness of such annotation portals among cell biologists. Others may be lack of trust in the available annotation, lack of annotations and lack of integration of different annotation types.

Cell biologists may hesitate to rely on data from large-scale projects due to a perceived lack of accuracy, which could be improved by better communication. Indeed, the possibility of treating error in a statistical sense is a particular strength of large-scale approaches. While error cannot be avoided, its size is a critical parameter to understand how reliable results are. One example of a functional proteomics technique where false discovery rate (FDR) calculation has been established is crosslinking MS77. Similarly, FDR is routinely calculated for all MS protein identifications78,79. In addition, in spatial proteomics, statistical frameworks are being developed to encapsulate confidence of assigning proteins to subcellular niches80,81.

In addition to expanding the amount of available large-scale data, it will undoubtedly be necessary to develop new tools and techniques to provide additional, complementary links and fill systematic gaps left by current approaches. Examples of emerging functional proteomics technologies are crosslinking MS42, coaggregation proteomics82 and methods to study dynamic subcellular niches52,55. The large success by which protein structures can be predicted now83 offers the exciting possibility to improve structure-based function prediction, especially when predicted structures could be experimentally confirmed by, for example, crosslinking MS84. These and other intracellular techniques are particularly attractive, as many proteins require folding assistance, cofactors or post-translational modifications to function correctly and would therefore need to be studied in their native environment. In addition, it is becoming increasingly feasible to study proteomes of single cells, allowing the determination of cell-to-cell heterogeneity85.

Finally, a key remaining challenge is the integration of different types of data across scales (time and space), which would maximize synergies between different types of omics data. An example for this is the integration of the Human Protein Atlas and BioPlex data, underpinning that the generation of a cellular hierarchy reveals many novel cellular systems undetectable by either dataset when used in isolation86. Such computational tools could also accelerate science through providing data-driven hypothesis generation; that is, opportunities for researchers to connect their data to big proteomics data.

Even where the function of a protein is well annotated, there is increasing evidence suggesting that a number of proteins have the capacity to carry out alternative, unrelated functions, reported in the literature as ‘moonlighting’87. Historically, as researchers have assumed ‘one-protein one-function’, alternative functions have not been sought for most proteins. An additional benefit of the systems-wide interrogation of the functional proteome will be to provide alternative functional annotations even for well-studied proteins, as well as a better understanding of the extent to which proteins are capable of ‘moonlighting’.

How to quantify progress of functional characterization

To develop, optimize and evaluate strategies to tackle protein annotation inequality, one needs to be able to measure their impact in a robust and informative way. Measuring the degree of functional characterization is far from trivial, not least because the term itself can have different meanings. ‘Protein function’ may refer to the wider biological purpose of a protein, such as to which phenotype it associates, or to which metabolic pathway it belongs to. It could also refer to structural and mechanistic insights into how a protein fulfils these functions at a molecular level; for example, the enzymatic mechanism.

A number of approaches to determine protein annotation levels have been developed, including a literature score based on text mining6, the UniProt annotation score88, an assessment of Gene Ontology (GO) coverage3 and a system to classify proteins based on their development as drug targets5. Each of these metrics captures or emphasizes slightly different aspects of the available annotations. They do not distinguish between original characterization and functional association. However, to systematically evaluate the performance of an annotation transfer system, it will be necessary to quantify it adequately. The McNamara fallacy89 illustrates the danger of evaluating progress toward a complex goal on the basis of a single, easy-to-measure target variable without taking into account broader and more difficult to measure aspects of the challenge (McNamara’s over-reliance on a single quantitative metric — number of enemy combatants killed or wounded — has been linked to the US failure in the Vietnam War).

How to avoid exchanging one bias for another

We have argued that the proteome is a powerful layer for annotating gene function, but proteomics approaches are also susceptible to biochemical bias; for example, from protein abundance and solubility. Therefore, to achieve a systematic reduction in the genome-wide annotation bias, it may be necessary to optimize multiple individual functional proteomics methods and integrate their results in a concerted effort. One may also integrate proteomics data with data produced by other omic disciplines. Metabolomics, for instance, can capture a complementary functional spectrum61,90. Note that combining proteomics with genetics, functional genetics or metabolomics substantially improves the predictability of phenotypes91,92.

Regardless of the approaches taken, however, the narrow window of standard laboratory conditions should probably be left behind. Recent multi-organism proteomics surveys93,94 suggest that potentially many more proteins could be characterized by comparative proteomics, taking advantage of the broad evolutionary conservation of many proteins’ functions and the differential accessibility of conserved proteins across organisms. The fact that many omic technologies can be directly applied to human cells, combined with the advent of genome editing, has raised concerns that funding for work on non-human organisms might be in decline95,96, although in-depth statistics indicate that these concerns may be, at present, unfounded97. Studying a broad diversity of organisms has not only brought us penicillin, green fluorescent protein and CRISPR–Cas9, but may also help us to capture the functional spectrum of the human proteome.

The Understudied Proteins Initiative

We envisage that the time is right for a coordinated effort to reduce annotation inequality across the human genome and proteome (Fig. 2). Our Understudied Proteins Initiative will include different data generation approaches, develop an integration framework and make the annotations available to researchers via an appropriate platform. The project will aim to address not only the technical but also the biomedical reasons for missing gene functions, such as narrowly defined growth conditions, single time-point studies and the focus on very few laboratory models with low genetic variability. This protein function moonshot may also stimulate methodological developments in functional proteomics and may extend to other species.

Fig. 2: Roadmap of the Understudied Proteins Initiative.
figure 2

A survey will help define the challenge and goals for the initiative. Then a workshop will bring together experts from the large-scale data community to establish the initiative framework, covering six action areas to be discussed. Finally, a collaborative effort of many labs will experimentally tackle the problem of understudied proteins.

As a first step, the goal must be defined clearly. If the contribution of functional proteomics is to stimulate mechanistic studies of under-characterized proteins, then what is the minimum information that scientists require to start such work? This question can only be answered by those that illuminate the cellular function of individual proteins in molecular and mechanistic detail. Ultimately, it is the sum of their individual subjective decisions as laboratory scientists and reviewers that decide what proteins are being studied in detail. We recently launched a survey to capture their views (https://understudiedproteins.org/survey)98.

As a second step, a community of interested scientists must be built. This will be started at an upcoming meeting supported by the Wellcome Trust (https://understudiedproteins.org/conference). The meeting will discuss the outcome of the survey and its implications for the goals of an Understudied Proteins Initiative, and how progress toward these goals could be monitored. This will set the framework for an open discussion on what technologies or developments may be able to systematically unlock the potential of currently uncharacterized proteins in biomedical research, and therefore become part of a larger roadmap.