Standardized annotation of translated open reading frames

To the Editor — Ribosome profiling (Ribo-seq) has extended our understanding of the translational ‘vocabulary’ of the human genome, uncovering thousands of open reading frames (ORFs) within long noncoding RNAs (lncRNAs) and presumed untranslated regions (UTRs) of protein-coding genes. However, reference gene annotation projects have been circumspect in their incorporation of these ORFs because of uncertainties about their experimental reproducibility and physiological roles. Yet, it is clear that certain ‘Ribo-seq ORFs’ make stable proteins, others mediate gene regulation, and many have medical implications. Ultimately, the absence of standardized ORF annotation has created a circular problem: while Ribo-seq ORFs remain unrecognized by reference annotation databases, this lack of recognition will thwart studies examining their roles. Here, we outline a community-led effort involving Ensembl/ GENCODE, the HUGO Gene Nomenclature Committee (HGNC), UniProtKB, HUPO/ HPP and PeptideAtlas to produce a standardized catalog of 7,264 human Ribo-seq ORFs; a path to bring protein-level evidence for Ribo-seq ORFs into reference annotation databases; and a roadmap to facilitate research in the global community. Ribo-seq1 provides an RNA-sequencing-based readout of mRNA translation by isolating ribosome-bound RNA fragments of ~30 nucleotides in length. Sequencing of these fragments offers genome-wide footprints of ribosome– RNA interactions, detecting translated ORFs with sub-codon resolution2–8. Although Ribo-seq circumnavigates the experimental difficulties of working with protein molecules (for example, using mass spectrometry (MS) analytical tools) and readily finds translations missed by in silico evolutionary methods, it does not demonstrate the actual existence of proteins, and most translations do not show signs of constraint as coding sequences (CDS). A wide range of ‘functional’ scenarios are therefore plausible for Ribo-seq ORFs (Table 1). Several public resources already process and/or display Ribo-seq datasets, including sORFs.org9, GWIPS-viz10 and Trips-Viz11, whereas OpenProt12 and nORFs.org13 incorporate Ribo-seq into whole-translatome catalogs. Meanwhile, McGillivray et al. have produced a catalog of upstream ORFs (uORFs) with predicted biological activity14. Such efforts have made important contributions in Ribo-seq ORF interpretation. Nonetheless, the global scientific community is constrained by the absence of ‘reference’ gene annotation, which supports most large-scale genomics projects and provides the framework for human variant interpretation (Fig. 1a, Supplementary Fig. 1). The creation of Ribo-seq annotations within existing reference gene and protein databases presents specific challenges that were not faced by previous cataloging efforts9–13. In particular, it is necessary to consider how these annotations can be Table 1 | Approaches to interpreting Ribo-seq ORFs


Standardized annotation of translated open reading frames
To the Editor -Ribosome profiling (Ribo-seq) has extended our understanding of the translational 'vocabulary' of the human genome, uncovering thousands of open reading frames (ORFs) within long noncoding RNAs (lncRNAs) and presumed untranslated regions (UTRs) of protein-coding genes. However, reference gene annotation projects have been circumspect in their incorporation of these ORFs because of uncertainties about their experimental reproducibility and physiological roles. Yet, it is clear that certain 'Ribo-seq ORFs' make stable proteins, others mediate gene regulation, and many have medical implications. Ultimately, the absence of standardized ORF annotation has created a circular problem: while Ribo-seq ORFs remain unrecognized by reference annotation databases, this lack of recognition will thwart studies examining their roles. Here, we outline a community-led effort involving Ensembl/ GENCODE, the HUGO Gene Nomenclature Committee (HGNC), UniProtKB, HUPO/ HPP and PeptideAtlas to produce a standardized catalog of 7,264 human Ribo-seq ORFs; a path to bring protein-level evidence for Ribo-seq ORFs into reference annotation databases; and a roadmap to facilitate research in the global community.
Ribo-seq 1 provides an RNA-sequencing-based readout of mRNA translation by isolating ribosome-bound RNA fragments of ~30 nucleotides in length. Sequencing of these fragments offers genome-wide footprints of ribosome-RNA interactions, detecting translated ORFs with sub-codon resolution [2][3][4][5][6][7][8] . Although Ribo-seq circumnavigates the experimental difficulties of working with protein molecules (for example, using mass spectrometry (MS) analytical tools) and readily finds translations missed by in silico evolutionary methods, it does not demonstrate the actual existence of proteins, and most translations do not show signs of constraint as coding sequences (CDS). A wide range of 'functional' scenarios are therefore plausible for Ribo-seq ORFs (Table 1).
Several public resources already process and/or display Ribo-seq datasets, including sORFs.org 9 , GWIPS-viz 10 and Trips-Viz 11 , whereas OpenProt 12 and nORFs.org 13 14 . Such efforts have made important contributions in Ribo-seq ORF interpretation. Nonetheless, the global scientific community is constrained by the absence of 'reference' gene annotation, which supports most large-scale genomics projects and provides the framework for human variant interpretation (Fig. 1a,  Supplementary Fig. 1).
The creation of Ribo-seq annotations within existing reference gene and protein databases presents specific challenges that were not faced by previous cataloging efforts [9][10][11][12][13] . In particular, it is necessary to consider how these annotations can be Ribo-seq ORFs may be recognized as canonical-in accordance with existing protein annotations-on the basis that the sequence of the proteins they encode shows clear evidence of being maintained by evolutionary selection over a significant period of evolutionary time.
A Ribo-seq ORF encodes a taxonomically restricted protein Ribo-seq ORFs may encode proteins whose sequences and molecular activities are specific to one species or lineage. Evidence for selection or conservation across distant species or lineages is lacking for these ORFs, either because the protein sequence has diverged beyond recognition from its orthologs, or because the protein evolved recently from previously noncoding material and homologs do not exist in other species or lineages.
A Ribo-seq ORF regulates protein or RNA abundance Ribosome engagement of regulatory ORFs does not result in a protein product under selection but regulates the abundance of a canonical protein or RNA. This paradigm is well established for uORFs and uoORFs, as noted in Table 2, though it is applicable to other transcript scenarios. Regulatory ORFs may compete for ribosomes with their downstream canonical ORFs or produce nascent peptides that stall ribosomes, leading to the controlled 'dampening' of protein expression. Alternative modes of action, such as the induction of RNA decay pathways, the processing of small RNA precursors or the adjustment of RNA stability, have also been inferred.
A Ribo-seq ORF is the result of random translation The translation of some Ribo-seq ORFs may simply be 'noise'. Because translation has a high bioenergetic cost, a protein that results from random translation is likely to be translated at lower levels than a canonical CDS and evolve neutrally; it may also be comparatively unstable and could be rapidly degraded. Nonetheless, it is theoretically possible that certain proteins do exist as stable 'junk' proteins, or that random translation events affect the expression of canonical proteins. The detection of random Ribo-seq ORFs is less likely to be reproducible.
A Ribo-seq ORF encodes a disease-specific protein This protein would not be produced under normal physiological homeostasis but could be of major interest for diagnostics and therapeutics. Insights of this sort are especially emerging in cancer biology, where transcription and translation are known to be dysregulated. This leads to the production of 'aberrant', possibly rapidly degraded proteins that are commonly antigenic and presented on the cell surface by the HLA system, potentially acting as neoantigens. Furthermore, antigens resulting from disease-specific dysregulated ribosome activity-sometimes called defective ribosomal products (DRiPs)-have also been explored.
Note: a given ORF may encompass several of these possibilities: for example, a translated ORF could be both regulatory and implicated in disease neoantigen production.
integrated into the broad range of user workflows that are already supported by global annotation resources. For such reasons, reference annotation projects are generally conservative when it comes to the incorporation of new data types. Thus, rather than attempt to describe a 'maximal' set of potential Ribo-seq translations from the outset, our strategy is to build up a comprehensive resource in stages that is reciprocally improved by input from the scientific community (Fig. 1b).
Here, as 'Phase I' of this work, we present a consolidated catalog of Ribo-seq ORFs from seven publications 2-8 annotated onto GENCODE version 35 ( Fig. 1c; Supplementary Tables 1-9). A detailed description of the Ribo-seq datasets, our analysis methods and ORF characteristics is available in the Supplementary Methods. We removed ORFs smaller than 16 amino acids (aa) and those translated from non-ATG ('near-cognate') initiation codons, and merged redundant sense overlapping ORFs, resulting in a collated set of 7,264 unique ORFs (Fig. 1c). We classified these ORFs according to their spatial relationship with existing gene annotations (Fig. 1d), as presented in Table 2. We hope community usage of this catalog will help address the key technical and biological questions necessary to move this work into 'Phase II' , where we aim to create a more comprehensive resource as outlined below.
For Phase I, we investigated repeated ORF identifications between studies,   Table 2 for more information).
observing that 3,085 of 7,264 Ribo-seq ORFs were found by more than one publication (Supplementary Fig. 2; Supplementary  Tables 2 and 3). However, although such 'reproducibility' can demonstrate consistency in Ribo-seq signal, it neither provides insights into biological function nor indicates that the 4,179 non-replicated ORFs are 'false' . A major goal of Phase II will be to incorporate a greater diversity of human cell types and tissues for improved estimates of ORF reproducibility, expression patterns and potential cell type specificity, along with further evaluation of criteria to quantify the technical confidence in Ribo-seq ORF calls.
Furthermore, Phase I excluded many translations by restricting the consensus set to ATG-initiated 'cognate' translations of at least 16 aa in length. Although these tiny ORFs may provoke skepticism in the absence of additional evidence-the smallest annotated human protein is 24 aa-there may be no lower size limit for a functional ORF 15 . For example, the tarsal-less (tal) gene produces a polycistronic transcript translated into proteins as short as 11 aa in several insect species 16 . Furthermore, the inclusion of ORFs initiated with near-cognate start codons can be complicated by ambiguous predictions of initiation site positions 17 . Ribo-seq following treatment with lactimidomycin or homoharringtonine, which inhibit translation elongation and result in accumulation of sequencing reads at the putative start sites, can help to identify near-cognate start sites 17,18 . Such datasets will be leveraged by our future Phase II efforts. For our current annotation resource, we have separately aggregated the Ribo-seq ORFs with near-cognate start codons or translations shorter than 16 codons ( Supplementary Fig. 3a-c and Supplementary Tables 4 and 5), rather than including them in the Phase I catalog.
A core aim of Phase II will be to identify which Ribo-seq ORFs participate in cell physiology and how they do so. One aspect is distinguishing between cellular function mediated by a stable protein and functionality imparted at the level of translation itself. We here use 'protein' as an umbrella term for protein, peptide and polypeptide, although we recognize that the terms polypeptide, micropeptide or microprotein are commonly used for small protein molecules ( Table 2). Because of the challenges of protein sequencing, evolutionary analysis has played a major historical role in ORF annotation, which is based on the assumption that the evolution of translated sequences is driven by selection at the protein level. Within our Phase I dataset, 75 Phase I replicated Ribo-seq ORFs (2.4%) present evidence of potential protein-level constraint as measured by PhyloCSF 19 ( Supplementary  Fig. 3d-f); among these, ten have now been classified as protein coding by GENCODE (Supplementary Table 6).
Nonetheless, the evolutionary profile of many Phase I Ribo-seq ORFs remains hard to interpret. In part, this is because distinguishing ORF selection at the protein and DNA levels can be especially difficult for very small regions, and Ribo-seq ORFs are typically much smaller than those of known annotated proteins (Supplementary Fig. 3g-j). A second drawback is that evolutionary analysis cannot infer the protein-coding or regulatory potential of evolutionarily 'young' de novo Ribo-seq ORFs 20 . Reference annotation projects remain skeptical about the existence of proteins that are not deeply conserved, despite the fact that some young proteins clearly do participate in cellular physiology 20,21 . Furthermore, there is a substantial knowledge gap in regard to the mode and tempo of regulatory ORF evolution. Here, genetic variation within human populations may provide insights. For example, Whiffin et al. 22 recently used the gnomAD human variation dataset to identify 3,191 genes in which uORF-perturbing variants are likely to be deleterious, thereby inferring the physiological importance of these translations. Meanwhile, Neville et al. 23 used the same dataset to find aggregate evidence of selective pressure against deleterious variants in their nORFs.org catalog 13 , which is especially pronounced for STOP-gain variants in uORFs. In prostate cancer, a recent analysis of 5′ UTR variants found regulatory roles for several uORFs 23 .
Although Ribo-seq ORFs may have regulatory roles irrespective of an encoded protein, the first step in confirming a protein-level physiological role for such an ORF is to demonstrate the existence of the protein in the cell. MS is a widely accepted approach to catalog the proteome, and its utility will be an important area of investigation for Phase II. At present, 609 of 7,264 Ribo-seq ORFs have been reported to have support in published MS datasets (Supplementary Table 10). However, different groups use distinct methodologies and parameters for MS, and for Phase I these findings are simply reported in Supplementary Tables 2 and 3 without further investigation. Reference annotation projects have historically favored high-stringency MS approaches, and the Human Proteome Organization (HUPO)/ Human Proteome Project (HPP)-which aims to produce a full annotation of the human proteome-has published guidelines to standardize the nature of MS evidence required to annotate a human protein 24 . As one facet of our development of an MS workflow, these Ribo-seq ORFs have been added to the PeptideAtlas analytical pipeline, which is used by HUPO. In Phase II, our projects will jointly examine the question of how best to use MS data to define which Ribo-seq ORFs produce proteins. For reference annotation, we see two aspects to this: first, how to set standards for accepting and reporting potential MS support for a prospective Ribo-seq ORF protein; and second, how to define the point at which the body of evidence supports protein-coding annotation.
These aspects are illustrated by a preliminary analysis, which took advantage of the fact that 333 of our Ribo-seq ORFs are present in sequences previously queried by the PeptideAtlas workflow (Supplementary Methods). We find single-mapping peptide-spectrum matches (PSMs) for 13 Ribo-seq ORFs (Supplementary Table 11); all but one are supported by a single PSM each, whereas most of the peptides identified are not fully tryptic (two examples are presented in Supplementary Fig. 4). The majority of observed PSMs derive from human leukocyte antigen (HLA) peptidome datasets, which is consistent with prior proteomic analyses demonstrating enrichment for peptides mapping to Ribo-seq ORFs in immunopeptidome data [25][26][27] . We emphasize that this preliminary analysis was not a full remapping of MS data and involved only a fraction of the Ribo-seq ORFs; a larger, focused effort will be forthcoming.
There are multiple causes contributing to the fact that Ribo-seq ORFs and certain classes of canonical proteins are infrequently detected in MS data, which are summarized elsewhere 28 . One consideration for HUPO is that an MS-based 'canonical' protein assignment requires multiple PSMs, ideally based on non-overlapping tryptic peptides. Although we recognize the value of these guidelines, very small proteins may be 'less discoverable' by MS, especially due to a paucity of identifiable tryptic fragments 28 . Notably, nearly 1,500 protein-coding genes annotated by GENCODE, UniProt and HGNC do not presently have MS support recognized by HUPO 24 . Moving forward, we are committed to examining all potential protein-coding Ribo-seq ORF cases with full manual gene annotation processes, and we plan to expand this workflow to include manual analysis of the peptide spectra by PeptideAtlas.
Although the value of MS in identifying translated proteins is indisputable, we believe a broader 'gold standard' for evidence should employ additional methodologies, such as epitope tagging combined with western blot imaging or endogenous antibody work; HUPO already incorporates such data in collaboration with the Human Protein Atlas 24 . Consideration also needs to be given to emerging proteomics technologies, such as targeted proteomics workflows and immunopeptidomics, and progress is being made in medium-throughput functional screening assays. For example, recent large-scale studies have translated hundreds of Ribo-seq ORFs in mammalian cells through exogenous expression, finding that nearly 50% may stably produce proteins, despite little evidence of evolutionary constraint 2,6,27 .
In addition to their evaluation as proteins or regulatory units, the reference annotation of Ribo-seq ORFs necessitates the creation of integrated workflows to interpret overlapping variants, and notwithstanding great community interest in this field, standardized approaches are not yet available. We emphasize that variant interpretation pipelines designed to classify CDS mutations may be unsuitable for Ribo-seq ORFs ( Table 1), and that a minority of overlapping variants fall within sequences displaying amino-acid-level constraint. Neville et al. 13 found that their nORFs.org catalog contains 48 Human Gene Mutation Database or ClinVar variants that are already considered pathogenic or likely to be pathogenic, even though they do not disrupt annotated CDSs. Although these variants may affect noncanonical ORFs, it will be important to define their mechanisms of action through experimental studies, as alternative explanations for pathogenicity, such as the creation of cryptic splice sites, are supported in certain cases. After exclusion of variants in Ribo-seq ORFs that overlap annotated CDSs, a total of 1,142 single-nucleotide variants present in the ClinVar database 29 were located within our aggregated set of Phase I Ribo-Seq ORFs (Supplementary Methods). Fewer than 2% of these variants have been classified as pathogenic or likely to be pathogenic, but this is likely to be an underestimate because the absence of pathogenesis is commonly inferred from the absence of overlap with known coding features, and because ClinVar variant coverage is heavily skewed toward annotated CDSs.
Furthermore, there is major interest in the application of Ribo-seq to study human disease. In particular, it is being widely used to explore the dynamics of translation in cancer cells with aberrant proteins as diagnostic markers or targets for immunotherapy 25,26,30 . At present, reference annotation projects do not attempt to distinguish aberrant translation events from those that contribute to 'normal' physiology. It will be important to deduce the fraction of Ribo-seq ORFs that encode proteins that exist in normal cellular conditions. Conversely, we envisage the value of classifying potentially aberrant translations within Phase II through a distinct annotation framework.
Our intention is for the Ribo-seq Phase I catalog to be seen as a pragmatic interim solution to a long-term problem. We believe that reference annotation databases can advance both scientific and clinical research through the propagation and standardization of Ribo-seq ORF datasets, even-and perhaps especially-while the phenotypic impact of these features remains uncertain. As biological knowledge improves, this will support the development of more accurate annotations and variant interpretations, with the potential to yield substantial insights across all aspects of human biology. In this spirit, we hope the results of Phase I of this project will be useful and beneficial to the community and invite interested labs to join our future Phase II efforts.

PRO-ACTive sharing of clinical data
To the Editor -The open sharing of clinical data for research poses challenges not only in resolving consent, privacy and intellectual property issues associated with trial results 1,2 , but also in subsequently facilitating access to and utilization of the data. Controlled-access systems can place onerous restrictions on industry-based researchers, require arduous application processes and involve long review or authorization times for users. Indeed, analyses of major efforts, such as ClinicalStudyDataRequest.com (CSDR) 3,4 and the UK Health Research Authority's (HRA) Assessment Review Portal (HARP) database 5 , suggest that fewer than 50% of the available trial datasets in these resources have been accessed and analyzed by researchers after launch. Here, we outline several strategies that ensured that researcher-engagement with an open-access ALS clinical trial data resource reached its full potential. We hope that our insights will be instructive for others seeking to galvanize open clinical data sharing efforts within the broader research community.
Prize4Life, a non-profit organization focused on accelerating treatments and a cure for ALS, created the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) database (https://ncri1.partners. org/ProACT/Document/DisplayLatest/9) in collaboration with Massachusetts General Hospital's Neurological Clinical Research Institute and with funding from the ALS Therapy Alliance. PRO-ACT was publicly launched in 2012 (with further data incorporated in 2015), including demographics, family history, medical history (including use of frontline treatment riluzole), vital capacity, adverse events and other data types from both the placebo and active arms of over 20 clinical trials 6,7 . The database currently holds >10,000 fully deidentified ALS patient records from 23 phase 2 and 3 clinical trials, representing the largest aggregation of publicly available ALS clinical data.
In contrast to other clinical trial repositories (Table 1), PRO-ACT has been widely accessed by >2,500 users, from >50 countries, including dozens of universities, several governmental agencies and >50 drug development companies. Over 80 publications have used PRO-ACT as a primary data source. Last year, the success of PRO-ACT and its value to the ALS field was acknowledged when the database's creators received the Healey Center Prize for Innovation in ALS 8 . Several factors have contributed to the success of PRO-ACT in engaging the wider research community.