Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow

Wright, James C.; Mudge, Jonathan; Weisser, Hendrik; Barzine, Mitra P.; Gonzalez, Jose M.; Brazma, Alvis; Choudhary, Jyoti S.; Harrow, Jennifer

doi:10.1038/ncomms11778

Download PDF

Article
Open access
Published: 02 June 2016

Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow

James C. Wright ORCID: orcid.org/0000-0001-6950-4328¹^na1,
Jonathan Mudge¹^na1,
Hendrik Weisser¹,
Mitra P. Barzine ORCID: orcid.org/0000-0002-0860-9510²,
Jose M. Gonzalez¹,
Alvis Brazma²,
Jyoti S. Choudhary¹^na1 &
…
Jennifer Harrow¹^na1

Nature Communications volume 7, Article number: 11778 (2016) Cite this article

5563 Accesses
56 Citations
13 Altmetric
Metrics details

Subjects

Abstract

Complete annotation of the human genome is indispensable for medical research. The GENCODE consortium strives to provide this, augmenting computational and experimental evidence with manual annotation. The rapidly developing field of proteogenomics provides evidence for the translation of genes into proteins and can be used to discover and refine gene models. However, for both the proteomics and annotation groups, there is a lack of guidelines for integrating this data. Here we report a stringent workflow for the interpretation of proteogenomic data that could be used by the annotation community to interpret novel proteogenomic evidence. Based on reprocessing of three large-scale publicly available human data sets, we show that a conservative approach, using stringent filtering is required to generate valid identifications. Evidence has been found supporting 16 novel protein-coding genes being added to GENCODE. Despite this many peptide identifications in pseudogenes cannot be annotated due to the absence of orthogonal supporting evidence.

Quality assessment of gene repertoire annotations with OMArk

Article Open access 21 February 2024

The status of the human gene catalogue

Article 04 October 2023

A joint NCBI and EMBL-EBI transcript set for clinical genomics and research

Article Open access 06 April 2022

Introduction

Biological and biomedical research relies on complete accurate and consistent annotation of genes and their products. However, identifying and annotating functional elements in the human genome is a persistent challenge. Despite intensive study, especially in identifying protein-coding genes, our understanding of the genome is far from complete, particularly with regard to non-coding RNAs, alternatively spliced transcripts and regulatory sequences. The GENCODE consortium is carrying out the annotation of all gene features in the reference human genome¹. The annotation combines computational analysis, targeted experimental evidence and manual curation approaches to produce high-accuracy genome annotations for protein-coding genes, long non-coding RNAs (lncRNA) and pseudogenes. The reference gene set produced by GENCODE is the default gene set available in Ensembl.

A key goal of genome annotation is to describe a complete list of protein-coding genes. For many genes, confirmation of coding potential and the definition of the exact gene structure are not trivial². Automated algorithms utilize comparative genomics methods such as phyloCSF³ to score transcript-coding potential while annotation groups incorporate antibody tagging⁴ and mass spectrometry^5,6,7 proteogenomic experiments. GENCODE release (V22) contains only 19,814 protein-coding genes, decreasing from over 20,600 protein-coding loci in release V7. This fundamental refinement can mainly be attributed to improved comparative and proteomics analysis methods that have indicated transcripts with poor coding potential for removal from the reference gene set.⁸

High-quality peptides from large-scale proteomics experiments confirm coding potential, especially for genes and transcripts where there is little other supporting evidence^9,10. Proteomics experiments only identify a fraction of detectable peptides. A peptide may not be detected if it is not amenable to the methods used or if the corresponding protein is not expressed in detectable quantities in the tissue type analysed. This means that while reliably identified peptides verify protein-coding potential, the lack of corresponding peptides does not prove a transcript is not translated. Despite this, the majority of protein-coding genes in the GENCODE annotation of the human genome are routinely identified with confident peptides in large-scale proteomics experiments^{11,12,13,14,15}.

The increased scale of proteogenomics endeavours means that a higher false positive rate is inevitable. Recent reviews of the proteogenomics field have highlighted the challenge posed by false positives^16,17. Target-decoy database searching is the most common approach for determining false positive identification rates, however, recent publications have pointed out some of the limitations of this method^18,19,20,21. To minimize erroneous identifications, accurate estimation of error and conservative significance filtering criteria must be applied. While initial filtering by a false discovery rate (FDR), calculated based on decoy identifications, is a good starting point, novel identifications should be examined independently and assessed by posterior error probability (PEP) and P value. The use of tools such as Percolator can also correct for inaccuracies in FDR estimation²². Moreover, since many peptides are only detected occasionally, combining results from many experiments can improve proteome coverage and confidence of novel assignments. In addition, even with high confidence in peptide identifications alternative explanations such as sequence variants and contaminates must be considered. Proteogenomics experiments have been particularly useful for poorly annotated genomes, providing a powerful tool to validate new gene models^23,24,25.

Proteogenomics analysis can generate large numbers of putative novel protein-coding gene identifications. However, there are no suitable guidelines for how this data should be treated so that it can be successfully incorporated into genome annotation. We present here criteria for the generation and interpretation of proteomics evidence for novel annotations. We describe a proteogenomic workflow encompassing a search database, and stringent filtering criteria to refine peptide assignments (Fig. 1). To assist annotators, we propose a priority annotation score to rank identifications. From the annotator’s perspective, we discuss how proteogenomics evidence can be interpreted and what additional information substantiates protein-coding potential for annotation. In the presented work, we concentrate on the human genome by reanalysing human tissue data from the draft human proteome experiments, but these methodologies are applicable to any sequenced genome.

**Figure 1: Proteogenomic analysis pipeline.**

Results

Reanalysis of the human proteome

We have implemented the workflow shown in Fig. 1, and used it to reprocess over 52 million spectra from the human proteome experiments. This resulted in 17,777,190 significant peptide spectrum matches (PSMs) corresponding to 342,015 distinct peptide sequences longer than 7 amino acids (aa), representing 34% of the query spectra. PSMs were determined as significant using a 1% FDR and a 0.05 PEP cut-off. Significant PSMs were required to identify the same peptide with these criteria via multiple search algorithms. This equated to a final estimated PSM q-value of 0.0007 (0.07% FDR).

The distribution of identified peptides in the search space is shown in Fig. 2, with 99.5% matching known protein-coding sequences (CDS). About 0.6% of these were unique to either GENCODE or UniProt, highlighting the importance of considering multiple sequence annotation sources. These differences are due to independent database curation and annotation efforts. Figure 3 further breaks down the identified peptides by tissue, showing testis and ovary tissues to have the highest total number of peptide identifications. Testis also contained the most tissue-specific peptides, followed by liver, spleen, brain and foetal tissues. This observation suggests that these tissues have a significant number of uniquely expressed genes or alternative transcripts not found in other tissues. Other recent publications focusing on the analysis of Testis tissue have also found this to be true²⁶.

**Figure 3: Tissue peptide specificity histograms.**

Conducting protein inference and clustering to produce the shortest list of proteins that explain all significant peptides, we found evidence for 19,262 human proteins. For known proteins, we mapped these proteins to their Ensembl gene identifiers, finding evidence for 16,271 genes or clusters of genes with distinct peptides. A cluster of genes is defined as a set of genes, usually from the same family, which match one or more unique peptides specific to this set. Genes that only matched a subset of peptides from another gene were not counted in this number. About 13,852 genes were unambiguous identifications evidenced by multiple peptides and at least 1 uniquely identifying peptide. This is in reasonable agreement with the numbers of genes reported in previous studies using similar levels of stringency¹³. Further filtering by number of uniquely identifying peptides produces 11,606 and 10,740 genes with 2 and 3 unique peptides, respectively.

Among these genes, 7,716 transcripts were identified with unique peptides revealing 1,169 genes with evidence of alternative splicing. Further strict filtering increased the PEP threshold to 0.01, removed modified peptides, removed peptides with >30 aa and peptides with >2 missed cleavages. This additional filtering returned 32,497 gene-specific peptides and reduced the resulting genes with multiple different transcripts to 867. About 8,425 of the identified peptides crossed exon boundaries in these genes. About 1,869 transcripts mapped to these genes with 9,114 unique peptide mappings, and 2,956 of these peptides crossed exon boundaries. Testis again showed the highest amount of alternative transcription.

Identifying novelty in the proteome

About 1,406 of the identified peptides did not match previously annotated coding regions within known protein-coding genes in GENCODE or UniProt. While these peptides could represent genuine novel CDS annotations, they could also be due to false positive assignments, or explicable as single nucleotide polymorphisms (SNPs) that lead to single amino-acid variants (SAVs). Genuine novel CDS annotations can take three forms: the addition of completely novel protein-coding genes where no previous annotation existed; the annotation of a CDS within a locus that was previously classified as an lncRNA or pseudogene; or the addition of novel CDS within pre-existing protein-coding genes (for example, the identification of additional coding exons). While the novel data is the most interesting in terms of GENCODE annotation, it is at the highest risk of false positive identifications. We examined these identifications and devised suitable filtering criteria and quality measures to produce a refined set of novel peptides for genome annotation (Fig. 4).

**Figure 4: High-stringency peptide filtering regime.**

To generate higher confidence peptide identification, we increased our acceptance threshold to a minimum PEP of 0.01. Examination of the remaining spectra revealed that certain types of peptide commonly showed poor ion series coverage casting reasonable doubt on their assignments. As a consequence, long peptides >29 aa, peptides with >2 missed cleavages and semi-tryptic peptides were eliminated. In addition, we examined the incidence of post-translationally modified PSMs by comparing the enrichment of modification type in the novel data and the CDS data. These analyses found that deamidation and N-terminal carbamidomethylation were overrepresented among novel identifications (Supplementary Fig. 1). We have therefore disregarded all novel PSMs with these modifications. However, other modifications such as oxidation and acetylation showed similar proportions in both CDS and novel PSMs and have been retained, they only contribute a small portion (6%) of the novel PSMs.

The human genome has been the subject of intensive study and annotation for many years. Hence, the chances of a peptide being completely novel are small and an alternative explanation should be considered. There are several repositories of human genome annotation and these are being constantly updated with new annotations. Each novel peptide was blasted against the latest available GENCODE, RefSeq, NextProt and UniProt sequence database releases to identify peptides found in updated protein-coding annotations; 231 (7%) of the novel peptides were eliminated in this way. Also peptides containing SAVs will not match their corresponding protein sequence and can be misinterpreted as novel. To account for this, all novel peptides were required to be at least two amino acids different to any peptide from a known protein-coding gene.

The implemented peptide filtering regime should eliminate the majority of false positives. However, manual inspection of spectra is recommended for key peptides, especially those determined to provide evidence for new protein-coding regions. The final set of 650 highly confident novel PSMs mapped to 204 unique peptides and inferred 168 possible novel protein-coding annotations, 17 with multiple peptide evidence (Supplementary Dataset 1). In addition, any new gene annotations obtained were further validated by mapping tissue-specific RNAseq transcript data from the Uhlén study²⁷ to the corresponding gene regions (Supplementary Fig. 4).

As part of this analysis, we proposed a priority annotation score for each novel peptide and protein. This score used peptide features suggested by expert annotators based on prior experience, targeting criteria that appeared to distinguish peptides that led to annotation from those that did not. The aim of this score was to rank novel peptides identifications, highlighting the protein and peptide mappings most likely to lead to new annotation. The peptide priority annotation score is based on various peptide features that are not normally considered in PSM scoring including the PEP, the number of PSMs observed, the number of replicate identifications, the delta scores between top and second rank assignments, peptide length and the amino-acid sequence complexity. The full formula used can be found in the online methods. The summed score of all unique peptides identifying a novel protein was then used to rank the putative novel protein identifications. In the context of our human proteogenomic analysis, this score was difficult to further optimize since, in spite of the speculative search space used, only a few novel genes were identified. This likely reflects the high quality of existing annotation for this reference genome. However, all filtered novel identifications were manually examined regardless of ranking, and we were thus able to demonstrate a good correlation between peptide score and the likelihood of annotation taking place (Supplementary Fig. 5). This priority annotation score is potentially more useful for annotators targeting non-model organism genomes, based on the presumption that such gene sets are less complete in terms of protein-coding content. However, we emphasize that the scoring system does not provide actual probability estimations; it should instead be used to rank potential proteins. Furthermore, while a score of ∼100 marks the point at which the annotation of a peptide became probable in this study, this is unlikely to be the same for studies on other genomes given the number of variables involved (such as the quality of genome sequence, and the size and nature of the proteomics data).

Discussion

Manual annotation of the multi-peptide list, containing 17 prospective protein-coding regions, led to the addition of 7 protein-coding genes to GENCODE. In addition an eighth novel protein-coding region was created within the INPP5F protein-coding gene as part of an alternatively spliced transcript. Subsequently, 9 novel protein-coding genes were annotated in the list of 151 single-peptide identifications (Table 1). In total, our survey has led to the addition of 16 protein-coding genes to GENCODE: 8 based on lncRNA or RNAseq models, and 8 on pre-existing pseudogenes. The coding status of four of the pseudogenes—MAGEB6P1, MYO15B and two CENPV-related loci—were highlighted by Pandey et al.¹⁶, although MYO15B was previously regarded as a coding locus by UniProt. However, our MYO15B peptide is novel to this study and has supported the annotation of a CDS that is 1,534 aa longer than the UniProt entry. Two of the remaining 11 cases (HUMG00000150367 and HUMG00000158324) correspond to unreviewed entries in UniProt; the coding potential of the latter, in common with the CENPV-related loci, is also supported by data from PeptideAtlas²⁸. Of the remaining cases, one represents our single agreement with the 404 translated lncRNAs proposed by Wilhelm et al.¹⁵ (ENST00000424358), leaving 9 protein-coding genes that represent novel discoveries.

Table 1 Novel GENCODE protein-coding annotations.

Full size table

These annotations have resulted from a combination of the proteogenomic data presented here and the HAVANA manual annotation process, which incorporates next-generation transcriptomic data sets such as RNAseq²⁹, CAGE³⁰ and polyAseq³¹ that were not available during the genome-wide annotation phase of the GENCODE project. Certain pseudogenes, for example, were previously wrongly annotated since correct transcript structures could not be deduced, a problem confounded by 11 of the 16 novel protein-coding genes having single-exon CDS. Incorrect transcript structures can also prevent the identification of CDS based on conservation and paralogy. This is illustrated in Fig. 5a, which shows a protein-coding gene reconstructed on the site of a shorter pseudogene of TEX13A. RNAseq data is required to support the transcript structure, where testis-specific transcription is observed in agreement with the proteomics results, and the 714 aa CDS shows widespread conservation across mammalian genomes. In total, 15 of the 16 novel protein-coding genes either display mammalian conservation beyond primates or are putative members of known gene families. The outlier is shown in Fig. 5b, an endogenous retroviral element-associated CDS with conservation limited to certain primate genomes.

**Figure 5: Novel protein-coding loci supported by multiple PSMs.**

Despite the conservative and rigorous criteria presented here, 90% of our novel peptide identifications have not led to annotation, for two main reasons: either annotation did not make biological ‘sense’ in a transcriptomic context, or there was an absence of orthogonal supporting evidence for the translation or transcription of the model. In the latter case, we note that numerous identifications were made solely within Cufflinks-derived RNAseq models; none of these led to protein-coding annotation, despite many of these peptides appearing to have good PSMs. In fact, few of these structures were seen to be supported by orthogonal transcript evidence, and so could not be added to GENCODE even as lncRNAs. Since proteogenomic surveys must be wary about alternate peptide explanations and the persistence of false positives, these non-annotations are therefore suggestive in this regard. Due to the difficulty in confirming peptide-to-protein inference, many of these novel peptide assignments may be correct but originating from a protein source not considered in our search space. Although the peptide data was essential in highlighting the putative CDS in each of these cases, it was the orthogonal data sets that ultimately led to annotation, in particular the observation of mammalian conservation. For illustration, Fig. 6 compares an 11 aa lncRNA-associated peptide (Fig. 6a) with a 19-aa peptide identified from an RNAseq model (Fig. 6b). The shorter peptide identifies a novel 63-aa CDS exhibiting strong vertebrate conservation, with striking eye-specific transcriptomic data matching the identified peptide. In contrast, the longer peptide is found 4 aa downstream of a STOP codon within a LINE transposon, in a region that shows no signature of protein-coding evolution. Indeed, we did not annotate any lineage-specific CDS in the single-peptide list, even when they could be linked to canonical initiation codons. We propose that lineage-specific CDS should only be annotated based on the presence of multiple, high-quality and manually validated peptides, unless other sources of experimentally derived supporting evidence are available.

**Figure 6: Identifying novel protein-coding regions based on single peptides.**

Pseudogenes provide a particular interpretative challenge; such loci account for the majority of identified novel peptides. These loci differ from the pseudogenes converted to protein-coding genes previously discussed; they contain sequence disablements such as frameshifts or premature termination codons, whereas the latter are bona fide protein-coding genes previously mis-annotated. One could assume that pseudogene peptides are incorrect, given that these loci are traditionally thought of as being non-functional. Our initial concern was that peptides attributed to pseudogenes may actually derive from paralogous loci outside the search space, or from allelic variant forms of the parent genes. We found little evidence for these scenarios, although it is plausible that some pseudogene peptides could be explained by undocumented parental allelism, the presence of paralogous loci within reference assembly gaps, the existence of unappreciated copy number variation or retro-transposition events that are not fixed in the human lineage. These cases would be difficult to confirm without proteomic and transcriptomic experiments being conducted on paired samples.

It is interesting that many of the processed pseudogenes in our lists belong to families with dozens of copies, for example, the GAPDH family. All except one of these parental genes are described by Protein Atlas³² as producing a protein found at ‘high levels’ in multiple tissues. It could be that highly transcribed genes produce more pseudogenes due to the increased chance of retro-transposition, raising the likelihood that one or more will gain the potential to be translated. Alternatively, it may simply be that higher levels of protein expression inevitably lead to an increase in the capture of dissonant spectra, or perhaps increase the likelihood of cryptic in situ modifications that are subsequently misinterpreted. Ultimately, it is clearly logical to be conservative in the interpretation of pseudogene peptide evidence. As for lineage-specific lncRNA CDS, pseudogene translation should only be considered for annotation where more than one high-quality unique peptide can be linked to an in-frame initiation codon. For such cases, a novel biotype ‘translated pseudogene’ has been created in GENCODE; conversion to full-protein-coding gene status will only occur when the functional basis of the locus is experimentally established.

Proteomics has potential as a powerful tool for genome annotation, allowing validation of existing protein-coding annotations and discovery of novel coding regions. It offers substantial benefits for less refined genomes, although our identification of 17 novel coding regions demonstrates that mass spectrometry has an important role to play in human annotation. It is notable that despite the high levels of stringency and filtering used, the majority of peptides identified outside the known protein-CDS have not led to novel annotations, emphasizing a need for vigilance in interpreting such data sets and the need to handle pseudogenes identified by proteomics with care. Recent years have seen a rapid increase in the number of proteogenomics experiments published, many of which strive to obtain the largest coverage of peptides and novel identifications possible. The size of these data sets places a burden on annotators, as they strive to separate those genuine identifications from the inevitable set of spurious peptides. We believe that proteogenomics should take a ‘quality-over-quantity’ approach, as exemplified by the integrated mass spectrometry and annotation workflow presented here. In addition, we propose a priority annotation score for ranking novel proteomics evidence. Although of limited benefit in this study, the higher scoring peptides did lead more frequently to new annotation, and we believe this would be of great value in non-reference genomes. Above all, proteomics data sets should not be interpreted in isolation, rather they should be considered as a single component of an annotation strategy that also includes high-throughput transcriptomic data coupled with comparative analyses.

Methods

Raw data sources

Three large, publicly available human tissue proteomics data sets were downloaded from the Internet. The first, containing 2,212 raw files, was generated by the Pandey lab at Johns Hopkins University¹⁴, and was downloaded from ProteomeXchange via the PRIDE³³ repository (www.ebi.ac.uk/pride/archive/accessionPXD000561). These were higher-energy collisional dissociation (HCD) raw files, from Thermo Scientific Orbitrap instruments, comprising 85 fractionated experimental samples covering 30 different human adult and foetal tissues. The second data set was the Human BodyMap data generated by the Kuster lab at the Technische Universität München as part of their draft human proteome publication¹⁵, which was downloaded from ProteomicsDB (www.proteomicsdb.orgaccessionPRDB000042). This data set consists of 1,087 HCD and collision-induced dissociation (CID) Thermo Scientific raw files from 48 experiments covering 36 different tissues. The final data set which was also used in the draft human proteome was generated by Paul Cutler at Roche Pharmaceuticals and originally deposited in PeptideAtlas³⁴ in 2011. This data was also downloaded from ProteomicsDB (accession PRDB000012) and contains 1,618 CID Thermo Scientific raw files covering 10 different human tissues.

Spectral processing

Each raw file was converted to the standard mzML format using the ProteoWizard (v3.0.6485) msconvert tool³⁵. Following conversion, the data was processed with TOPP tools from OpenMS (pre-v2.0 development build) (Fig. 1)³⁶. All spectra were centroided using the PeakPickerHiRes tool, and files from fractionated experiments were merged using the FileMerger tool (up to a maximum file size of 2 GB). A small number of the raw files appeared to contain no spectra and were not included in the conversion. The final set of 195 mzML files comprised 52,236,496 spectra.

Sequence database creation and preparation for searching

A comprehensive human sequence database in FASTA format was created by combining six parts. These included, the complete human GRCh38 GENCODE v20 (ref. 37) CDS translated sequences; the UniProt³⁸ human reference proteome from May 2014; common contaminant protein sequences downloaded from Max Planck Institute of Biochemistry (http://maxquant.org/contaminants.zip) and HLA sequences from (http://www.ebi.ac.uk/ipd/imgt/hla/download.html); a selection of non-coding gene sequences from GENCODE v20 including pseudogenes, lncRNA sequences and 5′ untranslated region sequences; novel sequences generated using the AUGUSTUS gene predictor³⁹; an additional set of two-way consensus pseudogene predictions from Pseudogene.org (December 2013); and three-frame translated RNAseq transcript sequences. RNAseq transcript models were imported from three sources: Ensembl RNAseq models assembled using data from the Illumina Human BodyMap 2.0 project, which captures transcription in 16 human tissues (http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-513); models generated by the Kellis lab at MIT using also Human BodyMap data and the Scripture software⁴⁰; and models produced by Caltech and CSHL using RNAseq data from different ENCODE cell lines and the Cufflinks software⁴¹. GENCODE non-coding gene sequences, AUGUSTUS predictions, pseudogenes and RNAseq models overlapping GENCODE coding regions were filtered out from the final sequence database. For some of the databases the sequences needed to be translated into amino acids from nucleotide sequences; this was done using the EMBOSS (v6.6) six-pack tool⁴² to generate three-frame translations of the sequences, splitting stop codons into separate ORFs with a minimum length of 10 amino acids. Finally, a set of randomized decoy sequences of equal size to the target database was generated using the Mimic tool (https://github.com/percolator/mimic) and appended to the database. To account for isobaric peptides all isoleucine (I) residues within the database were replaced with leucine (L); after searching, leucine residues were always converted to the ambiguous code J. All protein accessions were formatted to include the source database, a unique identifier and if available a genomic locus. Supplementary Table 1 contains a summary of the sequence database components including the number of proteins, peptides and amino acids. In terms of distinct tryptic peptide sequences, the known coding portion of the database contained 787,587 peptides and the novel sequences provided an additional 4,211,835 peptides.

Spectral identification and database search pipeline

Figure 1 describes the overall identification and analysis pipeline, including the OpenMS/TOPP-based workflow that was used to run the database searches for the three data sets. The TOPP tool MascotAdapterOnline was used to submit mzML files to a Mascot Server v2.4 (Matrix Science) cluster; the in-house developed TOPP tool MSGFPlusAdapter was used to run MS-GF+ v10089 (ref. 43) on the same files across a large computer cluster. Two wrappers were implemented to run MascotPercolator v2.08 (refs 44, 45) and the msgf2pin/Percolator v2.08-1 (ref. 46) tool combination to optimize and rescore the Mascot and MS-GF+ results, respectively. In addition, SEQUEST combined with Percolator⁴⁷ was used to search the data in a Proteome Discoverer v1.4 (Thermo Scientific) workflow. All database searches were performed with a precursor tolerance of 10 p.p.m. and a fragment tolerance of 0.02 Da for HCD spectra and 0.5 Da for CID spectra. Up to three missed cleavages were allowed. The fixed modification carbamidomethyl (+57.0214) was specified for all cysteine residues. In addition, the following variable modifications were used in the searches: N-terminal acetylation (+42.01056), N-terminal carbamidomethyl (+57.0214), deamidation of asparagine and glutamine residues (+0.984), oxidation of methionine (+15.9949), and N-terminal conversion of glutamine and glutamic acid to pyro-glutamine (−17.0265, −18.0106). The multiple search results of this workflow were converted into mzTab formatted files and uploaded along with the mzML spectra and FASTA search database to the PRIDE repository (http://www.ebi.ac.uk/pride/archive/)³³ under accession PXD002967.

Initial results processing and filtering

Custom Perl scripts were used to parse, merge and filter the results of each search engine (Fig. 3). The results from the multiple search engines were first merged and filtered so that every PSM had the same identification in at least two of the three search engines. The highest (least confident) PEP value was retained in each case. The PSMs were then filtered keeping only matches with a q-value of ≤0.01 (1% FDR), a PEP of ≤0.05 and a peptide length of at least seven amino acids. PSMs matching contaminant or decoy sequences were also removed. Proteins were then inferred from the final list of peptides, taking a simple approach to cluster proteins that matched the same set of peptides, so that each protein cluster had at least one unique peptide. To calculate the number of genes identified by these proteins, we mapped GENCODE CDS and UniProt accessions back to Ensembl gene identifiers. Several protein clusters did not contain a GENCODE CDS nor a UniProt accession; these non-CDS protein identifications were separated and further filtered for genome annotation. Supplementary Fig. 4 displays the −10log(PEP) score distribution of CDS and novel PSMs.

Non-CDS peptide analysis

Non-CDS identifications, which we define as peptides uniquely matching a single sequence not found in the GENCODE CDS, UniProt database nor the contaminant sequence databases were further filtered to an increased stringency. The filtering criteria included, a 0.01 PEP threshold, a maximum peptide length of 29 amino acids, only fully tryptic peptides, and a maximum of two missed cleavages. In addition after identifying an enrichment of deamidated and N-terminally carbamidomethylated PSMs in the non-CDS data set, PSMs so modified were also removed. Peptides were also checked to make sure that modifications were identified on the correct N-terminal residues in the case of pyroglutamate conversions. BLASTp⁴⁸ with parameters optimized for short sequences was used to search an up-to-date combined GENCODE V22 (ref. 37), RefSeq V70 (ref. 49), NeXtProt (release 28 April 2015)⁵⁰ and UniProt (release May 2015)³⁸ sequence database. Any matches were removed from the data set. A Perl script was then used to search the GENCODE CDS removing peptides with less than two mismatches to known proteins.

Priority annotation score

A score for each peptide was computed using a set of peptide features. The following equation describes the scoring for each protein based on the summed priority annotation scores of its distinct unambiguous peptides:

Where

P_i represents the best PEP obtained for each peptide, representing the confidence in this peptide.

U_i is the number of significant unmodified PSMs identifying each peptide, this increases the score of more abundant, repeatedly sampled peptides.

M_i is the number of modified PSMs significantly identifying each peptide, this is adjusted by Wm (initially set to 5), again this increases the score of more abundant peptides; however, the adjustment makes modified forms of the peptide less influential than unmodified PSMs.

L_i represents the peptide amino-acid length.

D_i is the delta score difference between 1st and 2nd ranked peptide spectrum assignments, adjusted by Wd (initially set to 10). This feature boosts the score where there is less ambiguity in sequence assignment, as this value can be large so it is divided by 10 to avoid it dominating the score.

R_i is the number of samples or replicates significantly finding each peptide this is adjusted by Wr (initially set to 5). This feature increases confidence in a novel peptide and reduces the chance of it originating from a variant sequence. In this study, there are over 100 samples included, hence we applied a weighting to prevent this feature overly influencing the final score.

E_i is the entropy for the peptide sequence calculated using information theory using this formula:

Where

n_i is the number of occurrences of each amino-acid type in the peptide and N is the length of the peptide. This feature increases the scores of more complex peptides.

The Wm, Wd and Wr weightings were initially set to values that scale the peptide features to the same magnitude and avoid any feature overly biasing the score. For use beyond this study these values can be optimized, as well as additional features added to suit the type of experimental data and annotation evidence required. Supplementary Fig. 5 shows a histogram of the resulting scores for peptides that led to novel annotations and those that were rejected on manual inspection of the genomic loci.

RNAseq analysis and validation of novel genes

After retrieval of the publicly available raw data of the ArrayExpress data set with the accession number E-MTAB-2816 (ref. 27), the RNAseq libraries were analysed with iRAP using TopHat2 (ref. 51) to map reads to the reference genome (GRCh38.p3) and HTSeq-count⁵² to assign reads to the Ensembl release 76 gene annotation⁵³ using default parameters. Reads Per Kilobase of transcript per Million (RPKM) values were then calculated with the internal function provided by iRAP. For each tissue, gene expression levels were calculated by first taking the mean of the technical replicates before calculating the mean of the biological ones. When the VEGA IDs of the genes had corresponding ENSEMBL 76 gene IDs, those IDs were directly used to assess the gene expression. When this was not the case, the coordinates of the genes were retrieved from the latest VEGA release and then the windows of interest were plotted on merged bam files (one bam per tissue) obtained from TopHat2. The Supplementary Dataset 1 includes the number of paired reads that have been counted for each novel gene on the merged bam files normalized by the number of initial samples for each tissue.

Manual genome annotation

Manual annotation was performed according to the HAVANA criteria as defined for the GENCODE project³⁷. A novel ‘logic tree’ was used to allow annotation decisions to be made consistently; see Supplementary Fig. 3. Furthermore, a tblastn query was performed for each peptide on the primary list (see main text) to check whether related sequences found outside the search space could provide alternative explanations for the match; this was also done for those peptides on the secondary list seen to highlight putative coding regions. For pseudogene-associated matches on the primary list, dbSNP⁵⁴ was queried to investigate whether the PSM could also belong to a variant form of the parent locus.

Data availability

All search results along with the spectra and sequence database are available in the PRIDE repository (http://www.ebi.ac.uk/pride/archive/) under accession PXD002967. The authors declare that all other data supporting the findings of this study are available within the article and its Supplementary Information files.

Additional information

How to cite this article: Wright, J. C. et al. Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow. Nat. Commun. 7:11778 doi: 10.1038/ncomms11778 (2016).

References

Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 7, (Suppl 1): S4 1–S4 9 (2006).
Article Google Scholar
Harrow, J. et al. Identifying protein-coding genes in genomic sequences. Genome Biol. 10, 201 (2009).
Article Google Scholar
Lin, M. F., Jungreis, I. & Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27, i275–i282 (2011).
Article CAS Google Scholar
Stadler, C. et al. Immunofluorescence and fluorescent-protein tagging show high correlation for protein localization in mammalian cells. Nat. Methods 10, 315–323 (2013).
Article CAS Google Scholar
Ahn, J. M. et al. Proteogenomic analysis of human chromosome 9-encoded genes from human samples and lung cancer tissues. J. Proteome Res. 13, 137–146 (2014).
Article CAS Google Scholar
Khatun, J. et al. Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genomics 14, 141 (2013).
Article CAS Google Scholar
Woo, S. et al. Advanced proteogenomic analysis reveals multiple peptide mutations and complex immunoglobulin peptides in colon cancer. J. Proteome Res. 14, 3555–3567 (2015).
Article CAS Google Scholar
Ezkurdia, I. et al. Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes. Hum. Mol. Genet. 23, 5866–5878 (2014).
Article CAS Google Scholar
Tanner, S. et al. Improving gene annotation using peptide mass spectrometry. Genome Res. 17, 231–239 (2007).
Article CAS Google Scholar
Brosch, M. et al. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and ’resurrected’ pseudogenes in the mouse genome. Genome Res. 21, 756–767 (2011).
Article CAS Google Scholar
Beck, M., Claassen, M. & Aebersold, R. Comprehensive proteomics. Curr. Opin. Biotechnol. 22, 3–8 (2011).
Article CAS Google Scholar
Neuhauser, N. et al. High performance computational analysis of large-scale proteome data sets to assess incremental contribution to coverage of the human genome. J. Proteome Res. 12, 2858–2868 (2013).
Article CAS Google Scholar
Omenn, G. S. et al. Metrics for the human proteome project 2015: progress on the human proteome and guidelines for high-confidence protein identification. J. Proteome Res. 14, 3452–3460 (2015).
Article CAS Google Scholar
Kim, M. S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).
Article CAS ADS Google Scholar
Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
Article CAS ADS Google Scholar
Pandey, A. & Pevzner, P. A. Proteogenomics. Proteomics 14, 2631–2632 (2014).
Article Google Scholar
Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 11, 1114–1125 (2014).
Article CAS Google Scholar
Blakeley, P., Overton, I. M. & Hubbard, S. J. Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies. J. Proteome Res. 11, 5221–5234 (2012).
Article CAS Google Scholar
Zhang, K. et al. A note on the false discovery rate of novel peptides in proteogenomics. Bioinformatics 31, 3249–3253 (2015).
Article CAS Google Scholar
Savitski, M. M., Wilhelm, M., Hahne, H., Kuster, B. & Bantscheff, M. A scalable approach for protein false discovery rate estimation in large proteomic data sets. Mol. Cell. Proteomics 14, 2394–2404 (2015).
Article CAS Google Scholar
Gupta, N., Bandeira, N., Keich, U. & Pevzner, P. A. Target-decoy approach and false discovery rate: when things may go wrong. J. Am. Soc. Mass Spectrom. 22, 1111–1120 (2011).
Article CAS ADS Google Scholar
Kall, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics data sets. Nat. Methods 4, 923–925 (2007).
Article Google Scholar
Armengaud, J. et al. Non-model organisms, a species endangered by proteogenomics. J. Proteomics 105, 5–18 (2014).
Article CAS Google Scholar
Kucharova, V. & Wiker, H. G. Proteogenomics in microbiology: taking the right turn at the junction of genomics and proteomics. Proteomics 14, 2360–2675 (2014).
Article Google Scholar
Branca, R. M. et al. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nat. Methods 11, 59–62 (2014).
Article CAS Google Scholar
Zhang, Y. et al. Tissue-based proteogenomics reveals that human testis endows plentiful missing proteins. J. Proteome Res. 14, 3583–3594 (2015).
Article CAS Google Scholar
Uhlen, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
Article Google Scholar
Deutsch, E. W. The peptideatlas project. Methods Mol. Biol. 604, 285–296 (2010).
Article CAS Google Scholar
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008).
Article CAS Google Scholar
Takahashi, H., Kato, S., Murata, M. & Carninci, P. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol. Biol. 786, 181–200 (2012).
Article CAS Google Scholar
Derti, A. et al. A quantitative atlas of polyadenylation in five mammals. Genome Res. 22, 1173–1183 (2012).
Article CAS Google Scholar
Uhlen, M. et al. Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 28, 1248–1250 (2010).
Article CAS Google Scholar
Vizcaino, J. A. et al. The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013. Nucleic Acids Res. 41, D1063–D1069 (2013).
Article CAS Google Scholar
Desiere, F. et al. The PeptideAtlas project. Nucleic Acids Res. 34, D655–D658 (2006).
Article CAS Google Scholar
Chambers, M. C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30, 918–920 (2012).
Article CAS Google Scholar
Kohlbacher, O. et al. TOPP-the OpenMS proteomics pipeline. Bioinformatics 23, e191–e197 (2007).
Article CAS Google Scholar
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Article CAS Google Scholar
UniProt, C. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).
Article Google Scholar
Stanke, M., Steinkamp, R., Waack, S. & Morgenstern, B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 32, W309–W312 (2004).
Article CAS Google Scholar
Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).
Article CAS Google Scholar
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).
Article CAS ADS Google Scholar
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European molecular biology open software suite. Trends Genet. 16, 276–277 (2000).
Article CAS Google Scholar
Kim, S. & Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 5, 5277 (2014).
Article CAS ADS Google Scholar
Brosch, M., Yu, L., Hubbard, T. & Choudhary, J. Accurate and sensitive peptide identification with Mascot Percolator. J. Proteome Res. 8, 3176–3181 (2009).
Article CAS Google Scholar
Wright, J. C. et al. Enhanced peptide identification by electron transfer dissociation using an improved Mascot Percolator. Mol. Cell. Proteomics 11, 478–491 (2012).
Article Google Scholar
Granholm, V. et al. Fast and accurate database searches with MS-GF+Percolator. J. Proteome Res. 13, 890–897 (2014).
Article CAS Google Scholar
Spivak, M., Weston, J., Bottou, L., Kall, L. & Noble, W. S. Improvements to the percolator algorithm for peptide identification from shotgun proteomics data sets. J. Proteome Res. 8, 3737–3745 (2009).
Article CAS Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Article Google Scholar
Pruitt, K. D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756–D763 (2014).
Article CAS Google Scholar
Lane, L. et al. neXtProt: a knowledge platform for human proteins. Nucleic Acids Res. 40, D76–D83 (2012).
Article CAS Google Scholar
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
Article Google Scholar
Anders, S., Pyl, P. T. & Huber, W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
Article CAS Google Scholar
Flicek, P. et al. Ensembl 2014. Nucleic Acids Res. 42, D749–D755 (2014).
Article CAS Google Scholar
NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 43, D6–17 (2015).
FANTOM Consortium. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).
Fagerberg, L. et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell. Proteomics 13, 397–406 (2014).
Article CAS Google Scholar
Hezroni, H. et al. Principles of long non-coding RNA evolution derived from direct comparison of transcriptomes in 17 species. Cell Rep. 11, 1110–1122 (2015).
Article CAS Google Scholar

Download references

Acknowledgements

This work is funded by the National Institutes of Health grant U41HG007234 to the GENCODE project and Wellcome Trust grant (WT098051) to the Sanger Institute. We would like to thank Toby Hunt and Mike Kay for their contribution to the genome annotation. We would also like to acknowledge Iakes Ezkurdia and Michael Tress for running the SEQUEST searches.

Author information

James C. Wright, Jonathan Mudge, Jyoti S. Choudhary and Jennifer Harrow: These authors contributed equally to this work.

Authors and Affiliations

Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
James C. Wright, Jonathan Mudge, Hendrik Weisser, Jose M. Gonzalez, Jyoti S. Choudhary & Jennifer Harrow
European Bioinformatics Institute, EMBL, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
Mitra P. Barzine & Alvis Brazma

Authors

James C. Wright
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Mudge
View author publications
You can also search for this author in PubMed Google Scholar
Hendrik Weisser
View author publications
You can also search for this author in PubMed Google Scholar
Mitra P. Barzine
View author publications
You can also search for this author in PubMed Google Scholar
Jose M. Gonzalez
View author publications
You can also search for this author in PubMed Google Scholar
Alvis Brazma
View author publications
You can also search for this author in PubMed Google Scholar
Jyoti S. Choudhary
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Harrow
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.C.W., J.M., J.S.C. and J.H. conceived and designed the study. J.C.W. and H.W. converted, processed and searched the proteomics data. J.M.G. and J.C.W. constructed the sequence database. J.C.W., J.M., J.S.C. and J.H. analysed and interpreted the data. J.M. and J.H. carried out the manual genome annotation. M.P.B. and A.B. analysed the RNAseq data. J.C.W., J.M., J.H., J.S.C., M.P.B., J.M.G., H.W. and A.B. drafted the manuscript. J.C.W., J.M., J.H., J.S.C., M.P.B., J.M.G., H.W. and A.B. critically revised the manuscript. J.C.W., J.M., J.H., J.S.C., M.P.B., J.M.G., H.W. and A.B. approved the final version of the article.

Corresponding authors

Correspondence to Jyoti S. Choudhary or Jennifer Harrow.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Information

Supplementary Figures 1 - 5 and Supplementary Table 1 (PDF 593 kb)

Supplementary Dataset 1

Novel Identifications. This table, made up of four sheets, contains a lists of inferred novel (non-CDS) proteins, peptides, PSMs and the final set of validated novel protein coding genes. Included in the protein tables are the match genomic loci, GENCODE VEGA identifiers, annotation notes, peptides matched, and tissues in which these peptides were seen. Additionally these tables also highlight which of these proteins have any further supporting proteomic evidence in the January 2016 release of PeptideAtlas. The table also contains the results from mapping tissue specific transcriptomic RNAseq data to the same locus. An RPKM above 0.75 is considered to show significant transcript expression at these loci. The peptide table gives more detail on individual peptides, including annotation score, best PEP, number of modified and unmodified PSMs, modification types, number of samples significantly identifying peptide and the list of tissues for these samples. The final sheet contains a list of the significant PSMs identifying the non-CDS peptides. This table shows the identification and scoring for the multiple search engines, the spectrum mzML ID, any modifications found in the spectra, and spectral charge state. (XLSX 297 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Wright, J., Mudge, J., Weisser, H. et al. Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow. Nat Commun 7, 11778 (2016). https://doi.org/10.1038/ncomms11778

Download citation

Received: 22 December 2015
Accepted: 28 April 2016
Published: 02 June 2016
DOI: https://doi.org/10.1038/ncomms11778

This article is cited by

PepQuery2 democratizes public MS proteomics data for rapid peptide searching
- Bo Wen
- Bing Zhang
Nature Communications (2023)
Improving the genome and proteome annotations of the marine model diatom Thalassiosira pseudonana using a proteogenomics strategy
- Xiao-Huang Chen
- Ming-Kun Yang
- Da-Zhi Wang
Marine Life Science & Technology (2023)
Evidence for a novel overlapping coding sequence in POLG initiated at a CUG start codon
- Yousuf A. Khan
- Irwin Jungreis
- Manolis Kellis
BMC Genetics (2020)
Peptimapper: proteogenomics workflow for the expert annotation of eukaryotic genomes
- Laetitia Guillot
- Ludovic Delage
- Charles Pineau
BMC Genomics (2019)
LINC00309 is associated with short disease-free survival in breast cancer
- Sheng Huang
- Yayun Chi
- Jiong Wu
Cancer Cell International (2019)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.