Linking publication, gene and protein data

Kersey, Paul; Apweiler, Rolf

doi:10.1038/ncb1495

Review Article
Published: 01 November 2006

Linking publication, gene and protein data

Paul Kersey¹ &
Rolf Apweiler¹

Nature Cell Biology volume 8, pages 1183–1189 (2006)Cite this article

852 Accesses
25 Citations
Metrics details

Abstract

The computational reconstruction of biological systems, 'systems biology', is necessarily dependent on the existence of well-annotated data sets defining and describing the components of these systems, especially genes and the proteins they encode. Information about these components can be accessed either through structured bioinformatics databases, which store basic chemical and functional information abstracted from (or supplementing) the scientific literature, or through the literature itself, which is richer in content but essentially unstructured.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Protein analysis using InterProScan.**

**Figure 2: A schematic representation of a typical workflow for bioinformatics analysis.**

**Figure 3: From gene to protein to literature: a sample analysis.**

Rummagene: massive mining of gene sets from supporting materials of biomedical research publications

Article Open access 20 April 2024

reString: an open-source Python software to perform automatic functional enrichment retrieval, results aggregation and data visualization

Article Open access 06 December 2021

NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature

Article Open access 25 March 2021

References

Apweiler, R., Bairoch, A. & Wu, C. H. Protein sequence databases. Curr. Opin. Chem. Biol. 8, 76–80 (2004).
Article CAS Google Scholar
Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein Data Bank. Nature Struct. Biol. 10, 980 (2003).
Cochrane, G. et al. EMBL Nucleotide Sequence Database: developments in 2005. Nucleic Acids Res. 34, D10–D15 (2006).
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Wheeler, D. L. GenBank. Nucleic Acids Res. 34, D16–D20 (2006).
Article CAS Google Scholar
Okubo, K., Sugawara, H., Gojobori, T. & Tateno, Y. DDBJ in preparation for overview of research activities behind data submissions. Nucleic Acids Res. 34, D6–9 (2006).
Article CAS Google Scholar
Birney, E. et al. Ensembl 2006. Nucleic Acids Res. 34, D556–D561 (2006).
Article CAS Google Scholar
Wu, C. H. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34, D187–D191 (2006).
Article CAS Google Scholar
Garavelli, J. S. The RESID Database of Protein Modifications as a resource and annotation tool. Proteomics 4, 1527–1533 (2004).
Article CAS Google Scholar
Christie, K. R. et al. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 32, D311–D314 (2004).
Article CAS Google Scholar
Schwarz, E. M. et al. WormBase: better software, richer content. Nucleic Acids Res. 34, D475–D478 (2006).
Article CAS Google Scholar
Grumbling, G. & Strelets, V. FlyBase: anatomical data, images and queries. Nucleic Acids Res. 34, D484–D488 (2006).
Article CAS Google Scholar
Blake, J. A., Eppig, J. T., Bult, C. J., Kadin, J. A. & Richardson, J. E. The Mouse Genome Database (MGD): updates and enhancements. Nucleic Acids Res. 34, D562–D567 (2006).
Article CAS Google Scholar
Sequeira, E., McEntyre, J. & Lipman, D. PubMed Central decentralized. Nature 410, 740 (2001).
Lopez, R., Duggan, K., Harte, N. & Kibria, A. Public services from the European Bioinformatics Institute. Brief Bioinform. 4, 332–340 (2003).
Article CAS Google Scholar
Jenuth, J. P. The NCBI. Publicly available tools and resources on the Web. Methods Mol. Biol. 132, 301–312 (2000).
CAS PubMed Google Scholar
Zdobnov, E. M., Lopez, R., Apweiler, R. & Etzold, T. The EBI SRS server-new features. Bioinformatics 18, 1149–1150 (2002).
Article CAS Google Scholar
Etzold, T., Ulyanov, A. & Argos, P. SRS: information retrieval system for molecular biology data banks. Methods Enzymol. 266, 114–128 (1996).
Article CAS Google Scholar
Geer, R. C. & Sayers, E. W. Entrez: making use of its power. Brief Bioinform. 4, 179–184 (2003).
Article Google Scholar
Gene Ontology Consortium. The Gene Ontology (GO) project in 2006. Nucleic Acids Res. 34, D322–D326 (2006).
Whetzel, P. L. et al. The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22, 866–873 (2006).
Article CAS Google Scholar
Orchard, S. et al. Autumn 2005 Workshop of the Human Proteome Organisation Proteomics Standards Initiative (HUPO-PSI) Geneva, September, 4–6, 2005. Proteomics 6, 738–741 (2006).
Article CAS Google Scholar
Kersey, P. et al. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res 33, D297–302 (2005).
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33, D501–D504 (2005).
Article CAS Google Scholar
Maglott, D., Ostell, J., Pruitt, K. D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 33, D54–D58 (2005).
McGinnis, S. & Madden, T. L. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32, W20–W25 (2004).
Article CAS Google Scholar
Pearson, W. R. Using the FASTA program to search protein and DNA sequence databases. Methods Mol. Biol. 24, 307–331 (1994).
CAS PubMed Google Scholar
Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
Article CAS Google Scholar
Eddy, S. R. What is a hidden Markov model? Nature Biotechnol. 22, 1315–1316 (2004).
Article CAS Google Scholar
Mulder, N. J. et al. InterPro, progress and status in 2005. Nucleic Acids Res. 33, D201–D205 (2005).
Article CAS Google Scholar
Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res. 33, W116–W120 (2005).
Article CAS Google Scholar
Kopp, J. & Schwede, T. The SWISS-MODEL Repository: new features and functionalities. Nucleic Acids Res. 34, D315–D318 (2006).
Article CAS Google Scholar
Sonnhammer, E. L., von Heijne, G. & Krogh, A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 175–182 (1998).
CAS PubMed Google Scholar
McGuffin, L. J. & Jones, D. T. Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics 19, 874–881 (2003).
Article CAS Google Scholar
McGuffin, L. J., Bryson, K. & Jones, D. T. The PSIPRED protein structure prediction server. Bioinformatics 16, 404–405 (2000).
Article CAS Google Scholar
Nelson, S. J., Schopen, M., Savage, A. G., Schulman, J. L. & Arluk, N. The MeSH translation maintenance system: structure, interface design, and implementation. Medinfo 11, 67–69 (2004).
Google Scholar
Hoffmann, R. & Valencia, A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics 21, ii252–ii258 (2005).
CAS PubMed Google Scholar
Oinn, T. et al. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 3045–3054 (2004).
Article CAS Google Scholar
Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucleic Acids Res. 32, D452–D455 (2004).
Stein, L. D. et al. The generic genome browser: a building block for a model organism system database. Genome Res. 12, 1599–1610 (2002).
Article CAS Google Scholar
Davidson, S. B. et al. K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal 40, 512–531 (2001).
Article Google Scholar
Karp, P. D. et al. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res. 33, 6083–6089 (2005).
Article CAS Google Scholar
Durinck, S. et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).
Article CAS Google Scholar
Stevens, R. D., Robinson, A. J. & Goble, C. A. myGrid: personalised bioinformatics on the information grid. Bioinformatics 19, i302–i304 (2003).
Article Google Scholar
Berners-Lee, T. & Hendler, J. Publishing on the semantic web. Nature 410, 1023–1024 (2001).
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

the EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, Cambridge, UK
Paul Kersey & Rolf Apweiler

Authors

Paul Kersey
View author publications
You can also search for this author in PubMed Google Scholar
Rolf Apweiler
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Information

Supplementary table S1 (PDF 69 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kersey, P., Apweiler, R. Linking publication, gene and protein data. Nat Cell Biol 8, 1183–1189 (2006). https://doi.org/10.1038/ncb1495

Download citation

Issue Date: 01 November 2006
DOI: https://doi.org/10.1038/ncb1495

This article is cited by

Openness and trust in data-intensive science: the case of biocuration
- Ane Møller Gabrielsen
Medicine, Health Care and Philosophy (2020)
Mining locus tags in PubMed Central to improve microbial gene annotation
- Chris J Stubben
- Jean F Challacombe
BMC Bioinformatics (2014)
Bioinformatics and molecular modeling in glycobiology
- Martin Frank
- Siegfried Schloissnig
Cellular and Molecular Life Sciences (2010)
Towards bioinformatics assisted infectious disease control
- Vitali Sintchenko
- Blanca Gallego
- Enrico Coiera
BMC Bioinformatics (2009)
Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data
- Michael J Gilchrist
- Mikkel B Christensen
- Nancy Papalopulu
BMC Bioinformatics (2008)

Linking publication, gene and protein data

Abstract

Access options

Similar content being viewed by others

Rummagene: massive mining of gene sets from supporting materials of biomedical research publications

reString: an open-source Python software to perform automatic functional enrichment retrieval, results aggregation and data visualization

NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

This article is cited by

Openness and trust in data-intensive science: the case of biocuration

Mining locus tags in PubMed Central to improve microbial gene annotation

Bioinformatics and molecular modeling in glycobiology

Towards bioinformatics assisted infectious disease control

Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data

Search

Quick links

Abstract

Access options

Similar content being viewed by others

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links