Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Linking publication, gene and protein data

Abstract

The computational reconstruction of biological systems, 'systems biology', is necessarily dependent on the existence of well-annotated data sets defining and describing the components of these systems, especially genes and the proteins they encode. Information about these components can be accessed either through structured bioinformatics databases, which store basic chemical and functional information abstracted from (or supplementing) the scientific literature, or through the literature itself, which is richer in content but essentially unstructured.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Protein analysis using InterProScan.
Figure 2: A schematic representation of a typical workflow for bioinformatics analysis.
Figure 3: From gene to protein to literature: a sample analysis.

Similar content being viewed by others

References

  1. Apweiler, R., Bairoch, A. & Wu, C. H. Protein sequence databases. Curr. Opin. Chem. Biol. 8, 76–80 (2004).

    Article  CAS  Google Scholar 

  2. Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein Data Bank. Nature Struct. Biol. 10, 980 (2003).

  3. Cochrane, G. et al. EMBL Nucleotide Sequence Database: developments in 2005. Nucleic Acids Res. 34, D10–D15 (2006).

  4. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Wheeler, D. L. GenBank. Nucleic Acids Res. 34, D16–D20 (2006).

    Article  CAS  Google Scholar 

  5. Okubo, K., Sugawara, H., Gojobori, T. & Tateno, Y. DDBJ in preparation for overview of research activities behind data submissions. Nucleic Acids Res. 34, D6–9 (2006).

    Article  CAS  Google Scholar 

  6. Birney, E. et al. Ensembl 2006. Nucleic Acids Res. 34, D556–D561 (2006).

    Article  CAS  Google Scholar 

  7. Wu, C. H. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34, D187–D191 (2006).

    Article  CAS  Google Scholar 

  8. Garavelli, J. S. The RESID Database of Protein Modifications as a resource and annotation tool. Proteomics 4, 1527–1533 (2004).

    Article  CAS  Google Scholar 

  9. Christie, K. R. et al. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 32, D311–D314 (2004).

    Article  CAS  Google Scholar 

  10. Schwarz, E. M. et al. WormBase: better software, richer content. Nucleic Acids Res. 34, D475–D478 (2006).

    Article  CAS  Google Scholar 

  11. Grumbling, G. & Strelets, V. FlyBase: anatomical data, images and queries. Nucleic Acids Res. 34, D484–D488 (2006).

    Article  CAS  Google Scholar 

  12. Blake, J. A., Eppig, J. T., Bult, C. J., Kadin, J. A. & Richardson, J. E. The Mouse Genome Database (MGD): updates and enhancements. Nucleic Acids Res. 34, D562–D567 (2006).

    Article  CAS  Google Scholar 

  13. Sequeira, E., McEntyre, J. & Lipman, D. PubMed Central decentralized. Nature 410, 740 (2001).

  14. Lopez, R., Duggan, K., Harte, N. & Kibria, A. Public services from the European Bioinformatics Institute. Brief Bioinform. 4, 332–340 (2003).

    Article  CAS  Google Scholar 

  15. Jenuth, J. P. The NCBI. Publicly available tools and resources on the Web. Methods Mol. Biol. 132, 301–312 (2000).

    CAS  PubMed  Google Scholar 

  16. Zdobnov, E. M., Lopez, R., Apweiler, R. & Etzold, T. The EBI SRS server-new features. Bioinformatics 18, 1149–1150 (2002).

    Article  CAS  Google Scholar 

  17. Etzold, T., Ulyanov, A. & Argos, P. SRS: information retrieval system for molecular biology data banks. Methods Enzymol. 266, 114–128 (1996).

    Article  CAS  Google Scholar 

  18. Geer, R. C. & Sayers, E. W. Entrez: making use of its power. Brief Bioinform. 4, 179–184 (2003).

    Article  Google Scholar 

  19. Gene Ontology Consortium. The Gene Ontology (GO) project in 2006. Nucleic Acids Res. 34, D322–D326 (2006).

  20. Whetzel, P. L. et al. The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22, 866–873 (2006).

    Article  CAS  Google Scholar 

  21. Orchard, S. et al. Autumn 2005 Workshop of the Human Proteome Organisation Proteomics Standards Initiative (HUPO-PSI) Geneva, September, 4–6, 2005. Proteomics 6, 738–741 (2006).

    Article  CAS  Google Scholar 

  22. Kersey, P. et al. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res 33, D297–302 (2005).

  23. Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33, D501–D504 (2005).

    Article  CAS  Google Scholar 

  24. Maglott, D., Ostell, J., Pruitt, K. D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 33, D54–D58 (2005).

  25. McGinnis, S. & Madden, T. L. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32, W20–W25 (2004).

    Article  CAS  Google Scholar 

  26. Pearson, W. R. Using the FASTA program to search protein and DNA sequence databases. Methods Mol. Biol. 24, 307–331 (1994).

    CAS  PubMed  Google Scholar 

  27. Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).

    Article  CAS  Google Scholar 

  28. Eddy, S. R. What is a hidden Markov model? Nature Biotechnol. 22, 1315–1316 (2004).

    Article  CAS  Google Scholar 

  29. Mulder, N. J. et al. InterPro, progress and status in 2005. Nucleic Acids Res. 33, D201–D205 (2005).

    Article  CAS  Google Scholar 

  30. Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res. 33, W116–W120 (2005).

    Article  CAS  Google Scholar 

  31. Kopp, J. & Schwede, T. The SWISS-MODEL Repository: new features and functionalities. Nucleic Acids Res. 34, D315–D318 (2006).

    Article  CAS  Google Scholar 

  32. Sonnhammer, E. L., von Heijne, G. & Krogh, A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 175–182 (1998).

    CAS  PubMed  Google Scholar 

  33. McGuffin, L. J. & Jones, D. T. Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics 19, 874–881 (2003).

    Article  CAS  Google Scholar 

  34. McGuffin, L. J., Bryson, K. & Jones, D. T. The PSIPRED protein structure prediction server. Bioinformatics 16, 404–405 (2000).

    Article  CAS  Google Scholar 

  35. Nelson, S. J., Schopen, M., Savage, A. G., Schulman, J. L. & Arluk, N. The MeSH translation maintenance system: structure, interface design, and implementation. Medinfo 11, 67–69 (2004).

    Google Scholar 

  36. Hoffmann, R. & Valencia, A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics 21, ii252–ii258 (2005).

    CAS  PubMed  Google Scholar 

  37. Oinn, T. et al. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 3045–3054 (2004).

    Article  CAS  Google Scholar 

  38. Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucleic Acids Res. 32, D452–D455 (2004).

  39. Stein, L. D. et al. The generic genome browser: a building block for a model organism system database. Genome Res. 12, 1599–1610 (2002).

    Article  CAS  Google Scholar 

  40. Davidson, S. B. et al. K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal 40, 512–531 (2001).

    Article  Google Scholar 

  41. Karp, P. D. et al. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res. 33, 6083–6089 (2005).

    Article  CAS  Google Scholar 

  42. Durinck, S. et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).

    Article  CAS  Google Scholar 

  43. Stevens, R. D., Robinson, A. J. & Goble, C. A. myGrid: personalised bioinformatics on the information grid. Bioinformatics 19, i302–i304 (2003).

    Article  Google Scholar 

  44. Berners-Lee, T. & Hendler, J. Publishing on the semantic web. Nature 410, 1023–1024 (2001).

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Information

Supplementary table S1 (PDF 69 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kersey, P., Apweiler, R. Linking publication, gene and protein data. Nat Cell Biol 8, 1183–1189 (2006). https://doi.org/10.1038/ncb1495

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/ncb1495

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing