e-Science: relieving bottlenecks in large-scale genome analyses

Craddock, Tracy; Harwood, Colin R.; Hallinan, Jennifer; Wipat, Anil

doi:10.1038/nrmicro2031

Opinion
Published: December 2008

e-Science: relieving bottlenecks in large-scale genome analyses

Tracy Craddock¹,
Colin R. Harwood²,
Jennifer Hallinan¹ &
…
Anil Wipat¹

Nature Reviews Microbiology volume 6, pages 948–954 (2008)Cite this article

351 Accesses
16 Citations
Metrics details

Abstract

The development of affordable, high-throughput sequencing technology has led to a flood of publicly available bacterial genome-sequence data. The availability of multiple genome sequences presents both an opportunity and a challenge for microbiologists, and new computational approaches are needed to extract the knowledge that is required to address specific biological problems and to analyse genomic data. The field of e-Science is maturing, and Grid-based technologies can help address this challenge.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Publicly available bacterial genomes.**

**Figure 2: Analysis of the *Bacillus subtilis* secretome as an *in silico* experiment.**

**Figure 3: A heat map that shows the prevalence of secreted protein families across 12 *Bacillus species*.**

Metagenome analysis using the Kraken software suite

Article 28 September 2022

Jennifer Lu, Natalia Rincon, … Martin Steinegger

Critical Assessment of Metagenome Interpretation: the second round of challenges

Article Open access 08 April 2022

Fernando Meyer, Adrian Fritz, … Alice Carolyn McHardy

Go Get Data (GGD) is a framework that facilitates reproducible access to genomic data

Article Open access 12 April 2021

Michael J. Cormier, Jonathan R. Belyeu, … Aaron R. Quinlan

References

Luciano, J. S. & Stevens, R. D. e-Science and biological pathway semantics. BMC Bioinformatics 8, S3 (2007).
Article Google Scholar
de Roure, D., Goble, C. & Stevens, R. in Proc. 2007 IEEE Conf. eScience Grid Comput. 603–610 (2007).
Foster, I., Kesselman, C. & Tuecke, S. The anatomy of the Grid: enabling scalable virtual organizations. Int. J. High Perform. Comput. Appl. 15, 200–222 (2001).
Article Google Scholar
Foster, I. & Kesselman, C. Globus: a metacomputing infrastructure toolkit. Int. J. High Perform. Comput. Appl. 11, 115–128 (1997).
Google Scholar
Thain, D., Tannenbaum, T. & Livny, M. in Grid Computing (eds Berman, F., Fox, G. & Hey, T.) 299–335 (2003).
Book Google Scholar
Stajich, J. E. et al. The Bioperl toolkit: perl modules for the life sciences. Genome Res. 12, 1611–1618 (2002).
Article CAS Google Scholar
Chapman, B. & Chang, J. Biopython: Python tools for computational biology. ACM SIGBIO Newsl. 20, 15–19 (2000).
Article Google Scholar
Pocock, M., Down, T. & Hubbard, T. BioJava: open source components for bioinformatics. ACM SIGBIO Newsl. 20, 10–12 (2000).
Article Google Scholar
Oinn, T. et al. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 3045–3054 (2004).
Article CAS Google Scholar
Stevens, R. D., Robinson, A. J. & Goble, C. A. ^myGrid: personalised bioinformatics on the information grid. Bioinformatics 19, I302–I304 (2003).
Article Google Scholar
Curbera, F. et al. Unraveling the web services web: an introduction to SOAP, WSDL, and UDDI. IEEE Internet Comput. 6, 86–93 (2002).
Article Google Scholar
Khare, R. & Taylor, R. N. in Proc. 26th Int. Conf. Software Eng. (ed. Taylor, R. N.) 428–437 (2004).
Wilkinson, M. D. & Links, M. BioMOBY: an open source biological web services proposal. Brief. Bioinformatics 3, 331–341 (2002).
Article Google Scholar
Foster, I., Kesselman, C., Nick, J. M. & Tuecke, S. Grid services for distributed system integration. Computer 35, 37–46 (2002).
Article Google Scholar
Hull, D. et al. Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34, W729–W732 (2006).
Article CAS Google Scholar
Pillai, S. et al. SOAP-based services provided by the European Bioinformatics Institute. Nucleic Acids Res. 33, W25–W28 (2005).
Article CAS Google Scholar
Senger, M., Rice, P. & Oinn, T. in UK e-Science All Hands Meet. 2003 (ed. Cox, S. J.) 509–513 (2003).
Google Scholar
Majithia, S., Shields, M., Taylor, I. & Wang, I. in Proc. IEEE Intern. Conf. Web Services (ed. Shields, M.) 514–521 (2004).
Google Scholar
Castro, A. G., Thoraval, S., Garcia, L. J. & Ragan, M. A. Workflows in bioinformatics: meta-analysis and prototype implementation of a workflow generator. BMC Bioinform. 6, 87 (2005).
Article Google Scholar
Ludäscher, B. et al. Scientific workflow management and the kepler system. Concurr. Comput. Pract. Exper. 18, 1039–1065 (2006).
Article Google Scholar
Stevens, R. et al. ^myGrid and the drug discovery process. Drug Discov. Today 2, 140–148 (2004).
Article CAS Google Scholar
Fisher, P. et al. A systematic strategy for large-scale analysis of genotype phenotype correlations: identification of candidate genes involved in African trypanosomiasis. Nucleic Acids Res. 35, 5625–5633 (2007).
Article CAS Google Scholar
Agostini, F. P., Soares-Pinto, D. O., Moret, M. A., Osthoff, C. & Pascutti, P. G. Generalized simulated annealing applied to protein folding studies. J. Comput. Chem. 11, 1142–1152 (2006).
Article Google Scholar
Craddock, T., Lord, P., Harwood, C. R. & Wipat, A. in Proc. 5th UK e-Science All Hands Meet. 788–795 (2006).
Google Scholar
Harwood, C. R. & Cranenburgh, R. Bacillus protein secretion: an unfolding story. Trends Microbiol. 16, 73–79 (2008).
Article CAS Google Scholar
Juncker, A. S. et al. Prediction of lipoprotein signal peptides in Gram-negative bacteria. Protein Sci. 12, 1652–1662 (2003).
Article CAS Google Scholar
Bendtsen, J. D., Nielsen, H., von Heijne, G. & Brunak, S. Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 340, 783–795 (2004).
Article Google Scholar
Sonnhammer, E. L. L., von Heijne, G. & Krogh, A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 175–182 (1998).
CAS PubMed Google Scholar
Jones, D. T., Taylor, W. R. & Thornton, J. M. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 33, 3038–3049 (1994).
Article CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS Google Scholar
Enright, A. J., Kunin, V. & Ouzounis, C. A. Protein families and TRIBES in genome sequence space. Nucleic Acids Res. 31, 4632–4638 (2003).
Article CAS Google Scholar
Sun, Y. et al. Exploring microbial genome sequences to identify protein families on the Grid. IEEE Trans. Inf. Technol. Biomed., 11 (2007).
Sun, Y. et al. in 2005 IEEE International Symposium on Cluster Computing and the Grid 977–984 (2005).
Google Scholar
Hedeler, C. et al. e-Fungi: a data resource for comparative analysis of fungal genomes. BMC Genomics 8, 426 (2007).
Article Google Scholar
Chaudhuri, R. R. et al. xBASE2: a comprehensive resource for comparative bacterial genomics. Nucleic Acids Res. 36, D543–D546 (2008).
Article CAS Google Scholar
Sulakhe, D. et al. Gnare: automated system for high-throughput genome analysis with Grid computational backend. J. Clin. Monit. Comput. 19, 361–369 (2005).
Article Google Scholar
Sulakhe, D., Rodriguez, A., Wilde, M., Foster, I. A. & Maltsev, N. A. Interoperability of GADU in using heterogeneous grid resources for bioinformatics applications. IEEE Trans. Inf. Technol. Biomed. 12, 241–246 (2008).
Article Google Scholar
Seshadri, R., Kravitz, S. A., Smarr, L., Gilna, P. & Frazier, M. CAMERA: a community resource for metagenomics. PLoS Biol. 5, e75 (2007).
Article Google Scholar
Maltsev, N. A. et al. PUMA2-grid-based high-throughput analysis of genomes and metabolic pathways. Nucleic Acids Res. 34, D369–D372 (2006).
Article CAS Google Scholar
Schulze-Kremer, S. Ontologies for molecular biology. in Proc. 3rd Pacific Symp. Biocomput., 693–704 (1998).
Kohler, J. et al. Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics 22, 1383–1390 (2006).
Article CAS Google Scholar
Papanikou, E., Karamanou, S. & Economou, A. Bacterial protein secretion through the translocase nanomachine. Nature Rev. Microbiol. 5, 839–851 (2007).
Article CAS Google Scholar
Berks, B. C., Palmer, T. & Sargent, F. Protein targeting by the bacterial twin-arginine translocation (Tat) pathway. Curr. Opin. Microbiol. 8, 174–181 (2005).
Article CAS Google Scholar

Download references

Acknowledgements

The authors acknowledge funding from the UK Engineering and Physical Sciences Research Council and Non-linear Dynamics for a CASE (collaborative awards in science and engineering) studentship to T.C., from Research Councils UK for a fellowship to J.H. and from the European Union (Bacell Health; grant number LSH-2002-1.1.0-1).

Author information

Authors and Affiliations

Tracy Craddock, Jennifer Hallinan and Anil Wipat are at the School of Computing Science, Claremont Tower, Newcastle University, Newcastle upon Tyne, NE1 7RU, UK.,
Tracy Craddock, Jennifer Hallinan & Anil Wipat
Colin R. Harwood is at the Institute for Cell and Molecular Biosciences, Medical School, Newcastle University, Newcastle upon Tyne, NE2 4HH, UK.,
Colin R. Harwood

Authors

Tracy Craddock
View author publications
You can also search for this author in PubMed Google Scholar
Colin R. Harwood
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Hallinan
View author publications
You can also search for this author in PubMed Google Scholar
Anil Wipat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Colin R. Harwood.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Craddock, T., Harwood, C., Hallinan, J. et al. e-Science: relieving bottlenecks in large-scale genome analyses. Nat Rev Microbiol 6, 948–954 (2008). https://doi.org/10.1038/nrmicro2031

Download citation

Issue Date: December 2008
DOI: https://doi.org/10.1038/nrmicro2031

This article is cited by

Reducing data transfer in big-data workflows: the computation-flow delegated approach
- Rickey T. P. Nunes
- Santosh L. Deshpande
Journal of Data, Information and Management (2019)
e!DAL - a framework to store, share and publish research data
- Daniel Arend
- Matthias Lange
- Uwe Scholz
BMC Bioinformatics (2014)
Initial steps towards a production platform for DNA sequence analysis on the grid
- Angela CM Luyf
- Barbera DC van Schaik
- Silvia D Olabarriaga
BMC Bioinformatics (2010)
Applications of thiol-disulfide oxidoreductases for optimized in vivo production of functionally active proteins in Bacillus
- Thijs R. H. M. Kouwen
- Jan Maarten van Dijl
Applied Microbiology and Biotechnology (2009)

e-Science: relieving bottlenecks in large-scale genome analyses

Abstract

Access options

Similar content being viewed by others

Metagenome analysis using the Kraken software suite

Critical Assessment of Metagenome Interpretation: the second round of challenges

Go Get Data (GGD) is a framework that facilitates reproducible access to genomic data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Related links

DATABASES

Entrez Genome Project

FURTHER INFORMATION

Rights and permissions

About this article

Cite this article

This article is cited by

Reducing data transfer in big-data workflows: the computation-flow delegated approach

e!DAL - a framework to store, share and publish research data

Initial steps towards a production platform for DNA sequence analysis on the grid

Applications of thiol-disulfide oxidoreductases for optimized in vivo production of functionally active proteins in Bacillus

Search

Quick links

Abstract

Access options

Similar content being viewed by others

Metagenome analysis using the Kraken software suite

Critical Assessment of Metagenome Interpretation: the second round of challenges

Go Get Data (GGD) is a framework that facilitates reproducible access to genomic data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Related links

Related links

DATABASES

Entrez Genome Project

FURTHER INFORMATION

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Reducing data transfer in big-data workflows: the computation-flow delegated approach

e!DAL - a framework to store, share and publish research data

Initial steps towards a production platform for DNA sequence analysis on the grid

Applications of thiol-disulfide oxidoreductases for optimized in vivo production of functionally active proteins in Bacillus

Search

Quick links