From small reads do mighty genomes grow

Croucher, Nicholas J.

doi:10.1038/nrmicro2211

Download PDF

Genome Watch
Published: September 2009

From small reads do mighty genomes grow

Nicholas J. Croucher¹

Nature Reviews Microbiology volume 7, page 621 (2009)Cite this article

221 Accesses
3 Citations
Metrics details

Abstract

This month's Genome Watch discusses the use of next-generation sequencing technologies to assemble draft genomes for two pseudomonad species.

Main

The two next-generation sequencing technologies that are most frequently used to study bacterial genomes are the Roche 454 Genome Sequencer and the Illumina (formerly Solexa) Genome Analyzer. The 454 platform typically produces 200–400 nucleotide reads and has been used by itself to assemble draft genomes, but these drafts suffer from small contig sizes and artefactual frameshift mutations resulting from errors in estimating the length of homopolymeric tracts. Illumina technology can generate a much larger amount of data per run, but in the form of short (typically 36 or 54 nucleotide) reads. Although this technology has a relatively high error rate, miscalled bases are more randomly distributed than those from 454 sequencing and can be corrected using the greater depth of coverage. Previously, Illumina technology has been mainly used for the resequencing of populations that contain little variation, for example Salmonella enterica subsp. enterica serovar Typhi¹, but two recent papers describe the use of Illumina and 454 platforms to assemble high-quality draft genomes for the pseudomonad species Pseudomonas syringae and Pseudomonas aeruginosa .

P. syringae causes disease in a wide range of plants. Reinhardt and colleagues² used 454 and Illumina technologies to sequence the genome of P. syringae pathovar oryzae, which infects rice. The chromosomes of the three previously sequenced representatives of the species exhibit considerable variation and therefore cannot be used as reliable guides for assembly. Therefore, the authors first validated a de novo assembly methodology by resequencing the previously completed genome of the tomato and Arabidopsis thaliana pathogen P. syringae pathovar tomato strain DC3000. Illumina reads of 35 nucleotides were assembled into thousands of small contigs of up to 7 kb in length, which were combined with 454 reads to give a draft sequence that accounted for 99.1% of the complete genome. Aligning the individual Illumina reads to the draft allowed the correction of 626 errors that were introduced by 454 sequencing and assembly.

Applying this method to P. syringae pathovar oryzae produced 130 scaffolds with a total length of 5.6 Mb, plus 2,002 unincorporated contigs. As expected, this sequence proved to be highly divergent from the other representatives of the species: it lacked 161 coding sequences that were previously thought to be part of the P. syringae 'core' genome and possessed 97 coding sequences not found in the other sequenced strains. This pathovar was found to have a unique complement of type III effector proteins, which are secreted into host cells, where they disrupt plant immune responses.

A different approach was taken to sequence an isolate of P. aeruginosa, the leading cause of Gram-negative nosocomial infections and an important colonizer of the lungs of patients with cystic fibrosis. Salzberg and colleagues³ used a four-step strategy to assemble the genome of the highly virulent strain P. aeruginosa PAb1 using only single-end, 33-nucleotide Illumina reads. First, the reads were mapped against genomes of other members of the species, and contigs were assembled using this alignment. Second, contigs from different comparative assemblies were used to complement one another in order to generate an improved consensus.

The third step involved a 'gene-boosted assembly'. Gene fragments at the ends of contigs were used to predict the protein sequences that were likely to span the gaps in the assembly. Tblastn was then used to identify reads that would map to these breaks, and these reads were assembled with the flanking contigs to close the gaps. Finally, contigs assembled de novo from the Illumina reads were used to improve the sequence. This resulted in a single, large scaffold of 6.3 Mb, with the largest contig being over 500 kb long. The authors estimate the accuracy of the assembled genome to be >99.97% within contig sequences. One finding of this study was that the genome does not encode parts of the cyclic di-GMP-dependent signalling pathway implicated in the repression of swimming activity, which may explain the observed hypermotility of the strain.

Genome drafts such as those described here will improve as the quality and quantity of data that are produced by sequencing technologies increase, and this promises to change the way in which bacterial genomes are studied in the near future.

References

Holt, K. E. et al. High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nature Genet. 40, 987–993 (2008).
Article CAS Google Scholar
Reinhardt, J. A. et al. De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. Genome Res. 19, 294–305 (2009).
Article CAS Google Scholar
Salzberg, S. L. et al. Gene-boosted assembly of a novel bacterial genome from very short reads. PLoS Comput. Biol. 4, e1000186 (2008).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Nicholas J. Croucher is at the Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. microbes@sanger.ac.uk,
Nicholas J. Croucher

Authors

Nicholas J. Croucher
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Croucher, N. From small reads do mighty genomes grow. Nat Rev Microbiol 7, 621 (2009). https://doi.org/10.1038/nrmicro2211

Download citation

Issue Date: September 2009
DOI: https://doi.org/10.1038/nrmicro2211

This article is cited by

Optimal reference sequence selection for genome assembly using minimum description length principle
- Bilal Wajid
- Erchin Serpedin
- Hazem Nounou
EURASIP Journal on Bioinformatics and Systems Biology (2012)

From small reads do mighty genomes grow

Abstract

Main

References

Author information

Authors and Affiliations

Related links

DATABASES

Entrez Genome Project

Rights and permissions

About this article

Cite this article

This article is cited by

Optimal reference sequence selection for genome assembly using minimum description length principle

Search

Quick links

Abstract

Main

References

Author information

Authors and Affiliations

Related links

Related links

DATABASES

Entrez Genome Project

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Optimal reference sequence selection for genome assembly using minimum description length principle

Search

Quick links