This month's Genome Watch discusses the use of next-generation sequencing technologies to assemble draft genomes for two pseudomonad species.
The two next-generation sequencing technologies that are most frequently used to study bacterial genomes are the Roche 454 Genome Sequencer and the Illumina (formerly Solexa) Genome Analyzer. The 454 platform typically produces 200–400 nucleotide reads and has been used by itself to assemble draft genomes, but these drafts suffer from small contig sizes and artefactual frameshift mutations resulting from errors in estimating the length of homopolymeric tracts. Illumina technology can generate a much larger amount of data per run, but in the form of short (typically 36 or 54 nucleotide) reads. Although this technology has a relatively high error rate, miscalled bases are more randomly distributed than those from 454 sequencing and can be corrected using the greater depth of coverage. Previously, Illumina technology has been mainly used for the resequencing of populations that contain little variation, for example Salmonella enterica subsp. enterica serovar Typhi1, but two recent papers describe the use of Illumina and 454 platforms to assemble high-quality draft genomes for the pseudomonad species Pseudomonas syringae and Pseudomonas aeruginosa .
P. syringae causes disease in a wide range of plants. Reinhardt and colleagues2 used 454 and Illumina technologies to sequence the genome of P. syringae pathovar oryzae, which infects rice. The chromosomes of the three previously sequenced representatives of the species exhibit considerable variation and therefore cannot be used as reliable guides for assembly. Therefore, the authors first validated a de novo assembly methodology by resequencing the previously completed genome of the tomato and Arabidopsis thaliana pathogen P. syringae pathovar tomato strain DC3000. Illumina reads of 35 nucleotides were assembled into thousands of small contigs of up to 7 kb in length, which were combined with 454 reads to give a draft sequence that accounted for 99.1% of the complete genome. Aligning the individual Illumina reads to the draft allowed the correction of 626 errors that were introduced by 454 sequencing and assembly.
Applying this method to P. syringae pathovar oryzae produced 130 scaffolds with a total length of 5.6 Mb, plus 2,002 unincorporated contigs. As expected, this sequence proved to be highly divergent from the other representatives of the species: it lacked 161 coding sequences that were previously thought to be part of the P. syringae 'core' genome and possessed 97 coding sequences not found in the other sequenced strains. This pathovar was found to have a unique complement of type III effector proteins, which are secreted into host cells, where they disrupt plant immune responses.
A different approach was taken to sequence an isolate of P. aeruginosa, the leading cause of Gram-negative nosocomial infections and an important colonizer of the lungs of patients with cystic fibrosis. Salzberg and colleagues3 used a four-step strategy to assemble the genome of the highly virulent strain P. aeruginosa PAb1 using only single-end, 33-nucleotide Illumina reads. First, the reads were mapped against genomes of other members of the species, and contigs were assembled using this alignment. Second, contigs from different comparative assemblies were used to complement one another in order to generate an improved consensus.
The third step involved a 'gene-boosted assembly'. Gene fragments at the ends of contigs were used to predict the protein sequences that were likely to span the gaps in the assembly. Tblastn was then used to identify reads that would map to these breaks, and these reads were assembled with the flanking contigs to close the gaps. Finally, contigs assembled de novo from the Illumina reads were used to improve the sequence. This resulted in a single, large scaffold of 6.3 Mb, with the largest contig being over 500 kb long. The authors estimate the accuracy of the assembled genome to be >99.97% within contig sequences. One finding of this study was that the genome does not encode parts of the cyclic di-GMP-dependent signalling pathway implicated in the repression of swimming activity, which may explain the observed hypermotility of the strain.
Genome drafts such as those described here will improve as the quality and quantity of data that are produced by sequencing technologies increase, and this promises to change the way in which bacterial genomes are studied in the near future.
Holt, K. E. et al. High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nature Genet. 40, 987–993 (2008).
Reinhardt, J. A. et al. De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. Genome Res. 19, 294–305 (2009).
Salzberg, S. L. et al. Gene-boosted assembly of a novel bacterial genome from very short reads. PLoS Comput. Biol. 4, e1000186 (2008).
About this article
EURASIP Journal on Bioinformatics and Systems Biology (2012)