PacBio sequencing output increased through uniform and directional fivefold concatenation

Advances in sequencing technology have allowed researchers to sequence DNA with greater ease and at decreasing costs. Main developments have focused on either sequencing many short sequences or fewer large sequences. Methods for sequencing mid-sized sequences of 600–5,000 bp are currently less efficient. For example, the PacBio Sequel I system yields ~ 100,000–300,000 reads with an accuracy per base pair of 90–99%. We sought to sequence several DNA populations of ~ 870 bp in length with a sequencing accuracy of 99% and to the greatest depth possible. We optimised a simple, robust method to concatenate genes of ~ 870 bp five times and then sequenced the resulting DNA of ~ 5,000 bp by PacBioSMRT long-read sequencing. Our method improved upon previously published concatenation attempts, leading to a greater sequencing depth, high-quality reads and limited sample preparation at little expense. We applied this efficient concatenation protocol to sequence nine DNA populations from a protein engineering study. The improved method is accompanied by a simple and user-friendly analysis pipeline, DeCatCounter, to sequence medium-length sequences efficiently at one-fifth of the cost.


Variant
Scar Oligonucleotide sequence (5'-3')  Table S2. Oligonucleotides used to attach the sample-specific barcodes. The barcode-encoding regions are shown in bold and colour. The regions overlapping with the termini of the concatenated amplicon are shown in italics. The oligonucleotides attach the barcodes either to the 5'-end of library variant 1 (Fwd primer, orange) or to the 3'-end of library variant 5 (Rev primer, blue).

n/a Fwd See Barcodes
Supplementary Figure S1. Initial attempts at the concatenation showed different efficiencies of assembly for three different samples. Improved efficiency correlated with increased PCR yield of the samples. The total DNA after PCR pertains to the DNA recovered after PCR amplification of the 5x amplicons 1-5 for each sample calculated as an average of the two identical repeats. As PCR yield increased, the assembly efficiency increased because the percentage of original template DNA was reduced. A DNA marker as a size reference was loaded in the leftmost lane (M).
Supplementary Figure S2. Optimizing the efficiency of the assembly process. The efficiency of concatenating the five library variants was compared for assemblies using different commercially available enzymes and following protocols 1-3 (black, green and orange, respectively; see methods section). Optimal performance was determined by the percentage of the fully assembled 5x product (between 1% and 40%). The most efficient assembly was achieved with the NEB Golden Gate assembly kit and protocol 1 (increased reaction time and number of cycles). Further details can be found in the methods section. The lane containing a molecular mass ladder is labelled M in each gel.
Supplementary Figure S3. Analysis of size distribution by Agilent Bioanalyzer of the pooled concatenated amplicons S1-S9 directly before submission for sequencing.
Supplementary Figure S4. Detailed schematic representation of the data processing workflow to extract individual library variants (gene sequences) from the PacBio sequencing run. PacBio can read a sequence from either direction and therefore produces two sets of amplicons: one as the sense strand and the other as the reverse complement (Set 1, Set 2). Our sequencing output represents a pool of different samples that each contain a unique pair of 16 bp sequencing barcodes flanking the fully assembled 5x amplicon. Therefore, amplicons were first demultiplexed by identifying their pair of barcodes and assigning them into sample specific sub-pools accordingly. Each sub-pool was then deconcatenated to extract the five individual library variants contained within each 5x amplicon. As each sequence can be sequenced in the forward or reverse direction, sequences were scanned for either of the two constant regions that flank each variant and the first identified constant region was removed. Each sequence was trimmed by identifying and removing the opposing flanking region (including the second constant region), and filtered by size using a 5% error margin of the expected size range (707-825 bp). Finally, the library variants from reads in the reverse direction were reverse-complemented and merged with the forward reads. The library variants from each sample were subsequently clustered into families of similar sequences, translated to amino acid sequences, and dereplicated within each family (not shown in figure).