Hierarchical molecular tagging to resolve long continuous sequences by massively parallel sequencing

Here we demonstrate the use of short-read massive sequencing systems to in effect achieve longer read lengths through hierarchical molecular tagging. We show how indexed and PCR-amplified targeted libraries are degraded, sub-sampled and arrested at timed intervals to achieve pools of differing average length, each of which is indexed with a new tag. By this process, indices of sample origin, molecular origin, and degree of degradation is incorporated in order to achieve a nested hierarchical structure, later to be utilized in the data processing to order the reads over a longer distance than the sequencing system originally allows. With this protocol we show how continuous regions beyond 3000 bp can be decoded by an Illumina sequencing system, and we illustrate the potential applications by calling variants of the lambda genome, analysing TP53 in cancer cell lines, and targeting a variable canine mitochondrial region.

* to whom correspondence should be addressed.

Chimeras
We detected unexpectedly high levels of chimeras when we examined the TP-tag junctions in the raw data. Our first hypothesis was that they were caused by intermolecular ligation during circularization. We lowered amounts and compared to blunt end ligation and cre recombinase based systems using data from another study (Akan P., et al, manuscript submitted). In this study blunt-end ligation circularization had a chimera frequency of around 2% and cre recombinase < 1%. Further the fragment concentration during circularization did not affect the chimeric frequency found in this protocol. We therefore concluded that the circularization was not the major contributor of this phenomenon. We suspect that the library PCR causes the elevated levels of chimeras, with the hypothesis being that the secondary structure of the circularization hairpin obscures the activity of the polymerase, causing it to stall or dislodge.

Bias of the transposase
Some interesting aspects of the transposase based library preparation (Nextera, Epicentre now Illumina) was discovered in this study. The transposase in the kit is a variant of the Tn5 transposase 1 and has been characterized for high throughput sequencing applications 2 . We noticed a clear preference in binding to the circularization adapter. This is highly undesirable as those fragments will not produce the information incorporated in the library. Roughly, the average length of the dsDNA circles to be prepared for sequencing is 1500 bp for targets of 3000 bp, assuming a uniform distribution. The adapter is 46 bp, and digestion inside the adapter should by chance happen in approximately 2-4% of the sequences. Instead 12% of the sequences begin with the circularization adapter, and the pattern of digestion is clearly non-stochastic ( Figure S4). The reported recognition sequence for Tn5 is A-GNTYWRANC-T 1 and if we align this motif to our adapter, the best match (82%) is indeed the highest noted insertion site (position 22, Figure S4). This is contrasted to a previously noted low bias of transposition. When we compare to a human shotgun sample generally a low bias is seen, as is the case for all the raw data in this study ( Figure S5). Although a general low bias is seen per position on a global scale, particular sequences can be quite strongly biased. Table S1. Tile-seq throughput.

Supplementary tables
The theoretical output is based on a typical amount of reads obtained from one lane of Illumina HiSeq2000 sequencing, one of the 2x100 bp reads is used for classifying the construct (reading indices). The current throughput is approximated based on the mapped non-duplicated reads of the data, average read length and average amount of bases that could be generated in a typical run.  Table S2.

Comparison of Tile-seq and Sanger sequencing
Cost comparison between Tile-seq and Sanger sequencing. This is a simplified example of a sequencing project containing 96, 3kb amplicons, either sequenced by Tile-seq as described in this report, or by 600 bp Sanger reads in both directions. This example is included to provide an overview of the protocol as compared to the established protocol of Sanger sequencing. Polymer consumption of Sanger sequencing is not considered in this example. If the same amplicon is prepared, 107 oligos (96 forward, 1 reverse, 2 universal and 8 adapters) is needed for Tile-seq and 8 for Sanger (each used as sequence primers). This example assumes low complexity samples (e.g. two alleles), known reference sequence within the amplicon, and unproblematic sequence (e.g. no long homopolymer stretches). If these criteria would not be fulfilled, cloning or sequence primer iterations will be required for Sanger but would not impact the Tile-seq protocol. Tile-seq would produce approximately 40x average coverage, and the Sanger protocol is used to read five 600 bp regions in both directions (2x). The manually intensive steps of Tile-seq is the first PCRs, and for Sanger sequencing setting upp the Big Dye reactions and the subsequent ethanol preparations (960 in total). The purifications of Tile-seq, in total 244, are automated and cost approximately $0.1 per reaction.         Figure S1. Unidirectional exonuclease III degradation. 1% agarose gel analysis of exonuclease III degradation of differently prepared constructs. 1. Unidirectional degradation expected by uracil incorporation and subsequent USER treatment. 2. Same construct as 1 but no USER treatment (degradation in both ends expected). 3. Uracil incorporated in both ends and USER treatment (no degradation expected). 4. SacI restriction site incorporated in 1 end and SacI treatment (unidirectional degradation expected). Figure S2. Time-point tag occurrence after mapping to the lambda amplicon references. The mapped position of each respective TP-tag detected of each amplicon is plotted using a Gaussian kernal density approximation (R). Plots indicate amplicon dependent variation but generally a distinguished order. TP8 omitted due to generally low detection levels. Figure S3. Frequency (y-axis) of reads beginning with the adapter in the raw data, counted per starting position (x-axis) in the adapter. The plot shows a transposase induced bias in adapter fragmentation. Dots indicate variable positions due to the TPtag, and the last 11 bp were not included in the query due to the rise of unspecific hits. Figure S4. Weblogo plots on the base distribution of the 12 first bases of raw data from Nextera prepared libraries. Left and middle shows a typical human shotgun sample with expected random distribution, right shows Tile-seq data. Left plot shows probability over base position, and middle and left shows bits (0.0-0.3) over base position. All plots indicate slight bias supporting the recognition sequence of Tn5.