Droplet barcoding for massively parallel single-molecule deep sequencing

The ability to accurately sequence long DNA molecules is important across biology, but existing sequencers are limited in read length and accuracy. Here, we demonstrate a method to leverage short-read sequencing to obtain long and accurate reads. Using droplet microfluidics, we isolate, amplify, fragment and barcode single DNA molecules in aqueous picolitre droplets, allowing the full-length molecules to be sequenced with multi-fold coverage using short-read sequencing. We show that this approach can provide accurate sequences of up to 10 kb, allowing us to identify rare mutations below the detection limit of conventional sequencing and directly link them into haplotypes. This barcoding methodology can be a powerful tool in sequencing heterogeneous populations such as viruses.

. The pinched flow fractionation device used to remove large coalesced droplets from an emulsion. a) Emulsion collected at the end of the triple merger device before and after thermal cycling, showing coalescence droplets which will be sorted out in the next step. b) Schematic and bright field picture of the pinched flow fractionation device used to remove large droplets. Droplets are injected at 400 µL per hr. HFE7500 oil is injected at 4000 µL per hr. The smaller droplets are collected into tube, while the outlet for large droplets is attached to a syringe filled with water, pulling at 3000 µL per hr. Microscope image of pinched flow fractionation device separating large and small droplets. Arrows indicate large droplets that flow to the lower outlet, separating them from the smaller droplets.

Supplementary Figure 4.
Schematic and validation of droplet barcode library. a) Chemically synthesized random Nmer barcodes are encapsulated into droplets so that most droplets contain zero or one barcode. Inside droplets, each barcode is clonally amplified by PCR, generating droplets that contain zero or many copies of a unique barcode. SYBR staining inside droplets is used to identify droplets that contain barcodes. b) Plot showing the probability of reusing the same barcode for an experiment using a total number of barcodes, for barcodes of 15-20 bps long. Error bars represent the standard error of the mean from 10,000 simulation runs. See supplemental methods for simulation details. c) Empirical data from a SMDB sequencing library. Distribution of Hamming distance of each barcode to its closest neighbor before and after clustering. Error barcodes are 1 Hamming distance away from their closest neighbor while original barcodes are on average three Hamming distances away. Dashed blue line shows the theoretical distribution of Hamming distances given an equal number of randomly chosen barcodes.

Supplementary Note 1. Droplet stability in thermal cycling
Droplet stability to thermocycling is dependent on surfactant, aqueous buffer, and droplet size. Using the EA surfactant and our PCR buffer, we found empirically that droplets are most stable to thermal cycling when they are immersed in FC-40 with 5% w/w EA surfactant and with 2% tween-20 w/v and 2% PEG-6000 w/v in the aqueous phase. Under these conditions, droplets are most stable to thermal cycling if their spherical diameter is less than 55um.

Supplementary Note 2. Algorithm to cluster error barcodes to their original sequences
The algorithm we use, called dfsCluster, available at https://github.com/AbateLab/Barcoding, operates under the expectation that each sequence in a barcode cluster is one Hamming distance away from another sequence in that cluster, and at least two Hamming distances away from any sequence not in that cluster. If this is the case, the sequences associated with unique barcode clusters form connected components in Hamming space, which can then be identified using a depth first search (dfs) in time proportional to barcode length times the number of unique sequences amongst the barcodes and their derivatives.
One scenario where this expectation doesn't hold is where sequences from one part of a barcode cluster are at least two mutations away from all sequences from another part. Computer simulation using a single length 15 template, 0.8 template replication rate, and 0.0001 single base error rate shows that although clusters do split, the splits are inconsequential. When we run dfsCluster on these simulated barcode clusters, it consistently groups 99.99% of the simulated cluster's sequences into a single cluster.
The other scenario is collisions, where multiple barcode clusters merge into a single component.
Collisions are heavily dependent on the minimum Hamming distances between the original barcodes. To this end, dfsCluster does provide the option to identify and remove components suspected of being collisions from its output. This filter marks clusters where the normalized difference between the number of most, and second most populous sequences in each cluster is less than 0.7 as collisions, with a false positive rate of 0.017, and a false negative rate of 0. For more functional details, consult the Abate lab github.

Supplementary Note 3. Comparing rate of double encapsulation to theoretical Poisson distribution
If the process of template encapsulation is completely random, then the distribution of the number of templates per droplet should follow a Poisson distribution where λ is the average number of templates per droplet and k is the number of templates in a particular droplet (ie k = 2 represents droplets with two templates).

( , ) = − / !
Approximating each barcode cluster as a single droplet, the fraction of droplets that contain a single template (k = 1) is represented by: (1, ) = − The fraction of droplets containing two templates is: (2, ) = − 2 /2 The ratio between P(1, λ) and P(2, λ) is R = 2/λ. Using the number of one and two template containing barcode clusters, we estimate λ = 0.1, which matches the target encapsulation ratio and supported by counting fluorescent vs. non fluorescent droplets in SYBR staining after initial template amplification in droplets.

Supplementary Note 4. Defining coverage entropy
In order to visualize the coverage distribution for every barcode cluster, it must be described as a numerically. We applied the informational entropy from information theory to the distribution of reads to arrive at coverage entropy S: where Pi is the probability of finding a read that maps into the ith bin along the template in each barcode cluster. Coverage entropy is maximum when the probability for reads to fall into bin is equal.

Gravity induced droplet size fractionation
In a tumbling emulsion, larger droplets experience higher buoyant force than smaller droplets. This phenomenon can be used in a simple method of segregating large and small droplets all in one emulsion.
To use gravitational fractionation, droplets are loaded into a syringe with equal volumes of HFE7500 with 2% EA surfactant, then gently rolled along the 30 o tilted long axis of the syringe at approximately 0.5Hz for one hour. The rolling fluidizes the emulsion allowing droplets to shuffle past one another based on their buoyancies. After rolling, the syringe is fully tilted to 90 o facing down. The large droplets are on top of the emulsion while the small droplets are at the bottom. Half of the emulsion containing the small droplets are collected. The other half of the emulsion contains a mixture of large and small droplets which are further sorted using a pinch flow fractionation device (Fig. S3).

Simulating probability of repeating barcodes
The probability of resampling the same barcode by randomly drawing from an unlimited pool is akin to the "birthday problem" for which the analytical solution is not computationally tractable for such a large number of possible barcodes. In order to determine the probabilities, we performed in silico simulations. Barcodes are generated by randomly selecting one of four bases with equal probabilities until N bases are selected, resulting in a random barcode. A repeat event occurs when a newly generated barcode matches exactly with a previously generated barcode. The probability is determined by averaging the result from multiple simulations. The script used for simulation is available at the Abate Lab Github: https://github.com/AbateLab/

Calculating the limit of detection for rare variants
The probability of observing a rare mutation present at frequency f when sampling the population n times is described by a Poisson distribution: Hence, the frequency of mutants that can be detected with probability P is: = ln(1 − ( > 0)) − Setting P = 0.95 we can calculate the minimum frequency of molecules we expect to detect 95% of the time when we sequence n number of molecules with SMDB.