Sequencing of a single-cell genome requires DNA amplification, a process prone to introducing bias and errors into the amplified genome. Here we introduce a novel multiple displacement amplification (MDA) method based on the unique DNA primase features of Thermus thermophilus (Tth) PrimPol. TthPrimPol displays a potent primase activity preferring dNTPs as substrates unlike conventional primases. A combination of TthPrimPol’s unique ability to synthesize DNA primers with the highly processive Phi29 DNA polymerase (Φ29DNApol) enables near-complete whole genome amplification from single cells. This novel method demonstrates superior breadth and evenness of genome coverage, high reproducibility, excellent single-nucleotide variant (SNV) detection rates with low allelic dropout (ADO) and low chimera formation as exemplified by sequencing HEK293 cells. Moreover, copy number variant (CNV) calling yields superior results compared with random primer-based MDA methods. The advantages of this method, which we named TruePrime, promise to facilitate and improve single-cell genomic analysis.
It has become apparent in the last 5 years that genomic analysis of single cells provides crucial information that is lost in bulk sequencing of tissue because of averaging effects and limitations of computational methods to deconvolute sequence information from many different clones1,2. For example, the complexity of alterations of the cancer genome has only been grasped recently by assessing single tumour cells in parallel3,4,5. Especially for oncology, single-cell sequencing offers novel insights into the evolution of cancers over time and in reaction to treatment, which will lead to novel strategies for treatment regimens and drug development6. Other application areas for single-cell sequencing are pre-implantation diagnostics7,8,9 and basic biological research, for example, in the neurosciences10.
Amplification of genomic DNA is a necessary first step for the available sequencing technologies. Unfortunately, DNA amplification is a process subject to bias introduction, error and co-amplification of minute levels of contaminating DNA. Several techniques have been developed for whole genome amplification (WGA), broadly dividable into PCR-related protocols and those based on multiple displacement amplification (MDA). PCR-based methods can be classified into degenerate oligonucleotide-primed PCR (DOP-PCR)11, linker-adapter PCR12, primer extension pre-amplification PCR (PEP-PCR-/I-PEP-PCR)13,14 and variations thereof. MDA methods are mainly based on using the highly processive Φ29DNApol15 together with random hexamers16,17,18,19. There is another variant MDA method called pWGA based on the reconstituted T7 replication system20. Recently, another hybrid PCR/MDA method called multiple annealing and looping-based amplification cycles (MALBAC) has been proposed, relying on the Bacillus stearothermophilus polymerase for the MDA process21. Key parameters that determine the quality of the amplification are the absence of contaminations and artefacts in the reaction products, coverage breadth and uniformity, nucleotide error rates and the ability to recover single-nucleotide variants (SNVs), copy number variants (CNVs) and structural variants. In general, PCR-based methods are thought to have advantages in CNV detectability22, whereas Φ29DNApol-based methods have the advantage of extremely low nucleotide error rates due to the high fidelity of the polymerase, produce very long amplification products and cover the genome more completely. Problems that affect all amplification methods to some degree are chimera formation and preferential amplification of one allele (allelic dropout, ADO).
A source for potential amplification bias in the current Φ29DNApol-based MDA methods is the propensity to generate primer-derived artefacts and priming inequality arising from different sequence-dependent hybridization kinetics of the oligonucleotides. Thus, using a dedicated primase may provide an advantage over random oligonucleotides. However, most known primases only accept NTPs and generate RNA primers. These RNA primers are not an ideal substrate for most replicative DNA polymerases and need to be elongated by specialized transition DNA polymerases, as DNA polymerase-α in human cells.
Primases can be divided into two evolutionarily unrelated families: DnaG-like primases (Bacteria) and archaeal-eukaryotic primase (AEP)-like primases (Archaea and Eukaryotes)23,24. Recently, a novel subfamily of AEPs called primase-polymerase (PrimPol)25,26 has been described, whose first members were originally found in archaeal plasmids27 and in some bacteria28. PrimPols show both DNA polymerase and DNA primase activities, and are often associated to helicases, to form a replication initiation complex26,28,29. These features enable a system where the same enzyme performs both the initiation and elongation stages. Perhaps the most significant feature of PrimPols, unlike conventional primases, is their ability to carry out the initiation and extension of DNA chains27,30,31,32. More recently, PrimPol was described to exist in human cells (HsPrimPol, UniProtKB Q96LW4), encoded by the PRIMPOL gene (also known as CCDC111)33,34,35,36.
In this case, HsPrimPol is not associated to a helicase, but displays some strand-displacement capacity. Moreover, its DNA polymerase activity is able to efficiently bypass different kind of DNA lesions as 8oxoG and pyrimidine dimers34,35,37,38,39. It is very likely to be that this translesion synthesis capacity is crucial for the demonstrated role of human PrimPol in mitochondrial DNA maintenance34. Moreover, the most significant feature of human PrimPol (also extending to archaeal and bacterial PrimPols) is the ability to initiate the synthesis of DNA chains (as a DNA primase), unlike conventional primases that need NTPs to make primers26,27,28,34. Such convenient capacity was shown to be required to re-prime arrested replication forks during nuclear DNA replication in human cells35,36,40 and also confirmed in avian cells41.
From a biotechnological perspective, the unique ability to synthesize DNA primers could make PrimPol a useful partner in an MDA-type process. Here we describe the cloning and characterization of the Thermus thermophilus PrimPol enzyme and the creation of a novel primer-free WGA method with specific advantages for single-cell genome amplification, which we termed TruePrime.
TthPrimPol is a DNA primase with wide template specificity
Human PrimPol was initially considered as a good candidate for DNA amplification processes. However, the human protein was promptly discarded mainly due to stability issues, probably related to the presence of a Zn-finger domain at its carboxy terminus and also due to its strong dependence on Mn2+ ions to activate its priming function34. We therefore sought to identify a more stable bacterial orthologue, with optimal enzymatic properties to be exploited for biotechnological applications.
Search of the non-redundant database of protein sequences (National Center for Biotechnology Information, NIH, Bethesda), performed using the BLASTP programme42 revealed that the hypothetical conserved protein AAS81004.1 (291 amino acids) from T. thermophilus strain HB27 contained a PrimPol domain of the type found in bifunctional replicases from archaeal plasmids. Although conventional AEP-like primases, as human Prim1, have the three conserved motifs (A, B and C) that form the primase active site, the PrimPols already characterized have the same three conserved motifs but also a Zn-finger domain required for the DNA primase activity34,35,36. Strikingly, the putative T. thermophilus PrimPol contains only the three motifs but no Zn-finger domain (Fig. 1a). Instead, TthPrimPol contains an α-helical PriCT-1 domain, found at the C-terminal of some AEP primases24. A detailed amino acid sequence alignment (Fig. 2) of TthPrimPol with its closest bacterial relatives and also with the well-characterized pRN1 PrimPol, and some representatives of other AEP-like members found in archaea, bacteria, phages and plasmids, supports the correct identification of PrimPol in T. thermophilus (see legend to Fig. 2 for further details). A more extensive search of the closest TthPrimPol orthologues was carried out, supporting their conservation and monophyletic origin in Bacteria (Supplementary Fig. 1). Moreover, the significant amino acid sequence similarity with pRN1 PrimPol and also with the polymerization domain (PolDom) of Mycobacterium tuberculosis LigD (not shown), whose three-dimensional (3D) structures have been solved26,43,44, was sufficient to generate a 3D model for TthPrimPol in complex with DNA and nucleotide substrates (Fig. 1b; see Methods for details).
TthPrimPol was cloned, expressed and purified in a soluble and active form, as described in Methods. Primase activity was first analysed at 55 °C using a single-stranded template oligonucleotide in which a potential primase recognition sequence 3′-GTCC-5′ is flanked by thymine residues34, according to the preferred template context to initiate primer synthesis by several viral, prokaryotic and eukaryotic RNA primases34,35,36. TthPrimPol displayed a strong primase activity, starting synthesis opposite the ‘TC’ template sequence. The nucleotide acting as ‘nano-primer’ (5′-position)38 can be either a ribonucleotide (ATP) or a deoxynucleotide (dATP) in the presence of manganese, but only a deoxynucleotide (dATP) when magnesium is the metal cofactor (Fig. 3a, left panel). Second and further added nucleotides (3′-position) must be strictly deoxynucleotides (dGTP and dATP), regardless of the metal cofactor present. Modification of the base preceding the directing TC template sequence had a minor effect on the priming activity of TthPrimPol (Fig. 3a, right panel), in contrast to the strong preference for 3′-GTCC-5′ shown by human PrimPol34. TthPrimPol was able to initiate DNA primer synthesis also at 30 °C at multiple sites on a single-stranded circular DNA template (M13mp18) by using both purine and pyrimidine nucleotides to form the initial dimer (Fig. 3b), in agreement with a desirable and wide template specificity, unlike human PrimPol that largely prefers to make dimers with purine nucleotides34. The preference for dNTPs as incoming nucleotides also makes TthPrimPol a competent DNA-directed DNA polymerase, able to extend the initiating dimers into longer DNA primers. By providing the four dNTPs, TthPrimPol synthesized primers up to 20-mer (Fig. 3c, lane 3). Heparin, in an amount that can inhibit TthPrimPol when pre-incubated (Fig. 3c, lane 2), was only able to inhibit the synthesis of primers longer than 10 nucleotides, when added after enzyme/DNA binding (Fig. 3c, lane 4). Thus, the main primer products (7–9 nt) are synthesized processively, but further extension appears to be distributive.
TthPrimPol serves as primase for Φ29DNApol-mediated MDA
Having established that TthPrimPol is indeed a DNA primase, we explored the possibility that these DNA primers could be efficiently elongated by a second polymerase, the high-fidelity Φ29DNApol45. We designed a first experiment in two steps (pulse and chase), to interrogate about the size of the primers made by TthPrimPol that can be efficiently extended by Φ29DNApol. First, during the pulse, TthPrimPol generated labelled primers at different enzyme/DNA ratios, supporting that 7–9 nt primers are the main products, processively synthesized (Fig. 3d, left panel). During the chase at 10 μM dNTPs, Φ29DNApol was able to generate highly elongated products by extending these primers (Fig. 3d, right panel).
Thus, the compatibility of both enzymes allowed their combination to perform rolling circle amplification (RCA)46 on a single-stranded M13mp18 template, as an alternative method (that we termed TruePrime) to the currently used mix of random primers (RPs) and Φ29DNApol (Fig. 4a). As a control, none of the enzymes by itself was able to amplify the target. Of note, also human PrimPol was not able to cooperate with Φ29DNApol in RCA, highlighting the unique primase features of TthPrimPol, able to make DNA primers in the presence of Mg2+, a metal needed to achieve faithful DNA synthesis by Φ29DNApol. In addition, WGA from human DNA yielded DNA amounts comparable to RP-mediated amplification (Fig. 4b). The size distribution of the resulting DNA showed a broad high-molecular weight pattern as also seen with RPs+Φ29DNApol (Fig. 4c). Sensitivity of TruePrime for low-input amounts of target DNA was excellent, in the range of femtograms, and superior to the RP-mediated amplification (Fig. 4d). The reaction output of the TthPrimPol/Φ29DNApol combination was shown to be target derived: even at 1 fg input >95% of the sequences could be mapped to the human genome (Supplementary Fig. 2).
TruePrime WGA yields high-quality genomic sequences
We next applied this novel WGA method to the amplification of genomic DNA from single human HEK293 cells isolated by serial dilution, manual picking and visual inspection, and subjected them to the TruePrime protocol (see Methods) for 3 or 6 h reaction time. Yields obtained were ∼6 μg for 3 h and ∼10 μg for 6 h as quantified by Picogreen. Limited sequencing of DNA obtained from four cells was carried out in comparison with DNA isolated from the originating bulk cells (non-amplified, NA). In parallel, single cells were amplified using a commercially available MDA kit (REPLI-g Single Cell Kit, Qiagen, Aarhus, Denmark), or by using our TruePrime protocol, but exchanging TthPrimPol for random hexamers (generic RP-MDA). In addition, we amplified a single HEK293 cell by the MALBAC protocol21 as a hybrid MDA method. Fragment sizes for the amplified DNA were ∼9–19 kb for the commercial RP-MDA, 1.5–12 kb for TruePrime and 0.5–1.5 kb for MALBAC (Supplementary Fig. 3). DNA was sequenced using a paired read strategy (Illumina HiSeq, 125 bp read length).
Comparison of mapping characteristics of these samples was done at exactly 12 million randomly selected read pairs for NA DNA, TruePrime-amplified DNA, the two RP-MDA protocols and MALBAC. We calculated the deviation of the actual fraction of the human genome covered from the theoretically possible fraction covered assuming an ideal Poisson distribution for the successfully mapped reads47. Theoretically expected coverage rates varied because of differing success rates in mapping the 12 million read pairs to the genome (NA DNA 92.03%, TruePrime 86.77%, Commercial random primed MDA 91.50%, Generic random primed MDA 59.07%, MALBACs 89.68%). In addition, we adjusted the expected maximal coverage to the duplicate rate, which was 1.79% in the NA sample, 1.23% in the TruePrime sample, 1.33% in the commercial RP-MDA sample, 9.53% in the generic RP-MDA sample and 0.23% for MALBAC. The deviations from the observed to the maximally expected (Poisson) coverage breadth were 9.17% for NA DNA, 13.15% for TruePrime, 34.83% for the commercial RP-MDA, 30.73% for the generic RP-MDA and 46.89% for MALBAC at this read depth. Visual inspection of the coverage pattern across the genome by a Circos plot (Fig. 5a) as well as by a sliding window view on chromosome 4 (Fig. 5b and Supplementary Fig. 4) highlights the evenness of coverage and the high similarity to the NA material in contrast to the two RP-MDA methods and MALBAC. Graphing read depth frequency also shows the similarity of the TruePrime amplified sample to the NA one (Fig. 5c).
We studied terminal breadth of coverage in one of the amplified genomic samples (1c), the commercial RP-MDA sample, the NA reference DNA and MALBAC at high sequencing depth. The NA sample reached a genome coverage of 19.19-fold with 91.64% of the human genome (hg19) covered, the TruePrime-amplified sample had a genome coverage of 19.65-fold with a fractional coverage of 91.26% relating to an absolute difference in bases with 0 coverage of 11.7 million (Table 1). In comparison, the commercial RP-MDA method reached 85.57% genome coverage breadth at comparable read depth and MALBAC reached 58.57% coverage breadth. Nucleotide error rates in the reads were similar between the four samples (Table 1).
We also looked at coverage breadth saturation with increasing read input at a minimum coverage of 1 × (Fig. 5d, upper panel) and the deviation from the expected coverage using a Poisson distribution model (Fig. 5d, middle panel). The last panel in Fig. 5d shows the saturation of genome coverage breadth at a minimal coverage depth of 10 × . The TruePrime-amplified sample shows the highest similarity in all analyses to the NA material.
Relative coverage per chromosome is in general similar between the NA sample and TruePrime but with visible exception for some chromosomes that show a relatively lower coverage by TruePrime (for example, 19 and 22; Fig. 5e). We investigated whether this was related to the varying GC content between human chromosomes. The notable difference between chromosomal coverage in the NA sample is due to some other basic bias in the library prep or Illumina sequencing protocol, as the effect of GC content on chromosomal coverage does not reach significance (P=0.12; Supplementary Fig. 5a). In the TruePrime-amplified sample there is a significant effect of GC content on chromosomal coverage (R2=0.38; P=0.0017; Supplementary Fig. 5b). Surprisingly, the behaviour of the commercial MDA (REPLI-g; Qiagen) is identical to this (R2=0.44, P=0.0006; Supplementary Fig. 5c,d), implying that the main driver behind this behaviour is Φ29DNApol, not the priming mechanism. A regression model using both chromosomal GC content and the variation of chromosomal coverage already present in the NA sample fully explains the pattern of chromosomal coverage in both amplified samples with an R2 of 0.94 (P<0.0001 for all effects), meaning that there is an unexplained variance of only 6% in this coverage behaviour. It is however important to note that the chromosomal variation inherent in the sequencing process is the most influential factor for the coverage pattern seen in the amplified samples.
In general, read number frequency in dependence of GC content appears similar between the NA and the TruePrime-amplified sample (Fig. 5f), with the exception of a slight preference of the TruePrime amplification reaction for a GC range of 16–24%. MALBACs showed a right shift of the distribution curve.
We assessed coverage characteristics at a sequencing depth of ∼20 × also by examining k-mer frequency distribution48 (Fig. 6). K-mer frequencies were calculated using jellyfish49. K-mer size was set to 19 as suggested by Kelley et al.50. For the frequency plot shown the y axis was cut at 5% to show the higher frequencies at greater detail. The fraction of K-mers at a frequency of one (unique K-mers) was 34.36% for NA, 32.18% for TruePrime, 49.12% for MALBAC and 45.19% for the commercial RP-MDA protocol; the fraction of K-mers with frequencies of one or two was 36.32% for NA, 36.30% for TruePrime, 60.80% for MALBAC and 52.12% for the commercial RP-MDA protocol. Unique K-mer frequency is thought to be mostly due to nucleotide errors, but can also arise from a high fraction of very low coverage regions, which is possibly the explanation for the higher unique K-mer content in the commercial RP-MDA sample. On the other side, the identical unique K-mer content in the NA and TruePrime-amplified sample supports both the low nucleotide error rate of the amplification method and the even coverage obtained. The bimodal distribution of K-mer frequencies in the human genome is completely lost in the commercial RP-MDA and the MALBAC sample, also most likely to be the effect of coverage inequality, whereas the distribution of the higher frequency k-mers is very similar between NA and TruePrime (peak depth 16 for NA, 13 for TruePrime).
Next, we looked at the reproducibility of the amplification results. Figure 7a shows a Circos plot of genome coverage from NA material (grey) and four single HEK293 cells amplified with TruePrime (blue) (input: exactly five million randomly selected read pairs). The fraction of the genome covered at this read number was 28% for the NA sample and between 26 and 28% for all 4 cells amplified. Cross-correlation between read numbers per 100 kb bin (Fig. 7b) and a sliding window view on chromosome 4 (Fig. 7c) highlight the similarity between the four replicates.
CNV detection in single cells is of particular interest in oncology. HEK293 cells have a partial aneuploidic state51 and therefore CNV alterations should be detectable. Currently, a vast variety of bioinformatic tools are available for CNV detection based on different strategies52,53. We used both FREEC54 and the recently published Ginkgo platform specifically optimized for single-cell CNV detection55. Visual comparison of CNV plots shows that TruePrime much better preserves the chromosomal CNV state than both RP-MDA and MALBAC (Supplementary Figs 6 (FreeC) and 7 (Ginkgo)). For Ginkgo, the median absolute deviation (MAD) of all pairwise differences in read counts between neighbouring bins55 calculated for single HEK293 amplified with TruePrime was ∼0.2, a number close to the MAD value derived for published DOP-PCR amplified genomes, and strikingly different from the MADs from published RP-MDA methods that range between 0.35 and 0.8 (ref. 55).
Chimera formation is thought to be a problem in MDA potentially arising by strand switching during the displacement process56. We estimated the number of chimeras formed during the amplification process as the increase in broken read pairs due to wrong distance or mate inversion in the amplified samples relative to the NA sample. Although the NA sample had 2.5% broken read pairs of this nature, the TruePrime-amplified samples showed between 3.9 and 6.5%, suggesting that there is an increase in chimeras generated by the TruePrime process in the range of 2–3% over all read pairs. The commercial RP-MDA protocol showed a similar fraction of broken read pairs due to wrong distance or mate inversion (5%) as the TruePrime sample implicating the same increase in chimeras and suggests that the priming process has little influence on the occurrence of chimeras in Φ29DNApol-mediated DNA amplification protocols.
For SNV calling, we used four different SNV callers due to high inter-caller variability57,58 (Supplementary Table 1). With the exception of samtools/bcftools, all callers detected similar numbers of SNVs with a median of 3.0 Mio SNVs for the NA and 2.7 Mio SNVs for the TruePrime-amplified cell. The overlap between SNVs in the two samples was 2.4 Mio, equivalent to 81% of the SNVs found in the NA sample (Supplementary Table 1). In contrast, in the cell amplified by the commercial RP-MDA method, only 1.6 Mio SNVs were detected of which 1.4 Mio overlapped with SNVs detected in the NA sample (45% of all NA SNVs; Supplementary Table 1). This was even lower in MALBAC, where only 30% of the SNVs detected in the NA sample were recovered.
A major question with WGA methods is the so-called ADO rate, meaning the fraction of heterozygous SNVs that are lost due to exclusive or predominant amplification of only one allele. A way to estimate this number is to establish the heterozygous SNVs in the NA sample and determine the fraction that is called as homozygous SNVs in the amplified sample. The overall number of SNVs that were detected as heterozygous in the NA sample and homozygous in the TruePrime-amplified sample (1c) ranged from 0.73% (Isaac SNV caller) to 8.42% (Varscan2) with a median of 5.95% across the whole genome (Supplementary Table 1), suggesting an ADO of ∼1.45–15.5% with a median of 11.23% (AB->AA plus the non-observed AB->BB). Interestingly, there was a considerable degree of variation in the apparent conversion rate among different chromosomes (Fig. 8). In contrast, the commercial RP-MDA method showed a very high estimated ADO rate of 45.74% (median of the 4 callers) similar to the MALBAC method (47.22% in the median; Supplementary Table 1).
Related to the question of ADOs is the issue of false positives that could be generated during the amplification process. The TruePrime false positive rate (FPR) for the SNVs based on the overlap with the NA sample was around 1% for three of the callers and 3.66% for samtools/bcftools (Supplementary Table 1). The RP-MDA FPR was similar to that (Supplementary Table 1). MALBACs showed the highest FPR with 5.9%. Another possibility to detect generation of false positives is to determine the conversion of homozygote alleles to heterozygotes for a haploid chromosome (#18) in our HEK293 cell line. This ensures that the homozygote calls from the NA DNA are true positives. The rates here were 0.13–1.29% for TruePrime, depending on the caller, and similar in the RP-MDA sample (Supplementary Table 1).
Here we have cloned, characterized and put into technical use a novel PrimPol, TthPrimPol. To our knowledge, this is the first instance that this class of enzymes has been made available for biotechnological applications.
We have exploited several unique features of TthPrimPol for its cooperation with Φ29DNApol, to enable successful WGA. First, the ability to use dNTPs and synthesize DNA primers makes it possible for the enzyme to work without addition of NTPs to the reaction, which would alter DNA polymerase characteristics of Φ29DNApol and would generate RNA/DNA chimeric molecules with possible disadvantageous consequences for downstream enzymatic manipulation for library construction and so on. Second, TthPrimPol works with Mg2+ as the only metal ion and does not need Mn2+, which would interfere with the high-fidelity DNA synthesis by Φ29DNApol. Third, the primase function of TthPrimPol synthesizes DNA primers of 7–9 nucleotides length in a processive mode, but then switches to a distributive mode for its polymerase function, enabling the highly processive Φ29DNApol to take over elongation of those primers. This unique compatibility of the two enzymes thus enables the replacement of the error-prone polymerase function of TthPrimPol with the high-fidelity Φ29DNApol for WGA (Fig. 9).
We find that TruePrime has an exquisite breadth of coverage, which approximates that of NA DNA (91.26% at ∼19 × coverage). Breadth of coverage is a known strength of MDA-based protocols including the hybrid MALBAC method5,21,59 as opposed to purely PCR-based methods (DOP-PCR has only ∼10% coverage breadth3,60) and reaches ranges of over 90% genome coverage61. In our hands, the commercial RP-MDA protocol gave a coverage breadth of ∼86%, whereas MALBAC reached 59%. Together with the inherently high fidelity of Φ29DNApol of about 10−7 (ref. 18) and the high evenness of coverage, this lays ideal foundations for high-quality SNV calling throughout the genome of single cells. Indeed, we report an 80.6% concordance of SNVs called in NA and amplified samples in the range of expected SNV numbers. Our estimate for the ADO number appears low (estimated at 11.23% in the median of four SNV callers and as low as 1.45% in the Isaac caller). Numbers in the literature for MDA methods for single-genome cell amplification vary widely between 4 and 50% (refs 62, 63, 64), and we find a high estimated ADO for both RP-MDA and MALBAC (>45%). It is important to note that estimation of the ADO using the heterozygote to homozygote conversion is also subject to the variant caller used and the read depth at each SNV locus. We attempted to address this issue by using four different callers and reporting detailed output parameters. The much higher ADO in the RP-MDA and MALBAC sample may be partially due to the lower evenness in coverage, which results in more loci having lower coverage and therefore having a higher chance of missing one allele.
A weakness of the MDA group of methods despite the superior breadth of genome coverage is CNV detection, in particular for methods relying on read-depth counting. This is due to inequality bias in amplifying different genomic regions. TruePrime has considerably less amplification bias than random primer-based protocols and reaches the coverage dispersion characteristics of DOP-PCR amplification experiments from single cells. Consequently, this allows for improved CNV detection accuracy, thus improving one major weakness of Φ29DNApol/MDA-based protocols so far.
Another reported problem with MDA methods concerns chimera formation occurring by strand switching during strand displacement. We find that percentage of broken read pairs probably due to chimeras is at a 2–3% percentage fraction of reads. In summary, TruePrime presents an important improvement to Φ29DNApol-mediated amplification of single-cell genomes. We believe that this method will contribute greatly to the accessibility of genomic information from single cells.
Computational modelling of TthPrimPol 3D structure
The 3D structure of TthPrimPol was modelled using as template the crystallographic structure of the DNA primase/polymerase domain of ORF904 from the archaeal plasmid pRN1 (PDB ID:3M1M and 1RNI). TthPrimPol amino acid residues 4 to 166 were modelled with the Phyre2 (ref. 59) online server using 3M1M as template; TthPrimPol amino acid residues 167 to 208 were modelled with the DeepView Project Mode of Swiss Model65 online server using 1RNI as template. DNA template and primer strands, metals and incoming nucleotide were modelled using two crystal structures of the polymerase domain (PolDom) of M. tuberculosis ligase D (PDB ID:4MKY for template/primer and PDB ID:3PKY, for metals and incoming nucleotide), which were fitted to the TthPrimPol model by using the three invariant catalytic aspartates (motifs A and C) and the invariant histidine (at motif B) as reference coordinates. The image depicted in Fig. 1b was created with the PyMol Molecular Graphics System (version 1.2r3pre, Schrödinger, LLC), omitting the amino acid sequence information from the LigD PolDom crystals.
Cloning of TthPrimPol
Sequence analysis of the T. thermophilus HB27 genome (DDBJ/EMBL/GeneBank AE017221.1; GI:46197919) revealed the ORF TTC0656, encoding a protein that belongs to the AEP superfamily. Using this sequence information, we synthesized two primers (5′-ccggcccatatgaggccgattgagcacgccc-3′ and 5′-gcgcgcgaattctcatacccacctcctcatccggg-3′) for amplification of the TthPrimPol gene by PCR from T. thermophilus genomic DNA. The gene fragment amplified by PCR using Expand High Fidelity polymerase (Roche Diagnostics, Mannheim, Germany) was ligated into the pGEM T-easy vector (Promega, Madison, WI, USA) by TA cloning and confirmed by sequencing. Using the NdeI and EcoRI sites, the fragment bearing the target gene was ligated into pET28 vector (Novagen, Merck-Millipore, Billerica, MA, USA), allowing the expression of TthPrimPol fused with a multifunctional leader peptide containing a hexahistidyl sequence for purification on Ni2+-affinity resins.
Expression of TthPrimPol was carried out in the E. coli strain BL21-CodonPlus (DE3)-RIL (Stratagene), with extra copies of the argU, ileY and leuW transfer RNA genes. Expression of TthPrimPol was induced by the addition of 1 mM isopropyl-β-D-thiogalactoside to 1.5 l of log phase E. coli cells grown at 30 °C in lysogeny broth (LB) to an Abs600 nm of 0.5. After induction, cells were incubated at 30 °C for 5 h. Subsequently, the cultured cells were harvested and the pelleted cells were weighed and frozen (−20 °C). Just before purification, which was carried out at 4 °C, frozen cells (3.5 g) were thawed and resuspended in 20 ml buffer A (50 mM Tris-HCl pH 7.5, 5% glycerol, 0.5 mM EDTA and 1 mM dithiothreitol (DTT)) supplemented with 1 M NaCl, 0.25% Tween-20 and 30 mM imidazole, and then disrupted by sonication on ice. Cell debris and insoluble material were discarded after a 50 min centrifugation at 40,000 g. The supernatant was loaded into a HisTrap crude FF column (5 ml, GE Healthcare) equilibrated previously in buffer A supplemented with 1 M NaCl, 0.25% Tween-20 and 30 mM imidazole. After exhaustive washing with buffer A supplemented with 1 M NaCl, 0.25% Tween-20 and 30 mM imidazole, proteins were eluted with a linear gradient of 30–250 mM imidazole. The eluate containing TthPrimPol was diluted with buffer A supplemented with 0.25% Tween-20 to a final 0.1 M NaCl concentration and loaded into a HiTrap Heparin HP column (5 ml, GE Healthcare), equilibrated previously in buffer A supplemented with 0.1 M NaCl and 0.25% Tween-20. The column was washed and the protein eluted with buffer A supplemented with 1 M NaCl and 0.25% Tween-20. This fraction contains highly purified (>99%) TthPrimPol. Protein concentration was estimated by densitometry of Coomassie Blue-stained 10% SDS–polyacrylamide gels, using standards of known concentration. The final fraction, adjusted to 50% (v/v) glycerol, was stored at −80 °C.
3′-GTCC-5′ oligonucleotide (1 μM) or its variant XTCC oligonucleotides (1 μM) or M13mp18 single-stranded DNA (ssDNA) (20 ng μl−1) were used as alternative templates to assay primase activity. The reaction mixtures (20 μl) contained 50 mM Tris-HCl pH 7.5, 75 mM NaCl, 5 mM MgCl2 or 1 mM MnCl2, 1 mM DTT, 2.5% glycerol, 0.1 mg ml−1 BSA, [α-32P] dATP (16 nM; 3,000 Ci mmol−1) or [γ-32P] ATP (16 nM; 3,000 Ci mmol−1), the indicated amounts of each dNTP or NTP, in the presence of TthPrimPol (400 nM). After 60 min at either 55 °C or 30 °C, as indicated, reactions were stopped by addition of formamide loading buffer (10 mM EDTA, 95% v/v formamide and 0.3% w/v xylene cyanol). Reactions were loaded in 8 M urea-containing 20% polyacrylamide sequencing gels. After electrophoresis, de novo synthesized polynucleotides (primers) were detected by autoradiography.
To evaluate the processivity of primer synthesis by TthPrimPol, we used heparin as a competitor (Fig. 3c). TthPrimPol (10 nM) was pre-incubated for 5 min on ice in the previously described reaction buffer, either in the absence/presence of heparin (1 ng μl−1). Subsequently, the reaction was complemented with M13mp18 ssDNA (5 ng μl−1), dATP, dCTP and dTTP (10 μM each), [α-32P] dGTP (16 nM; 3,000 Ci mmol−1) and heparin (1 ng μl−1) when indicated, and the incubation was maintained for 10 min at 30 °C and processed as described.
A ‘Pulse and Chase’ experiment was designed to analyse the extension by Φ29DNApol of the primers synthesized by TthPrimPol in two consecutive stages (pulse and chase), as indicated in Fig. 3d. During pulse, the reaction mixtures (20 μl; 50 mM Tris-HCl pH 7.5, 75 mM NaCl, 10 mM MgCl2, 1 mM DTT, 2.5% glycerol and 0.1 mg ml−1 BSA) containing decreasing concentrations of TthPrimPol (100, 25 and 6.25 nM), [α-32P] dGTP (16 nM), dATP+dCTP+dTTP (1 μM) and 5 ng μl−1 M13mp18 ssDNA were incubated at 30 °C during 20 min. Half of the reaction was analysed as described for the primase assays. During chase, a second half of the reaction was supplemented with Φ29DNApol (50 nM) and the four unlabelled dNTPs (10 μM), to allow primer extension for another 20 min at 30 °C. Then, the samples were processed as described.
Rolling circle amplification
M13mp18 circular ssDNA was used as input for the TruePrime RCA kit workflow. Briefly, DNA (2.5 μl; 40 fg μl−1) was first denatured by adding 2.5 μl of alkaline buffer D and incubated 3 min at room temperature. The samples were then neutralized by adding 2.5 μl of buffer N. The amplification mix containing 9.3 μl of H2O, 2.5 μl of reaction buffer, 2.5 μl of dNTPs, 2.5 μl of Enzyme 1 (TthPrimPol) and 0.7 μl of Enzyme 2 (Φ29DNApol) was added to the DNA samples, resulting in a final reaction volume of 25 μl. When indicated, TthPrimPol was replaced by HsPrimPol or random synthetic primers (50 μM). Reaction mixtures were incubated for 3 h at 30 °C and Φ29DNApol was inactivated for 10 min at 65 °C, to avoid degradation of the amplification products. Amplified DNA was quantified using the Quant-iT PicoGreen dsDNA Assay Kit (Invitrogen, Life Technologies, Carlsbad, CA, USA) following the recommendations of the manufacturer. Briefly, samples were diluted 1:1,000 in 1 × TE and, in parallel, a DNA standard using human genomic DNA (Roche) with 1.6, 0.8, 0.4, 0.2 and 0.1 μg ml−1 was prepared. Twenty microlitres of the sample or DNA standard were transferred into a 96-well plate and 20 μl of PicoGreen working solution (PicoGreen stock solution 1:150 diluted) was added. After gently shaking the 96-well plate, fluorescence was measured in a Fluostar Microplate Reader (BMG Labtech; excitation: 485 nm and emission: 520 nm). For measurements, duplicates for each sample and DNA standard were performed, and DNA concentration was determined from the human genomic DNA standard curve.
Whole genome amplification
Six picograms (Fig. 4, part b) or different doses ranging from 1 ng to 100 ag (Fig. 4, part d) of human genomic DNA (Promega) were used as input in the reactions. Input DNA was subjected to the TruePrime WGA kit workflow. Briefly, DNA (2.5 μl) was first denatured by adding 2.5 μl of buffer D and incubating 3 min at room temperature. The samples were then neutralized by adding 2.5 μl of buffer N. The amplification mix containing 26.8 μl of H2O, 5 μl of reaction buffer, 5 μl of dNTPs, 5 μl of Enzyme 1 (TthPrimPol) and 0.7 μl of Enzyme 2 (Φ29DNApol) was added to the DNA samples, resulting in a final reaction volume of 50 μl. When indicated, TthPrimPol was replaced by the same concentration of HsPrimPol or random synthetic primers (50 μM). Reaction mixtures were incubated for 3 h at 30 °C and Φ29DNApol was inactivated for 10 min at 65 °C, to avoid degradation of the amplification products. Amplified DNA was quantified using the Quant-iT PicoGreen dsDNA Assay Kit (Invitrogen, Life Technologies).
We chose HEK293 cells for testing WGA protocols due to their partial aneuploidic state51. Remark: HEK293 cells are listed as potentially contaminated with HeLa cells in ICLAC (http://iclac.org/wp-content/uploads/Cross-Contaminations-v7_2.pdf) based on a publication in 1981. The cells analysed by us are clearly HEK293 cells based on their genomic sequence and CNV profile. The cell line has been obtained in 2010 from the DSMZ (Leibniz-Institute German Collection of Microorganisms and Cell Cultures; DSMZ number: ACC305). The cell line is regularly checked for mycoplasma contamination by a PCR assay (primer A: 5′-ggc gaa tgg gtg agt aac acg-3′ and primer B: 5′-cgg ata acg ctt gcg acc tat-3′).
HEK293 cells were washed with 1 × PBS, followed by incubation with Trypsin-EDTA solution (Gibco). After resuspending cells with culture medium, they were spun down and washed again with 1 × PBS. After preparing three serial dilutions of cells in 1 × PBS, they were counted, diluted to a final concentration of 1 cell per 2.5 μl 1 × PBS. This volume was dispensed into clear-well plates and visually inspected. TruePrime Single Cell WGA kit workflow was followed to amplify the genomic DNA of each cell. Briefly, 2.5 μl of lysis buffer L2 were added, followed by incubation for 10 min on ice. To neutralize the lysis buffer, 2.5 μl of neutralization buffer N were added. The amplification mix containing 26.8 μl of H2O, 5 μl of reaction buffer, 5 μl of dNTPs, 5 μl of Enzyme 1 (TthPrimPol) and 0.7 μl of Enzyme 2 (Φ29DNApol) was added to the neutralized samples, resulting in a final reaction volume of 50 μl. The reaction mixtures were incubated for 3 or 6 h at 30 °C, followed by inactivation of Φ29DNApol for 10 min at 65 °C. To obtain NA reference DNA, genomic DNA was extracted from HEK293 cells using QIAamp genomic DNA extraction kits (Qiagen). DNA concentration was determined by the Quant-iT PicoGreen dsDNA Assay Kit (Invitrogen, Life Technologies). For amplification with the REPLI-g single-cell kit (Qiagen; commercial random primed MDA) and with the MALBAC protocol Single Cell WGA kit (Yikon Genomics), instructions of the manufacturers were followed.
After amplification, DNA was precipitated by ethanol precipitation. DNA fragmentation (Covaris), library preparation using NebNext (NEB) and paired end sequencing (HiSeq 2500, v4 chemistry, HiSeq Control Software, Version 2.2.58, Real-Time Analysis, Version 1.18.64 and Sequence Analysis Viewer, Version 1.8.46, Casava Version 1.8.2) were performed at GATC Biotech, Konstanz, Germany. FastQ files were obtained and further processed.
Bioinformatic and statistical analyses
Quality assessment and mapping
CLC Genomic Workbench Version 8.5 (Qiagen) was used for main analyses of NGS data sets (alignment and mapping parameters). Illumina BaseSpace FASTQC v1.0.0 was used for sequencing quality assessment and GC content dependency calculations. Circos plots were generated using the Circos framework66.
Saturation of coverage breadth
For each 20 × sample, a calculated number of aligned reads were selected with samtools view function from bam files to produce the desired read depth (0.1 × , 0.25 × , 0.5 × , 0.75 × , 1 × , 2 × , 3 × , 4 × , 5 × , 10 × and 15 × ). The fraction of single read coverage and the fraction of tenfold coverage was calculated with bedtools genomeCoverageBed function.
SNV calling and analyses
All SNV callers were applied to the same aligned sequence data set (BAM files aligned to hg19 with an overall coverage (mapped reads) of ∼19–20 × of the human genome).
CLC Genomics Workbench 9.0 low-frequency variant detection caller67 was used with the following stringent settings (required significance (%)=1.0, ignore positions with coverage above=1,000, restrict calling to target regions=not set, ignore broken pairs=yes, ignore nonspecific matches=reads, minimum coverage=10, minimum count=3, minimum frequency (%)=5.0, base quality filter=yes, neighbourhood radius=5, minimum central quality=20, minimum neighbourhood quality=15, read direction filter=yes, direction frequency (%)=10.0, relative read direction filter=yes, significance (%)=1.0, read position filter=yes, significance (%)=1.0, remove pyro-error variants=no).
Samtools 1.368—mpileup/bcftools, htslib 1.3.1 was used with -E and -uf settings. Bcftools 1.3.1 was used with -cv, -Ov and --ploidy=GRCh37.
VarScan2.v2.4.169 was used with standard settings.
Isaac Variant Caller 1.0.770 was used with the (standard) settings: isSkipDepthFilters=0, maxInputDepth=10,000, depthFilterMultiple=3.0, indelMaxRefRepeat=−1, minMapq=20, minGQX=30, isWriteRealignedBam=0, binSize=25000000CLC. The Isaac variant caller uses the GATK Unified Genotyper followed by filtering with the variant quality score recalibration (VQSR) protocol71.
SNV intersection analyses were done with Illumina BaseSpace VCAT (Variant Calling Assessment Tool v188.8.131.52) and Illumina VariantStudio v2.2.4 (https://basespace.illumina.com/home/index). Estimated ADO rate was determined over the whole genome using R v3.3.0 by determining the number of variants in the overlapping set of SNVs that were heterozygote in the NA sample and homozygote in the amplified samples and assuming an equal number of heterozygote to homozygote conversions that were not contained in the overlap set because of reversion to the reference allele.
SNV false positive rates
For the estimation of the FPR for each SNV caller all called SNVs from the NA sample were assumed as true SNVs. Thus:
Position based recall and precision (calculated by V-CAT)
Recall and precision were calculated by the VCF Gold Standard Comparison with NIST Genome in a Bottle integrated calls v0.2.
K-mer frequency analyses
K-mers were calculated with jellyfish49. K-mer size was set to 1,950. An estimation of genome sizes from k-mer counts were calculated by the bin sizes B, the corresponding frequency F and the peak depth estimated from the k-mer distribution: . The single-copy region was extracted visually as the first peak after the initial error peak at bin size 1.
Additional statistical analyses were done using JMP 12 (SAS Institute, Heidelberg, Germany) or R v3.3.0 /Rstudio v0.99.902. A P-value <0.05 was considered significant.
The data sets generated during and analysed during the current study are available in the GenBank Sequence Read Archive, accession number SRP085855.
How to cite this article: Picher, A. J. et al. TruePrime is a novel method for whole-genome amplification from single cells based on TthPrimPol. Nat. Commun. 7, 13296 doi: 10.1038/ncomms13296 (2016).
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank Patricia Garrido, Frank Herzog, Gisela Eisenhardt, Patricia Rebollo and Clara López for expert technical help. We thank Margarita Salas for intense discussions. L.B. was funded by the Spanish Ministry of Economy and Competitiveness (BFU2012–37969 and CSD2007–00015), and by Comunidad de Madrid (S2011/BMD-2361). S.G.-G. was the recipient of a fellowship from the Spanish Ministry of Economy and Competitiveness.