Plasmid DNA contaminant in molecular reagents

Background noise in metagenomic studies is often of high importance and its removal requires extensive post-analytic, bioinformatics filtering. This is relevant as significant signals may be lost due to a low signal-to-noise ratio. The presence of plasmid residues, that are frequently present in reagents as contaminants, has not been investigated so far, but may pose a substantial bias. Here we show that plasmid sequences from different sources are omnipresent in molecular biology reagents. Using a metagenomic approach, we identified the presence of the (pol) of equine infectious anemia virus in human samples and traced it back to the expression plasmid used for generation of a commercial reverse transcriptase. We found fragments of multiple other expression plasmids in human samples as well as commercial polymerase preparations. Plasmid contamination sources included production chain of molecular biology reagents as well as contamination of reagents from environment or human handling of samples and reagents. Retrospective analyses of published metagenomic studies revealed an inaccurate signal-to-noise differentiation. Hence, the plasmid sequences that seem to be omnipresent in molecular biology reagents may misguide conclusions derived from genomic/metagenomics datasets and thus also clinical interpretations. Critical appraisal of metagenomic data sets for the possibility of plasmid background noise is required to identify reliable and significant signals.


Results
equine Infectious anemia virus pol sequences are derived from extrinsic plasmids. In a previous study, we detected contigs containing the polymerase (pol) gene of the retrovirus Equine infectious anemia virus (EIAV) in all evaluated human samples from healthy volunteers (n = 4) 48 . EIAV is a retrovirus infecting Equidae but not reportedly humans and also has not been reported as a zoonotic disease of humans so far 49 . A phylogenetic analysis of the sequences found in relation to those of other lentiviridae such as Human Immunodeficiency Virus-1 pol (HIV-1; NC_001802.1), Feline Immunodeficiency Virus pol (FIV; NC_001482.1) and Maedi/Visna pol strain kv1772 (NC_001452.1) showed a high similarity of the sequences detected with the pol gene of the EIAV clone CL 22 strain (ID: M87581.1; Fig. 1). Further alignment of sequences showed no genetic variation among the pol sequence we found, which is highly unusual for retroviruses with high mutation rates. Only when compared to the standard strain EIAV Wyoming, a small number of nucleotide differences had been identified.
All fragments found, corresponded only to a part of the pol gene of EIAV reference strains (1.667 kb). Furthermore, the pol sequences identified were flanked by a CmR sequence (Chloramphenicol acetyltransferase; ID: EDS05563.1), and in the case of the longest contig available by an additional Bla Tem-1 resistance-encoding sequence (ID: WP_000027050.1, Fig. 2A). Further assembly of EIAV pol flanking sequences revealed additional genes indicative for the presence of an expression vector including a Histidine-Tag, a Ribosomal Binding Site (RBS), a lac operator, a T5 promoter and a lambda t0 as well as a rrnB T1 terminator (Fig. 2B).
To validate the presence of a vector and to identify the source of contamination, we tested all laboratory consumables and clinical samples used previously by Thannesberger et al., with the use of a PCR assay that is specific for the EIAV pol sequences found. Surprisingly, all of these samples were negative for EIAV pol sequences (Fig. 3A). To exclude the presence of an RNA template of the EIAV pol sequences, samples had been tested again after reverse transcription with Omniscript RT Kit (Qiagen, Hildesheim, Germany). After that, all samples that were reverse transcribed had been tested positive for EIAV pol sequences, including also the non-template control of the reaction mix (Fig. 3B). Therefore, we suspected that the RT kit used (Omniscript RT Kit) is the EIAV pol source. To validate this hypothesis, we treated all of these samples with a different reverse-transcriptase (iScript cDNA Synthesis Kit, Biorad, California, USA) and repeated the same experiment. These experiments yielded uniformly negative test results (data not shown), which further indicates that the Omniscript RT Kit was the source of the EIAV pol sequences.
In order to quantify the overall genomic background noise present during the virome testing procedure, a qPCR was designed that is specific for the CmR resistance found frequently in the EIAV contigs. Three different time steps, reflecting the enzymatically treatment incorporated in the standard workflow of the VIPEP method, had been tested and designated T0, T1 and T2. Time step T0 contained the reverse transcription mix (Omniscript RT Kit) without performing reverse transcription, T1 after the reverse transcription, and T2 was after a multiple displacement amplification (MDA) of 1 µl T1 with REPLI-g Mini Kit (Qiagen, Hildesheim, Germany). The plasmid copy number increased from 39,249 per µl at T0 to 383,045 copies at T1 and 245,444,045 copies at T2.

Characterization of omnipresent natural and artificial plasmid residues in NGss reagents.
After that, all contigs available from the previous study had been re-evaluated in silico for the presence of plasmid sequences such as selection markers and origin of replication to evaluate the possible presence of additional artificial expression vectors. We found multiple other sequences exhibiting characteristics of expression vectors (Fig. 4). Of 4956 contigs from twelve samples, 1.61% (n = 80) contained plasmid sequences. These sequences were found in such diverse samples such as human urine (n = 4), pharyngeal lavages (n = 4), technical replicate groups  Analysis of EIAV plasmid (A) Blast search revealed for sequences above 2.5 kb the presence of a CAT (Chloramphenicolacetyltransferase). For the longest sequence UN_TR272_len_4326 a second bacterial resistance (AmpR-) conferring a resistance to ß-Lactam antibiotics such as Ampicillin. (B) Plasmid map of the predicted Omniscript RT Kit expression plasmid which was identified as the source of the EIAV pol. Qiagen confirmed that such a plasmid is used for their Omniscript product. The EIAV pol sequence is in-frame with a histidin-tag, flanked by a BamHI and a HindIII restriction site and followed by a lambda t0 terminator. Further downstream a inactive CmR resistance followed by a rrnB T1 Terminator. Further upstream a AmpR promoter together with a ß-lactamase can be found. In front of the Insert is a Ribosomal Binding Site (RBS) with a T5 promoter to ensure strong transcription. The system is induced by a lac operator. The backbone of the plasmid seems to be pDS56/RBSII and therefore the origin of replication may be pBR332. The whole plasmid with the name p6EIAV-RT was created by Dr. Stuart J LeGrice in 1991. (n = 2) and a non-template control (n = 1). The relative abundance of plasmid background ranged from 0.16% in the Non-Template Control (NTC) up to 20.83% in one patient sample. Interestingly, the urine samples had a higher plasmid background with a mean of 11.67% (Max: 20.83%; Min: 2.65%; SD: 8.97%) compared to the pharyngeal lavage samples with a mean of 4.67% (Max: 10.47%; Min: 2.65%; SD: 4.42%). The urine technical replicates had higher plasmid residues compared to the pharyngeal lavage technical replicates (6.757% vs. 4.225%) (Fig. 5).
Characterization of plasmid residues. Of the 80 contigs with plasmid signatures, 41% (n = 33) had an origin of replication, 63% (n = 51) a selection marker and 52% (n = 42) an insert. Apart from the EIAV coding expression vector, three other artificial expression vectors could be identified by their inserts. Of these inserts, 19% included a chimera of a human-mouse chimera Bicaudal 1 gene (n = 8), 11% the UL-32 gene of the Cytomegalovirus (n = 5) and 5% the leukemia fusion protein AML1-MTG8 (n = 2). All contigs with a specific insert had been aligned and the consensus sequence displayed in SnapGene Viewer gave a predicted plasmid map (Fig. 5). The plasmids coding for Bicaudal 1 chimera and UL-32 genes were identical to those used for other studies in our laboratory and had, therefore, been identified as laboratory contaminants. BLAST of the 2268 bp long fragment of "Und_TR29_ len2635", found in the Und sample (Undetermined contigs), showed a 99% query coverage with homo sapiens mRNA for AML1-MTG8 fusion protein (GenBank: D13979.1). The source of this plasmid remains unknown.
Natural plasmids residues are derived from a variety of sources. Besides the presence of artificial plasmids, natural occurring plasmids from different species were found in all twelve samples (n = 12). The most frequent plasmid was from Micrococcus spp.  Table 1). The plasmid sequences we found from Serratia maracesens pUO901 (ID: NG_047232.1) and Enterobacter cloacae pEC005 (ID: NG_050201.1) coded only for antibiotic resistances.
The first one was identified as a aminoglycoside-(3)-N-acetyltransferase (AAC(3)s), whereas the latter coded for a Class A extended-spectrum beta lactamase TEM-157 (Table 1). These plasmids are likely from natural sources.
Detection of plasmid residues in commercially available polymerases. To evaluate whether plasmid residues are commonly present in commercially available polymerase preparations, we tested Taq polymerases (n = 4), high-fidelity polymerases (n = 2) and qPCR mastermixes (n = 7) for the presence of an origin of replication (pBM1/pUC19/pBR322/ColE1) and selection markers (bla TEM-1 ; CmR). An origin of replication and an ampicillin resistance had been found in two polymerase preparations (HotStarTaq, EvaGreen).  The complete definition for an artificial fragmented plasmid is as following: "May contain several artificial sequences similar to a complete vector but is missing one criteria which can be: ori (O), selection marker (SM), promoter region with insert (I) regardless length and is not naturally occurring". Due to the nature of fragmented plasmids, they may have either one or two features and are further characterized by them (e.g. ori with selection marker = O + SM). Sequences containing neither an ori, selection marker or insert but contained any other plasmid feature (e.g. histidine-tags) were termed very short fragments (VSF An origin of replication had only been found in one polymerase preparation (iTaq Universal Probes Supermix). A Chloramphenicol resistance had not been found in any of the polymerase preparations tested. The methodology used did not incorporate a negative control to see if a positive signal can be obtained. Therefore, possible laboratory cross-contamination could not be excluded entirely although being unlikely due to PCR mastermix preparation in CleneCab PCR Workstation and highly specific primers. (Herolab, Wiesloch, Germany). To confirm our findings, enzymes preparations that had been tested positive for plasmid residues were used as template and amplified with a previously plasmid negative polymerase preparation, (GoTaq G2 Hot Start Polymerase; Promega). The HotStarTaq was still positive for Ori-and Ampicillin presence and the EvaGreen 2X qPCR Express Mix-ROX remained only positive for Ori presence, indicative for possible presence of artificial expression plasmids. All previous positive tested Taq enzymes from BioRad had been tested negative and, therefore, reconfirmed negative for plasmid presence ( Table 2).
Analysis of metagenomics studies. Finally, we analyzed previously published metagenomic data sets of human gut and plasma samples as well as a data set using different whole genome amplification kits [50][51][52] for the presence of plasmid residues. Retrospective analysis of these data sets, natural plasmid residues had been found in most sets and most commonly Acinetobacter sp. and Escherichia sp. as source organisms (Table 1 and Table 2). The highest diversity of plasmids had been found in metagenomic data focusing on the fecal microbiome 53 . Especially metagenomic studies analyzing high bio mass samples such as microbiome studies are expected to contain a higher amount and diversity of natural plasmids compared to samples with low biomass (e.g. plasma). Remarkably, a plasmid highly similar to Xuhuaishuia manganoxidans strain DY6-4 had been detected in several samples of two unrelated metagenomics studies although this bacterium has been found only in the Pacific Clarion-Clipperton Fracture Zone 51 (Table 3) so far.

Discussion
The presence of bacterial DNA residues in commercially available enzymes, DNA extraction kits and other molecular grade reagents have been recognized recently 21,26,41,52 . The presence of plasmids in molecular biology reagents, however, has remained unnoticed, so far. We found natural and artificial plasmid residues in most tested NGS reagents including particularly recombinant generated enzyme preparations. Sources of these plasmids included laboratory contaminants as well as bacteria and expression vectors used for the generation of recombinant proteins. Plasmid sequences have been identified frequently in NGS studies, but may have been attributed erroneously to bacteria. Hence, plasmid sequences present in clinical and environmental samples may have far-reaching consequences.
Metagenomic studies are increasingly used in addition to standard PCR assays to address clinical questions as reviewed in Klymiuk & Steininger 54 . Enzymes used for these assays are generated by recombination in (with) prokaryotic systems. Plasmid sequences may misguide clinical treatment decisions and adversely affect patient outcome. For example, antimicrobial resistance testing is increasingly adjunct by testing bacterial isolates for the  48  presence of genes that confer resistance 55 . In the studies analyzed, common antibiotic resistance gene sequences had been found from Enterobacter cloacae and Serratia marcesens. These two pathogens are increasingly resistant to multiple or most antimicrobial drug classes and the presence of resistance genes in clinical samples would not be surprising or questioned 14,15,17,45 . Consequently, the choice of antimicrobial treatment would be misguided towards reserve antimicrobials that are more toxic than standard ones. At least one patient death was documented in association with a false-positive test result by a contaminated mastermix 56 . Misguidance of clinical decisions may also be associated with false-positive PCR results. We found evaluated EIAV sequences in all human samples. We could identify the plasmid used for the generation of the reverse transcriptase as the source of these sequences. Identification of a horse retrovirus in human samples was implausible, which guided our investigation into the right direction. In general, the presence of host-specific viral, genomic or plasmid DNA (e.g. Xuhuaishuia manganoxidans strain DY6-4) in samples derived from other hosts should be questioned for their plausibility. Still, recombinant reverse transcriptase is also used in PCR assays for detection of EIAV in horse samples and this pol sequence is used in several detection assays as target 57 . A positive test result would be plausible and negative controls would test negative because they are usually not treated with a reverse transcriptase. In case of a single positive EIAV test result, however, all horses of the stable would be culled.
Elimination of plasmid sequences from molecular biology reagents is difficult and costly. The presence of natural plasmids from bacteria such as Ralstonia sp., Bradyrhizobium sp. and Legionalla sp., are common contaminants in Ultrapure Water and are difficult to avoid 21 . Contamination of reagents from the human body may remain unnoticed. In one of our recent metagenomic studies, we found plasmid fragments from Ralstonia sp., Burkholderia sp., Enterobacter sp., Acinetobacter sp., and Micrococcus sp. 48 . The first two were likely introduced by water samples, whereas the later were likely introduced through human handling as these microbes are part of the normal human skin flora 58 . Previously, we found Bicaudal-1 and UL32 protein expression plasmids in human samples 48 . These plasmids were very likely contaminations as our research group used these plasmids in another research study. In addition, prokaryotic expression plasmids are commonly used to generate enzymes for molecular biology and are difficult to eliminate. For example, we identified the plasmid used for the generation of the EIAV reverse transcription as the pDS56/RBSII-based plasmid expression vector by the backbone 59 . Nevertheless, we also found differences in the level of contamination between the enzyme preparations from different manufacturers, which also indicates the feasibility of reducing this background signal.
A possible, inexpensive and feasible solution to the problem of plasmid residues in metagenomics studies may be the testing of technical replicates of the samples as well as the negative controls in parallel and subtracting during bioinformatics analysis signals detectable in both samples. Databases that comprehensively annotate the different expression vectors used for recombinant generation of proteins are important in this respect. Furthermore, specification of the type and sequences of expression plasmids used in the package inserts of every molecular biology reagent would be helpful. Nevertheless, most production processes of enzymes are proprietary and, in our experience, companies are very hesitant to provide this information.
Another solution, presented by de Goffau and colleagues, would be to use different isolation kits during sample preparation to control if the results are reproducible 60 .  In conclusion, we found that plasmid sequences are frequently present in molecular biology reagents. The sources for this background noise in metagenomic studies are diverse and include contamination of reagents from the environment, cross-contamination in the laboratory from purposely generated plasmids, as well as plasmids used for the generation of enzymes. The amount and type of plasmids found in metagenomics studies may greatly vary upon pre-treatment of samples (e.g. use of different enzymes). The presence of these plasmids in samples may have far-reaching consequences including the misguidance of therapeutic decisions in human and veterinary medicine -particularly when unexpected. Our observations open up whole new avenues to identifying and appropriately addressing these potential issues. Background plasmid noise may be eliminated for example from signals by use of appropriate negative controls, manufacturers of enzymes and recombinant proteins may inform customers of the possible presence of plasmid traces, and metagenomic data will be interpreted even more cautiously.

Methods
Urine and pharyngeal lavage samples from human healthy volunteers had been collected in a sterile collection cup (Greiner Bio-One GmbH, Kremsmünster, Austria) as described previously 48 . Lavages had been collected by asking the patient to gurgle with 10 ml of sterile, physiologic sodium-chloride solution (0.9% NaCl Mini-Plasco isotonic solution, B. Braun-Austria GmbH, Maria Enzersdorf, Austria) for a minimum of one minute and collecting the lavage fluid in a sterile tube. Samples had been kept on ice and had directly been processed. Nucleic acids had been enriched with Vivaspin 20 50.000 MWCO PES ultracentrifugation columns (Sartorius, Aubagne, France) at 4000 g and 4 °C. Total DNA and RNA had then been purified with the Roche High Pure Viral Nucleic Acid Kit (Roche, Mannheim, Germany) and reverse transcribed with either iScript cDNA Synthesis Kit (Bio-Rad, Hercules, USA) or Omniscript RT Kit (Qiagen, Hildesheim, Germany) according to the manufactures instructions. The samples had been cryopreserved at −80 °C until testing.   For quantitative analyses of plasmid copies, a qPCR assay amplifying in part the chloramphenicol acetyltransferase (CmR) encoding gene had been designed with the use of the online-tool GenScript Real-time PCR (TaqMan) Primer Design (https://www.genscript.com/ssl-bin/app/primer). The 20 µl reaction mix contained 9 µl iTaq Universal Probes Supermix (Bio-Rad, Hercules, USA), 300 nM primers (Forward: 5′-GAC-GGT-GAG-CTG-GTG-ATA-TG-3′; Reverse: 5′-TGT-GTA-GAA-ACT-GCC-GGA-AA-3′), 200 nM of the CmR Probe (5′-FAM-CGC-TCT-GGA-GTG-AAT-ACC-ACG-ACG-TAMRA-3′) and 5 µl template. The reaction had been done in a 96-well optical microtiter plate (Life Technologies, Carlsbad, CA, USA) and amplified in a StepOnePlus Real-Time PCR System (Thermo Fisher Scientific, Waltham, MA, USA). The reaction mix had been pipetted into a MicroAmp Fast 96-Well Reaction Plate 0.1 ml (Applied Biosystems, California, USA) and afterwards 5 µl of template had been added. The cycling conditions included an initial denaturation step at 95 °C for 2 minutes, followed by 40 cycles of denaturation for 15 seconds at 95 °C and 20 seconds extension time at 60 °C. Every run of the CmR qPCR included a serial dilution of the plasmid pDONR221 from 3 × 10 1 to 3 × 10 6 copies per well for calculation of a standard curve and quantification of target sequences. Each DNA sample had been analyzed in triplicate and at least 12 negative controls, only containing the reaction mix with 1 µl ddH 2 0 as template, had been included in each run.
In order to test commercially available polymerases for presence of plasmid sequences, a specific pan-Ori primer pair (For ward: 5′ -AGT-TCG-GTG-TAG-GTC-GT T-CG-3′ ; Reverse: 5′-GCC-TAC-ATA-CCT-CGC-TCT-GC -3′) had been designed with the online primer design tool Primer3 v.0.4.0. (http://bioinfo.ut.ee/primer3-0.4.0/primer3/). This PCR assay allowed detection of pBM1, pBR322, ColE1 and pUC19 in one reaction. The commonly used penicillin resistance bla TEM-1 , had been detected by a PCR using a primer pair designed by Lee and colleagues (Forward: 5′-CTA-CGA-TAC-GGG-AGG-GCT-TA-3′, Reverse: 5′-ATA-AAT-CTG-GAG-CCG-GTG-AG-3′) 53 . For the detection of Chloramphenicol resistance (CmR) the same primer pair had been used as for the described qPCR. Cycling conditions and set up of reaction mixes had been conducted according to the enclosed manufacturer's manual except that no template had been added. All PCR reactions consisted of 30 cycles with 30 seconds denaturation at 95 °C, 30 seconds annealing at 60 °C and 25 seconds extension time at 72 °C. The time needed for initial denaturation and final extension as well as primer, MgCl 2 and dNTP concentration may vary upon polymerase or mastermix used. Cycling conditions for High-Fidelity Polymerases such as Q5 and iProof were shorter (10 seconds denaturation and 20 seconds extension time). As positive control for Ampicillin and Ori presence, 1 µl of a 1 ng/µl pcDNA3.1(+) dilution has been used as template. The (RT)-qPCR mastermixes had been pipetted according to each manufacturer's manual. The same cycling conditions had been used as for the PCR reaction.
To exclude false-positive results, 0.125 µl to 0.2 µl of pure enzyme had been used as template for amplification with the GoTaq G2 DNA Polymerase (Promega, Madison, Wisoconsin, USA) which had no detectable plasmid residues. Cycling conditions included a 2 minute initial denaturation step at 95 °C, followed by 30  In order to evaluate contigs for further potential plasmid contaminations, sequences had been evaluated for the presence of common plasmid features including origin of replication (F1, pBR322, pUC19, p15a, ColE1, SV40), selection markers (Chloramphenicol, Ampicillin (Bla Tem-1 ), Kanamycin (Tn5), Streptomycin (aadA), Puromycin (pac) and Hygromycin (hph)), promoter (T7, T3, Sp6, AmpR, CMV, tet, LacI, polyhedrin, SV40), terminator (rrnB T1-T2, lambda), protein tags (Histidine, HA, Streptavidin) and primer binding sites (pBluescript SK, pBluescript KS, M13 pUC and other commonly used primer sites). All plasmid sequences had been searched from 5′ to 3′as well as from 3′to 5′. Sequences with at least one of these characteristics had been analyzed further by the SnapGene Viewer software (GSL Biotech LLC, Chicago, USA), which automatically annotates plasmid features. All sequences attributed to plasmids had been analyzed via their annotated features and classified into artificial vectors or artificial plasmid fragments (see Fig. 5A).
As final step, known plasmid sequences had been searched in the short read metagenome sequence data of all samples, which was described earlier by Thannesberger and colleagues 48 as well as published raw data from other metagenomics studies [50][51][52] . We used the previously described bioinformatic pipeline 48 which estimates the coverage along the plasmids and rejects short regions of unspecific coverage. All plasmid sequences from the NCBI RefSeq database, release 77, had been used as reference 54  Abbreviated summary. Due to increasing sequencing throughput enabled through Next-Generation sequencing (NGS), the analysis of all microbial genomes present in a single sample became possible (metanogemics). The indiscriminant sequencing of all nucleic acid sequences present in a sample by metagenomics does pose the risk of attributing biological significance to contaminating sequences as well as biasing the biological signal through a technical signal. Thus research conclusions and clinical decisions may be misguided significantly. We found that background plasmid sequences are present in every biological sample and have been erroneously interpreted as clinically significant biological differences previously. Through recognition of this significant background in metagenomic studies, however, we show how to devise effective countermeasures such as labelling of commercial reagents for presence of plasmids used for generation of recombinant proteins, and specifying these.