High-throughput sequencing for species authentication and contamination detection of 63 cell lines

Cell lines are widely used in research and for diagnostic tests and are often shared between laboratories. Lack of cell line authentication can result in the use of contaminated or misidentified cell lines, potentially affecting the results from research and diagnostic activities. Cell line authentication and contamination detection based on metagenomic high-throughput sequencing (HTS) was tested on DNA and RNA from 63 cell lines available at the Canadian Food Inspection Agency’s National Centre for Foreign Animal Disease. Through sequence comparison of the cytochrome c oxidase subunit 1 (COX1) gene, the species identity of 53 cell lines was confirmed, and eight cell lines were found to show a greater pairwise nucleotide identity in the COX1 sequence of a different species within the same expected genus. Two cell lines, LFBK-αvβ6 and SCP-HS, were determined to be composed of cells from a different species and genus. Mycoplasma contamination was not detected in any cell lines. However, several expected and unexpected viral sequences were detected, including part of the classical swine fever virus genome in the IB-RS-2 Clone D10 cell line. Metagenomics-based HTS is a useful laboratory QA tool for cell line authentication and contamination detection that should be conducted regularly.

www.nature.com/scientificreports/ genba nk/ barco de) 1 . However, STR/SNP and COX 1-based methods do not provide information on the presence and type of microbial contamination. Shotgun metagenomic sequencing allows sequencing of a broader spectrum of DNA or RNA in a sample. Thus, it can be used for species identification of cell lines and potentially detect the presence of bacterial, viral, or fungal contamination. If specific STR and SNP loci are amplified prior to sequencing, STR/SNP profiling may be reliably implemented for the confirmation of the specific cell line. In this study, high-throughput sequencing (HTS) was performed on DNA and cDNA extracts from each of 63 cell lines available at the Canadian Food Inspection Agency (CFIA) National Centre of Foreign Animal Disease (NCFAD) to verify the species of origin and presence of microbial contamination.

Methods
Cell culture. A total of 63 cell lines available at the NCFAD were seeded from frozen stocks and grown for 48 h at 37 °C and 5% CO 2 , except for Trichoplusia ni cells which were grown with shaking at 27 °C for 24 h before collection. Adherent cell lines were dissociated using 0.25% Trypsin-0.1% EDTA, and cells in suspension were spun down at 4 °C for 10 min at a relative centrifugal force of 600, and re-suspended in culture media. An aliquot of cells was stained with 0.2% trypan blue and counted using a Cellometer Auto T4 counter (Nexcelom Bioscience).
DNA/RNA extraction. DNeasy Blood and Tissue kit (QIAGEN) was used to isolate DNA and RNA from ~ 2.5 × 10 6 viable cells using the manufacturer's recommended protocol. The DNA and RNA were eluted into 50-100 μL of AE elution buffer (QIAGEN). Qubit dsDNA Broad Range (BR) and RNA High Sensitivity (HS) kits (Thermo Fisher Scientific) were used to quantify DNA and RNA in the extracts on the DS-11 FX fluorometer (Denovix).
High-throughput sequencing. Invitrogen ezDNase enzyme (Thermo Fisher Scientific) was used for the digestion of cellular DNA within extracted nucleic acid prior to cDNA synthesis. Superscript IV First-Strand Synthesis module (Thermo Fisher Scientific) was used for the synthesis of the first strand of cDNA using a 1:1 ratio of random hexamers and oligo-dTs and 300 ng of RNA. The NEBNext Ultra II Second-Strand synthesis module (New England Biolabs) was used according to the manufacturer's protocol to generate the second strand of cDNA. QIAQuick PCR purification kit (QIAGEN) was used to purify the double-stranded cDNA and eluted in EB buffer (QIAGEN) according to the manufacturer's recommended protocol. The Qubit dsDNA BR kit (Thermo Fisher Scientific) was then used to quantify the cDNA using a DS11 FX fluorometer.
Sequencing was performed separately on the DNA and cDNA samples with the cDNA samples separated into two runs including a test run with a smaller number of samples due to the timing of the availability of the samples (Tables 1 and 2). Library preparation for the DNA samples was performed using Riptide High-Throughput Rapid DNA Library prep kit (iGenomX) and the manufacturer's protocol was followed with a 1:1 ratio of the low GC and high GC primers. The samples were pooled and loaded at a final concentration of 18 pM with 1% PhiX, and sequencing was performed on an Illumina MiSeq using a V2 flow cell with a 300-cycle (2 × 150 bp) cartridge.
Library preparation for the cDNA samples was subsequently performed with the Nextera XT Library Prep kit (Illumina) following the manufacturer's protocol due to a switch over of Illumina library preparation methods in the laboratory. In the first run, 26 samples were pooled and loaded at a final concentration of 10 pM with 1% PhiX, and sequencing was performed again on the Illumina MiSeq using a V2 flow cell with a 300-cycle (2 × 150 bp) cartridge. In the second run of cDNA samples, 65 samples were pooled at a final concentration of 18 pM with 1% PhiX, and sequencing was performed on a V3 flow cell with a 600-cycle (2 × 300 bp) cartridge. Sequence analysis. iGenomX DNA sequencing reads were demultiplexed using the fgbio 7 software (v.0.7.0; command used: fgbio DemuxFastqs -i R1.fastq.gz R2.fastq.gz -r 8B12M + T 8 M + T-x metadata.csv).
To determine the species of the cell line, metagenomic analysis was performed using the nf-villumina 8 (v2.0.0) Nextflow 9 workflow on the concatenated DNA and cDNA sequencing data. As part of the nf-villumina workflow, Illumina PhiX Sequencing Control V3 reads were removed using BBDuk 10 , and poor quality reads and adaptors were removed using fastp 11 . Taxonomic classification of the filtered reads was performed with Kraken 2 12 using an index of NCBI RefSeq sequences for bacteria, archaea, viruses and the GRCh38 human genome (downloaded and built March 22, 2019), and with Centrifuge using an index of NCBI nt sequences (downloaded and built 2020-02-04). Quality filtered reads were assembled into contigs with Megahit 13 , Shovill 14 , and Unicycler 15 , which were queried against the NCBI nt database (downloaded December 04, 2020) using nucleotide BLAST+ 16,17 (v2.11.0) (default parameters except "-evalue 1e−6") restricting the search to eukaryotic NCBI nt database entries (i.e. belonging to NCBI taxonomic ID (taxid) 2759). The processed reads for each cell line were mapped against the top matching COX1 sequence identified by BLAST analysis using Snippy (v4.6.0) 18 as part of the nf-illmap Nextflow workflow (v1.0.0) 19 . The resulting BAM alignment file was loaded into Geneious v.9.1.8 20 where a threshold for coverage depth was set to a minimum of three, and variants were called using the Find Variations/SNPs tools with default settings except Minimum Coverage = 3 and Minimum Variant Frequency = 0.75. Variants were only called if the read depth had a minimum coverage of 3×. MDBK-HS-1 is from the cell lines available at CFIA NCFAD in Winnipeg, Manitoba, Canada while MDBK-HS-2 came from the CFIA NCAD laboratory in Lethbridge, Alberta, Canada. For cell lines where the observed species from the top BLAST match was not as expected based on laboratory records, the reads were additionally mapped to the CytB gene sequence using the same methods as was used for mapping to the COX1 sequences.
RNA and DNA viruses and bacteria were identified from the cell line DNA and cDNA sequencing data using DAMIAN 21 . As part of DAMIAN analysis, raw sequencing reads were trimmed using Trimmomatic 22  www.nature.com/scientificreports/ settings and assembled using SPAdes 23 . Contigs were taxonomically classified using nucleotide BLAST+ (v2.11.0) (DAMIAN BLAST+ option "progressive") and the NCBI nt database (downloaded December 04, 2020). Trimmed reads were mapped to the viral genomes identified by DAMIAN and additional nucleotide BLAST analysis using the nf-illmap workflow. Variants were called in Geneious V.9.1.8 20 using the method described above.

Results
Cell line authentication. Table 1 lists the cell lines for which the expected species identity was confirmed by mapping the combined reads from the cDNA and DNA sequences to the reference COX1 gene of the top mitochondrial genome BLAST match for each cell line. All observed species in this list matched the species recorded in the institute's cell line inventory list. This list also includes five archived cell lines that have been documented as "unknown" which did not have a defined species listed. Two cell lines, LFBK-αvβ6 and SCP-HS, were determined to be composed of cells from a different species than expected. According to institute documentation, LFBK-αvβ6 was a continuous bovine kidney cell line that constitutively expresses αvβ6 integrin 24,25 ; however, there were no BLAST results from the LFBK-αvβ6 de novo assembled contigs that corresponded to the Bos taurus genome or mitogenome. All BLAST results matched sequences from the Sus scrofa genome and mitogenome. Figure 1A shows the DNA and cDNA reads mapped to reference B. taurus and S. scrofa COX1 sequences. A total of 880 reads from LFBK-αvβ6 mapped to the S. scrofa COX1 gene and had a breadth of coverage of 100% with 0 total variants (i.e., SNPs, MNPs, and INDELs) between the mapped reads and the reference, while only 77 reads mapped to the B. taurus reference COX1 gene with a breadth of coverage of 44.1% and 98 total variants (see Table 2 for reference accession numbers and results).
According to documentation, SCP-HS is an ovine brain choroid plexus cell line adapted for growth in horse serum; however, the top BLAST results from the de novo assembled contigs were to B. taurus and not to Ovis aries. Figure 1B shows the coverage of the SCP-HS reads mapped across reference COX1 sequences from O. aries and B. taurus. In the B. taurus assembly, 202 reads mapped with 100% breadth of coverage across the COX1 gene www.nature.com/scientificreports/ with 0 total variants, while in the O. aries assembly, 85 reads mapped with 100% breadth of coverage across the COX1 gene with 176 total variants (see Table 2 for reference accession numbers and results). Eight cell lines (CGBQ, BGMK, MA-104, PaLu, PaSPT, Vero, Vero Nectin-4, Vero-76) were found to align better to the COX1 sequence from a different species (within the same genus) than the expected species based on available documentation. For these samples, reads were mapped against the COX1 sequences from both the expected and observed species. This analysis showed that, when reads were mapped against a reference sequence representing the expected species, more variants were observed than when they were mapped against a reference representing the observed species, suggesting that the cell line is derived from a different species than was expected ( Table 2). The COX1 sequences for the references of the observed and expected species do however share a high similarity; between 95.6 and 96.9% for the primate sequences, 99.4% for the goose sequences, and 97.2% for the bat sequences. A high similarity between the references increases the difficulty in discerning one species from another, therefore for those eight cell lines the reads were also mapped to the mitochondrial gene cytochrome b (Cytb) sequence. While the Cytb sequences between the observed and expected species also share a high similarity (between 94.0 and 95.8% for the primate sequences, 98.3% for the goose sequences, and 96.4% for the bat sequences), Table 2 shows that with the exception of PaLu and PaSPT, the results of the Cytb analysis are consistent with those of the COX1 analysis suggesting with higher confidence that the cell lines are derived from a different species than expected.

Detection of bacterial and viral sequences.
Upon identifying the species of the 63 cell lines, a separate workflow was used to identify bacterial and viral DNA and cDNA sequences. Some viral sequences were expected in the cell lines including human adenovirus C used for the transformation of HEK-293, the common FBS contaminant bovine viral diarrhea virus 2 (BVDV2) in CPAE, and the common porcine circovirus 1 (PCV1) in swine-derived PK-15 (PCV+) cells. Sequences matching these viruses were detected as expected, and PCV1 was also found in all four IPAM clones (Table 3). Retroviral sequences, including murine leukemia virus (MuLV), were also found in some of the cell lines (Table 3). Only viruses with a complete or near complete viral genome (> 98% breadth of coverage) are listed, as incomplete cancer-causing retroviral sequences can be expected within the genomes of tumor-derived cell lines 26 . Reads that were classified as classical swine fever virus (CSFV) were also found in the IB-RS-2 Clone D10 cell line with a 39.5% breadth of coverage across the viral genome with seven total variants relative to the reference genome. In the T. ni insect cell line, reads identified as Flock House virus had a breadth of coverage of 85.1% across the reference genome with two total variants (Table 3).

Discussion
The aim of this study was to authenticate the species identity of cell lines available for use at the Canadian Food Inspection Agency's National Centre for Foreign Animal Disease, and to establish methods that can be integrated into the laboratory quality assurance system. Confirming cell line species at our laboratory was previously conducted by comparing the electrophoretic migratory patterns of common intracellular enzymes (isoenzymes). Examining the polymorphic isoenzyme profiles between species for cell line confirmation has limitations including limited species range, low sensitivity of detection, and complex data interpretation.
In this study, 53 of the 63 cell lines had a COX1 sequence that was consistent with the expected species; the reads from each of these cell lines had a breadth of coverage of > 95% across the COX1 gene, and no more than five variants compared to the reference. LFBK-αvβ6 and SCP-HS cells were found to be from a different genus than www.nature.com/scientificreports/ expected, suggesting that the cell lines had been misidentified, contaminated, or mislabeled. When reads from the LFBK-αvβ6 and SCP-HS cell lines were mapped to the COX1 genes corresponding to the species identified by BLAST analysis, no variants were observed in either sample. The porcine DNA found within the LFBK-αvβ6 cell line is consistent with a published erratum that this cell line is of porcine origin 24,25 . LFBK-αvβ6 isoenzyme patterns are also consistent with cultures of porcine origin (unpublished results). Assembled sequences from eight of the cell lines showed a higher pairwise nucleotide identity to a different species within the same genus than what was expected ( Table 2). Five of the cell lines were of primate origin, two were of bat (flying foxes) origin, and one was of goose origin. The number of variants (i.e., SNPs, MNPs, INDELs) between the mapped reads and the COX1 and Cytb genes were used as an indication of how similar the cell line was to a particular species. The difference in the number of variants between the expected and observed species varied for each cell line (between 9-64 variants for COX1 and 1-68 for CytB); however, in each case, the number of variants was higher when aligned to the expected species as compared to the observed species, except for PaLu and PaSPT where the reads mapped to the Cytb gene had a higher number of SNPs to the observed species than the expected. Turner et al. 27 describes the morphological differences between species of the Chlorocebus genus of Old World monkeys, and reported that various geographical locations may permit deviation from the predicted morphology of these species. Thus, the species of the individual animal from which each of these cell lines originated was likely misidentified. It was also noticed that the number of variants in the bat cell lines (PaLu and PaSPT) was considerably higher in the observed species (39 and 40 variants, respectively) compared to all of the other cell lines (6 or fewer variants). The genus Pteropus is known to be very diverse with a large number of species 28 , therefore, additional investigation will be required to determine if the cell lines are, in fact P. ornatus, as identified here, or if there was a misidentification between closely related species when the cell line was originally created.
The current gold standard for the authentication of human cell lines is STR profiling 29 , while non-human cell lines are best identified using DNA barcoding with the COX1 gene 6 . The International Cell Line Authentication Committee (ICLAC) keeps a Register of all known misidentified or cross-contaminated cell lines. As of this study, the Register was last updated March 25, 2020 and contains a total of 509 cell lines that are misidentified; of these only 38 were nonhuman cell lines 30 . This is likely not because human cell lines are more susceptible to contamination compared to nonhuman cell lines, but rather, because there is more information available for human cell lines in addition to the limitations of STR profiling which is only applicable for single species differentiation 30 . Thus, the method described here is useful since it can identify the species as well as the presence of contaminants such as other cell lines, mycoplasma, or viruses 1 .
Experimental results can be negatively impacted due to mycoplasma contamination of cell lines. Depending on the species of mycoplasma, the effects on the cells vary from changes in protein and nucleic acid synthesis levels to a complete loss of the culture 31 . Detection of contamination is difficult, due in part to the small size (0.3-0.8 µM) 32 of the mycoplasma cells, which allows them to pass through filters 32,33 . Additionally, high concentrations of mycoplasma are possible without any obvious visual signs 33 . In this study, mycoplasma was not detected in any of the 63 cell lines tested. This result was expected as the NCFAD currently has quality control procedures in place to check for mycoplasma contamination in their cultures, and the results here are consistent with the systems in place.
The presence of certain viruses was expected in some of the cell lines. Bovine viral diarrhea virus 2 (BVDV2) is a common contaminant in fetal bovine serum 34 and was present in the CPAE and OA3.Ts cell lines. Human adenovirus C was found in both HEK-293 and A549 cells. PCV1, a ubiquitous virus in pigs, was found as expected in the PK-15 (PCV +) cell line and in all four of the IPAM clones tested. Retroviral sequences are common in the genomes of their hosts due to insertion into the host genome 25 . The near-complete genome (99.3% breadth of coverage with 77 variants) of murine leukemia virus (MuLV) was detected in the P3X63-Ag8 cell line. Partial genomes from retroviruses such as avian leukosis virus (ALV) and porcine endogenous retrovirus (PERV) were detected in some cell lines.
Sequencing reads covering 39.5% of the CSFV genome were found in the cell line IB-RS-2 Clone D10 with seven total variants shared between the reads mapped and the reference genome. This clone was originally determined to be free of CSFV contamination 28 , however, testing of this cell line obtained from the American Type Culture Collection (ATCC) by Bolin, et al. 35 detected the virus in this clone. The presence of the entire CSFV genome was also found in the same cell line used by the Pirbright Institute, UK (Don King, personal communication).

Conclusion
Cell line authentication is important for the reproducibility and accuracy of research and diagnostics involving cell lines as it can help identify unexpected errors and contamination in archived material and cell lines obtained from other sources. This study confirmed the species identity of 63 cell lines that are available at the Canadian Food Inspection Agency's National Centre for Foreign Animal Disease. Of these cell lines, five were previously undefined, eight were determined to be derived from a different species within the same genus than was expected, and two were identified as species from different genera than expected. The methods described in this study or other comparable methods can be useful as they provide a single approach for species identification, as well as for the detection of contamination (e.g., mycoplasma) or the presence of unexpected viruses. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.