Introduction

Cell line authentication is an essential part of ensuring the validity of research and diagnostic results. Misidentified or contaminated cell lines can present irreproducible or inaccurate results which may mislead future research1. There have been reports where the cell line was misidentified by the source institute, rendering the results of any publication using that cell line questionable. For example, the KB cell line, believed to be oral or squamous cell carcinoma2, and the KU7 cell line believed to be derived from bladder cancer cells3, were both found to be HeLa cells. Vaughn et al.2 found 631 publications published between the years 2000–2014 that mentioned the use of the KB cell line, of which 574 articles were describing it incorrectly. These, and other papers describing misidentified cell lines, have been and may continue to be cited and used in other studies, thus potentially invalidating the research2. As an increasing number of cell lines are reported as being contaminated or misidentified, many scientific journals including Nature and PLOS ONE have now put policies in place for the authentication of cell lines used in their publications1,4. Two of the main causes of cell line misidentification are cross-contamination between cell lines and mislabeling of tubes or culture flasks4. Cross-contamination may occur within cell lines of the same species (intra-species cross-contamination) or between different species (inter-species cross-contamination). Cell lines may also be completely overgrown and replaced by a contaminating cell line5. Regularly confirming the identity of cell lines can prevent contamination and mislabeling errors from affecting future research and diagnostic test results.

One of the methods used for cell line identification is short tandem repeat (STR) profiling, which has been widely used for human identification in forensics1. As each human cell line has originated from a different individual, STR profiling allows for differentiation between them. For non-human cell lines, using genes such as the mitochondrial gene cytochrome c oxidase subunit I (COX1) for DNA barcoding may be used to determine the species of origin. Due to frequent third-position base substitutions in this gene there is a high rate of molecular evolution leading to diversification, which can even differentiate between various phylogeographic groups of the same species6. The species can then be determined by comparing the DNA barcode profile of a cell line to databases of these sequences (BOLD, http://www.barcodinglife.org, and NCBI, http://www.ncbi.nlm.nih.gov/genbank/barcode)1. However, STR/SNP and COX 1-based methods do not provide information on the presence and type of microbial contamination.

Shotgun metagenomic sequencing allows sequencing of a broader spectrum of DNA or RNA in a sample. Thus, it can be used for species identification of cell lines and potentially detect the presence of bacterial, viral, or fungal contamination. If specific STR and SNP loci are amplified prior to sequencing, STR/SNP profiling may be reliably implemented for the confirmation of the specific cell line. In this study, high-throughput sequencing (HTS) was performed on DNA and cDNA extracts from each of 63 cell lines available at the Canadian Food Inspection Agency (CFIA) National Centre of Foreign Animal Disease (NCFAD) to verify the species of origin and presence of microbial contamination.

Methods

Cell culture

A total of 63 cell lines available at the NCFAD were seeded from frozen stocks and grown for 48 h at 37 °C and 5% CO2, except for Trichoplusia ni cells which were grown with shaking at 27 °C for 24 h before collection. Adherent cell lines were dissociated using 0.25% Trypsin-0.1% EDTA, and cells in suspension were spun down at 4 °C for 10 min at a relative centrifugal force of 600, and re-suspended in culture media. An aliquot of cells was stained with 0.2% trypan blue and counted using a Cellometer Auto T4 counter (Nexcelom Bioscience).

DNA/RNA extraction

DNeasy Blood and Tissue kit (QIAGEN) was used to isolate DNA and RNA from ~ 2.5 × 106 viable cells using the manufacturer’s recommended protocol. The DNA and RNA were eluted into 50–100 μL of AE elution buffer (QIAGEN). Qubit dsDNA Broad Range (BR) and RNA High Sensitivity (HS) kits (Thermo Fisher Scientific) were used to quantify DNA and RNA in the extracts on the DS-11 FX fluorometer (Denovix).

High-throughput sequencing

Invitrogen ezDNase enzyme (Thermo Fisher Scientific) was used for the digestion of cellular DNA within extracted nucleic acid prior to cDNA synthesis. Superscript IV First-Strand Synthesis module (Thermo Fisher Scientific) was used for the synthesis of the first strand of cDNA using a 1:1 ratio of random hexamers and oligo-dTs and 300 ng of RNA. The NEBNext Ultra II Second-Strand synthesis module (New England Biolabs) was used according to the manufacturer’s protocol to generate the second strand of cDNA. QIAQuick PCR purification kit (QIAGEN) was used to purify the double-stranded cDNA and eluted in EB buffer (QIAGEN) according to the manufacturer’s recommended protocol. The Qubit dsDNA BR kit (Thermo Fisher Scientific) was then used to quantify the cDNA using a DS11 FX fluorometer.

Sequencing was performed separately on the DNA and cDNA samples with the cDNA samples separated into two runs including a test run with a smaller number of samples due to the timing of the availability of the samples (Tables 1 and 2). Library preparation for the DNA samples was performed using Riptide High-Throughput Rapid DNA Library prep kit (iGenomX) and the manufacturer’s protocol was followed with a 1:1 ratio of the low GC and high GC primers. The samples were pooled and loaded at a final concentration of 18 pM with 1% PhiX, and sequencing was performed on an Illumina MiSeq using a V2 flow cell with a 300-cycle (2 × 150 bp) cartridge.

Table 1 Cell lines with species identity determined by sequencing that matched institute records or were previously unknown.
Table 2 Cell lines in which a different species from institution records was identified.

Library preparation for the cDNA samples was subsequently performed with the Nextera XT Library Prep kit (Illumina) following the manufacturer’s protocol due to a switch over of Illumina library preparation methods in the laboratory. In the first run, 26 samples were pooled and loaded at a final concentration of 10 pM with 1% PhiX, and sequencing was performed again on the Illumina MiSeq using a V2 flow cell with a 300-cycle (2 × 150 bp) cartridge. In the second run of cDNA samples, 65 samples were pooled at a final concentration of 18 pM with 1% PhiX, and sequencing was performed on a V3 flow cell with a 600-cycle (2 × 300 bp) cartridge.

Sequence analysis

iGenomX DNA sequencing reads were demultiplexed using the fgbio7 software (v.0.7.0; command used: fgbio DemuxFastqs -i R1.fastq.gz R2.fastq.gz -r 8B12M + T 8 M + T-x metadata.csv). To determine the species of the cell line, metagenomic analysis was performed using the nf-villumina8 (v2.0.0) Nextflow9 workflow on the concatenated DNA and cDNA sequencing data. As part of the nf-villumina workflow, Illumina PhiX Sequencing Control V3 reads were removed using BBDuk10, and poor quality reads and adaptors were removed using fastp11. Taxonomic classification of the filtered reads was performed with Kraken 212 using an index of NCBI RefSeq sequences for bacteria, archaea, viruses and the GRCh38 human genome (downloaded and built March 22, 2019), and with Centrifuge using an index of NCBI nt sequences (downloaded and built 2020-02-04). Quality filtered reads were assembled into contigs with Megahit13, Shovill14, and Unicycler15, which were queried against the NCBI nt database (downloaded December 04, 2020) using nucleotide BLAST+16,17 (v2.11.0) (default parameters except “-evalue 1e−6”) restricting the search to eukaryotic NCBI nt database entries (i.e. belonging to NCBI taxonomic ID (taxid) 2759). The processed reads for each cell line were mapped against the top matching COX1 sequence identified by BLAST analysis using Snippy (v4.6.0)18 as part of the nf-illmap Nextflow workflow (v1.0.0)19. The resulting BAM alignment file was loaded into Geneious v.9.1.820 where a threshold for coverage depth was set to a minimum of three, and variants were called using the Find Variations/SNPs tools with default settings except Minimum Coverage = 3 and Minimum Variant Frequency = 0.75. Variants were only called if the read depth had a minimum coverage of 3×. MDBK-HS-1 is from the cell lines available at CFIA NCFAD in Winnipeg, Manitoba, Canada while MDBK-HS-2 came from the CFIA NCAD laboratory in Lethbridge, Alberta, Canada. For cell lines where the observed species from the top BLAST match was not as expected based on laboratory records, the reads were additionally mapped to the CytB gene sequence using the same methods as was used for mapping to the COX1 sequences.

RNA and DNA viruses and bacteria were identified from the cell line DNA and cDNA sequencing data using DAMIAN21. As part of DAMIAN analysis, raw sequencing reads were trimmed using Trimmomatic22 with default settings and assembled using SPAdes23. Contigs were taxonomically classified using nucleotide BLAST+ (v2.11.0) (DAMIAN BLAST+ option “progressive”) and the NCBI nt database (downloaded December 04, 2020). Trimmed reads were mapped to the viral genomes identified by DAMIAN and additional nucleotide BLAST analysis using the nf-illmap workflow. Variants were called in Geneious V.9.1.820 using the method described above.

Results

Cell line authentication

Table 1 lists the cell lines for which the expected species identity was confirmed by mapping the combined reads from the cDNA and DNA sequences to the reference COX1 gene of the top mitochondrial genome BLAST match for each cell line. All observed species in this list matched the species recorded in the institute’s cell line inventory list. This list also includes five archived cell lines that have been documented as “unknown” which did not have a defined species listed.

Two cell lines, LFBK-αvβ6 and SCP-HS, were determined to be composed of cells from a different species than expected. According to institute documentation, LFBK-αvβ6 was a continuous bovine kidney cell line that constitutively expresses αvβ6 integrin24,25; however, there were no BLAST results from the LFBK-αvβ6 de novo assembled contigs that corresponded to the Bos taurus genome or mitogenome. All BLAST results matched sequences from the Sus scrofa genome and mitogenome. Figure 1A shows the DNA and cDNA reads mapped to reference B. taurus and S. scrofa COX1 sequences. A total of 880 reads from LFBK-αvβ6 mapped to the S. scrofa COX1 gene and had a breadth of coverage of 100% with 0 total variants (i.e., SNPs, MNPs, and INDELs) between the mapped reads and the reference, while only 77 reads mapped to the B. taurus reference COX1 gene with a breadth of coverage of 44.1% and 98 total variants (see Table 2 for reference accession numbers and results).

Figure 1
figure 1

Reference assemblies of LFBK-αvβ6 and SCP-HS reads to references of the expected species and top BLASTn-matched mitogenomes. The nf-illmap workflow was used to map reads from the LFBK-αvβ6 and SCP-HS cell lines to reference COX1 sequences from the expected species of each cell line and the species which showed the top BLAST match to the de novo-assembled sequences. (A) LFBK reads were mapped to B. taurus and S. Scrofa. (B) SCP-HS was mapped to O. aries and B. taurus. The Y-axis shows the coverage of each genome position. Positions of variants are indicated by the grey lines below the graphs.

According to documentation, SCP-HS is an ovine brain choroid plexus cell line adapted for growth in horse serum; however, the top BLAST results from the de novo assembled contigs were to B. taurus and not to Ovis aries. Figure 1B shows the coverage of the SCP-HS reads mapped across reference COX1 sequences from O. aries and B. taurus. In the B. taurus assembly, 202 reads mapped with 100% breadth of coverage across the COX1 gene with 0 total variants, while in the O. aries assembly, 85 reads mapped with 100% breadth of coverage across the COX1 gene with 176 total variants (see Table 2 for reference accession numbers and results).

Eight cell lines (CGBQ, BGMK, MA-104, PaLu, PaSPT, Vero, Vero Nectin-4, Vero-76) were found to align better to the COX1 sequence from a different species (within the same genus) than the expected species based on available documentation. For these samples, reads were mapped against the COX1 sequences from both the expected and observed species. This analysis showed that, when reads were mapped against a reference sequence representing the expected species, more variants were observed than when they were mapped against a reference representing the observed species, suggesting that the cell line is derived from a different species than was expected (Table 2). The COX1 sequences for the references of the observed and expected species do however share a high similarity; between 95.6 and 96.9% for the primate sequences, 99.4% for the goose sequences, and 97.2% for the bat sequences. A high similarity between the references increases the difficulty in discerning one species from another, therefore for those eight cell lines the reads were also mapped to the mitochondrial gene cytochrome b (Cytb) sequence. While the Cytb sequences between the observed and expected species also share a high similarity (between 94.0 and 95.8% for the primate sequences, 98.3% for the goose sequences, and 96.4% for the bat sequences), Table 2 shows that with the exception of PaLu and PaSPT, the results of the Cytb analysis are consistent with those of the COX1 analysis suggesting with higher confidence that the cell lines are derived from a different species than expected.

Detection of bacterial and viral sequences

Upon identifying the species of the 63 cell lines, a separate workflow was used to identify bacterial and viral DNA and cDNA sequences. Some viral sequences were expected in the cell lines including human adenovirus C used for the transformation of HEK-293, the common FBS contaminant bovine viral diarrhea virus 2 (BVDV2) in CPAE, and the common porcine circovirus 1 (PCV1) in swine-derived PK-15 (PCV+) cells. Sequences matching these viruses were detected as expected, and PCV1 was also found in all four IPAM clones (Table 3). Retroviral sequences, including murine leukemia virus (MuLV), were also found in some of the cell lines (Table 3). Only viruses with a complete or near complete viral genome (> 98% breadth of coverage) are listed, as incomplete cancer-causing retroviral sequences can be expected within the genomes of tumor-derived cell lines26. Reads that were classified as classical swine fever virus (CSFV) were also found in the IB-RS-2 Clone D10 cell line with a 39.5% breadth of coverage across the viral genome with seven total variants relative to the reference genome. In the T. ni insect cell line, reads identified as Flock House virus had a breadth of coverage of 85.1% across the reference genome with two total variants (Table 3).

Table 3 List of viral genomes detected in the cell lines.

Discussion

The aim of this study was to authenticate the species identity of cell lines available for use at the Canadian Food Inspection Agency’s National Centre for Foreign Animal Disease, and to establish methods that can be integrated into the laboratory quality assurance system. Confirming cell line species at our laboratory was previously conducted by comparing the electrophoretic migratory patterns of common intracellular enzymes (isoenzymes). Examining the polymorphic isoenzyme profiles between species for cell line confirmation has limitations including limited species range, low sensitivity of detection, and complex data interpretation.

In this study, 53 of the 63 cell lines had a COX1 sequence that was consistent with the expected species; the reads from each of these cell lines had a breadth of coverage of > 95% across the COX1 gene, and no more than five variants compared to the reference. LFBK-αvβ6 and SCP-HS cells were found to be from a different genus than expected, suggesting that the cell lines had been misidentified, contaminated, or mislabeled. When reads from the LFBK-αvβ6 and SCP-HS cell lines were mapped to the COX1 genes corresponding to the species identified by BLAST analysis, no variants were observed in either sample. The porcine DNA found within the LFBK-αvβ6 cell line is consistent with a published erratum that this cell line is of porcine origin24,25. LFBK-αvβ6 isoenzyme patterns are also consistent with cultures of porcine origin (unpublished results).

Assembled sequences from eight of the cell lines showed a higher pairwise nucleotide identity to a different species within the same genus than what was expected (Table 2). Five of the cell lines were of primate origin, two were of bat (flying foxes) origin, and one was of goose origin. The number of variants (i.e., SNPs, MNPs, INDELs) between the mapped reads and the COX1 and Cytb genes were used as an indication of how similar the cell line was to a particular species. The difference in the number of variants between the expected and observed species varied for each cell line (between 9–64 variants for COX1 and 1–68 for CytB); however, in each case, the number of variants was higher when aligned to the expected species as compared to the observed species, except for PaLu and PaSPT where the reads mapped to the Cytb gene had a higher number of SNPs to the observed species than the expected. Turner et al.27 describes the morphological differences between species of the Chlorocebus genus of Old World monkeys, and reported that various geographical locations may permit deviation from the predicted morphology of these species. Thus, the species of the individual animal from which each of these cell lines originated was likely misidentified. It was also noticed that the number of variants in the bat cell lines (PaLu and PaSPT) was considerably higher in the observed species (39 and 40 variants, respectively) compared to all of the other cell lines (6 or fewer variants). The genus Pteropus is known to be very diverse with a large number of species28, therefore, additional investigation will be required to determine if the cell lines are, in fact P. ornatus, as identified here, or if there was a misidentification between closely related species when the cell line was originally created.

The current gold standard for the authentication of human cell lines is STR profiling29, while non-human cell lines are best identified using DNA barcoding with the COX1 gene6. The International Cell Line Authentication Committee (ICLAC) keeps a Register of all known misidentified or cross-contaminated cell lines. As of this study, the Register was last updated March 25, 2020 and contains a total of 509 cell lines that are misidentified; of these only 38 were nonhuman cell lines30. This is likely not because human cell lines are more susceptible to contamination compared to nonhuman cell lines, but rather, because there is more information available for human cell lines in addition to the limitations of STR profiling which is only applicable for single species differentiation30. Thus, the method described here is useful since it can identify the species as well as the presence of contaminants such as other cell lines, mycoplasma, or viruses1.

Experimental results can be negatively impacted due to mycoplasma contamination of cell lines. Depending on the species of mycoplasma, the effects on the cells vary from changes in protein and nucleic acid synthesis levels to a complete loss of the culture31. Detection of contamination is difficult, due in part to the small size (0.3–0.8 µM)32 of the mycoplasma cells, which allows them to pass through filters32,33. Additionally, high concentrations of mycoplasma are possible without any obvious visual signs33. In this study, mycoplasma was not detected in any of the 63 cell lines tested. This result was expected as the NCFAD currently has quality control procedures in place to check for mycoplasma contamination in their cultures, and the results here are consistent with the systems in place.

The presence of certain viruses was expected in some of the cell lines. Bovine viral diarrhea virus 2 (BVDV2) is a common contaminant in fetal bovine serum34 and was present in the CPAE and OA3.Ts cell lines. Human adenovirus C was found in both HEK-293 and A549 cells. PCV1, a ubiquitous virus in pigs, was found as expected in the PK-15 (PCV +) cell line and in all four of the IPAM clones tested. Retroviral sequences are common in the genomes of their hosts due to insertion into the host genome25. The near-complete genome (99.3% breadth of coverage with 77 variants) of murine leukemia virus (MuLV) was detected in the P3X63-Ag8 cell line. Partial genomes from retroviruses such as avian leukosis virus (ALV) and porcine endogenous retrovirus (PERV) were detected in some cell lines.

Sequencing reads covering 39.5% of the CSFV genome were found in the cell line IB-RS-2 Clone D10 with seven total variants shared between the reads mapped and the reference genome. This clone was originally determined to be free of CSFV contamination28, however, testing of this cell line obtained from the American Type Culture Collection (ATCC) by Bolin, et al.35 detected the virus in this clone. The presence of the entire CSFV genome was also found in the same cell line used by the Pirbright Institute, UK (Don King, personal communication).

Conclusion

Cell line authentication is important for the reproducibility and accuracy of research and diagnostics involving cell lines as it can help identify unexpected errors and contamination in archived material and cell lines obtained from other sources. This study confirmed the species identity of 63 cell lines that are available at the Canadian Food Inspection Agency’s National Centre for Foreign Animal Disease. Of these cell lines, five were previously undefined, eight were determined to be derived from a different species within the same genus than was expected, and two were identified as species from different genera than expected. The methods described in this study or other comparable methods can be useful as they provide a single approach for species identification, as well as for the detection of contamination (e.g., mycoplasma) or the presence of unexpected viruses.