A high-throughput real-time PCR tissue-of-origin test to distinguish blood from lymphoblastoid cell line DNA for (epi)genomic studies

Lymphoblastoid cell lines (LCLs) derive from blood infected in vitro by Epstein–Barr virus and were used in several genetic, transcriptomic and epigenomic studies. Although few changes were shown between LCL and blood genotypes (SNPs) validating their use in genetics, more were highlighted for other genomic features and/or in their transcriptome and epigenome. This could render them less appropriate for these studies, notably when blood DNA could still be available. Here we developed a simple, high-throughput and cost-effective real-time PCR approach allowing to distinguish blood from LCL DNA samples based on the presence of EBV relative load and rearranged T-cell receptors γ and β. Our approach was able to achieve 98.5% sensitivity and 100% specificity on DNA of known origin (458 blood and 316 LCL DNA). It was further applied to 1957 DNA samples from the CEPH Aging cohort comprising DNA of uncertain origin, identifying 784 blood and 1016 LCL DNA. A subset of these DNA was further analyzed with an epigenetic clock indicating that DNA extracted from blood should be preferred to LCL for DNA methylation-based age prediction analysis. Our approach could thereby be a powerful tool to ascertain the origin of DNA in old collections prior to (epi)genomic studies.

www.nature.com/scientificreports/ studies 16 . In addition to genetic studies, LCLs have also been used as a surrogate biological material that could be representative of blood in other genomic 17,18 , transcriptomic [19][20][21] and epigenomic 22,23 studies. However, several comparative studies highlighted the presence of modifications in the (epi)genome and transcriptome of LCLs compared to blood due to immortalization and in vitro culture, as well as the absence of representativity of all types of blood cells. These modifications included few mutations [24][25][26] , some copy number variations and chromosomal aberrations 1,27,28 , mtDNA mutations and copy number changes [28][29][30] , frequent DNA methylation variations [31][32][33][34] as well as modification of transcriptomes [35][36][37][38] . As a result LCLs may not completely reflect the tissue of origin and most of these studies have recommended their use with caution in genomic and transcriptomic studies and even more in epigenomic studies 28,[31][32][33][34][35]38,39 . Thus, the use of blood should be preferred to LCLs for these types of studies, notably when blood DNA or RNA samples could still be available.
In this context, our study aimed to develop a simple and efficient high-throughput real-time PCR approach allowing the rapid identification of the biological material from which the DNA was extracted (blood or LCL). The method is intended to be used on large scale DNA collections as a screening and/or quality control test that could be used to validate, ascertain or identify their tissue of origin i.e. blood or LCL, prior to downstream (epi) genomic studies. The approach is based on the detection of different genetic features specific either to LCLs or blood DNA, including the relative quantification of EBV genome whose copy number is very high in LCLs and the detection of rearranged TCR β and TCR γ that are specific to T-cells in blood. It was developed and optimized using 458 blood samples from healthy donors from the SU.VI.MAX cohort 40 and the French blood bank (EFS) as well as 316 LCL reference DNA samples from CEPH families 11 .
We further applied our tissue-of-origin test on 1957 DNA samples from the CEPH Aging cohort, which was recruited during the years 1990 to 2000 and comprises more than 2000 nonagenarians, centenarians and supercentenarians as well as their offspring 41,42 . The collection includes more than 10,000 DNA samples extracted from blood or LCLs, but this information was dated, uncertain or sometimes missing and needed to be verified or determined. Following the identification of their tissue of origin, we performed DNA methylation-based age prediction on a subset of DNA samples from blood and LCLs using an epigenetic clock based on three loci and pyrosequencing 43,44 and compared the age predictions to their chronological ages. The results confirmed that the use of blood DNA should be preferred over LCL DNA for DNA methylation analyses and that the developed tissue of origin test could be a useful tool for the rapid identification, verification or validation of the DNA origin. It could be easily implemented in biobanks and used along with the other quality controls of DNA on several large scale and/or ancient DNA collections prior to (epi)genomic studies.

Materials and methods
Ethics statement. The study was conducted in accordance with current ethical and legal frameworks and approved by an institutional review board (comité consultatif de protection des personnes dans la recherche biomedicale, CCPPRB Paris-Saint-Antoine, approval No. 00479). Informed consents were obtained from all participants.
Reference blood and lymphoblastoid cell line DNA. DNA 43 . After PCR, 10 µL of amplified product was purified and prepared for pyrosequencing using the pyrosequencing primers and assays described in Ref. 43 and according to the detailed protocol described in Refs. 50 Statistical analysis. GAPDH was used as a control single copy gene in genomic DNA for the normalization of C t values. C t GAPDH /C t Gene/Genome of interest ratios were calculated for EBV and TCRγ and used to classify DNA samples in three different groups (blood, LCL and uncertain origin) according to C t GAPDH /C t Gene/Genome of interest ratio using two thresholds chosen empirically. For TCR β , the highest melting temperature (T m ) was selected to distinguish between blood and LCL DNA using a single threshold. The sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and accuracy of the three real-time PCR tissue-of-origin tests used alone or in combination were calculated. For each calculation, samples identified as blood were considered as positive results while those identified as LCL and of uncertain origin were considered as negative.

Results
Strategies for distinguishing blood from LCL DNA. Our study aimed to develop a real-time PCR approach allowing to differentiate DNA extracted from blood or LCL. We first searched for genetic features specific to LCL or blood DNA. The first genetic feature relied on the detection of EBV genomes in the DNA, whose copy number is high in LCL DNA (2 to 500 copies per diploid genome equivalent) 52 and low to zero in blood DNA of individuals with no ongoing EBV infection or EBV-associated diseases [53][54][55] . We also searched for other genetic features that could be specific to blood DNA and absent in LCL DNA and identified rearranged T-cell receptor (TCR) genes and extra-chromosomal signal joint T-cell receptor excision circles (sjTREC) that are specific from T lymphocytes 56,57 . As sjTRECs drastically decrease in blood with age until being barely detectable around 80 years old 58 , we focused on rearranged TCR genes from T lymphocytes whose number is maintained throughout life 59,60 . We further restricted our choice to TCR β and TCR γ and excluded TCR δ , as it is known to be frequently rearranged in B-lymphocytes 61 , and TCR α due to the high complexity of this gene locus, which presents a large number of V/J segments, and of its rearrangement 48,62 . Thus, to develop our tissue-of-origin test we decided to focus on three genetic features i.e. EBV DNA relative load and rearranged TCR β and TCR γ .
EBV real-time PCR assay. For the development of our tissue-of-origin PCR test, we first developed, optimized and evaluated the EBV PCR assay using DNA samples of known origin. DNA extracted from blood were obtained from EFS healthy donors (n = 93) and from healthy individuals of the SU.VI.MAX cohort (n = 364) (see "Materials and methods" and Table 1), while DNA extracted from LCLs were from CEPH reference families (n = 316). 10 ng DNA from blood and LCL were used for this assay as well as for all other PCR assays in order to limit the amount of DNA required for each test. We also used a PCR assay targeting GAPDH single copy gene as a control to assess the quantification of our DNA samples and to test for the amplifiability of DNA samples. The results showed that its C t values are comparable across all the tested samples ( Supplementary Fig. 1) indicating no quantification bias and/or DNA with extreme degradation. Moreover, GAPDH assay and C t values were used to normalize the C t value of the EBV PCR assay for every tested DNA sample. Figure 1A showed the bimodal distribution of blood and LCL DNA samples according to their C t GAPDH /C t EBV ratio. We decided to set empirically two cut-offs for C t GAPDH /C t EBV ratio, with a first threshold at 91 below which all samples are considered as blood (Fig. 1A). On the contrary, DNA samples origin was considered as LCL when C t GAPDH /C t EBV was higher than www.nature.com/scientificreports/ Figure 1. Distribution of C t EBV /C t GAPDH ratios, C t TCRγ /C t GAPDH ratios and mean TCR β T m from blood DNA from EFS and SU.VI.MAX (n = 457) and LCL DNA from CEPH reference families (n = 316) using real-time PCR assays. (A) Distribution of C t EBV /C t GAPDH ratios based on EBV and GAPDH real-time PCR assays. (B) Distribution of C t TCRγ /C t GAPDH ratios based on TCR γ and GAPDH real-time PCR assays. (C) Distribution of mean TCR β T m based on TCR β real-time PCR assay. The chosen thresholds for each test are given above the frameworks. Table 2. Calculations of sensitivity, specificity, PPV, NPV and accuracy for tissue-of-origin tests. For our calculations, blood + was considered as the positive result, and LCL + and uncertain origin as negatives results. www.nature.com/scientificreports/ 110, which was the second threshold set. Samples whose ratio was comprised between 91 and 110 were classified as samples of uncertain origin. With our set thresholds for C t GAPDH /C t EBV ratio, our EBV tissue-of-origin test presented a strong specificity, sensitivity and accuracy (98.5%, 100% and 0.99, respectively, Table 2). As we aimed to exclude false positive samples that could hinder downstream (epi)genomic analyses if LCL DNA samples were misclassified as blood samples, our EBV PCR test resulted in 100% PPV indicating a very high confidence for identification of DNA extracted from blood (Table 2).

Tissue-of-origin assay Sensitivity (%) Specificity (%) PPV (%) NPV (%) Accuracy
TCR γ real-time PCR assay. Similarly to the EBV assay, we developed a second tissue-of-origin real-time PCR assay to distinguish blood from LCL DNA samples based on a second genetic feature that is assumed to be absent in LCL DNA and present in blood DNA, i.e. rearranged TCR γ genes. The TCR γ assay used one primer pair corresponding to V II and J I/II segments and amplifying a large proportion of the recombined TCR y gene repertoire 47 . Indeed, a small number of V and J segments allowed the use of a limited number of consensus primers, leading to amplification of a majority of rearranged TCR γ genes 47,63 . C t values for rearranged TCR γ assay were normalized with C t GAPDH values to obtain a ratio that also presented a bimodal distribution among blood and LCL DNA samples (Fig. 1B). The calculated sensitivity for blood DNA detection was of 94.3% while its specificity was of 99.4% for a PPV of 99.5% and an overall accuracy of 0.96 (Table 2). In comparison, the TCR γ test thereby presented a slightly lower performance than the EBV assay for the identification of the blood origin of DNA (Table 2).
TCR β real-time PCR assay. For our third tissue-of-origin assay, we considered another genetic feature specifically expressed in blood tissue but not in LCLs, i.e. the rearranged TCR β gene. The TCR β gene contains many V/D/J variable regions, which are rearranged through the maturation of T lymphocytes. Thereby, blood contains a huge diversity of recombined TCR β receptors, which required the use of multiplexed primers for the amplification of a portion of this repertoire. Our selected primers allowed the amplification of D β1 segment rearranged with any J β1 -J β6 segments of the TCR β gene 48 . Due to the use of several PCR primers in a single multiplexed PCR reaction that generated primer dimers as well as non-specific amplifications, C t values from blood and LCL DNA samples were close and did not allow the use of a C t GAPDH /C t TCRβ ratio for this test to distinguish blood from LCL DNA (Supplementary Fig. 1 and 2). Thus, we chose to look at the melting temperature values (T m ) obtained with melting curve analysis after PCR amplification: T m results for blood DNA samples were over 89.5 °C with a low proportion of primer dimers with lower T m (< 89.5 °C), whereas LCL DNA melting curves presented only T m values under 89.5 °C corresponding to primer dimers and non-specific amplification products ( Supplementary  Fig. 2). When we used the highest T m obtained for TCR β amplicons, we obtained a bimodal distribution in blood and LCL DNA samples allowing to distinguish them (Fig. 1C). We used a threshold of 89.5 °C that allowed to identify blood DNA samples with 98.2% sensitivity, 98.7% specificity, 99.1% PPV and 0.98 accuracy (Fig. 1C, Table 2).
Combination of the three tissue-of-origin PCR tests strongly excluded false positive blood DNA samples. The three tests described above allowed to distinguish blood from LCL DNA samples with high accuracy when used independently (Table 2). However, for further (epi)genomic investigations and applications, we would like to exclude all false positive blood samples (i.e. LCL DNA misclassified as blood DNA) and also to limit the possible technical and/or biological issues that could arise during PCR experiments relying on a single test. We decided to combine our three developed tests and to consider a DNA sample as blood when at least two out of the three tests were positives for blood ( Fig. 2 and Table 2). The calculated sensibility (98.5%), specificity (100%), PPV (100%), NPV (97.8%) and accuracy (0.99) showed the best performances compared to the tests used alone equaling the values of EBV assay (Table 2). Specificity and PPV calculated using this combination were of particular interest as they indicated no LCL misclassified as blood sample. Thereby, none of the 316 LCL origin samples were false positives ( Fig. 2 and Table 2), validating our approach combining the three tests for accurate identification of DNA extracted from blood.

Application of our tissue-of-origin test to the CEPH Aging cohort. Our tissue-of-origin test was
applied to 1957 DNA samples from 1813 individuals, including 1346 DNA isolated from nonagenarians and centenarians (NC group) and 457 DNA samples from NC group's offspring (NCO group) of the CEPH Aging cohort ( Table 1). The information about the origin of these DNA samples was dated, incomplete or missing and needed to be validated or identified. The distribution of NC + NCO DNA samples according to C t GAPDH / C t EBV ratio, C t GAPDH /C t TCRγ ratio and TCR β T m showed the typical bimodal distribution indicating the presence of DNA extracted from blood and LCL in this cohort as expected (Fig. 3A). Using the combination of the three tests, we were able to identify 796 and 1148 DNA samples extracted from blood and LCL, respectively ( Fig. 3B and Table 3), while 12 samples remained of uncertain origin despite one blood positive test. When separating NC from NCO DNA samples, our results indicated that the NCO group presented proportionally more DNA samples extracted from blood compared to the NCO group ( Supplementary Fig. 3 and 4), probably due to the greater use of DNA samples from the NC group in former genetic studies. We further compared our results to the information available in the CEPH Biobase database and found 99.31% concordance for the 1304 DNA samples whose tissue of origin information was available (Table 3). Moreover, our combined approach enabled the identification of the tissue-of-origin for 98.93% of the 653 DNA samples whose origin was missing or uncertain according to our database (Table 3). Only 13 out of the 1957 tested DNA samples remained from unknown origin (0.66%, Table 3). Among them, 7 were already uncertain before the test. Taken together, our results allowed to validate the information present in the CEPH Biobase database. They www.nature.com/scientificreports/ also showed the strength of our high-throughput real-time PCR tissue-of-origin tests applied to a large cohort of DNA samples.

DNA methylation-based age prediction is altered in lymphoblastoid cell lines. The epigenetic
clock is defined as the modifications of the epigenomes during aging that correlate to the chronological age similarly in every individual 64 . Thus, several DNA methylation-based age prediction biomarkers have been used to develop age-prediction models principally using pyrosequencing [43][44][45] or genome-wide epigenotyping arrays [65][66][67] .
To estimate the age of the samples used in our study and measure the differences of age predictions between blood and LCL DNA, we used the age prediction model of Thong 44 , which is based on DNA methylation of the KLF14, TRIM59 and ELOVL2 promoters and evaluated as being among the best age prediction models in a previous study 43 . We first evaluated the model on a subset of 24 blood DNA (EFS) and 26 LCL DNA (CEPH families) from control samples of individuals aged from 19 to 53 years (Fig. 4A). The results showed that the age predictions from control blood samples were accurate (MAD = 4.2) and strongly correlated to chronological age (R = 0.88), while the age predictions showed very poor performances for the control EBV samples (MAD = 25.7, R = 0.19, Fig. 4A). Similarly, when the model was applied to 24 blood and 21 LCL DNA samples from nonagenarians and centenarians' offspring of the CEPH aging cohort aged from 45 to 79 years, the age predictions showed www.nature.com/scientificreports/ www.nature.com/scientificreports/ good performances for blood samples (MAD = 6.8, R = 0.80) with a slight tendency for underestimation of the predicted age and poor performances for LCL samples (MAD = 12.0, R = 0.25, Fig. 4B). These results indicated that DNA methylation and the epigenetic clock are impaired in LCL samples and that such analyses should be performed on blood extracted DNA rather than LCL DNA.

Discussion
The rapid increase in number of genetic and genomic studies in the last thirty years became possible with the development of new high-throughput genotyping and sequencing technologies as well as bioinformatics resources, associated to the reduction of their costs. These studies also required the availability of an ever-growing number of DNA samples that were collected and stored in DNA biobanks or biological resource centers, which also allowed their distribution to the scientific and biomedical community worldwide 68,69 . Thus, several large DNA collections were for the majority established from blood or blood-derived LCLs to provide DNA samples for genetic, genomic and epidemiologic studies 70 . Furthermore, several guidelines, considerations and best practices for biobanking have been proposed aiming to standardize and harmonize the policies and procedures within and between biobanks in order to improve the overall quality and reproducibility of downstream experiments 68,[71][72][73] .
Although having being extensively used in genetic, population genetic and genome wide association studies, DNA extracted from LCLs should be used with caution in genomic and more particularly epigenomic studies, as several alterations of their (epi)genomes might arise during immortalization and in vitro culture and might not reflect their cells of origin 28,[31][32][33][34][35]38,39 . Thus, the use of genomic DNA extracted from blood should be preferred over LCLs for (epi)genomic studies, and this despite the development of bioinformatics tools that might allow the filtering of LCL-specific alterations before data interpretation 27,28 . In some genomic studies such as the 1000 Genomes Project, whole genome sequencing experiments were performed on DNA samples extracted either from blood or LCLs, and some annotations about the tissue-of-origin could be missing or inaccurate, thereby potentially impacting downstream bioinformatic analyses and the interpretation and significance of the data 39,52 . Table 3. Concordance between the information present in the CEPH Biobase database and results of real-time PCR tissue-of-origin assays.  www.nature.com/scientificreports/ In this context, we have developed a rapid and simple high-throughput real-time PCR approach that allowed to distinguish blood extracted from LCL extracted DNA, which was based on the relative detection of EBV genomes and of rearranged TCR β and TCR γ (Fig. 1). This tissue-of-origin test is intended to be used as a quality control to validate, ascertain or identify the tissue of origin of DNA samples from large or ancient DNA collections prior to (epi)genomic studies. It could be used at the same time in the sample processing workflow as other quality control tests currently used in DNA biobanks before genotyping or sequencing experiments such as microsatellite markers typing for DNA sample authentication 74 or sex typing for the detection of potential DNA sample misassignment or mix-up 75 . The use of a GAPDH single-copy gene assay was essential to test the amplifiability of DNA and to normalize the EBV and TCR γ assays ( Fig. 1 and Supplementary Fig. 1). The three tests could be used independently as they presented good sensitivity and specificity when used alone (Table 2). However, we recommend their use in combination to identify blood DNA samples with a cutoff of two positive tests out of three (Fig. 2, Table 2). Of note, the use of combined tests is considered as an optimal strategy to increase the testing accuracy and reduce the uncertainty compared to single tests 76,77 . Moreover, each individual test could present some drawbacks that should not be shared by the others, thereby justifying the use of three independent tests. For example, the detection of high level of EBV genomes could also be present in DNA extracted from blood from individuals ongoing acute or chronic EBV infection or EBV-associated diseases [53][54][55]78 , but these health conditions should not impact the results of TCR β and TCR γ assays. Although presenting the best individual performances with the control samples, the GAPDH/EBV assay could also be less sensitive for blood samples from aged individuals with our set cutoff as EBV viral load was known to be higher in the elderly 79,80 , which could potentially explain the moderate shift to the right of the blood extracted DNA sample in our results on the CEPH Aging cohort. This tendency was visible when separating NC from NCO samples, which supported our hypothesis ( Fig. 3A and Supplementary Fig. 3A and 4A). When applied to 1957 DNA samples of the CEPH Aging cohort using the thresholds defined with the blood and LCL reference DNA samples, our tissue-of-origin test allowed the identification of 796 DNA extracted from blood and 1148 DNA extracted from LCL, while only 0.66% DNA samples remained of uncertain origin (n = 13, Table 3). These results were compared to the information that was mostly but partially present in the CEPH Biobase database revealing more than 99% agreement on the origin of DNA samples between experimental results and CEPH Biobase information (Table 3). Our tests also allowed the identification of tissue-of-origin for 98.93% DNA samples with missing or uncertain information, enabling their use in downstream (epi)genomic experiments.

Origin of DNA according to the CEPH Biobase database
Finally, to measure the impact of the origin of our DNA samples on epigenetic analyses, we ran an age prediction model using DNA methylation of three CpG sites on about a hundred individuals from control groups and CEPH Aging collection in order to predict their chronological age (Fig. 4). The age predictions showed good performances for blood DNA (MAD = 4.2-6.8), which were similar to those obtained with DNA methylationbased and pyrosequencing-based age prediction models 43 . Although requiring additional validations, the slight under-estimation of the chronological age observed for the blood DNA samples of the CEPH aging cohort could be of biological and clinical significance (Fig. 4), as the offspring of centenarians was shown to be epigenetically younger and have lower predicted ages 81 . Conversely, age predictions showed very poor performances for LCL DNA (MAD = 12-25.7, Fig. 4). This indicated that the epigenetic clock used was strongly impaired in LCLs and that an age prediction model using as little as three CpG sites could reveal this alteration. Of note, several studies have shown that DNA methylation was altered in LCLs and did not represent the methylome of blood or their cells of origin [31][32][33][34][35] . Few other studies also evaluated age prediction models on LCLs using a high number of CpG sites (> 50) and epigenotyping microarrays data and found the epigenetic clock and age prediction were altered in these cell lines 67,82 . The poorer age prediction performance observed on LCL DNA from CEPH families compared to the CEPH aging cohort might be attributed to the high number of passages for the former, as DNA methylation alterations were described to be stronger in LCLs with high passage numbers 35 . Taken together, our results and the aforementioned studies indicated that when possible, blood extracted DNA should be preferred to LCL DNA for DNA methylation and age prediction analyses.

Conclusion
Our study presented for the first time an experimental approach for the identification of the tissue of origin of DNA samples, whether extracted from blood or LCLs. It is intended to be used in large and/or ancient DNA collections to validate, ascertain or identify their origin. We proposed this approach as a quality control test that could be implemented in DNA biobanks and used along with other quality control tests prior to (epi)genomic studies. In our experimental conditions, we evaluated the cost per PCR reaction at 1 euro (≈ 1.2 $) for a total of 4 euros (≈ 4.5 $) per DNA sample for the combined approach, which is cost-effective. We also anticipate the development of additional tissue-of-origin tests that could be applied to DNA from other tissue types or from other nucleic acid types, i.e. RNA, which would further improve the practices for biobanks and contribute to the science of biobanking. www.nature.com/scientificreports/