DNA methylation profiling in mummified human remains from the eighteenth-century

Reconstruction of ancient epigenomes by DNA methylation (DNAm) can shed light into the composition of cell types, disease states, and age at death. However, such analysis is hampered by impaired DNA quality and little is known how decomposition affects DNAm. In this study, we determined if EPIC Illumina BeadChip technology is applicable for specimens from mummies of the eighteenth century CE. Overall, the signal intensity on the microarray was extremely low, but for one of two samples we were able to detect characteristic DNAm signals in a subset of CG dinucleotides (CpGs), which were selected with a stringent processing pipeline. Using only these CpGs we could train epigenetic signatures with reference DNAm profiles of multiple tissues and our predictions matched the fact that the specimen was lung tissue from a 28-year-old woman. Thus, we provide proof of principle that Illumina BeadChips are applicable for DNAm profiling in ancient samples.

www.nature.com/scientificreports/ isolated DNA revealed that particularly in the sample from A.C.B. the read length was very short, which might indicate higher fragmentation (Supplemental Table S1   www.nature.com/scientificreports/ more than 850,000 CpGs by two different probe designs: type I assays consist of two bead types per CpG locus (one for unmethylated and one for methylated sequences), whereas type II assays have only one bead type that incorporates different fluorescently labeled nucleotides depending on the methylation status 12 . Notably, only approximately 320 ng and 220 ng of total DNA were hybridized, which only corresponds to about 3.1 ng and 1.4 ng of human DNA in the sample of A.C.B and T.H., respectively. This amount is well below the recommended amount of 250 ng and also less than in other studies that successfully tested dilutions with 50 ng 4 . Thus, the initial analysis revealed overall extremely low signal intensities. The signals were slightly higher for the mummy of T.H. than for A.C.B., despite the even lower fraction of human DNA (Supplemental Fig. S2a). Quality control probes for sample preparation steps also revealed that especially the sample of A.C.B. failed most of the quality thresholds (Supplemental Table S2). Furthermore, density plots for DNAm levels (β-values) did not show the typical bimodal distribution of methylated and unmethylated CpGs (Supplemental Fig. S2b). Yet, we observed small peaks at low and high DNAm levels for the T.H. sample, particularly for type I assays, indicating that a subset of CpGs might provide useful methylation signals (Fig. 1f). To filter for such CpGs, we used the SeSAMe package 17 to select 23,875 CpGs of the type I assay with the lowest detection P-values (P < 0.01; Fig. 1g).
When we compared DNAm levels in this subset of CpGs, we observed a clear correlation between DNAm levels of T.H. and normal lung-tissue (Fig. 1h, 2a). To further estimate the cellular composition, we used a previously published reference methylation atlas of 25 human tissues and cell types, which utilizes 7890 CpGs for deconvolution of cell types 4 . However, only 578 of these CpGs were comprised in our subset. Despite this limitation, most of the tissues in our dataset collection could be correctly assigned to the corresponding tissue. However, this did not work reliably for lung tissue samples and for the lung specimen from T.H., possibly due to the small number of remaining CpGs in the signatures and due to the fact that the lung-specific CpGs were only selected with lung epithelial cells and not whole tissue (Supplemental Fig. S3). Alternatively, we trained a new tissue deconvolution model that only utilizes the detected CpGs. To this end, we split the DNAm dataset collection into a training and validation set (Supplemental Table S4). For each of the nine tissues, we selected the top ranked candidate CpGs with tenfold cross-validation as described in our previous work 3 (Supplemental Fig. S4). The mean DNAm levels in the nine tissues of the training dataset were then used as reference matrix for a non-negative least squares (NNLS) deconvolution algorithm. Overall, the predicted tissues corresponded to the real tissues in both the training and the validation set (Supplemental Fig. S5). Using this predictor, the normal lung tissue was merely predicted as lung tissue and very similar results were observed for the tissue sample of T.H., indicating that the DNAm profile still reflects the tissue of origin (Fig. 2b). For sexanalysis with DNAm profiles the conventional methods (R packages minfi, Ewastools, sEst, and watermelon) were not reliably applicable with the reduced set of detected CpGs. However, the percentage of detected CpGs was higher on the X than on the Y chromosome, which is in line with female samples (Supplemental Fig. S6).
Epigenetic age predictions. Subsequently, we wanted to analyze if the DNAm was still indicative for age at death. To this end, we used a multi-tissue age predictor described by Horvath 7 . Only seven CpGs of this 353 CpGs epigenetic clock were within the detected 22,778 CpGs. Therefore, we have retrained an epigenetic age predictor based on these seven CpGs (Supplemental Table S5) and age-predictions with this reduced aging signature correlated with the chronological age for samples of the training (R 2 = 0.71) and validation set (R 2 = 0.43). Notably, the sample of T.H. was predicted to be 33 years old-and thus very close to documented age of 28 years (Fig. 2c).

Discussion
Our exploratory study demonstrates that it is possible to address DNAm in human remains from the eighteenth century with Illumina BeadChip microarrays. At first sight, the samples failed the quality control step and would have normally been excluded from further analysis. It was necessary to adjust processing and the sample-specific analysis pipeline to the very low signal intensities. We chose ssNoob normalization (minfi) 18 together with detection P-value filtering based on the SeSAMe package 17 . This provided a subset of CpGs with similar DNAm levels as observed in freshly isolated tissue. It is still unclear, why this was not possible for the mummy of A.C.B., albeit the detected fraction of human DNA was higher and C to T substitution frequency was lower. This might be partly attributed to higher fragmentation as indicated by the shorter read length in the shotgun sequencing results. The corpse of A.C.B. revealed very high levels of mercury sulfide, possibly for syphilis treatment, and it is conceivable that such conditions impact on DNAm measurements 19 .
In an elegant study, Pedersen et al. utilized the CpG to TpG substitutions at the start of sequencing reads to estimate DNAm levels in a 4000-year-old Paleo-Inuit sample 20 . In comparison to public Illumina 450 k BeadChip profiles, they could demonstrate that the estimated DNAm levels clustered with patterns of hair follicle-corresponding to the tissue used for DNA extraction 20 . Yet, this approach to detect cytosine methylation depends on the extend of post-mortem deamination rates and ultimately on the sequencing depth. In comparison, our www.nature.com/scientificreports/ method is relatively cost effective and facilitates a more direct comparison with other DNAm profiles that were generated on the same Illumina BeadChip platform. On the other hand, the bioinformatics requirement to train predictors specifically for a small subset of CpGs should not be underestimated. The low number of remaining CpGs reduces applicability. Our age-predictions were based on only seven out of 353 CpGs of the Horvath aging clock 7 , which we chose because it was trained for multiple different tissues. However, with a larger reference dataset available it might be advantageous to use all CpGs that passed the filter criteria and to establish an independent multi-tissue age-predictor for a given specimen. Furthermore, the impact of postmortem deamination needs to be considered, albeit the percentage is relatively low and enriched at the 5′ ends. It is yet unclear if the lower signal intensity is only a result of the low amount of DNA or rather to the low fraction of human DNA. It might be possible to use hybrid capture to enrich the entire human genome, therefore reducing the contamination of "foreign DNA". Alternatively hybrid capture could even be used with a relatively small panel of regions that are important for tissue deconvolution and/or epigenetic age estimation 21,22 .
The findings of this study may also be relevant for forensics, since only few studies have analyzed DNAm changes in corpses at different stages of decomposition. We have recently described that age-associated DNAm changes in PDE4C that were analyzed in buccal swabs at a post mortem interval of 1 to 42 days were hardly affected by early decomposition 23 . Here we show that conventional microarray-based methods for DNAm profiling may even be applicable in mummified corpses after centuries. Recently, there is much progress in epigenetic biomarker development for lifestyle habits (e.g. smoking 24 or diseases (e.g. malignant diseases 25 ). Thus, DNAm analysis in ancient corpse might also shed additional insight into habits and prevalence of diseases in former days. However, the broad applicability is limited, since we were only successful for one of two samples. It will be necessary to further validate this approach with larger sample collections of multiple different mummies to www.nature.com/scientificreports/ better understand the relevant parameters and reliability of the results. Furthermore, it will be interesting to determine if this method is even applicable to much older remains.

Methods
Sample collection. We analyzed lung tissue specimens from the mummified human remains of Terézia  26 . A total of 265 well documented mummified individuals were discovered during reconstruction works in the Dominican church of Vác, Hungary, between 1994 and 1995. Constant temperatures (8-11 °C) in the crypts in combination with continuous ventilation led to a natural mummification of the buried human bodies in their coffins. The corpse of Anna Catharina Bischoff was discovered during reconstruction works inside the Barfüsserkirche in Basel, Switzerland, in 1975 in one of the excavated coffins. The mummy was identified in a multidisciplinary research project 15 and during a sampling campaign in 2016 gut tissue material has been taken (EURAC ID: 2132). This study was conducted with approval of the museum collections and based on the International Council of Museums code of ethics (ICOM 2017) with regard to storage, display, and study of human remains. The study team carefully considered ethical issues and the appropriateness of the research involving human mummies, as human remains have to be considered not as 'objects' but as the remains of once-living people 27 . Several tissues of the mummies were initially probed, but for this exemplary analysis we simply used available tissue specimen. Lung tissue of T.H. was extensively probed for the tuberculosis study 26 .
Extraction of DNA, library preparation, and sequencing. Sample preparation and DNA extraction was performed in a dedicated pre-PCR area with protective clothing, UV-light exposure of the equipment, bleach sterilization of surfaces and filtered pipette tips at the ancient DNA laboratory of the EURAC Institute for Mummy Studies in Bolzano, Italy. DNA was extracted from the tissue samples using a chloroform-based DNA extraction method-a method known to remove efficiently inhibitory substances and that has been previously already successfully applied to other mummified specimen 28,29 . Libraries for the sequencing were generated 30,31 and 100-base pair paired-end sequencing was performed on an Illumina HiSeq2500 platform 28 . Paired Illumina reads were quality-checked and processed (adapter removal and read merging) using the SeqPrep tool. Preprocessed reads were mapped to the human genome (build Hg19 32 , default mapping parameters) using bowtie2 and the parameter "end-to-end" 32,33 . To deduplicate the mapped reads, we used the DeDup tool 34 . The minimum mapping and base quality were both 30. The resulting bam files were used to check for characteristic aDNA nucleotide misincorporation frequency patterns using mapDamage2 35 . A general taxonomic profile of the sequencing reads was assessed using DIAMOND blastx search against the NCBI nr database (Release 237, April 2020). The DIAMOND 36 tables were converted to rma6 (blast2rma tool) format (minPercentIdentity 97), imported into MEGAN6 software 37 , and subsequently visualized using the Krona tool 38 .
DNA methylation analysis. Genomic DNA (approximately 320 ng total DNA for the specimen of A.C.B, and 220 ng for T.H.) was bisulfite converted according to the manufacturer's instructions and hybridized with the Illumina EPIC methylation microarray (at Life and Brain GmbH, Bonn, Germany). We did not perform additional quality control before hybridization to reduce further loss of the specimen. For comparison, we compiled 301 Illumina EPIC methylation profiles of nine different tissues from 13 different studies (Supplemental  Tables S3, S4). The IDAT files of the Illumina BeadChip datasets were normalized with ssNoob using the minfi R package. Samples with bad quality and three colon samples suspected to be mislabeled were excluded. To select of probes with reliable signal within the ancient DNA samples, we used the detection P-value with out-of-band (OOB) array hybridization (pOOBAH) approach, in the R package SeSAMe 17 . We selected all probes with a P-value < 0.01 in each mummy sample. Normalization control probe pairs, which are based on sequences of housekeeping genes (without any CpGs) are often used for normalization and processing approaches 39 . They enable to correct the dye bias between the red and green fluorescence channel, which only affects type II assay beads. Since our samples show no or just a very low signal for these probes, we focused our analysis on the type I assay probes. We also removed CpG sites associated with the X and Y chromosomes, as well as SNPs, which resulted in a subset of 22,778 CpGs for the T.H. specimen. MDS analysis was performed with the R package limma.

Tissue deconvolution and sex determination.
To estimate the tissue of origin, we initially used a previously described human cell-type methylation atlas 4 . However, only 578 out of 7890 CpGs of these signatures were detected in the T.H. specimen-therefore, only these CpGs were used for a non-negative least square (NNLS) deconvolution algorithm [40][41][42] of the reference dataset and the T.H. sample. For easier comparison with our own selection, we reduced the number of groups from 25 to 10 to only represent the selected tissues in our datasets. Blood subtype predictions were combined to "Leukocytes" and the remaining groups were combined as "Other".
Alternatively, we selected candidate CpGs for each tissue within the subset of 22,778 detected CpGs based on the difference in mean beta values and variance within the groups in the training set with a tenfold crossvalidation 3 . The top 5 ranked CpGs for each of the nine tissues were then implemented into the NNLS algorithm to predict tissue proportions. The mean beta values for each tissue in the training set were used as the reference matrix.
Training of epigenetic age predictor. For epigenetic age prediction, we used age-associated CpGs from a multi-tissue epigenetic clock 7  www.nature.com/scientificreports/ Therefore, new coefficients for a linear prediction model were calculated by fitting a multivariable linear model with the R stats package based on the training samples (Supplemental Table S5).

Software.
A table with all used software, versions and references can be found in the supplemental information (Supplemental Table S6).

Ethics declarations.
All methods were carried out in accordance with relevant guidelines and regulations as indicated above. The experiments were approved by the institute/museum where the mummies belong. This study was conducted according to the International Council of Museums code of ethics (2017) with regard to storage, display, and study of human remains.

Data availability
Raw data of DNA methylation profiles generated in this study were submitted to Gene Expression Omnibus (GEO): GSE169595.