Whole-exome sequencing of the mummified remains of Cangrande della Scala (1291–1329 CE) indicates the first known case of late-onset Pompe disease

Mummified remains of relevant historical figures are nowadays an important source of information to retrace data concerning their private life and health, especially when historical archives are not available. Next-generation-sequencing was proved to be a valuable tool to unravel the characteristics of these individuals through their genetic heritage. Using the strictest criteria currently available for the validation of ancient DNA sequences, whole-genome and whole-exome sequencing were generated from the mummy remains of an Italian nobleman died almost 700 years ago, Cangrande della Scala. While its genome sequencing could not yield sufficient coverage for in depth investigation, exome sequencing could overcome the limitations of this approach to achieve significantly high coverage on coding regions, thus allowing to perform the first extensive exome analysis of a mummy genome. Similar to a standard “clinical exome analysis” conducted on modern DNA, an in-depth variant annotation, high-quality filtering and interpretation was performed, leading to the identification of a genotype associated with late-onset Pompe disease (glycogen storage disease type II). This genetic diagnosis was concordant with the limited clinical history available for Cangrande della Scala, who likely represents the earliest known case of this autosomal recessive metabolic disorder.


Results
WGS performance and authentication of ancient DNA. DNA extracted from the mummified remains of Cangrande della Scala (right intermediate cuneiform bone and liver) was used for three exploratory WGS experiments. Two WGS datasets were prepared from bone DNA, one including partial uracil-DNA glycosylase treatment and one with no treatment. A third dataset was prepared from liver DNA to assess the degree of DNA preservation. The percentage of human sequence content in all three samples was compared by alignment to the human reference genome (Table 1). This revealed an extremely low percentage of mapped fragments from the liver sample (0.06%) but a much higher percentage from the bone sample (23%).
The fraction of human DNA recovered from the cuneiform bone sample was consistent with values obtained from other small foot bones (phalanxes) in human remains of similar age 7 . We carried out several tests to confirm the ancient nature of the sequenced human DNA and to exclude modern DNA contamination. The tests revealed that the library sequences showed features typical of degraded DNA, such as an average length of ~ 90 bp without uracil-DNA glycosylase treatment and 86 bp with partial uracil-DNA glycosylase treatment. Furthermore, the frequency of 5′ cytosine deamination in the library prepared without uracil-DNA glycosylase treatment was consistent with the age of the sample 8 (Supplementary Table S1, Supplementary Figs. S1 and S2). Complete mitochondrial genomes could be reconstructed from both libraries (mean coverage of 11.42 × and 8.22 × respectively) and both sequences were unambiguously assigned to the same mitochondrial haplogroup (X2b11) with a maximum score (Supplementary Table S2). Finally, present-day human DNA contamination in mitochondrial sequences, estimated using two different methods, did not exceed 1% (Supplementary Table S2). Genetic sex determination revealed Ry values of 0.0885 and 0.0887 for the libraries prepared with and without uracil-DNA glycosylase treatment, respectively, confirming that the bone DNA belonged to a male individual (Supplementary Table S3).
Exome capture and WES performance. Having confirmed satisfactory DNA preservation, we used the bone sample for WES to enrich for human sequences and achieve good coverage for variant calling at a reasonable cost. To limit the effect of DNA damage on variant calling, whole-exome capture was restricted to the library prepared using partial uracil-DNA glycosylase treatment. Quality control of the enriched library revealed a size range of 116-769 bp (average = 277 bp) and a concentration of 4.68 ng/µl (Fig. 1).
The WES library was sequenced twice, yielding 62 million and 77 million fragments, respectively ( Table 2). The mapped coverage on the exome target was on average 29.35 × in the first run, with a large proportion of PCR duplicates (51.72%). This increased slightly to 31.83 × in the second run, but again the low coverage reflected the large proportion of PCR duplicates (55%). When the two runs were combined, the mapped coverage was 37.28 × and the percentage of duplicates increased to 65%, indicating that library saturation had been achieved. Despite the low average coverage, the high uniformity of enrichment (FOLD80 penalty ≥ 1.41) allowed us to cover 99.16% of the exome target with at least 10 reads (Twist design) and to genotype 93.38% of the target bases (Table 2).
Exome data authentication and estimate of nuclear DNA contamination. The misincorporation patterns and fragment sizes of the WES data ( Supplementary Fig. S3) were consistent with the profiles derived from the WGS data ( Supplementary Fig. S2): 2.17% C > T transitions at the first 5' base and an average fragment length of 100 bp, approximately 10 bp longer than the pre-capture size, in agreement with earlier studies [9][10][11] (Supplementary Table S4). The WES data were also used to estimate putative nuclear DNA contamination. Taking advantage of the male genetic sex of the sample, we measured the heterozygosity observed at 1344 polymor- Table 1. Exploratory WGS analysis, indicating the human sequence content in three DNA samples. The table shows the number of sequenced reads, the GC percentage, the number of mapped reads before and after removing PCR duplicates, the percentage of fragments mapped to the human reference genome, and the average coverage of the genome.

Sample ID
No. of merged reads GC content (%)  Tables S5 and S6), those classified in HGMD and/or ClinVar were investigated in detail. This reduced the priority list to 210 ClinVar variants and 179 HGMD variants, 140 of which were common to both databases. Subsequent analysis focused on 113 rare clinical variants associated with more severe diseases. They included two heterozygous missense mutations in two exons of the GAA gene associated with the autosomal recessive phenotype "late-onset Pompe disease". The first variant (c.1465G>A) was classified as "pathogenic/likely pathogenic" in ClinVar and damaging (DM) in HGMD, whereas the second (c.271G>A) was classified as "benign/likely benign" in ClinVar and likely damaging (DM?) in HGMD. Moreover, we identified additional variants in genes correlating with the regulatory activity of lysosomal enzymes 13 : besides 28 variants which were common in the frequency population databases (MAF > 5%), six rare exonic missense variants were present in the ATP6 gene and one rare exonic missense variant was present in the RUNX1 gene.
Phasing the GAA variants. Chromosome-wide phasing was used to determine the cis/trans phase of the two heterozygous genotypes in the GAA gene, considering all the variants on chromosome 17. Only one of the two variants of interest (c.271G>A) could be correctly phased, whereas the other (c.1465G>A) was discarded by  www.nature.com/scientificreports/ the algorithm because it was missing from the reference population. To exclude the possibility that the two variants did not constitute a haplotype, we assessed the longest haplotype in the reference population carrying variant c.271G>A. The aim was to confirm that the haplotype containing c.271G>A spanned over the c.1465G>A variant position, which could be inferred as a reference for this haplotype in the population. This allowed the identification of a set of variants consistently inherited together with c.271G>A in 98% of reference individuals, namely in a linkage disequilibrium block more than 14 kb in length (chr17:80,096,549-80,110,889) and spanning over the c.1465G>A variant position. It is therefore very unlikely that the latter variant was inherited with c.271G>A on the same chromosome. Moreover, the frequency of the two variants differed in the general population frequency databases, confirming the low probability of both being inherited together (c.271G>A allele frequency = 3% in all the population databases, c.1465G>A allele frequency reported in two databases only, with values of 0.0009% and 0.0015%, respectively). These data support the hypothesis of a compound heterozygous genotype (variants affecting the GAA gene in trans) thus confirming the diagnosis of late-onset Pompe disease.

Discussion
We have described the successful clinical analysis of a 700-year-old human mummy by WES and demonstrated that exome enrichment applied to ancient human DNA can lead to a genetic diagnosis that may help to support historical data. By integrating WGS and WES, we confidently assessed the authenticity of the data and excluded significant bias caused by contamination with modern human DNA. Next-generation sequencing for the study of ancient samples has mainly targeted small regions, such as the mitochondrial DNA (mtDNA), the Y-chromosome DNA (Y-DNA) 14 or specific SNPs of interest 15,16 . Only a few "high-quality" ancient human WGS studies have been reported, including a Denisovan 17 (30 × mean coverage) and two Neanderthal individuals (52 × and 27 × mean coverage) 18,19 investigated mainly to discover gene flow events and admixtures of archaic hominins. Typically, low-coverage genomes are used to investigate genetic diversity at the population level and provide phylogenetic information 20,21 . Considering the low quantity of human material present in Cangrande's samples, which would have required an abnormal sequencing effort, we considered WES as a much more affordable technology to eliminate environmental DNA contaminants and enrich for protein-coding regions, which are mainly responsible for the development of Mendelian disorders 22 . WES has already been applied in a limited number of studies, although these did not focus on the functional consequences of genetic variants and their association with clinical phenotypes 23,24 .
WES performed on samples from the mummified remains produced 37 × coverage of the protein-coding regions. The genotypability values and number of identified variants were almost comparable to the WES analysis of modern humans 25 . We identified two clinically relevant compound heterozygous variants in the GAA gene associated with late-onset Pompe disease, which could very well explain Cangrande's disease and death.
Contemporaneous sources on Cangrande's life provide little verifiable information about his health. These documents are also difficult to interpret because the authors were either very close to the court or enemies of the Scaligeri. Most accounts of Cangrande's childhood are laudatory and are based on literary topoi. They describe a child who did not like traditional games or the company of peers, but was predisposed to military life. We know from the pro-Scaliger writer Ferreto Ferreti and from the Paduan enemy Albertino Mussato that Cangrande, even in combat on horseback, preferred to use the bow in the Parthian manner rather than the spear or sword, allowing his arm more freedom of movement 26,27 . Cangrande was reportedly ill for some time at the age of 23 but recovered enough for battle after imbibing a small dose of an antidote (not better defined) and a sip of wine 28 . The enemy chronicler Mussato attributed the discomfort to one foot, which prevented the Scaligero from riding 27 . He was forced to abandon his horse and accept the draft horse of a peasant 28 . At the age of 29, Cangrande was reportedly pierced by an arrow in the thigh 29 , but managed to return to camp, rally his troops and return to fight 30 . The autopsy of Cangrande in 2004 did not reveal any wounded limbs, suggesting that Cangrande instead suffered crippling thigh discomfort, possibly a cramp. In his next battle he once again had to abandon his horse and accept one from a peasant 31 . At the age of 34, Cangrande fell seriously ill for a long time and was given up for dead 32 . Cangrande died at the age of 38, after showing symptoms of malaise for 3 days identified as fever and a generic fluxum, which can be translated as vomiting or perhaps hemorrhage 31 . Some sources report fluxus ventris or intestinal disease with diarrhea, which was disproven in the 2004 autopsy by the discovery and examination of solid fecal matter at the base of the rectum. Cangrande retained his mental clarity at the end of his life, enabling the completion of an important juridical document 32 .
The clinical spectrum (infantile, juvenile and adult-onset) of Pompe disease is a continuum, depending on the residual activity of α-glucosidase. In late-onset forms, the storage of glycogen is confined to the skeletal muscle, heart and liver. The clinical manifestation includes skeletal muscle weakness, respiratory distress due to diaphragm and accessory respiratory muscle weakness, muscle cramps, spontaneous bone fractures and arrhythmogenic cardiopathy (but normal intelligence). This is entirely consistent with the Cangrande's three episodes of severe weakness after exercise as reported in the historical records, and his death after 3 days of sickness, but with no mental impairment. No scar was found on his thighs, supporting the hypothesis that the arrow wound reported in 1320 was in fact a severe muscle cramp. Given the finding of digitalis in his body 4 , it is possible that it was administered to counteract tachycardia, a key symptom of cardio-respiratory insufficiency, and would represent the first known clinical use of this drug.
The first variant (c.1465G>A) found in Cangrande's GAA gene is described in patients with late-onset Pompe disease and fully inhibits α-glucosidase maturation and activity 33 . The second variant (c.271G>A), known as GAA *2 and reported as likely damaging in the human database of HGMD, is frequent in the Caucasian English population (0.03) 34 . The biochemical phenotype associated to this genotype shows reduced activity toward the natural substrate (glycogen) but normal activity toward the artificial substrate 4-methylumbelliferil-α-glucopyranoside. The α-glucosidase encoded by the GAA *2 allele has a K m for glycogen tenfold higher than the wild-type enzyme www.nature.com/scientificreports/ and enzymatic activity towards glycogen that is 1/10 of normal 34 . To explain the absence of this frequent allele in their 15 late-onset cases, Swallow et al. 34 cited a modification of the Michaelis-Menten equation 35 where the increase in K m is offset by an increase in substrate concentration and therefore they did not classify variant GAA *2 as a disease-causing allele, based on the assumption that lysosomal pathology is due to encumbrance. Since in our case such variant was instead present in trans-configuration with a pathogenic allele producing no enzyme activity (c.1465G>A), we concluded that Cangrande had a α-glucosidase with just 10% of normal activity. While abrogation of α-glucosidase enzymatic activity is causative of the classical infantile form, such condition is instead responsible of the late-onset Pompe disease, associated with α-glucosidase activity lower than 20% 36,37 .
Additional genes have been recently demonstrated to play a role as genetic modifiers of lysosomal functions 13,38,39 . Among these we found that Cangrande genome carried six rare missense variants in ATP6 (controlling the pH of the lysosomal compartment) 38,39 , while RUNX1 (involved in autophagy/lysosome metabolism) 13 had one rare missense variant. Beside the GAA alleles causative of the late-onset Pompe disease, these other genetic variants could also contribute to Cangrande's clinical phenotype. To remove potential contaminants, the outer layer of the bone sample was brushed with disposable tools and irradiated with ultraviolet light (254 nm) for 45 min in a Biolink DNA Crosslinker (Biometra). Bone powder was then collected from the densest part of the bone using a low-speed dental micromotor equipped with disposable tungsten carbide ball burrs. DNA was extracted from 50 mg of bone powder using a silica-based protocol that allows DNA molecules to be recovered efficiently even if highly fragmented 9 . In the final step, DNA was eluted twice in 50 µl TET buffer (10 nM Tris, 1 mM EDTA, 0.05% Tween-20). DNA was extracted from 50 mg of mummified liver tissue using the QIAamp DNA mini kit according to the manufacturer's recommendations (Qiagen).

Whole-genome library preparation and sequencing. Sequencing libraries suitable for Illumina plat-
forms were prepared from 20 µl of DNA extracted from bone or liver tissue following a protocol optimized for ancient samples 42 . The resulting data were used to evaluate the deamination rate to confirm the authenticity of the genetic material. A partial uracil-DNA glycosylase treatment 43 was applied to an additional 30-µl aliquot of DNA extracted from bone. This treatment removes internal uracil residues and abasic sites, reducing the probability of errors during variant calling. A unique combination of two indices per library was used for barcoding. Libraries were sequenced in 150-bp paired-end mode on a NovaSeq 6000 instrument (Illumina) to generate an average 1 × coverage of the entire genome.
Bioinformatics analysis of WGS data: preservation and contamination estimates, molecular sex determination, and mitochondrial genome reconstruction. Sequences were demultiplexed and sorted according to the indices, and raw sequence data from all three libraries were analyzed using an established pipeline 44 . Adapters were clipped-off and reads with a minimum overlap of 10 bp were merged in a single sequence using Clip&Merge v1.7.4. Merged reads were then mapped onto GRCh38 using BWA v0.7.17-r1188 45 , setting parameters to improve the accuracy of ancient DNA reads (-l = 16,500, -o = 2 and -n = 0.01) 46 . Only reads with a map and base qualities score ≥ 30 were retained. Reads mapped onto the human genome were authenticated by deamination and fragmentation pattern analysis using mapDamage2.0 47 . Molecular sex determination was applied to the bone sample by comparing the number of alignments to the Y chromosome and the total number of alignments to the X and Y chromosomes in the libraries prepared with and without uracil-DNA glycosylase 48 . The mitochondrial genome was reconstructed from the same libraries to assess overall DNA preservation and contamination. Reads mapping to the mitochondrial genome were extracted from BAM files using SAMtools v1.7 49 . For the library prepared without uracil-DNA glycosylase, the Schmutzi pipeline 50 was used to call the consensus sequence and to evaluate the level of contamination with present-day human DNA 51 . For the library prepared using a partial uracil-DNA glycosylase treatment, the consensus sequence was called using mpileup and vcfutils.pl in the SAMtools package. To estimate the ratio of contaminant/authentic DNA in the mitochondrial sequence data, a likelihood-based method was used as previously described 52 . The mitochondrial haplogroups were assigned according to PhyloTree build 17 53 using Haplogrep2 54 .
Exome enrichment and WES. The DNA library prepared from bone was captured using the Twist Bioscience Human Core Exome Kit + RefSeq v1.3 protocol. Single-plex exome capture was carried out using 120-bp biotinylated probes, with minor modifications to the standard protocol due to the high level of sample degradation. All the available material from the library was used, although the DNA input requirements of the standard protocol were higher than the sample's initial input (300 ng instead of 500 ng). After the washing steps to remove www.nature.com/scientificreports/ nonspecific targets, the remaining material was eluted in 22.5 µl of water, without keeping the backup slurry. Ten cycles of amplification were performed rather than the eight cycles recommended by the protocol. The final PCR cleanup was carried out using a 1.5 × ratio of Twist Bioscience Beads. The enriched library was validated using a Tape Station 4150 High Sensitivity D1000 assay kit (Agilent Technologies) and quantified by RT-PCR using the Lib Quant kit (Roche). WES was then performed on a NovaSeq 6000 instrument in 2 × 100-bp paired-end mode.
Bioinformatics analysis on WES data: read alignment, variant calling, and data authentication. The WES FASTQ files were quality checked using FastQC (http:// www. bioin forma tics. babra ham. ac. uk/ proje cts/ fastqc/). Adapters and low-quality bases were removed, and reads were aligned with the human reference genome (GRCh38/hg38) using the Paleomix bam_pipeline v1.2.13.8 55 with BWA-mem v0.7.17 45 and a disabled "-collapse" parameter to properly calculate the insert size for all the sequenced fragments. Duplicated reads were removed using Picard MarkDuplicates v2.21.1. GATK Base Recalibrator v4.1.8.1 56 and BamUtil clipoverlaps v1.4.14 were then applied to adjust base quality and soft-clip overlapping reads. Coverage and genotypability metrics were calculated using CallableLoci in GATK v3.8. The FOLD80 penalty value was calculated using Picard CollectHSMetrics v2.21.1 (http:// broad insti tute. github. io/ picard/). Variants were identified using GATK HaplotypeCaller v4.1.8.1 (with parameter "-dont-use-soft-clipped-bases" set to "true"), producing a gVCF file. Variants were then recalibrated and filtered by following the GATK Hard Filtering Best Practices. The authenticity of the WES data was estimated as previously described for WGS. Additionally, nuclear contamination was estimated by measuring the heterozygosity of the X chromosome 57 using the ANGSD pipeline 58 . Because males have only one copy of the X chromosome, any heterozygosity on this chromosome in males indicates contamination.

Variant annotation, prioritization and phasing. The gVCF file was annotated using Golden Helix
VarSeq v.2.2.1 (Golden Helix) and the following databases: ClinVar and HGMD Professional v2020.1 59 were used to investigate the clinical significance of identified variants, whereas the population frequency databases of 1000Genomes Project Phase3, gnomAD v2.0.1 and ESP6500 v2 were used to determine the frequency of variants. Similarly, an internal database was used to flag rare Italian variants. Several prediction tools were used to calculate the pathogenicity scores for each genetic variation (FATHMM, GERP, Polyphen, SIFT, PhastCons and PhyloP) and the RefSeq Genes database was integrated to provide the effect of each variant. Variants were then prioritized using two pipelines. The "annotated variants" pipeline retained only those variants classified as "Pathogenic", "Likely Pathogenic", "Conflicting", "Uncertain Significance" or "Other" in ClinVar, or classified as "DM" or "DM?" in HGMD. Variants with an alternative allele frequency below 5% in the population frequency databases were flagged. The "predicted coding variants" pipeline retained variants present in exonic or splicing site regions but without reported clinical significance in ClinVar and HGMD. Variants with an alternative allele frequency below 1% in the population frequency databases were flagged, focusing on those with: (1) a "LOF", "Missense" or "Splice_region_variants" effect in the RefSeq database; (2) a predicted "Damaging" effect by three or more prediction tools applied to the dbNSFP database (SIFT, Polyphen2, MutationTaster, MutationAssessor and FATHMM); and (3) a frequency below 2% in the Functional Genomics Variant database. Phasing of WES reads was accomplished using Eagle v2.4.1, with the provided hg38 genetic map and the reference panel from gnomAD v3.1 "HGDP + 1 KG callset" (https:// gnomad. broad insti tute. org/ downl oads# v3-hgdp-1kg).

Data availability
The WES variants data are available for download from our public repository using the link: http:// ddlab. sci. univr. it/ files/ Cangr ande/ Cangr ande. tar. gz (VCF file with associated BED files of callable regions).