Genomic analysis of the blood attributed to Louis XVI (1754–1793), king of France

A pyrographically decorated gourd, dated to the French Revolution period, has been alleged to contain a handkerchief dipped into the blood of the French king Louis XVI (1754–1793) after his beheading but recent analyses of living males from two Bourbon branches cast doubts on its authenticity. We sequenced the complete genome of the DNA contained in the gourd at low coverage (~2.5×) with coding sequences enriched at a higher ~7.3× coverage. We found that the ancestry of the gourd's genome does not seem compatible with Louis XVI's known ancestry. From a functional perspective, we did not find an excess of alleles contributing to height despite being described as the tallest person in Court. In addition, the eye colour prediction supported brown eyes, while Louis XVI had blue eyes. This is the first draft genome generated from a person who lived in a recent historical period; however, our results suggest that this sample may not correspond to the alleged king.


1-The gourd
The

2-Physical and pathological background of Louis XVI
According to direct testimonies originating in relatives and witnesses -and also some physicians for retrospective diagnoses 1-14 , the French King Louis XVI is described as:

3.2-Genomic DNA
Genomic DNA was quantified using both NanoDropND-1000 (NanoDrop Technologies) and Qubit dsDNA BR (Invitrogen, Life Technologies) and its integrity was assayed with a DNA High Sensitivity Bioanalyzer chip (Agilent Technologies). High discrepancy between Nanodrop and Qubit measurements were observed, and DNA degradation was detected by Bioanalyzer analysis (Supplementary Figure S2A) Figure S2B).

3.4-Genomic sequencing
One out the three library fractions had detectable signal on the Bioanalyzer (Supplementary Figure S2C). These adapter-ligated fragments were then quantified using the KAPA Library  (Supplementary Figure S2D). The library was also sequenced on a MiSeq instrument with a final concentration of 8pM. 6,037,260 reads (51 bp SR) were obtained.

3.5-Exome selection and sequencing
Considering the high fraction of environmental contaminant DNA that is usually present in ancient DNA samples, we performed an exome selection to increase the coverage in the coding regions.

4-Metagenomic analysis
In order to gain insight into species composition of the sequenced sample, a metagenomic analysis was performed. The BLAST results were input to MEGAN v 3.9 19 , a tool for metagenomic analysis which puts the taxonomic classification of analysed sequences into a phylogenetic context. Of 59,040 contigs with blast hits, 57,603 contigs were assigned a taxonomic classification, while 1,437 remained unassigned.
The species tree consisted of two main branches, bacteria and eukaryota. The majority of contigs in the bacterial branch were assigned to Pseudomonas-group genomes. In the eukaryotic branch there were three main sub-branches -plants, human and fungi. The "Plants" sub-branch consisted mainly of Cucurbitaceae (the family which encompasses gourd), while the "Fungi" sub-branch consisted mainly of Aspergillus. Figure 2 illustrates the overall composition of the sequenced sample, as a synthesis of short-read mapping and Megan analysis. In summary, the highest proportion of reads with known origin came from Pseudomonas (46%), followed by human (24%) and Aspergillus (0.6%).
Cucurbitaceae comprised 0.1% of the data. Considering that plant genomes evolve very rapidly and the genome of gourd is not sequenced, it is conceivable that further reads from gourd are contained within the data but could not be identified.
Exome selection using the same library efficiently removed the environmental DNA background in the sample. After this procedure, 84% of sequencing data could be successfully mapped to the exome human genome.

5-Genome mapping and coverage estimates
Our final genomic dataset was a combination of single-end reads and paired-end reads generated with different strategies (exome vs. whole genome) and machines (Illumina GAII, Illumina Hiseq2000, Illumina MySeq). The use of ancient DNA implies a special treatment to deal with the degradation of the template sequences, which very frequently may be too short and thus produce overlapping sequence reads for a given pair and the sequencing of the adapter; importantly, this may cause an overrepresentation of certain bases in the reads and thus bias the SNP calling. Consequently, we used AdapterRemoval 20 in order to both remove the adapter sequence from the reads and build a consensus single-end read from those reads in the same pair that covered the same sequence (Supplementary Tables   Supplementary Table S1). In summary, the BAM files for our SNP calling is composed from

6.1-MtDNA affiliation and contamination estimates
The number of raw reads that aligned to the human mtDNA reference genome (rCRS) was  Table S3). Next, we tried to assign a haplogroup affiliation to the contaminants. We were able to identify 3 different contaminants, which belonged to haplogroups J1c2c2, H1a and K (the last one had been previously identified as contaminant 22 ). We then looked in Phylotree version 15 for mutations that unambiguously distinguish each of the contaminant haplogroups from the other contaminants and the endogenous N1b1a2. For those positions we count the number of reads matching the correspondent contaminant haplogroup in order to estimate the contribution of each contaminant to the overall contamination. On average, J1c2c2, H1a and K contaminants are present in 9.6, 13 and 1.9 % of the reads, respectively.

6.2-X chromosome contamination estimate
We followed a previously developed method 25 that lies on the fact that male individuals carry one X chromosome and thus only one allele at each site. Positions with reads showing more than one allele are a consequence of either errors (sequencing or mapping) or contamination from another individual. Given that contamination will only affect to sites where the sample and the contaminating individual present different alleles, it will lead to a higher mismatch rate in polymorphic sites than in adjacent sites that are less likely to be polymorphic.
Using 1000 Genomes Project Phase 1 data 26 , we identified all the X chromosome SNPs with a frequency of 0.1 or higher in European population. This preliminary set was filtered in order to exclude SNPs less than 10 bases apart.
Using the reads data from the gourd´s genome, we counted the number of major and minor reads at SNP sites and adjacent sites (4 sites on each side of these SNP sites). Major reads are the reads carrying the allele most frequently seen in each site, and the rest are minor reads. We only analyzed sites with a minimum depth of 4 and a maximum depth of 10 in the gourd´s reads. Following  Table S5). This confirms the previous Y-chromosome attribution 22 , as determined with the AmpFlSTR Identifiler PCR amplification Kit.
We noticed that several SNPs show ancestral alleles together with derived alleles. These ancestral alleles are probably due to contamination, so we used them to give an estimate of the contamination in the Y chromosome (Supplementary Table S5). We obtained a contamination of 17 %, which agrees with the estimate obtained for the X chromosome. We also estimated the 95% confidence interval with the following formulae: sqrt[(p*(1-p))/n)] where p is 0.17. The obtained 95CI is 0.063-0277, although due to the small number of observations the best nuclear contamination estimate is obviously that obtained from the X chromosome.

7-DNA fragmentation pattern
Ancient DNA sequences show characteristic damage patterns at the ends of the reads, derived from depurinations (preceding the sequencing reads) that subsequently involve deamination of exposed citosines and fragmentation of the DNA templates 27 . This process results in increased C to T substitutions at the 5' ends of the reads plus increased G to A substitutions at the 3' ends. Additionally, there is a characteristic enrichment of purines at the bases prior to 5' end of the reads and of pyrimidines after the 3' ends of the reads. While this complex pattern of DNA fragmentation is considered a signal of authenticity of ancient sequences, it is not known what should be expected from a sample as recent as the blood of king Louis XVI. The only comparable age in ancient genomic studies is that of the Australian aborigine hair sample 25 , accessioned at the Duckworth Collection in Cambridge in 1923. The analysis of the damage patterns shows a mild increase -as compared to other, older samples such as Neandertals-of T's (2.3% in position 1) and A's (2.1% in position 1) at the 5' and 3' ends of the reads, respectively. Perhaps more significantly, they found a bias towards purines and pyrimidines before and after the ends, respectively.
In order to determine whether the gourd´s DNA showed the ancient DNA pattern, we analysed a random subsample of the aligned sequencing reads for their nucleotide composition and misincorporation patterns. From the SAM file containing the aligned read data from all sequencing approaches (~60 million), we randomly sampled 1% reads obtaining a similar amount of reads from each chromosome (33,000-37,000 reads/each). To simplify the subsequent nucleotide substitution analysis, we removed reads that showed insertions or deletions with respect to the reference sequence.
This final read set contained 886,789 reads with which the damage pattern analysis was performed using the mapDamage software 28 . A clear purine-pyridimine pattern is observed prior and after the gourd's reads, as well as a mild increase of damage at the ends of the reads (Supplementary Figure   S3). Like in the case of the Australian aborigine, the excess of pyrimidines is mostly due to thymines.

8-SNP calling and filtering
The SNP calling pipeline consisted firstly in the removal of PCR duplicates with Picard tools (http://picard.sourceforge.ne), indel realignment and base quality recalibration, both with GATK 29,30 , steps that were applied independently to each of the four files. The resulting files were then merged before applying the global indel realignment step with GATK. Finally, the unified genotyper and variant quality score recalibration steps were performed with GATK in order to obtain the VCF file with the SNP calls ( 17533325 Supplementary Table S6). The following databases were used at different steps of the pipeline: 1000G biallelic indels (indel realignment), dbSNP137 (base quality recalibration) and dbSNP137, 1000G OMNI and HapMap (variant quality score recalibration).
To ensure that the observed contamination would not contribute to any heterozygous position in our working datasets, we applied a stringent allele balance filter. We removed from the datasets all the heterozygous positions with an allele imbalance more skewed than 0.3-0.7 for the two possible alleles (based on the maximum estimated contamination, which is that from the mtDNA). We have modelled the probability that contamination remains after the allele imbalance filtering by using a Bernoulli model. We took as example sites with 12x coverage. In this case, only three allelic combinations will remain after the 30% removal, accounting for 4:8, 5:7 and 6:6 (meaning number of reads with different alleles). Using conditional probabilities and nuclear contamination estimates between 0.18 and 0.20, the remaining contamination would be (for instance, in the case of 0.18): 0.18^4 in the 4:8 combination, 0.18^5 in 5:7 and 0.18^6 in 6:6 (assuming always that the background contamination will emerge as the minoritary allele). Thus in those SNPs of 12x coverage, the expected remaining contamination will be only between 0.25% and 0.39%.
However, this strict filtering would remove not only contaminants but many real heterozygous positions when applied to a low-coverage genome. Assuming a binomial distribution of reads with p=0.5 for the two possible alleles, we estimated the percentage of real heterozygous variants that have been affected by the filtering as a function of coverage (Supplementary Figure S4). For the datasets ≥3x (mean coverage = 5), exome ≥6x (mean coverage = 9) and exome ≥9x (mean coverage = 12), we estimate that around 37.5, 28.91 and 14.60% of real heterozygous positions were lost due to the allele balance filter respectively.

9-Patterns of post-mortem damage
The predominant post-mortem damage patterns seen in ancient DNA sequences are C→T and G→A miscoding lesions primarily derived from the deamination of cytosines to uracils, either in the sequenced strand (C→T) or in the complementary strand, thus resulting in G→A changes 27,31 .
Therefore, the number of C→T and G→A changes clearly exceeds the T→C and A→G substitutions in ancient genomes. This pattern can be observed in the exome, where the coverage is higher (Supplementary Table S7). For instance, taking into account all exomic reads, the frequency of C→T is 17.24% and G→A is 17.39%. In contrast, the nucleotide change frequencies of T→C and A→G are 12.02% and 12.19%, respectively.

10-PCR genotyping
A total of 9 autosomal SNPs were genotyped with a PCR approach using a two-steps protocol 32 .
Sequences of primers used for amplifying each one of the fragments targeted are reported in (Supplementary Table S8). The first PCR step was a multiplex amplification that included all the 9 fragments with the following conditions: 2 units TaqGold

11.1-Principal component analysis
Louis XVI had a complex ancestry, with people coming from different regions of Europe, including present-day Germany, Poland, Austria, Italy and France (   Supplementary Table S9).
We first performed a principal component analysis (PCA) of gourd's SNP data at >9x coverage, using data from European individuals of the 1,000 Genomes Project for comparison. From the 236 European individuals sequenced exclusively with Illumina, only SNPs with a MAF >0.5% were considered and sites were pruned using PLINK software 33  Genotypes were selected from the curated POPRES dataset, comprising 1,387 and x SNPs. Each SNP position in the dataset was assigned to one of the 5 possible source populations, in such a way that the proportion of SNPs assigned to a particular source population is the same as the ancestry proportion of that population. Then, for each SNP a genotype was randomly picked among the samples that belonged to the corresponding source population. This procedure was repeated 30 times in order to have 30 randomly generated composite genomes. Finally, PCA was performed on the merged dataset POPRES+ composites, using EIGENSTRAT version 3.0 36 (Figure 4b).

11.2-Analysis of tracts of identity-by-descent
To obtain a profile of the sharing of long identity-by-descent (IBD) tracts between gourd´s individual and other individuals found in the POPRES dataset, we used the fastIBD supplied by the Beagle software. First, based on the merged dataset of gourd´s genome and POPRES, we computed for each SNP a genetic map position in centiMorgan units based on the recombination map published by deCODE 37 . We used the sex-averaged version of the map at 10kb resolution, removed any SNPs that are not within the range of the deCODE map (such as within 5 Mb of the ends of chromosomes), and linearly interpolated the genetic map position for each marker using its physical position. Additionally, we removed SNPs with a minor allele frequency less than 5% in the POPRES dataset. In total, 72,913 autosomal SNPs were used to infer IBD tracts using Beagle. We followed the recommendation of Beagle's authors 38 to merge inferred IBD tracts among the 10 independent runs, as well as a postprocessing modification described previously 39 , which is aimed to ameliorate Beagle's tendency to spuriously introduce gaps into long blocks of IBD. In more detail, we first removed segments not overlapping another segment seen in at least one other run. We then merged any two segments of the same pairs of individual separated by a gap shorter than at least one of the segments and no more than 5cM long. We then discarded any segments that did not have any subsegment meeting the significance threshold of 1x10 -8 .
To avoid apparent differences in average sharing due to varying sample size, we down-sampled  Table S10).

12-Functional effect characterization analysis
We restrictred the functional effect characterzation analysis to the sites that had a minimum coverage of 6 reads per site and passed the allele balance filter (section SNP calling and filtering). For the non-synonymous changes we followed a series of different steps to ensure that we only retained the "most deleterious" changes (those that according to Polyphen-2 had a higher than 0.95 probability of being damaging and false positive rate of less than 0.05). Then we filtered the sites using a purely chemical classification, the Grantham Scores (GS) 42 , and confined our analysis to sites that had a radical Grantham Score (>= 150). Finally, we retained only positions that had a higher >4 GERP score that is correlated to the evolutionary constrains on that site 43 , and hence it will be expected that changes in this positions will be deleterious (Supplementary Table S14 Supplementary Table S15).

13-Eye color phenotype
We checked in the gourd´s genome the state of the six most informative SNPs for eye color prediction, and used the IrisPlex 44

14.1-Empirical risk assignment
The "GWAS Catalog" is an online catalog of published genome-wide association studies that is maintained by the NHGRI 45 . The Catalog ascertains all studies that test >100,000 SNPs to compile reported SNP-trait associations that achieve P<10 -5 . We used the Catalog to gain insight on the personal genetic risk of gourd´s individual for three complex phenotypes of interest: height, obesity and type 2 diabetes. Specifically, we aimed to combine information from all known associated loci to assess the fitting of the observed genetic risk of gourd´s individual to the expected distribution of risk scores in individuals of European ancestry. In the case of height, we worked with EAF (Effect Allele Frequency), the alleles that contribute to increased height, instead of risk alleles.
We selected all GWAS associations for which the associated allele was available at the GWAS Catalog (n = 7,029; accessed: 2013 March 3 rd ). From these, we ascertained 3,181 associations to different phenotypes for which the corresponding genotype in the gourd´s genome was available. Next, we selected all SNP-trait associations that corresponded to our phenotypes of interest, and recorded the allele frequency of the risk allele (RAF) or EAF allele gathered from HapMap (CEU individuals of European ancestry).
To calculate the expected distribution of RAF or EAF alleles in Europeans for each phenotype, we simulated 100,000 individuals with an assigned genotype at each associated SNP (we considered Hardy-Weinberg equilibrium to draw genotypes from RAF and EAF values). For each simulated genotype, we assigned a risk score of 0 if the genotype appears to be homozygous for the protective allele, 1 if heterozygous and 2 if homozygous for the risk allele. In the case of height, the allelic load assigned was 0, 1 or 2 depending if the individual carries none, one copy or two copies of the EAF allele, respectively. Hence, the unweighted distribution of genetic risk scores in general Europeans depends on the number of associated SNPs and its RAF or EAF allele frequencies. We then used gourd´s genotypes to calculate individual RAF or EAF allele loads for each phenotype. Finally, we also constructed a weighted genetic score by means of a similar analysis, but weighting each SNP proportionally to the reported estimates of RAF or EAF effect size. For each of this ~ 2 million SNPs, we defined a 100 kb region centered in the SNP, with a LD threshold of R² = 1. All the SNPs fulfilling these criteria were considered and checked against the NHGRI GWAS Catalog for the phenotypes considered. Afterwards, using the 1000 Genomes data, pairwise haplotypes were built through PLINK-1.07 33 , using both types of SNPs. Therefore, we could infer if the RAF or EAF allele of the non-genotyped SNP is traveling in LD with any of the LD-retrieved positions. The percentage of SNP recapturing was 35% for height, and 51% for both obesity and type 2 diabetes ( Table 1). The latter phenotypes did not present SNPs after filtering (we considered only those SNPs recaptured by at least three neighbouring SNPs) and thus were excluded from further analyses.

14.3-Quality control
The fact that the same unknown position could be recaptured by several known SNPs, suggests that the congruency in the haplotype combination could serve as a quality control method, provided that the coverage of the genome is low and that there is a background contamination. Obviously, the probability that damage or contamination could alter a whole genomic region is lower than for single SNP positions. If the same allele for the unknown SNP is found in all or in most of the pairwise haplotypes that we retrieve from the known SNPs, we can be more confident that the specified allele is the endogenous one. Obviously, an incongruence does not necessarily mean there is an error in the genotypes, and may simply respond to the fact that the LD patterns in CEU are different from those of 250 years ago. An example of this quality control can be seen at Supplementary Table S17.

14.4-LD patterns
To check whether we were analysing consistent and real haplotypes in the gourd's genome, we checked if haplotypes present in 60 CEU individuals from the 1000 Genomes Project Pilot phase 46 were to be found in the gourd's genome.
First, we randomly selected 244,610 SNPs from the gourd's genome and mapped those positions to the CEU genomes. Whenever mapping was possible, because the variant was called both in the gourd and in CEU, we defined a 1Mb block around that site and obtained all the SNPs in LD in CEU, using R² ≥ 0.7 with PLINK software 33 . Then, those gathered positions in LD were searched, if existing, in the gourd's genome and their genotype was retrieved. We only considered homozygous positions to build the blocks, given that the gourd's genome has not been phased. With these common SNPs we retrieved all extant haplotypes and its frequency in CEU using PLINK option --hap-freq.
Out of these ~250 k randomly selected positions we found 17.898 haplotypes in common between CEU and the gourd's genome that were composed of the same linked variants, ranging in length from 3 to 851 SNPs. In particular, we found 10.257 haplotypes with lengths comprised between 3 and 15 SNPs, 7.262 with lengths between 16 and 100, and 379 longer than 100 SNPs. For each of the haplotypes, we compared the gourd's haplotype with the observed haplotypes in CEU, and obtained a percentage of sequence similarity. On each comparison, we took into account only the similarity value of the most similar CEU haplotype (see example in Supplementary Table S18), and plotted the similarity percentage for the different haplotype length bins (Supplementary Figure S5). For the three categories, the average similarity was > 93%, and a significant positive correlation was found between similarity to the gourd's haplotype and frequency in CEU (Supplementary Table S19).
As a proof of principle to validate this approach, we followed the same previous procedure using Yoruban (YRI) population. As expected, haplotype length for YRI was lower, not exceeding 253 SNPs.
We crossed both sets (CEU x YRI) to find positions where a common haplotype (not necessarily of the same length) could be built around the same SNP (n=4329). According to our expectation, those haplotypes with the highest similarity in CEU and thus more frequent, should be less frequent in YRI population or, in other words, those most frequent in YRI should have less similarity to the observed haplotype in the gourd's genome.