Prion protein gene mutation detection using long-read Nanopore sequencing

Prion diseases are fatal neurodegenerative conditions that affect humans and animals. Rapid and accurate sequencing of the prion gene PRNP is paramount to human prion disease diagnosis and for animal surveillance programmes. Current methods for PRNP genotyping involve sequencing of small fragments within the protein-coding region. The contribution of variants in the non-coding regions of PRNP including large structural changes is poorly understood. Here, we used long-range PCR and Nanopore sequencing to sequence the full length of PRNP, including its regulatory region, in 25 samples from blood and brain of individuals with inherited or sporadic prion diseases. Nanopore sequencing detected the same variants as identified by Sanger sequencing, including repeat expansions/deletions. Nanopore identified additional single-nucleotide variants in the non-coding regions of PRNP, but no novel structural variants were discovered. Finally, we explored somatic mosaicism of PRNP’s octapeptide repeat region, which is a hypothetical cause of sporadic prion disease. While we found changes consistent with somatic mutations, we demonstrate that they may have been generated by the PCR. Our study illustrates the accuracy of Nanopore sequencing for rapid and field prion disease diagnosis and highlights the need for single-molecule sequencing methods for the detection of somatic mutations.

. Nanopore sequencing of PRNP and octapeptide repeat genotyping. (a) The human PRNP gene. Important pathogenic variants, such as deletions/insertions in the octapeptide repeat region (OPR), codon 129 (M129V) or codon 200 (E200K) SNVs are indicated in the protein-coding region in orange. Genetic diagnosis at the MRC Prion Unit at UCL is performed by Sanger sequencing of the protein-coding sequence (1015-bp amplicon) or specifically of the OPR (348-bp amplicon). Exon 1 and the beginning of the intron are particularly GC-rich (see GC % plot). This likely explains why producing a single amplicon spanning the entire genomic region was technically challenging. Instead, we amplified PRNP in two overlapping fragments for Nanopore sequencing (see Nanopore amplicons). (b) Sequence of PRNP's OPR. The OPR is composed of five similar repeats for a total of 123 bp. The first repeat (R1) is 27 bp and encodes a sequence of 9 amino acids (nonapeptide), the subsequent ones (R2, R2, R3, R4) are 24 bp and all encode the same sequence of 8 amino acids (octapeptide). The two R2 repeats are identical. R2 vary with R3 and R4 by two nucleotides at wobble positions.

Results
Genotyping PRNP in prion disease cases using Nanopore sequencing. PRNP genotyping in prion diseases cases is routinely performed using both Sanger sequencing and gel electrophoresis of the protein-coding region 29 . First, we tested if Nanopore sequencing performs as well as Sanger for the amplicon that is routinely used for clinical genotyping. Genomic DNA was extracted from CJD patient blood and the PRNP protein-coding region was amplified and sequenced using both Nanopore and Sanger. We computed a consensus sequence of the Nanopore reads and aligned it to the sequence obtained by Sanger sequencing. Both sequences were 100% identical. The sample was genotyped as codon 129 heterozygous (M129V) by both Sanger and Nanopore sequencing ( Supplementary Fig. S1). This pilot experiment encouraged us to pursue Nanopore sequencing as an alternative to Sanger sequencing for PRNP genotyping. Next, we developed a protocol for amplification and sequencing of the entire PRNP genomic region. We mapped the regulatory region based on genome annotations of epigenetic marks H3K4me1, H3K4me3, H3K27ac, and transcription factor binding sites. These pointed towards an important regulatory region starting around 1.1 kb upstream of exon 1 and extending into the intron, consistent with a prior study 30 . Although Nanopore allows sequencing of fragments longer than 1 Mb 31 , we faced technical difficulties amplifying the 16.7 kb region as a single PCR product. This was likely due to a GC-rich (> 70%) region around exon 1. We therefore opted to sequence two overlapping amplicons (Fig. 1a). The smaller amplicon (2988 bp) started upstream of the regulatory region and ended 1 kb into the intron. The larger amplicon (14,025 bp) included the remaining of the intron and exon 2. The two amplicons overlapped by 316 bp.
Using this protocol, we sequenced the PRNP genomic region in 25 samples from 21 individuals. First, we sequenced one healthy control and seven patients with inherited prion disease-of which six carried an insertion and/or a deletion in the OPR (2 OPRD, 1 OPRD, 1 OPRI, 2 OPRI, 5 OPRI/1 OPRD, and 6 OPRI). Second, we sequenced both blood and brain samples from four inherited prion disease patients with OPR insertions. Third, we sequenced one healthy control and eight patients with sporadic CJD. Of the 21 individuals, nine carried an SNV in the protein-coding region detected by Sanger sequencing: one E200K, one P102L, and seven M129V (Table 1). Using Nanopore, we obtained a range of coverage of 3165-8160 × for the gene body (14,025-bp amplicon) and 133-8966 × for the regulatory region (2988-bp amplicon).  1  52331  Blood  Control  -Not tested -M129V  14   2  1906  Blood  Inherited  2 OPRD  -2 OPRD  -3   3  55826  Blood  Inherited  1 OPRD  -1 OPRD  -9   4  58398  Blood  Inherited  -E200K  -E200K  1   5  58778  Blood  Inherited  1 Control  -M129V  -M129V  16   14  54493  Blood  Sporadic  -M129V  -M129V  16   15  54890  Blood  Sporadic  -----16  54917  Blood  Sporadic  ----5   17  54960  Blood  Sporadic  -M129V  -M129V  15   18  54968  Blood  Sporadic  ----1   19  55048  Blood  Sporadic  ----1   20  55050  Blood  Sporadic  -----21  55052  Blood  Sporadic  -M129V  -M129V  www.nature.com/scientificreports/ First, we called SNVs for all samples. After filtering calls based on allele frequency and strand bias, we detected a total of 46 different SNVs, most of which were present in more than one individual, for a total of 203 calls (Fig. 2a, Supplementary Fig. S2). All 46 SNVs were found in human genetics databases, and their respective Lengths of the OPR reads in four inherited cases for which genomic DNA from blood and brain was sequenced. The read lengths form a distribution around each expected OPR length due to small artefactual insertions/ deletions typical of Nanopore sequencing. See Table 1 for more details about individuals and samples. www.nature.com/scientificreports/ frequency in the general population was in accordance with the number of individuals carrying each SNV in our panel (Fig. 2b). The SNV calling algorithm successfully called the E200K, P102L, and M129V variants previously identified by Sanger sequencing (Table 1). No additional SNVs were detected in the protein-coding sequence, indicating a low frequency of false positive calls. We observed that samples carrying the 129V haplotype tended to carry more SNVs in total, in line with known linkage disequilibrium between M129V and adjacent SNVs. We did not find an excess of SNVs in sporadic CJD cases ( Supplementary Fig. S2). Second, we called SVs for all samples. The variant calling algorithm correctly identified all 15 OPR mutations present in our panel including the precise size of the inserted/deleted sequence ± 2 bp (Fig. 2c, d, Table 1). We did not detect novel SVs in PRNP's regulatory region or in the sporadic CJD samples. These results support the use of Nanopore sequencing for PRNP genotyping in patient samples.
Exploring somatic mutation of the octapeptide repeat region using Nanopore sequencing. Somatic mosaicism has been implicated in many diseases, from cancer to neurodegeneration [32][33][34][35] . In prion diseases, somatic mutation of PRNP's OPR has been hypothesised as a possible cause of sporadic CJD 14,15 . We took advantage of the high sequencing coverage obtained with Nanopore sequencing to search for rare somatic insertions or deletions in PRNP's OPR.
We first trimmed all reads we collected to keep only PRNP's OPR, generating 208,554 OPR reads. Then, we labelled each read with the most likely OPR length based on the total insertion and/or deletion compared to the reference. We designed a consensus sequence for one OPR repeat and built a set of template sequences for a range of possible OPRs. Each read was then aligned to the OPR template sequence matching its OPR label, returning the number of mismatches. We defined a somatic mutation call as any read whose OPR label was not reference or the genotype of its sample and was at least 94% identical to its OPR template sequence ( Supplementary Fig. S3). For example, any OPR read with a 24-bp insertion was labelled 1 OPRI, aligned to the 1 OPRI template, and was selected as a somatic mutation call if it aligned to its template sequence and did not come from a sample with a heterozygous 1 OPRI mutation.
Chimeric reads, which are single reads composed of sequences from one or more amplicons, can occur with Nanopore sequencing 36 . As we sequenced samples with different OPR genotypes on the same flow cell, we wanted to exclude any somatic mutation call which could be a chimeric read from another sample on the flow cell. To this end, we discarded the following reads: those not barcoded at both ends; those containing an adapter or barcode in their middle; those longer than their amplicon 37 . We identified a total of 129 somatic mutation calls, at a frequency per sample of 0-0.28% of reads (Fig. 3a, Supplementary Fig. S5). Reads from the forward and reverse strand were uniformly represented in the somatic mutation calls, excluding the possibility of strand bias. The number of somatic mutation calls from each sample did not correlate with its sequencing coverage, which likely excludes sequencing errors occurring at the pore ( Supplementary Fig. S4) 38 . Based on the above checks, we concluded that the 129 somatic mutation calls were not sequencing artefacts.
Of the 129 somatic mutation calls, 103 (80%) were from inherited CJD samples ( Supplementary Fig. S3). This represented a significant excess of calls from this group compared to sporadic CJD or control samples, even after accounting for the higher number of inherited CJD samples in our panel. Most inherited CJD patients in our panel carried a monoallelic OPR mutation. In these samples, the somatic mutation calls could represent the mutation of the reference allele or the mutation of the allele already carrying an OPR mutation. For example, a 5 OPRI somatic mutation call in a 4 OPRI heterozygous carrier could represent a 5-repeat insertion in the reference allele, or a 1-repeat insertion in the 4 OPRI allele. We used haplotype phasing to discriminate between these two possibilities ( Supplementary Fig. S4). Of the 64 somatic mutation calls from samples with an OPR mutation, 47 (73%) were assigned to the OPR-mutated allele ( Supplementary Fig. S3). Somatic mutation calls ranged from the deletion of 7 OPR repeats to the insertion of 5 additional repeats, but the majority (55/94, 59% of haplotype-assigned calls) represented the deletion of one OPR repeat (Fig. 3b).
Were specific samples more likely to produce somatic mutation calls? Samples from individuals with longer OPRs had more somatic mutation calls, with each additional repeat in the OPR increasing somatic mutation calls by 0.017% (Fig. 3c). In the four individuals who had both blood and brain DNA sequenced, the frequency of somatic mutation calls was consistently, but not significantly, lower in brain than in blood (0.07 ± 0.07% fewer calls in brain vs blood of the same individual, p = 0.068; Fig. 3d).
PCR amplification introduces errors in the octapeptide repeat region that resemble somatic mutations. Li et al. 14 have elegantly demonstrated that PRNP's OPR is prone to contraction or expansion when replicated by PCR or in E. coli cells. Therefore, we tested if the somatic mutation calls could be errors generated by the Taq polymerase. We amplified the protein-coding region (1015-bp amplicon, Fig. 1a) from a known amount of control sample: 1 ng of genomic DNA (#56635), approximately 307 haploid genomes 60 . In this sample, we previously found that 0.0325% of reads were somatic mutation calls. Assuming 0.0325% of genomes in the original genomic DNA carried a somatic mutation of the OPR, it was almost certain that we drew either no (0.9 probability) or only one mutated genome (0.09 probability, cumulatively 0.99 probability). Accordingly, if amplification did not introduce errors in the OPR, only reference OPR reads or a few mutated OPR reads with a unique mutation (e.g. all 2 OPRI) should be detectable in the sequencing reads. However, 0.21% of reads were somatic mutation calls, ranging from 4 OPRD to 4 OPRI (Fig. 3e). The distribution of OPR lengths resembled the distribution of the original 129 somatic mutation calls ( Fig. 3e vs b)

Discussion
Sequencing of PRNP is of crucial importance in informing patients, veterinaries, and clinicians about prion diseases diagnosis and management. Here we established a novel long-read sequencing strategy to sequence the full length of PRNP including its regulatory region (16.7 kb). First, we showed that Nanopore sequencing accurately detects known single-nucleotide and structural variants in PRNP, including the insertion and deletion of repeats in the OPR, which are challenging to sequence accurately with Sanger or short-read technologies. Second, we did not discover novel SVs in non-coding regions of PRNP or in patients with sporadic CJD. Third, we detected rare changes in the OPR consistent with somatic mutations, which have been speculated to be the cause of sporadic CJD. However, we demonstrated that these somatic mutation calls may in fact represent errors introduced during the PCR.
The present strategy has some limitations: it requires amplification of PRNP in two fragments, which renders the protocol relatively laborious and lab-based. Removing the amplification steps could help streamline the process. To do so, targeted sequencing using CRISPR-Cas9 39 or the 'Read Until' mode of Oxford Nanopore devices could be employed 40 . This however will require modifications of the present protocol. By removing the amplification steps, these strategies could enable PRNP sequencing in the field. This could benefit rapid decision making during therapeutic studies and support programmes which survey the zoonotic potential of animal prion diseases, such as during the recent episodes of Chronic Wasting Disease (CWD) in Norway and North America 1,41,42 . This may first require sequencing of PRNP in the affected species as not all have annotated reference genomes. Direct sequencing of genomic DNA would also allow the analysis of epigenetic marks, which may serve as a predictor of disease progression in sporadic CJD 39 .
Somatic mutations in PRNP have long been suspected to be a cause of sporadic CJD. The hypothesis posits that the first prion in a patient with sporadic CJD originated in a cell or group of cells which acquired a somatic mutation in PRNP. A strong candidate for this original mutation is the somatic insertion of additional repeats in the OPR, as OPRIs can cause inherited CJD and somatic instability is commonly seen for repetitive sequences 43 . A possible mechanism for the insertion or deletion of repeats in the OPR is replication slippage 14,44 . During replication, the DNA polymerase may dissociate from the DNA leading to either the template or daughter strand to 'slip' , i.e. re-anneal incorrectly to an earlier repeat ( Supplementary Fig. S6). In the case of template strand slippage, the polymerase then skips one or more repeats when DNA replication resumes, producing an OPRD. In the case of daughter strand slippage, the DNA polymerase replicates the same one or more repeats again, producing an OPRI. In our panel, most somatic mutation calls in inherited CJD samples were assigned to the OPR-mutated allele and OPR length positively correlated with frequency of somatic mutation calls. Both observations support this model: as more repeats are present in the DNA molecule undergoing replication, there are more ways the two DNA strands can mispair if the polymerase dissociates, leading to OPR mutations.
However, our results suggest that the OPR mutations identified here might have been generated by the Taq polymerase. The strong bias towards 1 OPRD in our calls also supports this possibility. Indeed, deletion of repeats are more frequent in bacteria, while eukaryotic repetitive sequences show no bias or a bias towards insertions 45 . Replication slippage is dependent on cell division, which leads to two predictions relevant for future research on sporadic CJD. First, the frequency of somatic mutations of the OPR should be higher in cells undergoing divisions than in post-mitotic cells. In four individuals, we found that the frequency of somatic mutation calls was consistently higher in blood DNA than in brain DNA, which suggests that some calls could have been genuine somatic mutations. Second, the original OPR mutation which causes sporadic CJD may not arise in post-mitotic neurons, but rather during development or in glia. Of note, mechanisms independent of cell division may also be possible, as has been suggested for trinucleotide repeat disorders such as Huntington disease 46,47 . Future research should initially aim at discovering genuine somatic mutations of PRNP's OPR in genomic DNA from donors, for example using single-molecule sequencing technologies 48 .

Methods
Patient recruitment and sample obtention. Healthy control donors were recruited from spouses or relatives of patients by the National Prion Clinic (London, UK), and the UCL Dementia Research Centre (London, UK). All experimental protocols were approved by the London Queen Square Research Ethics Committee (reference 05/Q0505/113). Samples were obtained with written informed consent from all controls, patients, or a For example, a 3 OPRI somatic mutation call on the mutated OPR haplotype of a 4 OPRI sample was counted as a − 1 change. Of the N = 94 somatic mutation calls from haplotype-phased samples, 55 (59%) removed one repeat from the OPR (− 1 changes). (c) Somatic mutation calls, as percentage of reads of each sample, as a function of the OPR length of each sample. The length is given compared to the reference OPR: 0 corresponds to the reference OPR, negatives correspond to OPR deletions (e.g. − 2 is 2 OPRD), and positives correspond to OPR insertions (e.g. 8 is 8 OPRI). The black dashed line is the line of best fit by linear regression: % somatic mutation calls = 0.036 + 0.017 × OPR length. OPR length is a significant predictor of the percentage of somatic mutation calls, R 2 = 0.53, ***p < 0.001 by linear regression. (d) Somatic mutation calls, as percentage of reads of each sample, for four inherited CJD cases which had genomic DNA from both blood and brain sequenced. For each individual, there was a decrease in somatic mutation calls in brain compared to blood. p = 0.068 by onesample one-sided t test (null hypothesis was no significant decrease between blood and brain). (e) Changes in the OPR introduced by PCR amplification of a reference OPR, as in (a).  Sanger sequencing. Sanger sequencing of the protein-coding sequence of PRNP was performed following published protocols for genetic diagnosis 29,50 . In this work, the PCR amplicons were not cloned. Instead, the exact sequences of inserts were inferred from Sanger sequencing electropherograms by subtracting the reference sequence from heterozygous calls. The amplicon subjected to sequencing was the 1015-bp amplicon covering the protein-coding region (Fig. 1a). The shorter amplicon (348 bp in the reference allele) was used to assess OPR deletions/insertions in affected individuals by gel electrophoresis. The Sanger sequencing trace included in Supplementary Fig. S1 was exported from Benchling (benchling. com).
Initial denaturation was 95 °C-5′, followed by 35 cycles of: 95 °C-30″ (denaturation); 58 °C-40″ (annealing); 72 °C-1′ (extension). The PCR product was purified and concentrated using Zymo DNA Clean & Concentrator-5 kit and eluted in 50 µL TE buffer. The PCR product's concentration was quantified with Qubit (dsDNA Broad Range assay) and its length was verified using TapeStation 2200 (D1000 tape). Library preparation was performed according to Nanopore Technologies' 1D amplicon by ligation protocol (version ADE_9003_v108_ revU_18Oct2016). In brief, the PCR products were end-prepped/dA-tailed for adapter attachment and sequencing adapters were attached by ligation. The final products were cleaned using size-selection beads. All reagents were provided by the SQK-LSK108 kit, except: NEBNext Ultra II End Repair/dA-Tailing buffer and enzyme mix (#E7546), NEB Blunt TA/Ligase Master Mix (#M0367), and Agencourt AMPure XP beads (#A63880). The PRNP coding region amplicons were diluted after end-repair/dA-tailing to bring only 0.2 pmol of DNA fragments to the ligation reaction. 23 ng of final library were loaded onto the MinION flow cell. Sequencing was performed for 53 min and followed live on the MinKNOW software.
To call variants, the reads obtained after basecalling were aligned to the human reference genome (hg38) using minimap2 v2.18-r1015 53 . The resulting sam file was converted to bam, then sorted and indexed using samtools v1.9 54 . Nanopolish v.0.13.2 was used for variant calling. Alignments were visualised with the Integrative Genomics Viewer (IGV) v2.8.6. Nanopore sequencing of full-length PRNP and its regulatory region. Amplification prior to Nanopore sequencing was performed in two subsequent PCRs. The first one amplified the genomic region of interest from genomic DNA, the second attached unique barcodes to allow multiplexing of several samples onto the sequencing flow cell. In order to use the barcoding primers provided by Oxford Nanopore Technologies in the PBK-004 barcoding kit, we designed primers that included a complementary sequence to the barcoding primers (see "Primer sequences" section). We used the NEB LongAmp Taq DNA polymerase, which performed better than other long-range polymerases we tried.
The barcoding PCR and library preparation were carried out according to ONT's barcoding kit (SQK-PBK004, version PBK_9073_v1_revB_23May2018) and ligation sequencing kit (SQK-LSK109) protocols. Up to 12 samples were multiplexed on each flow cell and sequencing was performed for 48 hr.
Basecalling and alignment. The raw, multi-line FAST5 files were processed using ONT Guppy v4.2.2 suite. We used guppy_basecaller for basecalling and guppy_barcoder for demultiplexing, i.e. assigning reads to each sample. The reads in fastq format were then aligned to the human reference genome (hg38) using guppy_ aligner. Samtools v1.7.0 was used to sort and index the resulting alignment files (BAM format). IGV v.2.8.6 was used to visualise the alignments. www.nature.com/scientificreports/ at a minimum coverage of 20× and with an allele frequency of minimum 0.2 were called. We filtered the calls further using strand bias as a criterion 39 . MinION sequences at random the forward or reverse strand. Hence, for a true positive variant call, the reads supporting the variant are predicted to be ~ 50% forward and ~ 50% reverse. There is evidence of strand bias if this proportion is strongly imbalanced, i.e. when the forward and the reverse reads do not uniformly call for the same nucleotide. We used the StrandOddsRatio (SOR) metric computed by nanopolish 55 , which increases with evidence of strand bias. To filter SNV calls in non-coding regions of PRNP, we reasoned that the SNVs in the protein-coding region identified by both Sanger and Nanopore sequencing could be used as a set of true positives. For those, the maximum SOR was 0.77 (N = 12). Therefore, we filtered out any SNV call above SOR = 1.0 as showing evidence of strand bias and hence a possible false positive. This decreased the number of SNV calls for all samples (N = 25) from 218 to 203. SNV calls before and after strandbias filtering are included ( Supplementary Information 1).

Single
Haplotype phasing. To assign the gene body reads to haplotypes (haplotagging), we generated a new VCF file for each file containing its filtered SNV calls (see "Single-nucleotide variant calling" section). Five samples (#58648, #54890, #55050, #59060, #53747) had no heterozygous SNV call and therefore their reads could not be haplotagged. We first used whatshap phase to phase the SNV calls in relation to each other 56 . A single heterozygous SNV call is sufficient for haplotagging but could not be phased by whatshap phase. Therefore, the VCF files of samples with a single SNV call (#54917, #54968, #55048, #43706, #42656) were manually edited to match the formatting of a phased variant call. We then used whatshap haplotag to haplotag the reads.
Structural variant calling. Prior to SV calling, alignments were filtered further to keep only high-quality reads. For the regulatory region amplicon, any read shorter than 1 kb, longer than 3.5 kb, or with more than 15% of its length soft-clipped was discarded. For the gene body amplicon, any read shorter than 10 kb, longer than 15 kb or with more than 5% of its length soft-clipped was discarded. The goal of the maximum length criteria is to exclude potentially chimeric reads. Secondary alignments were also discarded. We used sniffles v1.0.12 to call structural variants (SVs) larger than 20 bp and supported by at least 100 reads 57 . Calls were filtered to keep SV calls within the PRNP genomic window and with an allele frequency above 0.1. Fig. 2c, d. The filtered aligned reads (see above) from the gene body were trimmed to keep only the OPR (chr20:4,699,371-4,699,493) using samtools ampliconclip 54 . From the reads we obtained, any read shorter than 21 bp or longer than 702 bp were discarded. 21 bp was chosen as minimum length as it would correspond to a 4 OPRD, minus 6 bp to account for small artefactual deletions. 702 bp was chosen as maximum length as it would correspond to a 24 OPRI, plus 3 bp to account for small artefactual insertions. We then counted the number of reads of each length to produce the histograms in Fig. 2c, d. Search for somatic mutations of the OPR. The haplotagged OPR reads were imported in R from SAM files, for a total of 208,554 OPR reads. We parsed the CIGAR of each read to calculate its total insertion and deletion. For example, 11500H43M1I24M1D29M1I7M3D7M2D7M2306H returned insertion 2 bp (1I + 1I) and deletion 6 bp (1D + 3D + 2D). The hard clips (11500H, 2306H) were created by samtools ampliconclip. Each read was then assigned to a most likely OPR genotype based on its total insertion/deletion. For each OPR genotype, the interval of possible lengths was defined as the target insertion/deletion length minus 6 bp to allow for small artefactual deletions up to the target insertion/deletion length plus 3 bp to allow for small artefactual insertions. For example, the target deletion length for 1 OPRD is − 24 bp. Therefore, any read with a total deletion of − 30 bp up to − 21 bp was labelled as a potential 1 OPRD read. As the target insertion/deletion for the reference OPR is 0 bp, the interval was set from deletions of − 6 bp up to insertions of + 3 bp. Any read with a total insertion/deletion not included in one of the intervals (e.g. a 10-bp insertion) was labelled as unassigned.

Lengths of OPR reads. This refers to
At this stage, we generated Supplementary Fig. S4 by plotting for each sample the proportion of reads labelled as reference or as the expected OPR mutation assigned to each haplotype.
In parallel, we built OPR template sequences using the OPR consensus sequence. Each read, except if labelled as unassigned, was then aligned one-to-one to the OPR template matching its OPR label ( Supplementary Fig. S5), returning the number of mismatches. Alignments were performed with ClustalW 58 , implemented in the msa R package 59 . The mismatch threshold to filter somatic mutation calls was calculated as the mean + standard deviation of the number of mismatches of all 'expected' reads, i.e. reference reads and reads matching the OPR genotype of their sample. Somatic mutations calls were then defined as any read not labelled as reference, not labelled as the OPR genotype of its sample, and having fewer mismatches with its template than 5.8% of its length ( Supplementary Fig. S3). For the compound heterozygous sample #46345, reads labelled as reference and below the mismatch threshold were also defined as somatic mutation calls.
Control PCR to test for PCR-introduced errors. 1 ng of genomic DNA from sample #56635, approximately 307 haploid genomes 60 , was PCR amplified to produce the 1015-bp amplicon covering the protein-coding sequence (see Sanger sequencing). The product was cleaned, eluted in 25 µL of dH 2 O, and its concentration was quantified at 10.36 ng/µL by Qubit (dsDNA Broad Range assay). 1 µL of this product was then barcoded and sequenced as previously (see "Nanopore sequencing of full-length PRNP and its regulatory region" section).
The fast5 files were basecalled to produce reads which were aligned to the human reference genome as previously (see Basecalling and alignment). We then filtered the reads by discarding any read shorter than 900 bp, longer than 1300 bp, or with more than 20% of its length soft-clipped. We trimmed the reads to keep only the OPR (see "Lengths of OPR reads" section), generating 366,243 OPR reads. Calling possible somatic mutations was then performed as previously (see "Search for somatic mutations of the OPR" section). www.nature.com/scientificreports/ The original frequency of somatic mutation calls in sample #56635 was 0.0325% (9 of 26,764 OPR reads were somatic mutation calls). Let us assume that these somatic mutation calls were genuine, and that any genome taken at random from sample #56635 had a probability of carrying an OPR mutation P(mutated) = 0.000325 . Conversely, the probability that any genome taken at random from sample #56635 was OPR reference was P(reference) = 1 − 0.000325 = 0.99967.
The probability to have drawn exactly 0 genome of the 307 can be written as the probability that all 307 genomes were reference, which is P reference 307 = 0.99967 307 = 0.904 , i.e. there was a 90.4% probability to draw 0 mutated genome in the 307. The probability that exactly n genomes out of k genomes drawn (here, 307) were mutated can be calculated as where From Eqs. (1) and (2), the probability of having drawn exactly 1 mutated genome in the 307 was 307 × 0.000325 1 × 0.99967 307−1 = 0.0902 , i.e. there was a 9.02% probability of drawing exactly 1 mutated genome in the 307.
The probability of drawing 0 or 1 mutated genome in the 307 was therefore 0.904 + 0.0902 = 0.9942 , i.e. the probability to draw more than one mutated genome was 0.58%.