Introduction

Prion diseases are a group of fatal mammalian neurodegenerative disorders, the most common of which in human is sporadic Creutzfeldt–Jakob disease (CJD). Prion diseases can be acquired or transmitted between humans and animals, necessitating active surveillance programmes1. The infectious agent of prion disease is composed of assemblies of misfolded forms of cellular prion protein (PrP). Mutations in the gene that encodes PrP (PRNP) are associated with inherited prion diseases, which comprise a heterogeneous range of clinical phenotypes that mimic common dementias2.

Human PRNP is composed of two exons flanking a single large intron for a total of 15 kb located on chromosome 20 (Fig. 1a)3,4. The 762-bp open reading frame is contained entirely within the second exon. Sanger sequencing of the protein-coding region of PRNP is routinely performed as part of the clinical investigation of patients suspected to have any form of prion disease2. Accordingly, the coding region has been extensively studied in prion disease cases with over 70 variants reported in patients, most of which are non-synonymous single-nucleotide variants (SNV)2,5. An important SNV in PRNP is at codon 129 (rs1799990), which encodes either methionine (M) or valine (V). Homozygosity at codon 129 is associated with an elevated risk of developing sporadic CJD and shorter survival relative to heterozygotes6, and the variant is a strong determinant of clinical phenotype7. In sporadic CJD cases homozygous at codon 129, concentration and stability of the distinct prion protein isoforms represent other determinants of disease duration8. Other non-synonymous variants protect against acquired or sporadic prion diseases such as G127V and E219K9,10,11.

Figure 1
figure 1

Nanopore sequencing of PRNP and octapeptide repeat genotyping. (a) The human PRNP gene. Important pathogenic variants, such as deletions/insertions in the octapeptide repeat region (OPR), codon 129 (M129V) or codon 200 (E200K) SNVs are indicated in the protein-coding region in orange. Genetic diagnosis at the MRC Prion Unit at UCL is performed by Sanger sequencing of the protein-coding sequence (1015-bp amplicon) or specifically of the OPR (348-bp amplicon). Exon 1 and the beginning of the intron are particularly GC-rich (see GC % plot). This likely explains why producing a single amplicon spanning the entire genomic region was technically challenging. Instead, we amplified PRNP in two overlapping fragments for Nanopore sequencing (see Nanopore amplicons). (b) Sequence of PRNP’s OPR. The OPR is composed of five similar repeats for a total of 123 bp. The first repeat (R1) is 27 bp and encodes a sequence of 9 amino acids (nonapeptide), the subsequent ones (R2, R2, R3, R4) are 24 bp and all encode the same sequence of 8 amino acids (octapeptide). The two R2 repeats are identical. R2 vary with R3 and R4 by two nucleotides at wobble positions.

The N-terminal region of PrP contains a repetitive sequence of amino acids termed the octapeptide repeat region (OPR). The normal OPR consists of a nonapeptide followed by four octapeptide repeats (Fig. 1b)12. Repeat length is variable in the human population ranging from deletions (OPRD, for octapeptide repeat deletion) of two repeats to insertions (OPRI, for octapeptide repeat insertion) of an extra 12 repeats. Insertion of four or more supplementary repeats is recognised as definite cause of inherited prion disease2. Instability of the OPR, which might manifest as somatic insertion and/or deletion of repeats in brain tissue, is a hypothetical mechanism of sporadic CJD13,14,15.

The expression level of Prnp is a strong determinant of incubation time in mouse models of prion disease16,17. In cattle, a 23-bp deletion in the promoter and a 12-bp deletion in the intron of PRNP increase an animal’s risk of developing bovine spongiform encephalopathy upon consumption of infected feedstuffs, potentially by modifying expression levels18,19,20. In humans, whether variants in regulatory regions of PRNP can increase risk of sporadic CJD by altering expression remains unclear. First, there has been no assessment of the non-coding regions of PRNP in CJD cases for structural variants (SVs) that might alter regulatory sites, hence the need for the present study. Second, in contrast to previous reports21,22 our genome-wide association study (GWAS) of sporadic CJD did not find evidence of an independent signal in the intron or regulatory regions of PRNP23. Regulatory variants may have been missed in our study. For example, an eQTL upstream of PRNP (rs17327121) is associated with lower expression of PRNP in the cerebellum24 but was not genotyped or imputed in the GWAS. Third, differences in risk of prion disease by sex, age, and codon 129 genotype do not appear to be explained by different levels of PRNP expression25. Fourth, duplication of the PRNP locus would be predicted to increase risk of CJD by causing overexpression, but this has only been reported once in a control individual26. Overall, more routine sequencing of the non-coding regions of PRNP in CJD patients may provide insights into regulatory variants and risk of disease.

Here we set up a long-read sequencing protocol for full-length human PRNP, including its regulatory region. We employed this protocol to sequence PRNP in 21 individuals, 19 of which had inherited or sporadic CJD. We used Oxford Nanopore MinION, a portable sequencer which offers a more complete view of genomic variation than short-read technologies as it generates long reads spanning repetitive regions and potential SVs27,28. Our aim was to establish a protocol that could potentially support (1) the rapid field sequencing of PRNP, (2) longer read lengths that are more likely to span an SV and can be more confidently aligned or assembled, and (3) high sequencing coverage to detect hypothetical somatic mutations in the OPR.

Results

Genotyping PRNP in prion disease cases using Nanopore sequencing

PRNP genotyping in prion diseases cases is routinely performed using both Sanger sequencing and gel electrophoresis of the protein-coding region29. First, we tested if Nanopore sequencing performs as well as Sanger for the amplicon that is routinely used for clinical genotyping. Genomic DNA was extracted from CJD patient blood and the PRNP protein-coding region was amplified and sequenced using both Nanopore and Sanger. We computed a consensus sequence of the Nanopore reads and aligned it to the sequence obtained by Sanger sequencing. Both sequences were 100% identical. The sample was genotyped as codon 129 heterozygous (M129V) by both Sanger and Nanopore sequencing (Supplementary Fig. S1). This pilot experiment encouraged us to pursue Nanopore sequencing as an alternative to Sanger sequencing for PRNP genotyping.

Next, we developed a protocol for amplification and sequencing of the entire PRNP genomic region. We mapped the regulatory region based on genome annotations of epigenetic marks H3K4me1, H3K4me3, H3K27ac, and transcription factor binding sites. These pointed towards an important regulatory region starting around 1.1 kb upstream of exon 1 and extending into the intron, consistent with a prior study30. Although Nanopore allows sequencing of fragments longer than 1 Mb31, we faced technical difficulties amplifying the 16.7 kb region as a single PCR product. This was likely due to a GC-rich (> 70%) region around exon 1. We therefore opted to sequence two overlapping amplicons (Fig. 1a). The smaller amplicon (2988 bp) started upstream of the regulatory region and ended 1 kb into the intron. The larger amplicon (14,025 bp) included the remaining of the intron and exon 2. The two amplicons overlapped by 316 bp.

Using this protocol, we sequenced the PRNP genomic region in 25 samples from 21 individuals. First, we sequenced one healthy control and seven patients with inherited prion disease—of which six carried an insertion and/or a deletion in the OPR (2 OPRD, 1 OPRD, 1 OPRI, 2 OPRI, 5 OPRI/1 OPRD, and 6 OPRI). Second, we sequenced both blood and brain samples from four inherited prion disease patients with OPR insertions. Third, we sequenced one healthy control and eight patients with sporadic CJD. Of the 21 individuals, nine carried an SNV in the protein-coding region detected by Sanger sequencing: one E200K, one P102L, and seven M129V (Table 1). Using Nanopore, we obtained a range of coverage of 3165–8160 × for the gene body (14,025-bp amplicon) and 133–8966 × for the regulatory region (2988-bp amplicon).

Table 1 Samples information and genotyping summary. Ind., individual; Sample, sample unique identifier; Tissue, source of the genomic DNA sequenced; Prion disease, healthy control or inherited or sporadic CJD. Sanger sequencing columns list SVs and SNVs detected by Sanger sequencing of PRNP’s protein-coding region. Nanopore MinION columns list SVs and SNVs detected by Nanopore sequencing of the PRNP genomic region.

First, we called SNVs for all samples. After filtering calls based on allele frequency and strand bias, we detected a total of 46 different SNVs, most of which were present in more than one individual, for a total of 203 calls (Fig. 2a, Supplementary Fig. S2). All 46 SNVs were found in human genetics databases, and their respective frequency in the general population was in accordance with the number of individuals carrying each SNV in our panel (Fig. 2b). The SNV calling algorithm successfully called the E200K, P102L, and M129V variants previously identified by Sanger sequencing (Table 1). No additional SNVs were detected in the protein-coding sequence, indicating a low frequency of false positive calls. We observed that samples carrying the 129V haplotype tended to carry more SNVs in total, in line with known linkage disequilibrium between M129V and adjacent SNVs. We did not find an excess of SNVs in sporadic CJD cases (Supplementary Fig. S2).

Figure 2
figure 2

Detection of known variants in PRNP with Nanopore sequencing. (a) Each dot represents an SNV call by Nanopolish. SNV calls are ranked by increasing evidence of strand bias (SOR, strand odds ratio). In orange are the SNVs in the protein-coding region which were also called by Sanger. They are together the set of true positives we used to calibrate our strand bias threshold (SOR = 1.0, dashed black line). (b) Number of individuals from our panel who carried each unique SNV as a function of allele frequency in the general population. r = 0.94 by Pearson’s correlation. (c) Lengths of the OPR reads in seven inherited CJD cases. (d) Lengths of the OPR reads in four inherited cases for which genomic DNA from blood and brain was sequenced. The read lengths form a distribution around each expected OPR length due to small artefactual insertions/deletions typical of Nanopore sequencing. See Table 1 for more details about individuals and samples.

Second, we called SVs for all samples. The variant calling algorithm correctly identified all 15 OPR mutations present in our panel including the precise size of the inserted/deleted sequence ± 2 bp (Fig. 2c, d, Table 1). We did not detect novel SVs in PRNP’s regulatory region or in the sporadic CJD samples. These results support the use of Nanopore sequencing for PRNP genotyping in patient samples.

Exploring somatic mutation of the octapeptide repeat region using Nanopore sequencing

Somatic mosaicism has been implicated in many diseases, from cancer to neurodegeneration32,33,34,35. In prion diseases, somatic mutation of PRNP’s OPR has been hypothesised as a possible cause of sporadic CJD14,15. We took advantage of the high sequencing coverage obtained with Nanopore sequencing to search for rare somatic insertions or deletions in PRNP’s OPR.

We first trimmed all reads we collected to keep only PRNP’s OPR, generating 208,554 OPR reads. Then, we labelled each read with the most likely OPR length based on the total insertion and/or deletion compared to the reference. We designed a consensus sequence for one OPR repeat and built a set of template sequences for a range of possible OPRs. Each read was then aligned to the OPR template sequence matching its OPR label, returning the number of mismatches. We defined a somatic mutation call as any read whose OPR label was not reference or the genotype of its sample and was at least 94% identical to its OPR template sequence (Supplementary Fig. S3). For example, any OPR read with a 24-bp insertion was labelled 1 OPRI, aligned to the 1 OPRI template, and was selected as a somatic mutation call if it aligned to its template sequence and did not come from a sample with a heterozygous 1 OPRI mutation.

Chimeric reads, which are single reads composed of sequences from one or more amplicons, can occur with Nanopore sequencing36. As we sequenced samples with different OPR genotypes on the same flow cell, we wanted to exclude any somatic mutation call which could be a chimeric read from another sample on the flow cell. To this end, we discarded the following reads: those not barcoded at both ends; those containing an adapter or barcode in their middle; those longer than their amplicon37. We identified a total of 129 somatic mutation calls, at a frequency per sample of 0–0.28% of reads (Fig. 3a, Supplementary Fig. S5). Reads from the forward and reverse strand were uniformly represented in the somatic mutation calls, excluding the possibility of strand bias. The number of somatic mutation calls from each sample did not correlate with its sequencing coverage, which likely excludes sequencing errors occurring at the pore (Supplementary Fig. S4)38. Based on the above checks, we concluded that the 129 somatic mutation calls were not sequencing artefacts.

Figure 3
figure 3

Somatic mutation calls in PRNP’s OPR. (a) Somatic mutation calls per sample, as percentage of the number of reads. (b) Number of somatic mutation calls as a function of the change it generated in the OPR. For example, a 3 OPRI somatic mutation call on the mutated OPR haplotype of a 4 OPRI sample was counted as a − 1 change. Of the N = 94 somatic mutation calls from haplotype-phased samples, 55 (59%) removed one repeat from the OPR (− 1 changes). (c) Somatic mutation calls, as percentage of reads of each sample, as a function of the OPR length of each sample. The length is given compared to the reference OPR: 0 corresponds to the reference OPR, negatives correspond to OPR deletions (e.g. − 2 is 2 OPRD), and positives correspond to OPR insertions (e.g. 8 is 8 OPRI). The black dashed line is the line of best fit by linear regression: % somatic mutation calls = 0.036 + 0.017 × OPR length. OPR length is a significant predictor of the percentage of somatic mutation calls, R2 = 0.53, ***p < 0.001 by linear regression. (d) Somatic mutation calls, as percentage of reads of each sample, for four inherited CJD cases which had genomic DNA from both blood and brain sequenced. For each individual, there was a decrease in somatic mutation calls in brain compared to blood. p = 0.068 by one-sample one-sided t test (null hypothesis was no significant decrease between blood and brain). (e) Changes in the OPR introduced by PCR amplification of a reference OPR, as in (a).

Of the 129 somatic mutation calls, 103 (80%) were from inherited CJD samples (Supplementary Fig. S3). This represented a significant excess of calls from this group compared to sporadic CJD or control samples, even after accounting for the higher number of inherited CJD samples in our panel. Most inherited CJD patients in our panel carried a monoallelic OPR mutation. In these samples, the somatic mutation calls could represent the mutation of the reference allele or the mutation of the allele already carrying an OPR mutation. For example, a 5 OPRI somatic mutation call in a 4 OPRI heterozygous carrier could represent a 5-repeat insertion in the reference allele, or a 1-repeat insertion in the 4 OPRI allele. We used haplotype phasing to discriminate between these two possibilities (Supplementary Fig. S4). Of the 64 somatic mutation calls from samples with an OPR mutation, 47 (73%) were assigned to the OPR-mutated allele (Supplementary Fig. S3). Somatic mutation calls ranged from the deletion of 7 OPR repeats to the insertion of 5 additional repeats, but the majority (55/94, 59% of haplotype-assigned calls) represented the deletion of one OPR repeat (Fig. 3b).

Were specific samples more likely to produce somatic mutation calls? Samples from individuals with longer OPRs had more somatic mutation calls, with each additional repeat in the OPR increasing somatic mutation calls by 0.017% (Fig. 3c). In the four individuals who had both blood and brain DNA sequenced, the frequency of somatic mutation calls was consistently, but not significantly, lower in brain than in blood (0.07 ± 0.07% fewer calls in brain vs blood of the same individual, p = 0.068; Fig. 3d).

PCR amplification introduces errors in the octapeptide repeat region that resemble somatic mutations

Li et al.14 have elegantly demonstrated that PRNP’s OPR is prone to contraction or expansion when replicated by PCR or in E. coli cells. Therefore, we tested if the somatic mutation calls could be errors generated by the Taq polymerase. We amplified the protein-coding region (1015-bp amplicon, Fig. 1a) from a known amount of control sample: 1 ng of genomic DNA (#56635), approximately 307 haploid genomes60. In this sample, we previously found that 0.0325% of reads were somatic mutation calls. Assuming 0.0325% of genomes in the original genomic DNA carried a somatic mutation of the OPR, it was almost certain that we drew either no (0.9 probability) or only one mutated genome (0.09 probability, cumulatively 0.99 probability). Accordingly, if amplification did not introduce errors in the OPR, only reference OPR reads or a few mutated OPR reads with a unique mutation (e.g. all 2 OPRI) should be detectable in the sequencing reads. However, 0.21% of reads were somatic mutation calls, ranging from 4 OPRD to 4 OPRI (Fig. 3e). The distribution of OPR lengths resembled the distribution of the original 129 somatic mutation calls (Fig. 3e vs b) with 1 OPRD being the most frequent call (319/761 calls, 42%). Therefore, we cannot confirm that the 129 calls were genuine somatic mutations. Instead, the calls may have been errors introduced by the PCR. Single-molecule sequencing technologies are likely required to confirm the presence of somatic mutations in PRNP’s OPR in human samples.

Discussion

Sequencing of PRNP is of crucial importance in informing patients, veterinaries, and clinicians about prion diseases diagnosis and management. Here we established a novel long-read sequencing strategy to sequence the full length of PRNP including its regulatory region (16.7 kb). First, we showed that Nanopore sequencing accurately detects known single-nucleotide and structural variants in PRNP, including the insertion and deletion of repeats in the OPR, which are challenging to sequence accurately with Sanger or short-read technologies. Second, we did not discover novel SVs in non-coding regions of PRNP or in patients with sporadic CJD. Third, we detected rare changes in the OPR consistent with somatic mutations, which have been speculated to be the cause of sporadic CJD. However, we demonstrated that these somatic mutation calls may in fact represent errors introduced during the PCR.

The present strategy has some limitations: it requires amplification of PRNP in two fragments, which renders the protocol relatively laborious and lab-based. Removing the amplification steps could help streamline the process. To do so, targeted sequencing using CRISPR-Cas939 or the ‘Read Until’ mode of Oxford Nanopore devices could be employed40. This however will require modifications of the present protocol. By removing the amplification steps, these strategies could enable PRNP sequencing in the field. This could benefit rapid decision making during therapeutic studies and support programmes which survey the zoonotic potential of animal prion diseases, such as during the recent episodes of Chronic Wasting Disease (CWD) in Norway and North America1,41,42. This may first require sequencing of PRNP in the affected species as not all have annotated reference genomes. Direct sequencing of genomic DNA would also allow the analysis of epigenetic marks, which may serve as a predictor of disease progression in sporadic CJD39.

Somatic mutations in PRNP have long been suspected to be a cause of sporadic CJD. The hypothesis posits that the first prion in a patient with sporadic CJD originated in a cell or group of cells which acquired a somatic mutation in PRNP. A strong candidate for this original mutation is the somatic insertion of additional repeats in the OPR, as OPRIs can cause inherited CJD and somatic instability is commonly seen for repetitive sequences43. A possible mechanism for the insertion or deletion of repeats in the OPR is replication slippage14,44. During replication, the DNA polymerase may dissociate from the DNA leading to either the template or daughter strand to ‘slip’, i.e. re-anneal incorrectly to an earlier repeat (Supplementary Fig. S6). In the case of template strand slippage, the polymerase then skips one or more repeats when DNA replication resumes, producing an OPRD. In the case of daughter strand slippage, the DNA polymerase replicates the same one or more repeats again, producing an OPRI. In our panel, most somatic mutation calls in inherited CJD samples were assigned to the OPR-mutated allele and OPR length positively correlated with frequency of somatic mutation calls. Both observations support this model: as more repeats are present in the DNA molecule undergoing replication, there are more ways the two DNA strands can mispair if the polymerase dissociates, leading to OPR mutations.

However, our results suggest that the OPR mutations identified here might have been generated by the Taq polymerase. The strong bias towards 1 OPRD in our calls also supports this possibility. Indeed, deletion of repeats are more frequent in bacteria, while eukaryotic repetitive sequences show no bias or a bias towards insertions45. Replication slippage is dependent on cell division, which leads to two predictions relevant for future research on sporadic CJD. First, the frequency of somatic mutations of the OPR should be higher in cells undergoing divisions than in post-mitotic cells. In four individuals, we found that the frequency of somatic mutation calls was consistently higher in blood DNA than in brain DNA, which suggests that some calls could have been genuine somatic mutations. Second, the original OPR mutation which causes sporadic CJD may not arise in post-mitotic neurons, but rather during development or in glia. Of note, mechanisms independent of cell division may also be possible, as has been suggested for trinucleotide repeat disorders such as Huntington disease46,47. Future research should initially aim at discovering genuine somatic mutations of PRNP’s OPR in genomic DNA from donors, for example using single-molecule sequencing technologies48.

Methods

Patient recruitment and sample obtention

Healthy control donors were recruited from spouses or relatives of patients by the National Prion Clinic (London, UK), and the UCL Dementia Research Centre (London, UK). All experimental protocols were approved by the London Queen Square Research Ethics Committee (reference 05/Q0505/113). Samples were obtained with written informed consent from all controls, patients, or a patient's consultee in accordance with applicable UK legislation and Codes of Practice. All methods were carried out in accordance with relevant guidelines and regulations.

DNA extraction

Nucleon BACC3 kit was used for DNA extraction from frozen blood samples according to the manufacturer’s protocol. Final DNA samples were stored at 4 °C until needed. DNA extraction from CJD brain samples was carried out in a Biosafety Level 3 facility as previously published49.

Sanger sequencing

Sanger sequencing of the protein-coding sequence of PRNP was performed following published protocols for genetic diagnosis29,50. In this work, the PCR amplicons were not cloned. Instead, the exact sequences of inserts were inferred from Sanger sequencing electropherograms by subtracting the reference sequence from heterozygous calls. The amplicon subjected to sequencing was the 1015-bp amplicon covering the protein-coding region (Fig. 1a). The shorter amplicon (348 bp in the reference allele) was used to assess OPR deletions/insertions in affected individuals by gel electrophoresis.

The Sanger sequencing trace included in Supplementary Fig. S1 was exported from Benchling (benchling.com).

Nanopore sequencing of PRNP’s protein-coding region

PCR was performed on genomic DNA from blood of a healthy individual to produce the 1015-bp amplicon covering the protein-coding sequence (see above). Forward primer was 5′-CTATGCACTCATTCATTATG-3′, reverse primer was 5′-GTTTTCCAGTGCCCATCAGTG-3′. The PCR well contained 12.5 μL 2 × MegaMix Royal (Microzone), 10.5  μL H2O, 1 μL primers (12.5 μM each), 1 μL (600 ng) genomic DNA.

Initial denaturation was 95 °C—5′, followed by 35 cycles of: 95 °C—30″ (denaturation); 58 °C—40″ (annealing); 72 °C—1′ (extension). The PCR product was purified and concentrated using Zymo DNA Clean & Concentrator-5 kit and eluted in 50 µL TE buffer. The PCR product’s concentration was quantified with Qubit (dsDNA Broad Range assay) and its length was verified using TapeStation 2200 (D1000 tape). Library preparation was performed according to Nanopore Technologies’ 1D amplicon by ligation protocol (version ADE_9003_v108_revU_18Oct2016). In brief, the PCR products were end-prepped/dA-tailed for adapter attachment and sequencing adapters were attached by ligation. The final products were cleaned using size-selection beads. All reagents were provided by the SQK-LSK108 kit, except: NEBNext Ultra II End Repair/dA-Tailing buffer and enzyme mix (#E7546), NEB Blunt TA/Ligase Master Mix (#M0367), and Agencourt AMPure XP beads (#A63880). The PRNP coding region amplicons were diluted after end-repair/dA-tailing to bring only 0.2 pmol of DNA fragments to the ligation reaction. 23 ng of final library were loaded onto the MinION flow cell. Sequencing was performed for 53 min and followed live on the MinKNOW software.

FAST5 files were basecalled using guppy v4.5.3 (Oxford Nanopore Technologies). We then computed a consensus sequence of the Nanopore reads using canu v2.1.151, and “polished” it using nanopolish v0.13.252. The resulting consensus sequence was aligned to the sequence obtained by Sanger sequencing using BLASTn (megablast) (NCBI). The full alignment is included in Supplementary Fig. S1.

To call variants, the reads obtained after basecalling were aligned to the human reference genome (hg38) using minimap2 v2.18-r101553. The resulting sam file was converted to bam, then sorted and indexed using samtools v1.954. Nanopolish v.0.13.2 was used for variant calling. Alignments were visualised with the Integrative Genomics Viewer (IGV) v2.8.6.

Nanopore sequencing of full-length PRNP and its regulatory region

Amplification prior to Nanopore sequencing was performed in two subsequent PCRs. The first one amplified the genomic region of interest from genomic DNA, the second attached unique barcodes to allow multiplexing of several samples onto the sequencing flow cell. In order to use the barcoding primers provided by Oxford Nanopore Technologies in the PBK-004 barcoding kit, we designed primers that included a complementary sequence to the barcoding primers (see “Primer sequences” section). We used the NEB LongAmp Taq DNA polymerase, which performed better than other long-range polymerases we tried.

The first PCR amplified PRNP in two amplicons (Fig. 1a, Nanopore amplicons). For the regulatory region amplicon (chr20:4,685,060–4,688,047), primers were primer 3 (forward) and primer 4 (reverse); annealing was 59 °C—25″; elongation time was 2′40″, for 30 cycles. For the gene body amplicon (chr20:4,687,732–4,701,756), primers were primer 2 (forward) and primer 1 (reverse); annealing was 57 °C—30″; elongation time was 13′00″, for 30 cycles. Amount of genomic DNA used as template was 119–443 ng. The reactions and other program settings were as recommended by the manufacturer’s protocol.

The barcoding PCR and library preparation were carried out according to ONT’s barcoding kit (SQK-PBK004, version PBK_9073_v1_revB_23May2018) and ligation sequencing kit (SQK-LSK109) protocols. Up to 12 samples were multiplexed on each flow cell and sequencing was performed for 48 hr.

Basecalling and alignment

The raw, multi-line FAST5 files were processed using ONT Guppy v4.2.2 suite. We used guppy_basecaller for basecalling and guppy_barcoder for demultiplexing, i.e. assigning reads to each sample. The reads in fastq format were then aligned to the human reference genome (hg38) using guppy_aligner. Samtools v1.7.0 was used to sort and index the resulting alignment files (BAM format). IGV v.2.8.6 was used to visualise the alignments.

Single-nucleotide variant calling

We used nanopolish v0.13.3 to call single-nucleotide variants (SNVs) in the PRNP genomic window. Default nanopolish filtering criteria were used: only SNVs in regions sequenced at a minimum coverage of 20× and with an allele frequency of minimum 0.2 were called. We filtered the calls further using strand bias as a criterion39. MinION sequences at random the forward or reverse strand. Hence, for a true positive variant call, the reads supporting the variant are predicted to be ~ 50% forward and ~ 50% reverse. There is evidence of strand bias if this proportion is strongly imbalanced, i.e. when the forward and the reverse reads do not uniformly call for the same nucleotide. We used the StrandOddsRatio (SOR) metric computed by nanopolish55, which increases with evidence of strand bias. To filter SNV calls in non-coding regions of PRNP, we reasoned that the SNVs in the protein-coding region identified by both Sanger and Nanopore sequencing could be used as a set of true positives. For those, the maximum SOR was 0.77 (N = 12). Therefore, we filtered out any SNV call above SOR = 1.0 as showing evidence of strand bias and hence a possible false positive. This decreased the number of SNV calls for all samples (N = 25) from 218 to 203. SNV calls before and after strand-bias filtering are included (Supplementary Information 1).

Haplotype phasing

To assign the gene body reads to haplotypes (haplotagging), we generated a new VCF file for each file containing its filtered SNV calls (see “Single-nucleotide variant calling” section). Five samples (#58648, #54890, #55050, #59060, #53747) had no heterozygous SNV call and therefore their reads could not be haplotagged. We first used whatshap phase to phase the SNV calls in relation to each other56. A single heterozygous SNV call is sufficient for haplotagging but could not be phased by whatshap phase. Therefore, the VCF files of samples with a single SNV call (#54917, #54968, #55048, #43706, #42656) were manually edited to match the formatting of a phased variant call. We then used whatshap haplotag to haplotag the reads.

Structural variant calling

Prior to SV calling, alignments were filtered further to keep only high-quality reads. For the regulatory region amplicon, any read shorter than 1 kb, longer than 3.5 kb, or with more than 15% of its length soft-clipped was discarded. For the gene body amplicon, any read shorter than 10 kb, longer than 15 kb or with more than 5% of its length soft-clipped was discarded. The goal of the maximum length criteria is to exclude potentially chimeric reads. Secondary alignments were also discarded.

We used sniffles v1.0.12 to call structural variants (SVs) larger than 20 bp and supported by at least 100 reads57. Calls were filtered to keep SV calls within the PRNP genomic window and with an allele frequency above 0.1.

Lengths of OPR reads

This refers to Fig. 2c, d. The filtered aligned reads (see above) from the gene body were trimmed to keep only the OPR (chr20:4,699,371–4,699,493) using samtools ampliconclip54. From the reads we obtained, any read shorter than 21 bp or longer than 702 bp were discarded. 21 bp was chosen as minimum length as it would correspond to a 4 OPRD, minus 6 bp to account for small artefactual deletions. 702 bp was chosen as maximum length as it would correspond to a 24 OPRI, plus 3 bp to account for small artefactual insertions. We then counted the number of reads of each length to produce the histograms in Fig. 2c, d.

Search for somatic mutations of the OPR

The haplotagged OPR reads were imported in R from SAM files, for a total of 208,554 OPR reads. We parsed the CIGAR of each read to calculate its total insertion and deletion. For example, 11500H43M1I24M1D29M1I7M3D7M2D7M2306H returned insertion 2 bp (1I + 1I) and deletion 6 bp (1D + 3D + 2D). The hard clips (11500H, 2306H) were created by samtools ampliconclip. Each read was then assigned to a most likely OPR genotype based on its total insertion/deletion. For each OPR genotype, the interval of possible lengths was defined as the target insertion/deletion length minus 6 bp to allow for small artefactual deletions up to the target insertion/deletion length plus 3 bp to allow for small artefactual insertions. For example, the target deletion length for 1 OPRD is − 24 bp. Therefore, any read with a total deletion of − 30 bp up to − 21 bp was labelled as a potential 1 OPRD read. As the target insertion/deletion for the reference OPR is 0 bp, the interval was set from deletions of − 6 bp up to insertions of + 3 bp. Any read with a total insertion/deletion not included in one of the intervals (e.g. a 10-bp insertion) was labelled as unassigned.

At this stage, we generated Supplementary Fig. S4 by plotting for each sample the proportion of reads labelled as reference or as the expected OPR mutation assigned to each haplotype.

In parallel, we built OPR template sequences using the OPR consensus sequence. Each read, except if labelled as unassigned, was then aligned one-to-one to the OPR template matching its OPR label (Supplementary Fig. S5), returning the number of mismatches. Alignments were performed with ClustalW58, implemented in the msa R package59. The mismatch threshold to filter somatic mutation calls was calculated as the mean + standard deviation of the number of mismatches of all ‘expected’ reads, i.e. reference reads and reads matching the OPR genotype of their sample. Somatic mutations calls were then defined as any read not labelled as reference, not labelled as the OPR genotype of its sample, and having fewer mismatches with its template than 5.8% of its length (Supplementary Fig. S3). For the compound heterozygous sample #46345, reads labelled as reference and below the mismatch threshold were also defined as somatic mutation calls.

Control PCR to test for PCR-introduced errors

1 ng of genomic DNA from sample #56635, approximately 307 haploid genomes60, was PCR amplified to produce the 1015-bp amplicon covering the protein-coding sequence (see Sanger sequencing). The product was cleaned, eluted in 25 µL of dH2O, and its concentration was quantified at 10.36 ng/µL by Qubit (dsDNA Broad Range assay). 1 µL of this product was then barcoded and sequenced as previously (see “Nanopore sequencing of full-length PRNP and its regulatory region” section).

The fast5 files were basecalled to produce reads which were aligned to the human reference genome as previously (see Basecalling and alignment). We then filtered the reads by discarding any read shorter than 900 bp, longer than 1300 bp, or with more than 20% of its length soft-clipped. We trimmed the reads to keep only the OPR (see “Lengths of OPR reads” section), generating 366,243 OPR reads. Calling possible somatic mutations was then performed as previously (see “Search for somatic mutations of the OPR” section).

The original frequency of somatic mutation calls in sample #56635 was 0.0325% (9 of 26,764 OPR reads were somatic mutation calls). Let us assume that these somatic mutation calls were genuine, and that any genome taken at random from sample #56635 had a probability of carrying an OPR mutation \(P(mutated)=0.000325\). Conversely, the probability that any genome taken at random from sample #56635 was OPR reference was \(P(reference)=1- 0.000325=0.99967\).

The probability to have drawn exactly 0 genome of the 307 can be written as the probability that all 307 genomes were reference, which is \({P\left(reference\right)}^{307}={0.99967}^{307}= 0.904\), i.e. there was a 90.4% probability to draw 0 mutated genome in the 307.

The probability that exactly \(n\) genomes out of \(k\) genomes drawn (here, 307) were mutated can be calculated as

$$\left( {\begin{array}{*{20}c} k \\ n \\ \end{array} } \right) \times P\left( {mutated} \right)^{n} \times P\left( {reference} \right)^{k - n}$$
(1)

where

$$\left( {\begin{array}{*{20}c} k \\ n \\ \end{array} } \right) = \frac{k!}{{n! \times \left( {k - n} \right)!}}$$
(2)

From Eqs. (1) and (2), the probability of having drawn exactly 1 mutated genome in the 307 was \(307 \times 0.000325^{1} \times 0.99967^{307 - 1} = 0.0902\), i.e. there was a 9.02% probability of drawing exactly 1 mutated genome in the 307.

The probability of drawing 0 or 1 mutated genome in the 307 was therefore \(0.904+0.0902=0.9942\), i.e. the probability to draw more than one mutated genome was 0.58%.

Primer sequences

The sequence complementary to the ONT barcoding primers (kit PBK-004) is shown in bold.

Primer 1:

Sequence: ACTTGCCTGTCGCTCTATCTTCGTGGGTGACTGGGAAGTGAG.

Direction: Reverse.

Binding site: chr20:4,701,737–4,701,756.

Primer 2:

Sequence: TTTCTGTTGGTGCTGATATTGCTGGCCGCGTTATTTTCTTGT.

Direction: Forward.

Binding site: chr20:4,687,732–4,687,751.

Primer 3:

Sequence: TTTCTGTTGGTGCTGATATTGCACATAAACATGGCCCAGGCA.

Direction: Forward.

Binding site: chr20:4,685,060–4,685,079.

Primer 4:

Sequence: ACTTGCCTGTCGCTCTATCTTCCTCTGCCTGCTTTTCCCCTT.

Direction: Reverse.

Binding site: chr20:4,688,028–4,688,047.

All genomic positions are given for the human reference genome GRCh38 (hg38).