Sequencing of human genomes with nanopore technology

Whole-genome sequencing (WGS) is becoming widely used in clinical medicine in diagnostic contexts and to inform treatment choice. Here we evaluate the potential of the Oxford Nanopore Technologies (ONT) MinION long-read sequencer for routine WGS by sequencing the reference sample NA12878 and the genome of an individual with ataxia-pancytopenia syndrome and severe immune dysregulation. We develop and apply a novel reference panel-free analytical method to infer and then exploit phase information which improves single-nucleotide variant (SNV) calling performance from otherwise modest levels. In the clinical sample, we identify and directly phase two non-synonymous de novo variants in SAMD9L, (OMIM #159550) inferring that they lie on the same paternal haplotype. Whilst consensus SNV-calling error rates from ONT data remain substantially higher than those from short-read methods, we demonstrate the substantial benefits of analytical innovation. Ongoing improvements to base-calling and SNV-calling methodology must continue for nanopore sequencing to establish itself as a primary method for clinical WGS.

1 Supplementary Note

Patient Medical History and Genetic Findings
The female patient, born following IVF to unrelated Caucasian parents, presented at the age of 9 months with recurrent respiratory tract infections, panhypogammaglobulinaemia, thrombocytopaenia and mild anaemia. Replacement immunoglobulin therapy from 17 months of age improved infection frequency and platelet count, and in spite of normal IgA and IgM levels, resulting severe thrombocytopaenia and the onset of other immune conditions have since thwarted the withdrawal of immunoglobulin replacement therapy. Having developed severe enteropathy with granulomatous inflammation, at the age of 12 years she required a pancolectomy due to incontrollable haemorrhage during colonoscopy. Despite this, and treatment with immunomodulating agents including anti-TNF, she remains symptomatic with chronic diarrhoea and fistulation. She developed psoriasis at the age of 20 and a seronegative large-joint oligoarthritis at 24 years requiring management with methrotrexate. After a childhood history of dyspraxia the patient developed progressive neurological symptoms from the age of 20-25 years, including initially poor balance and coordination and difficulty fixating on objects and progressing to a dependence on mobility aids since the age of 28, with additional severe dysarthria and some extra-cerebellar symptoms including difficulty in swallowing and word-finding. Neurological investigations have revealed cerebellar ataxia, gazeevoked nystagmus in all directions, hyperreflexia, cerebellar atrophy and extensive supratentorial grey and white matter signal changes on MRI and a raised cerebrospinal fluid protein level with no evidence of infection. Tests for known genetic causes of antibody deficiency or cerebellar atrophy were negative.

Model formulation and complete data probability
The math in this section can be seen as an extension of the STITCH approach of Davies et al. [1]. Consider a diploid individual with one maternal and one paternal haplotype that intersect a set of T SNPs (indexed t ∈ {1, ..., T }) at physical positions L along some chromosome (i.e. L t ∈ N). Consider a collection of sequencing reads (indexed r ∈ N) that come from those two haplotypes. Let read r come from haplotype H r ∈ {1, 2} for arbitrarily labelled maternal (1) and paternal (2) haplotypes. We assume that, without additional information, each read is equally likely to come from the two haplotypes regardless of any parameters θ, or P (H r = k) = P (H r = k|θ) = 1 2 At a particular SNP, without loss of generality, let 0 be the reference base, 1 the alternate base, and let 2 and 3 the other two bases. Define each observed read R r = {(u r,j , s r,j , b r,j )|j = 1, ..., J r }, where each read intersects J r SNPs, where u r,j is the index of SNP j from read r (i.e. u r,j ∈ {1, ..., T }), s r,j is the observed sequenced base for SNP j in read r (s r,j ∈ {0, 1, 2, 3}), and b r,j is the phred scaled base quality (b r,j ∈ N). Let g r,j (g r,j ∈ {0, 1, 2, 3}) be the real (unobserved) base for read r and SNP j. Define the probability of the underlying sequenced read having the alternate or reference base as being P (G r,j = 1|H r = k, θ) = θ ur,j ,k P (G r,j = 0|H r = k, θ) = (1 − θ ur,j ,k ) or, more generally for convenience P (G r,j = g, |H r = k, θ) = θ g ur,j ,k (2) with 0 ≤ θ t,k ≤ 1, and where superscript g relates to notation and not to exponentiation. Note that we define P (G r,j > 1, |H r = k, θ) = 0 under the model.
Fundementally, at each SNP it intersects, each read has a real underlying genotype g r,j , and we observe a sequenced base s r,j , such that P (S r,j = v|G r,j = g) = 1 − 10 and again for convenience define P (S r,j = s r,j |G r,j = w) = φ w r,j where again superscript w relates to notation and not exponentiation. We can also calculate the probability of observing base v in read r at SNP u r,j given it came from haplotype k as P (S r,j = v|H r = k, θ) = 1 g=0 P (S r,j = v|H r = k, G r,j = g, θ)P (G r,j = g|H r = k, θ) = 1 g=0 P (S r,j = v|G r,j = g)P (G r,j = g|H r = k, θ) (5) = 1 g=0 φ g r,j θ g ur,j ,k = φ 1 r,j θ ur,j ,k + φ 0 r,j (1 − θ ur,j ,k ) We then define the probability of observing a read as the product of the probabilities of observing each sequenced base, or P (R r |θ) = Jr j=1 P (S r,j |θ) Finally, we can calculate the joint probability of the observations and hidden parameters as P (R r , H r = h r , G r = g r |θ) = |O| r=1 Jr j=1 P (S r,j = s r,j , H r = h r , G r,j = g r,j |θ) Jr j=1 P (S r,j = s r,j |H r = h r , G r,j = g r,j , θ)P (H r = h r , G r,j = g r,j |θ) Jr j=1 P (S r,j = s r,j |G r,j = g r,j )P (G r,j = g r,j |H r = h r , θ)P (H r = h r |θ)

Initialization and iteration
We initialized θ t,k ∀t, k by performing random draws from the uniform U (0, 1) distribution. We then performed 300 EM iterations, whereby we first calculated P (H r = k, G r,j = v|O, θ) given θ, and then calculated new θ using Equation 12. We also applied the heuristic (described below) at every 20th iteration between iterations 20 and 200.

Heuristics
Expectation maximization is a technique that is guaranteed to find parameters that locally maximize the likelihood function of a statical model. However, if the likelihood function is not concave, there is no guarantee that a global maxima will be obtained. Here, we describe a heuristic approach that attempts to enable discovery of parameter values that enable higher local maxima to be found.
Recall that the model we've described attempts to model an individuals diploid chromosomes using parameters such that haplotype k emits a read containing a reference base at SNP t with probability θ t,k . Ideally, a one-to-one correspondence can be found in which the maternal haplotype refers along it's entire length of the computational haplotype k = 1 or k = 2. However, since we initialize with random data, then it is likely that local maxima will be found where locally, the maternal haplotype looks like either haplotype k = 1 or k = 2, and then switches to the other haplotype at some further SNP. We implemented a method that attemps to find such switches and to resolve them.
We therefore scan between all pairs of SNPs and see whether a likelihood calculated using a subset of the data is improved if an artificial phase switch is introduced. More formally, between SNPs t and t + 1, we get the subset of reads O t,t+1 = {R r |∃j 1 , j 2 s.t.u r,j1 = t, u r,j2 = t + 1}. We then introduce a phase switch between SNPs t and t + 1 in a temporary set of parameters, i.e. θ t,1 = θ t,1 , θ t,2 = θ t,2 , θ t+1,1 = θ t,2 , θ t+1,2 = θ t,1 . We then perform two iterations using the new parameters and the subset of the dataset to iterate updating with this artificial set of parameters, and calculate a likelihood l(θ |O t,t+1 ).
Once this is done for all pairs of SNPs, we begin a resursive process to select a set of phase switches to apply to the dataset. We consider only phase switches that offer an increase in the loglikelihood in the restricted observation set. We then recursively select a phase switch that offers the greatest increase in log likelihood relative to the original parameters among the set of eligible phase switches, and then remove from consideration phase switches from all pairs of SNPs that intersect the phase set of the current pairs of SNPs, where we consider a phase set to be a partionning of the available SNPs t = 1, ..., T into a minimal number of member sets such that every SNP in every member of a phase set can be connected by tiling overlapping reads. The recursion ends when there are no more eligible phase switches to consider.

Base-calling performance
To evaluate the impact of different basecalling algorithms on read-level base accuracy and consequent variant calling accuracy, we performed a series of benchmarks.
The benchmarks are based on a data set built in the early phase of the project which used Nanonet v2.0.0 for base-calling. From this data set, we selected reads mapping to chromosome 22 and extracted the corresponding MinION reads from the original Fast5 data set. The Fast5 was then re-called using three alternatives to Nanonet v2.0.0: Albacore v2.0.2, Metrichor v2.43.1 and Scrappie v0.2.2.
Albacore v2.0.2 was run using the R9.4 450bps linear configt; Scrappie v0.2.2 was run in default configuration; Metrichor was run using the 1D workflow. After basecalling, the reads in each dataset were re-mapped against the human genome reference (see Methods in the main manuscript).
We noted substantial variation in the resulting read-level substitution, deletion and insertion error rates (Figures 3, 5, 4).
Albacore v2.0.2 achieved the lowest unfiltered substitution error rate (mean 14.3%) and deletion error rate (mean 5.22%). The latter is offset by an elevated insertion error rate (mean 3.3%)

Variant calling performance
We tested the variant calling performance characterstics on each of the four base-called data sets described above. In addition, we employed various post-alignment filters to check the effect of removing low-quality reads before variant calling. The filters were "unfiltered", "fixed error" (removing all reads with an error rate of 20%) and "fixed size" (removing 20% of reads with the highest error rate) (11). Variant calling and evaluation were performed as described in the Methods section of the main manuscript.
In terms of variant calling performance, the -now discontinued -base caller Metrichor performed best, but with Albacore showing almost comparable performance. Scrappie and Nanonet produced worse results (12). At optimum performance, Metrichor and Albacore had similar false negative rates, but Metrichor overall had lower false positive rates (12). Filtering the read set by quality had little effect on the overall result.

Supplementary Tables
Supplementary Table 1: Flow cell statistics Number of reads and bases per flow cell. Numbers are listed separated for reads in the "fail" and "pass" fraction. Additionally, the fraction "trimmed" contains read numbers and lengths after trimming. The trimming tool splits read with internal adapter sequences into multiple "fragments", which are listed separately.  300 1,189,885,461 204,209 1,166,775,612 205,797 24 WTON000202 12,753 20,247,856 296,652 1,714,812,395 296,483 1,680,565,491 298,931 24 WTON000203 9,998 15,576,774 195,529 1,117,477,874 195,429 1,095,294,766 196,880 24 WTON000204 13,008 18,160,021 255,137 1,456,910,268 254,998 1,427,499,519 256,885 24 WTON000205 7,437 10,647,602 139,277 790,212,374 139,194 774,565,197 140,191 24 WTON000206 12,461 25,649,836 268,968 1,537,659,678 268,847 1,506,469 Supplementary Table 2: Base-caller comparison results Error rates in mapped reads basecalled by different base callers. Shown are the substitution, insertion and deletion error rates, the total number of reads in a set, the number of reads mapped and the average read length. The "basecaller" column list the base-caller and the "process" column indicates how the data set was filtered: "fixed error" remove alignments with an error rate above 20%, "fixed size": remove approx. 20% of alignments with highest error rate, "unfiltered": all alignments.        Variant calling results using different base callers and input filtering methods. "filtering": filtering method applied to BAM files ("unfiltered", "fixed error": remove all alignments above error rate of 20%, "fixed size": remove approx. 20% of alignments with highest error rate, ), "RH/RA": contamination level thresholds for freebayes, "threshold": QUAL threshold for optimal F1 performance, "f measure": F1 measure, "unfiltered false negative rate": number of reference variant sites missing irrespective of QUAL threshold and genotype.       in the NA12878 genome (chr22). The figure shows the allele frequency at heterozygous sites ("het"), sites that are homozoygous for the alternative allele ("hom alt") and sites that are homozygous for the reference allele ("hom ref"). Allele frequencies are shown for the reference allele ("freq ref") and the alternate allele ("freq alt"). The modes of the reference allele frequency at homozygous reference positions and the alternate allele frequency at homozygous alternate allele positions are shifted from their expectation at 0 and 1, respectively. Part of the shift is explained by the substitution error rate in the sequence data. However, the larger shift of alternate allele frequencies at sites that are homozygous for the alternate allele ("hom-alt freq alt") compared to reference allele frequencies at sites that are homozygous for the reference allele ("hom-ref freq ref") indicates a reference alignment bias: the alignment algorithm places gaps to minimize the number of substitutions with respect to the reference sequence. Similarly shifted from the expectation at 0.5 are allele frequencies at heterozygous sites with a higher frequency of the reference allele compared to the alternate allele.
Supplementary Figure 9: Kmer frequencies at false positive sites indicate base-calling bias Shown here are frequencies of kmers of size 3. Each dot represent a particular kmer, with the X-axis showing the frequency of this kmer in chr22. The Y-axis shows the frequency of the kmer in the set of false positives. Deviation from the diagonal indicates an enrichment of a kmer in the set of false positives. A kmer is coloured black if it contains a CpG dinucleotide and orange if it does not. Rare kmers are enriched in the set of false positives and all include a CpG dinucleotide. CpG dinucleotides are rare in the genome. As a consequence, base callers trained on genomic data will have less training data available increasing the error rate at such sites.
Supplementary Figure 11: Number of reads in the data sets for base-caller comparison The plot shows the number of reads in the different data sets generated for variant calling performance using different base callers. fixed error: only use alignments with less than 20% error rate, fixed size: remove 20% of alignments with the highest error rate, unfiltered: all alignments Supplementary Figure 12: Variant calling F1 measures for different base-called input The optimimum F1 value is shown based on optimizing across variant caller parameters and using a variant call quality threshold (QUAL) producing the highest F1 score.

Supplementary Figures -LVC Snapshots
This section contains various IGV snapshots. The track layout in all snapshots is the same. The top of the figure shows repeat annotations from the UCSC genome browser. The middle section displays read data from ONT, Illumina and Pacbio. The bottom section contains large variants called by the SVClassify method (SVS), the Pacbio data set (PACBIO) and the large variant calls using the variant caller on ONT data (ONT) described in this manuscript.

Large variant calls in ONT data not present in reference
This section shows large variant calls based on ONT data, which are not present in reference data set and have no or weak evidence in PACBIO data. Based on visual inspection we believe these calls to be true positives.
Supplementary Figure 31: Substitution rates in reads spanning variants of interest in SAMD9L Shown is the subsitution rate per read of reads spanning the two variants c.1076G>A and c.3353A>G. Reads are grouped into reads supporting a cis arrangement of the variants, reads supporting a trans arrangement of the variants, and reads where the base is neither the reference nor the alternate variant at either of the sites.
Supplementary Figure 32: Phasing variants using allele-specific PCR A) Strategy 1 is explained in the schematic diagram and employed a semi-nested PCR format. The first-round PCR used a universal R primer in tandem with one of two different F primers where the 3'-terminal base was either a T or a C, i.e. complementary to either the reference or alternate base at the site of the first de novo mutation (chr7:92,761,932T>C). The second-round of PCR used primers FN and R. PCR amplicons were purified using the exoSAP method and Sanger sequenced using BigDye3.1 chemistry. Where the "932C" primer had been used which is complementary to the mutant allele at the first de novo site, the predominant peak at the site of the second de novo mutation corresponds to the mutant T allele (red arrow). Where the "932T" (reference) primer had been used, it is predominantly the reference C allele seen at the second site. B) Strategy 2 is the same as Strategy 1 but in reverse. Allele-specific primers were designed at the site of the second mutation (chr7:92,764,209C>T). Again, it is predominantly the mutant C allele observed (blue arrow) at the site of the first de novo site where the "209T" (mutant) primer had been used. The genomic coordinates shown are based on the hg19 build.