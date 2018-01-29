Human DNA.

Human genomic DNA from the GM12878 human cell line (CEPH/Utah pedigree) was either purchased from Coriell as DNA (cat. no. NA12878) or extracted from the cultured cell line also purchased from Coriell (cat. no. GM12878). Cell culture was performed using Epstein–Barr virus (EBV)-transformed B lymphocyte culture from the GM12878 cell line in RPMI-1640 media with 2 mM L-glutamine and 15% FBS at 37 °C.

QIAGEN DNA extraction.

DNA was extracted from cells using the QIAamp DNA mini kit (Qiagen). 5 × 106 cells were spun at 300g for 5 min to pellet. The cells were resuspended in 200 μl PBS and DNA was extracted according to the manufacturer's instructions. DNA quality was assessed by running 1 μl on a genomic ScreenTape on the TapeStation 2200 (Agilent) to ensure a DNA Integrity Number (DIN) >7 (value for NA12878 was 9.3). Concentration of DNA was assessed using the dsDNA HS assay on a Qubit fluorometer (Thermo Fisher).

Library preparation (SQK-LSK108 1D ligation genomic DNA).

1.5–2.5 μg human genomic DNA was sheared in a Covaris g-TUBE centrifuged at 5,000–6,000 r.p.m. in an Eppendorf 5424 (or equivalent) centrifuge for 2 × 1 min, inverting the tube between centrifugation steps.

DNA repair (NEBNext FFPE DNA Repair Mix, NEB M6630) was performed on purchased DNA but not on freshly extracted DNA. 8.5 μl nuclease-free water (NFW), 6.5 μl FFPE Repair Buffer and 2 μl FFPE DNA Repair Mix were added to the 46 μl sheared DNA. The mixture was incubated for 15 min at 20 °C, cleaned up using a 0.4× volume of AMPure XP beads (62 μl), incubated at room temperature with gentle mixing for 5 min, washed twice with 200 μl fresh 70% ethanol, pellet allowed to dry for 2 min, and DNA eluted in 46 μl NFW or EB (10 mM Tris pH 8.0). A 1 μl aliquot was quantified by fluorometry (Qubit) to ensure ≥1 μg DNA was retained.

End repair and dA-tailing (NEBNext Ultra II End-Repair/dA-tailing Module) was then performed by adding 7 μl Ultra II End-Prep buffer, 3 μl Ultra II End-Prep enzyme mix, and 5 μl NFW. The mixture was incubated at 20 °C for 10 min and 65 °C for 10 min. A 1× volume (60 μl) AMPure XP clean-up was performed and the DNA was eluted in 31 μl NFW. A 1-μl aliquot was quantified by fluorometry (Qubit) to ensure ≥700 ng DNA was retained.

Ligation was then performed by adding 20 μl Adaptor Mix (SQK-LSK108 Ligation Sequencing Kit 1D, Oxford Nanopore Technologies (ONT)) and 50 μl NEB Blunt/TA Master Mix (NEB, cat. no. M0367) to the 30 μl dA-tailed DNA, mixing gently and incubating at room temperature for 10 min.

The adaptor-ligated DNA was cleaned up by adding a 0.4 × volume (40 μl) of AMPure XP beads, incubating for 5 min at room temperature and resuspending the pellet twice in 140 μl ABB (SQK-LSK108). The purified-ligated DNA was resuspended by adding 25 μl ELB (SQK-LSK108) and resuspending the beads, incubating at room temperature for 10 min, pelleting the beads again, and transferring the supernatant (pre-sequencing mix or PSM) to a new tube. A 1-μl aliquot was quantified by fluorometry (Qubit) to ensure ≥500 ng DNA was retained.

Sambrook and Russell DNA extraction.

This protocol was modified from Chapter 6 protocol 1 of Sambrook and Russell51. 5 × 107 cells were spun at 4500g for 10 min to pellet. The cells were resuspended by pipette mixing in 100 μl PBS. 10 ml TLB was added (10 mM Tris-Cl pH 8.0, 25 mM EDTA pH 8.0, 0.5% (w/v) SDS, 20 μg/ml Qiagen RNase A), vortexed at full speed for 5 s and incubated at 37 °C for 1 h. 50 μl Proteinase K (Qiagen) was added and mixed by slow inversion ten times followed by 3 h at 50 °C with gentle mixing every 1 h. The lysate was phenol-purified using 10 ml buffer saturated phenol using phase-lock gel falcon tubes, followed by phenol:chloroform (1:1). The DNA was precipitated by the addition of 4 ml 5 M ammonium acetate and 30 ml ice-cold ethanol. DNA was recovered with a glass hook followed by washing twice in 70% ethanol. After spinning down at 10,000g, ethanol was removed followed by 10 min drying at 40 °C. 150 μl EB (Elution Buffer) was added to the DNA and left at 4 °C overnight to resuspend.

Library preparation (SQK-RAD002 genomic DNA).

To obtain ultra-long reads, the standard Rapid Adapters (RAD002) protocol (SQK-RAD002 Rapid Sequencing Kit, ONT) for genomic DNA was modified as follows. 16 μl of DNA from the Sambrook extraction at approximately 1 μg/μl, manipulated with a cut-off P20 pipette tip, was placed in a 0.2 ml PCR tube, with 1 μl removed to confirm quantification value. 5 μl FRM was added and mixed slowly ten times by gentle pipetting with a cut-off pipette tip moving only 12 μl. After mixing, the sample was incubated at 30 °C for 1 min followed by 75 °C for 1 min on a thermocycler. After this, 1 μl RAD and 1 μl Blunt/TA ligase was added with slow mixing by pipetting using a cut-off tip moving only 14 μl ten times. The library was then incubated at room temperature for 30 min to allow ligation of RAD. To load the library, 25.5 μl RBF (Running Buffer with Fuel mix) was mixed with 27.5 μl NFW, and this was added to the library. Using a P100 cut-off tip set to 75 μl, this library was mixed by pipetting slowly five times. This extremely viscous sample was loaded onto the “spot on” port and entered the flow cell by capillary action. The standard loading beads were omitted from this protocol owing to excessive clumping when mixed with the viscous library.

MinION sequencing.

MinION sequencing was performed as per manufacturer's guidelines using R9/R9.4 flow cells (FLO-MIN105/FLO-MIN106, ONT). MinION sequencing was controlled using Oxford Nanopore Technologies MinKNOW software. The specific versions of the software used varied from run to run but can be determined by inspection of fast5 files from the data set. Reads from all sites were copied off to a volume mounted on a CLIMB virtual server (http://www.climb.ac.uk) where metadata was extracted using poredb (https://github.com/nickloman/poredb) and base-calling performed using Metrichor (predominantly workflow ID 1200, although previous versions were used early on in the project) (http://www.metrichor.com). We note that base-calling in Metrichor has now been superseded by Albacore and is no longer available. Scrappie (https://github.com/nanoporetech/scrappie) was used for the chr20 comparisons using reads previously identified as being from this chromosome after mapping the Metrichor reads. Albacore 0.8.4 (available from the Oxford Nanopore Technologies user community) was used for the ultra-long read set, as this software became the recommended base-caller for nanopore reads in March 2017. Given the rapid development of upgrades to base-caller software we expect to periodically re-base-call these data and make the latest results available to the community through the Amazon Open Data site.

Modified MinION running scripts.

In a number of instances, MinION sequencing control was shifted to customized MinKNOW scripts. These scripts provided enhanced pore utilization/data yields during sequencing, and operated by monitoring and adjusting flow cell bias-voltage (–180 mV to –250 mV), and used an event-yield-dependent (70% of initial hour in each segment) initiation of active pore channel assignment via remuxing (reselection of ideal pores for sequencing from each group of four wells available around each channel on the flowcell). More detailed information on these scripts can be found on the Oxford Nanopore Technologies user community. In addition, a patch for all files required to modify MinION running scripts compatible with MinKNOW 1.3.23 only is available (Supplementary Code 1).

Live run monitoring.

To assist in choosing when to switch from a standard run script to a modified run protocol, a subset of runs was monitored with the assistance of the minControl tool, an alpha component of the minoTour suite of MinION run and analysis tools (https://github.com/minoTour/minoTour). minControl collects metrics about a run directly from the grouper software, which runs behind the standard ONT MinKNOW interface. minControl provides a historical log of yield measured in events from a flow cell enabling estimations of yield and the decay rate associated with loss of sequencing pores over time. MinKNOW yield is currently measured in events and is scaled by approximately 1.7 to estimate yield in bases.

Assembly.

All “NG” statistics were computed using a genome size of 3,098,794,149 bp (3.1 Gbp), the size of GRCh38 excluding alt sites.

Canu v1.4 (+11 commits) r8006 (4a7090bd17c914f5c21bacbebf4add163e492d54) was used to assemble the initial 20-fold coverage data set:

These are the suggested low-coverage parameters from the Canu documentation, but with a decreased maximum evidence error rate. This specific parameter was decreased to reduced memory requirements after it was determined that the MinHash overlapping algorithm was underestimating error rates owing to systematic error in the reads. Counterintuitively, this systematic error makes two reads look more similar than they are, because they share more k-mers than expected under a random model. Manually decreasing the maximum overlap error rate threshold adjusted for this bias. The assembly took 40K CPU hours (25K to correct and 15K to assemble). This is about twofold slower than a comparable PacBio data set, mostly because of the higher noise and errors in the nanopore reads.

The same version of Canu was also used to assemble the 30-fold data set:

For this larger data set, overlapping was again tweaked by reducing the number of hashes used and increasing the minimum overlap identity threshold. This has the effect of lowering sensitivity to further compensate for the bias in the input reads. This assembly required 62K CPU hours (29K to correct, 33K to assemble) and a peak of 120 Gbp of memory, which is about fourfold slower than a comparable PacBio data set. The assembly ran on a cluster comprised of a mix of 48-thread dual-socket Intel E5-2680 v3 @ 2.50GHz CPUs with 128 Gbp of memory and 8-thread dual-socket Intel CPU E5-2698 v4 @ 2.20GHz CPUs with 1,024 Gbp of memory.

The combined data set incorporating an additional 5× coverage of ultra-long reads was assembled with an updated version of Canu v1.4 (+125 commits) r8120:

This assembly required 151K CPU hours (15K to correct, 86K to trim, and 50K to assemble) and a peak of 112 Gbp of memory. These high runtimes are a consequence of the ultra-long reads. In particular, the current Canu trimming algorithm was not designed for reads of this extreme length and high error rate after correction and the algorithms used are not optimal.

Assembly contiguity modeling.

Expected assembly contiguity was modeled on repeat tracks downloaded from the UCSC genome browser (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/).

For a given repeat identity (0%, 90%, 95%, 98%, 99%, and 99.5%), all repeats with a lower identity estimate (genomicSuperDups and chainSelf) were filtered and overlapping repeats were merged. Gaps in the reference were also considered as repeats. To compute the maximum repeat length likely to be spanned by a given sequence distribution, the probability of an unspanned repeat of a fixed length was estimated for all lengths between 1 and 100 kbp in steps of 1 kbp using an equation from http://data-science-sequencing.github.io/lectures/lecture7/52,53,54:

where G is the genome size, L is the read length, a i is the number of repeats of length 1 ≤ i ≤ L − 2, N is the number of reads ≥ L, and c is the coverage in reads ≥ L. We used the distribution of all repeats for a i and plotted the shortest repeat length such that P (at least one repeat is unbridged) > 0.05 for real sequencing length distributions both nanopore and PacBio sequencing runs. Assemblies of the data were plotted at their predicted spanned read length on the x axis and NG50 on the y axis for comparison with the model. A 30× run of ultra-long coverage was simulated from the 5× dataset by repeating each ultra-long read six times.

Assembly validation and structural variant analysis.

Assemblies were aligned using MUMmer v3.23 with parameters “-l 20 -c 500 -maxmatch” for the raw assemblies and “-l 100 -c 500 -maxmatch” for the polished assemblies. Output was processed with dnadiff to report average 1-to-1 alignment identity. The MUMmer coords file was converted to a tiling using the scripts from Berlin et al.55 with the command:

and drawn using the coloredChromosomes package56. Since the reference is a composite of human genomes and there are true variations between the reference and NA12878, we also computed a reference-free estimate of identity. A 30-fold subset of the Genome In a Bottle Illumina data set for NA12878 (ref. 20) was downloaded from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/RMNISTHS_30xdownsample.bam. Samtools fastq was used to extract fastq paired-end data for the full data set and for the reads mapping to chromosome 20. The reads were aligned to the whole genome assembly and chromosome 20 assemblies with BWA-MEM 0.7.12-r1039. BWA-MEM is a component of the BWA package and was chosen because of its speed and ubiquitous use in sequence mapping and analysis pipelines. Aside from the difficulties of mapping the ultra-long reads unique to this work, any other mapper could be used instead. Variants were identified using FreeBayes v1.0.2 (ref. 57), a widely used method originally developed for short-read sequencing but also applicable to long reads, with the command:

The length of all variants was summed and the total number of bases with at least 3× coverage was summed using samtools depth. QV was computed as and identity was computed as Dotplots were generated with “mummerplot–fat” using the 1-to-1 filtered matches.

A previously published GM12878 PacBio assembly5 was aligned as above with MUMmer v3.23. The resulting alignment files were uploaded to Assemblytics58 to identify structural variants and generate summary figures. Versus GRCh38, the PacBio assembly identified 10,747 structural variants affecting 10.84 Mbp, and reported an equal balance of insertions and deletions (2,361 vs. 2,724), with a peak at approximately 300 bp corresponding to Alu repeats (Supplementary Fig. 5a and Supplementary Table 6). The high error rate of the nanopore assembly resulted in a much larger number of identified variants (69,151) affecting 23.45 Mbp, with a strong deletion bias (3,900 insertions vs. 28,791 deletions) (Supplementary Fig. 5b and Supplementary Table 6). The Illumina-polished assembly reduced the total variants (47,073) affecting 16.24 Mbp but the deletion bias persisted (2,840 insertions vs. 20,797 deletions) (Supplementary Fig. 5c and Supplementary Table 6).

Base-call analysis.

Sequences were aligned to the 1000 Genome GRCh38 reference (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.sa) using BWA-MEM version 0.7.12-r1039 with the “-x ont2d” option59. The BAM alignments were converted to PAF format60 and CIGAR-strings parsed to convert alignments to an identity. Summary statistics for each flow cell were tabulated separately and combined. Alignment length versus identity was plotted using smoothScatter in R. Depth of coverage statistics for each flow cell were obtained from “samtools depth -a” and combined. As for the assembly statistics, a genome size of 3,098,794,149 bp was used to compute bases covered. The mean coverage was 25.63 (63.20 s.d.). The minimum coverage was 0 and the maximum was 44,391. Excluding 0-coverage regions, the mean coverage was 27.41 (64.98 s.d.). The coverage histogram was plotted compared with randomly generated Poisson values generated with R's rpois function with ë = 27.4074.

Metrichor reads mapping to human chromosome 20 were additionally base-called with Scrappie v0.2.7. Scrappie reads composed primarily of low-complexity sequence were identified using the sdust program included with Minimap (commit: 17d5bd12290e0e8a48a5df5afaeaef4d171aa133)60 with default parameters (-w 64 -t 20). The total length of the windows in a single sequence were merged and divided by read length to compute percentage of low-complexity sequence in each read. Any read for which this percentage exceeded 50% was removed from downstream analysis. Without this filtering, BWA-MEM did not complete mapping the sequences after >30 days of runtime on 16-cores. Similar filtering on the Metrichor-based reads had only a limited effect on the data set.

To measure homopolymer accuracy, we extracted pairwise read-to-reference alignments for reads spanning all homopolymers of length 2 or greater. For efficiency, at most 1,000 randomly selected instances were considered for each homopolymer length. Each homopolymer so identified is enclosed by two non-homopolymer “boundary” bases (e.g., the T and G in TAAAG). The number of match, mismatch, insertion, and deletion alignment operations between the boundary bases was tabulated for each homopolymer, and alignments not anchored at the boundary bases with match/mismatch operations were ignored. Homopolymer call length was reported as the number of inserted bases minus the number of deleted bases in the extracted alignment, quantifying the difference between expected and observed sequence length. All base callers with the exception of Scrappie failed in large homopolymer stretches (e.g., Supplementary Fig. 3), consistently capping homopolymers at 5 bp (the k-mer length of the model). Scrappie shows significant improvement, but tended to slightly overcall short homopolymers and undercall longer ones (Fig. 2b).

To quantify deviations from the expected 50:50 allele ratio at heterozygous sites, 25,541 homozygous and 46,098 heterozygous SNP positions on chromosome 20 were extracted from the Illumina Platinum Genomes project VCF for GM12878, requiring a minimum distance of 10 bp between SNP positions. Scrappie base calls at these positions were extracted using samtools mpileup. Deviation from the expected allelic ratio was defined as d = abs(0.5 – [allele A coverage]/[allele A coverage + allele B coverage]). Averaged over all evaluated heterozygous SNPs, d = 0.13 and 90% of SNPs have d ≤ 0.27 (corresponding to approximately ≥25% coverage on the minor allele). Results were similar when stratified by SNP type.

Assembly polishing with nanopolish.

We ran the nanopolish consensus-calling algorithm14 on the chromosome 20 assemblies described above. For each assembly we sampled candidate variants from the base-called reads used to construct the contigs (using the “–alternative-basecalls” option) and input the original fast5 files (generated by the base-caller in the Metrichor computing platform) into a hidden Markov model, as these files contained the annotated events that the HMM relies on. The reads were mapped to the draft assembly using BWA-MEM with the “-x ont2d” option.

Each assembly was polished in 50,000-bp segments, and the individual segments were merged into the final consensus. The nanopolish jobs were run using default parameters except the “–fix-homopolymers” and “–min-candidate-frequency 0.01” options were applied.

Assembly annotation.

Comparative Annotation Toolkit (CAT) (https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/commit/c9503e7ad7718a935b10a72f75302caa5accb15e) was run on both the polished and unpolished assemblies. CAT uses whole genome alignments to project transcripts from a high-quality reference genome to other genomes in the alignment61. The gene finding tool AUGUSTUS is used to clean up these transcript projections and a combined gene set is generated62.

To guide the annotation process, we obtained human RNA-seq data from SRA for a variety of tissues (Supplementary Table 7) and aligned them to both GRCh38 and the two assembly versions. GENCODE V24 was used as the reference annotation. Two separate progressiveCactus63 alignments were generated for each assembly version with the chimpanzee genome as an outgroup.

The frequency of frameshifting insertions or deletions (indels) in transcripts was evaluated by performing pairwise CDS (coding DNA sequence) sequence alignments using BLAT in a translated protein parameterization. Alignments were performed both on raw transMap output as well as on the final consensus transcripts.

Paralogous alignments of a source transcript were resolved through a heuristic combination of alignment coverage, identity, and synteny. Synteny is measured by counting how many gene projections near the current projection match the reference genome. In the case where multiple isoforms of a gene end up in different loci as the result of this process, a rescuing process is performed that chooses the highest scoring locus to place all isoforms at so that isoforms do not end up on different contigs. Through this process, a 1-1 orthology relationship is defined.

MHC analysis.

The ultra-long assembly contains the MHC region between positions 2–6 Mb within a single 16-Mbp contig (tig01415017). Heterozygous sites were extracted by mapping Illumina reads to the polished assembly using BWA-MEM with default parameters. Alignments were post-processed according to the GATK 3.7 whole-genome variant calling pipeline, except for the “-T IndelRealigner” step using “–consensusDeterminationModel USE_READS”. The -T HaplotypeCaller parameter was used for variant calling. WhatsHap64 was used to phase the Illumina variants with Nanopore reads reported to be contained in the contig by Canu. WhatsHap was modified to accept CRAM (http://genome.cshlp.org/content/21/5/734.long, https://bitbucket.org/skoren/whatshap) output since BAM files could not represent long CIGAR strings at the time of this analysis (https://github.com/samtools/hts-specs/issues/40). First, WhatsHap was run excluding any ultra-long sequences. This generated 18 phase blocks across the MHC. When ultra-long sequences were included the result was a single phase block comprising the entire MHC, supporting the utility of ultra-long reads in resolving haplotypes across large, complex regions in the genome. Nanopore reads were aligned back to the assembly using NGM-LR (CoNvex Gap-cost alignMents for Long Reads)38 and the combined VCF file used for phasing. Reads with more than one phasing marker were classified as haplotype A or B when >55% of their variants were in agreement (Fig. 5a). A new assembly was generated for haplotypes A and B using only reads assigned to each haplotype as well as reads marked homozygous. The assemblies were polished by Pilon 1.21(ref. 26) using the SGE pipeline at https://github.com/skoren/PilonGrid. Pilon was given all reads mapping to the MHC.

Exon sequences belonging to the six classical HLA genes were extracted from the phased assembly, and HLA types called at G group resolution. These results were compared to GM12878 HLA type reference data. For the class I and II HLA genes, with the exception of one DRB1 haplotype, there was good agreement between the best-matching reference type and the alleles called from the assembly (edit distance 0–1). Detailed examination of HLA-DRB1, however, showed that one exon (exon 2) is different from all reference types in the assembly, a likely error in the assembly sequence.

GM12878 G group HLA types for HLA-A/B/C, HLA-DQA1, HLA-DQB1, and HLA-DRB1 are from ref. 65; the presence of exactly one HLA-DRB3 allele is expected due to linkage with HLA-DRB1 (DRB1*03 is associated with HLA-DRB3, and DRB1*01 has no DRB3/4/5 association).

Genotyping SNPs using nanopolish.

Nanopolish was used for genotyping the subset of reads that mapped to human chromosome 20. The 1000 Genomes phase 3 variant set for GRCh38 was used as a reference and filtered to include only chromosome 20 SNPs that were not singletons (Allele Count ≥ 2). This set of SNPs was input into “nanopolish variants” in genotyping mode (“–genotype”). The genotyping method extends the variant calling framework previously described12 to consider pairs of haplotypes, allowing it to be applied to diploid genomes (option “–ploidy 2”). To evaluate their accuracy, genotype calls were compared to the “platinum calls” generated by Illumina23. When evaluating the correctness of a nanopore call, we required the log-likelihood ratio of a variant call (heterozygous or homozygous non-reference) to be at least 30, otherwise, we considered the site to be homozygous reference.

Estimating SV genotyping sensitivity.

Previously identified high-confidence GM12878 SVs, validated with Moleculo and/or PacBio long reads, were used to determine genotyping sensitivity29. Using LUMPY28, we recalled SVs in the Platinum Genomes NA12878 Illumina data set (paired-end reads; European Nucleotide Archive, Run Accession ERR194147), intersected these calls with the aforementioned high confidence set, and genotyped the resulting calls using SVTyper28 and the same Platinum alignments, generating a set of 2,414 high-confidence duplications and deletions with accompanying genotypes. Nanopore reads from all flow cells were mapped using BWA-MEM (bwa mem -k15 -W30 -r10 -B2 -O2 -L0), and then merged into release-specific BAM files. Merged BAM files were subsampled using Samtools (samtools view -s $COVERAGE_FRACTION) to approximate coverage values as shown in Figure 2a. SVs were then genotyped in each subsampled BAM file using a modified version of SVTyper (http://github.com/tomsasani/svtyper). Generally, long nanopore reads are subject to higher rates of mismatches, insertions, and deletions than short Illumina reads. These features can result in 'bleed-through' alignments, where reads align past the true breakpoint of an SV66. The modifications to SVTyper attempt to correct for the bleed-through phenomenon by allowing reads to align past the breakpoint, yet still support an alternate genotype. All modifications to SVTyper are documented in the source code available at the GitHub repository listed above (commit ID: d70de9c) (Supplementary Code 2). Nanopore- and Illumina-derived genotypes were then compared as a function of subsampled nanopore sequencing coverage.

The false-discovery rate of our SVTyper genotyping strategy was estimated by randomly permuting the genomic locations of the original SVs using BEDTools “shuffle”67. Centromeric, telomeric, and “gap” regions (as defined by the UCSC Genome Browser) were excluded when assigning randomly selected breakpoints to each SV. The randomly shuffled SVs were then genotyped in Illumina and nanopore data in the same manner as before. It is expected that the alignments at shuffled SV intervals would almost always support a homozygous reference genotype. So, all instances in which Illumina data supported a homozygous reference genotype, yet the nanopore data called a non-homozygous reference genotype, were considered false positives. SV coordinates were shuffled and genotyped 1,000 times and the average false-discovery rate over all iterations was 6.4%.

Nanopore and PacBio genotyping sensitivity was compared to a subset of our high-confidence SV set. Because our high-confidence set includes only “DUP” and “DEL” variants, and the Genome in a Bottle (GIAB) PacBio SV VCF (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NA12878_PacBio_MtSinai/NA12878.sorted.vcf.gz) does not report “DUP” variants, we compared genotypes at deletions with genomic coordinates that shared reciprocal overlap of at least 0.5 between the GIAB VCF and our high-confidence SV VCF. We then compared nanopore genotypes (as determined by SVTyper) with the genotypes reported in the GIAB SV VCF. Importantly, the GIAB VCF was derived from a ∼44× coverage data set, whereas our data set (containing data from both releases) represents only about ∼32× coverage of the genome. Additionally, all nanopore data used in this analysis were aligned using BWA, while GIAB PacBio data were aligned using BLASR68.

Scaling marginAlign and signalAlign data analysis pipelines.

To handle the large data volume, the original marginAlign and signalAlign algorithms were ported to cloud infrastructures using the Toil batch system69. Toil allows for computational resources to be scaled horizontally and vertically as a given experiment requires and enables researchers to perform their own experiments in identical conditions. All of the workflows used and the source code is freely available from https://github.com/ArtRand/toil-signalAlign and https://github.com/ArtRand/toil-marginAlign. Workflow diagrams are shown in Supplementary Figure 10.

Generating a controlled set of methylated control DNA samples.

For signalAlign, DNA methylation control standards were obtained from Zymo Research (cat. no. D5013). The standards contain a whole-genome-amplified (WGA) DNA substrate that lacks methylation and a WGA DNA substrate that has been enzymatically treated so all CpG dinucleotides contain 5-methylcytosines. The two substrates were sequenced independently on two different flow cells using the sequencing protocol described above. Otherwise, training for signalAlign and nanopolish was carried out as previously described35,36.

5-methylcytosine detection with signalAlign.

The signalAlign algorithm uses a variable order hidden Markov model combined with a hierarchical Dirichlet process (HMM-HDP) to infer base modifications in a reference sequence using the ionic current signal produced by nanopore sequencing70. The ionic current signal is simultaneously influenced by multiple nucleotides as the strand passes through the nanopore. Correspondingly, signalAlign models each ionic current state as a nucleotide k-mer. The model allows a base in the reference sequence to have any of multiple methylation states (in this case 5-methy cytosine or canonical cytosine). The model ties the probabilities of consistently methylated k-mers by configuring the HMM in a variable order meta-structure that allows for multiple paths over a reference k-mer depending on the number of methylation possibilities. To learn the ionic current distributions for methylated k-mers, signalAlign estimates the posterior mean density for each k-mer's distribution of ionic currents using a Markov chain Monte Carlo (MCMC) algorithm given a set of k-mer-to-ionic current assignments. Using the full model, the posterior for each methylation status is calculated for all cytosines in CpG dinucleotides.

5-methylcytosine detection with nanopolish.

Previous work describes using nanopolish to call 5-methylcytosine in a CpG context using a hidden Markov model36. The output of the nanopolish calling procedure is a log-likelihood ratio, where a positive log-likelihood ratio indicates evidence for methylation. Nanopolish groups nearby CpG sites together and calls the group jointly, assigning the same methylation status to each site in the group. To allow comparison to the bisulfite data each such group was broken up into its constituent CpG sites, which all have the same methylation frequency. Percent-methylation was calculated by converting the log-likelihood ratio to a binary methylated/unmethylated call for each read, and calculating the fraction of reads classified as methylated. A filtered score was also computed by first filtering reads where the absolute value of the log-likelihood ratio was less than 2.5 to remove ambiguous reads.

Life Sciences Reporting Summary.

Further information on experimental design is available in the Life Sciences Reporting Summary.

Data availability.

Sequence data including raw signal files (FAST5), event-level data (FAST5), base-calls (FASTQ) and alignments (BAM) are available as an Amazon Web Services Open Data set for download from https://github.com/nanopore-wgs-consortium/NA12878. Nanopore raw signal files and the 35× assembly are additionally archived and available from the European Nucleotide Archive under accession PRJEB23027.