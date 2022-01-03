The emergence of nanopore sequencing is reshaping the landscape of genomics. Devices from Oxford Nanopore Technologies (ONT) enable sequencing of native DNA and RNA molecules with no theoretical upper limit on read length1. This supports the accurate assembly and phasing of repetitive genomes and metagenomes2,3,4,5,6; enhanced resolution of structural variation7,8,9,10,11 and spliced RNA transcripts12; and profiling of epigenetic and RNA modifications13,14,15,16,17,18. High-throughput ONT instruments (GridION and PromethION) have recently enabled cost-effective sequencing of large eukaryotic genomes7,8,19. However, large data volumes and computational bottlenecks have become a major impediment.

ONT devices measure the displacement of ionic current as a DNA or RNA strand passes through a biological nanopore, recording time series signal data in FAST5 format (Fig. 1a and Supplementary Note 1). These data are translated, or ‘base-called’, into sequence reads (FASTQ format) before downstream analysis. Many bioinformatics tools also directly access the signal data to improve the accuracy of assembled genomes or detect fine signal perturbations that are indicative of DNA/RNA modifications, genetic variants or other features (Fig. 1a)5,14,16,17,18. However, nanopore signal data are large (~1.3-TB FAST5 files for ~30× human genome; Supplementary Table 1), and both base-calling and downstream analysis steps are computationally expensive.

Fig. 1: SLOW5 format enables efficient parallel analysis of nanopore signal data. a, Schematic diagram illustrating the typical life cycle of nanopore data. Raw current signal data are generated on an ONT sequencing device and written in FAST5 format. Raw data are base-called into sequence reads (FASTQ/FASTA format). Downstream analysis involving both base-called reads and raw signal data is used to identify genetic variants, epigenetic modifications (for example, 5mC) and other features. b, Schematic diagram illustrating the bottleneck in ONT signal data analysis. FAST5 file reading requires the HDF5 software library, which serializes file access requests by multiple CPU threads, preventing efficient parallel analysis. SLOW5 files are not dependent on the HDF5 library and are amenable to efficient parallel analysis. A more detailed mechanistic diagram is provided in Extended Data Fig. 1e. c, Bar chart shows the relative file sizes (bytes per base) of a typical human genome sequencing dataset in ASCII SLOW5 (purple), binary BLOW5 format with no compression (orange), zlib compression (red) and vbz compression (pink), compared to FAST5 format with zlib compression (blue) and vbz compression (teal). d, Dot plots show the rate of file access (reads per second) for the above file types, as a function of CPU threads used on two HPC systems: HPC-HDD (left) or HPC-Lustre (right). e, Dot plots show the rate of execution (reads per second) for DNA methylation calling for the same file types on HPC-HDD (left) and HPC-Lustre (right). For the instance of maximum CPU threads, bar charts show the time consumed by individual workflow components: FAST5/SLOW5 data access (pink), FASTA data access (teal), BAM data access (orange) and data processing (navy). f, Bar charts show the time consumed by data access (pink) and data processing (navy) during DNA methylation calling on a range of different computer systems. Full specifications are provided in Supplementary Table 2. Source data Full size image

Currently, the most popular signal-level analysis is DNA methylation profiling with the software Nanopolish/f5c17,20. We selected this example use case as the basis for an analysis of FAST5 data analysis on high-performance computing (HPC) systems (Supplementary Note 2). FAST5 is a hierarchical data format 5 (HDF5) file with a specific schema defined by ONT. HDF5 is a generic file format for storing large data that can only be read and written using a single software library first developed in 1998. Our analysis showed that: (1) the use of increasing numbers of parallel CPU threads resulted in a relatively small reduction in the overall run time of a typical methylation calling job (Extended Data Fig. 1a); (2) this was due to inefficient data access (file reading) rather than inefficient data processing (Extended Data Fig. 1a–d); and (3) the underlying bottleneck was a limitation in the software library for reading HDF5 files, whereby parallel input/output (I/O) requests from multiple CPU threads are serialized, preventing efficient use of parallel CPU resources (Extended Data Fig. 1e and Supplementary Note 2).

Parallel computing enables scalable analysis of large datasets and is central to modern genomics. Unfortunately, our analysis shows that the FAST5 format suffers from an inherent inefficiency that ensures, even with access to advanced HPC systems, that the analysis of nanopore signal data will be prohibitively slow (Fig. 1b). For example, with the maximum resource allocation available on Australia’s National Computing Infrastructure (among the world’s largest academic supercomputers; see Supplementary Table 2—HPC-Lustre), genome-wide DNA methylation profiling on a ~30× human genome dataset runs for more than 14 days. Moreover, given that the vast majority (>90%) of the overall run time is spent simply reading FAST5 files, the performance benefits of further software optimization would be small compared to the time taken for file reading.

To overcome the inherent limitations in FAST5 format, we created SLOW5, a file format designed for efficient, scalable analysis of nanopore signal data (Fig. 1b). SLOW5 encodes all information found in FAST5 but is not dependent on the HDF5 library required to read FAST5 files. The human readable version of SLOW5 format is a tab-separated values (TSV) file encoding metadata and time series signal data for one nanopore read per line, with global metadata stored in a file header (Table 1 and Supplementary Note 3). Parallel file access is facilitated by an accompanying binary index file that specifies the position of each read (in bytes) within the main SLOW5 file (Supplementary Note 3). SLOW5 can be encoded in human readable ASCII format or a compact and efficient binary format, BLOW5, which is analogous to the seminal SAM/BAM format for storing sequence alignments21. The binary format optionally supports compression with zlib and ‘vbz’ (Z-standard + StreamVByte) algorithms, thereby minimizing the storage footprint while permitting efficient parallel access (Methods).

Table 1 Example of a SLOW5 ASCII file with a single read group Full size table

BLOW5 format is smaller than FAST5 format due to simpler space allocation and reduced metadata redundancy. Comparison of equivalent files with matched compression (FAST5-zlib versus BLOW5-zlib or FAST5-vbz versus BLOW5-vbz) revealed space savings that ranged from 18% to 69%, depending on the dataset (Supplementary Table 3). The largest savings were observed for datasets with short read lengths, and this effect was independent of compression type (Extended Data Fig. 2a,b). On a ~30× human genome dataset, BLOW5 was approximately 25% smaller (Fig. 1c), equating to a reduction of ~300 GB.

To determine the performance benefits of SLOW5, we first measured data access using a small human DNA sequencing dataset of ~500,000 reads (Supplementary Table 1) on two different HPC systems (HPC-HDD and HPC-Lustre; Supplementary Table 2). The rate of SLOW5 data access (reads per second) was faster than FAST5 across the board and increased with the use of additional CPU threads, whereas FAST5 access was largely unchanged (Fig. 1d). This trend, which reflects the capacity of SLOW5 to be efficiently accessed by multiple CPU threads in parallel, was observed for SLOW5, BLOW5 and compressed BLOW5 format, with the latter exhibiting the most efficient data access (Fig. 1d). As a result, we observed substantial improvements in data access rates when using many CPUs on both HPC systems. Using 48 CPU threads on the HPC-Lustre system, ~7 h were required to read this small dataset in FAST5 format, compared to just ~13 min in compressed BLOW5 (~32-fold improvement) (Fig. 1d).

This improvement in data access manifested in performance gains during DNA methylation profiling. When using SLOW5 input, the Nanopolish/f5c runtime was reduced in proportion to the number of CPUs available (Fig. 1e). This is indicative of efficient parallel computation and was not observed when using FAST5 (Fig. 1e). As a result, substantial improvements were observed when using many CPUs, with a maximum ~15-fold reduction in runtime with 48 CPUs on the HPC-Lustre system (Fig. 1e). The improvement is the result of efficient data access, with no difference observed in data processing among the different file formats (Extended Data Fig. 3a,b). Whereas data access was the major bottleneck during FAST5 analysis, it constituted a negligible fraction of the total run time during SLOW5 analysis (Extended Data Fig. 3c,d). Put simply, this means that overall performance is dictated by the efficiency of the program rather than the time taken to read the input data, thereby enabling optimization through further engineering. For example, using GPU acceleration available in f5c20 with compressed BLOW5 input, we ran methylation profiling on a 30× human genome in ~10.5 h with 48 threads (>30-fold improvement compared to standard analysis with FAST5) (Supplementary Table 2).

Although the SLOW5 format is designed for scalable analysis on HPC systems, we reasoned that improved data access would be beneficial on almost any computer. To test this, we benchmarked DNA methylation profiling, as above, on a range of architectures (Supplementary Table 2). In all cases, the time consumed by data access was reduced, leading to improvements in overall execution time (Fig. 1f). As expected, improvements were greatest on systems with larger numbers of CPUs, such as a cloud-based virtual machine on Amazon AWS (~7-fold improvement at 32 CPU threads). However, benefits were observed even on miniature devices for portable computing, such as an Nvidia Xavier embedded module (~60% improvement) (Fig. 1f). In summary, SLOW5 delivered performance improvements during methylation profiling on a diverse range of hardware.

To ensure that FAST5 to SLOW5 file conversion is not a barrier to SLOW5 adoption (given that ONT devices currently write data in FAST5 format), we implemented software (slow5tools) for efficient, parallelizable, loss-less conversion from FAST5 to SLOW5 (Methods). File conversion times are proportionally reduced with high CPU availability and are trivial compared to execution times for typical FAST5 analysis (Extended Data Fig. 4a,b). For example, conversion of a ~30× human genome dataset from FAST5 to compressed BLOW5 takes just ~3 h with 48 CPUs. We additionally implemented software for live FAST5 to SLOW5 file conversion during a sequencing run, using the internal computer on an ONT PromethION device (Extended Data Fig. 4c). This means that the user can obtain raw data in compressed BLOW5 format with effectively zero additional workflow hours required for file conversion.

The inefficiency of FAST5 data access creates delays and expenses, limiting the feasibility of ONT sequencing for many applications in research and clinical genomics. Arguably, these frictions also discourage the development of bioinformatics software that directly accesses nanopore signal data. This is in stark contrast to the simple, efficient and open-source SAM/BAM sequence alignment format, developed in 2009 (ref. 21), which was a key catalyst in the growth of genome informatics.

The SLOW5 format provides the framework for efficient, parallelizable analysis of nanopore signal data for any intended application. SLOW5 reading and writing is managed by efficient software application programming interfaces (APIs) for both the C (slow5lib) and Python (pyslow5) languages (Methods). This facilitates integration of SLOW5 into third-party software, including with existing packages, by replacing the existing FAST5 API. Notably, just ~70 lines of code were required for adoption of SLOW5 by the third-party software Sigmap22, compared to ~2,600 lines of code for FAST5 access within the same tool. This shows the simplicity of the SLOW5 API, which is fully open source and not dependent on the HDF5 library required to read FAST5. Along with the simple, intuitive structure of SLOW5 format, this will support active and open software development for nanopore data analysis.