Genopo: a nanopore sequencing analysis toolkit for portable Android devices

The advent of portable nanopore sequencing devices has enabled DNA and RNA sequencing to be performed in the field or the clinic. However, advances in in situ genomics require parallel development of portable, offline solutions for the computational analysis of sequencing data. Here we introduce Genopo, a mobile toolkit for nanopore sequencing analysis. Genopo compacts popular bioinformatics tools to an Android application, enabling fully portable computation. To demonstrate its utility for in situ genome analysis, we use Genopo to determine the complete genome sequence of the human coronavirus SARS-CoV-2 in nine patient isolates sequenced on a nanopore device, with Genopo executing this workflow in less than 30 min per sample on a range of popular smartphones. We further show how Genopo can be used to profile DNA methylation in a human genome sample, illustrating a flexible, efficient architecture that is suitable to run many popular bioinformatics tools and accommodate small or large genomes. As the first ever smartphone application for nanopore sequencing analysis, Genopo enables the genomics community to harness this cheap, ubiquitous computational resource.


Supplementary Data 1 [provided as separate Excel sheet]
Detailed run-time information for SARS-CoV-2 genome analysis.

Supplementary Data 2 [provided as separate Excel sheet]
Detailed run-time information for NA12878 methylation calling analysis.

ARTIC pipeline for SARS-CoV-2 genome analysis
The ARTIC network for viral surveillance has established a standardised bioinformatics pipeline for SARS-CoV-2 genome analysis with ONT sequencing data (https://github.com/artic-network/artic-ncov2019). This workflow has been integrated into Genopo for the analysis of SARS-CoV-2 patient isolates and is summarised in Fig. S1. While this workflow is specifically tailored for analysis of SARS-CoV-2, Genopo also supports a generic variant calling pipeline (comprised of Minimap2 1 , Samtools 2 & Nanopolish 3 ) that can be used to detect variants in any organism where a reference genome is available.
Base-called reads (.fastq) are first aligned to the SARS-CoV-2 reference genome (MN908947) using Minimap2 then sorted and indexed using Samtools (v1.10). Primer sites are trimmed from alignments and coverage normalisation performed using Artic_c trim (v1.0.0), a C/C++ re-implementation of the original aligntrim.py Python script in the ARTIC repository. Trimmed and normalised alignments are again sorted and indexed using Samtools. ONT raw signal files (.fast5) are indexed using Nanopolish index, then variant calling is performed using Nanopolish variants (v0.11.3). Note that depending on the available RAM on the device, the variant calling step may have to be run multiple times while iterating through smaller genomic windows at a time (-w option). This is a common practice even on high-performance computers when the genome is large. The more RAM a device has, the larger the window size can be.
After obtaining the variant calls in VCF format, the user can optionally proceed to build a consensus genome for their SARS-CoV-2 isolate, using Bcftools (v1.10.2) 2 . VCF files generated from Nanopolish lack metainformation in the header to be compatible with Bcftools. Therefore, the VCF header is first modified using Bcftools reheader. Reads with low coverage are then identified using Samtools depth followed by Artic_c mask, a sub-tool that generates a BED file specifying low-coverage regions of the genome (< 20-fold). Variants with low quality scores (QUAL < 200) are identified using Bcftools query and concatenated with the low coverage BED file and another BED file containing explicitly specified (pre-defined) ambiguous bases to produce a single BED file for masking. BED file concatenation is performed using our sub-tool Artic_multiinter, which is functionally equivalent to Bedtools multiinter 4 . Finally, the consensus genome is built using Bcftools consensus where the ambiguous bases in the masking BED file are replaced with 'N'.

Methylation profiling in NA12878
Genopo currently supports complete pipelines for methylation calling and event alignment. The methylation calling pipeline is illustrated in this article and summarised in Fig. S2. The pipeline's inputs are a reference genome (e.g., hg38), ONT raw signal files (.fast5) and corresponding base-called reads (.fastq). Base-called reads are first aligned to the reference genome using Minimap2 (v2.17) 1 . The memory usage for Minimap2 increases with the size of the reference genome, causing failures for large reference genomes when running on mobile devices with limited memory. To overcome this constraint, we used an index partitioning strategy described elsewhere 5 , integrating this algorithm into our pipeline along with Minimap2. Note that the reference partitioning should be done on a computer, before storing the partitioned reference on the smartphone (see Advanced Usage Instructions). The aligned reads are then sorted and indexed using Samtools (v1.10) 2 . Read polishing is then performed using Nanopolish 3 . We adopt a re-engineered version of Nanopolish called f5c 6 , which is both memory and time efficient. F5c first indexes base-called reads and raw signal data. Subsequently, f5c can either perform methylation calling or event alignment, at the user's request. Complete commands are as follows:

Basic usage information
Genopo has four major functionalities, which are listed below and shown in Fig. S4a.
1) Stand-alone mode for configuration and execution of a custom pipeline on the mobile device.
2) Mobile-cluster mode for real time analysis using a cluster of mobile devices, which is currently under development and will not be discussed in this paper.
3) Download data-sets using URLs and extract compressed files.
4) An example demonstration of downloading and extracting a nanopore data-set of E. Coli Bacteria, followed by executing a complete methylation calling pipeline on the data-set. Genopo includes a 'help' section for new users to get started. It has a summary of the above four major functionalities (Fig. S4b). Once Genopo mobile application is launched for the first time, the user is prompted to grant permission to read and write from the internal storage of the mobile device. In most of the devices, once this permission is set, it is adequate to read from and write to the external storage (SD card) as well. However, in certain devices the user is expected to set this permission explicitly, which can be done by navigating to [help→set SD card permission] section (Fig. S4c). Genopo's start page has listed down the above four functionalities (Fig. S4a). A user navigating to stand-alone mode will land on a page to select custom pipeline steps from where he can choose Minimap2, Samtools, f5c or a desired combination of those tools (Fig. S4f). Once the steps are selected, the user can choose either GUI mode or (Fig. S4g,h) to configure parameters for each tool. Fig. S3 shows the procedure to use stand-alone mode. It is recommended to use the GUI mode as the final commands are always compiled into a set of strings and later shown in the terminal mode before proceeding to the execution. In GUI mode file path arguments get auto completed once the user set the correct path to the data set directory. Genopo provides an elegant directory navigator for this purpose and both GUI mode and terminal mode have it. If the user chose terminal mode at the beginning, he skips GUI mode and lands on the terminal mode. From the terminal mode the user can proceed to the pipeline execution page (Fig. S4i). Once the pipeline execution is started a timer will be displayed. After the execution of the pipeline the user can write results to a log file (named f5n.log) which is located inside storage/mobile-genomics directory. In the rare event of a crash of Genopo, the user can run the previous pipeline using LOAD PREVIOUS CONFIGURATION command. If the app crashes during an execution, the user can identify the error occurred by referring tmp.log which is located inside storage/mobile-genomics folder. For more information regarding log files please refer the help section on home page in the application. Functionalities to download a data-set form a URL and extract a compressed data-set are available on a same page (Fig. S4d). To download a data-set, the user has to set the specific data-set URL path and the location on the storage to where the data-set should be downloaded. Decompressing a file is as easy as setting the file path of the compressed file and pressing the EXTRACT button. The decompressed file will have the same location as the compressed file. Since a data-set usually consists of many numbers of considerably small fast5 files, it will take much time to transfer them to a device storage unless the files are compressed. Hence, Genopo is provided with a file extraction functionality to decompress the files as necessary. In mobile-cluster mode compressed files will get transferred over WiFi. The example demonstration is a setup with only three steps to help users get familiar with Genopo. The steps involve the basic procedure to execute a pipeline. They are 1, download a data set 2, extract the data set and 3, execute the pipeline (Fig. S4e).