Featherweight long read alignment using partitioned reference indexes

The advent of Nanopore sequencing has realised portable genomic research and applications. However, state of the art long read aligners and large reference genomes are not compatible with most mobile computing devices due to their high memory requirements. We show how memory requirements can be reduced through parameter optimisation and reference genome partitioning, but highlight the associated limitations and caveats of these approaches. We then demonstrate how these issues can be overcome through an appropriate merging technique. We incorporated multi-index merging into the Minimap2 aligner and demonstrate that long read alignment to the human genome can be performed on a system with 2 GB RAM with negligible impact on accuracy.


List of Figures
. Memory usage of Minimap2 for default parameters PacBio Oxford nanopore Index construction 8.67 GB 11.32 GB Index residence 6.33 GB 7.71 GB Mapping with base-level alignment (SAM output) 8.56 GB 11.30 GB Mapping without base-level alignment (PAF output) 7.14 GB 8.80 GB Minimap2 was run with default parameters. Pre-set profiles map-pb and map-ont were used for PacBio and Oxford Nanopore, respectively. The peak memory usage for each event is presented. Index construction refers to the building of the index and then serialising the index to a file. Index residence is the memory required only for the index to reside in memory, such as when loading a pre-built index.  System 1 is a laptop with flash memory (Intel i7-8750H processor, 16GB of RAM and Toshiba XG5 series NVMe SSD) while system 2 is a workstation with a mechanical hard disk (Intel i7-6700 processor, 16GB of RAM and Toshiba DT01ACA series HDD). Alignment was performed on the NA12878 Nanopore data with the map-ont pre-set in Minimap2 using 8 threads. Fraction of mapped reads w=10 w=15 w=20 w=25 w=30 w=35 w=40 w=45 w=50 Figure S2. Effect of the window size parameter w on the error rate and sensitivity for simulated reads. Effect of window size on the proportion of mapped reads and the associated error rate (log scale) for 4 million simulated long reads (see Materials and Methods). A single curve contains points for each mapping quality threshold (MAPQ score), one point for each mapping quality threshold from 60 (leftmost) to 0 (rightmost).

4/13
No repeats  . Binary format was preferred as it reduces the file size compared to ASCII. When the last read is mapped to the current partition of the index in memory, the dump will contain the intermediate state of the mappings for all the reads, in the same order as the reads in the input read set. If the partitioned index had n partitions, at the end of the n th partitions we will have n such dumps. The dumped internal state includes; fifteen 32-bit unsigned integers (such as the reference ID, chaining scores, query and reference start and end), two 32-bit signed integers and one floating point value. All these information are inside a single structure in Minimap2 (called mm_reg1_t in minimap.h) which made the dumping convenient. The size required for a single alignment is around 80 bytes.

Serialising (dumping) of the internal state
If the user has requested Minimap2 to generate the base-level alignment, then the internal state for base-level alignment are also dumped. Base-level alignment information include; six 32-bit integers (such as the base level alignment score, number of CIGAR operations and a variable size flexible integer array for storing CIGAR operations. These information are stored inside another structure in Minimap2 (called mm_extra_t), which is only allocated if the base level alignment has been requested. The memory address to this structure is stored as a pointer in the previously mentioned mm_reg1_t structure. When dumping, we flatten the information linearly (eliminate memory pointers) to the file.
In addition to the above, a quantity called replen (sum of lengths of regions in the read that are covered by highly repetitive k-mers) is dumped. This is a per read quantity. We save the replen to the same dump file that we discussed above, just after the information for each mapping. For each read there will be a replen for each part of the index, that is saved in the dump for that particular part of the partitioned index [line 495 of https://github.com/hasindu2008/minimap2-arm/blob/ v0.1-alpha/map.c].

Merging operation
When alignment of all reads to all parts of the index completes, the merging operation is invoked [merge function in https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/merge.c]. We simultaneous open the read file and the dump files for all parts of the partitioned index. Reads are sequentially loaded while loading all the internal states for the alignments of that read. This includes the internal state for all its alignments (includes the base-level information if it had been requested) as well as the replen from each dump file. The flattened data in the files are restored to their original structures when loading to the memory.
If no base-level alignments had been requested, the alignments are sorted based on the chaining score in descending order [function mm_hit_sort_by_score in https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/ merge.c]. If base-level alignment had been requested, they are sorted based on the base-level DP alignment score. Categorisation of primary and secondary chains is performed on the sorted alignments according to the same method done on Minimap2 (using mm_set_parent function). This fixes the issue with the primary vs secondary flag. Then the alignment entries are filtered based on the user requested number of secondary alignments and the priority ratio (using mm_select_sub function). This eliminates the issue of outputting secondary alignments for each part of the index that makes the output size huge. If the output has been requested in form of a SAM file, the best primary alignment is set to the primary flag while all other primary alignments are set to supplementary (using mm_set_sam_pri function).
The mapping quality (MAPQ) estimation depends on the length of the read covered by repeat regions in the genome. To compute a perfect value for this quantity, the whole index needs to be in the memory which is the case for a single reference index. However, we estimate this quantity by taking the maximum out of the replen values that were dumped for the particular read. The Spearman correlation of this estimated value to the perfect replen was 0.9961. As the mapping quality is anyway an estimation, computing the mapping quality based on the estimated replen does not affect the final results significantly.

Emulated single reference index
For memory efficiency, Minimap2 stores meta-data of reference sequences (such as the sequence name and sequence length) only in the reference index (refer to mm_idx_t struct in minimap.h). The order in which the sequences reside in the struct array forms a unique numeric identifier for each reference sequence.
In the internal state for mappings only this numeric identifier is stored. The meta-data for the reference sequence are resolved using these numeric identifiers, only during the output printing. However, during merging we do not have the reference indexes in memory and the numeric identifiers cannot be resolved. Hence, we construct an emulated single reference index. For this, we save the meta-data of the reference sequences when each part of the partitioned index is loaded [line 47-54 in https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/merge.c]. These metadata go to the beginning of the dump file for the particular part of the index. At the beginning of the merging, the meta-data is loaded back to form an emulated single reference index [line 164-173 in https://github.com/hasindu2008/ minimap2-arm/blob/v0.1-alpha/merge.c]. However, the numeric identifiers in the internal states from the dump files are incorrect (as numeric the identifier is an independent incrementing index for each part of the index). These are corrected to be compatible with the numeric identifiers in the emulated single reference index by adding the correct offset [line 254 in https://github.com/hasindu2008/minimap2-arm/blob/v0.1-alpha/merge.c].
As a side effect of this emulated single reference index, a correct SAM header can be output even in the partitioned mode. Further, the merging process which merges the mappings for a read at a time, outputs the mappings for a particular read ID adjacently. Hence, no additional sorting is required for any downstream analysis tools that require so.

Memory efficiency for references with unbalanced lengths
The existing partitioned index construction method in Minimap2, does not balance the size of index partitions when the reference genome has sequences (chromosomes) with highly varying lengths. This existing index construction method puts the reference sequences to the index in the order they exist in the reference genome. When constructing a partitioned index, it moves to the next part of of the index only when the user specified number of bases per index (by default 4 Gbases) is exceeded. When building a partitioned index for overlap finding, the parts would be approximately equal in size as the length of the longest read would be a few mega bases. However, in case of a reference genomes like the human genome where the chromosomes are of highly variable lengths, the size of the parts are unbalanced. The largest part of the index determines the peak memory. Hence, an unbalance will hinder the maximum efficiency for systems with limited memory. For instance, consider a hypothetical genome (total length 700M) with following chromosomes and lengths in the order chr1 (300M), chr2 (320M), chr3 (60M), chr4 (20M). Providing a value of 350M as the number of bases in a partition (with the intention of splitting into 2 parts), will create an unbalanced index as follows.
• part1 : chr1, chr2 : total length -620M • part2 : chr3, chr4 : total length -80M We follow a simple partitioning approach to balance this out. Instead of the number of bases per partition, the number of partitions is taken as a user input. The reference sequences are first sorted in descending order based on the sequence length (length without the ambiguous N bases). The sum of bases in each partition is initialised to 0. The, the sorted list in traversed in order while assigning the current sequence into the partition with the minimum sum of bases. The sum of bases in that partition is updated accordingly. Using this strategy, we get a distribution as follows.