GPU-acceleration of the distributed-memory database peptide search of mass spectrometry data

Database peptide search is the primary computational technique for identifying peptides from the mass spectrometry (MS) data. Graphical Processing Units (GPU) computing is now ubiquitous in the current-generation of high-performance computing (HPC) systems, yet its application in the database peptide search domain remains limited. Part of the reason is the use of sub-optimal algorithms in the existing GPU-accelerated methods resulting in significantly inefficient hardware utilization. In this paper, we design and implement a new-age CPU-GPU HPC framework, called GiCOPS, for efficient and complete GPU-acceleration of the modern database peptide search algorithms on supercomputers. Our experimentation shows that the GiCOPS exhibits between 1.2 to 5\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}× speed improvement over its CPU-only predecessor, HiCOPS, and over 10\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}× improvement over several existing GPU-based database search algorithms for sufficiently large experiment sizes. We further assess and optimize the performance of our framework using the Roofline Model and report near-optimal results for several metrics including computations per second, occupancy rate, memory workload, branch efficiency and shared memory performance. Finally, the CPU-GPU methods and optimizations proposed in our work for complex integer- and memory-bounded algorithmic pipelines can also be extended to accelerate the existing and future peptide identification algorithms. GiCOPS is now integrated with our umbrella HPC framework HiCOPS and is available at: https://github.com/pcdslab/gicops.

1 Supplementary Sections 1.1 Supplementary Section 1 Related Work.Most early-age database peptide search algorithms employ complex matrix-vector and vector-vector operations such as Cross-Correlations (XCorr) [1], [2], Fast Fourier Transforms (FFTs) [3], [4], and Spectral (Vector) Dot Products (SDP) [5], [6], [7] to compute similarity scores between an experimental spectrum (or 2D vector) and a theoretical database spectrum.These computational motifs [8], as we know it, are efficiently parallelizable using SIMD and SIMT based hardware accelerators including vector engines, graphics processing units (GPUs) and Field Programmable Gate Arrays (FPGAs).Therefore, several research efforts focused on developing hardware-accelerated spectral similarity computing algorithms including GPU SDP and KSDP [9], FastPaSS [10] Tempest [11], ProteinByGPU [12], GPUScorer [13], Tide-for-PTM-search [14]..The common parallelization model employed in all above algorithms involves offloading the vector-vector similarity score computations by launching a thread block per computation.Most algorithms also employ optimization techniques including vector sparsity to efficiently exploit the compute, DRAM, memory bandwidth shared-memory resources [11], [9].Several algorithmic designs also explore further manual fine-tuning including loop unrolling, cache and register usage [12] to further improve the achieved throughput.Similarly, MIC-Tandem [15] accelerates the SDP computations using the (now discontinued) Intel Many Integrated Core (MIC) co-processors to achieve speedup.More recently, Bruker introduced PaSER which is a proprietary GPUaccelerated version of the ProLuCID-4D algorithm.As expected, most of the hardware-accelerated database peptide search algorithms report several ordersof-magnitude speedup over their respective CPU implementations.For instance, Tempest reports 8 to 13× speedups, Tide-for-PTM-search reports 2.7 to 5.8× speedups, and [9] reports 8 to 100× speedup for the SDP and KSDP kernels respectively.

Supplementary Section 2
Limitations in Existing GPU Algorithms.The parallelization methods employed in all existing GPU-accelerated database peptide search algorithms are based on the observation that the spectral similarity computations make up more than 80% or 90% of the total database peptide search computations [11], [9], [15], [10], [12].While this is true for index-free (closed-search) database peptide search algorithms, modern CPU-based algorithms employ sophisticated data processing, machine learning and database filtering techniques [16], [17], [18], [19] to significantly reduce the number of required spectral similarity computations and achieve over 100× speedup over the existing GPU-based algorithms [16], [17], [20], especially for open-search.
Consequently, the computational profile has significantly shifted from computeintensive (spectral similarity computation intensive) to data-and memory-intensive workflows [20].For instance, MSFragger spends a significant fraction of its execution time in fragment-ion search involving memory lookups and updates.Similarly, TagGraph spends a significant fraction of their execution times in graph traversal-like computations to filter the potential database peptide candidates before computing scores.Consequently, the existing GPU-acceleration methods and optimizations based on matrix/vector computations cannot be employed to (optimally) accelerate the complex compute and memory-patterns of the modern database peptide search algorithms.Finally, the software/codebase of almost all existing GPU-accelerated algorithms are either unavailable, outdated, or unusable as explored in Supplementary Section 3.

Supplementary Section 3
Existing GPU Database Peptide Search Software.Tempest was obtained from its GitHub repository https://github.com/markadamo/tempestand built with GCC 9.4, SQLite 3.36, and CUDA 11.6.The experiments ran for days without any results.Debugging the code with GDB revealed that the program gets stuck in an infinite loop during OpenCL's GPU setup, no matter the input data or experimental settings.We tried to diagnose and fix, and even remove the problematic functions/loops altogether in the code (along with fixing other critical bugs) but were unsuccessful as the specific loop had multiple nested loop and compound if-else conditions with little documentation.We reported this issue to the Tempest developers at https: //github.com/markadamo/tempest/issues/1but it is still open as of writing this manuscript.ScoreByGPU was obtained from its website: http://www.comp.hkbu.edu.hk/~youli/ProteinByGPU.html but the software package contained only the pre-built binaries (no source code) for Windows OS built with CUDA 4.2 and so, could not be run on our experimental setup or any other computing resources available to us.We requested the authors for the source code or a Linux-based binary but did not get a response as of writing this manuscript.Tide-for-PTM-search was obtained from its GitHub repository at https://github.com/Tide-for-PTM-search/Tide-for-PTM-searchand the pre-built binaries were run by installing the required libraries including CUDA 8.0, and MSToolkit.MCTandem was obtained from https://github.com/Logic09/MCtandem but could not be experimented with as it requires a discontinued and unavailable Intel Xeon Phi Coprocessor 7120P to run.PaSER is a commercial software based on a proprietary GPU-version of the ProLuCID algorithm developed by Bruker.We requested Bruker for a demo or an evaluation copy of PaSER but did not get a response as of writing this manuscript.Similarly, we could not find any binaries or source code for the GPU-based-SDP [9] algorithms (i.e., SDP and KSDP) so we reached out to the authors for a copy of the software but did not get a response as of writing this manuscript.

Supplementary Section 4
Eliminating Race Conditions in the GPU Fragment-Ion Search.Our algorithm for alleviating the race conditions in the GPU-accelerated fragmention search can be best explained using an example.Consider a simplified example where a GPU thread block of size ψ operate on an input data array A with ψ elements.The desired output is a score array B such that: B[i] ← count(A/bin = i) where bin ∈ N − {0}.On completion, each B[i] will contain the count of elements in A with i as their tenth place digit.However, GPUcomputing this kernel in a straight-forward manner as: B[A[tid]/bin]+ = 1 by all ψ threads may lead to race conditions.See the Supplementary Figure 3 for an example of this simplified problem.
To alleviate this, we leverage the fact that the A is pre-sorted, like the stablesorted database index in CFIR-Index [21].Processing a sorted A ensures that the threads updating the same B[i] location will have adjacent thread ids, and can be identified by checking: (A[tid]/bin = A[tid − 1]/bin).If this condition is true for a thread, then it belongs to a group of colliding threads and if it is false, then it is the leader of a thread group.Note that the threads not colliding with any other thread are also leaders of a group of size 1.Once the leaders and members are identified, we perform a block wise conditional reduction using the bin ids (A[tid]/bin) as keys, updating the current bin scores (the sum of each group's size in this example) in each of the log(ψ) iterations.Once reduced, only the leader threads update the scorecard with their reduced values.The reduction algorithm is visually illustrated in the Supplementary Figure 3 as well as in Supplementary Algorithm 4 (lines 10 to 15).
Analysis.The best-case scenario for this algorithm would be when all ψ elements in A belong to the same group resulting in ψ collisions.The worst case scenario would be when none of the elements are colliding.In either case, using the above proposed algorithm will result in Θ(log(ψ)) operations to write ψ updates to the scorecard which is way more efficient than using atomic intrinsics leading to performance loss due to millions of such possible collisions.

Supplementary Figures
2.1 Supplementary Figure 1 CPU-GPU Pipeline Schematic.The CPU-GPU pipeline in GiCOPS consists of a scheduling thread t s , a priority queue pq, a global work queue wq, and thread signaling semantics.The scheduling task repeatedly assigns all work items in the wq to the compute units in pq based on their priority and notifies them.The wq may be pre-filled, for instance, database or experimental data file handles, or may fill in parallel by data producer sub-tasks.The priorities maybe changed depending on the kind of the workload.For instance, assigning higher priority to the GPU or other accelerator units for compute-intensive workloads whereas assigning higher priority for memory-or communication-intensive workloads and so on.Supplementary Algorithm 1 illustrates the scheduling algorithm implemented by t s

CPU
(1) 2.2 Supplementary Figure 2 Sorted Tag Approach (STA).Consider, we want to GPU sort three arrays a i of sizes 2, 3, 4 respectively.Using the STA, we first flatten them into a global array A = a 1 , a 2 , a 3 and initialize a tag array T = 1, 1, 2, 2, 2, 3, 3, 3, 3. Then we first StableSortByKey using A as key and T as value, and then again using T as key and A as value.The resultant final array (A ′ ) now has the sorted versions of a ′ i in the correct order.

GPU (0)
Arrays (ai) and Tag-Array (T) flatten and stable_sort_by key(A, T) A = {2, 3, 5, 6, 7, 0, 3, 6, 8} T = {1, 1, stable_sort_by key(T, A) A' ={{2, 3}, {5, 6, 7}, {0, 3, 6, 8}} Unflatten A' to a'i 2.3 Supplementary Figure 3 Race Conditions in the GPU Fragment-Ion Search.Consider the input array A of size k, a GPU thread block of size ψ = k = 11, and a bin size of bin = 10, then computing the output array: B[idx] = count(A[j]/bin = idx) in parallel by GPU threads would lead to race conditions.Our algorithm eliminates this by first ensuring that A is pre-sorted and then performing a block wide reduction before updating B. To do this, the threads first identify if they are the leaders of a bin-group and then using the bin numbers as keys (key = A[tid]/bin), the values (i.e.counts) are reduced.After reduction, only the leader threads update B with their final (reduced) values.
The computations follow the dynamic programming equation: log(n! ) = log(n)+ log(n − 1! ) which also avoids 64-bit (double) overflow for n ≥ 21.These values are communicated to the GPU's constant memory to be used during hyperscore computations.