Mass spectra alignment using virtual lock-masses

Mass spectrometry is a valued method to evaluate the metabolomics content of a biological sample. The recent advent of rapid ionization technologies such as Laser Diode Thermal Desorption (LDTD) and Direct Analysis in Real Time (DART) has rendered high-throughput mass spectrometry possible. It is used for large-scale comparative analysis of populations of samples. In practice, many factors resulting from the environment, the protocol, and even the instrument itself, can lead to minor discrepancies between spectra, rendering automated comparative analysis difficult. In this work, a sequence/pipeline of algorithms to correct variations between spectra is proposed. The algorithms correct multiple spectra by identifying peaks that are common to all and, from those, computes a spectrum-specific correction. We show that these algorithms increase comparability within large datasets of spectra, facilitating comparative analysis, such as machine learning.

The second data structure is, what we call, the active sequence A. At any time, A contains a sequence of peaks, listed in increasing order of their m/z values, which is currently being considered to become a VLM sequence. That data structure uses a doubly linked list L and a boolean-valued vector B of dimension m. The linked list L is actually containing the sequence of peaks to be considered for the next VLM and the vector B is such that, at any time, B[σ ] = True if and only if a peak from spectrum S σ is present in L. The active sequence A also maintains the m/z value µ l of the last peak that was removed from L, the average m/z value µ A of the peaks in L, and a copy w A of the window size w chosen by the user. The methods of the active sequence A use the following methods defined for the doubly linked list L.
• L. f ront(), which returns a copy of the peak at the beginning of the list L (the one having the smallest m/z value in L).
• L.back(), which returns a copy the peak at the end of L.
• L.size(), which returns the number of peaks in L.
• L.pop_ f ront(), which removes the peak at the beginning of the list L.
All these methods run in O(1) time for a doubly linked list 2 .
The active sequence A provides four methods to the main algorithm. Given an initialized heap H and a chosen window size w, the A.init(H, w) method initializes L to be the empty list (containing zero peaks) and initializes all m entries of the vector B to False. It also assigns the value w to w A and µ l to a negative m/z value. All these operations in A.init(H, w) are done in O(1) time. Algorithm 1: The A.isValid method for active sequence A.
The A.isValid(H) method, described by Algorithm (1), returns True if and only if the peaks in the active sequence A satisfies all the criteria enumerated in the definition of a VLM. A precondition for the validity of this method is that L contains only peaks that belong to distinct spectra of S . This precondition holds initially for an empty list L and will always be enforced each time a new peak gets inserted in A, as we will see below. Another precondition for the validity of this method is that µ l holds the largest m/z value of the peaks in S that are not in L while satisfying the constraint of being smaller or equal to the m/z value of L. f ront(). This precondition holds trivially initially for an empty list and, as we will see below, will be maintained each time the content of L is modified. The first check of A.isValid(H) returns False immediately as soon as L does not contain exactly |S | peaks as this violates property (1) for a VLM. Then, if H is not empty, the m/z value of the peak at the top of H (which is the peak having the m/z value closest to the largest m/z value in L while not being smaller to it) is checked to see if it lies outside the interval [µ A (1 − w), µ A (1 + w)] permitted for a valid VLM. If it lies inside, we return False as A violates property (4) for a VLM. Note that this check is not performed if H is empty since property (4) is automatically satisfied if there does not exist a peak in S \ L having an m/z value larger or equal to µ(L.back()). The next two checks just verify that all the m/z values of the peaks in L are inside [µ A (1 − w), µ A (1 + w)] and return False immediately if this is not the case. The final check verifies if µ l lies outside that interval and returns False if it is not the case. Since every check is done in The A.advanceLowerBound() method, described by Algorithm (2) gets called by the main algorithm only when it decides to remove L. f ront() from the active sequence A in order to eventually include H.top() in L. A precondition for its validity, which will be enforced by the main algorithm, is that L is not empty prior to its execution. Before removing the peak L. f ront() = (σ , ρ) from L, it first sets B[σ ] to False since that peak will no longer be in L, and assigns µ l to µ(σ , ρ). Since no other method removes peaks from L, µ l is always contains the m/z value of the last peak that moved out from L. Then, after the A.advanceLowerBound(); Output: Removes L. f ront() from L and updates the attributes of A.
Algorithm 2: The A.advanceLowerBound method for active sequence A.
and that all peaks in L belong to a different spectra of S . If L is empty when the method is called, then for sure we can insert H.top() into L and, after setting µ A to µ(H.top()) and the corresponding entry of B to True, we are certain that the desired constraints are satisfied. If L is not empty prior to the insertion of (σ , ρ) = H.top(), then we verify if another peak coming from spectra S σ is already present in L by testing if B[σ ] = True. If this is the case then we do not insert H.top() and return False. If B[σ ] = False then no peak coming from spectra S σ is in L, so we then compute the value µ A that µ A should have after the insertion of (σ , ρ) and if µ(L. f ront()) and µ(σ , ρ) both lie in then we perform the insertion and update H with L.push_back(H.popAndFeed()) and then return True. Note that, when we try to insert H − top(), we do not verify if µ l < µ A (1 − w A ) with the updated L and µ A . Indeed, the only fact that we are certain about at all times is that µ l ≤ µ(L. f ront()). The purpose of the A.insert(H, S ) method is to see if we can insert H.top() into L 3/8 while still having some probability that the content of L will become a valid VLM sequence after zero or more future insertions. If the insertion is not performed, it means that the content of L with H.top() cannot possibly become a valid VLM sequence and, in that case, to find another VLM sequence we need to remove L. f ront() with the A.advanceLowerBound() method and then try again to insert H.top() into L. No that when the insertion is rejected, A.insert(H, S ) is performed in O(1) time since the heap H is not modified. If the insertion is successful, the the heap is reconstructed with H.popAndFeed() in O(log m) time and, in that case, virtualLockMassDetection(S , w); Input: S = {S 1 , S 2 , ..., S m }, a set of mass spectra. Input: w, a window size parameter in relative units. Output: The sequence of all isolated VLM points with respect to (S , w). Data: H, a heap initialized with H.init(S ); thus containing the first peak of each spectra in S . Data: A, an active sequence initialized with A.init(H, w); hence initially empty. Data: U , a sequence of m/z values, initially empty.
Algorithm 4: The Virtual Lock Mass Detection Algorithm.
Having described the data structures used and their methods, we are now in position to present the main algorithm for virtual lock mass detection, which is described by Algorithm (4). Recall that the task of this algorithm is to find all the isolated VLM points w.r.t. (S , w). The strategy to achieve this task is to first find the sequence U = µ 1 , . . . , µ |U | of all possible VLM points w.r.t. (S , w), which may contain several pairs of overlapping VLMs. Once U is found, Algorithm (4) calls the DeleteOverlaps(U , w) function, described by Algorithm (5), which returns to its caller the subset of U of all isolated VLM points. For this last task, DeleteOverlaps(U , w) just exploits the definition of overlapping VLM points and, consequently, uses a Boolean-valued vector A (with all entries initialized to False) and assigns sequentially A[k] and A[k + 1] to True whenever µ k overlaps with µ k+1 w.r.t. w. Then, to finish its task, Algorithm (5) just needs to insert in V only the µ k s for which A[k] is False since it is only these VLMs that do not overlap with another VLM in U . Note that Algorithm (5) runs in O(|U |) time, which is in O(n) time in the worst case, where n is the total number of peaks in S .
Hence, the central part of virtualLockMassDetection(S , w) is to find the sequence U of all (possibly overlapping) VLM points. The strategy to achieve this task is to use A.insert(H, S ) to try to insert in A (consequently in L) the next unprocessed peak of S , which is always located on the top of the heap H. Initially, the first such peak of S , a peak having the smallest m/z value among those in S , gets eventually inserted in an empty A by A.insert(H, S ). Next, after verifying with A.isValid(H) if the content of A satisfies the criteria to be a valid VLM sequence, we try to insert again in A the next available peak. On each insertion failure, we test if, before this insertion, the content of A was a valid VLM sequence. This is done with the usage of the Boolean variable f ound (which is set to True as soon as the content of A is a valid VLM sequence and which is set to false immediately after the average m/z value µ A of A's content is appended to U ). Hence, for each considered peak in L. f ront(), we try to insert one more peak in L and test after the insertion if L's content is a valid VLM sequence. If we cannot insert an extra peak in A with the current peak in L. f ront() this means that there is no possibility of finding one more VLM sequence with the current peak in L. f ront(). In that case we remove that peak from L with A.advanceLowerBound() and, consequently, L. f ront() now becomes the peak that was next to L. f ront() in L. Hence, with this strategy, the algorithm attempts to find the largest consecutive sub-sequence of peaks from S that starts with any given peak in S and that forms a valid VLM sequence 3 . In addition, note that in the else branch of Algorithm (4), we verify if H becomes empty after a successful insertion. In that case, we need to check if we can find a valid VLM sequence by incrementing sequentially the lower bound L. f ront() and then append to U the first VLM found. Then, we can safely exit the while loop since any other possible VLM sequence will be a subset of the one already found. Without this else branch, a VLM sequence that ends with the last peak presented by H would be missed by the algorithm.
For the running time of virtualLockMassDetection(S , w), note that, at each iteration of the outer while loop, the slowest operation is A.insert(H, S ), when H is not empty. If H is empty, the slowest operation is the while loop in the else branch.

An algorithm for Virtual Lock Mass Correction
Given a set S of spectra and a widow size parameter w expressed in relative units, once the sequence V of all isolated VLM points w.r.t. (S , w) has been determined, the individual spectra in S can be corrected in a manner similar as is usually done with traditional external lock masses. Algorithm (6) performs the correction needed for each peak in a spectrum S ∈ S . Recall that all the peaks with an intensity smaller than a chosen threshold t a and greater than a chosen threshold t b have been removed from S and that the m/z values in S are listed in increasing order. First, in the for loop, we identify each peak of S corresponding to a lock mass point v i ∈ V . Since S ∈ S and v i is a VLM point w.r.t. (S , w), we are assured to find exactly one such peak p j ∈ S with an observed m/z value of µ j such that µ j lies in For such µ j , we assign the index j to α i so that α α α = (α 1 , . . . , α n ) is a vector of n indexes, each pointing to the peak in S associated to a VLM point. Note that for µ j ∈ [(1 − w)v i , (1 + w)v i ], its corrected m/z value must be equal to v i . Instead of performing these corrections immediately in the for loop, we delay them to the linear interpolation step where all peaks having a m/z value µ j such that α 1 ≤ j ≤ α n will be corrected. virtualLockMassCorrection(S, V , w); Input: S = (µ 1 , ι 1 ), (µ 2 , ι 2 ), ..., (µ m , ι m ) , a spectrum Input: V = v 1 , v 2 , ..., v n , a sequence of m/z values (VLMs) sorted in increasing order Input: w, a window size parameter in relative units Output: A spectrum S = (µ 1 , ι 1 ), (µ 2 , ι 2 ), ..., (µ m , ι m ) where each µ j is the corrected m/z value for the peak (µ j , ι j ) ∈ S Data: α α α = (α 1 , . . . , α n ), a vector of indexes (natural numbers) //construction of α α α i ← 1; Algorithm 6: Virtual Lock Mass Correction Algorithm

6/8
Next, for each VLM v i , we correct by linear interpolation all the m/z values µ j such that α i ≤ j ≤ α i+1 . To explain precisely this procedure, let µ (µ j ) denote the corrected value of µ j . Linear interpolation consists at looking for a correction of the form where a is called the slope and b is the intercept. By imposing that µ (µ j ) = v i for j = α i and µ (µ j ) = v i+1 for j = α i+1 , we find that The nested while loops of the algorithm performs exactly these linear interpolation corrections for all µ j such that α i ≤ j ≤ α i+1 for i = 1 to n − 1. Once all m/z values µ j such that α 1 ≤ j ≤ α n have been corrected, the algorithm is done. Hence, we have decided not to correct any m/z value of S that is either smaller that v 1 (1 − w) or larger than v n (1 + w) because such a peak has only one adjacent VLM and, consequently, could only be corrected by extrapolation, which is much less reliable than interpolation 4 . Finally, the intensities of the peaks remain unchanged.
For the running time complexity of Algorithm (6), first note that the for loop is executed in O(m) time, where here, m denotes the number of peaks in S. This is because we only need to iterate once over the peaks in S and, for each peak in S, we only perform a constant (independent of m and n) amount of operations. Next, the nested while loops is also performed in O(m) time since the iteration over all peaks in S is also done only once, and only a constant amount of operations is performed for each peak in S. The total running time of the algorithm is therefore in O(m).

From VLM correction to spectra alignment
After running the VLM detection and correction algorithms, all the peaks associated with VLM points will be perfectly aligned in the sense that each peak in different spectra associated to a VLM point v will have exactly the same m/z value v. However, all the other peaks corrected by Algorithm (6) will not be perfectly aligned in the sense that a molecule fragment responsible for a peak in different spectra will not yield exactly the same mass after correction. This is due to possibly many uncontrollable phenomena that vary each time a sample gets processed by a mass spectrometer, and by the fact that the correction of each peak was performed by an approximate numerical interpolation. However, if all the peaks have been corrected by Algorithm (6), we expect that the peaks corresponding to the same molecule fragment f across different spectra will have very similar masses and will all be localized within a very small window of m/z values. Moreover, we also expect that the m/z values of the peaks coming from another molecule fragment g having a different mass will not cross the m/z values coming from molecule fragment f .
More precisely, suppose that we have executed Algorithms (4) and (6) with a window size parameter w (in relative units) on a sequence S of mass spectra. In addition, suppose that a molecule fragment f gives rise to a peak of m/z value µ 1 in spectrum S 1 , and a peak of m/z value µ 2 in spectrum S 2 , and so on for a sub-sequence of spectra in S . Let M f = µ 1 , µ 2 , . . . be the sequence of these m/z values. Moreover, let µ f be the average of the m/z values in M f . Then, we expect that there exists a window size θ in relative units, such that 0 < θ w, and for which we have µ i ∈ [µ f (1 − θ ), µ f (1 + θ )] for all µ i ∈ M f . Moreover, if θ is sufficiently small, we expect that the sequence M g referring to peaks produced by another molecule fragment g having a different mass will be such that each µ j ∈ M g will not be located within Motivated by this hypothesis, let us introduce the following definitions. Given that Algorithms (4) and (6) have been executed on a sequence S of mass spectra with window size parameter w in relative units, and given that we have another window size parameter θ w in relative units, we say that a m/z value µ f is an alignment point w.r.t. (S , θ ) if there exists a sequence M f of peaks from S that satisfies the following properties.
Let m def = |S |. Notice that the only difference between the definition of alignment point (and its associated alignment sequence) and the definition of VLM point (and its associated VLM sequence) is the fact that a VLM sequence must contain exactly m peaks, whereas an alignment sequence can contain any number of peaks between 1 to m (since the peaks in an alignment sequence may originate from a molecule fragment which is not present in all the samples for which we have a spectrum in S ). Hence, if we remove the statement if L.size() = m then return False from the A.isValid(H) method, Algorithm (4) then finds all the maximum-length sub-sequence of peaks that satisfy the 4 criteria for a valid alignment sequence when it reaches the deleteOverlap(U ) method. Consequently, with that very minor change, virtualLockMassDetection(S , θ ) finds all isolated alignment points w.r.t. (S , θ ) in O(n log m) time, where n is the total number of peaks in S . If the window size parameter θ is too large, then many alignment points will overlap and Algorithm (4) will return very few isolated alignment points. If θ is very very small, then, in contrast with the VLM identification case, Algorithm (4) will return a very large number of isolated alignment points associated to aligned sequences that contain only one point. Hence, in contrast with the VLM identification case, the best parameter θ is not the one for which we obtain the largest number of alignment points. What should then be the choice for θ ? To answer this question, we consider each VLM point (and its associated sequence of peaks) found by Algorithm (4). If we leave out one VLM point v i from the correction algorithm (6) and use this algorithm to correct all the m/z values of the peaks associated to this VLM point, the maximum deviation from v i among these m/z values will give us the smallest window size θ i such that each m/z value will be located within Essentially, this window size θ i is the smallest one for which we can still recognize all the peaks associated to the same VLM v i . It would then certainly be a very good choice for θ in that region of m/z values. We can then repeat this procedure for all isolated VLM points (except the VLMs with the smallest and largest m/z values) found by Algorithm (4) to obtain a sequence of θ i values. One interesting possibility for θ is the maximum among the θ i values. However, this is clearly an overestimate of the maximum spreading of peaks associated to the same molecule fragment since all the VLMs will be used for the correction, including the one that was left out. Moreover, as we can see in Figure (1), we can recover a large fraction of the non-overlapping VLMs if we use a significantly smaller window size than the max i θ i . For that reason, we have decided to use, for the window size θ , the smallest value covering 95% of the non-overlapping VLMs, i.e., the 95th percentile. Alternatively, to attempt to maximize the accuracy of a learning algorithm, a percentile z can be selected by cross-validation along with the selection of the hyperparameters of the learning algorithm. If we have r VLM points, each θ i associated to the ith VLM point is found in O(m) time for a sequence S of m spectra; thus implying a running time in O(mr) to find every θ i . Then, the 95th percentile is found by sorting the vector of θ i s in O(r log r) time. Assuming that we always have log(r) < m, the total running time to find θ is in O(mr), and hence in O(n) when S contains a total of n peaks.
Once the window size θ is found, we can then run Algorithm (4) just once on the full sequence S of spectra with that value of θ in O(n log m) time. Consequently, the total running time of the alignment algorithm, which includes the running time to find θ and to find all non overlapping alignment points w.r.t. (S , θ ), is in O(n log m).