Trackable and scalable LC-MS metabolomics data processing using asari

Significant challenges remain in the computational processing of data from liquid chomratography-mass spectrometry (LC-MS)-based metabolomic experiments into metabolite features. In this study, we examine the issues of provenance and reproducibility using the current software tools. Inconsistency among the tools examined is attributed to the deficiencies of mass alignment and controls of feature quality. To address these issues, we develop the open-source software tool asari for LC-MS metabolomics data processing. Asari is designed with a set of specific algorithmic framework and data structures, and all steps are explicitly trackable. Asari compares favorably to other tools in feature detection and quantification. It offers substantial improvement in computational performance over current tools, and it is highly scalable.


Supplementary
. Pairwise, unambiguously matched features between processing tools on the Yeast2021 dataset.

Supplementary Figures
Supplementary Figure 1  Supplementary Figure 9. Example of QC overview with many low-quality peaks.

Supplementary Methods
Software design of asari. Asari is written in Python 3, and can be used as a standalone command line tool or imported as a package. Its library dependency includes numerical computing via numpy and scipy, data wrangling via pandas, and visualization via panel and hvplot. Pymzml is used to parse mzML format. Data structures, annotation, search and chemical calculation make use of our supporting packages metDatamodel, mass2chem and jmsmetabolite-services. Implementation of new and previous algorithms was coded from the ground up, where numerous details contributed to the computing speed, e.g., discrete mathematics is preferred over continuous curves, and intermediary indexing and caches are employed.
Processed mass tracks are cached on disk to reduce memory footprint. Mass tracks are explicitly linked with features and peaks, and the information is exported as JSON in asari output. The quality metric mSelectivity is used internally. Other quality metrics, peak shape, SNR and cSelectivity, are part of exported feature tables. Grouping ions into empirical compounds is performed using khipu (Li and Zheng 1 ). The annotation and search functions are generic to accommodate reference databases, and the default is HMDB 2 . . In practice, we only need to calculate the mSelectivity based on the two neighbors of higher m/z values and two of lower m/z values to obtain an accurate approximation.

Metabolomics datasets
Four new datasets were generated in this study: HZV029 (human plasma samples), MT02  and 60 μL of extraction solvent was added. Extraction blanks were also prepared to remove features of non-biological origins. All samples were vortexed and incubated with shaking at 1000 rpm for 10 min at 4°C followed by centrifugation at 4°C for 15 min at 20,817 x g. The supernatant was transferred into mass spec vials and 2 μL injected into UHPLC-MS.
All samples were maintained at 4 °C in the autosampler, and analyzed using a Thermo Scientific Orbitrap ID-X Tribid Mass Spectrometer coupled to a Thermo Scientific Transcen LX-2

LC-MS metabolomics data processing.
Asari has default parameters with mass accuracy of 5 ppm, minimum peak height 1E5. No parameter was modified unless described specifically. The results in this paper were based on version 1.10.6.
MZmine 2.53 (or version 3.3.0) processing was performed in the following order: 1. Mass detection, using Centroid mass detector.   Step-by-step description of asari default processing workflow.

Mass track construction for each sample; see chromatograms.extract_massTracks_.
a. Get all MS1 spectra from a data file, as a list of [(m/z, scan number, intensity), ...]. Index to a dictionary by int(mz * 1000) for efficient retrieval. This is an mzTree.
b. Create a list of data bins from mzTree. Each bin starts with a value in the mzTree, which has a m/z range around 0.001 and is filtered by minimal required scan number. If two bins are adjacent by 0.001 or within tolerance ppm, they are merged. See chromatograms.get_thousandth_bins.
c. Build mass tracks per data bin. If the m/z range in a data bin is within 2 x tolerance ppm, the bin leads to a single mass track.
Else, a nearest neighbor (NN) clustering is performed to establish the number of mass tracks. The NN clustering assigns each data point to its nearest "peak mz value". The "peak mz values" are determined by finding peaks in the m/z value distribution, with the requirement that the two peaks need to be separated by the mz tolerance minimally.

Retention time alignment
a. The reference sample was established prior to mass alignment. A set of landmark elution peaks are determined in this reference sample by the criteria: mSelectivity > 0.99, min_peak_height is satisfied (default 1e5), prominence > 20% of peak height and the peak is the only peak on its mass track. These are our "selected_reference_landmark_peaks". See set_RT_reference, peaks.quick_detect_unique_elution_peak.
b. For each of the remaining samples, a set of good peaks are selected using the same criteria as above, but limited to from the mass tracks already aligned to the selected_reference_landmark_peaks. They constitute the "good_landmark_peaks" specific to a sample. See CompositeMap.calibrate_sample_RT.
c. Perform a LOWESS (Locally Weighted Scatterplot Smoothing) regression to obtain a function to describe the relationship of the RT values between good_landmark_peaks and selected_reference_landmark_peaks. To prevent spurious results from the regression, we add 10% extension out of both ends as boundaries that the regression must converge to. In practice, the function extrapolates to a dictionary to map all scan numbers between the current sample and the reference sample. The dictionary skips numbers that are identical between the two samples to save memory and computing. The resulting "rt_cal_dict" is a correspondence between sample_rt_numbers and reference_rt_numbers. See chromatograms.rt_lowess_calibration d. Loop through all remaining samples to obtain the RT calibration function (rt_cal_dict) per sample.

Building the composite map
a. Now that we have established both the m/z alignment in MassGrid and RT alignment between samples, they are linked in the CompositeMap class.
b. Each LC-MS feature should be on a specific location on the composite map (with a consensus m/z and RT). A feature can be conceptualized as a pattern that is repeated in multiple samples. Thus, the pattern will persist and be even enhanced when the signals from multiple samples are superimposed. This concept is implemented into "composite mass tracks", where the intensity values are summed on corresponding mass tracks across all samples after RT calibration. The mass tracks are intensity vectors of the same length, based on scan numbers, after retention time alignment. See CompositeMap.build_composite_tracks.

Detection of elution peaks (features)
Instead of detecting the same peaks in every sample, asari detects an elution peak only once on the composite mass tracks. Because a composite mass track represents all samples, an elution peak here is equivalent to a feature in the experiment. The shape and prominence of a peak matter in relation to its surroundings. Some instruments generate higher intensity numbers than others arbitrarily. The absolute intensity number is not the most important; a peak is distinctive as long as its shape is good and its height is well above the noise level. We may state that all good peaks look similar. It is the "scaling" factor that differs between metabolites and between platforms. The peak detection in asari is largely dependent on the noise levels. The default parameters are functional over a large range and updated based on statistical analysis of the mass track. See peaks.audit_mass_track, peaks.stats_detect_elution_peaks.
a. If the max intensity of a mass track is higher that a preset ceiling (1E8), the mass track is rescaled under the preset ceiling for the purpose of peak detection. This serves a normalization purpose to make peak detection parameters robust.
Because the composite mass tracks sum up the cumulative intensity of all samples, the intensity could be exceedingly high in large studies. But the detection of elution peaks on composite mass tracks only needs to determine the apex and peak boundaries. The useful quantitative value is the peak area reported on individual samples, not on a composite mass track. After peak detection, the peak height is scaled back using the same scaling factor.
b. If the median intensity on a mass track is below the preset min_intensity_threshold (default 1e3 for Orbitrap data), this is a low-intensity track. Both baseline level and noise level are set to min_intensity_threshold.
c. Else, if over half the data points are above min_intensity_threshold and the median intensity is higher than 10 times of preset min_peak_height (default 1e5 for Orbitrap data), detrend (scipy.signal.detrend) is performed on the mass track.
Detrend is a regression method to ensure the baseline is not significantly rising or decreasing over the chromatography. It is a computationally expensive operation and only required for high-signal and high-noise data.
d. If a track is not low-intensity (see b), the bottom signals are taken as intensity values below the lower quartile plus min_intensity_threshold. The constant of min_intensity_threshold makes this method stable, even when zeros dominate the track. Here, the baseline level and noise level are assigned as the mean and standard deviation of the bottom signals, respectively. e. Smoothing (chromatograms.smooth_moving_average) is applied when the noise level is higher than 1% of max intensity and max intensity is lower than 10 times of the preset min_peak_height. For low noise or high intensity tracks, smoothing is not needed.
f. The mass track is subtracted by a filter (i.e. baseline + noise level). This creates multiple segments of positive intensity values, because some data points are below the filter. Peak detection is performed on the separate regions, because i) it is faster to skip very low signals, ii) this reduces the chance of dealing with multiple peaks simultaneous, and iii) this allows further determination of proper peak prominence based on local signal levels.
g. Peak detection on a segment of signals. Asari uses a simple local maxima method (scipy.signal.find_peaks), with prominence control that is dynamically determined on each mass track then each segment. Prominence is the vertical distance from a peak top to its lowest contour line. Prominence is calculated in a series of steps: the initial value is 1/3 of min_peak_height; then the greater of prominence and the noise level of a track is used as the new prominence; if the segment is high-intensity and high-noise, the greater of prominence and 5% of max intensity is used as the new prominence. Another parameter, min_peak_height, applies to this method. The parameter min_timepoints is used for controlling both distance between peaks and peak width. The prominence is computed on a sliding window size (default 25 scans

Data export
The output directory by asari bears a time stamp, not to overwrite existing data. The recommended feature table is preferred_Feature_table.tsv. All peaks are kept in export/full_Feature_table.tsv. Annotation is exported into both JSON and tsv formats. See experiment.ext_Experiment.export_peak_annotation. MassGrid is exported as a csv file. The composite map is exported as a pickle file, which is used by the visual dashboard.