Toffee – a highly efficient, lossless file format for DIA-MS

The closed nature of vendor file formats in mass spectrometry is a significant barrier to progress in developing robust bioinformatics software. In response, the community has developed the open mzML format, implemented in XML and based on controlled vocabularies. Widely adopted, mzML is an important step forward; however, it suffers from two challenges that are particularly apparent as the field moves to high-throughput proteomics: large increase in file size, and a largely sequential I/O access pattern. Described here is ‘toffee’, an open, random I/O format backed by HDF5, with lossless compression that gives file sizes similar to the original vendor format and can be reconverted back to mzML without penalty. It is shown that mzML and toffee are equivalent when processing data using OpenSWATH algorithms, in additional to novel applications that are enabled by new data access patterns. For instance, a peptide-centric deep-learning pipeline for peptide identification is proposed. Documentation and examples are available at https://toffee.readthedocs.io, and all code is MIT licensed at https://bitbucket.org/cmriprocan/toffee.

experimental data in mzML. Building reliable and robust algorithms requires a strong testing framework of both unit and regression tests and a test harness that encourages developers to use it. The IO-bound nature of mzML files raises artificial barriers to test adoption.
Toffee addresses each of these challenges through a lossless compression mechanism, the open HDF5 7 storage protocol, and a boost::rtree run time implementation. Importantly, it does not aim to be a long-term archival format and takes the view that this is the role of the original vendor file. Instead, toffee fills the role of efficient raw data access through open protocols, and although it does not currently implement the PSI ontology, attributes and metadata can be added trivially should the user wish to do so. While vendor lock-in for archiving is a disadvantage, the highly regulated clinical environment in which vendors operate suggests that there should always be a way to access the data within these files if necessary.

Lossless compression
The following explanation of how toffee compression is achieved reflects its heritage of development on Sciex TOF data. However, the fundamental approach of converting m/z to an integer index space should be applicable to data collected on both time-of-flight, and Orbitrap mass analyzers. In making toffee open-source, contributions on its extension by others more familiar with different instruments are highly encouraged. In particular, I anticipate it is extremely well suited to include ion mobility for efficient analysis of timsTOF DIA-MS data.
Toffee's lossless compression is achieved by understanding the physics of a time-of-flight (TOF) mass analyzer. Here, a charged ion is accelerated through an electric potential; by applying Maxwell's equations on the work done on the ion and Newton's laws of motion, you can say that the m/z of the ion is related to its time of arrival at the sensor through the following relationship: where the arrival time is a multiple of the sensor sampling rate (Δt) plus an offset (t 0 ) such that , and γ are the Intrinsic Mass Properties (IMS) of the injection 8 . For a full derivation see the Methods section.
Pragmatically, given the data of an mzML file, toffee is able to iteratively calculate the IMS properties; ideally, this data would be exposed directly from the vendor format but this cannot always be relied upon. Interestingly, in order to ensure lossless compression, the IMS parameters must be calculated and stored for each scan rather than for a given injection, or even scan type (i.e. MS1 or MS2). The implication here is that the mass spectrometer performs in-line calibration that is additional to the manual calibration completed as part of routine lab operation (see Supplementary Material calculate_ims.ipynb for more information).
From this m/z transfer function, it is possible to convert m/z values to integer m/z indices (i), while retention time is calculated from the scan index and the instrument cycle time. Thus, all raw data in the file can be represented as a vector of integer triplets: m/z index, retention time index, and intensity. These triplets fall onto a Cartesian grid, and thus can be stored as a compressed row storage sparse matrix 9 . This sparse matrix can be efficiently saved as three data sets in the toffee file with zlib compression provided natively by HDF5. Further details can be seen in the Methods section.
The Toffee file format is accompanied by C++ and python libraries for creating and accessing the file format. All code is MIT licensed at https://bitbucket.org/cmriprocan/toffee and documentation and examples are available at https://toffee.readthedocs.io.

Results
File size comparisons. File size is crucial to manage long-term data storage and retrieval costs, in addition to minimizing the hardware that is required during any computational analysis of the raw data. Using three public data sets covering both TripleTOF5600 (Swath Gold Standard 10 and TRIC manual validation set, only the y-and b-ions are included in the analysis 11 ), and TripleTOF6600 (ProCan90, including only the first injection from each mass spectrometer 12 ), the raw vendor files were converted to mzML using 'msconvert' 13,14 in both profile and vendor peak-picking centroid mode, each with and without 'msnumpress' 15 , as well as the 'sciex/wiffconverter' Docker image 16 in both profile and centroid modes. Toffee files were then produced from the msconvert and Sciex Docker profile mzML files, and the toffee file back to mzML. Figure 1 shows that the largest mzML files are produced by 'msconvert' with no 'msnumpress' compression, and the smallest are either the lossy centroided and compressed mzML files, or the lossless profile toffee files, both of which compare in size to the vendor format. Finally, for reference, a small subset of files were converted to mz5.
Raw data access. Enabling random, rather than sequential, data access is beneficial on multiple fronts. In peptide-centric approaches such as OpenSWATH, it is a significant algorithmic advantage to enable accessing data by slicing through the m/z axis. As Table 1 shows, compared with an indexed mzML file, toffee is around 4 times more efficient for spectrum-centric and two orders of magnitude more efficient for peptide-centric data access. This analysis was conducted on a laptop with a single thread and under minimal RAM usage (<5 GB). It is worth noting that these types of timings are somewhat artificial. The most suitable method of loading and caching raw data will be highly dependent on the application, algorithm, and its implementation -performance is always best achieved through profiling the actual code being optimized. Furthermore, there is no consideration given here to threaded access, or high memory environments that would allow all data to be held in RAM. Full details of this comparison can be seen in Supplementary Material random-access-timing.ipynb.
Replacing mzML for analysis. The Toffee file format is accompanied by a wrapper around OpenSWATH 10 that enables SWATH-MS data to be analyzed with standard algorithms and the scores to be used in False Discovery Rate (FDR) calculations of PyProphet 17 . The MIT-licensed OpenMSToffee 18 serves two purposes: to demonstrate that toffee does not introduce any artifacts to the OpenSWATH pipeline converting raw data to a quantified list of peptides, and as an exemplar of how peptide-centric data extraction can be achieved using the C++ toffee library.
Using OpenMSToffee I have conducted a thorough investigation into OpenSWATH with a variety of mzML conversions, and toffee itself. In order to assess the quality of the data in each file conversion method, they are input into the computational pipeline as described in the Methods section (all analysis code is included in the 'openms-toffee-paper' repository 19 and is executable on MacOS or Linux). One of the reasons for selecting the SGS and TRIC data sets for this analysis, is their inclusion of a collection of manually validated peptide query matches (PQMs). Using this information, the results of the file format pipelines are assessed by categorizing those results with a peptide retention time within 15 seconds of the manually validated peak as a true positive, those not within this threshold as a false positive, and those peaks in the manual validation, but not found using this file format as false negatives. One can see from Fig. 2 that each of the profile files (mzML and toffee) performs equivalently, while centroided mzML files have a larger number of false positives particularly with the TRIC data (see Fig. 2C). This is further demonstrated in Fig. 2D by the increase in missing values that are seen with centroided mzML files when analyzing the ProCan90 data set.
These results show there is no meaningful difference to the final quantified peptide results from OpenSWATH regardless of using profile data from 'msconvert' (with or without 'msnumpress'), 'Sciex Docker' , or toffee; however, there is a drop in performance when using centroided data.
Deployment at scale. ProCan has moved to a toffee-based production pipeline since mid-2019. The program runs its analytics pipelines on a hybrid-cloud computational infrastructure with kubernetes orchestrating between on-premise and Amazon Web Services. By eliminating the need for mzML files, storage costs have decreased by an order of magnitude from a predicted $US 60,000 per month to $US 6,000 per month in year 5 of operation. In production, ProCan converts directly from the vendor file to toffee in a fully Dockerised workflow, and though file conversion is a once off, it is interesting to note that time taken to convert scales linearly with the size of the vendor file (see Fig. 3a).

Figure 1.
Comparison of several different file conversion methods, including: profile and centroided mzML created from 'msconvert' and the 'Sciex Docker image' , mz5 files created using 'msconvert' and toffee files created using the method described in this work. Clearly, profile mzML files require the most storage although this can be improved based on the method of generation and if 'numpress' compression is used. Centroiding the data results in much smaller files, and toffee files are of equivalent size to the original vendor format. See the Methods section for a full description of all conversion methods.  Table 1. Accessing data from toffee files is 4 times more efficient for spectrum-centric access, and 100 times more efficient for peptide-centric access. MS2 window "050" was selected as an indicative medium/high load window. The setup for reading mzML requires a loop through the file to extract the index and match it to its relevant MS1 and MS2 windows, while toffee must load its relevant data classes. (2020) 10:8939 | https://doi.org/10.1038/s41598-020-65015-y www.nature.com/scientificreports www.nature.com/scientificreports/ Although Fig. 3b does not control for the complexity of spectral library or the computational hardware (e.g. Amazon instance type), it is possible to observe processing time for toffee compared with mzML. Due to the architecture of OpenSWATH (in particular, challenges around thread safety), OpenMSToffee is by no means an optimum deployment of the technology and there is significant room for CPU performance improvements once const-correctness is addressed in OpenSWATH. For that reason, no attempt has been made to push toffee or OpenMSToffee into the upstream 'openms' repository.

Novel uses.
Having confirmed the OpenMSToffee pipeline is equivalent to the profile mzML/OpenSWATH pipeline, novel uses of toffee files can be explored.
in-silico dilution series. It is of critical importance to developing robust scientific software that one can isolate and test algorithms with controlled inputs and known expected outputs. By providing an efficient peptide-centric interface into the data, toffee allows algorithm developers to write tests of their implementation down to individual units of work while still retaining the complexity of real world data. This testing approach is fundamental to scientific software development 20 , for example, in high-energy physics, the US National Laboratories have developed the Tri-Lab test suite that pits algorithms against toy problems with analytic solutions 21 .
In the context of mass spectrometry, curating known input data is much more difficult due to the stochastic nature of the instruments, and of the subject under study. Often, validation experiments are based around injecting samples that contain a controlled dilution of a known group of peptides and technical replicates are performed to normalize the stochastic impact of the instrument. An alternative approach is to create data in-silico through analytic models 22,23 ; however, it is highly improbable that the noise artificially added to these models is an accurate reflection of data in reality.
Toffee offers us a new approach. By treating the data as triplets on a Cartesian grid, it becomes trivial to extract data from one toffee file, the foreground data, and add it to another toffee file, the background data, to create an entirely new toffee file. Referred to as an in-silico dilution series, data for specific peptides are extracted from the foreground file, scaled based on a theoretical dilution, and placed into the background file at a known retention time (see Fig. 4).
In this study, two in-silico dilutions are constructed -one with a water background and another with an E. coli background -such that the impact of background noise can be assessed. Details of how this is done can be seen in Supplementary Material in-silico-dilution.ipynb. The files are then analyzed with two spectral libraries: a 'Simple' SRL containing just those peptides that were added in-silico, and a 'Complex' SRL that includes the in-silico peptides, plus an SRL derived from this E. coli sample. Comparing the results from these two SRLs neatly shows the impact of the π 0 parameter discussed at length in Rosenberger et al. 17 .  Figure 5A shows the normalized intensity quantified by the OpenMSToffee pipeline; as expected, the dilution curve of the water background is bound on the upper limit by the theoretical dilution (red horizontal lines) and the data that is below the theoretical limit is due to signal truncation at the lower limit of detection of the mass Figure 3. Timing of production wiff-to-toffee file conversions and OpenMSToffee/OpenSWATH runs for more than 10,000 mass spectrometer injections run at ProCan since mid-2017. As expected, CPU time scales roughly linearly with the size of the wiff file, however, exact timings will be dependent on the computational hardware available, in particular the I/O bandwidth. www.nature.com/scientificreports www.nature.com/scientificreports/ spectrometer (as set by the background file). In contrast, the E. coli dilution series often exceeds the theoretical limit and shows the OpenSWATH peak integration algorithms are at risk of quantifying noise by treating the extracted ion chromatogram as a one-dimensional signal rather than its two-dimensional reality. Figure 5B-D show the confusion matrix where PQMs detected more than 10 seconds from the theoretical retention time are labelled as false positives, and PQMs correctly identified in the Simple SRL/Water Background for a given dilution and not detected in the file of interest are labelled as false negatives. From Fig. 5B,C the complexity of the SRL seems unimportant at low dilutions and false positives remain around the expected FDR. At dilutions 4 and above, the more complex E. coli background makes it harder for PyProphet to distinguish between target and decoy peptides, leading to a more stringent FDR, and thus, a marked increase in the number of false negatives when compared to the water background (see Fig. 5D).

Re-Quantification of peaks.
Through the many visualisations of toffee data, and appreciation for the TOF detector, one recognises that the data is roughly Gaussian in both retention time and m/z index space. By fitting an analytic model to the data, and using this model for re-quantification of peptide intensities, more accurate data can be obtained. Furthermore, by fixing the retention time of the analytic function for all fragments, it is also possible to deconvolute co-eluting peaks and count only the contribution from the peak of interest.
In a TOF mass analyzer, one can assume that individual ions obtain a distribution of kinetic energy that leads to an approximately normal distribution of data in m/z index space. Further, the elution profile of ions from the LC column can be approximated as a log-normal distribution that is skewed towards the left. These observations imply that numerical optimization can find the peak location, spread, and amplitude of the Gaussian functions for each fragment, j, through the following function: , where ( , , , , where t and m represent the retention time and i m z / space respectively; t 0 and → m 0 are the peak locations assuming the retention time for all fragments must be constant and the mass offset is allowed to be different for each one to account for calibration offsets; σ t , σ m are the spread of each Gaussian; → a is the amplitude for each fragment; → c is the amplitude of chemical noise for each MS1 fragment; and I j is the raw intensity data for a given fragment. Figure 6 shows the result of applying this analytic model to more than 1,200 technical replicates selected from routine operation within ProCan. Interestingly, there is little change between the replicate correlation on the raw re-quantified data. However, it is possible to use the fit model parameters to perform an additional round of outlier removal. Specifically, the spread parameters σ t and σ m , and the peak location parameters t 0 and → m 0 should be largely systematic features of the data. Peaks that were fit with parameters that were significantly different from their cohort are marked as false positives and dropped from the subsequent correlation calculation. In doing so, the correlation between technical replicates improves. Deep-learning pipeline proof-of-concept. While machine learning, and in particular deep learning, are permeating many facets of science, computer vision is the area where the technology is most mature. By treating toffee data as triplets on a Cartesian grid, and accepting a degree of mass approximation (<5 parts per million, see Supplementary Material calculate_ims.ipynb), it becomes trivial to extract data as a two-dimensional slice analogous to an image, and thus amenable to deep-learning based peptide identification strategies. Figures 7 and 8 show raw two-dimensional data for two peptide query parameters extracted directly from a HEK293 file. Here, the red, green, and blue channels are filled with data extracted around the m/z of the precursor and product ions with offsets of 0, 1, and 2 times the isotopic carbon-12/carbon-13 mass difference. Using OpenSWATH results from ProCan90, it is possible to develop a significant labelled training dataset suitable for training an object detection convolutional neural network. For this proof of concept, a single shot detector (SSD) architecture was trained with a ResNet50 backbone on the Amazon SageMaker platform (see Supplementary Material raptor-part0-transfer_from_resnet50.ipynb). Figure 8 shows inference results for a random sample of target peptides from the holdout set where a-d are successfully detected peptides and e-f are false negatives.
Clearly, this proof of concept is not yet at the performance of state-of-the-art SWATH-DIA analysis tools. However, the results are encouraging and give hope that an analyis pipeline that does not require an experimental spectral library may be possible in the not too distant future.

Discussion
In summary, the challenges of file size and data access with current open formats for data independent acquisition mass spectrometry are acute when a scientific program needs to operate at biobank-scale. These challenges significantly increase the cost and complexity of data management and analysis, and hold back the progress of writing efficient algorithms that are routinely tested on real-world data. Toffee aims to address these issues by taking a first-principles approach to understanding the raw data, and translating those findings into a best-practice software library. Code is released with an MIT license, python packages should be easily installed for MacOS and Linux using 'conda' , both toffee and an exemplar implementation wrapping OpenSWATH are available via version-controlled Docker images, and all analyses performed in this paper are available as Jupyter notebooks.

Methods
Open science. To the extent possible, this work aims to have full and automated reproducibility. The code is released with an MIT license and makes use of the following community tools and technologies: pyteomics 24  Correlation between peptide intensities of technical replicates before and after re-quantification. In this plot, correlation is calculated between the intensity of 1,251 technical replicates collected during routine production in ProCan. Using parameters found by fitting the analytic model to the data, potential outliers, or false discoveries, are detected and subsequently dropped from the analysis improve the correlation between replicates. Prior to outlier filtering, there is little difference between the re-quantified data and the original intensities recorded by OpenMSToffee both validating the analytic model chosen whilst indicating that there is not a significant burden in the data from co-eluting peptides.

Scientific RepoRtS |
(2020) 10:8939 | https://doi.org/10.1038/s41598-020-65015-y www.nature.com/scientificreports www.nature.com/scientificreports/ Data sets. Three publicly available data sets are used in the current work: Swath Gold Standard and TRIC data available from PeptideAtlas raw data repository with accession number PASS00289 10 and PASS00788 11 , respectively, and the ProCan90 dataset can be obtained from the PRIDE archive under the identifier PXD011093 12 .

Software and analysis.
• Code repositories • https://bitbucket.org/cmriprocan/toffee • https://bitbucket.org/cmriprocan/openms-toffee • https://bitbucket.org/cmriprocan/dia-test-data -a resource of small DIA-MS files that are used in continuous integration regression tests • https://bitbucket.org/cmriprocan/openms-toffee-paper -all analyses produced for this paper • https://bitbucket.org/cmriprocan/openms -a versioned Docker file for creating the base OpenMS Docker image • Documentation • https://toffee.readthedocs.io • https://openms-toffee.readthedocs.io   . By accepting a mass accuracy loss of less than 5 parts per million, toffee data can be efficiently accessed in a way that is amenable for modern computer vision algorithms. For example, the images here are produced by stacking data from the mono-isotope, and isotopes 1 and 2, into the red, green and blue channels of an image respectively. The left panel shows target and corresponding data for AIELFSVGQGPAK charge 2, and the right panel is for AADAEAEVASLNRR charge 3. The top two-dimensional horizontal slice of each of these images is from the precursor ion (or MS1) data, while the remaining slices are from the six product ion (or MS2) data as specified in the spectral library. Images  DIA pipeline. All analysis for raw data was completed using the Docker image cmriprocan/ openms-toffee:0.13.12.dev 38 and a complete python script for marshalling analysis is included in the Supplementary Material pipeline.py. In short, the current best-practice OpenSWATH workflow is used whereby the spectral library is provided in SQLite format, scores are saved to an osw result file, and PyProphet (version 2.0.4; git hash d35a53af86131e7c4eb57bbb09be8935a1f30c70) FDR control is applied at the peak-group, peptide, and protein levels. For the latter two, scores are calculated across the full cohort of the experiment (i.e. the'global' context is used); in a departure from the defaults, the -parametric model is used as it was found to be less conservative on the small SRLs used in this study. mzML files are analyzed using a modified version of OpenMS v2.4.0 (CMRI-ProCan/OpenMS version CMRI-ProCan-v1.1.2 39 ). This same version is used as the basis for the code in the OpenMSToffee bitbucket repository 18 , a wrapper around OpenMS that enables toffee files to be used in conjunction with the algorithms of OpenSWATH. Furthermore, OpenMSToffee has a series of tests that ensure regressions are not introduced to OpenSWATH scoring algorithms when the version of OpenMS is updated. Finally, OpenMSToffeeWorkflow is used as a drop-in replacement for OpenSwathWorkflow when analyzing toffee files.

Toffee file format. Time of Flight (TOF) m/z transfer function.
The mass analyzer in a TOF mass spectrometer is a digital sensor that samples at a constant frequency. It effectively measures the time taken for a charged ion to move a known distance (d) through an electric field of known strength (U). Thus, measurements are made at integer multiples (i) of a constant time interval (Δt), such that