Baseline proteomics characterisation of the emerging host biomanufacturing organism Halomonas bluephagenesis

Despite its greener credentials, biomanufacturing remains financially uncompetitive compared with the higher carbon emitting, hydrocarbon-based chemical industry. Replacing traditional chassis such as E. coli with novel robust organisms, are a route to cost reduction for biomanufacturing. Extremophile bacteria such as the halophilic Halomonas bluephagenesis TD01 exemplify this potential by thriving in environments inherently inimical to other organisms, so reducing sterilisation costs. Novel chassis are inevitably less well annotated than established organisms. Rapid characterisation along with community data sharing will facilitate adoption of such organisms for biomanufacturing. The data record comprises a newly sequenced genome for the organism and evidence via LC-MS based proteomics for expression of 1160 proteins (30% of the proteome) including baseline quantification of 1063 proteins (27% of the proteome), and a spectral library enabling re-use for targeted LC-MS proteomics assays. Protein data are annotated with KEGG Orthology, enabling rapid matching of quantitative data to pathways of interest to biomanufacturing.


Introduction
This document describes a data processing pipeline which generates data reported in paper "Baseline Proteomics Characterisation of Biomanufacturing Organism Halomonas Bluephagenesis". The pipeline is scripted in gnu-make and describes every process between raw data and the package submitted to PRIDE repository (https://doi.org/10.6019/PXD028156).

Introduction to Pipeline
Data was acquired as MS e in which the instrument alternates between a low energy scan and a collision energy "ramp". The low energy scan captures precursor ions, the ramp captures product ion spectra. Data of this kind is not supported by opensource tools such as openms or trans-proteomics pipeline for spectra generation. So the Waters proprietary tools are used.
The proprietary tools that manage the pipeline as follows.
• Mass spectrometry data files are converted to feature lists by apex3d.exe.
• Precursor and product ions are matched by elution profile and combined into conventional spectra peak lists by peptide3d64.exe. • Spectra are searched against a fasta database by iadb.exe.
It is possible to output additional information at each stage of the process. For example apex3d.exe is able to output a .csv file of all identified features and peptide3d64.exe can output a conventional .mgf file which may be submitted to other search engines. The final output of iadb.exe is a set of three .csv files summarising the outcome at a protein, peptide and fragment level.
After peakpicking data is searched and processed along two separate tracks. Protein expression is quantified using the "top three" method incorperated into Waters "iadb" search. Spectra obtained from the peptide3d64.exe are separately extracted as .mgf files and processed through the X!Tandem search engine and trans-proteomics-pathway to produce spectral libraries.

Introduction to Make
The analysis pipeline is executed using gun-make also called make. The program make is well documented(https: //www.gnu.org/software/make/manual). Make is principally used to compile software but is also useful to document bioinformatics pipelines. Make is a useful tool for scripting the "many-to-many" operations in which multiple mass spectrometry data files are each processed into a series of intermediate files followed by "many-to-one" operations where multiple intermediate files are combined into a summary output. Make facilitates writing commands for each operation in a single place for application to multiple files which ensures consistency and facilitates modification and experimentation with the pipeline as a whole.
The principle of make is that the program is supplied with a set of "rules" for converting "prerequisites" into "targets" by means of a "recipe". In this case the targets are a set of spectral libraries and the source files are MS datasets, fasta protein sequence libraries etc. Recipes take the general form: target : prerequisite program -in prerequisite -out target Recipes can be written in any order and make is able to chain them together to convert source files to target files via intermediate files.
The following make file has been written with recipes in the sequence in which they would be run through the pipeline. When executed; make will parallelise the processing to speed up job completion.
Make is able to substitute text for macros which are defined myMacro:=text or myMacroList:=element_01 element_02 and then called ${myMacro} or ${myMacroList}. It also has a set of built in functions that are called in the general form: $(function_name variable,param,param...) which are able to manipulate variables, for example to change a file suffix, or change or remove a directory path. In this way it is possible to write rules convert raw data to intermediate files through an analysis pipeline without explicitly Term 1writing out all intermediate file names. Each recipe used for this project are briefly described and links to documentation supplied as encountered in the make file.

System Setup
This section describes the programs that must be installed in order to run this pipeline.

PC Setup
This pipeline was developed on a PC with spesification: Processor Intel i7-7820X 3.6GHz 8 cores.

RAM 128 GB
Operating System Windows 10 2004Build 19041.1165 Windows is required for Waters' proprietary tools. The peptide3d program and process consumes a lot of RAM, script detailed below prevents more than one instance of peptide3d running simultaneously to prevent the system running out of RAM. It is possible the pipeline will fail if run on a system with substantially less RAM than was used in development.

Install All Required Programs
(1) If Waters executables (apex3d, peptide3d, merge and iadb) outlined above are not otherwise available download Progenisis QI for Proteomics and install, as this software installs the required executables.
Here the executables from Waters' PLGS (2) Download Rtools and install, this includes gnu tools sed, awk, grep, etc.
(3) If not present on your system download gnu-make v4.3. Compiled binaries can be found here: https://github.com/mbuilov/gnumake-windows. Earlier versions of gnu-make will not be compatible as the script uses features introduced in v4.3 such as grouped targets. (4) If not present on your system download and install Trans-Proteomic Pipeline (TPP) this work was tested with v6.0.0. (5) If not present on your system download and install x!tandem this work was tested with version ALANINE (2017.02.01). As an alternative the version of x!tandem distributed with the Trans-Proteomic Pipeline (TPP) may be used, however at present this is the older Jackhammer TPP (2013.06.15.1 -LabKey, Insilicos, ISB) version which was not used here. (6) If not present on your system download and install seqkit this work was tested with v0.12.0. (7) If not present on the system download and install openMS this work was tested with v2.4.0. (8) A python distribution is required to run the msproteomicstools package. This work was tested with miniconda for python 3.7 with msproteomicstools installed as per the instructions given on the website. The minimal python installation enables just the requirements for this process to be installed with the versions required to replicate this work. Miniconda may be installed alongside other instances of python allowing this work to be repeated without disrupting a pre-existing system. During installation note the target directory and adjust if required. For single user installation this will look like: C:\Users\<user>\AppData\Local\Continuum\miniconda3, for all users it may be C:\ProgramData\Miniconda3. Note C:\ProgramData\ is a hidden directory on windows 10, navigate to it directly by entering path into windows explorer or set windows explorer to show hidden files. (9) Install required python packages. To access the python command line, open a windows command window; navigate to the Miniconda3\Scripts directory; and type activate into the command line. Required packages will need to be installed from pipy.org through the python command line. The following commands entered into the anaconda prompt should install the required packages in the versions used here. Alternatively the latest versions of these programs might be installed although that might require adjustments to processing commands below. pip install numpy==1.15.3 pip install pymzml==0.7.5 pip install Biopython==1.72 pip install Cython==0.29.2 --install-option="--no-cython-compile" pip install msproteomicstools==0.8.0 No further interaction with the python environment should be required.

Executing Make
Gnu-make is executed from the command line. It it easiest to start a new cmd window in the project directory, set the PATH variable to just the location of gnu-make and then execute gnu-make on the script: PATH="C:\Program Files\gmake" gnumake-4.3-x64.exe -rR -j 8 -l 8 -f ./IntegratedAnalysis.mk --output-sync all The command line is documented on the gmake manual. Briefly: -rR turns off internal rules for compiling software; -j indicates the number of concurrent jobs permitted, select a lower number if limited processors or RAM available; -l checks load is below value specified before starting a new job, this provides some protection against overloading RAM which causes Waters programs to crash; -f indicates the file name of the makescript to run; --output-sync puts the output of each job on the console after the job finishes, if not set the output of concurrently running tasks will be ooutput concurrently and consequently unintelligible; all is the name of the "target" required by this build, alternatives are available for testing, see below.

Documented Makefile for Pipeline
The following code blocks are combined into a single gnu-make script to run the pipeline.

Set Shell
Each recipe in the make file is run in a "shell", that is a separate command line environment. On windows there is a choice of shell. This script used windows native executables from Waters to process the data and so the windows cmd.exe shell was chosen. It is likely that the script could be rewritten to run in the Windows Subsystem for Linux if required. However in this case all the dos commands such as copy and ren could need to be replaced with unix alternatives and paths changed to unix format.

SHELL=C:/Windows/System32/cmd.exe
Programs in the trans-proteomics-pipeline call other programs and particularly perl scripts. In order for these programs to be available they must be on the PATH for the shell when it is called by make. The PATH is set explicitly with the export to include the required trans-proteomics-pipeline directories.
The local for the shell is also explicitly set set to the default "C" local. Without setting this here perl may return a "locale" error which is harmless in itself but will stop the make script. Explicitly setting the locale to the default "C" locale by exporting LC_ALL=C to the shell prevents this error. export PATH=C:\TPP\perl\bin;C:\TPP\bin export LC_ALL=C

Paths to Executables
Explicit control of software versions is assured by assigning the full path of each executable to a macro for each of the programs required for the pipeline. These must be edited to be correct for the system on which the make script is to be run. Setting them explicity should also support reproducability since the version of software used here may be installed along side subsequent releases of the software buing used in future active research. Checking these are correct also acts as a check list to ensure system is set up correctly.

Macros
In make macros are rules to generate a required instructions prior to execution. They may be used to generate output file names from an input file name, or vice versa. They may be used to generate entire recipe comands.
The following macro tests to see if required target directories exist and creates it if not. It is used at the start of most recipes to ensure the target directory is available.
define makeDir if not exist $(subst /,\,$(abspath $(dir ${1}))) mkdir $(subst /,\,$(abspath $(dir ${1}))) endef The following two macros turn relative file paths into absolute file paths, this allows relative paths to facilitate portability of the script but absolute paths to be provided to programs which require them. They also convert between unix-like forward slash "/" and windows like back slash "\" for those that need them. define winify $(subst /,\,$(abspath ${1})) endef define unixify $(subst \,/,$(abspath ${1})) endef The following pair of macros between them find the position of a string within a gmake word list. The slightly esoteric formulation is a recursive macro _pos, which nibbles through a word list until it removes the matching element combined with pos which counts the remaining elements. The code enables the pep3d process to be run one file at a time as run concurrently multiple pep3d processes will consume all the RAM causing the processes to crash.
The following macro calls powershell to obtain a UUID. This is used to pass a UUID to the IADB search.

Mass Spectrometry Raw Data Files
The raw data acquired on the Xevo has been stored on a local hard drive.
Waters data are stored as multiple files inside a folder. Make can't treat a directory as a pre-requisite so each file is specified as the _HEADER.TXT file held in each of the directories and this element is removed when passed to the apex3d program.
The raw data file paths are concatenated into a single list of files so they can all be separately processed by a single command. This ensures consistency in data processing. Raw data directory paths will need to be adjusted to the location of downloaded data for re-use.

Fasta Files
The .fasta protein database files are specified here.
Also specified are a pair of .fasta databases of common contaminants to draw-off likely contaminating spectra and a file containing a the peptides set as retention-time and quantitative standards concatenated into a single pseudo-protein. The specified .fasta files are combined so that: • The two contaminant databases are combined and given a common cont_ prefix to protein names. • The contaminant and iRT databases are combined with the strain specific database for halomonas C3001. • All annotation apart from accession number is stripped from the sequence annotation.
The .fasta files are also converted to a table format for easy import and use in R.

Spesify Upload Packet
This section specifies the files which are to be uploaded to Pride and copies them to a specific folder for upload.

Spesify Targets
A critical part of a makefile is specification of the targets. Providing recipes exists to build intermediate and final target files from input files all these targets will be produced. The targets specified in all will be built. This is assured by making the special variable .PHONY dependent on all because all pre-requisites of .PHONY must always be rebuilt. The target .PHONY is the first target defined in the file and an invocation of make on the makefile will built that target if no other target is given on the comand line. Other targets may be spesified on the gnu-make invoking comand line, the targets buildSet for example buildApex are provided for trouble shooting to enable each step of the process to be built in turn. The special target .ONESHELL is also set here to specify that all lines in a recipe be run in the same shell invocation rather than each in a separate invocation. Running in a single shell is important for a few recipes where variables are set in one line and used in subsequent lines in the recipe.

Set Apex, Peptide3d and Mayu Variables
It is possible to use makefile variables to set the values of switches on executables. For full executables for Apex and Peptide3d see Apex Feature Picking and Peptide3d Spectra Generation. However during pipeline refinement the following parameters to which the process is quite sensitive were taken out to this section for ease of adjustment.

Apex Settings
The Apex3d program has many settings which are set below:

Peak Picking
The Apex3d and peptide3d algorithms between them work to pick peaks from the raw data, de-isotope them, relate precursor and product ions and produce spectra peaklists for database searching. I am not entirely clear how the tasks are split. I think Apex does the initial peak picking and feature detection and peptide3d associates precursor and product ions and determines peptide mass from multiply charged precursors and outputs a de-isotope peak list. I need to find some documentation to properly understand this.

Peptide3d Spectra Generation
The peptide3d algorithm is run with as close as possible to the PLGS defaults. Some changes have been made. There was an issue with peptide3d "hanging" with some files. The process would consume all the RAM and never complete. Documentation on this from Waters suggested altering some settings as noted in the code below.

TPP Pipeline for Spectral Library
The trans-proteomic pipeline is used to process .mgf files produced by peptide3d above into a spectral library. The results of the TPP are cross referenced with the iadb search to yield a high quality list of quantified proteins.

X!Tandem Search
The first step is to convert .mgf files into mzXML files which can be searched by X!tandem: Each of the resulting .mzXML files is then searched with x!tandem as below. The process is a little complex. For each mzxml file a copy is made of the standard xtandem paramater file. This copy is then edited with required spectrum file, target file name and fasta search file. After the search is complete, which outputs a tandem xml and an mzid file, the paramater file is deleted.

Mayu FDR Estimation
Mayu is a tool that estimates peptide and protein false discovery rates. Mayu is actually run twice here. One time with settings defined in mayu_Rule to find the correct score to set for spectra import into spectrast. The second time with settings defined in mayu_Rule_plot making calculations to higher FDR to produce data for figures presented in the paper. Spectrast and Spectral Library Generation The integrated search results are now converted into a spectral library with spectrast. This takes pepXML files generated above, filters them to control FDR against the mayu result files and applies an iRT index.