Assessing technical and biological variation in SWATH-MS-based proteomic analysis of chronic lymphocytic leukaemia cells

Eagle, Gina L.; Herbert, John M. J.; Zhuang, Jianguo; Oates, Melanie; Khan, Umair T.; Kitteringham, Neil R.; Clarke, Kim; Park, B. Kevin; Pettitt, Andrew R.; Jenkins, Rosalind E.; Falciani, Francesco

doi:10.1038/s41598-021-82609-2

Download PDF

Article
Open access
Published: 03 February 2021

Assessing technical and biological variation in SWATH-MS-based proteomic analysis of chronic lymphocytic leukaemia cells

Gina L. Eagle¹^na1,
John M. J. Herbert²^na1,
Jianguo Zhuang¹,
Melanie Oates¹,
Umair T. Khan^1,3,
Neil R. Kitteringham⁴,
Kim Clarke²,
B. Kevin Park⁴,
Andrew R. Pettitt^1,3,
Rosalind E. Jenkins⁴ &
…
Francesco Falciani^2,5

Scientific Reports volume 11, Article number: 2932 (2021) Cite this article

2073 Accesses
5 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Chronic lymphocytic leukaemia (CLL) exhibits variable clinical course and response to therapy, but the molecular basis of this variability remains incompletely understood. Data independent acquisition (DIA)-MS technologies, such as SWATH (Sequential Windowed Acquisition of all THeoretical fragments), provide an opportunity to study the pathophysiology of CLL at the proteome level. Here, a CLL-specific spectral library (7736 proteins) is described alongside an analysis of sample replication and data handling requirements for quantitative SWATH-MS analysis of clinical samples. The analysis was performed on 6 CLL samples, incorporating biological (IGHV mutational status), sample preparation and MS technical replicates. Quantitative information was obtained for 5169 proteins across 54 SWATH-MS acquisitions: the sources of variation and different computational approaches for batch correction were assessed. Functional enrichment analysis of proteins associated with IGHV mutational status showed significant overlap with previous studies based on gene expression profiling. Finally, an approach to perform statistical power analysis in proteomics studies was implemented. This study provides a valuable resource for researchers working on the proteomics of CLL. It also establishes a sound framework for the design of sufficiently powered clinical proteomics studies. Indeed, this study shows that it is possible to derive biologically plausible hypotheses from a relatively small dataset.

Benchmarking of analysis strategies for data-independent acquisition proteomics using a large-scale dataset comprising inter-patient heterogeneity

Article Open access 12 May 2022

A primary human T-cell spectral library to facilitate large scale quantitative T-cell proteomics

Article Open access 23 November 2020

MultiPro: DDA-PASEF and diaPASEF acquired cell line proteomic datasets with deliberate batch effects

Article Open access 02 December 2023

Introduction

Chronic lymphocytic leukaemia (CLL) is the most common leukaemia in adults in Western countries. It is a malignancy of CD5⁺ B lymphocytes that accumulate in the blood, bone marrow and secondary lymphoid tissues such as lymph nodes¹. CLL is a highly heterogeneous disease and is characterised by its clinical variability, particularly in relation to treatment response². This clinical variability is partially reflected by two distinct forms of the disease defined by the somatic mutational status of the immunoglobulin heavy chain variable region (IGHV) gene. Thus, patients whose CLL cells express mutated IGHV genes (M-CLL) are associated with a favourable outcome whereas those with CLL cells expressing unmutated IGHV genes (UM-CLL) are associated with early disease progression and shorter survival^3,4,5. In addition, many other factors are also thought to be associated with this clinical variability; they include distinct pattern of clonal evolution and reciprocal interactions between leukemic cells and the tissue microenvironment resulting in the activation of pro-survival signalling pathways¹. Indeed, the B-cell receptor (BCR) signalling pathway is critically involved in the survival and proliferation of CLL cells².

Past attempts at understanding the biological basis of the heterogeneity between CLL patients have mainly focussed on genomic alterations and gene expression at the mRNA level⁶. However, despite this, the molecular basis of CLL variability remains incompletely understood. We speculate that the in-depth study of the CLL proteome could thus provide better understanding of CLL heterogeneity and its underlying biological mechanisms. There are a limited number of studies that have applied proteomic approaches to link individual protein expression to the clinical phenotype in CLL^7,8,9,10,11. However, large-scale CLL proteomic studies are still lacking¹². Mass spectrometry (MS) is the standard method of choice for measuring protein expression¹³, with shotgun MS using data dependent acquisition (DDA) being the dominant approach in cancer proteomics research to date¹⁴. However, fast, reproducible and sensitive detection and quantification of proteomes in a large number of patient samples has remained a challenge due to limitations in technology. Recently, data independent acquisition (DIA) technologies have emerged as an alternative to DDA.

SWATH (Sequential Windowed Acquisition of all THeoretical fragments)-MS, is a label-free mass spectrometric technique that combines DIA with targeted data extraction on a high-resolution mass spectrometer¹⁵. SWATH-MS generates mass spectral maps of fragment ions from all detectable peptide precursors. The composite MS/MS spectra are then deconvoluted by alignment with a high quality and comprehensive tissue-specific library¹⁶, whereupon patient samples can be stratified based on the quantitative expression profile of thousands of proteins. SWATH-MS has been shown to be a highly reproducible method for large-scale protein quantification¹⁷. However, a comprehensive analysis of the sources of variation associated with large-scale sample preparation and instrument robustness is still lacking. Such analyses are needed for the optimisation of experimental and data workflows for optimal study design and, ultimately for routine, high-throughput clinical proteomics. In addition, due to the heterogeneous nature of CLL, biological variability between patient samples has to be considered to ensure that sufficient numbers of samples are included in a SWATH-MS study for robust statistical discrimination between clinical subgroups.

In this study, SWATH-MS for the proteome-wide analysis of CLL patient samples was optimised. To achieve this, a comprehensive CLL-specific spectral library was generated. SWATH-MS data was then acquired from cryopreserved CLL samples from 6 patients at various stages of the disease, incorporating triplicate sample preparations and triplicate MS acquisitions for each sample into the experimental design. The relative contribution of the technical variability, naturally associated with sample handling and with the acquisition technology, and biological variability (IGHV mutational status) in the generation of SWATH-MS data was then assessed. A robust statistical approach to correct for technical variations was applied and analysis of the proteins found to be differentially expressed between UM-CLL and M-CLL was performed. Pathway analysis performed on these proteins supported the importance of metabolic remodelling in the biology of CLL and remarkably gene set enrichment analysis showed considerable overlap with previous studies based on gene expression profiling. Finally, an error model to estimate the statistical power of a SWATH-MS study was developed. This model determined the numbers of CLL samples required to detect significant changes in protein expression across the whole dynamic range of a SWATH-MS dataset.

Our study highlights the importance of assessing biological and technical variability in SWATH-MS generated protein expression data prior to undertaking large-scale clinical proteomic studies.

Results

Generation of a CLL-specific spectral library for SWATH-MS analysis

A CLL-specific spectral library to support quantitative proteomics of CLL samples by SWATH-MS has been generated. The library contains 1,586,900 spectra (< 1% FDR) and digital information for 157,285 peptides (< 1% FDR) resulting in the identification of 7736 proteins (the full list of proteins is provided in the DDA “Supplementary data”).

The library encompasses 50% of all human UniProtKB/SwissProt entries that have evidence at the protein level (Fig. 1A) and represents a broad range of Gene Ontology (GO) cellular components (PANTHER) (Fig. 1B). The library covers 98% of the CLL proteome previously reported in our iTRAQ-MS study⁷ and expands the coverage by 127% (Fig. 1C). The quality of the library and its potential as a reference for future functional studies are demonstrated by the high representation of GO molecular functions comparable with a Human gene database (21,002 entries—Reference Proteomes project at UniProt) (PANTHER) (Fig. 1D). Furthermore, analysis of the B-cell receptor (BCR) signalling pathway (MetaCore, Clarivate, PA, USA) showed that the library incorporates over 87% of the molecules involved in BCR signalling (Supplementary Fig. S1).

Identification of technical variations in SWATH-MS data

The biological and technical variability in SWATH data were investigated using cryopreserved CLL samples from 6 patients (Table 1 and Fig. 2A). To ensure biological variability in the samples used in the study, samples with IGHV mutations ranging from 0 to 14% were chosen. Quantitative information was obtained for 5179 proteins and 23,879 peptides across all samples and all replicates by SWATH-MS, which was reduced to 5108 proteins after removing redundancy and weak signals (see “Experimental Procedures”). The full list of proteins is provided in the “Supplementary data”.

Table 1 Clinical features of CLL samples analysed by SWATH-MS to identify variation in proteomics data associated with biological and technical factors.

Full size table

Overall reproducibility of the SWATH-MS data was initially assessed by performing a PCA. The visual inspection of the samples projected in the first two components revealed that sample preparation was a major source of technical variation, with samples clustering based on preparation day (Fig. 2B and Supplementary Fig. S2A). To identify the number of proteins whose variations were associated with technical factors, the SWATH-MS data were subjected to an ANOVA analysis (Fig. 2C). Replicate SWATH-MS acquisitions exhibited very good reproducibility with minimal technical effects on data. Only a single protein was found to be differentially expressed by ANOVA between the replicate MS acquisitions. In contrast, the replicate sample preparations showed considerable technical variation with 593 proteins found by ANOVA to be differentially expressed between sample preparation days. The ANOVA identified 357 proteins which were differentially expressed between UM-CLL and M-CLL samples.

Two additional computational methods for identifying biologically relevant differences in protein expression were then tested (Fig. 2C). The first was limma. This approach, which is similar to ANOVA, identified 319 proteins associated with IGHV mutational status. The second method was partial correlation, a correlation based approach to identify proteins whose expression correlates with the percentage of IGHV mutation as a continuous variable. Results showed that 295 proteins significantly correlated with the percentage of IGHV mutation. Of these proteins, 187 (63%) had been identified by ANOVA to be differentially expressed between the two IGHV groups, M-CLL and UM-CLL (Fig. 2D).

Assessments of method to remove batch effects

Having ascertained that the main source of variation was associated with the sample preparation batch, methodologies that correct for this bias were explored. The Bayesian method Combat was chosen to correct for the variation associated with different batches of protein preparation. This method can be used in both supervised (Combat S) and unsupervised (Combat U) modes. Combat S operates with the knowledge of both technical (sample preparation day) and biological factors (IGHV mutational status, WBC and gender) whereas Combat U is only aware of the technical sample groups. In addition, the “RemoveBatchEffect” function from the limma package (limma S) was tested. Limma was also conducted, including both machine run and preparation day into the linear model design (linear M)¹⁸. After processing the data with the different batch correction methods, ANOVA and limma were used to assess the relative efficacy of the methods to remove technical variation while preserving biological information.

All of the batch correction methods tested effectively removed variation associated with sample preparation day whilst retaining a comparable number of differentially expressed proteins associated with IGHV mutation status (Fig. 3A). The results were consistent with PCA, which showed that the data now clustered based on patient samples and IGHV mutational status (Fig. 3B).

Both supervised and unsupervised Combat batch correction methods resulted in an increase in the number of proteins found to be significantly differentially expressed between UM-CLL and M-CLL samples, with an additional 26 and 38 proteins identified in the Combat U data and the Combat S data, respectively. Limma S correction showed no difference in the number of differentially expressed proteins identified, whilst the linear M method resulted in an additional 36 significant proteins. Crucially, 100%, 99%, 100% and 95% of the proteins found to be significant to IGHV mutational status in the uncorrected data were retained after Combat S, Combat U, limma S and linear M corrections, respectively (Fig. 3C). Two-hundred and forty three proteins significant to IGHV mutational status were common across all four batch corrected datasets (Fig. 3D). Similar results were observed when analysing the number of proteins found to be significantly differentially expressed between low and high WBC subgroups (Supplementary Fig. S2B). On average, 97% of differentially expressed proteins were retained after batch correction (Supplementary Fig. S2C) and 229 were common across all four batch corrected datasets (Supplementary Fig. S2D).

By far, the largest correction effect was observed on the data analysed by partial correlation, which resulted in an additional 133 proteins significantly associated with the percentage of IGHV mutation after Combat S batch correction (Fig. 3C). Ninety-eight percent of the proteins found to be significant in the uncorrected data were retained after batch correction (Fig. 3C). An overlap of 62% (n = 266) was observed between proteins found to be significant to the percentage of IGHV mutation and proteins found to differentially expressed between M-CLL and UM-CLL samples in the Combat S corrected data (Fig. 3E).

Analysis of the proteomics IGHV mutational signature identifies functional pathways and upstream regulators in CLL

To determine the biological significance of proteins found to be differentially expressed between UM-CLL and M-CLL after batch correction, 395 proteins identified by ANOVA after Combat S batch correction were subjected to functional enrichment analysis using a combination of the web-based tool DAVID and Ingenuity Pathway Analysis (IPA). First, DAVID was used to determine whether the list of differentially expressed proteins were enriched in biological pathways. Results showed that significantly enriched pathways included several metabolic functions (Glycolysis, Carbon and Pyruvate metabolism, Glutathione metabolism), adhesion (Cell–cell adherence function), splicing and importantly B cell receptor and Toll-like receptor signalling (Fig. 4A). The IPA software application was then used to infer which functions may be activated or repressed in UM-CLL compared to M-CLL samples. IPA is able to infer a functional response by comparing the observed change in protein expression with prior knowledge of expected effects between regulatory and effector genes stored in the Ingenuity Knowledge database. This approach was applied to identify which biological functions were likely to be activated or repressed, as well as to highlight proteins not detected by the SWATH-MS analysis, which may be responsible for driving the observed differences in the proteomic profile.

The analysis of biological pathways (Fig. 4B) correctly identified the samples as a haematological malignancy and predicted an increase in proliferation and survival and a decrease in apoptosis in UM-CLL cells, an observation consistent with IGHV mutational status¹⁹. The analysis also predicted an inhibition of phagocytosis which is consistent with a recent observation²⁰. The upstream driver analysis inferred changes in the activity of the transcription factors SQSTM1 and GRHL2 (Fig. 4C), the kinases MAPK3 and PPP1CC (Fig. 4D), and the G protein coupled receptor PROKR2 and the E2F Tfdp1 complex (Fig. 4E). These results provide a few intriguing hypotheses on the biology of UM-CLL cells.

Proteomic and transcriptomic signatures linked to IGHV mutational status significantly overlap

Having shown that proteins linked to IGHV mutational status were present in pathways and biological processes of interest, any correlation between mRNA expression data and SWATH-MS proteomics data was determined. A transcriptional signature linked to IGHV mutational status was first defined by using one of the largest publicly available datasets of mRNA expression profiles for CLL (GEO database, accession number GSE28654)²¹. In total, 3008 mRNA genes were found to be differentially expressed between UM-CLL and M-CLL subgroups (≤ 10% FDR). The mRNA signature was then compared to proteins found to be significant to IGHV mutation in the batch corrected SWATH-MS (FDR ≤ 10%) using GSEA.

Enrichments of all protein sets to the mRNA signature were significant at 0% FDR, with 116, 111, 114 and 118 core genes from the Combat S, Combat U, limma S and linear M corrected SWATH-MS protein data, respectively, overlapping with the mRNA signature (GSEA 0% FDR, Table 2). Interestingly, the protein gene set defined by partial correlation to the percentage of IGHV mutation had the largest core gene overlap with the transcriptional signature, with 139 core genes identified (GSEA 0% FDR, Table 2).

Table 2 Normalised enrichment scores and numbers of core genes for the Gene Set Enrichment Analysis (GSEA) of SWATH-MS proteomics data and mRNA expression data.

Full size table

Statistical power analysis

A model was built and used to assess the statistical power of CLL SWATH-MS based studies to determine sample sizes suitable for detecting significant changes in protein expression levels between clinical subgroups. Unsurprisingly, the coefficient of variation (%) was dependent upon protein mean abundance, with higher coefficient of variation (%) seen in proteins expressed at low abundances (Fig. 5A).

The relationship between protein abundance and statistical power for a given number of patient samples in each clinical subgroup (i.e. IGHV mutational status) was plotted (Fig. 5B). Visual inspection of this plot shows that even with a small sample size, good statistical power can be achieved across a considerable proportion of the signal range, at least with this dataset. As an example, the percentage of proteins that can be analysed at an estimated 90% statistical power as a function of the sample numbers was plotted (Fig. 5C.). Table 3 shows the percentage of proteins (out of n = 5108 proteins) which would meet the statistical criteria with a given number of patient samples per clinical subgroup if statistical powers of 95%, 75% or 50% are used.

Table 3 Required number of patient samples per clinical subgroup and the percentage of proteins in the SWATH-MS dataset which would meet the statistical criteria at a statistical power of 95%, 75% or 50%.

Full size table

Discussion

Studies of the proteome are essential if the complexity of disease heterogeneity is to be fully understood, and predictive biomarkers of disease progression and treatment response are to be established. Recently developed DIA-MS methods such as SWATH-MS provide an opportunity to do this, but these are only valid if the variability of the data is recognised and accounted for. Sources of technical variability are numerous and although some procedures can be automated, it is not always possible to remove all aspects of variability in sample preparation. In addition, heterogeneity in diseases such as CLL may be manifested at the protein level, therefore the sample numbers required for SWATH-MS studies must be determined by statistical means in order to reliably detect genuine differential protein expression between clinical subgroups.

Here, a CLL-specific spectral library from CLL patient samples and normal B-cells has been generated. Including normal B-cells in the library not only allows for comparative studies of malignant and normal B-cells in the future, but also captures any differences in the B-cell proteome from very early stages of the disease. The complex mixture of peptides was subjected to extensive fractionation to build the comprehensive library, which captured 50% of all human proteins which have evidence of expression. This is a significant observation given that only a very small proportion of human tissue, i.e. PBMCs, was included. In total, 7736 proteins are represented in the CLL-specific library. In contrast with the CLL proteomics study using iTRAQ-MS published previously by this group⁷, SWATH-MS is a label-free method that allows hundreds of CLL samples to be screened over the course of months or even years. The library provides a permanent reference source that can be readily used by the CLL research community. It is worth mentioning that different peptides are enriched by different sample preparation methods. Therefore, if libraries for SWATH-MS are to be shared, the same sample preparation method used to generate the library should also be used to generate the SWATH maps. To illustrate this point, we have previously prepared CLL samples using hydrophilic interaction liquid chromatography (HILIC) and compared to samples prepared with CEX. SWATH maps from samples prepared by HILIC aligned poorly to the library generated by peptides prepared by CEX. Although 94% of the proteins identified (by DDA) were represented within the CLL spectral library, only 14.5% of peptides were shared between HILIC and CEX prepared samples (data not shown).

SWATH-MS data were acquired from CLL patient samples incorporating triplicate sample preparations from cryopreserved cells and triplicate MS acquisitions. In doing so, effects from cell sample thawing on different dates, batches of chemicals and buffers, changes in analytical columns and maintenance of instruments have been taken into consideration, all of which could contribute to variation in the data. SWATH files were aligned to the CLL spectral library using endogenous CLL peptides in order to calibrate for retention time. Endogenous peptides have been shown to exhibit lower absolute error when compared to spiked-in reference peptides for human lysates²². Using this approach, 5179 proteins were quantified across all 54 SWATH maps. In addition to DIA, DDA was also performed on each of the sample preparation replicates. Overlap of DDA and SWATH-MS data was high, with 92% of proteins identified by DDA (< 1% FDR) also identified by SWATH-MS. However, an additional 3152 proteins (156%) were identified by SWATH-MS. Furthermore, on average, only 56% of proteins identified by DDA were common across all three sample preparation replicates, highlighting the problems faced with traditional DDA methods with regards to incomplete datasets (Supplementary Table S1).

Reproducibility between replicate SWATH-MS runs was extremely good (Supplementary Fig. S2A). Although protein variance is reduced during inference of protein abundance²³, the run-to-run variability in this case is negligible compared to other contributing factors. In contrast, sample preparation replicates showed substantial variability in the data. PCA of uncorrected data showed that samples clustered based on sample preparation day. Four different batch correction methods that are more commonly used for microarray analysis were successfully applied to correct the proteomics data. All methods tested retained the majority of differentially expressed proteins between clinical variables which were identified in the uncorrected data, whilst successfully removing technical variability associated with day of sample preparation. Importantly, PCA of all batch-corrected data showed that data were clustered based on individual patient samples and also according to IGHV mutation status. The number of differentially expressed proteins significant to IGHV mutation increased by 7%, 11%, 11% and 45% after Combat U, Combat S, linear M and Combat S of data analysed by partial correlation, respectively. Comparable percentage increases were also seen for low versus high WBC CLL subgroups, except for linear M analysis in which there was a 2% decrease in the number of differentially expressed proteins identified.

This was a small study designed to assess the best methods to correct for variability in sample processing. Two different approaches were used to assess the success of the approach in terms of revealing expected enrichment of biological functions associated with unmutated IGHV status. Firstly, differentially expressed proteins were functionally annotated and subjected to pathway analysis (IPA). Metabolic functions dominated the results, with 40% of all differentially expressed proteins being associated with metabolic processes and over a third of pathways related to metabolism. These results suggest considerable differences in metabolic activity between UM-CLL and M-CLL cells. Interestingly, proteins found to be differentially expressed between low and high WBC subgroups in the Combat S corrected data showed a similar metabolic functional signature, with 14/38 of the pathways enriched associated with metabolism (Supplementary Fig. S2E). Unlike normal B-cells, CLL cells are known to store lipids and utilise free fatty acids to produce chemical energy^24,25,26. Indeed, increased mitochondrial respiration has been associated with poor prognostic indicators such as UM-CLL and advanced clinical stage (based on higher WBC)²⁷. Furthermore, high and low metabolic states have been shown to be representative of CLL disease stage²⁸. Taken together, this suggests that metabolic adaption is indeed an important factor in the biology and prognosis of CLL. In addition to metabolism, proteins involved in KEGG pathways such as the regulation of actin cytoskeleton and the cell adhesion molecules were also found to be differentially expressed between UM-CLL and M-CLL. These results correlate with our previous CLL iTRAQ-MS study, in which significant differences in cytoskeletal remodelling, cell migration and adhesion pathways were observed between M-CLL and UM-CLL cells⁷.

Pathway analysis provided a few intriguing hypotheses on the biology of UM-CLL cells. It revealed activation of biological functions associated with increased cancer cell survival and repression of those associated with the immune clearance of cancer cells in the UM-CLL samples, in line with previous studies^19,20. IPA also predicted that the transcription factor SQSTM1 (p62) will be activated in UM-CLL samples (Fig. 4C), promoting nuclear accumulation of NFE2L2/NRF2 and subsequent expression of cytoprotective genes^29,30. Highly active p62 cells may therefore be more resistant to ROS inducing therapeutics³¹. In addition, IPA predicted an overactivation of the G protein coupled receptor PROKR2, a receptor for prokineticins. These belong to a family of highly conserved small peptides that control a wide range of physiological and pathological functions and which have been implicated in several forms of cancer³². Also, prokineticins are expressed at high levels in the bone marrow by monocytic/granulocytic lineage cells³³. These findings suggest that prokineticins may be relevant in CLL and possibly linked to mutational status. In summary, the functional analysis of protein differences between M-CLL and UM-CLL following stringent batch correction suggest that genuine differences in biology have been captured. It also shows that, despite the relatively small number of samples examined, SWATH-MS analysis has the potential to provide important biological insights.

In the second approach used to validate the SWATH-MS data, GSEA was used to compare a publicly available CLL mRNA expression dataset with the SWATH-MS protein expression datasets. The analysis revealed significant overlaps between differentially expressed transcripts and proteins. Results were consistent across all of the batch correction methods tested, with over 100 core genes identified in each of the five corrected protein expression datasets.

Statistical power analysis was performed using the SWATH-MS data to determine sample sizes suitable for detecting significant changes between clinical subgroups at the protein level. Statistical power analysis with proteomics data is more complex than with traditional data, since the variability between measurements is a function of signal intensity, and statistical power varies between groups of proteins at different levels of expression. Therefore, a model was built and used to assess the statistical power of a CLL SWATH-MS study. Results showed that the number of patients in a clinical subgroup and the protein abundance can affect statistical power. Therefore, to detect significant differences in those proteins expressed at lower levels, larger numbers of clinical samples are required. For example, 100 samples per clinical subgroup would be required to detect significant changes across all proteins in the dataset (n = 5108/5108) with 95% statistical power, whereas 20 samples per group would detect significant changes across 57% of the proteins in the dataset (n = 2912/5108), which would likely be those proteins expressed at higher levels.

This study provides an exhaustive library of CLL proteins, a valuable resource for the research community. It also highlights the critical importance of assessing biological and technical variation in MS data prior to undertaking large-scale, long term proteomic studies of clinical samples. In the case of CLL samples, where the cells have been aliquoted and cryopreserved, we would recommend a minimum of two preparations per patient sample for SWATH-MS. Batch correction methods can then be used to remove technical variability in the data. However, a single SWATH-MS data acquisition for each sample replicate is sufficient. Statistical power analysis has shown that the heterogeneous nature of CLL is manifested, at least in part, at the protein level, making the selection of an adequate number of samples to be included in each clinical subgroup vital for the reliable interpretation of disease-relevant proteomics results. Further work is however needed to fully validate the general applicability of our analytical approach.

Experimental procedures

Study design and CLL sample preparation

All samples used for this study were obtained with informed consent and with the approval of the North West 2 Research Ethics Committee–Liverpool Central and stored in the Liverpool Bio-Innovation Hub Biobank (LBIH). All methods were performed in accordance with the relevant guidelines and regulations. Venous blood was drawn from CLL patients into tubes containing sodium heparin at a final concentration of 10 units/1 ml of blood. Mononuclear cells were isolated by centrifugation of blood over Lymphoprep (Axis-Shield PoC AS, Oslo, Norway) within 4 h of sampling and stored at − 150 °C within 2 h of separation. Analysis for recurrent chromosomal abnormalities and IGHV gene mutational analysis was performed as described previously^7,34.

Cryopreserved peripheral blood mononuclear cells (PBMCs) were thawed at 37 °C, diluted slowly in RPMI 1640 and rested for one hour at 37 °C with 5% CO₂ to recover after thawing. Cell viability after resting was > 70% for all the cases used in this study, with the exception of three cases which were > 60% (Tables 1 and 4). After washing in ice-cold phosphate-buffered saline (PBS), 2 × 10⁷ cells were lysed by sonication on ice in 50 µL of 7 M urea, 2 M thiourea, 40 mM tris (pH 7.5), 4% CHAPS buffer. Protein concentrations were determined using the 2-D Quant Kit (GE Healthcare, UK). Protein was reduced with 5 mM dithiothreitol (DTT) at 37 °C and alkylated with 0.15 M iodoacetamide (IAA), before diluting with 50 mM ammonium bicarbonate followed by overnight digestion with trypsin (Promega). Peptides were then diluted to 5 mL with 10 mM potassium dihydrogen phosphate/25% acetonitrile (ACN) and acidified to < pH 3 with phosphoric acid prior to cation exchange chromatography.

Table 4 Clinical features of CLL samples analysed by data dependant acquisition (DDA) to generate a CLL-specific spectral library for mapping data acquired by SWATH-MS.

Full size table

Data dependent acquisition (DDA) for generation of a CLL-specific spectral library

Cryopreserved PMBCs from 14 CLL patients at different stages of the disease were used to generate a CLL-specific spectral library (Table 4). In addition, normal B-cells were also included, after purification by negative selection using a B-cell isolation kit (Miltenyi Biotech, Bisley, UK) from Buffy coats obtained from the National Blood Service (Liverpool, UK). Cells were lysed and 100 μg of protein from each sample was used to create a representative pool (total 1500 μg) which was prepared as described above. Peptides were fractionated on a polysulfoethyl A strong cation-exchange column (200 × 4.6 mm, 5 μm, 300 Å; Poly LC, Columbia, MD) at 1 mL/min using a gradient from 10 mM potassium dihydrogen phosphate/25% ACN (w/v) to 0.5 M potassium chloride/10 mM potassium dihydrogen phosphate/25% ACN (w/w/v) in 75 min. Fractions of 2 mL were collected and were dried by centrifugation under vacuum (SpeedVac, Eppendorf UK Ltd, Stevenage, UK). Fractions were reconstituted in 1 mL of 0.1% trifluoroacetic acid and were desalted using an mRP Hi Recovery protein column 4.6 × 50 mm (Agilent, Berkshire UK) on an Agilent 1200 HPLC system (Agilent)⁷.

Forty desalted fractions were each reconstituted in 0.1% formic acid and 0.5–1 μg of sample was loaded on-column. Peptides were separated by in-line reversed phase chromatography using a nanoACQUITY UPLC Symmetry C18 Trap Column and an ACQUITY UPLC Peptide BEH C18 nanoACQUITY Column (Waters, UK). Peptides were eluted using a gradient of 2–50% ACN/0.1% formic acid (v/v) over 120 min at a flow rate of 300 nL/min. DDA was performed on a Triple TOF 6600 (SCIEX) in positive ion mode using 25 MS/MS per cycle (2.8 s cycle time) and 30 MS/MS per cycle (1.8 s cycle time) to maximise both spectral quality and coverage, and the combined data were searched using ProteinPilot 5.0 (SCIEX) using the Paragon algorithm (SCIEX). The data were searched against the SwissProt database (Nov 2015, 20,193 human entries) with carbamidomethyl as a fixed modification of cysteine residues and biological modifications allowed. Mass tolerance for precursor and fragment ions was 10 ppm. In order to reduce false positives, a false discovery rate (FDR) of 1% was applied using the reversed database as decoy. This resulted in 7736 proteins being included in the CLL library (PRIDE identifier PXD011330)³⁵. This equated to protein, peptide and spectra confidence scores as listed in the DDA “Supplementary data”. In order to align SWATH data with the CLL library, only proteotypic peptides with no modifications were required. To this end, a ‘rapid’ search of the data was performed using ProteinPilot. This resulted in the identification of 7386 proteins at 1% FDR.

Proteins represented in the library were functionally classified using the PANTHER (Protein ANalysis THrough Evolutionary Relationships) classification system (http://pantherdb.org, v12.0)^36,37 and the GeneGo BCR pathway map in the MetaCore database (Version 6.14 build 61,508; Clarivate, PA, USA) was used to assess molecular coverage within this pathway.

Data independent acquisition (DIA) (SWATH-MS)

Cryopreserved PBMCs from 6 CLL patients (not used for generating the CLL-specific spectral library) were thawed, lysed and 200 μg of protein from each sample was prepared as described above. Individual digests from samples were loaded onto a prepacked ion exchange column (Bio-Scale Mini Macro-Prep High S, BIO-RAD, UK) in 10 mM potassium dihydrogen phosphate/25% ACN (w/v) and eluted in 0.15 M potassium chloride/10 mM potassium dihydrogen phosphate/25% ACN (w/w/v). Four fractions were collected and dried by centrifugation under vacuum. Fractions were reconstituted in 1 mL of 0.1% trifluoroacetic acid and desalted using an mRP Hi Recovery protein column 4.6 × 50 mm (Agilent) on a 1260 Infinity LC system (Agilent).

Fractions were each reconstituted in 0.1% formic acid and pooled in a total volume of 20 μL. Samples where diluted 1:10 and 5 μL aliquots were delivered into a TripleTOF 6600 mass spectrometer (SCIEX) as described above. SWATH acquisitions were performed using 100 SWATH windows of variable effective isolation width to cover a mass range of 350–1250 m/z (Supplementary Table S2).

Spectra were aligned using SWATH 2.0 in the PeakView v2.2 software (SCIEX) against the CLL-specific spectral library (generated from the search result allowing no modifications) (7386 protein entries). Thirteen endogenous peptides were used for retention time calibration (Supplementary Table S3). Data were processed in PeakView using a XIC extraction window of 8 min and XIC width of 75 ppm. Peak areas from peptides with > 99% confidence and < 1% global false discovery rate were extracted using MarkerView v1.2.1 (SCIEX).

Experimental design and statistical rationale

This study aimed at assessing the relative contribution of technical and biological factors to the variability observed in a SWATH-MS experiment. The experimental design therefore reflects the need for a suitable compromise between assessing the variability of measurements and constraining the experiment within a reasonable size. Replicate PBMC aliquots from 6 individual CLL patients (Table 1) were thawed, lysed and prepared for SWATH-MS (as described above) on three separate days over a period of 3 months. Patient samples were chosen based on IGHV mutational status, with 3 UM-CLL and 3 M-CLL samples included in the experiment. Triplicate SWATH-MS acquisitions were performed on each replicate sample preparation over a period of one month, incorporating changes in columns and traps and maintenance on the LC and MS systems. In total, 54 SWATH acquisitions were performed.

Assessing technical and biological variability

SWATH-MS protein expression data was normalised using the total area sums (sum of all peak areas used to compute the scaling factor) normalisation strategy in MarkerView and transformed on a log2 scale. A total of 5108 non-redundant SWATH-MS proteins were defined after converting protein accessions to gene symbols and then removing proteins with a low background signal, imputing missing values with random forest³⁸ and collapsing multiple protein accessions to one gene symbol (different translational products reduced to one gene). All 54 samples were treated as biological replicates. An exploratory analysis of the full dataset by principal component analysis (PCA) was then performed using the Partek Genomics Suite (version 7.0).

ANOVA (as implemented in the statistical environment R³⁹) was used to assess technical and biological variability. Biological factors included in the model were IGHV mutational status (a cut-off value of 2% was applied to distinguish M-CLL from UM-CLL counterparts^3,4), white blood count (WBC) (two patients with WBC > 300 × 10⁹/L were categorised as having high WBC and four patients with < 100 × 10⁹/L as low WBC) and patient gender. Technical factors included in the model were the sample preparation day and SWATH-MS acquisition. The p values generated by ANOVA were corrected using the Benjamini and Hochberg method to control for multiple testing⁴⁰. Proteins with a ≤ 10% Benjamini and Hochberg control of FDR ANOVA result were identified as being significantly differentially expressed per variable. Venn diagrams were created using Venny 2.1⁴¹.

Batch correction was performed using the Bayesian method Combat⁴² and linear models for microarray analysis (limma)⁴³. These methods were used as implemented in the Bioconductor packages sva (v3.24.0) and limma (v3.32.2), respectively. Combat was run in both an unsupervised (no knowledge of IGHV mutational status, WBC or gender (Combat U)) and a supervised manner (knowledge of IGHV mutational status, WBC and gender (Combat S)). The batch correction “removeBatch Effect” function from limma was run in a supervised manner (limma S). In addition, a separate approach was used in which batch information was incorporated into the linear model design (linear M)¹⁸. This approach also included the duplicate correlation function from limma, in which samples were blocked on the technical replicates⁴². Finally, an alternative approach to identify proteomics signatures associated with IGHV mutation was tested. Instead of subdividing patients in two groups (UM-CLL and M-CLL), the percentage of IGHV mutation was used as a continuous variable and partial correlation on Combat S corrected data was used to identify significant proteins, with WBC and gender as confounding variables. This analysis was performed using the ppcor: Partial and Semi-Partial (Part) Correlation function in R (P. corr, v1.1)⁴⁴. PCA was performed using the Partek Genomics Suite v7.0 to assess variance across the batch corrected sample sets. Proteins with a ≤ 10% FDR result were identified as being significant.

Functional enrichment analysis

Proteins found to be differentially expressed between UM-CLL and M-CLL in the Combat S corrected data (< 10% FDR by ANOVA) were selected for computational functional analysis. Proteins were functionally classified by Gene Ontology Biological Process (GOBP) and the Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathways using the Database for Annotation, Visualization and Integrated Discovery (DAVID) (v6.8)^45,46. In addition, functional pathway prediction activity and upstream regulator analysis has been performed on the same list of genes using the Ingenuity Pathway Analysis (IPA, v8.5) software (Qiagen).

Comparison of differential gene expression in UM-CLL vs. M-CLL subgroups between Protein (SWATH) and mRNA data

A publicly available CLL mRNA dataset acquired from 89 patients with known gender and IGHV mutational status (accession GSE28654)²¹ was used to compare protein and mRNA expression. The Affymetrix data was pre-processed by first selecting probe-sets called present in ≥ 14 samples per IGHV mutational subgroup (MAS5⁴⁷), followed by robust multiarray averaging (RMA) normalisation and finally selecting the most reliable gene probe-sets with the JetSet algorithm⁴⁸, resulting in a final set of 10,953 genes.

Upon PCA of the mRNA data, it was observed that the microarray scan date was a source of technical variation, as clusters of samples based on scan date could be seen (Supplementary Fig. S3A). To remove errors associated with technical variations between scan date batches, data were processed using Combat S batch correction of a 30 sample subset, balanced for IGHV mutational status across scan dates, followed by a limma analysis without scan dates in the model¹⁸. PCA of the processed mRNA expression data showed IGHV mutational status subgroups separated on the first principal component (Supplementary Fig. S3B).

An mRNA ranked t-statistic gene expression signature was defined and gene set enrichment analysis (GSEA)^49,50 was used to compare the mRNA expression signature representing genes expressed at higher or lower levels in UM-CLL (≤ 10% FDR) to the corrected SWATH-MS data.

Statistical power and sample size calculations

One of the objectives of this study was to use the experimental data on the six individual CLL patients, stratified by their IGHV mutational status, to estimate the statistical power associated with a given experimental design. A strategy for estimating statistical power needs to consider that experimental variability is a function of signal intensity and that it is higher for proteins expressed at low levels. Therefore, the coefficient of variation of the available biological replicates was modelled as a function of signal intensity based on the replicate SWATH-MS data. A non-parametric regression method of Loess was used to model the best fit of the coefficient of variation versus the mean protein abundance. Using this model, the statistical power as a function of signal intensity was computed for a given effect and sample size.

Sample size calculations were performed using Combat S processed SWATH proteomics data, balanced for IGHV classes¹⁸, using the Bioconductor ssize package (which facilitates power analysis calculations and visualization of results when large numbers of gene measurements are involved⁵¹) (R package version 3.4.0).

Data availability

All raw and processed MS data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD011330.

Abbreviations

ANOVA:: Analysis of variance
BCR:: B-cell receptor
CEX:: Cation exchange chromatography
CLL:: Chronic lymphocytic leukemia
Combat S:: Combat run in a supervised manner
Combat U:: Combat run in an unsupervised manner
DDA:: Data dependent acquisition
DIA:: Data independent acquisition
FDR:: False discovery rate
GO:: Gene ontology
GOBP:: Gene ontology biological process
GSEA:: Gene set enrichment analysis
HILIC:: Hydrophilic interaction liquid chromatography
IAA:: Iodoacetamide
IGHV:: Immunoglobulin heavy chain variable region
iTRAQ:: Isobaric tags for relative and absolute quantification
KEGG:: Kyoto Encyclopedia of Genes and Genomes pathways
limma:: Linear models for microarray analysis
limma S:: Limma batch correction function (supervised)
linear M:: Limma with batch information incorporated into linear model design
M-CLL:: IGHV mutated chronic lymphocytic leukemia
MS/MS:: Tandem mass spectrometry
P. corr:: Partial and semi-partial (Part) correlation function (R)
PANTHER:: Protein analysis through evolutionary relationships
PBMC:: Peripheral blood mononuclear cells
PCA:: Principal component analysis
SVA:: Surrogate variable analysis
SWATH:: Sequential windowed acquisition of all theoretical fragments
UM-CLL:: IGHV unmutated chronic lymphocytic leukemia
WBC:: White blood cell count

References

Fabbri, G. & Dalla-Favera, R. The molecular pathogenesis of chronic lymphocytic leukaemia. Nat. Rev. Cancer 16, 145–162. https://doi.org/10.1038/nrc.2016.8 (2016).
Article CAS PubMed Google Scholar
Kipps, T. J. et al. Chronic lymphocytic leukaemia. Nat. Rev. Disease Primers 3, 16096. https://doi.org/10.1038/nrdp.2016.96 (2017).
Article PubMed Google Scholar
Damle, R. N. et al. Ig V gene mutation status and CD38 expression as novel prognostic indicators in chronic lymphocytic leukemia. Blood 94, 1840–1847 (1999).
Article CAS PubMed Google Scholar
Hamblin, T. J., Davis, Z., Gardiner, A., Oscier, D. G. & Stevenson, F. K. Unmutated Ig V(H) genes are associated with a more aggressive form of chronic lymphocytic leukemia. Blood 94, 1848–1854 (1999).
Article CAS PubMed Google Scholar
Cramer, P. & Hallek, M. Prognostic factors in chronic lymphocytic leukemia-what do we need to know?. Nat. Rev. Clin. Oncol. 8, 38–47. https://doi.org/10.1038/nrclinonc.2010.167 (2011).
Article CAS PubMed Google Scholar
Guieze, R. & Wu, C. J. Genomic and epigenomic heterogeneity in chronic lymphocytic leukemia. Blood 126, 445–453. https://doi.org/10.1182/blood-2015-02-585042 (2015).
Article CAS PubMed PubMed Central Google Scholar
Eagle, G. L. et al. Total proteome analysis identifies migration defects as a major pathogenetic factor in immunoglobulin heavy chain variable region (IGHV)-unmutated chronic lymphocytic leukemia. Mol. Cell. Proteom. MCP 14, 933–945. https://doi.org/10.1074/mcp.M114.044479 (2015).
Article CAS Google Scholar
Huang, P. Y. et al. Protein profiles distinguish stable and progressive chronic lymphocytic leukemia. Leukemia Lymphoma 57, 1033–1043. https://doi.org/10.3109/10428194.2015.1094692 (2016).
Article CAS PubMed Google Scholar
Alsagaby, S. A. et al. Proteomics-based strategies to identify proteins relevant to chronic lymphocytic leukemia. J. Proteome Res. 13, 5051–5062. https://doi.org/10.1021/pr5002803 (2014).
Article CAS PubMed Google Scholar
Johnston, H. E. et al. Proteomics profiling of CLL versus healthy B-cells Identifies putative therapeutic targets and a subtype-independent signature of spliceosome dysregulation. Mol. Cell. Proteom. MCP 17, 776–791. https://doi.org/10.1074/mcp.RA117.000539 (2018).
Article CAS Google Scholar
Thurgood, L. A., Dwyer, E. S., Lower, K. M., Chataway, T. K. & Kuss, B. J. Altered expression of metabolic pathways in CLL detected by unlabelled quantitative mass spectrometry analysis. Br. J. Haematol. 185, 65–78. https://doi.org/10.1111/bjh.15751 (2019).
Article CAS PubMed Google Scholar
Thurgood, L. A., Chataway, T. K., Lower, K. M. & Kuss, B. J. From genome to proteome: Looking beyond DNA and RNA in chronic lymphocytic leukemia. J. Proteom. 155, 73–84. https://doi.org/10.1016/j.jprot.2017.01.001 (2017).
Article CAS Google Scholar
Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355. https://doi.org/10.1038/nature19949 (2016).
Article ADS CAS PubMed Google Scholar
Chen, E. I. & Yates, J. R. 3rd. Cancer proteomics by quantitative shotgun proteomics. Mol. Oncol. 1, 144–159. https://doi.org/10.1016/j.molonc.2007.05.001 (2007).
Article PubMed PubMed Central Google Scholar
Gillet, L. C. et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: A new concept for consistent and accurate proteome analysis. Mol. Cell. Proteom. MCP 11, 0111.016717. https://doi.org/10.1074/mcp.O111.016717 (2012).
Article CAS Google Scholar
Schubert, O. T. et al. Building high-quality assay libraries for targeted analysis of SWATH MS data. Nat. Protoc. 10, 426–441. https://doi.org/10.1038/nprot.2015.015 (2015).
Article CAS PubMed Google Scholar
Collins, B. C. et al. Multi-laboratory assessment of reproducibility, qualitative and quantitative performance of SWATH-mass spectrometry. Nat. Commun. 8, 291. https://doi.org/10.1038/s41467-017-00249-5 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Nygaard, V., Rodland, E. A. & Hovig, E. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics (Oxford, England) 17, 29–39. https://doi.org/10.1093/biostatistics/kxv027 (2016).
Article MathSciNet Google Scholar
Coscia, M. et al. IGHV unmutated CLL B cells are more prone to spontaneous apoptosis and subject to environmental prosurvival signals than mutated CLL B cells. Leukemia 25, 828–837. https://doi.org/10.1038/leu.2011.12 (2011).
Article CAS PubMed Google Scholar
Manukyan, G. et al. Neutrophils in chronic lymphocytic leukemia are permanently activated and have functional defects. Oncotarget 8, 84889–84901. https://doi.org/10.18632/oncotarget.20031 (2017).
Article PubMed PubMed Central Google Scholar
Trojani, A. et al. Gene expression profiling identifies ARSD as a new marker of disease progression and the sphingolipid metabolism as a potential novel metabolism in chronic lymphocytic leukemia. Cancer Biomark. Sect. A Disease Mark. 11, 15–28. https://doi.org/10.3233/CBM-2012-0259 (2011).
Article MathSciNet Google Scholar
Parker, S. J. et al. Identification of a set of conserved eukaryotic internal retention time standards for data-independent acquisition mass spectrometry. Mol. Cell. Proteom. MCP 14, 2800–2813. https://doi.org/10.1074/mcp.O114.042267 (2015).
Article CAS Google Scholar
Limonier, F. et al. Estimating the reliability of low-abundant signals and limited replicate measurements through MS2 peak area in SWATH. Proteomics 18, e1800186. https://doi.org/10.1002/pmic.201800186 (2018).
Article CAS PubMed Google Scholar
Bilban, M. et al. Deregulated expression of fat and muscle genes in B-cell chronic lymphocytic leukemia with high lipoprotein lipase expression. Leukemia 20, 1080–1088. https://doi.org/10.1038/sj.leu.2404220 (2006).
Article CAS PubMed Google Scholar
Rozovski, U., Hazan-Halevy, I., Barzilai, M., Keating, M. J. & Estrov, Z. Metabolism pathways in chronic lymphocytic leukemia. Leukemia Lymphoma 57, 758–765. https://doi.org/10.3109/10428194.2015.1106533 (2016).
Article CAS PubMed Google Scholar
Rozovski, U. et al. Aberrant LPL expression, driven by STAT3, mediates free fatty acid metabolism in CLL cells. Mol. Cancer Res. MCR 13, 944–953. https://doi.org/10.1158/1541-7786.MCR-14-0412 (2015).
Article CAS PubMed Google Scholar
Vangapandu, H. V. et al. B-cell receptor signaling regulates metabolism in chronic lymphocytic leukemia. Mol. Cancer Res. MCR 15, 1692–1703. https://doi.org/10.1158/1541-7786.MCR-17-0026 (2017).
Article CAS PubMed Google Scholar
Koczula, K. M. et al. Metabolic plasticity in CLL: Adaptation to the hypoxic niche. Leukemia 30, 65–73. https://doi.org/10.1038/leu.2015.187 (2016).
Article CAS PubMed Google Scholar
Copple, I. M. et al. Physical and functional interaction of sequestosome 1 with Keap1 regulates the Keap1-Nrf2 cell defense pathway. J. Biol. Chem. 285, 16782–16788. https://doi.org/10.1074/jbc.M109.096545 (2010).
Article CAS PubMed PubMed Central Google Scholar
Jain, A. et al. p62/SQSTM1 is a target gene for transcription factor NRF2 and creates a positive feedback loop by inducing antioxidant response element-driven gene transcription. J. Biol. Chem. 285, 22576–22591. https://doi.org/10.1074/jbc.M110.118976 (2010).
Article CAS PubMed PubMed Central Google Scholar
Sanchez-Lopez, E. et al. NF-κB-p62-NRF2 survival signaling is associated with high ROR1 expression in chronic lymphocytic leukemia. Cell Death Differ. https://doi.org/10.1038/s41418-020-0496-1 (2020).
Article PubMed PubMed Central Google Scholar
Monnier, J. & Samson, M. Prokineticins in angiogenesis and cancer. Cancer Lett. 296, 144–149. https://doi.org/10.1016/j.canlet.2010.06.011 (2010).
Article CAS PubMed Google Scholar
LeCouter, J., Zlot, C., Tejada, M., Peale, F. & Ferrara, N. Bv8 and endocrine gland-derived vascular endothelial growth factor stimulate hematopoiesis and hematopoietic cell mobilization. Proc. Natl. Acad. Sci. USA 101, 16813–16818. https://doi.org/10.1073/pnas.0407697101 (2004).
Article ADS CAS PubMed PubMed Central Google Scholar
Carter, A. et al. Imperfect correlation between p53 dysfunction and deletion of TP53 and ATM in chronic lymphocytic leukaemia. Leukemia 20, 737–740. https://doi.org/10.1038/sj.leu.2404120 (2006).
Article CAS PubMed Google Scholar
Perez-Riverol, Y. et al. The PRIDE database and related tools and resources in 2019: Improving support for quantification data. Nucleic Acids Res. 47, D442–D450. https://doi.org/10.1093/nar/gky1106 (2019).
Article CAS PubMed Google Scholar
Mi, H. et al. PANTHER version 11: Expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements. Nucleic Acids Res. 45, D183–D189. https://doi.org/10.1093/nar/gkw1138 (2017).
Article CAS PubMed Google Scholar
Mi, H., Muruganujan, A., Casagrande, J. T. & Thomas, P. D. Large-scale gene function analysis with the PANTHER classification system. Nat. Protoc. 8, 1551–1566. https://doi.org/10.1038/nprot.2013.092 (2013).
Article CAS PubMed PubMed Central Google Scholar
Liaw, A. & Wiener, M. Classification and regression by randomForest. R News 2, 18–22 (2002).
Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing. http://www.R-project.org/. (The R Foundation for Statistical Computing, 2011).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate—A practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x (1995).
Article MathSciNet MATH Google Scholar
Oliveros, J. C. Venny. An interactive tool for comparing lists with Venn's diagrams. v. 2.1. https://bioinfogp.cnb.csic.es/tools/venny/index.html. (2007–2015).
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47. https://doi.org/10.1093/nar/gkv007 (2015).
Article CAS PubMed PubMed Central Google Scholar
Leek, J. T. et al. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 28(6), 882–883 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kim, S. ppcor: An R package for a fast calculation to semi-partial correlation coefficients. Commun. Stat. Appl. Methods 22, 665–674. https://doi.org/10.5351/CSAM.2015.22.6.665 (2015).
Article PubMed PubMed Central Google Scholar
da Huang, W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57. https://doi.org/10.1038/nprot.2008.211 (2009).
Article CAS Google Scholar
da Huang, W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13. https://doi.org/10.1093/nar/gkn923 (2009).
Article CAS Google Scholar
McClintick, J. N. & Edenberg, H. J. Effects of filtering by Present call on analysis of microarray experiments. BMC Bioinform. 7, 49. https://doi.org/10.1186/1471-2105-7-49 (2006).
Article CAS Google Scholar
Li, Q., Birkbak, N. J., Gyorffy, B., Szallasi, Z. & Eklund, A. C. Jetset: Selecting the optimal microarray probe set to represent a gene. BMC Bioinform. 12, 474. https://doi.org/10.1186/1471-2105-12-474 (2011).
Article Google Scholar
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550. https://doi.org/10.1073/pnas.0506580102 (2005).
Article ADS CAS PubMed PubMed Central Google Scholar
Subramanian, A., Kuehn, H., Gould, J., Tamayo, P. & Mesirov, J. P. GSEA-P: A desktop application for Gene Set Enrichment Analysis. Bioinformatics (Oxford, England) 23, 3251–3253. https://doi.org/10.1093/bioinformatics/btm369 (2007).
Article CAS Google Scholar
Warnes, R. G., Liu, P. & Le, F. ssize: Estimate Microarray Sample Size. https://www.bioconductor.org/packages/release/bioc/html/ssize.html (2012).

Download references

Acknowledgements

This work was supported by research Grants from the North West Cancer Research UK (134623) and Leukaemia Research Fund UK (05013). UTK is an MRC Clinical Training Fellow based at the University of Liverpool supported by the North West England Medical Research Council Fellowship Scheme in Clinical Pharmacology and Therapeutics, which is funded by the Medical Research Council (MR/N025989/1), Roche Pharma, Eli Lilly and Company Limited, UCB Pharma, Novartis, the University of Liverpool and the University of Manchester. We would also like to thank Gregory R. Warnes for his advice in the use of the ssize package.

Author information

These authors contributed equally: Gina L. Eagle and John M. J. Herbert.

Authors and Affiliations

Department of Molecular and Clinical Cancer Medicine, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, UK
Gina L. Eagle, Jianguo Zhuang, Melanie Oates, Umair T. Khan & Andrew R. Pettitt
Computational Biology Facility, University of Liverpool, Liverpool, UK
John M. J. Herbert, Kim Clarke & Francesco Falciani
Department of Haemato-Oncology, Clatterbridge Cancer Centre NHS Foundation Trust, Liverpool, UK
Umair T. Khan & Andrew R. Pettitt
Department Pharmacology and Therapeutics, Institute of Systems, Molecular and Integrative Biology, MRC Centre for Drug Safety Science, University of Liverpool, Liverpool, UK
Neil R. Kitteringham, B. Kevin Park & Rosalind E. Jenkins
Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, L69 7ZB, UK
Francesco Falciani

Authors

Gina L. Eagle
View author publications
You can also search for this author in PubMed Google Scholar
John M. J. Herbert
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Oates
View author publications
You can also search for this author in PubMed Google Scholar
Umair T. Khan
View author publications
You can also search for this author in PubMed Google Scholar
Neil R. Kitteringham
View author publications
You can also search for this author in PubMed Google Scholar
Kim Clarke
View author publications
You can also search for this author in PubMed Google Scholar
B. Kevin Park
View author publications
You can also search for this author in PubMed Google Scholar
Andrew R. Pettitt
View author publications
You can also search for this author in PubMed Google Scholar
Rosalind E. Jenkins
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Falciani
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.L.E. processed samples, generated MS data, analysed data and wrote the manuscript. J.M.J.H. performed bioinformatic analysis of data and wrote the manuscript. J.Z. designed the study and wrote the manuscript. M.O. provided patients’ samples and clinical data for the study. U.T.K. provided clinical data for the study and contributed to the preparation of the manuscript. N.R.K. designed the study and contributed to the preparation of the manuscript. K.C. interpreted data. B.K.P. contributed to the generation of mass spectrometry data. A.R.P. designed the study, provided clinical perspectives and wrote the manuscript. R.E.J. directed the study, analysed MS data, interpreted data and wrote the manuscript. F.F. directed the study, performed bioinformatic analysis of data, interpreted data and wrote the manuscript.

Corresponding authors

Correspondence to Rosalind E. Jenkins or Francesco Falciani.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Eagle, G.L., Herbert, J.M.J., Zhuang, J. et al. Assessing technical and biological variation in SWATH-MS-based proteomic analysis of chronic lymphocytic leukaemia cells. Sci Rep 11, 2932 (2021). https://doi.org/10.1038/s41598-021-82609-2

Download citation

Received: 24 July 2020
Accepted: 11 January 2021
Published: 03 February 2021
DOI: https://doi.org/10.1038/s41598-021-82609-2

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Benchmarking of analysis strategies for data-independent acquisition proteomics using a large-scale dataset comprising inter-patient heterogeneity

A primary human T-cell spectral library to facilitate large scale quantitative T-cell proteomics

MultiPro: DDA-PASEF and diaPASEF acquired cell line proteomic datasets with deliberate batch effects

Introduction

Results

Generation of a CLL-specific spectral library for SWATH-MS analysis

Identification of technical variations in SWATH-MS data

Assessments of method to remove batch effects

Analysis of the proteomics IGHV mutational signature identifies functional pathways and upstream regulators in CLL

Proteomic and transcriptomic signatures linked to IGHV mutational status significantly overlap

Statistical power analysis

Discussion

Experimental procedures

Study design and CLL sample preparation

Data dependent acquisition (DDA) for generation of a CLL-specific spectral library

Data independent acquisition (DIA) (SWATH-MS)

Experimental design and statistical rationale

Assessing technical and biological variability

Functional enrichment analysis

Comparison of differential gene expression in UM-CLL vs. M-CLL subgroups between Protein (SWATH) and mRNA data

Statistical power and sample size calculations

Data availability

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links