Introduction

Polycystic ovary syndrome (PCOS) is a complex, heterogeneous and often underdiagnosed endocrine disorder. According to the International PCOS Guideline, the syndrome can be diagnosed if at least two of three of the following criteria are fulfilled after exclusion of other etiologies: oligo- or anovulation, clinical and/or biochemical hyperandrogenism and polycystic ovaries1. The estimated prevalence varies from 8–12% depending on the study population and applied criteria2,3,4. As PCOS has a strong metabolic and inflammatory side, it should be considered more than a mere gynaecological problem. Indeed, affected women are at increased risk for metabolic syndrome, type II diabetes and cardiovascular diseases5. Moreover, low-grade chronic inflammation is commonly detected in women with PCOS, and it has been linked to the development of insulin resistance and accelerated atherosclerosis6.

Due to the vast scientific interest in PCOS, it has become evident that affected women also present with high pregnancy- related morbidity and adverse offspring outcomes7,8. Affected women have a 3 −4- fold increase in the risk of pregnancy-induced hypertension and pre-eclampsia and a 2-fold higher risk for preterm delivery independent of BMI9. Structural alterations of placentae from women with PCOS have been reported, even in uncomplicated pregnancies, possibly indicating abnormal placentation and defective placental function10,11.

Proteomic technologies have been used to study PCOS-related alterations in protein expression in the plasma, ovarian tissue, follicular fluid, adipose tissue and T cells12,13,14,15,16. A list of proteomic biomarkers for PCOS has also been published, showing an association with networks related to the coagulation system, cell cycle regulation, metabolism, apoptosis, immune system/inflammation, cell signalling, oxidative stress, insulin, adipose tissue regulation, cholesterol and cell structure17. Interestingly, the proteomics biomarkers detected in women with PCOS overlap with those detected in women with pre-eclampsia, a disorder defined by hypertension and proteinuria, often related to placental dysfunction18. It is important to note that all the previous proteomic analyses in women with PCOS have been conducted using samples from non-pregnant women, and to date, no proteomics studies have been published for pregnant women with PCOS. Given that the women with PCOS are at risk for adverse gestational outcomes, a proteomic analysis during pregnancy could clarify the mechanisms leading to these conditions. In general, pregnancy can be considered a stress test as well as a window of opportunity to estimate health risks in women later in life19. For PCOS, it could be possible to identify the affected women, especially those at risk for adverse health outcomes later in life, by discerning the differences in circulating proteins.

By adopting a quantitative label-free proteomics approach, plasma proteomes of samples from uncomplicated term pregnancies of non-obese women with PCOS were compared with those of controls matched for age and BMI. A total of 169 proteins with two or more unique peptides were differentially expressed between cases and controls, from which 35 passed the cut-off Mann-Whitney p-value of 0.05. These proteins formed protein-protein interaction networks related to humoral immunity, inflammation and cardiovascular disease. The data were further analysed by the principal component analysis (PCA). To determine how well the proteomic analysis could classify the women with PCOS and controls and to identify possible biomarkers for PCOS, three parallel methods were used: orthogonal projections to latent structure-discriminant analysis (OPLS-DA), hierarchical clustering (HCA) and self-organizing maps (SOMs). We also found that the detected proteins correlated with several circulating hormones. It was also of interest to identify novel biomarkers that could be utilized in PCOS diagnostics. A ROC-curve analysis revealed that in the data set, complement factor properdin (properdin) was able to classify cases and controls with very high accuracy. Interestingly, properdin abundances correlated with AMH levels at the end of pregnancy.

Results

Metadata

Label-free quantitative proteomics was performed on seven plasma samples from pregnant women with PCOS at term and on 20 plasma samples from pregnant control women matched for age and BMI. All plasma samples were collected at the delivery ward when the women arrived to give birth. A schematic representation of the entire analysis process is depicted in Fig. 1. Patient demographics are presented in Table 1. Furthermore, plasma Anti-Müllerian hormone (AMH) and steroid hormone analysis results from the authors’ previous study20 were added in the analyses as additional variables.

Figure 1
figure 1

Quantitative Proteomics Analysis Workflow (see Methods for further information). Lithium-heparin plasma samples were depleted of the 12 most abundant proteins and digested with trypsin. Nanoflow ultrahigh performance liquid chromatography – Ultra-Definition Mass-spectrophotometry (nUPLC-UDMSE) was performed in triplicates. After acquisition, the data analysis was performed with Progenesis QI. Differences in protein abundances between groups were assessed by the Mann-Whitney U-test with a cut-off level set to p < 0.05. A network analysis by IPA was used to build protein interaction networks of proteins that differed between cases and controls and to provide a broader scope for interpretation regarding how the changes may affect the functions of the body. The Principal Component Analysis (PCA) was used to visualise the principal axes of protein abundance variations in cases and controls in order to define how much variation the sample classes has when compared and whether the cases and control separate from each others. A clustering analysis is an alternative technique to analyze the differences between groups as well as similarities within a group. Self-organizing maps (SOM) is an unsupervised data visualization technique that reduces the dimensions of data through the use of self-organising neural networks. Hierarchical clustering analysis (HCA) calculates the dissimilarity between individuals and builds a hierarchy of clusters. OPLS-DA was used to define differences between the groups and to identify the proteins with the highest discriminative power. These proteins were then used for the ROC-curve analysis to calculate AUC-values.

Table 1 Clinical characteristics of the study subjects. BMI, body-mass index; BP, blood pressure; SD, standard deviation; IQR, inter-quartile range; N, number of participants; AMH, Anti-Müllerian hormone.

Proteomics in cases and controls

A total of 169 proteins with two or more unique peptides were identified, from which 35 passed the cut-off Mann-Whitney p-value of 0.05. The fold-changes of the proteins with a p-value < 0.05 ranged from 13,3 to −3.81. The proteins with two or more unique peptides detected and a Mann-Whitney p-value < 0.05 as well as their mean abundances are listed in Table 2. The exact standardized protein abundances for each individual are listed in Supplementary Table 1.

Table 2 List of proteins that were differentially abundant (p < 0.05) in the plasma of pregnant women at term with PCOS compared with age and BMI matched controls.

Pathway analysis

The Ingenuity pathway analysis (IPA) network module was used to identify the protein interaction networks of proteins differentially expressed in pregnant women with PCOS compared to pregnant control women. Only the proteins passing the cut-off p-value of 0.05 from the Mann-Whitney analysis were used. Network analysis reveals shared functional and biological aspects between the proteins which makes it easier to interpret how the proteomic changes affect the body as a whole. The networks that were identified were related to humoral immune responses, inflammatory responses, cardiovascular disease and cellular growth and proliferation. Full lists of proteins in these two networks can be found in Supplementary Tables 2 and 3.

Principal component analysis (PCA)

A PCA was done using Progenesis QI Proteomics to determine the principal axes on protein abundance variations in PCOS cases and controls. PCA reduces the data to its basic components so that patterns and clusters can be detected. The analysis with all the proteins with two or more unique peptides did not show distinct clustering (Fig. 2b), however, when PCA was carried out with only the proteins with two or more unique peptides and a Mann-Whitney p-value < 0.05 the cases and controls mostly separated along the X-axis (Fig. 2a), revealing that the proteomes of pregnant women with PCOS resemble each others.

Figure 2
figure 2

Principal component analysis. Purple dots represent PCOS cases (circled in red) and blue dots the controls (circled in blue). (a) All differentially expressed proteins with ≥2 unique peptides are presented (b) Only proteins passing the cut-off of p < 0.05 for Mann-Whitney test are depicted. The cases and controls mostly cluster separately along the x-axis.

Hierarchical clustering and SOM clustering

Protein abundance data of PCOS vs control (top 35 Mann-Whitney Passing proteins only) were used for hierarchical clustering (HCA) and self-organizing map (SOM) clustering analyses to determine which samples cluster together. Clustering analyses are unsupervised methods, so the only information provided is the protein abundance data, based on which the process clusters the individuals whose proteomes most resemble each other. In hierarchical clustering analysis 6/7 cases of PCOS clustered together (Fig. 3). In SOM clustering, however, all cases cluster next to each other (Fig. 4). This shows that based on only the protein abundances, unsupervised methods find enough similarities in proteomes of pregnant women with PCOS to cluster them together.

Figure 3
figure 3

Hierarchical clustering analysis (HCA) of the 35 differentially expressed proteins in cases and controls. 5/6 of the cases clustered together.

Figure 4
figure 4

Self-organizing map (SOM) clustering of the 35 differentially expressed proteins in cases and controls. All cases clustered together.

Orthogonal projections to latent structure-discriminant analysis (OPLS-DA)

OPLS-DA modelling was used to identify proteins that can differentiate the cases from controls. OPLS-DA is a modelling technique that can perform binary comparisons. The modelling provides two values: (p) is the magnitude of change of a given marker, and (p(corr)) depicts significance of the marker in binary comparison. Any experiment with a large number of measured variables can be modelled with OPLS-DA to filter out the most differing markers among the two groups. OPLS-DA can separate predictive and uncorrelated variance in binary comparisons. According to OPLS-DA, two proteins that passed the cutoff value of +0.65 or −0.65 for p(corr) were enriched in the plasma of pregnant women with PCOS: properdin (CFP) (p(corr) value = −0.850) and insulin-like growth factor II (IGF-II) (p(corr) = −0.691), indicating them as potential discriminant markers in our dataset.

ROC curve analysis

To identify the proteins that could be used to distinguish the cases from controls, a ROC curve analysis was performed using Metaboanalyst analysis tool21. The top 35 plasma proteins and AMH-levels were used for calculating ROC curves for individual markers as well as combination ROC curves for sets of biomarkers by a support vector machine algorithm (Fig. 5a and Suppl. Table 4). Properdin alone had the best predictive accuracy (AUC-value = 1, Fig. 5c) but IGF-2 also performed well (Fig. 5b). Of note, properdin was also the top protein found by OPLS-DA in the dataset. This serves as the orthogonal statistical validation of the OPLS-DA modelling and non-parametric testing.

Figure 5
figure 5

ROC-curve analysis using individual or a combination of proteins. (a) Combination ROC-curves calculated by MetaboAnalyst 4.0. The best AUC-value was reached using all the identified 35 proteins and AMH (yellow, AUC: 0.97). See Suppl. Table 4 for proteins used for creation of the curves. (b) AUC-curve of IGF-2 AUC: 0.937. c. An AUC-curve of properdin alone can classify cases and controls perfectly, AUC: 1.

Correlation analysis

In order to see which clinical parameters correlated with protein abundances measured with LC-MS, we used clinical data as well as steroid hormone and AMH values form our previous study20. All significant correlations are listed in Table 3. Interestingly, properdin levels correlated positively with AMH levels (r = 0.417, p = 0.0383). As AMH can be considered a marker for antral follicle count, high properdin levels may be linked with polycystic ovarian morphology. The strongest correlation was detected between inactive dipeptyl peptidase 10 (DPP10) and testosterone (r = 0.700, p < 0.001).

Table 3 Correlation analysis results for clinical parameters and protein abundances detected by LC-MS. Presented as correlation coefficient (p-value).

Discussion

PCOS affects around 8–12% of the female population, making it one of the most common endocrinological disorders worldwide2,3,4. Although the affected women suffer from reproductive and metabolic dysfunction, the syndrome often remains undiagnosed22. PCOS is a risk factor for adverse pregnancy outcomes, such as pregnancy induced hypertension, pre-eclampsia, prematurity and gestational diabetes, but the underlying mechanisms remain unclear7,9. During pregnancy major metabolic and inflammatory changes occur in the female body. To determine whether these responses are affected by PCOS, plasma samples from uncomplicated term pregnancies from non-obese women with PCOS and matched controls were compared using label-free quantitative proteomics. To the authors’ knowledge, this is the first study to assess the plasma proteome of pregnant women with PCOS.

The analysis indicated that 35 proteins were significantly differentially expressed between the cases and the controls. Most of the proteins were associated with networks related to inflammation, humoral immunity and cardiovascular disease. Amongst these were some proteins previously detected in non-pregnant women with PCOS, but proteins that have not been associated with PCOS previously were also identified. SOM clustering and hierarchical clustering analyses revealed that the PCOS cases cluster close to each other. The support vector machine based ROC analysis was used to identify individual or combinations of proteins that could best classify cases from controls, and it revealed that properdin (CFP) alone provided the best prediction for PCOS diagnosis in thedataset.

Several of the proteins now identified in pregnant women with PCOS have been reported to be differentially expressed in non-pregnant women with PCOS as well, including increased levels of insulin growth factor II (IGF2), platelet factor 4 (PF4), serum amyloid A (SAA), fibulin-1 (FBL1), apolipoprotein A4 (APOA4) and alpha-1B-glycoprotein (A1BG) and decreased levels of zinc-alpha-2-glycoprotein (AZGP1) and serum paraoxonase/arylesterase 1 (PON1)23,24,25,26,27. In contrast to this finding, for non-pregnant women with PCOS, serum thrombospondin-1 (THBS1) levels have been reported to be lower28. Some of the proteins that were shown to be increased in the plasma of pregnant women with PCOS in the present study have been detected to be decreased in the follicular fluid of women with PCOS14. Interestingly, all these proteins are involved in thrombosis, inflammation and/or metabolism.

Indeed, the network analysis revealed that a vast majority of the differentially expressed proteins belonged to networks related to humoral responses, inflammatory responses, cardiovascular disease, lipid metabolism and cellular growth and proliferation. Even a normal pregnancy is an acquired hypercoagulable and inflammatory state29,30. The concentrations of coagulation factors increase and fibrinolysis and anticoagulatory factors decrease during gestation29. Furthermore, insulin sensitivity decreases by 50–60%31. All these changes are mandatory for the mother to adapt to pregnancy and on the other hand, a rigorous control of these factors is critical for a healthy pregnancy. Conditions that predispose to thrombosis, inflammation or insulin resistance may affect this fine balance, leading to suboptimal implantation and placentation and possibly complications during pregnancy.

The protein levels of IGF-2 were significantly higher in women with PCOS in our dataset. IGF-2 has been linked to placental function in several studies: Igf2 overexpression in mice leads to overgrowth of both the placenta and the fetus32 and deletion of the placental-specific Igf2 leads to reduction in placental and fetal weight and decreased transport of nutrients and reduced diffusion capacity33,34. In quinea pigs, administration of IGF-II to the mother in early-pregnancy increases placental functional capacity and weight of the fetus and the placenta35. Women with PCOS have a higher risk for large for large for gestational age (LGA) infants in general8, which could be partly explained with higher IGF-2 levels in their circulation. In our dataset, IGF-2 protein levels at the end of pregnancy did not correlate with fetal weight, however, we did not have any LGA infants in the PCOS group. Unfortunately, we did not have the information for placental weight to correlate with the IGF-2 levels.

Women with PCOS have a 3–4-fold higher risk for developing pre-eclampsia (PE) during pregnancy9. Pre-eclampsia is characterised by hypertension (≥140/90 mmHg) and proteinuria (>300 mg/day) after the 20th week of gestation. PE is assumed to stem from defective vascular function and placentation leading to hypoxia of the placenta, the release of soluble factors and over time, to generalised inflammation and progressive endothelial damage36. It has been reported that the proteomic markers of PCOS overlap with those identified in pre-eclamptic patients18, however, none of the PCOS studies utilised in the systematic review included data of pregnant women. Like PCOS, pre-eclampsia is a multi-systemic syndrome, where complex pathophysiological changes, including endothelial dysfunction, inflammation, activation of coagulation and metabolic changes are prominent features. Taking all this into account, it is not surprising that many of the proteins that were differentially expressed between pregnant PCOS cases and controls in the dataset have also been linked to pre-eclampsia, e.g. properdin, insulin-like growth factor 2 (IGF2), PF4, coagulation factor XII (F12), FBL1, apolipoprotein C3 (APOC3), hemopexin (HPX), apolipoprotein E (APOE), PLG, vitronectin (VTN), ZAG1, prostaglandin-H2 D-isomerase (PTGDS) and the complement component C8 alpha chain (C8A)18. It is interesting to note that these differences in circulating proteins were detected even though the women with PCOS included in our study had no signs of pre-eclampsia and a histological examination of their placentas revealed no abnormalities. Whether the presence of these markers can explain the increased risk of pregnancy complications or can be used to predict cardiometabolic risk later in life requires further research.

Complement factor properdin is a positive regulator of the alternative pathway but it can also act as an independent complement activator37. Properdin also seems to control platelet aggregation by exacerbating thromboinflammation38. In this dataset, properdin was enriched in the plasma of pregnant women with PCOS compared with controls. Disturbances in complement activation have been detected in women with PCOS previously, as complement factor C3 has been reported to correlate with traditional cardiovascular disease risk factors in these women, most importantly with insulin resistance39,40. Excessive complement activation during pregnancy has been linked to many pregnancy complications, such as pre-eclampsia, preterm birth and pregnancy-induced hypertension41. Properdin also had the highest p(corr)-value according to the OPLS-DA analysis, and the ROC-curve analysis showed that properdin alone could classify the cases and controls with high accuracy. Interestingly, properdin levels correlated with AMH levels, indicating a possible link between these two proteins. Unfortunately, the increase in properdin levels could not be confirmed using commercially available ELISAs (data not shown). This is most likely due to the fact that upon a freeze-thawing process properdin forms tight aggregates that resist even highly reducing conditions38. Due to the harsh sample reducing steps of LC/MS, it is well suited for the detection of properdin, unlike ELISA which preserves the structure of the antigen.

Correlation analysis of the identified protein abundances with clinical parameters, steroid hormones or AMH revealed several significant correlations. The most significant correlations were observed with testosterone, such as between testosterone and inactive dipeptidyl peptidase (DPP10; r = 0.700). DPP10 genes have previously been reported to be enriched in patients with autism42, a condition which incidence has been reported to be higher in women with PCOS and their offspring43. Androstenedione levels correlated negatively with sex-hormone binding globulin, cardiovascular protective factors (e.g. kallistatin44) and lipolytic factors (e.g. zinc-alpha-2-glycoprotein45), and both androstenedione and testosterone correlated with several complement factors, indicating that androgens affect plasma proteome of pregnant women.

In this study, data independent acquisition in the UDMSE mode analysis of clinical samples by LC-MS was used. Some of the reasons it was chosen it over ELISA- type orthogonal methods include the superior selectivity, reproducibility and multiplexing of mass spectrometric methods. Other reasons include higher throughput, cost-per-sample and sensitivity compared to immunoassays. However, due to very high dynamic range of serum/plasma proteome, it becomes necessary to deplete high-abundant proteins before processing samples for MS analysis. It introduces a confounding factor when considering validation by orthogonal methods due to the non-selective partial depletion of other proteins; however, current commercially available kits for high-abundant protein depletion are robust enough to maintain high inter-assay reproducibility. The same cannot be said for concordance between immunoassay and LC-MS assays. Regarding the statistical data analysis, reproducibility of selecting biomarkers for differentiating cases from controls is best achieved by a combination of unsupervised and supervised methods. A balanced combination of both of these method types in the form of the PCA, OPLS-DA and ROC curve analysis has been used. Whether a proteomic signature can separate the clinical groups in question is best determined using unsupervised methods, such as the PCA, but to identify important features of interest, supervised methods, such as the OPLS-DA are more suitable. Unsupervised methods serve as a guide regarding whether to use supervised methods or not because using supervised methods alone produces the risk of over-fitting the model.

In addition to the strengths, the study has some limitations. The pilot nature of the study is acknowledged as only seven women with PCOS were included in the analysis. Analyses of larger sample sets are thus warranted in the future. The cases were identified based on hospital records retrospectively and thus we had no contact with them before pregnancy. This population based approach does, however, also reduce selection bias, as the study population is not selected from patients from fertility clinics. The information collected during pregnancy was limited to the data collected from public health care and thus no data were available on insulin sensitivity of the subjects or weight of the placentas, for example. Placental samples were not available for in vitro studies and thus it was not possible to investigate how factors identified in this study affect placental function. In addition, all samples were from term uncomplicated pregnancies of non-obese women. Whether changes similar to the ones detected can be observed in obese women with PCOS or in samples obtained from early pregnancy requires further investigation.

Finally, it is hypothesized that the increased plasma levels of properdin and other circulating thromboinflammatory factors may indicate defective placental function or by themselves induce pathological changes that lead to increased incidence of pregnancy-related hypertensive disorders in women with PCOS. The finding that healthy pregnant women with PCOS display altered plasma proteome even during uncomplicated pregnancies may also reflect an increased risk for metabolic morbidity later in life. Future studies comparing systemic and placental expression of the biomarkers are essential for confirmation of these hypotheses.

Methods

Patients

The plasma samples were obtained from women who participated in the ‘Biology, Affect, Stress, Imaging, and Cognition in pregnancy and the puerperium’ (BASIC) cohort20,30,46,47. In Uppsala County, Sweden, all pregnant women are invited to participate in the population based BASIC cohort during their routine ultrasound at gestational week 16–18. The eligibility criteria are: 1.) ability to communicate in Swedish, 2.) age > 18 years and 3.) free from blood-borne diseases. The BASIC cohort covers around 23% of the women in the Uppsala county area. Brief demographic data are collected upon inclusion (e.g. on chronic disorders, ongoing medication, smoking in early pregnancy, height and weight). The samples used in this study were collected during years 2010–2012. The women with PCOS were identified from the cohort from the hospital register by the ICD-10 diagnosis of polycystic ovary syndrome (E282). The women were diagnosed according to the Rotterdam criteria, meaning that at least two of the following criteria were present: (1) polycystic ovaries by ultrasonography, (2) oligo- or amenorrhoea and (3) hyperandrogenism, either biochemical (elevated testosterone, androstenedione or elevated free androgen index) or clinical (hirsutism, Ferriman-Gallwey score > 8). The controls were healthy pregnant women matched for age and pre-pregnancy BMI and they had no records for PCOS diagnosis, menstrual irregularities or ovulatory infertility. Placental sample slides from the pregnancies were examined by a pathologist and no abnormalities were detected in either group.

The women provided written informed consent for inclusion and the study has been approved by the Regional Ethical Review Board (Uppsala, Sweden) and the Regional Ethical Committee of Northern Osthrobothnia Hospital District (Oulu, Finland). The study is compliant with the Declaration of Helsinki.

Plasma sample collection

Upon admission to the delivery ward at the Uppsala county hospital, the only delivery ward in the area, a venous blood sample was drawn into a Lithium-Heparin tube. The plasma samples were stored at −70 °C.

Plasma sample processing

Plasma samples were processed essentially as described previously48,49. Briefly, the top 12 most abundant proteins of plasma were depleted by a TOP 12 depletion kit (Pierce, Thermo Fisher) from 10 µL plasma according to the manufacturer’s instructions. Depleted plasma was used for estimating the protein amount by a BRADFORD MX reagent (EXPEDION) and an equal amount of protein per sample was dried and resuspended in 50 mM Tris buffer containing 6 M urea (pH 7.8). Dithiothreitol (DTT) was added to final concentration of 10 mM and sample tubes shaken for 1 h at RT. Iodoacetamide was then added to final concentration of 40 mM and tubes shaken for 1 h at RT. DTT (40 mM) was then used to quench excess IAA for 1 h at RT with shaking. Trypsin was the added to the protein mixtures at a trypsin:protein ratio of 1:50 and the samples were incubated at 37 °C overnight. Resulting tryptic peptides were cleaned with C18 spin columns according to the manufacturer’s instructions.

UDMSE and data analysis

A total of 500 ng of peptides per samples were injected into the nano Acquity UPLC (Ultra Performance Liquid Chromatography) ‐ system (Waters Corporation, MA, USA). TRIZAIC nanoTile 85 μm × 100 mm HSS‐T3u wTRAP was used for the on-line liquid chromatographic separation of the peptide mixture before being analysed by a mass spectrometer. Samples were loaded, trapped and washed for 2 min with 8.0 μL/min with 1% B. The analytical gradient used was as follows: 0–1 min 1% B, at 2 min 5% B, at 65 min 30% B, at 78 min 50% B, at 80 min 85% B, at 83 min 85% B, at 84 min 1% B and at 90 min 1% B with 450 nL/min. Buffer A: 0.1% formic acid in water and Buffer B: 0.1% formic acid in acetonitrile.

Data were acquired in data independent acquisition mode using UDMSE with Synapt G2‐Si HDMS (Waters Corporation, MA, USA). The data were collected for 100–2000 m/z, with a scan time of one‐second, and a IMS wave velocity of 650 m/s, and collision energy was ramped in trap between 20 and 60 V. Calibration was performed using Glu1‐Fibrinopeptide B MS2 fragments and as a lock mass, the Leucine-Enkephaline ion (m/z 556.2771) was used during the runs. The samples were run as triplicates, and further analyses were done using Progenesis QI for Proteomics software (Nonlinear Dynamics, Newcastle, UK).

The data analysis was performed as previously described48,49. Briefly, the raw files were imported to Progenesis QI for Proteomics software (Nonlinear Dynamics, Newcastle, UK) using lock mass correction. Default parameters for peak picking and the alignment algorithm were used. Progenesis software facilitated peptide identification with Protein Lynx Global Server and label‐free quantification50. Peptide identification was done against Uniprot human FASTA sequences (UniprotKB Release 2017_03, 20183 sequence entries) with (CLPB_ECOLI (P63285)), and the ClpB protein sequence was inserted for label‐free quantification. The modifications used included fixed modification of cysteine (carbamidomethyl) and variable modification of methionine (oxidation). Trypsin was the digesting agent with two missed cleavages allowed. Fragment and peptide error tolerances were set to auto, and the FDR to less than 1%. One or more ion fragments per peptide, three or more fragments per protein and one or more peptides per protein were needed for ion matching.

The identified proteins were grouped based on the parsimony principle, and unique peptides to the protein are reported. The parsimony principle governs the rule that protein hits are reported as the minimum set that explains all observable peptides. Progenesis QI for Proteomics software does not take a strict parsimonious approach due to the over‐stringency as described previously51; however, for the resolution of conflicts, if two proteins contain some common peptides, the protein with fewer peptides is grouped with the protein with a higher number of peptides which are a superset of the subsumed protein’s peptides. The lead protein is the one with the greatest coverage or the highest score when the coverages of two or more proteins are the same. Lead identity peptide data are always used for quantitation, and further details regarding this approach are given on the software website (www.nonlinear.com).

Statistics

Hierarchical clustering and self-organising maps (SOM) clustering were performed by R programming as described previously52,53. Briefly, SOM clustering was performed on data consisting of X number of samples using R package SOM with parameters (.xdim = 5, ydim = 6, topol = “hexa”, neigh = “gaussian”). Data were centred and scaled before performing the clustering. The hclust function in R was used for generating the heatmaps. MetaboAnalyst 4.0 (https://www.metaboanalyst.ca/) was used for generating individual and combined ROC curves and calculating area under the curve (AUC) values21. A non-parametric Mann-Whitney test was performed on cases vs. controls and 0.05 was set as the cut-off for p-value. Principal component analysis was performed using Progenesis QI for Proteomics (v4.0). Orthogonal projections for latent structures-discriminant analysis (OPLS-DA) giving S-Plot was generated by the EZInfo 3.0 software with default parameters. Proteomics data were pareto scaled prior to OPLS-DA modelling. The script that was used to calculate Pearson correlation coefficients and p values to demonstrate the correlation between hormone values and protein expression can be found in supplementary information. Calculations were performed in R programming language.