Identification and validation of predictive factors for progression to severe COVID-19 pneumonia by proteomics

Dear Editor, Recent study showed that around 80% of coronavirus disease19 (COVID-19) patients are moderate cases who will recover with or without conventional treatment, while the remaining 20% developed severe disease requiring intensive care. Early and accurate screening of new COVID-19 patients to identify those who will develop severe disease will facilitate decision-making on appropriate treatment regimens and reasonable allocation of limited healthcare resources. Therefore, novel predictive factors for disease progress from moderate to severe are urgently needed. As proteomic profiling from sera of patients are informative during the disease progression, we hypothesize that some proteins would be significantly altered in the sera of patients who will develop severe COVID-19 after admission, which are informative to predict the disease progression. Herein, we initiated a study by recruiting two groups of COVID-19 patients, the development group (group I) with 23 patients (15 severe and 8 moderate) (Supplementary Table S1); and the validation group (group II) with 50 patients (29 severe and 21 moderate) (Supplementary Table S2) (Fig. S1a). In addition, 10 healthy individuals negative for the SARS-CoV-2 nucleic acid test were recruited as control group. To characterize the proteomics signatures of COVID-19 patients, we performed a quantitative liquid chromatography-tandem mass spectrometry/mass spectrometry (LC-MS/MS) proteomic analysis on the sera of the patients from group I and the control group. Overall, the abundance of 988 proteins across these 33 individuals exhibited high consistency (Supplementary Fig. 1b; Supplementary Data S1). Based on the treatment records, 15 patients in group I who required ventilator were classified as severe cases, and the rest were classified as moderate (Supplementary Table S1). Principal component analysis (PCA) results show that although most of the moderate patients are similar to healthy controls, the severe patients clearly segregated from the rest (Fig. 1a), suggesting some unique confounding proteomic features for the severe cases. Differential expression (DE) analysis between patients and healthy controls identified 43 and 47 significantly upregulated and downregulated proteins, respectively (Fig. 1b, c, Supplementary Table S3). The upregulated proteins include classical inflammatory response proteins, such as protein S100-A8 (S100A8), protein S100-A9 (S100A9), serum amyloid A-1 protein (SAA1), serum amyloid A-2 protein (SAA2), and alpha-1-antichymotrypsin (SERPINA3). We performed systematic Gene Ontology (GO) enrichment analysis and the results show that the upregulated proteins are significantly enriched (Bonferroni corrected P < 0.05) in biological processes related to acute inflammatory response, neutrophil activation, and activation of innate immune response (Fig. 1d; Supplementary Table S4). Our results also indicate that some of the upregulated proteins are significantly involved in response to hypoxia and cytokine secretion (Fig. 1d). The significance of some of these differentially expressed proteins were independently validated by parallel reaction monitoring (PRM) (Supplementary Figs. 2, 3). These results provide an initial proteomics characterization for the 23 confirmed COVID-19 patients. DE analysis between the severe and moderate cases show that 20 proteins are significantly upregulated in the severe cases (Fig. 1e, f, Supplementary Table S5). Ten of them, including S100A8 and S100A9, are also identified as upregulated in the previous comparison between the patients and the healthy counterparts. Upregulation of these 10 proteins identified in both comparisons suggests that the severe cases might have stronger inflammatory responses than the moderate patients. Upregulations of the other 10 proteins including actin, cytoplasmic 1 (ACTB), calpain small subunit 1 (CAPNS1), collagen alpha-3(VI) chain (COL6A3), coagulation factor VIII (F8), glutathione S-transferase omega-1 (GSTO1), myoglobin (MB), out at first protein homolog (OAF), superoxide dismutase [Mn], mitochondrial (SOD2), thymosin beta-4 (TMSB4X), and tumor necrosis factor receptor superfamily member 17 (TNFRSF17), are only found in severe patients while the expression of such those 10 proteins are either nonsignificantly different or downregulated in moderate patients as compared to controls (Supplementary Fig. 4). Functional enrichment analysis of the 20 upregulated proteins in the progression to severe group show that, in addition to immune response related biological processes, some of these proteins (ACTB, GSTO1, S100A9, OAF, and SOD2) are significantly involved in cellular detoxification during the disease progression. (Fig. 1g; Supplementary Table S6). To consolidate the specificity of the upregulated proteins in the severe patients, we performed additional PRM validation. The validation results indicate 8 (S100A8, OAF, 40S ribosomal protein S28 (RPS28), SOD2, MB, GSTO1, D-dopachrome decarboxylase (DDT) and CAPNS1) out of the 10 upregulated proteins are of significantly higher detection rate by PRM in the progression to severe samples than the moderate ones (P < 0.05, Wilcoxon rank sum test, Fig. 1h, Supplementary Figs. 5, 6, and 7). Next, we performed random forest machine-learning to test whether any of these factors with differential protein expression profiles can be used for early detection of potential severe COVID-19 patients. The results show that the combination of these eight proteins can only achieve an area-under-curve (AUC) of ~0.80 (Supplementary Fig. 8). To further identify severe COVID-19 predictive factor proteins, we designed an iterative model building approach on the proteomics data to search for an optimal combination of proteins for prediction of severe COVID-19 patients (Supplementary


Study Oversight and Participants
Guangzhou Center for Disease Control and Prevention (Guangzhou CDC) has been authorized for laboratory testing for SARS-CoV-2 infections since the official announcement of COVID-19 breakout, according to Chinese Law for Infectious Disease and Law for Health and Quarantine. All the patients with laboratory confirmed SARS-CoV-2 infection were referred to the No. 8 People's Hospital of Guangzhou, which specializes in infectious diseases and has been designated as one of the major designated hospitals to treat COVID-19 patients in the province.
Between January 27 and March 3, 2020, we recruited 73 adult COVID-19 patients who were laboratory confirmed at Guangzhou CDC laboratory and then were referred to the No. 8 People's Hospital of Guangzhou for treatment. All patients were divided into development group (Group I: 23 patients recruited between January 27 and February 10) and validation group (Group II: 50 patients recruited between February 13 and March 3). A total of 10 healthy volunteers were recruited as control group. The diagnosis of COVID-19 was made following the Protocol for Novel Coronavirus Pneumonia Diagnosis and Treatment issued by the National Health Commission of the People's Republic of China. Written informed consent was obtained from each participant and the present study was approved by the Ethical Committee of Guangzhou CDC (GZCDC-ECHR-2020A0002).

Sample and Clinical Data collection
For all included COVID-19 patients (both Group I and Group II), serum samples and a panel of epidemiological, clinical, radiological and laboratory data were collected within 24 hours after admission. In addition, infection with other respiratory pathogens such as influenza A virus (H1N1, H3N2, H7N9), influenza B virus, respiratory syncytial virus (RSV), parainfluenza virus, adenovirus, SARS coronavirus and middle east respiratory syndrome (MERS) coronavirus were ruled out by real time -polymerase chain reaction (RT-PCR) assays approved by the China Food and Drug Administration. In addition, blood serum samples of group I and healthy control were collected for quantitative data-independent acquisition (DIA) proteomics analysis ( Supplementary  Fig. 1a). The serum samples were deactivated at 56 °C for 30 minutes before proteomics analysis.
Proteomics data analysis and confirmation For deeper proteomic analysis, the proteins extracted from 100uL serum by 8 M urea (which can denature all proteins) and vortexed for 5 min. And then 10 mM dithiothreitol (DTT) was added to the mixture of serum and urea at 37℃ for 1 h followed by added 55 mM iodoacetamide (IAM) and incubated for 45 min in a dark room. Finally, the mixture was centrifuged at 25,000 × g for 20 min at 4℃, and the 1ml supernatant get through the Cleanert PEP96 wellplate inserted with 30mg C18 individual columns (Agela Technologies, China) that bind with the lower abundance serum proteins with higher hydrophobicity and expel the high-abundance proteins in the flowthrough. The bound lower abundance proteins were finally eluted with 75% ACN (Acetonitrile) (10.1021/acs.jproteome.9b00353). Sample was dehydrated in vacuum and redissolved in 50 mM Ammonium bicarbonate (ABC), digested with FASP (Sartorius, U.K.), at the ratio of 50:1 (protein to enzyme), subsequently eluded with 70% ACN and dehydrated in a vacuum centrifuge. DIA proteomic strategy was adopted to quantify the serum protein changes between the patients and health controls, initiating in a data dependent acquisition (DDA) spectral library established by the separation of peptide mixtures from the 23 development cohort samples and 10 control samples (10 µg/sample) into 10 fractions on an high performance liquid chromatography (HPLC) system (Shimadzu, Japan) linked with Gemini high pH C18 column (4.6 × 250 mm, 5 μm) under a basic condition (pH 9.8). The peptides of each fraction pooled with iRT peptides were analyzed by Orbitrap Fusion Lumos Mass Spectrometer (Thermo Fisher Scientific, CA, USA) and identified by database searching software Maxquant (version 15.3.30) against human uniport protein database. The data acquisition of individual serum peptide sample pooled with the same amount of iRT peptides was performed under the DIA mode as reported and the data analysis was completed by Spectronaut (12.0.20491.14.21367) software referenced as the DDA library above. The proteins quantified with no less than 1.5-fold changes and P-value < 0.05 (Wilcoxon rank sum test) between case and control were defined as the differential proteins. To confirm the alterations of these differential proteins, a target quantification approach, PRM (Parallel Reaction Monitoring), was adopted for further analysis. For PRM confirmation, the 72 peptide precursors of selected differential proteins and iRT precursors were scheduled with 16-min retention time windows using Q-Exactive HF mass spectrometer (Thermo Fisher Scientific, CA, USA). The PRM quantification was performed by Skyline software and the intensities of three iRT peptides DGLDAASYYAPVR, GAGSSEPVTGLDAK, TPVISGGPYEYR were used to normalize the sample loading.
Validation of predictive factor performance The accuracy of predictive factor panel was validated in the sera from group II and healthy controls by an ELISA assay. These ELISA kits are purchased from Beijing Andy Gene Co., Ltd and their commercial catalogue number are S100A8 (Cat: AD11500Hu), OAF (Cat: AD12688Hu), RPS28 (Cat: AD12692Hu), SOD2 (Cat: AD12691Hu), MB (Cat: AD12155Hu), GSTO1 (Cat: AD12018Hu), DDT (Cat: AD10014Hu) and CAPNS1 (Cat: AD12690Hu) respectively. The concentrations of these three proteins were determined according to the manufacturer's protocols. Absorbance was measured at 450 nm using HydroFlex microplate washer. All samples were analyzed in triplicate, and the average concentration for each patient was calculated.
Statistical analysis and machine-learning Based on the disease progression, all 73 patients were classified as progression to severe group and moderate group. Categorical variables were summarized in percentage and compared between progression to severe group and moderate group using χ² test and Fisher's exact test. Continuous variables were expressed as median and interquartile range (IQR), whichever deemed appropriate. All data were analyzed with R (v3.5) and analysis of differentially expressed proteins was performed with MSstats R package which includes log2 transformation, normalization and P-value calculation on the Spectronaut and skyline quantitative data. PCA was performed with pcaMethods R package. GO enrichment analysis was performed with topGO R package with default parameters.
An iterative random forest machine-learning with 5-fold cross-validation approach was designed to identify the optimal panel of clinical and proteomic features for early detection of severe patient. Briefly, selected clinical and proteomic features from the patients were extracted for random forest modeling with 5-fold cross-validation, using disease severity as the response variable. Five-fold cross-validation was employed to eliminate potential overfitting of the random forest model. To reduce potential variability of the random forest models, each round of random sampling for training and testing dataset in model building was iteratively repeated for 100 times, and all reported results were averaged. All possible combinations of the selected clinical and proteomic features were exhaustively tested by this iterative machine-learning process, and the predictive power of each combination was recorded and ranked according to the AUC value by receiver operating characteristic (ROC) analysis. The average sensitivity and specificity scores from each feature combination were also recorded. The combination of features with the highest AUC value was considered the best for distinguishing severe from moderate patients, and individual feature can be ranked according to the importance score in the corresponding random forest model. • a Validation of target peptides of SAA2, CRP, ATP6V1G2 and H2AC4 by PRM. b Cumulative bar plots of peak area of retention curves in a for patients and controls. Color code is same as a. c Boxplots of PRM detection intensity of SAA2, CRP, ATP6V1G2 and H2AC4 in patients and controls. * P-value < 0.05, ** P-value < 0.01, Wilcoxon Rank Sum test.

Figure. S4. Boxplots of PRM intensity for the 10 DE proteins between severe and moderate patients.
Data from the normal controls are also shown. All of these 10 proteins showed significant upregulation in severe vs. moderate patient comparison. However, 7 of these proteins showed insignificant change between moderate patients and normal controls, and 3 even showed significant up-regulation in normal. * P-value < 0.05, ** P-value < 0.01, ns: not significant, Wilcoxon Rank Sum test.  a Validation of target peptides of CAPNS1, SOD2, GSTO1, RPS28 and S100A8 by PRM. b Cumulative bar plots of peak area of retention curves in a for severe and moderate patients. Color code is same as a. c Boxplots of PRM detection intensity of CAPNS1, SOD2, GSTO1, RPS28 and S100A8 in severe and moderate patients. * P-value < 0.05, ** P-value < 0.01, Wilcoxon Rank Sum test. Generating n schemes of including i indicators, and initiating the local combination pointer j=0 Is the local combination searching unfinished(j<n)?
Training and evaluate the random forest model by using the j th combination schemes through 5 fold cross validation, and make j=j+1 Logging the combination scheme and evaluate score Does the score of this combination scheme meet the conditions?
Are all global combination schemes searching finished(i 8)?
Picking the optimal ombination scheme Generating n schemes of including i indicators, and initiating the local combination pointer j=0 Is the local combination searching unfinished(j<n)?
Training and evaluate the random forest model by using the j th combination schemes through 5 fold cross validation, and make j=j+1 Logging the combination scheme and evaluate score