Introduction

The novel coronavirus (2019-nCoV, or COVID-19) was first identified at the end of 2019 and has rapidly spread worldwide. It causes severe acute respiratory syndrome and can lead to pneumonia1. Detecting and monitoring the disease as early as possible is paramount to preventing progression. COVID-19 shares overlapping signs, symptoms, laboratory findings and imaging features with other respiratory viruses, which might complicate its diagnosis, treatment, and prognosis2. Recently, the under-detection of many infectious diseases has increased, which is somewhat due to the prevalence of a novel coronavirus2. Influenza, a contagious viral disease-causing respiratory illness, shared similar clinical manifestations to COVID-19. Fever, cough, rhinitis, sore throat, headache, shortness of breath, and myalgia are some of these similar symptoms3,4. Different subtypes of the influenza A virus, including H1N1, H3N2, and influenza B as a seasonal influenza virus, are currently circulating among individuals5. The co-occurrence of influenza and COVID-19 may increase in the year’s cold months. Both viruses are spread from person to person primarily by airborne droplets6. Failures in differential detection of COVID-19 may result in higher hospitalization rates, prolonged stay in intensive care units, and an increased chance of death in patients7,8.

Searching for the virus-specific genetic materials via real-time quantitative polymerase chain reaction (RT-qPCR), so far, is the most reliable method for the detection of coronavirus. However, the procedure of RT-qPCR on virus-specific genetic materials is unable to distinguish between active infection and colonization but host-response biomarkers are able to do that7,9,10. Furthermore, RT-qPCR can have a high rate of false-negative results due to the low virus load in individuals, which can also change over time, as well as incorrect sampling. This makes it essential to use host-specific biomarkers as a complementary tool to ensure accurate diagnosis of presence or type of infection in at-risk hosts11,12,13.

Numerous tissues, including respiratory epithelial cells, nasopharynx, colonocytes, and whole blood or plasma samples, have recently seen significant changes to the host transcriptome following COVID-19 infection14,15. Therefore, transcriptomics can be used effectively to identify COVID-19 affected host transcriptional signatures, paving the way for the creation of novel diagnostic biomarkers and therapeutic strategies12. To find virus-specific transcriptional signatures, it is also necessary to comprehend the host response to COVID-19 infection in comparison to other respiratory infections16. Although several candidate gene biomarkers have been proposed so far, none of them were successful for an efficient diagnosis and particularly differential diagnosis of COVID-19 in samples.

In the present study, we hypothesized that novel and potentially more specific blood biomarkers of a disease could be identified by searching for the DEGs involved in the common pathways between blood and the major organs affected by the disease. We validated this hypothesis using machine learning methods and found that these potential biomarkers included signatures that could accurately differentiate COVID-19 from Influenza blood samples7. So, in order to identify COVID-19 potential Specific Blood Differential expressed genes (SpeBDs), we implemented a strategy based on finding shared pathways of peripheral blood (PB) and the most involved tissues in COVID-19 patients (lung tissue, nasopharyngeal swab and bronchoalveolar lavage fluid (BALF)) to filter blood DEGs based on playing a role in those shared pathways. Furthermore, potential Differential Blood DEGs of COVID-19 versus influenza (DifBDs) were identified by extracting DEGs involved in only enriched pathways by SpeBDs and not by influenza DEGs. Then, a machine learning method (feature selection) was utilized to narrow down the number of SpeBDs and DifBDs and find the most predictive combination of DEGs. This step was performed to select potential COVID-19 Specific Blood Biomarker Signatures (SpeBBSs) and COVID-19 versus influenza Differential Blood Biomarker Signatures (DifBBSs), respectively. Then the models based on the SpeBBSs or DifBBSs and the corresponding algorithms were validated on an external dataset. Accuracy (ACC), Area under curve (AUC) and Matthews Correlation Coefficient (MCC) were calculated to measure the power of machine learning models constructed by considering SpeBBSs and DifBBSs. Different steps of this experiment are demonstrated in Fig. 1.

Figure 1
figure 1

A workflow representing the main steps of the present study. Designed using diagram.net online tool available at https://app.diagrams.net/.

Materials and methods

Datasets selection

For finding SpeBDs, transcriptomic profiles of COVID-19 infected versus control samples from PB and three sources related to the respiratory system, the most involved tissues in COVID-19, were considered, including Lung Tissue (Lung), Nasopharyngeal Swab (Swab), and Bronchoalveolar Lavage Fluid (BALF). Datasets of PB, Lung, and Swab sources were obtained from GEO database17. Also, the differential expression analysis (DEA) data of the BALF source was obtained from Zhou et al.’s18 and Li et al.’s study19. In addition, datasets of the three types of Influenza (H1N1, H3N2, and B) were used to discover DifBDs. Table 1 provides all the information about dataset IDs, data production platforms, and sample sizes.

Table 1 Publicly available biomarker discovery datasets.

Differential expression analysis

Among RNAseq datasets of COVID-19, GEO raw data of GSE155241 (Table 1) were analyzed by the Galaxy web server (https://usegalaxy.org/)20. Quality control was executed with FastQC (version 0.11.8). The reads were aligned to the human reference genome file (Gencode, release 32, hg38 https://www.gencodegenes.org/human/releases.html) using HISAT2 (version 2.1.0) with default parameters. The reads mapped to the human reference genome were counted using featureCounts Galaxy Version 2.0.1 and default parameters.

The RNAseq count files of this study and all other RNAseq datasets of COVID-19 (Table 1) were analyzed by the following methodology: Bioconductor’s DeSeq2 package was used to identify DEGs from the normalized expression dataset. It was then applied to mine statistically significant DEGs based on the difference in their expression values between samples of the COVID-19 versus control. DEGs with |log2FC|≥ 1 and adjusted p-value ≤ 0.05 were considered to be significantly differentially expressed. Also, the DEA results of COVID-19 and Healthy BALF samples from Zhou et al.’s study and Li et al.’s study were filtered by a |log2FC|≥ 1 and adjusted p-value < 0.05. After obtaining the DEGs of the COVID-19 datasets related to the four sources (Swab, BALF, Lung, and PB), the results of DEA of the sources with more than one dataset (BALF and Lung) were integrated using the Venn diagram. While, in the cases of Swab and PB sources, only one dataset was related to each of them, and plotting the Venn diagram was not required.

For three influenza types, we selected microarray raw data (Table 1). Microarray data were pre-processed, merged, and analyzed independently by the R programming language for each influenza type. The series matrixes were downloaded from the GEO database. Quantile normalization and log transformation were performed on datasets. The aggregate function averaged multiple expression values assigned to the same gene symbols. The platform for producing the data in each influenza type was the same GPL, and the source of all samples was peripheral blood; the data was homogenous, so we integrated data for each influenza type independently using the merging method. In order to remove the batch effect between datasets, we performed a batch effect removal using the ComBat function from the SVA package. Finally, DEA between three types of influenza and healthy samples was conducted independently using the Limma package. DEGs with a false discovery rate adjusted p-value < 0.05 and |log2FC|≥ 0.4 |were considered significant.

Biomarker discovery using pathway enrichment analysis

SpeBDs discovery

The pathway enrichment analysis by the Reactome database in Enrichr web-based tool21 was performed for DEGs of each source (Swab, BALF, Lung, and PB) independently. Enriched pathways with adjusted p-value < 0.05 were considered significant. After that, common pathways of each Swab, BALF, and Lung source with PB were found, and the DEGs that had enriched those common pathways in PB were extracted. These DEGs were considered as SpeBDs.

DifBDs discovery

In order to find DifBDs, the pathway enrichment analysis for the three types of influenza (H1N1, H3N2, and B) was performed independently by the Reactome database of Enrichr. The pathway enrichment analysis for the SpeBDs was performed as well. A pathway was considered significant if the adjusted p-value was smaller than 0.05. Then, a Venn diagram was constructed including the significant pathways of SpeBDs, H1N1, H3N2, and B. Significant specific pathways of COVID-19 that were not enriched in any of the influenza types were selected. After that, the SpeBDs of COVID-19 that had enriched those pathways were extracted. These DEGs were considered as DifBDs.

Choosing the best biomarker signatures and validation by machine learning

RapidMiner Studio as a powerful tool for biomarker discovery was registered (version 9.7) and utilized to extract and validate biomarker signature from SpeBDs and DifBDs22,23,24,25,26.

In this study a two-step machine learning approach was implemented, first we employed four classifiers (k-NN, Random Forest, SVM, Naïve Bayes) to supervise the wrapper feature selection method and extract the best combination of biomarkers from the feature selection dataset (an external dataset different from discovery datasets but containing SpeBDs or DifBDs). In the next step, the models based on optimal subset of biomarkers and the corresponding algorithms (the same algorithms that were applied in feature selection to select them) were validated on the validation dataset (another external dataset different from discovery and feature selection datasets). The logic behind this strategy was that the algorithm applied to supervise a wrapper method has had the best performance ability for a subset of features, among other probable combination of features. So we can use that algorithm for building a model (biomarker panel) based on the corresponding features (SpeBBSs or DifBBSs) and test the model on an external dataset to validate the model. The purpose of employing four classifiers in this study was to get four subsets of genes and build four models and biomarker panels. In this way, we had the chance to consider four biomarker panels with a high classification power and introduce the best one, as our minimal biomarker panel.

Feature selection

A biomarker panel containing a less number of genes would be more practical to test in a clinical assay27,28. So, we decided to choose a small set of most predictive biomarker signatures from SpeBDs to be introduced as COVID-19 potential Specific Blood Biomarker Signatures (SpeBBSs) and from DifBDs to be introduced as COVID-19 versus influenza Differential Blood Biomarker Signatures (DifBBSs). In order to do that, we applied a machine learning method (feature selection) using the Optimize Selection (forward selection type) operator implemented in Rapid Miner. The Forward Selection is a kind of wrapper feature selection approach. Here, we employed four classifiers (k-NN, Random Forest, SVM, Naïve Bayes) to supervise the wrapper method and extract the best combination of biomarkers from the feature selection dataset.

The Forward Selection strategy initially uses only one attribute (in our case, each attribute is a SpeBD or DifBD). Additional attributes are added until there is no more performance gain by adding an attribute.

Rapid Miner provides several other methods for feature selection including Brute Force, Evolutionary algorithm, Backward Elimination, and many other methods29. The Optimize Selection (Brute Force) operator examines all possible combinations of the attribute sets to select the most relevant attribute. This method is not applicable in the case of high-dimensional data due to its comprehensive examination30. The evolutionary algorithm selects the most relevant attributes of the dataset using evolutionary algorithms, e.g. genetic algorithm (GA). Backward Elimination starts with all features and it removes the worst feature in each step30. We tried using Optimize Selection (Evolutionary) and Optimize Selection (Backward Elimination) operators of Rapid Miner but these algorithms represented lower performances with the low number of features compared to the Forward Selection strategy. The purpose of feature selection in this study is to select a small set of biomarker signatures because such a panel would be more clinically applicable. We, therefore, chose to use Optimize Selection (Forward Selection) operator that has a higher performance in selecting a small set of biomarker signatures.

SpeBBSs discovery and validation

The count values of SpeBDs were extracted from dataset GSE166190 and Bibert et al.’s dataset-A31, which included peripheral blood samples of healthy people and COVID-19 infected patients. Table 2 listed the sample size and platform properties of these datasets.

Table 2 Datasets used for feature selection and validation of blood biomarker signatures by machine learning methods.

The rlog function of the package DESeq2 was used to convert the raw counts to normalized logarithmic counts. The dataset was then transposed (samples in rows and SpeBDs genes in columns), and after conversion of disease status to binominal (Healthy = 0 and COVID-19 = 1) input dataset for machine learning was prepared. After that, the two-step machine learning procedure was used to narrow down the SpeBDs for obtaining SpeBBSs (feature selection phase using an external dataset (GSE166190)) and validating the SpeBBSs (validation phase using another external dataset (Bibert et al.’s dataset-A)). In each phase, the five indicators (ACC, Spe, Sen, MCC, and AUC) were calculated for the feature selections and models constructed by the four algorithms.

DifBBSs discovery and validation

The count values of DifBDs were extracted from dataset GSE161731-B and Bibert et al.’s dataset B31, which included peripheral blood samples of Influenza and COVID-19 infected patients. The sample size and platform properties of these datasets are listed in Table 2. In order to construct the input for RapidMiner software, the binominal disease status (Influenza = 0 and COVID-19 = 1) was added to rlog transformed, transposed counts files of the two datasets. The same two-step procedure for selecting and validating the SpeBBSs was applied to select DifBBSs among DifBDs (feature selection phase using an external dataset (GSE161731-B)) and validate them (validation phase using another external dataset (Bibert et al.’s dataset-B)). In each phase, the five indicators (ACC, Spe, Sen, MCC, and AUC) were calculated for the feature selections and constructed models by the four algorithms.

Performance evaluation

The ten-fold cross-validation strategy was employed to evaluate the performance of constructed models in this study. In ten-fold cross-validation, the input (samples) is divided into ten equal parts. One of the ten parts is retained as the test data set. The other parts are used as inputs of the training subprocess. Cross-validation is repeated ten times and every time one of the subsets plays the role of the test dataset. The ten results are then averaged to obtain a single result.

The performance of classification was obtained in terms of four common measurements. These measurements were Accuracy (ACC), Sensitivity (Sen), Specificity (Spe), the Mathews correlation coefficient (MCC), and area under the curve (AUC). The first four were calculated using true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) indicators by the formula. And AUC was calculated by plotting a ROC curve.

When datasets are imbalanced in evaluating binary classification problems, MCC gives more information than other measures like accuracy because it considers the balance ratios of the four measures (TP, TN, FP, and FN). The accuracy score can be misleading since it does not fully consider the size of the four classes of measurements in its final calculation. However, we provided this indicator as the most intuitive evaluation metric. The MCC value is between − 1 and 1. 1 is MCC of a model with the best performance. 0 is like a random prediction, and − 1 indicates a complete discrepancy between reality and prediction25. Also, we used AUC, which is a standard parameter and a threshold-independent measure. AUC is the area under the ROC curve generated by plotting sensitivity or true positive rate against false positive rate.

The following parameters were set for the four classifiers of this study:

k-NN: K: 5; Weighted vote: true; Measure types: MixedMeasures; Mixed measure: MixedEucideanDistance. Random Forest: Number of trees: 100; Criterion: gain_ratio; Maximal depth: 10; Voting strategy: confidence vote; Guess subset ratio: true. SVM: Kernel type: Dot; C: 0.00; Convergence epsilon: 0.001; Lpos: 1.0; L neg: 1.0; epsilon: 0.0; Epsilon plus: 0.0; epsilon minus: 0.0. Naïve Bayes : Laplace correlation parameter was set to true.

Results

Differential expression analysis and biomarker discovery using pathway enrichment analysis

SpeBDs discovery

The DEA between COVID-19 and Healthy PB samples of dataset GSE161731-A (Table 1) resulted in 624 DEGs including 271 upregulated and 353 downregulated genes. The pathway enrichment analysis of these up and downregulated DEGs resulted in 113 significant pathways which are listed in Tables S1 and S2.

The DEGs of differential analysis results between COVID-19 and Healthy BALF samples in Zhou et al. and Li et al. studies were obtained. Then, the Venn diagram plotted for these two groups of DEGs from the BALF source resulted in 890 DEGs including 475 upregulated and 415 downregulated genes. The pathway enrichment analysis of these DEGs resulted in 36 significant pathways (Tables S3 and S4). Thirty-one of these significant pathways were shared with the significant pathways of PB, and We extracted 95 DEGs from the PB dataset that had enriched those common pathways (Figs. 2 and 3).

Figure 2
figure 2

Common Pathways of PB with BALF, Lung, and Swab, their adjusted p-values in pathway enrichment analysis, and the list of extracted SpeBDs from them. The figure is generated using RStudio version 2022.12.0 and Adobe Illustrator version 24.2.1.

Figure 3
figure 3

Extraction of SpeBDs from PB DEGs of COVID-19 patients with the help of the common pathways between PB and the three sources from the respiratory system of COVID-19 patients (Swab, BALF, and Lung). A whole list of SpeBDs is indicated in this figure. Lung, Lung tissue biopsy; Swab, nasopharyngeal swab; BALF, bronchoalveolar lavage fluid; PB, peripheral blood. The figure is created using Cytoscape version 3.8.2 and Illustrator version 24.2.1

The DEA was performed between COVID-19 and Healthy Lung samples of datasets GSE147507, GSE150316, GSE155241, and GSE159787. The plotted Venn diagram for the four datasets of Lung source resulted in 15 upregulated and 42 downregulated common genes, a total of 57 DEGs. The pathway enrichment analysis of these 57 DEGs resulted in 9 significant pathways (Tables S5 and S6), 2 of these significant pathways were shared with PB, and we extracted 54 DEGs of the PB dataset that had enriched those common pathways (Figs. 2 and 3).

The DEA between COVID-19 and Healthy Swab samples of dataset GSE156063 resulted in 207 upregulated and 379 downregulated genes, a total of 586 DEGs. Pathway enrichment analysis of these DEGs resulted in 91 significant pathways which are listed in Tables S7 and S8; six of which were shared with the PB significant pathways, and 74 DEGs of the PB dataset that enriched those common pathways were extracted (Figs. 2 and 3).

Finally, from all the DEGs extracted from the PB dataset in this step (from common pathways of PB with BALF:95 DEGs, with Lung: 54 DEGs, and with Swab:74 DEGs), duplicated DEGs were removed, and 108 unique SpeBDs were obtained (Fig. 3). A complete list of SpeBDs and their related extraction sources are listed in Figs. 2 and 3. Moreover, a pathway enrichment analysis was performed for the SpeBDs, and 152 significant pathways were enriched (Table S9).

DifBDs discovery

In order to obtain DEGs of Influenza H1N1, the four related datasets including GSE111368-A, GSE90732, GSE68310-A, and GSE61821-A were integrated. DEA resulted in 309 upregulated and 208 downregulated, a total of 517 DEGs. These DEGs were enriched in 79 significant pathways (Tables S10 and S11).

To obtain DEGs of Influenza H3N2, three datasets including GSE61754, GSE29385, and GSE61821-B were integrated. The results of DEA were 1139 DEGs including 854 upregulated and 285 downregulated genes. The DEGs were enriched in 11 significant pathways (Tables S12 and S13).

Also, the two datasets of Influenza B (GSE111368-B and GSE68310-B) were integrated, and the DEA resulted in 976 DEGs including 512 upregulated and 464 downregulated genes. The pathway enrichment analysis for these DEGs resulted in 186 significant pathways (Tables S14 and S15).

Finally, a Venn diagram of significantly enriched pathways of influenza H1N1, H3N2, B, and SpeBDs was plotted (Fig. 4A). Eighty-three pathways were specifically enriched by SpeBDs and not by any of the Influenza types. The 87 SpeBDs that enriched those pathways were extracted for further analysis and named DifBDs. A list of uncommon pathways and DifBDs from them is provided in Fig. 4B.

Figure 4
figure 4

(A) Venn diagram representing the pathways enriched by SpeBDs, Influenza H1N1 PB DEGs, Influenza H3N2 PB DEGs, and Influenza B PB DEGs constructed using an online tool available at https://bioinformatics.psb.ugent.be/webtools/Venn/. The red circle mentions pathways that were enriched by SpeBDs and not by the three Influenza types; these pathways are listed in part B: Eighty-three pathways were obtained from pathway enrichment analysis of SpeBDs and were different from pathways obtained by pathway enrichment analysis of Influenza H1N1, H3N2, and B DEGs; (B) is created using RStudio version 2022.12.0 and Adobe Illustrator version 24.2.1.

Choosing the best gene signature and validation by machine learning

SpeBBSs discovery and validation

In order to select the best subset of SpeBDs to be introduced as SpeBBSs, a feature selection method was applied using an external dataset containing SpeBDs (GSE166190). Then, these biomarker signatures were validated on another external dataset (Bibert et al.’s dataset -A). All the four classifiers used for evaluating the performance of the feature selection method indicated high robustness levels in terms of AUC and ACC (ACC higher than 92.86% and AUC higher than 86.10% on the feature selection dataset). Also the models based on these algorithms and the SpeBBSs had ACCs higher than 90.77% and AUCs higher than 96.30% on the validation dataset. Feature selection using Random Forest provided the highest ACCs and AUCs (95.92% ACC and 93.80% AUC on feature selection dataset and the model based on this classifier and the three selected SpeBBSs had the 93.09% ACC and 98.00% AUC on the validation dataset) (Fig. 5A,B). Feature selection using this classifier chose IGKC, IGLV3-16, and SRP9 as SpeBBSs. The feature selection and model based on this algorithm had the second-highest performance regarding MCC on both datasets respectively (83.64% on the feature selection and 78.70% on the validation dataset) (Fig. 5C,D). Furthermore, they showed the highest sensitivity with an acceptable level of specificity.

Figure 5
figure 5

The ten-fold cross-validation results of the feature selection method in choosing SpeBBSs and the constructed machine learning models; (A) ROC curves representing classification ability of the feature selection method by the four classifiers on GSE166190 dataset (the feature selection dataset); (B) ROC curves representing classification powers of the constructed models based on the selected SpeBBSs and corresponding algorithms (the same algorithms that were applied in feature selection step) on Bibert et al.’s dataset A (the validation dataset). These ROC curves show ROC (red lines) at various threshold settings (blue lines). In the ROC curves, the x-axis shows 1-specificity, and the y-axis shows sensitivity. (C) Four measures indicating the classification power of the feature selection method by the four classifiers on GSE166190 dataset (the feature selection dataset); (D) Four measures indicating the power of constructed models based on the selected SpeBBSs and the corresponding algorithms (the same algorithms that were applied in feature selection step) on Bibert et al.’s dataset A (the validation dataset). FS: feature selection.

DifBBSs discovery and validation

In order to select the most predictable subset of DifBDs to be introduced as DifBBSs, a feature selection method was applied using an external dataset containing DifBDs (GSE161731-B). Then, these biomarker signatures were validated on another external dataset (Bibert et al.’s dataset B). The forward selection method using all four classifiers had ACCs higher than 97.87% and AUCs higher than 95.00% on the feature selection dataset. Models built based on them and the corresponding DifBBSs, represented higher than 82.4 ACCs and higher than 83.60% AUCs on the validation dataset (Fig. 6A,B). Among them, the feature selection using Naive Bayes had a high performance on the feature selection dataset and constructed model based on this classifier and the corresponding DifBBSs represented the highest performance on the validation dataset in terms of MCC; In addition, the feature selection and the model built based on this algorithm showed high levels of sensitivity and specificity in both datasets (Fig. 6C,D). The forward selection method using this classifier chosen FMNL2, IGHV3-23, IGLV2-11, and RPL31 as DifBBSs.

Figure 6
figure 6

The ten-fold cross-validation results of the feature selection method in choosing DifBBSs and the constructed machine learning models; (A) ROC curves representing classification ability of the feature selection method by the four classifiers on GSE161731-B dataset (the feature selection dataset); (B) ROC curves representing classification powers of the constructed models based on the selected DifBBSs and corresponding algorithms (the same algorithms that were applied in feature selection step) on Bibert et al.’s dataset-B (the validation dataset). These ROC curves show ROC (red lines) at various threshold settings (blue lines). In the ROC curves, the x-axis shows 1-specificity, and the y-axis shows sensitivity. (C) Four measures indicating the classification power of the feature selection method by the four classifiers on GSE161731-B dataset (the feature selection dataset); (D) Four measures indicating the power of constructed models based on the selected DifBBSs and the corresponding algorithms (the same algorithms that were applied in feature selection step) on Bibert et al.’s dataset B (the validation dataset). FS: feature selection.

Discussion

Gene expression profiles of the disease-involved cells are not practical in the diagnosis of diseases. Rather, such profiles might be valuable for selection of limited number of potential protein biomarkers which can be detected via common techniques in biofluid samples. From both basic and clinical perspectives, comprehending the associations between blood biomarkers and the pathogenic states and processes in the tissues affected by the disease could be a great help in selecting the right molecule as potential biomarker. Therefore, in this study, we considered the overlapping pathways between peripheral blood and the central involved body system in COVID-19 in order to identify the disease’s novel and potential specific blood biomarkers7. Although, further steps such as comparisons of DEGs of a disease against other diseases (e.g. what we did for Influenza in this study) are indeed needed to get specific biomarkers for diseases, this strategy can help to find the potential specific blood biomarkers before comparing the DEGs of our desired disease against the rest of the diseases one by one. SpeBDs were extracted from the overlapping pathways between PB and respiratory system-related samples (Swab, BALF, and Lung) of Covid-19 patients. The extracted 108 SpeBDs enriched 152 significant pathways that, as we expected, are involved in multiple pathways in the immune system, such as classical antibody-mediated complement activation, FCGR activation, creation of C4 and C2 activators, initial triggering of complement, role of phospholipids in phagocytosis, complement cascade, regulation of actin dynamics for phagocytic cup formation, immune System, immunoregulatory interactions between a Lymphoid and a non-Lymphoid cell, FCERI mediated NF-kB activation, viral mRNA Translation, FCERI mediated Ca + 2 mobilization32,33.

In the next step, a machine learning method (feature selection) was utilized to narrow down the number of SpeBDs and find the most predictive combination of them to select SpeBBSs. The five indicators (ACC, AUC, MCC, Sen and Spe) were calculated to measure the power of machine learning models constructed by SpeBBSs. Consequently, feature selection using Random Forest selected IGKC, IGLV3-16, and SRP9 as SpeBBSs with the highest classification power. And the constructed model based on this algorithm and SpeBBSs also validated this biomarker panel on an external dataset. Interestingly, the involvement of these biomarker proteins was previously shown by some studies. Immunoglobulin kappa constant, IGKC, encodes the constant domain of kappa-type light chains for antibodies and Immunoglobulin lambda variable 3-16, IGLV3-16, encodes the variable domain of lambda-type light chains of antibodies. Immunologically, plasma cells are responsible for synthesizing antibodies and have been identified as possibly producing virus-neutralizing antibodies in COVID-1919,34. Upregulated IGKC and IGLV3-16 expression may be involved in the differentiation of B lymphocytes into immunoglobulins-secreting plasma cells, which could play an important role in the pulmonary immune response35. SRP9 is a component of the signal recognition particle (SRP) complex, involved in targeting secretory proteins to the rough endoplasmic reticulum membrane35. The SRP proteins also have a role in the virus-host responses. Based on an experiment, the 7SL RNA component of the SRP interacts with SARS-CoV-2, and upon binding, the viral proteins disrupt SRPs function, thus inhibiting protein trafficking to the cell membrane36. Moreover, it was shown that the uncleaved SRP9 could increase the translation elongation arrest and allows translocation, including the insertion of transmembrane domains (e.g., Coronavirus envelope protein). This process can finally lead to frameshifts in the translation process37.

In the next part, another pathway-based strategy was applied to obtain DifBDs. 87 DifBDs were extracted from the 83 pathways enriched by SpeBDs but not by Influenza H1N1, H3N2, and B DEGs. The most important of these pathways involves classical antibody-mediated complement activation, FCGR activation, activators, initial triggering of complement, FCERI mediated NF-kB activation, binding and Uptake of Ligands by Scavenger, complement cascade, regulation of actin dynamics for phagocytic cup formation, role of phospholipids in phagocytosis and mobilization. It can be seen that a number of non-specific pathways have been removed from the previous 152 pathways.

Then, DifBBSs were selected from 87 DifBDs using a feature selection approach. The five indicators of ACC, AUC, MCC, Sen and Spe were calculated to measure the power of machine learning methods and models constructed by DifBBSs. Accordingly, the feature selection by the best classifier (the Naive Bayes) selected FMNL2, IGHV3-23, IGLV2-11, and RPL31 as DifBBSs. These DifBBSs along with the Naive Bayes were validated on an external dataset as a biomarker panel with the highest performance. Formin-like protein 2, FMNL2, is a formin-related protein from a family of large proteins with multidomain that play an essential role in controlling a cytoskeletal organization38. There is a significant interaction between the native β1 integrins expressed on human and mouse pulmonary epithelial cells and the S-protein of SARS-CoV-239,40. The critical role of β1 integrins in mediating cellular adhesive interaction with the SARS-CoV-2 S-protein have recently shown in studies39. As FMNL2 involves in the regulation of β1-integrin traffic and function41, it is possible that as COVID-19 progress, FMNL2 regulation shifts from cell-to-cell adhesion to cell-to-substitute adhesion.

IGHV3-23 (Immunoglobulin Heavy Variable 3-23) and IGLV2-11 (Immunoglobulin Lambda Variable 2-11) belong to a cluster of genes in the immunoglobulin (Ig) structure. During acute phase infection in COVID-19, these two variable chains are parts of top frequent paired heavy and light chain clonotypes that are identified in the repertoire of more general clonotypes42,43,44. RPL31 (Ribosomal Protein L31) is a member of ribosomal proteins (RPs). One direct evidence of ribosomal heterogeneity comes from ribosomopathy, caused by defective RPs and/or rRNAs. In a study, the putative role of ribosomal heterogeneity in COVID-19 susceptibility and severity is investigated as an important role45. Furthermore, recent studies showed RPL31 as a diagnostic biomarker for this infection8.

Conducting the pathway analyses based on a manually curated aggregate of multiple data sources can be the limitation of the present work. On the other hand, the reliability of the findings is maintained by a promise with known mechanisms and between the expression profiling data from different datasets.

Conclusion

In summary, to find potential specific biomarkers for diagnosis of COVID-19, we focused on disease pathways, which include multiple pathways that can vary between different disease-related compartments. Consequently, more works that simultaneously analyze multiple mechanisms in peripheral blood and inflamed tissues are required. By the way, our findings shed a light on some pathways and molecules which can be valuable candidates for more investigations. Moreover, investigating differential biological pathways in similar diseases can help us identify differential diagnostic biomarkers for diseases. The present study identified several candidate biomarkers for specific detection of COVID-19 and differential diagnosis compared to influenza strains in blood. Further practical studies are necessary to validate these combinatorial biomarkers.