Principal Components Analysis Based Unsupervised Feature Extraction Applied to Gene Expression Analysis of Blood from Dengue Haemorrhagic Fever Patients

Dengue haemorrhagic fever (DHF) sometimes occurs after recovery from the disease caused by Dengue virus (DENV), and is often fatal. However, the mechanism of DHF has not been determined, possibly because no suitable methodologies are available to analyse this disease. Therefore, more innovative methods are required to analyse the gene expression profiles of DENV-infected patients. Principal components analysis (PCA)-based unsupervised feature extraction (FE) was applied to the gene expression profiles of DENV-infected patients, and an integrated analysis of two independent data sets identified 46 genes as critical for DHF progression. PCA using only these 46 genes rendered the two data sets highly consistent. The application of PCA to the 46 genes of an independent third data set successfully predicted the progression of DHF. A fourth in vitro data set confirmed the identification of the 46 genes. These 46 genes included interferon- and heme-biosynthesis-related genes. The former are enriched in binding sites for STAT1, STAT2, and IRF1, which are associated with DHF-promoting antibody-dependent enhancement, whereas the latter are considered to be related to the dysfunction of spliceosomes, which may mediate haemorrhage. These results are outcomes that other type of bioinformatic analysis could hardly achieve.


Results
shows the overall flow of the analysis.
Application of PCA-based unsupervised FE to synthetic examples. The use of a synthetic data set before the application of a methodology to a real data set is often useful in understanding the advantages and disadvantages of the proposed methodology. First, because we know the true answers for the synthetic data set (in contrast to the real data set), it is relatively easy to evaluate the performance of the methodology. Next, by preparing various data sets, we can intentionally generate a data set that can or cannot be successfully analysed with the proposed methodology, which allows us to understand the situations in which the proposed method is applicable. We can also demonstrate the superiority of the proposed method to conventional methods. To demonstrate how PCA-based unsupervised FE works and that it outperforms other popular FSs that are specifically designed for gene expression analyses, we compared PCA-based unsupervised FE with a significance analysis of microarrays (SAM) 30 and Limma 31 (see S1_Text for more details about how to perform SAM and Limma). The synthetically generated test data sets comprised "gene expression" data drawn from normal distributions, and the two classes to which each sample belonged had distinct or not distinct means (see Methods). Of 1000 genes, only the last 10 genes had expression patterns that differed between the two classes, whereas the expression of the first 990 genes did not, because it is reasonable to assume that in a real situation, the activities of a small number of genes are responsible for the observed phenotype(s). Each of the two classes included 10 samples, so in total, only 20 samples were considered. A small number of samples relative to the number of genes is also common in real experiments. The two classes had means of 0 and s(≥ 0). Therefore, a larger s indicates easier FS. s is also used as an enhancement factor for the expression of the 10 genes associated with the different gene expression patterns in the two samples, whereas the expression of the other genes is not enhanced. This also reflects the real situation: relevant genes should be more strongly expressed, whereas irrelevant genes should not be expressed. Figure 2 shows typical scatter plots of the PC scores attributed to 1000 genes when s = 1,1.5, or 2. Note that this poses a very difficult problem compared with the standard benchmark 32 , where the discrimination of the two classes is easier than in the present case because the difference between the two classes is greater. When s = 1, i.e., in the most difficult set-up, no genes were detected correctly, although no genes were wrongly identified. However, when s increased from 1 to 1.5, the number of correctly identified genes also increased to one of 10, and still no genes were wrongly identified. When s further increased to 2, nine genes were identified and no gene was wrongly identified. We performed averaging using 100 ensembles while changing s between 1 and 2. Figure 3 shows the dependence of true positives (TPs), false positives (FPs), and F-measures upon s. Here, in addition to TPs and FPs, we considered F-measures, which are useful performance measures for unbalanced data sets and are defined as [2(TP)/(FP + TP)⋅ (TP)/(TP + FN)]/[(TP)/(FP + TP) + (TP)/(TP + FN)], where FN represents false negatives. Although neither TP nor F-measure was large for smaller s, when s = 2, TP, FP, and F-measure had reasonable values. To compare the performance of this computation with that of SAM, we repeated the same computation using two SAM set-ups; one correctly assumed two classes, whereas the other wrongly assumed four classes (Fig. 3). Although the FPs obtained with both SAMs were small, the SAM that wrongly assumed four classes and the SAM that correctly assumed two classes were inferior to PCA-based unsupervised FE when s ≥ 1.6. and s ≥ 1.8, respectively. Although it was unsupervised, PCA-based unsupervised FE definitely outperformed SAM. Furthermore, although the performance of SAM decreased when four classes were wrongly assumed, PCA-based unsupervised FE circumvented this problem because it used no sample labelling information. Although we also tried to compare Limma, it did not identify any gene, including FPs, in this specific set-up, possibly because the parameter settings were too severe.
To perform comparisons with more-realistic synthetic data sets, we generated a gene expression data set composed of two classes using one set of the DENV gene expression profiles (data set 5, see Methods) analysed in this study. After the expression of each gene was standardized, the samples were divided into two classes, each containing half the samples. The positive constant s was added to the samples in one of the two classes such that the two classes were distinct. Therefore, a larger s also indicates an easier resolution of the problem. Figure 3 shows the results averaged using 100 ensembles while s was changed from 0.5 to 1. The overall performance achieved was relatively similar to that achieved with the first synthetic data set. PCA-based unsupervised FE again outperformed the other methodologies only for larger s (s ≥ 0.7), although Limma identified non-negative TPs with smaller values in this second synthetic data set than were identified with the other two methods.
Although we can conclude from its application to the two synthetic data sets that PCA-based unsupervised FE outperforms two popular FSs proposed for the analysis of "gene expression" data when s ≥ 1.8 (for the first synthetic data set) or s ≥ 0.7 (for the second synthetic data set), it is unclear whether PCA-based unsupervised FE would outperform these two methods when the set-ups were further modified. In fact, there is no way to check the superiority of PCA-based unsupervised FE to these two methods in all possible situations. Therefore, comparisons made with real examples are required.
The theoretical background and further advantages of this methodology are discussed in S1_Text. Application of PCA-based unsupervised FE to gene expression in DENV-infected patients. To demonstrate the utility of PCA-based unsupervised FE when applied to real data sets and to understand how DF progresses to DHF based upon a gene expression analysis, we used this method to analyse the gene expression patterns of multiple DENV-infected patients. We used multiple gene expression profiles because the comparison and integration of multiple profiles allows us to identify more-robust platform-independent outcomes. The first example (data set 1, GSE51808) was obtained by Kwissa et al. 33 . It includes four categories: DHF patients, DF patients, convalescent patients (CP), and healthy controls (HC). When investigating the PC loadings that differ between the groups, we found that PC2 (with a contribution of only 1.45%) and PC3 (with a contribution of only 0.45%) differed between DHF + DF and CP + HC; the P-values computed with a t test rejected the null hypothesis that the mean v kj within DHF + DF and the mean within CP + HC were identical in favour of the hypothesis that they were not: 1.03 × 10 −21 for PC2 and 4.56 × 10 −3 for PC3. Although the contribution of the first PC was 95.5%, it did not differ significantly among the four classes. Figure S1 in S1_Text shows a biplot of PC1 to PC3, where PC1 clearly displays no sample dependence (the first PC loading attributed to all samples has the same value). However, it is obvious that DHF + DF and CP + HC are well separated in the two-dimensional space spanned by the PC2 and PC3 loadings. This suggests that PCA-based unsupervised FE correctly identifies the space in which DHF + DF and CP + HC are well separated. On this plane, we selected 879 probes as outliers (see sheet 4 in S1_File for the full list of genes associated with the 879 probes). Because 879 probes were too many to be considered critical to DHF and to establish a smaller and more reliable set of genes, we applied PCA-based unsupervised FE to a second data set (data set 2, GSE13052 34 ) to further screen the genes. Figure S1 in S1_Text also shows the biplot of PC1 to PC3, where PC1 again clearly displays no sample dependence (again, the first PC loadings attributed to all samples have the same values). PC2 and PC3 were again selected as the PCs used for FE. The contributions of PC2 and PC3 were only 3.39% and 2.90%, respectively. The P-values computed with a t test that rejected the null hypothesis that the means v kj within the convalescent and acute patients are identical in favour of the hypothesis that where they were not were 1.65 × 10 −5 for PC2 and 7.15 × 10 −3 for PC3. Although the contribution of the first PC was 89.5%, it did not differ significantly among the four classes (see Fig. S1 in S1 Text). However, the convalescent and acute patients were well separated in the two-dimensional space spanned by the PC2 and PC3 loadings. This suggests that PCA-based unsupervised FE correctly identified the space in which the convalescent and acute patients were well separated. On this plane, we selected 275 probes as outliers (see sheet 4 in S1_File for the full list of genes associated with the 275 probes). We identified the 46 common genes that were common to the 879 and 275 genes identified with the first and the second data sets (data sets 1 and 2), respectively ( Table 1). These are expected to be more robust and more reliable than the genes identified in either data set alone because they were detected in two data sets using different platforms.
To confirm that the expression of these 46 genes did actually differ between the healthy controls and patients, we performed a clustering analysis of the samples in data sets 1 and 2 using only these 46 genes. Figure S3 in S1_Text shows the heatmaps produced. It is clear that the symptomatic patients (with fever) are well separated from both the healthy controls and the patients without symptoms. This suggests that PCA-based unsupervised FE successfully identified a limited number of genes that discriminate the two groups well.
The fact that gene expression can distinguish patients with symptoms from healthy controls, but cannot distinguish patients with DF from those with DHF is consistent with the heatmap produced with all genes by Kwissa et al. 33 . Selecting a limited number of genes to reproduce the results using all genes is not straightforward. Genes are usually selected based on their differential expression. However, to do this, we must decide the kind of difference to be considered. For example, if we select genes based upon their differential expression between DF and DHF, the outcome may differ from that when we use all the genes that cannot be used to distinguish DF and DHF. Therefore, reducing the number of genes is highly context dependent. In contrast to this, our unsupervised approach can identify a limited number of genes that produce the outcome produced using all genes, because we need no criterion based upon sample labelling or classification. Despite this, the 46 genes selected with our methodology reproduced the outcome obtained using all genes, which demonstrates the superiority of our methodology.
To confirm that we had successfully selected critical genes representing the relationships between samples, we applied PCA to x ij s using only the probes associated with the 46 selected genes (Therefore, this is not only FS but also FE). That 46 genes alone can represent disease progression suggests the reliability of our methodology and the biological interpretation that can be drawn from the analysis of these 46 genes. Figure 4 shows the results. PC2 and PC3 were again selected to draw the biplot and the PCs were more easily interpreted. PC2 represented the distinction between patients that display symptoms (i.e., fever) and those that do not (i.e., healthy controls and convalescent patients). PC3 represents the distinction between DHF (dengue shock syndrome [DSS]) and DF (uncomplicated). Remarkably, using the 46 identified genes, the scatter plots of the PC loadings (samples) for data sets 1 and 2 became common. The samples were aligned beside PC3 on both sides of the origin. Infected patients were roughly divided into the upper and lower half, which corresponded to DF and DHF, respectively. More interestingly, the scatter plots of the PC scores (genes) correlated significantly between data sets 1 and 2 ( Fig. S4 in S1_Text). These common embedding structures of the samples, as well as those of the genes in data sets 1 and 2 shown in Fig. S4 in S1_Text, demonstrate the robustness of PCA-based unsupervised FE and the applicability of this methodology.
Although PCA-based unsupervised FE identified essential genes and common biological structures in two independent data sets, it is still possible that this was an a coincidence. To test this hypothesis, we applied PCA-based unsupervised FE to data set 3 (GSE25001 35 ). If the 46 genes selected also describe disease progression in another additional data set, our conclusions will be strengthened. Figure S1 in S1_Text shows a scatter plot of the PC scores attributed to the probes. Again PC2 (contribution 5.3%) and PC3 (contribution 1.4%) are shown. In this embedding, the 46 genes selected in both data sets 1 and 2 form a trigram Y shape, meaning that the 46 genes are grouped into three categories, each of which shares the same sample dependence. This Y shape is unlikely to be an accidental coincidence because data set 3 was not used to select the 46 genes. Finally, only 46 genes were embedded by the PCA (Fig. 5). Although data set 3 was not used to identify the 46 genes, the timescale of DF/DHF development is well represented. In the early stage of infection, there were no significant differences between DF and DHF. With increasing time, the distinction between DF and DHF increased and was largest in the follow-up stage (i.e., after recovery). To investigate this quantitatively, we applied a t test to the second and third PC scores between the "DSS" and "uncomplicated" groups ( Table 2). It is obvious that only in the disease (DIS) stage and follow-up (FOLLOWUP) stage do the PC scores differ between the "DSS" and "uncomplicated" groups. This confirms that we have successfully identified, using PCA-based unsupervised FE, the 46 genes in data sets 1 and 2 that represent DHF/DF progression, even in the independent third data set.
Why did we specifically select these three gene expression profiles? This study was fully data driven, and with a data-driven approach, we try to integrate multiple data sets that may generate reliable outcomes, like those we used in the present study. Because our approach was successful, the selection of these data sets was also successful.  LY6E  IFI27  TNFSF10   OAS1  CDC20  GYPC  PI3  FCGR3A   HBA1  HBA2  HBG1  HBG2  IFI44L   IFIT3  CCR1  FPR1  STAT2  ISG15   OASL  CD38  TNFRSF17  CXCR1  ZBP1   HBB  IFI35  MKRN1  APOBEC3A  ALAS2   IL1RN  RSAD2  ASCC2  IFIT2  ADIPOR1   SLC25A37  OAS3  SDF2L1  TMEM140  FKBP11   HERC5  ITM2C  TXNDC5  STRADB  SLC25A39 EPSTI1 Comparison with other supervised and unsupervised methodologies. Although we have demonstrated the usefulness of our unsupervised methodology, we should explain why we did not use other popular supervised methods but intentionally used an unsupervised method, because it is generally supposed that supervised FE outperforms unsupervised FE. To demonstrate the superiority of our unsupervised FE over other frequently used supervised methods, we compared our methodology with SAM and Limma, two major implementations of FE specifically adapted for gene expression analyses. Using SAM and Limma and assuming two or four classes, we identified genes associated with adjusted P-values of less than 0.01 (Table S2 in S1_ Text). However, when the same "adjusted P-values less than 0.01" criterion was applied, too many genes were identified in both sets to identify the intersections between data sets 1 and 2, as was done with PCA-based unsupervised FE. Although it might be possible to filter the genes further using additional criteria, e.g., the fold change, it is obvious that these two methodologies are inferior to PCA-based unsupervised FE, because PCA-based unsupervised FE requires no criteria other than P-values. How well would the other above-mentioned unsupervised FSs 6-9 perform when applied to the present data sets? In the gene expression profiles analysed in this study, the number of genes exceeded a few tens of thousands, and the above-mentioned unsupervised FSs, other than Ding's study 6 , would entail computational complexities proportional to the square of the number of genes. Therefore, it is unrealistic to apply these methods to the present data sets. Consequently, we could not compare our performance with that achieved with the above-mentioned unsupervised FSs. The methodology reported in the study by Ding 6 is very similar to our methodology. Ding 6 ordered genes based upon a two-way ordering system, assuming a gene expression matrix as the bipartite graph, and discarded the middle-ranked genes. However, the genes themselves were not ranked based upon PC scores other than the first one obtained with PCA, and it did not outperform a simple supervised method according to a t test in his trials. Ding also stated clearly that gene expression must be non-negative in this implementation because the gene expression matrix must be treated as weights in the bipartite graphs. Therefore, he could not apply scaling such that to the gene expression as we have done. Ding 6 came close to the idea presented in this study, but missed the central point: not samples but features (genes) should be embedded and PC scores other than the first score should be considered for FSs, even if their contributions are apparently very small.

Discussion
We investigated the robustness and biological significance of the 46 genes identified.
First, because the results obtained may have been accidental, we used additional data sets to determine whether the selection of genes with PCA-based unsupervised FE was robust. We used an in vitro study to enhance the robustness of the results, because the data sets analysed were from an in vivo study. If the genes selected are consistent with those selected in the in vitro study, the outcome is more trustworthy. Interestingly, the 46 genes included many genes previously reported to be associated with DENV in in vitro studies 36 , i.e., CD38, HERC5, IFI44L, IFIT3, LY6E, OA, OASL, RSAD2, TRAIL, (TNFSF10), and the anti-viral activity of TRAIL against DENV has been confirmed experimentally 36 . Furthermore, after applying PCA-based unsupervised FE to the data set analysed by Warke et al. 36 , we found that 59 probes were associated with aberrant gene expression between the control and DENV-infected cell lines (see sheet 4 in S1_File for the full list of genes associated with the 59 probes). Left: data set 1 (GSE51808); right: data set 2 (GSE13052). Open magenta circles are PC scores for the probes associated with the 46 genes (for more details on the biological features of individual genes, see the enrichment analysis available in sheets 1-3 in S1_File). The contributions of PC2 and PC3 increased to 13% and 4% (left: data set 1), respectively, and to 7% and 1.6% (right: data set 2), respectively. Black (red) crosses represent PC loadings, v 2j and v 3j , of the DF (DHF) patients. Blue crosses correspond to healthy controls. Green crosses represent convalescent patients. Broken arrows show the different gene functions (see sheets 1-3 in S1_File). DSS, Dengue shock syndrome.
Scientific REPORTS | 7:44016 | DOI: 10.1038/srep44016 Among the genes associated with these 59 probes, the genes shared with those identified in the present study were identified : APOBEC3A, IFI27, IFI35, IFIT2, IL1RN, ISG15, MX1, and OSA1. Thus, 17 of the 46 genes were also Figure 5. Biplots for data sets 3 to 5. Top: Biplot of data set 3 (GSE25001) using only the 46 genes identified in data sets 1 and 2. Open magenta symbols × , ∇ , ◇ are the PC scores, u 2i and u 3i , for the probes associated with the 46 genes (for more details on the biological features of individual genes, see the enrichment analysis available in sheets 1-3 in S1_File). The contributions of PC2 and PC3 increased to 11% and 6.3%, respectively. Crosses (+ ) represent the PC loadings, v 2j and v 3j , of the patients. Filled circles connected by solid cyan lines represent the centres of mass for each group (see legends for more details). Middle left: Scatter plot of the PC scores for data set 4 (GSE43777-GPL507) using 76 probes attributed to any of the 46 genes. The contributions of PC2 and PC3 increased to 5.3% and 4.0%, respectively. The correspondence between the coloured crosses (+ ) and disease progression are black (stage G1), red (stage G2), green (stage G3), blue (stage G4), cyan (stage G5), magenta (stage G6), and grey (stage G7). Cyan solid and broken lines correspond to DF and DHF, respectively. Middle right: Scatter plot of the PC scores in data set 5 (GSE43777-GPL201) using 28 probes attributed to any of the 46 genes. The contributions of PC2 and PC3 increased to 14% and 10.0%, respectively. The correspondence between the coloured + symbols and disease progressions are the same as for GSE43777-GPL507 and yellow (stage G0). Solid cyan line corresponds to patients with either DF or DHF. Bottom: Scatter plots of PC loading (left to right: GSE43777-GPL507/201). P-values were computed with a t test to determine whether the PC scores had positive (or negative) mean values. detected in an in vitro study. Moreover, 10 of the 46 genes were recognized as anti-viral interferon-stimulated genes (ISG) 37 , whereas APOBEC3, IFI44L, IFIT2/3, ISG15, MX1, OAS1/3/L, and RSAD2. CD38 are reported to be associated with immune thrombocytopenia 38 .
Second, we used two additional in vivo gene expression profiles (data sets 4 and 5, GSE43777, see methods) also associated with disease progression during eight distinctive stages, G0 to G7. Because Sun et al. 39 identified a triangular configuration in their two-dimensional PCA embedding, their data set was suitable for testing the robustness of our 46 genes (for a more detailed comparison with their results, see below). Basically, these genes reproduced the configuration of data set 3, although with some differences (Fig. 5). First, the time progression in data sets 4 and 5 on the plane spanned by PC2 and PC3 was V-shaped, which was also seen in data set 3. PC3 is the vertical axis in data set 3, whereas PC2 is the vertical axis in data sets 4 and 5. However, because the order of the PCs is simply dependent on their contributions, their order is not biologically important if the overall configuration is conserved. Finally, although the Y-shaped configuration of the genes observed in data set 3 is missing from data sets 4 and 5, the overall configuration is similar. Interferon and heme biosynthesis 1 and 2, are located in the right half plane, the second quadrant, and third quadrant of data sets 3, 4, and 5, respectively. This is obvious in data set 3, although a t test was also used to determine whether the PC scores had negative or positive mean values in data sets 4 and 5 in order to determine the quadrant in which the center of the genes was located (see captions). Therefore, the configuration of the PC scores in data set 3 observed in Fig. 5 is expected to be robust. After a t test was applied, the distinction between DHF and DF in the follow-up samples seen in data set 3 was still observed between DHF and HF at stage G7 in data set 4. The 2nd PC loading attributed to DF patients was significantly larger than that of the DHF patients (P = 0.05 and 0.04 with a t test and Wilcoxon signed-rank sum test, respectively). Because data set 5 is predominantly composed of DF only, we did not check this point in data set 5. The relatively weak distinction between DF and DHF in data set 4 may be because these samples were collected at different times (data set 3: FOLLOWUP, ≥ 72h after illness onset; data set 4: G7 (convalescent) samples, around day 28 after the first sampling). In fact, there was a large gap between the G7 samples (grey + ) and the G6 (late acute) samples (magenta+ ) for data sets 4 and 5 (middle panels of Fig. 5). It is possible that the uncollected samples reflect the distinction between DF and DHF. We require more samples to confirm this point.
Because we successfully confirmed the robustness of our results, we next investigated the biological reliability of the 46 selected genes. We uploaded the 46 genes to three enrichment analysis servers, DAVID 40 , g:pfofiler 41 and TargetMine 42 (see sheets 1-3 in S1_File for the full list of enriched biological terms and pathways), to compensate for the bias introduced by each individual enrichment server. g:Profiler reported the enhancement of the IRF/IRF-7 motif, which is known to occur in interferon (IFN)-related transcription factors (TFs) 43 . Both g:Profiler and TargetMine also detected the measles and influenza A pathways (hsa05162 and hsa05164, respectively). Measles and influenza are often listed as diseases that differ negligibly from DF in terms of their diagnosis 44,45 . There are no DENV pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Therefore, it is reasonable that these two viruses were detected instead of DENV. Other than these, multiple enrichments related to either viral infection or haemorrhage were detected. For example, Gene Ontology (GO) biological process (BP) terms GO:0009615 (response to virus), GO:0006955 (immune response), and GO:0015671 (oxygen transport) were identified by all three servers. GO cellular component (CC) term GO:0005833 (haemoglobin complex) was also identified by all three servers. Reactome pathways REAC:913531 (interferon signalling), REAC:909733 (interferon alpha/beta signalling), REAC:168256 (immune system), and REAC:1280215 (cytokine signalling in immune system) were identified by g:Profiler and TargetMine. Further haemorrhage-related or viral-infection-related GO BP terms were detected by g:Profiler and TargetMine. The next step in the biological validation process was to determine the interactions between these genes. If they have tight relationships, the selection of the 46 genes is more reliable because single proteins rarely function without the collaboration with other proteins. To check this, we uploaded the 46 genes to the STRING server 46 , which detected 96 protein-protein interactions among the products of the 46 genes (P = 0 within the numerical accuracy), although the expected number of protein-protein interactions was only eight. Therefore, the 46 identified genes were also enriched for protein-protein interactions, probably because of the functional collaborations between the products of these genes.
These analyses suggest that PCA-based unsupervised FE can successfully identify a biologically feasible set of genes.
To further investigate the biological backgrounds of the 46 selected genes, we uploaded the 46 genes to Enrichr 47 , a multi-functional enrichment analysis server. Among the results given by Enrichr, we noticed the top-ranked three transcription factor (TF) bindings at "ENCODE TF ChIP-seq 2015", STAT1, STAT2, and IRF1 in K562 cells. K562 is a cell line often used in in vitro DENV infection experiments (see references cited below). The 46 genes were also enriched for multiple histone modifications ( Table 3). The genes associated with histone modification largely overlapped the TF target genes (Fig. 6). Ni et al. 48 reported that the biphasic formation of a STAT1/IRF1 complex is accompanied by histone methylation, which is consistent with the observed enrichment

Table 2. P-values that distinct DSS and uncomplicated patients in data set 4. P-values computed with
a two-sample t test applied to the differences in the second and the third PC scores between "DSS" and "uncomplicated" patients. (Fig. 5).
Scientific REPORTS | 7:44016 | DOI: 10.1038/srep44016 of STAT1-and IRF1-binding sites in these 46 genes. More interestingly, Schoggins et al. 49 identified STAT2 and IRF1 as effective inhibitors of DENV infection. Many studies have also reported the cooperation between these TFs. Kumatori et al. 50 reported that STAT1 and IRF1 cooperatively regulate the expression of the GP91 gene. Wang et al. 51 reported that the STAT1/IRF-1 signalling pathway mediates the injurious effects of IFN-gamma on oligodendrocyte progenitor cells. Therefore, identifying the enrichment of these three TFs is unlikely to be accidental. We also found that antibody-dependent enhancement (ADE) is a factor potentially involved in the direct relationship between DF and these TFs. Chareonsirisuthigul et al. 52 showed that the ADE infection pathway suppresses the innate anti-DENV mediator, the nitric oxide (NO) radical, by disrupting the transcription of the inducible nitric oxide synthase (iNOS) gene by TFIRF1, and blocking the activation of STAT1. ADE is believed to be a potential cause of DHF because DENV-ADE infection has a greater effect on viral replication than DENV infection 53,54 . In contrast, Huang et al. 55 reported that neither DENV infection nor ADE-DENV infection upregulates IL10 or IL6 expression, and these proteins were not encoded by any of the 46 genes identified in the present study. As can be seen in Fig. 5, because the PCA of these 46 genes not only described the DF-to-DHF progression, but also distinguished between DF and DHF, these genes must include those responsible for DHF. Therefore, detecting the enrichment of these TF-bound genes may be the key to distinguishing between DHF and DF. This suggests that among our 46 selected genes, the genes targeted by these three TFs are expressed downstream from STAT1, STAT2, and IRF1, and are the ADE targets among the 46 selected genes. To our knowledge, this is the Rank Histone modification P-value adjusted P-values Z-score combines score   Fig. 1. first report to identify the genes downstream from ADE-DENV using a bioinformatic (meta) analysis rather than experiments.
Although we hypothesize that our 46 selected genes include the genes associated with ADE, this only refers to immunology-related genes. As can be seen in Fig. 5, the 46 genes also include many heme biosynthesis genes. To determine the relationship between DHF and these heme biosynthesis genes, we uploaded the genes to another data analysis server, COEXPRESSdb 56 , which infers a possible set of genes within the list of uploaded genes that might be co-expressed. In this way, we found 11 spliceosome (hsa03040)-related genes among the 46 genes, which might be co-expressed (see Figs S8 and S9 in S1_Text and sheet 5 in S1_File). Interestingly, Hess et al. 57 have already reported that U-spliceosomal non-coding RNAs (ncRNAs) are affected during DENV infection. Pre-mRNA splicing not only occurs in the cytoplasm of platelets, but also provides a mechanism for regulating cytokine production after platelet activation 58 , which is related to the innate immune response 59 . It is clear that reduced platelet counts might lead to a severe bleeding diathesis. This suggests that the dysfunction of spliceosomes causes haemorrhage, and is, to the best of our knowledge, the first such proposal. Further and follow-up studies are required.
We also compared our results with those of the studies from which the data sets used in this study were taken. Why didn't we compare the genes we selected with those selected in the original studies? 33,34,35,39 . Neither Kwissa et al. 33 nor Hoang et al. 35 provided a list of the selected genes, possibly because they could not select a sufficiently small number of genes to investigate them individually in detail. Long et al. 34 selected 100 genes for each of the three pairwise comparisons (DHF vs DF, CP vs DF, and CP vs DHF), 300 genes in total, similar to the number we selected (275 genes). Thus, the only comparison possible is between our 275 genes and their 300 genes. When we replaced our 275 genes with the 300 genes selected by Long et al. in data set 2, only 83 genes were selected from both data sets 1 and 2 (see sheet 6 in S1_file). Because 83 genes are more than our 46 genes, Long et al. apparently identified more coincidental results between data sets 1 and 2 than did our analysis. However, when both the 83 genes and the 46 genes were uploaded simultaneously to g:profiler, the impression was reversed. The 83 genes are primarily enriched in genes involved in the cell division cycle, which is unlikely to be related to DF (see the enrichment analysis in S1_Text). This suggests that our methodology identified a more limited but more biologically reliable set of genes than did that of Long et al. Among the five studies from which the gene expression profiles were taken, the study by Sun et al. 39 was most similar to ours. Those researchers even embedded samples in two-dimensional space with PCA, using only the genes that they selected and that formed the triangular configuration seen in our genes in Fig. 5. However, our results still have multiple advantages over theirs. First, they selected 313 genes, which is over six time more than the number we selected (46 genes). Because they selected genes for each of the eight disease stages (G0 to G7) and there were very few overlaps between them, they could not identify a restricted number of genes that represented the overall disease progression. Their PCA embedding of samples was also less consistent with disease progression than ours. Therefore, they had to merge the eight stages into three groups, the early acute, late acute, and convalescent groups, to obtain biologically interpretable results. Furthermore, only some (23) of our 46 genes were included in their 313 genes. Fifteen and 13 of our genes were included in the genes detected in the early and late acute stages, respectively, and only five of our genes were included among the genes detected in both the early and late acute stages. Therefore, our methodology identified a more reliable, smaller, more robust, and distinct set of genes than the analysis of Sun et al.
Throughout the manuscript, we have almost always argued that the identification of too many significant genes is neither trustworthy nor usable. There can be two objections to this opinion: 1. It is always possible to reduce the number of genes by using a smaller number of top-ranked (relatively more significant) genes. Therefore, the identification of too many significant genes is not a problem at all. 2. The identification of too many significant genes may be evidence that the methodology works even when smaller samples are considered (because generally smaller samples induce larger, less significant P-values).
Objection 1 is meaningless from a statistical perspective because adjusted P-values are usually regarded as a portion of FP. This means that requiring very small adjusted P-vales (e.g., less than 1/N) does not make sense. Therefore, selecting the top-ranked genes does not guarantee more-reliable results. To demonstrate this, we used Limma or SAM to analyse the top-ranked 879 or 275 genes, which were the numbers identified with PCA-based unsupervised FE from data sets 1 and 2, respectively, and counted the overlaps between them (there were 46 overlaps when PCA-based unsupervised FE was used). We found that only one and two genes were commonly selected when Limma was used for the two-and four-class classifications, respectively. For SAM, because more than a thousand genes were associated with adjusted P-vales equal to zero in data set 1 for both the two-class and four-class classifications, we could not even select a smaller number of significant genes. This definitely suggests that simply taking the top-ranked genes without statistical reliability is not an appropriate alternative to a method that can select a reliable number of genes based on a statistically meaningful assumption (e.g., adjusted P-values less than 0.01). Thus, objection 1 is refuted, at least for our case. As for objection 2, we tried to select genes using samples containing half the data from data sets 1 and 2, using either SAM, Limma, or PCA-based unsupervised FE (Table S2). The results for SAM and Limma were very disappointing. The number of genes selected often remained unchanged (i.e., too many) or became too small (even less than 1). This suggests that objection 2 holds only for very specific combinations of methods and sample numbers, so it is not worthy of consideration. In contrast, PCA-based unsupervised FE selected almost the same number of genes independently of the sample numbers. This may seem strange. However, PCA-based unsupervised FE selects genes based upon the PC scores attributed to individual genes. PCA is essentially a projection from a high dimension to low (two) dimensions. In this context, halving the sample numbers corresponds to the random elimination of half the high-dimensional space, which is unlikely to change the projection dramatically. If the PC scores are not altered much, the genes Scientific REPORTS | 7:44016 | DOI: 10.1038/srep44016 selected remain unchanged, even after the sample numbers are halved. For these reasons, we conclude that objections 1 and 2 need not be considered, at least in the present study.
Finally, we will briefly discuss the relationship between the robustness of our methodology and the heterogeneity of the data sets. Our results may be considered untrustworthy because we derived most of our conclusions from the study of heterogeneous data sets collected from multiple studies that had distinct experimental plans, used different platforms (microarrays), and had distinct purposes. How can we derive something valid from such diverse data sets? Our methodology, PCA-based unsupervised FE, is known to generate robust results from an integrated analysis of heterogeneous data sets. For example, we previously analysed mouse cardiac maturation based on data sets collected with two distinct microarray platforms 60 . The selected genes were biologically useful when they were considered in human samples. This process is similar to that used in the present study, in which the integration of data sets 1 and 2 identified a gene set that described disease progression in data set 3. Alternatively, in an integrated analysis of mRNA and miRNA, we successfully compared the mRNA and miRNA expression measured in different studies 13 . All these studies suggest that our methodology is sufficiently robust to derive biologically reliable outcomes, even from heterogeneous data sets. Contrary to the impression that the use of heterogeneous data sets is erroneous, it can provide rather robust and reliable results if a suitable methodology is used. It even strengthens the robustness of the outcomes if it is successful, because although the use of heterogeneous data sets has greater potential for failure, the conclusions are less likely to be accidentally reliable while actually being untrue. Today, although RNA-seq technology is superceding microarrays because they are less reliable, if suitable methods such as PCA-based unsupervised FE are used, suitable results can still be obtained. In fact, PCA-based unsupervised FE has also been used in an integrated analysis of microarray and RNA-seq measurements 10,20 .

Methods
Synthetic data set. First synthetic data set. In the first synthetic gene expression data set (data set 1), the gene expression values were uncorrelated random numbers drawn from a Gaussian distribution with common variance. The number of genes was assumed to be much larger than the number of samples. The samples formed two classes, each of which included the same number of samples, but they differed only for very small numbers of genes among all genes. The mean value within each class was taken to be different, so that they were discriminative. The majority of the remaining genes did not form two classes. The mathematical formulation was as follows.
To simulate gene expression with N genes and M samples, x ij s (i = 1, … , N, j = 1, … , M) were drawn from the normal distribution  µ σ ( , ), where μ ≥ 0 is the mean and σ > 0 is the standard deviation, and Second synthetic data set. Because the gene expression data drawn from a Gaussian distribution are unlikely to be similar to real data sets, we compiled a more realistic but synthetic data set from data set 5 (see below). Gene expression was standardized to have a zero mean and a variance of one for each gene. The samples were then shuffled within each sample. The shuffled samples were divided into two classes. Some positive constant was added to only the samples within one of the two classes. The mathematical formulation is as follows. Suppose x ij is the gene expression of the ith gene of the jth sample that belongs to data set 5. To standardize gene expression, x ij was converted to Then x ij was shuffled within each gene as where s j is a random integer drawn from (1, … , M) without replacement. s j is independently drawn for each i. The positive constant s was then added to the first N 0 genes in the first half of the samples, In data set 5, N = 8793, M = 168 and N 0 is taken to be 100.

PCA-based unsupervised FE. PCA.
In contrast to the usual use of PCA, where samples are embedded, the genes were embedded in this implementation.
and X is a matrix whose elements are x ij . The gram matrix G is defined as G ≡ XX T . Eigen vectors u k = (u k1 , … , u kN ) T s (1 ≤ k ≤ min(M, N)) are then obtained as Gu k = λ k u k , where u ki is the kth PC score attributed to gene i and λ k s are the Eigen values ordered as λ k ≥ λ k + 1. The kth PC loadings attributed to the jth sample v kj are defined as v k = X T u k , where v k = (v k1 , … , v kM ) T because v k is the Eigen vector of the covariance matrix X T X, X T Gu k = X T XX T u k = X T Xv k = λ k X T u k = λ k v k . The k′ th PC associated with the smallest P-value was then selected and used for FE, as a smaller P-value corresponds to a more significant difference between {v kj |j = (M)/(2) + 1, … , M} and {v kj |j = 1, … , (M)/(2)}. Assuming that u k′i obeys a normal distribution, P-values were then attributed to the ith gene using the χ squared distribution. The P-values were further adjusted by the Benjamini-Hochberg (BH) criterion 61 , and the genes associated with adjusted P-vales less than 0.01 were selected as the genes associated with the difference between 1 ≤ j ≤ (M)/ (2) and (M)/(2) < j ≤ M.

PCA-based unsupervised FE
PCA-based unsupervised FE applied to gene expression in DENV patients. We identified a set of ks, {k} sig , associated with P-values less than 0.05 that rejected the null hypothesis that the means of the PC loadings (v kj ) were identical over multiple classes, in favour of the alternative hypothesis that the means were not identical over multiple classes. Assuming that u k′i s are normally distributed, the P-values were then attributed to the ith gene using a χ squared distribution; P-values are  , where σ v ik is the standard deviation of {v ik |i = 1, … , N}, and > χ P x ( ) 2 is the probability that the argument is larger than x under the assumption that the arguments obey a χ squared distribution. The P-values were further adjusted by the BH criterion 61 , and those genes associated with adjusted P-values less than 0.01 were selected as the genes associated with the difference between multiple classes. All the genes identified with PCA-based unsupervised FE are shown in S1_File.
Gene expression profiles. Four in vivo gene expression data sets were downloaded from the Gene Expression Omnibus (GEO) using GEO IDs: GSE51808 33 , GSE13052 34 , GSE25001 35 , and GSE43777-GPL570/201 39 . Hereafter, these gene expression data will be denoted data sets 1, 2, 3, 4, and 5, respectively. One in vitro gene expression data set was also downloaded from GEO ID GSE9378 36 . The processed data GSEXXXXX_series_matrix.txt (where GSEXXXXX is GEO ID) for all five sets were downloaded and used for further analysis. Gene expression was scaled for PCA-based unsupervised FE, i.e., ∑ = ∑ = x x N 0, / 1 i ij i ij 2 . For other analyses, gene expression was used as it was, because the data had been processed. For details of the samples included in these gene expression profiles, see Table S1 in S1_Text. 2 are projected and over drawn onto the two-dimensional space spanned by the k 1 th and k 2 th PCs. For visibility (in other words, to avoid any overlap of the genes and samples or to avoid the locations of the genes or samples that are too close to the origin because the distances between the PC scores and PC loadings differ), an arbitrary positive constant scaling factor c is often used to multiply either u i s or v j s. By definition, since = ∑ = v x u kj i N ij ki 1 , the jth samples and ith genes that are oriented in the same direction from the origin are regarded as related on the biplot.