An Epidemiological Human Disease Network Derived from Disease Co-occurrence in Taiwan

In “classic” biomedical research, diseases have usually been studied individually. The pioneering human disease network (HDN) studies jointly consider a large number of diseases, analyse their interconnections, and provide a more comprehensive description of diseases. However, most of the existing HDN studies are based on molecular information and can only partially describe disease interconnections. Building on the unique Taiwan National Health Insurance Research Database (NHIRD), in this study, we construct the epidemiological HDN (eHDN), where two diseases are concluded as interconnected if their observed probability of co-occurrence deviating that expected under independence. Advancing from the existing HDN, the eHDN can also accommodate non-molecular connections and have more important practical implications. Building on the network construction, we examine important network properties such as connectivity, module, hub, and others and describe their temporal patterns. This study is among the first to systematically construct the eHDN and can have important implications for human disease research and health care and management.


Methods
Database. Taiwan launched the single-payer national health insurance (NHI) programme on March 1st, 1995. By the end of 2004, about 99.9% of the Taiwan population were enrolled 11, 12 . With the high cost of treatments that are not insured or by commercial insurance, the dominating majority of hospital/clinic-based disease treatments go through NHI. To get insurance reimbursement, hospitals and clinics are required to provide comprehensive data on each disease treatment episode. Data are then sorted and stored in NHIRD. Compared to other databases, unique advantages of the NHIRD may include unbiasedness (virtually the whole Taiwan population are covered), comprehensiveness (comprehensive information are available on all inpatient and outpatient treatment episodes), and uniformity (all data are collected and stored under the same protocol). NHIRD has served as the basis of a large number of biomedical and public health studies (with already close to 400 publications in PubMed). We refer to Hwang et al. and the NHIRD website for more detailed information on NHIRD 11, 13 .
In this study, we retrieved data collected between 2000 and 2013 from NHIRD. The initial dataset contains records on one million subjects (about 4.26% of Taiwan's population) randomly selected from the 2005 registry for beneficiaries. In NHIRD, each subject has a unique ID, which is used to link different databases. For our analysis, we analysed both outpatient and inpatient treatments, with information in the CD (ambulatory care expenditures by visits) and DD (inpatient expenditures by admissions) files, respectively.
For disease identification, the ICD-9-CM code was used. Prior to 2005, the ICD-9-CM 1992 version was used. For consistency, it was converted to the 2001 version. With more interest on diseases, following the literature, records with the E and V codes (external causes of injury and supplemental classification), 630-679 (Pregnancy, Childbirth and Puerperium Complications), and 760-999 (Symptoms, Signs & Ill-Defined Conditions) of ICD-9-CM were removed from analysis. Limitations of the ICD-9 code have been recognised. For example, it may be biassed by experts' discrimination. In addition, the vocabulary used to describe multiple patient billing codes may actually describe the same clinical disease. To address such problems, following the literature 14 , we adopted the electronic health record (EHR) driven Phenome-wide association studies (PheWAS) codes (PheCode), which group the ICD-9 codes into 1,723 PheWAS Codes (PheCode). To generate more reliable estimates, we focused on common diseases defined as having nonzero occurrence in each of the fourteen calendar years, leading to a total of 1,356 diseases for downstream analysis. More information on data processing is provided in Supplementary Information (SI). On the patient side, records with inconsistency (for example, conflicting sex information) were removed to ensure a high standard of analysis. The final analysed dataset contains records on 986,646 patients with 1,381,749 inpatient and 173,355,725 outpatient episodes in the study period. Among them, there are 486,992 males and 499,654 females. More information is provided in SI. Network-based Analysis. Our network analysis is based on the WGCNA (weighted gene co-expression network analysis) 15 , which was originally designed for the analysis of gene expression data and has demonstrated satisfactory performance in a large number of publications [16][17][18][19] . A closer examination of WGCNA suggests that its applicability is not limited to gene expression data. For the completeness of this article, below the analysis steps are briefly described, and readers are referred to Horvath for more details 20 . It is noted that although with some minor changes, the main advancement of this study is not on the WGCNA technique itself. Rather, this study marks a new and innovative application of the WGCNA technique. A flowchart describing the proposed analysis procedure is shown in Fig. 1. Below we provide more details on each analysis step.
Network Construction. In the eHDN analysis, a node corresponds to a disease, and two diseases are connected with an edge if their probability of co-occurrence deviating from that expected under independence. The edge information is accommodated in the adjacency matrix. Denote n as the number of diseases. For diseases i and j (i, j ∈ 1, …, n), their φ-correlation is computed as: where, for a fixed time period, C ij is the number of patients with both diseases (treated in the same or different, inpatient or outpatient, episodes), N is the total number of patients, and P i and P j are the number of patients with diseases i and j. Denote S = [s ij ] as the n × n similarity matrix with its (i, j)th element being s ij . Then the n × n adjacency matrix A can be defined where (i, j)th element is Here the threshold τ is imposed to remove spurious small correlations, only retain the large ones, and generate a sparse and more interpretable network. Its value is chosen using the scale-free topology criterion 2,3,15,21 , which has been extensively adopted in the literature. In the adjacency matrix, all components take values between 0 and 1 (that is, positive and negative correlations are treated in the same manner). Two diseases are more strongly correlated (positive or negative) if their corresponding value in the adjacency matrix is bigger. Connectivity, Module, and Hub. For node (disease) i, its connectivity is defined as = ∑ ≠ K a i j i ij , which quantifies how strongly it is connected to the other nodes. In the literature, an alternative definition of connectivity has also been considered, where = ∑ ≠ k TOM i j i ij (the definition of TOM is provided below. more information on the two connectivity measures is provided in SI).
An important network concept is module (also referred to as "community" in some studies), which is composed of tightly connected nodes. Consider the topological overlap matrix (TOM), where its (i, j)th element is: u iu uj . Loosely speaking, l ij measures how many neighbour nodes that i and j shared. TOM ij measures the distance between diseases i and j in a network sense 20,22 . Accordingly, define dissTOM ij = 1 − TOM ij , which is non-negative and symmetric and measures the dissimilarity between any two diseases. With matrix dissTOM, whose (i, j)th element is dissTOM ij , modules can be identified by hierarchical clustering with a dynamic tree cutting approach 15,23 .
With each module, connectivity can be re-computed and referred to as intramodular connectivity. Nodes (diseases) with the highest correlation with the eigen-diseases (definition below) are identified as hubs.
Remarks. The network quantities described above have important implications. Adjacency directly describes how strongly two diseases are connected to each other. Of interest are diseases that are tightly interconnected. In health care management and planning, such diseases should be considered together as opposed to individually. In network analysis, it has been suggested that more highly connected nodes play more important roles in a network. It is thus of interest to examine connectivity and identify the highly connected ones. Such nodes (diseases) may have a higher priority in disease control and prevention, as they can potentially have a higher impact on the overall health conditions. In biomedical research, clustering/classifying diseases is an important task, and the module structure provides an alternative way for disease clustering. Diseases within the same modules can potentially Temporal Trends. For all diseases, occurrence rates change over time. In addition, occurrence observed in clinics is also affected by diagnosis and other factors, which are also time-dependent. As such the eHDN and its quantities described above change over time. By conducting analysis year by year and comparing across time, we are able to obtain the temporal trends of the eHDN. This is significantly different from the molecular HDNs, which are static. For scalar quantities, variation over time can be directly assessed. For the module structure, we assess variation using the Jaccard indexes for modules obtained in consecutive years.
Visualisation. The patient-disease relationship and disease correlations can be visualised using heatmaps. The overall network structure can be visualised using the software Gephi 24 . In such a plot, diseases that share edges are connected with lines, the size of a node is proportional to its connectivity, and different modules are represented using different colours. Construction of modules can be visualised using dendrograms and heatmaps. For scalar quantities (prevalence, connectivity, etc.), changes over time can be visualised using scatter plots (possibly with nonparametric fits). Changes of module memberships can be visualised using alluvial diagrams. Changes of mod-ulesâ€ ™ mean connectivity can be visualised using radar charts. The aforementioned visualisation tools provide a more intuitive way of interpreting network structures and properties.

Results
The numbers of patients for each calendar year are shown in Table S1 in SI. In Table S2 in SI, we further present summary statistics on the numbers (proportions) of inpatient and outpatient treatment episodes, stratified by gender, age, and calendar year. Across time, an increasing trend of inpatient treatment is observed. For outpatient treatment, an increasing trend is also observed, although not as prominent as for inpatient. It is noted that the number of observations per year is large enough (larger than many of the peer studies) to make credible inference.
For a more intuitive description of the patient-disease distributions, in Fig. S2 in SI, we present the patient-disease heatmaps, where the x-axis corresponds to patients, the y-axis corresponds to diseases, and a red dot represents one disease occurrence. It is noted that with the huge sample size, plotting all patients generates plots with huge sizes. Thus, in Fig. S2, we presented results for 1% subjects randomly selected from our data in 2012 and 2013. The prevalences of diseases are computed year by year. The top ten are presented in the top panel of Fig. 2. Acute upper respiratory infections (code 465) has the highest prevalence in all years. The high prevalence of acute upper respiratory infections in Taiwan has been noted in multiple publications 25 . Also in the top ten are gingivitis (code 523.1), acute sinusitis (code 464), and atopic/contact dermatitis (code 939) and others, all of which have been extensively examined in the literature 13,[25][26][27] . Network construction and connectivity. The eHDN is constructed using the approach described above.
The threshold value τ is determined as 0.03. More detailed results are provided in SI. In Fig. 3, we provide a "traditional view" of the network structure of the eHDN for the year 2013. Similar constructions/plots have also been done for other years (details omitted and available from the authors). In Fig. 3, diseases that are connected with edges are linked with lines. Extensive "activities" are observed, suggesting a high degree of interconnections among diseases. A certain number of isolated diseases not linking to other diseases are also observed. As shown in Fig. S4 in SI, for the NHIRD data, the weighted connectivity k i and unweighted connectivity K i values are highly correlated. In the bottom panel of Fig. 2, we present the weighted connectivity values for the top ten diseases. More variations are seen in connectivity than in prevalence. For multiple diseases, bell-shaped curves are observed. Such an observation has not been made in the literature. Changing in connectivity can be caused by both intrinsic reasons (such as changing patterns in disease occurrence) as well as reasons such as diagnosis. Disease 464 (acute sinusitis) is observed to have the highest connectivity. It is a common disease and related to a large number of respiratory diseases. Significant increases in connectivity are observed for multiple diseases, especially disease 523.1 (gingivitis) and disease 939 (atopic/contact dermatitis). As a common non-destructive gum disease, gingivitis has been increasingly linked to multiple oral, digestive, and blood diseases. Also, patients with atopic/contact dermatitis and allergic rhinitis have a higher risk of asthma and many autoimmune diseases 13,28 . More information on connectivity is also available in Fig. 3, where the sizes of nodes are proportional to their connectivity. The variations in connectivity across nodes are clearly observed.
In the investigation of disease connectivity, we first identify those with the highest overall connectivity across 2000 and 2013 and present the top ten diseases in the upper panel of Table 1. The list includes multiple heart diseases, type 2 diabetes, and osteoarthrosis, all of which have been suggested as connected to a large number of diseases. It is "reassuring" that our analysis coincides with "traditional wisdom". In the lower panel of Table 1, we present the list of hub diseases for the year 2013. Their module information is also provided in the bracket. The differences between the upper and lower panels are caused by the module structure (hubs are identified within modules separately) as well as variations across years.  Module identification and properties. Construction of the module structure first involves constructing the dendrogram. In the left panel of Fig. S6 in SI, we show the dendrogram for the year 2013 as well as the identified modules. Different colours represent different modules, and the grey colour represents diseases not classified in the identified modules. Different modules are also represented using different colours in Fig. 3. In Fig. 4, we show the heatmap of the diseases and mark different modules using black boundaries. The "clustering structure" along the diagonal is clearly seen, which suggests the distinct differences across modules. For the year 2012, we show the corresponding plots in Fig. S9 in SI. A careful comparison of the plots suggests variation across time (more definitive results below). For the other years, similar plots can be generated (omitted here, available from the authors). Different modules differ in multiple aspects. First, they have different sizes. The nine modules have sizes 22 (pink), 39 (turquoise), 21 (magenta), 24 (red), 32 (blue), 27 (brown), 25 (yellow), 24 (black), and 25 (green), respectively. Also, as can be seen from Fig. 4, the levels of connections within modules also vary. For example, there are tighter connections within the green module than others. Diseases in different modules also have different levels of connectivity. More detailed statistics on connectivity are provided in Fig. S5 in SI. From a biomedical perspective, it is of interest to examine the "meanings" of the modules. In Table 2, we provide the detailed list of diseases in the nine modules. As suggested in the published HDN studies, the modules provide an alternative way of defining disease classifications. More specifically, our classification, as shown in Table 2, is based on whether diseases co-occur on the same patients. An enrichment analysis is conducted to examine the representative diseases of different modules. It is found that the nine modules are enriched with the following diseases: mycoses and bacterial of skin and subcutaneous tissue disorders (pink); gynaecological disorders (turquoise); ophthalmological disorders (magenta); neurotic personality and other nonpsychotic mental disorders (red); circulatory and digestive system disorders (blue); comorbidity of endocrine and cardiovascular disorders (brown); endocrine and cardiovascular disorders (yellow); degenerative disorders (black) and arthropathies and related disorders (green), respectively. It is noted that some disease clustering/classification structures in the literature are based on, for example, biology and are defined for the whole population. In contrast, our network and module structure, based on the NHIRD, are tailored to the Taiwan population. The Taiwan population are dominatingly Asian, which may have disease risk and characteristics different from other populations. In addition, disease occurrence highly depends on environmental, socioeconomic, and other factors, which vary significantly across regions/countries. As such, for the Taiwan population and their health care and management, our constructed module structure/ disease classification can be more sensible.
Modules can describe the interconnections among diseases, with those in the same module more tightly connected. In a further step of analysis, it is of interest to examine the interconnections among modules. The eigen-disease of each module is extracted for this purpose. Eigen-diseases are defined as the first principal components of the modules. Literature suggests that, under certain conditions, they have the highest connectivity and can best represent the corresponding modules. The hierarchical clustering of the nine eigen-diseases is shown in the right panel of Fig. S9 in SI. The brown and yellow eigen-diseases are clustered the first. Figure 4 suggests that    Tables 3 and S4 Modules with "more unique" diseases, for example the modules enriched with gynaecological disorders (which have unique etiological pathways), tend to be more stable throughout the years. As an alternative way of visualising changes of module structures over time, the alluvial diagrams are shown in Figs S10-S22 in SI. It is noted that such plots provide similar information as in Tables S4-S15 in e SI, however, in a more intuitive way and can be preferred by some practitioners.
For the modules, we also summarise their connectivity and examine changes over time. The results are presented in the radar charts in Figs S10-S22. Again, significant across-module differences are observed. For some modules, for example the one enriched with respiratory failure related diseases, significant temporal variations are observed. For disease 509.1 (respiratory failure), we present the temporal trends of connectivity and intramodular connectivity in the upper panel of Fig. 5. For a better visualisation, the nonparametric smooth fits are also added. The observed trends are similar to those reported in the literature 27 . A representative of "the opposite" is disease 411.4 (coronary atherosclerosis), which is shown in the bottom panel of Fig. 5 and has a much more stable connectivity. This observation is similar to that in Tseng et al. 26 .

Discussion
HDN and other pan-disease research has drawn significant attention in recent literature and has brought significant insights beyond single-disease studies. Significantly different from the existing studies that are based on molecular information, in this study, we have taken advantage of the unique NHIRD, constructed the eHDN co-occurrence network, and studied its properties. This study has several contributions. The constructed eHDN provides a way of describing disease interconnections in a "global" manner. The adjacency measure establishes disease connections from an epidemiological perspective. The constructed modules provide an alternative way of disease clustering/classification. A closer examination of the analysis results suggest that the identified highly connected diseases and modules have sound biological interpretations, which provide support to the validity of the proposed analysis. This study also establishes a new way for analysing disease epidemiological data. The adopted technique is heavily based on the WCGNA studies. This study demonstrates the effectiveness of this technique for epidemiological data. In addition, this study also demonstrates various effective way of visualising the analysis results, which provides a more intuitive way of understanding disease epidemiological data. This study also provides an alternative analysis of NHIRD -in the literature, analysis has usually been focused on individual diseases.
Despite significant advancements, this study inevitably has limitations. The Taiwan population is dominatingly Asian. Thus, extending the findings to other races should be done with cautions. In our analysis, to describe the "big picture", we conduct analysis on the whole selected cohort. The occurrence of most diseases depends on age, gender, and other factors. It will be of interest to conduct stratified analysis. Information is only available for the year 2000-2013. Without information on diagnosis prior to 2000, our analysis only captures disease occurrence within this time period. The WGCNA-based technique, although successful and popular, also has limitations. The network generated is undirected and hence cannot reflect the "order" of diseases. In this study, we have only analysed the most important network properties (connectivity, hub, module, etc.). Other, more subtle network properties may also be of interest. In addition, we have focused on the application of the WGCNA technique. Its theoretical validity for the NHIRD data has not been examined. However, the sensible analysis results provide some support to the validity of the analysis technique. There are other statistical techniques for network construction and analysis. It will be of interest, however beyond the scope of this study, to compare different network constructions for the NHIRD data.
The merit of HDN analysis has been well established in the literature. Results obtained in this study can be valuable for basic and clinical science researchers as well as health care providers and policymakers. This study focuses on disease connection from an epidemiological perspective and may well complement the existing HDN studies. Specifically, comparing the eHDN with molecular HDNs may suggest which disease connection are attributable to molecular and non-molecular causes. However, in the literature, there is a lack of molecular HDN specific to the Asian population (it is noted that molecular risk factors of many diseases are race-specific). In addition, the existing molecular HDNs have been constructed based on techniques other than the WGCNA. With these considerations, we postpone the joint analysis of eHDN and molecular HDNs to future studies.