A knowledge graph of clinical trials (CTKG\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathop {\mathtt {CTKG}}\limits$$\end{document})

Effective and successful clinical trials are essential in developing new drugs and advancing new treatments. However, clinical trials are very expensive and easy to fail. The high cost and low success rate of clinical trials motivate research on inferring knowledge from existing clinical trials in innovative ways for designing future clinical trials. In this manuscript, we present our efforts on constructing the first publicly available Clinical Trials Knowledge Graph, denoted as CTKG\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathop {\mathtt {CTKG}}\limits$$\end{document}. CTKG\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathop {\mathtt {CTKG}}\limits$$\end{document} includes nodes representing medical entities in clinical trials (e.g., studies, drugs and conditions), and edges representing the relations among these entities (e.g., drugs used in studies). Our embedding analysis demonstrates the potential utilities of CTKG\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathop {\mathtt {CTKG}}\limits$$\end{document} in various applications such as drug repurposing and similarity search, among others.

. Schema of CTKG. www.nature.com/scientificreports/ and multiple standard-outcome nodes. The connection between the outcome node and the cluster-outcome node represents that the name of the outcome can be assigned to the cluster represented by the cluster-outcome node, while the connection between the outcome node and the standard-outcome node represents that the name or the description of the outcome contains the standard outcome measure. Please refer to "Outcome extraction and outcome clustering" section for more details.
Each group node represents a study arm or a comparison group, that is, a group of participants who receive a specific intervention. There are multiple types of group nodes as follows: • event-group node. The information described by each event-group node is the number of participants within the group affected by specific types of adverse events. Each event-group node is connected to multiple drug nodes representing the drugs used in the event group, and adverse-event nodes representing the specific adverse events that occurred in the event group. Each adverse-event node also links to an organ-system node representing the affected organ system. • baseline-group node. Each baseline-group node represents a group of participants with their demographic attributes (e.g., "Age" and "Ethnicity") or study-specific attributes (e.g., "Baseline Modified Gingival Index"). Each baseline-group node is connected to one or multiple baseline-record nodes. • drop-group node. Each drop-group node represents a group of participants with their withdrawal information.
Each drop-group node is connected to one or multiple period nodes. Each period node represents an interval of the study (e.g., "First Intervention" and "Part 1: Treatment Period 1"), and has attributes describing the number of participants at the beginning and the end of the period. Each period node can link to multiple drop-record nodes. Each drop-record node includes a withdrawal reason and documents the number of the participants in the group withdrawing with this reason in a period. • outcome-group node. An outcome-group node has the information on the efficacy of the studied interventions on the participants. The efficacy is evaluated by different outcome measures and analyzed by different statistical test methods with the measurements. For example, an outcome-group node could represent a group of 17 participants receiving Naltrexone as the intervention, and the efficacy of Naltrexone was evaluated using the results from this group of participants; these 17 participants and the efficacy evaluation are included in an outcome-group node. Other nodes related to the efficacy measures of interventions are as below: -method node. Each method node represents a statistical hypothesis testing method that is used to make inference or draw conclusion statistically from the data collected in clinical trials. For example, the variance analysis method "ANOVA" could be a method node that is used to test the superiority of an intervention compared with a control in a clinical trial. Each method node is connected to multiple outcome-analysis nodes, representing that the method is used to conduct the analyses. Please refer to Table 2. Statistics of relation types in CTKG. Columns represent: "Relation type": the type of relation; "Node type 1": the type of head nodes in the relations; "#Node 1": the number of unique head nodes with the relations; "Node type 2": the type of tail nodes in the relations; "#Node 2": the number of unique tail nodes with the relations; "#Relations": the number of relations of a relation type. www.nature.com/scientificreports/ "Statistical analysis method normalization" section for more details about the normalization of method names. -outcome-measurement node. Each outcome-measurement node represents the measurement of a specific outcome measure (i.e., outcome node) on the corresponding group of participants. For example, the measurement of the average changes in BOLD (i.e., outcome node) on the 17 participants in the outcomegroup node is 1.23 with a standard deviation 1.07, and is represented as a outcome-measurement node. Each outcome-measurement node links to one outcome node and one outcome-group node. -outcome-analysis node. Each outcome-analysis node represents a statistical analysis on a specific outcome measure by comparing multiple outcome groups using a statistical testing method. For example, the analysis of the outcome "average changes in BOLD" via the statistical testing method "paired t-test" on two groups of participants receiving the intervention "Naltrexone" and placebo, is represented as an outcome-analysis node; the p-value of the analysis is 0.002, indicating that the alternative "Naltrexone will block contextual processing" can be accepted at the significance level of 0.005. Each outcome-analysis node links to one outcome node, one method node and multiple outcome-group nodes.
Note that the different types of group nodes for a study could represent the same participant group with different information. According to AACT, using a single group to uniquely represent a participant group in the study is impossible due to the complicated designs of clinical trials. Therefore, we followed AACT and used different types of group nodes to represent different types of information of the clinical trials.

Embedding analysis
We conducted an embedding analysis to evaluate the quality of CTKG and demonstrate its utilities in various important applications. We applied TransE 13 , a state-of-the-art graph embedding method, to generate embeddings for nodes in CTKG . These computable embeddings can benefit various downstream tasks. For example, we could establish similarities among nodes using their embeddings. The similarities enable fast retrieval of nodes corresponding to similar medical entities and could facilitate applications such as drug repurposing and similar study search, as will be discussed below. Note that TransE generates node embeddings based on the topology of the graph (i.e., node attributes are not considered). We used the TransE implementation in DGL-KE 14 , which is a high-performance python library on top of the Deep Graph Library (DGL 15 ). DGL is a python library for deep learning on graphs and enables training models on large-scale graphs in a convenient way. Based on DGL, DGL-KE provides many popular KG embedding algorithms like TransE for users to learn KG embeddings that can be used for many different applications 9 . Other KG embedding methods are also applicable for the following analyses. Detailed information on KG embedding methods is available in a survey 16 .
CTKG for drug repurposing. In this analysis, we evaluated the utilities of CTKG for drug repurposing-a strategy to identify new therapeutic indications for existing drugs 17 . Particularly, we assessed if the high similarities between the condition node embeddings and drug-term node embeddings indicate the high potential of the corresponding drugs in treating the conditions. For the evaluation, we calculated the cosine similarities between all the condition nodes and drug-term nodes, and identified the top-10 most similar pairs. Among these 10 pairs, we found that 5 of them have evidence demonstrated by the literature indicating potential repurposability, as presented in Table 3. For example, the condition node "Diabetes Mellitus, Type 2" has a similarity 0.597 with the drug-term node "Benzoates"; as demonstrated in the literature 18 , Alogliptin Benzoates, an agent of Benzoates, is now available for the treatment of Type 2 Diabetes. Similarly, the condition node "Lung Neoplasms" has a similarity 0.574 with the drug-term node "Triterpenes", and as demonstrated in the literature 19 , Triterpenes have anti-cancer properties against Lung Neoplasms. Please note the average similarity between condition nodes and drug-term nodes is -0.032, and thus the above similarities are significantly high. In addition, the above drugs are not studied for their highly-similar conditions in any CTKG studies (i.e., no existing edges connecting the condition nodes and drug-term nodes). Thus, the above results demonstrate the utilities of CTKG for drug repurposing.
Other similar condition and drug-term node pairs, for example, "Squamous Cell Carcinoma of Head and Neck" www.nature.com/scientificreports/ and "Naloxone" with similarity 0.565, and "Lung Neoplasms" and "Uric Acid" with similarity 0.564, may enable new hypothesis generation for innovative investigation and findings.
CTKG for similar medical entity retrieval. In this analysis, we evaluated whether CTKG enables highquality node embeddings for similar medical entity retrieval tasks. Particularly, we focused on the retrieval of similar studies, and the retrieval of similar conditions, drugs, adverse events and outcomes. All these retrieval tasks are common and useful in designing new clinical trials 20 .
Similar study retrieval. CTKG can support the search and retrieval of similar studies. To demonstrate this, we first identified the top-5 most similar pairs of study nodes using cosine similarity over their embeddings. In each identified pair, we randomly selected one study node, and identified its top-5 most similar study nodes. Table 4 presents the selected study nodes and their top-5 most similar study nodes. As presented in Table 4, the identified similar studies all investigated similar drugs or conditions. For example, study "NCT00795769" and its top-5 most similar studies investigated the prevention of the side effects caused by the stem cell transplant, or conditions that could be treated by the stem cell transplant; study "NCT01431274" and its top-5 most similar studies all investigated the therapies for the Chronic Obstructive Pulmonary Disease (COPD). These results show the utilities of CTKG for retrieving similar studies, which could facilitate new clinical trial design.
Similar medical concept retrieval. CTKG can also support the retrieval of other similar medical concepts. To demonstrate this, we identified the top-10 most similar pairs of condition nodes, drug-term nodes, adverse-event nodes, and standard-outcome nodes, as in Tables 5, 6, 7, and 8, respectively, using cosine similarities over their embeddings. As presented in Table 5, the identified similar condition nodes all share some commonalities. For example, condition node "Nephritis" and "Lupus Nephritis" have a similarity 0.997 (average condition similarity is 0.331), and Lupus Nephritis is a common sub-type of Nephritis. We also found a similar trend in Table 6, 7 and 8. For example, durg-term nodes "ABT-267" and "Macrocyclic Compound" have a similarity 0.997 (average drug-term similarity is 0.254), and both drugs could be used to treat Hepatitis C Virus (HCV) infection 21,22 .
In addition, the two drugs are studied together in multiple studies such as NCT01458535, NCT01464827 and NCT01563536. In Table 7 for adverse-event nodes, "Blood Luteinising Hormone" is very similar to "Uterus Myomatosus" (cosine similarity 0.995; average adverse-event similarity is 0.329). The high similarity could be due to the fact that Luteinising Hormone can affect the development and growth of Uterus Myomatosus by stimulating the production of estrogen 23 . Note that Luteinising Hormone and Uterus Myomatosus are not present together in any of the CTKG studies; therefore, such similar pairs demonstrate the effectiveness of CTKG on  Table 8 for standard-outcome nodes, "Aspartate Aminotransferase" is very similar to "Alanine Aminotransferase" in their embeddings (cosine similarity 0.986; average standardoutcome similarity is 0.315), and both measure the amount of two enzymes made by liver in the blood and can be tested to check the liver damage. These results demonstrate that CTKG can facilitate the search and retrieval of medical entities in the context of clinical trials that carry similar/related information.
CTKG for other applications. CTKG could also enable other potential applications such as adverse drug event prediction and outcome prediction, etc. Specifically, for the adverse drug event prediction, we could   24 over CTKG , and infer new adverse events of drugs using the existing or predicted paths from drug-term nodes to adverse-event nodes in CTKG . For the outcome prediction, we could employ link prediction methods 9,13 to infer new edges between study nodes and outcome nodes based on the existing ones in CTKG . Overall, CTKG could facilitate new knowledge discovery and benefit the design of new clinical trials, and also improve the success rate of future clinical trials. We released the code for drug repurpos-

Discussion
In this manuscript, we presented and released a new knowledge graph CTKG for clinical trials. We also described our methods in generating CTKG . We demonstrated the potential utilities of CTKG in drug repurposing and similarity search, among others, via embedding analysis over CTKG . Currently, CTKG only includes studies that have both drug interventions and reported outcomes. However, incomplete studies (e.g., studies not started or without reported outcomes), and studies without drug interventions (e.g., studies for medical devices) could also contain valuable knowledge for the design of future clinical trials. Therefore, we will enrich CTKG with more studies in the future research. In addition, current CTKG does not contain all the important information for drug discovery and development. For example, CTKG does not have the interactions between drugs/molecules and proteins/diseases, nor the interactions among proteins. Missing such information may limit the potential of CTKG for a much wider range of applications (e.g., to predict if a new molecule for a disease can survive from clinical trials). In the future research, we will align CTKG with other knowledge bases 10,26,27 and integrate more and diverse information into CTKG to enable more applications using CTKG . Moreover, CTKG embeds rich textual information (e.g., title and description) and heterogeneous data types (e.g., numerical, categorical and textual data) as node attributes, which encourages a much borader spetrum of research (e.g., deep graph embedding, link prediction) and more complicated methods to be developed to leverage such information for better translational clinical trial design. We will also investigate attribute-sensitive KG embedding methods 28 to better leverage CTKG.

Methods
CTKG represents each medical entity (e.g., a clinical trial, also referred to as a study; a drug; an adverse event) as a single node. To develop CTKG , we extracted the medical entities from the Access to Aggregate Content of ClinicalTrials.gov (AACT) database 5 . We then normalized multiple expressions of a same medical entity into a single one. Figure 2 presents the overview of development of CTKG.
Clinical trials data. The clinical trials data in CTKG is collected from the AACT database. AACT is a publicly available relational database, which contains the information of every clinical trial registered in ClinicalTrials.gov, and is updated on a daily basis. In AACT, each clinical trial, also referred to as a study, is associated with a unique National Clinical Trial (NCT) ID, and all the information of a clinical trial is stored in 45 different tables. For example, information representing the medications, procedures and other actions provided or conducted in a clinical trial is stored in two tables: "interventions" and "browse interventions"; information representing the measurements used to evaluate the safety and efficacy of drugs or procedures studied in clinical trials is stored in the table "outcomes. " All the tables and their schemas are publicly available 29  Note that CTKG does not include all the tables in AACT. For example, CTKG does not include tables such as "Sponsors", "Overall officials" and "Result contacts" because they are not directly related to the design and results of clinical trials, and including them may not significantly benefit the knowledge graph in analyzing the relations among medical entities. Other AACT tables such as "Provided documents" and "Documents" contain the links to detailed study protocols, informed consent forms and statistical analysis plans, etc. These documentations have rich textual information that might be complementary to the structural relations represented by CTKG . However, such information is highly specific to each individual clinical trial, and does not help establish new relations across clinical trials if no natural language processing is applied first, which by itself is highly nontrivial. Therefore, CTKG does not include such tables; instead, CTKG uses AACT's original study IDs so that all such information can still be retrieved from AACT if needed. CTKG does not include other AACT tables such as "Calculated values", "Design outcomes" and "Design group interventions" because information in such tables is already included in other tables that CTKG includes. Table 9 summarizes the AACT tables that are included and are not included in CTKG.

Adverse event normalization.
In AACT, we could find the adverse events ( AE ) , represented by AE terms, happened among the participants in the "reported events" table. Many AE terms listed in the table could be mapped to the Medical Dictionary for Regulatory Activities ( MedDRA 30 ). MedDRA is the international medical terminology developed under the auspices of the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH). More specifically, we found 28,677 unique AE terms in which 13,995 terms could be directly mapped to the MedDRA dictionary. In CTKG , such terms are also referred to as MedDRA terms. We normalized the remaining 14,682 AE terms that are not in the MedDRA to MedDRA terms as follows: • We removed parenthesized contents (e.g., "Altered pitch perception (pitch seemed lower)"). The contents in parentheses are typically explanations or afterthoughts so removing them would not significantly affect the major meanings. www.nature.com/scientificreports/ • We removed words or phrases that specify the auxiliary information (e.g., "left", "right", "Baseline Phase") or the time frame (e.g., "for 12 hours") of adverse events. We observed that these words or phrases are studyspecific, and not in the MedDRA terms. For example, by removing the phrase "Baseline Phase", the AE term "Throat tightness -Baseline Phase" can be normalized to the MedDRA term "Throat tightness". The phrase "Baseline Phase" is given to specify the initial phase of assessment involving collection of initial data in the study, and thus unrelated to the adverse event itself. • We removed the stop words and lemmatized AE terms using the NLTK library 31 , and Stanza NLP Library 32 , respectively. • We mapped an AE term to its most similar MedDRA term if their edit distance is less than 4. For example, the adverse event term "Cholecyctitis" will be normalized to the MedDRA term "Cholecystitis". This process can correct simple misspellings.
After each step above, if the normalized AE term is a MedDRA term, we will stop the normalization. With the above normalization, we successfully normalized 7,296 AE terms to MedDRA terms. In total, we got 15,976 unique MedDRA terms and had 7,393 AE terms that cannot be normalized.
In order to construct a one-to-one mapping between the adverse events and the MedDRA terms, we used the MedDRA dictionary to further group multiple MedDRA terms of the same adverse event into a unique MedDRA term. According to the definition of MedDRA , each MedDRA term is assigned to one of the five hierarchical levels 33 . Specifically, the MedDRA terms with the lowest level (i.e., level 1), which are used to communicate the adverse events in practice, could correspond to the same adverse event. For example, "Eye itching" and "Ocular itching" are two MedDRA terms with level 1 and represent the same event. Such MedDRA terms corresponding to the same event have a common parent, which is a MedDRA term with level 2 (e.g., "itchy eyes" in the above example). Therefore, we normalized each MedDRA term with level 1 to its linked MedDRA term with level 2. In total, we converted 15,976 MedDRA terms into 11,153 more abstract MedDRA terms. Each term among these 11,153 MedDRA terms and 7,393 non-MedDRA terms represents an adverse event, which is further represented as an adverse-event node in CTKG . Note that due to the licensing restriction of MedDRA , we didn't specify which adverse-event nodes represent MedDRA terms in CTKG and only kept the terms as the attribute of adverseevent nodes.
Drug mentions and normalization. In AACT, the drugs used in studies (i.e., clinical trials) could be found in the intervention table, in which the "name" field stores the information about medicines and administrations used in each intervention. For example, we could find that the drug Naltrexone is used in the study NCT04322526 via its intervention "Naltrexone 50 Mg Oral Tablet. " In CTKG , we used Medical Text Indexer (MTI) 34 to automatically extract drug mentions. MTI is developed by the National Library of Medicine (NLM) to recognize medical entities (e.g., anatomy, drugs and conditions) from plain text. We used this tool to extract drug mentions following 2 steps: • We used MTI to automatically recognize all the medical entities from the interventions.
• We found drug entities from the medical entities recognized by MTI. Specifically, for each recognized entity, MTI will output its MeSH code if available. MeSH is a hierarchically-organized vocabulary from NLM to index and categorize biomedical and health-related information 35 . Given the MeSH code, we first identified entities with MeSH codes starting with character "D", which indicates drug entities (e.g., D02.241.223.701.430 for Ibuprofen). After that we removed the entities not representing specific drugs by excluding those with the MeSH code D26.310 (drug combination), D26 (pharmaceutical preparations), D23.101 (biomarkers) and D26.255 (dosage forms). We also noticed that a few recognized entities were not associated with MeSH codes. For these entities, we did a manual check and identified the ones representing specific drugs.
After the above 2 steps, there were still 1,775 unique interventions in which MTI did not find any drug mentions. For these interventions, we did a manual search and identified the drugs mentioned. Eventually, from the intervention table, we found 3,487 mentioned drugs in total. Among these drugs, 860 (24.7%) of them are found manually. Most of the manually found drugs are investigational drugs (e.g., pf-06669571), or drugs mentioned in abbreviations (e.g., tvr and umec). Besides the drugs in interventions, there were also drugs mentioned in the titles or descriptions of the study groups (e.g., event group). For example, from the title "tramadol/diclofenac 25/25", we could find the drugs Tramadol and Diclofenac. We also extracted drugs mentioned in the titles or descriptions of study groups to generate a complete list of drug mentions. Specifically, we first used the above 2 steps to automatically extract the mentioned drugs in titles and descriptions of study groups. For groups that we did not find any drugs automatically, we manually searched their titles and descriptions, and identified the mentioned drugs. In the end, we found 4585 drug mentions from the interventions and the study groups.
From the drug mentions, we observed that one drug could be represented by different names. For example, the drug "losartan potassium" could be represented by its brand name "cozaar" or its generic name "losartan. " Therefore, we normalized the drug mentions found in texts to normalized terms. Specifically, we first used MTI to map all the 4585 drugs to their MeSH terms. For example, MTI could automatically map the drugs "losartan potassium", "cozaar" and "losartan" to the MeSH term "losartan. " For the drugs that MTI can find their MeSH terms, the MeSH terms were used as their normalized terms. For the other drugs, if they are in abbreviations (e.g., tvr), we first found their full names (e.g., Telaprevir), and used the MeSH terms of their full names for normalization; if they are not in abbreviations, we used their generic names for normalization. We noticed that investigational drugs may not have generic names. For these drugs, their identifiers mentioned in studies (e.g., www.nature.com/scientificreports/ pf-06669571) were used as their normalized terms. After the normalization, the 4585 drug mentions were normalized to 2548 normalized terms. Each of the normalized term is represented as a drug-term node in CTKG.
Statistical analysis method normalization. We observed that one statistical analysis method could be represented by different names in the table. For example, the method "paired t-test" could be represented as "paired t test", "paired t-tests" and "paited t-test" in the table. Therefore, we normalized the names of the methods using the 3 steps as follows: • We preprocessed the method names from the table by removing the space and punctuation in the text.
• We calculated the edit distance among the preprocessed names, and normalized the preprocessed names with edit distance less than 4 to a same normalized term. We also did a manual check to correct possible misnormalization. For the names that will be normalized to a same term, we used the names with the highest frequency as the normalized term. • We further refined the normalized terms by merging the terms with the same words. We noticed that after the second step, there were still normalized terms that represent the same method with the same words but of different orders. For example, the normalized terms "pairedttest" and "ttestpaired" represent the same method "paired t-test" with the same words but of different orders. We manually merged such terms to the one with the highest frequency.
After all the steps, we normalized the 1,299 unique method names mentioned in the table to 907 normalized terms. Each of the normalized terms is represented as a method node in CTKG.
Outcome extraction and outcome clustering. In AACT, the outcome measures used to test the effectiveness of the interventions could be found in the "title" or the "description" fields of the outcome table. Most of the titles in the outcome table are long phrases and could involve multiple standard outcome measures (e.g., in the title "Change From Baseline in Platelet Count and White Blood Cell Count", where "Platelet Count" and "White Blood Cell Count" represent standard outcome measures). These standard outcome measures are common assessment tools that are used to assess the effectiveness of an intervention. The complex relations between the outcome titles and the standard outcome measures make it difficult to directly represent the outcomes with the extracted standard outcome measures. Therefore, we incorporated the identified standard outcome measures as nodes into CTKG and built connections between the outcome nodes and the involved standard-outcome nodes. Through such connections, we can infer which standard outcome measures are used in each study to assess the efficacy of interventions. We observed that some popular phrases within the titles or the descriptions of outcome records represent standardized assessment tools used to measure the outcome of clinical trials, for example, "Visual Analogue Scale" is a tool widely used as a measure for pain. Incorporating such standard outcome measures into the CTKG could enable the comparison on the outcome measurements across different studies, and also could provide a reference regarding the choice of standardized assessment tools in the design of clinical trials. Therefore, we extracted the phrases that could represent standard outcome measures as below: • We found the abbreviations and identified the definitions of abbreviations from the titles or the descriptions of the outcomes using the Schwartz-Hearst algorithm 36 . We observed in the titles that many standard outcome measures are associated with their corresponding abbreviations. For example, we could identify the abbreviation "BI" and the corresponding definitions "Bleeding Index" from the outcome name "Gingival Health Measured by Bleeding Index (BI). " • We kept only the definitions containing the following words: scale, index, score, test, questionnaire, value, count, inventory, assessment, level, rate. We observed that most standard outcome measures would contain such words (e.g., "Visual Analogue Scale", "Social Responsiveness Scale"). • We manually normalized different variants of the same standard outcome measures and removed the extracted phrases that are not outcome measures. We also manually added some popular standard outcome measures (e.g., "Overall Survival", "blood pressure", "triglyceride") that do not contain the above words or do not have any abbreviations.
All the extracted phrases are represented as the standard-outcome nodes in the CTKG . In the end, we got 492 standard-outcome nodes from 50,342 outcome records (i.e., 56.96% over all the 88,386 outcome records), and connected the standard-outcome nodes with the corresponding outcome nodes. With the extracted standard outcome measures, there were still more than 40% of the outcome nodes not connected to any standard-outcome nodes. Therefore, to aggregate similar outcome nodes, we also grouped all the outcome titles (including those containing the standard outcome measures) into several clusters. Specifically, we represented each outcome title using its term frequency-inverse document frequency (TF-IDF) vectors. We then grouped the TF-IDF vectors of outcome titles using the CLUTO 37 , a clustering toolkit, into 200 clusters. Each cluster is presented as a cluster-outcome node and has attributes describing the cluster size, that is, the number of outcomes within the cluster, and the most representative words of these outcomes. Specifically, for each cluster, the representative words of outcomes include 5 descriptive words and 5 discriminating words derived by CLUTO that can best describe or discriminate each cluster. Each word is associated with a percentage computed by CLUTO (details in its manual) which indicates the importance of this word with respect to describing or discriminating the cluster. We converted the descriptive words and the discriminating words as two attributes of each cluster, by combining the words and their corresponding percentages. For example, after clustering, one