Leveraging graph-based hierarchical medical entity embedding for healthcare applications

Automatic representation learning of key entities in electronic health record (EHR) data is a critical step for healthcare data mining that turns heterogeneous medical records into structured and actionable information. Here we propose ME2Vec, an algorithmic framework for learning continuous low-dimensional embedding vectors of the most common entities in EHR: medical services, doctors, and patients. ME2Vec features a hierarchical structure that encapsulates different node embedding schemes to cater for the unique characteristic of each medical entity. To embed medical services, we employ a biased-random-walk-based node embedding that leverages the irregular time intervals of medical services in EHR to embody their relative importance. To embed doctors and patients, we adhere to the principle “it’s what you do that defines you” and derive their embeddings based on their interactions with other types of entities through graph neural network and proximity-preserving network embedding, respectively. Using real-world clinical data, we demonstrate the efficacy of ME2Vec over competitive baselines on diagnosis prediction, readmission prediction, as well as recommending doctors to patients based on their medical conditions. In addition, medical service embeddings pretrained using ME2Vec can substantially improve the performance of sequential models in predicting patients clinical outcomes. Overall, ME2Vec can serve as a general-purpose representation learning algorithm for EHR data and benefit various downstream tasks in terms of both performance and interpretability.


A.1 Service Embedding
The details of learning medical service embedding vectors are given in Algorithm 1. The combinations function lists all the unique pairs of medical services within the segment J (i) seg . To embed medical services, we first obtain the adjacency matrix A svc from patient journeys and use it to generate biased random walks, then optimize the embeddings of medical services by maximizing the probability of each service "seeing" its neighbors in the walks via stochastic gradient descent (SGD).
Algorithm 1 Medical service embedding , context window length T , dimension p, walks per node r, walk length l, context size k Output: Service embedding S ∈ R |S|×p 1: A svc ← 0 ∈ R |S|×|S| 2: for i = 1 to P do 3: for s x , s y in combinations(J for all nodes s ∈ S do 12: walk ← BiasedRandomWalk(G svc , s, l)

A.2 Doctor Embedding
In this work, we deployed two layers of graph attention networks to enhance the learning capability. The structure of the two-level doctor embedding model is shown in Figure 1. As the proposed auxiliary task is a supervised classification, we configure the final layer of the model as a softmax. In addition, the output embedding from each of the K attention heads of the second GraphAttenNet are averaged instead of concatenated, followed by the final softmax: We summarize the steps for doctor embedding in Algorithm 2, where GraphAttenNet-2L denotes the operations of two GraphAttenNet stacked together; CrossEnt denotes cross-entropy loss; L gt and L pred represent the ground-truth and the predicted doctor specialties, respectively.

Algorithm 2 Doctor embedding
Input: Patient journeys {J (i) } P i=1 , service embedding S, number of attention heads K, learning rate η Output:

A.3 Patient Embedding
We summarize the steps for patient embedding in Algorithm 3.

B.1 Criteria of Choosing Patients into Negative Cohort
Following criteria are used for for generating the negative cohort: CLL risk factor, diagnosis, procedure, and prescription. A patient must meet at least three of the criteria before being included into the negative cohort. The risk factors include anemia, chills, fatigue, fever, night sweats and pain, Sjögren's syndrome, weakness, and weight loss. The related diagnoses include recurrent infection, Epstein-Barr infection, Helicobacter pylori infection, HIV/AIDS, Human T-lymphotrophic virus Type-I, rheumatoid arthritis, hypogammaglobulinemia, psoriasis, Wiskott-Aldrich syndrome, and pneumonia. The related procedures include tissue culture and chromosome analysis, increased frequency of CBC/ or blood test, and flow cytometry. The related prescriptions include Dexamethasone, Neupogen, and Prednisolone.

B.2 Using ME2Vec as Pretrained Input Embeddings for Recurrent Models
As shown in Figure 2, the recurrent model we used for this experiment is a two-layer LSTM with 256dimensional hidden units and a 128-dimensional input embedding layer. The hidden outputs of the LSTM enter a global max-pooling layer that for each of the 256 dimensions, keeps the maximum value from all the time steps. The outputs of the global max-pooling layer are further processed by an multilayer perceptron (MLP) ended with a sigmoid function to make the final prediction. The recurrent model is trained for 30 epochs using an Adam optimizer with a batch size of 64 and learning rate of 1e-4.