Relevant Word Order Vectorization for Improved Natural Language Processing in Electronic Health Records

Electronic health records (EHR) represent a rich resource for conducting observational studies, supporting clinical trials, and more. However, much of the data contains unstructured text, presenting an obstacle to automated extraction. Natural language processing (NLP) can structure and learn from text, but NLP algorithms were not designed for the unique characteristics of EHR. Here, we propose Relevant Word Order Vectorization (RWOV) to aid with structuring. RWOV is based on finding the positional relationship between the most relevant words to predicting the class of a text. This facilitates machine learning algorithms to use the interaction of not just keywords but positional dependencies (e.g. a relevant word occurs 5 relevant words before some term of interest). As a proof-of-concept, we attempted to classify the hormone receptor status of breast cancer patients treated at the University of Kansas Medical Center, comparing RWOV to other methods using the F1 score and AUC. RWOV performed as well as, or better than other methods in all but one case. For F1 score, RWOV had a clear edge on most tasks. AUC tended to be closer, but for HER2, RWOV was significantly better for most comparisons. These results suggest RWOV should be further developed for EHR-related NLP.

www.nature.com/scientificreports www.nature.com/scientificreports/ The idea behind RWOV is quite simple. RWOV is focused on predicting the class of a term of interest (TOI) from a block of text. Although EHR data are unstructured, nevertheless, there is a relatively concise vocabulary that is used by healthcare professionals when describing patient characteristics. Therefore, we should see the same terms occurring repeatedly in patient medical records. Furthermore, we propose that only a fraction of these terms indicate the meaning in relation to some particular term of interest. Nevertheless, we believe that the relative order of these "most relevant words" is important to the meaning of the text. RWOV creates a matrix, where each row represents a subject and each column a word. The most relevant words are those that co-occur the most frequently with respect to some TOI. We will call these the top words. The value in each cell of the matrix is either 0, or the inverse of the number of top words that occur between the top word represented by the column and the TOI plus 1. The sign of the value indicates if the top word occurs before or after the TOI in the text. Therefore, the value in each cell drops away naturally in a nonlinear fashion from 1 (as close as possible to the TOI) to 0 (does not occur in the block of text).

Data.
For this study, we used a straightforward collection of data to evaluate the performance of our NLP approach compared to a few other existing approaches. Three datasets containing tumor pathology reports of women with breast cancer who sought treatment at the University of Kansas Medical Center in two recent years were included, one for each of the terms of interest we used in this study. Our goal is to identify the status of three important breast cancer biomarkers from the pathology report free text: estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2). In order to keep the results more interpretable, we limited the datasets to include only those reports that included a determination of hormone receptor status. The number of positive and negative subjects for each hormone receptor are shown in Table 1. The data were labelled using two annotators who received training in reading pathology reports from clinical research coordinators who normally perform this task. Any discrepancies in annotation were then manually resolved (there were no disagreements due to interpretation).
With respect to Algorithm 1, the specific terms of interest (TOI) used in this study were "er", "pr", and "her2". However, we allow the user to set aliases for TOI. In this case the aliases were for "er": "estrogen" and "estrogen receptor", for "pr": "progesterone receptor", and for "her2": "her-2" and "her/". Note that in most cases, there were numerous biomarkers listed in a single block of text. For example, a report might list the following: "tumor cells are positive for CK7 and MOC31, negative for CK5/6, TTF-1, CK20, ER and PR. Her2 immunostain is positive. Ki-67 is reported 85%". In most cases there were between 3-10 biomarkers delineated, adding to the challenge of correct classification.
Analysis. Class imbalance is likely to be a factor when assessing performance with these types of data.
Hormone receptor status is unbalanced in the population. Furthermore, depending on the study, researchers may be interested in one class or the other of the subjects. Complicating this, is the fact that some performance metrics, such as accuracy, can give the impression of good performance even when a method is unable to accurately predict the class of interest. Therefore, we will break down the performance by class (and train separate models to predict each class), and provide F1 and AUC as our major performance metrics, given that accuracy can be misleading in these circumstances. We will not attempt to use sophisticated class balancing approaches in this assessment, so that we can minimize the number of factors that are being considered. F1 is defined as follows: TP FP   TP  TP FN  TP  TP FP   TP  TP FN where TP, FP, and FN stand for the number of true positives, false positives, and false negatives respectively. In other words, it is the harmonic mean of the precision and the recall. We will assume that each of our NLP methods can produce a score representing its confidence in the predicted class label of a sample. For this work, we will also assume that a sample can only be one of two classes. Then the AUC is simply the probability that a random sample that is truly of one class is scored higher than a random sample from the other class.
We will compare the performance of our approach to two popular vectorization methods. The first is known as n-grams 9,11 , and the other is called word2vec 14 , combined with either of two machine learning algorithms: support vector machines (SVM), and artificial neural networks (NN). For n-grams, we rely on the implementation in CountVectorizer module of the scikit learn library for python. For word2vec, we used the genism library for There are two important stages to consider, with their own associated algorithms: (1) data structuring or vectorization and (2) learning and prediction. At both of these stages, important choices can be made that affect performance. At the first stage, vectorization, there are hyperparameters for all of these approaches. For our approach, the only hyperparameter is the number of top (or most relevant) words to model. This was decided by training and testing on an independent dataset and then this setting was simply used for all the analyses here. This will be the recommended default settings for our method, but likely performance could be improved by using training, validation, and test data. However, we wanted to preserve our sample size in this case. For n-grams, we used a number of different settings to try and determine the effect of considering greater or fewer numbers of words. These were [1,2], [2,2], [1,3], [2,3], and [3,3]. The numbers represent the ranges of the number of words to build n-grams from. Furthermore, the vectors were transformed by IDF 15 . The word2vec approach has more hyperparameters, so we searched a number of parameter sets on the independent dataset to determine those with the best performance. We found settings that are in a relatively common range (size = 100 for dimensionality of the vectors, window = 6 for the maximum distance between a word and the predicted word, negative = 5 for sampling negative words). The neural network structure was determined using a grid search to determine the best structure for the n-grams and word2vec methods. Although the differences in performance were minimal overall, the same structure was found for both these methods. Therefore, we also used this structure for our own method, in order to give those methods any possible advantage. This structure contained three hidden layers using rectified linear units (ReLU) for their activations with 50, 50, and then 100 nodes respectively, and a final sigmoid layer for classification. L2 regularization of the network was also used, and the hyperparameter was determined separately   www.nature.com/scientificreports www.nature.com/scientificreports/ for each method using the independent dataset. Again, better performance could be achieved by tailoring this solution. As noted, we have taken measures to ensure the comparison is as fair as possible. In all comparisons, the exact same training and test data were used to compare all models. All results are presented as the average result over a threefold cross-validation.

Results
Our results (Table 2) showed that RWOV has consistently high accuracy in detecting ER, PR, and Her2 status compared to other vectorization approaches, using NN as a classifier. Furthermore, this result is true irrespective of which class is being predicted. In terms of class imbalance, the other approaches saw a noticeable decline in performance for the underrepresented class in most cases (i.e. ER−, PR−, and HER2+), particularly in terms of F1 score. This is despite the fact that class weighting was enabled for SVM. In every case, RWOV had the highest F1 score. This is particularly notable for the HER2+ class, which included only 42 subjects that were HER2+.
For both F1 score and AUC, RWOV-NN (i.e., RWOV using neural networks), had the best performance in every task with the exception of AUC for PR− classification. However, even in that case it was very close (0.95 vs 0.96 for SVM with n-grams). In some cases, the other methods had equal performance to RWOV at particular settings. Difference in AUC were tested using the pROC package for the R statistical environment using the method of DeLong, et al. 16 (Table 3). In most cases, there were not significant differences in terms of AUC. Although for HER2+, and most comparisons for HER2−, RWOV-NN had a significantly greater AUC than any other approach.
We created 95% bootstrap confidence intervals for all AUCs and F1 scores. For most tasks, RWOV-NN had not only the highest performance but also consistently narrow confidence intervals. These are shown in Figs 1 and 2 respectively. ROC curves are shown in Fig. 3. www.nature.com/scientificreports www.nature.com/scientificreports/ From Figs 1 and 2, it can be seen that in every case that there is a significantly superior method, it is RWOV that has the best performance. Figure 3 demonstrates that although there are typically not significant differences in AUC, RWOV-NN typically exhibits the best cut threshold allowing for a high true positive fraction and a low false positive fraction.
The top 30 words for each TOI are shown in Table 4, as well as the mean frequency of their occurrence per observation. Note that the words have been stemmed to reduce them to their common roots or parts, often by truncation, in some cases by substitution with an identifier. For example "tumor" is "tum" and "comma" and ", " are "comm".

Discussion and Conclusions
Electronic healthcare records have enormous promise in facilitating research into improving patient treatment. However, much of the EHR is stored as unstructured data. Therefore, it is time-consuming to extract data from the EHR to pre-screen patients for clinical trials, or perform feasibility analysis for recruitment, because these records must be manually examined. Additionally, useful observational studies could be performed, if it were not for this major limitation. Nevertheless, given that the primary purpose of EHR is to support patient care, it would be inappropriate to change its structure purely to facilitate research. Therefore, it is imperative to develop methods for structuring and learning from these data that can facilitate these goals.
In this study, we have demonstrated that our method, Relevant Word Order Vectorization (RWOV), combined with a neural network, shows great promise in tackling this challenge. On a relevant use case, of identifying the hormone receptor status of breast cancer patients, RWOV showed consistently high accuracy across all three classification tasks. In most cases, it had the highest accuracy of any method examined in this study. Of particular importance, RWOV maintained high accuracy in classes with the poorest representation. This is necessary, www.nature.com/scientificreports www.nature.com/scientificreports/ because for some studies, it will be necessary to include patients based on these poorly represented classes, and poor accuracy might lead to some subjects being unnecessarily excluded (for example HER2+).
The reason for RWOV's performance on these tasks seems clear. It depends on a unique vectorization method that determines the most important words for classifying a particular case, in addition to their relative location with reference to a term of interest. This relevant word ordering is well suited to the natural language processing in electronic health records, where the data are semi-structured, due to the repetitive nature of how healthcare providers often enter text. Our algorithm is able to take advantage of this semi-structured data to be a more powerful learner. The relatively poor performance of word2vec, which is a well-respected approach, is likely due to the small sample size. Typically, it depends on larger samples to perform well.
Although RWOV performed at least as well as the other methods for the most common class, it particularly shone in predicting the least common class (ER−, PR−, and HER2+). This was partricularly evident in the case of HER2+. For this dataset, there were relatively few positive observations (just 42 of the 229 total observations), and it has a higher semantic complexity. Although we could not determine a single reason for its performance edge for this problem, in at least some cases it able to recognize a class with a more complex set of dependencies. For example, RWOV-NN and RWOV-SVM correctly classified an observation with a snippet of text like the following (full text longer): "ER: Positive (85%) PR: Positive (80%) HER2: Equivocal (2+) Ki-67: 23% FISH analysis for HER2/neu was performed and was reportedly negative. " The other methods were unable to (with the exception of NN-W2V). For HER2, when the initial test is equivocal, an additional test is performed (FISH analysis), making the classification more complex, especially given the limited sample size. While many approaches focus on finding terms that differentiate text, RWOV focuses on finding positional interactions of terms, which may give it an edge in these scenarios.  www.nature.com/scientificreports www.nature.com/scientificreports/ This initial approach, although promising, is only the beginning. There are many ways that our method could likely be improved. The learning algorithm was off the shelf, in order to provide proof-of-concept but would likely benefit from more customization, such as tuning the structure of the neural network and applying network optimization methods. Also, we were limited in the amount of data that we could provide, due to the laborious process of hand-labeling examples that is required. Therefore, we think we could improve performance by implementing a semi-supervised approach, in addition to a larger training dataset. We expect that the ideas put forward in this article will stimulate research into other approaches that account for the unique characteristics of EHR. At the same time, RWOV will directly benefit research at the University of Kansas Cancer Center by increasing and improving the data available for projecting the feasibility of studies at this institution. To help other institutions benefit from this approach, RWOV is available for Python in a public code repository at https://github.com/ jeffreyat/RWOV.git.  Table 4. Top 30 occurring words for each TOI. The mean frequency of occurrence per observation is shown. Words have been stemmed (shortened to common roots/parts).