Clinical Assistant Diagnosis for Electronic Medical Record Based on Convolutional Neural Network

Automatically extracting useful information from electronic medical records along with conducting disease diagnoses is a promising task for both clinical decision support(CDS) and neural language processing(NLP). Most of the existing systems are based on artificially constructed knowledge bases, and then auxiliary diagnosis is done by rule matching. In this study, we present a clinical intelligent decision approach based on Convolutional Neural Networks(CNN), which can automatically extract high-level semantic information of electronic medical records and then perform automatic diagnosis without artificial construction of rules or knowledge bases. We use collected 18,590 copies of the real-world clinical electronic medical records to train and test the proposed model. Experimental results show that the proposed model can achieve 98.67% accuracy and 96.02% recall, which strongly supports that using convolutional neural network to automatically learn high-level semantic features of electronic medical records and then conduct assist diagnosis is feasible and effective.

base for internal medicine 21 . It may also lead to decline in matching accuracy because of not considering all possible conditions. Secondly, a variety of hospitals and departments encounter a wide range of cases which may have great disparities. It is fairly complicated and adverse to manage and maintain the knowledge base constructed by hundreds and thousands of diseases from such various departments, which leads to low efficiencies.
High level semantic understanding for medical record texts has always been hard because of its high coding degree 22,23 . In recent years, with the development of natural language processing, there has been an increasing number of auxiliary diagnostic methods based on semantic analysis algorithm 16,22,24 . These kinds of methods try to conduct a high level semantic understanding on EMRs, which mainly draw on natural language processing related technology 22 . They hope to help the computer better understand the semantics of electronic medical records, and then make a diagnosis accordingly. The ultimate goal they hope to achieve is far from easy to achieve. During these years, with the extensive adoption of deep neural network technology in the field of natural language processing, the application of deep neural network on semantic understanding with analyzing texts has become a popular research 23,25,26 .
To achieve the ultimate goal, in this study, we applied a multi-layer convolutional neural network for high level semantic understanding for electronic medical records, which can then be used for disease diagnoses. In the past few years, convolutional neural network has made notable progress in fields such as computer vision [27][28][29] and natural language processing 23,26 . The incremental advancement of CNN is likely to benefit the development of new technology and inventions in other fields. A large number of researches and applications have shown that the convolutional neural network has a powerful ability in feature extractions and expressions 27,30 , which does not require hand-designed features but carries out self-learning through plenty of data. Previous studies have shown that neural network can represent the words in the texts into a dense vector through learning and mapping them into a continuous vector space [31][32][33][34][35] . In this vector space, semantically similar words are distributed in the same region 33 . Thus, even if the two sections of the text are not the same, as long as the expressions are of the same meaning, they will have similar mathematical expressions, reflected in the semantic space very close 32 . This can greatly alleviate the problem of semantic ambiguity, and is more efficient than the model based on knowledge base. So we don't need to build a large number of complex rules or knowledge base to guide how the model decides, but the model itself can automatically extract useful information from the electronic medical records by self-learning, and then conduct disease diagnoses based on these information. This makes our model lighter and more efficient than the knowledge base-based model. The overall framework of our model is shown in Fig. 1. The input of our model is an electronic medical record and the output is the probability of diseases we predicted.

Results
Data Preparing. To promote the development of the related fields, in this study, we collected and released a large real-world electronic medical records dataset (C-EMRs) collected from Huangshi Central Hospital in China. It has a total number of 18,590 EMRs and contains the most common diseases of each department, which are Hypertension, Diabetes, Chronic Obstructive Pulmonary Disease (COPD), Gout, Arrhythmia, Asthma, Gastritis, Stomach Polyps. After expunging personal information, each electronic medical record includes thirteen items: chief complaint, physical examination, history of present illness and so forth. Each electronic medical Figure 1. The overall framework of the proposed model. We use the convolutional neural network to extract the semantic feature vectors of unstructured electronic medical records and map them to the feature space, finally we use the classifier to calculate the probable probability of each disease and select the highest probability of the disease as the auxiliary diagnosis of our model.
ScIeNTIfIc REPORtS | (2018) 8:6329 | DOI:10.1038/s41598-018-24389-w record corresponds to a result of doctor's diagnosis, which will be used as the label for each EMR samples during the training process. Due to the possibility of a patient with multiple diseases, it is possible that the two electronic medical records have the same content, but the diagnostic results are different. In our dataset, there are altogether 447 patients whose situations are consistent with what is mentioned above. The number and proportion of each disease are shown in Fig. 2.
The electronic medical records number of different disease in C-EMRs are imbalance. For diabetes, there are 5642 medical records, but for gout there are only 657. In order to avoid biases and to ensure that there is enough training data, we choose the diseases that has more than 1,000 records to form the training set, which are hypertension, diabetes, COPD, arrhythmia, asthma and gastritis. Also in order to prevent that the training dataset has too much biases, we randomly selected almost the same number records of these diseases as training and testing data. So finally the training data for our model are 7000 EMRs, and another 400 EMRs for testing, which are distributed as Table 1.
Experiment Results. We use stochastic gradient descent with momentum 0.9 to train parameters of our network. Our model can quickly converge during the training processing, training after about 20 epochs (one epoch means that all the training samples finish one training session) can reach a steady state with high accuracy and the loss curve is very smooth, which can be seen in Fig. 3(a). From Fig. 3(b) we can see that the prediction time of each electronic medical record is mainly between 10 and 20 milliseconds, which can be predicted in real time.
In Table 2, the Precision, Recall, F1-score and Accuracy of four machine learning algorithms, which are Support Vector Machine (SVM), Multinomial Naïve Bayes(MultinomialNB), Logistic Regression and k-NearestNeighbor, as well as our proposed model are reported. These four machine learning algorithms that we compared to have been applied to the auxiliary diagnosis of electronic medical records in some previous related   works and have achieved good results [36][37][38] . From the results shown in Table 2 we can see that our model has achieved the best effect on each evaluation method. On the test set, our model achieves a 98.67% accuracy and a recall rate of 96.02%, which strongly proves that CNN do have stronger capability of information extraction from texts than other algorithms. Table 3 shows the average prediction time of different methods for each EMR in test set. From Table 3 we can see, the average diagnostic time of our model for each electronic medical record is only 13.82 milliseconds, which indicates that our model can be very efficient in the diagnosis process.
As we have mentioned before, our model can automatically extract high-level semantic features from electronic medical records and map them to a high-dimensional feature space (usually hundreds to thousands of dimensions). We can use t-Distributed Stochastic Neighbor Embedding (t-SNE) 39,40 technique for the dimensionality reduction and visualization of these high-dimensional feature vectors, which can be find in Fig. 4. In this feature space, each point represents an electronic medical record and different colors indicate different diseases. At the beginning of training (Epoch = 0), since the model parameters are randomly initialized, all the electronic medical records in the feature space are randomly distributed and indivisible. After 5 epoch, electronic medical records of different diseases began to have a trend of separation. After 10 epoch, the electronic medical records of all kinds of diseases have been separated, except for some areas and the edge of each category. When the training reaches 100 epoch, we can clearly see that the samples of each disease have been completely separated, and the electronic medical records of the same disease are also gathered together. After training, electronic medical records belong to the same kind of disease distribution in the same area. Considering that some patients may also suffer from a variety of diseases, there will be individual records mixed with other categories. For each inputted electronic medical record, we mapped it to the feature space, and by analyzing its location distribution in the feature space, we can calculate the possibility of which disease it belongs to.

Discussion
Automatic extraction of useful information in electronic medical records is of great significance and value for the study of clinical treatment and related diseases 1,5-7 . The current clinical diagnosis model or system is mostly based on the large-scale medical knowledge base of human construction 18,41,42 . Through the association extraction of electronic medical records and the rule matching with the knowledge base, the electronic medical records are analyzed and the clinical auxiliary diagnosis is provided. This kind of method is usually of heavy workload 21 and the actual effect is not very satisfactory. In this study, we propose a method of information extraction and analysis of electronic medical records using convolutional neural network, and finally conduct clinical auxiliary diagnosis. Comparing with other machine learning algorithms, our model is proved superior to other algorithms on various metrics ( Table 2). The high precision (95.94%) achieved by our model means that the probability of misdiagnosis of our model is very low, which is extremely important for practical use. At the same time, the recall (96.02%) of our model is also high, which means that the probability of missed diagnosis is extremely low in our model. Combined with these test results, we can find that our model has significantly practical value.
It is worth noting that our model does not require human building large scale knowledge bases and complex rules, since all the model parameters and features are automatically learned from a large number of historical electronic medical records, which makes our model quite lightweight and fairly practical. At the same time, our model is very efficient, through testing we found that the average prediction time of each electronic medical record is 13.82 milliseconds, which outperforms other machine learning methods (SVM: 180.5 ms, MultinomialNB: 172.5 ms, LogisticRegression: 167.5 ms, KNeighborsClassifier: 205.0 ms), which has been shown in Table 3.
These results strongly support that it is feasible and effective to use the convolutional neural network to automatically learn high-level semantic features of electronic medical records and then conduct assist diagnosis. Based on these advantages, our model can effectively improve the clinical diagnostic efficiency of doctors. At the same time, because our model is affirmed by a large number of historical diagnostic medical records, it can effectively reduce the possibility of misdiagnosis.
As the results shown in Fig. 3, our model can effectively extract high-level semantic features of electronic medical records and map them into high-dimensional feature space. In this feature space, electronic medical records of different diseases have different distribution, and the electronic medical records of the same disease are gathered together. By analyzing this feature space (we can also call it "disease space"), we may even be able to help clinicians better understand the relation between various diseases and what tendencies are likely to occur in the same disease, which would be one of the most promising aspects of the proposed model. We hope that our research will not only help clinicians better make clinical diagnosis, on the other hand, we hope to help clinicians further understand the various clinical diseases from another perspective. Although we have made gratifying achievements, we still should consider some limitations on the current exploratory reseach. Firstly, the sample types used for model training and testing are not enough as we used only several most common and medically different diseases. Therefore, in the future research, we'll try to do more research, including more types of diseases, and more similar diseases, such as diabetes I and II. Secondly, in this study, we only consider three main contents of electronic medical records: chief complaint, history of present illnesses and physical examination. Although these three items are likely the most crucial ones, other contents corresponding to this record are equally important. In the future research, we will take into account comprehensive contents in the electronic medical record, even including other diagnostic information, such as medical images.
In summary, the major contributions of this paper are as follows. Firstly, we designed and implemented an auxiliary diagnosis model for electronic medical records based on convolution neural network. We hope this model can not only effectively improve working efficiency but also reduce misdiagnosis rate of doctors for making a diagnosis. Since our model can conduct high level semantic understanding of the electronic medical records, we also hope that it can help doctors to better understand the clinical manifestations of various diseases, and even the relation between the various diseases. Secondly, in order to promote the development of the related fields, we collected and released a large real-world electronic medical records dataset (C-EMRs). It has a total number of 18,590 EMRs and contains the most common diseases of each department, which are Hypertension, Diabetes, Chronic Obstructive Pulmonary Disease (COPD), Gout, Arrhythmia, Asthma, Gastritis, Stomach Polyps. Thirdly, we tested and evaluated the proposed auxiliary diagnosis model for electronic medical records based on CNN on this dataset. The test results show that the method has high diagnostic efficiency (13.82 milliseconds costs for each EMR prediction) and diagnostic accuracy (acc: 98.67%, recall: 96.02%). Although our model still has space for further improvement, it has shown significant and practical value for clinical research. We hope that our work will serve as a guide for future related work and help promote the further development of the auxiliary diagnosis of electronic medical records.

Method
Model Structure and Analysis. In this study, we propose a method using convolutional neural network to extract features from electronic medical records and conduct disease prediction. The input of the proposed model is an electronic medical record and the output is the prediction probability of diseases. The final structure of the convolutional neural network used in this study is as follows: an embedding layer, a convolutional layer with three different sizes of convolutional kernels, an average pooling layer and a fully connected layer following with a softmax classification. The embedding layer transforms the inputted EMR text into a two-dimensional matrix form which is suitable for the processing of convolution. The convolutional layer is used to extract features from the input matrix and convolution kernels of different sizes can learn different context related features. The pooling layer is served for down sampling the features, which can enhance the robustness of the model and significantly influence the performance 27,43 . The purpose of the fully connected layer is to fuse all these features and pass them to the softmax classifier for disease prediction. The softmax classifier, whose parameters have been learned during the training process, calculates the correlation between the input feature vector and the various diseases, and finally concludes the probability value of each disease. The practical parameters setting will be given in the next section "Experiment Setting".
For each of the structured medical records inputted, we first make it unstructured by connecting each of its contents to form a whole passage. For each passage S, we illustrate it with a matrix  ∈ × X N D , as shown in Equation (1), where the i-th row indicates the i-th word in passage S, each word is represented as a D-dimension vertor which is randomly initialized, that is Generally, let X i:j refer to the matrix which consists of the words vectors from the i-th word to the j-th word, that is: The convolution layer contains convolution kernels of multiple sizes, and each size contains multiple number of convolution kernels. The width of each convolution kernel is the same as the width of the input matrix. Suppose that the height of the k-th convolution kernel is H, the convolutional kernel can be expressed as Convolution operation is a feature extraction process for the elements in the local region of the input matrix. For example, when w k 1,1 and x 1,1 coincide, then the feature c k 1 extracted from X 1:H by the convolutional kernel can be: where the weight w i j k , denotes the importance of the j-th value in the i-th word vector, b i j k , is the bias term and f is a nonlinear function, here we follow previous works 27 and use ReLu function as our nonlinear function, which is defined as The convolution process is that the convolution kernel W k slids from the top to the bottom on the input matrix X with a certain step T c , and calculates the features of each local region. Finally, the document feature extracted by convolution kernel W k is: The pooling layer can reduce the number of neural network parameters while maintaining the overall distribution of the data, which can effectively prevent the model over-fitting and improve the robustness of the model 27,43 . The pooling operation is very similar to the convolution operation, the only difference is that it only calculates the ScIeNTIfIc REPORtS | (2018) 8:6329 | DOI:10.1038/s41598-018-24389-w average or maximum value of the local area. We conduct a max pooling operation after each convolution operation on the feature C k , suppose the height of a pooling kernel is H p and the step size is T p , then the output is: p N H T p p 1 c The process described above is a process in which one convolution kernel W k produces one feature M k . After all the convolution and pooling operations have been completed, all the extracted features are concatenated end to end to obtain the feature vector of the entire EMR, which can be indicated as where F i = M i , l indicates the number of the features. Fully connected layer is used to further blend features and extract higher-level features. By defining a weight matrix W F , we compute the weighted sum of each feature element and obtain the final feature representation of the inputted text S: F f where W F and b f are learned weight matrix and bias, the values in weight matrix W F reflect the importance of each feature. The dimension of the output vector y is L, which is the number of labels. In our realization, L is the number of diseases required to be predicted. We then pass the vector y through the softmax classifier to get the predicted probability of each disease: where P i indicates the prevalence of the i-th disease corresponding to the input medical records.
In the process of training, we update network parameters through applying backpropagation algorithm, and the loss function of the whole network consists of two parts, one is the error term and the other is the regularization term, which can be described as: where num is the batch size of EMRs, P indicates the output of the classifier, and each element as P i represents the prevalence of the i-th disease. T is the target value which corresponds to the doctorâ€ ™ s diagnosis result.
For instance, if this medical records corresponds to the t-th disease, the value of the t-th element in vector T is 1, and the remaining values are 0. The error term in the loss function calculates the mean square error (MSE) between the prediction vector and the actual label. We hope that through the self-learning of the model, the mean square error gets smaller and smaller, that is, the prediction results are getting closer to the real values. In order to strengthen the regularization and prevent overfitting, we adopted the dropout mechanism and a constraint on l2-norms of the weight vectors during the training process. Dropout mechanism means that in the training process of deep learning network, the neural network unit is temporarily discarded from the network, i.e. set to zero, according to a certain probability. This mechanism has been proved to effectively prevent neural network overfitting, and significantly improve the model's performance 27,44,45 . We train our model by minimizing the LOSS function over a batch size number of samples. We use stochastic gradient descent with momentum 0.9 to train the parameters of our network. The update rule for weight w is: the words that appear more than five times and the others will be remarked as a character "〈unk〉". So finally we get 17,274 unique words in our dictionary. Since our model requires the input matrix be of a certain size, that is, the length of the input text should be constant. We design multiple sets of comparision experiments to choose the best value of this super-parameters. According to the experiment results, we finally fix each input electronic medical record text into 130 words. Less than 130 words will be padding with zero, and more than 130 words will be discarded.
For the input of our model, we map each word to a vector of 300 dimensions which are randomly initialized, so the dimension of input matrix will be 130 × 300. The width of the convolutional kernel is the same as the input matrix, thus 300. However, the height of the convolutional kernel is not fixed, we set the kernel heights to be 4, 5, 6 by comparing the results of different kernel sizes, and each of the different heights has 128 convolution kernels. The dimension of the feature extracted for each EMR is 3 × 128 = 384 and the dimension of the output vector is six, corresponding to six diseases that require diagnosis, so the weight matrix W F of fully connected layer would be  ∈ × W F 384 6 . Evaluation Method. We use several evaluation indicators commonly used in classification tasks to evaluate the performance of our model, which are precision, recall, F1-score and accuracy. Their conceptions and formulas are described as follows: • Precision measures the proportion of positive samples in the classified samples. • F1-score is a measure of a test's accuracy. It considers both the precision and the recall of the test. The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0. • Accu3racy measures the proportion of true results (both true positives and true negatives) among the total number of cases examined = + + + + . Accuracy TP TN TP FN FP TN (18) where TP (True Positive) represents the number of positive samples that are predicted to be positive by the model, FP (False Positive) indicates the number of negative samples predicted to be positive, FN (False Negative) illustrates the number of positive samples predicted to be negative and TN (True Negative) represents the number of negative samples predicted to be negative.
Data availability. The dataset analysed during the current study is available in the Github repository, https:// github.com/YangzlTHU/C-EMRs.