Master clinical medical knowledge at certificated-doctor-level with deep learning model

Mastering of medical knowledge to human is a lengthy process that typically involves several years of school study and residency training. Recently, deep learning algorithms have shown potential in solving medical problems. Here we demonstrate mastering clinical medical knowledge at certificated-doctor-level via a deep learning framework Med3R, which utilizes a human-like learning and reasoning process. Med3R becomes the first AI system that has successfully passed the written test of National Medical Licensing Examination in China 2017 with 456 scores, surpassing 96.3% human examinees. Med3R is further applied for providing aided clinical diagnosis service based on real electronic medical records. Compared to human experts and competitive baselines, our system can provide more accurate and consistent clinical diagnosis results. Med3R provides a potential possibility to alleviate the severe shortage of qualified doctors in countries and small cities of China by providing computer-aided medical care and health services for patients.


Supplementary Methods
Data set. The large-scale medical corpus consists of text materials on medical subjects from multiple sources. A major portion of the corpus is prepared from various publications. A total of 32 published books are used, including textbooks for medical school students, reference books for medical practitioners, and guidebooks. All text materials is extracted from these books and is divided by paragraph. The content structure of the book is parsed, and nested levels of chapter titles are extracted. This title information is appended to corresponding text paragraphs as metadata.
Each paragraph along with its metadata is stored as a "document". A total of 243,712 documents is extracted from these publications. The MedQAs dataset is a collection of exam problems with ground-truth answers. A total number of over 270,000 problems are collected from the internet and from published materials such as exercise books. The problems collected are not official test problems from past examinations, but could be related to those. We filtered out any incomplete or duplicate problems. Questions in these problems have an average length of 27.4 words. Each problem has exactly 5 candidate answers. A training/valid split is created for supervised training of models. The problems in the valid set are chosen based on an estimated fidelity of appearance in past exams. A small subset of problems was chosen based on their source and context that indicate their possible appearance in past exams. The valid set is constructed to approximate performance in real exams. The final training and valid set contains 222,323 and 6,446 problems respectively.
Similarity degrees. We calculate the similarity degrees of NMLEC 2017 questions against our training dataset with following manners. For each question from NMLEC 2017, we computes its Levenshtein ratio against all questions from the training set of MedQA, then averaged the top five values as its similarity degree. A lower value means more dissimilarity.
Embedding spaces. The word2vec algorithm treats text materials as being unstructured and only depicts general neighboring relationship between words, thus embeddings learned by word2vec mainly describe general language and commonsense knowledge. We found that text materials in medical domain are not completely unstructured but are often semi-structured. For example, words in medical textbooks are organized into different semantic units, such as chapter, section, subsection, paragraph, and sentence etc. Therefore we can define rules to organize text from these medical corpus into separate text fragments with relationships based on the semi-structured information. Besides neighboring relationship (R 1 used by word2vec), in this study we also used another three relationships to describe long range contextual dependency. They are 1) R 2 : two words in the same sentence; 2) R 3 : two words in the same paragraph, 3) R 4 : two words in the same section. Since diseases play important roles in understanding medical knowledge. We extracted about 1174 diseases and their content description from medical textbooks involving more than ten subjects, such as infectious diseases & STD, pediatric, internal medicine, stomatology, respiratory medicine, chirurgery, obstetrics and gynecology, dermatovenerology, neurology, and otorhinolaryngology head and neck surgery etc. In textbooks, each disease (usually as a subtitle) is followed by a paragraph (content description) which comprehensively describes the characteristics of the disease. We split the paragraph into several sub-paragraphs which describe the disease at views of "symptom", "examination", and "differential diagnosis", respectively. The "symptom" sub-paragraph mainly describes symptoms of the disease; the "examination" sub-paragraph mainly describes which examinations are needed to make a definite diagnosis for the disease; and the "differential diagnosis" sub-paragraph presents content of how to differentiate the disease from other similar diseases. Based on these, we further defined another three relationships: 1) R 5 : disease and its "symptom" description; 2) R 6 : disease and its "examination" description; 3) and R 7 : disease and its "differential diagnosis" description. All of our used relationships for multi-embedding learning are summarized in the the Supplementary Table 2.
Embedding of words and concepts. In general applications, it's enough to build embeddings on words or characters. However, in medical domain concepts are so important for presenting medical knowledge that we need build embeddings on concepts. To achieve this goal, we firstly employed NER technology to annotate medical concepts (such as diseases, symptoms etc.) for textbook materials. For annotated medical concepts we build embeddings for them, and for other non-concept words, we still build embeddings on words. That is, concept embeddings and word embeddings are learned simultaneously according the relationships defined in Supplementary Table 2. Embedding learning. According to equation (1) presented in the Methods section of main text, given two words/concepts (or one is a word and the other one is a concept) s a , s b , and < s a , s b >∈ R i we obtain their embeddings by maximizing their conditional probability P (s a |s b ) with a proper function F i . For simplicity, in this study F i is defined with a softmax function: Embedding combining. Multiple embeddings are combined by concatenation when used as input to neural network components in out model. Although there exists more sophisticated ways of combining embeddings, we found in our experiments that a simple concatenation here is robust enough in achieving desired performance.
NMLEC. Medical licensing examination is a feasible and appropriate manner, officially adopted by many countries over the world to assess a physician's mastering medical knowledge and the ability to apply the knowledge, and the principles to solve medical problems. In China, NMLEC is annually organized by NMEC to assess doctors' quality. In China, every year more than 530,000 candidates participate into the exam. The General Written Test (GWT) of NMLEC is produced by board-certified and experienced medical experts who are employed by NMEC. GWT consists of 600 choice-questions with four categories A1, B1, A2, and A3/A4 (details in Supplementary   Figure 2). All of the questions mainly cover medical knowledge from preclinical medicine, medical humanities, preventive medicine and clinical medicine (details in Supplementary Figure 1).
Fast reading: context-based vector space text retrieval. The quality and the relevance of the retrieved digest is an important factor determining the performance of the reasoning module. The aim of the retrieval system is to extract semantically relevant information from the large medical corpus. Traditional text retrieval methods such as BM25 lacks the ability to distinguish relevancy at the semantic level. We developed a vector space retrieval model based on context vectors. Context vectors are representations of words in context, which captures the syntactic structure, e.g. phrases, and semantic representation, e.g a specific word sense. Context vectors c i are generated from word embeddings w i using a bidirectional LSTM network, which is trained to model context in the reasoning module. Retrieval of related documents from the text corpus is performed in pure vector space. This consists of a two-step process. In the first step, context representations c are generated for all documents in the text corpus: each document is first embedded into vector space using one set of word embeddings, then processed by a bidirectional-LSTM network. Word embedding is learned in the free reading module, unsupervisedly on all text corpus using skip-gram and has dimension 200. LSTM network is extracted from the context reasoning module after supervised training on the QA dataset. The parameters of the 256-dimensional bidirectional-LSTM network used in the input layer of the context reasoning module is extracted. We use this LSTM network as a universal context modeling network to process all text materials: The output vector sequence of the LSTM network is associated with the original text document and is collected in a vector database V . The total number of vectors in V is 22,740,576, equal to the total number of words in the text corpus. An index based on IVFADC is constructed using k-means to cluster all vectors into 4096 cells. Product quantization is used to compress each vector to 64 bits. Indexing of vectors and searching are performed with faiss toolkit. In the next step, given a question q as query, first the context vectors c i of the query are generated using the same LSTM network as above. For each context vector, N = 2000 nearest neighboring vectors (by Euclidian distance) are retrieved from V . The retrieved vectors are grouped by the original document they belong to, and the documents are scored by relevancy. The relevancy score of a document d w.r.t.
a query q is defined as: where δ i is the indicator of a match found at position i of query (L q being the length of query in words): and D i is the Euclidian distance between c i and the matched context vector v. Parameters α and β are empirically set to 1.0 and 2.5 respectively. The scoring function is meant to simply average the degree of matching of each word in the query to the document. The distance between two context vectors reflect the semantic similarity of two words and their surrounding context. Measuring context distances instead of word co-occurances alleviates the conditional independence assumption of words in traditional retrieval methods and makes semantically related documents rank higher.
The top 5 documents that rank highest from each source of text is returned as a digest of related information in the text corpus given the question as a query.
Keypoint reasoning. The keypoint reasoning module (Supplementary Figure 10) is designed to use key points in a question to infer its correct answer. One of the main challenges is to recognize and extract key points from the question. Here, we do not explicitly extract key points but adapt an implicit manner that important words (playing key roles in inferring answer) are given high weight values and less important words are assigned lower weight values. To realize this purpose, we employ an attention learning mechanism which is a fashion strategy commonly used in deep learning models recently. Firstly, we assume that every words are useful and establish a reasoning relationship between question and its answer at word level with a Reasoning Matrix (RM) M ∈ R m×n (m, n are the numbers of answer's words and question's words): where v o i ∈ R D×1 is an unit embedding/vector (||v o i || 2 = 1) of the ith words in the answer, and v q j ∈ R D×1 is an unit embedding (||v q j || 2 = 1) of the jth words in question. That is, the above equation calculates reasoning degree between words from question and words from answer by using cosine distance. For simplicity, we denote M 1 , M 2 , and M 3 is the corresponding RM from the first layer, the second layer, and the third layer, respectively. Then, we will introduce two kinds of attention-based filter strategies to reduce noisy/undesirable words and pay more attention on key-point words (which are important to infer correct answer). We notice that there exists a kind of noisy words which disturb making correctly reasoning. We shortly called it "synonymynoise" which describes a phenomenon that there exist the same words or synonymy words both in question and answer. However, here we want to establish a reasoning relationship between words from question and words from answer, such as words describing illness Symptoms in question and words depict corresponding treatment in answer. The synonymy relationship between words from question and words from answer is not expected. According to equation (5), word pairs having synonymy relationship (or they are the same word) have a large value. To reduce "synonymynoise", we introduce an Internal-Attention (I-A) strategy to produce an Attention Matrix (AM) A 1 based the first layer reasoning matrix M 1 where k 1 is a scale value, τ 1 ∈ (0, 1) is a threshold parameter to determine which degree in A 1 to be reduced. Since we want to reduce high degree (always generated with "synonymy-noisy" words), τ 1 promotes to be a large value. In this study, empirically k 1 = 10, and τ 1 = 0.9. It must be pointed that according to our experiments k 1 can be any value of the range (5, 30), and τ 1 can be value ranging in (0.75, 0.95). The model performance performance decline can be omitted compared to the best setting model. After obtaining the attention matrix A 1 , we transfer the "prior knowledge" obtained from the first layer into the second layer by using a simple External-Attention Besides "synonymy-noisy", we also notice that many words in question have little reasoning relationship to answer, such as stop words which only perform an grammatical function in the sentence.
We shortly called this kind of noise words "irrelevance noise". We assume that if two words (one is from question, the other is from answer) have no/little reasoning relationship, their reasoning degree (calculated by the second layer reasoning matrix A 2 ) is a little value. Therefore, we reduce the "irrelevance noise" by an I-A strategy in layer two where k 2 is a scale value, and τ 2 is a threshold parameter to determin which degree to be reduced.
In this study, empirically k 2 = 20, τ 2 = 0.05. According to our experiments k 1 can be any value of the range (10, 30), and τ 1 can be value ranging in (0.01, 0.1). we also transfer the "prior knowledge" obtained both from the second layer and the first layer into the third layer by using the E-A strategy To further enhance the effectiveness of reducing "irrelevance noise", we employ the same calculation (8) used in the second layer to obtain the attention matrix A 3 . Ultimately, we employ an ensemble manner to calculate the Keypoint Matrix (KM) K KM represents reasoning degree at word level, which gives large degree value to key point words and little value to noisy words. We obtain the total reasoning degree value r between question and answer by A dual-path attention structure is used to match information between question and documents and extract relevant information given the matching matrix M . Attention is first performed columnwise, where each word Q(i) in the question gets a summarization read R Q n (i) of related information in the document D n : Attention is also performed row-wise to summarize relevant information in the reversed direction, from question to document: (1, j), ..., M n (L Q , j))(i) Reasoning module is a decision network. First the most salient evidence is decided using a gating layer. The support vector of the selected evidence is given to a multi-layer feed-forward network, which evaluates the correctness of the candidate answer given support from the salient evidence.
Correctness of the answer is represented by a scalar output given by the MLP network. The whole Context Reasoning module is trained end-to-end on the training set of medical QA dataset, using the groundtruth answer as supervision. Note that we don't have labels for salient evidences, and the model's ability to select salient evidences naturally emerge when trained on the end task. To achieve this we first let the gating layer weight the different documents and produce a weighted mix of evidences for the decision network, then train the model to some point and replace the weight mix with the single most weighted document (the identified salient document). The model is implemented in Tensorflow with LSTM network implementation from nvidia's CUDNN library.
Training used Adam optimizer and dropout of rate 0.2 between layers. Hyperparameters are chosen using the validation set. Training takes roughly 18 hours on a single GTX 1080 GPU.
Global reasoning. A graph-based approach is employed to effectively distill evidence from doc- uments. An explicit graph is constructed from question, answer and all documents from the digest to model the relation between evidence words. Words form the nodes of the graph, and an edge denotes the co-occurrences of a pair of words within a document. Each node is given a weight W N equal to the inverse document frequency of the word. A weight W E is also given to each edge. Cooccurrences within a shorter window is given a higher weight. For co-occurrences within multiple documents, the edge weight is given as where W E i is the weight of the co-occurrences of word left and in right the i-th evidence, and W D i is a weight given to the evidence (to penalize longer evidences). The maximum spanning tree (MST) containing all nodes representing words in question an answer is generated from the graph. The nodes in the graph are words directly or indirectly related to the question, and form a new evidence that supports the question. The weights in the MST and are used to evaluate the supporting strength of the evidence to the question-answer pair. An implicit graph is also generated to calculate relevance in embedding space, which can be regarded as the continuous version of the graph-based approach. Matching matrix are calculated for each question-evidence pair and each evidence-evidence pair: where e n a n i represents the a n i -th vector in the n-th evidence, and is found by maximizing questionevidence relevance (values in M Q n ) and minimizing evidence-evidence overlap (values in M D mn ).
The new evidence is then matched with the question representations to generate a supporting matrix, which integrates support from all evidences. Reasoning reuses part of trained network from context reasoning module. The final output vector represent the overall support of the evidences to the statement in question.
Baseline methods. We adapted or developed several baseline methods for experiments, including: 1) R-net: A reading comprehension model achieving previous state-of-the-art single model result on the SQuAD dataset 1 . It has an architecture that stacks a question-to-document attention layer and a document-to-document attention layer. Because the model is originally used on SQuAD, which has a single input document, we concatenated the digest documents as input to the model.
The final prediction layer is also replaced with a pooling layer to generate a scalar score.
2) Neural reasoner: A framework for reasoning over natural language sentences, using a deep stacked architecture. It can extract complex relations from multiple facts to answer questions. We supply individual documents in the digest to the model as different evidences.
3) Iterative attention: This model has a universal architecture that does not limit its use to a specific task. It uses attention to read from question and document alternatively, and the read is performed iteratively then the final state is used to make prediction. The model is directly applicable to our Medical QA task.
The above three models are all end-to-end neural network models and are trained using the same Medical QA dataset as our own model. They share the unsupervised word embeddings in "Free Reading" of Med3R framework.
A WatsonQA system. the WatsonQA system is a pipelined system based on evidence retrieval and analysis. It has a hand-crafted retrieval system based on Lucene to retrieval potential evidences from the corpus. The evidence analysis is based on word co-occurence graph between question and a document. The graphs are merged into a single graph to represent relation between question and multiple evidences. The nodes and edges of the graph contain multiple types of features as weights.
Finally a set of graph feature extractor is used to summarize the evidence graph into a vector for scoring the candidate answers.
Model training. The unsupervised embeddings in the "Free Reading" module are first trained on the medical corpus. Next, the multi-layer reasoning module is trained on the Medical QA dataset in an end-to-end fashion. The three components of the reasoning module is individually trained, using the correct answer as supervision. At this stage the vector space retrieval model in guided reading module is bypassed and a BM25 based text retrieval model is used instead to provide a document digest. For the "Guided Reading" module, the vector space retrieval model is constructed by transferring the parameters of the LSTM network from the trained context reasoning model. The supervised embedding learning in deep reading contains 2 components: the reasoning embedding is trained together with the keypoint reasoning module to learn embeddings for keywords, and the supervised-refining of embedding is done during training of context reasoning module. The whole framework is thus largely trained end-to-end except for the "bootstrapping" of the vector space retrieval.    Below are the feedbacks:

Procedure of the Tests
From 26

Test Object
Intelligent Doctor Assistant

Testing Tool
National Medical Licensing Examination for clinical physician, 2017

Test Content and Volume
The test consists of four modules: basic medicine, clinical medicine, medical humanities and preventive medicine.
The total number of problems in the test is 600. It is divided into four units. Each unit has 150 problems.

Test Performance
The total score of NMLE of the Intelligent Doctor Assistant was 456, higher than the average score of all kinds of candidates in the country. In 2017, there highest score of candidates is 533 in the country.

Initial Impression
(1) The NMLE results of Intelligent Doctor Assistant ranks among the top of all kinds of candidates.
(2) Intelligent Doctor Assistant performs best in clinical medicine module.
(3) Intelligent Doctor Assistant has the highest performance in memorization test.
(4) In the case type test, Intelligent Doctor Assistant's performance is slightly lower than that of the non-case test.
The performance is lower for the test questions that require the application of multiple key points of knowledge.
History of present illness: After catching a chill 5 days ago, the patient began to suffer from sore throat, which worsened when swallowing.  LSTM networks are used to summarize question and each evidence into vector representations. Dot-product of these vectors gives a matching tensor which measures the semantic similarities. Next context representations of a group of evidences are pooled together to assemble a distilled evidence, given the matching tensor.
The new evidence contains relevant pieces from all the evidences, and is finally compared with the question to find support of the statement in the question. Supplementary