Medical device surveillance with electronic health records

Post-market medical device surveillance is a challenge facing manufacturers, regulatory agencies, and health care providers. Electronic health records are valuable sources of real-world evidence for assessing device safety and tracking device-related patient outcomes over time. However, distilling this evidence remains challenging, as information is fractured across clinical notes and structured records. Modern machine learning methods for machine reading promise to unlock increasingly complex information from text, but face barriers due to their reliance on large and expensive hand-labeled training sets. To address these challenges, we developed and validated state-of-the-art deep learning methods that identify patient outcomes from clinical notes without requiring hand-labeled training data. Using hip replacements—one of the most common implantable devices—as a test case, our methods accurately extracted implant details and reports of complications and pain from electronic health records with up to 96.3% precision, 98.5% recall, and 97.4% F1, improved classification performance by 12.8–53.9% over rule-based methods, and detected over six times as many complication events compared to using structured data alone. Using these additional events to assess complication-free survivorship of different implant systems, we found significant variation between implants, including for risk of revision surgery, which could not be detected using coded data alone. Patients with revision surgeries had more hip pain mentions in the post-hip replacement, pre-revision period compared to patients with no evidence of revision surgery (mean hip pain mentions 4.97 vs. 3.23; t = 5.14; p < 0.001). Some implant models were associated with higher or lower rates of hip pain mentions. Our methods complement existing surveillance mechanisms by requiring orders of magnitude less hand-labeled training data, offering a scalable solution for national medical device surveillance using electronic health records.


Supplementary
. The labeling function development and model evaluation workflow. In (1) domain experts examine unlabeled candidate relationships to gain insight into writing and refining labeling functions. These functions are then empirically evaluated for accuracy, precision, recall, and F1 score on an expert-labeled development set. This is an iterative process until the desired labeling function performance is achieved on the development set. In (2) the final labeling functions are applied to a large collection of unlabeled data to generate probabilistic labels for training a deep learning model. The resulting trained model is evaluated on expert-labeled unseen test set. This approach requires orders of magnitude less handlabeled data than what would be needed for directly training deep learning model in (2), because hand-labeled data is only used to develop labeling functions and to evaluate final model performance.

Concept extraction models
By restricting our implant candidate extraction to the specific operative notes for each patient's THA procedure, we sufficiently disambiguated implant mentions to achieve high performance using dictionary-based string matching. Thus our implant candidates were used directly as our final implant outputs.
For pain extraction, we learned a generative model from labeling functions applied to unlabeled patient notes to create a probabilistically labeled training set. We then used this data to train a state-of-the-art Bidirectional Long Short-Term Memory (LSTM) 4 neural network with attention as our end discriminative model. Hyperparameter tuning was done using random search over 10 models, using a parameter grid derived from the literature ( For the final predicted pain events, all anatomical entities were normalized to UMLS concept unique identifiers (CUIs) using rule-based linking to the FMA. CUIs were linked to the most specific (i.e., longest distance to root node) concept in the FMA.

Supplementary Results
Modeling pain outcomes as relations enables detection of long-distance mentions In our gold set, 52.51% of all Pain-Anatomy mentions occurred 1 or more words apart. At a note-level, 39% of positive pain relations occurred only as long distance mentions. These long distance relations also contain different information compared to compound mentions (e.g., "hip pain"): for notes containing both mention types, the anatomical locations mentioned overlap by only 15% on average. The table on the left lists the number of patients implanted with each system, the number of revision events observed for each based on structured records only, and the total person-years of data available. The forest plot displays the corresponding hazard ratio, with the hazard ratio (95% confidence interval) and p-value listed in the table to the right. Note that this figure only shows implant systems for which at least one revision event was detected.  The table on the left lists the number of patients implanted with each system, the number of mechanical failure events observed for each, and the total person-years of data available. The forest plot displays the corresponding hazard ratio, with the hazard ratio (95% confidence interval) and p-value listed in the table to the right. Note that this figure only shows implant systems for which at least one mechanical failure event was detected.

Post-implant complication-free survival among implant systems
Supplementary Figure 5. Summary of Cox proportional hazards analysis of the risk of particle disease for each hip implant system. The table on the left lists the number of patients implanted with each system, the number of particle disease events observed for each, and the total person-years of data available. The forest plot displays the corresponding hazard ratio, with the hazard ratio (95% confidence interval) and p-value listed in the table to the right.
Note that this figure only shows implant systems for which at least one particle disease event was detected. Figure 6. Summary of Cox proportional hazards analysis of the risk of radiographic abnormality for each hip implant system. The table on the left lists the number of patients implanted with each system, the number of radiographic abnormality events observed for each, and the total person-years of data available. The forest plot displays the corresponding hazard ratio, with the hazard ratio (95% confidence interval) and p-value listed in the table to the right. Note that this figure only shows implant systems for which at least one radiographic abnormality event was detected.