Ontology-based feature engineering in machine learning workflows for heterogeneous epilepsy patient records

Biomedical ontologies are widely used to harmonize heterogeneous data and integrate large volumes of clinical data from multiple sources. This study analyzed the utility of ontologies beyond their traditional roles, that is, in addressing a challenging and currently underserved field of feature engineering in machine learning workflows. Machine learning workflows are being increasingly used to analyze medical records with heterogeneous phenotypic, genotypic, and related medical terms to improve patient care. We performed a retrospective study using neuropathology reports from the German Neuropathology Reference Center for Epilepsy Surgery at Erlangen, Germany. This cohort included 312 patients who underwent epilepsy surgery and were labeled with one or more diagnoses, including dual pathology, hippocampal sclerosis, malformation of cortical dysplasia, tumor, encephalitis, and gliosis. We modeled the diagnosis terms together with their microscopy, immunohistochemistry, anatomy, etiologies, and imaging findings using the description logic-based Web Ontology Language (OWL) in the Epilepsy and Seizure Ontology (EpSO). Three tree-based machine learning models were used to classify the neuropathology reports into one or more diagnosis classes with and without ontology-based feature engineering. We used five-fold cross validation to avoid overfitting with a fixed number of repetitions while leaving out one subset of data for testing, and we used recall, balanced accuracy, and hamming loss as performance metrics for the multi-label classification task. The epilepsy ontology-based feature engineering approach improved the performance of all the three learning models with an improvement of 35.7%, 54.5%, and 33.3% in logistics regression, random forest, and gradient tree boosting models respectively. The run time performance of all three models improved significantly with ontology-based feature engineering with gradient tree boosting model showing a 93.8% reduction in the time required for training and testing of the model. Although, all three models showed an overall improved performance across the three-performance metrics using ontology-based feature engineering, the rate of improvement was not consistent across all input features. To analyze this variation in performance, we computed feature importance scores and found that microscopy had the highest importance score across the three models, followed by imaging, immunohistochemistry, and anatomy in a decreasing order of importance scores. This study showed that ontologies have an important role in feature engineering to make heterogeneous clinical data accessible to machine learning models and also improve the performance of machine learning models in multilabel multiclass classification tasks.


Supplementary Material
Ontology-based feature engineering in machine learning workflows for heterogeneous epilepsy patient records

Supplementary Methods
Here we present additional details of the method used in using ontology-based feature engineering, implementation of the machine learning workflow, and the evaluation metrics used in the study.

Feature engineering using epilepsy ontology
As part of the three-step feature engineering process used to map values in the patient records to ontology terms, we first used syntactic mappings based on synonyms and related annotation properties (modeled using rdfs:label and rdfs:comment in the ontology) (1-3)). The syntactic transformation involves removal of whitespace in a phrase (e.g., "Mesial Temporal Sclerosis") or mapping specific parts of the phrase ("Atypical Ganglioma WHO grade II" to AtypicalGaglioma) as the associated WHO grading is already modeled in EpSO using quantifier restriction on the object property hasWorldHealthOrganizationGrading.
In the second step, we used class expressions that combined one or more ontology terms to represent a complex term.
This composition-based class expressions using one or more ontology terms for representing medical concepts has also been implemented in the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) (4). SNOMED CT uses a combination of pre coordination, where terms are modeled explicitly in an ontology (precoordinated expressions), and post coordination, where one or more terms are combined using a set of rules (post coordinated expressions). The use of post coordinated expressions in SNOMED CT enables it to model new medical concepts or terms. For example, a SNOMED CT post coordinated expression using the recommended syntax |hip joint| : |laterality| = |left| represents the laterality information about a hip joint (4).
In this study, we used an aggregation of epilepsy ontology terms to map a value in the patient record; for example, "depletion of neuron in CA2" was mapped to NeuronalLoss, CA2Field, PyramidalNeuron, and HippocampalSclerosis. We note that the first three ontology terms together model the term from the patient record and the fourth term describes important contextual information (diagnosis information), which is important in a machine learning workflow. A similar term augmentation occurs when the patient record term "Blurring" is mapped to the ontology term BlurringOfGreyWhiteMatterJunction, which provides additional contextual information.
In the third step of the feature engineering step, we used semantic transformation approach that uses the semantics of a patient record term for mapping to an ontology term. For example, the term "microcolumn" was mapped to AbnormalRadialCorticalLamination, FocalCorticalDysplasiaTypeIA, and OccipitalLobe by interpreting the occurrence of microcolumns to the specific type of cortical dyslamination seen in focal cortical dysplasia type 1A. This mapping considered that although abnormal radial cortical lamination occurs in both focal cortical dysplasia type 1A and focal cortical dysplasia type 1C; however focal cortical dysplasia type 1C also includes the finding of abnormal tangential cortical lamination. Therefore, "microcolumn" was not mapped to focal cortical dysplasia type 1C.
The mapping of patient record terms to epilepsy ontology terms required manual review and curation; therefore, dissemination of these mappings through a look up service will enable other users to reuse these mappings in their machine learning workflows. As additional mappings are created for new machine learning workflows, a library of these mappings can be a valuable resource for feature engineering of epilepsy clinical data.

Machine learning workflow: parameters, training, and validation
The input features FN from the 312 neuropathology reports NPR result in a feature matrix denoted by FM ∈ ^( _ 〖 〗_ ) with the diagnosis values used as labels to be assigned to a patient record by each of three models. Therefore, the machine learning task was implemented as multilabel classification based on the binary relevance (BR) transformation method where a patient can have one or more neuropathology diagnosis label (D with |D| = 59) with each label being independent of each other (5,6). Each of the three models were trained for each diagnosis label based on the four input features, that is, microscopy (M), immunohistochemistry (IHC), brain localization (L), and imaging results (I). For example, neuropathology record with input features M, IHC, L, and I with output D label Ganglioglioma WHO grade I. The data values in the reports were encoded using the Scikit "OneHotEncoder" library.
We used the Scikit "liblinear" solver for fitting the logistic regression model with "l2" regularization with a tolerance value of 0.01 for stopping, and relative strength of regularization value set to 1. The random forests library in Scikit is an implementation of the ensemble machine learning method that combines decision trees using random features to improve performance of the model (7). The Scikit random forest library used the "n_estimator" variable to denote the total number of decision trees in the forest, which was assigned a value of 21 during our parameter tuning phase based on the lowest number of incorrect predictions. The Scikit library uses additional parameters to use sample drawn with replacement from the data used for training (also called bootstrap sample), which is set to "true" in our implementation with the generalization accuracy estimated from the left-out samples with the relevant parameter "oob_score" set to true (6). The third model used in this paper is gradient tree boosting with the learning rate parameter set to 0.1, the parameter for the number of weak learners "n_estimator" is set to 31, and the parameter to select the fraction of samples used for fitting the number of individual base learners is set to a value of 0.95. These parameter values are tuned based on the performance of test evaluations performed using a range of parameter values.
To avoid overfitting, we used 5-fold cross validation with each iteration leaving out one subset of data for testing. The trained classifier was used to predict whether each of the diagnosis labels can be assigned to a patient record based on their neuropathology features. This leave-one-out approach takes into consideration our assumption that all the patient reports in our dataset are independent of each other and that the reports were created by a similar process. We used the aggregate of the iterations to generate the final assignment of a diagnosis label to a patient report.

Evaluation metrics: Hamming loss, balanced accuracy, and recall
If N is total number of samples, L is total number of labels, is the set of true class labels, and ̂ is the set of labels predicted by a classifier m, then, the Hamming loss is defined as HL(m, N) where ∆ is the symmetric difference between two sets (5,6). The accuracy measure is defined as A(m, N) (5, 6). The Scikit library includes a specialized function called balanced accuracy that address the issue of bias in imbalanced datasets, and it is computed by assigning a weight to each sample based on the occurrence of the true positive labels (6). In addition to these two metrics, we used recall measure to evaluate the performance of the three models, which is defined as , where true positive (TP) corresponds to correct diagnosis labels assigned by a model to a patient record and false positive (FP) corresponds to the correct diagnosis labels that were not assigned by a model to a patient record (8). A lower hamming loss value, higher accuracy value, and higher recall values are indicative of improved performance by a machine learning model.

Statistical analysis of the results.
To validate the significance of our comparison, we conducted a corrected repeated k-fold cv test based on 5 repetitions of 5-fold cross validation (9). A 5-fold cross validation approach was used to calculate the balanced accuracy, hamming loss, and recall for our baseline and class V ontology mapping for each machine learning algorithm. The accuracy measures for each of the 5 folds were recorded. For each algorithm, this 5-fold approach was repeated 5 times, resulting in 25 accuracy measures for each metric for either of the ontology mappings. These accuracy measures were then compared using a t-test using the following formula, where: (1) r is the number of replications, (2) k is the number of folds for cross validation, (3) aij refers to the accuracy metric (hamming loss, balanced accuracy, or recall) from fold j of replication i for one of the algorithms (random forest, logistic regression, or gradient boosting) for the baseline mapping (4) bij refers to an accuracy metric (hamming loss, balanced accuracy, or recall) from fold j of replication i for one of the algorithms (random forest, logistic regression, or gradient boosting) for the class V ontology mapping (5) n1 refers to the number of instances used for training, and (6) n2 refers to the number of instances used for testing.
P-values were calculated from the test statistic according to a t-distribution with 24 degrees of freedom (df = kr-1). All calculations were performed in Python (version 3.10).
The results (with p = 0.05) show that the improvement in balanced accuracy is not statistically significant for all the three machine learning models. Similarly, the change in the hamming loss and recall values are also not statistically significant across all the three learning models    Figure S1: EpSO models neuropathology findings at fine level of granularity to support semantic annotation of patient records and applications in feature engineering.