Human–machine cooperation meta-model for clinical diagnosis by adaptation to human expert’s diagnostic characteristics

Park, Hae-Jeong; Kim, Sung Huhn; Choi, Jae Young; Cha, Dongchul

doi:10.1038/s41598-023-43291-8

Download PDF

Article
Open access
Published: 27 September 2023

Human–machine cooperation meta-model for clinical diagnosis by adaptation to human expert’s diagnostic characteristics

Hae-Jeong Park^1,2,3,
Sung Huhn Kim⁴,
Jae Young Choi⁴ &
…
Dongchul Cha^4,5,6

Scientific Reports volume 13, Article number: 16204 (2023) Cite this article

686 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Artificial intelligence (AI) using deep learning approaches the capabilities of human experts in medical image diagnosis. However, due to liability issues in medical decisions, AI is often relegated to an assistant role. Based on this responsibility constraint, the effective use of AI to assist human intelligence in real-world clinics remains a challenge. Given the significant inter-individual variations in clinical decisions among physicians based on their expertise, AI needs to adapt to individual experts, complementing weaknesses and enhancing strengths. For this adaptation, AI should not only acquire domain knowledge but also understand the specific human experts it assists. This study introduces a meta-model for human–machine cooperation that first evaluates each expert’s class-specific diagnostic tendencies using conditional probability, based on which the meta-model adjusts the AI’s predictions. This meta-model was applied to ear disease diagnosis using otoendoscopy, highlighting improved performance when incorporating individual diagnostic characteristics, even with limited evaluation data. The highest accuracy was achieved by combining each expert’s conditional probabilities with machine classification probability, using optimal weights specific to each individual’s overall classification accuracy. This tailored model aims to mitigate potential misjudgments due to psychological effects caused by machine suggestions and to capitalize on the unique expertise of individual clinicians.

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Segment anything in medical images

Article Open access 22 January 2024

Transparent medical image AI via an image–text foundation model grounded in medical literature

Article 16 April 2024

Introduction

Deep learning-based diagnostic assistant systems have made significant strides in various medical fields, such as radiology, retinal fundus, and dermatology images^{1,2,3,4,5,6,7,8,9,10}. While these systems have demonstrated performance comparable to domain specialists, the integration of these technologies into real-world clinical practice remains an underexplored challenge. One pivotal concern surrounding these systems is liability. Hence, assistive cooperation of artificial intelligence (AI) to human intelligence (HI) is imperative¹⁰. This cooperation means AI aids in the decision-making process, with human experts retaining final decision-making responsibility. A few studies have focused on incorporating DL models to help medical experts in various tasks^10,11,12.

The differences between training AI models and human experts and their distinct strengths and limitations in the diagnostic process lead to AI-HI cooperation. For instance, AI models have shown statistical bias towards prevalent diseases^13,14, yet maintain remarkable consistency¹⁵. Conversely, human experts, while not as biased towards prevalent conditions, exhibit significant inter-rater variability^16,17. This variability can sometimes prompt patients to seek multiple opinions. These distinctions should be considered in the cooperation between humans and machines for practical use.

In a real-world setting, automated diagnostic systems primarily serve as assistants, offering insights from AI as a supplementary opinion. However, if a diagnostic assistant system makes a suggestion based solely on the outcome of AI classification, users, especially less experienced physicians, could be biased toward or against the suggestion of the AI¹⁰. A promising alternative is a cooperative model that treats both humans and AI as independent classifiers and merges their outputs. However, due to the substantial variability in human experts’ accuracies and diagnostic tendencies, such a unified model might exhibit unpredictable performance for experts with different expertise in clinical practice.

Thus, the ideal AI system should be tailored to individual practitioners, accounting for their unique diagnostic strengths and biases. Such adaptability not only harnesses the domain-specific expertise of the human but also aligns with their diagnostic tendencies. This alignment is vital; experts’ skepticism towards AI decisions, especially when they conflict with their own judgments, might lead to the rejection of correct AI suggestions even in cases the human expert is not proficient. Adapting AI to individual human diagnostic characteristics might mitigate such challenges, ensuring smoother AI-HI collaboration. Therefore, AI should learn not only the domain knowledge but also the human partners’ tendencies that it assists.

In this study, we advocate for an AI-HI cooperation strategy tailored to individual diagnostic traits. By first assessing an expert’s proficiency with a small set of labeled data, we formulate a meta-model to maximize each party’s strength in integrative diagnosis by adaptively weighting each player’s diagnostic characteristic for the final decision. We compared this AI-HI meta model with a simple ensemble method that adds each party’s predictions according to predefined weight. By doing so, we argue that the cooperation of human and machine experts should be individual-specific, considering the pros and cons of each human expert for clinical usage.

Using ear disease classification from otoendoscopic images—which shows heterogeneous diagnostic performance among human experts¹⁵—we underscore the practicality of this cooperative strategy. Ear and mastoid diseases are common in, but not limited to, developing countries¹⁸. They are one of the core medical licensing exam tests, and primary care physicians are expected to treat these disease groups. However, medical students may not receive sufficient training on otoscopy¹⁹, and even for experts, the performance may not be good enough to ensure consistency²⁰. By testing our meta-model with a diverse group of physicians, including six otolaryngologists or otolaryngology residents and eight non-otolaryngologists, we seek to address these challenges.

Results

This study used convolutional neural network (CNN)-based otoendoscopy classification models (see Cha et al.¹⁵) trained with 6,900 otoendoscopic images of six classes (Table 1). Table 1 summarizes two independent data sets used to train and test the cooperation models. The first test set (Test 1) is equally distributed, while the second test set (Test 2) is chosen according to the imbalanced prevalence. Six ENT (Ear, Nose, and Throat) physicians and eight non-ENT physicians (family medicine specialists, emergency medicine specialists, and general practitioners) scored an average ± SD of 71.17 ± 3.37% and 45.63 ± 7.89%, respectively. Our ensemble AI model, based on pre-trained DL models (ResNet152, DPN92, Inception-V4, and Densenet201) and modified to handle the class imbalance problem, scored an accuracy of 80.33% in balanced test set (Test 1) and 86.67% in imbalanced test set (Test 2). For detailed results, see Cha et al.¹⁵. We compared the proposed method with a simple ensemble method in combining human experts and AI’s decisions.

Table 1 Training and test sets for otoendoscopic images.

Full size table

Synergistic effects of AI and human experts according to human expertise and classes with different prevalences

To evaluate the synergistic effects of human experts and DL models according to classes, we tested the performance of a simple ensemble between HI and AI by weighting the top 4 performing DL models’ softmax values and a human rater’s diagnosis. Since otolaryngologists were often better at predicting less common disease entities, appropriately combining decisions from both parties synergized overall accuracy (Fig. 1a). This was even true for non-otolaryngologists. Still, the benefit was minimal since the diagnostic performance was far worse than DL models. We also assessed the per-class accuracy of six individual classes.

Each class had a different optimal human weight value: smaller classes, such as tumors and myringitis, had higher human weight values, which implies better performance in the human–computer hybrid classifier than machines alone in these classes. Since DL models outperformed both human physician groups for each class, setting the alpha (human weight) value more than 3 had an adverse effect on class-specific outcomes in otolaryngologists (Fig. 1b). In non-otolaryngologists, accuracy was usually better off by allowing less physician’s input to reflect the final result (Fig. 1c).

This experimental result can be summarized in two points: (1) an appropriate combination of both human experts and artificial intelligence synergizes overall accuracy, and (2) the class-specific tendency of each individual might play a more critical role in practical human–machine cooperation than merely weighting each individual’s general accuracy. Thus, we took these two points as the core for the current meta-model.

AI and HI: cooperation meta-models

The experimental AI-HI cooperative classification models were designed to blend the human and DL models’ classification results with appropriate weighting. The proposed cooperative model first incorporated the overall accuracy of individuals: weighting high for experts with higher performance. The cooperative classifier then weighted the human rater’s class-specific diagnostic characteristics and the DL models’ predictions. A human rater’s diagnostic characteristics are assessed from the confusion matrix of the human expert for an evaluation data set in the form of the conditional probability of being a true class for a diagnosed class by an individual.

As described in the methods section, we sampled each clinician’s given answer and correctness from balanced test set (Test 1 dataset) from zero up to 200 samples. We compared the proposed AI-HI cooperation model of the performance-weighted conditional probability of individual (PCoptMH) with (1) a simple average of the conditional probability of the human diagnosis and machine classification probability (CavgMH) and (2) the optimally weighted average of the human delta response (1 for the diagnosed class, and 0 for the other classes) of an individual expert and machine classification probability (PoptMH). The PCoptMH considers each human expert's overall diagnostic performance and class-specific diagnostic tendency. Meanwhile, the CavgMH considers the class-specific diagnostic tendency but does not consider each individual’s overall performance: it does not differentiate highly accurate human experts from those with low accuracy. In contrast, the PoptMH considers each human expert's overall performance but does not consider class-specific diagnostic characteristics.

Figure 2 shows an example of this cooperative procedure applied to a third-year otolaryngology resident. From the classification results and diagnosis results for 100 evaluation samples of Test 1 dataset, we evaluated the DL models’ and the physician’s overall accuracy and class-specific tendency (Fig. 2a, b). Then, we built a conditional probability of the clinician (Fig. 2c) based on the confusion matrix of the clinician for the evaluation data set (Fig. 2d, e).

For comparison purposes, the accuracies of the DL model and the human expert are depicted in the confusion matrix using a new test set (Fig. 2f, g). This test set comprised both Test 1 and Test 2 datasets, excluding the 100 evaluation samples selected from the Test 1 dataset, resulting in 500 samples. This arrangement ensured that all diagnostic labels rated by each individual were included. The same test set was also employed to assess the performance of the cooperation methods, PoptMH and PCoptMH (Fig. 2h, i). In subsequent analyses involving different evaluation samples, the test dataset was assembled following the same method.

Using cooperation with PoptMH resulted in only a minimal gain in accuracy (Fig. 2h). Applying the PCoptMH, a slight decrease in overall accuracy was obtained compared to DL models alone (0.84 vs. 0.83, Fig. 2f, i); however, the decision was more tailored toward the physician’s decision, which respects individual tendencies to diagnose each class.

In summary, increased accuracy was obtained across almost all groups when there were sufficient data for evaluating human rater skills—by adding machine classification probability and the conditional probability of each individual’s classification (Fig. 3, CAvgMH). Combining the conditional probability of human decisions for each class with overall personal accuracy (PCoptMH) produced more accurate results than only using each human expert's overall accuracy (PoptMH).

Regarding practical points, the performance did not require many probing samples to estimate the confusion matrix for each person. Even five samples for each class (30 in total, Fig. 3a) helped capture the tendency of human classification and improve the cooperative classifier.

Discussion

DL models are expected to play a cooperative role in clinics^3,21 with human intelligence because the two have different diagnostic strategies and may complement each other. In our previous study¹⁵, DL models show consistent performance but weakness in imbalanced data, as indicated by higher prediction bias toward prevalent cases. In contrast, human raters showed higher individual variances but little bias toward prevalence¹⁵. We speculate this is due to differential learning or training tactics between humans and computers.

Collective intelligence (CI) was introduced in medical diagnostics²² and decision-making²³ and proved beneficial. In a study²² with skin cancer detection, diagnostic accuracy should be similar between doctors to enhance the overall detection accuracy. In another study²³, which focused on mammography screening, CI increased true positives and decreased false positives and helped overcome the decision accuracy of a single radiologist.

Our study can be viewed as CI between humans and computers, as our experimental model independently takes input from both parties. A similar approach has been studied in skin cancer classification²⁴, predicting the course of multiple sclerosis²⁵. In the skin cancer classification by Hekler et al.²⁴, they classified dermoscopic images into five categories and combined dermatologists’ answers, which were independently taken. They applied the XGBoost²⁶ algorithm to combine the probability of each class label with the dermatologist’s answer. In another study, they surveyed medical students to predict the duration of the relapsing–remitting phase in multiple sclerosis. They used random forests²⁷ to train the DL model and bootstrapping with medical students’ predictions to enhance the classification performance. Of note, they also used a linear combination of predictions of humans and computers but got worse performance.

Weighted averaging between DL models and human physicians showed improvement in diagnostic accuracy, especially in minor classes (Fig. 1). As an exception, otitis media with effusions were hard to diagnose because of many subtle cases, and physicians were better off by accepting the DL model’s answers. But for tumors and myringitis classes, human physicians showed the possibility of aiding the DL model’s lack of data availability. In non-otolaryngologists, since the diagnostic skills are too inferior to DL models, they could offer little help to increase the accuracy.

With the conditional probability method, our system analyzes the strengths and weaknesses of the human experts and weighs DL results to make suggestions depending on the situation: it provides strong suggestions when DL is superior and weak suggestions when DL is vulnerable. As shown in Fig. 2, PoptMH utilizes the delta decision of each individual (1 or 0 for classes) when combined with the machine classification probability. In contrast, PCoptMH uses conditional probabilities representing an individual’s diagnostic tendency. By introducing the conditional probability of a diagnosis of each human expert, cooperation performance is increased compared to simple weighting for the human expert’s overall accuracy. In this manner, the combined prediction is tailored to the specific physician.

It should be noted that our proposed system does not require an extensive number of samples to evaluate each human expert’s classification performance or bias. Only 30 or 60 samples are sufficient to learn human experts’ diagnostic characteristics. This makes the current method suitable for real-world clinics, as shown in Fig. 3.

The relative merits of the parallel cooperation of humans and machines and the sequential approach from the machine to the human (the DL model provides an opinion to the human expert) have been argued. The sequential approach may be a conventional concept for AI–HI cooperation in medicine. However, in a previous study¹⁰, the diagnoses made by human raters were adversely affected when the computer made faulty suggestions, especially in less experienced physicians. The strength of our system is that it takes predictions by humans and the DL model as inputs independently. Therefore, users may avoid the psychological effects initiated by the suggestions of the prediction systems. Parallel cooperation, however, could be of disadvantage in terms of time-saving since it requires human physicians’ answers to see the result of the meta-model. However, inspecting eardrums and external auditory canals does not usually require extensive time as to radiologic image analysis. Therefore, in this situation, the parallel cooperation in the current study makes a plausible choice.

Although the accuracy of the predictions of the AI model exceeds most human raters, relying on automated systems without human guidance is discouraged. As noted in a study on skin cancer²⁸, physicians put clinical information together with imaging information, which may ultimately result in a more accurate diagnosis than considering image alone as in AI models. Also, the current DL model treats all diseases equally. The system does not know that it is dire to misdiagnose diseases such as external auditory canal cancer.

We can extend the current parallel cooperation to serial cooperation, i.e., AI decision result is suggested to a human expert. Instead of directly providing the classification decision in terms of the final class label or the softmax probability of the classifier, we may adjust the suggested probability by weighting the classification probability of the machine with the inverse of human conditional probability (for the false classes) to guide the human decision. An experiment with this scheme will be further explored.

Diagnosing otologic diseases from a single image poses significant challenges, even for physicians with extensive experience²⁰. Physicians may lack confidence in diagnosing certain diseases, especially when dealing with complex ear pathologies unless they have received specialized training. This is where DL models can be particularly beneficial, aiding physicians by not only enhancing overall accuracy but also instilling confidence in their diagnostic decisions, akin to consulting with a colleague. Nevertheless, the dynamics of an individual’s confidence in the machine, as well as in their own judgments, warrant further exploration in future studies.

The current study has some limitations. Otoendoscopic images often contain more than one finding, which calls for multi-label multi-class models in future studies for a more realistic automated diagnosis system. In this study, if more than one pathology was present, labeling was performed according to a predefined labeling priority, but at the time of prediction, only one class was chosen as the argmax value, not the labeling priority, due to the AI model’s design. In addition, we presented a cooperation model only for the case of otoendoscopic image classification. The cooperation model should be validated in other domains of cooperation between human experts and machines. Also, even though we obtained parallel input from both parties, the final confirmation should be done by the attending physician because of the reasons mentioned earlier. This is eventually double consideration of the human rater. Most importantly, the design of the cooperative classifier was post-hoc, which is based on snapshots of physicians’ answers. Physicians’ decision accuracy is likely to increase as they gain more experience, whether with our diagnostic assistance or not. Future studies on the collaborative model should be designed to follow physicians’ enhancement of skills so that the model could rely more on human physicians’ answers. Also, DL models with consideration of clinical information as well as images, which are multi-modal systems, should be conducted to offer a more physician-like model in the future.

In conclusion, we suggest a cooperation method that weighs the strengths and weaknesses of both parties for improved and consistent healthcare services. For this, the system first assesses the diagnostic characteristics of human experts for all classes. Based on this individualized assessment, the proposed model appropriately respects both the user diagnosis patterns and DL models by independently taking answers from both parties. Furthermore, the model minimizes psychological effects often present in conventional diagnosis assistant systems. We did not use domain-specific knowledge in the AI-HI cooperation meta-model; hence, the strategies we applied are not confined to otoendoscopic image classification. It may be generalizable to all decision-making tasks, where individual human knowledge plays an important role.

Materials and methods

AI classifiers

We used CNN-based DL models for otoendoscopy classification, which we previously reported in Cha et al.¹⁵. The DL models were trained with 6900 images out of 7500 otoendoscopic images of patients who visited the outpatient clinic in a tertiary referral center (Severance Hospital, Seoul, South Korea, Department of otorhinolaryngology). The details of labeling and training of the models can be referred to Cha et al.¹⁵.

In brief, otoendoscopic photos of the tympanic membrane and the external auditory canal (EAC) were labeled into six categories based on the Color Atlas of Endo-Otoscopy²⁹ (Table 1). If more than one etiologies were present in the image, it was labeled according to our labeling order, determined by the clarity of the diagnosis and the next required step in real-world clinics. Post-surgery status, similar images, including the same patient’s eardrum image from multiple angles, blurry images, and otoendoscopic images from the same patient’s follow-up data, were excluded. To minimize noisy labeling, three additional steps were taken. First, we checked the medical record of the given image created by the attending physician at the time, who had at least ten years of clinical experience in our center. Second, we also checked audiometric and radiologic test results when the otoendoscopic image could not be classified clearly. Lastly, the image was excluded if the last author (D.C.) could not agree even after the aforementioned steps.

ImageNet pre-trained CNN models were used to perform transfer-learning of otoendoscopic images. After several models, four top performers were chosen for further optimization: ResNet152³⁰, InceptionV4³¹, DPN92³², and DenseNet201³³. Affine transformations on images (horizontal flip, rotation, random scales, levels, and warping) were performed when augmenting otoendoscopic image data when oversampling. Since the training dataset was imbalanced, we used oversampling, mixup³⁴, and focal loss³⁵ (γ = 1) to mitigate the class imbalance problem. Training, validation, and testing were implemented using Pytorch with Fastai library³⁶.

Participants and experiments

A computerized online questionnaire consisting of two each mutually exclusive sets containing 300 anonymized otoendoscopic images (Table 1) was presented to fourteen physicians: six otolaryngology department personnel (two otolaryngologists, four otolaryngology residents) and eight non-otolaryngologist but practicing otoscopy in clinics: two emergency medicine specialists, two family medicine specialists, and four general practitioners. No clinical nor demographic information was presented in the survey; images were the only clues to come up with a conclusion. We obtained written informed consent from all physician participants.

All participants answered the evaluation images in identical order. The first set was a balanced image set, which does not affect human physicians, but may adversely affect the DL model’s performance due to the imbalanced training dataset. The second set was an imbalanced image set, which may favor DL models, even with aforementioned class imbalance mitigation strategies¹⁵, but also may be a practical measure for real-world clinical performance since the incidence represents real-world proportions of disease in a tertiary referral hospital.

Similar to the machine models, participants were requested to answer according to the same labeling priority order if more than one pathology was present in the image. Human raters were not aware of whether the test set was balanced or not.

The Severance Hospital Institutional Review Boards approved this study. (IRB No 2019–0467-001). All methods were performed in compliance with the Declaration of Helsinki.

AI-HI cooperative classifier

Human physicians and DL models make classifications using different mechanisms; hence, combining both classifications would lead to improvement, similar to creating ensemble classifiers in DL models.

1.
Evaluation of synergistic effects according to classes and human expertise.

Before introducing AI-HI cooperation models, we examined the synergistic effects of combining human diagnostic results and predictions by DL models according to human expertise and classes with different prevalences.

For the four best-performing DL model classifiers in the previous study¹⁵, a basic ensemble of the prediction results was done by adding each value following the softmax activation function. In mathematical notations,

$$ {\text{c}}^{*} = \mathop {\text{arg max}}\limits_{c} \left( {\mathop \sum \limits_{i = 1}^{n} \sigma \left( {M_{i} \left( x \right)} \right) + {{\alpha \delta }}\left( x \right)} \right),\;\; 0 \le {\upalpha } \le 4, $$

where ${\mathbf{\rm P}}_{{{\mathcal{M}}_{i} }} \left( x \right) = \sigma (M_{i} \left( x \right))$ generates a probability vector of being each class of the i-th model $M_{i}$ (among n = 4 models) for an input image x using a softmax function $\sigma$. ${\updelta }\left( x \right)$ is a function that takes the input from the human and returns a one-dimensional binarized vector, having 1 in the human-predicted class and 0 otherwise. α is a personalized human weight for adjusting the influence of the human-predicted class. If ${\upalpha }$ is set to 0, the human input is not used, whereas if ${\alpha }$ is set to 4, the human input is always bigger than the sum of each four DL classifiers’ softmax values. ${\text{c}}^{*}$ is the final class prediction that has the maximal vector sum of all HI and DL classifiers. Upon inspection of maximum values in both test sets, they ranged from 0.9 to 3.65, with an average of around 2.5. This basic ensemble method corresponds to PoptMH, explained below.

We divided all participating human doctors into ENT and non-ENT groups. For each class, we evaluated the mean accuracies and the mean recalls for each class in the ENT group and non-ENT group according to different weights $\mathrm{\alpha }$. The recall of a class is defined by the number of true-positive samples divided by the number of true samples for the class.

The results are displayed in Fig. 1.

2.
Basics of AI-HI cooperation models.

Considering individual differences in classification performance and quality, we proposed a cooperation strategy by appropriately weighing the classification results of both humans and DL models so that the classifier reflects the physician’s personal diagnostic preferences.

First, the system learns the human rater’s diagnostic characteristics by evaluating the class-specific bias and overall accuracy of the human rater from the expert’s diagnostic result of the evaluation dataset of a maximum of 300 balanced samples. Using a human expert’s confusion matrix of the evaluation dataset, we derived the conditional probability of the true class given each diagnostic decision (predicted label) for the human expert. Based on this, the optimal weight for the human expert compared to the DL classifiers is estimated. The mathematical formulation is explained below.

For an evaluation data set composed of N samples of images ${\varvec{X}}_{n = 1,..,N}^{E}$, let’s assume the i-th human rater ${\mathcal{H}}_{i}$ performs diagnostic decisions.

1.
For N evaluation sample images ${\varvec{X}}_{n = 1,..,N}^{E}$ for total C classes, we derive a set of a conditional probability vector $\{ {\mathbf{\rm P}}_{{{\mathcal{H}}_{i} }}^{n} \}_{n = 1, \ldots ,N}$,$ {\mathbf{\rm P}}_{{{\mathcal{H}}_{i} }}^{n} \in R^{1 \times C}$. ${\mathbf{\rm P}}_{{{\mathcal{H}}_{i} }}^{n}$ is composed of a conditional probability $p_{i} \left( {T_{l} {\text{|D}}_{n}^{i} } \right)$ of being a true label $T_{l} $ given a diagnosed label $ {\text{D}}_{n}^{i} $ for the n-th image performed by the human rater ${\mathcal{H}}_{i}$. The conditional probability $p_{i} \left( {T_{l} {\text{|D}}_{n}^{i} } \right)$ is calculated based on each individual’s confusion matrix ${\mathbf{C}}_{i}$ (for the evaluation data set) by normalizing each column (predicted class) of the confusion matrix by the sum of the column along with the true label.
$$ p_{i} \left( {T_{l} {\text{|D}}_{n}^{i} } \right) = \frac{{{\mathbf{C}}_{i} \left( {T_{l} ,{\text{D}}_{n}^{i} } \right)}}{{\mathop \sum \nolimits_{k = 1}^{C} {\mathbf{C}}_{i} \left( {T_{k} ,{\text{D}}_{n}^{i} } \right)}} $$

Thus, a conditional probability vector for a sample n is defined by

${\mathbf{\rm P}}_{{{\mathcal{H}}_{i} }}^{n} = \left[ {p_{i} \left( {T_{1} {\text{|D}}_{n}^{i} } \right) p_{i} \left( {T_{2} {\text{|D}}_{n}^{i} } \right) \cdots p_{i} \left( {T_{C} {\text{|D}}_{n}^{i} } \right)} \right]$.

Each expert has a matrix of conditional probabilities for all evaluation samples.

$$ {\mathbf{\rm P}}_{{{\mathcal{H}}_{i} }} = \left[ {{\mathbf{\rm P}}_{{{\mathcal{H}}_{i} }}^{1} \ldots {\mathbf{\rm P}}_{{{\mathcal{H}}_{i} }}^{N} } \right]^{T} \in R^{N \times C} $$

2.
For the same ${\varvec{X}}_{n = 1,..,N}^{E}$, we calculate a classification probability matrix (the output of the softmax layer) for each DL model, e.g., for j-th model ${\mathcal{M}}_{j}$, ${\mathbf{\rm P}}_{{M_{j} }} = \left[ {{\mathbf{\rm P}}_{{{\mathcal{M}}_{j} }}^{1} \ldots {\mathbf{\rm P}}_{{{\mathcal{M}}_{j} }}^{N} } \right]^{T} \in R^{N \times C}$. We then concatenate all DL and the i-th human rater ${\mathcal{H}}_{i}$’s probability matrices, i.e., $\left[ {{\mathbf{\rm P}}_{{{\mathcal{M}}_{1} }} \ldots {\mathbf{\rm P}}_{{{\mathcal{M}}_{M} }} {\mathbf{\rm P}}_{{{\mathcal{H}}_{i} }} } \right] \in R^{{N \times C \times \left( {M + 1} \right) }}$ to derive optimal weights for classes for the human expert.
3.
The optimal weights for a total of M classifiers and a human rater, ${\mathbf{W}}^{*} \in R^{{\left( {M + 1} \right) \times C}} , $ are determined to best fit the target ground-truth labels ${\mathbf{Y}}$ for the evaluation dataset (for each sample n, the n-th row of ${\mathbf{Y}}$ is assigned with 1 for the sample label, otherwise all zeros, thus ${\mathbf{Y}} \in R^{N \times C} )$ in terms of cross-entropy (CE) loss between weighted sum of diagnosis $\left( {\left[ {{\mathbf{\rm P}}_{{{\mathcal{M}}_{1} }} \ldots {\mathbf{\rm P}}_{{{\mathcal{M}}_{M} }} {\mathbf{\rm P}}_{{{\mathcal{H}}_{i} }} } \right]{\varvec{W}} \in R^{N \times C } } \right)$ and ${\mathbf{Y}}$.
$$ {\mathbf{W}}^{*} = arg\mathop {\min }\limits_{{\mathbf{W}}} CE\left( {{\mathbf{Y}},\left[ {{\mathbf{\rm P}}_{{{\mathcal{M}}_{1} }} \ldots {\mathbf{\rm P}}_{{{\mathcal{M}}_{M} }} {\mathbf{\rm P}}_{{{\mathcal{H}}_{i} }} } \right]{\varvec{W}}} \right) $$
4.
For the final procedure for a new test sample, ${\varvec{X}}^{{\varvec{t}}}$, the system choose the class of maximum probability (argmax) from the weighted probabilities drawn from DL models and the human rater.
$$ C_{i}^{*} = arg\mathop {\max }\limits_{c} \left[ {{\text{\rm P}}_{{{\mathcal{M}}_{1} }}^{t} \ldots {\text{\rm P}}_{{{\mathcal{M}}_{M} }}^{t} {\text{\rm P}}_{{{\mathcal{H}}_{i} }}^{t} } \right]{\mathbf{W}}^{*} $$

Figure 4 illustrates the current procedure.

3.
Evaluations of AI-HI cooperation models in human individuals according to evaluation sample sizes.

To utilize small data samples for the evaluation of each individual and to reduce the number of parameters to estimate, we simply averaged the classification results of four machine models as they show a more or less similar pattern of classification. We then combined the classification result of one machine model and one human rater by weighting them with a single ratio variable W = {w}, rather than a vector.

For a human expert ${\mathcal{H}}_{i}$, the human diagnostic decision for a new test data, ${\varvec{X}}_{{}}^{t}$, is denoted as delta-response vector $\delta_{{{\mathcal{H}}_{i} }}^{t} $(assigning only the chosen label to be one, others to be zeros). The conditional probability (${\text{\rm P}}_{{{\mathcal{H}}_{i} }}^{t} ) $ of the true label given a human diagnosed label is derived from the delta-response. The weighted sum $\left( {{\text{S}}_{i}^{t} } \right) $ of the classification probability of the DL classifiers (${\text{\rm P}}_{{\mathcal{M}}}^{t} ) $ and the conditional probability of the human decision (${\text{\rm P}}_{{{\mathcal{H}}_{i} }}^{t} ) $ is used to classify new samples. The three different cooperation models are based on whether the conditional probability is used and whether the weight for the human and machine’s decision is optimized based on the human individual’s overall accuracy.

$$ {\text{S}}_{i}^{t} = 0.5{\text{ {\rm P}}}_{{\mathcal{M}}}^{t} + 0.5{\text{ {\rm P}}}_{{{\mathcal{H}}_{i} }}^{t} \;\;\left( {{\text{CavgMH}}} \right) $$

$$ {\text{S}}_{i}^{t} = \left( {1 - w_{i}^{*} } \right){\text{\rm P}}_{{\mathcal{M}}}^{t} + w_{i}^{*} \delta_{{{\mathcal{H}}_{i} }}^{t} \;\;\left( {{\text{PoptMH}}} \right) $$

$$ {\text{S}}_{i}^{t} = \left( {1 - w_{i}^{*} } \right){\text{\rm P}}_{{\mathcal{M}}}^{t} + w_{i}^{*} {\text{\rm P}}_{{{\mathcal{H}}_{i} }}^{t} \;\;\left( {{\text{PCoptMH}}} \right) $$

From the evaluation data set conducted by a human expert ${\mathcal{H}}_{i}$, the optimal weight for the human $w_{i}^{*}$ is estimated for PoptMH and PCoptMH by minimizing the cross-entropy loss between ${\text{S}}_{i}^{E}$ of the evaluation data set and ${\text{Y}}^{E}$, the true label for the evaluation data set.

For all three different models, the final label was chosen to maximize ${\text{S}}_{i}^{t}$ for each individual.

$$ C_{i}^{{\text{t*}}} = arg\mathop {\max {\text{S}}_{i}^{t} }\limits_{c} $$

To test how many evaluation samples are needed to reliably assess the diagnostic characteristics of each individual, the system evaluates each human expert using a portion of 300 balanced data, i.e., 30 (6 samples per class), 60 (12 samples per class), 100 (20 samples per class), and 150 (30 samples per class) evaluation samples. The average accuracies of the five-fold training and tests were used to evaluate the performance of the cooperations models in learning the individual and choosing the minimal number of evaluation samples.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Asaoka, R. et al. Using deep learning and transfer learning to accurately diagnose early-onset glaucoma from macular optical coherence tomography images. Am. J. Ophthalmol. 198, 136–145 (2019).
Article PubMed Google Scholar
Bien, N. et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med. 15, e1002699 (2018).
Article PubMed PubMed Central Google Scholar
Jha, S. & Topol, E. J. Adapting to artificial intelligence: Radiologists and pathologists as information specialists. Jama 316, 2353–2354 (2016).
Article PubMed Google Scholar
Karri, S. P., Chakraborty, D. & Chatterjee, J. Transfer learning based classification of optical coherence tomography images with diabetic macular edema and dry age-related macular degeneration. Biomed. Opt. Express 8, 579–592 (2017).
Article CAS PubMed PubMed Central Google Scholar
Kim, Y. et al. Deep learning in diagnosis of maxillary sinusitis using conventional radiography. Invest. Radiol. 2018, 896 (2018).
Google Scholar
Mazo, C., Bernal, J., Trujillo, M. & Alegre, E. Transfer learning for classification of cardiovascular tissues in histological images. Comput. Methods Programs Biomed. 165, 69–76 (2018).
Article PubMed Google Scholar
Milea, D. et al. Artificial intelligence to detect papilledema from ocular fundus photographs. N. Engl. J. Med. 382, 1687–1695 (2020).
Article PubMed Google Scholar
Rajpurkar, P. et al. CheXaid: Deep learning assistance for physician diagnosis of tuberculosis using chest x-rays in patients with HIV. NPJ Digital Med. 3, 115 (2020).
Article Google Scholar
Tschandl, P. et al. Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: An open, web-based, international, diagnostic study. Lancet Oncol. 20, 938–947 (2019).
Article PubMed PubMed Central Google Scholar
Tschandl, P. et al. Human–computer collaboration for skin cancer recognition. Nat. Med. 26, 1229 (2020).
Article CAS PubMed Google Scholar
Codella, N.C. et al. Collaborative human-ai (chai): Evidence-based interpretable melanoma classification in dermoscopic images. In Understanding and Interpreting Machine Learning in Medical Image Computing Applications 97–105 (Springer, 2018).
Garg, A. X. et al. Effects of computerized clinical decision support systems on practitioner performance and patient outcomes: A systematic review. Jama 293, 1223–1238 (2005).
Article CAS PubMed Google Scholar
Japkowicz, N. & Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 6, 429–449 (2002).
Article MATH Google Scholar
Japkowicz, N. The class imbalance problem: Significance and strategies. In Proc. of the Int’l Conf. on Artificial Intelligence, Vol. 56 (Citeseer, 2000).
Cha, D. et al. Differential biases and variabilities of deep learning-based artificial intelligence and human experts in clinical diagnosis: Retrospective cohort and survey study. JMIR Med. Inform. 9, e33049 (2021).
Article PubMed PubMed Central Google Scholar
Tuijn, S., Janssens, F., Robben, P. & Van Den Bergh, H. Reducing interrater variability and improving health care: A meta-analytical review. J. Eval. Clin. Pract. 18, 887–895 (2012).
Article PubMed Google Scholar
Gingerich, A., Ramlo, S. E., van der Vleuten, C. P., Eva, K. W. & Regehr, G. Inter-rater variability as mutual disagreement: Identifying raters’ divergent points of view. Adv. Health Sci. Educ. 22, 819–838 (2017).
Article Google Scholar
World Health Organization. Chronic suppurative otitis media : Burden of illness and management options. https://iris.who.int/handle/10665/42941 (World Health Organization, 2004).
Niermeyer, W. L., Philips, R. H. W., Essig, G. F. Jr. & Moberly, A. C. Diagnostic accuracy and confidence for otoscopy: Are medical students receiving sufficient training?. The Laryngoscope 129, 1891–1897 (2019).
Article PubMed Google Scholar
Moberly, A. C. et al. Digital otoscopy versus microscopy: How correct and confident are ear experts in their diagnoses?. J. Telemed. Telecare 24, 453–459 (2018).
Article PubMed Google Scholar
Hamid, O.H., Smith, N.L. & Barzanji, A. Automation, per se, is not job elimination: How artificial intelligence forwards cooperative human-machine coexistence. In 2017 IEEE 15th International Conference on Industrial Informatics (INDIN) 899–904 (2017).
Kurvers, R. H. J. M. et al. Boosting medical diagnostics by pooling independent judgments. Proc. Natl. Acad. Sci. U. S. A. 113, 8777–8782 (2016).
Article CAS PubMed PubMed Central ADS Google Scholar
Wolf, M., Krause, J., Carney, P. A., Bogart, A. & Kurvers, R. H. J. M. Collective intelligence meets medical decision-making: The collective outperforms the best radiologist. PloS One 10, e0134269–e0134269 (2015).
Article PubMed PubMed Central Google Scholar
Hekler, A. et al. Superior skin cancer classification by the combination of human and artificial intelligence. Eur. J. Cancer 120, 114–121 (2019).
Article PubMed Google Scholar
Tacchella, A. et al. Collaboration between a human group and artificial intelligence can improve prediction of multiple sclerosis course: A proof-of-principle study. F1000Research 6, 2172 (2017).
Article PubMed Google Scholar
Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference On Knowledge Discovery And Data Mining 785–794 (2016).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article MATH Google Scholar
Han, S. S. et al. Assessment of deep neural networks for the diagnosis of benign and malignant skin neoplasms in comparison with dermatologists: A retrospective validation study. PLoS Med. 17, e1003381 (2020).
Article PubMed PubMed Central Google Scholar
Sanna, M., Russo, A., Caruso, A., Taibah, A. & Piras, G. Color Atlas of Endo-Otoscopy. Examination-Diagnosis-Treatment (Georg Thieme Verlag, 2017).
Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).
Szegedy, C., Ioffe, S., Vanhoucke, V. & Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence (2017).
Chen, Y, et al. Dual path networks. arXiv:1707.01629 (2017).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4700–4708 (2017).
Zhang, H., Cisse, M., Dauphin, Y.N. & Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv:1710.09412 (2017).
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference On Computer Vision 2980–2988 (2017).
Howard, J. & Gugger, S. Fastai: A layered API for deep learning. Information 11, 108 (2020).
Article Google Scholar

Download references

Acknowledgements

This research was supported by Brain Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science and ICT(NRF- 2022M3E5E8018285).

Author information

Authors and Affiliations

Department of Nuclear Medicine, Department of Psychiatry, Graduate School of Medical Science, Brain Korea 21 Project, Yonsei University College of Medicine, Seoul, South Korea
Hae-Jeong Park
Department of Cognitive Science, Yonsei University, Seoul, Republic of Korea
Hae-Jeong Park
Center for Systems and Translational Brain Sciences, Institute of Human Complexity and Systems Science, Yonsei University, 50-1, Yonsei-ro, Sinchon-dong, Seodaemun-gu, Seoul, 03722, Republic of Korea
Hae-Jeong Park
Department of Otorhinolaryngology, Yonsei University College of Medicine, Seoul, South Korea
Sung Huhn Kim, Jae Young Choi & Dongchul Cha
Center for Innovative Medicine, Healthcare Lab, NAVER Corporation, 95, Jeongjail-ro, Bundang-gu, Seongnam-si, Gyeonggi-do, 13561, Republic of Korea
Dongchul Cha
Healthcare Lab, Naver Cloud Corporation, Seongnam-si, Republic of Korea
Dongchul Cha

Authors

Hae-Jeong Park
View author publications
You can also search for this author in PubMed Google Scholar
Sung Huhn Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jae Young Choi
View author publications
You can also search for this author in PubMed Google Scholar
Dongchul Cha
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.P. and D.C. designed the study. D.C. and H.P. wrote machine learning codes. H.P. developed mathematical formula and wrote a cooperative code, D.C., S.K. and J. C. collected datasets, constructed, maintained, and performed online surveys. H.P. and D.C. wrote the initial draft of the manuscript. All the authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Hae-Jeong Park or Dongchul Cha.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Park, HJ., Kim, S.H., Choi, J.Y. et al. Human–machine cooperation meta-model for clinical diagnosis by adaptation to human expert’s diagnostic characteristics. Sci Rep 13, 16204 (2023). https://doi.org/10.1038/s41598-023-43291-8

Download citation

Received: 15 June 2023
Accepted: 21 September 2023
Published: 27 September 2023
DOI: https://doi.org/10.1038/s41598-023-43291-8

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.