Introduction

Continuous learning, also known as lifelong learning, is a fundamental idea in machine learning where a model continuously learns and evolves based on the input of an ever-increasing amount of data while retaining previously acquired knowledge1. This learning process model will continue to incrementally learn and autonomously change its diagnostic capabilities without forgetting the original task. Automated machine learning (AutoML)2 is the latest development in artificial intelligence (AI) and is expected to become the future of AI3. For instance, Google’s Cloud AutoML4,5,6 has employed this technology, wherein AutoML allows clinicians with limited knowledge of ML to apply such models to their data sets. Most automated deep learning models developed based on Cloud AutoML exhibit comparable diagnostic performance and characteristics with the latest deep learning algorithms6. However, the current version only allows a single image to be uploaded for prediction. This limits large-scale external validation and substantially reduces its usability for systematic evaluation in the study of predictive models7.

Dynamic memory, which is retaining a small and diverse subset of the data stream in memory, has been used to alleviate catastrophic forgetting in continuous learning in medical imaging8; however, the practical application of this approach is challenging. Class-incremental learning on CIFAR-100 is a method that delivers state-of-the-art performance on challenging continual learning benchmarks without storing data9. However, this method is still in the exploratory stage. The CLS adopts a model automatic optimization method to monitor the diagnostic performance of the model in real-time, effectively evaluate its quality, supervise it in the continuous learning process, and optimize it without overfitting under the condition of small data sets in the initial stage of the CLS. The CLS adopts historical image data from the organizations of users to construct data sets; it labels data according to pathological results to ensure the accuracy of training data labeling, facilitates data verification and quality control, and labels benign and malignant images as well as pathological types and diseases. The CLS also integrates three models with data on benign and malignant tumors and pathological types and disease diagnoses through the integration method, thereby obtaining three kinds of diagnostic results required by physicians to derive accurate diagnoses. With the continuous increase in new data, the CLS will learn from more cases and types of diseases to improve its diagnostic capability and increase the number of disease diagnoses. The CLS can also be applied to AI-based diagnosis of ultrasound images of thyroid masses, liver, kidney, and other superficial body parts.

Existing AI trains diagnostic models on a large amount of image data and mainly classifies them into benign and malignant categories10,11,12, Various AI tools classify images into breast imaging report and data system (BI-RADS) categories13,14, phyllodes tumor and fibroadenoma15, and even mastitis and adenosis16. AI diagnoses have reached or exceeded the diagnostic abilities of medical experts17 but are seldom applied in practical work18,19. This is mainly attributed to the lack of trust in the results of AI diagnoses. Many clinical images are enhanced, cropped, transformed, or modified to obtain ideal experimental results20. Thus, the larger the training data, the more difficult it is to control the quality of the data. Some data are labeled according to the experience of experts21,22; however, even experienced experts cannot achieve complete accuracy in pathological diagnoses, which further reduces the users’ trust in AI. With our continued ignorance, we risk missing out on perspectives that could shape profound solutions to the challenges we face entering the next decade23.

Developments in big data and AI are transforming medicine; however, public health systems have been slow to fully embrace their potential24. Data availability and quality are limiting factors and the expansion of digital technologies and data collection also presents a range of ethical and governance concerns24. Continuous learning for medical AI diagnosis is still in its infancy, and models for ultrasound diagnostics using continuous learning have not yet been reported. Nevertheless, it is considered an ideal learning method with great potential in medical practice as it is akin to the learning method of human clinicians25. The continuous learning model can gradually learn from errors and adjust performance with increasing data; however, many challenges remain in its clinical application, the most crucial being that new data may interfere with the knowledge that the model has attained, resulting in a sudden decline in its performance26,27. These models are sensitive to environmental changes and liable to performance decay. Despite their successful integration into clinical practice, ML/AI algorithms should be continuously monitored and updated to ensure their long-term safety and effectiveness28. Continuous monitoring algorithms are not “impossible,” but they are difficult to construct; the quality and quantity of data are uncertain, altering the diagnostic capabilities of the models obtained after training. Therefore, continuously monitoring the diagnostic capabilities of the model and comparing its diagnostic capabilities with the previous model is crucial. These are all factors to consider when developing a continuous learning model. These are all factors to consider when developing a continuous learning model. Furthermore, such models must incorporate clinical data from several patients to use AI in assessing health outcomes; however, this may lead to patient privacy issues. Assessing the quality of these models is currently impossible25, as the regulatory challenges and risks of using AI in real-time medicine are substantial25. Currently, no medical device based on AI or ML continuous learning is approved by the US Food and Drug Administration29; however, such devices will be approved in the near future30. Medical devices can be updated based on new data, including the personalization and elimination of errors; however, data accuracy and optimal device performance must be ensured31.

The CLS was tested with seven independent data sets in three external data sets and compared with 21 physicians. We use nine performance indices and the same criteria to comprehensively evaluate and rank the diagnostic ability of the CLS and participating physicians. These indices included sensitivity, specificity, area under the curve (AUC), the diagnostic accuracy of pathological type (DAPT), accuracy of pathological type identification (APTI), missed diagnosis rate of pathological type (MDRPT), the diagnostic accuracy of pathological diseases (DAPD), the accuracy of differentiating pathological diseases (ADPD), and missed diagnosis rate of pathological diseases (MDRPD). The CLS adopts the optimal ensemble method to effectively overcome the problem of continuous learning model supervision. In this project, we employed this CLS to evaluate ultrasound breast masses. We believe this approach will be valuable in gaining the trust of physicians in the technology and ensuring accurate tumor diagnoses.

Results

Training and testing

Details of the cases used for training and testing and the pathological distribution of diseases are provided in Table 1. The experimental dataset (EDS) was used to test and select five algorithms (Supplementary Table 1a) with higher AUC values from 13 algorithms, including resnet50, DenseNet121, inceptionv3, inceptionresnetv2, and Xception; the AUC did not significantly differ among the five algorithms (p = 0.09~0.88, Supplementary Table 1b). According to the AUC values of benign and malignant tumors and pathological types and diseases diagnosed by the model, the sum of the three models (Supplementary Table 1c) was calculated. The value of inceptionresnetv2 was the highest (2.161). Therefore, inceptionresnetv2 was chosen as the algorithm used to develop the CLS in this study. Five algorithms were used to construct and test the model using cropped and uncropped image data sets, where the AUC value of the uncropped image model was higher than that of cropped image model in four algorithms. Three of them had p < 0.05; therefore, the image was not cropped in this study (Supplementary Table 1d).

Table 1 Baseline characteristics of datasets.

The data were divided into six stages (Table 2, Supplementary Table 2): first: 83, second: 81, third: 85, fourth: 84, fifth: 81, and sixth: 85. The number of benign cases exceeded that of malignant cases and the benign cases were randomly selected. Finally, the redundant data of the six stages were first: 39, second: 35, third: 45, fourth: 30, fifth: 28, and sixth: 39, the actual data used for training the model were first: 44, second: 46, third: 40, fourth: 54, fifth: 53, and sixth: 46.

Table 2 Disease distribution and number included in six training sessions.

The organization internal test dataset (OITDS) was tested by the optimization model (OM) and-optimal model (NOM) obtained from the six stages of the CLS training, and the test results were scored (Table 3). During the model training process, the saved model is tested. The model with the highest AUC value is the OM, whereas the model whose accuracy does not increase at the end of the training is the NOM. The OM score increased from the lowest (71.1 points) in the second stage to the highest (78.79 points) in the sixth stage (Supplementary Table 3b). The increase was not directly proportional to the increase in training data and images. The CLS score in the second stage was 70.87 points, which was marginally lower than that in the first stage (71.1 points); however, this stage included 90 training data and 506 images from only 44 and 245 images, respectively, in the first stage (Supplementary Table 2). The OITDS test results suggested that the average score of the OM was higher than that of the NOM with the nine performance indices, while the scores of the six stages steadily improved.

Table 3 The OM obtained from six stages of training was tested and evaluated on three datasets.

The OM and NOM obtained were used to test the external test dataset (ETDS), and the test results were scored (Table 3, Supplementary Table 3c). The average score of the OM was slightly higher than that of the NOM, and the scores of the six stages steadily improved. As there were no additional data after the completion of the sixth training stage, this project tested an add-test dataset (ATDS) on the OM and NOM obtained from the first five training stages; the ATDS included the data of the second to sixth stages, which were 81, 85, 84, 81, and 85, respectively, and the test results were scored (Table 3, Supplementary Table 3d). The average score of the OM was slightly higher than that of the NOM. The scores of the third stage were the highest. The CLS, therefore, exhibited stable diagnostic performance.

Evaluation and comparison with physicians

Twenty-one physicians participated in the test (details of experience and comprehensive evaluation results in Supplementary Table 4a, b). The correlation coefficient between the working years and total score was −0.33, the correlation coefficients of nine indices and working years were between −0.49 and −0.1, and the comprehensive diagnostic scores of primary physicians were slightly higher than those of intermediate and senior physicians. The CLS adopted the OITDS test results as shown in Supplementary Table 3a, where the comprehensive evaluation results of the six stages of the CLS training showed a low correlation between the sensitivity, specificity, and training stages; the correlation coefficients of the other seven indices with the training stage were between 0.64–0.86 (Supplementary Table 3e). The CLS diagnostic score and training times had a good correlation, except that the specificity decreased by 11.86%. The mean values of the other eight CLS indices were higher than those of physicians.

CLS and physician diagnosis total score ranking

We compared the total score ranking of participating physicians and the CLS (Table 4). The physician tests used the OITDS, and the CLS also used the OITDS results of the OM. Compared with the 21 physicians and six stages, the CLS had its lowest score in stage 2, ranking tenth; however, this score exceeded that of 17 (81%) physicians. Further, the CLS ranked ninth in phase 1, sixth in phase 3, and second in phase 6, outscoring 20 (95%) physicians. As the learning phase progressed, the CLS improved from tenth to second place and ranked in the top five in phases 4–6 (Supplementary Fig. 3b). We utilized three external data sets, including OITDS, ETDS, and ATDS; among them, ATDS had five data sets and a total of seven independent data sets were used for testing. The CLS attained high and stable diagnosis scores with a small amount of data for training the model (Supplementary Fig. 1a, b). The OM score was higher than the NOM score when using the model optimization method to achieve the supervision of CLS diagnoses. The CLS could output three results simultaneously—benign and malignant tumors, pathological types, and pathological disease diagnoses (identifying the disease to which the mass belongs and the result of the pathological diagnosis)—by using the model integration method. Furthermore, it could effectively and transparently evaluate the diagnostic ability of physicians and the CLS using nine indices.

Table 4 The total score ranking of participating physicians and CLS comparison evaluation.

Discussion

Open and transparent standard comparative diagnostic ability in AI diagnosis is essential for generating clinicians’ trust. In this comparative study, we use nine performance indices for a comprehensive evaluation, and the diagnostic ability score of the CLS exceeded 17 of the 21 participating physicians in the first stage and 20 in the sixth stage. Because the evaluation criteria are the same, the evaluation process is open and transparent, with continuous improvement in the diagnosis level of physicians; physicians can also test and verify their diagnostic ability at any time. We found no relationship between the scores of clinicians’ diagnostic abilities and their working years (Supplementary Table 4b), where the correlation coefficient between working years and the total score was −0.33 (Supplementary Fig. 3a), and the correlation coefficients between working years and nine indices were all below 0; unlike the findings of Yang et al.32. Younger physicians obtained higher diagnostic scores, which may be related to our evaluation index, including the pathological diagnosis of the mass. The study data included 41% benign mass breast adenosis, 40% breast fibroadenoma, and 82% malignant mass breast infiltrating ductal carcinoma. These three types offer primary, intermediate, and senior clinicians more opportunities to learn, and tracking pathological results is essential to obtaining practical experience in diagnostics. This work was mainly completed by primary physicians, who may have had the opportunity to gain enhanced diagnostic experience and similarly. AI may have more advantages in the pathological diagnoses of tumors owing to its ability to learn continuously. The CLS is constantly learning and improving, and if its ability surpasses that of physicians, it could increase physicians’ confidence in AI.

We found that different algorithms had different diagnostic capabilities with the same data and training methods. Among the 13 candidate algorithms, inceptionresnetv2 had the highest AUC value when tested with the same data, while mobilenetV2 had the lowest. He et al.33 have also used the inceptionresnetv2 algorithm to achieve good diagnostic results in the auxiliary diagnosis of breast cancer, and the algorithm is considered superior to the ResNeXt-101 and SENet-101 algorithms. We plan to compare the diagnostic performances of the five algorithms under different amounts of data and test new algorithms in the future. If a better algorithm is identified, it could be used for CLS diagnoses.

Image processing may improve the diagnostic ability of AI under experimental conditions; however, the process is variable and leads to differential results. The findings of this study suggested that diagnoses were better without cropping images, similar to those obtained by Golse34 using whole images. Processing images by CAD tools35 did not improve the diagnostic performance of radiologists. The experimental results of the three data sets in the present study were similar, indicating that the CLS exerts stable diagnostic performance without image processing. In theory, the peripheral part of the image will not affect the diagnosis, as AI cannot extract the characteristics of the mass from the periphery. It may be possible to change the characteristics of the mass in the image if it is cut or otherwise processed.With different processing methods, the characteristic changes may be different if AI learns using processed images. Therefore, images provided for AI diagnoses must be processed the same way.

Overfitting is an issue that must be solved in the process of model training, especially in the case of small data sets. Data enhancement36 is usually used to generate more training data to reduce overfitting. In this study, the imagenet pre-training model and transfer learning were used, given the small amount of data in the initial stage of the CLS. To judge whether the model, whose accuracy was no longer improved, was an overfitted model, the OM was selected as the model with the highest AUC value after training. The OM was not the model with the largest number of training rounds, and its diagnostic scores were higher than those of the NOM in the first four stages. The NOM diagnostic performance was better in the fifth and sixth stages, given the large amounts of data and images; therefore, it was not easy to overfit; the ETDS and ATDS tests yielded similar results.

Furthermore, from the six stages and three data sets, the average OM evaluation score was higher than that of the NOM. Therefore, it is necessary to adopt the optimization method to supervise the CLS, which can be further improved with automatic comparisons, selections of models, and comparisons with previous stage models. If the diagnostic performance is improved, a new model can be adopted so that the diagnostic performance of the CLS can be guaranteed to be stable or improve but not decrease. A study by Zhou et al.37 suggests an imperative need for research on medical AI model safety issues; thus, the use of optimal methods in this study ensured the safe operation of the model.

This project found that the amount of training data is not proportional to the diagnostic ability of the model. The CLS used a small amount of data to train the model and produced good diagnostic results; only 44 cases were used in the first stage, including 22 benign and 22 malignant cases with 120 and 125 training images, respectively. The results of three data sets to verify the model showed that the training model with limited data could also produce good diagnostic results. Faes et al.7 believed this small dataset method could be tailored to specific patient groups (e.g., based on geography). It could be valuable once automatic deep learning finds its place in the medical field. In this study, the CLS started with a small dataset, and with continuous learning, the amount of data would have continued to increase; thus, it will be possible to study the diagnostic ability of the CLS under different amounts of data.

Furthermore, through multi-center research and merging multi-center data, we could compare the impact of different amounts of data on the diagnostic performance of the CLS, especially under a large amount of data. This approach is similar to Swarm learning38 and comparable to Nightingale Open Science for solving medicine’s data bottleneck39. The continuous learning process in the CLS can also be a continuous research process because CLS can automatically output experimental results and test data sets, thereby significantly reducing the workload of researchers. Henry40 reported that rather than viewing the system as a surrogate for their clinical judgment, clinicians perceived themselves as partnering with the technology; thus, clinicians can learn to trust an ML system through experience, expert endorsement and validation, and systems designed to accommodate clinicians’ autonomy and support them across their entire workflow. Lehne41 argues that interoperability is a prerequisite for the digital innovations envisioned for future medicine, and multi-agency data sharing and exchange enables data interoperability.

Model integration is an important method for realizing the pathological diagnosis of mass and involves fusing the models of different diagnostic tasks together to complete the diagnostic task, the integration method is like the automatic breast segmentation diagnosis technology using dual deep learning. Through model integration, the CLS can output various results, including whether the tumor is benign or malignant or the pathological type, arranged by probability. Results on malignancy, pathological type, and pathological disease diagnoses do not depend on each other, as this could lead to inconsistent results and require manual judgment by physicians, which is in line with the typical diagnostic approach of physicians. In addition to the first diagnosis, physicians need to consider other potential diagnoses; when too many diseases are considered, the CLS will output the most likely multiple diagnostic results for the physicians’ reference, which is expected to play a vital role in helping physicians identify diseases. The pathological mass diagnosis is the precise diagnosis that can help the patient choose the best treatment plan. Ultimately, most experts believe artificial and human intelligence will work synergistically42; the CLS exemplifies this collaboration.

In the initial stage, the types of diseases in the CLS are relatively few, and the data and images of various diseases are also few, making it difficult to diagnose these diseases. In the first stage of this project, there were six pathological types and four diseases. The inconsistency between the two numbers was because when the CLS constructs the dataset, the cumulative number of images, types, or diseases needs to reach ten or more to be included in the dataset; in this way, the images are randomly selected for training and verification according to the ratio of 8:2, and the number of cases and diseases learned by the CLS will increase with continuous learning. From the second to sixth stage, the number of pathological types goes up to seven. In contrast, the number of pathological diseases goes up to nine in the second stage and 15 in the sixth stage (Supplementary Table 2). The CLS currently diagnoses 41 diseases (Supplementary Table 6), but the number of diseases is not limited; when there is a new disease, it can be added through the settings of the CLS. Oren et al.43 believe that the evaluation of results in existing AI imaging studies is usually carried out by lesion detection, ignoring the type and biological aggression of the lesions. Using clinically meaningful outcome evaluation such as survival rate, symptoms, and treatment is essential to improve AI imaging studies and their applicability and effectively apply them in clinical practice. In this project, benign and malignant tumors, pathological types, and disease diagnoses were obtained by the multi-model integration method, which is expected to evaluate the survival rate of patients according to the pathological types and disease results of tumors, thereby providing the best plan for the clinical treatment of patients.

The CLS is also integrated with the ultrasound picture archiving and communication system (US_PACS). This integration has many advantages, as US_PACS automatically provides the content to the CLS through the parametric design of the report content, and the CLS can perform BI-RADS classification on the mass; during report writing, the physician only needs to select the left or right side of the current display image and then select AI to assist with the diagnosis (Supplementary Fig. 5). The user can select any number of images, which will be automatically provided to the CLS, and the CLS returns the diagnosis results to US_PACS, avoiding patient information disclosure and ruling out the impact of sex, age, race, equipment, or physicians’ habits of collecting images on data. Celi et al.44 believe that adopting AI can enable intelligent integration of AI design and clinical workflow by providing seamless, effective, and unbiased assistance to patients and physicians. This process requires medical expertise as well as time-consuming input from experts and researchers in the medical field; in this way, AI can also work under the supervision of clinicians. According to Young et al.45, patients and the public express positive attitudes toward AI but prefer manual supervision. Through integration with US_PACS, real data can be provided to the CLS. With continuous learning, the diagnostic performance of the CLS is expected to continue improving, thereby increasing the accuracy of tumor diagnoses.

This study had some limitations regarding the data quantity, where only 629 breast masses were used to construct the dataset, and the CLS was only learned in six stages. Physicians only used one dataset for testing; thus, more data and prospective studies are required to verify the current results. Furthermore, the diagnostic effect of uncropped images was better than that of cropped images, but this needs multi-center verification. In addition, only data and image balance processing were performed for benign and malignant cases in this project. Finally, the interpretability of AI46 could increase clinicians’ understanding of the results and reduce the risks of using AI, which requires further study.

This project takes the study of ultrasound breast mass as an example and is a critical step toward the clinical application of AI; however, this is only the beginning of obtaining precise diagnoses through AI, and further research remains to be completed beyond the diagnosis of breast masses. Thus, through optimization integration and constant improvements in diagnostic accuracy, this method could be applied to other masses in other organs such as the liver, kidney, thyroid, and external organs. The application of this method has potential value in improving precision medicine.

Methods

Data collection and dataset construction

We obtained 561 cases of breast masses with pathological results from 1 January 2015 to 31 December 2020 (Table 1, Fig. 1). There are 68 cases due to bilateral breast masses, so there are 629 breast masses and 2235 images, from which 130 breast masses and 686 images were randomly selected as the OITDS; the remaining data and images were used for training. Data from 3098 cases were collected from other institutions, from which 180 cases and 793 images were randomly selected as the ETDS from two tertiary hospitals in Hunan Province, namely Liuyang People’s Hospital (2591 cases; 151 cases randomly selected) and Huaihua First Hospital (507 cases; 29 cases randomly selected). The Cancer Hospital Affiliated with Xiangya Medical College of Central South University, Second Xiangya Hospital of Central South University, and First Affiliated Hospital of Hunan University of Traditional Chinese Medicine participated in the project and multi-center testing.

Fig. 1: Data construction flowchart.
figure 1

OITDS organization internal test dataset; ETDS external test dataset; ATDS add test dataset.

The images in this project were obtained in JPEG format from the video output port of the ultrasound instrument through the US_PACS video capture card. If the image was output in digital imaging and communications in medicine (DICOM) format, it was converted to JPEG format. We collected 965 benign and malignant tumor images from the data in our institution; of these, 800 images were randomly selected for training, 165 images were randomly selected for testing, and an EDS was constructed; 200 images each of benign and malignant masses were selected from 130 cases and used to construct a benign and malignant diagnostic test dataset (BMTDS); 200 images of infiltrative non-specific cancer were selected as a positive class according to the pathological type in the diagnosis, by randomly selecting 200 images from those of other pathological types as a negative class, a pathological type diagnostic test dataset (PTTDS) was constructed; 200 images of breast infiltrating ductal carcinoma were selected as a positive class according to the pathological disease diagnosis, and 200 images in other pathological disease images were randomly selected as a negative class; thus, the pathological disease diagnostic test dataset (PDTDS) was constructed.

The 499 breast masses and 1549 images were divided into six training data sets (Table 2) in the order of patient examination and were used to train the model in stages; the data of all previous stages were accumulated in the later stage, and cases without pathological results and ultrasound images were not included. Based on the pathological results, if only one side had a mass, irrespective of it being a single or multiple mass, it was considered a breast mass; if one side had a benign and malignant mass, only the malignant mass was selected; if one side had multiple types of malignant masses, and the malignant degree of the tumor could be judged according to the pathological diagnosis results, the one with the highest degree of malignancy was selected; if one side had multiple types of benign masses, the one with the largest mass was selected. Data from 180 cases outside the institution were collected; if there was a mass on one side of the breast, that side was chosen, and if there were masses on both sides of the breast, one side was chosen randomly. This study was approved by the Ethics Committee of The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China (approval number: R201949). Informed consent was waived. The statistical tools used included MedCalc Statistical Software version 20.014 (MedCalc Software Ltd., Ostend, Belgium; https://www.medcalc.org; 2021), use its ROC curve analysis to calculate 95% confidence interval and significance level p, and use the comparison of two rates test to calculate the p-value of incidence rate ratio. The CLS development technology of this project was provided by Guangzhou Yirui Zhiying Technology Co. Ltd. (Guangzhou, China).

Image clipping experiment

The EDS and BMTDS data sets were constructed with and without image cropping (cutting off the text around the ultrasound image), respectively. Five algorithms were used to develop the model and tested by BMTDS. The AUC values obtained by the two approaches were compared (Supplementary Table 1d). According to the experimental results, we chose whether or not to cut the image during model training.

Experimental conditions for CLS development

A basic PC with the following specifications was used: CPU, Intel (R) Core (TM) i7-6700CPU 3.40 GHz (Intel, Santa Clara, CA, USA); memory, 8 GB; system type, 64-bit operating system, x64-based processor. The operating system used Windows 10 Professional Edition (Microsoft, Redmond, WA, USA). Other features included a GPU graphics card (NVIDIA Quadro P4000; NVIDIA, Santa Clara, CA, USA), video memory (8 g), and software such as Python 3.7.6, tensorflow-gpu (version 1.13.1 Google, Mountain View, CA, USA), scikit-learn 0.21.0, and keras 2.2.4.

CLS simulation prospective study

We divided 499 training data and 1,549 images into six stages of data to maintain a relative balance between benign and malignant data (Supplementary Table 2). Each stage was based on malignant mass images, randomly extracting benign cases with approximately the same number and images as the malignant cases. Python was used to develop the CLS and integrate it with US_PACS (Fig. 2). The CLS was divided into three parts: CLS_A, CLS_B, and CLS_C. A clinician provided data and images to CLS_A through the US_PACS; when CLS_A received images with pathological results, the images were automatically classified, and a dataset was constructed, i.e., benign and malignant set (BMS), pathological type set (PTS), and pathological disease set (PDS), The classification was based on pathological types and disease classifications ((Supplementary Table 6).

Fig. 2: CLS training and diagnosis flowchart.
figure 2

CLS continuous learning system, US_PACS ultrasound picture archiving and communication system, BMS benign and malignant set, PTS pathological type set, PDS pathological disease set, BMTDS benign and malignant diagnostic test dataset, PTTDS pathological type diagnostic test dataset, PDTDS pathological disease diagnosis test dataset, BM_OM benign and malignant diagnostic optimization model, BM_NOM benign and malignant diagnostic non-optimal model, PT_OM pathological type diagnostic optimization model, PT_NOM pathological type diagnostic non-optimal model, PD_OM pathological disease diagnosis optimization model, PD_NOM pathological disease diagnosis non-optimal model.

An ImageNet pre-training model and transfer learning were used in each training. When the number of images of benign and malignant tumors reached the preset number (initially set to 125 images), CLS_B automatically started the training model; after the training of the benign and malignant diagnoses models, it automatically started the testing module, and the last eight models (the maximum number was limited by the condition of the computer hardware) were selected from the models stored during training to test the BMTDS. The model with the highest AUC value as a benign and malignant diagnostic optimization model (BM_OM) was automatically selected, output, and saved as an original record for use by the experimenter. The model trained to the end with no further increase in accuracy was taken as the benign and malignant diagnostic non-optimal model (BM_NOM), which could be identified from the receiver operating characteristic (ROC) curve of the sixth stage (Supplementary Fig. 4d). The highest value of the AUC was 0.845 (BM_OM) in 34 rounds and 0.826 (BM_NOM) in the last round (round 43).

After CLS_B finished the training of the pathological type models, the test module automatically started; the last eight models were selected from the models stored during the training to test the PTTDS. The model with the highest AUC value was automatically selected as a pathological-type diagnostic optimization model (PT_OM). The model trained to the end with no further increase in accuracy was used as a pathological type diagnostic non-optimal model (PT_NOM). From the ROC curve of the sixth stage (Supplementary Fig. 4e), the AUC value was the highest at 83 rounds of training, reaching 0.869 (PT_OM), and at round 84 of training, the AUC value was 0.868 (PT_NOM).

After CLS_B finished the training of the pathological disease model, the test module automatically started. The last eight models were selected from the models stored during training to test the PDTDS, and the model with the highest AUC value was automatically selected as a pathological disease diagnosis optimization model (PD_OM). The model trained to the end without further increase in accuracy was taken as a pathological disease diagnosis non-optimal model (PD_NOM). From the ROC curve of the sixth stage (Supplementary Fig. 4f), the AUC value was the highest at 71 rounds of training (0.825; PD_OM), and at 86 rounds of training, it was 0.80 (PD_NOM).

After the training at each stage with CLS_B, the test module automatically started. BM_OM, PT_OM, and PD_OM were adopted to the OITDS, and the result was output (Supplementary Fig. 4a). BM_NOM, PT_NOM, and PD_NOM were manually selected in each stage to test the OITDS; the result was output and the results of the two methods were compared (Supplementary Table 5a, Supplementary Fig. 2).

When CLS_A received the images without pathological results, the images were directly transmitted to CLS_C. BM_OM was selected to perform benign and malignant diagnoses on the mass, and the results were returned to US_PACS. The image with the highest malignant probability was selected from the provided images to perform pathological type and disease diagnoses. PT_OM was first selected for pathological type diagnosis; PD_OM was then selected for pathological disease diagnosis, and the result was returned to the US_PACS after the diagnosis was complete. Since diagnoses and learning were performed by different modules (same or different servers), CLS_B did not affect the diagnosis of CLS_C while learning.

CLS diagnostics performance test

In addition to the automatic testing of the OITDS after CLS training (Supplementary Table 5a, Supplementary Fig. 4a), this project tested the ETDS on OM and NOM obtained in six stages of training and compared the output results (Supplementary Table 5b, Supplementary Fig. 4b). This project tested an ATDS on the OM and NOM obtained from the first five training stages, and the output results were compared (Supplementary Table 5c, Supplementary Fig. 4c).

Comparison of the diagnostic performance of the CLS with test physicians

The same evaluation standard (Supplementary Table 7) was used to compare the scores of the CLS and diagnostic ability of 21 physicians who participated in breast ultrasound diagnoses (including two community hospital physicians who studied in our institution). The highest total score of nine indices was 100, the AUC value was 20, the value of other indicators was 10, and the sensitivity and specificity were selected according to the Youden index of the ROC curve. Three diagnoses could be selected according to the pathological type, and three indices were used for evaluation; the accuracy of APTI indicated a correct diagnosis in three diagnoses, and according to the ranking of correct diagnoses, different scores were given: first rank, 3 points; second rank, 2 points; third rank, 1 point; absent in three diagnoses, 0 points. The calculation of the various indices is explained in Supplementary Table 7. The OITDS, ETDS, and ATDS diagnostic results (Supplementary Table 3b–d) using the OM and NOM were evaluated according to the evaluation criteria; physician test results were also evaluated (Supplementary Fig. 3b), and the CLS and physician diagnostic scores were ranked (Table 4).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.