Machine intelligence in non-invasive endocrine cancer diagnostics

Artificial intelligence (AI) has illuminated a clear path towards an evolving health-care system replete with enhanced precision and computing capabilities. Medical imaging analysis can be strengthened by machine learning as the multidimensional data generated by imaging naturally lends itself to hierarchical classification. In this Review, we describe the role of machine intelligence in image-based endocrine cancer diagnostics. We first provide a brief overview of AI and consider its intuitive incorporation into the clinical workflow. We then discuss how AI can be applied for the characterization of adrenal, pancreatic, pituitary and thyroid masses in order to support clinicians in their diagnostic interpretations. This Review also puts forth a number of key evaluation criteria for machine learning in medicine that physicians can use in their appraisals of these algorithms. We identify mitigation strategies to address ongoing challenges around data availability and model interpretability in the context of endocrine cancer diagnosis. Finally, we delve into frontiers in systems integration for AI, discussing automated pipelines and evolving computing platforms that leverage distributed, decentralized and quantum techniques.

Artificial intelligence (AI) refers to any non-living entity that executes tasks typically requiring human intelligence 1 . Endocrinology stands to benefit greatly from the rise of AI, particularly in the realm of cancer diagnostics, where AI has the potential to facilitate enhanced diagnostic precision and improved workflows. Medical images are a mainstay of tumour diagnostics and they also serve as a reservoir of mineable pixel data that naturally lends itself to machine-based classification 2,3 . Computer vision (Box 1) applications are already leveraging this property to power robust diagnostic interpretations of endocrine neoplasms 4-7 . Although tissue pathology remains the gold standard in the diagnosis of many endocrine tumours, the macroscopic characterization of tissue in imaging studies can augment histological findings. Indeed, in some cases, a biopsy sample might not always reflect the intratumoural heterogeneity across genomic subclones 8, 9 . Additionally, a biopsy is invasive and is subject to sampling error that can render it inconclusive. Computer vision can be leveraged in support of histological findings by inferring diagnosis from the structural heterogeneity observed within tumours on medical imaging 3 . Furthermore, given the high frequency with which medical imaging is performed in cancer management, archives of longitudinal medical imaging data can be used by computer vision applications to better characterize disease, predict progression at the time of diagnosis and monitor response to treatment 2 .
The correlation of AI-driven image analytics with other omics data and clinical expertise can also be used to enable integrative approaches to care 10,11 (Fig. 1). Indeed, studies demonstrate that the mapping of genomics or pathomics (image features are extracted from pathology studies) data with radiomics (image features are extracted from radiology studies) data from medical imaging can illuminate conserved trends at different levels of human physiology, with implications for diagnosis and prognosis [12][13][14] . Furthermore, advances in machine intelligence have the potential to enable non-invasive endocrine cancer diagnostics that could preclude or limit the use of invasive biopsy 15 . Looking ahead, it will be important for both endocrinologists and radiologists to cultivate a working understanding of the utility and limitations of AI if the benefits of these technologies are to be realized.
In this Review, we highlight the ever-growing contributions of machine intelligence to the field of endocrine imaging diagnostics for tumours of the adrenal, pancreatic, pituitary and thyroid glands.

Understanding AI
The definition of AI is broad and encompasses a variety of approaches that bridge the natural, applied and social sciences. Examples of tasks in medicine that can leverage AI include image interpretation 16,17 , disease forecasting 18 , genomics 19,20 , natural language processing 21,22 and therapeutic discovery 23 , among others. We review key concepts in machine learning and deep learning in more detail in this section, both of which fall under the larger umbrella of AI.
Machine intelligence in non-invasive endocrine cancer diagnostics Supervised. Supervised learning (Box 1) uses labelled inputs and asks an algorithm to identify how the relevant features from a dataset map to each respective label 26 . For example, let us say we are trying to differentiate between benign or malignant thyroid nodules using quantitative pixel features extracted from an ultrasonography study that represent nodule texture. Our labels here are 'benign' and 'malignant' and our inputs are texture representations, such as pixel correlation or entropy. During model training, the machine learning algorithm studies the texture features (Box 1) of benign and malignant images to develop and refine its decision-making process. Conceptually, the goal of supervised techniques is to correctly classify unlabelled data into the pre-defined categories used during model training. In this hypothetical example, we feed the algorithm feature data from unlabelled scans and we want it to tell us if the imaging findings are benign or malignant. The model is supervised in the sense that its programmer shows the algorithm correct examples to guide the learning process.
Unsupervised. In contrast to supervised learning, unsupervised techniques (Box 1) use unlabelled inputs and let the model adjudicate the data into groups. Revisiting our previous example of the thyroid nodule, we could build a model where pixel data of texture features from scans of patients with unconfirmed or borderline diagnoses are used as the unlabelled inputs. In what is an oversimplification, we could imagine the model 'plotting' the imaging data based on common features. Doing so enables the algorithm to identify clusters in the data, which might or might not translate to a substantive interpretation. Critically, the algorithm decides what is important when plotting the data. In the supervised learning example, we were looking to classify thyroid nodules as either benign or malignant. In this unsupervised scenario, the data could cluster any which way. For example, the data might triangulate to 'coordinates' or groups for different types of nodules as intended or it could group by background noise. In this way, the unsupervised learning model can potentially elucidate trends that the investigator had not originally set out to find, arguably the greatest strength and weakness of this technique. Unsupervised techniques can also be leveraged for augmenting imaging workflows in the annotation and pre-processing of unlabelled data 27,28 . Again, a critical conceptual distinction between supervised and unsupervised learning is that the output for the former will typically be a defined label or value, whereas the latter will be a cluster or association.
Reinforcement. Reinforcement learning (Box 1) is a framework where the model interacts with its environment through actions that are each tied to a value reward 29 . In keeping with our thyroid nodule example, we could build a model that is fed pixel data from unlabelled scans of patients. The model is tasked with the identification of the malignant and benign target

Key points
• Developments in machine intelligence have been made possible by the increase in data ubiquity and computing power and have the potential to enhance image segmentation, analysis and workflow in non-invasive endocrine cancer diagnostics. • Improved adherence to consensus reporting standards and evaluation criteria in artificial intelligence (aI) for medical image analysis is urgently needed in the field of endocrine cancer diagnostics as this will enable meaningful cross-study comparison. • a centralized inventory to track diagnostic algorithms in oncologic endocrinology that are in active clinical use would improve performance auditing and algorithm stewardship. • The looming risk of excessive intervention in endocrine cancers can be addressed with the improved detection facilitated by aI, possibly via correlation with prognostic data for improved risk stratification. • Poor data availability continues to stymie the development of robust machine learning applications, particularly in rare endocrine cancers; solutions to this problem might include database curation, pre-training techniques and workflow automation. • other breakthroughs in machine intelligence will come with the exploration of alternative computing frameworks, such as decentralized, distributed and quantum networks, that might enhance model training and efficiency.
www.nature.com/nrendo patterns. The model will take an action based on the data it encounters and then uses the reward information from its environment to find the path that maximizes the reward over time. We can think of this type of technique as learning by trial and error.

Deep learning
Deep learning (Box 1) is a subset of machine learning using algorithm architectures inspired by neural processing in humans that make classifications or predictions using layers of abstract data representations 30 . Deep learning models typically perform sequential operations that distort the data in each successive layer and this series of transformations enables the model to progressively deduce information relevant to the assigned task. Revisiting our hypothetical thyroid nodule example, the first layer of our deep learning model might assess groups of image pixels at different orientations to discern edge information 31 . The second layer might then compile the edges from the first layer to detect patterns of edges 31 . The next layer might assemble different edge motifs to detect hyperechoic or hypoechoic regions of the scan. Finally, subsequent layers might transform inputs from the previous layer to recognize complex image traits such as microcalcifications, cysts and necrosis. Importantly, deep neural networks can be differentiated from shallow neural networks by their multiple (>1) 'hidden' layers, which contain complex, non-linear connections that can be difficult for humans to interpret (Fig. 3). Although these hidden layers are striking in their ability to enhance the complexity of features discernible by the model, deep learning algorithms require lots of data to avoid picking up noise specific to the training dataset (See 'overfitting' , Box 1). A key strength of deep learning is that the technique is less reliant on feature engineering (Box 1) when compared with classic machine learning models 32 .
Deep learning models can also make use of the aforementioned supervised, unsupervised and reinforcement learning techniques. Deep learning models (TaBle 1) can be used for specific tasks within the radiomics workflow, such as in segmentation or feature extraction, often with improved performance compared with traditional machine learning methods like single-layer 'shallow' neural networks. Mixed techniques are often employed in the feature extraction process, whereby 'deep features' mined using deep learning algorithms are syphoned into a second classifier algorithm, either in isolation or in some combination with other manually extracted or statistically derived features. However, deep learning can also be used in end-to-end processing, effectively obviating the need for human involvement in the segmentation, feature extraction and feature selection (Box 1) steps of the radiomics workflow 33,34 (Fig. 2).

Diagnostics
In this section, we review AI applications in endocrine cancer diagnostics by organ system, with an emphasis on clinical utility, technical limitations and areas for future research.

Adrenal gland
On abdominal imaging, approximately 5% of the general population have adrenal lesions that are revealed as incidentally found asymptomatic tumours (incidentalomas) 35,36 . Clinical work-up for adrenal masses starts with assessing them for potential malignancy and functionality 37 . Early radiomics efforts to discriminate adrenal lesions on imaging using CT and MRI use mean frequency attenuation mapping with histogram analysis 38 . However, the replication of findings has been a challenge, possibly owing to variation in techniques Box 1 | Glossary of key terms • machine learning: a branch of artificial intelligence where algorithms can learn a task without explicit pre-programming. • Computer vision: the use of artificial intelligence for image or other digital media analysis. • Supervised learning: a training technique that uses labelled inputs and asks the algorithm to identify how the relevant features from that data map to each respective label. • unsupervised learning: a training technique that uses unlabelled inputs and lets the model adjudicate the data into clusters or associations. • reinforcement learning: a training technique where the model interacts with its environment through actions that are each tied to a value reward. • Deep learning: a subset of machine learning algorithms that process data in networks of abstracted layers to learn, usually via sequential transformations of the data. • Continuous learning: an open training state whereby models can modify their architectures in real time. • Data augmentation: a process to generate synthetic data that involves slight transformations in the training images. • Transfer learning: a technique that uses large and diverse datasets to prime models prior to training with the limited target dataset. • Segmentation: the process of making images machine-readable through annotation of regions of interest. • region of interest: demarcation of areas relevant to the classification decision-making process. • Texture features: quantifiable patterns of pixels in the medical images, many of which are not visible to the naked eye. • Feature engineering: extraction of features from the data space is guided by domain knowledge, a process that deep learning can bypass through automated feature probing. • Feature selection: an analytic process in which a subset of the total pool of extracted features is selected for incorporation to the model. • overfitting: a phenomenon that occurs when the model too closely maps to features in the training data resulting in poor generalizability. • Generalizability: model performance on real-world patient populations outside the study data used to develop the model. • backpropagation: a training paradigm often used to develop neural networks where the weights of neurons are repeatedly tuned based on the error rate in the previous cycle through the training dataset. • Picture archiving and Communications Systems: a system for multi-modal (such as mrI, CT, X-ray and ultrasound) medical image storage and transfer using a universal digital imaging and communications in medicine (DICom) file format. • Cloud computing: a network architecture that performs data operations using a remote, centralized server. • Decentralized and distributed computing: network architectures that perform data operations using multiple local or non-centralized servers. • Federated learning: an ensemble training strategy where gradient information from models trained locally in parallel is loaded on a central server to develop a single consensus model; it does not require the transfer of patient data. • Quantum computing: a network architecture that leverages the properties of atomic and subatomic particles to improve the computational efficiency of conventional algorithms or to develop new learning paradigms.
to define the ROI [39][40][41] . Importantly, histogram analysis has paved the way for automated radiomics-based machine learning techniques with texture analysis, which can assess both low-order and higher-order features. Texture analysis explores hierarchical spatial relationships between pixels. First-order features describe distributions in grey-level pixel intensity, second-order features assess relationships between pixel pairs and higher-order features explore distributions in pixel neighbourhoods [42][43][44] . CT is the dominant imaging modality for evaluating the adrenal gland and can be performed with or without contrast enhancement for the visualization of adrenal tumours. In the evaluation of malignancy, a size of >4 cm is a concerning feature, often prompting resection 45 . However, this risk factor should not be taken in isolation as ~70% of these large adrenal tumours have been shown to be non-malignant lesions 46,47 . Machine learning has been used to differentiate large (>4 cm) adrenocortical carcinomas from other large cortical lesions on contrast-enhanced CT 48 . The radiomics signature obtained by machine learning had a diagnostic accuracy for malignant disease exceeding that of radiologists, although there was inter-observer variability on the radiologist evaluation (P <0.0001) 48 . The performance of this machine learning-based texture analysis model further improved with the inclusion of pre-contrast mean attenuation, which is a parameter that is also used in established adrenal radiological criteria 49 .
In terms of functional adrenal lesions, machine intelligence has also been used to differentiate between lipid-poor adenoma and subclinical phaeochromocytoma (which might secrete catecholamines), where attenuation thresholds and washout characteristics might not always be reliable 50,51 . As subclinical phaeochromocytomas can sustain secretory function, biopsy or surgery could precipitate haemodynamic instability if a functional tumour goes undetected. Studies have yielded radiomics signatures for subclinical phaeochromocytoma via machine learning-driven texture analysis on non-contrast CT imaging with performance accuracy ranging from 85% to 89% 52,53 . However, the potential benefits of this approach over existing clinical criteria are hard to discern due to considerable differences in baseline tumour characteristics, such as attenuation and size, and the lack of comparison between machine learning-driven analysis and expert radiologist evalution 52,53 . Still, we can envision a future role for the enhanced detection of subclinical pheochromocytomas with artificial intelligence techniques to confidently and quickly prompt confirmatory biochemical testing.
Other groups have also leveraged the improved resolution of adrenal imaging on MRI to train their models. Indeed, one group developed a machine learning-based radiomics signature to differentiate adrenal adenomas from non-adenomatous lesions on MRI, with non-inferior performance in comparison with expert radiologists 54 . Other studies have explored neural networks for the differentiation of tumour subtypes on MRI (accuracy, 93%) and CT (accuracy, 82%), including adrenal adenomas, cysts, lipomas, lymphangiomas and metastases. However, these neural networks were trained with radiologist evaluation as the ground truth condition rather than with the gold standard of biopsy pathology 55,56 .
Looking ahead, we anticipate that the field of AI-powered adrenal tumour diagnostics will move towards robust automated detection and preoperative classification of incidentalomas. Future work is needed in the differentiation of small adrenal masses <4 cm, particularly in the case of malignancy, where early detection is linked with better outcomes 57 . The field will be improved with more robust clinical evaluation and workarounds for small cohort size, possibly through increased data-sharing and/or pre-processing techniques to reduce overfitting.

Pancreas
The aberrant proliferation of endocrine islet cells leads to the development of pancreatic neuroendocrine tumours (NETs) and prognosis is overall favourable with complete resection [58][59][60] . A minority of these neoplasms retain the functional status of their original islet cell lineage, which can induce a clinical syndrome due to hormone production, often facilitating their detection 61,62 . Absent such biochemical indicators, the clinical management of pancreatic NETs is primarily stage-guided by Ki67 index and mitotic count observed in tissue samples obtained by biopsy; however, imaging characteristics, such as tumour size, depth of invasion and presence of metastases, are also considered [63][64][65][66] . Pancreatic NETs classically present on CT imaging as contrast-enhancing masses that are best visualized on the arterial phase, Endocrinologists communicate with patients and radiologists to gain a clinical overview of their patient. Four arms give a holistic overview of disease: radiomics (for example, CT or MRI), pathomics (for example, histology of tissue samples), genomics and phenomics (for example, digital health mobile phone applications and wearable trackers). An artificial intelligence algorithm (such as a deep neural network, seen in the centre) synthesizes all the information and provides a diagnostic classification.
Preoperatively, biopsy samples are typically obtained via fine-needle aspiration on endoscopic ultrasound, although the localization and yield can be complicated by lesion size and spatial orientation 69 . In light of these uncertainties, there is interest in developing a system for preoperative risk stratification of pancreatic NETs, which will help guide therapeutic directions in support of endocrine oncologists and surgeons 70,71 . Studies have utilized both conventional machine learning and deep learning on preoperative CT and MRI to classify pancreatic NET grade with robust accuracy in pathology-confirmed tumours 4,72-75 . Importantly, the development of classification boundaries for future studies requires consensus in the partitioning of tumour grades. For example, some studies differentiate grade G1 and G2 from G3 neoplasms, whereas others differentiate grade G1 from G2 and G3 neoplasms 4, 74,76 . Given that pancreatic NETs are so rare, a deep learning study using MRI has used data augmentation (Box 1) with a generative adversarial network on 96 patients with confirmed disease to enable their convolutional neural network to have improved generalizability on unseen data 75 . As well as stratification, future computer-aided diagnosis could also potentially be used for pancreatic NETs if efforts using AI could be expanded to functional imaging techniques with tracers such as the octreotide scan [77][78][79] .
We also envision a role for machine intelligence to support radiologists in the differentiation of atypical pancreatic NETs from adenocarcinoma. Pancreatic adenocarcinoma is an exocrine malignancy of the epithelioid ductal cells that often confers a poor prognosis due to delays in diagnosis 80,81 . Although pancreatic NETs are usually distinguishable from adenocarcinomas on CT by their vascularity pattern and absence of ductal dilation, a hypovascular enhancement pattern occurs non-infrequently in atypical variants 63,67,68 . To date, statistical approaches utilizing histogram analysis on CT images have seen conflicting findings in terms of the robustness of features used for differentiation, including entropy, kurtosis and skew 82,83 . Future studies can be performed with AI and focus on combining imaging information with clinical data (such as laboratory tests) for increased accuracy.
Broadly, studies in the field of pancreatic imaging have utilized deep neural networks to improve workflow by carrying out automatic segmentation of pancreatic lesions, a process ordinarily complicated by the irregular contours and difficult anatomy of the pancreas [84][85][86][87][88] . In addition, several studies used advanced learning techniques for classification in exocrine pancreatic cancer and precursor lesions, with encouraging findings 89,90 . For example, one exploratory study with a small dataset used a mix of supervised and unsupervised learning techniques for the classification of pancreatic cystic neoplasms on MRI. We highlight the paper's use of unsupervised methods, in which a k-means algorithm is trained to cluster pancreatic precursor lesions on unlabelled MRI scans. Following this step, the machine-annotated scans are fed into a novel proportioning type support vector machine for final label adjudication 91 (TaBle 1). Potential also exists here to eventually adapt such unsupervised models for the automatic labelling of unstructured medical image data in order to reduce the pre-processing workload. This work is still exploratory, with only a modest (6-10%) improvement in diagnostic accuracy over prior unsupervised machine learning approaches; however, it nevertheless highlights the opportunity to improve on prior learning techniques in the field of pancreatic imaging to develop models that can be used clinically.

Pituitary gland
Pituitary adenomas are found to occur in approximately 10% of the population, although they are typically small and subclinical lesions that do not require treatment 92,93 . Clinical syndromes such as acromegaly or bitemporal haemianopsia, for example, can result from tumour hormonal hypersecretion or tumour mass effect on surrounding structures [94][95][96][97][98] . In combination with clinical data, neuroimaging plays a vital role in informing pituitary tumour diagnosis, surgical planning and longitudinal monitoring 99,100 . MRI is generally the preferred imaging modality for the sellar region as it can provide exquisite detail of the neuroanatomy. An incredible diversity of sella turcica pathologies localize to the sellar region, including those of primary pituitary, local or distant origin.
Machine intelligence has been leveraged for a variety of diagnostic tasks that reflect the diversity of sellar A radiomics signature is the final output. One can also use either machine learning or deep learning for feature extraction and engineering, including the identification of pixel intensity, lesion shape, texture feature matrices and wavelets. Conventional machine learning algorithms must respect this pathway of acquisition, segmentation, feature extraction and feature selection. By contrast, deep learning can circumvent this process altogether with end-to-end processing from inputs to outputs.
lesions and hold implications in terms of treatment. For example, an early study utilized a three-layer feedforward artificial neural network (TaBle 1) with backpropagation (Box 1) for the differentiation of large suprasellar masses such as pituitary adenomas, craniopharyngiomas and Rathke cleft cysts 101 . Their learning model used patient age together with MRI features to achieve excellent accuracy, which improved on the performance of both neuroradiologists and general radiologists 101 . Interestingly, upon assessment of expert confidence and misclassifications, the authors found that the AI model was most beneficial when used to identify cases where cystic degeneration occurred in pituitary adenomas 101 . Other models have been used for the differentiation of null cell adenomas from other non-functioning pituitary adenomas via machine learning-based radiomics signatures, albeit lacking expert radiologist comparison 102 . Accurate diagnosis of null cell adenomas is critical, as adjuvant radiotherapy has shown some benefit in this adenoma subtype but not in others due to an overriding risk of hypopituitarism. Deep learning is also gaining traction, with one study utilizing convolutional neural networks (TaBle 1) on multisequence MRI to diagnose pituitary adenomas from other sellar pathologies and healthy controls, with a performance accuracy of 97.0%, although this protocol is still in need of radiologist comparison 5 .
Robust pituitary tumour characterization at the time of diagnosis can also inform subsequent surgical planning. A variety of conventional machine learning and deep learning techniques have been used to evaluate macroadenoma consistency, with many models achieving good diagnostic performance on par with that of radiologists [103][104][105] . This preoperative finding can have surgical implications as soft adenomas are generally amenable to suction curettage upon a transsphenoidal approach, whereas the firm subtype is more difficult to resect and requires ultrasonic aspiration and often a staged transsphenoidal approach 106,107 . Other deep learning models have been used to preoperatively predict tumour invasion or cerebrospinal fluid leak, to inform surgical planning 108,109 .
Future machine learning directions should strive to enable the early detection of small pituitary lesions, possibly via automated lesion detection or improved diagnostic performance, as early clinical intervention can prevent the sequelae of worsening mass effect or protracted hormone hypersecretion. In terms of disease forecasting, we also see potential value in tools for the determination of appropriate patient follow-up periods for tumour surveillance to reduce unnecessary scanning and promote efficient health-care utilization. To this aim, studies could use longitudinal patient data gathered by automated segmentation and measurement of

Thyroid gland
Thyroid cancer is the most common malignancy of the endocrine system, with an estimated 5-year prevalence of 4.6% 110 (International Agency for Research on Cancer). Ultrasonography is the mainstay imaging modality in diagnosis that can provide excellent visualization of nodules and guide potential biopsy acquisition. Many robust AI applications have emerged to characterize thyroid nodules owing, in part, to the ubiquity of data as ultrasound scans are non-ionizing, fairly low-cost and increasingly portable 110,111 . Studies to date primarily explore the automatic segmentation and classification aspects of thyroid nodule diagnosis [112][113][114][115][116][117] . The primary utility of these models lies in their potential to inform decisions around whether to proceed with surveillance or fine-needle aspiration biopsy 15 . To date, many of the radiomics signatures for thyroid cancer developed by conventional machine learning approaches map to the five domains in the Thyroid Imaging, Reporting and Data System (TI-RADS, used by radiologists) of echogenicity, echogenic foci, composition, shape and margin criteria [118][119][120][121] . These models support the robustness of these TI-RADS clinical imaging criteria; however, they also highlight a potential role for automated techniques in reducing inter-observer variability. An abundance of deep learning models has also been developed to inform clinical decisions in patients with thyroid nodules, although a 2020 metanalysis did not find a clear superiority over classic machine learning techniques or radiologists in terms of diagnostic accuracy 122 . Of course, interpretation of this pooled data is difficult as many of the deep learning models, sample sizes and clinical evaluation criteria vary substantially across studies. For example, one 2019 study with high volume data trained a convolutional neural network (TaBle 1) with images drawn from over 312,399 ultrasound scans from 42,952 patients across multiple institutions; this model was found to outperform skilled radiologists (>6 years experience) on external validation 6 . Although not all institutions will have access to high volume thyroid ultrasound scans, they can still implement a number of strategies to increase data availability. Here, we want to highlight one emerging strategy: the use of model pre-training with synthetic data creation via generative adversarial networks (TaBle 1). In fact, in the past year, the endocrine literature began to explore innovative 'knowledge-guided' approaches to data synthesis using deep learning-extracted features from TI-RADS to assist the generative adversarial network in its generation of thyroid nodule images 123 .
It is not clearly established how the benefits of machine intelligence systems for improving diagnostic accuracy will ultimately translate to the clinical setting. Overall, the literature suggests that these systems can achieve non-inferior performance to that of experienced radiologists (experience varies, typically 5-20 years) 122,124 . These algorithms do tend to outperform less-experienced radiologists and might therefore play a valuable supporting role, particularly in low-resource settings, where access to experts could be constrained [125][126][127] . Compared with a small cohort of models that are actively being utilized in the clinical setting, radiologists seemingly have a slight edge in varying indicators of performance on individual studies, although pooled overall metrics are comparable 122,128 . A centralized inventory to actively track these diagnostic algorithms in clinical use would improve performance auditing and algorithm stewardship. Looking ahead, we see that the field is already heading towards 3D detection and reconstruction in thyroid ultrasonography that might power more robust analytics 117 . Another challenge moving forward will be in mitigating the risks of excessive intervention in thyroid cancers with improved detection as many slow-progressing or early-stage cancers will remain subclinical. Possible solutions here include linking imaging algorithms to pathology reference standards as well as with longitudinal outcomes data for improved risk stratification 129 .

Facial recognition
Interestingly, a number of computer vision applications of facial recognition software have been developed to identify stereotyped facial features induced in hormonal excess 130 . A positive identification of characteristic facial features could indicate a number of pathologies, including an underlying endocrine tumour. The process is similar to the radiomics workflow, except that facial landmark tagging occurs in lieu of segmentation during image pre-processing.
Acromegaly can result in facial manifestations such as frontal bossing, sunken nasolabial folds, prominent zygomatic arch and enlarged jaw often due to an underlying pituitary somatotrophic macroadenoma.

Output
Fully connected layers Pooling Convolution Pooling Fig. 3 | A convolutional neural network. The input is a medical image to which an overlaying grid and a kernel matrix (for example, 3 × 3) are applied. The matrix feature maps to a smaller area on a stacked convolution layer. Another smaller kernel matrix (for example, 2 × 2) is pulled from a different area on that convolutional layer to a pooling layer. This pipeline then coalesces into a classification region with the 'fully connected' layers, which will yield an output.
Both machine learning and deep learning approaches have been used to craft models to identify stereotyped facial features, with a performance comparable to that of acromegaly specialists and exceeding that of general internists 131,132 . Stereotyped features can also occur in Cushing syndrome, such as facial plethora, hirsutism, acne and cervical fat pad, owing to increased cortisol. Initial pilot studies using machine learning are limited by small cohorts and demonstrate variable performance on retrospective validation (accuracy range 62.8-85%) 133,134 .
Limitations in models to date include poor visualization of facial features and potential entrenched bias due to racial and gender homogeneity in training data [131][132][133][134] . The diversification of data and obtaining metrics of bias are critically important as is documenting bias assessments in these facial recognition software applications to avoid replicating current racial and gender disparities in the care of Cushing syndrome and acromegaly that manifest as poor outcomes and delays in diagnosis, respectively 135,136 .

Clinical evaluation
The metrics used for clinical assessment in AI currently lack standardization, which undermines the smooth integration of AI into the health-care system (TaBles 2,3). Many computer vision studies in endocrine cancer imaging lack robust validation, which poses inherent limitations in terms of reproducibility. First, the lack of consistent reference standards (including biopsy, stable imaging and clinical criteria) for common clinical questions in machine intelligence for tumour imaging diagnostics can undermine the ability to establish a ground truth for comparison across studies. Furthermore, no consensus exists in definitions of high versus low

Data management
Pre-processing Acquisition Studies should disclose protocols used to obtain medical images as these can be different across institutions; variations in imaging machines, positioning, image capture and slicing, and data formats can limit generalizability; augmentation of acquisition protocols through automation can improve standardization Segmentation Refers to the process of making images machine-readable through annotation of ROIs, which can be performed manually or automatically; protocols can be subject to inter-observer and inter-study variability (such as whole tumour versus axial ROIs)

Heterogeneity
Refers to the sample data mix; ideally, data would include a multi-institutional and representative set of experimental and control images with both typical and atypical cases; publishing the data distributions for pathologies or demographics included in model training can help to mitigate these concerns Size With increasing dimensionality, models need more data for generalization; researchers can use model-specific or post hoc thresholds in performance cut-offs on validation to 'power' their studies but these processes are variable in practice; sample size determination practices should be reported in research studies; future work should assess for possible best practices in post hoc techniques for sample size determination Training Reference standard A degree of uncertainty exists in the ground truth condition in clinical diagnosis, although sample biopsy tends to be the gold standard in cancer diagnostics; however, diagnosis by a specific biomarker, imaging finding or clinical criteria might be more appropriate to the clinical question and/or institutional resources; however, the establishment of uniform reference standards for endocrine neoplasms is needed in cases where biopsy is not routinely obtained such as in small adrenal or pituitary masses Data separation Failure to separate training and validation sets is discouraged as it limits the generalizability of findings Testing and/or validation Performance: efficacy (diagnostic performance); safety (potential untoward effects on overall patient health or well being); fairness (equitable algorithm performance across populations) An expert radiologist comparison can be used to infer the clinical relevance of algorithm performance; retrospective and prospective experimental designs are typically used, with prospective studies less prone to memory bias (internal test sets) and selection bias; in algorithms intended for autonomous use in diagnosis or other high-risk applications, randomized clinical trials might be warranted to assess for efficacy, safety and fairness Implementation and quality control Generalizability Institutions should assess how algorithms perform in their respective clinical populations; ideally, all studies would be tested on a distinct, external dataset prior to implementation to infer generalizability; baseline variation in radiologist skill level across institutions can muddy comparisons; drawing from experts across different institutions as well as including a consensus agreement on 'highly experienced' expert level in existing reporting guidelines could help in assessments of model generalizability

Longevity
Model performance has the potential to degrade over time due to changing health infrastructure, cyber sabotage or shifts in population characteristics over time; continued performance auditing across the algorithms life cycle is indicated

Utility
The number of algorithms being developed to assist clinical diagnostics is exploding to the point where it can constrain bandwidth, clutter interfaces and overwhelm providers; moving forward, there will be a need for inventories of models that can guide clinical stewardship efforts to curtail their excessive use ROIs, regions of interest.
www.nature.com/nrendo NaTure revIewS | EnDoCRInoloGy experience levels in radiologists; however, the endocrine cancer computer vision literature generally trends towards more than 5 years of clinical practice at a minimum as indicative of a high level of expertise. Next, separation of the data training sets and testing datasets is critically important and cross-validation alone is not adequate in evaluating clinical performance. At a minimum, studies should be validated on external datasets, ideally with prospective studies, which are less prone to selection bias.
To improve the quality of research, a number of guidelines for reporting in computer vision studies in medicine have been developed [137][138][139] . Moving forward, the development of performance profiles for any high-fidelity model classes or software packages for standard benchmarking might also be helpful, while at the same time acknowledging that, often, many ways exist to accomplish the same task with machine learning. Importantly, for those algorithms intended for autonomous clinical use, multicentre randomized trials might also be indicated to qualify their performance in integrated settings. Long-term monitoring of efficacy and bias across the algorithm life cycle is also indicated, particularly in cases of continuous learning (Box 1) where algorithms continually update to reflect new data.

Interpretability
Decoding AI for physicians can mitigate uncertainty that could undermine trust in machine intelligence [140][141][142] . Broadly speaking, interpretability strategies come in multiple flavours, either being specific or agnostic to a given model class and assessing function either at a global or local level 143 . Global interpretations seek to offer holistic depictions of model behaviour and they focus on illuminating trends in the data that are most important to classification. Local interpretations focus on explaining individual model prediction instances. The intent of these strategies is to reassure endocrinologists and radiologists that the model is making decisions of what it should be looking at, often by way of visualizations or text. These explanations are generated using several techniques, including feature importance to highlight salient features, counterfactual examples of model predictions for a given input or decision rules that describe the logical flow of the model 143,144 .
Feature attribute strategies are quite popular and include colour mapping 145 , an interpretability technique that highlights regions of the medical image that influence the model decision. Other feature attribute methods include surrogate strategies 146 , which use simpler models to explain the behaviour of more complex models. In the oncologic endocrinology literature, one form of colour mapping, known as saliency mapping, has been demonstrated in thyroid nodule classification to illustrate model behaviour 147 . Other studies have utilized both gradient mapping and surrogate modelling techniques to highlight feature importance in the segmentation of brain tumours on MRI and abdominal CT, with the potential for future use in sellar, pancreatic and adrenal diagnostics [148][149][150] . Finally, image similarity feature attribute strategies have also been applied to computer vision models for thyroid cancer. This technique displays a similar image linked to a classification as an explanation for the user, often with a superimposed gradient mapping to illuminate any respective discrepancies in regions of importance 151 . Of note, textual explanations are less common; however, they have been utilized in breast MRI and pelvic x-ray imaging to generate descriptive semantic outputs 152,153 . Interestingly, a combined approach with both saliency maps and textual explanations was shown to be better received by a small group of physicians 153 . Future efforts should strive to develop standardized metrics for evaluating the performance of interpretability models to ensure their effective and reproducible knowledge translation to the clinical setting.

Data availability
Abundant medical imaging data is needed to develop clinically meaningful deep learning models for non-invasive endocrine cancer diagnostics, capable of generalizing to a variety of clinical settings. In this section, we discuss techniques to increase the availability of data to prevent overfitting in AI models.

Open data curation
A lack of high volume, quality data impedes the development of robust AI in endocrine cancer diagnosis. One strategy to improve data for use by machine learning models is through improved sharing of existing data via the creation of open databases. The ongoing coronavirus  Archive and the UK Biobank, the latter of which expanded its archive in 2020 to include an imaging database with pan-MRI and DXA scans on >5,000 patients.

Automated workflow
AI integration can be targeted through automated pipelines (including XNAT or DICOM Image Analytics and Archive) that can reduce latency in data retrieval through improved integration with existing health-care infrastructure 157 (Fig. 4). These tools can uncouple imaging data in Picture Archiving and Communications Systems from protected health information following image acquisition for use by AI models for near-real-time processing. From there, these imaging findings can be conveyed using automatic workflow interfaces connected with the electronic medical record as a central hub for coordination among endocrinologists, radiologists and other care team members. Looking ahead, we envision the deployment of these automated workflow pipelines to facilitate real-time analytics that endocrinologists can access rapidly at the bedside via smartphone-based imaging viewing platforms or portable imaging devices.

Data augmentation and transfer learning
Data pre-processing and model pre-training techniques can also be used to engineer workaround solutions to limited imaging data in order to improve AI model generalizability 32 . Data augmentation is a process that distorts the training images via oversampling to generate synthetic data 158,159 . Another popular option in computer vision for treating small sample sizes involves pre-training of the model with a large and diverse image set to transfer preliminary weights to nodes in the network, after which fine-tuning of the model is performed using the target data 160,161 . Although these augmentation and transfer learning (Box 1) methods are now becoming staples in medical image informatics research, they were not used in a number of the endocrine cancer studies that we reviewed. Looking ahead, we anticipate that improved uptake of these methods will promote deep learning breakthroughs, particularly in the cases of rare neoplasms with limited availability of imaging data such as those of the adrenal gland and endocrine pancreas.

Alternative computing platforms
The organization of network servers used to access, store and transfer data can influence AI model training and development (Fig. 5). In this section, we draw attention to how exploratory computing frameworks might be leveraged to improve the quality of AI applications for endocrine cancer diagnostics.

Decentralized or distributed
Information technology infrastructures are trending towards Cloud computing (Box 1) solutions that consolidate data operations within a central server. However, computing platforms with diffuse servers are now being explored to circumvent data-sharing issues associated with centralized data that pose barriers to the multi-institutional, collaborative training of AI models. Distributed networks process data diffusely across local nodes, whereas decentralized platforms operate as collectives of nodal clusters ( Fig. 5  group that provides a programming interface with the hospital picture archiving and communications systems (PACS) to streamline clinical artificial intelligence (AI) research 176 . DIANA has facilitated near-real-time monitoring of acquired images, large data queries and post-processing analyses. More importantly, DIANA is integrated with the machine learning algorithms developed for various applications. The future goal is to integrate AI endocrine cancer diagnostics (such as adrenal adenoma and pituitary adenoma) in this or other systems. HTTP, hypertext transfer protocol; PHI, protected health information. Figure 4 is adapted from reF. 176 , Springer Nature Limited.
We highlight the potential of an emerging decentralized training paradigm known as federated learning (Box 1), which is already being utilized to enable deep AI models to be developed for diabetic retinopathy and breast cancer diagnosis 162 . Federated learning uses distributed servers across multiple institutions for parallel model training and model updates are subsequently loaded onto a central server to develop an ensemble model. Distributed learning techniques, such as cyclic weight transfer, can conduct this process across local servers in series, using one model passed from institution to institution over the course of training 163 . Importantly, these techniques do not require inter-institutional patient data transfer or co-location. We can similarly envision a role for decentralized and distributed techniques in bypassing current barriers to data sharing and availability to enable deep learning applications in oncologic endocrinology, particularly in rare cancers. However, a notable limitation in current federated learning techniques is that the diversity of data is only as robust as that of the collaborating institutions. Still, past efforts have yielded deep learning models with impressive performance on par with those from shared multi-institutional datasets [162][163][164] .

Quantum
Other breakthroughs in machine intelligence in medicine will come with shifts in computing frameworks that can enhance model training and efficiency. Quantum computing (Box 1) represents one emerging prospect that would leverage the physical properties of atomic and subatomic particles to enhance processing power, algorithm performance and data transfer 165 . Quantum computers can theoretically support the simultaneous, parallel-path processing of data to create shortcuts that might outperform conventional computing 165 (Fig. 5). Encouraging scientific breakthroughs over the past 5 years demonstrated 'quantum supremacy' in terms of problem-solving capabilities over conventional computing, albeit these findings are still very much exploratory 166 . However, some experts anticipate that the arrival of usable quantum computing could occur as early as within the next few decades 167 .

Conclusions
Machine intelligence continues to gain traction in oncologic endocrinology for its potential to enable robust non-invasive diagnostics. However, for these technologies to take hold, both adherence to consensus reporting standards and evaluation criteria in AI image interpretation are required, which will enable meaningful cross-study comparisons. Although several of such AI guidelines have been established [137][138][139] , a lack of harmonization impedes their widespread uptake. Another challenge will be facilitating the smooth movement of these technologies into the clinical setting so that physicians embrace their use. Clarity at the federal and institutional levels is urgently needed in terms of developing longitudinal performance auditing, medicolegal liability frameworks and guidance on reimbursements for clinical AI developers and medical institutions utilizing these technologies.
Another theme is how poor data availability continues to stymie the development of robust machine learning applications, particularly in rare endocrine cancers. Although access to medical imaging is improving through open data-sharing initiatives, we still note a relative paucity of endocrine cancer scans within these larger imaging databases. We encourage the creation of domain-specific imaging databases that can better enable AI for oncologic endocrinology purposes.
Collaborative learning strategies might also centre the foray given their potential to circumvent data access issues without transferring personal health information. Future work on distributed computing paradigms also need to consider how to best manage potential cyber risks and data as the potential surface area vulnerable for cyberattack increases with the increasing number of participants. Digital health could also enable future breakthroughs such as via correlation of radiomics findings with wearables or digital health application data 168 . Finally, the advent of smartphone imaging viewing platforms and automated workflows will bring the field closer to smooth, real-time analytics that can enable robust partnerships among endocrinologists, radiologists and AI.
Published online 9 November 2021 Centralized, distributed, decentralized and quantum computing frameworks are shown. The centralized network panel has a node with spokes spreading outward that represents a single, consolidated platform such as a local (on-site data centre) or remote (Cloud) server. The distributed network panel shows a net-like pattern of equally spaced nodes and such a platform with multiple local servers or devices can be used for collaborative model training techniques like cyclic weight transfer. The decentralized network panel has multiple centralized nodes connected in a net-like pattern and federated learning is one training paradigm that uses this platform. Previous studies 177,178 depicted quantum networks as two nodes with the cutout region between nodes illustrating the induction of dependent quantum states among two particles (A and B, where S refers to a shared source of squeezed light) and this particle 'entanglement' is at the crux of quantum communications. Adapted from reF. 177 , Springer Nature Limited. The quantum network is reprinted from reF. 178 , CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). www.nature.com/nrendo