Large language models streamline automated machine learning for clinical studies

A knowledge gap persists between machine learning (ML) developers (e.g., data scientists) and practitioners (e.g., clinicians), hampering the full utilization of ML for clinical data analysis. We investigated the potential of the ChatGPT Advanced Data Analysis (ADA), an extension of GPT-4, to bridge this gap and perform ML analyses efficiently. Real-world clinical datasets and study details from large trials across various medical specialties were presented to ChatGPT ADA without specific guidance. ChatGPT ADA autonomously developed state-of-the-art ML models based on the original study’s training data to predict clinical outcomes such as cancer development, cancer progression, disease complications, or biomarkers such as pathogenic gene sequences. Following the re-implementation and optimization of the published models, the head-to-head comparison of the ChatGPT ADA-crafted ML models and their respective manually crafted counterparts revealed no significant differences in traditional performance metrics (p ≥ 0.072). Strikingly, the ChatGPT ADA-crafted ML models often outperformed their counterparts. In conclusion, ChatGPT ADA offers a promising avenue to democratize ML in medicine by simplifying complex data analyses, yet should enhance, not replace, specialized training and resources, to promote broader applications in medical research and practice.


Introduction
Machine learning (ML) drives advancements in artificial intelligence and is about to transform medical research and practice, especially in diagnosis and outcome prediction 1,2 .Recently, the adoption of ML for analyzing clinical data has expanded rapidly.Today, ML models have an established and evolving role in various areas of public health and medicine, spanning image analysis, public health, clinical-trial performance, and operational organization 2 .ML models are used in variable contexts such as augmenting medical knowledge, assisting clinicians, or taking on administrative tasks 3 .Several developments, such as increases in (i) available data generated during clinical care, (ii) available computational processing capacities, and (iii) research activities, favor the more widespread future utilization of ML models in medicine 4 .However, the complexity of developing, implementing, and validating those models renders them inaccessible to most clinicians and medical researchers 5 .It also limits their utilization to those people or groups that combine expertise in medicine and data science.Automated Machine Learning (AutoML) is an established discipline that aims to make ML accessible to non-technical experts.In medicine, the principle feasibility and use of AutoML platforms, such as the Classification Learner of MATLAB (MathWorks Inc.), Vertex AI (Google LLC), and Azure (Microsoft Corporation), have been demonstrated [6][7][8][9][10][11] , enabling non-technical experts to create ML models.These software solutions automate algorithm training and finetuning by providing dedicated interfaces to build and run a particular ML model.The user needs to direct the software to the desired output.So far, however, models using natural language commands and their conversion to Python code have not been implemented.
Powerful large language models (LLMs) 12 , such as ChatGPT's latest version, GPT-4 13 (Generative Pre-Trained Transformer-4, OpenAI, CA, US), expand the repertoire of AutoML platforms by offering a well-accessible option to the user 14,15 .While conversing with humans in plain language, LLMs can reason and perform logical deduction.Recently, the ChatGPT Advanced Data Analysis (ADA), formerly known as ChatGPT Code Interpreter, has been made available as an extension and beta feature that may be used to analyze data and math problems, create charts, and write, execute, and refine computer code 16 .Instructing ChatGPT ADA can be straightforward, such as "Analyze this patient data and build a machine learning model predicting 12-month mortality rates".Given this prompt, ChatGPT ADA will execute the task and provide feedback on the procedure.However, its validity and reliability in advanced data processing and analysis for large clinical trials have not yet been evaluated.
Our objective was to study the validity and reliability of ChatGPT ADA in autonomously developing and implementing ML methods.We included real-world datasets from four large clinical trials of various medical specialties that applied ML models for advanced data analysis (Figure 1).We hypothesized that (i) ChatGPT ADA may be used intuitively and does not require prior training, resources, and guidance in ML theory and practice to implement advanced ML methods efficiently and accurately and that (ii) the results of these implementations match those of specialized data scientists.We provide evidence that advanced LLMs like ChatGPT ADA simplify complex ML methods, increasing their accessibility in medicine and beyond.Real-world datasets and study details from four large clinical trials were collected and input into the ChatGPT Advanced Data Analysis (ADA) tool.The tool autonomously selected the appropriate machine-learning models for the analysis following prompting.The models were expertchecked and comprehensively evaluated.The ChatGPT ADA-based predictions were compared to the original studies (benchmark publication) and the validatory predictions following the re-implementation of the models.

Results
Across four large clinical trial datasets, ChatGPT ADA autonomously formulated and executed advanced ML techniques for disease screening and prediction.Its performance matched the hand-crafted and customized ML methods re-implemented based on the original studies.Figure 2 illustrates an exemplary interaction with ChatGPT ADA, highlighting the prompts and responses for autonomous prediction, supplemented by subsequent queries on the rationale and parameters underlying its ML model selection.Supplementary Figure S1 additionally shows the intermediary Python code generated during this interaction.
After briefly summarizing each clinical trial and associated dataset, we compare the ML methods head-to-head for each trial.We include ML methods developed and executed by ChatGPT ADA against the performance metrics of the originally published ML methods (as reported in the original studies) and the validatory ML methods (as re-implemented by a seasoned data scientist, S.T.A. with five years of experience in ML).Because individual patient predictions were unavailable in the original studies, the best-performing ML methods of the original studies were re-implemented.We conclude our analysis by presenting the explainability metrics determined by ChatGPT ADA and confirmed by our re-implementation.The model also displays deeper insights in response to follow-up queries about the reasoning and parameters guiding its choices.Note: The "Show work" option visible in the images allows users to view the intermediary Python code offered by the tool.Due to space constraints, the intermediary Python code for this exemplary interaction is presented in Supplementary Figure S1.
Metastatic Disease [Endocrinologic Oncology] -Predicting Metastatic Disease in Pheochromocytoma and Paraganglioma Pamporaki et al. utilized cross-sectional cohort data from the US, Germany, Poland, and the Netherlands, and employed ML methods to predict metastatic disease in patients diagnosed with pheochromocytoma or paraganglioma using blood test results 17 .These tumors are referred to as the 'great masquerader' because of their unspecific clinical presentation secondary to largely variable catecholamine excess, which poses diagnostic challenges 17 .The original study's training and test set cohorts comprised 493 and 295 patients (Table 1).Using predictions by 12 clinical experts as their reference, the authors implemented multiple supervised ML models, i.e., the decision tree classifier, support vector machine, Naïve Bayes, and AdaBoost 18 ensemble tree classifier.In the original study, the latter model performed best and significantly outperformed the clinical care specialists, with an area under the receiver operating characteristic curve (AUROC) of 0.942 versus 0.815 (best-performing clinical expert, P<0.001).Using the same (training and test) dataset distribution as the original study but withholding specific guidance on data preprocessing or ML methodology, we prompted ChatGPT ADA to predict metastatic disease in the test set while.ChatGPT ADA selected a Gradient Boosting Machine (GBM) 19 model for its prediction and achieved a slightly improved performance relative to its best-performing published counterpart in terms of AUROC values (0.949 vs. 0.942), accuracy (0.922 vs. 0.907), and F1scores (0.806 vs. 0.755) (Table 2).The entire conversation with ChatGPT ADA regarding prompts and responses is detailed in Supplementary File S1.

Oesophageal Cancer [Gastrointestinal Oncology] -Predicting Carcinoma of the Distal Oesophagus and Oesophagogastric Junction
Gao et al. used sponge cytology testing and epidemiologic data to screen for oesophageal squamous cell carcinoma and adenocarcinoma of the oesophagogastric junction 20 .The authors obtained multicohort data from 14,597 participants in China (Table 1) to design six ML models, i.e., logistic regression, adaptive boosting, Light Gradient Boosting Machine (LightGBM) 21 , extreme gradient boosting, Random Forest (RF) 22 , and support vector machine 23 , to predict highgrade intraepithelial neoplasia and carcinoma based on 105 cytologic and 15 epidemiologic features.The best-performing model was the LightGBM, which achieved an AUROC value of 0.960 in the test set.In contrast, ChatGPT ADA selected the GBM and outperformed the original model at an AUROC value of 0.979 (Table 2).Supplementary File S2 details the entire conversation with ChatGPT ADA for this dataset.

Cardiac Amyloidosis [Cardiology] -Predicting the Cardiomyopathies
Huda et al. attempted to identify patients at risk of cardiac amyloidosis, a now treatable condition predisposing to heart failure, using various cohorts and established medical diagnoses retrieved from health records 25 .Using data from 2,142 patients and controls (Table 1), the authors designed three ML models, i.e., logistic regression, extreme gradient boosting, and RF classifier.They found the latter ML model to perform best (AUROC value of 0.930 [internal validation set]).Because the external validation dataset was not publicly available, we used the original study's internal validation set to prompt ChatGPT ADA as above.The ChatGPT ADA-selected predictive model, i.e., RF classifier, outperformed the original model regarding the AUROC (0.954) and the other performance metrics (Table 2).Supplementary File S4 details the entire conversation with ChatGPT ADA for this dataset.

Explainability Analysis
The interpretation of model predictions, especially in situations demanding transparency and trust, relies on our capacity to grasp the importance of individual features.To study the ability of ChatGPT to provide metrics of explainability, we utilized the SHapley Additive exPlanations (SHAP) 26 analysis that helps quantify each feature's contributions to a model's predictions.This approach enhances our understanding of the model's decision-making process.We instructed ChatGPT ADA to perform the SHAP analysis autonomously without providing specific guidance.

Discussion
The availability of LLMs for advanced data processing 27,28 , specifically those with the capacity to write, execute, and refine code like ChatGPT ADA, marks a pivotal shift in the convergence of data science and clinical research and practice.Our investigation of four large clinical trials underscores the potential of these tools to simplify complex ML methods and increase their accessibility in medicine and beyond.If implemented with due diligence, these tools enhance, not replace, specialized training and resources, democratizing access to advanced data processing and, potentially, revolutionizing data-driven medicine.
While ML and "Big Data" are touted as revolutionizing healthcare 29 , clinicians regularly deal with too many patients in too little time 30 .Yet, they make hundreds of decisions each day that are primarily based on eminence and not on published data or studies 31 .Consequently, a valid and reliable tool that automates data processing may decentralize the monopoly of evidence held by specialized institutions.While clinicians remain at the center of patient care, ML methods can assist their expertise, e.g., by identifying at-risk patients for specific conditions based on electronic health records or by analyzing complex datasets such as genomic sequences.Intentionally, we designed our study to include variable data types such as clinical data, demographic data, cytologic data, genetic sequencing data, International Classification of Disease codes, and laboratory values from clinical trials spanning endocrinology, gastrointestinal oncology, genetics, and cardiology.Beyond empowering clinicians to use the clinical data to their patients' advantage, utilizing LLMs for advanced data analysis provides a less costly and more efficient alternative to hand-crafted ML models 32 .
In assessing accuracy, validity, and reliability, our study utilized datasets from original studies to gauge the robustness of predictions on previously unseen data.External validation is paramount in evaluating the model and its broader applicability.However, it is worth noting that external validation was absent in the Cardiac Amyloidosis dataset.Across various datasets, models chosen by ChatGPT ADA consistently demonstrated performance on par with, or exceeding, the hand-crafted ML methods used in the original studies.When comparing performance metrics, i.e., AUROC, accuracy, F1-score, sensitivity, and specificity, no significant differences were found between the optimized models re-implemented by our data scientist and those autonomously chosen by ChatGPT ADA.This observation demonstrates ChatGPT ADA's ability to select, train, and refine suitable and performant ML models autonomously.
We also assessed the risk of overfitting, which occurs when a model excels on training data but may not generalize well to unseen data.When evaluating the model's execution across training and validation datasets for each clinical trial, we were surprised that ChatGPT ADA had implemented strategies to increase model robustness and generalizability, like regularization, model selection based on validation, and choosing simpler models.However, even though these strategies may be helpful, users should still regularly check performance metrics for signs of overfitting.
Per the design of our prompting strategy, we did not ask ChatGPT ADA for specific explanations of why it selected a particular ML model.However, the tool displayed a surprisingly deep understanding of the clinical trial and appropriate analysis methods.For most clinical trial datasets, ChatGPT ADA employed a median imputation strategy.In c ontrast, it used a zeroimputation strategy for the Hereditary Hearing Loss dataset.When asked to explain, ChatGPT ADA indicated that its choice was informed by the dataset's inherent characteristics and semantics: "The data represents the presence (1) or absence (0) of certain genetic variants in patients.Given this binary representation, median imputation (which would yield either 0 or 1) might introduce bias.In genetic studies, undetected or missing variants are often interpreted as absent, making zero-imputation align with this understanding".This statement is supported by pertinent literature studies 33 .Contrarily, our seasoned data scientist, with a limited command of genetics, used median imputation, underscoring the value of domain-specific knowledge when setting up domain-specific ML methods.
We did not find signs of "hallucinations", i.e., factually erroneous responses 27,34 .Critically, the authors performed the statistical analysis step-by-step to ascertain the accuracy, reliability, and consistency of the model's outputs.Specific safeguarding measures, such as the provision of intermediary Python code throughout the different phases of building and executing the ML model, have been implemented by the manufacturer and improve comprehensibility and transparency.If coupled with more general safeguarding measures, e.g., increasing user awareness of hallucinations, enhancing the LLM's inherent robustness 35 , and implementing regular auditions and quality checks, the tool's validity and reliability can be ascertained 36 .
Regarding ease of use, ChatGPT ADA substantially reduces the complexity of developing and implementing ML methods by taking tabular data, suggesting how to deal with it, building the model, and outputting the results in the desired format.Not least due to its ability to communicate with the user, the tool offers a natural and effective way to work with ML models.At the same time, the automatization simplifies the associated workflow.However, as with any innovation, utilizing LLMs in clinical research and practice has multifaceted implications, from data privacy to data security to model interpretability, reliability, and associated ethical concerns [37][38][39][40][41] .Upholding patient data privacy seems particularly challenging as -on the one hand-users may be enticed to disclose confidential (or proprietary) information, let alone sensitive personal data such as race and ethnicity, to use the model most efficiently.On the other hand, OpenAI continuously trains the model using earlier user interactions, including prompts, which are retained as part of the ever-enlarging training data and cannot be deleted.Consequently, it is the user's responsibility to weigh the tool's advantages and benefits against its disadvantages and risks.
ChatGPT ADA, as a tool, democratizes access to advanced ML methods, enabling clinicians and researchers of all backgrounds to harness its capabilities.Besides being a potential cornerstone for their broader utilization in clinical research and practice 42 , the improved accessibility holds the potential of (i) accelerating medical research, (ii) confirming or contradicting earlier research, and (iii) improving patient care.However, when using the tool more widely, several potential challenges and limitations must be acknowledged.First, the tool's commercial and proprietary distribution is concerning because (i) its 'black-box' nature limits transparency and may reduce trust in its outputs 43 , (ii) commercial bias may be in opposition to the idealized concept of unbiased scientific or clinical deliberation, and (iii) algorithmic bias secondary to the model's potentially skewed foundational data may perpetuate unbalanced outcomes, for example, by not representing those patients adequately that had been underrepresented in the foundational data 44 .In the absence of benchmark publications for comparison, users must be more vigilant in ascertaining accuracy and reliability, for example, by seeking external validation whenever possible.
Regarding transparency and trust, we conducted a SHapley Additive exPlanations (SHAP) analysis 45 to better understand how ChatGPT ADA works on and with the respective datasets.The tool successfully identified and plausibly quantified the importance of numerous variables across the trials.For instance, its predictions centered on sex, age, and laboratory values (Metastatic Disease [Endocrinologic Oncology] dataset), specific cytologic features such as the presence of atypical glandular cell and nuclear width features (Oesophageal Cancer [Gastrointestinal Oncology] dataset), specific gene variants such as c.235delC (Hereditary Hearing Loss [Otolaryngology] dataset), and the previous history of (diagnosed) cardiomyopathy (Cardiac Amyloidosis [Cardiology] Dataset).The in-built ability to autonomously extract key features contributing to the model's predictions increases transparency, improves understanding, and furthers trust in ChatGPT ADA 46 .
Our study has limitations: First, while our study indicates that ChatGPT ADA is a formidable tool and technologically ready for well-curated datasets, it remains to be seen how it performs on other, less well-curated datasets that require more complex statistical methods, e.g., refined imputation strategies.Second, given their publication in 2021, we cannot exclude the possibility of two original studies 24,25 being part of the training that was concluded in 2021.Given the large sizes of the included datasets consisting of hundreds to thousands of patients, previous publications on the same dataset (or a specific subset), e.g., 47 , may have been included as part of the model's training data.Third, even though ChatGPT ADA and the original studies implemented the same model, we found different performance metrics, e.g., for the Cardiac Amyloidosis dataset where random forest classifiers were implemented.Possible sources of variability are the specific approaches used for data pre-processing, dataset splitting, model configurations, and hyperparameter selection.Despite our best efforts to standardize each model's implementation and execution, inter-model comparability is inherently limited.Fourth, because the LLM's response is closely related to how it is prompted 48 , it is unclear whether the performance metrics are subject to change if the model is prompted differently.Consequently, our work represents a mere starting point for exploring the potential of LLMs in clinical research and practice.Future research must validate our findings across different medical domains.
In conclusion, advanced LLMs like ChatGPT ADA are a potentially transformative step forward in data-driven medicine, making intricate ML methods more accessible.By way of example, our study demonstrates that such tools may streamline advanced data analyses for researchers, both experienced and inexperienced in ML, and hold the potential to reduce the burden of data pre-processing and model optimization substantially.Given the tools' novelty, limitations, and challenges, they should be applied as enhancements to specialized training, resources, and guidance, not substitutes.Nonetheless, in this dawning era of data-driven medicine, these tools may bridge the chasm between complex ML methods and their practical application in medical research and practice.

Ethics Statement
The methods were performed in accordance with relevant guidelines and regulations and approved by the ethical committee of the Medical Faculty of RWTH Aachen University for this retrospective study (Reference No. EK 028/19).

Patient Cohorts
The patient datasets were retrieved from public repositories as indicated in the original studies on metastatic disease prediction 17 , oesophageal cancer screening 20 , hereditary hearing loss 24 , and cardiac amyloidosis 25 .
In the included Endocrinologic Oncology study 17 , cross-sectional data from Germany, Poland, the US, and the Netherlands was used to assess the ability of the dopamine metabolite methoxytyramine to identify metastatic disease in patients with pheochromocytoma or paraganglioma.To this end, ten features were available.The included Oesophageal Cancer study 20 from China was centered on endoscopic screening and included multiple data sources from questionnaires to endoscopy data, i.e., cytologic and epidemiologic data.
The included Hereditary Hearing Loss study 24 contained genetic sequencing data to diagnose this condition in a Chinese cohort.Individuals were categorized based on hearing loss severity and variations in three genes (GJB2, SLC26A4, MT-RNR1).
The included Cardiac Amyloidosis study 25 utilized electronic health records to identify patients with cardiac amyloidosis from a dataset spanning 2008-2019, sourced from IQVIA, Inc., focusing on heart failure and amyloidosis.While the original study used external datasets for validation, these were inaccessible.Therefore, our analysis adhered to the original study's internal validation strategy: 80% as the training set and 20% for testing, resulting in 1,712 individuals for training and 430 for testing.For further information on the individual datasets, the reader is referred to Table 1 or the original studies.

Experimental Design
We extracted the original training and test datasets from each clinical trial.All datasets were available in tabular format, albeit in various file formats such as comma-separated values or Excel (Microsoft Corporation).No modifications to the data format, specific data pre-processing or engineering, or selecting a particular ML method were necessary to prompt ChatGPT ADA.GPT-September 25 version.A new chat session was started for each trial to exclude memory retention bias.
In the first phase, ChatGPT ADA was sequentially prompted by (i) providing a brief description of the study's background, objectives, and dataset availability, (ii) asking for developing, refining, and executing the optimal ML model based on the individual study's framework and design, and (iii) producing patient-specific predictions (classification probabilities) without revealing the ground truth.The same training and test datasets as in the original studies were used.We deliberately refrained from offering specific ML-related guidance when ChatGPT sought advice on improving prediction accuracy.Instead, ChatGPT ADA was tasked with (i) autonomously choosing the most suitable and precise ML model for the given dataset and (ii) generating predictions for the test data.Figure 2 and Supplementary Figure S1 provide an exemplary interaction with the model.Using the provided ground-truth test set labels, we calculated the performance metrics for ChatGPT ADA's results using Python (v3.9) using open-source libraries such as NumPy, SciPy, sci-kit-learn, and pandas.
The performance metrics were compared against those published in the original studies ("benchmark publication").In some clinical trials, the clinical care specialists' performance was also reported, and these metrics were included for comparison.Notably, inputting and analyzing each dataset through ChatGPT ADA took less than five minutes.Detailed transcripts of the interactions with ChatGPT ADA for every dataset are presented in Supplementary Files S1-S4.

Data Pre-Processing and ML Model Development
In the second phase, a seasoned data scientist re-implemented and optimized the bestperforming ML model of the original studies using Python (v3.9) using open-source libraries such as NumPy, SciPy, sci-kit-learn, and pandas and the same training datasets as outlined above ("benchmark validatory re-implementation").This re-implementation and optimization was necessary because individual patient predictions were unavailable in the original studies, precluding head-to-head model comparisons and detailed statistical analyses.More specifically, the data scientist optimized the data pre-processing and the ML model in close adherence to the original studies, yet complemented by his expertise and experience while aiming for peak accuracy.
The following provides trial-specific details on the data pre-processing and the conceptualization of the specific ML models.

Metastatic Disease [Endocrinologic Oncology]:
Re-implemented (validatory) ML model: The training set contained 30 missing values, while the test set contained 15 missing values.Median values from the training set were used to impute the missing values in both datasets.Ten distinct feature vectors were constructed from the dataset variables.The feature vectors were partially categorical and partially numerical.The categorical features were: 1) previous history of pheochromocytoma or paraganglioma (yes/no), 2) adrenal/extra-adrenal location of primary tumor (adrenal/extra-adrenal), 3) presence of Succinate Dehydrogenase Complex Iron-Sulfur Subunit B (SDHB) (yes/no/not tested), 4) tumor category of primary tumor (solitary, bilateral, multifocal), and 5) sex (female/male).The numerical features were: 1) age at diagnosis of first tumor [years], 2) spherical volume of primary tumor [cm³], 3) plasma concentration of metanephrine (MN) [pg/ml], 4) plasma concentration of normetanephrine (NMN) [pg/ml], and 5) plasma concentration of methoxytyramine (MTY) [pg/ml].Categorical data were translated into numerical integer values, e.g., female (0) and male (1) for sex.An Adaptive Boosting (AdaBoost) 18 ensemble tree classifier was employed and optimized using a 10-fold cross-validation grid search.This optimization led to selecting parameters like a maximum depth of 2 for individual decision trees, a count of 200 trees, and a learning rate of 0.01.Stagewise additive modeling was chosen, utilizing a multiclass exponential loss function.
ChatGPT ADA-crafted ML model: A check for missing data mirrored the findings above, leading the model to resort to a median imputation strategy.Numerical data were standardized using standard scaling, while categorical data were converted to integer values.The selected classification technique was a Gradient Boosting Machine (GBM) 19 with parameters set as follows: maximum tree depth: 3, number of trees: 100, minimum samples per leaf: 1, minimum samples for split: 2, and learning rate: 0.1.The logarithmic loss function was the chosen evaluation metric, with the quality of splits being evaluated using the Friedman mean squared error 49 .No validation dataset was incorporated, and the model was not subjected to any specific regularization techniques.

Oesophageal Cancer [Gastrointestinal Oncology]:
Re-implemented (validatory) ML model: The training dataset included 147 feature vectors, whereas the test dataset included 169.A comprehensive list of the feature vectors can be found in the literature: 20 .Excess feature vectors in the test set were excluded to maintain consistency, aligning it with the training dataset.Consequently, neither the training nor the test datasets contained missing values.Categorical data were mapped to numerical integer values.Imbalanced dataset distributions were addressed by conferring inverse frequency weights upon the data.In line with the original study, the DS selected the Light Gradient Boosting Machine (LightGBM) 21 with the gradient boosting decision tree algorithm.The configuration for the classifier was as follows: an unspecified maximum tree depth, 300 trees, a cap of 31 leaves per tree, and a 0.1 learning rate.The logarithmic loss function served as the evaluation metric.The model integrated both L1 and L2 regularization techniques.
ChatGPT ADA-crafted ML model: The pre-processing mirrored the approach above, identifying a class imbalance.The selected classifier was the GBM with parameters including a maximum tree depth of 3, 100 trees, minimum samples per leaf of 1, minimum samples for a split of 2, and a learning rate 0.1.The model's performance was assessed using the logarithmic loss function, with the quality of tree splits evaluated using the Friedman mean squared error.No validation dataset was incorporated, and the model was not subjected to any specific regularization techniques.

Hereditary Hearing Loss [Otolaryngology]:
Re-implemented (validatory) ML model: The training and test sets included 144 feature vectors, i.e., sequence variants at 144 sites in three genes 24 .The values of the training set were numerical, i.e., 0 (individual has no copies of the altered allele [98.2% of the values]), 1 (individual has one copy of the altered allele [1.6%]), and 2 (individual has two copies of the altered allele [0.2%]), while only one value was missing.The values of the test set were numerical, too, with a similar distribution: 0 (98.3%), 1 (1.5%), and 2 (0.2%), while no values were missing.Missing data points were addressed by imputing the median of the training data.All feature vectors were then subject to MinMax scaling.A Support Vector Machine 23 was the best-performing classifier per the original study, configured with the Radial Basis Function kernel, gamma set to 1, and enabled shrinking.Model optimization leveraged a 5-fold stratified cross-validation using grid search.The regularization cost parameter was defined at 100.
ChatGPT ADA-crafted ML model: The pre-processing was closely aligned with the methodology above, with one notable exception: Missing data was addressed by zero-imputation.The classifier chosen was the Random Forest (RF) 22 , with the following framework parameters: no explicitly defined maximum depth for individual trees, tree count of 100, minimum samples per leaf of 1, and minimum samples per split of 2. At each split, the features considered were the square root of the total features available.5-fold cross-validation was employed without the use of a grid search.Regularization was achieved by averaging predictions across multiple trees.Bootstrapping was chosen to create diverse datasets for training each decision tree in the forest.

Cardiac Amyloidosis [Cardiology]:
Re-implemented (validatory) ML model: The dataset comprised 1,874 numerical (0 or 1, indicating the presence or absence) feature vectors 25 .There was no value missing in the dataset.The feature vectors underwent standard scaling for normalization.The classifier chosen was the RF, with the following parameters: maximum depth for individual trees of 20, total tree number of 200, minimum samples per leaf of 2, and minimum samples per split of 5.For each tree split, the square root of the total features determined the number of features considered.A 5-fold crossvalidation was combined with a grid search for optimization.Regularization was effectuated by averaging the predictions over multiple trees.The model did not utilize bootstrapping.
ChatGPT ADA-crafted ML model: As there was no missing value in the dataset and the values were binary, the data underwent no scaling or standardization.The selected classifier was the RF.Parameters for the model were as follows: an unspecified maximum depth for individual trees, a tree count of 1,000, minimum samples per leaf of 1, and minimum samples per split of 2. For each tree split, the features considered were the square root of the total feature count.The model was validated using 5-fold cross-validation without grid search.Regularization was achieved by averaging predictions across several trees, and the model utilized bootstrapping 22,50 .
Because ChatGPT ADA provides all intermediary Python code during data pre-processing and ML model development and execution, we meticulously analyzed the code for accuracy, consistency, and validity (Supplementary Figure S1).

Explainability Analysis
We used SHapley Additive exPlanations (SHAP) 26 to analyze feature contributions to the model's predictions.ChatGPT ADA was tasked with autonomously performing a SHAP analysis to be narrowed down to the top 10 features.To ensure accuracy, the seasoned data scientist (S.T.A. with five years of experience) reviewed the Python code provided by ChatGPT ADA and reimplemented the procedure in Python using SHAP library 26 with TreeExplainer 51 to confirm the model's outputs.

Reproducibility Analysis
We evaluated the consistency of the tool's responses using separate chat sessions (to avoid memory retention bias), yet the same datasets, instructions, and prompts on three consecutive days.The model consistently reported the same responses and qualitative and quantitative findings.

Statistical Analysis and Performance Evaluation
The quantitative performance evaluation was performed using Python (v3.9) and its open-source libraries, such as NumPy and SciPy.Unless noted otherwise, performance metrics are presented as mean, standard deviation, and 95% confidence interval (CI) values.Using the published ground-truth labels from the original studies as reference ("benchmark publication"), we calculated a range of performance metrics based on ChatGPT ADA's predictions of the test set labels: AUROC, accuracy, F1-score, sensitivity, and specificity.These performance metrics are presented alongside those reported in the original studies, if available (Table 2).
Once the per-patient predictions were available following the re-implementation and optimization of the select ML models ("benchmark validatory re-implementation"), we calculated the performance metrics outlined above using the ground truth labels for the re-implemented (validatory) ML models and their ChatGPT ADA-based counterparts.We adopted bootstrapping 50 with replacements and 1,000 redraws on the test sets to ascertain the statistical spread (in terms of means, standard deviations, and 95% confidence intervals), and to determine if the metrics were significantly different.We adjusted for multiple comparisons based on the false discovery rate, setting the familywise alpha threshold at 0.05.Notably, the comparative evaluation of the performance metrics was conducted in a paired manner.Bootstrapping was applied to both models.The threshold for calculating the F1-score, sensitivity, and specificity was chosen based on Youden's criterion 52 .
DT, and SN.The literature research was conducted by STA, DT, and SN.JNK, CK, DT, and SN provided clinical expertise.STA, TH, ML, JNK, and DT provided technical expertise.All authors read the manuscript and agreed to the submission of this paper.

Encode categorical features. Let's first check for missing values in the training dataset.
There are missing values in some of the columns of the training dataset: 1. Age at diagnosis of the first tumor: 1 missing value 2. Plasma NMN pg/ml: 1 missing value 3. Plasma MN pg/ml: 6 missing values 4. Plasma MTY pg/ml: 1 missing value 5. Spherical volume of the primary tumor: 21 missing values For numerical features, I will impute the missing values with the median of the respective columns, as the median is less sensitive to outliers than the mean.For categorical features, I would typically use the mode, but it seems there are no missing in the categorical columns.
Next, I will: 1. Impute the missing values for the columns with medians.

Standardize the numerical features. 3. Encode the categorical features.
Let's proceed with these preprocessing steps.
The training data has been successfully preprocessed.
Next, I will:

Separate the features (X) from the target variable (y). 2. Train a model using the training data. For this task, I'll initially use a Gradient Boosting
Classifier, which is a powerful ensemble method suitable for binary classification problems.

After training, I will preprocess the test set similarly to the training set. 4. Predict the metastatic disease for patients in the test set and save the predictions in a
CSV file.

Let's start by separating the features and the target variable and training the model.
The test dataset has been successfully preprocessed.
Next, I will: 1. Predict metastatic disease for the patients in the test set.2. Save these predictions and the associated probabilities in a CSV file.
Let's proceed with the predictions.
The predictions for metastatic disease in the test set patients, along with the associated probabilities, have been successfully saved in a CSV file.
You can download the predictions here.ChatGPT ADA: "Certainly!Let's address each of your questions:

Parameters of the Model:
• I used the default parameters for the Gradient Boosting Classifier in this initial run.I'll list them out for you.

Grid Search Optimization:
• No, I did not perform grid search optimization in this initial approach.Grid search or other hyperparameter optimization techniques can be employed to potentially improve the model's performance.

Validation Data:
• I did not explicitly set aside a validation set in this approach.Typically, a portion of the training data would be used as a validation set to fine-tune the model, especially during hyperparameter optimization.This helps to ensure that the model generalizes well and doesn't overfit to the training data.

Regularization:
• Gradient Boosting inherently has some form of regularization due to the additive nature of boosting, which builds one tree at a time.However, further regularization can be introduced by adjusting hyperparameters, such as the learning rate or by using shrinkage.
Now, let's retrieve the parameters of the Gradient Boosting Classifier that was used.
I apologize for the inconvenience.It seems there was an unexpected issue accessing the model parameters.
However, I can provide you with the default parameters for the Gradient Boosting Classifier:

Maximum Depth of Individual Trees (max_depth): By default, this is set to 3. It determines how deep each tree can grow during any boosting round. 2. Number of Trees (n_estimators):
By default, it's set to 100.This is the number of boosting stages or trees to be run.

Minimum Samples per Leaf (min_samples_leaf): By default, it's set to 1. This parame-
ter specifies the minimum number of samples required to be at a leaf node.

Minimum Samples per Split (min_samples_split): By default, it's set to 2. This param-
eter specifies the minimum number of samples required to split an internal node.• This specifies the number of boosting stages or trees to be run.

Minimum Samples per Leaf (min_samples_leaf): 1
• Specifies the minimum number of samples required to be at a leaf node.

Minimum Samples per Split (min_samples_split): 2
• Specifies the minimum number of samples required to split an internal node. 5. Learning Rate (learning_rate): 0.1 • Shrinks the contribution of each tree, adding a form of regularization.
These parameters are the default settings of the Gradient Boosting Classifier in scikit-learn.I did not adjust or fine-tune them for the initial model run." Next, I'll check the distribution of the target variable "GroundTruth_bi" and examine if there are any missing values in the dataset.This will help understand the balance of the classes and the completeness of the data.
The target variable "GroundTruth_bi" has the following distribution: The feature explanations file provides a mapping between the feature labels and their corresponding full names.This will be helpful for reference as we proceed with the analysis.Now, please upload the test dataset ("test_unlabeled") so that we can continue with the modeling process."User: "Why did you choose logistic regression as the loss function and not the Log loss?" ChatGPT ADA: "In scikit-learn's Gradient Boosting Classifier, the parameter 'loss' specifies the loss function to be optimized.The option "deviance" corresponds to logistic regression, which, for binary classification problems, is equivalent to optimizing the log loss.Specifically: • "deviance": This is used for logistic regression and is equivalent to the log loss for binary classification problems.It's the negative log-likelihood of the true labels given the predicted probabilities.
To clarify, when I mentioned "deviance" as the loss function, it is essentially the log loss for binary classification in the context of scikit-learn's Gradient Boosting Classifier.The naming is a bit specific to the scikit-learn library.In other contexts or libraries, this might be more explicitly referred to as log loss."be either 0 or 1) might introduce bias.For instance, if a certain genetic variant is rare in the population and is missing in a sample, replacing it with a '1' (if the median is 1) might give a false positive for that variant.

Semantics:
In genetic studies, when a particular variant is not detected or is missing, it is often treated as absent.Replacing with zero aligns with this interpretation.
3. Sparse Data: Genetic datasets often have a sparse nature, with many variants being absent (represented by zeros) for many individuals.Replacing missing values with zeros respects this sparsity.
That said, replacing with the median (or another imputation method) could be appropriate in different contexts or for other types of data.For instance, in datasets with continuous features, median imputation is a common choice as it doesn't introduce bias the way a mean imputation might in the presence of outliers.
In this case, given the nature and context of the data, I believed that replacing with zeros was the most appropriate choice.However, it's worth noting that different imputation strategies can be experimented with, and the choice should be based on domain knowledge, the nature of the data, and empirical results." User: "-Did you use bootstrapping?-What was 'max_features'?" ChatGPT ADA: "In the Random Forest implementation I used, here are the details regarding bootstrapping and max_features: 1. Bootstrapping: Yes, bootstrapping was used.By default, the Random Forest classifier in scikit-learn uses bootstrapping when sampling data for building each tree.This means that each tree in the forest is built on a bootstrap sample (a sample drawn with replacement) from the training data.
2. max_features: I used the default value set by scikit-learn for classification tasks, which is 'max_features=sqrt(n_features)'.For a classification problem, this means that the square root of the total number of features is the maximum number of features considered for splitting at each node.
Would you like any additional information or details on these parameters?"

Figure 1 :
Figure 1: Study Design.Real-world datasets and study details from four large clinical trials were collected and input into the ChatGPT Advanced Data Analysis (ADA) tool.The tool autonomously selected the appropriate machine-learning models for the analysis following prompting.The models were expertchecked and comprehensively evaluated.The ChatGPT ADA-based predictions were compared to the original studies (benchmark publication) and the validatory predictions following the re-implementation of the models.

Figure 2 :
Figure 2: Screenshots of an example interaction with ChatGPT ADA to analyze the endocrinologic oncology dataset.(A) ChatGPT ADA autonomously selects and applies the appropriate ML model for the provided dataset, generating predictions for the test data.(B) The model also displays deeper insights in response to follow-up queries about the reasoning and parameters guiding its choices.Note: The "Show work" option visible in the images allows users to view the intermediary Python code offered by the tool.Due to space constraints, the intermediary Python code for this exemplary interaction is presented in Supplementary Figure S1.

Figure 3 :
Figure 3: Benchmark Validatory Re-Implementation -Receiver operating characteristic (ROC) curves of ML models as a function of clinical trial dataset.The ROC curves of the ChatGPT ADAbased ML model (blue, solid curve) and the validatory ML model as re-implemented by a seasoned data scientist (red, dotted curve) are shown.The True Positive Rate (sensitivity) is plotted versus the False Positive Rate (1-specificity).The diagonal grey line represents the line of no discrimination.Abbreviations: AUROC: Area under the receiver operating characteristic curve, ChatGPT Advanced Data Analysis, CI: Confidence interval.

Figure 4
details the top 10 most influential features (ranked by their overall impact as determined by the mean absolute SHAP values) contributing to the best-performing ML model of each clinical trial.SHAP values measure a feature's influence on a model's output.High absolute SHAP values signify substantial impact, and positive SHAP values elevate the model's prediction above the baseline.

Figure 4 :
Figure 4: Model explainability through the top 10 predictive features for the ChatGPT ADA-selected machine-learning models.An explainability analysis was performed for each clinical trial and ChatGPT ADA-selected machine-learning model.Indicated are SHapley Additive exPlanations (SHAP) values of each predictive feature that measure the feature's influence on model predictions.High absolute SHAP values signify substantial influence, and positive SHAP values elevate the model's prediction above the baseline.The features are ranked from top to bottom based on the mean absolute SHAP values (colorcoded on the right).In (C), specific gene locations are indicated.Please refer to the Materials and Methods for more details on abbreviations.Box plots indicate the ranges (x-axes) of each feature (y-axes).Crosses indicate (arithmetic) means, boxes the ranges (first [Q1] to third [Q3] quartile), with the central line representing the (arithmetic) median (second quartile [Q2]).Whiskers extend to 1.5 times the interquartile range above Q3 and below Q1.Any data point outside this range is considered an outlier (dots).Mind the different scales for the color codes and SHAP values.Abbreviations: Plasma MN: Plasma concentrations of metanephrine, Plasma NMN: Plasma concentrations of normetanephrine, SDHB: Succinate Dehydrogenase Complex Iron-Sulfur Subunit B, Plasma MTY: Plasma concentrations of methoxytyramine, AGC: Atypical glandular cells, DNA: Deoxyribonucleic acid, Chron.: Chronic, Cong.: congenital, Dias.: Diastolic, Sys.: Systolic.Note: The feature "Hyp.heart w/ HF & Stg 1-4 Unsp.CKD" refers to "Hypertensive heart with heart failure coexisting with unspecified stage 1-4 chronic kidney disease", while "Prev.hist.PGGLs" refers to "Previous history of Pheochromocytomas and Paragangliomas".

Table 1 : Characteristics of the clinical trials whose datasets were included
. (*) indicates that the original data split and, consequently, the external validation dataset was unavailable per the original study.In line with the published methodology, we randomly allocated 80% of patients and controls to the training set (n=1,712) and 20% to the test set (n=430).N/A: Not available.

Table 2 : Benchmark Publication -ML models and their published performance metrics as a function of clinical trial dataset.
Indicated are the performance metrics of the best-performing ML models as published in the original studies, of the ChatGPT ADA-based ML models, and, if available, of individual or numerous clinical experts.Abbreviations: AdaBoost: Adaptive Boosting, AUROC: Area under the receiver operating characteristic curve, ChatGPT ADA: ChatGPT Advanced Data Analysis, GBM: Gradient Boosting Machine, LightGBM: Light Gradient Boosting Machine, N/A: Not available, RF: Random Forest.

Table 3 : Benchmark Validatory Re-Implementation -ML models and their performance metrics as a function of clinical trial dataset.
Indicated are the performance metrics of the re-implemented and optimized ML models (as reported to perform best in the original studies) and of the ChatGPT ADA-based ML models.A seasoned data scientist re-implemented the ML models for validation purposes, thereby making per-patient predictions and head-to-head comparisons using bootstrapping available.
Please use this file to compute the evaluation metrics (AUC, accuracy, F1 score, sensitivity, and specificity) using the ground truth labels of the test data.If you have any further questions or need assistance with anything else, please let me know!"

of Trees (n_estimators): 100
By default, it's set to 0.1.It shrinks the contribution of each tree, adding a sort of regularization.Lower values can result in better generalization but would require increasing the number of trees ('n_estimators').6.Loss Function (loss): By default, it's set to 'deviance' for classification, which refers to logistic regression for binary classification.For better model performance, I would typically perform grid search optimization using a validation set to fine-tune these hyperparameters.Additionally, adjusting the learning rate and 'n_estimators' in tandem can introduce regularization, helping to prevent overfitting.
5. Learning Rate (learning_rate): "Thank you for providing the feature explanations file.I'll first load and inspect this file to understand the details of the features.

[attachment: "test_unlabeled.csv"]
The test dataset contains 6,698 entries with 172 columns.Interestingly, the test dataset has more columns than the training dataset, which suggests there might be additional features present in the test dataset that are not in the training dataset.causinganerrorduringscaling.I'll inspect this column and make the necessary adjustments to proceed with modeling.Let's take a look at the unique values of this column in the training data.It seems I made an error in referencing the column name.The column name 'ESOXB112' does not exist in the dataset.The error message from the scaler suggests that there's another nonnumeric column we need to address.This approach will help us achieve the best results in terms of all the metrics you specified.User: "No, I don't want new calculations.I'm jus asking what you did there.How much of data did you use for validation?"In the steps I followed, I did not explicitly set aside a separate validation set.Instead, I used the entire training dataset to train the Gradient Boosting Classifier.The optimization of the threshold was also performed on the training data.Typically, it's a good practice to set aside a portion of the training data (e.g., 20% 30%) as a validation set to evaluate the model's performance on unseen data and to optimize hyperparameters, such as the decision threshold.This helps in reducing the risk of overfitting and gives a better idea of the model's potential performance on new, unseen data.If you have any other questions or if there's anything else I can clarify, please let me know!" "By default, the Gradient Boosting Classifier in scikit-learn uses the 'friedman_mse' criterion to measure the quality of a split.The 'friedman_mse' is Friedman's mean squared error, which is a modification of the classic mean squared error and is generally considered to be a better criterion for gradient boosting.So, yes, in the model I trained, the 'friedman_mse' criterion was used since I relied on the default settings of the Gradient Boosting Classifier.If you have more questions or need further information, please let me know!" "For the Gradient Boosting Classifier model I trained, I used the default regularization settings provided by scikit-learn.By default, the learning rate ('learning_rate') is set to 0.1 in scikit-learn's Gradient Boosting Classifier.This is a type of regularization where each tree added to the ensemble is shrunk by this factor.A smaller learning rate means the model is more robust to overfitting, but it typically requires more trees to get good performance.2.Stochastic Gradient Boosting:This introduces randomness into the algorithm by subsampling the training data at each boosting iteration.The fraction of training data used in each iteration is controlled by the 'subsample' parameter, which is set to 1.0 by default in scikit-learn (meaning it uses all the data).3.Tree Constraints:The depth of the trees ('max_depth'), minimum samples per leaf ('min_samples_leaf'), and minimum samples required to make a split ('min_sam-ples_split') act as regularization parameters.In the model I trained, I used the default values: 'max_depth=3', 'min_samples_leaf=1', and 'min_samples_split=2'.
The issue has been identified.The 'CheckID' column in the test dataset is of type 'object', while it was of type 'float64' in the training dataset.This discrepancy is causing the error during scaling.To address this, I will convert the 'CheckID' column in the test dataset to a numeric type.Let's proceed with this correction and continue with modeling.I apologize for the oversight.It seems that due to the previous errors, the Gradient Boosting Classifier ('clf') was not defined in this scope.I'll redefine it and proceed with training."User:"Pleasecontinue.Do your best in getting the most accurate results."Next,I'llmakepredictions on the test dataset.To optimize the threshold for prediction, I'll use the training data to find a threshold that maximizes the sum of sensitivity and specificity, while still achieving good accuracy.I've successfully made predictions on the test dataset using the optimized threshold.You can download the predicted results here.This CSV file contains the 'PatientID', the 'PredictedLabel' (0 or 1), and the 'Probability' of the prediction being 1 (indicating the presence of high-grade lesions).Please use this file to evaluate the predictions against the ground truth labels.If you have any further questions or need additional assistance, please let me know!"User: "can you give me the parameters of your model that you chose?such as maximum depth of the individual decision trees, number of trees, minimum samples per leaf, minimum samples per split, learning rate or loss function?"•Number of Trees (n_estimators):The number of boosting stages (or trees) to be run.• Learning Rate (learning_rate): This determines the contribution of each tree to the final outcome.Lower values typically require more trees but can result in better generalization.Would you like me to perform grid search-based hyperparameter tuning for the Gradient Boosting Classifier on your dataset?" User: "Did you use Friedman MSE?" ChatGPT ADA: However, I did not explicitly set or tune regularization hyperparameters like 'alpha' (L1 regularization) or 'lambda' (L2 regularization), which are more common in algorithms like XGBoost.If you're interested in a more regularized model, or if you'd like more details on any other aspect, please let me know!"