A bio-inspired convolution neural network architecture for automatic breast cancer detection and classification using RNA-Seq gene expression data

Breast cancer is considered one of the significant health challenges and ranks among the most prevalent and dangerous cancer types affecting women globally. Early breast cancer detection and diagnosis are crucial for effective treatment and personalized therapy. Early detection and diagnosis can help patients and physicians discover new treatment options, provide a more suitable quality of life, and ensure increased survival rates. Breast cancer detection using gene expression involves many complexities, such as the issue of dimensionality and the complicatedness of the gene expression data. This paper proposes a bio-inspired CNN model for breast cancer detection using gene expression data downloaded from the cancer genome atlas (TCGA). The data contains 1208 clinical samples of 19,948 genes with 113 normal and 1095 cancerous samples. In the proposed model, Array-Array Intensity Correlation (AAIC) is used at the pre-processing stage for outlier removal, followed by a normalization process to avoid biases in the expression measures. Filtration is used for gene reduction using a threshold value of 0.25. Thereafter the pre-processed gene expression dataset was converted into images which were later converted to grayscale to meet the requirements of the model. The model also uses a hybrid model of CNN architecture with a metaheuristic algorithm, namely the Ebola Optimization Search Algorithm (EOSA), to enhance the detection of breast cancer. The traditional CNN and five hybrid algorithms were compared with the classification result of the proposed model. The competing hybrid algorithms include the Whale Optimization Algorithm (WOA-CNN), the Genetic Algorithm (GA-CNN), the Satin Bowerbird Optimization (SBO-CNN), the Life Choice-Based Optimization (LCBO-CNN), and the Multi-Verse Optimizer (MVO-CNN). The results show that the proposed model determined the classes with high-performance measurements with an accuracy of 98.3%, a precision of 99%, a recall of 99%, an f1-score of 99%, a kappa of 90.3%, a specificity of 92.8%, and a sensitivity of 98.9% for the cancerous class. The results suggest that the proposed method has the potential to be a reliable and precise approach to breast cancer detection, which is crucial for early diagnosis and personalized therapy.

The limitations of morphological characteristics in detecting and diagnosing breast cancer can lead to bias and difficulty in identification by physicians 10 .Advancements in microarray technology and the more recent Next Generation Sequencing (NGS) has made gene expression profiling of patients widely available, resulting in the collection of gene expression datasets corresponding to various diseases.This shift has marked a significant transformation in personalized medicine, departing from traditional descriptive "morphological" classification approaches towards a more comprehensive strategy that considers clinical characteristics and immunohistochemical biomarkers.Today, gene expression profiling has become well-integrated into routine clinical practice 11,12 .Breast cancer researchers have examined gene expression profiling in-depth, and clinical oncologists are starting to use the findings of these studies in their daily practices.Also, the early detection and treatment of different cancer types have benefited from mining gene expression level data 13 .Many methods are designed to accurately predict breast cancer based on gene expression data [14][15][16] .Computational techniques are becoming increasingly crucial in detecting breast cancer due to the rapid growth of computer technology.However, the use of computational techniques is affected by gene expression dataset characteristics such as small dataset sizes, excessive dimensionality, and unbalanced data 17 .Several machine learning, deep learning, and metaheuristic techniques have been created and applied to detect and classify cancer using gene expression data.
Khalsan et al. 18 presented an extensive overview of recent cancer research works that utilize gene expression data from various types of cancer, including kidney, breast, ovarian, lung, liver, gallbladder and central nervous system.The review encompasses several facets of machine learning in cancer research, including cancer classification, cancer prediction, identification of biomarker genes, and using microarray and RNA-Seq data.Yuan et al. 19 applied different methods of machine learning for the detection of lung cancer through the use of gene expression data.A novel computational method for detecting breast cancer was proposed by Wang et al. 20 based on incorporating random forest (RF), Monte Carlo feature selection (MCFS), rough set-based rule learning, SVM, and dagging.A deep learning method that uses Stacked Denoising Autoencoder (SDAE) to identify genes that can effectively differentiate between tumor and healthy cases of breast cancer was proposed by Danaee et al. 21.BRCA gene expression data from TCGA and gene expression omnibus (GEO) was analyzed by Jia et al. 22 .They used differentially expressed genes (DEG) and weighted gene co-expression network analysis (WGCNA) to select the most significant genes.A deep learning model combined with an artificial intelligence-based feature selection method (AIFSDL-PCD) using gene expression data was proposed by Alshareef et al. 23 for detecting prostate cancer.
The field of cancer prediction using machine and deep learning methods based on gene expression data has seen significant progress in recent years.However, despite the progress in predicting cancer using machine and deep learning methods based on gene expression data, the existing models have some issues affecting their performance.These issues include choosing the feature representation, optimal architecture, including the number of layers and nodes, suitable model parameters, and picking the best values for weights and bias are critical steps in improving performance [24][25][26] .Moreover, selecting the most suitable learning rates and regularization parameters can affect the model's ability to generalize to unseen data.Therefore, this paper aims to resolve these issues by finding a precise prediction model and advancing the state-of-the-art use of CNN to classify gene expression data using metaheuristic methods to optimize the CNN model.
Metaheuristic algorithms are optimization algorithms that search for solutions by exploring a large search space and iteratively improving candidate solutions.They have the ability to handle NP-hard problems, which are computationally intractable problems that cannot be solved using exact methods, by providing near-optimal solutions within a reasonable amount of time [27][28][29] .Metaheuristic optimization algorithms have been identified as an effective tool for solving large-scale optimization problems in bioinformatics.Many of these problems can be classified as NP-hard; thus, researchers have relied heavily on metaheuristic methods to address them.The metaheuristic methods allow for the efficient solution of large-scale samples while minimizing the use of computational resources.Despite the availability of various optimization methods, metaheuristic optimization algorithms are instrumental in solving optimization problems due to their flexibility in providing high-quality optimization solutions in a relatively short amount of computing time 30 .The use of metaheuristics models assists in solving the problems of high dimensionality, the complexity of variable relationships and noisy data peculiar to gene expression data.In addition, metaheuristics models can handle noisy and non-linear data by incorporating techniques such as randomization and simulated annealing to escape from local optima 31 .Chakraborty et al. 32 presented a metaheuristic method for skin disease classification based on an artificial neural network.In MotieGhader et al. 33 , metaheuristic methods, including GA, WCC, PSO, CUK, ICA, LA, HTS, ACO, FOA, DSOS, and LCA, with an SVM classifier were used for the detection of breast cancer based on mRNA and micro-RNA expression data.This paper proposes using the metaheuristic model EOSA-CNN for breast cancer detection using gene expression data 34 .EOSA is a new optimization algorithm with excellent performance track records in different application domains [35][36][37][38][39] .It is population-based and bio-inspired, developed by taking clues from the Ebola virus's effective propagation.The algorithm's framework was designed based on the spread of Ebola disease (EVD) 34,40 .This research makes significant contributions by introducing a bio-inspired CNN model for detecting breast cancer using gene expression data from the TCGA repository.The AAIC method is used for pre-processing to remove the outliers' samples, thereafter, normalization and filtration were used.Furthermore, we converted the pre-processed data into 2D images that can be utilized in the CNN architecture.The study also proposes a hybrid of the proposed CNN architecture that employs the EOSA to enhance the classification performance.The proposed model showed its ability to classify the tumor and normal samples with high accuracy and reliability.In our proposed model, the best combination of weights required for the feature extraction is obtained using the EOSA algorithm to handle the classification problem.Therefore, this study presents a hybrid model that combines the proposed CNN and EOSA for the process of classification based on BRCA gene expression data.Consequently, in this study, the main contributions are as follows: • Applying various pre-processing techniques (such as removing outliers, normalizing, and filtering) to prepare the gene expression data.• Transforming the gene expression data into two-dimensional images.
• Proposal of a novel bio-inspired CNN architecture for the detection of breast cancer.
• Introducing a hybrid model that combines the proposed CNN and EOSA for the classification process.
• Assessing and comparing the proposed model with other metaheuristic algorithms combined with the pro- posed CNN.
The rest of the paper is structured as follows: a detailed account of the related work is given in Section "Related work", while Section "Model Methodology" describes the model technology discussing the CNN Architecture and the Ebola Optimization Algorithm CNN Model (EOSA-CNN) along with the associated algorithms.Section "Experimentation, results and discussion" presents the experimental results with a discussion of the results.Comparison with results from the literature, the strengths and limitations of the model are also enumerated.Finally, the conclusion and the recommendations for future work are presented in Section "Conclusion and future work".

Related work
As earlier noted, several machine learning, deep learning, and metaheuristic techniques have been created and applied to detect and classify cancer using gene expression data.Yuan et al. 19 applied different machine-learning methods for detecting lung cancer through gene expression data.The Monte Carlo and incremental feature selection methods were used to identify the most important genes.Then, SVM and random forest (RF) were implemented, and their performances were compared.The results indicated that SVM achieved an accuracy, sensitivity, specificity, precision, and F1-measure of 100%, 93.2%, 96.7%, 93.9%, and 96.9%, respectively.These results are higher than those obtained using RF.Wang et al. 20 proposed a novel computational method called Patient-derived tumor xenograft (PDX) for breast cancer detection by incorporating Monte Carlo feature selection, RF, rough set-based rule learning, SVM, and dagging.In the work of Danaee et al. 21proposed, a deep learning approach that uses Stacked Denoising Autoencoder (SDAE) to identify genes that can effectively differentiate between tumor and healthy cases of breast cancer was proposed.They tested the efficacy of the extracted features using an artificial neural network (ANN), SVM, and SVM-RBF.The results showed that using the SDAE method with SVM-RBF achieved the highest accuracy of 98.26%.
Jia et al. 22 analyzed BRCA gene expression data from TCGA and GEO using differentially expressed genes (DEG) and weighted gene co-expression network analysis (WGCNA) to select the most significant genes.Twentythree hub genes were then identified using a protein-protein interaction (PPI) network.They applied SVM, decision tree (DT), Bayesian network (BN), ANN, and convolutional neural network (CNN-LeNet and CNN-AlexNet), and the results showed that ANN has the best performance with an average accuracy of 97.36%.Elbashir et al. 41 developed a lightweight CNN model for detecting breast cancer using RNASeq gene expression data.They first pre-processed the data by removing outliers, normalization and filtration.Then they converted the gene expression profiles into 2-D images.Thereafter, they applied a lightweight CNN model for the classification.
From their result, their model achieved an accuracy of 98.76.Alshareef et al. 22 proposed a deep learning model with an artificial intelligence-based feature selection method for prostate cancer detection (AIFSDL-PCD) using gene expression data.In addition, a feature selection (FS) method based on a chaotic invasive weed optimization (CIWO) to select the optimal genes revealed the novelty of their approach.Their results showed sensitivity, specificity, precision, F1-measure, and accuracy of 97.25%, 97.25%, 0.967%, 97.14%, 97.28%, and 97.19%, respectively.Chakraborty et al. 32 presented a metaheuristic method for skin disease classification based on an artificial neural network.Their proposed method, a non-dominated sorting genetic algorithm-II (NNNSGAII), was used to train an ANN.The proposed method obtained 87.92% accuracy, 94.2% precision, 87.5% recall, and 90.73% F-measure.
MotieGhader et al. 33 used metaheuristic methods, including world competitive contest (WCC), league championship algorithm( LCA), GA, particle swarm optimization (PSO), ant colony optimization (ACO), imperialist competitive algorithm (ICA), learning automata (LA), heat transfer optimization algorithm (HTS), Forest optimization algorithm (FOA), discrete symbiotic organisms search (DSOS), and cuckoo optimization (CUK), with an SVM classifier for breast cancer detection using mRNA and micro-RNA expression data.The proposed algorithm selected 186 mRNAs out of 9,692 and 116 miRNAs out of 489 and obtained an accuracy above 90% for the miRNAs dataset and 100% for the mRNA dataset.Wei et al. 42 proposed a generative adversarial model based on cancer genetic data (GANs).They used 12 different gene expression data from the TCGA, including lung, breast, prostate, colon, gastric, liver, rectal, esophageal, thyroid, clear cell renal cell carcinoma (CCRCC), uterine, and head and neck squamous cell carcinomas (HNSCC).They further used a reconstruction loss to enhance stability during model training.From their results, an accuracy of 92.6% was achieved by their proposed model.Deng et al. 43 proposed a gene selection model in a two-stage format for cancer classification in microarray datasets.Their approach combined a multi-objective optimization genetic algorithm (XGBoost-MOGA) with gradient boosting (XGBoost).During the first stage, the XGBoost-based feature selection is used in ranking the genes to eliminate genes that are not relevant effectively, thereby leaving a group of genes that are most relevant to the class.In the second stage, a subset of optimal genes from the group of the most relevant genes is identified using XGBoost-MOGA through multi-objective optimization.Based on two widely used learning classifiers, a comparison of the proposed method with other state-of-the-art feature selection methods using two widely used learning classifiers on 14 publicly available microarray datasets was performed.The results demonstrated that XGBoost-MOGA outperformed previous methods in terms of accuracy, F-score, precision, and recall.
In Houssein et al. 44 , the selection of genes that contribute to the prediction of cancer from gene expression datasets with the highest accuracy based on microarray gene expression was achieved by combining a Barnacles Mating Optimizer (BMO) algorithm with SVM called (BMO-SVM).They evaluated the proposed model using four benchmark microarray datasets, including leukemia1, lymphoma, a small-round-blue-cell tumor (SRBCT), and leukemia2.From their results, the proposed BMO-SVM approach performed better than the other wellknown methods, such as Particle Swarm Optimization (PSO), the Tunicate Swarm Algorithm (TSA), Artificial Bee Colony (ABC), and Genetic Algorithm (GA).Devi et al. 45 proposed an Improved Whale Optimization Algorithm (IWOA) algorithm for gene selection.The proposed solution used a multi-objective fitness function that balances error rate minimization and feature selection.The results show that the proposed IWOA obtained a minimal subset of genes used for the BRCA classification using Gradient Boost Classifier (GBC) and achieved an accuracy of 97.7%.The related studies are summarised and presented in Table 1.
From the existing literature, various shortcomings were discovered regarding utilizing deep learning models for the given task.Deep learning models necessitate substantial data, and acquiring sizable, high-quality datasets for analyzing breast cancer gene expression can be challenging.Consequently, this can cause overfitting of the model to the training data, thereby resulting in inadequate performance on fresh, unobserved data.The computational complexity and time required for developing and training deep learning models can pose a significant hurdle to their widespread implementation in clinical practice.The complexity of breast cancer, which entails numerous biological processes such as cell proliferation, invasion, and angiogenesis, may not be captured entirely by deep learning models, thereby restricting their capacity to forecast outcomes or recognize potential therapeutic targets precisely.To resolve this challenge, optimizing the CNN model becomes necessary using suitable approximate optimization methods.Metaheuristic optimization algorithms have been applied to solve these problems.Nevertheless, the critical challenge of using deep learning models for effectively and efficiently classifying breast cancer remains unresolved.Therefore, this paper aims to enhance the efficacy of DL models on breast cancer detection and classification using gene expression data by leveraging a new optimization algorithm inspired by the biological mechanism of the Ebola disease.

Model methodology
Dataset and pre-processing .Using the R software, we used the BRCA gene expression data from the Cancer Genome Atlas (TCGA) repository.The GDCquery function from the TCGAbiolinks library was used in developing the query 41,46 .The BRCA contains 1208 clinical samples and 14,895 genes or features.Moreover, there are 113 and 1095 normal and tumor samples, respectively.The data were identified to be noisy with many features.Therefore, different pre-processing steps were implemented to get clean data with genes positively contributing to BRCA detection.To identify the outliers samples, the array-array intensity correlation (AAIC), which defines a symmetric matrix of Spearman correlation between samples, was calculated 47 .The cut-off value of 0.6 was used to define the outlier samples to remove them.Normalization was applied for the gene expression data to ensure the validity of the expression levels and avoid biases in the analysis 48 .The TCGAanalyze-Normalization function was used from the TCGAbiolinks library to perform the normalization.Then filtration was performed using a cut-off value of 0.25 for reduction of gene number through the selection of genes whose mean expression values are higher than the cut-off value 41,49 .Consequently, the pre-processing obtained a dataset that contains 1208 clinical samples with 14,895 genes.
The gene expression data was reshaped from 1 to 2D images with a dimension of 122 × 123 to be appropriate for our metaheuristic models.The BRCA gene expression data contains columns that could not be reshaped www.nature.com/scientificreports/into the desired dimension.However, 112 columns of zeros were attached at the end to adjust the image size 41,50 .Moreover, we transformed the images into grayscale using the cvtColor() function from the OpenCV library in Python.This was done to ensure that the images met the requirements of the classification model and to improve image quality.Once the images were converted, they were prepared as input for the hybrid model.Figure 1 shows the proposed methodology.
The CNN architecture.After the pre-processing step, the resulting images were used as input to the model.
A specially designed CNN was used for the optimization model.The architecture of the proposed CNN model is a deep neural network designed to analyze and classify gene expression images with dimensions of 150 × 150 pixels and a single colour channel (grayscale).The model consists of multiple convolutional layers with increasing filter sizes, followed by max pooling layers to reduce the spatial dimensions of the feature maps.The architecture is designed to extract and learn high-level features from the input images, gradually increasing the number of filters to capture more complex patterns.The final output of the convolutional layers is flattened and passed through a Dropout layer, which randomly drops out some of the neurons to prevent overfitting.The final output layer is a Dense layer with ReLU activation that is fully connected.The CNN model architecture designed in this study is shown in Fig. 2. The proposed CNN model for breast cancer detection has a specific architecture that utilizes filters (denoted by "F"), kernels (denoted by "K"), and strides (denoted by "S").

Ebola optimization search algorithm CNN model (EOSA-CNN).
Ebola is a viral hemorrhagic fever that affects humans and primates, also called Ebola hemorrhagic fever or Ebola virus disease.The Ebola viruses cause this disease, which can cause individuals to transition between susceptible, quarantined, infected, recovered, hospitalized, and deceased subpopulations in a seemingly random manner.Drawing inspiration from the Ebola virus's ability to spread effectively, a novel optimization algorithm that is both bio-inspired and population-based was developed.The method of the propagation of Ebola disease (EVD) 34 was adopted in the design of the algorithm.To update the propagation, the EOSA model used a dynamic mechanism for propagation via susceptible, infection, quarantine, recovered, and hospitalized operations to gain a better fit.It helped to find the best or worst solution and provided an intuitive outcome.In this paper, the EOSA metaheuristic algorithm was hybridized with CNN to improve the performance of the CNN model.This was accomplished in all the iterations when the metaheuristic algorithm was trained to achieve the solution vector and update the CNN model.The weights and biases for the CNN were updated, and the loss function was subsequently calculated.Thereafter, the results obtained were compared with different hybrid models.The following steps describe the EOSA-CNN Model: 1. Set up the initial scalar and vector quantities for parameters and individuals, respectively.Assign initial values to individuals categorized as Susceptible (S), Infected (I), Recovered (R), Dead (D), Vaccinated (V), Hospitalized (H), and Quarantine (Q). 2. Randomly select an individual from the susceptible individuals as the index case ( I 1 ) 3. Designate the index case as the global and current best, then compute its fitness value.4. While there is at least one infected individual and the number of iterations is not complete, a. Update the position of each susceptible individual based on their displacement, and generate newly infected individuals (nI) accordingly.Note that the greater the displacement of an infected case, the  The pseudocode in Algorithm 1 presents the algorithm that uses mathematical models to optimize a CNN model.The algorithm uses evolutionary optimization techniques.The algorithm starts by initializing variables such as the CNN model's objective function, lower and upper bounds, batch size, number of epochs, population size, and the incubation period.It also creates empty sets for groups of individuals (Quarantine (Q), Susceptible (S), Exposed (E), Recovered (R), Hospitalized (H), Vaccinated(V), Infected (I)) and solutions.The set of susceptible individuals is then generated, and the algorithm starts with a time equal to 0 and an index case is randomly generated.The current best and global best solutions are set to the index case.The positions of the exposed individuals are updated by the algorithm using a mathematical model illustrated in Equation mI t+1 i = mI t i + ρM.The displacement scale factor of individuals is represented by ρ while mI t+1 i and mI t i indicate the updated and original positions at time t , respectively.The current time is denoted as t + 1 , and the movement rate of each individual represented as M(I) is calculated using Eqs. ( 2 and (3).
The exploration stage of the EOSA involves the infected individual moving beyond the normal neighbourhood range, lrate .In contrast, during the algorithm's exploitation phase, it is either assumed that the infected individual is displaced within a limit of srate in comparison to its previous position and remains within a distance of zero (0).
To determine the current best ( cBest ), the individuals infected in time t are evaluated, and the global best ( gBest ) is calculated using Eq. ( 5): At time t, the terms cBest, bestS , and gBest represent the current best solution, best solution, and global best solution, respectively.The objective function used for the problem is denoted by the term fitness.

Experimentation, results and discussion
System configuration and algorithms parameters setting.The experiments were conducted using Dell Optiplex 5050 computer machine with the following configuration: an Intel Core i5 7th generation processor with a hard disk size of 500 GB and 16 GB memory.All the models were developed using Python.EOSA-CNN model's performance was compared to that of a standalone CNN and five other metaheuristic algorithms, namely MVO-CNN (Physics-based), GA-CNN (Evolutionary-based), LCBO-CNN (Human-based), ( 1) www.nature.com/scientificreports/positive (TP) denotes the number of accurately classified cancerous images.False negative (FN) represents the number of cancerous images that were misclassified as non-cancerous.True negative (TN) is the number of accurately classified non-cancerous images.The performance metrics are calculated using the formulas involving TP, FP, FN, and TN presented in Eqs. ( 13), ( 14), ( 15), ( 16), ( 17), ( 18), ( 19) and (20).

Results and discussions
Table 4 presents the overall performance of the competing algorithms.It shows that the hybrid algorithms performed better than the traditional CNN and the proposed model EOSA-CNN recorded a better performance than the hybrid algorithms.We calculate the Balanced Accuracy, Accuracy, precision, Recall, f1-score, Cohen's kappa, sensitivity, and specificity.In terms of Balanced Accuracy, WOA-CNN, GA-CNN, MVO-CNN, SBO-CNN, CNN, and LCBO-CNN achieved 0.956, 0.942, 0.923, 0.942, 0.924, 0.940, respectively.Whereas the EOSA-CNN achieved 0.958, which is the best performance.With reference to accuracy, the GA-CNN, SBO-CNN, and EOSA-CNN performed the same result of 0.983.In contrast, for recall, EOSA-CNN and WOA-CNN attained 0.928.In terms of the f1-score, EOSA-CNN achieved 0.912.The comparative study of the proposed method with five metaheuristic algorithms and CNN is reported in Fig. 3.The proposed model performs better than the other models with respect to the validation accuracy in 100 epochs.
Figure 4 presents the precision, f1-score and recall of all models per normal class.It shows that the Precision of the GA-CNN and SBO-CNN have the same performance of 0.93 and CNN performance of 0.92.Furthermore, the gene expression dataset was imbalanced, so different metrics were calculated for more confirmation, like F1-Score, balanced accuracy, and recall.It presents the F1-score result of EOSA-CNN has a high performance of 0.91 for the normal class.Also, GA-CNN and SBO-CNN have identical results.The EOSA-CNN have high  performance compared to other methods in term of Recall 0.93%.All the methods correctly identified the tumor class with a high performance of 99% in terms of recall, precision, and F1-Score.Overall, the experiments indicated that the hybrid models benefited from pre-processing the gene expression data and almost had an equivalent performance in detecting the BRCA.
Figure 5 shows the confusion matrix for CNN and the hybrid algorithm, considering all the datasets' class labels.Each plot of the confusion matrix shows the classification accuracy for all classes, providing an accurate performance report for each one.Taking EOSA-CNN (top left of Fig. 5), for instance, the hybrid algorithm proposed in this study correctly identified 26 from 28 samples as a normal class and 270 from 273 samples as tumor.Also, CNN correctly identified the tumor class but misclassified 3 from 29 samples for the normal class.This result highlights the significance of the proposed hybrid algorithm in this study as it successfully enhanced the classification accuracy.Comparison with related studies.Table 5 shows the comparison between our proposed model performance and different studies.The proposed model in this study achieved higher classification accuracy than the results observed in previous works reported by Danaee et al. 21, Jia et al. 22 , and MotieGhader et al. 33 .While Elbashir et al. 41 achieved higher classification accuracy than our study using a CNN model, our approach showed a sensitivity of 0.9890% and an f1-score of 0.99% for both tumor and normal class.Moreover, the EOSA-CNN model achieved a sensitivity of 0.989%, which means the model has missed a few of the positive cases.Sensitivity is a crucial metric as it assesses the model's ability to detect positive cases correctly.Our models must identify all positive cases to ensure accurate predictions.Thus, this study highlights the significance of employing a metaheuristic algorithm to optimize CNN model hyperparameters, which is crucial in selecting the optimal combination of biases and weights required to train a CNN model effectively.Furthermore, the proposed method showcased that integrating these methods can significantly enhance gene expression data's overall performance and classification accuracy.
Strength and limitations of the EOSA-CNN model.In this section, the limitations of the study are discussed in more detail, including the small sample size of gene expression data compared to the very high number of genes.Moreover, the absence of addressing the problem of imbalanced data using approaches such as random over and under-sampling and cluster-based over-sampling is considered a serious challenge.The sample size used for the study may not be sufficient to capture the full complexity of the gene expression data, leading to potential biases and limitations in the analysis.Additionally, the issue of imbalanced data can significantly impact the model's performance, as the algorithm may be biased towards the majority class and struggle to predict the minority class accurately.

Conclusion and future work
Breast cancer is the most common medical diagnosis in women.The study, understanding and research of breast cancer have aided the diagnosis and development of new treatments for breast cancer.Gene expression profiling is helping researchers and doctors to comprehend the heterogeneous nature of breast cancer on a genomic level.
In this study, we developed a hybrid model that combines the Ebola optimization search algorithm (EOSA) with CNN architecture for the detection of breast cancer and diagnosis using gene expression data.We prepared the data using different pre-processing methods, including removing the outliers using Array-Array Intensity Correlation (AAIC).To avoid biases in the expression measures, we utilized the normalization method.The final step in pre-processing was filtration.After that, we converted the gene expression data into two-dimensional images, which were converted into grayscale images.For the classification, we use the EOSA-CNN model.The findings of this study demonstrate that the proposed model achieved high-performance measurements with exceptional accuracy (98.3%), precision (99%), recall (99%), f1-score (99%), kappa (90.3%), specificity (92.8%), and sensitivity (98.9%) for the cancerous class.These results suggest that the model has the potential to be an effective and reliable method for breast cancer detection using gene expression data.For future extensions, we planned to solve the problem of imbalanced data and hybridize the model with various state-of-the-art optimization algorithms. https://doi.org/10.1038/s41598-023-41731-zwww.nature.com/scientificreports/

2 *
Random Accuracy = ActNegative × PredNegave + PredPositive × ActPositive Total × Total = (FP + TN) × (FN + TN) + (TP + FN) × (TP + FP) (TP + TN + FP + FN) × (TP + TN + FP + FN) (Recall * Precision) (Recall + Precision) Figures 6, 7, 8, 9, 10 and 11 display the training and validation accuracy for all hybrid algorithms in each epoch.In all the hybrid models, the validation accuracy is higher than the training accuracy at the beginning of training.That indicates the models possess good generalization ability to new, unseen data, which is a positive indication.During training, the model's training accuracy improves, while the validation accuracy improves slower.Both training and validation accuracies stabilize at a level higher than 97%.In Fig. 12, CNN's performance in training and validation is depicted.Although the training accuracy improves and reaches 100%, the validation accuracy remains lower.This implies that the model is overfitting to the training data, effectively memorizing it but lacking the ability to perform well on new and unseen data.As a result, it may lack generalization ability.

Figure 3 .
Figure 3. Comparative performance of the proposed EOSA-CNN model against other models.

Figure 4 .
Figure 4. Comparative results of precision, f1-score, and recall for EOSA-CNN model and other models for normal class.

Figure 7 .
Figure 7. Training and validation accuracy curve for GA-CNN.

Figure 8 .
Figure 8. Training and validation accuracy curve for LCBO-CNN.

Figure 9 .
Figure 9. Training and validation accuracy Curve for MVO-CNN.

Figure 10 .
Figure 10.Training and validation accuracy curve for SBO-CNN.

Figure 11 .
Figure 11.Training and validation accuracy curve for WOA-CNN.

Figure 12 .
Figure 12.Training and validation accuracy curve for CNN.

Table 1 .
Comparative summary of related existing studies.

Table 4 .
The overall performance of the algorithms.

Table 5 .
A comparison of our model performance with several models used for gene expression data classification.