An ensemble-based deep learning model for detection of mutation causing cutaneous melanoma

Ali Shah, Asghar; Shaker, Ayesha Sher Ali; Jabbar, Sohail; Abbas, Qaisar; Al-Balawi, Talal Saad; Celebi, M. Emre

doi:10.1038/s41598-023-49075-4

Download PDF

Article
Open access
Published: 14 December 2023

An ensemble-based deep learning model for detection of mutation causing cutaneous melanoma

Asghar Ali Shah¹,
Ayesha Sher Ali Shaker²,
Sohail Jabbar³,
Qaisar Abbas³,
Talal Saad Al-Balawi³ &
…
M. Emre Celebi⁴

Scientific Reports volume 13, Article number: 22251 (2023) Cite this article

735 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

When the mutation affects the melanocytes of the body, a condition called melanoma results which is one of the deadliest skin cancers. Early detection of cutaneous melanoma is vital for raising the chances of survival. Melanoma can be due to inherited defective genes or due to environmental factors such as excessive sun exposure. The accuracy of the state-of-the-art computer-aided diagnosis systems is unsatisfactory. Moreover, the major drawback of medical imaging is the shortage of labeled data. Generalized classifiers are required to diagnose melanoma to avoid overfitting the dataset. To address these issues, blending ensemble-based deep learning (BEDLM-CMS) model is proposed to detect mutation of cutaneous melanoma by integrating long short-term memory (LSTM), Bi-directional LSTM (BLSTM) and gated recurrent unit (GRU) architectures. The dataset used in the proposed study contains 2608 human samples and 6778 mutations in total along with 75 types of genes. The most prominent genes that function as biomarkers for early diagnosis and prognosis are utilized. Multiple extraction techniques are used in this study to extract the most-prominent features. Afterwards, we applied different DL models optimized through grid search technique to diagnose melanoma. The validity of the results is confirmed using several techniques, including tenfold cross validation (10-FCVT), independent set (IST), and self-consistency (SCT). For validation of the results multiple metrics are used which include accuracy, specificity, sensitivity, and Matthews’s correlation coefficient. BEDLM gives the highest accuracy of 97% in the independent set test whereas in self-consistency test and tenfold cross validation test it gives 94% and 93% accuracy, respectively. Accuracy of in self-consistency test, independent set test, and tenfold cross validation test is LSTM (96%, 94%, 92%), GRU (93%, 94%, 91%), and BLSTM (99%, 98%, 93%), respectively. The findings demonstrate that the proposed BEDLM-CMS can be used effectively applied for early diagnosis and treatment efficacy evaluation of cutaneous melanoma.

PERCEPTION predicts patient response and resistance to treatment using single-cell transcriptomics of their tumors

Article 18 April 2024

A foundation model for generalizable disease detection from retinal images

Article Open access 13 September 2023

A large-scale machine learning analysis of inorganic nanoparticles in preclinical cancer research

Article 15 May 2024

Introduction

Cutaneous melanoma is a skin cancer that begins in the pigments-producing cells of the skin^1,2. The 5-year survival rate for all people with melanoma of the skin is 93%, starting from the time of the initial diagnosis. The thickness of the primary melanoma, whether the lymph nodes are affected, and whether cancer has migrated to other locations are all factors that affect overall survival at 5 years³. In as little as six weeks, melanoma can spread rapidly and pose a life-threatening threat³. Melanoma starts to spread and advances in the stage if untreated. Superficial Spreading Melanomas (SSM), Nodular Melanoma, Lentigo malign, Acral-lentiginous melanomas, Ulcerated and/or regressive melanomas and Melanomas in special anatomical regions are a few types of cutaneous melanomas¹. The major factor causing cutaneous melanoma is exposure to UV (ultra-violet rays) through sun^4,5. The sequence of genes defines the structure and function of the formed organs. Any disruption or error in this specific sequence results in a mutation which, in turn, causes many abnormalities in the body. In skin cells, ultraviolet (UV) radiation can harm DNA. Sometimes the damage might affect certain genes that control cell division and growth. The affected cells may transform into cancer cells if these genes cease functioning properly⁶. However, it can be cured by performing wide local excision (WLE) if it is diagnosed and detected at initial stages. Even for professional dermatologists, identifying melanoma from skin lesions using techniques such as visual inspection, clinical screening, dermoscopic analysis, biopsy, and histological investigation can be difficult and imprecise^7,8,9. Our genes, which control how our cells function, are made up of DNA, a molecule that is present in all of our cells. We resemble our parents since they are the source of our genes. But DNA affects more than just how we look. While our cells divide, develop, and die, certain genes continue to function. Oncogenes are genes that help cells grow, proliferate, and survive¹⁰. Tumor suppressor genes are those that regulate cell development, fix DNA errors, or induce cells to die at the appropriate moment. Cancer is produced by DNA mutations (or other sorts of alterations) that keep oncogenes active or switch tumor suppressor genes off. These sorts of gene alterations can cause cells to proliferate uncontrollably¹¹. Typically, changes in multiple distinct genes are required for a cell to become a cancer cell. Most gene changes that cause melanoma to happen during a person's lifetime and are not passed on to their children. People rarely get changes in their DNA from their parents that make it clear that they are more likely to get melanoma.

Skin lesions exhibit a diverse range of sizes and shapes, making their identification and classification complex. The borders of these lesions often lack distinct clarity, making it difficult to precisely delineate their edges. Moreover, when contrasted with the surrounding skin, these lesions might not exhibit a stark differentiation. Additionally, various noise elements, such as skin hair, lubricants, air, and bubbles, can further complicate the visual interpretation of skin lesions. These factors collectively contribute to the complexity of analyzing and diagnosing skin lesions, requiring advanced imaging techniques and algorithms to mitigate these challenges effectively. So, it is important to develop a reliable way to diagnose melanoma cancer. The approach suggested in this study will make it easier to diagnose melanoma early, which can make treatment easier and lower the death rate from the disease. Melanoma has the greatest fatality rate of any skin cancer, with around 10,000 cases and 12,390 fatalities recorded in the United States of America each year. New cases reported in the year of 2020 of melanoma of the skin are shown in the Fig. 1 below¹². The most important predictor of survival in melanoma and other skin tumors is early identification¹³. Accurate skin cancer diagnosis necessitates sufficient experience and appropriate equipment. “The accuracy of melanoma diagnosis for skilled dermatologists with naked-eye inspection is less than the dermatoscopy, which is a magnifying lens with either liquid emulsion or cross-polarization filters to remove the surface reflection of skin^14,15”. However, there is a global scarcity of skilled dermatologists¹⁶. Diagnosis of any disease related to skin is normally based on visual inspection and because of this different machine learning algorithms are used to develop models for detecting melanoma of skin¹⁷.

Literature review

Large-scale studies on DNA sequencing have identified that driver mutations cause cancer, which could be treated at early stages. The investigation of these mutations has also yielded important insights into the biology of cutaneous melanoma of skin, which will drive future research and new treatment targets. The sequence of DNA is very important it includes reading and extracting its strands. In a study Rishitha state comparison of different deep learning and machine learning algorithms for DNA sequencing is made. dataset of sequences was used, which was 80% used for training and 20% for testing for machine learning algorithms whereas for deep learning algorithms, 75% was used for training and 25% for testing. Five algorithms from deep learning and machine learning which were CNN, Naïve Bayes, decision tree, Random Forest, and transform learning and showed the accuracy of 91.45%, 98%, 82.60%, 91.80%, 94.57%¹⁸. Liu in a study¹⁹ developed a prediction system called CanSavPre which distinguish single amino acid variation whether it was cancerous or not. Two prediction systems were developed CanSavPrew and CanSavPrewm using machine learning. Features were extracted by genetic algorithm. CanSavPrew accuracy was 79.83% with Mcc 0.45 and F1 score 0.54 wheras CanSavPrewm gave accuracy 89.73%, Mcc 0.74 and F1 score 0.81. in study²⁰ genome deep learning method of deep neural network was developed as well as specific, mixture and total specific models with accuracy of 97.47%, 70.08%, and 94.70%, respectively to identification of cancer.in the study 14 models were developed. Mixture model and specific model have ratio for training and testing 80%, 20% respectively whereas total specific model randomly chose cancer samples and healthy samples of 80%. The accuracy, sensitivity, and specificity of GDL model of DNN was 97%, 98%, 97% respectively. To identify the mutation causing cancer is very difficult task and mutation recurrence is reliable indicator of their significance. Certain mutations are more likely to occur than others. Cancer driver genes often exhibit mutually exclusion of mutations, and they operate with very complex networks. The study²¹ proposed a machine learning method which investigate the functionality of mutually exclusive genes in network derived from associations of mutations, interactions of gene to gene and clustering. The method also identifies variant frequencies of driver genes and cancer related pathways which were studied less by genes whose frequencies were less. The method gave insights into cancer related mutations and its pathways to improve the understanding of the disease. The comparison of driver genes by LSFS algorithm and ShiBench, Rule, CTAT, HCD, CGC and CGCpointMut used as benchmarks in the study. The precision, recall and F-meature of ShiBench-LSFS was 0.385, 0.383, 0.384, Rule-LSFS was 0.215, 0.383, 0.275, CTAT-LSFS was 0.29,0.373,0.329, HCD-LSFS was 0.315,0.425,0.362, CGC-LSFS was 0.89, 0.50,0.640 and CGCpointMut-LSFS was 0.21,0.375,0.269 respectively.

Mutation detection would not be accurate if some sequences were lost by traditional alignment algorithms. In another study²² proposed feedback fast learning neural network position index algorithm for mutation detection. ACGT position index relationship in sequences of DNA. SNP and InDel mutations were studied. To analyze the linear relationship in two or more positions feedback fast learning neural network algorithms was used. Position index showed good response instead of Gatk, Vanscan2, Freebye, Bcftools for mutation detection. Position index showed good result of 84% in Exon region matching rate than the other algorithms. Location based index method showed good performance in detecting SNP and InDel. Quang in his study²³ proposed DanQ a hybrid convolutional and bi-directional long short-term memory recurrent deep neural network model for predicting from sequences non-coding function de novo. This model learns convolution kernels and then makes them into motifs. Comparison was made between a pure CNN model which is DeepSEA and DanQ, where DanQ showed the best performance. ROC AUC of DanQ model was 94.1% whereas PR AUC of DanQ model was 97.6%. Another study was on virus mutation prediction in RNA sequences in which the author used rough set gene evolution (RSGE) and NN (neural network) techniques. The purposed techniques were trained on two different countries (Korea and China) dataset. The technique predicts nucleotides in the next generation, showing 75% accuracy whereas neural network prediction was not as good as the proposed technique. The study also analyzed the relation between nucleotides in RNA and their effect on chaining the genotype of other nucleotides²⁴. In²⁵ the researcher develops the new DL technique, which was DeepBind, the technique predicted the specifications of sequences of DNA and RNA binding proteins. The proposed DeepBind technique used set of sequences to compute the binding score of each sequence. Proposed technique outperforms the other methods or techniques. The technique was trained on in vitro data and tested on in vivo data. The application of DeepGPUnd to microarray and sequencing data, as well as the ability of the model to learn from millions of sequences through parallel implementation on a graphics processing unit was also discussed in the study. In another research²⁶ a hybrid approach is used for detecting melanoma of skin, in this study a CNN and two other (KNN, AVM) DL approaches are used firstly individually and then combined, the combined method showed highest accuracy of 88.4% among them.

In²⁷ the author of the research showed the comparison of different ML algorithms in skin cancer detection. This study showed that inception_ResNet v2 gives highest accuracy of 89.55% with 91.67% recall whereas accuracy and recall of random forest is 86.36%,95%, VGG-16 is 86.21%, 87%, ResNet-50 is 79.55%,81.67%, inception v3 is 83%,82% respectively.

Different research and their proposed system accuracy are shown in the following Table 1.

Table 1 Comparison of the state-of-the-art works.

Full size table

Mostly there are research related to the images or histopathology for detecting the melanoma of the skin using ML or DL. There are many factors which can affect the results so the novelty of the proposed study is that it can detect cutaneous melanoma of the skin earlier and efficiently and the result will be more accurate. The previous work has various limits and restrictions. There is no generic and clear benchmark dataset for cutaneous melanoma of skin-based mutations and specific sequences. Evaluation approaches are not rigorous or compelling enough. The model's accuracy has considerable potential for improvement. Keeping these constraints in mind, this study compiled the most recent and generalized datasets as described in data collection. Furthermore, numerous deep learning algorithms are used to attain the highest level of accuracy and different testing techniques are used to measure its accuracy.

Methodology

This section includes all the step-by-step process of getting material (datasets) and processing it using different machine learning algorithms to identify melanomas. In the following sections testing methods of algorithms are explained thoroughly.

Benchmark dataset collection

The datasets are the most critical part of the research. These datasets are used for training the model, testing, and validating the results. The Dataset is composed in such a way that it contains mutated sequences as well as normal sequences. In the proposed model to obtain the dataset of normal genes sequences are taken from the asia.ensembl.org website²⁸. Mutated gene information is obtained from intogen.org²⁹. To extract a dataset of normal genes sequences from asia.ensembl.org and mutated genes information from intogen.org through a web scrapping application (WSA) written in python. The database of intogen.org only has information about mutated gene so an application Mutated Sequence Generator (MSG) is developed in python language for incorporation of this information in normal genes sequence which we got from asia.ensembl.org to make mutated genes sequences. These mutated gene sequences are known as driver mutations³⁰. Driver mutations are such mutations that cause cancer. The dataset used in this study was collected from 2608 cancer patients, covering 75 types of genes as outlined in Table 2. The dataset includes a total of 6778 mutated gene sequences and 6450 normal gene sequences. A set of 522 features has been extracted from each gene sequence. The problem is formulated as a binary classification task, with the target column indicating "Yes" for mutated gene sequences and "No" for normal gene sequences.

Table 2 Symbol of genes, number of mutations and number of samples.

Full size table

It is especially important to balance both datasets in the proposed framework. To balance both datasets under sampling and oversampling techniques are used. To balance the datasets The number of samples of the majority classes is reduced in the under-sampling technique whereas the number of samples increases in minority classes in the oversampling to balance datasets³¹. An oversampling technique Synthetic minority over sampling technique is used for balancing. The process of bench mark dataset collecting is also explain with the help of a Fig. 2.

For getting the most accurate results, a balanced dataset is used in the proposed study. A is the benchmark dataset in the proposed study which we get after preprocessing by CD-HIT. This dataset was processed by an ultrafast protein sequence clustering program known as CD-HIT. From each homologous cluster returned by CD-HIT, one representative was chosen., A is defined by the following Eq. (1).

$$A={A}^{+ }\cup { A}^{-}$$

(1)

In the equation above $\cup $ is union of both. ${A}^{+}$ is used for all mutated gene sequences that causes cancer and ${A}^{-}$ are normal gene sequences. $A$ represents the total dataset. The dataset was used in SCT (self-consistency test), IST (independent set test) and 10-FCVT (tenfold cross validation test) for training and testing purposes.

Feature extraction

Feature extraction is used to improve the performance of the model. Because it reduces redundant data, so redundancy and irrelevancy are removed. Available data gives useful features by feature extraction technique³² as illustrated in the Fig. 3.

There are two kinds of construction models are used to present gene samples. Sequential and discrete, both modeling is mostly used to represent genes in vector formulations. The sequential model indicates the gene sequence as its nucleotide sequence which shows in the following Eq. (2).

$$J={N}_{1}{N}_{2}{N}_{3}{N}_{4}{N}_{5}{N}_{6}{ . . .N}_{n}$$

(2)

In the above Eq. (2) ${N}_{1}$ represent the first nucleotide in gene $J$ and ${N}_{n}$ is the last nucleotide. The total length of the sequence is $n$. In the discrete model gene sample is represented by its nucleotide composition. The gene J representation by discrete model is shown by following Eq. (3).

$$J={\left[ {S}_{1 } {S}_{2 } {S}_{3 } {S}_{4\dots \dots .. }{S}_{20 }\right]}^{T }$$

(3)

where ${S}_{n }\left(n=\mathrm{1,2},\mathrm{3,4},\dots 20\right)$ represent the helpful component features by extraction methods using relevant nucleotides in gene $J$. These Sequential and discrete models explained in Eqs. (2 and 3) are further used in statistical moments.

Statistical moments are used for quantitative analysis of an acquired dataset. These are applied to change data from genomic to a fixed size. Each statistical moment describes the unique information to represent the type and behavior of data. The moments size and scale of variance serve as a tool for differentiating between functionally distinct sequences. Additionally, other moments that characterize data asymmetry and mean contribute to the development of a classifier by labeled dataset. Scientists have discovered that the characteristics of genomic and proteomic sequences are affected by the content and relative location of their bases³³. Only mathematical and computational models that are sensitive to the relative placement of component nucleotide bases within genomic sequences will be used to generate the feature vector in the future.

The gene sequences, comprising both normal and mutated variants, are acquired from a FASTA file containing gene sequences. The original dataset is structured in the gene sequences format. To enhance the dataset, we employed eight distinct feature extraction techniques: Statistical Moments (Raw, Hahn, Central), PRIM (Position Relative Incidence Matrix), RPRIM (Reverse Position Relative Incidence Matrix), AAPIV (Accumulative Absolute Position Incidence Vector), RAAPIV (Reverse Accumulative Absolute Position Incidence Vector), and FV (Frequency Vector).

To implement these techniques, we formulated and constructed the necessary algorithms. Subsequently, we developed Python code to execute these algorithms, resulting in the extraction of a comprehensive set of 522 features from each gene sequence. This systematic approach aims to capture diverse aspects of the genetic information and enhance the richness of the dataset for subsequent analyses.

Statistical parameters are computed using the Hahn moments. The Hahn moment is the key idea in pattern recognition. arithmetic mean and the variance formula. The arithmetic mean is calculated by summing up all the values in a dataset and dividing them by the number of values, while the variance measures the spread of data points around the mean³⁴. Since Two-dimensional data is necessary for Hahn moments. So genomic sequences will be converted into 2D matrix ${W}^{\mathrm{^{\prime}}}$ and its size will be $X \times X$ just like following Eq. (4). It is sequential representation of the nucleotide remnant of gene covered in $W$.

$${W}^{{\prime} }= \left[\begin{array}{ccc}{U}_{11}& \begin{array}{cc}{U}_{12}& \cdots \end{array}& {U}_{1x}\\ {U}_{21}& \begin{array}{cc}{U}_{22}& \cdots \end{array}& {U}_{2x}\\ \vdots & \begin{array}{cc}\vdots & \vdots \end{array}& \vdots \\ {U}_{x1}& \begin{array}{cc}{U}_{x2}& \cdots \end{array}& {U}_{xx}\end{array}\right]$$

(4)

In Eq. (4) $U$ denoted as gene sequence. Values of ${W}{\prime}$ are used for computing Hahn moments. Hahn polynomial for the proposed study dataset is computed³⁵ by the following Eq. (5), which is for 1D matrix of size Z.

$${g}_{x}^{u,v} \left(Y,Z\right)={\left(Z+M-1\right)}_{x}{\left(Z-1\right)}_{x} \times \sum_{n=0}^{x}{\left(-1\right)}^{n} \frac{{\left(-x\right)}_{n}{\left(-Y\right)}_{n}{\left(2Z+u+v-x-1\right)}_{n}}{{\left(Z+v-1\right)}_{n}{\left(Z-1\right)}_{n}} \frac{1}{n!}$$

(5)

Here u and v are all positive integers and predefined constants. The order of the moment is x and Z is the size of data. The Hahn moment calculated up to 3rd order for discrete 2D data is in following Eq. (6) as:

$${Q}_{gh}=\sum_{s=0}^{R-1}\sum_{t=0}^{R-1}{ \delta }_{gh}{f}_{g}^{l,m}\, \left(k,Z\right){f}_{h}^{l,m}\left(s,Z\right),t,y=\mathrm{0,1},2,\dots ,Z-1$$

(6)

“Here $g+h$ is the order of the moment, $l,m$ are predefined constants and ${\delta }_{gh}$ is an arbitrary element of matrix ${W}{\prime}$. Equations (5) and (6) are employed to determine the normalized Hahn moment of any order efficiently. The unique features of Hahn moments up to ${3}^{rd}$ order is as ${Q}_{00,}{Q}_{01,}{Q}_{10,}{Q}_{11,}{Q}_{02,}{Q}_{20,}{Q}_{12,}{Q}_{21,}{Q}_{03 }\& {Q}_{30}$³⁶. In total, we have calculated 10 Hahn moments for every gene sequence up to ${3}^{rd}$ order”.

When calculating the mean, variance, and asymmetries of a probability distribution, raw moments are employed. Neither scale invariance nor location invariance applies to these raw moments. For statistics imputation, raw moment is utilized. Imputation is the process of replacing missing data values in a dataset with the best substitute values to preserve information^37,38.

Raw moments are computed by using the values of ${W}{\prime}$. The raw moments $Z \left(j, k\right)$ of order $j+k$ are computed by the following Eq. (7).

$${Z}_{jk}= {\Sigma }_{m}{\Sigma }_{n}{m}^{j}{n}^{k}{W}^{{\prime} }\left(m,n\right)$$

(7)

“The origin of data is used as the reference point from which raw moments are computed, and the origin is used as the distance between the components. Above equation computed raw moments up to ${3}{rd}$ order”. Raw moments features are as ${Z}_{00}$, ${Z}_{01}$, ${Z}_{10}$, ${Z}_{11}$, ${Z}_{02}$, ${Z}_{20}$, ${Z}_{12}$, ${Z}_{21}$, ${Z}_{03} \& {Z}_{30}$. For every gene sequence 10 Raw moments are calculated up to ${3}{rd}$ order.

The centroid, also known as the geometric center or center of gravity, holds significance in both geometry and data analysis. In geometric figures, it represents the average position of all points on the shape's surface and serves as the intersection point of its medians. This point is often equated to the center of mass due to its influence on the figure's balance. In data analysis, the centroid extends its meaning to the average location of data points in a multi-dimensional space, with each dimension corresponding to a variable. Overall, the centroid stands as a pivotal notion, bridging the realms of geometry and statistics, providing insights into central tendencies and structural characteristics of shapes and datasets. A data point from which all data is dispersed equally in all directions. These directions are weighted average relationships^39,40,41.

Unique features of central moment calculated up to 3^rd order by the following Eq. (8) with the help of centroid of data as reference point.

$${M}_{jk}= {\sum }_{m}{\sum }_{n}{\left(m-\overline{m }\right)}^{j}{\left(n-\overline{n }\right)}^{k}{ W}^{{\prime} }\left(m,n\right)$$

(8)

Central moments unique features up to ${3}^{rd}$ order are as ${M}_{00},{M}_{01},{M}_{10},{M}_{11},{M}_{02},{M}_{20},{M}_{12},{M}_{21},{M}_{03},\&{M}_{30}$.centroids of central moments are computed as $\overline{m }$ and $\overline{n }$ as following Eqs. (9) and (10):

$$\overline{m }=\frac{{Z}_{10}}{{Z}_{00}} ,$$

(9)

$$\overline{n }=\frac{{Z}_{01}}{{Z}_{00}} $$

(10)

Ten central moments are also calculated for every gene sequence up to 3^rd order. These 10 unique features of Hahn moment, 10 of unique features of Raw moment and 10 of unique features of central moment, which we got then further unified as a SFV (Super Feature Vector).

To identify the genetic characteristics and the ordered location of the nucleotides in gene sequences is very important^41,42. “The relative position of nucleotides in any gene sequence is seen as a fundamental pattern that makes use of the physical properties of the gene sequence^43,44. PRIM represent the gene sequence in (20 × 20) order. The relative position of all nucleotides in the given gene sequence is extracted by the following matrix in Eq. (11)”:

$${O}_{PRIM}= \left[\begin{array}{cccc}{O}_{1\to 1}& \begin{array}{cc}{O}_{1\to 2}& \cdots \end{array}& \begin{array}{cc}{O}_{1\to n}& \cdots \end{array}& {O}_{1\to C}\\ {O}_{2\to 1}& \begin{array}{cc}{O}_{2\to 2}& \cdots \end{array}& \begin{array}{cc}{O}_{2\to n}& \cdots \end{array}& {O}_{2\to C}\\ \vdots & \vdots & \vdots & \vdots \\ {O}_{m\to 1}& {O}_{m\to 2}& {O}_{m\to n}& {O}_{m\to C}\\ \vdots & \vdots & \vdots & \vdots \\ {O}_{C\to 1}& {O}_{C\to 2}& {O}_{C\to n}& {O}_{C\to C}\end{array}\right]$$

(11)

In the above equation ${O}_{m\to n}$ denotes the cluster of the relative positions of the $nth$ base regarding the initial occurrence of the $mth$ base. Further by using this 2D ${O}_{PRIM}$ matrix Hahn, Raw and central moments were calculated.

The process of R-PRIM and PRIM calculations is the same, however, only R-PRIM works with reverse gene sequence ordering³⁵. R-PRIM computing revealed underlying patterns, allowing discrepancies between homologous sequences to be resolved. R-PRIM was also constructed as a 2D matrix of order (20 × 20) with 400 coefficients. R-PRIM matrix represents as following Eq. (12):

$${O}_{RPRIM}= \left[\begin{array}{cccc}{O}_{1\to 1}& \begin{array}{cc}{O}_{1\to 2}& \cdots \end{array}& \begin{array}{cc}{O}_{1\to n}& \cdots \end{array}& {O}_{1\to C}\\ {O}_{2\to 1}& \begin{array}{cc}{O}_{2\to 2}& \cdots \end{array}& \begin{array}{cc}{O}_{2\to n}& \cdots \end{array}& {O}_{2\to C}\\ \vdots & \vdots & \vdots & \vdots \\ {O}_{m\to 1}& {O}_{m\to 2}& {O}_{m\to n}& {O}_{m\to C}\\ \vdots & \vdots & \vdots & \vdots \\ {O}_{C\to 1}& {O}_{C\to 2}& {O}_{C\to n}& {O}_{C\to C}\end{array}\right]$$

(12)

Just like PRIM R-PRIM is also used for calculating Hahn, raw and central moments.

“The frequency vector is easily calculated by counting the number of times each nucleotide residue appears in the main sequence. The frequency vector's elements reflect the frequency of occurrence of the relevant nucleotide residue within the supplied sequence. As a result, the frequency vector has 20 coefficients”. The following Eq. (13) represents frequency vector as:

$$\sigma =\left\{{\rho }_{1},{\rho }_{2},{\rho }_{3},\dots ..,{\rho }_{n}\right\}$$

(13)

In the above equation the frequency of each nucleotide in gene sequence is $\rho $. These measurements are used to alleviate information on the position importance of nucleotide in a sequence. 20 FV (frequency vector) features are also integrated into the SFV (Super Feature Vector).

Feature extraction is a successful method for obtaining confusing patterns in gene sequences. Accumulative absolute position incidence vector (AAPIV) provides accumulative information regarding the position occurrence in gene sequences for each nucleotide base³⁸. The placement of gene sequences of cutaneous melanoma is shown in Eq. (14) as:

$$P=\left\{{\xi }_{1},{\xi }_{2},{\xi }_{3},{\xi }_{4},\dots {\xi }_{n}\right\}$$

(14)

Here ${\xi }_{n}$ is gene sequence which have n total nucleotides, these nucleotides can be computed by using the following Eq. (15) as for any $ith$ component.

$${\xi }_{i}= \sum \limits_{l=0}^{n}{J}_{k}$$

(15)

In the above equation ${\xi }_{i}$ is from gene sequence ${J}_{k}$ which have n numbers of nucleotides. AAPIV was used to accommodate relative positioning information from 20 native nucleotides in a gene sequence with a length of 20 related important features. These 20 important AAPIV elements are likewise integrated into the miscellaneous SFV (Super Feature Vector).

The reverse sequencing provides a more in-depth view of the hidden patterns in the gene sequence. RAAPIV refers to the computation of AAPIV for the reverse sequencing of the gene³⁴. It is written as:

$$R=\left\{{\xi }_{1},{\xi }_{2},{\xi }_{3},{\xi }_{4},\dots {\xi }_{20}\right\}$$

(16)

Formation of RAAPIV is mentioned in the above Eq. (16) in which 20 unique features are made. These 20 unique features are then added up into SFV.

$${\xi }_{i}= {\sum }_{l=0}^{n}{Reverse\left(J\right)}_{k}$$

(17)

In the above equation ${\xi }_{i}$ is element of R-AAPIV from gene sequence ${J}_{k}$ which have n number of nucleotides which is calculated by the above Eq. (17). Unique features are extracted from all the above-mentioned methods and SFV (Super Feature Vector) is created having 150-D number of features. This SFV (Super Feature Vector) is further used in prediction algorithms mentioned below.

Prediction algorithms

In this proposed study, a deep neural network with multiple layers is employed to identify cutaneous melanoma of the skin. Deep learning plays a significant role in the recognition, detection, prognosis, diagnosis, forecasting, and detection systems related to cutaneous melanoma. The deep neural network model comprises various layers, including an input layer, an output layer, a pooling layer, a dense layer, and a dropout layer, with fully connected layers stacked on top⁴⁵. Each layer accepts input from the preceding layer and processes the input features. These layers incorporate learning characteristics that self-educate using various learning techniques⁴⁶.

Within this work, three types of deep learning recurrent neural network (RNN) algorithms are utilized: Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Bidirectional LSTM⁴⁷. These algorithms employ three assessment methods for the detection of cutaneous melanoma: a self-consistency test, an independent set test, and a tenfold cross-validation test. The first deep learning algorithm employed in this procedure is LSTM, chosen for its ability to address the vanishing gradient problem encountered in neural networks. The vanishing gradient problem occurs when the loss function approaches zero, making training neural networks challenging. LSTM is specifically utilized in recurrent neural networks to mitigate short-term and vanishing gradient issues by extending the network's memory capacity. It operates through a gated process involving three types of gates: input gates, forget gates, and output gates⁴⁸. Each gate has a distinct role in regulating the flow of information from one stage to another. Consequently, unique activation functions are applied to each gate. Additionally, the suggested LSTM architecture includes an embedding layer. Figure 4 illustrates the LSTM architecture used in this proposed study.

The gate in LSTM oversees regulating information flow from one cell to another. Each gate uses a different activation function^49,50. The following Eqs. (18,19,20,21,22,23) show the working of LSTM.

$${{\text{g}}}_{{\text{t}}}=\upsigma ({z}_{t}{X}^{g}+{j}_{t-1}{Y}^{g})$$

(18)

The data is entered in LSTM layer from embedding layer. In LSTM layer data is passed through LSTM gates mentioned in above equations. In our proposed model, the embedding layer serves to convert input data into a fixed-length vector of a specified size. The vocabulary size is set at 1000, and the word vectors have a length of 64. Following the embedding layer, we incorporate an LSTM layer as the second layer, which features an output layer housing 128 neurons. Additionally, two dropout layers are introduced, with 10% of neurons deactivated, effectively addressing the issue of overfitting. A dense layer with 10 neurons is included in order to add depth to the network. Stochastic Gradient Descent (SGD) is employed as the optimizer within the LSTM layer, and the sigmoid function serves as the activation function. To minimize the loss, the Sparse Categorical Cross Entropy (SCCE) function is utilized.

In the context of this study, the Gated Recurrent Unit (GRU) approach is the second deep learning method implemented. GRU exhibits fewer gates compared to LSTM but performs analogous functions. Notably, due to its reduced number of gates and parameters, GRU tends to yield superior results compared to LSTM. In the cell, GRU relies on just two gates: the reset gate and the update gate. The reset gate dictates the extent to which prior information is disregarded, while the update gate influences the extent to which past information is incorporated⁴⁸. GRU also boasts faster computational speed when compared to LSTM⁵¹. Figure 5 provides a visual representation of the GRU architecture.

In our proposed model, a single embedding layer is utilized to transform input data into a vector with a fixed word length of 64. Subsequently, the second layer consists of a GRU layer housing 256 neurons, accompanied by a basic RNN layer featuring 128 neurons. To avoid overfitting, two dropout layers are introduced at 30%. Towards the end, a considerable layer of 10 neurons is added. In the GRU layer, Stochastic Gradient Descent (SGD) is implemented as an optimizer. The sigmoid function is applied as an activation function. To lessen the loss suffered when training the suggested model, Sparse Categorical Cross Entropy (SCCE) is adopted. Following Eqs. (19), (20), (21), (22), (23) show the working of the GRU.

$${s}_{t}= \sigma \left({g}_{t}{K}^{s}+{J}_{t-1}-{Z}^{s}\right)$$

(19)

$${n}_{t}= \sigma \left({g}_{t}{K}^{n}+{J}_{t-1}-{Z}^{n}\right) $$

(20)

$${j}_{t}{\prime}=tanj \left({s}_{t}*{J}_{t-1}K+ {g}_{t}Z \right) $$

(21)

Here ${s}_{t}$ represent reset gate and ${n}_{t}$ is updating gate.

$${j}_{t}{\prime}=tanj \left({s}_{t}*{J}_{t-1}K+ {g}_{t}Z \right)$$

(22)

$${p}_{t}=\left(1-{n}_{t}\right)*{J}_{t}^{\mathrm{^{\prime}}}+{n}_{t}*{J}_{t-1})$$

(23)

In the final stage of our proposed study, we utilize a bi-directional LSTM as our chosen deep learning approach. A bi-directional LSTM connects two LSTM cells, one operating in the forward direction and the other in the backward direction, ultimately yielding a single output⁵². In our suggested model, an embedding layer is employed to transform input data into fixed-length vectors, each consisting of 64 words. We incorporate two bi-directional layers, each featuring 128 neurons in the forward direction and 64 neurons in the backward direction. To prevent overfitting, three dropout layers are introduced at 30%. One dense layer with 64 neurons is employed, and one dense layer with 10 neurons is added at the end. In the GRU layer, Stochastic Gradient Descent (SGD) is utilized as an optimizer. The sigmoid function is utilized as an activation function as shown in the Fig. 6. To minimize the loss in training the suggested model, Sparse Categorical Cross Entropy (SCCE) is utilized”.

The goal of all these models is to achieve high accuracy. Unlike LSTM and GRU, Bi-directional LSTM does not require any prior information for prediction; it learns on its own by going ahead and backward, which is why the outcome of Bi-directional LSTM is superior to LSTM and GRU⁵³.

The BEDLM employs a divide-and-conquer strategy. It is used to increase the accuracy of a single base learner before compiling the entire model. To produce the best outcomes, many base learners are blended. Each base learner extracts various characteristics from data chunks received using the bootstrap process, creates some outcomes, and combines them. The data pieces are then sent back into the model. The model learns the patterns hidden in the datasets in this manner. BELDM is a flexible technique that outperforms simple machine learning algorithms in terms of accuracy. This is because the bootstrap approach allows for feature and row replacement strategies, and the model learns utilizing all conceivable data combinations. This also leads to the resolution of the overfitting difficulties. Bagging⁵⁴, boosting⁵⁵, blending, and stacking⁵⁶ are four prominent ensemble deep learning model types. The goal of all these models is to achieve high accuracy. Blending of EDLM type is used in the proposed study⁵⁷. BEDLM improved the performance of all the above-described deep learning models such as LSTM, GRU and Bi-directional LSTM.As shown in the following Fig. 7, dataset which we processed above are divided into three groups such as training dataset, validation dataset (denoted as V) and testing (denoted as T) dataset. The training dataset used as an input to all the DL models such as LSTM, GRU and Bi-directional LSTM. For all DL models trained models are developed which are trained model no. 1, trained model no.2, trained model no. 3. After that trained model no. 1, trained model no. 2, trained model no. 3 is tested on validation dataset and testing dataset. A result which is obtained from this model are shown in the result Table 3.

Table 3 Results of SCT by using LSTM, GRU, BLSTM and BEDLM.

Full size table

$$pt,j=\sum \limits_{m=1}^{M}{w}_{m}{f}_{m,j } $$

(24)

All DL models are assigned weights to make BEDLM described in the equation where ${w}_{m}\left(m=\mathrm{1,2},\mathrm{3,4},\dots .,M\right)$ is the weight and ${f}_{m,j}$ is the prediction of all DL models and $m$ is for jth observation. These testing strategies are used in 10 spans for each DL methodology which is 10 feed-forward and feed backward paths. The model determines it is ROC, specificity, sensitivity, Mathew's correlation coefficient and accuracy in each testing iteration.

Results

The cutaneous melanoma of skin dataset is firstly preprocessed and after that processed to obtain the essential aspects of the balanced data. The retrieved data is subsequently subjected to the deep learning (DL) algorithms. The independent set test, self-consistency test, and tenfold cross-validation test is used to validate the performance of deep learning (DL) algorithms. This section explains the findings of these validation approaches.

In the proposed deep learning model, Fig. 8 illustrates the accuracy and loss values during the training and testing phases. The experiment is conducted using a tenfold cross-validation technique. Each fold comprises measurements such as training accuracy (Acc), testing accuracy (Val Acc), training loss (Loss), and testing loss (Val Loss) for both training and testing samples.

The mathematical formulas of sensitivity, specificity, accuracy and Matthew’s correlation coefficient utilized to calculate the outcomes of the algorithms.⁵⁸. Sensitivity which is also called recall is called true positive value is the probability of a positive value in target condition. Specificity is also called true negative values is the negative value in the target condition. Accuracy is the proportion of properly identified samples to the total number of samples. MCC (Matthew correlation coefficient) calculates the difference be-tween expected and actual values.

False negatives (FN) are examples of negative data that the system mistakenly interpreted as positive, while true positives (TP) are examples of positive data that the system correctly identified. False positives (FP) are good things that are wrongly thought to be good, while true negatives (TN) are bad things that are correctly called bad (FP). Here, the capacity to anticipate the count that properly identifies the melanoma of skin is referred to as sensitivity and the capacity to forecast the count that accurately identifies the absence of melanoma of skin is referred to as specificity⁵⁹. All subjects with the specified condition are represented by P + FN. TN + FP are subjects who do not have the stated criteria. The total number of participants with positive findings is TP + FP, while the total number of subjects with negative results is TN + FN⁵⁷.

Self-consistency test (SCT)

The DL algorithm is tested using the SCT approach. 100% data is utilized for training and testing in the SCT. The entire dataset is used for both training and testing in the SCT. The loss in bidirectional LSTM is quite low. LSTM, GRU, and Bidirectional LSTM, on the other hand, achieved very excellent accuracy in the SCT, as demonstrated in the result Table 3. The decision boundary of SCT is shown in the Fig. 9 as below. ROC curve (receiver operating characteristic curve) of SCT is shown in the Fig. 10 as below.

Independent set test (IST)

IST is the second testing technique utilized for the suggested BEDLM strategy. The values are retrieved from the misperception matrix, which is used to calculate the model's accuracy. The suggested model's IST is the primary performance measurement approach. 80% of the values in the dataset is used to train the algorithm, while 20% is utilized for testing. The decision boundary of IST is illustrated as Fig. 11 below. ROC curve (receiver operating characteristic curve) of the IST is illustrated as Fig. 12 below. The Independent set test results are shown in Table 4.

Table 4 Results of IST by using LSTM, GRU, BLSTM and BEDLM.

Full size table

Tenfold cross validation test (10-FCVT)

The data is evenly subsampled into ten groups using the tenfold cross-validation (FCV) approach. The training set is then partitioned into 10 divisions and treated as a separate validation set, training the model, and then averaging generalization performance over the tenfolds to determine hyper-parameter and architectural decisions¹². The decision boundary of 10-FCVT is illustrated as following Fig. 13. The ROC curve (receiver operating characteristic curve) of BEDLM in 10-FCVT for all the DL algorithms such as LSTM, GRU, BLSTM is illustrated in the following Fig. 14. The 10-FCVT results are shown in Table 5.

Table 5 Results of tenfold cross validation test by using LSTM, GRU, BLSTM and BEDLM.

Full size table

Discussion

The significance of an automated classification system for skin lesions extends beyond its potential to significantly reduce the workload of dermatologists. By minimizing subjectivity and human error in the classification process, such a system can enhance diagnostic accuracy. The consequences of inaccurate or delayed diagnoses are evident in instances of inappropriate therapy, occasionally necessitating more extensive surgical intervention and prolonged hospital stays. Dermatologists, even with substantial experience, exhibit varying recall rates in skin cancer screening, emphasizing the need for robust diagnostic tools.

In regions where skilled dermatologists are scarce, particularly in developing nations, the proposed automated method can be indispensable. The BEDLM model showcased its effectiveness in automating skin lesion classification, demonstrating the ability to assign class labels to previously unseen lesions. While these findings are promising, further validation and progress hinge on acquiring more clinical data, including factors such as age, gender, race, and family history. Such data are critical before deep learning models can be considered for practical use in clinical settings.

The focus on melanoma detection, a rapidly progressive and extremely dangerous form of skin cancer, underscores the collaborative efforts between medical and computational research. The proposed BEDLM, incorporating approaches such as LSTM, GRU, and bi-directional LSTM, leverages datasets from recent retrospective cohort studies to identify melanoma. This approach aims at enabling early diagnosis, even before visible symptoms manifest.

Utilizing the most recent dataset for normal and mutant gene sequences of cutaneous melanocyte cancer, this study employs three distinct testing methodologies: SCT, 10-FCVT, and IST. The results indicate high accuracy levels, with 94%, 93%, and 97% accuracy in SCT, 10-FCVT, and IST, respectively. Notably, the BLSTM model achieves an impressive 99% accuracy in skin melanoma prediction, showcasing its suitability for high-accuracy applications.

Table 3 provides a comprehensive overview of outcomes for LSTM, GRU, BLSTM, and BEDLM across SCT, IST, and 10-FCVT. The BEDLM, trained on ninefolds and tested on onefold, undergoes repeated iterations, using the complete dataset for both testing and training. The incorporation of randomized data in each iteration enhances learning, with the average accuracy calculated at the conclusion of the process.

Looking ahead, genomic technologies hold promise in predicting subsets of melanoma patients more accurately, both biologically and clinically. These advancements pave the way for personalized medicine, allowing precise identification of molecular alterations in tumor cell populations as the disease progresses.

Conclusion

Cutaneous melanoma of the skin is a very severe type of cancer with its fast progression. As a result, for early diagnosis, a suggested BEDLM framework as shown in Fig. 7 including these three deep learning algorithms which are GRU illustrated in Fig. 5, LSTM illustrated in Fig. 4, and BLSTM illustrated in Fig. 6 is devised. Normal gene sequences are downloaded from asia.ensembl.org, while mutated gene information is acquired from IntOgen.org with the help of web scraping code. By integrating mutation information into normal gene sequences, mutated sequences are obtained as shown in the Fig. 2 (Data acquisition framework). Proposed study dataset contained 2608 human samples and 6778 mutations in total along with 75 types of different gene as shown in the following Table 2. Multiple feature extraction techniques are used in this study as shown in Fig. 3 (feature extraction) for obtaining features from normal gene and mutated gene sequences and for converting the data for training and testing into numeric format. The BEDLM proposed in this study gives 97% accuracy as shown in Table 3 (Results of SCT, IST and 10-FCVT in LSTM, GRU, BLSTM and BEDLM). To evaluate the performance of the proposed model different testing techniques such as 10-FCVT, SCT and IST are applied. The decision boundary of SCT illustrated in Fig. 9 and its ROC curve illustrated in Fig. 10. The ROC curve of IST illustrated in Fig. 12 and its decision boundary showed in Fig. 11. The decision boundary of 10-FCVT showed in Fig. 13 and its ROC curve illustrated in Fig. 14. Proposed BEDLM shows superior performance in IST. Result comparison of all algorithms such as LSTM, GRU, BLSTM and BEDLM and their accuracy, Sensitivity, Specificity, and Mathew’s correlation coefficient showed in the Table 3. The findings demonstrate that the suggested BEDLM in the proposed study can be used effectively for early diagnosis of cutaneous melanoma of skin.

In the future this technique can be used in identifying other life-threatening diseases and other deep learning models can be used for obtaining more accuracy and efficiency.

Data availability

The genomic data utilized in this study is accessible at https://www.intogen.org/search?cancer=MEL and http://asia.ensembl.org/Homo_sapiens/Gene/Sequence?g=ENSG00000141510;r=17:7661779-7687538. The dataset obtained after preprocessing is also available upon request from the first author at alishahsadiq@gmail.com.

References

Farberg, A. S. et al. Expert consensus on the use of prognostic gene expression profiling tests for the management of cutaneous melanoma: Consensus from the skin cancer prevention working group. Dermatol. Ther. 12(4), 807–823 (2022).
Article Google Scholar
Liang, Z., Pan, L., Shi, J. & Zhang, L. C1QA, C1QB, and GZMB are novel prognostic biomarkers of skin cutaneous melanoma relating tumor microenvironment. Sci. Rep. 12(1), 20460 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Song, B. et al. Characterization of coagulation-related gene signature to predict prognosis and tumor immune microenvironment in skin cutaneous melanoma. Front. Oncol. 12, 975255 (2022).
Article CAS PubMed PubMed Central Google Scholar
Abbas, Q., Daadaa, Y., Rashid, U. & Ibrahim, M. E. A. Assist-dermo: A lightweight separable vision transformer model for multiclass skin lesion classification. Diagnostics 13, 2531. https://doi.org/10.3390/diagnostics13152531 (2023).
Article PubMed PubMed Central Google Scholar
Kim, S. H. et al. Minimally invasive skin sampling and transcriptome analysis using microneedles for skin type biomarker research. Skin Res. Technol. 28(2), 322–335 (2022).
Article PubMed PubMed Central Google Scholar
Zeng, L. et al. Advancements in nanoparticle-based treatment approaches for skin cancer therapy. Mol. Cancer 22(1), 10 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wang, Z., Hounye, A. H., Liu, J., Zhang, J. & Qi, M. A novel pyroptosis-related LncRNA signature predicts prognosis and indicates tumor immune microenvironment in skin cutaneous melanoma. Life Sci. 307, 120832 (2022).
Article PubMed Google Scholar
Baig, A. R. et al. Light-dermo: A lightweight pretrained convolution neural network for the diagnosis of multiclass skin lesions. Diagnostics 13, 385. https://doi.org/10.3390/diagnostics13030385 (2023).
Article PubMed PubMed Central Google Scholar
Abbas, Q. & Celebi, M. E. DermoDeep—A classification of melanoma-nevus skin lesions using multi-feature fusion of visual features and deep neural network. Multimed. Tools Appl. 78(16), 23559–23580 (2019).
Article Google Scholar
D. of H. & H. Services. Genes and genetics. http://www.betterhealth.vic.gov.au/health/conditionsandtreatments/genes-and-genetics (Accessed 01 Jan 2023).
“Genes and Chromosomes-Fundamentals,” MSD Manual Consumer Version. https://www.msdmanuals.com/home/fundamentals/genetics/genes-and-chromosomes (Accessed 01 Jan2023). “Cancer today.” http://gco.iarc.fr/today/home (Accessed 01 Jan 2023).
Melanoma of the skin-cancer stat facts. SEER https://seer.cancer.gov/statfacts/html/melan.html. (Accessed 14 Nov 2023).
Vestergaard, M. E., Macaskill, P., Holt, P. E. & Menzies, S. W. Dermoscopy compared with naked eye examination for the diagnosis of primary melanoma: A meta-analysis of studies performed in a clinical setting. Br. J. Dermatol. https://doi.org/10.1111/j.1365-2133.2008.08713.x (2008).
Article PubMed Google Scholar
Wolner, Z. J. et al. Enhancing skin cancer diagnosis with dermoscopy. Dermatol. Clin. 35(4), 417–437. https://doi.org/10.1016/j.det.2017.06.003 (2017).
Article CAS PubMed PubMed Central Google Scholar
Combalia, M. et al. Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: The 2019 International Skin Imaging Collaboration Grand Challenge. Lancet Digit. Health 4(5), e330–e339. https://doi.org/10.1016/S2589-7500(22)00021-8 (2022).
Article CAS PubMed PubMed Central Google Scholar
Das, K. et al. Machine learning and its application in skin cancer. Int. J. Environ. Res. Public. Health 18(24), 13409. https://doi.org/10.3390/ijerph182413409 (2021).
Article PubMed PubMed Central Google Scholar
Motwani, J. & Eccles, M. R. Genetic and genomic pathways of melanoma development, invasion and metastasis. Genes 12(10), 1543 (2021).
Article CAS PubMed PubMed Central Google Scholar
Rishitha, N., Gummadi, R. & Natarajan, P. DNA sequencing using machine learning and deep learning algorithms. Int. J. Innov. Technol. Explor. Eng. 11(10), 20–27. https://doi.org/10.35940/ijitee.J9273.09111022 (2022).
Article Google Scholar
Liu, J.-J. et al. The structure-based cancer-related single amino acid variation prediction. Sci. Rep. 11(1), 13599. https://doi.org/10.1038/s41598-021-92793-w (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Sun, Y. et al. Identification of 12 cancer types through genome deep learning. Sci. Rep. 9(1), 17256. https://doi.org/10.1038/s41598-019-53989-3 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Habibi, M. & Taheri, G. A new machine learning method for cancer mutation analysis. PLoS Comput. Biol. 18(10), e1010332. https://doi.org/10.1371/journal.pcbi.1010332 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Zuo, Z. et al. Gene position index mutation detection algorithm based on feedback fast learning neural network. Comput. Intell. Neurosci. 2021, e1716182. https://doi.org/10.1155/2021/1716182 (2021).
Article Google Scholar
Quang, D. & Xie, X. DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44(11), e107–e107. https://doi.org/10.1093/nar/gkw226 (2016).
Article CAS PubMed PubMed Central Google Scholar
Salama, M. A., Hassanien, A. E. & Mostafa, A. The prediction of virus mutation using neural networks and rough set techniques. EURASIP J. Bioinform. Syst. Biol. 2016, 10. https://doi.org/10.1186/s13637-016-0042-0 (2016).
Article CAS PubMed PubMed Central Google Scholar
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33(8), 831–838. https://doi.org/10.1038/nbt.3300 (2015).
Article CAS PubMed Google Scholar
Daghrir, J., Tlig, L., Bouchouicha, M. & Sayadi, M. Melanoma skin cancer detection using deep learning and classical machine learning techniques: A hybrid approach. In 2020 5th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 1–5 (2020) https://doi.org/10.1109/ATSIP49331.2020.9231544.
Bistroń, M. & Piotrowski, Z. Comparison of machine learning algorithms used for skin cancer diagnosis. Appl. Sci. 12(19), 9960. https://doi.org/10.3390/app12199960 (2022).
Article CAS Google Scholar
“Gene: TP53 (ENSG00000141510) - Summary - Homo_sapiens - Ensembl genome browser 109.” http://asia.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000141510;r=17:7661779-7687538 (Accessed 24 Feb 2023).
IntOGen-Cancer driver mutations in Cutaneous melanoma of the skin. https://intogen.org/search?cancer=CM (Accessed 01 Jan 2023).
Waks, Z. et al. Driver gene classification reveals a substantial overrepresentation of tumor suppressors among very large chromatin-regulating proteins. Sci. Rep. 6(1), 38988. https://doi.org/10.1038/srep38988 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Kaur, P. & Gosain, A. Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise. In ICT Based Innovations, vol. 653, (eds Saini, A. K.), 23–30 (2018) https://doi.org/10.1007/978-981-10-6602-3_3.
Levine, M. D. Feature extraction: A survey. Proc. IEEE 57(8), 1391–1407. https://doi.org/10.1109/PROC.1969.7277 (1969).
Article Google Scholar
Amanat, S., Ashraf, A., Hussain, W., Rasool, N. & Khan, Y. D. Identification of lysine carboxylation sites in proteins by integrating statistical moments and position relative features via general PseAAC. Curr. Bioinform 15(5), 396–407 (2020).
Article CAS Google Scholar
Hall, A. R. Generalized Method of Moments (Oxford University Press, 2005).
Google Scholar
Malebary, S. J. & Khan, Y. D. Evaluating machine learning methodologies for identification of cancer driver genes. Sci. Rep. 11(1), 12281. https://doi.org/10.1038/s41598-021-91656-8 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhu, H., Shu, H., Zhou, J., Luo, L. & Coatrieux, J. L. Image analysis by discrete orthogonal dual Hahn moments. Pattern Recognit. Lett. 28(13), 1688–1704. https://doi.org/10.1016/j.patrec.2007.04.013 (2007).
Article ADS Google Scholar
Umair, S. M., Javid, S. & Fariha, S. Imputation of missing values by using raw moments. Stat. Transit. New Ser. 20(1), 21–40 (2019).
Article Google Scholar
Butt, A. H. & Khan, Y. D. CanLect-Pred: A cancer therapeutics tool for prediction of target cancerlectins using experiential annotated proteomic sequences. IEEE Access 8, 9520–9531. https://doi.org/10.1109/ACCESS.2019.2962002 (2020).
Article Google Scholar
Khan, Y. D., Rasool, N., Hussain, W., Khan, S. A. & Chou, K.-C. iPhosY-PseAAC: Identify phosphotyrosine sites by incorporating sequence statistical moments into PseAAC. Mol. Biol. Rep. 45(6), 2501–2509. https://doi.org/10.1007/s11033-018-4417-z (2018).
Article CAS PubMed Google Scholar
Khan, Y. D. et al. pSSbond-PseAAC: Prediction of disulfide bonding sites by integration of PseAAC and statistical moments. J. Theor. Biol. 463, 47–55. https://doi.org/10.1016/j.jtbi.2018.12.015 (2019).
Article ADS CAS PubMed Google Scholar
Hussain, W., Khan, Y. D., Rasool, N., Khan, S. A. & Chou, K.-C. SPrenylC-PseAAC: A sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins. J. Theor. Biol. 468, 1–11. https://doi.org/10.1016/j.jtbi.2019.02.007 (2019).
Article ADS CAS PubMed Google Scholar
Awais, M. et al. iPhosH-PseAAC: Identify phosphohistidine sites in proteins by blending statistical moments and position relative features according to the Chou’s 5-step rule and general pseudo amino acid composition. IEEE/ACM Trans. Comput. Biol. Bioinform. 18(2), 596–610. https://doi.org/10.1109/TCBB.2019.2919025 (2021).
Article CAS PubMed Google Scholar
Butt, A. H., Khan, S. A., Jamil, H., Rasool, N. & Khan, Y. D. A prediction model for membrane proteins using moments based features. BioMed Res. Int. 2016, e8370132. https://doi.org/10.1155/2016/8370132 (2016).
Article CAS Google Scholar
Akmal, M. A., Rasool, N. & Khan, Y. D. Prediction of N-linked glycosylation sites using position relative features and statistical moments. PLoS ONE 12(8), e0181966. https://doi.org/10.1371/journal.pone.0181966 (2017).
Article CAS PubMed PubMed Central Google Scholar
Hochreiter, S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 06(02), 107–116. https://doi.org/10.1142/S0218488598000094 (1998).
Article Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521(7553), 436–444. https://doi.org/10.1038/nature14539 (2015).
Article ADS CAS PubMed Google Scholar
Wang, H., Chen, S., Xu, F. & Jin, Y.-Q. Application of deep-learning algorithms to MSTAR data. In 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 3743–3745 (2015) https://doi.org/10.1109/IGARSS.2015.7326637.
Rengasamy, D., Jafari, M., Rothwell, B., Chen, X. & Figueredo, G. P. Deep learning with dynamically weighted loss function for sensor-based prognostics and health management. Sensors 20(3), 723. https://doi.org/10.3390/s20030723 (2020).
Article ADS PubMed PubMed Central Google Scholar
Gu, J. et al. Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377. https://doi.org/10.1016/j.patcog.2017.10.013 (2018).
Article ADS Google Scholar
Lin, G. & Shen, W. Research on convolutional neural network based on improved Relu piecewise activation function. Procedia Comput. Sci. 131, 977–984. https://doi.org/10.1016/j.procs.2018.04.239 (2018).
Article Google Scholar
Gao, Y. & Glowacka, D. Deep gate recurrent neural network. In Proceedings of The 8th Asian Conference on Machine Learning, 350–365 (2016) (Accessed 01 Jan 01 2023) https://proceedings.mlr.press/v63/gao30.html.
Basaldella, M., Antolli, E., Serra, G. & Tasso, C. Bidirectional LSTM recurrent neural network for keyphrase extraction. In Digital Libraries and Multimedia Archives, Cham, 180–187 (2018) https://doi.org/10.1007/978-3-319-73165-0_18.
Graves, A. & Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5), 602–610. https://doi.org/10.1016/j.neunet.2005.06.042 (2005).
Article PubMed Google Scholar
Breiman, L. Bagging predictors. Mach. Learn. 24(2), 123–140. https://doi.org/10.1023/A:1018054314350 (1996).
Article Google Scholar
Schapire, R. E. The strength of weak learnability. Mach. Learn. 5(2), 197–227. https://doi.org/10.1007/BF00116037 (1990).
Article Google Scholar
Stefenon, S. F. et al. Hybrid wavelet stacking ensemble model for insulators contamination forecasting. IEEE Access 9, 66387–66397. https://doi.org/10.1109/ACCESS.2021.3076410 (2021).
Article Google Scholar
Wang, N., Zeng, N. N. & Zhu, W. Sensitivity, Specificity, Accuracy, Associated Confidence Interval and ROC Analysis with Practical SAS Implementations. (2010).
van Stralen, K. J. et al. Diagnostic methods I: Sensitivity, specificity, and other measures of accuracy. Kidney Int. 75(12), 1257–1263. https://doi.org/10.1038/ki.2009.92 (2009).
Article PubMed Google Scholar
Lalkhen, A. G. & McCluskey, A. Clinical tests: Sensitivity and specificity. Contin. Educ. Anaesth. Crit. Care Pain 8(6), 221–223. https://doi.org/10.1093/bjaceaccp/mkn041 (2008).
Article Google Scholar

Download references

Acknowledgments

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-RP23047).

Author information

Authors and Affiliations

Department of Computer Science, Bahria University, Islamabad, Pakistan
Asghar Ali Shah
Department of Computer Science, Bahria University, Lahore, Pakistan
Ayesha Sher Ali Shaker
College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), 11432, Riyadh, Saudi Arabia
Sohail Jabbar, Qaisar Abbas & Talal Saad Al-Balawi
Department of Computer Science and Engineering, University of Central Arkansas, 201 Donaghey Ave., Conway, AR, 72035, USA
M. Emre Celebi

Authors

Asghar Ali Shah
View author publications
You can also search for this author in PubMed Google Scholar
Ayesha Sher Ali Shaker
View author publications
You can also search for this author in PubMed Google Scholar
Sohail Jabbar
View author publications
You can also search for this author in PubMed Google Scholar
Qaisar Abbas
View author publications
You can also search for this author in PubMed Google Scholar
Talal Saad Al-Balawi
View author publications
You can also search for this author in PubMed Google Scholar
M. Emre Celebi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Methodology, A.A.S.; Implementation, A.A.S., A.S.A.S.; Validation, S.J.; Data curation, A.A.S., A.S.A.S., S.J.; Writing—original draft, A.A.S., A.S.A.S. and S.J.; Writing—review and editing, Q.A., T.S.A., M.E.C.; Visualization, A.A.S., A.S.A.S.; Supervision, M.E.C.; Project administration, Q.A., T.S.A., M.E.C. All authors have contributed equally to this study and reviewed and approved the final version of the manuscript.

Corresponding author

Correspondence to Qaisar Abbas.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ali Shah, A., Shaker, A.S.A., Jabbar, S. et al. An ensemble-based deep learning model for detection of mutation causing cutaneous melanoma. Sci Rep 13, 22251 (2023). https://doi.org/10.1038/s41598-023-49075-4

Download citation

Received: 14 September 2023
Accepted: 04 December 2023
Published: 14 December 2023
DOI: https://doi.org/10.1038/s41598-023-49075-4

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

An ensemble-based deep learning model for detection of mutation causing cutaneous melanoma

Subjects

Abstract

Similar content being viewed by others

PERCEPTION predicts patient response and resistance to treatment using single-cell transcriptomics of their tumors

A foundation model for generalizable disease detection from retinal images

A large-scale machine learning analysis of inorganic nanoparticles in preclinical cancer research

Introduction

Literature review

Methodology

Benchmark dataset collection

Feature extraction

Prediction algorithms

Results

Self-consistency test (SCT)

Independent set test (IST)

Tenfold cross validation test (10-FCVT)

Discussion

Conclusion

Data availability

References

Acknowledgments

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Comments

Search

Quick links

Subjects

Abstract

Similar content being viewed by others

PERCEPTION predicts patient response and resistance to treatment using single-cell transcriptomics of their tumors

A foundation model for generalizable disease detection from retinal images

A large-scale machine learning analysis of inorganic nanoparticles in preclinical cancer research

Introduction

Literature review

Methodology

Benchmark dataset collection

Feature extraction

Prediction algorithms

Results

Self-consistency test (SCT)

Independent set test (IST)

Tenfold cross validation test (10-FCVT)

Discussion

Conclusion

Data availability

References

Acknowledgments

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links