Blood cancer prediction using leukemia microarray gene data and hybrid logistic vector trees model

Rupapara, Vaibhav; Rustam, Furqan; Aljedaani, Wajdi; Shahzad, Hina Fatima; Lee, Ernesto; Ashraf, Imran

doi:10.1038/s41598-022-04835-6

Download PDF

Article
Open access
Published: 19 January 2022

Blood cancer prediction using leukemia microarray gene data and hybrid logistic vector trees model

Vaibhav Rupapara¹^na1,
Furqan Rustam²^na1,
Wajdi Aljedaani³,
Hina Fatima Shahzad²,
Ernesto Lee⁴ &
…
Imran Ashraf⁵

Scientific Reports volume 12, Article number: 1000 (2022) Cite this article

8136 Accesses
34 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Blood cancer has been a growing concern during the last decade and requires early diagnosis to start proper treatment. The diagnosis process is costly and time-consuming involving medical experts and several tests. Thus, an automatic diagnosis system for its accurate prediction is of significant importance. Diagnosis of blood cancer using leukemia microarray gene data and machine learning approach has become an important medical research today. Despite research efforts, desired accuracy and efficiency necessitate further enhancements. This study proposes an approach for blood cancer disease prediction using the supervised machine learning approach. For the current study, the leukemia microarray gene dataset containing 22,283 genes, is used. ADASYN resampling and Chi-squared (Chi2) features selection techniques are used to resolve imbalanced and high-dimensional dataset problems. ADASYN generates artificial data to make the dataset balanced for each target class, and Chi2 selects the best features out of 22,283 to train learning models. For classification, a hybrid logistics vector trees classifier (LVTrees) is proposed which utilizes logistic regression, support vector classifier, and extra tree classifier. Besides extensive experiments on the datasets, performance comparison with the state-of-the-art methods has been made for determining the significance of the proposed approach. LVTrees outperform all other models with ADASYN and Chi2 techniques with a significant 100% accuracy. Further, a statistical significance T-test is also performed to show the efficacy of the proposed approach. Results using k-fold cross-validation prove the supremacy of the proposed model.

PERCEPTION predicts patient response and resistance to treatment using single-cell transcriptomics of their tumors

Article 18 April 2024

A single-cell atlas enables mapping of homeostatic cellular shifts in the adult human breast

Article Open access 28 March 2024

Feasibility of functional precision medicine for guiding treatment of relapsed or refractory pediatric cancers

Article Open access 11 April 2024

Introduction

Cancer is the abandoned outgrowth of abnormal cells that may spread to different parts of the human body¹. Currently, it is one of the leading causes of death in the world. Study² shows that approximately 10 million cancer deaths and 19.3 million new cases appeared only in 2020. The mortality rates of the different types of cancer vary concerning the type of cancer. For example, in 2020, lung cancer has 18%, colorectal cancer has 9.4%, while liver cancer, stomach cancer, and breast cancer has mortality rates of 8.3%, 7.7%, and 6.9%, respectively. Blood cancer constitutes nearly 10% of all the newly diagnosed cancer cases¹. Early diagnosis and prediction have been considered prudent ways to reduce cancer deaths worldwide.

In this regard, this study focuses on the prediction of blood cancer. As noted by the Leukemia and Lymphoma Society³, in the United States (US) alone, 1,290,773 people have blood cancer. The common types of blood cancers include myeloma, leukemia, lymphoma, myelodysplastic syndromes, among others. To be discrete, blood cancers affect the blood cells, bone marrow, lymph nodes, as well as other parts of the lymphatic system. Currently, research has led to the development of therapies that improve the immunity system of affected individuals so that they can deal with cancer cells.

Previous studies on blood cancer prediction have utilized different models and algorithms for predicting blood cancer, which yielded various accuracy and precision levels. For example, Goutam et al.⁴ utilized support vector machines (SVM) to achieve a precision of 85.74%, specificity of 80%, and sensitivity of 100%. Study⁵ used H20 deep learning and got an accuracy of 79.45%. Additionally, Vijayarani and Sudha⁶ applied K Means, Fuzzy Means, and Weighted K Means which achieved an accuracy of 78%, 75%, and 85%, respectively. Similarly, Xiao et al.⁷ used k-nearest neighbor (KNN), SVM, decision trees (DT), random forest (RF), and gradient boosting decision trees to achieve accuracy of 99.20%, 98.78%, and 98.41%, respectively. On the other hand, Subhan et al.⁸ leveraged KNN and Hough transform to obtain an accuracy of 93%. Gal et al.⁹ used KNN, SVM, and RF classifiers for achieving accuracy scores of 84%, 74%, and 81%, respectively. Despite such efforts to elevate the performance of the machine and deep learning classifiers, the desired accuracy is not met for blood cancer prediction.

The chief objective of the current study is to propose an approach that can perform blood cancer prediction with high accuracy using microarray gene data. Of the challenges associated with this task, the data imbalance and the high dimensionality of data are two important problems. To overcome these issues, the current study uses adaptive synthetic (ADASYN) oversampling and Chi-square (Chi2). In summary, this study makes the following contributions

The performance of well-known machine learning algorithms is analyzed on microarray gene data. These algorithms include RF, logistic regression (LR), support vector classifier (SVC), KNN, Naive Bayes (NB), extra tree classifier (ETC), DT, and Adaboost classifier (ADA).
A hybrid model called LVTrees is proposed which utilizes RL, SVC, and ETC through the majority voting. For data balancing the influence of ADASYN is investigated while Chi2 is used to select the optimal set of features for classification.
Extensive experiments are conducted to evaluate the efficacy of the proposed approach. In addition, several state-of-the-art methods are compared with the proposed approach. The statistical significance test is also performed to analyze the validity of the proposed approach. Results are further validated using k-fold cross-validation.

The rest of the paper is organized as follows. The following section discusses the research papers related to the current study. The proposed methodology is described in the section “Materials and methods” while the section “Results and discussions” contains the analysis and discussion of results. In the end, the “Conclusion” section concludes the paper and highlights the direction for future work.

Related work

Owing to the importance of the healthcare domain, several research works can be found in the literature that focus on cancer prediction using machine and deep learning approaches. For example, studies^10,11 perform cancer prediction using image-based approaches. Similarly, Goutam et al.⁴ developed an automated system for the diagnosis of leukemia. The framework supports a variety of strategies like K-means clustering etc. The data are obtained from hospitals for examining the performance of the proposed method as a binary classifier. Results show that it obtains a 98% accuracy for cancer prediction. While Vijayarani and Sudha⁶ focused on the prediction of disease using hemogram blood test data. A new algorithm called weight-based K-means is proposed to diagnose various diseases, e.g., human immunodeficiency virus (HIV) and viral infection. Tests are performed on data from 524 patients, and results show that the proposed algorithm achieves significantly higher accuracy than the Fuzzy C-methods and K-means clustering algorithms.

In the same way, a multi-model ensemble is presented in⁷ for predicting cancer. The authors analyzed the gene data gathered from the stomach, breast, and lung tissues. The DESeq approach is used to avoid overfitting in classification which helped identify genetic details differentiated between normal and tumor phenotypes. Moreover, it controlled the dimensionality of data and enhanced the forecast accuracy along with the significant reduction in computational time. Study¹² developed an automated method of detecting and classifying acute lymphoblastic leukemia based on a deep convolutional neural network (CNN). To test the performance, comparisons are made with different color models. The results show that the proposed method achieved high accuracy without requiring microscopic image segmentation. The authors presented a diagnosing method in¹³ to predict the primary stage of cancer. The model is integrated between hybrid feature selection and preprocessing phases. From a subset of 25 features, the proposed model showed the highest accuracy with 14 optimal features. A four-phase process is employed to train the subset of the optimal feature. Results show that the classification accuracy can be greatly improved by using preprocessing methods and feature selection before selecting the data.

Study¹⁴ proposed classification models to distinguish the blood microscopic images of patients affected by leukemia from those free of leukemia. To extract the features, a pre-trained CNN name AlexNet and various other classifiers are used. Tests show that SVM got better results compared to other classifiers. In the second model, extraction and classification are done using AlexNet only where results show its superiority over other models concerning different performance metrics.

Table 1 Summary of the systematic analysis studies in related work.

Subjects

Abstract

Similar content being viewed by others

Introduction

Related work

Materials and methods

Proposed approach overview

Data description

Supervised machine learning models

Proposed model LVTrees

Chi-square (Chi2)

ADASYN resampling

Results and discussions

Models performance on original leukemia dataset

Models performance using ADASYN oversampled dataset

Models’ performance after applying Chi2 technique

Models performance for combining Chi2 and ADASYN techniques

Significance of proposed approach

Experimental results of LVTrees on leukemia_GSE9476 dataset

Results using resampling on training data alone

Feature selection after data splitting

Results with 10-fold cross-validation

Performance analysis of proposed approach

T-test

Conclusion

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links