Most biomedical datasets, including those of ‘omics, population studies, and surveys, are rectangular in shape and have few missing data. Recently, their sample sizes have grown significantly. Rigorous analyses on these large datasets demand considerably more efficient and more accurate algorithms. Machine learning (ML) algorithms have been used to classify outcomes in biomedical datasets, including random forests (RF), decision tree (DT), artificial neural networks (ANN), and support vector machine (SVM). However, their performance and efficiency in classifying multi-category outcomes of rectangular data are poorly understood. Therefore, we compared these metrics among the 4 ML algorithms. As an example, we created a large rectangular dataset using the female breast cancers in the surveillance, epidemiology, and end results-18 database, which were diagnosed in 2004 and followed up until December 2016. The outcome was the five-category cause of death, namely alive, non-breast cancer, breast cancer, cardiovascular disease, and other cause. We analyzed the 54 dichotomized features from ~45,000 patients using MatLab (version 2018a) and the tenfold cross-validation approach. The accuracy in classifying five-category cause of death with DT, RF, ANN, and SVM was 69.21%, 70.23%, 70.16%, and 69.06%, respectively, which was higher than the accuracy of 68.12% with multinomial logistic regression. Based on the features' information entropy, we optimized dimension reduction (i.e., reduce the number of features in models). We found 32 or more features were required to maintain similar accuracy, while the running time decreased from 55.57 s for 54 features to 25.99 s for 32 features in RF, from 12.92 s to 10.48 s in ANN, and from 175.50 s to 67.81 s in SVM. In summary, we here show that RF, DT, ANN, and SVM had similar accuracy for classifying multi-category outcomes in this large rectangular dataset. Dimension reduction based on information gain will increase the model’s efficiency while maintaining classification accuracy.
Subscribe to Journal
Get full journal access for 1 year
only $41.58 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Liu DD, Zhang L. Trends in the characteristics of human functional genomic data on the gene expression omnibus, 2001–2017. Lab Investig. 2019;99:118–27.
Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and prognosis. Cancer Inf. 2007;2:59–77.
Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8–17.
Bishop CM. Pattern recognition and machine learning. New York, NY, USA: Springer; 2006.
Chow ZL, Thike AA, Li HH, Nasir NDM, Yeong JPS, Tan PH. Counting mitoses with digital pathology in breast phyllodes tumors. Arch Pathol Lab Med. 2020;144:1397–400.
Koo J, Zhang J, Chaterji S. Tiresias: context-sensitive approach to decipher the presence and strength of MicroRNA regulatory interactions. Theranostics. 2018;8:277–91.
Wang J, Deng F, Zeng F, Shanahan AJ, Li WV, Zhang L. Predicting long-term multicategory cause of death in patients with prostate cancer: random forest versus multinomial model. Am J Cancer Res. 2020;10:1344–55.
Maniruzzaman M, Jahanur Rahman M, Ahammed B, Abedin MM, Suri HS, Biswas M, et al. Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms. Comput Methods Programs Biomed. 2019;176:173–93.
Pirooznia M, Yang JY, Yang MQ, Deng Y. A comparative study of different machine learning methods on microarray gene expression data. BMC Genom. 2008;9:S13.
Haibe-Kains B, Desmedt C, Sotiriou C, Bontempi G. A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? Bioinformatics. 2008;24:2200–8.
Siegel RL, Miller KD, Jemal A. Cancer statistics, 2020. CA Cancer J Clin. 2020;70:7–30.
Goetz MP, Gradishar WJ, Anderson BO, Abraham J, Aft R, Allison KH, et al. NCCN guidelines insights: breast cancer, version 3.2018. J Natl Compr Canc Netw. 2019;17:118–26.
Chavali LB, Llanos AAM, Yun JP, Hill SM, Tan XL, Zhang L. Radiotherapy for patients with resected tumor deposit-positive colorectal cancer: a surveillance, epidemiology, and end results-based population study. Arch Pathol Lab Med. 2018;142:721–9.
Yang M, Bao W, Zhang X, Kang Y, Haffty B, Zhang L. Short-term and long-term clinical outcomes of uncommon types of invasive breast cancer. Histopathology. 2017;71:874–86.
Mayo E, Llanos AA, Yi X, Duan SZ, Zhang L. Prognostic value of tumour deposit and perineural invasion status in colorectal cancer patients: a SEER-based population study. Histopathology. 2016;69:230–8.
Bevers TB, Helvie M, Bonaccio E, Calhoun KE, Daly MB, Farrar WB, et al. Breast cancer screening and diagnosis, version 3.2018. J Natl Compr Cancer Netw. 2018;16:1362–89.
Afifi AM, Saad AM, Al-Husseini MJ, Elmehrath AO, Northfelt DW, Sonbol MB. Causes of death after breast cancer diagnosis: a US population-based analysis. Cancer. 2020;126:1559–67.
Clough-Gorr KM, Thwin SS, Stuck AE, Silliman RA. Examining five- and ten-year survival in older women with breast cancer using cancer-specific geriatric assessment. Eur J Cancer. 2012;48:805–12.
Amrane M, Oukid S, Gagaoua I, Ensari T. Breast cancer classification using machine learning. stanbul: Electric Electronics, Computer Science, Biomedical Engineeringsʼ Meeting (EBBT); 2018. p. 1–4. https://doi.org/10.1109/EBBT.2018.8391453.
Mao Y, Fu Z, Dong L, Zheng Y, Dong J, Li X. Identification of a 26-lncRNAs risk model for predicting overall survival of cervical squamous cell carcinoma based on integrated bioinformatics analysis. DNA Cell Biol. 2019;38:322–32.
Dong RZ, Yang X, Zhang XY, Gao PT, Ke AW, Sun HC, et al. Predicting overall survival of patients with hepatocellular carcinoma using a three-category method based on DNA methylation and machine learning. J Cell Mol Med. 2019;23:3369–74.
Grzesiak W, Zaborski D. Examples of the use of data mining methods in animal breeding. Adem Karahoca, editor. Data mining applications in engineering and medicine. London, UK: IntechOpen Limited; 2012; 303–24.
Wang XC, Shi F, Yu L, Li Y. Cases analysis of MATLAB neural network. Beijing: Beijing University of Aeronautics and Astronautics. 2009. p. 59–62.
Jaiantilal A. Classification and regression by randomforest-matlab. (2009, 2012). https://code.google.com/archive/p/randomforest-matlab/ Accessed 22 July 2020.
Gonçalves CB, Leles ACQ, Oliveira LE, Guimaraes G, Cunha JR, Fernandes H. Machine learning and infrared thermography for breast cancer detection. Multidiscipl Digit Publish Inst Proc. 2019;27:45.
Sokolova M, Japkowicz N, Szpakowicz S. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. Australasian joint conference on artificial intelligence. 2006; 1015–21.
Aruna S, Rajagopalan SP, Nandakishore LV. Knowledge based analysis of various statistical tools in detecting breast cancer. Comput Sci Inf Technol. 2011;2:37–45.
Youssef AM, Pradhan B, Jebur MN, El-Harbi HM. Landslide susceptibility mapping using ensemble bivariate and multivariate statistical models in Fayfa area, Saudi Arabia. Environ Earth Sci. 2015;73:3745–61.
Costache R, Hong H, Wang Y. Identification of torrential valleys using GIS and a novel hybrid integration of artificial intelligence, machine learning and bivariate statistics. Catena. 2019;183:104179.
Hong H, Liu J, Bui DT, Pradhan B, Acharya TD, Pham BT, et al. Landslide susceptibility mapping using J48 decision tree with AdaBoost, bagging and rotation forest ensembles in the Guangchang area (China). Catena. 2018;163:399–413.
Anyanwu MN, Shiva SG. Comparative analysis of serial decision tree classification algorithms. Int J Comput Sci Secur. 2009;3:230–40.
Chung D, Keles S. Sparse partial least squares classification for high dimensional data. Stat Appl Genet Mol Biol. 2010;9: Article17. https://www.degruyter.com/document/doi/10.2202/1544-6115.1492/html.
Safavian SR, Landgrebe D. A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern. 1991;21:660–74.
Lan T, Hu H, Jiang C, Yang G, Zhao Z. A comparative study of decision tree, random forest, and convolutional neural network for spread-F identification. Adv Space Res. 2020;65:2052–61.
Garcia Leiv R, Fernandez AnA, Mancus V, Casari P. A novel hyperparameter-free approach to decision tree construction that avoids overfitting by design. IEEE Access. 2019;7:99978–87.
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
Nguyen C, Wang Y, Nguyen HN. Random forest classifier combined with feature selection for breast cancer diagnosis and prognostic. J Biomed Sci Eng. 2013;6:551–60.
Jain AK, Jianchang M, Mohiuddin KM. Artificial neural networks: a tutorial. Computer. 1996;29:31–44.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97.
Fradkin D, Schneider D, Muchnik I. Machine learning methods in the analysis of lung cancer survival data. DIMACS technical report 2005–35. 2006.
We thank Lingling Han at Shenzhen Horb Technology Corporate, Ltd. for invaluable discussions and comments.
Conflict of interest
The authors declare that they have no conflict of interest.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Deng, F., Huang, J., Yuan, X. et al. Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data. Lab Invest (2021). https://doi.org/10.1038/s41374-020-00525-x