A database for using machine learning and data mining techniques for coronary artery disease diagnosis

Alizadehsani, R.; Roshanzamir, M.; Abdar, M.; Beykikhoshk, A.; Khosravi, A.; Panahiazar, M.; Koohestani, A.; Khozeimeh, F.; Nahavandi, S.; Sarrafzadegan, N.

doi:10.1038/s41597-019-0206-3

Download PDF

Data Descriptor
Open access
Published: 23 October 2019

A database for using machine learning and data mining techniques for coronary artery disease diagnosis

R. Alizadehsani¹,
M. Roshanzamir²,
M. Abdar³,
A. Beykikhoshk⁴,
A. Khosravi¹,
M. Panahiazar⁵,
A. Koohestani¹,
F. Khozeimeh⁶,
S. Nahavandi¹ &
…
N. Sarrafzadegan^7,8

Scientific Data volume 6, Article number: 227 (2019) Cite this article

27k Accesses
57 Citations
5 Altmetric
Metrics details

Subjects

Abstract

We present the coronary artery disease (CAD) database, a comprehensive resource, comprising 126 papers and 68 datasets relevant to CAD diagnosis, extracted from the scientific literature from 1992 and 2018. These data were collected to help advance research on CAD-related machine learning and data mining algorithms, and hopefully to ultimately advance clinical diagnosis and early treatment. To aid users, we have also built a web application that presents the database through various reports.

Measurement(s)	coronary artery disease
Technology Type(s)	digital curation
Factor Type(s)	year • disease
Sample Characteristic - Organism	Homo sapiens

Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.9825680

Machine learning and deep learning to predict mortality in patients with spontaneous coronary artery dissection

Article Open access 26 April 2021

Classification models for assessing coronary artery disease instances using clinical and biometric data: an explainable man-in-the-loop approach

Article Open access 24 April 2023

Machine learning for diagnosis of myocardial infarction using cardiac troponin concentrations

Article Open access 11 May 2023

Background & Summary

According to the World Health Organization (WHO) available in http://www.who.int/news-room/fact, cardiovascular diseases (CVDs) are the major reason for death worldwide. CVDs include different diseases related to heart and blood vessels, such as coronary heart disease (CHD), cerebrovascular disease, and rheumatic heart disease (RHD) among others. According to the latest WHO report available at http://www.who.int/news-room/fact and http://www.who.int/cardiovascular_diseases/en/, more than 17.7 million people are estimated to have died in 2015 due to having CVDs, accounting for 31% of all deaths globally. It also estimated that approximately 7.4 million died due to CHD, which is also called coronary artery disease (CAD)^1,2,3. In other words, it can be argued that CVDs - in particular, CAD - are among the deadliest diseases in both developed and developing countries and paying attention to them is vital and indispensable.

Although the CAD mortality rate is high, the chance of survival is higher if the diagnosis is made early enough. Therefore, scientists have devised predictive models to identify high-risk patients. Recently, machine learning (ML) and data mining (DM) approaches have become more popular to construct models not only for the early diagnosis of CAD^{4,5,6,7,8,9,10,11} but also for other fatal diseases^{12,13,14,15,16,17} such as cancer^{18,19,20,21,22,23,24}. These techniques reveal the hidden structures that help to achieve a quicker diagnosis among the large amount of medical data²⁵. Indeed, this is a semi-automated approach for finding patterns in data²⁶.

Although there are some datasets for various diseases^{27,28,29,30,31,32,33,34,35,36,37,38,39,40}, there are no comprehensive benchmarks publicly available to summarize the research and conclusions on CAD diagnosis. As a result, the studies in this field are not well organized. One can argue that a solution to this problem is creating a database of all studies to collect their related information. Using this database, researchers can explore the latest work in the field and stay informed about the new methods proposed and the results achieved. Therefore, this research attempts to provide a comprehensive dataset of the related works at the intersection of ML/DM and heart disease detection as a bridge for further research in the future. The impact of CAD disease on our daily lives and the popularity of ML/DM motivated us to create such a database. To the best of our knowledge, this is the first database that covers most relevant datasets as well as the related outcomes obtained by ML algorithms. It is a key point to recognize other modifications on ML/DM techniques that are relevant to CAD disease progression and development.

This database provides comprehensive and fundamental information on early detection of CAD disease in order to illuminate the patterns and processes that are used in ML/DM approaches. For instance, Alizadehsani et al.³ used an SVM to classify patients with CAD from healthy individuals. Their model had an accuracy of 95% and revealed that apart from typical chest pain, regional wall motion abnormality and ejection fraction (echocardiographic features), age, and hypertension have the highest importance in CAD diagnosis. Therefore, ML/DM techniques will be fruitful to biologists, computer scientists, healthcare researchers, and physicians who are experts in the CAD area.

The advantages of using ML/DM methods for CAD diagnosis can be summarized as follow³:

It may result in early detection that leads to a decrease in mortality rate.
ML/DM can provide a priori probability of disease and use this probability to selectively target patients for angiography. This can save in cost and time for other patients. The side effects of angiography are also eliminated for them.
Using ML/DM can extract hidden patterns in the collected data. This may lead to finding new methods for early detection in many diseases like CAD.

Although ML/DM techniques have many advantages, they are not perfect methods. The following factors limit their abilities in some directions.

According to no-free-lunch theorem⁴¹, different ML/DM algorithms are suitable for their own particular problems. One algorithm may work well on a specific dataset while it cannot show a good performance on some others. So, selecting a suitable algorithm for a specific dataset is a big challenge in bioinformatics. Consequently, selecting good feature selection or classification algorithms is also a big challenge in this field.
ML/DM algorithms commonly need massive datasets to be trained. These datasets must be inclusive and unbiased with high quality. Datasets also need time to be collected⁴².
ML/DM algorithms need time to be trained and tested enough to be able to generate results with high confidence. These algorithms need a lot of resources and equipment⁴³.
ML/DM algorithms face the verification problem. It is difficult to prove that the prediction made by them work correctly for all scenarios⁴³.
The correct interpretation of the generated results by ML/DM algorithms is another challenge that we are faced with⁴².
Another disadvantage of ML/DM algorithms is their high error-susceptibility. If they are trained with biased or incorrect data, they end up with imprecise outputs. This may lead to a chain of errors that mislead treatment methods. When these errors get noticed, it takes some times to diagnose the source of these errors and even needs more time to correct them⁴⁴.

The benefits of using our collected dataset can be listed as follows:

The researchers can access useful information easily and quickly to the results of much state-of-the-art research in this field. For example, important features on CAD in each country, comparing the performance of different research, features which were used in each research, and many other useful information that can be extracted from this dataset. Consequently, researchers can find the fields that there are fewer works on them. It also prevents researchers from doing repeated works.
This dataset facilitates the review step of the researchers. They can do their research quickly and with more quality. Meanwhile, the referees can also use it to check the novelty of new proposed methods and have quick access to the important properties of the articles. Using this dataset, top rank researchers and better algorithms and journals with more published works in this field can be found easily. It can also be extended for diagnosing other diseases especially more common ones such as diabetes, cancers, and hepatitis.

However, unfortunately, this dataset suffers from some disadvantages. Currently, updating the dataset is done manually. For example, finding new papers, extracting their properties and adding them to the dataset are done manually.

Meanwhile, one of the most important weaknesses of the research we investigated is the size of the used datasets. Unfortunately, almost all of the researchers do not use big datasets because collecting too many records needs a lot of time and cost. If we want to have extremely high confidence results, we need more than one million records. Projects like Electronic Health Records (EHR)⁴⁵ can help to achieve this goal. In EHR project, information about patients is saved electronically in a digital format. It can be shared across different health care centers to ease the treatment process. EHR almost includes all necessary records of patients like their medical history, drugs that are used, their procedures, vitals and their allergies, and laboratory test results. This mechanism has improved the quality of care. By increasing the samples in EHR, the quality of cure methods will be improved definitely. Meanwhile, it can reduce the risk of data replication as there is only one medical file for each patient which is commonly completely updated file. As all the information about the patients is saved in a digital searchable file, EHRs are more effective for extracting information for treatment methods. Meanwhile, population-based methods can also be applied more easily by widespread adaptation of EHR.

States of the art methods like deep learning can also benefit from EHR because deep learning needs much data for learning and these data are collected in EHR. Deep learning⁴⁶ is a part of machine learning algorithms based on artificial neural networks. Nowadays, this method shows significant ability in solving machine learning problems. It is inspired by distributed processing of biological systems. Because of its ability in the learning process, nowadays, it is used in various learning fields like machine vision, image processing, and bioinformatics.

Deep Survival Analysis⁴⁷ is a hierarchical generative approach using EHR for survival analyzing. It handles characteristics of EHR data and for an event of interest, it enables accurate risk scores. Traditional survival analysis^48,49 suffers from some weaknesses. For example, high dimension and very sparse data of EHR are one of these weaknesses that makes using traditional model difficult. Deep Survival Analysis differs from traditional survival methods. In this method, all observations are modeled jointly and conditioned on the rich latent structure.

As a clinical implication of this research, physicians can use this dataset to select more effective features in CAD diagnosis according to the region they are living in. This can increase the accuracy of their diagnosis and help the early treatment of patients. Meanwhile, as it was mentioned, it can also reduce the usage rate of angiography for suspicious patients and avoiding the side effect of the unnecessary procedure. More importantly, our system will work as a recommendation engine for clinicians to help them in decision making for specific treatment for a specific group of patients with similar characteristics toward personalized medicine.

Methods

For the first time, we designed and implemented a complete dataset about the research in CAD diagnosis field. It is an important field and many researchers work on it. So, accessing to a complete resource of the research in this field can help researchers improve their work more quickly and precisely. Meanwhile, this dataset includes some useful utilities for extracting information from the data saved in it. These utilities are accessible in www.cadataset.com. Using this dataset, some new information can be extracted. For example, a physician can find what features are more important in CAD diagnosis in different regions.

This study concentrates on recent papers from 1992 to 2018 that are related to CAD diagnosis using ML/DM techniques. For the sake of completeness, we used Google Scholar to find the most related articles. The database includes information such as authors, publisher, title of the paper, country (of publication or where the research was conducted), methods that are applied, evaluation metrics, type of diseases, features that are used, journal/conference, and the most important features used in their analysis (e.g., Alizadehsani et al., Elsevier, a data mining approach for diagnosis of coronary artery disease, Iran, [SVM, Naïve Bayes], [Accuracy, Sensitivity, Specificity], CAD, 55 features, computer methods and programs in biomedicine, 36 features).

As many CAD-related articles are published every year, we built the dataset such that it can be easily updated. Using this database, for each paper published in the field, one can determine in which countries the data are collected and what features have been reported to be of importance. In addition, features not considered in each study are also determined, allowing researchers to examine those features in the future. It also reveals which authors have more influence in the field, so others can use their experiences. The journals that have published the most articles in this field are determined to allow researchers to decide on where to publish their new articles. All of the algorithms that have been used to date and the accuracy that they have achieved are identified so that researchers can choose the algorithms that have not yet been used and compare them with previous results. The articles that have the most citations have been identified so that researchers can use their ideas. The publication houses that have published the most articles in this field are identified. The datasets and the feature categories that achieved the most accuracy are determined to help researchers in feature selection. The feature selection algorithms that researchers have used have been identified to help new researchers choose the best method. The articles that have achieved the most accuracy are specified to help researchers decide on which features and methods have better results.

As the future work, there are multiple issues for improvement of mechanisms used for collecting and management of our dataset. They are summarized as follows:

There are no published data for most countries in Europe, Africa, Australia, and South America. This lack of information is important as regional and racial differences may affect the way CAD is detected and treated. Thus, we recommend collecting CAD data and constructing databases from various continents and countries.
Most of the investigated datasets have a limited number of features. This severely limits the final results since the number of both samples and features can affect the performance of ML techniques. Hence, we will construct CAD databases with more features.
The median sample size for the CAD datasets that were investigated in research in this field is less than 500. The larger the number of samples is, the more significant are the statistical results. To ensure reliability and trustworthiness, a model should be developed and tested using at least one million samples⁵⁰.
Another problem with previous studies is the way the data were collected. Since the datasets differ in terms of the number of samples and features, it is not easy to compare ML techniques in terms of performance. In other words, the results obtained in various studies are comparable only if the data are the same. This dataset can help researchers to select features that make their research comparable with others.
As it was mentioned, currently, updating our dataset is done manually. Improving the tools which now is used to manipulate our dataset is necessary. This tool must be able to update the dataset automatically.

Database structure

The database designed in this research includes 15 tables shown in Tables 1–15. These tables include the following information:

the field name,
whether it is a primary key (P.K.) or a part of it,
if it is a foreign key (F. K.), and if yes, to which table it refers,
and a brief description of that field.

Table 1 lists the journals and conferences in which the investigated papers were published. In Table 2, the authors of the investigated papers are listed. Commonly in each paper, there are one or more datasets to which the proposed algorithms were applied. These datasets are listed in Table 3. Currently, we investigated only four heart diseases that are listed in Table 4. They are CAD and stenosis of LAD, LCX, and RCA. This list, however, is extendable to other diseases in the future. The features investigated in the articles are listed in Table 5, and the list of methods used for diagnosis is shown in Table 6. In most of the papers, the researchers selected a subset of features in the investigated datasets. The feature selection algorithms are listed in Table 7. Table 8 is dedicated to the characteristics of research papers but not the review papers in the field. The characteristics of review papers are shown in Table 9. Table 10 shows which method was applied on a specific disease in a dataset in a specific paper. The results of applying this method are also reported. Table 11 shows the features of each dataset. Table 12 indicates the authors of each research paper, while Table 13 indicates the authors of review papers. Since we separated the tables of research papers and review papers, we did the same for their authors as well. The research papers and review papers have different fields to report on. Therefore, we used different tables to save their details. To specify the feature selection algorithm that was used in each paper, Table 14 is designed. Finally, in Table 15, the rank that was assigned to each selected feature was reported.

Table 1 The fields of “Journals/Conferences” table, their properties and descriptions.

Subjects

Abstract

Similar content being viewed by others

Background & Summary

Methods

Database structure

Data Records

Technical Validation

Usage Notes

Web application

Code Availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links