Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Machine learning prediction for mortality of patients diagnosed with COVID-19: a nationwide Korean cohort study


The rapid spread of COVID-19 has resulted in the shortage of medical resources, which necessitates accurate prognosis prediction to triage patients effectively. This study used the nationwide cohort of South Korea to develop a machine learning model to predict prognosis based on sociodemographic and medical information. Of 10,237 COVID-19 patients, 228 (2.2%) died, 7772 (75.9%) recovered, and 2237 (21.9%) were still in isolation or being treated at the last follow-up (April 16, 2020). The Cox proportional hazards regression analysis revealed that age > 70, male sex, moderate or severe disability, the presence of symptoms, nursing home residence, and comorbidities of diabetes mellitus (DM), chronic lung disease, or asthma were significantly associated with increased risk of mortality (p ≤ 0.047). For machine learning, the least absolute shrinkage and selection operator (LASSO), linear support vector machine (SVM), SVM with radial basis function kernel, random forest (RF), and k-nearest neighbors were tested. In prediction of mortality, LASSO and linear SVM demonstrated high sensitivities (90.7% [95% confidence interval: 83.3, 97.3] and 92.0% [85.9, 98.1], respectively) and specificities (91.4% [90.3, 92.5] and 91.8%, [90.7, 92.9], respectively) while maintaining high specificities > 90%, as well as high area under the receiver operating characteristics curves (0.963 [0.946, 0.979] and 0.962 [0.945, 0.979], respectively). The most significant predictors for LASSO included old age and preexisting DM or cancer; for RF they were old age, infection route (cluster infection or infection from personal contact), and underlying hypertension. The proposed prediction model may be helpful for the quick triage of patients without having to wait for the results of additional tests such as laboratory or radiologic studies, during a pandemic when limited medical resources must be wisely allocated without hesitation.


A pneumonia of unknown cause detected in Wuhan, China was first reported to the World Health Organization (WHO) on December 31, 2019. A few weeks later, it was found to be caused by a novel coronavirus1. On February 11, 2020, the WHO formally named the causative coronavirus the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and the disease caused by the virus, the coronavirus disease 2019 (COVID-19). COVID-19 is the third known zoonotic coronavirus disease after SARS and the Middle East Respiratory Syndrome (MERS)2. After the onset of its rapid spread worldwide, the WHO declared the COVID-19 outbreak a global pandemic on March 11, 2020.

COVID-19 has a higher mortality rate (3.8% according to the WHO as of August 2020) than influenza and spreads more rapidly over much wider areas than prior coronavirus diseases. COVID-19 has already claimed far more lives than its predecessors (813 deaths for SARS and 858 deaths for MERS)3,4. As of October 05, 2020, COVID-19 had infected over 35 million people with the worldwide death toll exceeding 1,040,0005.

Because of the rapid spread of the virus, there has been a sharp increase in the demand for medical resources required to support infected people. Despite the desperate efforts to contain the disease and slow down its spread, many countries have been suffering from the shortage of hospital beds and critical care equipment for the timely treatment of ill patients6,7,8. Therefore, in addition to efficient diagnosis and treatment, accurate prognosis prediction is necessary to reduce the strain on healthcare systems and provide the best possible care for patients. When allocating limited medical resources, prediction models that estimate the risk of a poor outcome in an infected individual based on pre-diagnosis information could help to effectively triage patients.

All Koreans are mandated to enroll in national health insurance, except for the population in the lowest income bracket, who is funded by taxes and covered by Medicaid. Consequently, information regarding the sociodemographic characteristics and history of medical service use of virtually all Koreans is available in the database where currently information regarding COVID-19 patients is also periodically updated.

The purpose of this study was to develop and validate machine learning models that predict the prognosis of COVID-19 patients based on sociodemographic information, infection route, and medical status and history, for the nationwide cohort of South Korea.


Baseline characteristics

The patient characteristics are summarized in Table 1. The mean (± standard deviation [SD]) age was 44.97 (± 19.79) years; patients who died of COVID-19 were significantly older than those who recovered, with the mean (± SD) age being 78.17 (± 10.96) years and 43.06 (± 18.32) years, respectively. Women comprised 60.1% of the study population. Approximately 38.5% of the patients were symptomatic at the time of diagnosis. A majority of the infections (58.1%) were cluster infection, and 11.3% of the patients were nursing home residents. Of the 10,237 patients, 3147 (30.7%) had one or more underlying medical conditions; the four most common conditions were hypertension (18.2%), hyperlipidemia (18.0%), chronic lung disease or asthma (10.5%), and diabetes mellitus (DM) (10.0%).

Table 1 Baseline characteristics.

Mortality from COVID-19

A total of 10,237 patients were diagnosed with COVID-19 between January 23, 2020 and April 2, 2020 and were followed up until April 16, 2020. The case fatality ratio was 2.85% (228 deaths and 7772 recoveries). The median interval between diagnosis and mortality was 9 days (range 0–49 days), and 22 days (range 0–75 days) between diagnosis and recovery (Fig. 1). No significant difference in the interval between diagnosis and mortality or recovery was found among different age groups (p > 0.231) (Fig. 2).

Figure 1
figure 1

Histogram illustrating the distribution of the time interval between diagnosis and recovery (A) or mortality (B).

Figure 2
figure 2

Box plot illustrating the time interval between diagnosis and recovery or mortality according to the age group.

Factors associated with mortality

Cox proportional hazards model

In the multivariable analysis with medication excluded (Table 2), age > 70, male sex, moderate or severe disability, the presence of symptoms, infection at a nursing home, DM, and chronic lung disease or asthma were significantly associated with increased risk of mortality (p ≤ 0.047), whereas age < 40 and cluster infection or infection from personal contact or visit were associated with decreased risk (p ≤ 0.007). In the multivariable analysis with underlying disease excluded (Table 3), the use of loop diuretics or acarbose was significantly associated with increased risk of mortality (p ≤ 0.018).

Table 2 Results of Cox proportional hazards regression without medication.
Table 3 Results of Cox proportional hazards regression for medication.

Variable importance from machine learning

In predicting the final outcome (i.e., mortality vs. recovery), the five most significant predictors for the least absolute shrinkage and selection operator (LASSO) were age > 80, taking of acarbose, age > 70, taking of metformin, and underlying cancer in order of significance; for random forest (RF) they were cluster infection, infection from personal contact or visit, underlying hypertension, and age > 80 (Fig. 3). In predicting early mortality, similar patterns were observed (Supplementary Figs. 1 and 2). Overall, in addition to old age, LASSO focused on DM medication and cancer whereas RF relied on the infection route and hypertension.

Figure 3
figure 3

Variable importance in prediction of mortality from COVID-19 by LASSO (A) and Random Forest (B).

Performance of machine learning

The performances of the machine learning models in the final testing are summarized in Table 4. The optimized hyperparameters can be found in Supplementary Table 1. LASSO and linear support vector machine (SVM) showed high areas under the receiver operating characteristic curves (AUCs) (> 0.9), high balanced accuracies (> 91% for final mortality and > 86% for 14- or 30-day mortality), and sensitivities (> 90% for final mortality and > 83% for 14- or 30-day mortality) without compromising specificities (approximately 90%). However, the sensitivities of RF and radial basis function (RBF)-SVM were < 50%, ranging between 10.2% and 42.7% despite their high AUCs and specificities. K-nearest neighbor (KNN) showed the lowest AUCs and intermediate sensitivities and balanced accuracies. Regardless of the machine learning model, negative predictive values (NPVs) were high (> 97%), and positive predictive values (PPVs) were low (13.0–44.4%). All the models displayed have higher performances in long-term prediction than in short-term prediction.

Table 4 Final performance of machine learning models in prediction of mortality from COVID-19 in the test set.


Our results demonstrate that machine learning models utilizing sociodemographic characteristics and medical history can accurately predict the prognosis of COVID-19 patients after diagnosis; the models predict not only the final outcome (i.e., mortality vs. recovery) but also early mortality (i.e., 14- or 30-day mortality). The proposed prediction model aims at the quick triage of patients without having to wait for the results of additional tests such as laboratory or radiologic studies, during a pandemic when limited medical resources must be wisely allocated without hesitation.

Machine learning is focused on achieving high predictive accuracy without much focus on explaining how the accuracy is achieved. We presented the result of the Cox proportional hazard regression and variable importance as complements, showing how importantly input variables were used by LASSO and RF. In line with previous studies9,10,11,12,13,14,15,16,17,18,19,20, old age, male gender, and the presence of symptoms or underlying medical conditions were significantly associated with worse prognosis. We additionally found that moderate or severe disability and infection route were independently associated with prognosis.

Almost all previous studies reported that old age was a strong prognostic factor, and this was also confirmed by our results. However, the time to recovery or mortality did not differ by age in our study population.

Previously reported underlying medical conditions associated with poor prognosis include hypertension10,12,13,15,17,18, DM10,12,14,15, lung disease including chronic obstructive lung disease and asthma11,12,13, cardiovascular disease10,11,13,14,15, cancer21,22, and chronic renal disease12,18. In our study, DM (from Cox and LASSO), chronic lung disease or asthma (from Cox regression), cancer (from LASSO), and hypertension (from RF) were significant predictors. LASSO paid more attention to DM medication than the disease itself, which may have been because patients taking medication were more likely to have had DM longer than those not taking it. The insignificance of the other medical conditions, such as cardiovascular disease or chronic renal disease in our study, may be due to a different study population, the broadness of our operational definition of the diseases, and correlation with other strong predictive factors.

We also performed multivariable Cox regression to identify drugs associated with increased or decreased risk of mortality and found that the use of loop diuretics or acarbose was an independent risk factor. However, these results need to be interpreted cautiously. There may have been other confounding factors among the patients who consume this medication that we were unable to sufficiently adjust for. Loop diuretics are recommended to be considered in patients with congestive heart failure or advanced chronic renal disease23. Alpha-glucosidase inhibitors including acarbose are often used with a basal insulin regimen when basal insulin treatment alone did not result in glycemic control, especially postprandial glucose level in Asians24. Thus, the poor outcome in the patients taking the medications may have been due to the fact that they had have a longer duration of more comorbidities, not due to the direct drug effects.

Our main interest for the medication analysis was the angiotensin converting enzyme (ACE) inhibitors and angiotensin receptor (AR) blockers. There have been concerns regarding a potential harmful effect caused by ACE inhibitors and AR blockers in COVID-19 patients25. In our study, however, the use of ACE inhibitors or AR blockers was not significantly associated with mortality from COVID-19, in agreement with a recent large-scale study11.

Recent discussions pointed out the potential beneficial or harmful effects of other commonly prescribed drugs including antidiabetic drugs, statin, and aspirin in patients with COVID-1926,27,28,29,30. Some evidence suggested that the use of DDP-4, metformin, and statin may be associated with better prognosis in COVID-19 patients27,30,31. However, none of these effects has been validated yet. Due to the data structure and study design, we were also unable to effectively investigate the independent effects of such drugs. A further investigation is warranted, and several clinical studies including NCT04467931, NCT04510194, NCT04365309, and NCT04407273, are being planned or conducted to examine the potential association of the drugs with prognosis in COVID-19 patients32.

Our data had information regarding infection route, which showed that patients at nursing homes had worse prognosis whereas those who contracted the disease from large clusters had better outcomes. This may be attributed to the age distribution and the status of underlying diseases in these groups; most nursing home residents are elderly people with underlying diseases, whereas the infection clusters in Korea during the current outbreak were mostly churches and service call centers where a majority of attendees were young.

We tested several machine learning algorithms because the most appropriate algorithm may differ depending on data structure and a given task. LASSO and linear SVM demonstrated high sensitivities (> 90%) and almost perfect NPV (99.7%) in predicting mortality, which is clinically important because identifying and detecting at-risk patients is more significant than reducing false positive prediction. However, the other models showed low sensitivities despite our efforts to compensate for the class imbalance by up-sampling rare mortality cases and adding class weights when training them. Although we were not able to fully understand and explain this failure to overcome the class imbalance in these models, the difference in variable importance by LASSO and RF may be helpful in explaining the results. The two most important predictors for RF were cluster infection and personal contact where a very small proportion of the patients (0.6%, 42/7256) died, implying that RF chose to focus on detecting negative cases to achieve high AUC. In contrast, LASSO appears to have focused on the predictors associated with increased risk of mortality including old age and DM.

The current study has limitations. First, the data used in this study did not have information regarding laboratory or radiologic results which may also be important prognostic factors10,17,18,19,33. However, our prediction model aims at early prognosis prediction at the time of diagnosis before further diagnostic studies or treatment. Second, the treatments of patients can have a large impact on prognosis; however, we assumed that our patients all had standard therapy. In Korea, all patients are sent to designated hospitals with medical staff and equipment required to provide prompt standard therapy, and this is attributed to aggressive diagnosis and early intervention.

In conclusion, we have developed and validated robust machine learning models, which could be used to predict the prognosis of COVID-19 patients. LASSO and linear SVM demonstrated high sensitivities and specificities for identifying at-risk patients. Age, sex, moderate or severe disability, the presence of symptoms, and comorbidities including hypertension, DM, chronic lung disease or asthma, and cancer were significant risk factors.

Materials and methods

The Institutional Review Board of National Health Insurance Service Ilsan Hospital (NHIMC 2020-04-026) approved this retrospective Health Insurance Portability and Accountability Act-compliant cohort study and waived the informed consent from the participants, because this study was expected to present no or minimal risk of harm to the participants, and all the data used were anonymized. All methods were performed in accordance with relevant guidelines and regulations.

Data source

This study used a merged dataset combining relevant information from two data sources provided by the Korean National Health Insurance Service (KNHIS): the database of beneficiaries of national health insurance and the newly added database of patients with confirmed diagnosis of COVID-19. The KNHIS database provides all information regarding reimbursement for outpatient visits and hospital admissions, including sociodemographic information, medical diagnoses, procedures, prescriptions, and national health check-up results. Because virtually all Koreans are covered by national health insurance or Medicaid, the abovementioned information was available for all of our study participants. Detailed information on the KNHIS database including data configuration has been outlined elsewhere34,35. Data cannot be shared publicly because of the provisions of the KNHIS which prohibit authors from making the data publicly available and provides a limited portion of anonymized data to researchers for the purpose of a public interest.


The study included 10,237 Korean patients who had tested positive for COVID-19 from Jan 23, 2020 to Apr 16, 2020 (Fig. 4). Of these patients, 228 (2.2%) had died, 7772 (75.9%) had recovered, and 2237 (21.9%) were still in isolation or being treated. Sixty-seven patients lacked information on their income statuses and were excluded for cox proportional hazard regression and estimation of variable importance by machine learning.

Figure 4
figure 4

Flow diagram for study participants.

The sociodemographic and medical information included as potential predictive factors were age, sex, income level, place of residence, household type, disability, respiratory symptoms, infection route, underlying medical conditions, and medication (Table 1). The income level was divided into five categories; people in the lowest level were covered by Medicaid, and the four upper levels included quartiles of the insured patients based on the amount of their monthly contributions. Household type had two categories: seniors (> 65 year) living alone and other house types. There were five categories of infection route: personal contact with an infected person or visit to an affected area, cluster infection (e.g., from a religious gathering or packed workplace), infection at a nursing home, infection from abroad, and unclassified. In the KNHIS database, diagnoses were coded according to the Korean Standard Classification of Diseases (KCD) version 6, which is based on the International Classification of Diseases 10th revision (ICD-10). The operational definitions of the medical conditions used in this study are presented in Supplementary Table 2. Based on the prescription information provided by the KNHIS, patients were considered to be on medication if they had received a prescription that lasted more than 180 days for the last year. Medications reported to be potentially associated with COVID-19 prognosis were examined in this study11,26,30,36,37: various types of antihypertensive drugs (ACE inhibitors, AR blockers, beta-blockers, calcium channel blockers and loop diuretics), antidiabetic drugs (acarbose, sufonylurea, metformin, and DDP-4), and lipid-lowering drugs (statin and fenofibrate), non-steroidal anti-inflammatory drug, and aspirin.

Statistical analysis

The Kruskal–Wallis test was performed to compare the age groups. The Cox proportional hazards model was used to assess the hazard ratios of mortality from COVID-19. Following univariable analyses, multivariable analyses were performed to identify independent predictive factors, first with medication usage excluded, and then with underlying disease excluded. This is because the existence of underlying diseases and medication usage are expected to have strong correlations with mortality from COVID-19. Recovered or undetermined cases were censored at the date of recovery or the date of last follow-up (April 16, 2020), respectively. Two-sided probability values of < 0.05 were considered statistically significant. The statistical analyses were performed using the SAS software (version; SAS Institute, Cary NC, USA).

Machine learning

The dataset was randomly split into training and test sets in a ratio of 7:3 while preserving the same proportion of mortality in both sets. Because only a small proportion (2.2%) of the study population died of COVID-19, using accuracy as an evaluation metric is inappropriate. In our study population, an accuracy of 97.8% could be achieved simply by predicting no death every time, which would be a useless classifier as it would result in zero sensitivity. To mitigate the issue of imbalanced classes, we oversampled rare cases, performed class weighting, and used an evaluation metric of AUC that is more appropriate for data with imbalanced classes than accuracy, when fitting our model in the training set.

The machine learning algorithms used in this study were LASSO, linear SVM, RBF-SVM, RF, and KNN. Including irrelevant features in a machine learning model likely results in overfitting and can undermines the generalizability of a prediction model38. Thus, LASSO and SVM were regularized using L1-norm which automatically selects important features39. For RF, k most important features were selected based on the variable importance, with k being a hyperparameter that is optimized through cross validation. For KNN, only the features that had been independently associated with mortality in the multivariable Cox regression were used as input data. After the feature selection, the algorithms were trained and tested for classifying mortality vs. recovery after excluding 2237 undetermined cases (i.e., patients who were not cured nor died, but were still in isolation or being treated at the last follow-up). Subsequently, we developed models to predict 14- or 30-day mortality, for which the study population was divided into two groups: patients who died of COVID-19 vs. those who did not within 14 days or 30 days after diagnosis, respectively. Hyperparameter optimization was performed through tenfold cross validation using grid search in the training set. After each optimal model was found, it was fit in the entire training set and tested in the test set.

Using LASSO and RF, the importance of all variables were evaluated and scaled to have a maximum value of 100. Information regarding variable importance was not obtainable from SVM or KNN. Machine learning was performed using the R software (version 3.4.4, R Foundation for Statistical Computing, Vienna, Austria) with the caret package (version 6.0-86).


  1. World Health Organization. Coronavirus disease (COVID-19) pandemic. (2020)

  2. Sun, P., Lu, X., Xu, C., Sun, W. & Pan, B. Understanding of COVID-19 based on current evidence. J. Med. Virol. 92, 548–551 (2020).

    CAS  Article  PubMed  Google Scholar 

  3. World Health Organization. Middle East respiratory syndrome coronavirus (MERS-CoV). (2020).

  4. World Health Organization. Cumulative Number of Reported Probable Cases of SARS. (2020).

  5. Worldometer. COVID-19 Coronavirus Pandemic (2020).

  6. Ranney, M. L., Griffeth, V. & Jha, A. K. Critical supply shortages—The need for ventilators and personal protective equipment during the Covid-19 pandemic. N. Engl. J. Med. 382, e41 (2020).

    CAS  Article  PubMed  Google Scholar 

  7. Gondi, S. et al. Personal protective equipment needs in the USA during the COVID-19 pandemic. Lancet 395, e90–e91 (2020).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  8. Smereka, J. & Szarpak, L. The use of personal protective equipment in the COVID-19 pandemic era. Am. J. Emerg. Med. 38, 1529–1530 (2020).

    Article  PubMed Central  PubMed  Google Scholar 

  9. Gong, J. et al. A tool to early predict severe corona virus disease 2019 (COVID-19): A multicenter study using the risk nomogram in Wuhan and Guangdong, China. Clin. Infect. Dis. (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Yuan, M., Yin, W., Tao, Z., Tan, W. & Hu, Y. Association of radiologic findings with mortality of patients infected with 2019 novel coronavirus in Wuhan, China. PLoS ONE 15, e0230548 (2020).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  11. Mehra, M. R., Desai, S. S., Kuy, S., Henry, T. D. & Patel, A. N. Cardiovascular disease, drug therapy, and mortality in Covid-19. N. Engl. J. Med. (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Zhou, F. et al. Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: A retrospective cohort study. Lancet 395, 1054–1062 (2020).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  13. Wang, L. et al. Coronavirus Disease 2019 in elderly patients: Characteristics and prognostic factors based on 4-week follow-up. J. Infect. 80, 639–645 (2020).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  14. Guo, W. et al. Diabetes is a risk factor for the progression and prognosis of COVID-19. Diabet. Metab. Res. Rev. (2020).

    Article  Google Scholar 

  15. Shi, Y. et al. Host susceptibility to severe COVID-19 and establishment of a host risk score: Findings of 487 cases outside Wuhan. Crit. Care Lond. Engl. 24, 108 (2020).

    Article  Google Scholar 

  16. Meng, Y. et al. Sex-specific clinical characteristics and prognosis of coronavirus disease-19 infection in Wuhan, China: A retrospective study of 168 severe patients. Plos Pathog. 16, e1008520 (2020).

    Article  PubMed Central  PubMed  Google Scholar 

  17. Feng, Z. et al. Early prediction of disease progression in COVID-19 pneumonia patients with chest CT and clinical characteristics. Nat. Commun. 11, 4968 (2020).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  18. Hajifathalian, K. et al. Development and external validation of a prediction risk model for short-term mortality among hospitalized U.S. COVID-19 patients: A proposal for the COVID-AID risk tool. PLoS ONE 15, e0239536 (2020).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  19. Yan, L. et al. An interpretable mortality prediction model for COVID-19 patients. Nat. Mach. Intell. (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Shi, S. et al. Association of cardiac injury with mortality in hospitalized patients with COVID-19 in Wuhan, China. Jama Cardiol. 5, 2 (2020).

    Google Scholar 

  21. Lee, L. Y. W. et al. COVID-19 prevalence and mortality in patients with cancer and the effect of primary tumour subtype and patient demographics: a prospective cohort study. Lancet Oncol. 21, 1309–1316 (2020).

    Article  PubMed Central  PubMed  Google Scholar 

  22. Albiges, L. et al. Determinants of the outcomes of patients with cancer infected with SARS-CoV-2: Results from the Gustave Roussy cohort. Nat. Cancer (2020).

    Article  PubMed  Google Scholar 

  23. Lee, H.-Y. et al. 2018 Korean Society of Hypertension Guidelines for the management of hypertension: Part II-diagnosis and treatment of hypertension. Clin. Hypertens. 25, 20 (2019).

    Article  PubMed Central  PubMed  Google Scholar 

  24. Lee, M. Y. et al. Comparison of acarbose and voglibose in diabetes patients who are inadequately controlled with basal insulin treatment: Randomized, parallel, open-label, active-controlled study. J. Korean Med. Sci. 29, 90–97 (2013).

    Article  PubMed Central  PubMed  Google Scholar 

  25. Bavishi, C., Maddox, T. M. & Messerli, F. H. Coronavirus disease 2019 (COVID-19) infection and renin angiotensin system blockers. Jama Cardiol. 5, 2 (2020).

    Article  Google Scholar 

  26. Bianconi, V. et al. Is acetylsalicylic acid a safe and potentially useful choice for adult patients with COVID-19 ?. Drugs 80, 1383–1396 (2020).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  27. Hariyanto, T. I. & Kurniawan, A. Metformin use is associated with reduced mortality rate from coronavirus disease 2019 (COVID-19) infection. Obes Med. 19, 100290 (2020).

    Article  PubMed Central  PubMed  Google Scholar 

  28. Subir, R. Pros and cons for use of statins in people with coronavirus disease-19 (COVID-19). Diabet. Metab. Syndr. Clin. Res. Rev. 14, 1225–1229 (2020).

    Article  Google Scholar 

  29. Bifulco, M. & Gazzerro, P. Statins in coronavirus outbreak: It’s time for experimental and clinical studies. Pharmacol. Res. 156, 104803 (2020).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  30. Mirabelli, M., Chiefari, E., Puccio, L., Foti, D. P. & Brunetti, A. Potential benefits and harms of novel antidiabetic drugs during COVID-19 crisis. Int. J. Environ. Res. Pu. 17, 3664 (2020).

    CAS  Article  Google Scholar 

  31. Rodriguez-Nava, G. et al. Atorvastatin associated with decreased hazard for death in COVID-19 patients admitted to an ICU: A retrospective cohort study. Crit. Care 24, 429 (2020).

    Article  PubMed Central  PubMed  Google Scholar 

  32. U.S. National Library of Medicine. (2020).

  33. Yue, H. et al. Machine learning-based CT radiomics method for predicting hospital stay in patients with pneumonia associated with SARS-CoV-2 infection: A multicenter study. Ann. Transl. Med. 8, 859 (2020).

    Article  PubMed Central  PubMed  Google Scholar 

  34. Song, S. O. et al. Background and data configuration process of a nationwide population-based study using the korean national health insurance system. Diabet. Metab. J. 38, 395–403 (2014).

    Article  Google Scholar 

  35. Seong, S. C. et al. Data resource profile: The national health information database of the national health insurance service in South Korea. Int. J. Epidemiol. (2016).

    Article  Google Scholar 

  36. Cheng, H., Wang, Y. & Wang, G.-Q. Organ-protective effect of angiotensin-converting enzyme 2 and its effect on the prognosis of COVID-19. J. Med. Virol. (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Fajgenbaum, D. C. & Rader, D. J. Teaching old drugs new tricks: Statins for COVID-19?. Cell Metab. 32, 145–147 (2020).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  38. Remeseiro, B. & Bolon-Canedo, V. A review of feature selection methods in medical applications. Comput. Biol. Med. 112, 103375 (2019).

    CAS  Article  PubMed  Google Scholar 

  39. Maldonado, S., Weber, R. & Basak, J. Simultaneous feature selection and classification using kernel-penalized support vector machines. Inform. Sci. 181, 115–128 (2011).

    Article  Google Scholar 

Download references


The Korea Center for Disease Control and Prevention (KCDC) and National Health Insurance Service (NHIS) kindly provided data used for this study. The authors would like to thank the KCDC and NHIS for their support and cooperation. This research was conducted as part of the Ilsan Machine Intelligence with National health insurance big Data (I-MIND) project.

Author information

Authors and Affiliations



C.A., D.W.K., J.H.C., and Y.J.C. conceived the study. C.A., Y.J.C., and S.W.K. reviewed the literature and provided the idea of study. H.L. and D.W.K. participated in collecting and processing data. C.A. conducted machine learning development. C.A. and H.L. performed statistical analysis. C.A. wrote the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yoon Jung Choi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

An, C., Lim, H., Kim, DW. et al. Machine learning prediction for mortality of patients diagnosed with COVID-19: a nationwide Korean cohort study. Sci Rep 10, 18716 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:

Further reading


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing