A Poisson binomial-based statistical testing framework for comorbidity discovery across electronic health record datasets

Lemmon, Gordon; Wesolowski, Sergiusz; Henrie, Alex; Tristani-Firouzi, Martin; Yandell, Mark

doi:10.1038/s43588-021-00141-9

Article
Published: 21 October 2021

A Poisson binomial-based statistical testing framework for comorbidity discovery across electronic health record datasets

Nature Computational Science volume 1, pages 694–702 (2021)Cite this article

477 Accesses
4 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Discovering the concomitant occurrence of distinct medical conditions in a patient, also known as comorbidities, is a prerequisite for creating patient outcome prediction tools. Current comorbidity discovery applications are designed for small datasets and use stratification to control for confounding variables such as age, sex or ancestry. Stratification lowers false positive rates, but reduces power, as the size of the study cohort is decreased. Here we describe a Poisson binomial-based approach to comorbidity discovery (PBC) designed for big-data applications that circumvents the need for stratification. PBC adjusts for confounding demographic variables on a per-patient basis and models temporal relationships. We benchmark PBC using two datasets to compute comorbidity statistics on 4,623,841 pairs of potentially comorbid medical terms. The results of this computation are provided as a searchable web resource. Compared with current methods, the PBC approach reduces false positive associations while retaining statistical power to discover true comorbidities.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Feature selection by L1 regularization.**

**Fig. 2: Modeling the effects of confounding variables.**

Discovering disease–disease associations using electronic health records in The Guideline Advantage (TGA) dataset

Article Open access 25 October 2021

Global patterns of prognostic biomarkers across disease space

Article Open access 19 December 2022

Data-driven comorbidity analysis of 100 common disorders reveals patient subgroups with differing mortality risks and laboratory correlates

Article Open access 02 November 2022

Data availability

In this paper we calculate comorbidity statistics for all pairs of medical billing codes, including diagnoses, procedures and medications. All of these P-values are available to query and download from the following link: https://pbc.genetics.utah.edu/lemmon2021. Furthermore, a 2.3 GB file containing comorbidity statistics for all 4,623,841 pairs of medical terms can be downloaded from the Open Science Framework⁵⁶. Source Data for Figs. 1 and 2 and Extended Data Fig. 1–4 are available with this manuscript. The original input data includes detailed medical records from University of Utah Health. As this data include PHI (patient demographics, birth dates and dated medical diagnosis, procedure, and medication codes): we cannot make the data available with this publication.

Code availability

We provide a CodeOcean capsule⁵⁷ including code and sample input data.

References

Valderas, J. M., Starfield, B., Sibbald, B., Salisbury, C. & Roland, M. Defining comorbidity: implications for understanding health and health services. Ann. Fam. Med. 7, 357–363 (2009).
Article Google Scholar
Lone, N. I. et al. Predicting risk of unplanned hospital readmission in survivors of critical illness: a population-level cohort study. Thorax 74, 1046–1054 (2019).
Article Google Scholar
Wang, H. et al. Predicting hospital readmission via cost-sensitive deep learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 15, 1968–1978 (2018).
Article Google Scholar
Facchinetti, G. et al. Continuity of care interventions for preventing hospital readmission of older people with chronic diseases: a meta-analysis. Int. J. Nurs. Stud. 101, 103396 (2020).
Article Google Scholar
Atashi, A., Sarbaz, M., Marashi, S., Hajialiasgari, F. & Eslami, S. Intensive care decision making: using prognostic models for resource allocation. Stud. Health Technol. Inform. 251, 145–148 (2018).
Google Scholar
Yurkovich, M., Avina-Zubieta, J. A., Thomas, J., Gorenchtein, M. & Lacaille, D. A systematic review identifies valid comorbidity indices derived from administrative health data. J. Clin. Epidemiol. 68, 3–14 (2015).
Article Google Scholar
Charlson, M. E., Pompei, P., Ales, K. L. & MacKenzie, C. R. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J. Chronic Dis. 40, 373–383 (1987).
Article Google Scholar
Elixhauser, A., Steiner, C., Harris, D. R. & Coffey, R. M. Comorbidity measures for use with administrative data. Med. Care 36, 8–27 (1998).
Article Google Scholar
Roque, F. S. et al. Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput. Biol. 7, e1002141 (2011).
Article Google Scholar
Gutiérrez-Sacristán, A. et al. comoRbidity: an R package for the systematic analysis of disease comorbidities. Bioinformatics 34, 3228–3230 (2018).
Article Google Scholar
Moni, M. A., Xu, H. & Liò, P. CytoCom: a Cytoscape app to visualize, query and analyse disease comorbidity networks. Bioinforma. Oxf. Engl. 31, 969–971 (2015).
Article Google Scholar
Moni, M. A. & Liò, P. comoR: a software for disease comorbidity risk assessment. J. Clin. Bioinforma. 4, 8 (2014).
Article Google Scholar
Ronzano, F., Gutiérrez-Sacristán, A. & Furlong, L. I. Comorbidity4j: a tool for interactive analysis of disease comorbidities over large patient datasets. Bioinforma. Oxf. Engl. 35, 3530–3532 (2019).
Article Google Scholar
Siggaard, T. et al. Disease trajectory browser for exploring temporal, population-wide disease progression patterns in 7.2 million Danish patients. Nat. Commun. 11, 4952 (2020).
Article Google Scholar
Winter, A. C., Rist, P. M., Buring, J. E. & Kurth, T. Prospective comorbidity-matched study of Parkinson’s disease and risk of mortality among women. BMJ Open 6, e011888 (2016).
Article Google Scholar
Johnson, A. et al. MIMIC-IV (Version 1.0) (PhysioNet, 2021); https://doi.org/10.13026/S6N6-XD98
ICD-9-CM—International Classification of Diseases, Ninth Revision, Clinical Modification (CDC, 2019); https://www.cdc.gov/nchs/icd/icd9cm.htm
ICD-10-CM—International Classification of Diseases, Tenth Revision, Clinical Modification (CDC, 2020); https://www.cdc.gov/nchs/icd/icd10cm.htm
Clinical Classifications Software Refined (CCSR) (AHRQ, 2021); https://www.hcup-us.ahrq.gov/toolssoftware/ccsr/ccs_refined.jsp
CPT Codes (AAPC, 2021); https://www.aapc.com/resources/medical-coding/cpt.aspx
Liu, S. et al. RxNorm: prescription for electronic drug information exchange. IT Prof. 7, 17–23 (2005).
Article Google Scholar
Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950).
Article Google Scholar
Seifert, T. The relationship of migraine and other headache disorders to concussion. Handb. Clin. Neurol. 158, 119–126 (2018).
Article Google Scholar
Shimanovsky, A. et al. Autoimmune manifestations in patients with multiple myeloma and monoclonal gammopathy of undetermined significance. BBA Clin. 6, 12–18 (2016).
Article Google Scholar
Saif, M. W., Kaley, K. & Lamb, L. Pancreatic adenocarcinoma complicated by sinistral portal hypertension. Cureus 8, e689 (2016).
Google Scholar
Han, H. et al. Hypertension and breast cancer risk: a systematic review and meta-analysis. Sci. Rep. 7, 44877 (2017).
Li, X. et al. Comorbidities among patients with cancer who do and do not develop febrile neutropenia during the first chemotherapy cycle. J. Oncol. Pharm. Pract. 22, 679–689 (2016).
Article Google Scholar
Chia, V. M. et al. Chronic comorbid conditions associated with risk of febrile neutropenia in breast cancer patients treated with chemotherapy. Breast Cancer Res. Treat. 138, 621–631 (2013).
Article Google Scholar
Toma-Dasu, I., Wojcik, A. & Kjellsson Lindblom, E. Risk of second cancer following radiotherapy. Phys. Med. 42, 211–212 (2017).
Article Google Scholar
Donin, N. et al. Risk of second primary malignancies among cancer survivors in the United States, 1992 through 2008. Cancer 122, 3075–3086 (2016).
Article Google Scholar
Grantzau, T. & Overgaard, J. Risk of second non-breast cancer among patients treated with and without postoperative radiotherapy for primary breast cancer: a systematic review and meta-analysis of population-based studies including 522,739 patients. Radiother. Oncol. 121, 402–413 (2016).
Article Google Scholar
Rissanen, J. Modeling by shortest data description. Automatica 14, 465–471 (1978).
Article MATH Google Scholar
Hassell, K. L. Population estimates of sickle cell disease in the U.S. Am. J. Prev. Med. 38, S512–S521 (2010).
Article Google Scholar
Ahmadi, M., Poormansouri, S., Beiranvand, S. & Sedighie, L. Predictors and correlates of fatigue in sickle cell disease patients. Int. J. Hematol.-Oncol. Stem Cell Res. 12, 69–76 (2018).
Google Scholar
Herson, J., Sharma, S., Crocker, C. L. & Jones, D. Physical complaints of patients with sickle cell trait. J. Reprod. Med. 14, 129–132 (1975).
Google Scholar
Aich, A., Jones, M. K. & Gupta, K. Pain and sickle cell disease. Curr. Opin. Hematol. 26, 131–138 (2019).
Article Google Scholar
Tariq, S. & Aronow, W. S. Use of inotropic agents in treatment of systolic heart failure. Int. J. Mol. Sci. 16, 29060–29068 (2015).
Article Google Scholar
Anders, H.-J., Huber, T. B., Isermann, B. & Schiffer, M. CKD in diabetes: diabetic kidney disease versus nondiabetic kidney disease. Nat. Rev. Nephrol. 14, 361–377 (2018).
Article Google Scholar
Koye, D. N., Magliano, D. J., Nelson, R. G. & Pavkov, M. E. The global epidemiology of diabetes and kidney disease. Adv. Chronic Kidney Dis. 25, 121–132 (2018).
Article Google Scholar
El Fane, M. et al. Pneumocystosis during HIV infection. Rev. Pneumol. Clin. 72, 248–254 (2016).
Article Google Scholar
Seravalle, G. & Grassi, G. Obesity and hypertension. Pharmacol. Res. 122, 1–7 (2017).
Article Google Scholar
Hassaine, A., Salimi-Khorshidi, G., Canoy, D. & Rahimi, K. Untangling the complexity of multimorbidity with machine learning. Mech. Ageing Dev. 190, 111325 (2020).
Article Google Scholar
Hassaine, A. et al. Learning multimorbidity patterns from electronic health records using non-negative matrix factorisation. J. Biomed. Inform. 112, 103606 (2020).
Article Google Scholar
Chandrasekaran, B. & Jain, A. K. Quantization complexity and independent measurements. IEEE Trans. Comput. C–23, 102–106 (1974).
Article MATH Google Scholar
Trunk, G. V. A problem of dimensionality: a simple example. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1, 306–307 (1979).
Article Google Scholar
Capobianco, E. & Lio’, P. Comorbidity: a multidimensional approach. Trends Mol. Med. 19, 515–521 (2013).
Article Google Scholar
Linear Models Section 1.1, Scikit-learn 0.24.1 Documentation (ScikitLearn, 2021); https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Hong, Y. On computing the distribution function for the Poisson binomial distribution. Comput. Stat. Data Anal. 59, 41–51 (2013).
Article MathSciNet MATH Google Scholar
Hart, J. F. Computer Approximations (Wiley, 1968).
GNU Scientific Library: Reference Manual (Network Theory, 2009).
Behnel, S. et al. Cython: the best of both worlds. Comput. Sci. Eng. 13, 31–39 (2011).
Article Google Scholar
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar
Grinberg, M. Flask Web Development: Developing Web Applications with Python (O’Reilly, 2018).
Bostock, M., Ogievetsky, V. & Heer, J. D³: data-driven documents. IEEE Trans. Vis. Comput. Graph. 17, 2301–2309 (2011).
Article Google Scholar
Lemmon, G., Wesolowski, S., Henrie, A., Tristani-Firouzi, M., & Yandell, M. PBC Comorbidities (OSF, 2021); https://doi.org/10.17605/OSF.IO/TH239
Lemmon, G., Wesolowski, S., Henrie, A., Tristani-Firouzi, M., Yandell, M. A Poisson Binomial Based Statistical Testing Framework for Comprehensive Comorbidity Discovery Across Massive Electronic Health Record Datasets (CodeOcean, 2021); https://doi.org/10.24433/CO.2251918.v1

Download references

Acknowledgements

The following collaborators have provided valuable discussion, feedback, and insight which has guided development of PBC: B. Bray, V. Deshmukh, K. Eilbeck, E. J. Hernandez and R. Shah. We thank members of the University of Utah EDW for facilitating access to medical records. The computational resources used were partially funded by the NIH Shared Instrumentation Grant 1S10OD021644-01A1. This research was supported by the AHA Children’s Strategically Focused Research Network grant (17SFRN33630041) and the Nora Eccles Treadwell Foundation. G. Lemmon was supported by NRSA training grant T32H757632. S. Wesolowski was supported by NRSA training grant T32DK110966-04 and the AHA Children’s Strategically Focused Research Network Fellowship award (17SFRN33630041).

Author information

Authors and Affiliations

Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
Gordon Lemmon, Sergiusz Wesolowski, Alex Henrie & Mark Yandell
Utah Center for Genetic Discovery and Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
Gordon Lemmon, Sergiusz Wesolowski, Alex Henrie & Mark Yandell
Division of Pediatric Cardiology, University of Utah School of Medicine, Salt Lake City, UT, USA
Martin Tristani-Firouzi
Nora Eccles Harrison CVRTI, University of Utah School of Medicine, Salt Lake City, UT, USA
Martin Tristani-Firouzi

Authors

Gordon Lemmon
View author publications
You can also search for this author in PubMed Google Scholar
Sergiusz Wesolowski
View author publications
You can also search for this author in PubMed Google Scholar
Alex Henrie
View author publications
You can also search for this author in PubMed Google Scholar
Martin Tristani-Firouzi
View author publications
You can also search for this author in PubMed Google Scholar
Mark Yandell
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.L. was the senior research associate leading PBC development and validation. S.W. is an applied mathematician who has helped formalize our approach to statistical testing. A.H. was a software engineer on the project. M.T.-F. and M.Y. conceived of the project and secured research funding and played a key role in scientific discussions regarding development of PBC. All authors edited the manuscript.

Corresponding authors

Correspondence to Martin Tristani-Firouzi or Mark Yandell.

Ethics declarations

Competing interests

G.L. and M.Y. own shares in Backdrop Health, a University of Utah effort to commercialize Bayesian inference on health records; however, there are no financial ties regarding this research. The remaining authors declare no competing interests.

Additional information

Peer review information Nature Computational Science thanks Jeffrey P. Rewley and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 University of Utah medical records binned by age-decade.

Boxplots show median (black line), 25^th and 75^th percentile (box ends), 95^th and 5^th percentile (whisker caps) and outliers. Number of terms (bottom panel) is a count of distinct diagnoses, procedures and medications found in each patient’s medical history.

Source data

Extended Data Fig. 2 Comparison of score functions for logistic regression C-value optimization.

For each score function, we evaluated C-values ranging from 10⁻¹⁴ to 10¹⁴. (a) For each of 3041 diagnosis (DX), procedure (PX), and medication (RX) terms, we use cross validation to select the C-value that achieves the best score. Each boxplot contains these 3041 best scores as evaluated with different score functions. (b) Distribution of C-values for 3 score functions with high entropy. J_cutoff was chosen for downstream analysis because it has high entropy and has a smooth C-value distribution without the large outlier at C = −14.

Source data

Extended Data Fig. 3 Minimum description length of the comorbidity network discovered by PBC for diagnoses in the University of Utah EDW.

Examples of significantly associated medical conditions within each cluster are displayed. Citations supporting these associations are listed in Supplementary Table 6.

Extended Data Fig. 4 Deployment of PBC on MIMIC-IV EHR data.

See Fig. 1 legend for description of (a) and Fig. 2 legend for description of (b). In (b), the X-axis ticks correspond to the addition of regression features (PBC) or stratification criteria from left to right: 0 - no features, no stratification, 1- gender/female, 2 - ancestry/African American, 3 - length of medical history/at least 2 years, 4 - number of visits/at least 3 visits. The MIMIC-IV results are very similar to the University of Utah results, reinforcing a key message of this paper - that PBC retains the power to identify comorbid relationships that are lost by stratification.

Source data

Supplementary information

Supplementary Information

Supplementary Tables 1–6, Figs. 1 and 2, and Methods (a step-by-step explanation of the mathematics used to calculate pairwise comorbidity P-values).

Source data

Source Data Fig. 1

Source Data for Fig. 1.

Source Data Fig. 2

Source Data for Fig. 2.

Source Data Extended Data Fig. 1

Source Data for Extended Data Fig. 1.

Source Data Extended Data Fig. 2

Source Data for Extended Data Fig. 2.

Source Data Extended Data Fig. 4

Source Data for Extended Data Fig. 4.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lemmon, G., Wesolowski, S., Henrie, A. et al. A Poisson binomial-based statistical testing framework for comorbidity discovery across electronic health record datasets. Nat Comput Sci 1, 694–702 (2021). https://doi.org/10.1038/s43588-021-00141-9

Download citation

Received: 13 December 2020
Accepted: 16 September 2021
Published: 21 October 2021
Issue Date: October 2021
DOI: https://doi.org/10.1038/s43588-021-00141-9