Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

An introduction to machine learning and analysis of its use in rheumatic diseases

Abstract

Machine learning (ML) is a computerized analytical technique that is being increasingly employed in biomedicine. ML often provides an advantage over explicitly programmed strategies in the analysis of multidimensional information by recognizing relationships in the data that were not previously appreciated. As such, the use of ML in rheumatology is increasing, and numerous studies have employed ML to classify patients with rheumatic autoimmune inflammatory diseases (RAIDs) from medical records and imaging, biometric or gene expression data. However, these studies are limited by sample size, the accuracy of sample labelling, and absence of datasets for external validation. In addition, there is potential for ML models to overfit or underfit the data and, thereby, these models might produce results that cannot be replicated in an unrelated dataset. In this Review, we introduce the basic principles of ML and discuss its current strengths and weaknesses in the classification of patients with RAIDs. Moreover, we highlight the successful analysis of the same type of input data (for example, medical records) with different algorithms, illustrating the potential plasticity of this analytical approach. Altogether, a better understanding of ML and the future application of advanced analytical techniques based on this approach, coupled with the increasing availability of biomedical data, may facilitate the development of meaningful precision medicine for patients with RAIDs.

Key points

  • Appropriate application of machine learning (ML) algorithms and model construction, including that using data from patients with rheumatic autoimmune inflammatory diseases (RAIDs), involves preprocessing, feature selection, comparisons of multiple models to determine which is most appropriate for the data, and proper validation.

  • ML has been applied to various types of data from patients with RAIDs, including medical records and imaging data to classify patients, sequencing data to predict genetic risk loci, biometric data to identify disease activity, transcriptomic data to classify or cluster patient subtypes, and demographic, genetic and genomic data to predict treatment response.

  • Most published studies that describe the employment of ML in RAIDs, however, only serve as proof-of-principle studies as they lack adequate sample sizes or external test datasets; consequently, clinical translation of ML in rheumatology is in a nascent stage.

  • Current ML studies provide hypotheses that can be validated in large retrospective datasets or used to design prospective trials characterized by correct data collection and sample sizes that are suitable for the application of ML.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The output of a machine learning model is classification, regression or clustering.
Fig. 2: Machine learning model workflow.
Fig. 3: Guidelines for selecting the most appropriate machine learning algorithm.
Fig. 4: Receiver operating characteristic curves are used to assess binary classification performance.

Similar content being viewed by others

References

  1. Jordan, M. I. & Mitchell, T. M. Machine learning: trends, perspectives, and prospects. Science 349, 255–260 (2015).

    Article  CAS  PubMed  Google Scholar 

  2. Samuel, A. L. Some studies in machine learning using the game of checkers IBM journals & magazine. IBM J. Res. Dev. 3, 210–229 (1959).

    Article  Google Scholar 

  3. Bhavsar, P., Safro, I., Bouaynaya, N., Polikar, R. & Dera, D. Machine learning in transportation data analytics in Data Analytics for Intelligent Transportation Systems (eds Chowdhury, M., Apon, A. & Dey, K.) 283–307 (Elsevier Inc., 2017).

  4. Kubat, M. An Introduction to Machine Learning. (Springer International Publishing, 2017).

  5. Hand, D. Statistics and data mining: intersecting disciplines. ACM SIGKDD Explor. Newsl. 1, 16–19 (1999).

    Article  Google Scholar 

  6. Kim, K.-J. & Tagkopoulos, I. Application of machine learning in rheumatic disease research. Korean J. Intern. Med. 34, 708–722 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Liao, K. P. et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ 350, h1885 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Turner, C. A. et al. Word2Vec inversion and traditional text classifiers for phenotyping lupus. BMC Med. Inform. Decis. Mak. 17, 126 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Jorge, A. et al. Identifying lupus patients in electronic health records: development and validation of machine learning algorithms and application of rule-based algorithms. Semin. Arthritis Rheum. 49, 84–90 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Zhou, S. M. et al. Defining disease phenotypes in primary care electronic health records by a machine learning approach: a case study in identifying rheumatoid arthritis. PLoS One 11, 1–14 (2016).

    Google Scholar 

  11. Norgeot, B. et al. Assessment of a deep learning model based on electronic health record data to forecast clinical outcomes in patients with rheumatoid arthritis. JAMA Netw. Open. 2, e190606 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Walsh, J. A. et al. Identifying axial spondyloarthritis in electronic medical records of US Veterans. Arthritis Care Res. 69, 1414–1420 (2017).

    Article  Google Scholar 

  13. Odgers, D. J., Tellis, N., Hall, H. & Dumontier, M. Using LASSO regression to predict rheumatoid arthritis treatment efficacy. AMIA Jt. Summits Transl. Sci. Proc. 2016, 176–83 (2016).

    PubMed  PubMed Central  Google Scholar 

  14. Lockshin, M. D., Barbhaiya, M., Izmirly, P., Buyon, J. P. & Crow, M. K. SLE: Reconciling heterogeneity. Lupus Sci. Med. 6, e000280 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  15. McInnes, I. B. Psoriatic arthritis: embracing pathogenetic and clinical heterogeneity? Clin. Exp. Rheumatol. 34, 9–11 (2016).

    PubMed  Google Scholar 

  16. Weyand, C. M., Klimiuk, P. A. & Goronzy, J. J. Heterogeneity of rheumatoid arthritis: from phenotypes to genotypes. Springer Semin. Immunopathol. 20, 5–22 (1998).

    Article  CAS  PubMed  Google Scholar 

  17. de Bruijne, M. Machine learning approaches in medical image analysis: From detection to diagnosis. Med. Image Anal. 33, 94–97 (2016).

    Article  PubMed  Google Scholar 

  18. Deeb, S. J. et al. Machine learning-based classification of diffuse large B-cell lymphoma patients by their protein expression profiles. Mol. Cell. Proteom. 14, 2947–60 (2015).

    Article  CAS  Google Scholar 

  19. Ali, M. & Aittokallio, T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys. Rev. 11, 31–39 (2019).

    Article  CAS  PubMed  Google Scholar 

  20. Lou, B. et al. An image-based deep learning framework for individualising radiotherapy dose: a retrospective analysis of outcome prediction. Lancet Digit. Heal. 1, e136–e147 (2019).

    Article  Google Scholar 

  21. Jiang, M. et al. Machine learning in rheumatic diseases. Clin. Rev. Allergy Immunol. 60, 96–110 (2021).

    Article  PubMed  Google Scholar 

  22. Hügle, M., Omoumi, P., van Laar, J. M., Boedecker, J. & Hügle, T. Applied machine learning and artificial intelligence in rheumatology. Rheumatol. Adv. Pract. 4, rkaa005 (2020).

    Google Scholar 

  23. Stoel, B. Use of artificial intelligence in imaging in rheumatology-current status and future perspectives. RMD Open 6, e001063 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Kingsmore, K. M., Grammer, A. C. & Lipsky, P. E. Drug repurposing to improve treatment of rheumatic autoimmune inflammatory diseases. Nat. Rev. Rheumatol. 16, 32–52 (2020).

    Article  CAS  PubMed  Google Scholar 

  25. Guan, Y. et al. Machine learning to predict anti-TNF drug responses of rheumatoid arthritis patients by integrating clinical and genetic markers. Arthritis Rheumatol. 71, 1987–1996 (2019).

    Article  CAS  PubMed  Google Scholar 

  26. Fautrel, B. et al. Choice of second-line disease-modifying antirheumatic drugs after failure of methotrexate therapy for rheumatoid arthritis: a decision tree for clinical practice based on rheumatologists’ preferences. Arthritis Care Res. 61, 425–434 (2009).

    Article  CAS  Google Scholar 

  27. Eyre, S., Orozco, G. & Worthington, J. The genetics revolution in rheumatology: large scale genomic arrays and genetic mapping. Nat. Rev. Rheumatol. 13, 421–432 (2017).

    Article  CAS  PubMed  Google Scholar 

  28. Catalina, M. D. et al. Patient ancestry significantly contributes to molecular heterogeneity of systemic lupus erythematosus. JCI Insight 5, e140380 (2020).

    Article  PubMed Central  Google Scholar 

  29. Provost, F. & Kohavi., R. Glossary of Terms. J. Mach. Learn. 30, 271–274 (1998).

    Article  Google Scholar 

  30. Zhu, X. & Goldberg, A. Introduction to semi-supervised learning. Synth. Lect. Artif. Intell. Mach. Learn. 6, 1–116 (2009).

    Google Scholar 

  31. Haldorai, A., Ramu, A. & Suriya, M. Organization internet of things (IoTs): supervised, unsupervised, and reinforcement learning. in EAI/Springer Innovations in Communication and Computing 27–53 (Springer, 2020).

  32. Jain, A. K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31, 651–666 (2010).

    Article  Google Scholar 

  33. Kotsiantis, S. B., Zaharakis, I. D. & Pintelas, P. E. Machine learning: a review of classification and combining techniques. Artif. Intell. Rev. 26, 159–190 (2006).

    Article  Google Scholar 

  34. Ayodele, T. O. Types of Machine Learning Algorithms. in New Advances in Machine Learning (ed. Zhang, Y.) 19–49 (InTech, 2010).

  35. Alasadi, S. A. & Bhaya, W. S. Review of data preprocessing techniques in data mining. J. Eng. Appl. Sci. 12, 4102–4107 (2017).

    Google Scholar 

  36. Zhang, Z. Missing data imputation: focusing on single imputation. Ann. Transl. Med. 4, 9 (2016).

  37. Cao, X. H., Stojkovic, I. & Obradovic, Z. A robust data scaling algorithm to improve classification accuracies in biomedical data. BMC Bioinforma. 17, 359 (2016).

    Article  Google Scholar 

  38. Han, J., Kamber, M. & Pei, J. Data Transformation and Data Discretization. in Data mining: Concepts and Techniques 111–119 (Elsevier, 2012).

  39. Saeys, Y., Inza, I. & Larranaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007).

    Article  CAS  PubMed  Google Scholar 

  40. Tuikkala, J., Elo, L. L., Nevalainen, O. S. & Aittokallio, T. Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC Bioinforma. 9, 202 (2008).

    Article  Google Scholar 

  41. Aljuaid, T. & Sasi, S. Proper imputation techniques for missing values in data sets. in Proceedings of the 2016 International Conference on Data Science and Engineering ICDSE 2016 (Institute of Electrical and Electronics Engineers Inc., 2017)

  42. Rahman, M. M. & Davis, D. N. Machine Learning-Based Missing Value Imputation Method for Clinical Datasets. in Lecture Notes in Electrical Engineering 245–257 (Springer, Dordrecht, 2013).

  43. Raja, P. S. & Thangavel, K. Missing value imputation using unsupervised machine learning techniques. Soft Comput. 24, 4361–4392 (2020).

    Article  Google Scholar 

  44. Phung, S., Kumar, A. & Kim, J. A deep learning technique for imputing missing healthcare data. in Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS 6513–6516 (Institute of Electrical and Electronics Engineers Inc., 2019).

  45. Chowdhury, G. G. Natural language processing. Annu. Rev. Inf. Sci. Technol. 37, 51–89 (2005).

    Article  Google Scholar 

  46. Zhang, Y., Jin, R. & Zhou, Z. H. Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1, 43–52 (2010).

    Article  Google Scholar 

  47. Kozlowski, A. C., Taddy, M. & Evans, J. A. The geometry of culture: analyzing the meanings of class through word embeddings. Am. Sociol. Rev. 84, 905–949 (2019).

    Article  Google Scholar 

  48. McInnes, B. T., Pedersen, T. & Carlis, J. Using UMLS Concept Unique Identifiers (CUIs) for word sense disambiguation in the biomedical domain. AMIA Annu. Symp. Proc. 2007, 533–537 (2007).

    PubMed Central  Google Scholar 

  49. El Bouchefry, K. & de Souza, R. S. Learning in Big Data: Introduction to Machine Learning. in Knowledge Discovery in Big Data from Astronomy and Earth Observation 225–249 (Elsevier, 2020).

  50. Lever, J., Krzywinski, M. & Altman, N. Principal component analysis. Nat. Methods 14, 641–642 (2017).

    Article  CAS  Google Scholar 

  51. Anowar, F., Sadaoui, S. & Selim, B. Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE). Comput. Sci. Rev. 40, 100378 (2021).

    Google Scholar 

  52. Velliangiri, S., Alagumuthukrishnan, S. & Thankumar Joseph, S. I. A review of dimensionality reduction techniques for efficient computation. Procedia Comput. Sci. 165, 104–111 (2019).

    Article  Google Scholar 

  53. Guyon, I. & Elisseefl, A. An introduction to feature extraction. in Studies in Fuzziness and Soft Computing Vol. 207 1–25 (Springer, 2006).

  54. Kubat, M. Some Practical Aspects to Know About. in An Introduction to Machine Learning 191–210 (Springer International Publishing, 2017).

  55. Elashoff, J. C., Elashoff, R. M. & Goldman, G. E. On the choice of variables in classification problems with dichotomous variables. Biometrika 54, 668–670 (1967).

    Article  CAS  PubMed  Google Scholar 

  56. Toussaint, G. T. Note on optimal selection of independent binary-valued features for pattern recognition. IEEE Trans. Inf. Theory 17, 618 (1971).

    Google Scholar 

  57. Dormann, C. F. et al. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36, 27–46 (2013).

    Article  Google Scholar 

  58. Stańczyk, U. Feature evaluation by filter, Wrapper and embedded approaches. Stud. Comput. Intell. 584, 29–44 (2015).

    Article  Google Scholar 

  59. Ceccarelli, F. et al. Biomarkers of erosive arthritis in systemic lupus erythematosus: application of machine learning models. PLoS One 13, e0207926 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  60. Guyon, I. & Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 27–46 (2003).

    Google Scholar 

  61. Tuv, E. et al. Feature selection with ensembles, artificial variables, and redundancy elimination. J. Mach. Learn. Res. 10, 1341–1366 (2009).

    Google Scholar 

  62. Altman, N. & Krzywinski, M. Points of significance: clustering. Nat. Methods 14, 545–546 (2017).

    Article  CAS  Google Scholar 

  63. Tuv, E. Ensemble learning. in Studies in Fuzziness and Soft Computing (eds Guyon, I., Nikravesh, M., Nikravesh, M. Gunn, S. & Zadeh, L. A.) Vol. 207, 187–204 (Springer, 2006).

  64. Dietterich, T. G. Ensemble methods in machine learning. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 1857 1–15 (Springer, 2000).

  65. Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).

    Article  Google Scholar 

  66. Altman, N. & Krzywinski, M. Points of significance: ensemble methods: bagging and random forests. Nat. Methods 14, 933–934 (2017).

    Article  CAS  Google Scholar 

  67. Drucker, H. Improving regressors using boosting techniques. in 14th International Conference on Machine Learning 107–115 (1997).

  68. Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 7, 21 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  69. Schapire, R. E. The Boosting Approach to Machine Learning: An Overview. in Lecture Notes in Statistics 149–171 (Springer, 2003).

  70. Snoek, J., Larochelle, H. & Adams, R. P. Practical Bayesian optimization of machine learning algorithms. in Advances in Neural Information Processing Systems Vol. 4 2951–2959 (ACM, 2012).

  71. Kubat, M. Probabilities: Bayesian Classifiers. in An Introduction to Machine Learning 19–42 (Springer International Publishing, 2017).

  72. Aha, D. W., Kibler, D., Albert, M. K. & Quinian, J. R. Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991).

    Article  Google Scholar 

  73. Brownlee, J. Master machine learning algorithms discover how they work and implement them from scratch. Mach. Learn. Master. 1, 11 (2016).

    Google Scholar 

  74. Fu, W. J. Penalized regressions: the bridge versus the lasso? J. Comput. Graph. Stat. 7, 397–416 (1998).

    Google Scholar 

  75. Tharwat, A., Gaber, T., Ibrahim, A. & Hassanien, A. E. Linear discriminant analysis: a detailed tutorial. AI Commun. 30, 169–190 (2017).

    Article  Google Scholar 

  76. Krogh, A. What are artificial neural networks? Nat. Biotechnol. 26, 195–197 (2008).

    Article  CAS  PubMed  Google Scholar 

  77. Cross, S. S., Harrison, R. F. & Kennedy, R. L. Introduction to neural networks. Lancet 346, 1075–1079 (1995).

    Article  CAS  PubMed  Google Scholar 

  78. Ceccarelli, F. et al. Prediction of chronic damage in systemic lupus erythematosus by using machine-learning models. PLoS One 12, e0174200 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  79. Ruder, S. An overview of gradient descent optimization algorithms. Preprint at arXiv 1609, 04747 (2016).

    Google Scholar 

  80. O’Shea, K. & Nash, R. An introduction to convolutional neural networks. Preprint at arXiv 1511, 08458v2 (2015).

    Google Scholar 

  81. Medsker, L. R. & Jaub, L. C. Recurrent Neural Networks: Design and Applications (CRC Press, 2001).

  82. Arnold, L., Rebecchi, S., Chevallier, S. & Paugam-Moisy, H. An introduction to deep learning. in ESANN 2011 proceedings, 19th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning 477–488 (IEEE, 2010).

  83. Ikonomakis, M., Kotsiantis, S. & Tampakas, V. Text classification using machine learning techniques. WSEAS Trans. Comput. 4, 966–974 (2005).

    Google Scholar 

  84. Kubat, M. Decision Trees. in An Introduction to Machine Learning 113–136 (Springer International Publishing, 2017).

  85. Luo, G. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw. Model. Anal. Heal. Inform. Bioinforma. 5, 18 (2016).

    Article  Google Scholar 

  86. Probst, P. & Bischl, B. Tunability: importance of hyperparameters of machine learning algorithms. J. Mach. Learn. Res. 20, 1–32 (2019).

    Google Scholar 

  87. Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).

    Google Scholar 

  88. Feurer, M. & Hutter, F. Hyperparameter Optimization. in Automated Machine Learning: Methods, Systems, Challenges 3–33 (Springer, 2019).

  89. Lever, J., Krzywinski, M. & Altman, N. Points of Significance: model selection and overfitting. Nat. Methods 13, 703–704 (2016).

    Article  CAS  Google Scholar 

  90. Kim, J. H. Estimating classification error rate: repeated cross-validation, repeated hold-out and bootstrap. Comput. Stat. Data Anal. 53, 3735–3745 (2009).

    Article  Google Scholar 

  91. Schneider, J. Cross validation. Definitions https://www.cs.cmu.edu/~schneide/tut5/node42.html (1997).

  92. Ross, K. A. et al. Cross-validation. in Encyclopedia of Database Systems 532–538 (Springer US, 2009).

  93. Vabalas, A., Gowen, E., Poliakoff, E. & Casson, A. J. Machine learning algorithm validation with a limited sample size. PLoS One 14, e0224365 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Lever, J., Krzywinski, M. & Altman, N. Points of significance: classification evaluation. Nat. Methods 13, 603–604 (2016).

    Article  CAS  Google Scholar 

  95. Kumar, R. & Indrayan, A. Receiver operating characteristic (ROC) curve for medical researchers. Indian Pediatrics 48, 277–287 (2011).

    Article  PubMed  Google Scholar 

  96. Altman, N. & Krzywinski, M. Points of significance: regression diagnostics. Nat. Methods 13, 385–386 (2016).

    Article  CAS  Google Scholar 

  97. Handelman, G. S. et al. Peering into the black box of artificial intelligence: evaluation metrics of machine learning methods. Am. J. Roentgenol. 212, 38–43 (2019).

    Article  Google Scholar 

  98. Nantasenamat, C. How to build a machine learning model. Towards Data Science. https://towardsdatascience.com/how-to-build-a-machine-learning-model-439ab8fb3fb1 (2018).

  99. Chai, T. & Draxler, R. R. Root mean square error (RMSE) or mean absolute error (MAE)? — arguments against avoiding RMSE in the literature. Geosci. Model. Dev. 7, 1247–1250 (2014).

    Article  Google Scholar 

  100. Chicco, D., Warrens, M. J. & Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. Peer J. Comput. Sci. 7, e623 (2021).

    Article  Google Scholar 

  101. Alpaydin, E. Introduction to Machine Learning (Adaptive Computation and Machine Learning series) (The MIT Press, 2009).

  102. Bas¸tanlar, Y. & Özuysal, M. Introduction to machine learning. Methods Mol. Biol. 1107, 105–128 (2014).

    Article  PubMed  Google Scholar 

  103. Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16, 321–332 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  104. Camacho, D. M., Collins, K. M., Powers, R. K., Costello, J. C. & Collins, J. J. Next-generation machine learning for biological networks. Cell 173, 1581–1592 (2018).

    Article  CAS  PubMed  Google Scholar 

  105. Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug. Discov. 18, 463–477 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  106. Stafford, I. S. et al. A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases. NPJ Digit. Med. 3, 30 (2020).

  107. Feldman, C. H. et al. Supplementing claims data with electronic medical records to improve estimation and classification of rheumatoid arthritis disease activity: a machine learning approach. ACR Open. Rheumatol. 1, 552–559 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  108. Barnado, A. et al. Developing electronic health record algorithms that accurately identify patients with systemic lupus erythematosus. Arthritis Care Res. 69, 687–693 (2017).

    Article  Google Scholar 

  109. Xiong, W. W. et al. Real-world electronic health record identifies antimalarial underprescribing in patients with lupus nephritis. Lupus 28, 977–985 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  110. Barnado, A. et al. Phenome-wide association study identifies dsDNA as a driver of major organ involvement in systemic lupus erythematosus. Lupus 28, 66–76 (2019).

    Article  CAS  PubMed  Google Scholar 

  111. Barnado, A. et al. Phenome-wide association studies uncover a novel association of increased atrial fibrillation in male patients with systemic lupus erythematosus. Arthritis Care Res. 70, 1630–1636 (2018).

    Article  CAS  Google Scholar 

  112. Doss, J., Mo, H., Carroll, R. J., Crofford, L. J. & Denny, J. C. Phenome-wide association study of rheumatoid arthritis subgroups identifies association between seronegative disease and fibromyalgia. Arthritis Rheumatol. 69, 291–300 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  113. Zhao, S. S. et al. Incorporating natural language processing to improve classification of axial spondyloarthritis using electronic health records. Rheumatology 59, 1059–1065 (2020).

    Article  PubMed  Google Scholar 

  114. Deodhar, A. et al. Use of machine learning techniques in the development and refinement of a predictive model for early diagnosis of ankylosing spondylitis. Clin. Rheumatol. 39, 975–982 (2020).

    Article  PubMed  Google Scholar 

  115. Walsh, J. A., Rozycki, M., Yi, E. & Park, Y. Application of machine learning in the diagnosis of axial spondyloarthritis. Curr. Opin. Rheumatol. 31, 362–367 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  116. Moores, K. G. & Sathe, N. A. A systematic review of validated methods for identifying systemic lupus erythematosus (SLE) using administrative or claims data. Vaccine 31, K62–73 (2013).

    Article  PubMed  Google Scholar 

  117. Murray, S. G., Avati, A., Schmajuk, G. & Yazdany, J. Automated and flexible identification of complex disease: building a model for systemic lupus erythematosus using noisy labeling. J. Am. Med. Inform. Assoc. 26, 61–65 (2019).

    Article  PubMed  Google Scholar 

  118. Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res. 62, 1120–1127 (2010).

    Article  Google Scholar 

  119. Carroll, R. J. et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. J. Am. Med. Inform. Assoc. 19, e162–9 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  120. Ross, B. C. Mutual information between discrete and continuous data sets. PLoS One 9, e87357 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  121. Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006).

    Article  Google Scholar 

  122. Bellou, E., James, K., Ng, W. F. & Hallinan, J. Machine learning of fatigue-related clinical features in primary Sjogren’s Syndrome. Int. Symp. Sjogrens Syndr. 81, 363–364 (2015).

    Google Scholar 

  123. Donelle, J. A., Wang, S. X. & Caffery, B. Differentiating between Sjogren’s syndrome and dry eye disease: an analysis using random forests. J. Math. 5, 22–36 (2012).

    Google Scholar 

  124. Kalweit, M. et al. Personalized prediction of disease activity in patients with rheumatoid arthritis using an adaptive deep neural network. PLoS One 16, e0252289 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  125. Adamichou, C. et al. Lupus or not? SLE Risk Probability Index (SLERPI): a simple, clinician-friendly machine learning-based model to assist the diagnosis of systemic lupus erythematosus. Ann. Rheum. Dis. 80, 758–766 (2021).

    Article  CAS  Google Scholar 

  126. Toro-Domínguez, D. et al. Differential treatments based on drug-induced gene expression signatures and longitudinal systemic lupus erythematosus stratification. Sci. Rep. 9, 15502 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  127. Toro-Domínguez, D. et al. Stratification of systemic lupus erythematosus patients into three groups of disease activity progression according to longitudinal gene expression. Arthritis Rheumatol. 70, 2025–2035 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  128. Andersen, J. K. H. et al. Neural networks for automatic scoring of arthritis disease activity on ultrasound images. RMD Open 5, e000891 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  129. Tang, J. et al. Grading of rheumatoid arthritis on ultrasound images with deep convolutional neural network. in IEEE International Ultrasonics Symposium (IEEE Computer Society, 2018).

  130. Tang, J. et al. Enhancing convolutional neural network scheme for rheumatoid arthritis grading with limited clinical data. Chin. Phys. B 28, 038701 (2019).

    Article  Google Scholar 

  131. Üreten, K., Erbay, H. & Maras¸, H. H. Detection of rheumatoid arthritis from hand radiographs using a convolutional neural network. Clin. Rheumatol. 39, 969–974 (2020).

    Article  PubMed  Google Scholar 

  132. Murakami, S., Hatano, K., Tan, J., Kim, H. & Aoki, T. Automatic identification of bone erosions in rheumatoid arthritis from hand radiographs based on deep convolutional neural network. Multimed. Tools Appl. 77, 10921–10937 (2018).

    Article  Google Scholar 

  133. Rohrbach, J., Reinhard, T., Sick, B. & Dürr, O. Bone erosion scoring for rheumatoid arthritis with deep convolutional neural networks. Comput. Electr. Eng. 78, 472–481 (2019).

    Article  Google Scholar 

  134. Betancourt-Hernández, M., Viera-López, G. & Serrano-Muñoz, A. Automatic diagnosis of rheumatoid arthritis from hand radiographs using convolutional neural networks. Rev. Cuba. Fis. 35, 39–43 (2018).

    Google Scholar 

  135. Hemalatha, R. J., Vijaybaskar, V. & Thamizhvani, T. R. Automatic localization of anatomical regions in medical ultrasound images of rheumatoid arthritis using deep learning. Proc. Inst. Mech. Eng. Part. H. J. Eng. Med. 233, 657–667 (2019).

    Article  CAS  Google Scholar 

  136. Dehghani, H., Feng, Y., Lighter, D., Zhang, L. & Wang, Y. Deep neural networks improve diagnostic accuracy of rheumatoid arthritis using diffuse optical tomography. in Optics InfoBase Conference Papers (SPIE-Intl Soc Optical Eng, 2019).

  137. Vukicevic, A., Zabotti, A., de Vita, S. & Filipovic, N. Assessment of machine learning algorithms for the purpose of primary Sjögren’s syndrome grade classification from segmented ultrasonography images. in Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering. LNICST 241, 239–245 (2018).

    Google Scholar 

  138. Kise, Y. et al. Preliminary study on the application of deep learning system to diagnosis of Sjögren’s syndrome on CT images. Dentomaxillofacial Radiol. 48, 20190019 (2019).

    Article  Google Scholar 

  139. Simos, N. J. et al. Machine learning classification of neuropsychiatric systemic lupus erythematosus patients using resting-state fmri functional connectivity. in IST 2019 — IEEE International Conference on Imaging Systems and Techniques, Proceedings (Institute of Electrical and Electronics Engineers Inc., 2019).

  140. Morita, K., Tashita, A., Nii, M. & Kobashi, S. Computer-aided diagnosis system for Rheumatoid Arthritis using machine learning. in Proceedings of 2017 International Conference on Machine Learning and Cybernetics Vol. 2 357–360 (IEEE, 2017).

  141. Joo, Y. B., Baek, I. W., Park, Y. J., Park, K. S. & Kim, K. J. Machine learning-based prediction of radiographic progression in patients with axial spondyloarthritis. Clin. Rheumatol. 39, 983–991 (2020).

    Article  PubMed  Google Scholar 

  142. Sharon, H., Elamvazuthi, I., Lu, C. K., Parasuraman, S. & Natarajan, E. Development of rheumatoid arthritis classification from electronic image sensor using ensemble method. Sensors 20, 167 (2020).

    Article  Google Scholar 

  143. Simos, N. J. et al. Quantitative identification of functional connectivity disturbances in neuropsychiatric lupus based on resting-state fMRI: a robust machine learning approach. Brain Sci. 10, 777 (2020).

    Article  PubMed Central  Google Scholar 

  144. Castro-Zunti, R., Park, E. H., Choi, Y., Jin, G. Y. & Ko, S. B. Early detection of ankylosing spondylitis using texture features and statistical machine learning, and deep learning, with some patient age analysis. Comput. Med. Imaging Graph. 82, 101718 (2020).

    Article  PubMed  Google Scholar 

  145. Gossec, L. et al. Detection of flares by decrease in physical activity, collected using wearable activity trackers in rheumatoid arthritis or axial spondyloarthritis: an application of machine learning analyses in rheumatology. Arthritis Care Res. 71, 1336–1343 (2019).

    Article  Google Scholar 

  146. Andreu-Perez, J. et al. Developing fine-grained actigraphies for rheumatoid arthritis patients from a single accelerometer using machine learning. Sensors 17, 2113 (2017).

    Article  PubMed Central  Google Scholar 

  147. Oates, J. C. et al. Prediction of urinary protein markers in lupus nephritis. Kidney Int. 68, 2588–2592 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  148. Tang, Y. et al. Lupus nephritis pathology prediction with clinical indices. Sci. Rep. 8, 10231 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  149. Robinson, G. A. et al. Disease-associated and patient-specific immune cell signatures in juvenile-onset systemic lupus erythematosus: patient stratification using a machine-learning approach. Lancet Rheumatol. 2, e485–e496 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  150. Choi, M. Y. & Ma, C. Making a big impact with small datasets using machine-learning approaches. Lancet Rheumatol. 2, e451–e452 (2020).

    Article  Google Scholar 

  151. Ormseth, M. J. et al. Development and validation of a MicroRNA panel to differentiate between patients with rheumatoid arthritis or systemic lupus erythematosus and controls. J. Rheumatol. 47, 188–196 (2020).

    Article  CAS  PubMed  Google Scholar 

  152. Labonte, A. C. et al. Identification of alterations in macrophage activation associated with disease activity in systemic lupus erythematosus. PLoS One 13, e0208132 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  153. Kegerreis, B. et al. Machine learning approaches to predict lupus disease activity from gene expression data. Sci. Rep. 9, 9617 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  154. Orange, D. E. et al. Identification of three rheumatoid arthritis disease subtypes by machine learning integration of synovial histologic features and RNA sequencing data. Arthritis Rheumatol. 70, 690–701 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  155. Ghosh, J. & Acharya, A. Cluster ensembles. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 1, 305–315 (2011).

    Article  Google Scholar 

  156. Lu, R. et al. Immunologic findings precede rapid lupus flare after transient steroid therapy. Sci. Rep. 9, 8590 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  157. Bentham, J. et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat. Genet. 47, 1457–1464 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  158. Morris, D. L. et al. Genome-wide association meta-analysis in Chinese and European individuals identifies ten new loci associated with systemic lupus erythematosus. Nat. Genet. 48, 940–946 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  159. Stahl, E. A. et al. Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat. Genet. 42, 508–514 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  160. International Genetics of Ankylosing Spondylitis Consortium (IGAS). et al. Identification of multiple risk variants for ankylosing spondylitis through high-density genotyping of immune-related loci. Nat. Genet. 45, 730–8 (2013).

    Article  Google Scholar 

  161. Bowes, J. et al. Dense genotyping of immune-related susceptibility loci reveals new insights into the genetics of psoriatic arthritis. Nat. Commun. 6, 6046 (2015).

    Article  CAS  PubMed  Google Scholar 

  162. Li, Y. et al. A genome-wide association study in Han Chinese identifies a susceptibility locus for primary Sjögren’s syndrome at 7q11.23. Nat. Genet. 45, 1361–1365 (2013).

    Article  CAS  PubMed  Google Scholar 

  163. Almlöf, J. C. et al. Novel risk genes for systemic lupus erythematosus predicted by random forest classification. Sci. Rep. 7, 6236 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  164. Briggs, F. B. S. et al. Supervised machine learning and logistic regression identifies novel epistatic risk factors with PTPN22 for rheumatoid arthritis. Genes. Immun. 11, 199–208 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  165. Glaser, B. et al. Analyses of single marker and pairwise effects of candidate loci for rheumatoid arthritis using logistic regression and random forests. BMC Proc. 1, S54 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  166. Croiseau, P. & Cordell, H. J. Analysis of North American Rheumatoid Arthritis Consortium data using a penalized logistic regression approach. BMC Proc. 3, S61 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  167. Vignal, C. M., Bansal, A. T. & Balding, D. J. Using penalised logistic regression to fine map HLA variants for rheumatoid arthritis. Ann. Hum. Genet. 75, 655–664 (2011).

    Article  PubMed  Google Scholar 

  168. Bartoloni, E. et al. Application of artificial neural network analysis in the evaluation of cardiovascular risk in primary Sjögren’s syndrome: a novel pathogenetic scenario? Clin. Exp. Rheumatol. 37, S133–S139 (2019).

    Google Scholar 

  169. Navarini, L. et al. A machine-learning approach to cardiovascular risk prediction in psoriatic arthritis. Rheumatology 59, 1767–1769 (2020).

    Article  PubMed  Google Scholar 

  170. Navarini, L. et al. Cardiovascular risk prediction in ankylosing spondylitis: from traditional scores to machine learning assessment. Rheumatol. Ther. 7, 867–882 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  171. Ravenell, R. L. et al. Premature atherosclerosis is associated with hypovitaminosis D and angiotensin-converting enzyme inhibitor non-use in lupus patients. Am. J. Med. Sci. 344, 268–273 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  172. Reddy, B. K. & Delen, D. Predicting hospital readmission for lupus patients: an RNN-LSTM-based deep-learning methodology. Comput. Biol. Med. 101, 199–209 (2018).

    Article  PubMed  Google Scholar 

  173. Hong, S. et al. Longitudinal profiling of human blood transcriptome in healthy and lupus pregnancy. J. Exp. Med. 216, 1154–1169 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  174. Chen, Y. et al. Machine learning for prediction and risk stratification of lupus nephritis renal flare. Am. J. Nephrol. 52, 152–160 (2021).

    Article  CAS  PubMed  Google Scholar 

  175. Babajide Mustapha, I. & Saeed, F. Bioactive molecule prediction using extreme gradient boosting. Molecules 21, 983 (2016).

    Article  PubMed Central  Google Scholar 

  176. Nair, N. & Wilson, A. G. Can machine learning predict responses to TNF inhibitors? Nat. Rev. Rheumatol. 15, 702–704 (2019).

    Article  PubMed  Google Scholar 

  177. Plenge, R. M. et al. Crowdsourcing genetic prediction of clinical utility in the rheumatoid arthritis responder challenge. Nat. Genet. 45, 468–469 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  178. Tao, W. et al. Multiomics and machine learning accurately predict clinical response to adalimumab and etanercept therapy in patients with rheumatoid arthritis. Arthritis Rheumatol. 73, 212–222 (2021).

    Article  CAS  PubMed  Google Scholar 

  179. Plant, D. & Barton, A. Machine learning in precision medicine: lessons to learn. Nat. Rev. Rheumatol. 17, 5–6 (2021).

    Article  PubMed  Google Scholar 

  180. Van Looy, D. et al. Comparing statistics with machine learning models to predict dose increase of infliximab for rheumatoid arthritis patients. in Proc. 9th IASTED Int. Conf. Artif. Intell. Soft Computing, ASC 195–200 (ACTA Press, 2005).

  181. Lee, S. et al. Machine learning to predict early TNF inhibitor users in patients with ankylosing spondylitis. Sci. Rep. 10, 20299 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  182. Seridi, L. et al. OP0161 association of baseline cytotoxic gene expression with ustekinumab response in systemic lupus erythematosus. Ann. Rheum. Dis. 79, 101–102 (2020).

    Article  Google Scholar 

  183. Gottlieb, A. B. et al. Secukinumab efficacy in psoriatic arthritis. JCR 27, 239–247 (2021).

    PubMed  Google Scholar 

  184. Wolf, B. J. et al. Development of biomarker models to predict outcomes in lupus nephritis. Arthritis Rheumatol. 68, 1955–1963 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  185. Vodencarevic, A. et al. Advanced machine learning for predicting individual risk of flares in rheumatoid arthritis patients tapering biologic drugs. Arthritis Res. Ther. 23, 67 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  186. Patrick, M. T. et al. Drug repurposing prediction for immune-mediated cutaneous diseases using a word-embedding-based machine learning approach. J. Invest. Dermatol. 139, 683–691 (2019).

    Article  CAS  PubMed  Google Scholar 

  187. Ekins, S. et al. Exploiting machine learning for end-to-end drug discovery and development. Nat. Mater. 18, 435–441 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  188. Lavecchia, A. Deep learning in drug discovery: opportunities, challenges and future prospects. Drug. Discov. Today 24, 2017–2032 (2019).

    Article  PubMed  Google Scholar 

  189. Kuang, Z. et al. A machine-learning-based drug repurposing approach using baseline regularization. Methods Mol. Biol. 1903, 255–267 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  190. Zeng, X. et al. DeepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics 35, 5191–5198 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  191. Xu, R. & Wang, Q. Q. Automatic construction of a large-scale and accurate drug-side-effect association knowledge base from biomedical literature. J. Biomed. Inform. 51, 191–199 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  192. Bresso, E. et al. Integrative relational machine-learning for understanding drug side-effect profiles. BMC Bioinforma. 14, 207 (2013).

    Article  Google Scholar 

  193. Aliper, A. et al. Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. Mol. Pharm. 13, 2524–2530 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  194. Grammer, A. C. & Lipsky, P. E. Drug repositioning strategies for the identification of novel therapies for rheumatic autoimmune inflammatory diseases. Rheum. Dis. Clin. North. Am. 43, 467–480 (2017).

    Article  PubMed  Google Scholar 

  195. Figgett, W. A. et al. Machine learning applied to whole-blood RNA-sequencing data uncovers distinct subsets of patients with systemic lupus erythematosus. Clin. Transl. Immunol. 8, e01093 (2019).

    Article  Google Scholar 

  196. Catalina, M. D., Owen, K. A., Labonte, A. C., Grammer, A. C. & Lipsky, P. E. The pathogenesis of systemic lupus erythematosus: harnessing big data to understand the molecular basis of lupus. J. Autoimmun. 110, 102359 (2020).

    Article  CAS  PubMed  Google Scholar 

  197. Guthridge, J. M. et al. Adults with systemic lupus exhibit distinct molecular phenotypes in a cross-sectional study. EClinicalMedicine 20, 100291 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  198. Lu, Z., Li, W., Tang, Y., Da, Z. & Li, X. Lymphocyte subset clustering analysis in treatment-naive patients with systemic lupus erythematosus. Clin. Rheumatol. 40, 1835–1842 (2021).

    Article  PubMed  Google Scholar 

  199. Spielmann, L. et al. Anti-Ku syndrome with elevated CK and anti-Ku syndrome with anti-dsDNA are two distinct entities with different outcomes. Ann. Rheum. Dis. 78, 1101–1106 (2019).

    Article  CAS  PubMed  Google Scholar 

  200. Pinal-Fernandez, I. & Mammen, A. L. On using machine learning algorithms to define clinically meaningful patient subgroups. Ann. Rheum. Dis. 79, e128 (2020).

    Article  PubMed  Google Scholar 

  201. Baldini, C., Ferro, F., Luciano, N., Bombardieri, S. & Grossi, E. Artificial neural networks help to identify disease subsets and to predict lymphoma in primary Sjögren’s syndrome. Clin. Exp. Rheumatol. 36, S137–S144 (2018).

    Google Scholar 

  202. Delgadillo, J. Machine learning: a primer for psychotherapy researchers. Psychother. Res. 31, 1–4 (2021).

    Article  PubMed  Google Scholar 

  203. Breck, E., Polyzotis, N., Roy, S., Whang, S. E. & Zinkevich, M. Data Validation for Machine Learning. in Proceedings of the 2nd SysML Conference (Palo Alto Networks, 2019).

  204. Kubat, M. A Simple Machine-Learning Task. in An Introduction to Machine Learning (Springer International Publishing, 2017).

  205. Van Der Aalst, W. M. P. et al. Process mining: a two-step approach to balance between underfitting and overfitting. Softw. Syst. Model. 9, 87–111 (2010).

    Article  Google Scholar 

  206. Schaffer, C. Overfitting avoidance as bias. Mach. Learn. 10, 153–178 (1993).

    Article  Google Scholar 

  207. Adadi, A. & Berrada, M. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access. 6, 52138–52160 (2018).

    Article  Google Scholar 

  208. Tjoa, E. & Guan, C. A Survey on explainable artificial intelligence (XAI): toward medical XAI. IEEE Trans. Neural Netw. Learn Syst. https://doi.org/10.1109/TNNLS.2020.3027314 (2020).

  209. Kingsford, C. & Salzberg, S. L. What are decision trees? Nat. Biotechnol. 26, 1011–1012 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  210. Doran, D., Schulz, S. & Besold, T. R. What does explainable AI really mean? A new conceptualization of perspectives. in CEUR Workshop Proceedings Vol. 2071 (CEUR-WS, 2018).

  211. Cruz Rivera, S. et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Lancet Dig. Health 2, e549–e560 (2020).

    Article  Google Scholar 

  212. Burmester, G. R. Rheumatology 4.0: big data, wearables and diagnosis by computer. Ann. Rheum. Dis. 77, 963–965 (2018).

    Article  PubMed  Google Scholar 

  213. Pandit, A. & Radstake, T. R. D. J. Machine learning in rheumatology approaches the clinic. Nat. Rev. Rheumatol. 16, 69–70 (2020).

    Article  PubMed  Google Scholar 

  214. Yang, S. & Berdine, G. The receiver operating characteristic (ROC) curve. Southwest. Respir. Crit. Care Chron. 5, 34 (2017).

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank P. Bachali, S. Shrotri, K. Bell, and J. Kain for helpful discussion about machine learning concepts. The authors thank Dr. C. Nantasenamat for allowing us to modify his figure about the workflow of ML. This work was supported by funding from the RILITE Foundation.

Author information

Authors and Affiliations

Authors

Contributions

K. M. K. and C. E. P. researched data for the article. K. M. K., C. E. P., A. C. G. and P. E. L. contributed substantially to discussion of the content. K. M. K., C. E. P. and P. E. L. wrote the article. K. M. K. C.E. P. and P. E. L. reviewed and/or edited the manuscript before submission.

Corresponding author

Correspondence to Kathryn M. Kingsmore.

Ethics declarations

Competing interests

K.M.K. and C.E.P. were employed by AMPEL BioSolutions, LLC, during the preparation of this work. K.M.K. was additionally employed by the RILITE Research Institute during the preparation of this work. A.C.G. and P.E.L. are the founders of AMPEL BioSolutions, LLC. The authors declare that the content of this manuscript is not related to AMPEL BioSolutions, LLC’s commercial activities. AMPEL uses machine learning as one technique in our analyses pipelines, but does not have a proprietary interest in machine learning as a technology or commercial interest in a specific classifier, regressor or clustering approach. All of the material described in the manuscript is freely available in the public domain.

Additional information

Peer review information

Nature Reviews Rheumatology thanks M. Krusche and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Glossary

Machine learning

(ML). A subset of artificial intelligence that utilizes software to predict outcomes and recognize relationships in data without explicit programmes for each step.

Algorithms

Mathematical or computational methods that can be applied to data to form a model.

Model

A framework built upon input data that can classify, regress or cluster.

Statistical modelling

A model that relies on explicitly programmed mathematical functions to explain relationships in data.

Classification

Prediction of a categorical outcome.

Regression

Prediction of a quantitative outcome.

Clustering

Grouping of data points with similar characteristics.

Labelled

Data for which the class or outcome value is known.

Class

A group with a label that is produced from classification.

Clusters

Groups without a label that are produced from clustering.

Supervised models

Models trained on labelled data that are used to predict classes or quantitative values.

Unsupervised models

Models trained on unlabelled data that are used to find associations and patterns that result in groups of similar samples.

Imputation

A method of replacing missing values with data points.

Data scaling

The processes of transforming data into a format that a computer algorithm can use, which can also involve normalization.

Feature selection

The process of selecting the best set of variables to be used as input for the model.

Natural language processing

(NLP). A data scaling process that is also a branch of ML, which allows computers to interpret human language.

Dimensionality reduction

The process of reducing the number of input variables (features).

Variance

Error as a result of the fluctuations in the observations, or how much the observations differ from the average value.

Biased

A biased model is one that fails to capture underlying patterns in data and thus there is a difference between the true values and the values predicted by the model.

Decision trees

Supervised method that asks a series of ‘yes or no’ questions with labelled data to classify or regress.

Clustering algorithms

Unsupervised methods that assign observations to subsets using mathematically calculated distances.

Neural networks

Supervised or unsupervised methods that build a series of networks to predict or classify. They are named because the structure of the model is aimed at mimicking the way in which a human brain operates.

Ensemble algorithms

Supervised methods that aggregate several predictors from multiple machine learning models (for example, random forest).

Bagging

Algorithm that generates training sets by sampling of the training data with replacement to generate individual models that are characteristic of the sample, which are then aggregated to build a final model.

Boosting

Algorithm that adds an additional simpler model to minimize the existing error during each iteration of a supervised model.

Bayesian algorithms

Supervised methods that solve classification problems by predicting the most probably hypothesis, given the input data (for example, naive Bayes).

Instance-based

Supervised methods that memorize instances seen in training to make predictions (for example, support vector machines and k-nearest neighbours).

Regression algorithms

Supervised methods that use linear or polynomial functions for or as a fundamental part of prediction (for example, linear regression and logistic regression).

Regularization algorithms

A type of supervised regression method that shrinks coefficient estimates to zero to avoid overfitting (for example, least absolute shrinkage and selection operator and ridge regression).

Hyperparameters

Variables that must be set prior to model construction by the user or by software default and can then be tuned during model construction to maximize accuracy.

Parameters

Variables that are ‘learned’ during model construction. Parameters differ between algorithms based on algorithm architecture.

Training dataset

The dataset used by supervised models to ‘learn’ to predict an outcome by viewing both the input and output variables in the data.

Validation dataset

A portion of the training dataset that is withdrawn to give an estimate of fit while tuning model parameters, or a separate dataset used to estimate model fit and tune parameters.

Holdout

The process of reserving some samples for training and some for validation from a single dataset.

k-fold cross-validation

An extension of model validation that partitions the data into complementary subsets when training, to perform parallel analyses on each subset.

Sensitivity

The proportion of the actual positives that are correctly identified. Also known as the true positive rate.

Specificity

The proportion of the actual negatives that are correctly identified. Also known as the true negative rate.

Receiver operating characteristic (ROC) curves

(ROC curve). A plot of the sensitivity against the 1 − specificity that is used to assess the performance of a binary classifier.

Area under the curve

(AUC). Generally refers to the area under the ROC curve, so it can also be referred to as the area under the ROC (AUROC).

Testing dataset

An independent dataset that is used to provide an unbiased evaluation of the final model fit.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kingsmore, K.M., Puglisi, C.E., Grammer, A.C. et al. An introduction to machine learning and analysis of its use in rheumatic diseases. Nat Rev Rheumatol 17, 710–730 (2021). https://doi.org/10.1038/s41584-021-00708-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41584-021-00708-w

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing