Abstract
Adaptive immune receptor repertoires (AIRR) are key targets for biomedical research as they record past and ongoing adaptive immune responses. The capacity of machine learning (ML) to identify complex discriminative sequence patterns renders it an ideal approach for AIRR-based diagnostic and therapeutic discovery. So far, widespread adoption of AIRR ML has been inhibited by a lack of reproducibility, transparency and interoperability. immuneML (immuneml.uio.no) addresses these concerns by implementing each step of the AIRR ML process in an extensible, open-source software ecosystem that is based on fully specified and shareable workflows. To facilitate widespread user adoption, immuneML is available as a command-line tool and through an intuitive Galaxy web interface, and extensive documentation of workflows is provided. We demonstrate the broad applicability of immuneML by (1) reproducing a large-scale study on immune state prediction, (2) developing, integrating and applying a novel deep learning method for antigen specificity prediction and (3) showcasing streamlined interpretability-focused benchmarking of AIRR ML.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All data for the analyses presented in the manuscript are openly available. The detailed result files for use cases 1–3 are available as zip files at https://doi.org/10.11582/2021.00008 (ref. 78; use case 1), https://doi.org/10.11582/2021.00009 (ref. 81; use case 2) and https://doi.org/10.11582/2021.00005 (ref. 82; use case 3). Input data for use case 1 was downloaded from https://doi.org/10.21417/B7001Z.
Code availability
The immuneML source code is openly available at Github (github.com/uio-bmi/immuneML) under a free software license (AGPL-3.0). immuneML version 2.0.2 has been deposited on Zenodo with https://doi.org/10.5281/zenodo.5118741 (ref. 75). The immuneML Python package can be downloaded from pypi.org/project/immuneML.
References
Brown, A. J. et al. Augmenting adaptive immunity: progress and challenges in the quantitative engineering and analysis of adaptive immune receptor repertoires. Mol. Syst. Des. Eng. 4, 701–736 (2019).
Georgiou, G. et al. The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat. Biotechnol. 32, 158–168 (2014).
Yaari, G. & Kleinstein, S. H. Practical guidelines for B-cell receptor repertoire sequencing analysis. Genome Med. 7, 121 (2015).
Csepregi, L., Ehling, R. A., Wagner, B. & Reddy, S. T. Immune literacy: reading, writing, and editing adaptive immunity. iScience 23, 101519 (2020).
DeWitt, W. S. III et al. Human T cell receptor occurrence patterns encode immune history, genetic background, and receptor specificity. eLife 7, e38358 (2018).
Emerson, R. O. et al. Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat. Genet. 49, 659–665 (2017).
Krishna, C., Chowell, D., Gönen, M., Elhanati, Y. & Chan, T. A. Genetic and environmental determinants of human TCR repertoire diversity. Immun. Ageing 17, 26 (2020).
Britanova, O. V. et al. Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J. Immunol. 192, 2689–2698 (2014).
Schneider-Hohendorf, T. et al. Sex bias in MHC I-associated shaping of the adaptive immune system. Proc. Natl Acad. Sci. USA 115, 2168–2173 (2018).
Shemesh, O., Polak, P., Lundin, K. E. A., Sollid, L. M. & Yaari, G. Machine learning analysis of naïve B-cell receptor repertoires stratifies celiac disease patients and controls. Front. Immunol. 12, https://doi.org/10.3389/fimmu.2021.627813 (2021).
Ostmeyer, J., Christley, S., Toby, I. T. & Cowell, L. G. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. https://doi.org/10.1158/0008-5472.CAN-18-2292 (2019).
Beshnova, D. et al. De novo prediction of cancer-associated T cell receptors for noninvasive cancer detection. Sci. Transl. Med. 12, eaaz3738 (2020).
Liu, X. et al. T cell receptor β repertoires as novel diagnostic markers for systemic lupus erythematosus and rheumatoid arthritis. Ann. Rheum. Dis. 78, 1070–1078 (2019).
Arnaout, R. A. et al. The future of blood testing is the immunome. Front. Immunol. 12, 626793 (2021).
Greiff, V., Yaari, G. & Cowell, L. Mining adaptive immune receptor repertoires for biological and clinical information using machine learning. Curr. Opin. Syst. Biol. https://doi.org/10.1016/j.coisb.2020.10.010 (2020).
Akbar, R. et al. A compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding. Cell Rep. 34, 108856 (2021).
Dash, P. et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature 547, 89–93 (2017).
Glanville, J. et al. Identifying specificity groups in the T cell receptor repertoire. Nature 547, 94–98 (2017).
Springer, I., Besser, H., Tickotsky-Moskovitz, N., Dvorkin, S. & Louzoun, Y. Prediction of specific TCR-peptide binding from large dictionaries of TCR-peptide pairs. Front. Immunol. 11, 1803 (2020).
Friedensohn, S. et al. Convergent selection in antibody repertoires is revealed by deep learning. Preprint at bioRxiv https://doi.org/10.1101/2020.02.25.965673 (2020).
Mason, D. M. et al. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nat. Biomed. Eng. 5, 600–612 (2021).
Moris, P. et al. Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification. Brief. Bioinform. https://doi.org/10.1093/bib/bbaa318 (2020).
Graves, J. et al. A review of deep learning methods for antibodies. Antibodies 9, 12 (2020).
Narayanan, H. et al. Machine learning for biologics: opportunities for protein engineering, developability, and formulation. Trends Pharmacol. Sci. 42, 151–165 (2021).
Fischer, D. S., Wu, Y., Schubert, B. & Theis, F. J. Predicting antigen specificity of single T cells based on TCR CDR3 regions. Mol. Syst. Biol. 16, e9416 (2020).
Laustsen, A. H., Greiff, V., Karatt-Vellatt, A., Muyldermans, S. & Jenkins, T. P. Animal immunization, in vitro display technologies, and machine learning for antibody discovery. Trends Biotechnol. https://doi.org/10.1016/j.tibtech.2021.03.003 (2021).
Jokinen, E., Huuhtanen, J., Mustjoki, S., Heinonen, M. & Lähdesmäki, H. Predicting recognition between T cell receptors and epitopes with TCRGP. PLoS Comput. Biol. 17, e1008814 (2021).
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. https://doi.org/10.1038/s41573-019-0024-5 (2019).
Wainberg, M., Merico, D., Delong, A. & Frey, B. J. Deep learning in biomedicine. Nat. Biotechnol. 36, 829–838 (2018).
Lythe, G., Callard, R. E., Hoare, R. L. & Molina-París, C. How many TCR clonotypes does a body maintain? J. Theor. Biol. 389, 214–224 (2016).
Mora, T. & Walczak, A. M. How many different clonotypes do immune repertoires contain? Curr. Opin. Syst. Biol. 18, 104–110 (2019).
Briney, B., Inderbitzin, A., Joyce, C. & Burton, D. R. Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 566, 393–397 (2019).
Greiff, V. et al. Learning the high-dimensional immunogenomic features that predict public and private antibody repertoires. J. Immunol. https://doi.org/10.4049/jimmunol.1700594 (2017).
Parameswaran, P. et al. Convergent antibody signatures in human dengue. Cell Host Microbe 13, 691–700 (2013).
Thomas, N. et al. Tracking global changes induced in the CD4 T-cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence. Bioinformatics 30, 3181–3188 (2014).
Christophersen, A. et al. Tetramer-visualized gluten-specific CD4+ T cells in blood as a potential diagnostic marker for coeliac disease without oral gluten challenge. United Eur. Gastroenterol. J. 2, 268–278 (2014).
Widrich, M. et al. Modern Hopfield networks and attention for immune repertoire classification. Adv. Neural Inf. Process. Syst. 33, 18832–18845 (2020).
Sidhom, J.-W., Larman, H. B., Pardoll, D. M. & Baras, A. S. DeepTCR is a deep learning framework for revealing sequence concepts within T-cell repertoires. Nat. Commun. 12, 1605 (2021).
Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods 16, 315–318 (2019).
Kopp, W., Monti, R., Tamburrini, A., Ohler, U. & Akalin, A. Deep learning for genomics using Janggu. Nat. Commun. 11, 3488 (2020).
Feng, J. et al. Firmiana: towards a one-stop proteomic cloud platform for data processing and analysis. Nat. Biotechnol. 35, 409–412 (2017).
Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).
Tomic, A. et al. SIMON: Open-source knowledge discovery platform. Patterns 2, 100178 (2021).
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Paszke, A. et al. in Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 8026–8037 (Curran Associates, Inc., 2019).
Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018).
Rubelt, F. et al. Adaptive immune receptor repertoire community recommendations for sharing immune-repertoire sequencing data. Nat. Immunol. 18, 1274–1278 (2017).
Vander Heiden, J. A. et al. AIRR community standardized representations for annotated immune repertoires. Front. Immunol. 9, 2206 (2018).
Bolotin, D. A. et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods 12, 380–381 (2015).
Gupta, N. T. et al. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 31, 3356–3358 (2015).
Vander Heiden, J. A. et al. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics 30, 1930–1932 (2014).
Nazarov, V., immunarch.bot & Rumynskiy, E. immunomind/immunarch: 0.6.5: basic single-cell support. Zenodo https://doi.org/10.5281/zenodo.3893991 (2020).
Christley, S. et al. The ADC API: a web API for the programmatic query of the AIRR data commons. Front. Big Data 3, 22 (2020).
Corrie, B. D. et al. iReceptor: a platform for querying and analyzing antibody/B-cell and T-cell receptor repertoire data across federated repositories. Immunol. Rev. 284, 24–41 (2018).
Bagaev, D. V. et al. VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium. Nucleic Acids Res. 48, D1057–D1062 (2020).
Huang, H., Wang, C., Rubelt, F., Scriba, T. J. & Davis, M. M. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0505-4 (2020).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Nolan, S. et al. A large-scale database of T-cell receptor beta (TCRβ) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-51964/v1 (2020).
Weber, C. R. et al. immuneSIM: tunable multi-feature simulation of B- and T-cell receptor repertoires for immunoinformatics benchmarking. Bioinformatics 36, 3594–3596 (2020).
Marcou, Q., Mora, T. & Walczak, A.M. High-throughput immune repertoire analysis with IGoR. Nat Commun 9, 561 (2018). https://doi.org/10.1038/s41467-018-02832-w
Sethna, Z., Elhanati, Y., Callan, C. G., Walczak, A. M. & Mora, T. OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs. Bioinformatics 35, 2974–2981 (2019).
FAIR principles for data stewardship. Nat. Genet. 48, 343–343 (2016).
Scott, J. K. & Breden, F. The adaptive immune receptor repertoire community as a model for FAIR stewardship of big immunology data. Curr. Opin. Syst. Biol. 24, 71–77 (2020).
Breden, F. et al. Reproducibility and reuse of adaptive immune receptor repertoire data. Front. Immunol. 8, 1418 (2017).
Software with impact. Nat. Methods 11, 211 (2014).
Goodman, S. N., Fanelli, D. & Ioannidis, J. P. A. What does research reproducibility mean? Sci. Transl. Med. 8, 341ps12 (2016).
Mayer-Blackwell, K. et al. TCR meta-clonotypes for biomarker discovery with tcrdist3: quantification of public, HLA-restricted TCR biomarkers of SARS-CoV-2 infection. Preprint at bioRxiv https://doi.org/10.1101/2020.12.24.424260 (2020).
Abadi, M. et al. TensorFlow: a system for large-scale machine learning. In Proc. 12th USENIX Conference on Operating Systems Design and Implementation 265–283 (USENIX Association, 2016).
Vujovic, M. et al. T cell receptor sequence clustering and antigen specificity. Comput. Struct. Biotechnol. J. 18, 2166–2173 (2020).
Davidsen, K. et al. Deep generative models for T cell receptor protein sequences. eLife 8, e46935 (2019).
Bareinboim, E. & Pearl, J. Causal inference and the data-fusion problem. Proc. Natl Acad. Sci. USA 113, 7345–7352 (2016).
Pavlovic, M. et al. immuneML: v2.0.2. Zenodo https://doi.org/10.5281/zenodo.5118741 (2021)
Fowler, M. Domain-Specific Languages (Addison-Wesley Professional, 2010).
Zenger, M. Programming Language Abstractions for Extensible Software Components Ch. 1.3 (Swiss Federal Institute of Technology, 2004).
Pavlović, M. immuneML use case 1: replication of a published study inside immuneML. NIRD Research Data Archive https://doi.org/10.11582/2021.00008 (2021).
Ploenzke, M. S. & Irizarry, R. A. Interpretable convolution methods for learning genomic sequence motifs. Preprint at bioRxiv https://doi.org/10.1101/411934 (2018).
Heikkilä, N. et al. Human thymic T cell repertoire is imprinted with strong convergence to shared sequences. Mol. Immunol. 127, 112–123 (2020).
Pavlović, M. immuneML use case 2: extending immuneML with a deep learning component for predicting antigen specificity of paired receptor data. NIRD Research Data Archive https://doi.org/10.11582/2021.00009 (2021).
Scheffer, L. immuneML use case 3: benchmarking ML methods for AIRR classification on ground-truth synthetic data. NIRD Research Data Archive https://doi.org/10.11582/2021.00005 (2021).
Acknowledgements
We acknowledge generous support by The Leona M. and Harry B. Helmsley Charitable Trust (grant number 2019PG-T1D011, to V.G. and T.M.B.), the UiO World-Leading Research Community (to V.G. and L.M.S.), the UiO:LifeScience Convergence Environment Immunolingo (to V.G. and G.K.S.), EU Horizon 2020 iReceptorplus (grant number 825821, to V.G. and L.M.S.), a Research Council of Norway FRIPRO project (grant number 300740, to V.G.), a Research Council of Norway IKTPLUSS project (grant number 311341, to V.G. and G.K.S.), the National Institutes of Health (grant numbers P01 AI042288 and HIRN UG3 DK122638 to T.M.B.) and Stiftelsen Kristian Gerhard Jebsen (K.G. Jebsen Coeliac Disease Research Centre, to L.M.S. and G.K.S.). We acknowledge support from ELIXIR Norway in recognizing immuneML as a national node service.
Author information
Authors and Affiliations
Contributions
M.P., V.G. and G.K.S. conceived the study. M.P. and G.K.S. designed the overall software architecture. M.P., L.S. and K.M. developed the main platform code. M.P. and L.S. performed all analyses. M.P., L.S., C.K., F.L.M.B., R.A., G.S.A.H., G.B., M.C., R.F., I.G., S.G., P.-H.H., K.R., E.R., P.A.R., A.S., D.T., C.R.W. and M.W. created software or documentation content. R.K., N.V., K.W., L.S., M.P., A.A.C. and B.C. designed and developed the Galaxy tools. C.K., R.A., T.M.B., M.C., S.C., L.G.C., I.H.H., E.H., G.K., M.L.K., C.L.-A., A.M., T.M., J.P., K.R., P.A.R., A.R., I.S., L.M.S. and G.Y. provided critical feedback. M.P., L.S., V.G. and G.K.S. drafted the manuscript. V.G. and G.K.S. supervised the project. All authors read and approved the final manuscript and are personally accountable for its content.
Corresponding author
Ethics declarations
Competing interests
V.G. declares advisory board positions in aiNET GmbH and Enpicom B.V., and is a consultant for Roche/Genentech.
Additional information
Peer review information Nature Machine Intelligence thanks Pieter Meysman, Ryan Emerson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–6 and Tables 1–4.
Rights and permissions
About this article
Cite this article
Pavlović, M., Scheffer, L., Motwani, K. et al. The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires. Nat Mach Intell 3, 936–944 (2021). https://doi.org/10.1038/s42256-021-00413-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-021-00413-z
This article is cited by
-
Adaptive immune receptor repertoire analysis
Nature Reviews Methods Primers (2024)
-
Biophysical cartography of the native and human-engineered antibody landscapes quantifies the plasticity of antibody developability
Communications Biology (2024)
-
The dengue-specific immune response and antibody identification with machine learning
npj Vaccines (2024)
-
Forum on immune digital twins: a meeting report
npj Systems Biology and Applications (2024)
-
GENTLE: a novel bioinformatics tool for generating features and building classifiers from T cell repertoire cancer data
BMC Bioinformatics (2023)