The pitfalls of negative data bias for the T-cell epitope specificity challenge

Dens, Ceder; Laukens, Kris; Bittremieux, Wout; Meysman, Pieter

doi:10.1038/s42256-023-00727-0

Matters Arising
Published: 05 October 2023

The pitfalls of negative data bias for the T-cell epitope specificity challenge

Nature Machine Intelligence volume 5, pages 1060–1062 (2023)Cite this article

1424 Accesses
3 Citations
10 Altmetric
Metrics details

Subjects

Matters Arising to this article was published on 05 October 2023

The Original Article was published on 06 March 2023

Access through your institution

Buy or subscribe

arising from Y. Gao et al. Nature Machine Intelligence https://doi.org/10.1038/s42256-023-00619-3 (2023)

Recently, Gao et al.¹ introduced a combination of meta-learning and the neural Turing machine to tackle a very important but yet unsolved problem in immunology: the T-cell receptor (TCR)–epitope binding prediction challenge for novel epitopes. All high-performing machine learning models can have problems when deployed in a real-world setting if the data used to train and test the model contain biases. Herein, we describe how the technique used to create negative data for the TCR–epitope interaction prediction task can lead to a strong bias and that the performance drops to random when tested in a more realistic scenario.

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Schematic overview of the two approaches commonly used for generating negative TCR–epitope data.**

**Fig. 2: ROC curves of PanPep tested on shuffled negative data.**

Data availability

The data used to obtain the results is available on GitHub at https://github.com/PigeonMark/PanPep-Shuffled-Negatives and on Zenodo at https://doi.org/10.5281/zenodo.7798691.

Code availability

All scripts used to obtain the results are available on GitHub at https://github.com/PigeonMark/PanPep-Shuffled-Negatives and on Zenodo at https://doi.org/10.5281/zenodo.7798691.

References

Gao, Y. et al. Pan-peptide meta learning for T-cell receptor–antigen binding recognition. Nat. Mach. Intell. 5, 236–249 (2023).
Narla, A., Kuprel, B., Sarin, K., Novoa, R. & Ko, J. Automated classification of skin lesions: from pixels to practice. J. Invest. Dermatol. 138, 2108–2110 (2018).
Article Google Scholar
Seyyed-Kalantari, L., Zhang, H., McDermott, M. B. A., Chen, I. Y. & Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med. 27, 2176–2182 (2021).
Article Google Scholar
Castro, D. C., Walker, I. & Glocker, B. Causality matters in medical imaging. Nat. Commun. 11, 3673 (2020).
Article Google Scholar
Pavlović, M. et al. Improving generalization of machine learning-identified biomarkers with causal modeling: an investigation into immune receptor diagnostics. Preprint at https://doi.org/10.48550/arXiv.2204.09291 (2023).
Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).
Article Google Scholar
Hudson, D., Fernandes, R. A., Basham, M., Ogg, G. & Koohy, H. Can we predict T cell specificity with digital biology and machine learning? Nat. Rev. Immunol. 23, 511–521 (2023).
Krogsgaard, M. & Davis, M. M. How T cells ‘see’ antigen. Nat. Immunol. 6, 239–245 (2005).
Article Google Scholar
Meysman, P. et al. Benchmarking solutions to the T-cell receptor epitope prediction problem: IMMREP22 workshop report. ImmunoInformatics 9, 100024 (2023).
Zhang, W. et al. A framework for highly multiplexed dextramer mapping and prediction of T cell receptor sequences to antigen specificity. Sci. Adv. 7, eabf5835 (2021).
Article Google Scholar
Bekker, J. & Davis, J. Learning from positive and unlabeled data: a survey. Mach. Learn. 109, 719–760 (2020).
Moris, P. et al. Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification. Brief. Bioinform. 22, bbaa318 (2021).
Article Google Scholar
Grazioli, F. et al. On TCR binding predictors failing to generalize to unseen peptides. Front. Immunol. 13, 1014256 (2022).
Chandola, V., Banerjee, A. & Kumar, V. Anomaly detection: a survey. ACM Comput. Surv. 41, 1–58 (2009).
Chen, L. et al. Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening. PLoS ONE 14, e0220113 (2019).
Article Google Scholar

Download references

Author information

These authors contributed equally: Wout Bittremieux, Pieter Meysman.

Authors and Affiliations

Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
Ceder Dens, Kris Laukens, Wout Bittremieux & Pieter Meysman
AUDACIS consortium, University of Antwerp, Antwerp, Belgium
Ceder Dens, Kris Laukens & Pieter Meysman

Authors

Ceder Dens
View author publications
You can also search for this author in PubMed Google Scholar
Kris Laukens
View author publications
You can also search for this author in PubMed Google Scholar
Wout Bittremieux
View author publications
You can also search for this author in PubMed Google Scholar
Pieter Meysman
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.D. performed the study. C.D. and P.M. wrote the manuscript. W.B., K.L. and P.M. conceived and supervised the study. W.B., P.M. and K.L. revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Pieter Meysman.

Ethics declarations

Competing interests

K.L. and P.M. hold shares in ImmuneWatch, an immunoinformatics company.

Peer review

Peer review information

Nature Machine Intelligence thanks Geir Kjetil Sandve for their contribution to the peer review of this work. Primary Handling Editor: Dr Liesbeth Venema, in collaboration with the Nature Machine Intelligence Editorial Team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Methods and data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dens, C., Laukens, K., Bittremieux, W. et al. The pitfalls of negative data bias for the T-cell epitope specificity challenge. Nat Mach Intell 5, 1060–1062 (2023). https://doi.org/10.1038/s42256-023-00727-0

Download citation

Received: 05 April 2023
Accepted: 04 September 2023
Published: 05 October 2023
Issue Date: October 2023
DOI: https://doi.org/10.1038/s42256-023-00727-0

This article is cited by

Adaptive immune receptor repertoire analysis
- Vanessa Mhanna
- Habib Bashour
- Encarnita Mariotti-Ferrandiz
Nature Reviews Methods Primers (2024)
Reply to: The pitfalls of negative data bias for the T-cell epitope specificity challenge
- Yicheng Gao
- Yuli Gao
- Qi Liu
Nature Machine Intelligence (2023)