Training data are crucial for advancements in artificial intelligence, but many questions remain regarding the provenance of training datasets, license enforcement and creator consent. Mahari et al. provide a set of tools for tracing, documenting and sharing AI training data and highlight the importance for developers to engage with metadata of datasets.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
References
Zhang, D. et al. Preprint at https://doi.org/10.48550/arXiv.2205.03468 (2022).
Maslej, N. et al. Preprint at https://doi.org/10.48550/arXiv.2310.03715 (2023).
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. In Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency https://doi.org/10.1145/3442188.3445922 (ACM, 2021).
Paullada, A., Raji, I. D., Bender, E. M., Denton, E. & Hanna, A. Patterns 2, 100336 (2021).
Desai, M. A., Pasquetto, I. V., Jacobs, A. Z. & Card, D. Patterns 5, 100966 (2024).
Bandy, J. & Vincent, N. Preprint at https://doi.org/10.48550/arXiv.2105.05241 (2021).
Elazar, Y. et al. Preprint at https://doi.org/10.48550/arXiv.2310.20707 (2023).
Mahari, R. et al. Nat. Mach. Intell. https://doi.org/10.1038/s42256-024-00878-8 (2024).
Chen, D., Fisch, A., Weston, J. & Bordes, A. In Proc. 55th Annual Meeting of the ACL https://doi.org/10.18653/v1/P17-1171 (ACL, 2017).
Kittur, A. & Kraut, R. E. In Proc. 2008 ACM Conference on Computer Supported Cooperative Work https://doi.org/10.1145/1460563.1460572 (ACM, 2008).
Hwang, S. & Shaw, A. In Proc. International AAAI Conference on Web and Social Media 16 https://doi.org/10.1609/icwsm.v16i1.19297 (AAAI, 2022).
Deng, J. et al. In 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848 (IEEE, 2009).
Birhane, A., Han, S., Boddeti, V. & Luccioni, S. In Proc. 37th International Conference on Neural Information Processing Systems 930 (Curran Associates, 2024).
Precel, H., McDonald, A., Hecht, B. & Vincent, N. In Proc. CHI Conference on Human Factors in Computing Systems https://doi.org/10.1145/3613904.3642749 (ACM, 2024).
Jiang, H. H. et al. In Proc. 2023 AAAI/ACM Conference on AI, Ethics, and Society https://doi.org/10.1145/3600211.3604681 (ACM, 2023).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
N.V. has contributed to separate streams of research for companies that produce LLMs and LLM-based applications, including Microsoft and OpenAI. This article was not supported by any institution other than the author’s home university, Simon Fraser University.
Rights and permissions
About this article
Cite this article
Vincent, N. A step forward in tracing and documenting dataset provenance. Nat Mach Intell 6, 848–849 (2024). https://doi.org/10.1038/s42256-024-00884-w
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-024-00884-w