Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • News & Views
  • Published:

AI training data

A step forward in tracing and documenting dataset provenance

Training data are crucial for advancements in artificial intelligence, but many questions remain regarding the provenance of training datasets, license enforcement and creator consent. Mahari et al. provide a set of tools for tracing, documenting and sharing AI training data and highlight the importance for developers to engage with metadata of datasets.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

References

  1. Zhang, D. et al. Preprint at https://doi.org/10.48550/arXiv.2205.03468 (2022).

  2. Maslej, N. et al. Preprint at https://doi.org/10.48550/arXiv.2310.03715 (2023).

  3. Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. In Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency https://doi.org/10.1145/3442188.3445922 (ACM, 2021).

  4. Paullada, A., Raji, I. D., Bender, E. M., Denton, E. & Hanna, A. Patterns 2, 100336 (2021).

    Article  Google Scholar 

  5. Desai, M. A., Pasquetto, I. V., Jacobs, A. Z. & Card, D. Patterns 5, 100966 (2024).

    Article  Google Scholar 

  6. Bandy, J. & Vincent, N. Preprint at https://doi.org/10.48550/arXiv.2105.05241 (2021).

  7. Elazar, Y. et al. Preprint at https://doi.org/10.48550/arXiv.2310.20707 (2023).

  8. Mahari, R. et al. Nat. Mach. Intell. https://doi.org/10.1038/s42256-024-00878-8 (2024).

    Article  Google Scholar 

  9. Chen, D., Fisch, A., Weston, J. & Bordes, A. In Proc. 55th Annual Meeting of the ACL https://doi.org/10.18653/v1/P17-1171 (ACL, 2017).

  10. Kittur, A. & Kraut, R. E. In Proc. 2008 ACM Conference on Computer Supported Cooperative Work https://doi.org/10.1145/1460563.1460572 (ACM, 2008).

  11. Hwang, S. & Shaw, A. In Proc. International AAAI Conference on Web and Social Media 16 https://doi.org/10.1609/icwsm.v16i1.19297 (AAAI, 2022).

  12. Deng, J. et al. In 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848 (IEEE, 2009).

  13. Birhane, A., Han, S., Boddeti, V. & Luccioni, S. In Proc. 37th International Conference on Neural Information Processing Systems 930 (Curran Associates, 2024).

  14. Precel, H., McDonald, A., Hecht, B. & Vincent, N. In Proc. CHI Conference on Human Factors in Computing Systems https://doi.org/10.1145/3613904.3642749 (ACM, 2024).

  15. Jiang, H. H. et al. In Proc. 2023 AAAI/ACM Conference on AI, Ethics, and Society https://doi.org/10.1145/3600211.3604681 (ACM, 2023).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicholas Vincent.

Ethics declarations

Competing interests

N.V. has contributed to separate streams of research for companies that produce LLMs and LLM-based applications, including Microsoft and OpenAI. This article was not supported by any institution other than the author’s home university, Simon Fraser University.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vincent, N. A step forward in tracing and documenting dataset provenance. Nat Mach Intell 6, 848–849 (2024). https://doi.org/10.1038/s42256-024-00884-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-024-00884-w

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics