Main

The latest wave of language models, both public1,2,3,4,5 and proprietary6,7,8,9 attribute their powerful abilities in large part to the diversity and richness of ever larger training datasets, including pretraining corpora, and finetuning datasets compiled by academics10,11,12, synthetically generated by models2,5 or aggregated by platforms such as Hugging Face13. Recent trends see practitioners combining and repackaging thousands of datasets and web sources14,15,16,17, but despite some notable documentation efforts18,19, there are diminishing efforts to attribute, document or understand the raw ingredients into new models20,21,22.

A crisis in data transparency and its consequences

Increasingly, widely used dataset collections are being treated as monoliths, rather than a lineage of data sources, crawled (or model generated), curated and annotated, often with multiple rounds of repackaging (and relicensing) by successive practitioners. The disincentives to acknowledge this lineage stem both from the scale of modern data collection (the effort to properly attribute it), and increased copyright scrutiny23. Together, these factors have resulted in fewer datasheets24, non-disclosure of training sources6,7,25 and ultimately a decline in understanding training data26,27.

This lack of understanding can lead to data leakages between training and test data28,29, expose personally identifiable information (PII)30, present unintended biases or behaviours31,32,33 and generally result in lower quality models than anticipated. Beyond these practical challenges, information gaps and documentation debt incur substantial ethical and legal risks. For instance, model releases appear to contradict data terms of use (for example, WizardCoder34 licenced for commercial use, while training on commercially-prohibited OpenAI data), licence revisions postpublic release (with MPT-StoryTeller35) and even copyright lawsuits (for example, Andersen v. Stability AI36 and Tremblay v. OpenAI23). As training models on data is both expensive and largely irreversible, these risks and challenges are not easily remedied. In this work, we term the combination of these indicators, including a dataset’s sourcing, creation and licensing heritage, as well as its characteristics, the ‘data provenance’.

Unreliable data provenance and licensing

Our work motivates the urgency of tooling that facilitates informed and responsible use of data in both pretraining and finetuning. To empower practitioners to attribute data provenance, we develop a set of tools and standards to trace the data lineage of 1,858 finetuning datasets from 44 of the most widely used and adopted text data collections. We compile and expand relevant metadata with a much richer taxonomy than Hugging Face, Papers with Code or other aggregators (see the ‘DPExplorer’ section). With legal experts, we design a pipeline for tracing dataset provenance, including the original source of the dataset, the associated licences, creators and subsequent use.

As a byproduct of our work establishing the data provenance of widely used datasets, we characterize the artificial intelligence (AI) data ecosystem and/or supply chain37,38, and state of the field for policymakers, researchers and legal experts. Our work highlights a crisis in licence laundering and informed usage of popular datasets, with systemic problems in sparse, ambiguous or incorrect licence documentation. Notably, we find that more than 70% of licences for popular datasets on GitHub and Hugging Face are ‘unspecified’, leaving a substantial information gap that is difficult to navigate in terms of legal responsibility. The licences that are attached to datasets uploaded to dataset sharing platforms are often inconsistent with the licence ascribed by the original author of the dataset: our rigorous re-annotation of licences finds that 66% of analysed Hugging Face licences were in a different use category, often labelled as more permissive than the author’s original licence. As a result, much of these data are risky to use (or harmfully misleading) for practitioners who want to respect author’s intentions. Our initiative reduces unspecified licences from more than 72 to 30% and attaches licence URLs, allowing model developers to more confidently select appropriate data for their needs. To this end, the data provenance initiative supports attribution and responsible AI with the following contributions:

  1. (1)

    The most extensive known public audit of AI data provenance, tracing the lineage of more than 1,800 text datasets (the ‘DPCollection’), their licences, conditions and sources. We document changes in the dataset licensing landscape and synthesize observations into legal guidance for developers (see the ‘Legal discussion’ section).

  2. (2)

    The Data Provenance Explorer (DPExplorer) (www.dataprovenance.org), an open-source repository for downloading, filtering and exploring data provenance and characteristics. Our tools auto-generate data provenance cards for scalable symbolic attribution and future documentation best practices.

  3. (3)

    We find a sharp and widening divide between commercially open and closed data, with the latter monopolizing more diverse and creative sources. We suggest a data collection focus to narrow this gap.

The initiative to audit data provenance

The data provenance initiative’s goal is to audit popular and widely used datasets with large-scale legal and AI expert-guided annotation. We propose a base set of indicators necessary for tracing dataset lineage and understanding dataset risks (described in the ‘DPExplorer’ section). As a first contribution of the initiative, we audit 44 instruction or ‘alignment’ finetuning data collections composed of 1,858 individual datasets, selected by experts for their widespread adoption and use in the community. The selected collections and their variants see hundreds to more than 10 million monthly downloads on Hugging Face, with the datasets within these collections tallying to many more (Table 1). While these metrics have limitations, especially for application-specific use cases, we hope that our reproducible pipeline will be extended to other datasets.

Table 1 Alignment tuning collections and their characteristics

Our initiative’s initial focus on alignment finetuning datasets was decided based on their growing emphasis in the community for improving helpfulness, reducing harmfulness and orienting models to human values39. Some collections have overlapping datasets and examples, but we choose not to deduplicate to preserve the original design choices, that may include different templates, formatting and filtering.

DPExplorer

Our information audit spans (1) identifier information, bridging metadata from several aggregators, including Hugging Face, GitHub, Papers with Code, Semantic Scholar and ArXiv, (2) detailed dataset characteristics for a richer understanding of training set composition and (3) dataset provenance for licensing and attribution. We expand our provenance metadata beyond just licences, because conversations with practitioners revealed they rely not only on data licences, but on a specific legal and ethical risk tolerance, parameterized by (i) the lineage of licences, (ii) the data source, (iii) the creator’s identity and (iv) the precedence of adoption by other developers.

We release our extensive audit as two tools: (1) a data explorer interface, the DPExplorer for widespread use and (2) an accompanying repository for practitioners to download the data filtered for licence conditions. Practitioners are also able to generate a human-readable, markdown summary or data provenance card of the used datasets and compositional properties for languages, tasks and licences (see the ‘Data provenance card as a data bibliography’ section). Modern researchers training on hundreds of datasets often find it onerous to manually curate extensive data cards for these compilations24,40. We hope this tool will aid in writing the data attribution and composition sections of these documentation efforts, by providing auto-generated, copy-and-pastable dataframe summaries. Details on the collected data are provided in the ‘Metadata details’ section.

Licences in the wild

Based on our extensive study of empirical licence use for natural language processing (NLP) datasets, we identify a number of insights with relevance to practitioners and the wider community (see Extended Data Table 1 for a detailed breakdown). We note that this section treats datasets generated via OpenAI’s services as subject to a ‘non-commercial’ use restriction, reflecting OpenAI’s Terms of Use. However, these terms constitute a contractual agreement, not a copyright licence, potentially making them unenforceable against third parties who did not create the data using OpenAI (see the ‘Legal discussion’ section for a detailed discussion).

Frequency of licence types

Figure 1 shows the distribution of licences. The most common licences are CC-BY-SA 4.0 (15.7%), the OpenAI Terms of Use (12.3%) and CC-BY 4.0 (11.6%). We identify a long tail of licence variants with unique terms, and a large set of custom licences accounting for 9.6% of all recorded licences on their own. This wide licence diversity illustrates the challenge to startups and less resourced organizations attempting to navigate responsible training data collection, its legality and ethics.

Fig. 1: The distributions of licences used in the DPCollection, a popular sample of the major supervised NLP datasets.
figure 1

We find a long tail of custom licences, adopted from software for data: 73% of all licences require attribution and 33% share-alike, but the most popular are usually commercially permissive.

Distribution of restrictive licences

In total, 85% of dataset licences request attribution, and 30% include a share-alike clause (‘share alike’ is a copyright term meaning adaptations or copies of a work must be released under the same licence as the original). Datasets that request attribution pose challenges for practitioners who commonly train on hundreds of datasets and either do not cite them at all6,7,25 or simply cite an aggregation of data, which often falls short of the licence’s attribution requirements. Furthermore, share-alike clauses pose challenges for practitioners repackaging data collections, usually when multiple conflicting share-alike licences are involved as there is no clear way to resolve them (such as Longpre et al.17, Wang et al.41 and others in the DPCollection). Frequently, practitioners will over-write share-alike licences with more restrictive or even less restrictive conditions.

Missing or unspecified licences

Investigating these involves comparing our manually reviewed licensing terms to the licences for the same datasets, as documented in the aggregators GitHub, Hugging Face and Papers with Code. Table 2 shows that these crowdsourced aggregators have an extremely high proportion of missing (unspecified) licences, ranging from 69 to 72%, compared to our protocol that yields only 30% unspecified. An unspecified licence leaves it unclear whether the aggregator made a mistake or creators intentionally released data to the public domain. Consequently, risk-averse developers are forced to avoid many valuable datasets, which they would use if they were certain that there was no licence. As part of DPCollection, we manually reassign 46–65% of dataset licences (depending on the platform), resulting in much higher coverage, thus giving risk-averse developers more confidence and breadth in their dataset use.

Table 2 The distribution of licence use categories shows our licences have far fewer unspecified omissions than GitHub (GH, 72%), Hugging Face (HF, 69%) and Papers with Code (PWC, 70%), categorizing licences more confidently into commercial or non-commercial categories

Incorrectly specified licences

Table 2 shows that correct licences are frequently more restrictive than the ones by assigned by aggregators. GitHub, Hugging Face and Papers with Code each label licence use cases too permissively in 29%, 27% and 16% of cases, respectively. Our inspection suggests this is due to contributors on these platforms often mistaking licences attached to code in GitHub repositories for licences attached to data.

How does data availability differ by licence use category?

While non-commercial and academic-only licences play important roles in protecting data use, their presence can also exclude communities from participating (or competing) in the development of these technologies. In this section, we break down datasets according to their licence restrictions and see how they differ. Specifically, we ask: does complying with licences dictate systematic differences in resources for commercially permissive (‘open’) and non-commercial (‘closed’) development? And what particular features of data are particularly constrained by non-commercial prohibitions?

We compare datasets by categories of permitted use, according to their licences: (1) commercially viable, (2) non-commercial/academic-only (NC/A-O) or (3) unspecified licence. We group together non-commercial and academic-only conditions as the distinction plays a minor role in practice. We argue in the ‘Legal discussion’ section that datasets without any licence (unspecified) do not impose any conditions, may be treated as commercially viable, but this assessment depends on a developer’s risk tolerance and jurisdiction.

Non-commercial and academic-only licensed datasets have greater diversity in tasks, topics, sources and target text lengths

For each of these features, Table 3 illustrates the mean number per dataset, broken down by licence category and entropy to measure the randomness, and thus diversity, of each feature. NC/A-O datasets see greater diversity of tasks, topics and sources represented in the text than commercial datasets. Extended Data Fig. 2 shows where this diversity comes from. The most NC/A-O task categories include brainstorming, explanation, logic and maths, as well as creativity and creative writing. In comparison, the most commercially viable task categories are short text generation, translation and classification. Similarly, among source domains, governments and search queries are largely viable for commercial (and unspecified) purposes, whereas general web, exams and model-generated sources are among the most restrictive.

Table 3 The mean number of features (for example, tasks or languages) per dataset, and the mean entropy of the distribution, representing the diversity of categories

Target text lengths are notably longer for NC/A-O datasets

Not only do NC/A-O datasets appear more textually and functionally diverse, their length characteristics differ substantially. While Table 3 shows the input text lengths across licence categories are similar on average, the target text lengths are higher for NC/A-O datasets (103 versus 677). This breakdown is further illustrated in Fig. 2, where we see greater representation of both NC/A-O and synthetic datasets above the 100 target token threshold (y axis).

Fig. 2: Across finetuning datasets, we visualize their mean input and target text lengths, measured in log-scaled number of characters.
figure 2

The colours indicate either their licence use category (left) or whether they were machine generated or human collected (right). Long target texts are represented in large part by non-commercial and synthetic datasets that are often generated by commercial APIs. a, Licence use categories versus text lengths (log-scaled character length). b, Synthetic and/or regular datasets versus text lengths (log-scaled character length).

The rise of synthetic datasets generated using APIs with non-commercial terms of use may explain the differences in text diversity and length. Table 3 also shows a full 45% of NC/A-O datasets are synthetic, compared to <14% in more permissive licence categories. Taori et al.2, Wang et al.5, Touvron et al.4, Xu et al.42 and their variants, all generated in part using commercial APIs, exhibit stronger task and topic diversity than traditional academic datasets, as they cater to longer form generations, by design. This is evident from the concentration of creative, brainstorming and reasoning tasks baked into them, compared to the focus of more topic-focused question answering, classification and short text generation in non-synthetic datasets. These datasets are usually created using larger proprietary models, mostly from OpenAI APIs (see the ‘Legal discussion’ section).

In 2023 there was a spike in NC/A-O dataset licences

Among the large collection of datasets we trace, we record the date at which they are released, by cross-referencing their associated GitHub, ArXiv and Hugging Face dates. We find a striking change in the pattern of licensing restrictions. As shown in Extended Data Fig. 1, before 2023, no year saw more than one-third of the datasets released as NC/A-O. However, in 2023, when many of the most popular and diverse datasets were published, the NC/A-O rate is 61%. Furthermore, most datasets were unaccompanied by a licence before 2022 (~50–80%), compared to only 12% in 2023. The shift to more licence use, and to more restrictive licences, may foreshadow future challenges to open data.

Commercial datasets have greater language variety, but low-resource language datasets see the least commercial coverage. Table 3 shows that commercial datasets have greater diversity of languages than NC/A-O. However, when broken down by language family, as in Extended Data Fig. 1, we see stark differences in permitted use by group. Code language datasets are nearly all commercially viable (78%), because dataset creators can easily filter GitHub for permissively licenced repositories. English, Atlantic-Congo and Afroasiatic languages also see large permissive representation. However, Turkic, Sino-Tibetan, Japonic and Indo-European languages see in excess of 35% as non-commercial. Note that while the Indo-European language family contains many high-resource European language families, there is a long tail of lower-resource ones. These NC/A-O language families provide directions for open data practitioners to focus their future efforts.

Broader characteristics of the data

In addition to understanding systematic differences in the data by licence, there are research questions regarding the overall composition and characteristics of these widely used and adopted datasets. Our compilation of metadata through the DPCollection allows us to map the landscape of data characteristics and inspect particular features. Note that all these details are also available with interactive visualizations at www.dataprovenance.org, for further research and examination.

Language representation is heavily skewed to English and western European languages

Following Talat et al.’s43 recommendations in data transparency and documentation in demographic analysis, and corroborating Kreutzer et al.’s44 similar analysis for pretraining corpora, we find a stark Western-centric skew in representation. Figure 3 illustrates the coverage per country according to the spoken languages and their representation in DPCollection (see Methods for details). Figure 3 shows that Asian, African and South American nations are sparsely covered if at all. Even when nations from the Global South appear to have linguistic representation, the text source and dialect of the language contained in these datasets almost always originates from North American or European creators and web sources (although this is difficult to measure precisely). These observations corroborate similar findings in the geo-diversity of image data in the vision domain45,46,47. Models trained on these datasets are likely to have inherent bias, underperforming in critical ways for users of models outside the west48.

Fig. 3: A global heatmap of language representation scores measuring how well each country’s spoken languages are represented by the composition of natural language datasets in DPCollection, as calculated in the ‘Computing language representation’ section.
figure 3

English-speaking and western European nations are best represented, while the Global South sees limited coverage.

The primary drivers of dataset curation are academic organizations, industry labs, and research institutions

These metrics describe the scale of dataset curation contributions, but not the influence each dataset has had on the community. Extended Data Table 1a demonstrates the single largest dataset contributors are AI2 (12.3%), University of Washington (8.9%) and Facebook AI Research (8.4%). It is important to note that these contributors often only download and compile text from the Internet that was originally written by other people. Most dataset creators are located in the United States and China, raising additional concerns about potential biases contained in lower-resource language datasets.

Text datasets focus on language topics, general knowledge, logic and lifestyle

Previous data collection work focuses predominantly on describing datasets by their task compositions5,11,17, but rarely by their actual topics (except ref. 14 in their appendix). Extended Data Table 1b shows the most popular topics, clustered by category, with their representation across datasets. Like most NLP tasks, much of these text data focus on communication and language understanding topics, followed closely by general knowledge, routine, sports and education.

Text datasets are sourced primarily from online encyclopaedias, social media, and the web

While practitioners document their individual dataset sources in their published papers, this information is unstructured and can be hard to find. Collection of widely used datasets commonly just cite data papers rather than their sources, and data sources are often lost during data compilation and repackaging. By manually scanning approximately 500 academic papers, we annotate the original text sources and compile them into domain clusters to permit attribution and analysis, as summarized in Extended Data Table 1c. Among the most widely used sources are wikipedia.org (14.9%), undisclosed webpage crawls (7.0%), Reddit (6.2%) and Twitter (4.0%). The least represented domains include commerce, reviews, legal, academic papers and search queries.

Legal discussion

Our empirical analysis highlights that we are in the midst of a crisis in dataset provenance and practitioners are forced to make decisions based on limited information and opaque legal frameworks. While we believe our tooling will enable better transparency about where licences are in tension, major legal ambiguities remain in data licensing.

Open legal question regarding copyright and model training

Apart from the jurisdictional and interpretive ambiguities discussed in the Supplementary Information Legal Discussion, the process of training a model raises specific copyright questions49. Training a model poses several interesting legal questions with respect to copyright and infringement may occur in several ways even before any outputs are generated. First, the act of creating a training dataset by crawling existing works involves making a digital copy of the underlying data. As the name implies, copyright gives the author of a protected work the exclusive right to make copies of that work (17 US Code § 106). If the crawled data is protected by copyright, then creating training data corpora may raise copyright issues50. Second, copyright holders generally have an exclusive right to create derivative works (for example, translations of a work). Should a trained machine learning model be considered a derivative of the training data51? If so, then training a model would be more likely to violate the rights of the training data’s copyright holders52.

In the United States, the fair use exception may allow models to be trained on protected works (17 US Code § 107)53,54,55,56. As explained by previous work, the training of machine learning models on copyrighted content may be permissible if the underlying works are sufficiently ‘transformed’ into model weights, only a small amount of each work in the training data is included in the trained model, model training is designed to only glean generalizable insights from the training data, and the trained model does not have a strong effect on the economic success of the works in the training data. It is important to underscore that, while training a machine learning model itself may be protected by fair use this does not mean that model outputs will not infringe on the copyright of previous works. As the authors above highlight, the application of fair use in this context is still evolving and several of these issues are currently being litigated (for example, Andersen v. Stability36, Doe v. GitHub57 and Tremblay v. OpenAI23).

Fair use for data created for machine learning

Fair use is less likely to apply when works are created for the sole purpose of training machine learning models as in the case of supervised datasets with copyrightable compositions or annotations. Most literature on fair use and machine learning focuses on copyrighted art or text that was crawled to train a model. These crawled works were not created for the purpose of training machine learning models. By contrast, in this paper, we focus on supervised datasets that were created for the sole purpose of training machine learning models. As underscored by refs. 53 and 55, the fair use analysis depends in part on whether a trained model copies the ‘expressive purpose’ of the original work (Bill Graham Archives v. Dorling Kindersley58). While the expressive purpose of a piece of text or art is not to train machine learning models, the purpose of a training dataset is to do just that. As a result, we expect that it is less likely that fair use would apply to the use of curated data. Instead, the creators of these datasets hold a copyright in the dataset and the terms of the dataset licence agreement govern the subsequent use of these data. However, it is rare in practice for a large language model (LLM) to use a single supervised dataset and often multiple datasets are compiled into collections. This further complicates the legal analysis because we find that the licence terms of many popular dataset collections are conflicting.

Legal implications of LLM-generated annotations

We find that approximately 12% of the datasets we audit were annotated using OpenAI. The OpenAI Terms of Use state that outputs from the OpenAI service may not be used to ‘to develop models that compete with OpenAI’ (https://openai.com/policies/terms-of-use). These terms seem to preclude a developer from using OpenAI to generate training data to train a competing LLM. However, it is not clear whether they would also limit the ability of a developer to use OpenAI to create and publish an annotated dataset. While publishing such a dataset does not directly compete with OpenAI, it seems foreseeable that such a dataset could enable third parties (who did not themselves use OpenAI) to create competing LLMs. In the United States, there are several doctrines of secondary or indirect copyright liability aimed to enforce copyright in cases where there is no direct infringement51,59. The application of these doctrines depends on many factors, most importantly on whether OpenAI has a copyright interest in its outputs. If these copyright doctrines do not apply, then it is still possible that publishing the dataset constitutes a breach of contract by the dataset developers. While it would be more challenging for OpenAI to pursue a case against third parties, there are myriad other business torts, from unfair competition to misappropriation, that may be relevant to this situation and which go beyond the scope of this paper60. Time will tell whether OpenAI and other LLM providers can enforce their terms against third parties. However, a prominent researcher at Google has already resigned citing concerns that OpenAI outputs were used to train BARD61. In light of these ambiguities, our tool gives developers the ability to exclude OpenAI-generated datasets.

Data provenance enables informed decision-making

Despite these pervasive legal uncertainties, practitioners can still make some informed decisions to minimize risk if they have reliable data provenance information. With access to this information, practitioners can decide to err on the side of caution and to use only data licenced for commercial use, contact dataset creators of restrictively licenced data to negotiate a usage agreement or decide that their specific context and risk tolerance allows them to use datasets licenced for non-commercial use. Through our audit and tooling, we seek to provide the information needed to make informed decisions in an otherwise ambiguous landscape. Model providers may also consider strategies for partially mitigating uncertainties for downstream users, for example, by indemnifying users, as done by Google Cloud62. Of course, this does not solve the issues faced by model developers or dataset curators. We urge practitioners to take dataset licences seriously, as they may have real impacts on how their models may be used in practice.

In creating a repository of data licensing information, we hope to also encourage dataset creators to be more thoughtful about the licences that they select. Dataset creators are well-positioned to understand the appropriate uses of the datasets they publish and licences can be a tool to communicate these restrictions and to encourage responsible AI development.

Finally, this discussion highlights an important opportunity for regulators to reduce legal ambiguity by clarifying the enforceability of dataset licences both to help catalyse innovation and as a way to promote more responsible, inclusive and transparent machine learning practices63,64.

Methods

Details on collecting data provenance

These data were collected with a mix of manual and automated techniques, leveraging dataset aggregators such as GitHub, Hugging Face and Semantic Scholar (Extended Data Fig. 3). Annotating and verifying licence information, in particular, required a carefully guided manual workflow, designed with legal practitioners (‘License annotation process’ section). Once these information aggregators were connected, it was possible to synthesize or crawl additional metadata, such as dataset languages, task categories and time of collection. And for richer details on each dataset, such as text topics and source, we used carefully tuned prompts on language models inspecting each dataset.

Automated annotation methods

Based on the manually retrieved pages, we automatically extract licences from Hugging Face configurations and GitHub pages. We leverage the Semantic Scholar public API65 to retrieve the released date and current citation counts associated with academic publications. Additionally, we compute a series of other helpful, but often overlooked data properties such as text metrics (the minimum, mean and maximum for input and target lengths) and dialogue turns. We elected to measure sequence length in characters rather than word tokens, for fairer treatment across language and script given well-known differences in tokenizer performance across different languages66.

API annotation methods

While task categories have become the established measurement of data diversity in recent instruction tuning work5,11, there are so many other rich features describing data diversity and representation. To augment this, we use OpenAI’s GPT-4 API to help annotate for text topics. We randomly sampled 100 examples per dataset and carefully prompt GPT-4 to suggest up to ten topics discussed in the text.

To annotate for the original data sources, AI experts (PhD students and postdocs) reviewed the papers and filled out the original text sources, whether machines or template-generation were used for synthetic generation, and whether human annotators were used. GPT-4 was used as an in-context retriever on the dataset’s ArXiv paper to extract snippets that the experts may have missed. We split the ArXiv paper into 4,000-character chunks and prompt the API to return a json list of any mentions of the dataset source, for example from crawling, synthetic or manual generation.

Licence annotation process

One of our central contributions is to validate the licences associated with widely used and adopted datasets. This process provides a current snapshot of the data provenance landscape for finetuning data, but the methods and code we develop and share here are aimed to facilitate future audits, including those that extend beyond finetuning and text data. This followed a time-intensive human annotation protocol to collect dataset authors’ self-reported licences and categorize them according to stated conditions. Note that this protocol reflects best efforts to verify self-reported licences and does not constitute legal advice. Additionally, it is important to note that the enforceability of these licences depends on several factors discussed in the ‘Legal discussion’ section. One especially important assumption in cases where datasets are based on data obtained from other sources is that dataset creators actually have a copyright interest in their dataset. This depends on the data source and how creators modify or augment these data, and requires a case-by-case analysis. However, it appears that most developers operate under the general assumption that they alone own their datasets. Our licence annotation workflow follows these steps:

  1. (1)

    Compile all self-reported licence information. We aggregate all licensing information reported on GitHub, ArXiv, Hugging Face, Papers with Code and the collection itself (for example, Super-Natural Instructions)41.

  2. (2)

    Search for explicit data licences. The annotator searches for a licence specifically given to the dataset (not the accompanying code) by the authors. A licence is found if (i) the GitHub repository mentions or links a licence in reference to the data, (ii) the Hugging Face licence label was uploaded by the dataset creator themselves or (iii) the paper, Hugging Face or Papers with Code provide a dataset-specific licence link, attributable to the data authors.

  3. (3)

    Identify a licence type. A licence may fall into a set of common types (for example, MIT, Apache 2, CC BY SA and so on), be a ‘Custom’ licence, a permission request form or, if none was found for the data, unspecified. If a dataset has multiple licences, the annotator will list each of them according to their types.

  4. (4)

    Categorize licences. From the perspective of a machine learning practitioner, licensing typically is viewed through the lens of how it affects the model lifecycle—does it impede or allow for training on the data, downstream use conditions, attributing, modifying or re-distributing it? On the basis of discussions with industry experts, we categorize licences based on three important questions that affect the model lifecycle: is data usage limited to academic or non-commercial purposes (permitted use), does the data source need to be attributed (attribution) and do derivatives of the data need to be licenced under the same terms as the original (share-alike)? If there are multiple licences for a dataset, its categorization for each feature is chosen as the strictest across licences.

  5. (5)

    Sources. For each dataset, we review the documentation available in the academic paper, GitHub, website or Hugging Face to determine the original sources of the text as precisely as possible. The original sources are where the text was taken from before it was used in datasets. Sometimes, a dataset (introduced in a specific paper) might be based on another dataset. For example, the dataset might be an extension of another dataset, or it could be taking one dataset and formatting and/or modifying it to be usable for another learning task. In these cases, we find the ‘root’ dataset (that is, the original one that is extended or modified) and determine what the source is for that particular dataset. We also include new text sources that have been leveraged at each stage of dataset derivation and development. We provide a list of sources, grouped by domain, at https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection/blob/main/constants/domain_groups.json.

  6. (6)

    Additional provenance. In practice, legal teams may wish to balance their risk tolerance with more nuanced criteria. For instance, they may be satisfied with using (more permissive) GitHub licences, even when it is ambiguous whether these apply to the code or the data. They may also wish to include or exclude datasets on the basis of whether these are already widely used in practice, where the original data were sourced from and if the creator is a competitor. To supplement the above licence categories, we also collect all this metadata for fine-grained selection and filtering.

Data provenance card as a data bibliography

Previous work has stressed the importance of data documentation and attribution22,67. In particular, Gebru et al.’s24 datasheets break down documentation into motivation, composition, collection process, processing, uses, maintenance and distribution. Similarly, Bender and Friedman67 ask for curation rationale, language variety, speaker demographic, annotator demographic, speech situation and text characteristics, among others. However, when models train on many sources of data, even if they are each rigorously documented for each of these fields (rarely the case), it is challenging to cleanly synthesize comprehensive and navigable documentation for the resulting bundle.

To make this process tractable with scale, we propose leveraging symbolic attribution, where our tools auto-generate a structured store of the provenance and attribution metadata, similar to a bibliography for data (these are auto-generated at https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection). Our collected schema allows this store to succinctly capture the attribution (links to repositories, aggregator copies, papers, creators), provenance (text/machine sources, licences) and compositional properties of the data (languages, tasks, text metrics, format and time). This file of references and metadata, known as a data provenance card, enables comprehensive documentation proposed by previous work while providing some advantages from its structure. First, the data provenance card can be easily searched, sorted, filtered and analysed, whereas datasheets or statements, designed for individual datasets, are meant to be manually read. Second, developers can efficiently assemble relevant information without losing any detail by symbolically linking to the original datasets and their documentation. Third, as datasets are continually repackaged and absorbed into newer and bigger collections, data provenance cards are easily adaptable by simply appending or concatenating them. Altogether, we hope this tooling enables and promotes the thorough documentation proposed in previous work24,40,67,68

Metadata details

Collecting comprehensive metadata for each dataset required leveraging several sources including collection by linking to resources already on the web (W), human annotation by legal experts (E) or using GPT-4 to assist in human annotation (G). The collected metadata cover many aspects of these datasets, spanning identifiers, dataset characteristics and provenance information. These features were selected on the basis of our input from machine learning experts who contributed to this paper and who identified the information that would be most useful to practitioners.

Identifier information

Identifier information discloses links and connects aggregator identifiers.

  1. (1)

    Dataset identifiers (E): the dataset’s name, associated paper title and description of the dataset.

  2. (2)

    Dataset aggregator links (E): a link to each major aggregator, including GitHub, Hugging Face, Papers with Code, Semantic Scholar and ArXiv, allows us to incorporate and compare their crowdsourced metadata.

  3. (3)

    Collection (E): the name and URL to the data collection of which this dataset is a part.

Dataset characteristics

Dataset characteristics are detailed information relevant to understanding data representation and/or composition, and curating a training set.

  1. (1)

    Languages (E): each of the languages represented in the dataset, so developers can easily follow the ‘bender rule’69.

  2. (2)

    Task categories (E, G): the 20+ task categories represented in the instructions, such as question answering, translation, programme synthesis, toxicity identification, creative writing and roleplaying.

  3. (3)

    Text topics (G): an automated annotation of the topics discussed in the datasets, with GPT-4 labelling a sample of 100 examples for up to ten covered topics.

  4. (4)

    Text length metrics: the minimum, maximum and mean number of dialogue turns per conversation of characters (agnostic to tokenization/non-whitespace languages, as this introduces biases66) per user instruction and assistant responses.

  5. (5)

    Format (E): the format and intended use of the data. The options are zero-shot prompts, few-shot prompts, chain-of-thought prompts, multi-turn dialogue and response ranking.

  6. (6)

    Time of collection (W): the time when the work was published, which acts as an upper bound estimate of the age of the text.

Dataset provenance

  1. (1)

    Licences (W, E): the licence name and URLs associated with the data, using the process described in the ‘Licence annotation process’. We also enable filtering by licence use classes categorized by legal professionals.

  2. (2)

    Text source (E, G): the original sources of the text, often Wikipedia, Reddit or other crawled online or offline sources.

  3. (3)

    Creators (E): the institutions of the dataset authors, including universities, corporations and other organizations.

  4. (4)

    Attribution (W): the attribution information for the authors of the paper associated with the dataset.

  5. (5)

    Citation and download counts (W): the citation and Hugging Face download count for the paper and dataset, dated September 2023. This acts as an estimate of community use, and is commonly used as precedence to decide on the risk level for using these datasets.

Developing the DPExporer

The DPExplorer displays the collected data in a format accessible to developers by applying different aggregation, specialized filtering and tallying steps to obtain data summary statistics and overviews. All plots are built in JavaScript using the observablehq, P5 and D3 libraries that support dynamic, interactive visualizations. Many of our plots visualize languages and creators across geographies. To situate these, we use lookup tables, such as the language ISO 639 to group language families and we use the topojson to visualize the world map. We also map those to country codes and to language codes to interface with the map. As done in this paper, we map all tasks, topics and licences into clustered categories (Extended Data Table 2) to allow us to plot their distributions. We manually predefine clusters based on discussion among the authors, frequent taxonomies already used in the field, coupled with manual observation and iteration for what was tractable.

Computing language representation

We compute a language representation score Sk for each country k, parametrized by pkl, the percentage of people in country k that speak language l, and wli that is a binary indicator of 1 if dataset i D contains language l and 0 otherwise.

$${S}_{k}=\sum _{l\in L}\left({p}_{kl}\times \sum _{i\in D}{w}_{li}\right)$$

Software

We use the following Python (v.3.8.9) packages: aiohttp (v.3.9.5), aiosignal (v.1.3.1), annotated-types (v.0.7.0), anyio (v.4.4.0), async-timeout (v.4.0.3), attrs (v.23.2.0), certifi (v.2023.7.22), chardet (v.5.2.0), charset-normalizer (v.3.3.2), ConfigArgParse (v.1.7), datasets (v.2.19.2), dill (v.0.3.8), distlib (v.0.3.6), distro (v.1.9.0), exceptiongroup (v.1.2.1), filelock (v.3.11.0), frozenlist (v.1.4.1), fsspec (v.2024.3.1), h11 (v.0.14.0), httpcore (v.1.0.5), httpx (v.0.27.0), huggingface-hub (v.0.23.3), idna (v.3.4), jsonlines (v.4.0.0), multidict (v.6.0.5), multiprocess (v.0.70.16), numpy (v.1.24.4), openai (v.1.33.0), packaging (v.24.1), pandas (v.2.0.3), platformdirs (v.3.2.0), pyarrow (v.16.1.0), pyarrow-hotfix (v.0.6), pydantic (v.2.7.3), pydantic_core (v.2.18.4), python-dateutil (v.2.9.0.post0), python-dotenv (v.1.0.1), pytz (v.2024.1), PyYAML (v.6.0.1), requests (v.2.32.3), semanticscholar (v.0.5.0), sniffio (v.1.3.1), tabulate (v.0.9.0), tenacity (v.8.2.3), tqdm (v.4.66.4), typing_extensions (v.4.12.2), tzdata (v.2024.1), urllib3 (v.2.1.0), virtualenv (v.20.21.0), xxhash (v.3.4.1) and yarl (v.1.9.4).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.