ThoughtSource: A central hub for large language model reasoning data

Large language models (LLMs) such as GPT-4 have recently demonstrated impressive results across a wide range of tasks. LLMs are still limited, however, in that they frequently fail at complex reasoning, their reasoning processes are opaque, they are prone to ‘hallucinate’ facts, and there are concerns about their underlying biases. Letting models verbalize reasoning steps as natural language, a technique known as chain-of-thought prompting, has recently been proposed as a way to address some of these issues. Here we present ThoughtSource, a meta-dataset and software library for chain-of-thought (CoT) reasoning. The goal of ThoughtSource is to improve future artificial intelligence systems by facilitating qualitative understanding of CoTs, enabling empirical evaluations, and providing training data. This first release of ThoughtSource integrates seven scientific/medical, three general-domain and five math word question answering datasets.


.
Here we present ThoughtSource, a meta-dataset and so ware library for chain-of-thought reasoning in LLMs (https://github.com/OpenBioLink/ThoughtSource).The goals of ThoughtSource are to: -Facilitate qualitative understanding of CoTs generated by LLMs under various conditions (e.g., across tasks, models and prompts).
-Enable empirical and quantitative evaluation.
-Provide a library of diverse CoT training data for improving performance, robustness, explainability and value-alignment of future LLM-based AI systems.

Methods
We selected NLP benchmarks for question answering and natural language inference for which pre-existing data for constructing CoTs was available.For some of the datasets, one or multiple additional datasets were used as sources for additional CoTs, allowing for the comparison of ThoughtSource: a central hub for large language model reasoning data | 3   different CoT generation methodologies.We created data loader scripts compatible with the Hugging Face datasets library 20 for all datasets.Additionally, we collected metadata of attributes such as descriptions, websites and licenses.We contacted dataset providers and encouraged them to choose an open source/open data license if licensing information was unavailable or unclear.
We implemented two kinds of schemas: 1) source dataset schemas, which are unique to each dataset and provide data close to their original format; and 2) a standardized ThoughtSource schema, which maps all datasets into a common format.The ThoughtSource schema was created by extending the question answering schema of the BigBIO project 16   .
We implemented tailored algorithms for converting each dataset because the collected datasets provide explanations in different ways, such as math expressions or structured graph-based explanations.Furthermore, we performed preprocessing such as capitalization and punctuation correction.To recover standard formatted text from pre-tokenized datasets, we reversed the tokenization.This preprocessing was performed only on data in the ThoughtSource schema, while data in the Source schemas was le in their original formatting.All code for running these conversions is available in our Github repository.
We developed a suite of Python libraries and tools for generating novel CoTs and answers by calling LLM APIs, as well as tools for evaluating, comparing and annotating datasets.We built upon the LangChain library (https://github.com/hwchase17/langchain/)for interfacing with a wide variety of external LLM APIs.
This first release of ThoughtSource integrates seven scientific/medical, three general-domain and five math word question answering datasets (Table 1).For every dataset except for PubmedQA and MedQA we provide 'reference CoTs'.We created these reference CoTs by converting rationales provided by original datasets into reasoning chains.These rationales, depending on the dataset, were created by human experts or obtained from crowdsourcing.
Furthermore, we added CoTs generated by state-of-the-art LLMs by importing them from previous work, as well as generating them de-novo for this work (details below).
ThoughtSource: a central hub for large language model reasoning data | 4

Scientific/medical question answering datasets
WorldTree V2 21 is one of the most detailed multi-hop science question answering datasets available.Finding the right multiple-choice answers requires a multi-hop inference combining between 1 and 16 facts (average: 6).It contains explanations created by experts in the form of multiple facts.We concatenated these facts and applied a set of rules to improve style and grammaticality to yield reference CoTs that are close to natural language.

EntailmentBank
22 contains open-domain science exam questions and answers, along with systematic explanations that show how the correct answer is reached through a series of steps.
These steps are organized into a tree structure, known as an entailment tree, which starts with known facts and progresses through intermediate conclusions until the final answer is reached.
These entailment trees are also serialized into text-based proofs by traversing the trees.We applied a set of rules to improve style and grammaticality in these proofs to yield reference CoTs that are close to natural language.
OpenBookQA 23 contains questions modeled a er open-book exams of elementary-level science.
They require multi-step reasoning, commonsense knowledge, and a diverse application of core science facts to find the correct answer.The dataset provides over 1,300 core science facts and a mapping to all of the questions.By design, questions in OpenBookQA are answered incorrectly by both retrieval-based and word co-occurrence algorithms.The dataset contains a single-fact explanation of the correct answer for each question, which we adopted to create reference CoTs.

General-domain question answering datasets
CommonsenseQA 30 is a collection of multiple-choice questions that test a wide range of general knowledge.We created reference CoTs for the train and validation set derived from the crowd-sourced ECQA dataset³.We also added AI-generated reasoning chains generated with few-shot 5 and zero-shot 6 prompting, which are available for the validation split.The dataset was created through a crowdsourcing process to gather creative and diverse questions.Human-generated freetext reasoning chains are part of the train split of the original dataset and were used as CoTs.The dataset also includes relevant paragraphs from Wikipedia, but these were not included in our CoTs.We extended the StrategyQA dataset with AI-generated CoTs created through few-shot 5 and zero-shot 6 prompting, which are available for the train split.

Math word problem datasets
Algebra Question Answering with Rationales (AQUA-RAT) 33 is a large-scale multiple-choice dataset containing algebraic word problems.Each problem consists of a question with five possible answers and a rationale, a step-by-step natural language explanation of the solution.We used natural language explanations as reference CoTs.
Academia Sinica Diverse (ASDiv) math word problem (MWP) dataset 34 aims to provide more diverse language patterns and problem types than previous datasets.It covers most of the math topics taught in elementary school.Each MWP is labeled with its grade level (for indicating difficulty), the needed math operation (e.g.division) and includes a short explanation of the solution.ASDiv contains explanations of answers in the form of nested math expressions using ThoughtSource: a central hub for large language model reasoning data | 7 common operators such as addition, subtraction, division and multiplication.We generated reference CoTs by converting these math expressions into natural language explanation chains using a rule-based method.
Grade School Math 8K (GSM8K) 35 contains grade school math word problems.Despite their conceptual simplicity, these problems are more challenging to process than earlier datasets due to their linguistic diversity.The creators of GSM8K instructed crowd workers to write solutions to problems in free text format, which we used as reference CoTs in ThoughtSource, omitting any additional arithmetic specifications.
Math Word Problems (MAWPS) 36 is an online platform that provides a collection of math word problems.The problems have simple one-or two-line explanations for their solutions.MAWPS includes datasets from various sources, offers tools for automatically creating datasets with specific characteristics as well as the possibility to tune lexical and template overlap.We converted explanatory math expressions to reference CoTs with an approach similar to the one used for ASDiv.
Simple Variations on Arithmetic Math Word Problems (SVAMP) 37 was created by applying carefully chosen variations to examples from existing datasets, such as ASDiv and MAWPS.
These variations make it difficult for language models to solve the problems using simple heuristics, and instead require a deeper understanding and reasoning ability.We converted math expressions to reference CoTs with an approach similar to the one used for ASDiv.

Dataset schema
Tables 3-6 provide descriptions and datatypes of the various fields in the ThoughtSource schema.
Any performed sample task leads to a generated CoT and answer to the question.Annotations can be added programmatically or through an annotator tool.

Technical validation
The datasets were reviewed by three team members and issues were tracked on the issue tracker of the associated GitHub repository.
To characterize potential overlaps and relations between datasets, we calculated mutual n-gram overlap using n=3.(Fig. 2) .To quantify the overlap between two sets of n-grams we use the Szymkiewicz-Simpson coefficient (overlap coefficient), which can be interpreted as the proportion of n-grams of the smaller dataset that are contained in the bigger dataset:

ThoughtSource: a central hub for large language model reasoning data | 12
There is an overlap of 1.0 between the set of questions in WorldTree v2 and EntailmentBank.The QA pairs in EntailmentBank were taken from the WorldTree v2 dataset

22
, so all the questions in EntailmentBank are a subset of WorldTree v2.
Furthermore, there is significant overlap between the questions contained in ASDiv and SVAMP and those in ASDiv and MAWPS.ASDiv and SVAMP have overlapped questions because a subset of examples from ASDiv was used as seed examples for the creation of SVAMP.For MAWPS and ASDiv, questions were crawled from web resources.The overlap could be due to examples being crawled from the same web resources.
Besides overlaps in questions, we also identified overlaps in reference CoTs.WorldTree v2 provided an initial pool of atomic facts that the annotators could use to construct an explanation tree in EntailmentBank (in addition to creating their own facts).This explains the high overlap of n-grams of CoTs in WorldTree v2 and EntailmentBank.Similarly, a subset of WorldTree v2 facts was used for the creation of explanations in OpenBookQA.
ThoughtSource: a central hub for large language model reasoning data | 13
We plan to add more datasets and generated CoTs to the ThoughtSource repository, and we welcome outside contributions.Novel CoTs for existing core datasets can be generated and shared through the ThoughtSource APIs and JSON files.Completely new datasets can also be added, as described in the Github repository's contribution guide.
. Our code and data are licensed under an MIT license, while data adapted from existing datasets are available under the licenses of their respective sources.

StrategyQA
31 is a question answering dataset that tests the ability to reason through open-domain questions and provide Yes/No answers.Each example includes a question, a decomposition of the question into reasoning steps, and evidence paragraphs from Wikipedia.

QED
32 is a collection of expert-annotated structured explanations for answers to questions, built upon a subset of the Google Natural Questions dataset.Given a question and a passage from Wikipedia, QED uses linguistic information to represent explanations as a series of interpretable steps, such as referential equality, sentencehood, and entailment.Structured reasoning chains by experts are provided for all examples.To create reference CoTs, we extracted the sentence that entails the answer; statements about referential equality in QED were converted to natural language and added as additional steps in the CoTs (e.g."The noun phrase […] in the sentence and the noun phrase […] in the question refer to the same thing.").
creation of the annotation string key Specifies the label of the annotation string value Specifies the value of the annotation string We analyzed the distribution of question and reference CoT field lengths (Fig. 1).MedQA has the longest median question length, while PubMedQA has the longest median CoT length.Several datasets contain outlier CoT with extremely long text lengths.Context fields were only filled for the PubmedQA and QED datasets, with mean context lengths of 116 and 56 tokens, respectively.

Figure 1 :
Figure 1: Distribution of question (a) and reference (b) CoT field lengths.

Figure 2 :
Figure 2: n-gram overlap in questions and reference CoTs.Overlap is measured by mutual n-gram overlap using n=3, values <0.01 are omitted.

Fig. 3
demonstrates how to load a dataset, randomly sample from the pre-populated data in the dataset, call an external LLM API to generate novel CoTs and answers, automatically evaluate the accuracy of generated answers, and finally save all generated data to a JSON file.Fig. 4 depicts an excerpt of the resulting JSON file.from cot import Collection # Load a dataset collection_worldtree = Collection(["worldtree"]) # Randomly sample 10 rows of train split collection_worldtree_10 = collection_worldtree.select(split="train", number_samples=10)

Figure 3 :
Figure 3: Demonstration of the ThoughtSource API.Basic functionalities of the data loader, generator and evaluator modules are demonstrated.

Figure 4 :
Figure 4: An excerpt of data generated by running the example code.Data for a single question from Worldtree V2 are shown, including human-authored reference CoT, gold-standard answer, an AI-generated CoT and extracted answer, as well as evaluation results.Some fields were omitted for legibility.

Figure 5 :
Figure 5: An excerpt of the collection of prompt fragments.These fragments can be used to build prompts for interacting with LLMs, allowing for empirical testing of how different prompts affect model performance.We provide two web-based interfaces for exploring and annotating ThoughtSource data, the Dataset Viewer and the Annotator.The Dataset Viewer is a simple interface for exploring dataset contents.The Annotator (Fig.6) allows you to upload specific subsets of a dataset, provides convenience functions for highlighting similarities between different generated CoTs and the correctness of generated answers, and allows you to annotate individual CoTs interactively.The annotator facilitates identifying strengths and weaknesses of different CoTs.Annotations can be used for downstream model evaluation and further improving the capabilities of AI models through fine-tuning / reinforcement learning.

Figure 6 :
Figure 6: The ThoughtSource Annotator.The web-based interface allows for convenient inspection and annotation of reasoning chains and answers.Text that is similar between CoTs can be automatically highlighted based on an easily adjustable similarity threshold, facilitating a better understanding of similarities and differences of different reasoning chains.

Table 1 :
Integrated datasets.For some core datasets, additional datasets were used as sources for additional CoTs.

Table 3
Data recordsThe suggested method for accessing datasets is through programmatic access through our dataloader libraries.A comprehensive guide on how to achieve this is provided on the project's shows the example counts, CoT counts and answer types of each dataset.The majority of datasets in the current collection are of the multiple choice answer type.The medical dataset MedMCQA is the largest among all datasets.
Liévin et al.CoTs were generated for MedQA, MedMCQA and PubmedQA with the AI systems text-davinci-002 3 and code-davinci-002 38 (described in detail by co-authors Liévin et al. in a separate manuscript 25 ).Wei et al. and Kojima et al.CoTs for CommonsenseQA and StrategyQA were integrated from Cohere command-xlarge-nightly (https://docs.cohere.ai/).Since current LLM models are still prone to errors, it should be noted that AI-generated CoTs may contain faulty reasoning.ThoughtSource: a central hub for large language model reasoning data | 8

Table 3 : Statistics and answer types for all datasets.
Note that generated CoTs are not available for all examples, and multiple CoT might have been generated for any given example.

Table 6 : Fields of the 'annotation' object.
If you do not have a good answer, write \"I do not have a good answer\".