Background & Summary

The most recent generation of large language models (LLMs) has produced impressive results across a wide range of tasks. Examples of such models include T01, GPT-32, InstructGPT3 and GPT-44. These models demonstrated remarkable ability to generate text that is both realistic and coherent, as well as good performance on a broad spectrum of tasks, despite not explicitly being trained on them3.

However, despite this ability, LLMs are also limited in several ways. They often fail to produce accurate predictions due to their inability to accomplish complex reasoning, such as solving mathematical problems or question answering tasks requiring multi-hop reasoning. Furthermore, they tend to be black boxes, making it difficult to understand how and why predictions are generated. These limitations severely limit the application domains of LLMs and have the potential to cause harm, as lack of explainability and robustness can lead to critical failures and biases when these models are deployed in practice.

One recently proposed method for enabling complex reasoning and generating explanations with LLMs is to force models to explicitly verbalize reasoning steps as natural language, a technique known as chain-of-thought prompting5,6. This method improved performance on a variety of tasks and sparked the active development of further refinements7, such as decomposing problems and structuring reasoning (e.g., least-to-most prompting8, ReAct9, self-ask10, maieutic prompting11, successive prompting12) and/or extending LLM capabilities by leveraging external services for tasks like information retrieval (e.g., self-ask10, IRCoT13, DSP14). The terminology surrounding these rapidly evolving techniques is not settled, hence in this document, we refer to all approaches that result in a linear sequence of reasoning steps as ‘chain-of-thought’ (CoT).

Meta-datasets (datasets of datasets) that are easily accessible and standardized have proven useful for training and evaluating versatile LLMs. Examples include SuperGLUE15 for general-domain language model tasks, BigBIO16 and BLURB17 for biomedical tasks, or Pile18 and ROOTS19 as text corpora for LLM pre-training. Datasets can be complemented by tools such as PromptSource, which was used to convert a large number of datasets into prompts fit for training and interrogating LLMs. PromptSource facilitated training the highly performant T0 model1.

Here we present ThoughtSource, a meta-dataset and software library for chain-of-thought reasoning in LLMs (https://github.com/OpenBioLink/ThoughtSource). The goals of ThoughtSource are to:

  • Facilitate qualitative understanding of CoTs generated by LLMs under various conditions (e.g., across tasks, models and prompts).

  • Enable empirical and quantitative evaluation.

  • Provide a library of diverse CoT training data for improving performance, robustness, explainability and value-alignment of future LLM-based AI systems.

Methods

We selected NLP benchmarks for question answering and natural language inference for which pre-existing data for constructing CoTs was available. For some of the datasets, one or multiple additional datasets were used as sources for additional CoTs, allowing for the comparison of different CoT generation methodologies. We created data loader scripts compatible with the Hugging Face datasets library20 for all datasets. Additionally, we collected metadata of attributes such as descriptions, websites and licenses. We contacted dataset providers and encouraged them to choose an open source/open data license if licensing information was unavailable or unclear.

We implemented two kinds of schemas: (1) source dataset schemas, which are unique to each dataset and provide data close to their original format; and (2) a standardized ThoughtSource schema, which maps all datasets into a common format. The ThoughtSource schema was created by extending the question answering schema of the BigBIO project16.

We implemented tailored algorithms for converting each dataset because the collected datasets provide explanations in different ways, such as math expressions or structured graph-based explanations. Furthermore, we performed preprocessing such as capitalization and punctuation correction. To recover standard formatted text from pre-tokenized datasets, we reversed the tokenization. This preprocessing was performed only on data in the ThoughtSource schema, while data in the Source schemas was left in their original formatting. All code for running these conversions is available in our Github repository.

We developed a suite of Python libraries and tools for generating novel CoTs and answers by calling LLM APIs, as well as tools for evaluating, comparing and annotating datasets. We built upon the LangChain library (https://github.com/hwchase17/langchain/) for interfacing with a wide variety of external LLM APIs.

This first release of ThoughtSource integrates seven scientific/medical, three general-domain and five math word question answering datasets (Table 1). For every dataset except for PubmedQA and MedQA we provide ‘reference CoTs’. We created these reference CoTs by converting rationales provided by original datasets into reasoning chains. These rationales, depending on the dataset, were created by human experts or obtained from crowdsourcing. Furthermore, we added CoTs generated by state-of-the-art LLMs by importing them from previous work, as well as generating them de-novo for this work (details below).

Table 1 Integrated datasets.

Scientific/medical question answering datasets

WorldTree V221 is one of the most detailed multi-hop science question answering datasets available. Finding the right multiple-choice answers requires a multi-hop inference combining between 1 and 16 facts (average: 6). It contains explanations created by experts in the form of multiple facts. We concatenated these facts and applied a set of rules to improve style and grammaticality to yield reference CoTs that are close to natural language.

EntailmentBank22 contains open-domain science exam questions and answers, along with systematic explanations that show how the correct answer is reached through a series of steps. These steps are organized into a tree structure, known as an entailment tree, which starts with known facts and progresses through intermediate conclusions until the final answer is reached. These entailment trees are also serialized into text-based proofs by traversing the trees. We applied a set of rules to improve style and grammaticality in these proofs to yield reference CoTs that are close to natural language.

OpenBookQA23 contains questions modeled after open-book exams of elementary-level science. They require multi-step reasoning, commonsense knowledge, and a diverse application of core science facts to find the correct answer. The dataset provides over 1,300 core science facts and a mapping to all of the questions. By design, questions in OpenBookQA are answered incorrectly by both retrieval-based and word co-occurrence algorithms. The dataset contains a single-fact explanation of the correct answer for each question, which we adopted to create reference CoTs.

MedQA24 is a free-form multiple-choice OpenQA dataset containing questions from medical board exams in the US (USMLE), Mainland China and Taiwan. We imported the English-language USMLE subset. We have also introduced a version of the dataset wherein the multiple-choice questions have been converted into open-ended questions25. Reference CoTs are not provided.

MedMCQA26 is a multiple-choice question answering dataset containing real-world medical entrance exam questions from the All India Institute of Medical Sciences (AIIMS PG) and National Eligibility cum Entrance Test (NEET PG). Answer rationales authored by human experts were integrated as reference CoTs.

PubmedQA27 is a question answering dataset containing biomedical questions extracted from PubMed abstracts that can be answered with yes/no/maybe answers. In addition to the short answer, each question comes with a longer answer, which can be used as reference CoT.

MMLU28 (Massive Multitask Language Understanding) is a compendium of 57 distinct question-and-answer tasks encompassing a wide range of topics. We have selected six subjects particularly related to medical science: anatomy, clinical knowledge, college biology, college medicine, medical genetics, and professional medicine. Reference CoTs are not provided.

General-domain question answering datasets

CommonsenseQA29 is a collection of multiple-choice questions that test a wide range of general knowledge. We created reference CoTs for the train and validation set derived from the crowd-sourced ECQA dataset³. We also added AI-generated reasoning chains generated with few-shot5 and zero-shot6 prompting, which are available for the validation split.

StrategyQA30 is a question answering dataset that tests the ability to reason through open-domain questions and provide Yes/No answers. Each example includes a question, a decomposition of the question into reasoning steps, and evidence paragraphs from Wikipedia. The dataset was created through a crowdsourcing process to gather creative and diverse questions. Human-generated freetext reasoning chains are part of the train split of the original dataset and were used as CoTs. The dataset also includes relevant paragraphs from Wikipedia, but these were not included in our CoTs. We extended the StrategyQA dataset with AI-generated CoTs created through few-shot5 and zero-shot6 prompting, which are available for the train split.

QED31 is a collection of expert-annotated structured explanations for answers to questions, built upon a subset of the Google Natural Questions dataset. Given a question and a passage from Wikipedia, QED uses linguistic information to represent explanations as a series of interpretable steps, such as referential equality, sentencehood, and entailment. Structured reasoning chains by experts are provided for all examples. To create reference CoTs, we extracted the sentence that entails the answer; statements about referential equality in QED were converted to natural language and added as additional steps in the CoTs (e.g. “The noun phrase […] in the sentence and the noun phrase […] in the question refer to the same thing.”).

Math word problem datasets

Algebra Question Answering with Rationales (AQUA-RAT)32 is a large-scale multiple-choice dataset containing algebraic word problems. Each problem consists of a question with five possible answers and a rationale, a step-by-step natural language explanation of the solution. We used natural language explanations as reference CoTs.

Academia Sinica Diverse (ASDiv) math word problem (MWP) dataset33 aims to provide more diverse language patterns and problem types than previous datasets. It covers most of the math topics taught in elementary school. Each MWP is labeled with its grade level (for indicating difficulty), the needed math operation (e.g. division) and includes a short explanation of the solution. ASDiv contains explanations of answers in the form of nested math expressions using common operators such as addition, subtraction, division and multiplication. We generated reference CoTs by converting these math expressions into natural language explanation chains using a rule-based method.

Grade School Math 8 K (GSM8K)34 contains grade school math word problems. Despite their conceptual simplicity, these problems are more challenging to process than earlier datasets due to their linguistic diversity. The creators of GSM8K instructed crowd workers to write solutions to problems in free text format, which we used as reference CoTs in ThoughtSource, omitting any additional arithmetic specifications.

Math Word Problems (MAWPS)35 is an online platform that provides a collection of math word problems. The problems have simple one- or two-line explanations for their solutions. MAWPS includes datasets from various sources, offers tools for automatically creating datasets with specific characteristics as well as the possibility to tune lexical and template overlap. We converted explanatory math expressions to reference CoTs with an approach similar to the one used for ASDiv.

Simple Variations on Arithmetic Math Word Problems (SVAMP)36 was created by applying carefully chosen variations to examples from existing datasets, such as ASDiv and MAWPS. These variations make it difficult for language models to solve the problems using simple heuristics, and instead require a deeper understanding and reasoning ability. We converted math expressions to reference CoTs with an approach similar to the one used for ASDiv.

AI-generated CoTs

Liévin et al. CoTs were generated for MedQA, MedMCQA and PubmedQA with the AI systems text-davinci-0023 and code-davinci-00237 (described in detail by co-authors Liévin et al. in a separate manuscript38).

Wei et al. and Kojima et al. CoTs for CommonsenseQA and StrategyQA were integrated from previous external studies on few-shot5 and zero-shot6 prompting.

ThoughtSource-33 refers to a collection of 198 items, comprising 33 randomly selected items from each of six datasets: Commonsense QA, MedQA (USMLE), MedMCQA, OpenBookQA, StrategyQA and WorldTree V2. For every item of this collection, we created 60 unique zero-shot CoTs by executing ten different prompting strategies39 with six models: OpenAI text-davinci-0023, OpenAI text-davinci-0033, OpenAI GPT-3.5-turbo, OpenAI GPT-44, Flan-T5-XXL40 and Cohere command-xlarge-nightly (https://docs.cohere.ai/). Since current LLM models are still prone to errors, it should be noted that AI-generated CoTs may contain faulty reasoning.

Data Records

The suggested method for accessing datasets is through programmatic access through our dataloader libraries. A comprehensive guide on how to achieve this is provided on the project’s Github repository (https://github.com/OpenBioLink/ThoughtSource), and a snapshot of the code is available on Zenodo41. Additionally, a snapshot of the data available through an open license is also available on Zenodo42.

Table 2 shows the example counts, CoT counts and answer types of each dataset. The majority of datasets in the current collection are of the multiple choice answer type. The medical dataset MedMCQA is the largest among all datasets.

Table 2 Statistics and answer types for all datasets.

Dataset schema

Tables 36 provide descriptions and datatypes of the various fields in the ThoughtSource schema. Any performed sample task leads to a generated CoT and answer to the question. Annotations can be added programmatically or through an annotator tool.

Table 3 Fields of the ‘sample’ object.
Table 4 Fields of the ‘generated_cot’ object.
Table 5 Fields of the ‘answer’ object.
Table 6 Fields of the ‘annotation’ object.

We analyzed the distribution of question and reference CoT field lengths (Fig. 1). MedQA has the longest median question length, while PubMedQA has the longest median CoT length. Several datasets contain outlier CoT with extremely long text lengths. Context fields were only filled for the PubmedQA and QED datasets, with mean context lengths of 116 and 56 tokens, respectively.

Fig. 1
figure 1

Distribution of question (a) and reference (b) CoT field lengths.

Technical Validation

The datasets were reviewed by three team members and issues were tracked on the issue tracker of the associated GitHub repository.

To characterize potential overlaps and relations between datasets, we calculated mutual n-gram overlap using n = 3. (Fig. 2). To quantify the overlap between two sets of n-grams we use the Szymkiewicz–Simpson coefficient (overlap coefficient), which can be interpreted as the proportion of n-grams of the smaller dataset that are contained in the bigger dataset:

$${\rm{overlap}}\left(X,Y\right)=\frac{\left|X\cap Y\right|}{{\rm{\min }}\left(\left|X\right|,\left|Y\right|\right)}$$
Fig. 2
figure 2

n-gram overlap in questions and reference CoTs. Overlap is measured by mutual n-gram overlap using n = 3, values < 0.01 are omitted.

There is an overlap of 1.0 between the set of questions in WorldTree v2 and EntailmentBank. The QA pairs in EntailmentBank were taken from the WorldTree v2 dataset22, so all the questions in EntailmentBank are a subset of WorldTree v2.

Furthermore, there is significant overlap between the questions contained in ASDiv and SVAMP and those in ASDiv and MAWPS. ASDiv and SVAMP have overlapped questions because a subset of examples from ASDiv was used as seed examples for the creation of SVAMP. For MAWPS and ASDiv, questions were crawled from web resources. The overlap could be due to examples being crawled from the same web resources.

Besides overlaps in questions, we also identified overlaps in reference CoTs. WorldTree v2 provided an initial pool of atomic facts that the annotators could use to construct an explanation tree in EntailmentBank (in addition to creating their own facts). This explains the high overlap of n-grams of CoTs in WorldTree v2 and EntailmentBank. Similarly, a subset of WorldTree v2 facts was used for the creation of explanations in OpenBookQA.

Usage Notes

Python libraries for accessing and working with data can be downloaded from the Github repository and installed with the pip tool. Figure 3 demonstrates how to load a dataset, randomly sample from the pre-populated data in the dataset, call an external LLM API to generate novel CoTs and answers, automatically evaluate the accuracy of generated answers, and finally save all generated data to a JSON file. Figure 4 depicts an excerpt of the resulting JSON file.

Fig. 3
figure 3

Demonstration of the ThoughtSource API. Basic functionalities of the data loader, generator and evaluator modules are demonstrated.

Fig. 4
figure 4

An excerpt of data generated by running the example code. Data for a single question from Worldtree V2 are shown, including human-authored reference CoT, gold-standard answer, an AI-generated CoT and extracted answer, as well as evaluation results. Some fields were omitted for legibility.

In a zero-shot setup, specific text fragments can be used to prompt question answering and CoT reasoning in LLMs. ThoughtSource includes a curated list of text fragments that can be used to generate novel CoTs (Fig. 5). Where possible, we also mapped individual CoTs in pre-existing CoT datasets to the text fragments that were used in their creation.

Fig. 5
figure 5

An excerpt of the collection of prompt fragments. These fragments can be used to build prompts for interacting with LLMs, allowing for empirical testing of how different prompts affect model performance.

We provide two web-based interfaces for exploring and annotating ThoughtSource data, the Dataset Viewer and the Annotator. The Dataset Viewer is a simple interface for exploring dataset contents. The Annotator (Fig. 6) allows you to upload specific subsets of a dataset, provides convenience functions for highlighting similarities between different generated CoTs and the correctness of generated answers, and allows you to annotate individual CoTs interactively. The annotator facilitates identifying strengths and weaknesses of different CoTs. Annotations can be used for downstream model evaluation and further improving the capabilities of AI models through fine-tuning/reinforcement learning.

Fig. 6
figure 6

The ThoughtSource Annotator. The web-based interface allows for convenient inspection and annotation of reasoning chains and answers. Text that is similar between CoTs can be automatically highlighted based on an easily adjustable similarity threshold, facilitating a better understanding of similarities and differences of different reasoning chains.

All tools and libraries, as well as more detailed demonstration notebooks, can be found on the project Github page.

We plan to add more datasets and generated CoTs to the ThoughtSource repository, and we welcome outside contributions. Novel CoTs for existing core datasets can be generated and shared through the ThoughtSource APIs and JSON files. Completely new datasets can also be added, as described in the Github repository’s contribution guide.