Introduction

Much of the success of applications of artificial intelligence (AI) to a broad range of scientific problems1,2 has been due to the availability of well-documented, high-quality datasets3; open source, state-of-the-art neural network models4,5; highly efficient and parallelizable numerical optimization methods6; and the advent of innovative hardware architectures7.

Across science and engineering disciplines, the rate of adoption of AI and modern computing methods has been varied2. Throughout the process of harnessing AI and advanced computing, researchers have realized that the lack of an agreed upon set of best practices to produce, collect, and curate datasets has limited the combination of disparate datasets that with AI may reveal new correlations or patterns8,9.

From 2014 to 2016, a set of data principles, or best practices, based on findability, accessibility, interoperability, and reusability (FAIR) were defined so that scientific datasets could be readily reused by both humans and machines. The FAIR principles can be applied to address these limitations and increase the potential of AI for discovery in science and engineering. Using high energy physics (HEP) as an example, this article provides a domain-agnostic, step-by-step set of checks to guide in the process of making a dataset FAIR (“FAIRification”).

In HEP, there is a long history of the application of machine learning (ML) techniques to find small signals in the presence of large backgrounds. The observation of the Higgs boson at the CERN Large Hadron Collider (LHC) in 201210,11 was the result of the extensive use of ML algorithms based on boosted decision trees. Since then, as ML techniques have developed, their use in HEP has become ubiquitous. However, these developments have been largely the result of physicists adopting AI tools developed outside of their field of research.

The authors of this paper are members of the FAIR4HEP collaboration which has representation from the AI community and two of the large LHC collaborations, ATLAS and CMS. We are collaborating to prepare datasets from HEP experiments that meet FAIR data principles12. There are several major impediments to this strategy, including, among others, the lack of jargon-free documentation, difficulty of access to, and poor structure of the dataset, and the lack of clear metrics with which to benchmark and compare AI models. A consequence of the FAIR data principles is that they promote the use of open datasets, which in turn supports collaboration between practitioners of different disciplines (in this case, high-energy physicists and AI researchers) who have overlapping interests in particular datasets. We note that these impediments are common to many disciplines.

The FAIR guiding principles12, published in 2016, provide guidelines to improve the “FAIRness” of digital assets, such as datasets, code, and research objects. While the principles are valuable for the scientific and engineering fields, they do not include exemplar metrics13 or define ways to measure how well the FAIR data principles are met for the digital asset.

In HEP, there are currently several efforts available for creating, indexing, and sharing open, public datasets. The CERN Open Data Portal provides access to data and resources from the four major LHC collaborations, ALICE, ATLAS, CMS and LHCb, for both education and research. Previous data releases have already yielded publications using LHC data authored by external researchers unaffiliated with an LHC collaboration14,15,16,17,18,19. However, despite being a repository of LHC open data, it does not allow general members of the HEP research community to upload their own datasets. Zenodo is another platform launched in May 2013, which is part of the OpenAIRE project, in partnership with CERN, that is a catch-all repository for European Commission funded research and is used widely. It allows the community at large to upload data, software, and other artefacts in support of publications, as well as material associated with conferences, projects, or institutions. Citation information is also passed to DataCite and other scholarly aggregators. Zenodo has been used to host several high-profile public HEP datasets including the top quark tagging reference dataset20, LHC Olympics 2020 Anomaly Detection Challenge dataset21, hls4ml jet substructure dataset, and the Anomaly Detection Data Challenge 2021 dataset22. Other services like Kaggle and Codalab have been used to host HEP challenges like the TrackML accuracy phase23 and TrackML throughput phase24. The Durham High-Energy Physics Database (HEPData)25 is an open-access repository established for sharing scattering data from experimental particle physics. It mainly comprises the data points from plots and tables related to several thousand publications including those from the LHC.

Despite the widespread availability of public datasets in HEP, these services, and the datasets they host, do not follow FAIR principles. In particular, the interpretation of the FAIR principles in the context of the large datasets available and the specific computing infrastructure needs in HEP is not clear. For instance, the CERN Open Data Portal hosts datasets with sizes approaching 100 TB26, requiring special versions of software (provided through a virtual machine image) to read and analyze the data. Given these stringent computational, storage, and domain knowledge requirements, the accessibility of these datasets to non-experts and those lacking resources is not completely obvious. To explore how to address these difficulties, we present an analysis of the FAIRness of one of these datasets, the CMS \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) dataset.

This simulated collider dataset contains a selection of proton-proton interactions (events) in which a Higgs boson is produced and decays to two bottom quarks \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) (signal events) as well as background events comprised uniquely of “jets” of particles produced through the strong interaction, referred to as quantum chromodynamics (QCD) multijet events. This dataset was released in the CERN Open Data Portal. By providing the details of the how we evaluate FAIRness of this dataset and the steps taken to meet the FAIR data principles, we can help researchers in other fields create FAIR datasets in a similar manner. To ensure the reliability of our results we have conducted a similar study using the ARDC FAIR self assessment tool. We have found that the steps we have followed and the ARDC assessment tool provide consistent results.

In the following sections we describe how the FAIRness of this dataset is evaluated and present the result in the format of a set of checks. We also describe methods to improve FAIRness, provide a detailed data description, and discuss how we interpret FAIR principles for HEP.

Results

We have assessed the FAIRness of the target dataset, described in the previous section and in more detail below using the related FAIR metrics13. The results of this analysis are summarized in Tables 1 and 2. The following subsections summarize the results for each principle, and discuss steps that were, or will be, take to increase the FAIRness of the dataset. We also highlight the difficulties inherent in interpreting and applying these principles to HEP datasets, due to their unique properties of size, complexity, data format, and required domain knowledge.

Table 1 Findable and Accessible principle assessment checks for the CMS H(\(b\bar{b}\)) Open Dataset.
Table 2 Interoperable and Reusable principle assessment checks for CMS H(\(b\bar{b}\)) Open Dataset.

Findable

The findable principle requires that metadata and data should be easy to find for both humans and machines. For this specific dataset: 1) both data and metadata are registered with globally unique and persistent identifiers; 2) the association between metadata and the dataset itself is explicitly described in its metadata, and 3) the dataset is registered as a searchable resource and is searchable on a commonly used search engine. However, though searchable, the metadata fields are fairly sparse and information is lacking. Enriching the metadata with additional fields to include references that cite this dataset, or links to related or derived datasets, would make the data more readily available.

Accessible

To meet the FAIR accessible principle, data are required to be kept in a storage facility where they may be easily accessed or downloaded, with well-defined license and access conditions, which should be open access whenever possible, either at the level of metadata, or at the level of the actual data content.

The CMS \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) dataset is retrievable using standard HTTP communication protocols, is open access, and is under the Creative Commons public domain dedication (license). Since the DOI has formal metadata, it satisfies the metadata longevity plan.

Interoperable

The interoperable principle requires that the data can be readily combined with other datasets by humans as well as by computer systems. For this dataset, (meta)data are represented using a formal and broadly applicable representation language. To improve the interoperability, the data descriptions were rewritten to be human-readable, removing jargon to make it accessible not only for domain experts, but also non-HEP researchers. The (meta)data use a set of FAIR vocabularies defined for both general purpose and HEP domain-related purpose. Although not all terms are findable in FAIR vocabularies, those that are not findable are well-defined and referenced. Lastly, the description of this dataset provides references to other datasets from which it is derived. However, a more extensive set of references that elaborate on the paper describing this dataset, and more information about the methods used to derive this dataset, could be added to aid in the comprehension of the problem to be addressed with this dataset.

Reusable

Reusability requires the data to be readily usable for future research and to be able to be processed further using different computational methods. We found that the metadata and data of this dataset are well-described with accurate and relevant attributes. Thus, we anticipate that the dataset will be reusable and can be integrated with additional data in future studies.

Methods

In this section we describe our approach to evaluate FAIRness of the CMS \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) dataset, and provide a human-readable description of the HEP dataset contents and its overall structure. These two complementary aspects of the dataset are critical elements in any pursuit of data FAIRification.

Dataset FAIRification

We have created a set of ready-to-use, domain-agnostic checks to facilitate the evaluation of how well a dataset meets the FAIR guiding principles12, and applied them them to the \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) dataset. These checks provide researchers with a tool that can be used to assess the FAIRness of scientific datasets, and thus will streamline the use of such datasets for AI-driven analyses.

We have used the ARDC FAIR self assessment tools, developed by other researchers in the FAIR community, to validate our findings. We have also incorporated human-in-the-loop expertise in this process in the form of feedback from FAIR experts, who independently validated our results.

Dataset description

The CMS \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) Open Dataset consists of two data samples that have been a critical part of the understanding of physical phenomena associated with the Higgs boson. The Higgs boson, first observed at the LHC in 201210,11, is an elementary particle that is related to the Higgs mechanism for electroweak symmetry breaking, responsible generating the masses of the elementary particles.

One consequence of the Higgs mechanism is that the Higgs boson, which has a lifetime of only ≈10−22 seconds, couples to other particles in proportion to their mass and therefore will decay preferentially to elementary particles with comparatively higher masses. The \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) decay process is particularly important because the b quark is the most massive quark to which the Higgs boson can decay. By measuring precisely the rate of this decay process, the physics of the coupling between the Higgs boson and ordinary matter can be tested. Any significant deviations from the predicted values would be an indication of physics beyond the standard model of particle physics.

When a Higgs boson decays to b quarks, the quarks, which cannot be free in nature, are detected as clusters of particles moving away from the interaction vertex (jets) and recognized by a secondary decay vertex from a particle containing a b quark a short distance from the interaction. Collisions, or interactions between protons in the two circulating beams (events) occur at a rate of about 1 GHz, while the rate of production of Higgs bosons is only 0.001 Hz, about one every hour. The challenge of identifying Higgs bosons decaying to \({\rm{b}}\bar{{\rm{b}}}\) is to find them amid the much larger number of collisions (background) where a Higgs boson is not produced. In these background events, typically referred to as quantum chromodynamics (QCD) multijet events, a large number of particles are produced, which may include jets from b quarks, and can combine to resemble \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) events, which are the “signal” events.

To identify Higgs boson decays and separate them from the much larger QCD background, we use several key reconstructed components of proton-proton collisions. In particular, we reconstruct jets and analyze their characteristics which include tracks, secondary vertices (SVs), and substructure features. We also employ a particle-flow (PF) algorithm27 to provide a comprehensive list of final-state particles that are identified and reconstructed via combination of information from multiple detector subsystems.

The following defines these elements:

  • Jets are sprays of elementary particles in a cone-shaped pattern that radiate out from the collision vertex. They may be characterized by their substructure, including features like the jet mass, charge, and shape28. In total, the dataset contains 64 reconstructed jet features. These features are not necessarily independent from one another, and they may be derived from lower-level features related to the tracks, PF candidates, and secondary vertices.

  • Tracks are the reconstructed helical paths of charged particles as they move away from the collision vertex in the magnetic field at the detector. Each charged particle leaves a characteristic set of hits in the tracking detector of CMS, which are used to reconstruct the track. In total, there are 45 track features.

  • Secondary vertices are collections of tracks that originate from a particle decay that is not at the collision vertex. Secondary vertices are a interesting set of candidate features that discriminate between different classes of jets, because they are dominant signatures in bottom-quark decays. In total, there are 18 SV features.

  • Particle-flow candidates are formed by combining tracks and clusters from other detectors outside of the tracking detector. The PF algorithm27 is used to provide a complete event description through the generation of a comprehensive list of the particles produced in the collision. For each PF candidate, there are 24 features.

This dataset consists of particle jets extracted from simulated proton-proton collision events generated with a center-of-mass energy of 13 TeV. Each element of the dataset corresponds to a single jet, containing information about the jet, from jet-level features to track-level features (see later for the full details on the dataset).

The outcome of the default CMS reconstruction workflow is provided in the open simulation29. In particular, particle candidates are reconstructed using the particle-flow (PF) algorithm27. Particles produced nearly simultaneously with the events that leave extra hits in the detector (pileup) are removed with an algorithm developed for that purpose30. Jets are clustered from the remaining reconstructed particles31,32 with a jet-size parameter R = 0.8 (AK8 jets). The standard CMS jet energy corrections are applied to the jets. In order to remove soft, wide-angle radiation from the jet, the soft-drop (SD) algorithm33,34 is applied, with angular exponent β = 0, soft cutoff threshold zcut < 0.1, and characteristic radius R0 = 0.835. The SD mass (mSD) is then computed from the four-momenta of the remaining constituents.

The dataset is reduced by requiring the AK8 jets to have 300 < pT < 2400 GeV, |η| < 2.4, and 40 < mSD < 200 GeV. After this reduction, the dataset consists of 3.9 million \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) jets and 1.9 million QCD jets. Charged particles are required to have pT > 0.95Ge and reconstructed secondary vertices (SVs) are associated with the AK8 jet using \(\Delta R=\sqrt{\Delta {\phi }^{2}+\Delta {\eta }^{2}} < 0.8\). The dataset is divided into blocks of features, referring to different objects: tracks, secondary vertices, and particle candidates. See the CERN Open Data Portal for a complete list of features.

A typical use case for this dataset is the development of a machine-learning classifier to distinguish the \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) signal from the QCD background jets, which in ML terms, can be done via a binary-classification task. However, it is often useful to further classify the QCD jets, thereby, the task becomes multi-class classification with the following six jet classes: H_bb, QCD_bb, QCD_cc, QCD_b, QCD_c, and QCD_others. The labeling is performed sequentially. If a ‘generator-level’ Higgs boson is geometrically matched to the AK8 jet (ΔR < 0.8) and the two bottom quark decay products are also matched to the jet, then it is labeled as H_bb. If instead, only two bottom (charm) quarks are found, the jet is labeled as QCD_bb(QCD_cc). If only a single bottom (charm) quark is found, it is labeled as QCD_b(QCD_c). Finally, if none of the above conditions are met, it is labeled as QCD_others. The distribution of labels is shown in Fig. 1. The large class imbalance is a common feature of classification problems in high energy physics: background jets occur at much larger rates than signal jets.

Fig. 1
figure 1

The distribution of labels is shown for a representative file in the training dataset.

Specific signatures of b quark decays can be used in a ML algorithm to differentiate between \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) and QCD jets. For instance, one of the distinct signatures of b quarks is its long lifetime, which in a high energy collision translates to a particle that decays with a displacement with respect to the collision. The model can learn this information to improve the accuracy of the inference. An illustration of some of key features that can be used for \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) jet tagging are shown in Fig. 2. The distributions of some salient jet features are shown in Fig. 3.

Fig. 2
figure 2

Illustration of a \({\rm{H}}({\rm{b}}\bar{{\rm{b}}})\) jet with two secondary vertices (SVs) from the decay of two b hadrons resulting in charged-particle tracks (including a low-energy, or soft, lepton) that are displaced with respect to the primary collision vertex (PV), and hence with a large impact parameter (IP) value.

Fig. 3
figure 3

The distributions of some salient jet features: (a) the soft-drop jet mass; (b) number of particle candidates; (c) number of secondary vertices; and (d) number of tracks, are shown for one file in the training dataset.

Many different deep learning architectures have been developed and studied for the task of jet classification, such as: interaction networks (INs)36, dynamic graph convolutional neural networks37, and Lorentz-group equivariant networks38. The first was applied to this data39 as a comparison with another ML model called the deep double-b (DDB) tagger created by the CMS Collaboration40 that uses a smaller subset of the input features. In addition to jet classification40,41,42,43, a further challenge within this dataset is a regression task, whereby one attempts to reconstruct the true energy of the Higgs boson. To perform this task, a regression loss needs to be constructed targeting the true Higgs boson energy. This promotes the exploration of physics-motivated loss functions, such as the earth (or energy) mover’s distance (EMD)18.

Dataset structure

Particle physics uses a variety of data formats (and analysis ecosystems), including the ROOT library44. ROOT is a framework for data processing, created at CERN that is widely used by the high-energy physics community. A ROOT file is a compressed binary file where objects of any type can be saved. There are Python bindings built into ROOT, which are called PyROOT. Recently, an additional library called uproot45 has been developed that allows Python users to perform ROOT I/O directly. Unlike the standard C++ ROOT implementation, uproot is only an I/O library, primarily intended to stream data into machine learning libraries in Python. It can also make jagged or awkward arrays46.

Trees are a data structure in ROOT that are tables of information. Trees are composed of branches, which are the columns of the table. In this dataset, each row represents a jet. Some branches contain only a single floating point number per entry (jet). Other branches contain a vector of floating point numbers, where the length of the vector varies for each entry. The former means there is only one number per jet (or event); the latter means there may be a variable number per jet.

In addition to the ROOT format, this dataset is also released in HDF5 format. The HDF5 files contain different arrays for each output variable, with only information for up to 100 particle candidates, 60 tracks, and 5 secondary vertices stored in zero-padded arrays.

Discussion

The motivation of our work to adopt FAIR principles for the production, collection and curation of scientific datasets is to streamline and facilitate their use in the design, training, validation and testing of AI models. This approach is particularly relevant for ongoing efforts that aim to automate the inference of massive scientific datasets through the convergence of AI and modern computing environments47,48. It is often the case that AI models are trained with (abundant and easy to produce) synthetic data, large scale simulations, and first-principles mathematical models, although these may only provide an incomplete description of complex and highly nonlinear real-world phenomena. Thus, when AI models are used to extract new knowledge from realistic, experimental datasets, it is a common occurrence that AI predictions are off-target. However, once AI models are calibrated against experimental data, their predictions become increasingly accurate49. Given that this is a trend reported across many disciplines, it is useful to streamline the development of AI models with real, experimental datasets. This can be accomplished if synthetic and experimental datasets are produced, collected and curated following a common set of standards or, in this case, FAIR guiding principles.

Another motivation to understand and adopt FAIR principles to create AI-ready datasets is that some disciplines are subject to restrictive regulations that prevent data fusion and centralized analyses. This is a common issue in multi-modal biomedical datasets that are governed by federal regulations, consortium-specific data usage agreements, and institutional review boards. These restrictions have catalyzed the development of federated learning approaches, and the development of privacy-preserving methods and the use of secure data enclaves. It is clear that developing AI models by harnessing disparate data enclaves will only be feasible if datasets adhere to a common set of rules, or FAIR principles.

In this study we have shown that open source datasets may not be FAIR or AI-ready. The domain-agnostic checks that we provide in this article will provide researchers with a starting point, and guidance to FAIRify their datasets. The FAIR principles are comprehensive and can be used by AI and domain experts to enable the reusability of massive scientific datasets that will enable the creation of next-generation AI models leading to a digitally accurate, interpretable and reproducible description of natural phenomena.

Researchers who create FAIR datasets should keep in mind that this work aims to automate end-to-end AI studies, from data collection to inference. This will only be accomplished if datasets contain all the information needed to interpret, verify, and reproduce new findings. To accomplish this goal, we recommend that datasets are stored using formats that are widely available in modern computing environments, such as HDF5 or ROOT. Using such data formats simplify the handling of large datasets, will allow experimental and synthetic datasets to be on the same footing, and will make accessible more datasets for widely used APIs for AI research, e.g., TensorFlow or PyTorch. In future work, we plan to introduce tools to automate the evaluation of FAIR metrics for datasets and to gain a better understanding of the relationship between data and AI models.