JARVIS-Leaderboard: A Large Scale Benchmark of Materials Design Methods

Lack of rigorous reproducibility and validation are major hurdles for scientific development across many fields. Materials science in particular encompasses a variety of experimental and theoretical approaches that require careful benchmarking. Leaderboard efforts have been developed previously to mitigate these issues. However, a comprehensive comparison and benchmarking on an integrated platform with multiple data modalities with both perfect and defect materials data is still lacking. This work introduces JARVIS-Leaderboard, an open-source and community-driven platform that facilitates benchmarking and enhances reproducibility. The platform allows users to set up benchmarks with custom tasks and enables contributions in the form of dataset, code, and meta-data submissions. We cover the following materials design categories: Artificial Intelligence (AI), Electronic Structure (ES), Force-fields (FF), Quantum Computation (QC) and Experiments (EXP). For AI, we cover several types of input data, including atomic structures, atomistic images, spectra, and text. For ES, we consider multiple ES approaches, software packages, pseudopotentials, materials, and properties, comparing results to experiment. For FF, we compare multiple approaches for material property predictions. For QC, we benchmark Hamiltonian simulations using various quantum algorithms and circuits. Finally, for experiments, we use the inter-laboratory approach to establish benchmarks. There are 1281 contributions to 274 benchmarks using 152 methods with more than 8 million data-points, and the leaderboard is continuously expanding. The JARVIS-Leaderboard is available at the website: https://pages.nist.gov/jarvis_leaderboard


I. INTRODUCTION
The accelerated design and characterization of materials of technological interest has been a rapidly evolving area of research in the last few decades [1].Materials design requires approaches spanning a variety of length and time scales [2].For atomistic design, the methods employed may include computational approaches such as density functional theory, tight-binding, forcefields, and highly accurate approaches such as quantum Monte Carlo or quantum computations.A wide range of approaches are employed above the purely atomistic level, such as mesoscale and finite-element methods.Similarly, experimental characterization approaches include X-ray diffraction, vibroscopy, manometry, scanning electron microscopy, and magnetic susceptibility measurements.
Developing such metrology is a highly challenging task, even for one of these methods, let alone the entire galaxy of available methods.Projects and approaches such as the materials genome and FAIR initiatives [1,22], have resulted in several wellcurated datasets and benchmarks.These, in turn, have led to several materials informatics applications [23][24][25][26].Although electronic structure approaches such as density functional theory (DFT) tend to be more reproducible than other categories [16,27], a systematic effort must be made to validate these methods and estimate the error in predictions.Hence, it is highly desirable to have a large-scale benchmarking platform in the materials science field for reproducibility and method validation.
Massive progress in fields such as image recognition/image classification (ImageNet [28]), protein structure prediction (Al-phaFold [29]), large language modeling (Generative pretrained transformers (GPT)) [30]) has been possible primarily because of well-defined benchmarks in respective fields.With regards to AI methods for structure-to-property predictions [31], benchmarking efforts have enabled drastic improvements in the accuracy of predicted properties (i.e., moving away from descriptor-based predictions and including graph neural networks in the model architectures to improve accuracy).
For deterministic electronic structure methods such as DFT, extensive benchmarking of software and different DFT approximations (functionals, pseudopotentials, etc.) has led to increased reproducibility and precision in individual results and workflows [27,32].Such benchmarks allow a wide community to solve problems collectively and systematically.In addition, since there already exists highly accurate models for specific tasks (i.e., energy prediction), more comprehensive evaluations of the models are required so that the performance ranking is not overfitted to one biased data source.We believe that such a universal and large-scale set of benchmarks for materials science will significantly benefit the scientific community.
The goal of this project is to provide a more comprehensive framework for materials benchmarking than previous works.In particular, most existing efforts: 1) lack the flexibility to readily incorporate new tasks or benchmarks, which is a limitation given the continuous discovery of new materials and quantities in science, 2) are specialized towards a single modality, such as electronic structure, rather than providing a comprehensive framework that can accommodate multiple modalities, 3) offer only a limited set of tasks or properties, 4) are primarily focused on computational methods, overlooking the importance of experimental benchmarking, and 5) make adding contributions to existing platforms rather complex, creating a barrier to entry.In general, there is a need to simplify the process of user contributions to leaderboards to foster broader community engagement.
In this work, we present a user-friendly, comprehensive approach to integrate the benchmarking of both computational, experimental and data-analytics methods.The JARVIS-Leaderboard framework (https://pages.nist.gov/jarvis_leaderboard/)covers a variety of categories: Artificial Intelligence (AI), Electronic Structure (ES), Force-field (FF), Quantum Computation (QC), and Experiments (EXP).It also covers various data types, including atomic structures, spectra, images, and text.This project can be used to: (1) check the state-of-the-art methods in respective fields, (2) add a contribution model on an existing benchmark, (3) add a new benchmark, (4) compare new ideas and approaches to well-known approaches.To enhance reproducibility, we encourage each contribution to (1) be from peer-reviewed articles with an associated DOI for all contributions, models, and tools, (2) include a run script to exactly reproduce the results (especially for computational tools), (3) include a metadata file with details such as team name, contact information, computational timing and software (with software version)/hardware used in order to enhance transparency.
It is important to note differences between a typical data-repository and a benchmarking platform.Some of the key distinguishing factors between a usual large data-repository (such as JARVIS-DFT) and the present leaderboard effort are: 1) the leaderboard contains well-characterized/well-known samples/tasks (i.e., with digital object identifier/peer-reviewed article links) with all the scripts/metadata easily available to reproduce the results rather than just being a look-up table to find data, 2) large data repositories usually contain more variation in materials chemistry/structure and less variation of methods while the leaderboard focuses on a larger number of method comparisons.
For example, the JARVIS-DFT contains DFT data for more than 80,000 materials and millions of material properties with a few specific ES methods and hence there are only a few entries for, say, the electronic bandgap of Silicon from different methods, while the leaderboard contains electronic bandgaps for Silicon using more than 17 ES methods from various contributors.Similarly, JARVIS-ALIGNN project contains AI models for more than 80 properties/tasks of materials, i.e., just one model for a well-known property such as formation energy, while there are more than 12 methods for formation energy task in the leaderboard (as discussed later).
Furthermore, the JARVIS-leaderboard attempts to bridge together multiple categories of methods (AI, ES, FF, QC, EXP) and types of data (single properties, structure, spectra, text, etc.) with the goal of broadening benchmarking efforts across several fields of study.What differentiates the JARVIS-Leaderboard from platforms such as MatBench [33], is that MatBench [33] provides a handful of tasks to evaluate ML methods on larger datasets (i.e. 10 4 entries, most of which are from the Materials Project [65]).A potential drawback of this approach is that the resulting performance rankings could be biased towards the data distribution of a single source.In contrast, the JARVIS-Leaderboard covers a broader range of datasets and properties and provides a better overview of model performance.
Recently in the field of machine learning in materials science, there has been a fixation on performance metrics for newly developed models.This begs the question of whether or not benchmarking can be destructive to the development of new methods if these new methods cannot immediately outperform the previous state-of-the-art approaches.This also begs the question of whether or not benchmarking can lead to overfitting or poor generalization [66,67].
Therefore, we outline how the leaderboard can also be used to identify and focus on some of the major challenges in different fields, such as: (1) how to evaluate extrapolation capability [68]?(2) why is it difficult to develop a reasonably good AI model with similar accuracy to electronic structure methods?, (3) how can we reduce the computational cost of higher accuracy electronic structure predictions (such as bandgaps and bandoffsets)?, (4) how do we identify examples of materials that require high-fidelity methods (beyond DFT accuracy)?, (5) how can we identify material space where methodological improvements need to be targeted?, (6) how can we establish figures of merits for mesoscale models such as phase field?, (7) how can we make atomistic image analysis quantitative rather than qualitative?, (8) and how do we develop and benchmark multi-modal models (such as text, image, video, atomic structures etc.) [69]?
The JARVIS-Leaderboard is seamlessly integrated into the existing and well-established NIST-JARVIS infrastructure [70,71], which hosts several datasets, tools, applications, and tutorials for materials design, motivated by the materials genome initiative [1].The framework is open access to the entire materials science community for progressing the field collectively and systematically.JARVIS (Joint Automated Repository for Various Integrated Simulations) [70,71] is a repository designed to automate materials discovery and optimization using classical force-field, density functional theory, machine learning calculations, and experiments.Nevertheless, the leaderboard is not limited to NIST-JARVIS infrastructure and can be linked with other external projects as well.
Since its creation in 2017, JARVIS has had over 50,000 users worldwide, over 45 JARVIS-associated articles have been published, and over 80,000 materials currently reside in the database.As these numbers continue to multiply, significant effort on external outreach to the materials science community has been an additional goal of JARVIS, with several events (https: //jarvis.nist.gov/events/)such as the Artificial Intelligence for Materials Science (AIMS) and Quantum Matters in Materials Science (QMMS) workshops and hands-on JARVIS-Schools, which have had hundreds of participants throughout the last few years.Based on the level of success and support from the community with regard to the existing JARVIS infrastructure, we believe that the integration of the JARVIS-Leaderboard will have a similar level of engagement and success, with a growing number of contributors from all over the world (in government, academia and industry) and in different sub-fields of materials science.

A. Leaderboard overview
At the homepage, information regarding the number of methods, benchmarks, contributions, and datapoints are provided.A snapshot of the homepage with various categories is shown in Fig. 1a.Clicking on one of the entries (or searching in the 'Search' box) such as "formation_energy_peratom" opens a new tab with available contributions.This new tab consists of 1) a description of the benchmark, 2) a plot of various available contributions (as shown in Fig. 1b), 3) explicit table for the plot (as shown in Fig. 1c).For each contribution, links are provided to the submitted data (in .csv.zip format), reference benchmark data (in JSON file), a shell script to reproduce the contribution (run.shfile) and metadata file (metadata.json).The metadata file contains details about the team name, electronic mail address of the contributor(s), DOI number, software (with software version), hardware, instrument, computational timing and other relevant details of a benchmark.
There are several categories for the benchmarks including AI, ES, QC, FF and EXP and their combinations.Some example contributions and a summary table are also provided on the webpage to help a user navigate through the project.The summary table breaks down the available information into categories and sub-categories of different methodologies.
JARVIS-Leaderboard is an evolving project, so additions to the project are anticipated, welcome, and easy to make.We show a general flowchart for adding a new benchmark to the leaderboard in Fig. 2. The user can populate the reference dataset (with well-defined data splits) used for a specific benchmark (e.g. for 2D exfoliation energies in JARVIS-DFT dataset using an AI method: "AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test").AI benchmarks have pre-defined training/validation/test identifiers and target data in a corresponding json.zipfile, while other methods have only reference test set for evaluation because they do not require model training like an AI method does.For most benchmarks in the leaderboard, experimental data is used as the reference data.
There is a helper script jarvis_populate_data.py to generate a benchmark dataset.A user can apply their method, train models, or run experiments on that dataset and prepare a csv.zip file, a metadata.jsonfile, and also if possible, a conda environment.yaml/Nix/Dockerfileand a run.shfile.This step helps to reproduce the benchmark.These files are kept in a folder with the name of the folder as the team name and can be uploaded to a user's GitHub account by the automated jarvis_upload.pyscript.This script automatically forks the parent usnistgov/jarvis_leaderboard repo for the user, adds the team-name folder with its files in that forked repo, runs a few minimal sanity checks on the new contribution, and then makes a pull request to the parent repo.The contribution addition and automated testings are carried out using GitHub actions.The administrators of the JARVIS-Leaderboard at NIST will verify the contributions and then finally, it will become part of the leaderboard website.
This project is available on GitHub at: https://github.com/usnistgov/jarvis_leaderboard.The administrators of the JARVIS-Leaderboard at NIST will fully oversee the upload of contributions and benchmarks.A tree structure of the repo is shown in Fig. 3.There are two main directories in the repo: (1) benchmarks (reference) and (2) leaderboard contributions (for various leaderboard entries), as shown by the green highlighted boxes in Fig. 3.
The "benchmarks" directory has folders for the AI, ES, QC, FF, and EXP categories.Within them, there are sub-folders for specific sub-categories such as (1) SinglePropertyPrediction (where the output of a model/experiment is one single number for an entry), ( 2) SinglePropertyClass (where the output is class-ids, i.e., 0,1,.. instead of floating values), (3) ImageClass (for multi-class image classification), ( 4) TextClass (for multi-label text classification), ( 5) MLFF (machine learning forcefield), ( 6) Spectra (for multi-value data) and ( 7) EigenSolver (for Hamiltonian simulation).In each of these sub-folders, there are .json.zip files with well-defined reference datasets and available properties as also available in the JARVIS-Tools package https://jarvis-tools.readthedocs.io/en/master/databases.html.To avoid storage of large files in the GitHub repo, the actual datasets are part of JARVIS-Tools and are stored in the Figshare repository with specific DOIs and version numbers.
Next, in the "contributions" directory, there is a collection of folders that consist of .csv.zip, metadata.jsonfiles, and optionally a Dockerfile and run.shfile.The csv.zip file contains identifier (id) entries and corresponding prediction values obtained by the corresponding model/method.These test identifiers (such as JVASP-1408 in Fig. (3)) must match the test set IDs in the json.zipfile in the benchmarks folder for the metric measurements to work.Each of the csv.zip files must contain six components in the filename to place the contribution in the appropriate webpage.The components are the categories (such as AI), subcategories (such as ImageClass), property (such as bravais_lattice), dataset-name (such as stem_2d_image as available in the JARVIS-Tools database page), and data-split.For entries in the AI category, the data is in train-validation-test splits (using a fixed random number generator).For the current leaderboard format, we report the performance accuracy in the test set only.These files can be easily edited with common text editors.Each contribution folder (e.g.alignn-model) consists of one or several csv.zip files corresponding to each benchmark (such as for formation energies, bandgap, etc.).
Model-specific details are kept in the metadata.jsonfile with required keys such as model_name, project_url, team_name and an email address.Users can keep other data such as the uncertainty, time taken, and instrument/software/hardware used in the metadata file as well.For computational models, the run.shscript can be used to reproduce the contributions completely as a single command line script or job submission script.If a method requires additional steps or details beyond a simple command  2) leaderboard contributions (for various leaderboard entries).In the "benchmarks" directory, there are folders for the AI, ES, QC, FF, and EXP categories.Within them, there are sub-folders for specific sub-categories.In the "contributions" directory there is a collection of folders that consists of .csv.zip, metadata.jsonfiles, and optionally a Dockerfile and run.shfile for available contributions from each method.The csv.zip file contains entries of identifier (id) and corresponding prediction values as obtained by the corresponding model/method.These test identifiers (such as JVASP-1408) must match the test set ids in the json.zipfile in the benchmarks folder for the metric measurements to work.
line script, a user can upload a README file containing the additional details.For enhanced reproducibility, we also optionally allow users to include a Dockerfile and an ipython/Google-colab notebook for each benchmark.These notebooks can be used to run the contributions in the Google-cloud without downloading anything locally.In addition, there is a "docs" directory in the JARVIS-leaderboard.The docs folder consists of a directory structure that is similar to the benchmarks folder with categories names (AI, ES, etc.), and sub-categories (such as SinglePropertyPrediction, ImageClass etc.) with markdown (.md) files that will be converted automatically into corresponding html pages for the website.For each benchmark (i.e., json.zipfile), a corresponding docs entry (i.e., md file) should be present.A new benchmark must be associated with a peer-reviewed article and a DOI, in order to have trust in the reference benchmark data.A new benchmark must also be verified by the JARVIS-Leaderboard administrators.
As mentioned above, there already exist several other materials science-specific benchmarks.We compare some of these benchmarks in Table 1 based on the categories that are included.We find that there is no single, large-scale benchmark encompassing the various fields as in the JARVIS-Leaderboard.Also, the data format, metadata, and website for these different leaderboards vary significantly.Hence, having a uniform way to compare different methods would greatly help the materials community.
Each entry in the benchmark dataset consists of a unique identifier.Most of these datasets are integrated into JARVIS-Tools databases page already (but not limited by it), with an associated JARVIS ID number (JID) and are backed up in Figshare, FIG. 5. Periodic table element distribution for entries in all the datasets.This is calculated by taking into account all the element specific entries normalized by total entries i.e. these are percentage probabilities.
Google Drive and NIST-internal storage systems.The number of entries can vary from a few (which is especially applicable for experimental and high-accuracy computational methods, where generating a very large dataset is not feasible in terms of time and resources) to hundreds of thousands of entries in a dataset.
An overview of the dataset can be found in Fig. 4. Considering all possible entries in the dataset, we have close to 7 million datapoints.For example, an atomic structure can have multiple properties calculated, such as bandgaps and formation energies, among other properties.We find the JARVIS-DFT-3D dataset to have the largest number of entries.Considering unique systems, we can find the distribution in Fig. 4b).In this case, qe-tb (fitting dataset for ThreeBodyTB.jl[94]) is one of the largest datasets available in the leaderboard.Note that these datasets contain all varieties of data modalities such as atomic structure, images, spectra and text.In Fig. 5, we show the fractional distribution of periodic table elements in the entire dataset.We find that the most common elements are C, N, O, Cu which is similar to the natural abundance of these elements.
Experimental results are uploaded as benchmarks (i.e.what is regarded as the reference).In the absence of experimental data, high-fidelity computational methods can be used as a reference.If there are multiple experimental measurements available in the literature, each can be individually added as separate benchmarks (i.e., different json.zipfiles to distinguish one benchmark from another) and users can submit contributions for each of them.As time and the materials science field progresses, certain experimental data may need to be revisited (i.e. more accurate measurements in the future or results are reported that contradict previous experimental data).As a response to this, separate reference (experimental) benchmarks can be added, and users will be able to plot and compare the evolution of these benchmarks over time.
In addition, Leaderboard users can raise an issue on GitHub pertaining to reference benchmarks.The administrators will also upload a README file which contains additional information about the experiments conducted, including associated DOI, experimental conditions and provide details if additional experiments conducted on the same material/property exist in the literature.The experimental conditions described in the README file can be important when comparing the reference benchmark to calculated results, which may be in different conditions than the experiment (i.e. the bandgap of a material is never measured at 0 K, as DFT predicts).
Contributions to the leaderboard in the form of user-submitted experimental data can be compared with previous experiments, electronic structure methods or other numerical results.ES-based contributions are benchmarked against experimental results and can be compared with other ES methods.QC data can be compared with classical computation data or exact analytical results.For FF, contributions can be compared to DFT (or other ES data) or high-level interatomic potential benchmark suites (specifically for MLFFs) [198].For AI, a test dataset is used.Unlike other methods, AI methods can have both "train" and "test" datasets, while others have only "test" sets in the corresponding dataset.For AI methods, if the "train" dataset is not provided and only "test" is given, the benchmark can be used for checking extrapolation behavior such as vacancy formation energy benchmarks.

C. Analysis of Benchmarks
Presently, the leaderboard has 5 categories, 10 sub-categories, 152 methods, 274 benchmarks, 1281 contributions and 8714228 datapoints.In this section, we show a few of the hundreds of example analyses that can be carried out using the available benchmarks and contributions.In Fig. 6, we show the MAE of the AI computed formation energy and ES computed bandgap for Si for a variety of contributions in the leaderboard.In Fig. 6(a) we see the comparison of 12 AI models (each AI model had a well-defined 80:10:10 split for training, validation and testing respectively from the JARVIS-3D database) and find the kgcnn_coGN [140] has the highest accuracy/lowest error, followed by Potnet [141], Matformer [142] and ALIGNN [130,131] models.This can be attributed to the fact that as we include more structural information and use deep-learning methods rather than descriptor methods, we get improvement in accuracy.
In Fig. 6(c) we compare how several classical FFs compute the Voigt bulk modulus of Si.In Fig 6(d) we compare several MLFF models for the forces of Si.We compare various pretrained MLFFs and other MLFFs we specifically trained on the MLEARN [38] dataset (PBE-based DFT data).We see that ALIGNN-FF [170] and MatGL [139] perform similarly for prediction of forces.Fig. 6(d) provides a comprehensive comparison of MLFFs that are trained and tested on the same dataset and pretrained models that were trained elsewhere.The comparisons are presented in tabular form for all the benchmarks on the leaderboard website.We have provided tools and notebooks in the leaderboard GitHub repository that can be used for making such plots for all the available benchmarks and contributions.A collection of such figures for method comparison is available in the supplementary information ( Supplementary Figures 1-298).We have also added interactive plots for such comparisons on the website.These tools can aid in identifying examples of materials that require high-fidelity methods beyond the accuracy of DFT in order to understand their underlying properties.In addition, these tools can be used to validate electronic structure methods and provide insight for error estimation.
The leaderboard has a large number of benchmarks and can enable a more comprehensive comparison of different methods for better revealing their respective advantages and limitations.For instance, neural networks outperform descriptor-based models by a large degree in all of the 10 regression tasks in the latest Matbench [33] leaderboard.To check if this is also the case for 44 regression benchmarks in the current JARVIS leaderboard, we compare the performance of the best descriptor-based model to that of the best neural network.As shown in Fig. 7, the best neural network outperform the best descriptor-based model in 34 tasks, but only 14 out of 44 (32 %) tasks see a performance difference by more than 20 %.This indicates that descriptor-based models are still competitive with respect to neural networks, especially considering their better interpretability and orders of magnitude lower training cost [66,67].Notably, the best descriptor-based model is found to outperform the best neural network in 10 tasks including those with 10 4 -10 5 training data, opening up interesting questions and potential direction to further model improvement.For instance, the inferior performance of neural networks in the regression tasks for the heat capacity and hMOF data may be related to the recently revealed incapability of graph neural networks in capturing periodicity [202].

D. Analysis of Error Metrics
Although a metric such as the MAE can be useful to compare methods for a specific benchmark, it is difficult to compare across different methods, since MAE values can differ substantially.Hence, we use the mean absolute deviation (MAD, computed with respect to the average value of the training data as a baseline/random-guess model) to MAE ratio for both AI and ES singleproperty-prediction categories.Mean absolute deviation values act as a baseline/random-guessing model for the benchmark and contributed models should have MAE performance better than MAD values.We show the MAD/MAE ratios for AI and ES benchmarks in Fig. 8.We find that the MAD/MAE values range from 2 to 50.MAD/MAE values close to 1 suggest low predictive power.We observe that quantum properties such as the bandgap have lower MAD/MAE than classical quantities (quantities that do not require quantum mechanical simulations) such as total energy or bulk modulus.Interestingly, such trends for classical vs. quantum quantities are observed for both the AI and ES approaches.

E. Interactive View of Benchmarks and Contributions
In addition to making bar plots as shown in Fig. 6 and Fig. 8, the raw data available in benchmarks and contributions can be presented in various other forms such as scatter plots, bandstructures, adsorption spectra, and diffraction spectra.In Fig. 9, we show example comparisons of different methods for AI, ES, QC and EXP categories including (a) formation-energy-per atom model using AI, (b) bulk modulus predictions using ES, (c) electronic bandstructure of Al using QE with different quantum circuits [186], (d) CO 2 capture for zeolite at several labs in round-robin fashion [157].In Fig. 9a), we find that formation energy is one of the easiest quantities to train AI models and even simple chemistry only-based models can perform reasonably well (i.e., cfid_chem).Including more structural features (such as bond angles and dihedral angles) and using deep learning models (such as graph neural network vs descriptor based models) further helps improve accuracy.Similarly, for ES example for predicting bulk modulus, we find irrespective of DFT based method used, they are in relatively close agreement with experimental bulk modulus data as shown in Fig. 9b).In Fig 9c), we find that the selection of a quantum circuit is critically important for predicting electronic band structures well.Here, we used 6 different quantum [186] circuits and found the SU(2) [152] circuit to compare well with classical computer-based electronic bandstructures.This can be attributed to various entanglements captured in the SU(2) [152] circuits that may be missing in other circuits.Finally, for experimental inter-laboratory/round-robin type measurements of the zeolite CO 2 isotherm, we find excellent agreement across different labs [157].

III. METHODS
The JARVIS-Leaderboard aims to provide a comprehensive framework covering a variety of length and time-scale approaches [2] to enable realistic materials design.In this section, we provide a brief overview of the methods that are currently available in the leaderboard.In this work we use the terms categories, sub-categories, methods, benchmarks, and contributions often, so we define them as follows.
Currently, there are five main "categories" in the leaderboard: Artificial Intelligence (AI), Electronic Structure (ES), Forcefield (FF), Quantum Computation (QC), and Experiments (EXP).Each category is divided into "sub-categories", a list of which is provided on the website.These sub-categories include single-property-prediction, single-property-classification, atomic force prediction, text classification, text-token classification, text generation, image classification, image segmentation, image generation, spectra-prediction, and eigensolver.These sub-categories are highly flexible and new categories can be easily added."Benchmarks" are the reference data (in the form of json.zipfile, discussed later) used to calculate performance metrics for each specific contribution."Methods" are a set of precise specifications for evaluation against a benchmark.For example, within the ES category, density functional theory (DFT) performed with the specifications of the Vienna Ab initio Simulation Package (VASP) [87,88], Perdew-Burke-Ernzerhof (PBE) [89] functional and PAW [87,88] pseudopotentials (VASP-PBE-PAW) is a method.Similarly, within the AI category, descriptor/feature-based models with specifications of MatMiner [90] chemical features and the LightGBM [91]software is a method."Contributions" are individual data (in the form of csv.zip files) for each benchmark computed with a specific method.Each contribution files consist of six components: category (e.g.AI), sub-category (e.g.SinglePropertyPrediction), property (e.g.formation energy), dataset (e.g.dft_3d), data-split (e.g.test), metric (e.g.mae).

A. Electronic structure
Electronic structure approaches cover short length scales and short time scales with high-fidelity.There are a variety of ES methodologies such as such as tight-binding [92][93][94], density functional theory (DFT) [95], quantum Monte Carlo [96], dynamical mean field theory [97] and many-body perturbation theory (Green's function with screened Coulomb potential, GW methods) [98].For each of the methodologies, there are a number of specifications to completely describe a method including the exact software, exchange-correlation functional, pseudopotential, and other relevant parameters.Example methods used in this work are given in Table 2.

B. Force-field
Force fields can be used in molecular dynamics and Monte Carlo simulations for studying larger time and length scales compared to electronic structure methods.Traditional force fields are developed for specific chemical systems and applications and may not be transferable to other uses.It is important to check the validity of an FF before using it in a particular application.Moreover, the development of FFs is a cumbersome task.Examples of typical FFs include embedded-atom method (EAM) potentials [147] (i.e.Al099.eam.alloy for aluminum system [161]), Lennard Jones (LJ) [145] for 2D liquids, reactive empirical bond order (REBO) [148] for Si, and classical, atomistic force fields for biomolecular systems [162,163].Recently, machine learning force fields (MLFF) [164][165][166][167][168][169] have become popular because of their higher accuracy and ease of development (such as SNAP [129] FFs).Nevertheless, early generations of MLFFs were also developed for specific types of chemistry and applications.Very recently, several MLFFs have been developed that can be used to simulate any combination of periodic table elements.Some of these FFs include M3GNET [139], ALIGNN-FF [170], and CHGNet [138].In the leaderboard, we include benchmarks for energies, forces, and stress tensors for both specific systems and universal datasets.
Traditional FFs are available in LAMMPS [146], while MLFFs are integrated into the Atomic Simulation Environment (ASE) [171] package.Some of these MLFFs are now available in LAMMPS and other large-scale MD codes.In addition to static quantities, FFs can be used for Monte Carlo simulations, such as CO 2 adsorption in metal-organic frameworks (MOFs) [172] using the RASPA [173] code.In addition to energy, force, and stress, we also have FF benchmarks for classical properties such as the bulk modulus.For biomolecular systems, GROMACS [174] is commonly used, and we present here free energy differences and conformational state population benchmarks for three model peptides [175][176][177].

C. Artificial intelligence
Recently artificial intelligence methods have become popular for materials prediction across all lengths and time scales.We currently have benchmarks for four types of data used as input for the AI models: (1) atomic structure, (2) spectra, (3) images, and 4) text.AI techniques can be used for both forward prediction and inverse design.For atomic structure datasets, we use DFT datasets such as JARVIS-DFT [70,71], Materials Project (MP) [65], Tight binding three-body dataset (TB3) [94], Quantum-Machine 9 (QM9) [178,179].For spectral data, we use either DFT-based spectra of, for example, electron or phonon density of states (DOS), Eliashberg functions, or numerical XRD spectra.For images, we have simulated and experimental scanning transmission electron microscope (STEM) and scanning tunneling microscopy (STM) images for 2D materials.For text data, we have used the publicly available arXiv dataset.

D. Quantum computation
Quantum chemistry is one of the most promising applications of quantum computations [185].Quantum computers with relatively few logical qubits can potentially exceed the performance of much larger classical computers because the size of Hilbert space increases exponentially with the number of electrons in the system.Predicting the energy levels of a Hamiltonian is a typical and fundamentally important problem in quantum chemistry.We use Hamiltonian simulations with quantum algorithms and compare it with classical solvers.Determination of appropriate quantum circuit for a specific QC problem is a challenging task.For example, we use the tight-binding Hamiltonians for electrons and phonons in JARVIS-DFT and evaluate the electron bandstructures using quantum algorithms (such as variational quantum eigen solver (VQE) [153] and variational quantum deflation (VQD) [154]) and with different quantum circuits (such as PauliTwo design [152] and SU(2) [152] circuits).We primarily use the Qiskit [152] software in this work through the JARVIS-Tools/AtomQC [186] interface, but other packages such as Tequila [187], Circq [188], and Pennylane [155,156] can also be easily integrated.In addition to studying algorithm and circuit architecture dependence, the leaderboard can be used for studying the noise-levels in quantum circuits across different quantum computers, which is a key issue hindering quantum computer commercialization.Currently, we are only using statevector simulators for the quantum algorithms available in the Qiskit [152] library.

E. Experiments
Although experimental results for material properties and spectra are referenced in comparison to computational methods (within the JARVIS-Leaderboard and other leaderboards such as MatBench [33]), we dedicated a portion of the JARVIS-Leaderboard to experimental benchmarking.Benchmarking experiments essentially boils down to the comparison of different experiments for the same desired result/s.A systematic way to perform this benchmarking is through round-robin testing [189].This is an inter-laboratory test performed independently several times, which can involve multiple scientists and a variety of methods and equipment.This approach has been applied successfully for a range of materials science applications [157,[190][191][192][193], but many more of such experiments are still needed.Specifically in the JARVIS-Leaderboard, we include experimental round-robin results for manometric measurements of CO 2 adsorption [157].It is important to note that the experimental results included in the leaderboard are for well-characterized materials with well-defined properties and phenomena that can be easily reproduced (in contrast to replication attempts of variable experiments, such as the recent attempt to synthesize room temperature superconductors [194][195][196][197]).Some of the experiments we used for benchmarking purposes are XRD, magnetometry, vibroscopy, and scanning electron microscopy (SEM) and transition electron microscopy (TEM).We purchase the samples from industrial vendors with available identifiers such as CAS-number.We also carried out XRD for MgB 2 (a superconducting material) to verify its crystal structure before carrying out magnetometry measurements to determine the transition temperature.This measurement was compared with numerical XRD data.Magnetometry measurements for superconductors were also conducted to compare their superconducting transition temperatures with respect to predicted or experimentally available values [158].Strain-stress measurements were done for Kevlar for failure analysis [158].We have several instruments such as Bruker D8, Titan, Quantum design PPMS and FAVIMAT in the leaderboard currently.

F. Metrics used
We use several metrics in the leaderboard depending on the "sub-categories" mentioned above.We use mean absolute error (MAE), accuracy (acc), multi-mae (L1 norm of multi dimensional data), recall-oriented understudy for gisting evaluation (ROUGE) for the singlepropertyprediction, singlepropertyclassification, spectra/eigensolver/atomic forces and textGen/textsummary subcategories respectively.As the user contributes their data to compare against the reference data (benchmarks), other complementary metrics (such as those available in the sklearn.metricslibrary) can be easily calculated as the raw contribution data is also made available through the website.For the sake of readability and ease of use, we primarily employ the metrics mentioned above.For single property prediction, there is only scalar values per column in the csv.zip file with id and prediction separate by comma.,For spectra, force-prediction and other multi-value quantities (i.e.with multiple prediction values per id) we concatenate the array and separate by semicolon (to avoid comma convention in csv files).The benchmark data is also stored in a similar format.We provide tools to convert these csv.zip files into json or other file formats if needed.We also provided notebooks to visualize the data through Jupyter/Colab notebooks.In addition, we plan to eventually add metrics for timing, uncertainty, development cost and other details.Voigt bulk modulus of Si and (d) machine learning force-field (MLFF) based forces for Si.We provide Jupyter/Google colab notebooks to easily plot such comparisons for all available benchmarks.Also, similar analysis figures for all the available benchmarks are available in the supplementary information (Supplementary Figures 1-298).As a note, these plots are a current snapshot of the leaderboard, and it is possible that new and more accurate models will be developed and added here in the future.7. Relative performance computed as the ratio of the MAE difference between the best descriptor-based model and the best neural networks to the MAE of the best neural networks in the AI regression benchmarks.The benchmark name and the corresponding best performing neural network are indicated in the left and right y axis, respectively.For all the considered AI benchmarks, the best descriptor-based model is the tree-based model using Magpie [117] and Voronoi-tessellation [203] features.As a disclaimer, these plots are a current snapshot of the leaderboard, and it is possible that new and more accurate models will be developed in the future.2. A tree diagram for directory and file-structure in the leaderboard.There are two main directories in the repo: (1) benchmarks (reference) and (2) leaderboard contributions (for various leaderboard entries).In the "benchmarks" directory, there are folders for the AI, ES, QC, FF, and EXP categories.Within them, there are sub-folders for specific sub-categories.In the "contributions" directory there is a collection of folders that consists of .csv.zip, metadata.jsonfiles, and optionally a Dockerfile and run.shfile for available contributions from each method.The csv.zip file contains entries of identifier (id) and corresponding prediction values as obtained by the corresponding model/method.These test identifiers (such as JVASP-1408) must match the test set ids in the json.zipfile in the benchmarks folder for the metric measurements to work.
3. A tree diagram for directory and file-structure in the leaderboard.There are two main directories in the repo: (1) benchmarks (reference) and (2) leaderboard contributions (for various leaderboard entries).In the "benchmarks" directory, there are folders for the AI, ES, QC, FF, and EXP categories.Within them, there are sub-folders for specific sub-categories.In the "contributions" directory there is a collection of folders that consists of .csv.zip, metadata.jsonfiles, and optionally a Dockerfile and run.shfile for available contributions from each method.The csv.zip file contains entries of identifier (id) and corresponding prediction values as obtained by the corresponding model/method.These test identifiers (such as JVASP-1408) must match the test set ids in the json.zipfile in the benchmarks folder for the metric measurements to work.
4. Distribution of data in each dataset.(a) all entries in leaderboard, (b) entries with unique identifiers.Note that one identifier (such as JVASP-1002 for silicon) can have multiple properties (such as bandgap, bulk modulus etc.).A script to generate this figure is also provided on the leaderboard website as the leaderboard is continuously evolving.
5. Periodic table element distribution for entries in all the datasets.This is calculated by taking into account all the element specific entries normalized by total entries i.e. these are percentage probabilities.
6. Example mean absolute errors for benchmarks including (a) artificial intelligence (AI) formation energy for test set with 5572 materials in JARVIS-DFT 3D dataset, (b) electronic structure (ES) Si (JARVIS-DFT ID: JVASP-1002) bandgap, (c) classical force-field (FF) based Voigt bulk modulus of Si and (d) machine learning force-field (MLFF) based forces for Si.We provide Jupyter/Google colab notebooks to easily plot such comparisons for all available benchmarks.Also, similar analysis figures for all the available benchmarks are available in the supplementary information (Supplementary Figures 1-298).As a note, these plots are a current snapshot of the leaderboard, and it is possible that new and more accurate models will be developed and added here in the future.
7. Relative performance computed as the ratio of the MAE difference between the best descriptor-based model and the best neural networks to the MAE of the best neural networks in the AI regression benchmarks.The benchmark name and the corresponding best performing neural network are indicated in the left and right y axis, respectively.For all the considered AI benchmarks, the best descriptor-based model is the tree-based model using Magpie [117] and Voronoi-tessellation [203] features.As a disclaimer, these plots are a current snapshot of the leaderboard, and it is possible that new and more accurate models will be developed in the future.
8. Mean absolute deviation (MAD) to mean absolute error (MAE) ratio for (a) AI and (b) electronic structure methods.MAD:MAE serves as uniform criteria for comparing performances of models.9. Example results for AI, ES, QC and EXP results.(a) formation-energy-per atom model using AI for JARVIS-DFT 3D dataset with 5572 materials in the test set, (b) bulk modulus predictions using ES methods for 21 materials, (c) electronic bandstructure of Aluminum using QC methods with different quantum circuits on a coarse k-point mesh, (d) CO 2 capture for zeolite (ZSM-5) at several labs in inter-laboratory/round-robin fashion.

X. TABLE LEGENDS
1. Comparison of benchmark infrastructure available for materials design methods for several categories.2. Summary of current benchmark categories and methods available in the JARVIS-Leaderboard at the time of writing.More details can be found in the individual metadata.jsonfile.Note that the number of methods is continuously growing.
FIG. 1. Leaderboard snapshot with an example output for AI based formation energy per atom model on the JARVIS-DFT (dft_3d) dataset.a) homepage sanpshot showing list of categories and number of available contributions at the time of writing, b) an example AI regression model benchmark for formation energy with several contributions.The methods are sorted based on the mean absolute error (MAE) values.Lower MAE values indicate higher accuracy, c) explicit table for the plot in panel b.Links to individual csv.zip (AI-SinglePropertyPrediction-formation_energy_peratom-dft_3d-test-mae.csv.zip),json.zip(dft_3d_formation_energy_peratom.json.zip),shell script (run.sh) and detailed info (metadata.json)files are provided to help enhance reproducibility.Such results plots and tables are available for each benchmark in the leaderboard.

FIG. 2 .
FIG. 2. A flow-chartshowing the processes involved in uploading a new contribution to the leaderboard.The jarvis_populate_data.py scripts generate a benchmark dataset.A user can apply their method, train models, or run experiments on that dataset and prepare a csv.zip, a metadata.jsonfile, and other files in a new folder in the contributions directory.The contributions can be locally checked by the user using jarvis_server.pyscript.Then the folder can be uploaded to a user's GitHub account by the automated jarvis_upload.pyscript involving several GitHub uploading steps.The administrators of the JARVIS-Leaderboard at NIST will verify the contributions and then finally, it will become part of the leaderboard website.
FIG.3.A tree diagram for directory and file-structure in the leaderboard.There are two main directories in the repo: (1) benchmarks (reference) and (2) leaderboard contributions (for various leaderboard entries).In the "benchmarks" directory, there are folders for the AI, ES, QC, FF, and EXP categories.Within them, there are sub-folders for specific sub-categories.In the "contributions" directory there is a collection of folders that consists of .csv.zip, metadata.jsonfiles, and optionally a Dockerfile and run.shfile for available contributions from each method.The csv.zip file contains entries of identifier (id) and corresponding prediction values as obtained by the corresponding model/method.These test identifiers (such as JVASP-1408) must match the test set ids in the json.zipfile in the benchmarks folder for the metric measurements to work.

FIG. 4 .
FIG.4.Distribution of data in each dataset.(a) all entries in leaderboard, (b) entries with unique identifiers.Note that one identifier (such as JVASP-1002 for silicon) can have multiple properties (such as bandgap, bulk modulus etc.).A script to generate this figure is also provided on the leaderboard website as the leaderboard is continuously evolving.

FIG. 6 .
FIG.6.Example mean absolute errors for benchmarks including (a) artificial intelligence (AI) formation energy for test set with 5572 materials in JARVIS-DFT 3D dataset, (b) electronic structure (ES) Si (JARVIS-DFT ID: JVASP-1002) bandgap, (c) classical force-field (FF) based Voigt bulk modulus of Si and (d) machine learning force-field (MLFF) based forces for Si.We provide Jupyter/Google colab notebooks to easily plot such comparisons for all available benchmarks.Also, similar analysis figures for all the available benchmarks are available in the supplementary information (Supplementary Figures1-298).As a note, these plots are a current snapshot of the leaderboard, and it is possible that new and more accurate models will be developed and added here in the future.

FIG. 8 .
FIG. 8. Mean absolute deviation (MAD) to mean absolute error (MAE) ratio for (a) AI and (b) electronic structure methods.MAD:MAE serves as uniform criteria for comparing performances of models.

FIG. 9 .
FIG.9.Example results for AI, ES, QC and EXP results.(a) formation-energy-per atom model using AI for JARVIS-DFT 3D dataset with 5572 materials in the test set, (b) bulk modulus predictions using ES methods for 21 materials, (c) electronic bandstructure of Aluminum using QC methods with different quantum circuits on a coarse k-point mesh, (d) CO 2 capture for zeolite (ZSM-5) at several labs in interlaboratory/round-robin fashion.

TABLE 1 .
Comparison of benchmark infrastructure available for materials design methods for several categories.

TABLE 2 .
Summary of current benchmark categories and methods available in the JARVIS-Leaderboard at the time of writing.More details can be found in the individual metadata.jsonfile.Note that the number of methods is continuously growing.