Abstract
The Joint Automated Repository for Various Integrated Simulations (JARVIS) is an integrated infrastructure to accelerate materials discovery and design using density functional theory (DFT), classical forcefields (FF), and machine learning (ML) techniques. JARVIS is motivated by the Materials Genome Initiative (MGI) principles of developing openaccess databases and tools to reduce the cost and development time of materials discovery, optimization, and deployment. The major features of JARVIS are: JARVISDFT, JARVISFF, JARVISML, and JARVIStools. To date, JARVIS consists of ≈40,000 materials and ≈1 million calculated properties in JARVISDFT, ≈500 materials and ≈110 forcefields in JARVISFF, and ≈25 ML models for materialproperty predictions in JARVISML, all of which are continuously expanding. JARVIStools provides scripts and workflows for running and analyzing various simulations. We compare our computational data to experiments or highfidelity computational methods wherever applicable to evaluate error/uncertainty in predictions. In addition to the existing workflows, the infrastructure can support a wide variety of other technologically important applications as part of the datadriven materials design paradigm. The JARVIS datasets and tools are publicly available at the website: https://jarvis.nist.gov.
Introduction
The Materials Genome Initiative (MGI) (https://mgi.gov/, The website provides information about several activities and events under the Materials Genome Initiative (MGI); https://www.nist.gov/mgi, The website provides information about various projects under the National Institute of Standards and Technology (NIST)’s Materials Genome Initiative (MGI) chapter) was introduced in 2011 to accelerate materials discovery using computational^{1,2,3,4,5,6,7}, experimental^{8,9,10,11} and data analytics^{12,13,14} approaches. The MGI has revolutionized several fields for materialsapplications, such as batteries^{15}, thermoelectrics^{16}, and alloydesign^{17}, thorough openaccess public database and tool development^{18}. The MGI encourages systematic ProcessStructurePropertyPerformance (PSPP)^{19}based efficient designapproaches rather than Edisonian trialerror methods^{20}.
Especially in the field of computational materials design, quantum mechanicsbased density functional theory (DFT)^{21} has proven to be an immensely successful technique, and several databases of automated DFT calculations are widely used in materials design applications. Despite their successes, existing DFT databases face limitations due to issues intrinsic to conventional DFT approaches, e.g., the generalized gradient approximation of PerdewBurkeErnzerhof (GGAPBE)^{21,22}. Drawbacks of the existing DFT databases include noninclusion of van der Waals (vdW) interactions^{6}, bandgap underestimations^{23}, noninclusion of spinorbit coupling^{5}, overly simplifying magnetic ordering^{24}, neglecting defects^{25} (point, line, surface and volume), unconverged computational parameters such as kpoints^{26}, ignoring temperature effects^{27} (generally DFT calculations are performed at 0 K), lack of layer/thicknessdependent properties of low dimensional materials^{28}, and lacking interfaces/heterostructures of materials^{29}, all of which can be critical for realistic materialapplications. In addition, there are several other computational approaches, such as classical forcefield (FF)^{30}, computational microscopy, phasefield (PF), CALculation of PHAse Diagrams (CALPHAD)^{31}, and Orientation Distribution Functions (ODF)^{32} which lack the integrated tools and databases that have been developed for DFTbased computational approaches. Finally, the integration of computational approaches with experiments, the application of statistical uncertainty analysis, and the implementation of data analytics and artificial intelligence (AI) techniques require significant developments to meet the goals set forth by the MGI.
Some of the notable materials databases are: AutomaticFLOW for Materials Discovery (AFLOW)^{1}, Materialsproject^{2}, Khazana^{15}, Open Quantum Materials Database (OQMD)^{3}, Novel Materials Discovery (NOMAD)^{7}, Computational Materials Repository (CMR)^{33}, NIMSMatNavi(NIMSMatNavi database. https://mits.nims.go.jp/. This website host information about several material classes and their properties), NRELMatDB^{34}, Inorganic Crystal Structure Database (ICSD)^{35}, MaterialsCloud^{36}, Citrine(Ctrine informatics. https://citrine.io. This website hosts several tools for accelerated materials design), OpenKIM^{37}, Predictive Integrated Structural Materials Science (PRISMS)^{38}, and PhaseField hub (PFhub)^{39}. Some of the commonly used computationaltools are Python Materials Genomics (PYMATGEN)^{40}, Atomic Simulation Environment (ASE)^{41}, Automated Interactive Infrastructure and Database (AIIDA)^{4} and MPinterfaces^{42}. The data most commonly included in these databases consists of crystal structures, formation energies, bandgaps, elastic constants, Poisson ratios, piezoelectric constants, and dielectric constants. These material properties can be used directly to screen for potentially interesting materials for a given application as candidates for experimental synthesis and characterization, as well as part of a PSPP design approach to better understand the factors driving material performance. Beyond the directly calculated material properties mentioned above, several selection metrics are also being developed to aid materials design, such as scintillation attenuation length^{43}, thermoelectric complexity factor^{44}, spectroscopy limited maximum efficiency^{45,46}, exfoliation energy^{6}, and spinorbit spillage^{5,24,47}. Akin to DFTlike standard computational approaches that are used as screening tools for experiments, machine learning (ML)^{12,13,14,48} models for materials design are being developed as prescreening tools for other conventional computational methods such as DFT. In addition, ML tools are proposed to accelerate experimental methods directly based on computational data^{49}. All of the above developments show immense promise for accelerating materials design.
The principles mentioned above constitute the foundations of the Joint Automated Repository for Various Integrated Simulations (JARVIS) (https://jarvis.nist.gov) infrastructure, a set of databases and tools to meet some of the current materialdesign challenges. The main components of JARVIS are: JARVISDFT, JARVISFF, JARVISML, and JARVIStools. JARVIS is developed and hosted at the National Institute of Standards and Technology (NIST) (Please note that commercial software is identified to specify procedures. Such identification does not imply recommendation by the National Institute of Standards and Technology) as part of the MGI. A detailed documentation webpage for the database is available at: https://jarvismaterialsdesign.github.io/dbdocs/.
Started in 2017, JARVISDFT^{5,6,23,24,25,28,29,45,49,50} is a repository based on DFT calculations that mainly uses the vdWDFOptB88 van der Waals functional^{51}. The database also uses beyondGGA approaches for a subset of materials, including the TranBlaha modified BeckeJohnson (TBmBJ) metaGGA^{52}, the hybrid functional PBE0, the hybrid rangeseparated functional HeydScuseriaErnzerhof (HSE06), Dynamical Mean Field Theory (DMFT), and G_{0}W_{0}. In addition to hosting conventional properties such as formation energies, bandgaps, elastic constants, piezoelectric constants, dielectric constants, and magnetic moments, it also contains previously unavailable datasets, such as exfoliation energies for van der Waals bonded materials, the spinorbit coupling (SOC) spillage, improved metaGGA bandgaps, frequencydependent dielectric functions, the spectroscopy limited maximum efficiency (SLME), infrared (IR) intensities, electric field gradients (EFG), heterojunction classifications, and Wannier tightbinding Hamiltonians. These datasets are compared to experimental results wherever possible to evaluate their accuracy as predictive tools. JARVISDFT also introduced protocols such as automatic kpoint convergence, which can be critical for obtaining precise and accurate results. JARVISDFT is distributed through the website: https://jarvis.nist.gov/jarvisdft/.
The JARVISFF^{25,53} database, also started in 2017, is a repository of classical forcefield/potential computational data intended to help a user select the most appropriate forcefield for a specific application. Many classical forcefields are developed for a particular set of properties (such as energies), and may not have been tested for properties not included in training (such as elastic constants, or defect formation energies). JARVISFF provides an automatic framework to consistently calculate and compare basic properties, such as the bulk modulus, defect formation energies, phonons, etc., that may be critical for specific moleculardynamics simulations. JARVISFF relies on DFT and experimental data to evaluate accuracy. JARVISFF is distributed through the website: https://jarvis.nist.gov/jarvisff/.
The JARVISML^{45,49,50,54,55} is a repository of machine learning (ML) model parameters, descriptors, and MLrelated input and target data. JARVISML introduced Classical Forcefield Inspired Descriptors (CFID) in 2018 as a universal framework to represent a material’s chemistrystructurecharge related data. With the help of CFID and JARVISDFT data, several highaccuracy classification and regression ML models were developed, with applications to fast materialsscreening and energylandscape mapping. Some of the trained property models include formation energies, exfoliation energies, bandgaps, magnetic moments, refractive indexes, dielectric constants, thermoelectric performance, and maximum piezoelectric and infrared modes. Also, several ML interpretability analyses have provided physicalinsights beyond intuitive materialsscience knowledge^{54}. These models, the workflow, the datasets, etc. are disseminated to enhance the transparency of the work. Recently, JARVISML was expanded to include ML models to analyze STMimages in order to directly accelerate the interpretation of experimental images. Graph convolution neural network models are currently being developed for automated handling of images and crystalstructure analysis in materials science. JARVISML is distributed through the website: https://jarvis.nist.gov/jarvisml/.
JARVIStools is the underlying computational framework used for automation, datageneration, datahandling, analysis and dissemination of all the above repositories. JARVIStools uses cloudbased continuous integration, lowsoftware dependency, autodocumentation, Jupyter and GoogleColab notebook integration, pip installation and related strategies to make the software robust and easy to use. JARVIStools also hosts several examples to enable a user to reproduce the data in the above repositories or to apply the tools for their own applications. JARVIStools are provided through the GitHub page: https://github.com/usnistgov/jarvis.
While JARVIS has some features in common with existing DFTbased computational databases, we note that there are several features currently unique to the JARVIS framework. First, JARVIS has a tight integration between FF and DFT techniques. Second, JARVIS includes CFID ML learning descriptors and several ML models based on those descriptors, including solarcell efficiency, thermoelectrics, exfoliation energies, infrared active modes, and refractive index etc. Finally, JARVISDFT itself features heavy use of a van der Waals functional, a 2D materials database, a STM image database, spinorbit calculations, spinorbit spillage, solar cell efficiency, metaGGA functional calculations, other postGGA electronic structure calculations, 2D heterostructure design app and a Wannier function database. We also provide RESTAPI framework for users to download and upload materials data using JARVISAPI.
This paper is organized as follows: (1) we introduce the main computational techniques, organized by the time and length scales, (2) we illustrate JARVIStools and its functionalities, (3) we discuss the contents of the major JARVIS databases, (4) we demonstrate some of the derived applications, and (5) we discuss outstanding challenges and future work.
Results and discussion
Overview of computational techniques
There are many computational tools for simulating realistic materials depending on the time and length scales of interest^{56}. Before we discuss the details of JARVIS, we will provide a brief list of these techniques and highlight their range of applicability, as summarized in Fig. 1. Relevant techniques include quantum mechanical computations, classical/molecular mechanics, mesoscale modeling, finite element analysis, and engineering design. Each of these methodologies has its own ontology and semantics for describing themselves and the PSPP relationship. For example, ‘structure’ may imply electronic configurations in the quantum regime, atomic arrangement in molecular mechanics, microstructure, segments in phase fieldbased mesoscale modeling, and meshstructure in finite element analysis. Material properties are calculated using corresponding physical laws such as the Schrödinger equation in the quantum regime, or Newton’s laws of motion for classical regimes. For realistic material design, it is important to integrate these methods. A major challenge for multiscale modeling is propagating the results of one simulation into another while capturing the relevant physics. Artificial Intelligence (AI) techniques have been applied in each of these domains and can be used to integrate the methods to a certain extent^{12}. In JARVIS, we primarily focus on atomisticbased classical and quantum simulations and machinelearning, but we also attempt to integrate other simulation methods with our atomistic data for a few specific applications such as using DFT based elastic constants in orientation distribution function based finite element simulations.
Software and databases
The JARVIS infrastructure (Fig. 2) is a combination of databases and tools for running and integrating some of the computational methods mentioned above. The general procedure for adding a dataset to JARVIS is as follows. We start with the goal of finding or designing a material to display or optimize a given property. Then, we decide on an appropriate computational method, as well as a computationally efficient way to screen for the best candidate materials. The screening process can proceed in several steps, with computationally inexpensive methods applied first, followed by more computationally intensive methods on the remaining materials. Whenever possible, the data is compared with available experiments to evaluate the accuracy and quality of the database. Once a large enough dataset is generated, machine learning techniques can be utilized to accelerate the traditional computational approaches.
As an example of making use of multiple computational tools within the same framework, we consider finding materials to maximize solarcell efficiency. We develop a screening criterion (Spectroscopic Limited Maximum Efficiency, SLME, a part of JARVIStools) and calculate the necessary properties (dielectric function and band gap, a part of JARVISDFT). We test the method by comparing known materials to experiment (precision and accuracy assessment), and we perform more accurate metaGGA and GW calculations (JARVISBeyond DFT) as additional screening and validation steps. Finally, we develop a machine learning model (JARVISML) to accelerate future materials design. Details of this example can be found in refs. ^{45,46}. Similar casestudies for thermoelectrics, dielectrics, and infraredphonon modes are available in ref. ^{50} and ref. ^{55}.
The database component of JARVIS consists of JARVISDFT for DFT calculations and JARVISFF for molecular dynamics simulations. JARVISML hosts several machine learning models based on our datasets. JARVIStools contains tools for automating, postprocessing and disseminating generated data, as well as several derived applications such as JARVISHeterostructure. We also include precision and accuracy analyses of the generated data, which consists of comparing DFT data with experiments, comparing FF data with DFT, comparing ML models with DFT, etc. As a lowerlevel technique (see Fig. 1), JARVISDFT data can be fed into JARVISFF and JARVISML models, but not vice versa. We use JARVISML to accelerate both JARVISDFT and JARVISFF. In this way, the JARVISinfrastructure establishes a joint integration for automation and generation of repositories. We provide several socialmedia platforms to build a community of interest. Some of the key resources for the JARVISinfrastructure are shown in Table 1.
JARVIStools
JARVIStools is a pythonbased software package with ≈20,000 lines of code and consisting of several pythonclasses and functions. JARVIStools can be used for (a) the automation of simulations and datageneration, (b) postprocessing and analysis of generated data, and (c) the dissemination of data and methods, as shown in Fig. 3. It uses cloudbased continuous integration checking including GitHubAction, CircleCI, TravisCI, CodeCov, and PEP8 linter to maintain consistency in the code and its functionalities. The JARVIStools is distributed through an open GitHub repository: https://github.com/usnistgov/jarvis.
An example python class in JARVIStools is ‘Atoms’. It uses atomic coordinates, element types and lattice vectors to build an ‘Atoms’ object from which several properties, such as density and chemical formula, can be calculated. This ‘Atoms’ class, along with several other modules (discussed later), can be used for setting up calculations with external software packages. An example of the ‘Atoms’ class is shown in Fig. 4.
The ‘Atoms’ class along with many other modules in JARVIStools are used to generate input files for automating software codes. Currently, JARVIStools can be used to automate DFT calculations with packages such as Vienna Abinitio simulation package (VASP)^{57,58}, Quantum Espresso (QE)^{59}; MD with Largescale Atomic/Molecular Massively Parallel Simulator (LAMMPS)^{60}; ML with Scikitlearn^{61}, Keras^{62}, and LightGBM^{63}; Wannier calculations with Wannier90^{64} and Wanniertools^{65}. A number of predefined workflows are available in JARVIStools that are continuously being used to calculate properties of uncharacterized or existing materials in the database. Three workflows are shown in Fig. 5. For DFT calculations, an input Atoms class is used to generate input files for VASP (Fig. 5a) with the ‘VaspJob’ class in order to calculate the desired properties, such as the energy. We automatically perform calculations to converge numerical parameters like the kpoints and planewave cutoff for individual materials. Geometry optimization is then carried out with energy, force, and stress relaxation. We have chosen a particular set of pseudopotentials or PAWs as tested and recommended by the software developers of various codes. Subsequent properties, such as band structure, dielectric function, elastic constants, piezoelectric constants or spinorbit spillage are computed on the relaxed structure. Later, custom jobs can also be run on the optimized structure using ‘VaspJob’, such as Wannier90 calculations using the ‘Wannier90Win’ class, which generates the input files for an Atom class and a chosen set of pseudopotentials, disentanglement window and other controlling parameters. All of these steps produce a JavaScript Object Notation (JSON) file once the calculations are done as a signature of their completion. The workflows can be restarted from intermediate computations, making the calculations robust to interruptions due to computer failure, etc. We also add several errorhandlers in the workflows to automatically resubmit a calculation if a typical error is encountered.
A similar workflow is shown for an example of FF based on LAMMPS calculations in Fig. 5b. Here, for a particular forcefield such as NiAl^{53}, for example, all the structures related to Ni, Al, and Ni–Al are obtained from the DFT database and converted into a LAMMPS input format using ‘Atoms’, ‘LammpsData’ and ‘LammpsJob’ objects. Then a series of geometry optimization, vacancy formation energy, surface energy, and phononrelated calculations are run, based on the symmetry of the structure. All of these steps use a set of “.mod” module files with input parameters that control respective LAMMPS calculations. The obtained results are compared with corresponding DFT data, to evaluate the quality of an FF for a particular system or simulation.
In machine learning calculations, the input materialsdata is transformed into several machinereadable descriptors^{66} such as CFID dataset or STM image ‘numpy’ arrays. As we are not going to generate another set of data for testing ML models, we split the dataset into training and testing sets in a 90:10 or similar split. Using kfold crossvalidation, we obtain hyperparameters for the chosen algorithm, for example, the number of trees, learning rate, etc. in the case of Gradient Boosting Decision Tree (GBDT). We choose the optimized parameters and train on 90% train data and test on the 10% test data to evaluate the truly predictive performance on unseen data. We also carry out kfold crossvalidation using the finalized model to get model uncertainty. Later, we can analyze interpretability with techniques such as feature importance in treebased algorithms or filters in neural networks. These models are saved in Pickle, cPickle and Joblib modules for model persistency. We also carry out uncertainty analysis using methods such as prediction interval and MonteCarlo dropouts^{67}. A few examples and Jupyter notebooks are provided on the GitHub page to illustrate the abovementioned methods. More details about the individual python modules mentioned above can be found in the JARVIStools documentation (https://jarvistools.readthedocs.io/en/latest/). A documentation on integrating JARVIStools with the database is available at (https://jarvismaterialsdesign.github.io/dbdocs/).
After running the automated calculations, the data is postprocessed to predict various material properties (such as bandgap, formation energy, spinorbit spillage, SLME, density of states, phonons, dielectric function, or STM image). Many of the python classes use ‘ToDict’ and ‘FromDict’ methods that help store the metadata. These metadata are then used with HTML^{68}, Javascript, Flask^{69} and other related software to make webpages and webapps. The metadata is also shared in public repositories such as Figshare (https://figshare.com/authors/Kamal_Choudhary/4445539), and JARVISRepresentational state transfer (REST) API, based on the MGI philosophy of creating and using interoperable datasets. Note that through the JARVISREST API, a user can download JARVIS data and can also upload/store their own data. If the stored data follows the schema (in XSD format), then the API automatically generates HTML pages for the user’s data. The data generated in JARVIS is mainly stored in Extensible Markup Language (XML), JavaScript Object Notation (JSON), CommaSeparated Values (CSV) or American Standard Code for Information Interchange (ASCII) format and, again, JARVIStools can be used to analyze the precalculated data for materials design. A wrappercode for the RESTAPI upload and download is available at (https://github.com/usnistgov/jarvis/blob/master/jarvis/db/restapi.py). An example of downloading precalculated dataset with JARVIStools is shown in Fig. 4. JARVIStools, along with the various software shown in Fig. 3, has led to several databases shown in Fig. 6.
JARVISDFT
Density functional theory is one of the most commonly used techniques in condensedmatter physics to solve realworld materials problems. In DFT, instead of solving the fully interacting Schrödinger equation, we solve the KohnSham equations, which describe an effective noninteracting problem, greatly improving computational efficiency. Although exact in principle, DFT requires several approximations in practice. In particular, various levels of approximation to the exchangecorrelation functional are possible, which require different computational effort. Most existing DFT databases use the common GGAPBE throughout all the materialclasses. JARVISDFT can be viewed as an attempt to build a repository beyond existing DFT databases. JARVISDFT^{5,6,23,24,25,28,29,45,49,50} was started in 2017 and contains data for ≈40,000 materials, with ≈1 million calculated properties, mainly based on the VASP package. Although there are several DFTfunctionals adopted in JARVISDFT, we use vdWDFOptB88 consistently for all the 3D, 2D, 1D, and 0D materials. This functional has been shown to provide accurate predictions for latticeparameters and energetics for both vdW and nonvdW bonded materials^{28}. In addition to hosting 3D bulk materials, the database consists of 2D monolayer, 1Dnanowire, and 0Dmolecular materials (as shown in Table 2). However, to date, 3D and 2D materials have primarily been distributed publicly. Moreover, other exchangecorrelation functionals are considered (as shown in Table 3), which can help estimate the prediction uncertainty. While vdWDFOptB88 can predict accurate lattice parameters and formation energies, bandgaps are still underestimated. Calculations with hybrid functionals (such as rangeseparated HSE06 and PBE0) and manybody approaches (such as G_{0}W_{0}) remain too computationally expensive^{21} to use in a highthroughput methodology for thousands of materials. Hence, a metaGGA TranBlahamodified BeckeJohnson (TBmBJ) potential is used to provide a good balance between computational expense and accuracy. The TBmBJ accuracy is shown to be close enough to the highlevel methods such as HSE06 at up to ten times lower computational expense^{52}. Accurate prediction of optical gaps by calculation of the frequencydependent dielectric function is important for several applications, for example, solarcell efficiency calculations. Accurate prediction of bandgaps also helps in obtaining accurate frequencydependent dielectric functions, which can be critical for solarcell efficiency calculations; however, TBmBJ cannot describe the excitonic nature of electronhole pairs in lowdimensional materials. In addition to TBmBJ, we are generating HSE06, PBE0, G_{0}W_{0}, and DMFT datasets, which can be considered as beyondDFT methods discussed in the next section. Next, SOC is varied to analyze the differences introduced by this coupling. These differences are used to discover 3D and 2D topological materials. In addition, several DFT databases are developed including properties such as frequencydependent dielectric function and electric field gradient. A few important protocols such as kpoint automatic convergence are also introduced. A snapshot of the JARVISDFT website along with a list of properties that are available is shown in Fig. 7. JARVISDFT has several filtering options on the website to screen candidate materials. We provide the input files as downloadable .zip files, especially for the users who do not have much expertize in using pythonbased codes. Raw input and output files (on the order of 1 terabyte) will soon be made publicly available through the Figshare repository, NISTMaterials data repository, and Materials Data Facility (MDF). A summary table, with the number of data available with vdWDFOptB88 and other methods, is shown in Tables 2 through 4. Table 2, Table 3, and Table 4 provides a summary of available materials classes, DFT functionals used and materials properties available in the JARVISDFT database, respectively.
JARVISbeyondDFT
While quantum mechanical methods in singleparticle theories such as DFT or DFT+U methods (mainly GGA) are fast and can predict accurate results for most structural parameters, even when relatively strong electron correlations are present, qualitative predictions of excited state properties may require beyondDFT methods^{70}. BeyondDFT calculations have been applied to many materials systems, including cuprates and Febased hightemperature superconductors, Mott insulators, heavy Fermion systems, semiconductors, photovoltaics, and topological Mott insulators^{70}. In the last few decades, both perturbative and stochastic approaches have been developed to understand these strongly correlated materials. These approaches, including Dynamical Mean Field Theory (DMFT)^{71}, the GW approximation, or hybrid exchangecorrelation functionals are often called beyondDFT methods since they go beyond the limit of semilocal DFT. The materials design community often requires benchmarking for particular cases, where it is necessary to use beyondDFT methods, in order to assess accuracy of the results. In the JARVISBeyondDFT database we are building a database of spectral functions and related quantities as computed using metaGGA, GW, hybrid functionals, and LDA+DMFT for headtohead comparison on 100+ materials.
In the JARVISBeyondDFT^{70} database we try to answer a few key questions regarding discoveries through a materials database for quantum materials. First, where is it necessary to use a beyondDFT method, and which method to be use? Second, how do different “beyondDFT" methods compare with experiments? Target materials include but are not limited to various transition metal oxides, perovskites and mixed perovskites, nickelates, transition metal dichalcogenides, and a wide range of metals starting from alkali metals to transition metals, and various Ironbased superconductors. JARVISBeyondDFT will be distributed through the website: https://jarvis.nist.gov/jarvisbdft/.
JARVISFF
Classical forcefield/interatomicpotentialbased simulations are the workhorse technique for large scale atomistic simulations. They are especially suited for temperaturedependent and defectrelated phenomena. Several varieties of FFs differ based on the materials system and the underlying phenomena under investigation, e.g., whether they include bondangle information and fixed or dynamic charges. Also, they are generally designed for particular applications and phases, making it difficult to ascertain whether they will perform well in simulations for which they were not explicitly trained. JARVISFF^{25,53} is a collection of LAMMPS calculationbased data consisting of crystal structures, formation energies, phonon densities of states, band structures, surface energies and defect formation energies. There are ≈110 FFs in the database, for which the corresponding crystal structures are obtained from JARVISDFT, converted to LAMMPS format inputs, and used in a series of LAMMPS calculations to produce the aforementioned properties. These properties, when compared with corresponding DFT data, can help a user analyze the quality of a forcefield for a particular application. Examples include the comparison of DFT convex hull with FF, elastic modulus, surface energy and vacancy formation energy data. Some types of FFs included are EAM, MEAM, Bondorder and Tersoff, COMB, and ReaxFF as shown in Table 5. Furthermore, we plan to include several recently developed machine learning forcefields into JARVISFF. A snapshot of the JARVISFF website is also shown in Fig. 8.
JARVISML
Machine learning has several applications in materials science and engineering^{12,72,73}, such as automating experimental data analysis, discovering functional materials, optimizing known ones by accelerating conventional methods such as DFT, automating literature searches, discovering physical equations, and efficient clustering of materials and their properties. There are several data types that can be used in ML such as scalar data (e.g., formation energies, bandgaps), vector/spectra data (e.g., density of states, dielectric function, charge density, Xray diffraction patterns, etc.), imagebased data (such as scanning tunneling microscopy and transmission electron microscopy images), and natural language processingbased data (such as scientific papers). In addition, ML can be applied on a variety of materials classes such as bulk crystals, molecules, proteins and freesurfaces.
Currently, there are two types of data that are machinelearned in JARVISML^{45,49,50,54,55}: discrete and imagebased. The discrete target is obtained from the JARVISDFT database for 3D and 2D materials. There have been several descriptor developments as attempts to capture the complex chemicalstructural information of a material^{66}. We compute CFID descriptors for most crystal structures in various databases (as shown in Table 6). Many of these structures are nonunique but can still be used for prescreening applications^{45}. The CFID can also be applied to other materials classes such as molecules, proteins, point defects, free surfaces, and heterostructures, which are currently ongoing projects. These descriptor datasets, along with JARVISDFT and other databases, act as input and outputs for machine learning algorithms. The CFID consists of 1557 descriptors for each material: 438 average chemical, 4 simulationboxsize, 378 radial chargedistribution, 100 radial distribution, 179 angledistribution up to first neighbor, and another 179 for the second neighbor, 179 dihedral angle up to fist neighbor and 100 nearest neighbor descriptors. More details can be found in ref. ^{54}. Currently, we provide CFID descriptors only, but other descriptors such as Coulombmatrix, and sinematrix will be provided soon. With CFID descriptors, we trained several classification and regression tasks. Once these models are trained, parameters are stored that can predict the properties of an arbitrary compound quickly. We developed a webbased application to host the trained models, as shown in Fig. 9, and a list of the trained properties are displayed there as well. We note that classical quantities such as bulk modulus, maximum infrared (IR) active mode, and formation energies can be accurately trained, especially with regression models. For other properties such as bandgaps, magnetic moments, piezoelectric coefficients, thermoelectric coefficients, high accuracy models are obtained for classification tasks only. In addition to the descriptorbased data, we develop Scanning Tunneling Microscopy (STM)^{49} image classification models that can be used to accelerate the analysis of STM data. The images are converted into a black/white image to identify spots with/without atoms. The model’s accuracy is compared with respect to DFT data or experiments wherever applicable.
Derived apps
The knowledge developed through the abovementioned databases and tools can serve as static content, as well as accessed through dynamic userdefined inputs. Derived applications (apps) are designed to help a user analyze the combinatorics in the data. Based on the databases and tools discussed above, several apps are derived from JARVIS such as JARVISHeterostructure^{29}, JARVISWannier TB, and JARVISODF. JARVISHeterostructure (as shown in Fig. 10a) can be used to characterize heterojunction type and modeling interfaces for exfoliable 2D materials. We classify these heterostructures into typeI, II, and III systems according to Anderson’s rule, which is based on the bandalignment with respect to the vacuum potential of noninteracting monolayers, obtained from JARVISDFT. The app also generates crystallographic positions for the heterostructure that could be used as input for subsequent calculations. JARVISWannierTB (as shown in Fig. 10b) can be used to solve Wannier Tight Binding Hamiltonians on arbitrary kpoints for 3D and 2D materials. Properties such as the band structure and the density of states can be predicted on the fly from this app. In addition, many other apps are being developed, which are primarily based on the Flask python package^{69}.
The JARVISODF (Orientation Distribution Function) library is under development, which aims to calculate volumeaveraged (mesolevel) material properties, including the elastoplastic deformation behavior, using the property data available for single crystals in the JARVIS database. Once generated, the JARVISODF library will be capable of obtaining such material properties for all crystalline structures.
Accuracy and precision analysis
In simulations, accuracy refers to the degree of closeness between a calculated value and a reference value, which can be from an experiment or a highfidelity theory. Precision refers to the degree of closeness between numerical approaches to solving a certain model, including the effect of convergence and other simulation parameters.
In JARVISDFT, the accuracy of the DFT data is obtained by comparing it to available experimental results (see Supplementary Tables 1–9). The accuracy of JARVISFF and JARVISML, instead, is given with respect to DFT results. Note that the numbers of highquality experimental measurements or highfidelity calculations for a given property are often low. Therefore, the accuracy metrics we derive in our works are obtained only for the few cases we can directly compare, not for the entire dataset. In Table 7, we provide accuracy metrics for some material properties in the JARVISDFT with respect to experiments. In addition to the scalar data, vector/continuous data, such as frequency dependent dielectric function and Scanning Tunneling Microscopy (STM) images, are compared to a handful of experimental data points as well. Details of individual properties can be found in refs. ^{6,28,45,48,49,50,54,55}.
JARVISFF data accuracy is calculated with respect to the DFT data, for properties such as the convex hull, bulk modulus, phonon frequencies, vacancy formation energies and surface energies. In refs. ^{25,53}, we showed this through several examples, including the comparison of Ni–Al and Cu–O–H systems convex hulls to DFT data. We also showed examples of comparing defect formation energies, surface energies and its effects on Wulffshape. Although these accuracy analyses are based on 0 K DFT data, they are useful in predicting temperaturedependent and dynamical behavior because we consider several crystal prototypes of a system.
JARVISML model accuracy is evaluated on the testset (usually 10%) representing previously unseen DFT data for both regression and classifications models. Accuracy of regression and classification models are reported in terms of mean absolute error (MAE) and Receiver Operating Characteristic (ROC) Area Under Curve (AUC) metric, respectively. A brief summary of regression and classification model accuracy results is given below in Tables 8 and 9. Details of the accuracy analyses are provided in refs. ^{45,49,50,54,55}.
Precision analysis can refer to a wide variety of optional selections of simulation setups. Examples of precision analysis in JARVISDFT are using our convergence protocols for kpoints and planewave cutoff, and the convergence of Wannier tightbinding Hamiltonians. Using a converged kpoint mesh and planewave cutoff^{26} for each individual material is necessary to obtain highquality data. Note that these DFT convergences are carried out for energies of the system only, and not for other properties. However, we impose tight convergence parameters for both kpoints and energy cutoff (0.001 eV/cell), which typically results in other physical quantities being converged as well. In JARVISFF, comparison across structureminimization methods for calculating surface and vacancy formation energy values are examples of precision analysis^{25}. We find that the FF simulation setups (‘refine’ and ‘box’ methods) have minimal effect on the FFbased predictions. For classification ML models, precision is the ratio \(\frac{{{\mathrm{TP}}}}{{{\mathrm{TP}} + {\mathrm{FP}}}}\) where TP is the number of true positives and FP the number of false positives, which can be derived from the confusion. Precision analysis for classification ML model for STM Bravaislattices are available in ref. ^{49}. We find high precision (more than 0.87) for all of the 2DBravais lattices. Precision analysis for regression tasks are still ongoing and will be available soon.
Future work
Given that the number of all possible materials^{74} could be of the order of 10^{100}, and furthermore existing materials properties can be computed at increasing levels of accuracy/cost, the JARVIS databases will always be incomplete. This represents an opportunity for JARVIS to be drastically expanded in the future. Future work will be aimed at addressing some of the limitations of the existing databases, and may include additions like defect/disorder properties, magnetic ordering, nonlinear optoelectronics, more beyondDFT calculations, temperaturedependent properties, integration with experiments, and more detailed uncertainty analysis. Moreover, several ML models and methods for dataprediction and uncertainty quantification will be developed for ‘explainable AI’ (XAI) and transferlearning (TL)based research. Other derived apps such as JARVISODF, JARVISBeyondDFT, JARVISGraphConv, and JARVISSTM are also being developed. In addition to the technical aspects, the broader impact of the infrastructure will be to provide a research platform that will allow maximum participation of worldwide researchers. NISTJARVIS currently hosts precomputed data and would host onthefly calculation resources also. To make the dataprocessing userfriendly, we have a few filtering options on the JARVISDFT website. Furthermore, advanced filtering tools will be available through ElasticSearch package soon. ElasticSearch integration will allow crossfiltering among several databases. We are also working on several visualization tool integration using Plotly, Javascript and XSLT which will be available on the web soon.
In summary, we described the Joint Automated Repository for Various Integrated Simulations (JARVIS) platform, which consists of several databases and computational tools to help accelerate materials design and enhance industrial growth. JARVIS includes three major databases: JARVISDFT for density functional theory calculations, JARVISFF for classical forcefield calculations, and JARVISML for ML predictions. In addition, we provide JARVIStools, which is used to generate the databases. The generated data is provided publicly with several example notebooks, documentation and calculation examples to illustrate different components of the infrastructure. We believe the publicly available data and resources provided here will significantly accelerate futuristic materialsdesign in various areas of science and technology.
Methods
The entire study was managed, monitored, and analyzed using the modular workflow, which we have made available (Please note that commercial software is identified to specify procedures. Such identification does not imply recommendation by the National Institute of Standards and Technology) on our JARVIStools GitHub page (https://github.com/usnistgov/jarvis).
Density functional theory calculations
The DFT calculations are mainly carried out using the Vienna Abinitio simulation package (VASP)^{57,58}. We use the projected augmented wave method and OptB88vdW functional^{51}, which gives accurate lattice parameters for both van der Waals (vdW) and nonvdW solids^{28}. Both the internal atomic positions and the lattice constants are allowed to relax in spinunrestricted calculations until the maximal residual Hellmann–Feynman forces on atoms are smaller than 0.001 eV Å^{−1} and energytolerance of 10^{−7} eV. We do not consider magnetic orderings besides ferromagnetic yet, because of a high computational cost. We note that nuclear spins are not explicitly considered during the DFT calculations. The list of pseudopotentials used in this work is given on the GitHub page. The kpoint mesh and planewave cutoff were converged for each material using the automated procedure described in ref. ^{26}. The elastic constants are calculated using the finite difference method with six finite symmetrically distinct distortions. The thermoelectric coefficients such as power factor and Seebeck coefficients are obtained with the BoltzTrap code with Constant Relaxation Time approximation (CRTA)^{75}. Optoelectronic properties such as dielectric function and solarcell efficiency are calculated using linearoptics methods mainly using OptB88vdW and TBmBJ. We also compared such data with HSE06 and G_{0}W_{0}. The piezoelectric, dielectric and phonon modes at Гpoint are calculated using Density Functional Perturbation Theory (DFPT). Topological spillage for identifying topologically nontrivial materials is calculated by comparing DFT wave functions with/without SOC^{5,24}. 2D exfoliation energies are calculated by comparing bulk and 2D monolayer energy per atom. The 2D heterostructure^{29} behavior is predicted using Zur and Anderson methods. Wannier tight binding Hamiltonians are generated using the Wannier90 code^{64}. 2D STM images are predicted using the TersoffHamman method^{49}.
Forcefield calculations
Classical forcefield calculations are carried out with the LAMMPS software package^{60}. In our structure minimization calculations, we used 10^{−10} eVÅ^{−1} for force convergence and 10,000 maximum iterations. The geometric structure is minimized by expanding and contracting the simulation box with ‘fix box/relax’ command and adjusting atoms until they reach the force convergence criterion. These are commonly used computational setup parameters. After structure optimization point vacancy defects are created using Wycoffposition data. Free surfaces for maximum miller indices up to 3 are generated. The defect structures were required to be at least 1.5 nm long in the x, y, and z directions to avoid spurious selfinteractions with the periodic images of the simulation cell. We enforce the surfaces to be at least 2.5 nm thick and with 2.5 nm vacuum in the simulation box. The 2.5 nm vacuum is used to ensure no selfinteraction between slabs, and the slabthickness is used to mimic an experimental surface of a bulk crystal. Using the energies of perfect bulk and surface structures, surface energies for a specific plane are calculated. We should point out that only unreconstructed surfaces without any surfacesegregation effects are computed, as our highthroughput approach does not allow for taking into account specific, element dependent reconstructions yet. Phonon structures are generated mainly using the Phonopy package interface^{76}.
Machine learning training
Machine learning models are mainly trained using Scikitlearn^{61}, Keras^{62}, and LightGBM^{63} (TensorFlow backend) software. For DFT generated scalar data such as formation energies, bandgaps, exfoliation energies etc. the crystal structures are converted into a Classical Forcefield Inspired Descriptors (CFID) input array and the DFT data is used as target data, which is then traintest split in a ratio of 90: 10. Preprocessing such as ‘VarianceThreshold’, ‘StandardScalar’ are used before ML training. Regression models’ performance are generally reported in terms of Mean Absolute Error (MAE) or r^{2}, while that for classification models using the Receiver Operating Characteristic (ROC) Area Under Curve (AUC) value which lie between 0.5 and 1.0. Several other analyses such as feature importance, kfold cross validation and learning curve are carried out after the model training. The trained model is saved in pickle and joblib formats for model persistence. All the webapps are developed using JavaScript, Flask, and Django packages^{69}.
Data availability
JARVISrelated data is available at the JARVISAPI (http://jarvis.nist.gov), JARVISDFT (https://jarvis.nist.gov/jarvisdft/), JARVISFF (https://jarvis.nist.gov/jarvisff/), JARVISML (https://jarvis.nist.gov/jarvisml/) websites. The metadata is also available at the Figshare repository, see https://figshare.com/authors/Kamal_Choudhary/4445539.
Code availability
Pythonlanguage based codes with examples are available at JARVIStools page: https://github.com/usnistgov/jarvis.
References
Curtarolo, S. et al. AFLOWLIB. ORG: a distributed materials properties repository from highthroughput ab initio calculations. Comput. Mater. Sci. 58, 227–235 (2012).
Jain, A. et al. Commentary: the materials project: a materials genome approach to accelerating materials innovation. Apl. Mater. 1, 011002 (2013).
Kirklin, S. et al. The Open Quantum Materials Database (OQMD): assessing the accuracy of DFT formation energies. npj Comput. Mater 1, 15010 (2015).
Pizzi, G., Cepellotti, A., Sabatini, R., Marzari, N. & Kozinsky, B. AiiDA: automated interactive infrastructure and database for computational science. Comput. Mater Sci. 111, 218–230 (2016).
Choudhary, K., Garrity, K. F. & Tavazza, F. Highthroughput discovery of topologically nontrivial materials using spinorbit spillage. Sci. Rep. 9, 1–8 (2019).
Choudhary, K., Kalish, I., Beams, R. & Tavazza Highthroughput identification and characterization of twodimensional materials using density functional theory. Sci. Rep. 7, 5179 (2017).
Draxl, C. & Scheffler, M. The NOMAD laboratory: from data sharing to artificial intelligence. J. Phys. Mats. 2, 036001 (2019).
Chung, Y. G. et al. Computationready, experimental metal–organic frameworks: a tool to enable highthroughput screening of nanoporous crystals. Chem. Mater. 26, 6185–6192 (2014).
Green, M. L. et al. Fulfilling the promise of the materials genome initiative with highthroughput experimental methodologies. J. Appl. Phys. Rev. 4, 011105 (2017).
HattrickSimpers, J. R., Gregoire, J. M. & Kusne, A. G. Perspective: composition–structure–property mapping in highthroughput experiments: turning data into knowledge. APL Mater. 4, 053211 (2016).
Zakutayev, A. et al. An open experimental database for exploring inorganic materials. Sci. Data. 5, 180053 (2018).
Vasudevan, R. K. et al. Materials science in the artificial intelligence age: highthroughput library generation, machine learning, and a pathway from correlations to the underpinning physics. MRS Commun. 9, 821–838 (2019).
Agrawal, A. & Choudhary, A. Perspective: materials informatics and big data: realization of the “fourth paradigm” of science in materials science. APL Maters 4, 053208 (2016).
Schleder, G. R., Padilha, A. C., Acosta, C. M., Costa, M. & Fazzio, A. J. From DFT to machine learning: recent approaches to materials science–a review. J. Phys. Mater. 2, 032001 (2019).
Ceder, G. J. Opportunities and challenges for firstprinciples materials design and applications to Li battery materials. MRS Bull. 35, 693–701 (2010).
Xi, L. et al. Discovery of highperformance thermoelectric chalcogenides through reliable highthroughput material screening. J. Am. Chem. Soc. 140, 10785–10793 (2018).
Olson, G. B. & Kuehmann, C. Materials genomics: from CALPHAD to flight. Scr. Mater. 70, 25–30 (2014).
Aykol, M. et al. The materials research platform: defining the requirements from user stories. Matter 1, 1433–1438 (2019).
Callister, W. D. & Rethwisch, D. G. Materials Science and Engineering. Vol. 5 (John Wiley & Sons, NY, 2011).
de Pablo, J. J. et al. The materials genome initiative, the interplay of experiment, theory and computation. Curr. Opin. Solid State Mater. Sci. 18, 99–117 (2014).
Sholl, D. & Steckel, J. A. Density Functional Theory: A Practical Introduction. (John Wiley & Sons, 2011).
Perdew, J. P., Burke, K. & Ernzerhof, M. J. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865 (1996).
Choudhary, K. et al. Computational screening of highperformance optoelectronic materials using OptB88vdW and TBmBJ formalisms. Sci. Data 5, 180082 (2018).
Choudhary, K., Garrity, K. F., Jiang, J., Pachter, R. & Tavazza, F. Computational search for magnetic and nonmagnetic 2D topological materials using unified spin–orbit spillage screening. npj Comput. Mater 6, 1–8 (2020).
Choudhary, K. et al. Highthroughput assessment of vacancy formation and surface energies of materials using classical forcefields. J. Phys. 30, 395901 (2018).
Choudhary, K. & Tavazza, F. Convergence and machine learning predictions of MonkhorstPack kpoints and planewave cutoff in highthroughput DFT calculations. Comput. Mater. Sci. 161, 300–308 (2019).
Cooper, M. et al. Development of Xe and Kr empirical potentials for CeO_{2}, ThO_{2}, UO_{2} and PuO_{2}, combining DFT with high temperature MD. J. Phys. 28, 405401 (2016).
Choudhary, K., Cheon, G., Reed, E. & Tavazza, F. Elastic properties of bulk and lowdimensional materials using van der Waals density functional. Phys. Rev. B 98, 014107 (2018).
Choudhary, K., Garrity, K. F., Pilania, G. & Tavazza, F. Efficient computational design of 2D van der Waals Heterostructures: bandalignment, latticemismatch, webapp generation and machinelearning. arXiv 2004, 03025 (2020).
Allen, M. P. & Tildesley, D. J. Computer Simulation of Liquids. (Oxford university press, 2017).
Kattner, U. R. Phase diagrams for leadfree solder alloys. JOM 54, 45–51 (2002).
Acar, P., Ramazani, A. & Sundararaghavan, V. Crystal plasticity modeling and experimental validation with an orientation distribution function for ti7al alloy. Metals 7, 459 (2017).
Castelli, I. E. et al. New light‐harvesting materials using accurate and efficient bandgap calculations. Adv. En. Mater. 5, 1400915 (2015).
Stevanović, V., Lany, S., Zhang, X. & Zunger, A. Correcting density functional theory for accurate predictions of compound enthalpies of formation: Fitted elementalphase reference energies. Phys. Rev. B 85, 115104 (2012).
Belsky, A., Hellenbrandt, M., Karen, V. L. & Luksch, P. New developments in the Inorganic Crystal Structure Database (ICSD): accessibility in support of materials research and design. Acta Cryst. Sect. B 58, 364–369 (2002).
Talirz, L. et al. Materials Cloud, a platform for open computational science. arXiv 2003, 12510 (2020).
Tadmor, E. B., Elliott, R. S., Sethna, J. P., Miller, R. E. & Becker, C. A. J. The potential of atomistic simulations and the knowledgebase of interatomic models. JOM 63, 17 (2011).
Aagesen, L. et al. Prisms: an integrated, opensource framework for accelerating predictive structural materials science. JOM 70, 2298–2314 (2018).
Wheeler, D. et al. PFHub: the phasefield community hub. J. Open Res. Softw. 7, 29–36 (2019).
Ong, S. P. et al. Python Materials Genomics (pymatgen): a robust, opensource python library for materials analysis. Comput. Mater Sci. 68, 314–319 (2013).
Larsen, A. H. et al. The atomic simulation environment—a Python library for working with atoms. J. Phys. Cond. Mat. 29, 273002 (2017).
Mathew, K. et al. MPInterfaces: a materials project based Python tool for highthroughput computational screening of interfacial systems. Comput. Mater Sci. 122, 183–190 (2016).
Setyawan, W., Gaume, R. M., Lam, S., Feigelson, R. S. & Curtarolo, S. Highthroughput combinatorial database of electronic band structures for inorganic scintillator materials. ACS Comb. Sci. 13, 382–390 (2011).
Gibbs, Z. M. et al. Effective mass and Fermi surface complexity factor from ab initio band structure calculation. npj Comput. Mater 3, 1–7 (2017).
Choudhary, K. et al. Accelerated discovery of efficient solarcell materials using quantum and machinelearning methods. Chem. Mater. 31, 5900 (2019).
Yu, L. & Zunger, A. Identification of potential photovoltaic absorbers based on firstprinciples spectroscopic screening of materials. Phys. Rev. Lett. 108, 068701 (2012).
Liu, J. & Vanderbilt, D. Spinorbit spillage as a measure of band inversion in insulators. Phys. Rev. B 90, 125133 (2014).
Jha, D. et al. Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning. Nat. Commun. 10, 1–12 (2019).
Choudhary, K. et al. Density functional theory and deeplearning to accelerate data analytics in scanning tunneling microscopy. arXiv 1912, 09027 (2019).
Choudhary, K., Garrity, K. & Tavazza, F. Datadriven discovery of 3D and 2D thermoelectric materials. J. Phys. 32, 47 (2019).
Klimeš, J., Bowler, D. R. & Michaelides, A. Chemical accuracy for the van der Waals density functional. J. Phys. Cond. Matt. 22, 022201 (2009).
Tran, F. & Blaha, P. Accurate band gaps of semiconductors and insulators with a semilocal exchangecorrelation potential. Phys. Rev. Lett. 102, 226401 (2009).
Choudhary, K. et al. Evaluation and comparison of classical interatomic potentials through a userfriendly interactive webinterface. Sci. Data 4, 1–12 (2017).
Choudhary, K., DeCost, B. & Tavazza, F. Machine learning with forcefieldinspired descriptors for materials: Fast screening and mapping energy landscape. Phys. Rev. Mater. 2, 083801 (2018).
Choudhary, K. et al. Highthroughput density functional perturbation theory and machine learning predictions of infrared, piezoelectric, and dielectric responses. npj Comput. Mater 6, 64 (2020).
Saito, T. Computational Materials Design. Vol. 34 (Springer Science & Business Media, 2013).
Kresse, G. & Furthmüller, J. Efficient iterative schemes for ab initio totalenergy calculations using a planewave basis set. Phys. Rev. B 54, 11169 (1996).
Kresse, G. & Furthmüller, J. Efficiency of abinitio total energy calculations for metals and semiconductors using a planewave basis set. Comp. Mat. Sci. 6, 15–50 (1996).
Giannozzi, P. et al. QUANTUM ESPRESSO: a modular and opensource software project for quantum simulations of materials. J. Phys. 21, 395502 (2009).
Plimpton, S. Fast Parallel Algorithms for Shortrange Molecular Dynamics. (Sandia National Labs., Albuquerque, NM, 1993).
Pedregosa, F. et al. Scikitlearn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Gulli, A. & Pal, S. Deep learning with Keras. (Packt Publishing Ltd, 2017).
Ke, G. et al. Advances in Neural Information Processing Systems. In Proceedings of the 1995 Conference. Vol. 8. 3146–3154 (Mit Press, 1996).
Mostofi, A. A. et al. wannier90: a tool for obtaining maximallylocalised Wannier functions. Comp. Phys. Comm. 178, 685–699 (2008).
Wu, Q., Zhang, S., Song, H.F., Troyer, M. & Soluyanov, A. A. WannierTools: an opensource software package for novel topological materials. Comput. Phys. Comm. 224, 405–416 (2018).
Ward, L. et al. Matminer: An open source toolkit for materials data mining. Comput. Mater Sci. 152, 60–69 (2018).
Hammer, B. & Villmann, T. ESANN. 79–90 (Citeseer).
Musciano, C. & Kennedy, B. HTML, the Definitive Guide. (O’Reilly & Associates, 1996).
Grinberg, M. Flask Web Development: Developing Web Applications with Python. (“O”Reilly Media, Inc.", 2018).
Mandal, S., Haule, K., Rabe, K. M. & Vanderbilt, D. Systematic beyondDFT study of binary transition metal oxides. npjComput. Mater 5, 1–8 (2019).
Kotliar, G. et al. Electronic structure calculations with dynamical meanfield theory. Rev. Mod. Phys. 78, 865 (2006).
Schmidt, Jonathan et al. "Recent advances and applications of machine learning in solidstate materials science.". npj Comp. Mater 5.1, 1–36 (2019).
Ramprasad, R., Batra, R., Pilania, G., MannodiKanakkithodi, A. & Kim, C. Machine learning in materials informatics: recent applications and prospects. npj Comput. Mater 3, 1–13 (2017).
Walsh, A. The quest for new functionality. Nat. Chem. 7, 274–275 (2015).
Madsen, G. K. & Singh, D. BoltzTraP. A code for calculating bandstructure dependent quantities. Comp. Phys. Comm. 175, 67–71 (2006).
Togo, A. & Tanaka, I. First principles phonon calculations in materials science. Scr. Mater. 108, 1–5 (2015).
Acknowledgements
K.C., K.F.G., and F.T. thank the National Institute of Standards and Technology for funding, computational, and datamanagement resources. K.C. thanks the computational support from XSEDE computational resources under allocation number TGDMR 190095. Contributions from K.C. were supported by the financial assistance award 70NANB19H117 from the U.S. Department of Commerce, National Institute of Standards and Technology. Contributions by S.M., K.H., K.R., and D.V. were supported by NSF DMREF Grant No. DMR1629059 and No. DMR1629346. X.Q. was supported by NSF Grant No. OAC1835690. B.G.S. and S.V.K. acknowledge work performed at the Center for Nanophase Materials Sciences, a US Department of Energy Office of Science User Facility. A.A. acknowledges partial support by CHiMaD (NIST award # 70NANB19H005). G.P. was supported by the Los Alamos National Laboratory’s Laboratory Directed Research and Development (LDRD) program’s Directed Research (DR) project #20200104DR. K.C. thanks for helpful discussion with several researchers including Faical Y. Congo, Daniel Wheeler, James Warren, Carelyn Campbell, Chandler Becker, Marcus Newrock, Ursula Kattner, Kevin Brady, Lucas Hale, Eric Cockayne, Philippe Dessauw from National Institute of Standards and Technology; Karen Sauer, Igor Mazin, Nirmal Ghimire, Patrick Vora from George Mason University; Rama Vasudevan, Maxim Ziatdinov from Oak Ridge National Lab, Deyu Lu, and Matthew Carbone from Brookhaven National Lab; Marnik Bercx, Dirk Lamoen from University of Antwerp; Yifei Mo from University of Maryland; Anubhav Jain and Sinead Griffin from Lawrence Berkeley National Laboratory; Surya Kalidindi from Georgia Tech.; Tyrel McQueen and David Elbert from Johns Hopkins University; Richard Hennig from University of Florida; Giulia Galli and Ben Blaiszik from University of Chicago; Qiang Zhu from University of NevadaLas Vegas; Dilpuneet Aidhy from University of Wyoming; Susan B. Sinnott, Tao Liang from Pennsylvania State University.
Author information
Authors and Affiliations
Contributions
K.C. designed the JARVIS workflows, carried out highthroughput calculations, analysis, and developed the websites. F.T. contributed to the development of kpoint and other convergence protocol, BeyondDFT development and several other analyses. K.G. contributed to the development of topological materials discovery and Wanniertight binding Hamiltonian projects. A.C.E.R. assisted in the deployment of the webapps. B.D.C., A.A., and A.G.K. contributed to the machinelearning tasks. A.J.B., A.H.R., A.C., V.S., A.D. contributed to the phonon data analysis. Z.T. contributed to the development of the JARVISAPI website. J.H.S. contributed to the experimental validation of some of the screened materials. J.J. and R.P. contributed in the solarcell and topological materials discovery tasks. G.C., E.R., X.Q., H.Z., S.V.K., B.S., G.P. contributed to the discovery and characterization of lowdimensional materials. P.A. contributed to the elastic constant analysis task. S.M., K.R., D.V., and K.H. contributed to the BeyondDFT project. All authors contributed to writing the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Choudhary, K., Garrity, K.F., Reid, A.C.E. et al. The joint automated repository for various integrated simulations (JARVIS) for datadriven materials design. npj Comput Mater 6, 173 (2020). https://doi.org/10.1038/s41524020004401
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41524020004401
This article is cited by

Predicting solid state material platforms for quantum technologies
npj Computational Materials (2022)

Recent advances and applications of deep learning methods in materials science
npj Computational Materials (2022)

Large scale dataset of real space electronic charge density of cubic inorganic materials from density functional theory (DFT) calculations
Scientific Data (2022)

A dataset of 175k stable and metastable materials calculated with the PBEsol and SCAN functionals
Scientific Data (2022)

MaterialsAtlas.org: a materials informatics web app platform for materials discovery and survey of stateoftheart
npj Computational Materials (2022)