Introduction

The ability to design and refine materials has long accentuated the development of technology and infrastructure. Dating back to the introduction of stone (which ended not due to the lack of the availability of stone but due to a better material, bronze), bronze and iron (interestingly, iron was introduced due to availability, not because it was a better material) are historical milestones that shaped the development of cultures and the rise and fall of civilisations. More recently, examples of materials milestones include the Murano glass and Meissen porcelain that enabled the rise of medieval trade, technological and economic powerhouses and the quality of steel that generated the difference between Japanese and Chinese fencing styles and found deep reflections in culture. In all these cases, the natural availability of the particular raw material, and often serendipitous or brute force know-how, made a significant imprint on cultures and determined the rate of progress and destinies of people and countries. Materials shape and define the societies of their time: much as it is impossible to imagine a Samurai without a sword, presently, it is now difficult to imagine a person without a cell phone.

The ever-increasing spectrum of important functionalities required for developing and optimising materials fundamental to our modern requirements13 (e.g., modern smart phones have up to 75 different elements compared with twentieth century version that had ~30) requires efficient paradigms for materials discovery and design that go beyond our current serendipitous discoveries and classical synthesis characterisation theory approaches.4 This is because structures and composition underpinning the next-generation materials are not well understood, much less the pathways to synthesise them.511

As an example of the current problem, MgB2 was known to exist for decades but it was only recently discovered to be a superconducting material. Similarly, although a huge family of materials existed in layered geometries, only after the discovery of how to exfoliate graphite has it been possible to even consider monolayered materials leading to novel discoveries such as monolayer Fe-chalcogenide (highest Tc among all Fe-based superconductors), functionalised monolayer h-BN (piezoelectric), monolayer MoS2 (tunable band gap with high conductivity) and so on. Another critical example comes from complex oxides, the building blocks of many real-world devices. Bulk oxides with couplings between many degrees of freedom have been studied for many years with the hopes of finding a true multiferroic device1214—one in which an electric field can manipulate magnetic properties or a magnetic field can the internal electric dipole or polarisation. Such materials would offer numerous benefits, such as the possibility of next-generation, low-power-consumption storage devices.1518 However, only one material, BiFeO3, exhibits robust multiferroic properties at room temperature (and was discovered serendipitously).1922 Recent efforts have been geared towards harnessing emergent phenomena at oxide heterostructure interfaces to build transistors and next-generation storage devices.2326 Thin-film and heterostructure interfaces offer a wide variety of controllable variables that (in principle) permit design—choice of ‘parent’ material (or multiple materials for heterostructures), strains in the thin-film forms, composition changes at surfaces and interfaces (including defects such as oxygen vacancies) and dopant control of carrier concentrations.

Bridging these complex issues will require integrated and direct feedback from multi-scale functional measurements to theory and must allow real-time and archival experimental data to be incorporated effectively. Although computational approaches have recently allowed screening bulk properties of materials in existing structures in the inorganic (or organic) crystal structure databases, these efforts often lack a concerted effort to understand how a particular functionality comes about in a compound (or a set of compounds) and to use this knowledge to discover new materials and to provide a means of accessing the synthesis pathways. There are at least three major gaps that need to be addressed to enable more rapid progress for designing materials with desired properties.

First, there is the need for enhanced reliability for the computational techniques in such a way that they can accurately (and rapidly) address the above complex functionalities, provide the precision necessary for discriminating between closely competing behaviours and capability of achieving the length scales necessary to bridge across features such as domain walls, grain boundaries and gradients in composition. By and large, currently used theory-based calculations lack a reliable accuracy, and cannot treat the multiple length and time scales required. True materials are far more complicated than the simple structures often studied in small periodic unit cells by electronic structure calculations. The large length and time scales, as well as finite temperatures, make even density functional theory (DFT) calculations for investigating materials under device-relevant conditions prohibitively expensive. Grain boundaries, extended defects and complex heterostructures further complicate this issue.

Second, there is a need to take full advantage of all of the information contained in experimental data to provide input into computational methods to predict and understand new materials. This includes integrating data efficiently from different characterisation techniques to provide a more complete perspective on materials structure and function. For example, scanning probe microscopy reveals position-dependent functionality data, whereas transmission electron microscopy provides position-dependent electronic structure information—including behaviour near the above-mentioned defects that directly affect materials functionality. Techniques such as inelastic neutron scattering allow for direct measurements of space- and time-dependent response functions, which in principle can be compared directly with theoretical calculations. However, many of these techniques come with extreme requirements for full analysis and utilisation of instrumental data. For example, modern time-of-flight spectroscopy results in huge data sets, and in common with microscopy enormous amounts of potentially insightful experimental data are generated, which remain largely unreported and unused.

Third, pathways for making materials need to be established. Although in general pathways for making materials are least amenable to theoretical exploration and primarily rely on the expertise of individual researchers, big-data analytics on existing bodies of knowledge on synthesis pathways can suggest the general correlations between materials properties and synthetic routes, suggesting specific research directions.

In this perspective, we outline a strategy for bridging these critical gaps through utilisation of petascale quantum simulation, data assimilation and data analysis tools for functional materials design, in an approach that includes uncertainty quantification and experimental validation (Figure 1). This approach to change the past synthesis→characterisation→theory paradigm requires addressing gaps as a scientific community rather than relying upon tools used within a specific group, further necessitating development of efficient community-wide tools and data, including scientometric and text analytics.2730 Ideally, these tools need to be easily used by, and to the greatest extent, freely available to the scientific community, and in a framework that allows the community to contribute. We argue that transformative breakthroughs in materials design will depend on the availability of high-end simulation software and data analytics approaches that can fully utilise unparalleled resources in computing power and availability of data.

Figure 1
figure 1

Integrating community-based software with big-deep data approaches to bridge state-of-the-art experiments for materials by design.

Development and integration of theoretical tools

During the past couple of decades, theory and simulation have been reasonably successful in predicting single materials functionalities. However, the Materials Genome Initiative,31 calls for understanding, designing and predicting materials with ‘competing functionalities’ that often give rise to inhomogeneous ground states and chemical disorder. From occupancy in equivalent or weakly nonequivalent positions, to chemical phase separation in physically inhomogeneous systems (manganites, high-temperature superconductors (HTSCs)), we must be able to treat these mixed degrees of freedom. This could provide a route for rapid screening by first-principle methods of realistic materials structures incorporating the complexities of real materials, such as extended defects and impurities, as well as the complexities due to electron correlation and proximities to phase transitions. Achieving this goal will require a foundation-based quantum approaches that is capable of performing at peta- or even exascale levels, a framework for high-throughput calculations of multiple configurations and structures, and extensive data assimilation, validation and uncertainty quantification capabilities.

First-principle computational approaches

Starting with DFT, the current methods generally fall into a few groups: those using pseudopotentials and all-electron methods. Pseudopotential methods include plane-wave-based codes (QUANTUM ESPRESSO32, VASP21, Qbox33 and ABINIT34), where wave functions are expanded in a linear combination of plane waves and the Kohn–Sham equation is solved in Fourier space. Alternatively, the wave functions can be expanded in a linear combination of localised orbitals (e.g., atomic orbitals as basis set in the SIESTA package35). Gaussian packages, mainly used in the chemistry community, employ Gaussians as the basis set or different localisations (such as GAMESS36, NWChem37 and GAUSSIAN38). For all-electron methods, a direct analogy is the method based on augmented plane waves,39,40 versus methods based on radial basis sets.39,4143 In addition, multiple-scattering theory allows the formulation of all-electron Kohn–Korringa–Rostoker methods44 that do not rely on the expansion of the orbitals into basis functions, but directly calculate the Green’s function of the electrons in the system. In another group, the wave functions are represented directly on real-space grids. This approach is used in real-space multigrid45,46, PARSEC47 and GPAW48 codes. In general, real-space codes can be more easily scaled to a large number of message passing interface processes using domain decomposition, because the required message passing interface communications are mostly among neighbouring nodes. In plane wave codes, the required fast Fourier transforms require all-to-all communications, which are hard to scale to a large number of nodes. At this point in time, all major DFT packages have been adapted to work on parallel architectures at some level, but generally not on petascale level or hybrid architectures.

To make the transformative advances needed to achieve materials by design, another level of computational capability and validated accuracy would be valuable. However, one could argue that it is not as simple as enabling larger-scale calculations, but it may require a ‘disruptive’ advance in theory, such as the one brought about by DFT itself. In fact, one possibility might be based on the reformulation of ab initio computational methods to use many-electron basis states instead of one-electron basis states, e.g., many-body theory and computational methods that overcomes the limitation of current one-electron basis approaches could be revolutionary. Further work along these lines is clearly needed.

Turning back to current forms of DFT, one approach forward can be to build on real-space codes that have demonstrated petascale performance.4550 However, it will be important that these efforts include a robust common user interface for community use, and new capabilities necessary to enable treatment of important materials functionalities. The bridge between such atomistic computations and timely predictions might be realised by training complex models using the copious amounts of atomistic computations that are linked with the big data available from imaging and spectroscopy and bulk synchrotron X-ray and neutron measurements. Although such machine learning-based models are slowly beginning to make their way into the area of materials science, either in the form of unsupervised learning algorithms to fit the exchange-correlation functional, or supervised learning to predict electronic properties of molecular systems, more work is required to combine these techniques with experimental data to improve their predictive capability.5159

Accurate models for microstructure will be required in order to predict across scales relevant to materials under device-relevant conditions. Treating the longer length scales can often be achieved with tight-binding simulations,6065 effective Hamiltonians6670 or suitably tailored reactive and/or coarse-gained models. The value of tight binding and effective Hamiltonians is they are easily parameterised by first-principle calculations and preserve the ability to resolve electronic structure and chemical processes. For example, these approaches can address whether or not predicted zero-Kelvin structures and magnetic moments/ferroelectric polarisations persist to higher temperatures and, if not, do the materials remain useful.66,69,71,72 One can also study domain wall dynamics, including domain wall motion and energies as a function of simulation temperature, as well as issues of oxygen vacancies and their segregation to domains walls, as required for enabling device applications, as well as understanding switching and fatigue.68,7379 To enhance the computational abilities of effective Hamiltonian methods, implementation of the Wang–Landau (WL) Monte Carlo80 and the Hybrid Monte Carlo81 algorithms shows good promise. Combined with the use of the massively parallel supercomputers, this can enable calculations of finite temperature properties using supercells containing up to 108–109 atoms. Such capabilities allow study of complex effects with the potential to discover revolutionary complex (e.g., disordered) materials for which THz electro-optic coefficients (related to a change of the refractive index in response to an applied dc electric field) and/or THz dielectric tunability could be optimised. Having large electro-optic coefficients and/or large dielectric tunability could improve design of, e.g., efficient communication devices, such as phase arrays, phase shifters and external modulators operating in the THz range.

Integration with big-data approaches to bridge with experiments

In order to move beyond these theory-based avenues for improved capabilities in accuracies and scales, we must begin to fully utilise the spectacular progress over the past 10 years in imaging, X-ray and neutron scattering, which together provide quantitative structural and functional information from the atomic scale to the relevant mesoscales where real-world functionality often emerges. However, work is required to harness this as it frequently involves complex multidimensional data that require statistical approaches, decorrelation, clustering and visualisation techniques generally referred to as big-data approaches. Establishing a workflow for data and imaging analysis to provide the relevant atomic and magnetic configurations, as well as response behaviours of materials for direct input into the first-principle simulations, and subsequent refinement of theoretical parameters via iterative feedback. This approach also will need to be extended to interactive knowledge discovery that ultimately can deliver direct manipulation and design of materials. We suggest that data from state-of-the-art imaging, spectroscopy and scattering approaches integrated with theory can significantly improve the quality and rate of theoretical predictions for accelerating the design and discovery of functional materials.82

In terms of real-space imaging as a quantitative structural and functional tool, recent progress in high-resolution, real-space imaging techniques such as scanning transmission electron microscopy (STEM),8385 scanning tunnelling microscopy (STM)86,87 and non-contact atomic force microscopy (AFM) have allowed direct imaging of atomic columns (with STEM) and surface atomic structures (with STM/AFM). Going beyond the detection of atomic positions, high-resolution STEM, STM and non-contact AFM provide real-space electronic structure of materials and surfaces, visualising structures of molecular vibrations and phonons, complex electronic phenomena in graphene and high-temperature superconductors88,89 and even chemical bonds—thus relating structural parameters to local functionality. Just as the theoretical predictions are moving beyond the description of a material as repeated unit cell in a perfect crystal, experimental techniques are capable of detecting local disorder. Single STEM ptychography provides a wealth of information in k-space, often with 10 nm (X-ray) and atomic spatial resolution (STEM). Scanning, electron, X-ray and neutron probes enable a broad range of spectroscopies and provide information on local electronic, dielectric and chemical properties (e.g., electron energy loss spectroscopy in STEM, EXAFS or dichroism in X-rays, phonons in inelastic neutron scattering); the resulting data sets are multidimensional. Using this structural and functional information offers a unique opportunity to explore structure–property relationships on a level of a single atom and chemical bond, thereby linking local electronic, magnetic and superconductive functionalities to local bond lengths and angles.

Finally, microscopy tools can be used to arrange atoms in desired configurations, controlling matter via current- and force-based scanning probes90 and electron beams,91 opening a pathway to explore non-equilibrium and high-energy states generally unavailable in the macroscopic form but often responsible for materials functionalities.92 This can therefore complete a full loop in the materials discovery and design cycle, from exquisite observation to theory-based prediction, and finally to experimental control of materials. By combining atomic scale information from imaging with mesoscale and dynamical information from neutrons and chemical information from time-of-flight secondary ion mass spectrometry (such as AFM-TOF-SIMS)—giving the ability to probe deeper into the bulk of the material and look at buried interfaces and so on—via modelling provides a comprehensive approach to understand the complex behaviour of materials.

A more enabling aspect of theory–experiment matching is to improve a theoretical model given experimental observations. On the qualitative level, this is what imaging provides—namely, direct observation of atomic configurations that gives more information on local structures, defects, interfaces and so on (Figure 2). On a semi-quantitative level, the numerical values of observed atomic spacing’s can indicate the incompleteness of the model—e.g., the presence of (invisible) light atoms93 or vacancies.94 However, virtually unexplored is the potential of the quantitative studies—i.e., to improve the parameters of the mesoscopic or quantum theory based on these types of high-quality experimental observations of multiple spatially distributed degrees of freedom.

Figure 2
figure 2

Quantitative imaging allows access to a broad range of local structure–property relationships within each sample, effectively probing a combinatorial library of defects and functionalities. Whereas macroscopic properties yield information on single points in chemical space, local measurements can provide similar information for multiple points in chemical space. This information, when suitably processed, can be directly linked to theoretical simulations to enable effective exploration of material behaviours and properties. Furthermore, knowledge of extant defect configurations in solid can significantly narrow the spectrum of atomic configurations to be probed theoretically, precluding exponential growth of number of possible configurations with system size.

In reciprocal space (k-space), integration of big-data analytics and the unique capabilities for neutron- and X-ray-based imaging and spectroscopy of magnetic, inelastic and vibrational properties of materials offers equally interesting and enabling capabilities. For example, existing instrument suites for neutron scattering enables significant ability to comprehensively map out the dynamical response in materials across wide temporal and spatial scales. Data sets that are quantitative and complete in terms of coverage of the full-frequency scale of the system and wavevectors spanning the entire Brillouin zone give a broad resource for experimentally validating computation. Indeed, the dynamics in well-defined and highly characterised systems provide very stringent tests of approximations as the dispersion, intensity and line shapes are sensitive to the orbital overlaps, electron correlations and level of itinerancy. Further, a quantitative description of the dynamical response is directly related to transport and functional properties of the materials. First-principle calculations of the inelastic neutron scattering, e.g., for the spin cross-section S(q,ω) can be directly compared with measurement. By providing a comprehensive set of tools at different levels of approximation, there is potential to considerably extend the scope of analysis beyond the more localised magnetic systems commonly studied to itinerant behaviours for which data can offer new and essential insight.

Completing the materials by design loop

Both real-space and k-space imaging provide can provide information that can significantly improve material predictions. Indeed, real- and k-space observations can provide information on the predominant types of the defects and atomic configurations in the materials, which can be used to narrow the range of theoretically explored atomic configurations. Similarly, local structure–property measurements can be used to verify and improve theoretical models, as discussed elsewhere.82 However, an even bigger challenge is offered by comparing the theory with experiment on the level of microscopic degrees of freedom.

An approach to build an efficient computational interface between big data, experiment and theory that can provide an effective tool for advancing materials by design can benefit from a descriptor-based approach (Figure 3). Descriptors are functions of calculable microscopic quantities (e.g., formation energies, band structure, density of states or magnetic moments) that connect experimentally measurable quantities to local or macroscopic properties. Descriptors can represent macroscopic properties of the materials such as mobility, susceptibility or critical temperature. Therefore, a microscopic understanding of physical processes and mechanism is a requisite for defining a physically and structurally meaningful descriptor. A descriptor of a given materials property can be found either by physical intuition or by data mining such as machine learning and clustering analysis. Unfortunately, the concept of descriptors is well introduced in only a couple of problems such as predicting the crystal structure of an alloy, searching for a topological insulator, estimating a crystal’s melting point, and finding the optimal composition of heterogeneous catalysts.51,95 Successful examples of descriptors include formation enthalpy as a descriptor for determining the thermodynamic stabilities of binary and ternary compounds,52 spectroscopic limited maximum efficiency as a descriptor for evaluating the performance of solar materials,96 and figure of merit, ZT, as a descriptor for evaluating the performance of thermoelectric materials. By defining a descriptor based on the experimental measurable and big data as an initial reference point, one can then utilise the descriptor to guide required computational accuracies and to quantify uncertainties for a given experimental condition. Thus, we suggest a descriptor based on the experimental measurable and big data as an initial reference point, and then utilise the descriptor to guide required computational accuracies and to quantify uncertainties for a given experimental condition. Figure 4 illustrates this scheme. Each microscopic quantity (Mi) can be calculated using the ab initio quantum mechanical simulation approaches. These theoretical approaches could range from ab initio DFT, time-dependent DFT and beyond-DFT approaches. There are many parameters (Ci) for each approach; viable Cis determine exchange-correlation functionals, pseudopotentials, basis sets, energy cutoff and so on. Beyond-DFT approaches require much more special attention on their performance depending on those Cis, because of their perturbative nature. Each microscopic quantity Mi can be presented in the configuration space of computational approaches,97 where each Ai is composed of a set of Cis. Those calculated values11 are connected to a materials property (Pi) through a descriptor (D(M)). Thus, one can quantify uncertainties of the microscopic quantities depending on computational parameters, and establish a property map with expected deviation.

Figure 3
figure 3

Schematic for the workflow for a ‘materials by design’ approach. The description of the content in the dotted line is given in Figure 2.

Figure 4
figure 4

Schematic diagram describing a procedure for selecting an optimised computational method based on calibration to many-body theory or experimental data.

For each microscopic parameter, one can perform benchmark calculations using, e.g., Quantum Monte Carlo for total energies and GW (an approximation made in order to calculate the self-energy of a many-body system of electrons) for the energy spectrum to establish a database. That database can enable quantification of the accuracy of the property map depending on computational parameters. Using machine learning, specifically a scalable Bayesian approach, we can further utilise the benchmark database for optimising ab initio parameters that maximise trustworthiness of theoretical predictions, and establish an optimised ab initio approach for a given system. The optimisation of an ab initio approach can be material specific or property specific. Once the uncertainty of each method is identified, trends can be categorised and used for the development and optimisation of theoretical approach.

One promising way is the construction of an accurate exchange-correlation functional is using machine learning.98 Until now, the approach is limited to the one-dimensional, non-interacting electrons and only applicable to system similar to the training (reference) data. This physics-inspired categorisation of big data can be excellent training data for the method where leadership computing is indispensable for the extension to three-dimensional, real material case. We note that optimisation of ab initio approaches can be material or property specific. Once the uncertainties of each method are identified, trends can be categorised and used for the development and optimisation of the theoretical approach. For example, the construction of an accurate exchange-correlation functional using machine learning.98 Overall, this approach can establish a tight feedback loop between first-principle calculations and experimental big-data analysis of microscopic and mesoscopic quantities. This will enable both verification of theoretical models and their improvement. It also allows validation of first-principle approaches and establishment of optimised theoretical approaches to enable predictive capabilities for materials by design.

Making the materials: enter the big data

The success of theoretical prediction of materials’ functionality has given rise to the paradigm of synthesis–measurement–computation in materials discovery/design. Here the rise of high-throughput computational capabilities99103 has enabled massive parallel computations of materials functionality embodied in the Materials Genome Initiative, akin to the transition from individual artisanal fabrication to the production line in the beginning of twentieth century. However, exponential growth of the predictive computational capabilities is belied by the linear development of material synthesis pathways that still generally rely on the expertise and skills of the individual researcher and serendipity more than on the theory-driven prediction. In this regard, we note that broad incorporation of real-time process monitoring and data analytic and data mining tools can establish the correlations between preparation pathways and materials properties, potentially linking it to theoretical prediction and, in conjunction with artificial intelligence methods develop viable synthetic pathways. Although it is still a question whether this approach can create new knowledge, it will at least establish relevant correlations and summarise disjointed elements of phenomenological knowledge making it amenable for data mining and reuse.

From predicting to making

This approach can be illustrated by organic molecules, which have a long history of property prediction and successful synthesis, driven, e.g., by the need for new drugs. Indeed, computer programs to aid in the development of synthesis routes for relatively complex organic molecules goes back more than four decades, although the rational design of organic molecules certainly pre-dates this by several decades more. The sheer number of candidate molecules that can be synthesised increasingly necessitates these computer-aided designs. Early work focused on determining synthesis pathways given a bank of readily available compounds, as well as information on types of reactions present and their yields. These advances occurred at the same time as the birth of rational drug design,104 with numerous notable successes since.105107 More recently, a strategy to account for synthesis complexity, i.e., ranking molecules based on the difficulty of synthesis has been proposed by Li and Eastgate.108,109 They propose a score termed the ‘current complexity’, which takes the available chemical synthesis methodology, and uses a computational method to determine the current complexity scores for different molecules. The score includes an intrinsic component based on the complexity of the structure of the organic molecule, as well as an extrinsic component based on the reaction and the synthetic scheme. As an example, the strychnine molecule current complexity was determined and showed clear reductions in complexity over several decades, reflecting the advances in synthesis techniques in the studied timeframe. Once a list of potential candidates is developed, subsequent ranking by these metrics will enable only the most viable molecules to be synthesised, greatly enhancing throughput. It should be noted that in the organic molecule case, the synthesis is generally carried out at constant temperature, constant pressure and is spatially uniform, in marked contrast to most inorganic materials with coupled functionalities that are of significant interest at present. These differences are illustrated in Figure 5, and highlight the progress that needs to be made.

Figure 5
figure 5

(a) Organic molecules have one structure and are spatially uniform, to a first approximation. (b) For inorganic molecules, this is not the case, and, therefore, structure–property relations are more difficult to determine, as are synthetic pathways. Ball-stick model in a is the molecular structure of Cimetidine, from ref. 129.

Real-time analytics

Incorporation of this approach for bulk inorganic materials is significantly more complex, as the process is now controlled by large number of spatially distributed parameters and can be strongly non-equilibrium. For example, we consider the case of pulsed laser deposition for the controlled growth of oxide nanostructures.110,111 Typically, the growth requires careful control over parameters including laser fluence, substrate temperature, background gas pressure and target composition in order to achieve the desired stoichiometry of the grown nanostructures (Figure 6). Other growth methods (e.g., molecular beam epitaxy or even single crystal growth) will require different control parameters, but nonetheless real-time monitoring of the process is always important for control over the desired compounds. An obvious drawback in pulsed laser deposition is that real-time feedback on chemical composition is not available, although new techniques involving in situ X-ray detectors112 and Auger electron spectroscopy113 are just beginning to emerge to fill this void. However, information pertaining to the dynamics of the growth of the structure through reflection high-energy electron diffraction is typically available, which is currently used to provide details on surface reconstructions, film morphology, film thickness and growth modes.114

Figure 6
figure 6

For pulsed laser deposition (image above, left), typically the methodology is to tune control parameters based on optimising certain functional properties of the resulting sample, with little real-time feedback over surface morphology or chemical composition. This is largely due to difficulty in quantifying the real-time data, and inability to model the surface diffraction data in the dynamic growth process. Machine learning can bridge the existing gap between control parameters, real-time monitoring and functional properties of the grown material, as shown in the schema above, right.

Progress in this area requires bridging the gap between the parameters of the growth (which can be tuned), with the in situ monitoring of surface morphologies and chemical composition, along with the resulting functional properties of the grown material. By integrating this knowledge with real-time analytics of the diffraction or electron spectroscopy data, control over the resulting structures in terms of morphology115 and/or chemical composition should become possible. The big-data aspects are critical to this endeavour, given the multiple modalities of the captured data (and their large size), as well as the lack of appropriate quantitative theory to describe surface diffraction from high-energy electron diffraction geometries (or electron spectroscopies) during dynamic growth of oxide nanostructures. Overall, the full acquisition of data during the growth, along with correlation with the functional parameters of the resultant materials, will allow for data mining for properties that can be linked with the control variables, providing much greater flexibility, control and understanding of the dynamics of the growth process.

We further note that much of the experimental data, i.e., the link between the control parameters and the functional properties, already exists for, e.g., complex oxides,116,117 which have been studied for more than two decades. The challenge then is to mine the existing literature to find the growth conditions, and correlate them with the functional property. For ferroic thin films, this typically would involve determining film growth conditions (substrate temperature, laser fluence, background O2 pressure and so on) with the spontaneous polarisation PS, or the Curie temperature (or alternatively, the Néel temperature) through a careful text-based analysis of pulsed laser deposition papers on ferroic oxides. This will require developing the appropriate algorithms for text-based analysis of the several thousands of existing papers to look for the specific keyword–unit combinations (e.g., ‘fluence’ and ‘J/cm2), and compile them into an appropriate database that can subsequently be mined. Although each pulsed laser deposition chamber is different, presumably, given enough data trends can be identified for specific material compounds that can then be used to feed back into the synthesis. Future efforts must focus on developing the appropriate file format that can be used community-wide, possessing fields of the functional property of the oxide nanostructure, as well as the particular growth conditions, as well as metrics from real-time acquired spectroscopy/diffraction data, thereby negating the need for the text mining approach. The general schema is shown in Figure 7, with the aim of being able to build the hierarchical knowledge cluster from individual communities.

Figure 7
figure 7

(a) Schema for developing a library of properties for epitaxial oxide thin films for finding correlations between growth parameters and final film properties. The schema, if successful, can then be attempted for other materials systems, as shown in b.

Although in practice these will differ in details, the overarching framework and best practices developed for the specific case in the section From predicting to making can be adopted.

Exploration of community-wide knowledge base

The immense investment in government-sponsored research in the United State has laid the foundation for national scientific, economic and military security in the twenty-first century. However, the doubling of scientific publication every 9 years118 jeopardises this foundation because it is no longer humanly possible to track relevant research. In order for scientists, to maintain awareness of relevant scientific work and continue advancements in fundamental and applied research requires development of new computational methods that effectively utilise newly emergent computational and machine learning capabilities for accelerating true scientific progress. Indeed, scientific discovery is generally driven by the synergy between talent, skills and inspiration of individual researchers. The character of research for the past century steadily shifted from individual effort (Tesla, Edison) in the beginning of twentieth century to small teams (Bardeen, Schokley and Schrieffer) to large multi-PI teams running complex and expensive instrumentation. Correspondingly, success of research is increasingly determined by the accessibility of corresponding equipment and collaborative infrastructure. The effectiveness of scientific research is often determined by the knowledge and experience of researcher, implying the knowledge base, social network of collaborators and capability to trace disjoined factors, find original references and meld these together. This will require the development of novel computational analytic capable of unlocking the human knowledge documented and archived in the unstructured text of hundreds of millions of scientific publications in order to extend scientific discovery beyond innate human capacity.

More specifically, there are two main challenges in using data analytics to aid scientific discovery; first, how to discover material within your field that may be overlooked due to the volume of information produced, and, second, how to find significant information within fields that you do not follow.

One approach to address these problems could be through the use of citation networks, pioneered by Chen27 through the Citespace program. This tool allows importation of a library of references on a particular topic (with full citations therein), and analyses the citation networks, which can then be clustered by keywords. Text-based analysis of the abstracts can reveal sudden changes in fields (through an entropy metric), whereas centrality of nodes defines the importance of a particular paper to a field in terms of connectedness to other clusters. As an example of the use of Citespace, the timeline in Figure 8 on the topic of BiFeO3 shows that the field experienced a resurgence in 2003 after the publication of the famous paper in Science by Wang et al.,119 on multiferroic BiFeO3 thin-film heterostructures. The same figure also reveals that there has been a considerable drop in interest over the past several years, presumably as attention has shifted to two-dimensional non-graphene materials or non-oxide ferroic compounds.

Figure 8
figure 8

Citespace analysis of papers on BiFeO3. The size of the circle indicates the importance of the paper to the field, and the links between papers (through citations) are indicated by the arrows. Papers are plotted as circles in the year of publication. Papers are further organised into clusters, determined automatically by an algorithm based on keywords present. Figure made with Citespace.27

This concept can be progressed further via development of the advanced semantic analysis tools. For example, a set of author published papers can be used as seed documents to recommend documents of interest across various publication collections. This enables discovery of new sources that may be of interest, and refine the information within a source to only the most relevant. In order to process and analyse the large publications collections, each individual document is converted into a collection of terms and associated weights using the vector space model method. The vector space model is a widely recognised approach to document content representation120 in which the text in a document is characterised as a collection (vector) of unique terms/phrases and their corresponding normalised significance.

The weight associated with each unique term/phrase is the degree of significance that the term or phrase has, relative to the other terms/phrases. For example, if the term ‘metal’ is common across all or most documents, it will have a low significance, or weight value. Conversely, if ‘piezoelectric’ is a fairly unique term across the set of documents, it will have a higher weight value. The vector space model for any document is typically a combination of the unique term/phrase and its associated weight as defined by a term weighting scheme. Subsequently, the importance of the terms is weighted. Over the past three decades, numerous term weighting schemes have been proposed and compared121125 with one of the most popular ones being frequency-inverse corpus frequency (TF-ICF) developed in ref 124. The primary advantage of using TF-ICF is the ability to process documents in O(N) time rather than O(N2) like many term weighting schemes, while also maintaining a high level of accuracy.

For convenience, the TF-ICF equation is provided here:

(1) w i j = log ( f i j ) × log ( N n j )

In this equation, fij represents the frequency of occurrence of a term j in document i. The variable N represents the total number of documents in the static corpus of documents, and nj represents the number of documents in which term j occurs in that static corpus. For a given frequency fij, the weight, wij, increases as the value of n decreases, and vice versa. Terms with a very high weight will have a high frequency fij, and a low value of n. This approach implemented in Piranha system (Figure 9) enables the discovery of publications outside a scientist’s field of interest, and the refinement of information within a scientist’s field.

Figure 9
figure 9

Piranha cluster diagram using unsupervised learning to organise a set of document.

Finally, even with targeted information that is relevant to the researcher, there is still the need to process, distil and summarise the available information quickly, and preferably, automatically. In fact, algorithms for the summarising of news articles online already exist.126 It is not inconceivable that in the future, writing of introduction sections (and other sections) of written papers will be largely computer aided. Indeed, papers written by computer algorithms127 are now believable enough that some have made it through to conference proceedings, despite consisting of gibberish.128 The key fact, however, is that they are grammatically correct, and given access to content aggregators, it is quite possible that much paper writing, e.g., of reviews, will be highly automated. These advances promise to increase the pace of materials design, while simultaneously allowing researchers to manage to keep abreast within increasingly complex, competitive and specialised topics.

Summary

In this perspective article, we have proposed bridging first-principle theoretical predictions and high-veracity imaging and scattering data of materials microscopic degrees of freedom through big-deep data, with the aim towards guided inorganic material synthesis. For the theoretical side, the overall goal is to provide a scalable, open-source and validated software framework, encompassing methods applicable to complex and correlated materials, which also incorporate error/uncertainty quantification for functional materials discovery/design. We argue that this can potentially be achieved by incorporation of the big-data approaches in imaging and scattering coupled with scalable first-principles software to enable large gains in the rate and quality of materials discoveries. By establishing a tight feedback loop between first-principle calculations and experimental big-data analysis of microscopic and mesoscopic quantities will enable both verification of theoretical models and their improvement. It also allows validation of first-principle approaches and establishment of optimised theoretical approaches.

With high-throughput computations in conjunction with high-veracity imaging and scattering data, the use of big-deep data will accelerate rational material design and synthesis pathways. A schema for the method, as applied to oxide films, is illustrated, incorporating real-time diffraction and chemical data from film growth linked with control parameters, synthesised and understood through deep learning. Additional challenges and opportunities in the realm of literature searches, citation analysis and advanced semantic analysis promise to speed up workflows for researchers, assist with paper production, and distil and categorise the existing scientific literature to enable targeted research, to enable better networking and to help researchers deal with the ever-increasing volume of data. Big data are making a quick and decisive entrance to the materials science community, and harnessing its potential will be critical for accelerated materials design and discovery to meet our current and future materials needs.