Introduction

Data-driven science is becoming increasingly important amidst the worldwide deluge of data1,2. Recent developments in deep learning have provided a way to extract important features from big data automatically and to understand new phenomena3. Integration of data by machine learning is also important in materials science. New devices, such as next-generation batteries and photovoltaics, could be developed more efficiently by automatically exploring materials with superior properties, chemical structures, and processes4,5.

Despite the high expectations around materials informatics, even cutting-edge prediction models are not yet able to integrate big data from materials science, due to the lack of general knowledge in this field6,7,8,9,10,11,12,13. A number of models predict a variety of material parameters, including physical and chemical properties, structures, and spectroscopic responses6,7,8,9,10,11,12,13,14,15,16,17,18. The recent development of data mining techniques from scientific literature is also helpful to increase the number of databases and to enhance prediction accuracy14,15,18. However, a critical drawback has been that the previous models could not predict more than two parameters (Table 1 and Fig. 1a, b) and contained as many individual prediction algorithms and models as predicting parameters6,7,8,9,10,11,12,13. Therefore, the models could not perform essential tasks that are easy for humans, such as learning, considering, and predicting multiple real-world phenomena with a single intelligence. This limitation arises from the use of traditional, inflexible table databases. To integrate knowledge, varied information must be inputted and outputted (i.e., multimodal learning)19,20.

Table 1 Comparison of the present work with previous prediction models.
Fig. 1: Concept for the present work.
figure 1

a A single neural network model is trained to learn and predict diverse materials informatics projects. b Traditionally, a model can process only one database and predict one parameter. The Wikipedia logo is reprinted with the CC-BY-SA 3.0 license.

In this study, we introduced graphs with a common format to integrate diverse materials science projects (Fig. 2a, Supplementary Figs. 1 and 2). The format can express almost all experimental materials science information, such as structure, properties, processes, text, images, and even sounds. All related information from more than ten projects was inputted into a single neural network to predict more than 40 parameters simultaneously, including numeric properties, chemical structures, and text (Fig. 1). Graph approaches have been employed to analyze the relationships of atom-connections, chemical features, and reactions6,16,17. In this study, we extended the approach to train a neural network with the general phenomena of science, which are expressed by graphs. The multitask training of versatile information was essential to acquire broad knowledge about materials science. Our graph approach will be the key to developing general-purpose artificial intelligence for materials science, including inverse problem solving.

Fig. 2: About graph approach.
figure 2

a Concept of using a graph database for materials informatics. Traditionally, the experimental, real-world information (Z) is converted to table databases manually, which require unique algorithms to be formatted as numeric arrays for each case (right part of the figure). Machine learning models, trained with individual format databases, cannot interpret the other databases. On the other hand, common-formatted graphs were made to express versatile experimental information in this study (left part). A single machine learning model interprets all inputted experimental information. b Overview of processing graph structures for machine learning (see Supplementary Fig. 4 for further information).

Results

Process informatics for electron-conducting polymers

As a model case to demonstrate the effect of the graph format, we examined the process informatics of poly(3,4-ethylenedioxythiophene) doped with poly(4-styrenesulfonate) (PEDOT-PSS; Fig. 3). The polymer is known for its high electron conductivity and can be used in transparent flexible conductive films, capacitors, solar cells, thermoelectrics, and other energy-related devices21,22. The conductivity reaches over 3000 S cm−1 after the careful chemical treatment of the polymer film22,23. To achieve higher conductivity, a number of new annealing methods, including repeated chemical treatments with strong acids, bases, and solvents, have been reported23. Informatic approaches have been partially introduced to predict the properties of PEDOT-PSS24,25. However, the post-treatment methods have become too long and complex to be analyzed by conventional machine learning approaches or to be understood except by a few specialists (example scheme is shown in Fig. 3a).

Fig. 3: Prediction of electric conductivity.
figure 3

a Predicting the electrical conductivity of PEDOT-PSS from its post-treatment method. b Structure of PEDOT-PSS. The mixture gives higher conductivity with appropriate post-treatment, which induces the optimum connection of conductive moieties. c Experimental and predicted conductivity of PEDOT-PSS. Seventy percent of data are selected randomly as the training dataset. The rest is used only for prediction. R2 scores for training and test datasets are 0.87 and 0.70, respectively.

Process informatics aims to optimize procedures by using statistical tools9,26,27. The supposedly important factors for the target performance are extracted manually, and recorded in table databases (e.g., heating temperature, mixing speed, and duration; Supplementary Fig. 1). The table format is normally used because most machine learning models can only accept numeric arrays3.

The intrinsic problems with the traditional approach are the inflexible format of the table and ignorance of the experimental context. The database format must be changed whenever a new experimental step (e.g., additional mixing) is considered, although additional steps are often examined to optimize the procedure. It may not be possible to describe complex experimental information fully in a numeric table alone. Even if the table is constructed, reusing it in other projects may be difficult because it does not contain the context for the values (Supplementary Fig. 1).

In the present study, text-based experimental procedures were automatically converted to graphs while maintaining the text context and inputted directly to a machine learning model (Fig. 3a and Supplementary Fig. 3). Graph data describe the relationships among things (expressed as nodes) by connecting them with edges; in contrast to table databases, this flexibility enables the expression of diverse information easily, such as text structures, social networks, and molecular structures28. A recent development in deep learning has enabled the automatic recognition of the graphs and calculation of their characteristic features28. Still, individual prediction models had to be prepared for each genre. Here, we demonstrate that even a single graph and a prediction model can process multidisciplinary information, including chemicals, text, and numbers (Fig. 1).

A simple yet powerful approach to describe versatile information in graphs was to record both genre and content information in each node (Fig. 2b, Supplementary Figs. 4 and 5a). The original text of the post treatments was converted to dependency trees as undirected graphs by a natural language parser29. The nodes in graphs were classified as words, chemical structures, and values. The node information was then converted to numeric vectors by three algorithms (see “Methods” and Supplementary Fig. 4). To process words, a state-of-the-art language-understanding deep-learning model called BERT30 was used. Molecular information was converted to vectors by using molecular fingerprints31. Values in numeric nodes were kept unchanged. To distinguish the genres, three numeric arrays were added to the headers. Apart from the three classes, any information can be embedded in graphs if it can be converted to numeric arrays, thereby paving the way to learning general information about materials science (e.g., inorganic structures, images, and sounds)3.

More than 350 types of graphs related to the post-treatment of PEDOT-PSS (from over 20 papers) were prepared and inputted to a graph neural network6,32. The model was trained to predict the electrical conductivity from the post-treatment methods of the polymer films. In the original database, the procedures were written as text (Fig. 3a). After automatically converting the text to graphs, only the nodes containing electrical conductivity were replaced with the keyword “unknown”. Here, no significant information is basically lost during the graph conversion because the quasi-reversibility of text parsing29. The graphs were used as the inputs (questions) to the model. The model was trained to predict the conductivity from the graph-shaped questions. As the overall user interface, the model can answer the performances of the polymers only from text. This style is more convenient and reliable for most users; special effort and knowledge are needed to prepare traditional table databases, which require the careful, manual selection of important features for the target phenomena and formatting into numeric arrays for machine learning.

The prediction accuracy of the conductivity by the neural network was high. To check the accuracy, the database was split into training (70%) and test (30%) datasets randomly. Although the model was trained only with the training dataset, the R2 score of the prediction and experimental conductivity was greater than 0.7 (Fig. 3c). The score was comparable or slightly higher than the control experiment, where conductivity was predicted directly from texts using a conventional natural language model (R2 = 0.66, Supplementary Fig. 5b, c). The high accuracy supported the validity of the graph approach. Except for a few specialists, such accurate predictions are difficult to make due to the excessively complex preparation procedure (see the long preparation method shown in Fig. 3a). Because the neural network can find essential features from graphs automatically, manual parameter selection to prepare the database is not necessary. Automatic text parsing29 and the general graph approach enable automatic data collection from materials science big data, where recognition of unstructured data has been a bottleneck5.

Multitask learning in different projects

A key advantage of using general graphs is their high capability for describing diverse experimental information. Because the text context is maintained in the graphs, users can easily change the target parameters of the prediction by replacing the target node with the keyword “unknown”. In contrast to normal table databases, the graph questions themselves contain the information about what is to be predicted. This enables one model to learn and predict multiple databases and parameters easily (multitask and multimodal training). For instance, we prepared a graph database containing more than 1000 chemical compounds from Wikipedia (Supplementary Fig. 6). The relationships among chemical structures and their physical properties were recorded as graphs. Similarly, a lithium-ion conducting polymer database, which we constructed previously6, was converted into graphs. In the previous study, a long script was needed to process the complex conductor information so it could be interpreted by a machine learning model (i.e., into numeric arrays)6. However, in the present study, no additional script was necessary because the conductor information could be expressed in the general graph format (Supplementary Fig. 6).

Chemical graph databases were easily converted to question graphs by replacing the property nodes with the keyword “unknown”. More than ten properties, including ionic conductivity, melting point, pKa, viscosity, and vapor pressure, were set as questions (Supplementary Fig. 6). A machine learning model was trained with the PEDOT-PSS and Wikipedia databases to predict the recorded properties (Fig. 4a and Supplementary Fig. 7). Multitask and multimodal training is not feasible with the traditional table databases, due to their inflexible format; the process information about PEDOT-PSS and the chemical properties in Wikipedia cannot be described fully in an integrated table.

Fig. 4: Multitask learning.
figure 4

a Preparation of datasets for multitask learning. If the split ratio is 0.6, 60% of the randomly selected PEDOT-PSS data and all of the other database are used to train the model. The other 40% is selected as the test dataset. b R2 scores for the test dataset after multitask learning with Wikipedia or the lithium-ion conducting polymer database. Only the PEDOT-PSS database is trained as a control. The values show the averages of five experiments with different random shuffling (n = 5 independent experiments). Error bars show the standard errors. See Supplementary Fig. 7 for the results of the training dataset. c Representative heatmap of the latent vectors for the PEDOT-PSS and Wikipedia databases, outputted by a model trained only with the PEDOT-PSS database. See Supplementary Fig. 9 for further information. d Two-dimensional UMAP projection of the latent vectors shown in Supplementary Figs. 9 and 10.

Although there was no obvious relationship between the post-treatment of PEDOT-PSS and compounds in Wikipedia, the prediction accuracy of the electrical conductivity of PEDOT-PSS was improved by multitask training. The PEDOT-PSS database was split into training and test datasets randomly with different split ratios (0–0.9). All data from Wikipedia were combined with the training dataset (Fig. 4a). As expected, the R2 score for the test dataset increased as the split ratio increased (Fig. 4b). Most importantly, the scores were always higher when Wikipedia was learned simultaneously. A similar improvement was observed for the multitask learning of the lithium-ion conducting polymer database. The score was more than three times higher with the multitask learning than with only learning PEDOT-PSS, with a split ratio of 0.3 (corresponding to learning ca. 100 cases of PEDOT-PSS). To our knowledge, this is the first report of multitask learning of different databases and improvement of prediction accuracy in materials science.

To reveal the detailed process of multitask learning, we analyzed the intermediate calculation steps in the model, by visualizing the outputs of a hidden layer in the neural network (Fig. 4c, Supplementary Figs. 811). The hidden layer converted the inputted graphs to 32-dimensional numeric arrays as the vector representation (Supplementary Fig. 5a). The vectors contain essential information about the input and output, termed ‘latent space’3,13. We compressed the 32-dimensional vectors into two-dimensional arrays33 for easier understanding (Fig. 4d). When only the Wikipedia database was used for training (split ratio of 0), the plots from PEDOT-PSS and Wikipedia databases were separate, indicating that the model interpreted them as different species. In contrast, the plots were combined after multitask learning because the model found hidden mutuality among the data and partially shared calculation algorithms for predicting the different parameters. Further mechanism analysis of the multitask is not accessible due to the “black box” problem of deep learning3. However, the recent idea of machine learning, represented by influence functions34, may help researchers reveal the internal processes (e.g., clarify the relationships among specific databases and parameters).

A similar idea to multitask learning, called ‘transfer learning’, has also been proposed to improve prediction6,8, in which different prediction models partially shared the calculation steps to recognize important data features efficiently. However, the final calculations were done by individual algorithms (for details see Supplementary Fig. 12). The individuality limits the advantages of the synergistic effects of learning multiple databases and acquiring broad knowledge of the field. In contrast, in the graph approach, a single intelligence interpreted multiple databases and properties. This finding is essential for exploring the materials informatics of experimental projects, most of which have limited database capacity owing to the high experimental cost6.

General materials informatics prediction model

A more general materials informatics prediction model was pursued by increasing the number of learning databases. From public data, we collected 14 experimental materials science databases, containing over 40 properties (see Supplementary Information). The main compounds in the databases were monomeric molecules, organic polymers, and their mixtures. In addition to their basic physical properties, advanced features, such as redox potentials were included (Fig. 1). Prediction of redox potential is necessary to develop energy-related devices but was not fully successful, mainly because the potentials are changed by the effects of solvents and salts and complex systems are difficult to handle in the table format and simulations35. In the graph databases, the redox potentials were easily recorded as a function of redox molecules, solvents, and electrolyte salts.

For machine learning, some databases were selected randomly and learned with a single prediction model (Supplementary Fig. 13). By increasing the number of training databases, the number of predictable parameters increased because the model could understand the larger amount of information inputted. When the model was trained with all 14 databases, it could predict over 40 properties with high accuracy (Fig. 5, Supplementary Fig 14a, Supplementary Table 1, and Supplementary Data 1). The prediction accuracy was not high enough (R2 < 0) with the parameters with insufficient amounts of training data (typically less than 100 cases, Supplementary Fig. 14b). We emphasize that the prediction errors by the multitask training were basically smaller than the control experiments, introducing different random forest regressors to predict each parameter from corresponding chemical fingerprints (i.e., standard single-task training, Supplementary Fig. 14c). For higher accuracy, we are integrating other public experimental databases and even computational results. Together with revealing the synergetic effects of multitask learning, even one- or zero-shot learning36 may be achievable with the model, which can benefit from both human-like context understanding capability and hugecomputational power to process big data. General prediction models will be beneficial to the wider research community because of their broad knowledge and ability to answer unknown questions; internet search engines can only answer questions about known issues and human professional resources are limited.

Fig. 5: Prediction of versatile parameters.
figure 5

Experimental (x-axis) and predicted (y-axis) values of each parameter after multitask learning of the 14 databases (90% of data was used for training). Blue and red plots correspond to the training and test datasets, respectively. The values are shown as z-scores. R2 scores for test datasets are displayed. If test cases were not enough for evaluation, train scores are shown with brackets instead. Statistical results are also summarized in Supplementary Table 1. Larger graphs are shown in Supplementary Fig. 14a.

Inverse problem solving by graph approach

One of the ultimate goals of materials informatics is fully solving inverse problems. Instead of predicting the results of conditions carefully specified by humans, machine learning models are expected to answer much more ambitious questions, such as “Which post-treatment protocol for PEDOT-PSS will yield a conductivity of 104 S cm−1?” or “What is the organic polymer structure that gives a melting point of 500 °C?”. Although there are no available procedures or structures that give high performances, integration of big scientific data may find the answers. In contrast, many informatics challenges must be overcome. The main difficulties are related to uniqueness of mapping, common sense, and generation of complex answers (Supplementary Fig. 15). For example, there would be multiple polymer structures exhibiting a melting point of the desired value. Therefore, an inverse function of melting point to structure cannot be determined uniquely (uniqueness problem). Furthermore, most candidate structures must be excluded automatically based on common sense from the field, namely by filtering out inappropriate compounds, for example, synthetically challenging or unstable compounds. Finally, generating complex information, represented by chemical structures, is still an open question in deep learning13.

A general prediction model with the graph approach may be the key to solving inverse problems in materials science. Here, a graph neural network was trained with all the information from the 14 databases. In the previous section, nodes of numeric values (material properties) were set as the targets for prediction, whereas all types of nodes (number, word, and compound) in each graph were selected for prediction here (Supplementary Fig. 16). Final answers were constructed by finding the most similar vectors in the text and compound databases with the predicted values (see “Methods”). In the future, it may be possible to generate completely new answers by integrating an autoencoder13. After training, the model could predict the original text in the graphs with a high accuracy of 96% (Supplementary Table 2). Even the 4% failed predictions were close to the answers (Supplementary Table 3). Typical mistakes were predicting “electrical conductivity” instead of “ionic conductivity” (answer) and “melting point” instead of “boiling point” (answer). This was because the model could understand the similar meanings of natural language by neural networks30, resulting in near-miss answers.

The prediction accuracy of compounds was lower (36%), mainly because of the uniqueness problem (Supplementary Fig. 16). For instance, a chemical structure with a specific density (1.58 g cm−3) and a melting point (203 °C) was questioned. The answer was “trehalose”; however, there should be other structures satisfying these two conditions. This problem with uniqueness lowered the apparent accuracy of compound prediction. In practical use, users can add optional conditions freely for desirable compounds. The structure will be determined uniquely if other properties such as heat capacity, hydrophilicity, and chemical stability are specified. Because machine learning models have no actual experimental experience, including this type of tacit knowledge that researchers have will be the key to ensuring the quality of predictions.

Discussion

We used a graph format to express diverse materials informatics information. This common format allows databases from different projects to be combined easily. A single neural network interpreted miscellaneous information, including chemical structures, more than 40 material properties, and text. Multitask and multimodal learning was essential not only for increasing prediction accuracy, but also for developing general-purpose prediction models for materials science. Integrating big data and improving the inverse problem-solving methods will allow this method to be used as an artificial materials science expert, which will change the traditional research and development cycle.

Methods

General information

Databases were constructed or collected from public data. All experimental data collected from the literature were converted into undirected graphs using original Python 3 scripts. Graph nodes were classified into three types: values, text, and chemical compounds, which were converted into numeric arrays automatically by different algorithms. All graph edges were treated equally. To train a neural network, values of target nodes (=y) in graphs were replaced with a specific keyword “__unknown__”. The generated graphs were inputted as x to the model.

Databases

The PEDOT-PSS database was constructed in this work. The lithium-ion-conducting polymer database was constructed in our previous work6. The experimental properties of various compounds were collected from Wikipedia (https://ja.wikipedia.org/), Wikidata (https://www.wikidata.org/), Computational Chemistry Comparison and Benchmark DataBase (CCCBDB, https://cccbdb.nist.gov/), chemical suppliers, and the literature. All data were collected and rechecked by the authors.

Computer

Data processing and machine learning were conducted on a desktop computer (Intel Core i9-9900K CPU @ 3.60 GHz, 32 GB memory, GeForce RTX 2080 graphical processing unit, and Ubuntu 16.04 operating system).

Graph preparation from text

Text information about post-treatment of PEDOT-PSS was collected and converted to graphs by the following procedures (Fig. 3). We extracted the related experimental procedures from original articles (mainly experimental sections) and summarized them as text. The text was written in a set format for ease of machine learning (e.g., avoiding inconsistent spelling). In the future, automatic text collection (probably by machine learning) will be examined, to reduce the cost and to eliminate the human nature of data preparation. On the other hand, we note that the system can be robust against orthographical variances and different text expressions owing to the language-understanding deep-learning model30 (as long as sufficient data are given).

In the text, compounds were expressed as their IDs, such as C0001 and C0123, and their structure information was recorded in a compound database. In the database, the compound ID and simplified molecular input line entry system (SMILES) expressions were associated (about 50 types of chemicals). Numeric values were standardized as z-scores by each unit (e.g., siemens per centimeter and degrees Celsius, Supplementary Fig. 3). The text was parsed automatically by an open module (StanfordNLP 0.2.0)29 to construct dependency trees of words. Nodes of less important words and symbols (e.g., “at”, “were”, “and”, “was”, “by”, and “to”) were removed after parsing. Nodes of conductivity values (y) in graphs were replaced with “__unknown__” to prepare the inputs (x) for machine learning. An open-source library (NetworkX 2.4, https://networkx.github.io/) was used to generate graphs.

Graph databases for multitask learning

For multitask learning, all databases, typically written as tables, were converted to graphs by Python scripts if necessary. In the graphs, the relationships among the factors were connected by edges (Supplementary Fig. 1d). As a common rule, a numeric parameter was connected in the order “parameter name”—“value”—“[unit]” to describe a property of a target node. Numeric values were standardized as the z-score by each unit (e.g., degrees Celsius and siemens per centimeter). For multitask learning, a graph database of PEDOT-PSS was reconstructed manually (apart from the text database described in the previous section) because the automatic parser made graphs according to a different rule (Supplementary Fig. 3). A free graph editor (yEd 3.19) was used to draw graphs in the “graphml” format, which was compatible with the NetworkX library.

An integrated compound database was made by combining the compound information in each database and a chemical supplier’s catalog (Tokyo Chemical Industry Co.). A total of over 29,000 chemical structures were recorded. The integrated database was used for the multitask learning experiments (i.e., except for the automatic text parsing experiment in Fig. 3 and Supplementary Fig. 8). The prediction score can decrease slightly when a larger compound database is used (e.g., compare R2 scores in Figs. 3c and 4b) because of the larger loss of compound information after converting into 64-dimensional numeric vectors with larger data. Therefore, the compressing algorithms should be improved in future research.

Converting graphs to numeric matrices and vectors

For machine learning, graphs were converted into adjacency matrices and numeric vectors (Supplementary Fig. 4a). Adjacency matrices, expressing the node connections, were simply calculated by a function of NetworkX. A matrix, D, was prepared according to the following steps: (1) assign unique IDs to each node; and (2) determine Dij, which is 1 if nodes i and j were connected, and otherwise is 0.

Each node content was converted to 64-dimensional numeric arrays (Supplementary Fig. 4b). Other information, such as images and sounds, can be also treated by implementing additional processing scripts (see below). The first four-dimensional arrays of the vectors were headers to distinguish the node types. Three types of different random numeric arrays were assigned using an embedding function of Chainer 7.2.0, an open library for deep learning. The remaining 60-dimensional arrays were prepared by three different algorithms according to node types.

Numeric nodes: When the node represents a number, the corresponding 60-dimensional numeric array will be a repeat of the value. For instance, an array of (0.5, 0.5,..., 0.5) is set for the value node of 0.5.

Text nodes: To process text nodes, a natural language recognition model (BERT30, pretrained model of “uncased_L-24_H-1024_A-16”, accessible at https://github.com/tensorflow/models/tree/master/official/nlp/bert) was employed. The model could calculate 768-dimensional numeric vectors of the corresponding words, phrases, and text. Similar text expressions or meanings were converted to similar vectors by BERT30. The 768-dimensional vectors were compressed to 60-dimensional numeric arrays using a principal component analysis (PCA) algorithm, implemented with an open-source library (scikit-learn 0.22.2).

Compound nodes: Organic compounds were recorded as unique IDs (e.g., C0023) in graph nodes. By referencing a compound database, their structure information, expressed by 60-dimensional numeric arrays, was loaded. In the compound database, molecular structures were recorded as SMILES expressed by character strings. Their chemical features were calculated using an open module of chemistry (RDKit 2019.03.2, https://www.rdkit.org/), to obtain extended-connectivity fingerprints of the 2048-bit data. The binary data were split every 4 bits (e.g., (0011), (0000), (1111), …) and converted to corresponding integers (e.g., 512-dimensional array of (3, 0, 15, …)). Finally, a 60-dimensional array was prepared followed by PCA compression and standardization.

Other nodes: In this study, only numeric, text, and molecular structure nodes were implemented in the graphs. In the future, other information, such as inorganic crystal structures, images, sounds, and spectra, will be processed by adding appropriate scripts to express them as vectors.

Dataset preparation

Target values (y) on the nodes in graphs were replaced with a keyword “__unknown__” to prepare problems (x) automatically (underlines are added to distinguish from the word “unknown”). If one graph has more than two target values (e.g., melting point and density), the replacement and problem generation were done individually to generate multiple problems; there were not multiple “__unknown__” nodes in one graph. The numeric nodes of the material properties were set as the target values (y) for prediction. For inverse problem solving in the last section of the main manuscript, all nodes (numbers, chemicals, and text) in all graphs were set as problems.

Unless noted otherwise, the graph data were split into train and test datasets randomly (splitting ratio of 0 to 0.9). The train dataset was used to train a graph neural network and the test dataset was used only for prediction.

Machine learning

The prepared datasets were trained with a neural network (Supplementary Fig. 5). The Chainer library was used to script the model, which had a graph neural network layer to recognize graphs and three dense layers to calculate the final outputs. The graph neural network was prepared based on an open-source library in Chainer-chemistry 0.7.0. The Implemented function of the gated graph neural network was used. Only the input part of the function was modified to input the adjacency matrices and node vectors described above, whereas the original version was customized to input only the connection of atoms in molecules.

The neural network was trained to reduce the mean square errors between the predicted and actual values. Minibatch sizes of 32 and 128 were selected for only PEDOT-PSS learning and multitask learning, respectively. Training was repeated with 100 epochs with the Adam optimizer37. The dimension of output values was 1 for the normal prediction mode of numeric nodes.

The model was constructed with a 64-dimensional output to solve inverse problems in the last section of the main manuscript. The last 60-dimensional vectors were used for prediction (i.e., the first four-dimensional arrays were used only to distinguish node types). The predicted vectors were compared with the word (or compound) list in the integrated databases. Ones giving the highest cosine similarity with the predicted vectors were extracted as the prediction result. In the future, direct outputs of words and compounds may be achieved using autoencoders or similar techniques13. Numeric nodes were predicted by averaging the predicted 60-dimensional vectors. All hyperparameters were optimized manually. Automatic parameter tuning will be tested in future research with higher computing power (e.g., multiple GPUs).

Prediction by conventional models as the control experiments

Language model to predict conductivity directly from text (related to Fig. 3 and Supplementary Fig. 5): In Fig. 3, conductivity was predicted via graph structures, which were converted from texts. On the other hand, conventional recurrent neural networks, such as long short-term memory (LSTM)38, can treat text information directly. As a control experiment, conductivity was predicted from the texts. The conductivity values in the texts were replaced with “__unknown__” to make problems (Supplementary Fig. 5b). After converting words into embedding vectors, the word inputted to a LSTM layer (which outputs 16-dimensional latent vectors, implemented by Keras 2.3.1). Conductivity was calculated via a dense layer without activation functions.

Random forest regressors to predict chemical properties (related to Fig. 5and Supplementary Fig. 14): As the control to the multitask training, machine learning was conducted in a conventional way with a Wikipedia database. Random forest was selected as a conventional yet robust prediction algorithm4. First, compound information was converted to 60-dimensional arrays through the same process as the graph approach. Then, individual random forest regressors (by scikit-learn) were introduced and trained to predict each chemical property recorded in the database (absolute standard enthalpy of formation, boiling temperature, decomposition temperature, density, flash temperature, ionization energy, melting enthalpy, melting temperature, refractive index, vapor pressure, and pKa) from the 60-dimensional arrays. Train and test datasets were prepared randomly with a splitting ratio of 9/1.