Extended Data Fig. 1: Chemistry is captured by word embeddings. | Nature

Extended Data Fig. 1: Chemistry is captured by word embeddings.

From: Unsupervised word embeddings capture latent knowledge from materials science literature

Extended Data Fig. 1

a, Two-dimensional t-distributed stochastic neighbour embedding (t-SNE) projection of the word embeddings of 100 chemical element names (for example, ‘hydrogen’) labelled with the corresponding element symbols and grouped according to their classification. Chemically similar elements are seen to cluster together and the overall distribution exhibits a topology reminiscent of the periodic table itself (compare to b). Arranged from top left to bottom right are the alkali metals, alkaline earth metals, transition metals, and noble gases while the trend from top right to bottom left generally follows increasing atomic number (see Supplementary Information section S4 for a more detailed discussion). b, The periodic table coloured according to the classification shown in a. c, Predicted versus actual (DFT) values of formation energies of approximately 10,000 ABC2D6 elpasolite compounds40 using a simple neural network model with word embeddings of elements as features (see Supplementary Information section S6 for the details of the model). The data points in the plot use fivefold cross-validation. d, Error distribution for the 10% test set of elpasolite formation energies. With no extensive optimization, the word embeddings achieve a mean absolute error (MAE) of 0.056 eV per atom, which is substantially smaller than the 0.1 eV per atom error reported for the same task in the original study using hand-crafted features40 and the 0.15 eV per atom achieved in a recent study using element features automatically learned from crystal structures of more than 60,000 compounds41.

Back to article page