# Unsupervised word embeddings capture latent knowledge from materials science literature

## Abstract

The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3,4,5,6,7,8,9,10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11,12,13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.

## Access options

Rent or Buy article

from\$8.99

All prices are NET prices.

## Data availability

The scientific abstracts used in this study are available via Elsevier’s Scopus and Science Direct API’s (https://dev.elsevier.com/) and the Springer Nature API (https://dev.springernature.com/). The list of DOIs used in this study, the pre-trained word embeddings and the analogies used for validation of the embeddings are available at https://github.com/materialsintelligence/mat2vec. All other data generated and analysed during the current study are available from the corresponding authors on reasonable request.

## Code availability

The code used for text preprocessing and Word2vec training are available at https://github.com/materialsintelligence/mat2vec.

## References

1. 1.

Hill, J. et al. Materials science with large-scale data and informatics: unlocking new opportunities. MRS Bull. 41, 399–409 (2016).

2. 2.

Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science. Nature 559, 547–555 (2018).

3. 3.

Friedman, C., Kra, P., Yu, H., Krauthammer, M. & Rzhetsky, A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17, S74–S82 (2001).

4. 4.

Müller, H. M., Kenny, E. E. & Sternberg, P. W. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2, e309 (2004).

5. 5.

Swain, M. C. & Cole, J. M. ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 56, 1894–1904 (2016).

6. 6.

Eltyeb, S. & Salim, N. Chemical named entities recognition: a review on approaches and applications. J. Cheminform. 6, 17 (2014).

7. 7.

Kim, E. et al. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem. Mater. 29, 9436–9444 (2017).

8. 8.

Leaman, R., Wei, C. H. & Lu, Z. TmChem: a high performance approach for chemical named entity recognition and normalization. J. Cheminform. 7, S3 (2015).

9. 9.

Krallinger, M., Rabal, O., Lourenço, A., Oyarzabal, J. & Valencia, A. Information retrieval and text mining technologies for chemistry. Chem. Rev. 117, 7673–7761 (2017).

10. 10.

Spangler, S. et al. Automated hypothesis generation based on mining scientific literature. In Proc. 20th ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining 1877–1886 (ACM, 2014).

11. 11.

Mikolov, T., Corrado, G., Chen, K. & Dean, J. Efficient estimation of word representations in vector space. Preprint at https://arxiv.org/abs/1301.3781 (2013).

12. 12.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. Preprint at https://arxiv.org/abs/1310.4546 (2013).

13. 13.

Pennington, J., Socher, R. & Manning, C. GloVe: global vectors for word representation. Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP) 1532–1543 (Association for Computational Linguistics, 2014).

14. 14.

Liu, W. et al. New trends, strategies and opportunities in thermoelectric materials: a perspective. Materials Today Physics 1, 50–60 (2017).

15. 15.

He, J. & Tritt, T. M. Advances in thermoelectric materials research: looking back and moving forward. Science 357, eaak9997 (2017).

16. 16.

Ricci, F. et al. An ab initio electronic transport database for inorganic materials. Sci. Data 4, 170085 (2017).

17. 17.

Hohenberg, P. & Kohn, W. Inhomogeneous electron gas. Phys. Rev. 136, B864–B871 (1964).

18. 18.

Kohn, W. & Sham, L. J. Self-consistent equations including exchange and correlation effects. Phys. Rev. 140, A1133–A1138 (1965).

19. 19.

Gaultois, M. W. et al. Data-driven review of thermoelectric materials: performance and resource onsiderations. Chem. Mater. 25, 2911–2920 (2013).

20. 20.

Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904).

21. 21.

Plirdpring, T. et al. Chalcopyrite CuGaTe2: a high-efficiency bulk thermoelectric material. Adv. Mater. 24, 3622–3626 (2012).

22. 22.

Tian, H. et al. Low-symmetry two-dimensional materials for electronic and photonic applications. Nano Today 11, 763–777 (2016).

23. 23.

Pandey, C., Sharma, R. & Sharma, Y. Thermoelectric properties of defect chalcopyrites. AIP Conf. Proc. 1832, 110009 (2017).

24. 24.

Zhao, L.-D. et al. Ultralow thermal conductivity and high thermoelectric figure of merit in SnSe crystals. Nature 508, 373–377 (2014).

25. 25.

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018).

26. 26.

Peters, M. E. et al. Deep contextualized word representations. Preprint at https://arxiv.org/abs/1802.05365 (2018).

27. 27.

Jain, A. et al. The materials project: a materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).

28. 28.

Ong, S. P. et al. Python Materials Genomics (pymatgen): a robust, open-source Python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).

29. 29.

Kresse, G. & Joubert, D. From ultrasoft pseudopotentials to the projector augmented-wave method. Phys. Rev. B Condens. Matter Mater. Phys. 59, 1758–1775 (1999).

30. 30.

Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865–3868 (1996).

31. 31.

Kresse, G. & Furthmüller, J. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys. Rev. B Condens. Matter 54, 11169–11186 (1996).

32. 32.

Kresse, G. & Furthmüller, J. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Comput. Mater. Sci. 6, 15–50 (1996).

33. 33.

Madsen, G. K. & Singh, D. J. Boltztrap. A code for calculating band-structure dependent quantities. Comput. Phys. Commun. 175, 67–71 (2006).

34. 34.

Mathew, K. et al. Atomate: a high-level interface to generate, execute, and analyze computational materials science workflows. Comput. Mater. Sci. 139, 140–152 (2017).

35. 35.

Jain, A. et al. Fireworks: a dynamic workflow system designed for high-throughput applications. Concurr. Comput. 27, 5037–5059 (2013).

36. 36.

Yang, X., Dai, Z., Zhao, Y., Liu, J. & Meng, S. Low lattice thermal conductivity and excellent thermoelectric behavior in Li3Sb and Li3Bi. J. Phys. Condens. Matter 30, 425401 (2018).

37. 37.

Wang, Y., Gao, Z. & Zhou, J. Ultralow lattice thermal conductivity and electronic properties of monolayer 1T phase semimetal SiTe2 and SnTe2. Physica E 108, 53–59 (2019).

38. 38.

Mukherjee, M., Yumnam, G. & Singh, A. K. High thermoelectric figure of merit via tunable valley convergence coupled low thermal conductivity in $${{\rm{A}}}^{{\rm{I}}{\rm{I}}}{{\rm{B}}}^{{\rm{I}}{\rm{V}}}{{\rm{C}}}_{2}^{{\rm{V}}}$$ chalcopyrites. J. Phys. Chem. C 122, 29150–29157 (2018).

39. 39.

Kim, E. et al. Machine-learned and codified synthesis parameters of oxide materials. Sci. Data 4, 170127 (2017).

40. 40.

Faber, F. A., Lindmaa, A., Von Lilienfeld, O. A. & Armiento, R. Machine learning energies of 2 million elpasolite (ABC2D6) crystals. Phys. Rev. Lett. 117, 135502 (2016).

41. 41.

Zhou, Q. et al. Learning atoms for materials discovery. Proc. Natl Acad. Sci. USA 115, E6411–E6417 (2018).

## Acknowledgements

This work was supported by Toyota Research Institute through the Accelerated Materials Design and Discovery program. We thank T. Botari, M. Horton, D. Mrdjenovich, N. Mingione and A. Faghaninia for discussions.

## Author information

Authors

### Contributions

All authors contributed to the conception and design of the study, as well as writing of the manuscript. V.T. developed the data processing pipeline, trained and optimized the Word2vec embeddings, trained the machine learning models for property predictions and generated the thermoelectric predictions. V.T., J.D. and L.W. analysed the results and developed the software infrastructure for the project. J.D. trained and optimized the GloVe embeddings and developed the data acquisition infrastructure. L.W. performed the abstract classification. A.D. performed the DFT calculation of thermoelectric power factors. Z.R. contributed to data acquisition. O.K. developed the code for normalization of material formulae. A.D., Z.R. and O.K. contributed to the analysis of the results. K.A.P., G.C. and A.J. supervised the work.

### Corresponding authors

Correspondence to Vahe Tshitoyan or Gerbrand Ceder or Anubhav Jain.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data figures and tables

### Extended Data Fig. 1 Chemistry is captured by word embeddings.

a, Two-dimensional t-distributed stochastic neighbour embedding (t-SNE) projection of the word embeddings of 100 chemical element names (for example, ‘hydrogen’) labelled with the corresponding element symbols and grouped according to their classification. Chemically similar elements are seen to cluster together and the overall distribution exhibits a topology reminiscent of the periodic table itself (compare to b). Arranged from top left to bottom right are the alkali metals, alkaline earth metals, transition metals, and noble gases while the trend from top right to bottom left generally follows increasing atomic number (see Supplementary Information section S4 for a more detailed discussion). b, The periodic table coloured according to the classification shown in a. c, Predicted versus actual (DFT) values of formation energies of approximately 10,000 ABC2D6 elpasolite compounds40 using a simple neural network model with word embeddings of elements as features (see Supplementary Information section S6 for the details of the model). The data points in the plot use fivefold cross-validation. d, Error distribution for the 10% test set of elpasolite formation energies. With no extensive optimization, the word embeddings achieve a mean absolute error (MAE) of 0.056 eV per atom, which is substantially smaller than the 0.1 eV per atom error reported for the same task in the original study using hand-crafted features40 and the 0.15 eV per atom achieved in a recent study using element features automatically learned from crystal structures of more than 60,000 compounds41.

### Extended Data Fig. 2 Historical validations of functional material predictions.

ac, Ferroelectric (a), photovoltaic (b) and topological insulator predictions (c) using word embeddings obtained from various historical datasets, similar to Fig. 3a. For ferroelectrics and photovoltaics, the range of prediction years is 2001–2018. The phrase ‘topological insulator’ obtained its own embedding in our corpus only in 2011 (owing to count and vocabulary size limits), so it is possible to analyse the results only over a shorter time period (2011–2018). Each grey line uses only abstracts published before a certain year to make predictions. The lines show the cumulative percentage of predicted materials studied in the years following their predictions; earlier predictions can be analysed over longer test periods. The results are averaged in red and compared to baseline percentages from all materials. d, The target word or phrase used to rank materials for each application (based on cosine similarity), and the corresponding words used as indicators for a potentially existing study.

## Supplementary information

### Supplementary Information

Supplementary Information, including Supplementary Figures 1–8, Supplementary Tables 1–3 and additional references.

## Rights and permissions

Reprints and Permissions

Tshitoyan, V., Dagdelen, J., Weston, L. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019). https://doi.org/10.1038/s41586-019-1335-8

• Accepted:

• Published:

• Issue Date:

• ### Recent progress on discovery and properties prediction of energy materials: Simple machine learning meets complex quantum chemistry

• Yongqiang Kang
• , Lejing Li
•  & Baohua Li

Journal of Energy Chemistry (2021)

• ### Generalized Sparse Convolutional Neural Networks for Semantic Segmentation of Point Clouds Derived from Tri-Stereo Satellite Imagery

• Stefan Bachhofner
• , Ana-Maria Loghin
• , Johannes Otepka
• , Norbert Pfeifer
• , Michael Hornacek
• , Andrea Siposova
• , Niklas Schmidinger
• , Kurt Hornik
• , Nikolaus Schiller
• , Olaf Kähler
•  & Ronald Hochreiter

Remote Sensing (2020)

• ### Learning physical properties of liquid crystals with deep convolutional neural networks

• Higor Y. D. Sigaki
• , Ervin K. Lenzi
• , Rafael S. Zola
• , Matjaž Perc
•  & Haroldo V. Ribeiro

Scientific Reports (2020)

• ### Unsupervised learning using topological data augmentation

• Oleksandr Balabanov
•  & Mats Granath

Physical Review Research (2020)

• ### Time to kick-start text mining for biomaterials

• Osnat Hakimi
• , Martin Krallinger
•  & Maria-Pau Ginebra

Nature Reviews Materials (2020)