Unsupervised word embeddings capture latent knowledge from materials science literature


The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3,4,5,6,7,8,9,10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11,12,13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Word2vec skip-gram and analogies.
Fig. 2: Prediction of new thermoelectric materials.
Fig. 3: Validation of the predictions.

Data availability

The scientific abstracts used in this study are available via Elsevier’s Scopus and Science Direct API’s (https://dev.elsevier.com/) and the Springer Nature API (https://dev.springernature.com/). The list of DOIs used in this study, the pre-trained word embeddings and the analogies used for validation of the embeddings are available at https://github.com/materialsintelligence/mat2vec. All other data generated and analysed during the current study are available from the corresponding authors on reasonable request.

Code availability

The code used for text preprocessing and Word2vec training are available at https://github.com/materialsintelligence/mat2vec.


  1. 1.

    Hill, J. et al. Materials science with large-scale data and informatics: unlocking new opportunities. MRS Bull. 41, 399–409 (2016).

    CAS  Article  Google Scholar 

  2. 2.

    Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science. Nature 559, 547–555 (2018).

    ADS  CAS  Article  Google Scholar 

  3. 3.

    Friedman, C., Kra, P., Yu, H., Krauthammer, M. & Rzhetsky, A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17, S74–S82 (2001).

    Article  Google Scholar 

  4. 4.

    Müller, H. M., Kenny, E. E. & Sternberg, P. W. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2, e309 (2004).

    Article  Google Scholar 

  5. 5.

    Swain, M. C. & Cole, J. M. ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 56, 1894–1904 (2016).

    CAS  Article  Google Scholar 

  6. 6.

    Eltyeb, S. & Salim, N. Chemical named entities recognition: a review on approaches and applications. J. Cheminform. 6, 17 (2014).

    Article  Google Scholar 

  7. 7.

    Kim, E. et al. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem. Mater. 29, 9436–9444 (2017).

    CAS  Article  Google Scholar 

  8. 8.

    Leaman, R., Wei, C. H. & Lu, Z. TmChem: a high performance approach for chemical named entity recognition and normalization. J. Cheminform. 7, S3 (2015).

    Article  Google Scholar 

  9. 9.

    Krallinger, M., Rabal, O., Lourenço, A., Oyarzabal, J. & Valencia, A. Information retrieval and text mining technologies for chemistry. Chem. Rev. 117, 7673–7761 (2017).

    CAS  Article  Google Scholar 

  10. 10.

    Spangler, S. et al. Automated hypothesis generation based on mining scientific literature. In Proc. 20th ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining 1877–1886 (ACM, 2014).

  11. 11.

    Mikolov, T., Corrado, G., Chen, K. & Dean, J. Efficient estimation of word representations in vector space. Preprint at https://arxiv.org/abs/1301.3781 (2013).

  12. 12.

    Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. Preprint at https://arxiv.org/abs/1310.4546 (2013).

  13. 13.

    Pennington, J., Socher, R. & Manning, C. GloVe: global vectors for word representation. Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP) 1532–1543 (Association for Computational Linguistics, 2014).

  14. 14.

    Liu, W. et al. New trends, strategies and opportunities in thermoelectric materials: a perspective. Materials Today Physics 1, 50–60 (2017).

    Article  Google Scholar 

  15. 15.

    He, J. & Tritt, T. M. Advances in thermoelectric materials research: looking back and moving forward. Science 357, eaak9997 (2017).

    Article  Google Scholar 

  16. 16.

    Ricci, F. et al. An ab initio electronic transport database for inorganic materials. Sci. Data 4, 170085 (2017).

    CAS  Article  Google Scholar 

  17. 17.

    Hohenberg, P. & Kohn, W. Inhomogeneous electron gas. Phys. Rev. 136, B864–B871 (1964).

    ADS  MathSciNet  Article  Google Scholar 

  18. 18.

    Kohn, W. & Sham, L. J. Self-consistent equations including exchange and correlation effects. Phys. Rev. 140, A1133–A1138 (1965).

    ADS  MathSciNet  Article  Google Scholar 

  19. 19.

    Gaultois, M. W. et al. Data-driven review of thermoelectric materials: performance and resource onsiderations. Chem. Mater. 25, 2911–2920 (2013).

    CAS  Article  Google Scholar 

  20. 20.

    Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904).

    Article  Google Scholar 

  21. 21.

    Plirdpring, T. et al. Chalcopyrite CuGaTe2: a high-efficiency bulk thermoelectric material. Adv. Mater. 24, 3622–3626 (2012).

    CAS  Article  Google Scholar 

  22. 22.

    Tian, H. et al. Low-symmetry two-dimensional materials for electronic and photonic applications. Nano Today 11, 763–777 (2016).

    CAS  Article  Google Scholar 

  23. 23.

    Pandey, C., Sharma, R. & Sharma, Y. Thermoelectric properties of defect chalcopyrites. AIP Conf. Proc. 1832, 110009 (2017).

    Article  Google Scholar 

  24. 24.

    Zhao, L.-D. et al. Ultralow thermal conductivity and high thermoelectric figure of merit in SnSe crystals. Nature 508, 373–377 (2014).

    ADS  CAS  Article  Google Scholar 

  25. 25.

    Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018).

  26. 26.

    Peters, M. E. et al. Deep contextualized word representations. Preprint at https://arxiv.org/abs/1802.05365 (2018).

  27. 27.

    Jain, A. et al. The materials project: a materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).

    ADS  Article  Google Scholar 

  28. 28.

    Ong, S. P. et al. Python Materials Genomics (pymatgen): a robust, open-source Python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).

    CAS  Article  Google Scholar 

  29. 29.

    Kresse, G. & Joubert, D. From ultrasoft pseudopotentials to the projector augmented-wave method. Phys. Rev. B Condens. Matter Mater. Phys. 59, 1758–1775 (1999).

    ADS  CAS  Article  Google Scholar 

  30. 30.

    Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865–3868 (1996).

    ADS  CAS  Article  Google Scholar 

  31. 31.

    Kresse, G. & Furthmüller, J. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys. Rev. B Condens. Matter 54, 11169–11186 (1996).

    ADS  CAS  Article  Google Scholar 

  32. 32.

    Kresse, G. & Furthmüller, J. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Comput. Mater. Sci. 6, 15–50 (1996).

    CAS  Article  Google Scholar 

  33. 33.

    Madsen, G. K. & Singh, D. J. Boltztrap. A code for calculating band-structure dependent quantities. Comput. Phys. Commun. 175, 67–71 (2006).

    ADS  CAS  Article  Google Scholar 

  34. 34.

    Mathew, K. et al. Atomate: a high-level interface to generate, execute, and analyze computational materials science workflows. Comput. Mater. Sci. 139, 140–152 (2017).

    Article  Google Scholar 

  35. 35.

    Jain, A. et al. Fireworks: a dynamic workflow system designed for high-throughput applications. Concurr. Comput. 27, 5037–5059 (2013).

    Article  Google Scholar 

  36. 36.

    Yang, X., Dai, Z., Zhao, Y., Liu, J. & Meng, S. Low lattice thermal conductivity and excellent thermoelectric behavior in Li3Sb and Li3Bi. J. Phys. Condens. Matter 30, 425401 (2018).

    ADS  Article  Google Scholar 

  37. 37.

    Wang, Y., Gao, Z. & Zhou, J. Ultralow lattice thermal conductivity and electronic properties of monolayer 1T phase semimetal SiTe2 and SnTe2. Physica E 108, 53–59 (2019).

    ADS  CAS  Article  Google Scholar 

  38. 38.

    Mukherjee, M., Yumnam, G. & Singh, A. K. High thermoelectric figure of merit via tunable valley convergence coupled low thermal conductivity in \({{\rm{A}}}^{{\rm{I}}{\rm{I}}}{{\rm{B}}}^{{\rm{I}}{\rm{V}}}{{\rm{C}}}_{2}^{{\rm{V}}}\) chalcopyrites. J. Phys. Chem. C 122, 29150–29157 (2018).

    CAS  Article  Google Scholar 

  39. 39.

    Kim, E. et al. Machine-learned and codified synthesis parameters of oxide materials. Sci. Data 4, 170127 (2017).

    CAS  Article  Google Scholar 

  40. 40.

    Faber, F. A., Lindmaa, A., Von Lilienfeld, O. A. & Armiento, R. Machine learning energies of 2 million elpasolite (ABC2D6) crystals. Phys. Rev. Lett. 117, 135502 (2016).

    ADS  Article  Google Scholar 

  41. 41.

    Zhou, Q. et al. Learning atoms for materials discovery. Proc. Natl Acad. Sci. USA 115, E6411–E6417 (2018).

    ADS  CAS  Article  Google Scholar 

Download references


This work was supported by Toyota Research Institute through the Accelerated Materials Design and Discovery program. We thank T. Botari, M. Horton, D. Mrdjenovich, N. Mingione and A. Faghaninia for discussions.

Author information




All authors contributed to the conception and design of the study, as well as writing of the manuscript. V.T. developed the data processing pipeline, trained and optimized the Word2vec embeddings, trained the machine learning models for property predictions and generated the thermoelectric predictions. V.T., J.D. and L.W. analysed the results and developed the software infrastructure for the project. J.D. trained and optimized the GloVe embeddings and developed the data acquisition infrastructure. L.W. performed the abstract classification. A.D. performed the DFT calculation of thermoelectric power factors. Z.R. contributed to data acquisition. O.K. developed the code for normalization of material formulae. A.D., Z.R. and O.K. contributed to the analysis of the results. K.A.P., G.C. and A.J. supervised the work.

Corresponding authors

Correspondence to Vahe Tshitoyan or Gerbrand Ceder or Anubhav Jain.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Chemistry is captured by word embeddings.

a, Two-dimensional t-distributed stochastic neighbour embedding (t-SNE) projection of the word embeddings of 100 chemical element names (for example, ‘hydrogen’) labelled with the corresponding element symbols and grouped according to their classification. Chemically similar elements are seen to cluster together and the overall distribution exhibits a topology reminiscent of the periodic table itself (compare to b). Arranged from top left to bottom right are the alkali metals, alkaline earth metals, transition metals, and noble gases while the trend from top right to bottom left generally follows increasing atomic number (see Supplementary Information section S4 for a more detailed discussion). b, The periodic table coloured according to the classification shown in a. c, Predicted versus actual (DFT) values of formation energies of approximately 10,000 ABC2D6 elpasolite compounds40 using a simple neural network model with word embeddings of elements as features (see Supplementary Information section S6 for the details of the model). The data points in the plot use fivefold cross-validation. d, Error distribution for the 10% test set of elpasolite formation energies. With no extensive optimization, the word embeddings achieve a mean absolute error (MAE) of 0.056 eV per atom, which is substantially smaller than the 0.1 eV per atom error reported for the same task in the original study using hand-crafted features40 and the 0.15 eV per atom achieved in a recent study using element features automatically learned from crystal structures of more than 60,000 compounds41.

Extended Data Fig. 2 Historical validations of functional material predictions.

ac, Ferroelectric (a), photovoltaic (b) and topological insulator predictions (c) using word embeddings obtained from various historical datasets, similar to Fig. 3a. For ferroelectrics and photovoltaics, the range of prediction years is 2001–2018. The phrase ‘topological insulator’ obtained its own embedding in our corpus only in 2011 (owing to count and vocabulary size limits), so it is possible to analyse the results only over a shorter time period (2011–2018). Each grey line uses only abstracts published before a certain year to make predictions. The lines show the cumulative percentage of predicted materials studied in the years following their predictions; earlier predictions can be analysed over longer test periods. The results are averaged in red and compared to baseline percentages from all materials. d, The target word or phrase used to rank materials for each application (based on cosine similarity), and the corresponding words used as indicators for a potentially existing study.

Extended Data Table 1 Materials science analogies
Extended Data Table 2 Top 50 thermoelectric predictions
Extended Data Table 3 Top five functional material predictions and context words
Extended Data Table 4 Importance of the text corpus

Supplementary information

Supplementary Information

Supplementary Information, including Supplementary Figures 1–8, Supplementary Tables 1–3 and additional references.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tshitoyan, V., Dagdelen, J., Weston, L. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019). https://doi.org/10.1038/s41586-019-1335-8

Download citation

Further reading


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.