Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Unsupervised word embeddings capture latent knowledge from materials science literature


The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3,4,5,6,7,8,9,10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11,12,13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Prices vary by article type



Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Word2vec skip-gram and analogies.
Fig. 2: Prediction of new thermoelectric materials.
Fig. 3: Validation of the predictions.

Data availability

The scientific abstracts used in this study are available via Elsevier’s Scopus and Science Direct API’s ( and the Springer Nature API ( The list of DOIs used in this study, the pre-trained word embeddings and the analogies used for validation of the embeddings are available at All other data generated and analysed during the current study are available from the corresponding authors on reasonable request.

Code availability

The code used for text preprocessing and Word2vec training are available at


  1. Hill, J. et al. Materials science with large-scale data and informatics: unlocking new opportunities. MRS Bull. 41, 399–409 (2016).

    Article  CAS  Google Scholar 

  2. Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science. Nature 559, 547–555 (2018).

    Article  ADS  CAS  Google Scholar 

  3. Friedman, C., Kra, P., Yu, H., Krauthammer, M. & Rzhetsky, A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17, S74–S82 (2001).

    Article  Google Scholar 

  4. Müller, H. M., Kenny, E. E. & Sternberg, P. W. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2, e309 (2004).

    Article  Google Scholar 

  5. Swain, M. C. & Cole, J. M. ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 56, 1894–1904 (2016).

    Article  CAS  Google Scholar 

  6. Eltyeb, S. & Salim, N. Chemical named entities recognition: a review on approaches and applications. J. Cheminform. 6, 17 (2014).

    Article  Google Scholar 

  7. Kim, E. et al. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem. Mater. 29, 9436–9444 (2017).

    Article  CAS  Google Scholar 

  8. Leaman, R., Wei, C. H. & Lu, Z. TmChem: a high performance approach for chemical named entity recognition and normalization. J. Cheminform. 7, S3 (2015).

    Article  Google Scholar 

  9. Krallinger, M., Rabal, O., Lourenço, A., Oyarzabal, J. & Valencia, A. Information retrieval and text mining technologies for chemistry. Chem. Rev. 117, 7673–7761 (2017).

    Article  CAS  Google Scholar 

  10. Spangler, S. et al. Automated hypothesis generation based on mining scientific literature. In Proc. 20th ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining 1877–1886 (ACM, 2014).

  11. Mikolov, T., Corrado, G., Chen, K. & Dean, J. Efficient estimation of word representations in vector space. Preprint at (2013).

  12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. Preprint at (2013).

  13. Pennington, J., Socher, R. & Manning, C. GloVe: global vectors for word representation. Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP) 1532–1543 (Association for Computational Linguistics, 2014).

  14. Liu, W. et al. New trends, strategies and opportunities in thermoelectric materials: a perspective. Materials Today Physics 1, 50–60 (2017).

    Article  Google Scholar 

  15. He, J. & Tritt, T. M. Advances in thermoelectric materials research: looking back and moving forward. Science 357, eaak9997 (2017).

    Article  Google Scholar 

  16. Ricci, F. et al. An ab initio electronic transport database for inorganic materials. Sci. Data 4, 170085 (2017).

    Article  CAS  Google Scholar 

  17. Hohenberg, P. & Kohn, W. Inhomogeneous electron gas. Phys. Rev. 136, B864–B871 (1964).

    Article  ADS  MathSciNet  Google Scholar 

  18. Kohn, W. & Sham, L. J. Self-consistent equations including exchange and correlation effects. Phys. Rev. 140, A1133–A1138 (1965).

    Article  ADS  MathSciNet  Google Scholar 

  19. Gaultois, M. W. et al. Data-driven review of thermoelectric materials: performance and resource onsiderations. Chem. Mater. 25, 2911–2920 (2013).

    Article  CAS  Google Scholar 

  20. Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904).

    Article  Google Scholar 

  21. Plirdpring, T. et al. Chalcopyrite CuGaTe2: a high-efficiency bulk thermoelectric material. Adv. Mater. 24, 3622–3626 (2012).

    Article  CAS  Google Scholar 

  22. Tian, H. et al. Low-symmetry two-dimensional materials for electronic and photonic applications. Nano Today 11, 763–777 (2016).

    Article  CAS  Google Scholar 

  23. Pandey, C., Sharma, R. & Sharma, Y. Thermoelectric properties of defect chalcopyrites. AIP Conf. Proc. 1832, 110009 (2017).

    Article  Google Scholar 

  24. Zhao, L.-D. et al. Ultralow thermal conductivity and high thermoelectric figure of merit in SnSe crystals. Nature 508, 373–377 (2014).

    Article  ADS  CAS  Google Scholar 

  25. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at (2018).

  26. Peters, M. E. et al. Deep contextualized word representations. Preprint at (2018).

  27. Jain, A. et al. The materials project: a materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).

    Article  ADS  Google Scholar 

  28. Ong, S. P. et al. Python Materials Genomics (pymatgen): a robust, open-source Python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).

    Article  CAS  Google Scholar 

  29. Kresse, G. & Joubert, D. From ultrasoft pseudopotentials to the projector augmented-wave method. Phys. Rev. B Condens. Matter Mater. Phys. 59, 1758–1775 (1999).

    Article  ADS  CAS  Google Scholar 

  30. Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865–3868 (1996).

    Article  ADS  CAS  Google Scholar 

  31. Kresse, G. & Furthmüller, J. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys. Rev. B Condens. Matter 54, 11169–11186 (1996).

    Article  ADS  CAS  Google Scholar 

  32. Kresse, G. & Furthmüller, J. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Comput. Mater. Sci. 6, 15–50 (1996).

    Article  CAS  Google Scholar 

  33. Madsen, G. K. & Singh, D. J. Boltztrap. A code for calculating band-structure dependent quantities. Comput. Phys. Commun. 175, 67–71 (2006).

    Article  ADS  CAS  Google Scholar 

  34. Mathew, K. et al. Atomate: a high-level interface to generate, execute, and analyze computational materials science workflows. Comput. Mater. Sci. 139, 140–152 (2017).

    Article  Google Scholar 

  35. Jain, A. et al. Fireworks: a dynamic workflow system designed for high-throughput applications. Concurr. Comput. 27, 5037–5059 (2013).

    Article  Google Scholar 

  36. Yang, X., Dai, Z., Zhao, Y., Liu, J. & Meng, S. Low lattice thermal conductivity and excellent thermoelectric behavior in Li3Sb and Li3Bi. J. Phys. Condens. Matter 30, 425401 (2018).

    Article  ADS  Google Scholar 

  37. Wang, Y., Gao, Z. & Zhou, J. Ultralow lattice thermal conductivity and electronic properties of monolayer 1T phase semimetal SiTe2 and SnTe2. Physica E 108, 53–59 (2019).

    Article  ADS  CAS  Google Scholar 

  38. Mukherjee, M., Yumnam, G. & Singh, A. K. High thermoelectric figure of merit via tunable valley convergence coupled low thermal conductivity in \({{\rm{A}}}^{{\rm{I}}{\rm{I}}}{{\rm{B}}}^{{\rm{I}}{\rm{V}}}{{\rm{C}}}_{2}^{{\rm{V}}}\) chalcopyrites. J. Phys. Chem. C 122, 29150–29157 (2018).

    Article  CAS  Google Scholar 

  39. Kim, E. et al. Machine-learned and codified synthesis parameters of oxide materials. Sci. Data 4, 170127 (2017).

    Article  CAS  Google Scholar 

  40. Faber, F. A., Lindmaa, A., Von Lilienfeld, O. A. & Armiento, R. Machine learning energies of 2 million elpasolite (ABC2D6) crystals. Phys. Rev. Lett. 117, 135502 (2016).

    Article  ADS  Google Scholar 

  41. Zhou, Q. et al. Learning atoms for materials discovery. Proc. Natl Acad. Sci. USA 115, E6411–E6417 (2018).

    Article  ADS  CAS  Google Scholar 

Download references


This work was supported by Toyota Research Institute through the Accelerated Materials Design and Discovery program. We thank T. Botari, M. Horton, D. Mrdjenovich, N. Mingione and A. Faghaninia for discussions.

Author information

Authors and Affiliations



All authors contributed to the conception and design of the study, as well as writing of the manuscript. V.T. developed the data processing pipeline, trained and optimized the Word2vec embeddings, trained the machine learning models for property predictions and generated the thermoelectric predictions. V.T., J.D. and L.W. analysed the results and developed the software infrastructure for the project. J.D. trained and optimized the GloVe embeddings and developed the data acquisition infrastructure. L.W. performed the abstract classification. A.D. performed the DFT calculation of thermoelectric power factors. Z.R. contributed to data acquisition. O.K. developed the code for normalization of material formulae. A.D., Z.R. and O.K. contributed to the analysis of the results. K.A.P., G.C. and A.J. supervised the work.

Corresponding authors

Correspondence to Vahe Tshitoyan, Gerbrand Ceder or Anubhav Jain.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Chemistry is captured by word embeddings.

a, Two-dimensional t-distributed stochastic neighbour embedding (t-SNE) projection of the word embeddings of 100 chemical element names (for example, ‘hydrogen’) labelled with the corresponding element symbols and grouped according to their classification. Chemically similar elements are seen to cluster together and the overall distribution exhibits a topology reminiscent of the periodic table itself (compare to b). Arranged from top left to bottom right are the alkali metals, alkaline earth metals, transition metals, and noble gases while the trend from top right to bottom left generally follows increasing atomic number (see Supplementary Information section S4 for a more detailed discussion). b, The periodic table coloured according to the classification shown in a. c, Predicted versus actual (DFT) values of formation energies of approximately 10,000 ABC2D6 elpasolite compounds40 using a simple neural network model with word embeddings of elements as features (see Supplementary Information section S6 for the details of the model). The data points in the plot use fivefold cross-validation. d, Error distribution for the 10% test set of elpasolite formation energies. With no extensive optimization, the word embeddings achieve a mean absolute error (MAE) of 0.056 eV per atom, which is substantially smaller than the 0.1 eV per atom error reported for the same task in the original study using hand-crafted features40 and the 0.15 eV per atom achieved in a recent study using element features automatically learned from crystal structures of more than 60,000 compounds41.

Extended Data Fig. 2 Historical validations of functional material predictions.

ac, Ferroelectric (a), photovoltaic (b) and topological insulator predictions (c) using word embeddings obtained from various historical datasets, similar to Fig. 3a. For ferroelectrics and photovoltaics, the range of prediction years is 2001–2018. The phrase ‘topological insulator’ obtained its own embedding in our corpus only in 2011 (owing to count and vocabulary size limits), so it is possible to analyse the results only over a shorter time period (2011–2018). Each grey line uses only abstracts published before a certain year to make predictions. The lines show the cumulative percentage of predicted materials studied in the years following their predictions; earlier predictions can be analysed over longer test periods. The results are averaged in red and compared to baseline percentages from all materials. d, The target word or phrase used to rank materials for each application (based on cosine similarity), and the corresponding words used as indicators for a potentially existing study.

Extended Data Table 1 Materials science analogies
Extended Data Table 2 Top 50 thermoelectric predictions
Extended Data Table 3 Top five functional material predictions and context words
Extended Data Table 4 Importance of the text corpus

Supplementary information

Supplementary Information

Supplementary Information, including Supplementary Figures 1–8, Supplementary Tables 1–3 and additional references.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tshitoyan, V., Dagdelen, J., Weston, L. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing