PubMed is a widely used search engine for biomedical literature. It is developed and maintained by the US National Library of Medicine/National Center for Biotechnology Information and is visited daily by millions of users around the world. For decades, PubMed has used advanced artificial intelligence technologies that extract patterns of collective user activity, such as machine learning and natural language processing, to inform the algorithmic changes that ultimately improve a user's search experience. Although these efforts have led to objective improvements in search quality, the technical underpinnings remain largely invisible and go largely unnoticed by most users. Here we describe how these 'under-the-hood' techniques work within PubMed and report how their effectiveness and usage is assessed in real-world scenarios. In doing so, we hope to increase the transparency of the PubMed system and enable users to make more effective use of the search engine. We also identify open challenges and new opportunities for computational researchers to explore the potential of future improvements.
Subscribe to Journal
Get full journal access for 1 year
only $20.83 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Lu, Z. PubMed and beyond: a survey of web tools for searching biomedical literature. Database (Oxford) 2011, baq036 (2011).
Islamaj Dogan, R., Murray, G.C., Névéol, A. & Lu, Z. Understanding PubMed user search behavior through log analysis. Database (Oxford) 2009, bap018 (2009).
Chambliss, M.L. & Conley, J. Answering clinical questions. J. Fam. Pract. 43, 140–144 (1996).
Ely, J.W. et al. Obstacles to answering doctors' questions about patient care with evidence: qualitative study. Br. Med. J. 324, 710 (2002).
Hersh, W. Information Retrieval: a Health and Biomedical Perspective (Springer Science & Business Media, 2008).
Jensen, L.J., Saric, J. & Bork, P. Literature mining for the biologist: from information retrieval to biological discovery. Nat. Rev. Genet. 7, 119–129 (2006).
Hersh, W. et al. TREC 2005 genomics track overview. in Proceedings of the Fourteenth Text Retrieval Conference (TREC 2003) (NIST, 2003).
Jiang, J. & Zhai, C. An empirical study of tokenization strategies for biomedical information retrieval. Inf. Retr. Boston 10, 341–363 (2007).
Lu, Z., Kim, W. & Wilbur, W.J. Evaluation of query expansion using MeSH in PubMed. Inf. Retr. Boston 12, 69–80 (2009).
Herskovic, J.R., Tanaka, L.Y., Hersh, W. & Bernstam, E.V. A day in the life of PubMed: analysis of a typical day's query log. J. Am. Med. Inform. Assoc. 14, 212–220 (2007).
Haynes, R.B., McKibbon, K.A., Wilczynski, N.L., Walter, S.D. & Werre, S.R. Optimal search strategies for retrieving scientifically strong studies of treatment from Medline: analytical survey. Br. Med. J. 330, 1179 (2005).
Cao, Y. et al. AskHERMES: an online question answering system for complex clinical questions. J. Biomed. Inform. 44, 277–288 (2011).
Roberts, K. & Demner-Fushman, D. Interactive use of online health resources: a comparison of consumer and professional questions. J. Am. Med. Inform. Assoc. 23, 802–811 (2016).
Lin, J. & Wilbur, W.J. Modeling actions of PubMed users with n-gram language models. Inf. Retr. Boston 12, 487–503 (2008).
Russell-Rose, T. & Chamberlain, J. Expert search strategies: the information retrieval practices of healthcare information professionals. JMIR Med. Inform. 5, e33 (2017).
Sherman, L. & Deighton, J. Banner advertising: measuring effectiveness and optimizing placement. J. Interact. Market 15, 60–64 (2001).
Li, H. & Leckenby, J.D. Internet advertising formats and effectiveness. https://brosephstalin.files.wordpress.com/2010/06/ad_format_print.pdf (2004).
Campbell, F.M. National bias: a comparison of citation practices by health professionals. Bull. Med. Libr. Assoc. 78, 376–382 (1990).
Yeganova, L. et al. PubTermVariants: biomedical term variants and their use for PubMed search. Workshop on Biomedical Natural Language Processing 141–145 (ACL, 2016).
Wilbur, W.J., Kim, W. & Xie, N. Spelling correction in the PubMed search engine. Inf. Retr. Boston 9, 543–564 (2006).
Poder, T.G., Erraji, J., Coulibaly, L.P. & Koffi, K. Percutaneous coronary intervention with second-generation drug-eluting stent versus bare-metal stent: systematic review and cost-benefit analysis. PLoS One 12, e0177476 (2017).
Broder, A.Z. A taxonomy of web search. ACM Special Interest Group on Information Retrieval Forum 36, 3–10 (ACM, 2002).
Jones, K.S. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 60, 493–502 (2004).
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M. & Gatford, M. Okapi at TREC-3. in Proceedings of the 3rd Text Retrieval Conference (TREC-3) 109–126 (NIST, 1994).
Fiorini, N. et al. Best Match: new relevance search for PubMed. PLoS Biol. 16, e2005343 (2018).
Burges, C.J.C. et al. Learning to rank using gradient descent. in Proceedings of the 22nd International Conference on Machine Learning 89–96 (ACM, 2005).
Smith, L. & Wilbur, W.J. The popularity of articles in PubMed. Open Inf. Syst. J. 5, 1–7 (2011).
Mohan, S., Fiorini, N., Kim, S. & Lu, Z. Deep learning for biomedical information retrieval: learning textual relevance from click logs. in Proceedings of the 16th Workshop on Biomedical Natural Language Processing 222–231 (ACL, 2017).
Xiong, C., Dai, Z., Callan, J., Liu, Z. & Power, R. End-to-end neural ad-hoc ranking with kernel pooling. in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval 55–64 (ACM, 2017).
Valenzuela, M., Ha, V. & Etzioni, O. Identifying meaningful citations. in Association for the Advancement of Artificial Intelligence Workshop: Scholarly Big Data 21–26 (AAAI, 2015).
Etzioni, O. Artificial intelligence: AI zooms in on highly influential citations. Nature 547, 32 (2017).
Fiorini, N., Lipman, D.J. & Lu, Z. Towards PubMed 2.0. Elife 6, e28801 (2017).
Fiorini, N. et al. PubMed Labs: an experimental system for improving biomedical literature search. Database (Oxford) https://doi.org/10.1093/database/bay094 (2018).
Liu, W. et al. Author name disambiguation for PubMed. J. Assoc. Inf. Sci. Technol. 65, 765–781 (2014).
Lu, Z., Wilbur, W.J., McEntyre, J.R., Iskhakov, A. & Szilagyi, L. Finding query suggestions for PubMed. AMIA Annu. Symp. Proc. 2009, 396–400 (2009).
Huang, C.-C. & Lu, Z. Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation. Database 2016, baw025 (2016).
Jat, K.R. & Khairwa, A. Levalbuterol versus albuterol for acute asthma: a systematic review and meta-analysis. Pulm. Pharmacol. Ther. 26, 239–248 10.1016/j.pupt.2012.11.003 (2013).
Minack, E., Demartini, G. & Nejdl, W. Current approaches to search result diversification. in. Proceedings of The First International Workshop on Living Web at the 8th International Semantic Web Conference (ISWC) 37–44 (CEUR, 2009).
Kim, W., Yeganova, L., Comeau, D.C., Wilbur, J.W. & Lu, Z. MeSH-based dataset for measuring the relevance of text retrieval. in Proceedings of the 17th Workshop on Biomedical Natural Language Processing 161–165 (ACL, 2018).
Onal, K.D. et al. Neural information retrieval: at the end of the early years. Inf. Retr. J. 21, 111–182 (2018).
Sordoni, A. et al. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management 553–562 (ACM, 2015).
Zamani, H. & Croft, W.B. Relevance-based word embedding. in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval 505–514 (ACM, 2017).
Glater, R., Santos, R.L.T. & Ziviani, N. Intent-aware semantic query annotation. in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval 485–494 (ACM, 2017).
Mitra, B. & Craswell, N. Query auto-completion for rare prefixes. in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management 1755–1758 (ACM, 2015).
Cai, F., Reinanda, R. & Rijke, M.D. Diversifying query auto-completion. ACM Trans. Inf. Syst. 34, 1–33 (2016).
Xia, L. et al. Adapting Markov decision process for search result diversification. in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval 535–544 (ACM 2017).
Jiang, Z. et al. Learning to diversify search results via subtopic attention. in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval 545–554 (ACM, 2017).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 13, 3111–3119 (2013).
Pennington, J., Socher, R. & Manning, C. Glove: global vectors for word representation. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1532–1543 (ACL, 2014).
Mohan, S., Fiorini, N., Kim, S. & Lu, Z. A fast deep learning model for textual relevance in biomedical information retrieval. in Proceedings of the 27th International Conference on World Wide Web 77–86 (ACM, 2018).
Yeganova, L., Kim, W., Kim, S. & Wilbur, W.J. Retro: concept-based clustering of biomedical topical sets. Bioinformatics 30, 3240–3248 (2014).
The authors would like to thank K. Canese, R. Ismagilov, G. Starchenko, E. Kireev, J. Wilbur, D. Comeau, S. Kim, W. Kim, L. Yeganova, V. Miller, M. Osipov, R. Bryzgunov, I. Radetska, A. Gindulyte, M. Latterner, the NLM/NCBI leadership, and the many NCBI and NLM Library Operations staff working on and contributing to PubMed. This research was supported by the NIH Intramural Research Program, National Library of Medicine.
The authors declare no competing financial interests.
About this article
Cite this article
Fiorini, N., Leaman, R., Lipman, D. et al. How user intelligence is improving PubMed. Nat Biotechnol 36, 937–945 (2018). https://doi.org/10.1038/nbt.4267
Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records
BMC Medical Informatics and Decision Making (2020)
The British Journal of Radiology (2019)
ACS Synthetic Biology (2019)
BMC Bioinformatics (2019)