Main

Science is often seen to provide substantial impacts beyond the community of scientists themselves—for technological progress, government function, basic human curiosity and more1,2,3,4,5,6,7,8,9. Given the potential benefits, many nations have built institutional architectures to support science through public investment, following the logic of public goods10,11,12. Like a public park, which is funded by the government and can be visited for free, scientific research is substantially funded by governments, with its results placed in the public domain. This institutional design seeks to enable broad use of scientific ideas and avoid under-investment by private actors (for further background on the nature of public goods, see Supplementary Note 3). Yet, in turning to public funding, this approach in part relies on the idea that public investment in science can match the public interest in science.

Although public investment in science is a central feature of the scientific ecosystem11,12,13, empirically examining the varied public uses of science and testing whether there is alignment between public funding and public use has remained elusive, mainly owing to the difficulty in collecting systematic data. Moreover, the lack of measurement has invited substantial scepticism. Indeed, many observers view scientific research as a cloistered or ‘ivory tower’ activity that rarely corresponds to the public interest14,15,16,17,18. For example, the ‘two communities’ and ‘two cultures’ theories highlight substantial knowledge and interest gaps between scientists and policymakers, disconnecting scientific research from policy insights19,20,21,22 and suggesting little relationship between the quality of research and its public usage20,23,24. Meanwhile, scientists may have peculiar interests, with little exposure to real-world problems or incentives to tackle them7,25. These potential gaps further animate root concerns over the public funding of science and its proper allocation26,27,28,29. For example, policymakers have long criticized the National Science Foundation for funding frivolous research and have called for greater transparency around the relevance of science26,27. Some prominent academics and commentators, including Nobel-Prize winner Milton Friedman, have taken the position that the government should not fund science, favouring purely private sector research instead28,29.

In this Article, we advance a measurement framework to study public uses of science, the public funding of science and how public use and public funding relate. Building on prior research that considers the use of science within a given public domain30,31,32,33,34,35, here we integrate five large-scale datasets that link scientific publications from all scientific fields to their upstream funding support and downstream public uses across three public domains. Our first dataset (D1) is scientific publications, using Microsoft Academic Graph (MAG)36, which is one of the largest bibliometric databases of scientific research in the world (Methods and Supplementary Note 1.1). Our second dataset (D2) leverages the Microsoft Bing search engine to collect about 6 million government documents available online across all branches of the US government37. Using a machine reading technology, we systematically identify academic publications that are referenced in these government documents and match these references to MAG. This pipeline allows us to collect a high-scale dataset on how government documents consume scientific knowledge (Methods and Supplementary Note 1.2). In total, we identify 389,896 unique academic publications cited by 43,014 government documents. We further leverage a secondary policy documents database, Overton, to help validate results obtained from D2 (Supplementary Note 2.1). Our third dataset (D3) uses the Altmetric data31,32 to track academic publications covered by mainstream media reports. Matching these publications to the MAG data yields 724,849 unique papers covered by 2,701 media outlets (Methods and Supplementary Note 1.3). Building on prior work33,34,35,38, our fourth dataset (D4) links all patents granted by the US Patent and Trademark Office (USPTO) to the academic papers they reference, yielding 4,276,940 papers cited by 1,932,642 patents (Methods and Supplementary Note 1.4). Our main results focus on papers published between 2005 and 2014, a common period covered by all three datasets, resulting in 128,465, 275,536 and 1,296,922 papers cited in government, news and patent documents, respectively. Finally, we integrate funding records, using the Dimensions39 dataset (D5), which includes 5 million projects funded by over 400 funding agencies worldwide and links each funded project with its resulting publications (Methods and Supplementary Note 1.5). The Methods section and Supplementary Notes 1 and 2 further detail the construction of each dataset and additional validations.

Results

Diversity in public use

Our first analyses measure the usage of scientific research in the three public domains. To conduct this analysis, we first leverage the MAG’s classification of papers across 19 top-level fields. To account for cross-field differences in publication volume, we define a relative consumption index (RCI). For a given public domain (d) and scientific field (f), RCI measures the fraction of papers in the field consumed by that public domain, normalized by the same fraction calculated on all fields for that domain. That is,

$${\mathrm{RCI}}_d^f = \frac{{{{{\mathrm{\# }}}}\,{{{\mathrm{papers}}}}\,{{{\mathrm{in}}}}\,{{{\mathrm{field}}}}\,f\,{{{\mathrm{consumed}}}}\,{{{\mathrm{by}}}}\,{{{\mathrm{domain}}}}\,d{{{\mathrm{/\# }}}}\,{{{\mathrm{papers}}}}\,{{{\mathrm{in}}}}\,f}}{{{{{\mathrm{Total}}}}\,{{{\mathrm{\# }}}}\,{{{\mathrm{papers}}}}\,{{{\mathrm{consumed}}}}\,{{{\mathrm{by}}}}\,{{{\mathrm{domain}}}}\,d{{{\mathrm{/Total}}}}\,{{{\mathrm{\# }}}}\,{{{\mathrm{papers}}}}}}.$$

We find that the public uses of science are diverse, with many fields showing substantially specialized usage in public domains (Fig. 1). Computer science, materials science, mathematics and engineering (Fig. 1d,i–j) present substantially larger RCI values for patents than for government or news. By contrast, environmental science and geology (Fig. 1f,h) contribute relatively strongly in government and media documents compared with patents. Finally, physics, chemistry, medicine and biology present a broader range of use (Fig. 1b,c,k,l). Among all fields, biology is the only one over-represented across all three channels, demonstrating a uniquely general relevance to these broad domains beyond science.

Fig. 1: Diversity in public use.
figure 1

at, Different scientific fields experience distinct and typically specialized public uses. The usage metric RCI for the three public domains are presented for each field (bt). The dashed triangles represent a null model where each paper has the same chance to be used (a). The colour scheme highlights four high-level areas of research—the physical sciences, life sciences, social sciences, and ecology and earth sciences—following the four major clusters of science detected by ref. 62 and suggesting commonalities in patterns of public use within these four areas.

Social sciences, by contrast, exhibit a visibly different pattern of public use. The social sciences are strongly consumed in government and media domains while showing systematically low usage in patents (Fig. 1m–q). Economics sees especially strong government use, while psychology, sociology and political science see relatively strong media use. Arts and humanities (philosophy, art and history; Fig. 1r–t) are relatively under-represented in all three domains.

Specialization in public use further appears at subdomain levels (Supplementary Fig. 6). For government, different agencies consume very different scientific research. For example, the US Department of Treasury draws especially on economics and business research, the US Department of Energy draws especially on geology and engineering and the US Department of Defense draws unusually on history. Different patenting fields further exhibit highly specialized relationships with specific scientific fields. By contrast, in media, while The Washington Post draws unusually heavily on political science research, mainstream media sources in general are more consistent in the fields they report, with especially strong and widespread interest in medicine and psychology.

The specialization in public use is further accompanied by substantial differences in time lags in the use of science by the different public domains. Whereas the news media places a particular focus on very recent work, the government and inventive domains have wider reach into prior discovery (Supplementary Note 4.5). For example, in the news media, 63% of citations to scientific articles cover research papers published within the year. By contrast, government documents and patent inventions draw more widely over past work, with a median citation lag of 10 years between scientific publication and use (Supplementary Fig. 7). Importantly, while the public domains differ considerably in time lags, we find that the RCI comparisons are extremely similar when considering either the recent decade of scientific publications (Fig. 1) or the stock of scientific publications over a substantially longer history (Supplementary Fig. 8), indicating that the results in Fig. 1 are robust controlling for time lags.

Overall, these results highlight a large set of specialized relationships between specific domains of public use and specific fields of scientific research. From a public goods perspective, if we think of scientific fields as akin to a series of national parks, we see that each park is embedded in particular communities of public use. Collectively, these parks spread across diverse regions of knowledge and are accessed by diverse segments of the public. A few fields, especially biology, receive visitors at relatively intense rates from a broad range of public domains—a ‘Yellowstone Park’ of science.

Scientific impact and public use

Our second set of results examine whether the public domains tend to consume ideas that scientists themselves consider impactful. Longstanding arguments suggest that the public is not well equipped to evaluate science and may draw on poorly established scientific ideas, which would undermine the public good benefits of science20,23,24. Continuing the national parks metaphor, scientists may be primarily focused in a hard-to-reach backcountry, whereas the typical visitor may not have the tools to access this terrain nor gravitate to the same areas the scientists themselves consider attractive. To further examine public use, we therefore consider, at the article level, the alignment between public use and scientific use. Specifically, we calculate the probability of being a hit paper within science, defined as those papers in the top 1% of citations within the same field and year, and examine the relationship to usage in the public domains (Methods and Fig. 2b). We find that papers referenced in public domains have a remarkably high likelihood of being hit papers within science. Papers cited by government documents, news or patents exhibit hit rates of 14.1%, 18.0% and 9.1%, respectively, all large multiples of the baseline rate of 1%. Further, papers referenced in the intersection of different domains tend to be exceptionally impactful in science. For papers referenced in two public domains, approximately half are hit papers. Papers referenced by both government documents and news media have a hit rate of 45.1%. The results are broadly similar if we examine the intersection between government documents and patents (38.7%) or news and patents (46.1%). A paper consumed in all three domains is a hit paper in science at a staggering 72.8 times the baseline rate. Reversing the exercise, we also see that, as the citation percentile of a paper rises, the probability for public use increases steeply, with extremely sharp increases at the very top of the citation distribution (Supplementary Figs. 9 and 10).

Fig. 2: Public use and scientific use.
figure 2

The public tends to consume exceptionally high-impact science from all fields and in all three public domains, indicating alignment between public use and scientific use. a, Usage by domain for papers published from 2005 to 2014. The area of each subset is proportional to the square root of the paper count in the corresponding public domain. b, Hit rates for papers cited in at least one, two or three public domains. Hit papers are defined as those receiving citation counts, within science, in the top 1% within the field and year. ce, Hit rates for each of the 19 fields consumed by government documents (c), news media (d) and patents (e). In all fields, and in all three domains, the consumed papers tend to have hit rates within science many times larger than the baseline rate of 1% (dashed line).

The use of high-impact papers is not only common across different public domains, it also appears universal across research areas. Papers covered by public domains tend to be highly cited in all scientific fields (Fig. 2c–e). These findings remain similar when varying the threshold for hit papers to the top 5% or 10% citations (Supplementary Note 4.4 and Supplementary Figs. 11 and 12). We also repeat our analyses for papers produced by United States-based researchers, arriving at the same conclusions (Supplementary Fig. 13). While government, media and patenting documents may cite science for a variety of reasons and our reference-based measures are proxies for uses of science22,40,41, we see that the science referenced in public domains is not in conflict with what scientists themselves consider important; rather, impactful papers defined by these communities show substantial overlap. This finding stands in contrast to concerns over knowledge gaps, where the government and media in particular may be poorly positioned to assess high-impact scientific work or distinguish it from low-impact scientific work20,23,24,31,42. Considering the findings, one may note that, in each of these public domains, the initial step beyond science involves an intermediary—via the journalist in media, the inventor or other domain expert in patenting, the potential policy expert in government—all of whom may bring specialized capacities to bear in selecting what science they bring forth into their domain. The broader public use—among those who read a news article, use an invented product or experience a policy—will then depend upon these intermediaries, who may help bridge the knowledge gap. Overall, the public use of science, while marked by substantial specialization in use across research areas, presents a striking universality, where diverse public domains all draw on the highest-impact scientific papers within each field.

We further fine-grain the 19 broad research fields of papers into 294 subfields as indexed by MAG, and calculate the RCI score for each subfield in a given public domain. We visualize each field’s RCI values, locating each field within a common triangle to compare each field’s tendency toward usage in specific public domains (Fig. 3a). Fields in social science as well as arts and humanities are mostly used in media and government, whereas fields in science and engineering spread out widely within the triangle, again highlighting the field-level specialization yet collective diversity in the public uses of science.

Fig. 3: Public use and public funding.
figure 3

Amidst enormous diversity in public use across fields and domains, scientific funding for a given field is closely aligned with the totality of its public use. a, Ternary plot of RCI for 294 level-1 fields together, with the location of each field indicating its relative usage among the public domains. Circles are colour-coded according to their parent field in Fig. 1, and circle sizes reflect overall usage. bd, Average funding (US$) per paper across fields is positively correlated with a field’s RCI index in government (b), news (c) and patenting (d). The relationship remains statistically significant when combined with control variables (P < 0.001 in ordinary least squares regressions controlling for the number of papers and parent field fixed effects; see Supplementary Table 2 for details). e, Collectively, public uses beyond science strongly predict field level funding per paper.

Public use and public funding

Together, these results raise a central question: To what degree does the funding input for science relate to the field’s public use? The majority of scientific research is supported by public investment, which aims to advance not only science itself but also broader public interest41. The National Science Foundation, for example, formally introduced broader impacts as a key criterion for evaluating grant proposals in 1997. Here we focus on US-funded projects and use D5 to calculate the average funding per paper in a given subfield as a proxy for public investment costs per unit of output.

We find that the public investment per paper differs dramatically across fields, spanning over five orders of magnitude. Yet comparing average funding per paper with RCI in each domain reveals substantial correlations between funding and the use of science across all three public domains, with R2 = 0.159 for government, 0.272 for news and 0.376 for patents (Fig. 3b–d, Methods and Supplementary Table 1). To further test if the uncovered correlation is due to the heterogeneity in field size or parent field, we add the number of papers in the subfield as well as parent field fixed effects (for the 19 higher-level fields) into the regression, finding the strong correlation with RCI persists (P < 0.001 in all three cases). Notably, across the three domains, the representation of subfields in government documents has the lowest predictive power for funding, suggesting that public investments in science better reflect the overall public interest captured by media or patents. We further include funding from non-governmental sources or focus on papers by US researchers only, finding our conclusions remain the same (Supplementary Notes 4.1 and 4.2, Supplementary Figs. 19 and 20 and Supplementary Tables 47).

Most strikingly, a simple linear regression model combining the three RCI values together yields a surprisingly high degree of agreement with funding, with an R2 of 0.647 (Fig. 3e, Methods and Supplementary Table 2), providing at minimum a 72% increase in predictive power compared with using any of the three public domains alone. These results suggest that each public domain provides independent predictive power for understanding the allocation of public investment in science. The uncovered high predictive power of this analysis is especially striking given many complex factors and processes at work in appropriations, budget setting and grant review43,44,45,46,47,48. Although each research field differs substantially in its relative role and contribution in science and beyond, the combination of their impacts beyond science powerfully predicts funding, suggesting that, ultimately, what the public uses, what scientists use and what is funded are remarkably consistent.

Discussion

One source of this alignment could be that science follows the public interest. For example, scientists may prioritize or innately share areas of interest, such as coronavirus disease 2019 (COVID-19), where there is enormous public demand for solutions and where scientific attention has surged30,49,50. Another source could be that some scientists or science institutions are especially good at promoting their interests to the public, influencing what the public sees and funds. For example, one may wonder if high-prestige journals, eminent authors or funding for a paper drive attention to specific research. To test this, we further consider fine-grained, paper-level regressions that include journal fixed effects, author fixed effects and paper-level funding indicators. We find that the results are very similar, regardless of these controls (Supplementary Note 4.6, Supplementary Table 8 and Supplementary Figs. 2123). Indeed, the relative attention to different fields (Fig. 1), the alignment between public use and high-impact science (Fig. 2) and the alignment with public funding (Fig. 3) all appear robust after accounting for journal placement, the scientists who produced the work or the funding status of the specific paper. Thus, while some scientists, journals or funders may have advantages in reaching the public, the forms of alignment we see appear primarily as features of a research area, rather than the specific promotion opportunities from a journal, scientist or funding. More generally, numerous mechanisms, institutional factors and policies may be at work in producing, increasing or reducing use and alignment, and unpacking these mechanisms is an exciting area for future work.

Altogether, the analysis probes quantitatively key features of the public use and funding of science. Measuring the usage of scientific research outside science itself, we uncover enormous diversity and specialization in how different fields of scientific enquiry are linked to different public domains. Yet, despite these differences, the different public domains (and subdomains) universally draw on highly cited papers within science, indicating that public use is strongly aligned with what scientists themselves consider impactful. And, critically, the public usage of scientific fields across the diverse domains provides simple yet powerful predictors for the level of public investment in each field.

Note that, although the three domains each represent an important dimension of the public space, they do not cover all domains that science may impact. Even within each of the three domains we studied, there may be consumption of science through channels that go beyond our datasets. For example, scientists and their ideas can appear through television, in congressional testimony and in private sector consulting. Scientific ideas may also enter industry and government through social networks, through the hiring of scientists, and through influencing managerial practices (Supplementary Fig. 24), which may augment and alter perspectives on the public use of specific research fields. While there is much still to explore, this paper introduces a quantitative framework to examine public uses of science at the individual paper level, both across all scientific fields and diverse public domains, revealing individually specialized and collectively diverse uses, universality in impact and a remarkable alignment between the funding of science and its public use.

As society’s support of science depends on a public goods model11,13, and as legislators have called for more transparency in the usage and value of scientific funding51, the framework developed in this paper provides an empirical tool, offering quantitative evidence to inform discussions around public interest features of science. The allocation of science funding involves chains of decisions by individuals and groups with different perspectives and priorities. These considerations range from legislative committees and the goals of individual political representatives, to funding agency leaders, to within-agency mechanisms that often incorporate insights from scientists, interacting in a complex process that must bridge across distinct communities. As such, one might expect a substantial disconnect between what is eventually funded and forms of public interest; metaphorically, funding of public parks in ways weakly related to public use. Yet, despite the massive diversity in the public uses of science and a complex funding process, there is remarkable alignment in the end result. What the public uses and what scientists themselves use are closely consistent. And the funding of science closely tracks quantifiable public use. These results suggest the connections between the ivory tower and the real world appear more aligned than is commonly imagined.

Methods

Microsoft Academic Graph (D 1)

The publication and citation data are primarily obtained from Microsoft Academic Graph (MAG)36,52. MAG is among the largest open-source citation databases thus far and contains records of 209 million documents. We inter-linked different data tables to obtain the author, affiliation, year, publication venue and field information for each paper. Data pre-processing and summary statistics are further documented in Supplementary Note 1.1.

US government documents (D 2)

To quantify references to scientific articles in the government domain, one needs to construct a large-scale dataset of government documents that can be linked to the scientific papers. The task has been difficult in part because government documents are spread across many sources. Furthermore, although a substantial fraction of such documents may cite scientific literature, such citations do not follow a common structure.

Our data collection starts with a list of 6 million URLs under the.gov domain, which is the domain name for government agencies and contains the vast majority of US government entities. We downloaded these pages using an automatic crawler and focused on all PDF files in this set, extracting the references cited in these files using Science Parse53, an open-source tool for reference string extraction. We then matched this list to the MAG with a search engine-like system using title, journal, author and publication year information. Supplementary Note 1.2 documents technical details of this data pipeline. We also perform additional validation analysis using Overton, an independent dataset of policy documents (Supplementary Note 2.1).

Altmetric dataset (D 3)

To study references to scientific publications in the news media, we use a dataset offered by Altmetric31,32,54. This dataset records approximately 26.2 million papers with at least one news media or social media mention. We then merge paper information with MAG. A vast majority (22.1 million) of such publications in the Altmetric database have unique digital object identifiers (DOI). We find that 17.2 million (78%) of the DOIs can be matched to records in MAG.

USPTO patent database (D 4)

To study references to scientific publications in patents, we build on prior work and use a high-scale mapping from USPTO patents to MAG papers, which includes approximately 31.7 million citation pairs between patents and papers35,38, from both the front page and full text of the patents. To classify patents into technology classes, we use the Cooperative Patent Classification system, drawn from PatentsView, a data platform based on USPTO bulk data55 (Supplementary Note 1.4). Combining the two files provides technology class information for 97.5% of patents that reference scientific articles. The small share of missing technology class cases corresponds to patents recently granted, which have not been updated in our data.

Dimensions scientific funding data (D 5)

To understand how research funding from various sources is allocated into different scientific fields, we leverage research funding data from Dimensions39,56, which includes approximately 5 million research projects supported by over 400 funding agencies worldwide. To be consistent with the rest of our analysis, we focus on projects funded during the same 10-year period (2005–2014). A unique opportunity provided by Dimensions is a linkage table between supporting grants and resulting publications, which allows us to categorize the field of each grant according to its resulting publications. Together we link 292,875 funded projects with at least one publication (for detailed descriptions of our linkage procedure, see Supplementary Note 1.5).

Citation percentiles and hit papers

While citations are widely used as a proxy for scientific impact2,9,57,58,59, direct comparison of citation counts received by papers across time and field can be problematic without normalization60. We therefore calculate citation percentiles for papers within the same publication year and field. Here, following prior studies33,47,61, we define ‘hit papers’ (also known as ‘home runs’) as papers ranking in the top 1% of citations received. We further test robustness of these results by tuning the threshold from 1% to 5% or 10% (Supplementary Note 5.3 and Supplementary Figs. 11 and 12).

Regression models

To understand the association between public use and funding for different scientific fields, we use linear regression models (ordinary least squares). We first note that all three RCI measures are highly skewed (Supplementary Fig. 5a–c), prompting us we take the natural logarithm, ln RCI, in our linear regressions (Supplementary Fig. 5d–f). The same transformation is taken on the average funding per paper. The variables are defined as follows:

Dependent variable

\({{{\mathrm{ln}}}}\,Y_i\), defined as the natural logarithm of average funding per paper for the level-1 field i.

Predictors of interest

We examine the extent to which different impact measures can predict funding, including \(\ln {\mathrm{RCI}}_{ji}\) for the three public domains. To include all data points in the regression, for the rare cases when an impact measure is 0, we add 1 to avoid zeros in the logarithm. We further include the natural logarithm of the number of papers published in the 10-year period, \(\ln p_i\), as a control variable.

Fixed effects

To control for the possibility that fields under different broad categories may have specific funding and public use norms, we introduce Ffi, fixed effect terms for each level-0 field. Specifically, Ffi = 1 if the level-1 field i is a child field of the level-0 field f according to MAG’s classification structure. Note that some level-1 fields belong to two level-0 fields simultaneously (for example, mathematical physics is the child field of both mathematics and physics).

We start with bivariate regressions examining the relationship between each RCI (that is, for government, media or patenting) and average funding (Fig. 3b–d, Supplementary Table 1 and Models 1–3). That is,

$${{{\mathrm{ln}}}}\,Y_i = \beta _j\ln {\mathrm{RCI}}_{ji} + \varepsilon _i.$$

In multivariate regressions, we further include controls for heterogeneity in field size or parent field fixed effects (Supplementary Table 2 and Models 4–6).

We further investigate the joint predictive power of the three RCIs (Fig. 3e, Supplementary Table 2 and Model 7):

$${{{\mathrm{ln}}}}\,Y_i = \mathop {\sum }\limits_j \beta _j\ln {\mathrm{RCI}}_{ji} + \varepsilon _i$$

which shows that each measure contributes independently and substantially to explaining the variation in funding.

Finally, we add further control variables into Model 8 (Supplementary Table 2 and Model 8):

$${{{\mathrm{ln}}}}\,Y_i = \mathop {\sum }\limits_j \beta _j\ln {\mathrm{RCI}}_{ji} + \beta _p\ln p_i + \mathop {\sum }\limits_f \beta _fF_{fi} + \varepsilon _i$$

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.