Nature Publishing Group
nature.com about npg news@nature.com naturejobs natureevents help site index
Nature
my accounte-alertssubscriberegister
SEARCH JOURNAL  advanced  search
Thursday 16 August 2018
access to the literature

NPG Subject areas
Access material from all our publications in your subject area:
Biotechnology Biotechnology
Cancer Cancer
Chemistry Chemistry
Dentistry Dentistry
Development Development
Drug Discovery Drug Discovery
Earth Sciences Earth Sciences
Evolution & Ecology Evolution & Ecology
Genetics Genetics
Immunology Immunology
Materials Materials Science
Medical Research Medical Research
Microbiology Microbiology
Molecular Cell Biology Molecular Cell Biology
Neuroscience Neuroscience
Pharmacology Pharmacology
Physics Physics
Browse all publications
 

Analysing the scientific literature in its online context

The free versus fee debate over access to the scientific literature is a lively one, but it is also important to keep in mind the bigger picture, that the Internet is bringing about a much broader evolution in the way scientists work and communicate. Information and value increasingly lies not just in the published article but in relationships between articles, in the links among authors and papers, and in less formal communication among users and communities through Weblogs (or 'blogs'), listservs, home pages and other sources on the Web.

Understanding this wider context requires not only thinking on substantially larger scales, but also handling very different levels of heterogeneity. Going beyond journals and articles, making use of the nodes, links and complex interconnections in this scientific information space, in turn requires new Web-scale tools and algorithms.

The scientific literature is a mechanism for the dissemination and archiving of research, but it has also an been an object of study in itself: techniques for analysing the scientific literature have a long history1 and remain the subject of active research2, drawing on methods from citation analysis, information retrieval, machine learning and data mining. These techniques in turn have led to a number of tools for helping scientists track topics and new developments in their own research areas.

As the scientific literature moves increasingly into online and Web-accessible forms, we have the opportunity to carry out this kind of analysis at a much broader scope, addressing the ways in which scientific research advances and the broader consequences of this research. Many scientific issues move beyond the bounds of the research community, informing policy discussions, commercial activity, education and public views of science. Consider, for example, the debates over genetically modified foods, global warming, stem cell research, electronic privacy and cryptographic technology.

To take the first of these examples: if we want to analyse how scientific results inform worldwide regulatory policy on the subject of genetically modified foods, we have access not just to the relevant scientific papers, but also to Web-accessible text of official regulations, to reports commissioned by governmental agencies in the process of formulating these regulations and to transcripts of hearings on the topic.

If we want to see how the results of this research are being used by industry and academia, we have access to the sites of corporate labs containing research reports and white papers, to the home pages of university courses worldwide, and to datasets arising from both industrial and publicly-funded university research. Finally, if we want to understand public reactions to these active and often controversial scientific issues - both the hopes and fears that they inspire - we can look at their projections onto Internet news sources and emerging, discussion-oriented media such as blogs.

The point is that we need not consider the scientific literature as existing separately from the surrounding context; the Web provides a view of the literature and the context in a common space, allowing us to ask questions such as the above at a level of detail that was previously not possible. Working at this broader scope offers new ways to assess the impact of a piece of scientific research; rather than just looking at relationships among the papers themselves. We may find, for example, that one paper is highly influential because it influenced significantly the formulation of regulatory policy, while another's impact resides in the fact that it appears in the curricula and reading lists of graduate seminars at a number of leading universities.

When a scientific paper can play so many different roles in this online information space, how can we analyse the relationships in a unified way? To begin with, the Web, due to its hypertextual structure, has a crucial property in common with the scientific literature: its content is interwoven with explicit links.

Thus, just as citation analysis can expose underlying structure in a collection of research papers, current techniques for Web searching make extensive use of hyperlink analysis. Given this relationship, one can consider a simultaneous link analysis of scientific information as it is embedded in a broader Web context, including pages that represent research groups, governmental agencies, companies, university courses and other relevant parties. In this setting, a highly-cited government report that in turn extensively references a paper in the research literature contributes greatly to that paper's impact; links from university course home pages to a scientific paper contribute similarly to its impact.

One area where this type of broader approach has been successfully carried out is in the simultaneous analysis of papers and patents3: here, references from patents to papers reveal a type of impact, and this impact can be used to identify successful companies or those with high technological potential, so as to guide investment decisions. All these cross-references to the scientific literature-both from and between governmental, educational and corporate resources-can be treated via the common language of link analysis.

The problem of how to carry out this link analysis raises some interesting design choices, since the basic methods for Web hyperlink analysis treat the heterogeneity of online sources in different ways. The PageRank algorithm4 is based on the following recursive notion: an influential Web resource is one that receives many links from other influential resources.

The framework of hubs and authorities5, on the other hand, posits a separation of Web pages or sites into two categories, again by recursion: a hub is a resource with links to many good authorities, while an authority is a resource that receives links from many good hubs. In this way, one can distinguish between producers and consumers of scientific information, or between primary and secondary sources; much of the Web context surrounding the scientific literature can act as hubs, linking to published papers that serve as authorities.

Research on whole-Web analysis (see ref. 6. and the more recent IBM WebFountain project7) is relevant to these analyses as well; this approach attempts to synthesize activity across the entire network to answer broad queries such as: 'Which emerging research questions are currently viewed as most important by the biotechnology industry?' or 'How does the sentiment of the news media toward research into genetically-modified foods research vary in different countries?'. In addressing such questions, a useful part of the analysis is to look for a growing density of links to resources on a particular topic, serving as an indication of emerging interest.

Drawing meaningful comparisons between patterns of research in different fields is a challenge; similarly to recent studies attempting to scale up impact analysis from specific disciplines to all of science, they run into the difficulty that citation conventions and overall publishing volumes vary enormously from one area to another.

Indeed, a promising application of link analysis at this scale would be to refine our notions of what constitutes a coherent scientific field; one can seek natural groupings within the scientific community based on similarities at the level of the actual research rather than according to traditional disciplinary boundaries. Community-finding algorithms in networks (see e.g. refs 8, 9, 10) can produce such groupings, by clustering resources that exhibit similarities at the level of both content and citations.

Given such techniques, we might hope to identify emerging interdisciplinary areas as clusters of related papers growing rapidly at the boundaries of existing areas. An area like quantum computing, for example, can be seen as growing from a 'seed set' of foundational papers in both physics and computer science, and can be defined as a cluster with this seed set as its core.In this way, it overlaps with clusters corresponding to traditional subfields of both physics and computer science, drawing on both these areas but wholly contained in neither.

Automated analysis of clusters in the link structure, taking into account the full scientific literature, offers the opportunity to identify such areas much earlier in their evolution than was previously possible, exposing their intellectual antecedents in more well-established areas, and enabling researchers in related fields to track developments that could be relevant to their own disciplines. Again the point is that the very notion of a coherent field of scientific research, like the notion of impact, is something that link analysis can begin to quantify; while we inevitably lose something of the intuitive nature of these concepts in the process, making them precise is a crucial step towards the kinds of analysis we have been discussing.

The availability of usage data is a further valuable consequence of online access, and it provides additional opportunities for analysis. Traditional studies of scientific publications and their impact have necessarily treated the actual readers of papers as essentially invisible, because readers who do not go on to generate papers, patents or other tangible products mostly leave no traces that can be usefully incorporated into the analysis.

Notwithstanding this, sites that host electronic copies of scientific papers, as well as repositories for patents, datasets and computer software, can collect information about the rates at which these objects are downloaded, thus offering fundamentally new ways to assess the influence of a piece of work. There are a number of central questions about usage-based measures of impact that that remain to be answered:

First, how closely related are they to traditional citation-based measures? We expect that there will be a rough correspondence between citation and usage in an aggregate sense - in other words, that a widely-read paper will show up in the bibliographies of many subsequent papers. However, there will clearly be deviations from this general principle.

A foundational paper written long ago may still be cited but rarely read, its contents having been absorbed into more accessible books and reviews; on the other hand, a highly instructive online survey or tutorial may be widely downloaded and read, and thus have a large influence on its field - even if its readers do not usually cite it when they go on to write research papers. Such illustrations are suggestive of the different roles that papers may play when we consider both usage and citation patterns; a valuable application of usage analysis would be to put our understanding of these roles and relationships on a more thorough, quantitative footing.

Once usage data has been added to the overall context of online scientific information, we are dealing with a system the exhibits not just complex interconnections, but also powerful fluctuations over time scales that span up to five or six orders of magnitude. For example, a paper posted at the e-print arXiv site can result in a posting on the popular Slashdot technology news community ,which can in turn lead thousands of visitors to download the paper over the course of a single hour.

The on-line discussion generated by this process may then move into the mainstream news media, the network of blogs and the more technically focused mailing lists; scanning this content a week later can yield a bewildering array of pages and postings involving both experts and novices. In the same way that link analysis offers the ability to identify high-quality, high-impact research, one would like to be able to analyse such rapidly evolving on-line discussions and separate the high-quality, well-informed commentary from the superficial and the inaccurate.

Such events can represent very short-term 'spikes' of broader visibility for a research field that, in turn, may be experiencing a gradual growth or decline in professional interest. Both types of phenomena, playing out over days in one case and over years in the other, are visible in the shared online space of scientific research, publishing and discussion. Managing the complexity of all these effects so as to produce a global view of scientific activity is a challenging prospect, but one that promises deeper insights into the structure of scientific fields, the growth and decline of these fields over time, and the broader impact of scientific research on policy, education and business.

Jon Kleinberg

Department of Computer Science, Cornell University, Ithaca, New York, USA http://www.cs.cornell.edu/home/kleinber/


  1. Egghe, L. & Rousseau, R. Introduction to Informetrics (Elsevier, 1990).

  2. Shiffrin, R. & Borner, K. Mapping knowledge domains. Proc. Natl Acad. Sci. USA 101(Suppl.1), 5183-5185 (2004).

  3. Narin, F. Patents and publicly funded research. In Assessing the Value of Research in the Chemical Sciences, Chemical Sciences Roundtable, Board on Chemical Sciences and Technology, National Research Council, pp. 59-72 (National Academy Press, 1998).

  4. Brin, S. & Page, L. (1998) Anatomy of a large-scale hypertextual Web search engine. Proceedings of the 7th International World Wide Web Conference (1998).

  5. Kleinberg, J. Authoritative sources in a hyperlinked environment. Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms (1998).

  6. Kumar, R., Raghavan, P., Rajagopalan, S. & Tomkins, A. Trawling the Web for emerging cyber-communities. Proceedings of the 8th International World Wide Web Conference (1999).

  7. http://www.almaden.ibm.com/webfountain/

  8. Flake, G., Lawrence, S., Giles, C. L. & Coetzee, F. Self-organization and identification of Web communities. IEEE Computer 35:3 (2002)

  9. Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. Proc. Natl Acad. Sci. USA 99, 8271-8276 (2002).

  10. Hopcroft, J., Khan, O., Kulis, B. & Selman, B. Tracking evolving communities in large linked networks.Proc. Natl Acad. Sci. USA 101(Suppl.1), 5249-5253 (2004).

© 2004 Nature Publishing Group
Privacy Policy