A team of academics spent months collecting data on 2.3 million papers from the academic search engine Google Scholar — to find out how often the popular service points readers to versions of research papers that are free to read.
The team’s study, which was posted last month on the SocArXiv preprint server1, comes to a similar conclusion as an earlier analysis that found around half of scholarly literature is freely available. But this latest work stands out because it overcomes a crucial hurdle of Google Scholar, which usually prevents automated software from pulling down bulk information using the tool. The site disrupts automated computer searches by throwing in ‘CAPTCHAs’ — a type of test, typically a distorted image, that can be solved only by humans. There is also no way to interact with Google Scholar in a simple way using an application programming interface, or API, to retrieve particular pieces of data.
Nature spoke to Alberto Martín-Martín, a PhD student who studies bibliometrics and science communication at the University of Granada in Spain, about how he scraped Google Scholar for its information — and why he is asking the site to make its data more easily available.
Why have you done this work?
Google Scholar is one of the most used academic search engines in the world and our research group has been interested in it for many years. It has a lot of data about free-to-read publications, and we know that Google Scholar links to sources for articles that are not covered by other databases, such as the academic social networking site ResearchGate.
How did you get around the fact that Google Scholar has no API?
We spent three months scraping data from the website. I created a script to do so, but I had to be there to keep manually solving the CAPTCHAs that appeared regularly. It was a boring summer! We used several computers to distribute the enquiries because Google Scholar asks you to solve a CAPTCHA if one computer is making too many requests. Sometimes the CAPTCHAs appear so frequently that it is not practical to get the data this way. We don’t think it is a reliable method of getting the data.
How many CAPTCHAs did you solve over the course of the experiment?
I can’t tell you the exact number, but many hundreds!
How long would it have taken to extract the data if an API was available from Google Scholar?
One or two days.
What did you find out?
In general, across all areas of research, for papers published in 2009 and 2014, we found that 55% of all documents we examined could be accessed in some way for free through Google Scholar. Only 34% were freely available from publisher websites or repositories, and the rest were available from other sources, mostly ResearchGate. Most of the documents that were freely available from publishers didn’t have a clear open access licence, so we don’t know if those documents will continue to be free in the future.
What do you think about using Google Scholar for this type of work?
Google Scholar contains valuable information that is not available from any other database, but it is impractical to rely on it for large-scale analyses. With this study, we would like to propose that Google Scholar reconsiders making its data more open. The open-research-data landscape has changed in recent years, and many publishers now makedata about database listings available.
Sign up for the daily Nature Briefing email newsletter
Stay up to date with what matters in science and why, handpicked from Nature and other publications worldwide.