Thursday, August 03, 2017

The State of Open Access: Some New Data

A preprint posted on PeerJ yesterday offers some new insight into the number of articles now available on an open-access basis. 

The new study is different to previous ones in a number of ways, not least because it includes data from users of Unpaywall, a browser plug-in that identifies papers that researchers are looking for, and then checks to see whether the papers are available for free anywhere on the Web. 

Unpaywall is based on oaDOIa tool that scours the web for open-access full-text versions of journal articles.

Both tools were developed by Impactstory, a non-profit focused on open-access issues in science. Two of the authors of the PeerJ preprint  Heather Piwowar and Jason Priem – founded Impactstory. They also wrote the Unpaywall and oaDOI software.

The paper – which is called The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles – reports that 28% of the scholarly literature (19 million articles) is now OA, and growing, and that for recent articles the percentage available as OA rises to 45%.

The study authors say they also found that OA articles receive 18% more citations than average. 

In addition, the authors report on what they describe as a previously under-discussed phenomenon of open access  Bronze OA. This refers to articles that are made free-to-read on the publishers website without an explicit open licence. 

Below I publish a Q&A with Heather Piwowar about the study. 

Note: my questions were based on an earlier version of the article I saw, and a couple of the quotes I cite were changed in the final version of the paper. Nevertheless, all the questions and the answers remain relevant and useful so I have not changed any of the questions.

The interview


RP: What is new and different about your study? Do you feel it is more accurate than previous studies that have sought to estimate how much of the literature is OA, or is it just another shot at trying to do that?

HP: Our study has a few important differences:

·       We look at a broader range of the literature than previous studies and go further back (to pre-1950 articles), we look at more articles (all of Crossref, not just all of Scopus or Web of Science – Crossref has twice the number of articles that Scopus has), and we take a larger sample than most other studies. That’s because we classify OA status algorithmically, rather than relying on manual classification. This allowed us to sample 300k articles, rather than a few hundred as many OA studies have done. So, our sample is more accurate than most; and more generalizable as well.

·       We undertook a more detailed categorization of OA. We looked not just at Green and Gold OA, but also Hybrid, and a new category we call Bronze OA. Many other studies (including the most comparable to ours, the European Commission report you mention below) do not bring out all these categories specifically. (I will say more on that below). Furthermore, we didn’t include Academic Social Networks. Mixing those with publisher-hosted free-to-read content makes the results less useful to policy makers.

·       Our data and our methods are open, for anyone to use and build upon. Again, this is a big difference from the Archambault et al. study (that is, the one commissioned by the European Commission) and we think it is an important difference.

·       We include data from Unpaywall users, which allows us to get a sense of how much of the literature is OA from the perspective of actual readers. Readers massively favour newer articles, for instance, which is good news because such articles are more likely to be OA. By sampling actual reader data, from people using an OA tool that anyone can install, we can report OA percentages that are more realistic and useful for many real-world policy issues.

RP: You estimate that at least 28% of the scholarly literature is open access today. OA advocates tend nowadays to cite the earlier European Commission report which, the EU claims, indicates that back in 2011 nearly 50% of papers were OA. Was the EU study an overestimate in your view, or has there been a step backwards?

HP: Their 50% estimate was of recent papers, and included papers posted to ResearchGate (RG) and Academia.edu as open access. Our 28% estimate is for all journal articles, going back to 1900 – everything with a DOI. We found 45% OA for recent articles, and that’s excluding RG and Academia. So, they are pretty similar estimates.

RP: In fact, you came up with a number of different percentages. Can you explain the differences between these figures, why it is important to make these distinctions, and what the implications of the different figures are?

HP: There are two summary percentages: 28% OA for all journal articles, and 47% OA for journal articles that people read. As I noted, people read more recent articles, and more recent articles are more likely to be OA, so it turns out that almost half of the papers people are interested in reading right now are actually OA. Which is really cool!

Actually, when you consider that we used automated methods that missed a bit of OA it is more than half, so the 47% is a lower bound.

RP: You coin a new definition of open access in your paper, what you call Bronze OA. Can you say something about Bronze OA and its implications? It seems to me, for instance, that a lot of papers (over half?) currently available as open access are vulnerable to losing their OA status. Is that right? If so, what can be done to mitigate the problem?

HP: Yes, we did think we were coining a new term. But this morning I learned we weren’t the first to use the term Bronze OA – that honour goes to Ged Ridgway, who posted the tweet below in 2014


I guess it’s a case of Great Minds Think Alike!

Our definition of Bronze OA is the same as Ged’s: articles made free-to-read on the publisher’s website, without an explicit open license. This includes Delayed OA and promotional material like newsworthy articles that the publishers have chosen to make free but not open.

It also includes a surprising number of articles (perhaps as much as half of the Bronze total, based on a very preliminary sample) from entirely free-to-read journals that are not listed in DOAJ and do not publish content under an open license. Opinions will differ on whether these are properly called “Gold OA” journals/articles; in the paper, we suggest they might be called “Dark Gold” (because they are hard to find in OA indexes) or “Hidden Gold.” We are keen to see more research on this. 

More research is also needed to understand the other characteristics of Bronze OA. Is it disproportionately non-peer-reviewed content (e.g. front-matter), as seems likely? How much of Bronze OA is also Delayed OA? How much Bronze is Promotional, and how transient is the free-to-read status of this content? How many Bronze articles are published in “hidden gold” journals that are not listed in the DOAJ? Why are these journals not defining an explicit license for their content, and are there effective ways to encourage them to do so?

This kind of follow-up research is needed before we can understand the risks associated with Bronze and what kind of mitigation would be helpful.

RP: You say in your paper, “About 7% of the literature (and 17% of the OA literature) is Green, and this number does not seem to be growing at the rate of Gold and Hybrid OA.” You also suspect that much of this green OA is “backfilling” repositories with older articles, which are generally viewed as being of less value. What happened to the OA dream articulated by Stevan Harnad in 1994, and what future do you predict for green OA going forward?

HP: First, I should clarify: our definition of Green OA for the purposes of the study is that a paper is in a repository and is not available for free on the publisher site. This is so we don’t double count articles as both Green and Gold (or Hybrid or Bronze) for our analysis.

We gave publisher-hosted locations the priority in our classifications because we suspect most people would rather read papers there. So, in our article when we say green OA isn’t growing, what we mean is that more recent papers that are only available in repositories are available as Green OA at roughly the same rate as older papers.

It is worth future study to understand this better. I have a suspicion: perhaps much of what would have been Green OA became Bronze and what we call “shadowed green” – where there is a copy in a repository and a freely available copy on the publisher’s site as well. I suspect publishers responded to funder mandates that require self-archiving by making the paper free on the publisher sites as well, in synchronized timing.

Specifically, Biomed doesn’t look like it has as much Green as I’d expect, given the success of the NIH mandate and the number of articles in PMC. We do know many biomed journals have Delayed OA policies, which we categorized as Bronze in our analysis. Did they implement these Delayed OA policies in response to the PMC mandates? Perhaps others already know this to be true... I haven’t had a chance to look it up. Anyway. I think the interplay between Green and Bronze is especially worth more exploration.

We do also report on all the articles that are deposited in repositories, Green plus shadowed green, in the article’s Appendices. We found the proportion of the literature that is deposited in repositories to be higher for recent publication years.

One final note: We actually changed the sentence that you quoted in the final version of our paper, because we were wrong to talk about “growing” as we did. Our study didn’t measure when articles were deposited in repositories, but just looked at their publication year. Other studies have demonstrated that people often upload papers from earlier years, a practice called backfilling.

I suppose in some ways these have less value, because they are read less often. That said, anyone who really needs a particular paper and doesn’t otherwise have access to it is surely happy to find it.

RP: You also looked at the so-called citation advantage and estimate that an OA article is likely to attract 18% more citations than average. The citation advantage is a controversial topic. I don’t want to appear too cynical, but is not the idea of trying to demonstrate a citation advantage more an advocacy tool than a meaningful notion. I note, for instance, that Academia.edu has claimed that posting papers to its network provides a 73% citation advantage. Surely the real point here is that if all papers were open access there would be no advantage to open access from a citation point of view?

HP: That’s true! And that’s the world I’d love to see – one where the citation playing field is flat, because everyone can read everything.

RP: What would you say were the implications of your study for the research community, for librarians, for publishers and for open access policies?

HP: For the research community: Install Unpaywall! You’ll be able to read half the literature for free. Self-archive your papers, or publish OA.

For OA/bibliometrics researchers: Build on our open data and code, let’s learn more about OA and where it’s going.

For librarians: Use this data to negotiate with publishers: Half the literature is free. Don’t pay full price for it.

For publishers: Half the literature is now free to read. That percentage is growing. You don’t need a weathervane to know which way the wind blows: long term, there’s no money in selling things that people can get for free. Flip your journals. Sell services to authors, not access to content – it’s an increasingly smart business decision, as well as the Right Thing To Do.

For open access policy makers: We need to understand more about Bronze. Bronze OA doesn’t safeguard a paper’s free-to-read status, and it isn’t licensed for reuse. This isn’t good enough for the noble and useful content that is Scholarly Research. Also: let’s accelerate the growth.

You didn’t ask about tool developers. An increasing number of people are making tools that they can integrate OA into. They should use the oaDOI service. Now that such a large chunk of the literature is free, there are a lot of really transformative things we can build and do – in terms of knowledge extraction, indexing, search, recommendation, machine learning etc.

RP: OA was at the beginning as much (in fact more) about affordability as about access (certainly from the perspective of librarians). I note the recently published analysis of the RCUK open access policy reports that the average APC paid by RCUK rose by 14% between 2014 and 2016, and that the increase was greater for those publishers below the top 10 (who are presumably focused on catching up with their larger competitors). Likewise, the various flipping deals we are seeing emerge are focused on no more than transferring costs from subscriptions to APCs, with no realistic expectation of prices falling in the future. If the research community could not afford the subscription system (which OA advocates have always maintained) how can it afford open access in the long-term?

HP: If the rising APCs are because small publishers are catching up with the leaders by raising prices, that won’t continue forever – they’ll catch up. Then it’ll work like other competitive marketplaces.

The main issue is freeing up the money that is currently spent on subscriptions. We think studies like this, and tools like Unpaywall, can be helpful in lowering subscription rates, and foregoing Big Deals, as libraries are increasingly doing.

RP: As you say, in your study you ignored social networking sites like Academia.edu and ResearchGate “in accordance with an emerging consensus from the OA community, and based largely on concerns about long-term persistence and copyright compliance.” And you also say, “The growing proportion of OA, along with its increased availability using tools like oaDOI and Unpaywall, may make toll-access publishing increasingly unprofitable, and encourage publishers to flip to Gold OA models.” I am wondering, however, if it is not more likely that sites like Academia.edu (which researchers much prefer to use than paying to publish or depositing in their repository) and Sci-Hub (which is said to contain most of the scientific literature now) will be the trigger that will finally force legacy publishers to flip their journals to open access, whatever one’s views on the copyright issues Would you agree?

HP: It won’t be any one trigger, but rather an increasingly inhospitable environment. Sci-Hub is a huge contributor to that, and Academic Social Networks are too. Unpaywall opens up another front: a best-practice, legal approach to bypassing paywalls that librarians and others can unabashedly recommend. It all combines to make it easier and more profitable for publishers to flip, and for the future to be OA.

RP: Thank you for answering my questions.