Impact of Open Data Movement on Data Management and Publishing

“If I have seen a little further it is by standing on the shoulders of Giants.”

When Sir Issac Newton wrote this in a letter to his rival, he actually had borrowed the phrase itself. Whether or not this quote predates John of Salisbury in the 12th century is not known, but the most commonly used version is: “We are like dwarfs sitting on the shoulders of giants. We see more, and things that are distant, than they did, not because our sight is superior or because we are taller than that, but because they raise us up and by their great stature add to ours.”

The open access movement predates the Internet (to about the 1950's), and various models were proposed to increase access to academic research. Self-archiving (the act of depositing a free copy of an electronic document on the Internet in order to provide open access to it) has been common for computer scientists since at least the 1980s. In physics, it has become the norm and some sub-areas like high-energy physics have a 100% self-archiving rate. Interestingly, the two major physics publishers, American Physical Society and Institute of Physics Publishing have reported that the free archive has had no effect on journal subscriptions in physics; even though articles are freely available, usually before publication.

In 1997 when the US National Library of Medicine made Medline available freely in the form of PubMed, the usage of this database increased ten-fold.

In 2001 scholars around the world signed “an open letter to scientific publishers” calling for “the establishment of an online public library that would provide the full contents of the published record of research and scholarly discourses in medicine and the life sciences in a freely, accessible, fully searchable, inter-linked form.” The advocacy organization The Public Library of Science was established as a result. However, most scientists continued to publish and review for non-open access journals. (For a full timeline through 2009 check out:

Economic considerations

Momentum has picked up in recent years. Today, all seven of Britain’s research councils require that the results of work they fund are open-access in some way. By 2016 all public money given to universities in Britain will have the same open-access requirement. On the other side of the pond, in 2013 the White House required that federal agencies spending more than $100M a year on research must publish results where they can be read for free.

While proponents of the movement argue that open access benefits science, there have been some unintended consequences. In many open-access models, the cost of publishing shifted to the generator of the research - a burden on the already meager funds allocated to researchers. Others argue that it negatively affects the quality of published research by incentivizing publishers to push through more articles in order to generate higher revenues (as opposed to the current scenario where quality drives subscriptions and thus rejecting a submission does not hurt the bottom line).

Nature published an informative article on the pros and cons of the open data movement and concluded that any successful emerging model must be economically sound.

Nice to share

Stepping away from economic concerns, there are clear benefits to publically sharing data. On the NIH website’s FAQs regarding their mandatory data-sharing policy for grants over $500,000, there are seemingly straight-forward but nonetheless interesting FAQs. For example, the answer to “Why should I share my final research data?” is easy enough: Promoting new research, testing of new or alternative hypotheses and methods of analysis, facilitating education of new researchers, enabling exploration of topics not envisioned by the initial investigator and the ability to create new datasets by combining data from multiple sources.

The questions that are particularly interesting however, relate to the types of data to be shared. Some types of basic research, such as shared datasets from genome sequences and maps, protein and nucleotide databases, etc. lend themselves very well for compilation in centralized locations. Some types of data, may be less easily compiled, especially as science grows and evolves and it becomes increasingly necessary to combine data-sets and types across disciplines.

Looking Ahead

The move to digitize scientific data has been championed by the pharmaceutical industry as evidenced by their choice to make multi-million dollar investments in collaboration-focused software to capture and analyze data.

In the same vein as the current discourse regarding open-access, major discussions regarding switching from pen and paper to electronic formats were in full swing in the 1990s. And following a similar course as we are seeing with the open access movement, government mandates (eg the Electronic Signatures in Global and National Commerce Act) mean that instead of searching through notebooks and piles of documents, printing materials and submitting thousands of pages (eg for an FDA Audit), electronic lab notebook users can simply collate and submit electronic records, saving valuable time and money.

In essence, electronic lab notebooks are continuing to facilitate a movement enabled by government action. However, the world of sharing and collaboration was quite different 14 years ago - we’ve gone from dial-up internet and shared, wired desktop computers to a world where a single scientist may own a combination of laptops, tablets and smartphones, and the speed and access to information is incredible by comparison.

Acceptance of the open access movement is progressing along with technologies that better enable knowledge sharing. It’s fair to expect that those trends will continue.

What's your experience with open data? What do you think it'll look like in 5 years? Start the discussion in the comments below!