Royal Society Of Chemistry's new Repository: My initial thoughts on Open Data

Royal Society of Chemistry announces a new repository for the chemical sciences. I have been asked to comment (by an American Chemical Society organ) and will outline my thoughts below. This is important both for chemistry and more widely for scientific data. First the announcement.

Today http://www.rsc.org/AboutUs/News/PressReleases/2013/RSC-ann
The Royal Society of Chemistry today announces a new subject-based repository that will make it easier for researchers to find and share relevant journal articles and data from a single point of access.
David James, the Royal Society of Chemistry’s Executive Director of Strategic Innovation, said: “The Chemical Sciences Repository will offer free-to-access chemistry publications and integrated data in a single place. 
“This repository extends the services the Royal Society of Chemistry already offers researchers. With this new service we are improving our ability to ensure that the outputs from research activity are made as widely available as possible – to meet the needs of the scientific community, funders and others interested in accessing our content in a more comprehensive, streamlined way.”
The initial release will provide an article repository as a central point through which users can access the Royal Society of Chemistry’s open access articles, whether they are funded immediate open access articles, or articles that must be made open access after an embargo period, such as those funded by RCUK, the Wellcome Trust or NIH. This article repository will be available at the end of October 2013.
The repository will point to the Article of Record as the primary source. It will make open access versions of the article available when any embargo period expires. 
David James continued: “We plan to grow the Chemical Sciences Repository, with the addition of open access papers from institutional repositories, other publishers, and individuals – as well as theses, data and models.  
“The repository will make it easy for researchers to deposit their articles and data, and scientists will also find it easy to find and reuse compatible datasets. 
“As a community service the repository will catalyse further collaboration and open innovation between chemical scientists all over the world.”
The Royal Society of Chemistry will announce additional elements to the data repository in the coming months. Work is already underway with major UK universities around data extraction and upload, Electronic Lab Notebook (ELN) integration, and micropublishing. Offering functionality with chemical scientists specifically in mind, the repository will support the building of validation and prediction models to maximise the value and quality of the data collections. 
Head of Chemistry at the University of Southampton, Professor Philip Gale, said: ” My colleagues and I welcome this initiative: a collection of chemistry data curated by the Royal Society of Chemistry will be of significant value to the worldwide chemistry community. 
“We are now working with the Royal Society of Chemistry to enable best practice, to expose laboratory data in an intelligent and usable manner.”

This ought to be a wonderful thing. Whether it is will depend on what the RSC says and does.
There is a desperate need for a new way to manage scientific data in an Open manner for the benefit of the world. Some already do very well – astronomy, geo-imaging, proteins and genes. Others such as chemistry, materials are virtually closed. Some, like crystallography, are midway.  Anyone who has not browsed the bioscience databases should do so – it’s truly eye-opening and a complete example of Linked Open Data (TimBl’s 5-stars: make data open, comprehensible to humans and machines and link it both ways).
There’s an absolutely pressing need for places for scientists to deposit data. The process must be trivially easy, valuable to the depositor and valuable to the community. And the community must insist that data gets deposited. It’s almost universal for  readers of scientific articles to find that there is no data. Or that the data is only available as a PDF or a photocopied fax. (We were funded for several year by JISC to develop semantic tools and to provide advocacy but, ultimately, we simply couldn’t get chemists interested – this is common in other disciplines such as biodiversity).
The norm should be that a chemist deposits their data as they create it – often on an hourly basis. That’s how we develop software. It doesn’t have to be visible immediately (though  massive credit to Mat Todd and similar who do this). But at the very least the data associate with a thesis, a paper, or a presentation should be available. Actually it leads to greatly improved science and that ought to be an overwhelming argument.
And if we do this we shall create better data. Because the data will have to be semantic. Bioscience has put huge efforts into semantics – chemistry effectively nothing and materials science is worse.  We cannot run an information-rich world without machines consuming and transforming information automatically. And that is where Open is critical. The Semantic Web depends on Open.
And scientific societies and international unions are, IMO, the best places for managing scientific data. They have the resources to manage communities, and like the IUCr (crystallography) to develop ontologies. When it works well, it’s a great example of global collaboration. When it works badly (or negatively) it can destroy science and scientific creativity.
My judgment will depend on whether the RSC’s repository is truly Open. The least open end of the spectrum is shown by the American Chemical Society (Chemical Abstracts – CAS) and Elsevier (Reaxsys, once Beilstein) with their closed, walled gardens of chemical information.  Originally these resources (which have tens of millions of compounds and reactions) were shining examples of innovation.  In the 1970’s the brilliant work in chemistry, often supported by ACS, inspired me to go into chemical  informatics . Unfortunately nothing has changed in 30+ years and the ACS and Elsevier are now holding chemistry back behind the forefront of science and it’s getting worse rather than better. A typical example is that when Wikipedia started to use CAS identifier system – the natural thing to do – CAS threw the lawyers at them with a cease-and-desist. I commented on this (see Peter Suber’s summary)  and wrote (in 2008):
[ACS] have done the following:

  • re-asserted their position that they care for revenue more than supporting the wider chemical community
  • re-advertised themselves as one of the least progressive learned societies
  • alienated a growing number of young scientists who look to the Web as a critical part of the future of chemistry…

It seems inevitable that community based resources grow old and closed (examples being CAS, OCLC, and Cambridge Crystallographic Database – about which more in later blog posts). Without checks the RSC resource will go in this direction and if they feel their role is to compete in the market with ACS it will be quicker. Money distorts judgment. CAS officers can earn over a million dollars a year and they think like Fortune500 companies rather than a resource for the community.
So I have some questions and suggestions for the RSC:

  • The resources must be completely Open. That means conformant to the OKF’s Open Definition (Free to use, re-use and redistribute without restriction other than possibly attribution.).
  • There must be NO CC-NC licences. (We had a collaboration with the RSC on text-mining – I pleaded for the output of our research – an annotated corpus – of RSC papers to be licensable as CC-BY, but the RSC insisted on CC-NC. So the valuable corpus is not available to the community.
  • NO API-only access. The whole contents must be downloadable. This is a challenge, but it’s essential. A database wher I don’t know the scope of the contents is little use – I and my machines must be able to browse it.
  • Community involvement. The contents of the repository are created by the community. It’s appropriate that they are involved in their use and development. If not, the repository will inevitable drift to something developed and maintained by staff rather than depositors and users.
  • The RSC must see themselves as a facilitator of science, not an owner. They must not think in terms of “our repositiry”. They should encourage knowledge (software,data, etc.) from any source and facilitate its use. The bioscientists do. Europe PubMed Central does.
  • The RSC must not use it to give preference to their publications. Of course by default they can only put in their own and CC-BY resources. Must they must not use the repository to create a quasi-monopoly for their publications.

They should burn “Open” into their organizational DNA. This won’t be easy, but it’s the way that scientific societies must go. I’d recommend a steering committee for the repo which included strong representation from outside their immediate community – such as CODATA, IUCr, and students.
If they can do all this – and I believe the requests above are reasonable – then I think it will be wonderful. But the RSC has not had a proactive approach to the modern information world and it will have to be a serious change of direction.
I am developing software which will extract chemistry from a wide range of documents in automatic mode. I would intend to deploy them very shortly. I would like to deploy them on the RSC repository and extract data for my own purposes and the world’s purposes.
 

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Royal Society Of Chemistry's new Repository: My initial thoughts on Open Data

  1. Pingback: Internet Archaeology: Blasts from the past. « Henry Rzepa

  2. Richard Kidd says:

    Peter – thanks for your thoughtful post
    I agree with pretty much all the preamble – we’re doing this because we feel data needs to be preserved and published more openly, for it to be structured and semantic, and because as a Society we think it’s our responsibility.
    On the specific questions
    • The resources must be completely Open. That means conformant to the OKF’s Open Definition (Free to use, re-use and redistribute without restriction other than possibly attribution.).
    See below – as we won’t be in control of the upstream licensing for all forms of content. Should be ok for data.
    • There must be NO CC-NC licences.
    Yes to data being CC0 or CC-BY. For articles – we offer -NC licences on our own OA articles, as well as -BY, to offer our authors the choice they have requested. As our aim is to work with other publishers, institute and individual authors, their licences will be taken into account – but all content will be clearly tagged with the licence.
    • NO API-only access. The whole contents must be downloadable. This is a challenge, but it’s essential. A database wher I don’t know the scope of the contents is little use – I and my machines must be able to browse it.
    We haven’t put much thought into that yet, but request noted. We will concentrate the majority of our efforts on proving to the community the value of data deposition – as you know well, unless the community recognises the value of the deposition, what we decide to do downstream will become quickly irrelevant.
    • Community involvement. The contents of the repository are created by the community. It’s appropriate that they are involved in their use and development. If not, the repository will inevitable drift to something developed and maintained by staff rather than depositors and users.
    We’ll fail if it’s not embedded in the community – to make deposition as easy as possible, and to enable reuse and credit for the data submitted. Much of the initial data repository will be provided as part of the National Chemical Database Service, and we have an excellent Advisory Board chaired by Peter Scott from Warwick. (http://cds.rsc.org/about.html)
    • The RSC must see themselves as a facilitator of science, not an owner. They must not think in terms of “our repositiry”. They should encourage knowledge (software,data, etc.) from any source and facilitate its use. The bioscientists do. Europe PubMed Central does.
    That’s pretty much how we see it ourselves – as a facilitator.
    • The RSC must not use it to give preference to their publications. Of course by default they can only put in their own and CC-BY resources. Must they must not use the repository to create a quasi-monopoly for their publications.
    Again, we don’t see the value if it was just our own publications. It’ll be a level playing field.
    Since the data repository element of the CDS was announced earlier this year we’ve talked to a lot of the stakeholders in the community, covering the domain but also funders, repository owners and data specialists. We think what we’re doing will play nice with the ecosystem, but most of all we hope it will be an exemplar of what a community repository can be.
    Richard

Leave a Reply

Your email address will not be published. Required fields are marked *