Monthly Archives: August 2011

CCDC: Reasons why sourceCIF data must be Open

This is the text of a letter I have sent to Dr Colin Groom, Director of the Cambridge Crystallographic Data Centre. For context, read my blog posts ( ) over the last 4 days. "sourceCIFs" are raw data created as part of a crystallographic experiment by scientists (not in the CCDC) and required by community norms as part of the scholarly publication process. Some are published Openly, but others are sent by the author or publisher to CCDC in an exclusive process. CCDC then control the further distribution of this data which are either made available in trivial amounts (less than 0.1% of the CCDC's holding of sourceCIFs) or significant financial subscription (which many institutions cannot afford). I re-emphasize that I simply wish to make Open the collection of the author's original data [i.e. not the file with a CCDC header and CCDC accession code].

The arguments in this letter may apply to other disciplines, wherever Open data are managed by an exclusive gatekeeper.


At my presentation in MS89 at the IUCr meeting I presented my request to you for the raw verbatim sourceCIFs supplied to CCDC as part of the publication of scholarly articles (i.e. without any CCDC-added value). I was unaware of anyone from the CCDC being present, so you will have to accept my account. I stepped through the arguments on my blog, arguing that these data were part of the essential scholarly record, and that some publishers made their sourceCIFs available Openly. It is the sourceCIFs from publishers such as Wiley, Elsevier and Springer that are in question and for which I asked. There was no adverse comment on what I presented.

I showed your reply in the meeting and also posted it on my blog ( ). I summarised this as containing two main reasons why CCDC would not release the sourceCIFs:

  • "these arrangements were put in place to satisfy the demands of publishers". I have asked the University of Cambridge for details of these arrangements, through a Freedom of Information request. The advantage of specifying FoI is that it contains explicit guidelines on the public release of contracts ( ) and this gives you the power (and the duty) to make these contracts Open except in very special circumstances. I have also asked for the number of sourceCIFs involved. When I have a better idea of the facts from this request we will be able to judge whether any of the current publishers is acting as a block to making the sourceCIFs Open. Note that the FoI legislation requires a reply by Sept 26 latest and I am likely to make further comment at that stage.
  • "because the CDCC continues to rely on subscriptions to the CSD to fund its ongoing developments." We discussed this in a conversation, and I think I can summarise your argument as: "if the sourceCIFs were open, the CCDC would lose a significant number of subscriptions" [in part because other resources based on the sourceCIFs such as Crystallography Open Database and Crystaleye could provide competition]. You argued that the CCDC was beneficial to the community (which I agree with) and that it could only continue to exist if it had a monopoly right to control the distribution of sourceCIFs (with which I profoundly disagree and now explain why).

There are ethical, moral, and political/legal reasons why basis published scientific data should be Open.

  • Moral, in that the authors of the data believe that by providing their experimental data they are providing it to the world community, whether scientific or not. If CCDC closes the sourceCIFs data, authors are deprived of their moral rights.
  • Ethical. The data in sourceCIFs are of value to the world community (for example many subscribers to CCDC are involved in medical research and need the data to help develop new drugs).
  • Political and legal. Many governments and funders are requiring that the fruits of their funding are made completely Open. If, for example, a scientist is funded by Wellcome Trust or NIH their research is expected to be Open. They publish their papers Openly according to guidelines But for many of these papers the text is Open but the sourceCIFs is closed [it can only be obtained by request, only in small numbers and cannot be redistributed]

More generally public opinion is strongly in favour of reform towards Openness. I give some examples, some of which will have quasi-legal compulsion.

  • On Monday 29th Aug George Monbiot published an article in the Guardian which very strongly criticized the current system of academic publishing ( ). This has had widespread impact and very general support. I believe that the current CCDC practice of a monopoly control of academic research would fall under the same criticism. While it may not be the same scale, it is the same principle. CCDC lay themselves open to being judged in the court of public opinion and it will be difficult to show why they should not release sourceCIFs.
  • RCUK have now universally pushed for raw data to be openly available. In talking with NERC, I understand that their philosophy is that raw data should be Open and that value-adders should build on this and can create a competitive market based on the value-add, not a monopoly.
  • The value of text and data in bulk (e.g. for mining) has been highlighted by the Hargreaves report. Effectively Hargreaves is saying that copyright and other contractual restrictions are seriously harming science and that the UK should remove them. I wish the sourceCIFs to be used for data-mining in an open fashion, whereas at present the only data-mining is what CCDC permits and which has to be paid for. I have commented in . Note that this contains suggestions from a third party as to how we should approach sourceCIFs, and I have done what I can to avoid confrontation. But the issue is public and I expect the community to make reference to Hargreaves.
  • The Information commissioner's office (ICO) has taken strong action on scientists who refuse to share data (see Queen's University Belfast and tree-ring data which describes in detail how QUB fought and lost the right to keep data closed). The chairman of the UK parliament's Science and Technology Committee stated that "data has to be made publicly available" and that "Any university or scientist that hasn't got that message needs a total rethink of the way they do research". I hope that CCDC do not take the same route as QUB as it is messy, ultimately pointless, and reduces standing in the community.

These are some of the recent examples of how public opinion, including government, is solidly behind releasing data. Some of these involved conflict and I am keen to avoid this. My hope is that, by the time I get a reply to my FoI request, CCDC decides that it can after all, release the sourceCIFs as Open. If there are contractual problems with the publishers I am happy to help take them to higher authorities (as in the previous paragraph). While the pubklishers may not need to comply legally, I think moral pressure from government offices is likely to be effective.

I do not accept that the CCDC will suffer serious business loss. Encyclopedia Britannica has not been destroyed by Wikipedia but it is redesigning itself. Ordnance Survey has not been put out of business by OpenStreetMap. CCDC should not feel seriously challenged by Crystallography Open Database and Crystaleye. Indeed if it aligns itself with Open Crystallography it can benefit.

I am happy to advise CCDC in how to make sourceCIFs Open, e.g. by defining what is meant by Open and what needs to be done to make sure sourceCIFs are Open and can be re-distributed and re-used. There is also the issue of future sourceCIFs and I am happy to suggest processes for the future ingest of Open sourceCIFs. I stress that the Openness requirement is not negotiable – effectively it means there can be no imposition of conditions other than an Open licence (PPDL, CC0) [see the Panton Principles]. The decision must be rapid (i.e. within the 20 days of the FoI request). A promise to change in the future is, unfortunately, not acceptable.

I hope that CCDC will agree this is the right way to go, quickly.

Open Crystallography: The Hargreaves report can help make CCDC data Open

I have had a really useful suggestion about how to make data deposited with the Cambridge Crystallographic Data Centre (CCDC) Open. The Hargreaves report has recommended that text- and data-mining of the scientific literature should be allowed and the government agrees [see below]. It is therefore likely that the data in CCDC fall under data-mining. Since a major user of the CCDC data is the pharmaceutical industry, it clearly falls under "medical". I give Pete Carroll's suggestion in full, and add my comments [emphasis is mine].

Pete Carroll says:

August 29, 2011 at 11:26 pm  (Edit)

I wonder if the Government response to the Hargreaves Review regarding data/text mining for research might be relevant?

"Nor does the Government regard it as appropriate for certain activities of public benefit
such as medical research obtained through text mining to be in effect subject to veto by the owners
of copyrights in the reports of such research, where access to the reports was obtained lawfully
. We
recognise that some publishers view licensing of text mining as a legitimate commercial opportunity;
however we are not persuaded that restricting this transformative use of copyright material is
necessary or in the UK's overall economic interest…

the Government agrees with the Review's central thesis that the widest possible
exceptions to copyright within the existing EU framework are likely to be beneficial to the UK

The UK government can therefore be argued to be in favour of our obtaining Open data from the CCDC

…subject to three important factors:

That the amount of harm to rights holders that would result in "fair compensation"
under EU law is minimal, and hence the amount of fair compensation provided
would be zero. This avoids market distortion and the need for a copyright levy
system, which the Government opposes on the basis that it is likely to have adverse
impacts on growth and inconsistent with its wider policy on tax.

The CCDC advanced two main arguments for non-release. One was economic (it would hurt their business), the other was that third parties had rights over the data. On the first I believe that harm is minimal as the raw data has not had value added by CCDC and that their income comes from added value and independent products.

• Adherence with EU law and international treaties.

• That unnecessary restrictions removed by copyright exceptions are not re-imposed by
other means, such as contractual terms, in such a way as to undermine the benefits of
the exception.

The Government will therefore bring forward proposals in autumn 2011 for a substantial
opening up of the UK's copyright exceptions regime on this basis. This will include
proposals for a limited private copying exception; to widen the exception for noncommercial
research, which should also cover both text- and data-mining to the extent permissible under EU law.."


The parliamentary select committee for the dept of Business Innovation & Skills is holding an inquiry on the Hargreaves Review and the Government's response to the review. Closing date for submissions 5th September. See:

I know time is short but it could be worth yourself or someone from the research community bringing this problem of "closed CIFs" to their attention as exemplary evidence of problems with access to data.

PS good luck with your FOI request. You might find

useful to you if they try a S41 or S43 exemption over release of the contracts.

Thanks. I am hoping that they will not "try any exemptions". They are part of our wider community and it is my hope that they will see the positive value of opening their data and that this raise their public esteem. I do not want a battle – I would much rather see genuine reorientation of approach. But the solutions must be Open and they must be rapid. If there are problems I shall certainly approach the ICO (Information Commissioner's Office) who have been very unsympathetic to scientists hanging onto to data which should be in the public domain. I think that in practice any prolonged refusal to want to provide Open data will be tried the in the court of public opinion both within the scientific community and beyond. But I hope there will not be a "Crystalgate".

If they take heed of this and wish to make their data Open, then the only barriers will be from contracts imposed by third parties. Until they provide this information we do not know whether it is a problem. If it is, it may be that Hargreaves is a useful weapon.


#IUCR2011; FOI request for details of CIFs deposited by publishers with CCDC

In a previous post I showed a letter form Colin Groom of the CCDC indicating reasons why the CCDC could not make its deposited crystallographic data (source CIFS) Open

This is disappointing because otherwise we would be able to declare that almost all public crystallographic data was Open – as it is a substantial amount (many thousands) or data files are closed.

One of the reasons given by Colin was that the publishers had added conditions and restraints. I do not know the details of these so I have asked formally for details of numbers of source CIFs and the contracts which might restrain their distribution. I have used the UK Freedom Of Information act since CCDC is part of the University of Cambridge. (In fact all requests for information are ipso facto under FOI, so the additional formality helps to make sure the request is processed appropriately).

I have used the excellent to send the request. This helps make sure the request reaches the right place, sets the clock ticking (organizations are allowed 20 working days to reply) and provides a permanent Open record. My request is at and reads as below

Dear University of Cambridge,

I am writing to request information from the Cambridge
Crystallographic Data Centre (CCDC), one of the listed departments
of the University.
The CCDC maintains a database of crystallography, primarily created
from factual crystallographic data supplied as "supplemental
information" or "supporting data" accompanying scholarly
publications. These data (referred to hereafter as "source CIFs")
are created by the authors, not the CCDC, and represent part of the
primary scientific record supporting the scientific paper. Some
source CIFs are published in the open literature accompanying
publications (whether Open Access or Closed access). Other source
CIFs are not published Openly and are sent exclusively by various
publishers to the CCDC for deposition (closed source CIFs). [see
public reply
) from Dr. Groom, CCDC]. I wish to know the number of these closed
source CIFs and the contracts between the CCDC and each publisher .
1. Please list all the publishers with whom CCDC has an arrangement
for receiving source CIFs.
2. Please provide a copy of the current contract with EACH
3. Please indicate whether EACH publisher puts any restrictions on
the re-use and redissemination of these source CIFs.
4. Please indicate whether EACH publisher claims any intellectual
property rights over these source CIFs
5. Please indicate whether the CCDC claims any intellectual
property rights over these source CIFs
6. Please indicate for EACH publisher how many source CIFs are held
by the CCDC
7. Please give any information on whether the Advisory Board or
other governing body has discussed the question of making closed
source CIFs Openly available.

This is a valid FOI request (the CCDC hold the information and it
will not cost an undue amount of time or money to provide it).

Yours faithfully,

Peter Murray-Rust

I will keep readers of this blog updated on progress



#IUCR2011: reply from CCDC on restrictions on redistributing CIFs

I wrote to Colin Groom of the CCDC requesting the release of authors' raw CIFs (supporting information) into the public domain. I have now had a reply which I publish below. This will alter what I say at my presentation tomorrow (2011-08-29, Monday).


Apologies for the delay, I left the IUCr early on Saturday […].

The CCDC has arrangements with a number of publishers, whereby we are able to process CIFs into CSD entries and supply the source CIFs to those who request them. Supply of the source CIF requires the requestor to identify the CIF, either by reference number or by providing the reference to the article in which they are described. It is my understanding that these arrangements were put in place to satisfy the demands of publishers – they indicate that requestors have access to the journal article in which the CIFs were published. The CCDC makes no charge for this activity.

The entire, curated, value-added CIF collection does, of course, form the CSD. This is provided at below-cost to academic scientists. Where academic scientists have a genuine lack of funds, access to the CSD is subsidised by the CCDC charity. Of course you have access to the CSD.

No restrictions are made on the research use to which the CSD is put; however redistribution of the CSD is not permitted. Licensees are also required to seek permission from the CCDC prior to releasing derivative works and related services. These restrictions were put in place to satisfy the demands of publishers and, because the CDCC continues to rely on subscriptions to the CSD to fund its ongoing developments, to secure the future of the resource.

The distributed CIFs, and CIFs derived from the CSD, contain statements such as they "…may contain copyright material of the CCDC or of third parties" These were drafted several years ago and were put in place to deal with copyright claims of various publishers at that time. I recognise that there are changing views regarding the copyright of data. I also recognise that technological developments continually present new research opportunities and demands on data. We have therefore, been reviewing the services that the CCDC provides and the terms under which they are provided. Ian Bruno is leading this review. His primary consideration is how the CCDC can ensure that we maximise the accessibility and benefit of structural information both now and into the future. Unfortunately, this review is not yet complete, however, we will consult widely and welcome your views on these issues.



First thanks to Colin for his reply. Now my comments:

p.1 This indicates that it is impossible to discover the CIF unless one has access to the journal article.

p.2 I am not asking for access to the CSD, only to the raw CIFs which were contributed as part of the publication process.

P3. The restrictions on re-use are twofold – (a) from publishers (b) to create a monopoly for the CCDC to secure its income

P4. I understood from yourselves that this review would not be complete before the end of the calendar year 2011.


#IUCR2011: Open and Closed publication and Gatekeepers

I shall be explaining (very rapidly) the role of Open, Closed, text and data in my talk on Open Crystallography. I shall use a series of diagrams with consistent semiotics.

Before I start I want to make it clear that scientific publishing is not a zero cost activity. I am not waving hands here, I am basing this on models that work. There may be major difficulties to change en masses but that is not the same as being woolly-headed and naïve. And I know it takes time and I am impatient. However academic scientific research is a 100 billion -> 1000 billion USD industry and the money is in there. It is a question of re-orienting it.

Here is the traditional publication model, without data. A piece of fulltext is published and the reader (usually through their library) pays. Individual subscribers can pay 40 USD per day for 1 article.

This discussion is orthogonal to peer-review (please don't complicate it by involving PR – ALL the diagrams can have PR or NO-PR). It is also slanted towards crystallography


The components:

  • The funders fund the authors to do the research and publish fulltexts and data
  • The publisher decides whether and how to publish the paper (Gatekeeper)
  • The publisher uses a paywall to restrict access
  • The reader inputs money, reads fulltexts and data, but has no other say
  • Open means OKDefiniiton compliant. Anything else is closed


Most aspects of publishing are infinitely discussed, but the role of gatekeeper is critical here. The gatekeeper has complete and aribitrary say over what happens. The GK can restrict and filter input and can bar access. In the modern world GKs are frequently

  • In conflict with emerging ideas of web-democracy
  • Conservative
  • Self-serving; i.e. the operation is run for the benefit of gatekeepers and not the authors or the readers

In general most gatekeepers outlive their usefulness



Here the publisher arbitrarily refuses to publish data (no examples from crystallography)

Here the publisher accepts data and publishes it on their website, generally without any restrictions (effectively Open)

The publisher publishes the data to a closed repository run by another Gatekeeper. There may or may not be an agreement between the gatekeepers. The reader has paywalls and permission walls for both components. Nothing is Openly reusable.

The authors pays an Open Access charge (normally "GoldOA") and the publisher makes the text Openly accessible. Note that the publisher is still a gatekeeper in all other respects.

The OA model with data on publishers site

The authors or publishers put the data in a domain-specific repository (e.g. DRYAD, Tranche, PDB). Anyon can re-use the data. Funding models are variable (e.g. pay for deposit or grant)

The funder pays for the publication (e.g. Wellcome/HHMI/MPI journal). The funder may have input into gatekeeping

And finally the models emerging from the web culture

Here the readership, authorship and publishing mechanism merge into a meritocracy. There is still a need for gatekeeping, but decided by the community.

Tomorrow I will post a synopsis of my talk and give links

#IUCR2011: Open Crystallography

I am speaking on 2011-08-29 on new methods of publishing crystallography including data. I shall prepare my talk as a series of blog posts, not necessarily in the order that they are presented at the meeting.

I am arguing that there should be a concept of Open Crystallography to which crystallographers and other communities (not restricted to scientists) can subscribe. The idea is that published crystallographic iformation should be Open to everyone. Open as in the Open Knowledge Foundation's Definition ( ):

"A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.".

This is a crystal-clear operational definition – something either conforms completely or it does not conform. There are no intermediate positions

Crystallography is among the most Open of disciplines. It has conducted its affairs for the benefit of the community and has pioneered the concept of publishing data, especially in the long-tail of science (where millions of data are published independently). It has an Open Access journal (Acta Crystallographica E) with a very modest fee and a very large authorship (1000s of articles per year).

In the days of print journals it was extremely pro-active in requiring the publication of crystallographic supporting data/information alongside text. Indeed if the IUCr had not argued for and won this process there would be many fewer examples of supplementary data published today.

There are several subdisciplines of crystallography. Macromolecular crystallography (proteins, nucleic acids, etc.) are supported by the PDB ( , Protein Databank) which is effectively Open. People can copy entries, create derivatives create mashups, reformat, etc. without permission.

My focus here is on chemical crystallography (small molecules). Although all supporting information must be submitted to journals, only some of them publish it visibly and Openly. High-volume publishers with Open supplemental information include:

  • American Chemical Society
  • Int. Union of Crystallography (IUCr)
  • Royal Society of Chemistry
  • Nature

And by default all Gold Open Access publishers (e.g. BMC)

In contrast a number of publishers (Elsevier, Springer (ex BMC), Wiley/Blackwells) do not publish the supplemental information (or hide it behind paywalls). They send it (or get the author to send it) to the Cambridge Crystallographic Data Centre CCDC). This information is hidden behind paywalls and permissionwalls. It is not open

There is now a growing grounswell for making small-molecule crystallography completely Open. Some of us have built tools to collect Open CIFs into our own repositories or to accept donations. [see for historical position]. The largest of these are

Saulius and I met at this meeting and we have completely aligned objectives. We have agreed to use "Open Crystallography" as an umbrella for our efforts, and to exchange data and tools (I will explain this at the meeting). We can immediately donate our data to COD, and while most are duplicates there are clearly a number which are not.

Open Crystallography can follow the Panton Principles. [I have substituted 'science' by 'crystallography']

By open data in crystallography we mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. To this end data related to published crystallography should be explicitly placed in the public domain.

Formally, we recommend adopting and acting on the following principles:

  1. Where data or collections of data are published it is critical that they be published with a clear and explicit statement of the wishes and expectations of the publishers with respect to re-use and re-purposing of individual data elements, the whole data collection, and subsets of the collection. This statement should be precise, irrevocable, and based on an appropriate and recognized legal statement in the form of a waiver or license.

    When publishing data make an explicit and robust statement of your wishes.

  2. […] Creative Commons licenses (apart from CCZero), GFDL, GPL, BSD, etc are NOT appropriate for data and their use is STRONGLY discouraged.

    Use a recognized waiver or license that is appropriate for data.

  3.  The use of licenses which limit commercial re-use or limit the production of derivative works by excluding use for particular purposes or by specific persons or organizations is STRONGLY discouraged. These licenses make it impossible to effectively integrate and re-purpose datasets and prevent commercial activities that could be used to support data preservation.

    If you want your data to be effectively used and added to by others it should be open as defined by the Open Knowledge/Data Definition – in particular non-commercial and other restrictive clauses should not be used.

  4. Furthermore, in science it is STRONGLY recommended that data, especially where publicly funded, be explicitly placed in the public domain via the use of the Public Domain Dedication and Licence or Creative Commons Zero Waiver. This is in keeping with the public funding of much scientific research and the general ethos of sharing and re-use within the scientific community.

    Explicit dedication of data underlying published science into the public domain via PDDL or CCZero is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.

There are probably about 300,000 datasets for small-molecules (CIFs) now openly available through COD or Crystaleye and smaller collections. But there are probably about 50,000 – 150,000 CIFs published electronically but closed behind the CCDC walls. If we can Open these (and I am awaiting CCDC's reply) then all small-molecule crystallography becomes Open.



Talk at Int. Union of Crystallography: I ask for the availability of scientific data

I am giving an invited talk on Monday 2011-08-29 at the IUCr ( ) about our Crystaleye system and more generally about new approaches to publishing science. I'll be blogging a lot over the next 2 days and get all my ideas into posts.


Authors' raw data is part of the necessary scientific record (i.e. the material required to support the authors' claims) and there is a large groundswell that it should be universally published. Some disciplines already do this and in the Long-tail of science crystallography leads the field. Every paper MUST be accompanied by the crystallographic data (CIF) and all publishers require authors to make this available. (More later).


Crystaleye (written by Nick Day) is a system that reads the CIFs from publishers webpages and aggregates them into a browsable and searchable knowledgebase. So far it has got over 200,000 different CIFs and covers both organic and inorganic data. It does this for journals published by IUCr, Am. Chem. Soc. (ACS), Royal Soc. Chem. because they publish the CIFs on their websites. Other publishers, however, do not. They send the data to the Cambridge Crystallographic Data Centre (CCDC, [Note I have no formal connection with the CCDC].


The CCDC are the only place in the world allowed to hold these CIFs and they are not on public view. They have many tens of thousands of these CIFs but normally restrict gratis access to very small numbers, normally 1 at a time, by email request. To get access to the CIFs you have to subscribe by paying an annual subscription. This subscription also provides added downstream value, which I do not need.


I would like the raw data from major publishers such as Wiley, Elsevier and Springer, which at present are only held to my knowledge at CCDC. I believe that access to them is essential to doing modern science (I will explain why later), so I have asked the director of CCDC (Dr Colin Groom) for these CIFs. Note that these are the CIFs deposited by the authors, not material enhanced by the CCDC (and already routinely published by ACS, RSC, IUCr, etc.)


Colin and I met yesterday and he agreed to reply to my request. We understand each other's positions but it will be useful for Monday's talk to have a formal record and this is the mail I sent to Colin yesterday.


Several journals (mainly Wiley, Elsevier, Springer, Science) do not publish the authors' electronic CIFs but instead deposit them in the CCDC. These CIFs are a major part of the primary scientific record and can be used for validation, detection of fraud and error, systematic studies, mashups, etc.

I am asking whether the CCDC is prepared to make all these available as an Open collection (e.g. under CC0) so that the community can have bulk access to these without requiring further permission.

I asked the ICSD earlier this week and they were prepared to do this for the electronic CIFs they have received as part of the deposition process. If CCDC can do the same then the complete record of published electronic data will be available.

I am talking about new approaches to publication on Monday and would like to be able to present the CCDC's formal position and any future plans on this issue.

I am around most of the time and happy to see if we can meet
Many thanks,



I will publish his reply in full and unabridged on this blog and discuss it in my talk on Monday

Data sharing and Quixote meeting (Zaragoza)

I am talking an a few minutes to a group of chemists, other scientists, computational scientists, informatics specialist, IR managers, etc. in ZCAM (computational chemistry) in Zaragoza. This is a very exciting project and we hope to not only talk, but actually do things today.

Rather than use Powerpoint I blog my materials. A lot is present in previous blog posts, but this adds an overview of what I might say, and some of the materials I might use. What I actually say depends as always on what has already been said, and not said and the interests of the people present.

My motivation

[With ChEBI, Christoph Steinbeck] Compute properties (spectra, conformations, reactivity) of compounds in the human metabolome.


  • Open to all - no central ownership (cf. Wikipedia). Not my project, but OUR
  • Very cost-effective with a high potential for success
  • A long-tail discipline, with discrete data.

Data Sharing

  • Must be driven by scientists (researchers, editors)
  • Should be domain-specific

Why share data?

  • To promote MY work and receive credit (data citation)
  • To save MY work
  • To share MY datasets with ME (i.e. look for paterns, correlation)
  • To share MY datasets with MY colleagues
  • To share MY datasets with the world
  • To improve methodology
  • To validate science

What are the problems?

  • People want to use their results as intellectual capital
  • People can sell their data for money
  • It takes effort and money
  • It challenges established interests (priesthood, market)
  • Chemists are more conservative than many disciplines

Why/how will it happen?

  • Because individuals (e.g. grad students) find it useful
  • Because groups find it useful
  • Because journals find it useful enough to mandate
  • Because funders require it
  • Because developers (e.g. programs)find it useful

What should we do today?

  • Make a wish list for compchem data sharing
  • What is possible right now?

Resources related to Data Sharing

Recent blogs by PMR

SPARQL query for Crystaleye2

[This will be used interactively with crystaleye2. Try it under SPARQL. It's very new. If it works, congratulate Sam.

If it fails maybe the server is down, or blame me.]


Report only structures with R values less than 0.02:


PREFIX cif: <>

SELECT ?uri ?rfactor {

?uri cif:refine_ls_r_factor_gt ?rfactor

FILTER (?rfactor < 0.02)


Criteria for DataSharers

I am defining DataSharers as an ecology of places to make digital stuff available for sharing. I am still searching for a good term. Digisharer? Digital Sharer? I originally explored this list for repositories so the criteria are slightly different. Interestingly some of the criteria here are now almost trivially obvious, reflecting the innovative and community-based approach of science rather than the unclear world of repositories. So maybe some of the criteria below will dissolve away.

  • A dataSharer should have a clear purpose.
  • People should want to make stuff available through dataSharers.
  • People should want to get stuff via dataSharers.
  • The Community running any dataSharer should be clear and their motivation transparent.
  • Most successful dataSharers are started by one or more identifiable people.
  • The DataSharer should generate a wider community that has a sense of ownership. Encourage it to innovate in searching and displaying the contents of the DataSharer.
  • The DataSharers and their ecology should be a dynamic organism. Build for evolution, not stasis.
  • Make the DataSharer successful rapidly. Plan for the "now", not for future generations.
  • Make everything associated with the DataSharer OPEN and make this explicit.
  • Build DataSharers for machines and humans.
  • Make it clear what the extent of the DataSharer is, and make it iterable.
  • A DataSharer should be cloneable/forkable.
  • Do not rely on traditional metadata strategy.
  • Give depositors massive feedback.
  • Give everything unique IDs.

I'm promoting the idea of domain-specific Sharers, because this is much more likely to create a community. It's also interesting to see them coupled to journals and traditional publishers. However this requires very careful scrutiny. There are excellent publishers (IUCr, BMC, EGU…) and there are others. Some publishers, both commercial and non-commercial want to own our data. A DataSharer run by a traditional "owner and controller and reseller" of academic content is likely to need very careful scrutiny. Is the licence really Open? Is the data completely available? Is there a guarantee that it won't get closed sometime in the future? I find it impossible to see an Open DataSharer coupled to a closed access journal.


DataSharer principles: tested against GigaScience; partially Open, but not enough

Gigascience tweeted that they were studying my suggested principles for data repositories – which I shall now amend to DataSharers. I'd heard vaguely about GigaScience on the blogosphere but not paid huge attention as their datasets are large and I am more interested in the long tail. But as they are at least interested in me I will have a look at them. In what follows I am probably simply ignorant so corrections are welcome.

I am taking my information from: which seems to have some relation with Biomed Central, though nowhere is this very explicit. The tweetfeed comes from Shenzhen, China and the editorial board is from the BGI ( ). After some time I have now found a press release: which is clearer:

BioMed Central and BGI launch a new integrated database and journal, to meet the needs of a new generation of biological and biomedical research as it enters the era of "big-data."

GigaScience, an innovative new journal and integrated database to be launched by BioMed Central in November 2011, has released their first datasets to be given a Digital Object Identifier (DOI). This enables a long-needed way to properly recognize the data producers who have provided an untold number of essential resources to the entire research community. This not only promotes very rapid data release, but also provides easy access, reuse, tracking, and most importantly permanency for such datasets. The journal is being launched by a collaboration between BGI, the world's largest genomics institute, and open access publisher BioMed Central, a leader in scientific data sharing and open data.).

Media Contact
Matt McKay
Head of Public Relations, BioMed Central

PMR: Recommendation. If you are launching a new journal make it clear on the web page what the publishing organization is. It took me 15+ minutes of trawling the web to get these facts. The BGI seems to have commercial interests so I would want to know what the absolute policy on the journal is – who runs it, who runs the data?

So let's see how the DataSharer principles match up. I'm still refining them. I am taking the view that DataSharers must be completely Open (OKD-compliant, libre). BMC and I share the same operational views on Openness. The final pointer to the BMC licence makes the general principles reasonably clear for articles (but not for data)

Authors of articles published in GigaScience retain the copyright of their articles and are free to reproduce and disseminate their work (for further details, see the BioMed Central copyright and license agreement)

An online open-access open-data journal, we publish 'big-data' studies from the entire spectrum of life and biomedical sciences. To achieve our goals, the journal has a novel publication format: one that links standard manuscript publication with an extensive database that hosts all associated data and provides data analysis tools and cloud-computing resources.

PMR: I will be interested to see the links.

GigaScience aims to increase transparency and reproducibility of research, emphasizing data quality and utility over subjective assessments of immediate impact. To enable future access and analyses, we require that all supporting data and source code be publically available and we provide an extensive database and cloud repository that can host associated data, supplementary information and tools.

PMR: this will be interesting. I question "publically available" as I'm not clear what this means in practice (note, it often isn't very easy to make all code available if part of it have been licenced, e.g. from database vendors)

A unique feature of our database is that important associated datasets can be given DOIs, providing both permanency and an additional citation. Thus GigaScience provides easier access to associated data as well as recognition for data producers.

PMR: Very important, but surely not "unique" – isn't this what Datacite does?

Open access

All articles published by GigaScience are made freely and permanently accessible online immediately upon publication, without subscription charges or registration barriers. Further information about open access can be found here.

PMR: There is nothing about data – data are different from articles, so this should be addressed specifically.

Indexing services

Following publication in GigaScience, the full-text of each article is deposited immediately and permanently in repositories in e-Depot, the National Library of the Netherlands' digital archive of electronic publications. GigaScience is included in all major bibliographic databases. A complete list of indexing web services that include BioMed Central's journals can be found here.

BioMed Central is working closely with Thomson Reuters (ISI) to ensure that citation analysis of articles published in GigaScience will be available.

PMR: It is critical that this indexing metadata is made specifically Open, identified as such and made available to the community. Otherwise BMC is granting third parties ownership over citation data that they can control and resell to the scientific community (as happens with other citations). Make data set citation data OPEN.

Publication and peer-review process

Suitability of research for publication in GigaScience is dependent primarily on the data quality and utility, rather than a subjective assessment of immediate impact. To encourage transparent reporting of scientific research as well as enable future access and analyses, it is a requirement of submission that all supporting data and source code be made available.

PMR: Excellent requirement – it won't be easy.


Data and materials release

Submission of a manuscript to GigaScience implies that readily reproducible materials described in the manuscript, including all relevant raw data, will be freely available to any scientist wishing to use them for non-commercial purposes
[PMR emphasis]. Nucleic acid sequences, protein sequences, and atomic coordinates should be deposited in an appropriate database in time for the accession number to be included in the published article. In computational studies where the sequence information is unacceptable for inclusion in databases because of lack of experimental validation, the sequences must be published as an additional file with the article.

PMR: Whyever has the NC been included? It's inconsistent with everything that has been said before. It's unenforceable. It goes against all current BMC policies AFAIK. Please, Please remove it asap. I cannot regard Gigascience as Open while it remains. See .


Crystal structures of organic compounds can be deposited with the Cambridge Crystallographic Data Centre.

PMR: Structures in the CCDC are not Open. Their distribution is controlled by the CCDC and there is no right of re-use. Put them anywhere Open.

PMR: In general I get good vibes about Gigascience. I think they check most, but not all, of my initial principles. However I would like to see Data addressed specifically and consideration given to the Panton Principles for Open Data in Science including clear labelling.

PMR: If you reply in comments these will be visible to everyone. I will treat them constructively.

UPDATE: Comment from GigaScience to this blog crossed this post.