CrystalEye and Open Data and Open Notebook Science

There has been more interesting discussion on the contents of CrystalEye, derived data, and the concept of OpenData . I shall address some of the issues and welcome more discussion. Since I have been critical of others I am quite prepared to take criticism myself. Please remember that what you are looking at is the work of a graduate student, Nick Day, who is now writing up his thesis and has effectively finished work on CrystalEye software, other than fixing bugs which would affect his science. Note, however, that the compilation of the database continues automatically every time a new issue of a journal is published. So criticism of me is fair game, criticism of Nick and his work isn’t (and there hasn’t been any).
It is clear that the concepts of Open Data and Open Notebook Science are now of great interest and presumably of great value. I believe that CrystalEye fulfils essentially all the characteristics of ONS – it’s Open and it’s immediate. There is nothing hidden – everyone has access to the same material. Anyone can do research with it. So what are the issues?
People want to re-use the material for their own work and include it in their own products. They are allowed to do so. We currently allow spidering of the site so they can ease the process of getting the data. We provide a set of RSS feeds which allows systematic access to all of the entries (we can explain RSS in another post if required). Here’s a snapshot showing how to get them.
We accept that using RSS to distribute data is new, but we think it’s worth pursuing and it’s a valid area of research (which is our primary remit).
There is a strong expectation that we should put in effort to customize CrystalEye for subsections of the community. Before answering the questions in detail it may be useful to use an analogy with Open source software. In OSS the primary concerns are on rights of access, ownership, redistribution. This is often referred to as “the right to fork”. The act of labelling software “Open Source” and making it available to the public is sufficient. Nothing more is required.
Much Open Source software is difficult to use. For example, until about 3 years ago the Open Source chemistry program Open Babel required compiling. This meant that Windows users who did not have a C++ compiler couldn’t use it. This persisted for some time – Henry Rzepa and I felt sufficiently disadvantaged that we compiled OpenBabel (it wasn’t easy) for a series of platforms and mounted them for the community. But there was – and still isn’t – any expectation that the project has a moral responsibility to provide an executable version for every platform. In practice because it has generated a community there is now voluntary effort to provide it. But it is not a condition or expectation for Open Source. Nor is there any expectation that a project should provide personalised installation on demand, personalized training, advice, bugfixes, documentation, etc. Nor does “right to fork” mean that the community will not disapprove of forks and it is generally accepted that careful discussion is required before forking – it should not be seen as a confrontational act. I note, of course, that providing good support for an Open Source project is likely to enhance its adoption. But most OS projects in chemistry have gone through a period of wilderness where the documentation, installation, support, etc. were missing. There were certainly times when Jmol and OpenBabel had the feel of neglected gardens. [The same is true for CML documentation. We are addressing these concerns.]
I believe these are useful starting points for discussing Open Data.
Now, to address the comments. Remember that CrystalEye is a crystallographic knowledgebase. Extracting chemistry from it is a research activity, not an automatic one.
[AD = Andrew Dalke, CSM = ChemSpiderman ]
AD: So 90% of the data is easily exportable to an SD file? Wouldn’t this subset be useful, and easily generated?
CS: You’ve said that converting from CML to an SD file causes “corruption”. Is Open Babel good enough? If so, why not use it yourself?
PMR: From the discussion it appears that a subset would be useful. It is not clear what that subset is. (organic? entries? moieties? ) We are not using SD files ourselves because they are not sufficiently powerful for the research we are doing. It is not easy to determine which components of CrystalEye can be automatically converted to SD and which cannot. For example which organometallics convert well to SD? Can they be roundtripped? To create an SD file we would need to (a) install an SD generator in our workflow and (a) either create a snapshot of CrystalEye (which we hadn’t planned on doing but which seems to have moved up the priorities) or (b) regularly write SD  updates (perhaps for each entry). It will not be easy to get a publication out of this, I suspect. However
AD: InChIs handle moieties? Ahh, “moiety” here appears to be CIF derived terminology to mean “discrete bonded residue or ion”, and does not include fragment.
PMR: yes. Again, please remember that CrystalEye is a crystallographic knowledgebase. See the IUCr site

_name                      '_chemical_formula_moiety'
_category                    chemical_formula
_type                        char
loop_ _example              'C7 H4 Cl Hg N O3 S'
'C12 H17 N4 O S 1+, C6 H2 N3 O7 1-'
'C12 H16 N2 O6, 5(H2 O1)'
"(Cd 2+)3, (C6 N6 Cr 3-)2, 2(H2 O)"
;              Formula with each discrete bonded residue or ion shown as a
separate moiety. See the _chemical_formula_[] category
description for rules for writing chemical formulae. In addition
to the general formulae requirements, the following rules apply:
(1) Moieties are separated by commas ','.
(2) The order of elements within a moiety follows general rule
(5) in the _chemical_formula_[] category description.
(3) Parentheses are not used within moieties but may surround
a moiety. Parentheses may not be nested.
(4) Charges should be placed at the end of the moiety. The
charge '+' or '-' may be preceded by a numerical multiplier
and should be separated from the last (element symbol +
count) by a space. Pre- or post-multipliers may be used for
individual moieties.

AD: Given that you have five high-value views, which are your priorities?
PMR: That depends to some extent on the value of them to the community and how it fits into our research program. (There may be more than five priorities). Given that we are (not sure if it’s public so no details) a partner in a crystallographic knowledge program, crystallographic aspects probably head the list. This includes (a) helping crystallographic service providers to appreciate the value of their data (b) getting it into a repository (c) adding crystallographic tools such as searching for formulae and cell dimensions. We also expect to be a partner in other chemistry repository programs and in eTheses so these also have high priority – research we are funded to do and from which papers will emerge. That is not to say that there is not great value in making the chemistry available and that’s why we make the RDF and CML available.
AD: Generating the InChIs wil have effectively identical problems as generating an SD file, as will generating fragments.
PMR: No. The InChIs are already generated (you can see them on nearly every page except the inorganics) as are the CML files. Adding SD generation into the workflow will take a non-zero amount of effort.
CSM: [Variation in formula for NO]. Following your comments and with a couple of minutes of curation by yours truly you will now see this result only: That’s the power of feedback and creating a community of curators.
PMR: Certainly. In the same way we are happy to accept comments on chemical errors in what CrystalEye has produced.
CSM: Back to the issue of trying to link back to CrystalEye from ChemSpider. I get that you don’t want to share the structure files by SDF.
PMR: That’s probably fair, unless someone does work on roundtripping (Round-tripping) them. If that is done then we might be more interested. But it’s really a question of resource. We haven’t got resource to create SD files at present and it doesn’t bring us publications. It doesn’t give us much obvious added value
CSM: So, send us the CML files and IDs instead. Of course then I think we will go back to the forking off of the data. Considering that CrystalEye is Open Data it’s incredibly difficult to get access to.
PMR: Open Data does not require that the data is easy to access, just that it’s accessible. In fact CrystalEye is incredibly easy to get access to. It’s on the web, and you can get any entry you want through the table of contents of journals in just 3 clicks. That’s a lot quicker than many other bibliographic services. Similarly you can do a SMILES/SMARTS search for any substructure and return any number of hits.
What you actually mean is that CrystalEye is not in the form that is most convenient for your business purposes. There is an implication here that we have a moral duty to customize the data for you (and by implication for anyone else). I think it would be a pity to fork CrystalEye aggressively. Note, in any case, that you will need to honour the OKFN conditions in the definition:

5. Attribution

The license may require as a condition for redistribution and re-use the attribution of the contributors and creators to the work. If this condition is imposed it must not be onerous. For example if attribution is required a list of those requiring attribution should accompany the work.

6. Integrity

The license may require as a condition for the work being distributed in modified form that the resulting work carry a different name or version number from the original work.

PMR: We had not expected a fork at this stage of the project, but we shall work on making the conditions clear. As a minimum it will require the inclusion of the CrystalEye FAQ with the distribution. We would also require a specified amount of metadata to accompany each entry.

This entry was posted in crystaleye. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *