Open Data in Science

I have been invited to write an article for Elsevier’s Serials Review and mentioned it in an earlier post (Open Data: Datument submitted to Elsevier’s Serials Review). I had hoped to post the manuscript immediately afterward but (a) our DSpace crashed and (b) Nature Precedings doesn’t accept HTML So DSpace is up again and you can see the article. This post is about the content, not the technology
[NOTE: The document was created as a full hyperlinked datument, but DSpace cannot handle hyperlinks and it numbers each of the components as a completely separate object with an unpredictable address. So none of the images show up – it’s probably not a complete disaster – and you lose any force of the datument concept (available here as zip) which contains an interactive molecule (Jmol) ]
The abstract:

Open Data (OD) is an emerging term in the process of defining how scientific data may be published and re-used without price or permission barriers. Scientists generally see published data as belonging to the scientific community, but many publishers claim copyright over data and will not allow its re-use without permission. This is a major impediment to the progress of scholarship in the digital age. This article reviews the need for Open Data, shows examples of why Open Data are valuable and summarises some early initiatives in formalising the right of access to and re-use of scientific data.

PMR: The article tries not to be too polemic and to review objectively the area of Open Data (in scientific scholarship), in the style that I have done for Wikipedia. The next section shows Open Data in action, both on individual articles and when aggregating large numbers (> 100,000) articles. Although the illustrations are from chemistry and crystallography the message should transcend the details. Finally I try to review the various intitiatives that have happened very recently and I would welcome comments and corrections. I think I understand the issues raised in the last month but they will take time to sink in.
So, for example, the last section I describe and pay tribute to the Open Knowledge Foundation, Talis and colleagues, and Science/Creative Commons. I will blog this later but there is a now a formal apparatus for managing Open Data (unlike Open Access where the lack of this causes serious problems for science data). In summary, se now have:

  • Community Norms(“this is how the community expects A and B and C to behave – the norms have no legal force but if you don’t work with them you might be ostracized, get no grants, etc.”)
  • Protocols. These are high-level declarations which allow licences to be constructed. Both Science Commons and The Open Knowledge Foundation have such instruments. They describe the principles to which conformant licences must honour. I use the term meta-licence (analogous to XML, a meta-markuplanguage for creating markup languages).
  • Licences. These include PDDL and CC0 which conform to the protocol.

Throughout the article I stress the need for licences, and draw much analogy from the Open/Free Source communities which have meta-licences and then lists of conformant licences. I think the licence approach will be successful and will be rapidly adopted.
The relationship between Open Access and Open Data will require detailed work – they are distinct and can exist together or independently.  In conclusion I write:

Open Data in science is now recognised as a critically important area which needs much careful and coordinated work if it is to develop successfully. Much of this requires advocacy and it is likely that when scientists are made aware of the value of labeling their work the movement will grow rapidly. Besides the licences and buttons there are other tools which can make it easier to create Open Data (for example modifying software so that it can mark the work and also to add hash codes to protect the digital integrity).
Creative Commons is well known outside Open Access and has a large following. Outside of software, it is seen by many as the default way of protecting their work while making it available in the way they wish. CC has the resources, the community respect and the commitment to continue to develop appropriate tools and strategies.
But there is much more that needs to be done. Full Open Access is the simplest solution but if we have to coexist with closed full-text the problem of embedded data must be addressed, by recognising the right to extract and index data. And in any case conventional publication discourages the full publication of the scientific record. The adoption of Open Notebook Science in parallel with the formal publications of the work can do much to liberate the data. Although data quality and formats are not strictly part of Open Data, their adoption will have marked improvements. The general realisation of the value of reuse will create strong pressure for more and better data. If publishers do not gladly accept this challenge, then scientists will rapidly find other ways of publishing data, probably through institutional, departmental, national or international subject repositories. In any case the community will rapidly move to Open Data and publishers resisting this will be seen as a problem to be circumvented

This entry was posted in data, publishing and tagged , , , . Bookmark the permalink.

13 Responses to Open Data in Science

  1. In your paper, the topic ‘NMR-data’ and ‘OSCAR-3’ is mentioned again. Therefore my question again (and please be so kind and answer EXACTLY THIS question !) which can be answered by ticking the corresponding boxes below:
    Question #1: How many fully assigned C-NMR spectra have been AUTOMATICALLY extracted from the chemical literature using OSCAR-3 during the last 24 months.
    O: Zero
    O: 1-10
    O: 11-100
    O: More than 100
    Please tick the appropriate box – keep in mind I am asking for fully assigned spectra, because the assignment is necessary for using the data for subsequent predictions ( this is what you have called ‘robot referee’ in one of your posts ! )
    If there is at least ONE, SINGLE FULLY ASSIGNED C-NMR automatically generated by OSCAR-3, please let us all know the URL, where to download it.
    Question #2:
    How many fully assigned C-NMR spectra have been integrated into NMRShiftDB using OSCAR-3 during the past 6 months ?
    O: Zero
    O: At least 1
    Please keep in mind that NMRShiftDB increased by only 8 structures since Nov 18th, 2007. These 8 entries have been entered by ‘hko’ MANUALLY according to my best knowledge.

  2. Wolfgang, I have refrained from doing a one-a-day extracting (see [1] for the methodology), because of the legal uncertainty. Until publishers can tell me it is OK, I won’t risk court orders. Though I think I submitted one structure in this way, just for testing purposes.
    Wolfgang, how do you address this problem?
    1.http://chem-bla-ics.blogspot.com/2006/09/chemical-archeology-oscar3-to.html

  3. ad (2):
    I was asking for A FULLY, ASSIGNED dataset – as we use it in systems like CSEARCH and NMRShiftDB. Such a dataset consists at minimum of a structure having N carbons, a peaklist having N shift values and their assignment to the carbons ( including exchangeable assigned lines, which are usually marked by letters or other symbols ). Just to highlight a line between the strings ‘C13’ and the last ‘.’ is not a fully assigned dataset for further use in prediction systems. As it is clearly stated here in this blog (see http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=835), OSCAR-3 can
    –snip / example / see link above —
    13C (150 MHz) d 138.4 (Ar-ipso-C), 136.7 (C-2), 136.1 (C-1), 128.3, 127.6, 127.5 (Ar‑ortho-C, Ar-meta-C, Ar-para-C), 87.2 (C-3), 80.1 (C-4), 72.1 (OCH2Ph), 69.7 (CH2OBn), 58.0 (C-5), 26.7 (C-6), 20.9 ((CH3)AC-6), 17.9 ((CH3)BC-6), 11.3 (CH3C‑2), 0.5 (Si(CH3)3).
    (the “d” is a delta but I think everything has been faithfully copied from the Word document. Note that OSCAR can :
    * understand that this is a 13C spectrum
    * extract the frequency
    * identify the peak values (shiofts) and identify the comments
    —snip / end—–
    understand that this a C-NMR, extract the frequency, identify the shifts and the comments – the overall functionality described here, corresponds to a combination of ‘grep’ and ‘cut’ commands, but is far away from an ASSIGNED C-NMR DATASET. I agree, its helpful, but its less than 5% of the overall data-generation process ……
    —-snip / from above —-
    Though I think I submitted one structure in this way, just for testing purposes
    —-snip/end——
    1) Do you think you submitted a structure this way *OR* are you absolutely sure you submitted a structure this way ?
    2) What is your understanding of the term ‘structure’ in this particular context ? Just parsing the text, finding e.g. the name ‘androstane’ and looking it up in a collection of structures and putting the picture/connection table on the screen ? That’s very helpful, but only a part of the story.
    3) Are you sure THE ASSIGNMENT has also been automatically included in this dataset you submitted ?

  4. hko says:

    ad (#2) and (#3)
    What I expect from a utility program like OSCAR-3
    to help me checking or doing nmr assignments.
    – extract structure from paper
    – extract atom numbers for structure
    – extract shifts from shiftlist
    – combine shifts and corresponding atom numbers
    – check additional information in shiftlist
    like carbon multiplicity and/or number of carbons.
    – draw structure with shifts attached to carbon
    – compare shifts for structure after performing
    shift calculation (using NN, Hose or increments)
    If these points are completely fulfilled then OSCAR will be
    a useful tool for nmr people. Otherwise …

  5. pm286 says:

    (1-4) I am delighted to see the interest in extracting data from chemical fulltext. The methods are steadily developing and there has been important progress during 2007 and we expect more in 2008. At present the work is research rather than production, but we have never suggested otherwise. At present we do not have metrics for success as we do not have an Open Gold Standard against we can develop metrics. We intend to continue to build these up – progress in information extraction tends to be steady rather than dramatic.
    The major barrier to metrics is having an Open corpus of text to work with. Since we have not yet had offers from the mainstream publishing community to donate fulltext for indexing we are working with the growing body of Open Access, Open Supplemental info (difficult, but tractable in some cases) and theses (which by community norms are fully Open).
    For the details we always have a balance between precision and recall. In response to hko…
    – extract structure from paper
    . this depends very much how the structure was reported. If the names can be interprted or looked up then OSCAR does a good job. We are devloping image-recognition software which is impriving though far from perfect
    – extract atom numbers for structure
    This ranges from possible to impossible. It depends on a number of heuristics, ranging from numbering algorithms, extraction from images and retrofitting from oartial spectral assignment. Remember that in many papers it isn’t possible for humans to work out all the numbering anyway
    – extract shifts from shiftlist
    If the list is reasonably prepared recall and precision can be very high
    – combine shifts and corresponding atom numbers
    again this will depend on heuristics – a combination of textual annotation of shifts and partial machine assignment
    – check additional information in shiftlist
    this will depend on building a vocabulary of syntax and terminology. There may be some quick wins.

  6. This is an interesting discussion…two specific questions of interest to me..
    1) You comment “We are devloping image-recognition software which is impriving though far from perfect”. OSRA as you know is Open Source (http://www.chemspider.com/blog/converting-images-of-chemical-structures-to-real-structures.html). Are you modifying that or creating your own code from scratch? You seemed interested when it was released so I assume you are using the code and not reinventing the wheel? (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=457)
    2) I’m in the middle of curating all chemical structures on Wikipedia. I spent a couple of hours discussing it with Martin Walker last night. The process involves a lot of manual work…I’m at over 150 hours right now. There are issues with chemical names not matching the structure diagrams (people can use nomenclature very poorly!) so this will be an ongoing issue for ANYBODY using name to structure conversion structure. However, there are many names agreeing with the chemical structure. Have you thought about applying OSCAR to WIkipedia to generate a real structure file? You can then add that into the WWMM and hook up to Wikipedia. If you wait a while I’ll have one done and will hopefully be able to get Wikipedia to accept InChIKeys on the structures directly and therefore make Wikipedia searchable by InChIKey. I’ll log about this soon but have other deadlines in the way at present. I have just co-authored a book chapter on name to structure conversion and talked about OSCAR-3 but couldn’t comment too much on capabilities. I can add it in in proofing.. Here are 10 names of structure on Wikipedia …they are correct for the structures. You commented “If the names can be interprted or looked up then OSCAR does a good job. “How well does OSCAR does on this set of 10? If you want to post the InChI strings I’ll check the structures and let you know…
    (-)-2-Carbomethoxy-3-(4-fluorophenyl)tropane
    20-hydroxyecdysone
    3-Quinuclidinyl benzilate
    4-Methylbenzylidene camphor
    Adenosine thiamine triphosphate
    3-(4-aminophenyl)-3-ethylpiperidine-2,6-dione
    2-[(4-{[(2,4-diaminopteridin-6-yl)methyl]amino}benzoyl)amino]pentanedioic acid
    (2-butyl-1-benzofuran-3-yl){4-[2-(diethylamino)ethoxy]-3,5-diiodophenyl}methanone
    3-ethyl 5-methyl 2-[(2-aminoethoxy)methyl]-4-(2-chlorophenyl)-6-methyl-1,4-dihydropyridine-3,5-dicarboxylate

  7. Ad 4/5:
    Thanks for the answers ! My conclusion now is – please correct me, if I am wrong:
    a) The project is on its way
    b) Structure recognition from an image is under development, trivial names like ‘Fomittellic acid’ are not very helpful, because usually new compounds are investigated
    c) The numbering problem is extremely complex
    d) The assignment story is a tedious task, because of the large variety of formats and because of the numbering mentioned under c)
    e) There is not ONE, SINGLE FULLY ASSIGNED C-NMR dataset available, which has been generated automatically
    Egon: Congratulation, that you have already a FULLY ASSIGNED CNMR dataset available, which has been extracted automatically by OSCAR-3. When the legal situation has been settled, I would like to invite you to Vienna to show us, how to extract thousands of datasets per day in a fully automatic manner – any OSCAR-3 version released before January 7th, 2008 is welcome !
    HKO: I fully agree to your list of specifications, I want to add:
    a) Solvent
    b) Measuring frequency
    c) Experimental technique(s) used for signal assignment
    d) Couplings (e.g. CF,CP,etc.)

  8. pm286 says:

    (1) Please do not post interactive forms in comments. WordPress is fragile and is likely to cause problems in some browsers. In general if anyone wishes to carry out surveys it would be more appropriate to create the survey page elsewhere and post a link i the comments.
    Other than technical reasons (such as above) and spam I do not moderate comments.

  9. Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Community involvement in information capture and extraction

  10. Wolfgang,
    OSCAR3 as in the Bioclipse plugin, does *not* extract the assignment. However, it does allow to do an automatic assingment using predicted shifts from the NMRShiftDB. I’m sure you are not happy about this, and neither would I use that option myself.
    I must have misunderstood where you were getting at in detail, but the plugin does extract the structure *and* the spectrum, and as able to assign them to belong together.
    But peak assignments in literature are that free format, at least between journals, that extracting of that information with a tool like OSCAR3 is highly unlikely. (For certain natural product journals which use table formats, this might feasible, though…)
    I am sure I used that tool to create a structure/C-NMR data set; I am not sure the set I submitted to the NMRShiftDB ended up in the data set that is online right now.
    (I hope nothing accidentally ends up in bold…)
    BTW, what is wrong with releases after January 7th?
    Anyway, I’m delighted to inform you that the OSCAR3 version I used does not jeopardize any proprietary C-NMR data set.
    But I do hope these tools might help speed up the *manual* process of extraction of NMR spectra from literature. That must be worth something.

  11. –snip–
    I must have misunderstood where you were getting at in detail, but the plugin does extract the structure *and* the spectrum, and as able to assign them to belong together.
    –snip–
    In order to avoid any misunderstanding:
    YOU said: …. the plugin does extract structure *and* spectrum …
    MY QUESTION: The structure is a ‘picture’ (as usual in publications) and NOT already a MOL/SD/CML – correct ? If it is already e.g. a MOLfile you dont need OSCAR, you can use nearly every structure editor and attach the spectral information (in a more or less cumbersome way)
    The spectrum is a list of shifts (and their assignments) as usual in publications – as long as certain format-criteria are fulfilled, a combination of ‘cut’ and ‘grep’ is able to extract this information.
    –snip–
    But peak assignments in literature are that free format, at least between journals, that extracting of that information with a tool like OSCAR3 is highly unlikely. (For certain natural product journals which use table formats, this might feasible, though…)
    –snip–
    Than the literature assignment is missing ! Thats exactly the point ! Using a prediction program to get an assignment is possible ( for the algorithm see e.g. W. Robien, Chemical Monthly 1983 ! ), I have furthermore investigated the error rate depending on the size of the database assuming the literature assignment is always correct. The computer-assisted assignment can be fairly good and can be used as a starting point for *MANUAL* checking against the assignment given in the article. This can save aaprox. 30-40% of the total work when entering a dataset – BUT THAT IS DEFINITELY NOT THE FULLY AUTOMATIC PROCESS, I was asking for.
    With NMRPredict you can also read MOLfiles(SDfiles) and different types of Peaktables, the first step is again an automated assignment, which can be interactively corrected – which sounds quite similar to your approach, but I would never dare to claim that this is an automatic data-generation process. The user, who wants to create an proprietary database is simply well-supported by an efficient toolkit.
    Furthermore, the danger with automatic assignment is when coming across a badly represented class of compounds – doing the assignment, entering this dataset and doing the next assignment. If the first one is wrong, the second is wrong too, but showing better statistical parameters – I usually call this behaviour ‘applied error-propagation’ ( see e.g. the assignment of the Tosylgroup in the literature )
    From all these pieces of information I have collected from your posts, from PMR’s post and the publications I have the impression that OSCAR is a valuable tool during the data-extraction process, which does a scan over every article and “highlights” certain pieces of information and is able to extract a few trivial things. Its simply a valuable ‘eye-supporter’ when working on an article. The decisive step of the assignment itself can be either done manually or again supported by prediction programs (with all concerns I have). We are far away from completely automatic generation of (C)-NMR datasets, the reasons for that are very complex as we know.
    –snip–
    BTW, what is wrong with releases after January 7th?
    –snip–
    Hopefully nothing ! I want just to set a time mark in order to make things clear – maybe (hopefully) OSCAR-XX can perform the full functionality, which has already been claimed.

  12. pm286 says:

    (1-11)
    A general point. Although I don’t moderate this blog, it would be useful to keep discussion related to the topic of each post. The main thrust of this article is about publication of data, licences, publishing practice and not how to assign NMR spectra. There are many other posts which relate to NMR and it would be more useful to have the discussion there because then the search engines can find it. As it is most of the people reading about Open Data will not be interested in NMR assignment.
    Best
    P.

  13. (12) OK
    You are invited to put NMR-specific issues on my blog http://csearch-nmr-data.blogspot.com/
    /WR

Leave a Reply

Your email address will not be published. Required fields are marked *