petermr's blog

Chemical Informatics and China; and challenges of language

Posted on September 14, 2011 by pm286

I was honoured last week to be invited by Professor Xiaoxia Li to speak at the 14^th Asian Chemistry Congress in Bangkok and then to visit her group in Beijing. Unfortunately I couldn’t spend longer but the visit impressed me very much, both with the focus of the group and their high morale. The group is in the Institute of Process Engineering, Chinese Academy of Sciences in the Haidian part of Beijing (where there are many universities and scientific institutes). Here is Xiaoxia (left) and some of her group [photos on my phone, so variable quality].

Xiaoxia’s group has a lot in common with us. Several of the group are involved in indexing and information retrieval from the Deep Chemical Web (http://chemport.ipe.ac.cn/IPE-ChINGroup/group-publications-en.html ). Most of the web is actually inaccessible to search engines because the information is exposed through a query interface. “Enter your search term: “. Often you have no idea what to enter, and so Bingle passes it by. Her group is developing heuristics and templates for exploring what is in databases and what information can be extracted. It’s very challenging.

We also talked about software for processing chemical names and natural language wholly or partially in Chinese. We tried an experiment with OPSIN (name to structure). Daniel Lowe has explored how Chinese (and other languages) representation of IUPAC names might be processed by OPSIN. He is moderately confident that the core of OPSIN is suitable, and it is a question of preprocessing and vocabulary. Here is an example taken from http://zh.wikipedia.org/wiki/IUPAC%E6%9C%89%E6%9C%BA%E7%89%A9%E5%91%BD%E5%90%8D%E6%B3%95_%28A%E9%83%A8%29 – a translation of the IUPAC nomenclature rules into Chinese.

2,7,8-三甲基癸烷

When Daniel reads this from a file into OPSIN it interprets it as 2,7,8-trimethyldecane. We tried to reproduce this on a Chinese commandline, but ran into encoding problems. (Encoding is one of the commonest problems). However I am sure it is soluble.

I was also given a tour of the work in the Institute. There is a lot of exciting work on High-performance computing (using GPGPUs http://en.wikipedia.org/wiki/GPGPU ) and the institute has, I think, the 33^rd most powerful machine in the world. Certainly the scale and ambition of investment in science was clear. Among the demonstrations I saw were the simulation of a fluidised bed reactor (flow, temperature) and also the molecular dynamics of a complete influenza H1N1 virus (neuraminidase, haemagglutinin, capsids, RNA, – everything). We have come a long way since I worked in influenza 20 years ago.

I was also very very well entertained – driven everywhere – and shown many of the sights of Beijing. Here are two with colleagues from Xiaoxia’s group:

And (in the Forbidden City)

I was very well exercised by the end!

It was great to talk with a group with mutual interests – the discovery and re-use of information on the web. I gave them an overview of our work, our recent manuscripts, and left a fairly complete copy of OSCAR/OPSIN and related software. Some – at least – should be fairly easily adaptable.

And by happy chance “YY” (Prof Yong Zhang, left) was free and invited me to dinner with his group at Tsinghua University. YY spent 3 years in our group at Cambridge and was responsible for the early development of the World Wide Molecular Matrix (publication in press http://www.dspace.cam.ac.uk/handle/1810/238387 ).

Again I was very well looked after.

It is always great to find other groups who interact synergistically. Chemical informatics is not always glamorous, but it’s important and will increase in value as the barriers to information discovery start to disappear. Many many thanks.

Posted in Uncategorized | Leave a comment

14 Asian Chemical Congress: Why we need Open Data/Source/Standards

Posted on September 6, 2011 by pm286

I am talking tomorrow as an invited lecturer to the 14^th Asian Chemical Congress in the Cheminformatics section (http://www.14acc.org/speakers.htm#s8). My message is the Cheminformatics needs Open resources (as in http://www.opendefinition.org/ ). I am not arguing that everything should be Open, but that everything critical should be. To summarise:

Data should be open. Unless data are Open they cannot be:

Independently validated
Republished
Re-used for derivative works; this is where the innovation comes from
Used as reference sources

Source code should be open to the extent that:

It should be possible to recalculate a model, a set of properties, an analysis independently of closed systems
The algorithm used should be inspectable.

This does not prevent proprietary codes being used for speed, convenience, etc. but they should not be the only way of verifying the work

Standards, including dictionaries. Where files are used to communicate data, the syntax must be agreed (e.g. OpenSmiles (http://www.opensmiles.org/ ), and documentation openly visible. Where terms/metadata are used then they must be defined and agreed by the community (e.g. http://www.xml-cml.org/dictionary/ ). Modern dictionaries should be semantic (i.e. understandable by machine)

Chemistry , and cheminformatics even more, has very little in any of these areas. InChI is one of the few exceptions. Openness is being driven by funders, regulators, some government agencies and (from the bottom-up) the Blue Obelisk (http://sourceforge.net/apps/mediawiki/blueobelisk/index.php?title=Main_Page ).

Without Open Data/Source/Standards, computational/data-driven science is not reproducible.

Many areas in science, especially bioscience, are driven by the vision of the Semantic Web and Linked Open Data (http://en.wikipedia.org/wiki/Linked_Data ) and graph (http://en.wikipedia.org/wiki/File:Lod-datasets_2010-09-22_colored.png ). There is very little chemistry here, because very little is Open. Even KEGG will disappear because It’s becoming closed

I am working with the European Bioinformatics Institute on ChEBI (http://en.wikipedia.org/wiki/ChEBI ) and hopefully also on CHEMBL and related data. The bioinformatics community need Open chemical data and they are prepared to work to make it happen. Maybe at some stage the chemical community will see the value of Open knowledgebases. Until then we will continue to generate collections of computational chemistry, crystallography, spectra, and other properties by using machines to extract or generate them.

Here’s some material I presented earlier to the ChEBI group (2011-06-01) …

Web-based science relies on Linked Open Data.

Topics

Vision: machines could publish and “read” current chemical data
Almost no chemical data is effectively published
There are technical and cultural problems
Current publishing models are asymmetric; the author and reader have few rights or influence
“Almost Open”, “Freely Accessible” is not good enough
Individuals and small groups can change the world
- Wikipedia
- OpenStreetMap
- Quixote – reclaiming computational chemistry (http://quixote.wikispot.org/Front_Page and http://quixote.ch.cam.ac.uk/content/ )
- Software as an agent of political change
- Bottom-up Web 2.0 (The Blue Obelisk (http://www.blueobelisk.org and Quixote)
- Text and data mining
- Automated computation and aggregation of data
- Near-zero cost of robots – crystalEye
- eTheses
- Panton Principles
- Open bibliography
Resources
“Open Data” on Wikipedia (http://en.wikipedia.org/wiki/Open_data )
“Open Data in Science” (Murray-Rust on Nature Precedings (http://precedings.nature.com/ )
Science Commons (http://www.sciencecommons.org )
Open Knowledge Foundation (http://www.okfn.org )

Recent Blogs
/pmr/2011/03/28/open-data-what-i-shall-say-at-acs
/pmr/2011/03/28/draft-panton-paper-on-textmining/
/pmr/2011/03/28/biomedcentral-use-open-data-buttons-in-their-publications

Some fallacies:
“You can have SOME of the data (ACS make 8000 CAS numbers freely available to Wikipedia)
The data are free for NON-COMMERCIAL use (see my /pmr/2010/12/17/why-i-and-you-should-avoid-nc-licences/
“You can always ask permission and we’ll grant it”; PMR: doesn’t scale, doesn’t persist, can’t re-use

The key question: Is the price of closed data worth it?. Do the benefits outweigh the disadvantages?: to help you:

issue	closed data	open data
sustainability	supported by income	few proven models
creation of business model	easyish	hard
added human value	often common	anything possible
support	usually good	depends on community
domain acceptability	well proven	often suspicious
cost	high; increasing?	marginal
innovation	central authority	fully open
reuse	normally NO	fully OPEN
speed from source	often slow	immediate
mashupability/LODD	very rare	almost universal
reaction to new tech.	often slow	very fast
comprehensiveness	very good to patchy	potentially v. high
global availability	often very poor	universal
acceptable to funders	variable; decreasing	very high

In the current talk I shall stress tools for data extraction and creation. In particularly:

Open data bases (Crystaleye1 (http://wwmm.ch.cam.ac.uk/crystaleye , Crystaleye2 http://crystaleye.ch.cam.ac.uk and Crystallography Open database). See also http://opencryst.wordpress.com
OSCAR ( and Chemical Tagger (http://chemicaltagger.ch.cam.ac.uk/ ) for text-mining (hopefully the Hargreaves report will stimulate text mining)

I hope to show demos of some/all of:

OPSIN (http://opsin.ch.cam.ac.uk )
Crystaleye1
Crystaleye2, Quixote (http://quixote.ch.cam.ac.uk/ )
Chemical Tagger (http://chemicaltagger.ch.cam.ac.uk/ )
OSCAR Data
Avogadro

And, if you are excited about creating Open Chemistry, here are some tools to help (/pmr/2011/09/04/open-crystallography-how-to-start-it-and-where-should-we-base-it/ ).

Posted in Uncategorized | 2 Comments

Open Crystallography: How to start it and where should we base it?

Posted on September 4, 2011 by pm286

#opencryst

At #iucr2011 Saulius Grazulis and I agreed to set up Open Crystallography. We’ve both been working in this area (he with the Crystallography Open Database community (http://www.crystallography.net/ ) and our group with Crystaleye (http://wwmm.ch.cam.ac.uk/crystaleye/ ), Chempound for crystals (http://crystaleye.ch.cam.ac.uk/ ), etc.). These complement each other very nicely and we are aiming to

make all published crystallography Open.

That’s very similar to the Blue Obelisk, with Open Data and Open Source. COD has a lot of Perl and we’ve a lot of Java. They’ve got 150K structures, we’ve got about 250 K. Many overlap of course. We campaign for more crystallographers to make their structures Open, and this will come. We have to create the social and technical framework where it’s easy and attractive.

Anyone can join us. Crystallography is fun and appeals to mathematicians, high-school students, artists, etc.

So we are starting with creating Open communications where we can share what we have got. That’s something we are well used to in the Open Knowledge Foundation. So here I am outlining what we plan to do and asking readers for help and guidance.

Blog. A blog is s really good idea of getting messages out, indexed etc. reporting on news. Presenting interim results. And we are developing our own crystallography data blog for structures. What blog? Current suggestion: create a blog on wordpress.com. Comments, please!
Wiki. We need a wiki to host all the links, overviews, and pointers to our own material. Where? Current suggestion: use http://wikispot.org/ ; any other suggestions?
Mailing list. This is for day-to-day communication of technical issues, policy, debates, etc. Suggestion: use googlegroups. Any other ideas?
Realtime communication. This is for virtual meetings. For this we’ll use skype and Etherpads.
Sourcecode. We use Bitbucket.org. Not sure what COD use.
Hosting data bases, repositories, etc. This is hard work and a major effort of both projects
Hashtag. I am suggesting #opencryst, which seem to be free.

We’ve used this mixture very successfully in several virtual projects. Please suggest other things we’ll need to do.

This is a model for Open Foo (so far Foo includes ScienceData, Bibliography, Government Data, amd lost more). So if you are interested in developing Open anything, this project may be a useful learning ground.

Posted in Uncategorized | 1 Comment

#solo11; post Science Online thoughts

Posted on September 4, 2011 by pm286

I’ve been to all the SOLO meetings (Aug/Sept) and really enjoyed them. The earlier ones concentrated on blogging and it was a great way to meet other bloggers. They have gradually moved towards adding other types of online activity, with a strong sense of doing stuff. Last year I was pre-occupied as we had organized a monthlong hacakathon to search patents for chemistry using text-mining. (We are forbidden to do this for scientific papers). As with many of my barmy ideas we just managed it with minutes to go…

This time (http://www.scienceonlinelondon.org/ ) was more relaxed for me – I wasn’t presenting anything. There were thousands of tweets – #solo11 which will give you a good idea. So here are my highlights:

Michael Nielsen – theoretical physicist turned into Open Science evangelist and doing lots of lectures in Europe (http://michaelnielsen.org/blog/visiting-europe/ ). Try to go if you can. We had a long series of talks earlier this year in Utah. Michael had looked at the principles of collective action (http://en.wikipedia.org/wiki/The_Logic_of_Collective_Action ), ranging from how trades unions started (in small groups) to how Valencia managed its upstream river basin to avoid the “tragedy of the commons”. Michael’s takehome was that it is very difficult to change from a local optimum (I agree in general) and that it is slow. Anyone altering the status quo lays themselves open to freeloaders. This applies to reforming scholarly communications and creating Open science. I felt it was slightly pessimistic, and feel that funders and governments may be able to exert influence. But I have taken to heart that one should start smallish, aim for the achievable, do it well, and allow for time. (Which is, I hope, what my various schemes conform to).
Ivan Oransky (Retraction Watch – http://retractionwatch.wordpress.com/ ). RW is a great site. I’d dropped in before, but now I see it as mainstream . How can we make sure that the blogosphere’s work in challenging bad published science is respected and recorded? RW is a great start. What we need – and I have ideas – is a tool where referees can be alerted as to whether one of more authors “have previous”. I have just seen an example of a re-publication of bad science, where I ask “how can the referees possibly have let that through. It’s not just wrong, but the blogosphere has trashed it – and the referees are oblivious. And many journals don’t seem to care
MaryAnn Martone, Neuroscience Information Framework (NIF), Spinal Muscular Atrophy Foundation (SMA). SMA is a genetic disease of children – fatal over unpredictable timescales. The SMA is committed to funding research that is targeted to deliver real value to patients in an aggressive timescale. They are not interested in impact factors, but research that makes a difference now. She highlighted that at the moment scientists compete and occasionally collaborate. But we should be looking for cooperation and coordination. That’s common in Open Source software which is why it has such a strong resonance with Open Science. Effective achievement of measurable goals is more important than individual brilliance. And I think that charities are perhaps the major force that could make this happen. Less glory, more progress.
Kristi Holmes

– VIVO . An interdisciplinary national network Enabling collaboration and discovery among scientists across all disciplines.

And our own session on Open Research Reports (run by David Shotton , supported by Tanya Gray). This will create Open material for disease science and we’ll be doing this at a SWAT2LS hackathon in December in London. But that deserves its own post.

A fantastic meeting and a very broad and valuable delegate list. Thanks to everyone.

AJCann http://scienceoftheinvisible.blogspot.com/2011/09/solo11-day-1-rough-thoughts.html

AJCann http://scienceoftheinvisible.blogspot.com/2011/09/solo11-day2.html?utm_source=twitterfeed&utm_medium=twitter

Bjoern Brembs: http://bjoern.brembs.net/news.php?item.776.5&utm_source=twitterfeed&utm_medium=twitter

Posted in Uncategorized | 4 Comments

Reader Pays (a lot) to read “Sodium Hydride as Oxidant paper” (you don’t need to be a chemist)

Posted on September 3, 2011 by pm286

YOU DON’T HAVE TO KNOW ANY CHEMISTRY – THE POINT IS DIFFERENT. But you should have 70 USD ready.

I came across the following paper today:

Reductive and Transition-Metal-Free: Oxidation of Secondary Alcohols by Sodium Hydride

Xinbo Wang, Bo Zhang, and David Zhigang Wang

J. Am. Chem. Soc., 2011, 133 (13), p 5160

Publication Date (Web): July 21, 2009 (Addition/Correction)

DOI: 10.1021/ja904224y

Hmm… This is the same title as a paper published 2 years ago, which I thought had been withdrawn…

So some more searches:

Reductive and Transition-Metal-Free: Oxidation of Secondary Alcohols by Sodium Hydride

Xinbo Wang, Bo Zhang and David Zhigang Wang

J. Am. Chem. Soc., 2010, 132 (2), p 890

Publication Date (Web): December 23, 2009 (Addition/Correction)

DOI: 10.1021/ja910615z

But I thought it was 2009?

Ah yes…

Xinbo Wang, Bo Zhang and David Zhigang Wang

School of Chemical Biology and Biotechnology, Shenzhen Graduate School of Peking University, Shenzhen, China 518055

J. Am. Chem. Soc., Article ASAP

DOI: 10.1021/ja904224y

Publication Date (Web): July 21, 2009

Now the 2009 paper was severely cricized by the blogosphere. Including repeated work by TotallySynthetic http://totallysynthetic.com/blog/?p=1903 showing that if you kept oxygen out the reaction didn’t happen. The criticizers were moderate in their use of science – given that the ACS promoted the use of the term “junk science” I suggest it applied here. (The reaction was flawed, the explanation was flawed). But that’s not the point here.

The point is “what is in the 2011 paper?”. Now, when I am away from the office I deliberately do not sign into Cambridge University. So If I want to read the 2011 paper I find:

Purchase This Content

Choose from the following options:

$35.00 for 48 hours of access

Well, it had better be good…

I got a friend to access it for me. I won’t say who in case they get done for stealing content. They let me read the whole content of the paper. I’m going to reproduce it here.

Hang on, that’s violating copyright. You CANT reproduce a whole paper.

OK, I’ll cut out the vowels. …

It reads:

“Th*s p*p*r has b**n w*thdr*wn f*r sc**t*f*c r**s*ns”

I hope I haven’t broken fair use.

That’s the WHOLE paper. 35 USD for 2 days.

I haven’t read the 2010 paper. Perhaps someone can post it (but not, of course, the fulltext). Because I am going to spend my dollars in the pub.

Posted in Uncategorized | 11 Comments

Science Online: We can make blogs first-class citizens in scholarly publishing

Posted on September 3, 2011 by pm286

#solo11 (Science Online) is fantastic as ever and very different from last year. Masses of great people to meet.

Today (through Martin Fenner ) we’re going to look at how to use blogs for science. I’ll probably be blogging quite a lot. Feed at: http://blogs.scienceonlinelondon.org/blog/2011

Here’s my first http://blogs.scienceonlinelondon.org/blog/2011/09/03/the-value-of-blogs/

Many thanks to Martin for setting up this blogging session at #solo11. I have been blogging for 6 years and have found it a very useful way of communicating ideas and getting feedback. I am now strongly convinced that blogs are a better technical platform for formal and informal communication os science. There is, for example, no technical reason why scholarly publications should not be on WordPress rather than some proprietary backroom system.

This post is short, to test Martin’s blog. Here are some advantages of blogs:

The authoring interface is natural (I use Word, others type directly, some use LaTeX)
There are many natural tools that come as standard (index, search, archive, chronology)
Blog feeds can be filtered, combined, repurposed. (Ever tried repurposing a PDF hamburger?)
You can subscribe to feeds and get immediate notification
There is a really easy way to comment. Some blogs get 100 comments in a day. It’s rapid. Feedback is great
Blogs can be hyperlinked so a subject can be discussed in many places
specialist plugins can be built (e.g. for chemistry, scholarly publication)
… and I can keep going

and some disadvantages

you don’t get citations
you don’t get citations
you dont’ get citations

That’s a human problem, not a technical one.

Posted in Uncategorized | 3 Comments

CCDC: Reasons why sourceCIF data must be Open

Posted on August 31, 2011 by pm286

This is the text of a letter I have sent to Dr Colin Groom, Director of the Cambridge Crystallographic Data Centre. For context, read my blog posts (/pmr/ ) over the last 4 days. “sourceCIFs” are raw data created as part of a crystallographic experiment by scientists (not in the CCDC) and required by community norms as part of the scholarly publication process. Some are published Openly, but others are sent by the author or publisher to CCDC in an exclusive process. CCDC then control the further distribution of this data which are either made available in trivial amounts (less than 0.1% of the CCDC’s holding of sourceCIFs) or significant financial subscription (which many institutions cannot afford). I re-emphasize that I simply wish to make Open the collection of the author’s original data [i.e. not the file with a CCDC header and CCDC accession code].

The arguments in this letter may apply to other disciplines, wherever Open data are managed by an exclusive gatekeeper.

Colin,

At my presentation in MS89 at the IUCr meeting I presented my request to you for the raw verbatim sourceCIFs supplied to CCDC as part of the publication of scholarly articles (i.e. without any CCDC-added value). I was unaware of anyone from the CCDC being present, so you will have to accept my account. I stepped through the arguments on my blog, arguing that these data were part of the essential scholarly record, and that some publishers made their sourceCIFs available Openly. It is the sourceCIFs from publishers such as Wiley, Elsevier and Springer that are in question and for which I asked. There was no adverse comment on what I presented.

I showed your reply in the meeting and also posted it on my blog (/pmr/2011/08/28/iucr2011-reply-from-ccdc-on-restrictions-on-redistributing-cifs/ ). I summarised this as containing two main reasons why CCDC would not release the sourceCIFs:

“these arrangements were put in place to satisfy the demands of publishers“. I have asked the University of Cambridge for details of these arrangements, through a Freedom of Information request. The advantage of specifying FoI is that it contains explicit guidelines on the public release of contracts (http://www.ico.gov.uk/ ) and this gives you the power (and the duty) to make these contracts Open except in very special circumstances. I have also asked for the number of sourceCIFs involved. When I have a better idea of the facts from this request we will be able to judge whether any of the current publishers is acting as a block to making the sourceCIFs Open. Note that the FoI legislation requires a reply by Sept 26 latest and I am likely to make further comment at that stage.
“because the CDCC continues to rely on subscriptions to the CSD to fund its ongoing developments.” We discussed this in a conversation, and I think I can summarise your argument as: “if the sourceCIFs were open, the CCDC would lose a significant number of subscriptions” [in part because other resources based on the sourceCIFs such as Crystallography Open Database and Crystaleye could provide competition]. You argued that the CCDC was beneficial to the community (which I agree with) and that it could only continue to exist if it had a monopoly right to control the distribution of sourceCIFs (with which I profoundly disagree and now explain why).

There are ethical, moral, and political/legal reasons why basis published scientific data should be Open.

Moral, in that the authors of the data believe that by providing their experimental data they are providing it to the world community, whether scientific or not. If CCDC closes the sourceCIFs data, authors are deprived of their moral rights.
Ethical. The data in sourceCIFs are of value to the world community (for example many subscribers to CCDC are involved in medical research and need the data to help develop new drugs).
Political and legal. Many governments and funders are requiring that the fruits of their funding are made completely Open. If, for example, a scientist is funded by Wellcome Trust or NIH their research is expected to be Open. They publish their papers Openly according to guidelines But for many of these papers the text is Open but the sourceCIFs is closed [it can only be obtained by request, only in small numbers and cannot be redistributed]

More generally public opinion is strongly in favour of reform towards Openness. I give some examples, some of which will have quasi-legal compulsion.

On Monday 29^th Aug George Monbiot published an article in the Guardian which very strongly criticized the current system of academic publishing (http://www.guardian.co.uk/commentisfree/2011/aug/29/academic-publishers-murdoch-socialist ). This has had widespread impact and very general support. I believe that the current CCDC practice of a monopoly control of academic research would fall under the same criticism. While it may not be the same scale, it is the same principle. CCDC lay themselves open to being judged in the court of public opinion and it will be difficult to show why they should not release sourceCIFs.
RCUK have now universally pushed for raw data to be openly available. In talking with NERC, I understand that their philosophy is that raw data should be Open and that value-adders should build on this and can create a competitive market based on the value-add, not a monopoly.
The value of text and data in bulk (e.g. for mining) has been highlighted by the Hargreaves report. Effectively Hargreaves is saying that copyright and other contractual restrictions are seriously harming science and that the UK should remove them. I wish the sourceCIFs to be used for data-mining in an open fashion, whereas at present the only data-mining is what CCDC permits and which has to be paid for. I have commented in /pmr/2011/08/30/open-crystallography-the-hargreaves-report-can-help-make-ccdc-data-open/ . Note that this contains suggestions from a third party as to how we should approach sourceCIFs, and I have done what I can to avoid confrontation. But the issue is public and I expect the community to make reference to Hargreaves.
The Information commissioner’s office (ICO) has taken strong action on scientists who refuse to share data (see Queen’s University Belfast and tree-ring data http://www.informath.org/apprise/a3900.htm which describes in detail how QUB fought and lost the right to keep data closed). The chairman of the UK parliament’s Science and Technology Committee stated that “data has to be made publicly available” and that “Any university or scientist that hasn’t got that message needs a total rethink of the way they do research”. I hope that CCDC do not take the same route as QUB as it is messy, ultimately pointless, and reduces standing in the community.

These are some of the recent examples of how public opinion, including government, is solidly behind releasing data. Some of these involved conflict and I am keen to avoid this. My hope is that, by the time I get a reply to my FoI request, CCDC decides that it can after all, release the sourceCIFs as Open. If there are contractual problems with the publishers I am happy to help take them to higher authorities (as in the previous paragraph). While the pubklishers may not need to comply legally, I think moral pressure from government offices is likely to be effective.

I do not accept that the CCDC will suffer serious business loss. Encyclopedia Britannica has not been destroyed by Wikipedia but it is redesigning itself. Ordnance Survey has not been put out of business by OpenStreetMap. CCDC should not feel seriously challenged by Crystallography Open Database and Crystaleye. Indeed if it aligns itself with Open Crystallography it can benefit.

I am happy to advise CCDC in how to make sourceCIFs Open, e.g. by defining what is meant by Open and what needs to be done to make sure sourceCIFs are Open and can be re-distributed and re-used. There is also the issue of future sourceCIFs and I am happy to suggest processes for the future ingest of Open sourceCIFs. I stress that the Openness requirement is not negotiable – effectively it means there can be no imposition of conditions other than an Open licence (PPDL, CC0) [see the Panton Principles]. The decision must be rapid (i.e. within the 20 days of the FoI request). A promise to change in the future is, unfortunately, not acceptable.

I hope that CCDC will agree this is the right way to go, quickly.

Posted in Uncategorized | Leave a comment

Open Crystallography: The Hargreaves report can help make CCDC data Open

Posted on August 30, 2011 by pm286

I have had a really useful suggestion about how to make data deposited with the Cambridge Crystallographic Data Centre (CCDC) Open. The Hargreaves report has recommended that text- and data-mining of the scientific literature should be allowed and the government agrees [see below]. It is therefore likely that the data in CCDC fall under data-mining. Since a major user of the CCDC data is the pharmaceutical industry, it clearly falls under “medical”. I give Pete Carroll’s suggestion in full, and add my comments [emphasis is mine].

Pete Carroll says:

August 29, 2011 at 11:26 pm (Edit)

I wonder if the Government response to the Hargreaves Review regarding data/text mining for research might be relevant?

“Nor does the Government regard it as appropriate for certain activities of public benefit
such as medical research obtained through text mining to be in effect subject to veto by the owners
of copyrights in the reports of such research, where access to the reports was obtained lawfully. We
recognise that some publishers view licensing of text mining as a legitimate commercial opportunity;
however we are not persuaded that restricting this transformative use of copyright material is
necessary or in the UK’s overall economic interest…

the Government agrees with the Review’s central thesis that the widest possible
exceptions to copyright within the existing EU framework are likely to be beneficial to the UK,…

The UK government can therefore be argued to be in favour of our obtaining Open data from the CCDC

…subject to three important factors:

That the amount of harm to rights holders that would result in “fair compensation”
under EU law is minimal, and hence the amount of fair compensation provided
would be zero. This avoids market distortion and the need for a copyright levy
system, which the Government opposes on the basis that it is likely to have adverse
impacts on growth and inconsistent with its wider policy on tax.

The CCDC advanced two main arguments for non-release. One was economic (it would hurt their business), the other was that third parties had rights over the data. On the first I believe that harm is minimal as the raw data has not had value added by CCDC and that their income comes from added value and independent products.

• Adherence with EU law and international treaties.

• That unnecessary restrictions removed by copyright exceptions are not re-imposed by
other means, such as contractual terms, in such a way as to undermine the benefits of
the exception.

The Government will therefore bring forward proposals in autumn 2011 for a substantial
opening up of the UK’s copyright exceptions regime on this basis. This will include
proposals for a limited private copying exception; to widen the exception for noncommercial
research, which should also cover both text- and data-mining to the extent permissible under EU law..”

Source: http://www.ipo.gov.uk/ipresponse-full.pdf

The parliamentary select committee for the dept of Business Innovation & Skills is holding an inquiry on the Hargreaves Review and the Government’s response to the review. Closing date for submissions 5th September. See:
http://www.parliament.uk/business/committees/committees-a-z/commons-select/business-innovation-and-skills/inquiries/hargreaves-review-of-intellectual-property/

I know time is short but it could be worth yourself or someone from the research community bringing this problem of “closed CIFs” to their attention as exemplary evidence of problems with access to data.

PS good luck with your FOI request. You might find

http://www.ico.gov.uk/for_organisations/guidance_index/freedom_of_information_and_environmental_information.aspx

useful to you if they try a S41 or S43 exemption over release of the contracts.

Thanks. I am hoping that they will not “try any exemptions”. They are part of our wider community and it is my hope that they will see the positive value of opening their data and that this raise their public esteem. I do not want a battle – I would much rather see genuine reorientation of approach. But the solutions must be Open and they must be rapid. If there are problems I shall certainly approach the ICO (Information Commissioner’s Office) who have been very unsympathetic to scientists hanging onto to data which should be in the public domain. I think that in practice any prolonged refusal to want to provide Open data will be tried the in the court of public opinion both within the scientific community and beyond. But I hope there will not be a “Crystalgate”.

If they take heed of this and wish to make their data Open, then the only barriers will be from contracts imposed by third parties. Until they provide this information we do not know whether it is a problem. If it is, it may be that Hargreaves is a useful weapon.

Posted in Uncategorized | Leave a comment

#IUCR2011; FOI request for details of CIFs deposited by publishers with CCDC

Posted on August 29, 2011 by pm286

In a previous post I showed a letter form Colin Groom of the CCDC indicating reasons why the CCDC could not make its deposited crystallographic data (source CIFS) Open

/pmr/2011/08/28/iucr2011-reply-from-ccdc-on-restrictions-on-redistributing-cifs/

This is disappointing because otherwise we would be able to declare that almost all public crystallographic data was Open – as it is a substantial amount (many thousands) or data files are closed.

One of the reasons given by Colin was that the publishers had added conditions and restraints. I do not know the details of these so I have asked formally for details of numbers of source CIFs and the contracts which might restrain their distribution. I have used the UK Freedom Of Information act since CCDC is part of the University of Cambridge. (In fact all requests for information are ipso facto under FOI, so the additional formality helps to make sure the request is processed appropriately).

I have used the excellent http://www.whatdotheyknow.com to send the request. This helps make sure the request reaches the right place, sets the clock ticking (organizations are allowed 20 working days to reply) and provides a permanent Open record. My request is at http://www.whatdotheyknow.com/request/contracts_for_and_number_of_sour and reads as below

Dear University of Cambridge,

I am writing to request information from the Cambridge
Crystallographic Data Centre (CCDC), one of the listed departments
of the University.
The CCDC maintains a database of crystallography, primarily created
from factual crystallographic data supplied as “supplemental
information” or “supporting data” accompanying scholarly
publications. These data (referred to hereafter as “source CIFs”)
are created by the authors, not the CCDC, and represent part of the
primary scientific record supporting the scientific paper. Some
source CIFs are published in the open literature accompanying
publications (whether Open Access or Closed access). Other source
CIFs are not published Openly and are sent exclusively by various
publishers to the CCDC for deposition (closed source CIFs). [see
public reply
(/pmr/2011/08/28…
) from Dr. Groom, CCDC]. I wish to know the number of these closed
source CIFs and the contracts between the CCDC and each publisher .
1. Please list all the publishers with whom CCDC has an arrangement
for receiving source CIFs.
2. Please provide a copy of the current contract with EACH
publisher.
3. Please indicate whether EACH publisher puts any restrictions on
the re-use and redissemination of these source CIFs.
4. Please indicate whether EACH publisher claims any intellectual
property rights over these source CIFs
5. Please indicate whether the CCDC claims any intellectual
property rights over these source CIFs
6. Please indicate for EACH publisher how many source CIFs are held
by the CCDC
7. Please give any information on whether the Advisory Board or
other governing body has discussed the question of making closed
source CIFs Openly available.

This is a valid FOI request (the CCDC hold the information and it
will not cost an undue amount of time or money to provide it).

Yours faithfully,

Peter Murray-Rust

I will keep readers of this blog updated on progress

Posted in Uncategorized | Leave a comment

#IUCR2011: reply from CCDC on restrictions on redistributing CIFs

Posted on August 28, 2011 by pm286

I wrote to Colin Groom of the CCDC requesting the release of authors’ raw CIFs (supporting information) into the public domain. I have now had a reply which I publish below. This will alter what I say at my presentation tomorrow (2011-08-29, Monday).

Peter,

Apologies for the delay, I left the IUCr early on Saturday […].

The CCDC has arrangements with a number of publishers, whereby we are able to process CIFs into CSD entries and supply the source CIFs to those who request them. Supply of the source CIF requires the requestor to identify the CIF, either by reference number or by providing the reference to the article in which they are described. It is my understanding that these arrangements were put in place to satisfy the demands of publishers – they indicate that requestors have access to the journal article in which the CIFs were published. The CCDC makes no charge for this activity.

The entire, curated, value-added CIF collection does, of course, form the CSD. This is provided at below-cost to academic scientists. Where academic scientists have a genuine lack of funds, access to the CSD is subsidised by the CCDC charity. Of course you have access to the CSD.

No restrictions are made on the research use to which the CSD is put; however redistribution of the CSD is not permitted. Licensees are also required to seek permission from the CCDC prior to releasing derivative works and related services. These restrictions were put in place to satisfy the demands of publishers and, because the CDCC continues to rely on subscriptions to the CSD to fund its ongoing developments, to secure the future of the resource.

The distributed CIFs, and CIFs derived from the CSD, contain statements such as they “…may contain copyright material of the CCDC or of third parties” These were drafted several years ago and were put in place to deal with copyright claims of various publishers at that time. I recognise that there are changing views regarding the copyright of data. I also recognise that technological developments continually present new research opportunities and demands on data. We have therefore, been reviewing the services that the CCDC provides and the terms under which they are provided. Ian Bruno is leading this review. His primary consideration is how the CCDC can ensure that we maximise the accessibility and benefit of structural information both now and into the future. Unfortunately, this review is not yet complete, however, we will consult widely and welcome your views on these issues.

Colin

First thanks to Colin for his reply. Now my comments:

p.1 This indicates that it is impossible to discover the CIF unless one has access to the journal article.

p.2 I am not asking for access to the CSD, only to the raw CIFs which were contributed as part of the publication process.

P3. The restrictions on re-use are twofold – (a) from publishers (b) to create a monopoly for the CCDC to secure its income

P4. I understood from yourselves that this review would not be complete before the end of the calendar year 2011.

Posted in Uncategorized | Leave a comment

Chemical Informatics and China; and challenges of language

14 Asian Chemical Congress: Why we need Open Data/Source/Standards

Open Crystallography: How to start it and where should we base it?

#solo11; post Science Online thoughts

Kristi Holmes

– VIVO . An interdisciplinary national network Enabling collaboration and discovery among scientists across all disciplines.

Reader Pays (a lot) to read “Sodium Hydride as Oxidant paper” (you don’t need to be a chemist)

Science Online: We can make blogs first-class citizens in scholarly publishing

CCDC: Reasons why sourceCIF data must be Open

Open Crystallography: The Hargreaves report can help make CCDC data Open

#IUCR2011; FOI request for details of CIFs deposited by publishers with CCDC

#IUCR2011: reply from CCDC on restrictions on redistributing CIFs

Recent Posts

Recent Comments

Archives

Categories

Meta