Monthly Archives: May 2007

Data validation in publications

Tony Williams' comment to my post (Data validation and protcol validation - May 31st, 2007) has several valuable themes which I expand on here and in later posts. Tony and I are in agreement here and working towards something that we can separately and jointly promote. To summarise:

  • The published literature (even after peer-reviewing and technical editing) contains many factual errors.
  • machines can help to eliminate many of these before and during the publication process. In severe cases they can prevent bad science.
  • techniques include the direct computation of observed data and the comparison of data between datasets.
  • this is only reasonably affordable if the data are originally in machine-understandable form. PDF is not good enough.
  1. Antony Williams Says:
    May 31st, 2007 at 6:12 pm eI have blogged on your comments on the ChemSpider blog with a track back and we are in general agreement re the intent and value of the NMRShiftDB.

    I wanted to comment separately on “The role of the primary publisher is critical.” I agree that they can make it a lot easier to extract information and let’s discuss NMR data for now since this is the focus of this discussion. Validation engines will be required to confirm literature NMR data since year on year we have identified 8% errors in the peer-reviewed literature. Your comment re. 1% is one concern…8% is at a whole different level

PMR: It's important to define what is an error (and much of the debate about NMRShiftDB is what is an error). Because of the byzantine method of hand-publishing to PDF many transcription errors get included. Our rough estimate is that many if not most published papers in chemistry contain at least one transcription error (maybe only punctuation, but it fouls machine-reading).

A second procedural error is associating the wrong molecule with data. We don't know how common this is but we have certainly seen it. I suspect that these tow errors run at th 1% level.
These are separate from "scientific errors" where the "wrong structure is proposed (see below) or where the assignment (i.e. annotation) is "wrong".  I have no comment on these yet - maybe this is the 8%.

  1. Improved automated checking of data is possible. it is one of our primary missions to perform structure verification by NMR as well as auto-assignment and computer assisted structure elucidation. These technologies are not in their infancy…they are on the maturity curve now. The adoption of such tools by publishers, whether commercial or open source, will be essential if the generation of Open Access QUALITY databases is to proceed. I think I’m speaking to the converted of course….

PMR: completely agreed. We have done the same thing with crystallography and discovered a number of experimental errors and artifacts. Routine calculation of molecular geometry and NMR spectra shoudl now be a pre-requisite for these types of studies.

  1. As an example of how computer algorithms for validation of NMR assignments can outperform even skilled spectroscopists I highlight the debacle around hexacyclinol. A search on this term tells an interesting story cited as “into the biggest stink-bomb in organic synthesis in many years” ( The Chemical Blog declares “La Clair to get ass handed to him on hexacyclinol” ( The story regarding NMR validation algorithms comes AFTER the material was synthesized and AFTER a crystal structure proved the structure and AFTER full H1 and C13 assignments were made of the material. The algorithm went on to show that the assignments were incorrect allowing 7-bond couplings. We have worked with the authors to reassign the molecule and a publication is in preparation to report on the FINAL assignments..and potentially the end of this story.

PMR: fully agreed. The blogosphere had a field day with this and helped to raise the issue of quality in publishing.

There is a general concern among many principals that scientific fraud or sloppiness is of sufficient concern that data must be deposited in repositories so that questions such as this can be at least partially addressed by referral back to the raw data.

So publishers can help by:

  • insisting machine-readable raw data is available
  • using computer-validation where possible.

Note, of course, that the crystallographers already do this - I shall blog on this again very shortly.

Data validation and protcol validation

This post replies to an ongoing debate about the quality of data and Open vs Closed data and systems. It's specifically about NMR (spectroscopy) but my points are general. Since I have been publicly critical of some systems I must be prepared to take criticism myself (as happens here).

In Update: Robien on NMRShiftDB Ryan Sasaki (Technical Marketing Specialist for ACD/Labs) writes [PMR's comments interspersed]:

If you have read my earlier post, you will be aware of Wolfgang Robien's critique of the NMRShiftDB. Following this critique, Tony Williams from the ChemSpider Blog  and Peter Murray-Rust from the Unilever Cambridge Centre for Molecular Informatics replied to Wolfgang's comments.

Well now, it appears that Wolfgang has responded to Tony's comments.You can find his response here.

It appears that Wolfgang remains firm in his stance that the NMRShiftDB is not a good resource for scientists as it contains too many errors. He continues with the comments, "But: Enjoy - it's free!"

PMR: My case has been that science is impoverished by lack of access to data and information. Neither are free, but there are new methods which lower the costs dramatically and also redistribute them. "free" may mean undefunded and therefore lower quality or it may mean "open" and capable of dramatic improvement by the community. In the case of NMRShiftDB I am firmly of the opinion that it leads the way in opening access to scientific information. If the community wishes it can use it as a growing point to develop more and better data. If they don't, they will continue to use existing non-open systems (or in most cases not use anything at all).

I also state publicly that I support the activity of NMRShiftDB for several reasons. Firstly data (which for me are the central point of NMRShiftDB in this discussion):

  • It allows and promotes the public aggregation of data.
  • It contains mechanisms for assessing data quality automatically. For example software can be run that will indicate whether values are seriously in error.
  • It allows the public to identify errors and report them. It is also allows the creation of a developer/committer community that spreads the load of this process.
  • It allows mashups against other data resources (computation, crystal structures, etc.)
  • It acts as a model system that can be adapted for laboratories that wish to develop their own Open data aggregation systems. We had a DAAD-British_Council grant to collaborate with Koeln directly for this process and we see NMRShiftDB as a potential model to the extension of our SPECTRA program to capture data into institutional repositories.

Now software. I have no comment as to the relative merits of NMRShiftDB software against commercial systems. However the history of Open Source in chemistry has shown that within a few years software can be communally developed to become leaders in the field. 7 years ago relatively few people had heard of Jmol - now it is one of the leading display packages and widely used by publishers, pharma companies, etc. Similarly OpenBabel was a mess 5 years ago and the community has now put in so much work that they have made it the leading format conversion tool. It is therefore quite possible that NMRShiftDB software can do the same for the NMR community. Certainly if anyone is intending to build an NMR repository I would urge them to look closely at NMRShiftDB.

So I have a couple of responses in regards to Wolfgang's comments in his follow-up:

"When doing this job in a more systematic way not using specific examples as given here, the total number of incorrect assignments exceeds the above mentioned limit of 250 significantly. The intermediate number is at the moment around 300, but about ca. 1,000 pages of printouts are waiting for visual inspection."

Is 300 vs. 250 errors in a dataset of over 200,000 chemical shifts SIGNIFICANT? Is a difference of 50 errors in this dataset statistically significant? That's 0.025%. I await Wolfgang's final results and then we can judge whether it is significant. Meanwhile, he should also read the document we produced comparing the prediction accuracy between ACD/CNMR Predictor and Modgraph's NMRPredict if he wants to challenge our findings. I think it is a good place to pick up our conversation.
"I definitely do not claim, that collections like CSEARCH, NMRPredict and SPECINFO are free of errors - the desired level of errors is always 0.0%; a value which can't be reached - the acceptable limit is clearly below 0.1%, maybe 0.05% is good compromise between dream and reality."

I agree, as I mentioned in my last post that while the desired level of error is 0.0%, this is a value that cannot be reached. I certainly would not claim that our prediction databases are free of error. Further, our work reveals about 8% errors in the form of mis-assignments, transcription errors, and incorrect structures within the peer-reviewed literature we comb. Error is human nature.

PMR: One of the core skills for all 21st C humans is to make judgements about the usability of any information. Without NMRShiftDB my current access to spectra is minimal. If I have 20,000 entries with 1% error that is an enormous advance. Biologists work everyday with the knowledge that their gene identifications, sequences, annotations, etc. have serious erros and they try to measure that error rate.

Let me say, I am very confused by the positioning of this question to Christoph Steinbeck:

"Why do you "reinvent" existing systems - there are a lot of systems (with much better performance !) already around  (a few in alphabetical order: ACD, CSEARCH, KnowItAll, NMRPredict, SDBS, SPECINFO)"

Why reinvent existing systems? To improve! To provide better resources for NMR spectroscopists and scientists around the world! While there is certainly better performing systems to date there is no reason to believe that these existing systems cannot be surpassed in terms of performance. Further, they offer an alternative to those institutions that do not have access to commercial products.

I think that Wolfgang is misunderstanding something here. From his writing, it seems that he feels threatened by the NMRShiftDB and is trying too hard to discredit the hard work and ideas behind this open source collection. What NMRShiftDB is providing, is something very different than anything the commercial products he names are offering. It is a truly open access and open source offering where scientists and spectroscopists can freely share their data and build an NMR database that is freely available to the scientific community.

It's FREE! It's not a commercial product like the ones he compares it to!

PMR: I would re-iterate this and add: It's OPEN.

Christoph's group is handling this very well and he mentions himself,

"validations like Robien's and the ones performed by us help make a strong case for open access and open source policy."

PMR: certainly. We are in the process of going live with CrystalEye, a near zero cost crystallographic knowledge base. We have made efforts to identify the error rate (which is lower than NMR but non-zero). Our value will be judged on the validity of the protocol, not the validity of individual entries (though we shall be adding automatic checks to them).

Finally, As I mentioned above, I can only make the assumption that Wolfgang has not seen my blog posting that compares the results of his algorithm vs. ACD/Labs. It should make for an interesting discussion.

PMR: In the future the means of publishing pre-validated data will continue to increase so large amounts of Open high-quality data will become available. The current method of human curation of data will only be useful where the values of the data are so important that life or law depends on them.

The role of the primary publisher is critical. If they want they can help speed up this process; if they want to possess and constrict (cf. Wiley copyrighting data) they will slow it down, but ultimately lose both the battle and their credibility.

I shall write more on our strategy in coming weeks.

Open Data in biomedical science

Heather (Research Remix) has a most important post on data sharing - she has analysed the data deposition policies of some of the major journals/publishers. Note that this is orthogonal to Open Access - not all these publishers are OA, but many are agressive about requiring data deposition - and that's good. My comments during and after her post:

Diverse journal requirements for data sharing

Filed under: publishingdata — Heather Piwowar @ 9:04 am
Many academic journals make sharing research data a requirement for publication, but their policies vary widely. I’ve been wanting to understand this better: below is a summary of my Tuesday Morning Delve into the world of “Information for Authors”.I selected 10 journals, two from each of the following ad hoc categories: general science (Nature and Science), medicine (JAMA and NEJM), oncology (JCO and Cancer), genetics (Human Molecular Genetics and PLoS Computational Biology), and bioinformatics (Bioinformatics and BMC Bioinformatics). The results are obviously just the tip of the iceberg, but I found them enlightening.
PMR note: although Science and Nature are general journals almost all the emphasis is on biomedical in this discussion. I would not be surprised to find that the requirements were very different in - say - chemistry or materials science.
Nature has the most stringent requirements, followed closely by Science. These journals required data sharing for the most diverse types of data, specified acceptable databases,escrow requirements, and actually had “teeth” clauses… they specify a statement of consequences for times when you ask for data and the authors don’t provide it.The medical journals do have requirements for clinical trials registries, and sometimes suggestions for data inclusion based on clinical trial design, though they have no mention of requirements or encouragement for sharing (obviously deidentified) research data except that NEJM requires sharing microarray data.I’m out of time this morning to highlight the other findings, but you can have a look for yourself below.These rough conclusions of mine are consistent with Table 2-1, “Policies on Sharing Materials and Data of 56 Most Frequently Cited Journals”, in [Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences (2003). National Research Council of the National Acadamies]:
Their more exhaustive (though dated) analysis also suggests that few clinical-medicine journals have a policy, or if they do it rarely mentions depositing data. About half the life-science journals have some kind of a policy about depositing data. Almost no journals have a statement of consequences.

Conclusions: kuddos to Nature and Science. I’m surprised that the policies of other journals are so lax.

PMR: I am afraid I am not surprised. I don't know about medical science but the major commercial journals have no incentive for data deposition, although a senior representative from Wiley told me they copyrighted data so they can sell it back to us.
Not sure this analysis is worth digging into more deeply. It isn’t quite where my research is headed, though I do believe the trends would be informative. If anyone else wants to use this as a starting point, have at it!
PMR: this is too important to leave at this stage. It's something that the blogosphere - with a good wiki could manage easily. Apart from the spammability of wikis I'd suggest it very strongly
{Tried to post table here, but can’t get it to display nicely}
PMR: know how it feels - blogging software is strictly text and image only.
PMR: see below:

Well done Heather. It's hard work and often depressing. I once tried to read some publishers' contracts on what authors and readers could do and found them incomprehensible. I thought at the time it was incompetence and out-of-date pages - now I think much of the license area contains deliberate FUD.
I'm interested in the strength of Nature:

Indexed, publicly accessible database, or, where one does not exist, to readers promptly on request. Any supporting data sets for which there is no public repository must be made available [..] any interested reader [..] from the authors directly, [..]. Such material must be hosted on an accredited independent site [..] or sent to the Nature journal at submission [..]. Such material cannot solely be hosted on an author's personal or institutional web site.

Note that institutional web sites (does that mean repositories) are not good enough for Nature! If you read this as a non-biologist there are precious few sites where you could deposit data. Maybe at somewhere like BMC which offers reposition services. I think there is a great opportunity here for the new semantic web.

... and Science:

Large data sets must be deposited in an approved database and an accession number provided for inclusion in the published paper. Large data sets with no appropriate approved repository must be housed as supporting online material at Science, or when this is not possible, on the author's web site, provided a copy of the data is held in escrow at Science

What Heather does not mention is what the public access to the data are. Most of the databases are biological and therefore Open.

Approved databases: Worldwide Protein Data Bank ; Research Collaboratory for Structural Bioinformatics, Macromolecular Structure Database (MSD EMBL-EBI), or Protein Data Bank Japan], BioMag Res Bank, and Electron Microscopy Data Bank (MSD-EBI), Cambridge Crystallographic Data Centre {CLOSED). GenBank or other members of the International Nucleotide Sequence Database Collaboration (EMBL or DDBJ) and SWISS-PROT. Gene Expression Omnibus ; ArrayExpress.

The Cambridge Crystallographic Data Centre (no direct connection with PMR) has 350, 000 entries and last time I enquired allows only 25 to be downloaded free (0.01 %). I shall return to this later.

Science is as good as its word - there are many articles with exposed supporting info - here's a chemistry one and it looks of high technical quality (haven't read the science):

It doesn't say anything about copyright and I hope that Science can confirm that they do not assert copyright. It would be extremely useful if they suggested (or required)that authors add Science Commons license to the data. This would act as a high-profile encouragement to the others.

Nature is similar - here is

the supplemental data here has been formatted by Nature - but no copyright has been added - and again I hope that they can take the same approach I have suggested.

Whatever your views on Open access, these two journals have made a good start on Open Data. A long way to go as the data are in the dreaded hamburger PDF (molecules are destroyed by PDFisation), although plaudits for Nature Chemical Biology which sends molecules to Pubchem. We need more semantic data here, please.

Also up in PMR's good books are Royal Soc. Chemistry and Int. Union of Crystallography which expose all there supplemental data openly and, although muddled, effectively free of copyright.

The ACS is halfway. It does expose supplemental info, but it copyrights them (and I know from first hand intercourse that this is deliberate).

The less satisfactory publishers are harder to be precise about as they hide their information.

Wiley - Hides data for subscribers only and copyrights them aggressively. I suspect that some data are not even required.

Springer - does not seems to manage data itself and hides those that it does get. I have written to my Springer contact asking for clarification but not yet heard back

Elsevier - I suspect they require little data, ut have no hard evidence

It would be EXTREMELY useful for the blogosphere to collect information on these practices. If we all do a little we could cover the whole field. And shame those who need to be shamed.

I shall write more later on supplemental data.

Ola Spjuth of Bioclipse

From time to time people get presented with Blue Obelisks and the latest recipient is Ola Spjuth. Presentations - and the preparations for them - are rarely simple - they have included

  • obelisk falling off the table on night of travel
  • being sent a green obelisk (although to be fair this was an addition by the supplier)
  • tramping the length and breadth of San Fancisco only to discover there seemed to be no blue obelisks in the whole city. The recipients at that stage got pseudo-blue pseudo-obelisks

and most recently the supply was late, so I resorted to a crystal healing shop in Uppsala - which also had no obelisks but did have a magic blue rock.

So Ola was presented with a blue "foo" for which he was promised a real obelisk next time we met. We hopefully have a photograph of the presentation, in which case it should be possible to


to do some Photoshopping or similar on the photo.

Fortunately I shall be in Uppsala again in June and Ola and I will be able to meet for a physical handover.

For the uninitiated - there is no formula for the award of the blue obelisk - it happens when it happens, but there has to be physical meeting - they are not delivered by post. There is no citation - if there were an inscription it would read:

"Si monumentum requiris, circumspice"

Ranking chemistry and blogosphere metrics

I've been pointed to ChemRank - a system that allows you to comment on and rank the chemical literature. I hadn't seen this before and haven't looked in depth, so I am only commenting on the idea and technology. (As always I also would like to know what the Openness of the supporting organisation was). I copy the post in full and make some comments.

Dissociating the good literature from the bad literature is an endeavor we all do individually(but not for long: ChemRank). If only there was some website where we could tell whether that syn. prep. is accurate or that physical model is valid. The current methods to determine the validity of the literature are: to perform the same experiment, try to determine this from how many people cite that article, go ask around the department for someone who did something similar, try to relate the quality of the paper from the h-index of the author.

But, what if you still want more? What if you want something more than a numerical qualifier of a paper's worthiness, or of an author's scientific quality? What if you want to have a discussion beyond the scope of your labmates? What if you want to have an intellectual discussion about the current chemical literature with every other chemist on the planet? How do you do this and where do you begin?

The chemical blogosphere has done a rather good job in keeping the community of internet savvy chemists a breasted on some of the latest and coolest research ( But, unless you have a huge audience and build a faithful readership, no one will ever know your views and it can be rather difficult to have an intellectual conversation with just yourself.

With all this in mind, I set out to create a website that would overcome these historic limitations to academic communication. I created a new website called ChemRank: Screen Shot shown [deleted in this post - PMR]

At ChemRank you can add a paper to the database and then vote whether you find it a good or bad paper. You can also leave comments critiquing the yield or congratulating the authors on a job well done. Most importantly, there is a public record of your views! Your pain does not need to be repeated if you point out the problematic reaction or the incorrect eqn in the comments section.

An other cool feature of ChemRank is the building of a database that knows the good chemical literature from the bad chemical literature. Currently, there is no database to my knowledge, except you own brain or PI's brain, that tracks this information. Although, Noel's experiments with Connotea is the step in the right direction.

There is even something for the already established pioneers in online chemical literature discussions (aka chemical bloggers) too. For each paper added to the database a script, behind the scenes, will generate a little code snippet to allow voting on the literature your discussing on your blog too. For example, if you are discussing the recent paper "Atomic Structure of Graphene on SiO2" you can generate the following from copy/pasting a code snippet. Assuming you've already added the paper to the database and clicked the link: "add to your blog"

Atomic Structure of Graphene on SiO2

code snippet:

Code: (html)

This by no means is a completed project, there is still much more to do.

  • Make links that show the papers posted last 7days, 1month, 1 year, all time
  • Make users register before posting.
  • Add the ability to tag the articles with relevant keywords
  • Make a pipe's rss feed of the author's most recent papers, on the fly and behind the scenes, display in the description as well.
  • Make an api so others can access the database and use it for cool-new mashups.
  • Add a search feature, but that is kind of pointless while the database has less than 10 papers in it.

All of the above can be done in time, but it depends on the feedback from the community and the popularity of the website. As with all projects, I don't necessarily expect it to catch on right away if at all. A lot of times I code, simply to show how to do it, and how to do it well. Allowing public comments on the current literature is something the chemical publishers should be doing anyways, now we no longer have to wait for them...



This and similar approaches are very important. We don't know where metrics like this will take us but it can't be worse than the current citation index which is managed by ... by whom???

Before anyone puts too much effort into their own system the should check out carefully what the "Web 2.0" community does for review. As I blogged earlier, - Review Anything has tools for managing and collecting reviews - no technical reason why they shouldn't actually include chemistry AFAIK. I would strongly encourage using RDF for reviews - this will happen anyway and makes your material discoverable more easily.

There are reviews in the chemical blogosphere already - runs a poll for the best synthesis of the month/year, etc. I'm not sure whether there is any technology that publishes this in RDF form but it would be relatively easy. Commentaries in the blogosphere also have a greasemonkey viewer from Chemical blogspace
and other sites, so that when you read the table of contents the monkey alerts you to any commentary. It would be technically easy to fix the monkey so it counted eyeballs on papers - how often was this paper READ (not cited, not downloaded - but at least displayed in the browser presumably with a human somewhere near). And to add tools so that the monkey could ask for opinions from the viewer.

I'm not suggesting there should be one coordinated effort - anymore than we need just one movie review database. But I do think that reviewers should try to use the same technology so that their results can be aggregated - and the reviewers can themselves be reviewed!

The soul of the Blue Obelisk

There has been a discussion on the Blue Obelisk mailing list about whether we should move the mailing list to Google to attract the less geeky community. I drafted a contribution to this discussion and then felt it would be worth airing more generally as there are some principles of the Open movement, the gift economy, etc.

On 5/29/07, Tobias Kind <> wrote:

However to get more spin and attract more people
the whole setup must be easy, otherwise it will be just an
exclusive geek club :-) And some users just want to lurk
around instead of subscribing to the mailing-list.

I have been thinking about the suggestion to relocate the mailing list and offer the following.

The BO was indeed founded essentially as a geek club - authors of software and data resources who realised they needed to work together to make sure they were interoperable and that the resources they shared (data, algorithms, etc.) were normalised. It acts as a convenient first place to post mail of general interest to this community - new major releases, common technical problems, values of data, etc. (Yes, we all subscribe to most of each others mailing lists, but not all). That aspect is still a core aspect of the BO - looking through past mails there is a vibrant collection of mails on the details of programs, data, etc.

My description of the BO is that it is a meritocratic gift economy (see Eric Raymond) - where people are valued for their contributions. These can be many-fold - as above and range from programs, data, draft protocols, managing wikis and lists, etc. This type of approach is very common in Open Source projects.

More recently the BO has started to promote the BO brand. Not yet very consistently - but the idea is that software, data, protocols and other resources state more publicly that they are "members" (whatever that means) of the BO. In that way people can get a feel for the quality (at least the intent) or interoperability and Open Data (the Open Source is a given that most people understand). There is also the feeling that when you are using resource A and want to integrate it with resource B then there are people in the community who would at least try to help in some way - this is not true of all Open Source where very often integration is solely up to the users.

There are also common difficult problems we all face. Very recently, for example, I have had a request to integrate CML into geochemistry and isotopes are a requirement. I know that isotopes are more difficult than they look and would certainly ask the community for suggestions when or if I ran into trouble.

I think the calls for different or added lists springs from the emerging "political" dimension in BO world. The motivations seem to be:
* we need more people to discover BO through a more exposed mailing list
* people are debarred from contributing because of the geekiness of SF mail lists.

I am not too worried about the second. BO is primarily about contributing resources. This requires SVN and there is no reason to move this from SF. So it makes sense for the mailing list to be on the same site.

The political aspect ("spin") is worth discussing. Personally I see the most powerful tool for advocacy being useful running code, interoperable code, valid respected data, coherent algorithms, etc. When I moderated the XML-DEV list a major theme was to develop code that worked and standards that could be implemented (by the "desperate Perl hacker"). For me BO has many of the same features - implementation was the key thing. Now it is clear that BO has political dimensions - it is destabilising technology and I spoke on this theme at the Bioclipse meeting. Essentially if we get Bioclipse onto every scientific desktop and get BO software running for all major chemical tasks we shall destabilise the current chemical software economy. And for me that would be a good thing - it would allow innovations that we are lacking.

But the reality of the current situation is that we primarily need more developers - especially for testing, documentation, integration and dissemination. You do not need to be a geek to do some of these (especially docs, tutorials, etc.). But you do have to be able to use SVN at SF.

I am fairly confident that we haven't overlooked large pockets of current Open Source chemical developers. Not all Open Source chemistry projects would describe themselves as BO members, but they almost certainly know of our existence and we work with them from time to time. We also try to make sure that we interoperate. So if we want to reach out to a wider community *of contributors* where will they come from? I can think of at least the following:

  • WP-like contributors. We have very good contact with WP-chem and are making sure that our products are interoperable (e.g. through RDF). It may well be that much of the BO data in the future comes through WP. This is a great way of bringing in school students, etc.
  • non-chemical software developers. This is a very important resource - from example Miguel Howard - a major guru of Jmol - knew no chemistry but has done a world class job of implementing the graphics. So yes, there is an exciting and really valuable role for non-chemical programmers.
  • educators. The BO resources are now of sufficient quality and breadth to make excellent teaching resources. A major offshoot of this would be tutorials and use cases.
  • industrial programmers. I still have a dream that the pharmaceutical industry will do something in the Open Source area. I talk with representatives on a regular basis. They all say how important open source is and they all want to move into to it. I know they *use* our software in considerable amounts. But they don't contribute. We are very receptive to offers.
  • political evangelists (Open Data, Open Science). This is also an important and growing area. We have certainly some exposure here - e.g. at least 4 of us have been invited to Google/Nature FooCamp and we are active in Open forums, etc. Here again the strength of our case is the deeds more than the talk.

Should there be a wider dimension? I am receptive to new ideas. And I want more people to use our code. Is there a "non-geek" dimension that we should expand? I am not sure I can see it immediately, but maybe others can. What is essential, however, is that anyone proposing new ideas should be prepared to put in the work that helps them succeed - ideas by themselves usually don't flourish - the BO is hard work.

Remixing Open Data and the cost of not doing so

Welcome to a new blog (Research Remix)  from Heather Piwowar, currently doing her PhD in Biomedical Informatics at the University of Pittsburgh. Heather is encountering first-hand the difficulty of doing her research because of the problem of getting access to data. So she's taking a very systematic approach to analysing the problem. Here's a typical post

Open Literature Review on Open Data

Don’t you love to experiment? Me too.

This blog is an experiment. I’m starting my PhD literature review on the topic of biomedical data sharing and reuse, and thought it would be appropriate to do it out in the open.

Not quite sure how it will work: I’m new to this blogging thing. Please send me suggestions, questions, and especially links to related work.

Thanks, and happy experimenting… with your own data or that of others :)


One of the key tools we must have in fighting for Open Data is agreed metrics. That is hard work. It includes much disappointment - in other posts Heather mentions that many researchers don't reply to requests for data, and many of those that do cannot (or will not) supply it. (To be fair it's often because it is a lot more work than it might seem - among the first customers for Repositories we often find scientists who have lost their own data!).

It's also important to realise that this data has cost money. There seems to be an assumption that once the "science" has been published the data are then worthless. That's usually not true, but even if it was I think it's useful to enumerate the actual cost of collecting the data. A useful metric is to work out what they would cost at commercial rates - if a chemistry department generates (say) 500 crystal structures at a commercial cost of (say) USD 3000 (and that's probably underestimate) - that's 1.5 million dollars. Does it become worthless after publication?

So we need metrics. It's not exciting, but it's necessary. I would like to know how how many chemistry papers are available under "Open Access/Choice" or whatever name - where the author is invited to pay the publisher so that people can read the artcile Openly. And I am interested in the publishers' poicies on Open Data - is supplemental data Openly available. This is a sizeable task. But with modern Web 2.0 tools it should be easier to aggregate the response (or non-response) from the publishers. Suggestions and offers welcome.

Bioclipse - Rich Client

I'm at the Bioclipse workshop in Uppsala - excellently run by Ola Spjuth and colleagues. Rich clients - where the client has significant functionality beyond the basic browser - are critical for the interchange of scientific information. A typical example is Maestro (NASA image viewer) where the typical browser does not - and cannot - have the local power and functionality required. So NASA wrote their own and you can download it and run it locally.

It's easy to confuse Rich clients with AJAX services. For, say, Google maps all the functionality is on the server - cut off the web and you cannot use Google Maps. The maps are downloaded during use (you can often see the tiles coming down and covering the area). Nothing wrong with this, but you have to do it the way that Google has designed - you cannot re-use the maps in different ways (there are doubtless complex copyright issues anyway).
But isn't everything moving towards a "cloud" model where all our data, functionality, etc. is remote - on Google, Yahoo, or whatever? Yes, for a lot of our everyday lives. But I think science is different. (Maybe I'm stuck in 20th C thinking... but I'll try anyway). A scientist - or their lab - has data which is mentally on their laptop, or in the instruments, or in the fume hood or wherever. Most labs are probably not yet prepared to store this in Google. And they have local programs and protocols which only run on their local machines.

Moreover relying on GYM (Google, Yahoo, Microsoft) to provide all your environment means a loss of control. Scientists have already lost much of their independence to publishers (they control what you can publish - I just heard today of a publisher who would only accept graphics as EPS - why? - because it makes it easier for their technical department.)

There are obvious challenges to using a Rich Client. If every discipline has their own then the user gets confused. The developer has to manage every platform, every OS (at least in the future). Doesn't this get unmanageable?

Yes, if everyone does things differently. But if everyone uses the same framework it becomes much easier. And that's what we have with Eclipse. It's the world's leading code development environment and there are thousands of commercial plugins. It's produced by IBM and is Open Source, Free (obviously) and designed for extensibility. So it will prosper.

If the scientific community converges on Eclipse as a Rich Client then we have enormous economies of scale. That's what Bioclipse is doing in the chemistry and bio-sciences. In fact, however, much of the work is generic - data manipulation, display, RDF, etc. so other sciences can build on that and contribute their own expertise.

There are downsides. Everybody is familiar with browsers - very few scientists yet know Eclipse. But that can change. Eclipse has many tools for easy installation, tutorials, guided learning, updates, etc. All for free. So we expect to go through a period of "early adoption".

In my talk yesterday I described Bioclipse as Disruptive technology (WP) - technology which destabilises current practice and leads to improvements in quality and cost. Even more importantly it returns power to the scientist - they are in control of their data and how to repurpose it. We hope to develop Bioclipse as a browser-publisher so that the scientist works in an environment where they decide how to emit data, not subservient to the technical editing departments of publishers or the proprietary formats so common in chemical information.

Sued for 10 Data Points

Peter Suber has blogged about an important discussion on Wiley's action is threatening legal action for reproducing a data graph from a publication. (there's quite a bit to read if you follow the links but it's worth it.) Also read the followups where several Open luminaries comment in a more equable
manner than I feel capable of at the moment.

PS:  The Batts/Wiley story broke in late April when I was traveling.  If I'd been at my desk, I'd have covered it or at least I'd have tried.  But because the comments proliferated explosively, I wasn't at my desk, and I had a full load of other work, I decided that I had to let it go.  I'm glad to catch up a bit with this post.  I'm also glad to have the chance to recommend comments by Mark Chu-Carroll, Cory Doctorow, Matt Hodgkinson, Bill Hooker, Rob Knop, Brock Read, Kaitlin Thaney, Bryan Vickery, and Alan Wexelblat.  Finally, Katherine Sharpe at ScienceBlogs, where the controversy began, solicited comments from five "experts and stakeholders" (Jan Velterop of Springer, John Wilbanks of Science Commons, Mark Patterson of PLoS, Matt Cockerill of BMC, and me [PeterS].)

The graph had 10 points. This, gentle readers, is Data. Numbers. Facts. Facts are Non-copyrightable. End of story. The author got round it by re-entering the data - well done - and absolutely correct - you cannot copyright numbers.

I have not seen the original graph but I cannot assume that the technical authors at Wiley had created a "creative work" or immense added artistic and cultural merit. There is a limit to what one can do with 10 data points. Perhaps they were going to hang it in Tate Modern. (Most publishers actually create "destructive works" on data - omissions, hamburgers, etc.).

We have to redeem our data - and quickly. There are several legal ways.

  • create supplementary data which we post on our web sites, in institutional repositories
  • just do it - as in this story. You have right on your side. Get your institution to back you. Make a fuss. Tell the world that the publishers are making it harder to save the planet. They are. We need data to save the planet. What if this were a graph (from a rival publisher) of the prediction of sea-level rise at Chichester (it's on the sea - that's where Wiley lives). Wouldn't Wiley wish to know when they would be flooded?
  • Extract data from the publication in numeric form and post it. It will be increasingly possible to do this at zero cost. We'll start explaining how in later blogs. And it will be legal.

Linked Open Data

This is one of the key issues for me at present:. Paul Miller (Talis) - who with his colleagues is constantly working towards a community license - writes (Linked Data the real Semantic Web ?):

It has been interesting to follow the rise of the 'Linked Data' meme in the Semantic Web community recently, and to track it alongside longer term (but quieter) mutterings around 'Open Data' from the likes of Tim O'Reilly and XTech programme committees past and present.

The recent push is due in no small part, I believe, to the sterling efforts of the Linking Open Data community, and to the support they've been receiving from W3C's Semantic Web Education & Outreach (SWEO) group, of which I'm a rather quiet member.

Listening to Tim Berners-Lee's keynote in Banff a week or so back, there was a strong steer toward 'Linked Data', and the opportunities presented by the relationships between resources and the aggregate of those resources. This thread came up again and again, most notably in the Linked/Open Data sessions. Thinking about it again, the whole Linked Data thrust actually comes across as a far more compelling way to describe the value of the Semantic Web to the non-geek audience. Are we seeing some formal shift in W3C's language as we and they grapple to clearly express the value of these misunderstood 'new' approaches? Let's hope so, as these Data Web/ Web of Data stories get far less bogged down in the horrors of 'triples', 'ontologies' and other concepts designed to send most audiences into an irretrievable tailspin...

If the Web of Data is the target, of course, the thorny issue of to whom the data belong, and the ways in which the data may be used, come to the fore once more. This is an area we've been tackling with contributions such as the Talis Community License, and it came up in Rob's contribution in Banff [Rob's audio here, PDF of everyone's slides here], as well as papers from both of us at XTech last week. We've seen a lot of interest in some of the issues we've been stressing around the need to apply some licence to data, and the importance of understanding the rights that do - and don't - apply to data as opposed to creative works, and look forward to finishing the work we started with the TCL and getting the whole thing onto some more formal footing.

One conversation from last week that has carried over onto email this week was with Rufus Pollock of the Open Knowledge Foundation. They don't have a license, but they do usefully define a set of principles to underpin the notion of 'open knowledge', and they explicitly include the separate notion of data;

“The Open Knowledge Definition (OKD) sets out principles to define the 'open' in open knowledge. The term knowledge is used broadly and it includes all forms of data, content such as music, films or books as well any other type of information.

In the simplest form the definition can be summed up in the statement that 'A piece of knowledge is open if you are free to use, reuse, and redistribute it'.”

We're seeing movement as a growing body of implementors, commentators and analysts recognise the potential of linking disparate data resources together, leveraging some of the more basic capabilities of RDF and other Semantic Web enabling technologies. We're also seeing a matching awareness of the need to protect use of those data sets (and not merely to safeguard the interests of data owners, but also - and far more tellingly - to give confidence to data aggregators and users), and a refreshing willingness to engage openly and cooperatively in reaching a pragmatic solution. It's a great time to be involved in this space, and Talis looks forward to playing our full part across the piece.

Update: Rufus Pollock has begun a Guide to Open Data Licensing on their wiki...

One of the drivers is that systems such as Freebase and Metaweb claim to be able to manage huge amounts of linked Open Data. I'm hoping so as it will revolutionise the closed minds in chemical information. I'll be trying out some of these ideas at the Bioclipse meeting tomorrow.