Open Data: what I shall say at ACS

#pantonprinciples #opendata #acsanaheim

I am speaking on Open Data and the Panton Principles at ACS. It’s in a session solely on Open Data. But I suspect I am the only one who is promoting Free-as-in-speech. I get to kick off, so here’s an idea of what I am going to say… I have promised (gently) not to rant. But I cannot help myself if the arguments for Openness are overwhelming.

Note for non-chemists. Chemistry is a very non-Open subject. 95% of the journals are closed and of the remaining 5% it’s difficult to get people other than zealots to publish. There is almost no Open Data. Publishers defend “their content” avidly and there have been lawsuits. About 0.01% of practising chemists know or practice Openness (a guess, but probably a reasonable one).

The principles below hold for any science but the examples are chemistry.

Why is Open Data Important?

Data-rich sciences rely on data to:

  • Confirm or disprove experiments. When the data are not published the science cannot be verified or validated
  • Act as yardsticks for other methods (e.g. computational chemistry strives to replicate experiment. Experiment is moderated by theory)
  • Be re-used (mashed up, linked) in novel ways. There are over a thousand papers describing chemistry derived from the reported crystal structure literature

Traditionally only a small amount of data was published. Now, with printing costs irrelevant it’s technically possible to publish the whole experiment.

Moreover much science reported in text is now processible by machines. I argue /pmr/2011/03/28/draft-panton-paper-on-textmining/ that textmining can bring huge amounts of value to science.

What is the problem?

Most data is never published at all. That is partly laziness, partly selfishness, partly lack of technology, and partly the lack of culture in favour of publishing.

The data that is published is often published in “textual” form. By default we are forbidden – not for scientific reasons but for legal and commercial ones – to use modern methods of textmining to extract this data. This means that the legal restrictions are holding back science, perhaps by 10 years. The effect of this is:

  • We have much less data and much less variety of content
  • The current quality of data could be much higher (machines do a good job of validation)
  • The efficiency of data creation could be much higher
  • We could detect more fraud or other questionable data
  • Data would be more immediately available (humans have a slower clock cycle)

The downside – which I am sure most closed access publishers must concur with – is that some publishers cannot support their business model. So the equation is simple:

Support closed access publishers at the cost of fewer poorer later data.

I don’t think anyone can disagree with this conclusion. I make no public judgment here – the choice is yours

How to proceed?

If all publishers adopt a business model of Open Data content (and this is compatible with closed access publishing) then we have a step forward. So I and others are asking all publishers to declare that the data in the publications is Open.

What is Open?

Open is free-as-in-speech (libre) not free-as-in-beer (gratis) [Richard Stallman]. Gratis means you can use something but you have no rights. Most of the free services in chemistry are gratis. They can be switched off tomorrow (and frequently have been – I can name many service which were free-to-use and now are not). With gratis material you cannot as of right:

  • Create a derivative work – this curtails innovation
  • Rely on the material being persistent. This curtails mashups and linking
  • Publish aggregations, compure derivaties…
  • Create new ways of presenting and using the content

Libre allows all this. Free-as-in-speech is exemplified for knowledge by the Open Knowledge Foundation’s open Definition. : http://www.opendefinition.org/ :

The Open Knowledge Definition (OKD) sets out principles to define ‘openness’ in knowledge – that’s any kind of content or data ‘from sonnets to statistics, genes to geodata’. The definition can be summed up in the statement that “A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

It’s very simple. Just remember “free to use, free to re-use, free to redistribute”. It is clinically effective at deciding whether something is Open.

So what’s the problem?

The problem is that almost no data is marked as Open. So by default its state is unknown. And in that case the default action has to be to say it’s not Open. You cannot guess, or use algorithms to determine whether something is Open. The only foolproof way is to let someone sue you and lose. That’s because it’s a legal thing, not a moral one. And algorithms and reasonableness and morals don’t work in law.

So the way round this is for content providers (I avoid “owners”) to declare that data are Open. And that is what we are asking YOU to do.

So Why the Panton Principles?

The problem is that it’s not legally trivial to declare something Open. It’s taken the OKF and Creative/Science Commons two years to work out the best way. And the best way is to dedicate the data to the public domain. That needs a licence – and we suggest either PDDL or CC0. (not any-old CC licence, but CC0). So we met over some years and finally came up with the Panton Principles: http://pantonprinciples.org/

There is no reason why all authors publishers and funders should not endorse these principles and some have.

So what’s the problem?

The problem is that many content providers don’t realise that this is a problem. So we’ve built a site to ask them about the Openness of their data: http://www.isitopendata.org/ . Here we can ask questions of content providers and record their answers – in public. This means that we can save them time and hassle by only asking the question once.

The answer is for content providers who wish to make it clear that their data is Open to add a licence to that data. It’s also useful to add a visual indicator such as the OKF’s Open Data button.

And where is this going?

The steps are:

  • To get content providers to consider the importance of Openness
  • To get them to make a considered decision (hopefully Open)
  • To get them to mark the content as Open
  • To get them to spread the idea.

The framework is compelling. The Panton Principles have been successfully applied to bibliography – an important part of science. And they have activated a set of Panton deliberations – discussions (audio/visual) and papers.

We need Open Data for better quicker more complete science. That may mean changing the business models. If so we need to think soon…

 

 

almost no data is marked as Open. So by default its state is unknown. And in that case the default action has to be to say it’s not Open. You

 

 

 

 

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Open Data: what I shall say at ACS

  1. Pingback: Unilever Centre for Molecular Informatics, Cambridge - Open Data: latest overview. Please comment « petermr's blog

Leave a Reply

Your email address will not be published. Required fields are marked *