BioIT in Boston: What is Open?

My talk is Open Semantic Data in Science. I’ll probably write 3-4 blog posts on the various aspects of this, and at present I’m thinking of:

  • What is Open? (this post)
  • What is semantic? And what do we require for it?
  • What is data?
  • What are we able to offer (with some modest emphasis on our own endeavours).

I am starting with the assumption that for science now and in the future Open Data will be essential. The culture, especially among young people, is that the answer is out there and is retrievable within seconds or less. There’s also a realisation that increasingly we don’t know in detail what we are looking for when we start a study. We read bits of papers, skim around till we get a feel for the subject, ask our colleagues, post questions on blogs, etc.

We’re also using machines much more to help us with the data, both in the volume and the diversity. This is a central theme at BioIT. So the fundamental postulate of Openness is:
  • ANY barrier to access and re-use, however small and seemingly trivial COMPLETELY destroys public semantic data.

(Note that I accept that there are closed worlds companies, healthcare, etc. which require access controls, but their technology can feed off what we are trying to create in public view).

Why am I so insistent on this? I’ll leave the moral and ethical arguments aside here and concentrate on the technical aspects. The Open Knowledge Foundation has addressed this point in its definition and I’ll quote from that highlighting particular points (and abbreviating occasionally)

A work is open if its manner of distribution satisfies the following conditions:

  • 1. Access


The work shall be available as a whole …, preferably downloading via the Internet without charge. The work must also be available in a convenient and modifiable form.
Comment: This can be summarized as ‘social’ openness – not only are you allowed to get the work but you can get it. ‘As a whole’ prevents the limitation of access by indirect means, for example by only allowing access to a few items of a database at a time.

  • . Redistribution

The license shall not restrict any party from selling or giving away the work either on its own or as part of a package made from works from many different sources. The license shall not require a royalty or other fee for such sale or distribution.

  • . Reuse


The license must allow for modifications and derivative works and must allow them to be distributed under the terms of the original work. The license may impose some form of attribution and integrity requirements: see principle 5 (Attribution) and principle 6 (Integrity) below.
Comment: Note that this clause does not prevent the use of ‘viral’ or share-alike licenses that require redistribution of modifications under the same terms as the original.

  • . Absence of Technological Restriction


The work must be provided in such a form that there are no technological obstacles to the performance of the above activities. This can be achieved by the provision of the work in an open data format,

  • 5. Attribution


The license may require as a condition for redistribution and re-use the attribution of the contributors and creators to the work.

  • 6. Integrity

The license may require as a condition for the work being distributed in modified form that the resulting work carry a different name or version number from the original work.

  • 7. No Discrimination Against Persons or Groups
  • 8. No Discrimination Against Fields of Endeavor

The license must not restrict anyone from making use of the work in a specific field of endeavor. For example, it may not restrict the work from being used in a business, or from being used for military research.
Comment: The major intention of this clause is to prohibit license traps that prevent open source from being used commercially. We want commercial users to join our community, not feel excluded from it.
9. Distribution of License
The rights attached to the work must apply to all to whom the work is redistributed without the need for execution of an additional license by those parties.
10. License Must Not Be Specific to a Package

11. License Must Not Restrict the Distribution of Other Works

and now the absolute requirement for Openness.

NONE OF THE ABOVE CONDITIONS ARE OPTIONAL

  • This is the crux. There are many data resources which are described as Open but they fail in one or more aspects. The commonest failures are:
  • to expose only part of the data. A database system with a query interface is normally not Open Data even if individual items can be downloaded without barrier. It is generally impossible to extract the whoel work as its boundaries are concealed by the search interface
  • to limit the amount downloaded. This is very frequent (you may use a maximum of 100 entries).
  • To forbid re-use. This data is copyright X and may not be re-used without permission)
  • To require access through specific technology. A search form limits the access.
  • To require any form of signin, even if free. Robots are illiterate in this aspect
  • To restrict purpose of re-use. Thus CC-NC (no commercial reuse) is NOT OKF-compliant
  • To fail to provide a clear statement that the data are open and comply with the Open Knowledge definition. It’s almost universal that data are NOT labelled as Open. This is easy to fix just add the OKF’s tags
  • graphics1
  • So the message is simple, though it will take time to spread
  • Use the OKF definition for all your data and tag it as such

This blog authored with ICE + Open Office; thanks to PeterSefton and USQ

This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to BioIT in Boston: What is Open?

  1. Physchim62 says:

    I think you go too far in your arguments: you seem to suggest that web interfaces are inherently unopen, yet they allow normal human beings (not bots) to access information. Open means open for everyone, not just data-miners.
    In my experience, there are two forms of false openness. The first, we can call the “NIST type”, where copyright is claimed over a selection of non-copyrightable data. This is perfectly legal and perfectly despicable. The community should come up with norms to allow credit to be given to those who spend their time collating such data while still allowing free access to anyone who is interested.
    The second type of false openness is what one could call “selective openness”. You cannot force people to publish their results, especially not their negative ones and especially not in a commercial context. On the other hand, those results which are published openly are very much more available. I have seen at Wikipedia that this can create a bias towards results published in the internal reports of US federal agencies, for example, as these are free of copyright (if not always easily available).
    Commercial journal publishers have earned enormous sums of money at the expense of universities since I entered academia twenty years ago. I agree that we need to go back to a system where publishing a paper was almost as simple as standing up at a meeting of a learned society and reading it out: recent advances in technology make this possible despite the huge increase in the interested population. We, as scientists, also need to reassess how we rate each other’s work and each other as prospective collaborators. That will be the hardest part, I think, but also maybe the most fruitful.

  2. enondengod says:

    I completely agree with your post. Open is open, no matter what tech you use to access data, eyes and hands or bot datasifters.
    /E

  3. Pingback: Unilever Centre for Molecular Informatics, Cambridge - What is semantic? « petermr’s blog

  4. Pingback: Unilever Centre for Molecular Informatics, Cambridge - BioIT 2009 - chemical semantic concepts « petermr’s blog

  5. Pingback: Unilever Centre for Molecular Informatics, Cambridge - BioIT 2009 - what is data? « petermr’s blog

  6. Pingback: Unilever Centre for Molecular Informatics, Cambridge - BioIT 2009 - What is data? -1 « petermr’s blog

Leave a Reply

Your email address will not be published. Required fields are marked *