My talk is “Open Semantic Data in Science”. I’ll probably write 3-4 blog posts on the various aspects of this, and at present I’m thinking of:
- What is Open? (this post)
- What is semantic? And what do we require for it?
- What is data?
- What are we able to offer (with some modest emphasis on our own endeavours).
We’re also using machines much more to help us with the data, both in the volume and the diversity. This is a central theme at BioIT. So the fundamental postulate of Openness is:
- ANY barrier to access and re-use, however small and seemingly trivial COMPLETELY destroys public semantic data.
Why am I so insistent on this? I’ll leave the moral and ethical arguments aside here and concentrate on the technical aspects. The Open Knowledge Foundation has addressed this point in its definition and I’ll quote from that highlighting particular points (and abbreviating occasionally)
A work is open if its manner of distribution satisfies the following conditions:
The work shall be available as a whole …, preferably downloading via the Internet without charge. The work must also be available in a convenient and modifiable form. Comment: This can be summarized as ‘social’ openness – not only are you allowed to get the work but you can get it. ‘As a whole’ prevents the limitation of access by indirect means, for example by only allowing access to a few items of a database at a time.
- 1. Access
The license shall not restrict any party from selling or giving away the work either on its own or as part of a package made from works from many different sources. The license shall not require a royalty or other fee for such sale or distribution.
- . Redistribution
The license must allow for modifications and derivative works and must allow them to be distributed under the terms of the original work. The license may impose some form of attribution and integrity requirements: see principle 5 (Attribution) and principle 6 (Integrity) below. Comment: Note that this clause does not prevent the use of ‘viral’ or share-alike licenses that require redistribution of modifications under the same terms as the original.
- . Reuse
The work must be provided in such a form that there are no technological obstacles to the performance of the above activities. This can be achieved by the provision of the work in an open data format, …
- . Absence of Technological Restriction
The license may require as a condition for redistribution and re-use the attribution of the contributors and creators to the work. … The license may require as a condition for the work being distributed in modified form that the resulting work carry a different name or version number from the original work. Comment: The major intention of this clause is to prohibit license traps that prevent open source from being used commercially. We want commercial users to join our community, not feel excluded from it. 9. Distribution of License The rights attached to the work must apply to all to whom the work is redistributed without the need for execution of an additional license by those parties. 10. License Must Not Be Specific to a Package … 11. License Must Not Restrict the Distribution of Other Works … and now the absolute requirement for Openness. NONE OF THE ABOVE CONDITIONS ARE OPTIONAL
- 5. Attribution
- This is the crux. There are many data resources which are described as “Open” but they fail in one or more aspects. The commonest failures are:
- to expose only part of the data. A database system with a query interface is normally not Open Data even if individual items can be downloaded without barrier. It is generally impossible to extract the whoel work as its boundaries are concealed by the search interface
- to limit the amount downloaded. This is very frequent (“you may use a maximum of 100 entries”).
- To forbid re-use. “This data is copyright X and may not be re-used without permission”)
- To require access through specific technology. A search form limits the access.
- To require any form of signin, even if free. Robots are illiterate in this aspect
- To restrict purpose of re-use. Thus CC-NC (“no commercial reuse”) is NOT OKF-compliant
- To fail to provide a clear statement that the data are open and comply with the Open Knowledge definition. It’s almost universal that data are NOT labelled as Open. This is easy to fix – just add the OKF’s tags
- So the message is simple, though it will take time to spread
- Use the OKF definition for all your data and tag it as such