Dictated into Arcturus
This post is a first outline – not even a draft – of a proposed Panton Paper on “Semantic Open Scientific Data”
The vision of the Semantic Web 2.0 (If I’ve not lost count) includes Linked Open Data. We’be dealt a lot with Open and somewhat with Data but not about links. The rules of Linked Open Data (http://en.wikipedia.org/wiki/Linked_Data ):
- Use URIs to identify things.
- Use HTTP URIs so that these things can be referred to and looked up (“dereferenced“) by people and user agents.
- Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
- Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
The technologies for doing Semantic Web are therefore HTTP/URI, HTML, various XMLs, and RDF. The linking technology is can be through HTML link, href, or Xlink (or others).
LOD does not say much about documents, and most scientific data is published in document format. Most documents such as theses and scientific papers contain headers, abstracts, sections, paragraphs, embedded tables, images, attached data, etc. Common formats are Word, LaTeX and HTML. These provide more or less semantics according to the authoring tool, the diligence of the author, etc. So, for example, several publishers has very well marked up HTML.
PDF, PNG, Powerpoint as commonly used have effectively no semantics. (There are areas in some of these allowing the inclusion of semantics but these are not used and are so variable between releases that they are effectively useless.) in the average PDF it is impossible for a machine to tell where a sentence or paragraph start and another begins. Superscripts, styling are also incredibly difficult to interpret.
Most authors author in a (semi)semantic form. Most publishers will accept this, then print it out and scan it or de-semantify it as PDF. Many have the manuscript retyped.
Many graduate theses are required as PDF even though the authoring is in Word or LaTeX.
So here are my recommendations
- Authors should be provided with incentives and tools to create documents with as much semantics as possible.
- Publishers must become aware of the value of semantics and retain it during their processing
- Theses should always preserve the original born-digital document and data. It should always be available alongside any PDF.
- Repository owners should present their content as Linked Open Data (RDF) wherever possible. This may require managing identifier systems and ontologies
- Readers should have access to semantic readers supported by repositories and publishers
- There should be converters from common semi-semantic forms to fully semantic where possible (e.g. as in Chemistry), supported by repositories and publishers
-
Tools should be available for human and machine semantic annotation. This may not always be completely accurate, but it will be useful.