Beyond the PDF: when should we add semantics

On BTPDF here has a lively debate on adding semantics to scientific publications and this is a snapshot of some of my own contributions. The site (and the meeting) are – I think – open to anyone. The idea is that people will offer ideas and materials in to support the meeting.

** One discipline I think we should adopt before BTPDF is that we should all read a variety of papers in different fields.  I suspect the majority of attendees will be bioscientists because they have an excellent record of knowling they need semantics and developing it. I don’t think we can become polymaths and if we try to solve all disciplines simultaneously we will get nowhere. The problems of a clinical trial are completely different from string theory.

My zeroth law of semantic enhancement (not well phrased but it was late at night). Readers of this blog will know how easy it is to corrupt information

All discipline-independent syntactic problems are soluble and must be solved

By this I mean that characters must be expressed as characters with encoding (and not pixel images). All images with text should have the text machine processable. All graphs should be accomained by the raw data that created them (e.g. CSV). All numeric quantities should have units. All maths should use MathML. All chemistry should be in CML. Geolocations should use KML; maps should use polygons.  All line graphics should contain scalable vectors. None of this is rocket science – it’s purely a question of will. The temperature is not 278, it is 278 K.

If we do not solve the zeroth law there is little point in aiming for anything higher. Because the failure to obey the law corrupts the information irretrievably.

And some types of enhancement.


There are three main places that semantics can be created:

(a) by machines at the time of machine authoring. Where this can be done this is undoudtedly the highest quality as the machine defines the semantics and the consistency. An incresing amount of authoring is now done by simulations or instruments and there is absolutely no reason why the semantics should not be preserved. It is simply laziness to discard machine-produced annotations such as units, errors, etc.

(b) by humans.  The earlier that this is done the better. the person best placed to annotate information is the person who created it, although this must be modulated with experience. Semantic information added later may involve guessing (units, errors, conditions, etc.). Annotation at time of conventional publication is almost certain to involve uncertainty. One consequence of this is that we should develop semantic notebooks before we put effort into late authoring tools. Even more of a problem is annotation introduced by technical editors in publishing houses who were not involved in the science is likely to involve errors and misconceptions.

(c) by text/data/imagemining. Of course this is never perfect. Its advantage is that it can be done at any stage (unless prevented by lawyers). The disadvantages are many. A simplistic lexical approach will get many false positives. A human reader aware of this probably won’t worry but people who don’t are likely to dismiss any document with even one false positive (false negatives are not a problem). Hopefully more publishers will allow text/data/imagemining in which case we can build up a corpus of usage. For example 100 examples of “
off the south coast of Iceland” may be correlated with lat long , maps in the papers, etc.) and so may give us a textual density function. In this way we increase the semantic precision of the complete discipline. We can even hope to extract equations and other semantic objects.

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Beyond the PDF: when should we add semantics

  1. Pingback: Twitter Trackbacks for Unilever Centre for Molecular Informatics, Cambridge - Beyond the PDF: when should we add semantics « petermr’s blog [] on

Leave a Reply

Your email address will not be published. Required fields are marked *