Typed and scraped into Arcturus
A very important comment from Henry Rzepa [http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2488&cpage=1#comment-471129 ] on how publishers destroy semantics.
[HSR] This Peter does not mention above, but you will find a fully explained example of how to create a domain specific (chemistry) repository at 10.1021/ci7004737.
Other comments: We have used this repository to provide complete supporting data in readily accessible form for more than 20 primary scientific publications with publishers such as Nature publishing group, VCH (Wiley), RSC, ACS, Science. Each of these publishers had to be persuaded to a greater or lesser degree to integrate this data into the primary article (rather than more obscurely in supporting information), a location we insisted upon in order to give that data prominent visibility to the readers. If you are interested in seeing the effects, I have blogged on the topic.
I recently acted as a referee for a well known publisher. The article was the analysis of quite large molecules, and I was quite keen to explore the proposed structures. Data had been provided, in the form of a double column, page broken, Acrobat file. I faced not a little work in converting this format to something I could use for the purpose. Since I knew the authors, I contacted them after my review process was complete (yes, thus breaking my anonymity) asking why they had provided the data in such an arcane and relatively unusable format. They were following the publisher guidelines. They did suggest that it should be the publisher themselves who should offer a domain specific repository for authors to use, since it is non trivial for an author to establish a domain specific repository themselves (and even within domains, there are may diverse requirements). I have my doubts however that such a model could be effectively deployed by the multiple publishers in chemistry any time soon. Meanwhile, for the vast majority of articles which have associated data submitted with them, the Internet revolution has yet to make much of an impact!
Henry is, of course, a pioneer in this area and jointly and separately we keep running into this.
This example has come at an excellent time as I shall be posting another Panton paper on mining and semantics. I except a few publishers [mainly Open Access in practice and spirit] from the remarks below.
Publishers do not understand semantics and therefore destroy them.
Publishers are generally not interested in innovation unless they drive it, and most of them drive it to enhance their monetary returns rather than for the benefit of the community they purport to.
The simplest way to create "publications" is PDF. PDF was not invented/chosen for the benefit of the community but for the benefit of the publishers. PDF destroys semantics and can be used to prevent innovation use of publications such as data- and text-mining.
Many librarians and even Open Access enthusiasts have been sucked into the "PDF is wonderful" syndrome. Repositories, theses and journals use PDF and our semantic scholarship languishes.
And supplemental data – as Henry points out is even worse. Turning semantic images, tables, molecules, graphs, etc. into mindless PDF is the academic equivalent of opencast coal-mining or logging the rainforests. We destroy our information richness for the benefit of monetary gain or simple laziness.