Typed and scraped into Arcturus
A very important comment from Henry Rzepa [http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2488&cpage=1#comment-471129 ] on how publishers destroy semantics.
[HSR] This Peter does not mention above, but you will find a fully explained example of how to create a domain specific (chemistry) repository at 10.1021/ci7004737.
Other comments: We have used this repository to provide complete supporting data in readily accessible form for more than 20 primary scientific publications with publishers such as Nature publishing group, VCH (Wiley), RSC, ACS, Science. Each of these publishers had to be persuaded to a greater or lesser degree to integrate this data into the primary article (rather than more obscurely in supporting information), a location we insisted upon in order to give that data prominent visibility to the readers. If you are interested in seeing the effects, I have blogged on the topic.
I recently acted as a referee for a well known publisher. The article was the analysis of quite large molecules, and I was quite keen to explore the proposed structures. Data had been provided, in the form of a double column, page broken, Acrobat file. I faced not a little work in converting this format to something I could use for the purpose. Since I knew the authors, I contacted them after my review process was complete (yes, thus breaking my anonymity) asking why they had provided the data in such an arcane and relatively unusable format. They were following the publisher guidelines. They did suggest that it should be the publisher themselves who should offer a domain specific repository for authors to use, since it is non trivial for an author to establish a domain specific repository themselves (and even within domains, there are may diverse requirements). I have my doubts however that such a model could be effectively deployed by the multiple publishers in chemistry any time soon. Meanwhile, for the vast majority of articles which have associated data submitted with them, the Internet revolution has yet to make much of an impact!
Henry is, of course, a pioneer in this area and jointly and separately we keep running into this.
This example has come at an excellent time as I shall be posting another Panton paper on mining and semantics. I except a few publishers [mainly Open Access in practice and spirit] from the remarks below.
Publishers do not understand semantics and therefore destroy them.
Publishers are generally not interested in innovation unless they drive it, and most of them drive it to enhance their monetary returns rather than for the benefit of the community they purport to.
The simplest way to create “publications” is PDF. PDF was not invented/chosen for the benefit of the community but for the benefit of the publishers. PDF destroys semantics and can be used to prevent innovation use of publications such as data- and text-mining.
Many librarians and even Open Access enthusiasts have been sucked into the “PDF is wonderful” syndrome. Repositories, theses and journals use PDF and our semantic scholarship languishes.
And supplemental data – as Henry points out is even worse. Turning semantic images, tables, molecules, graphs, etc. into mindless PDF is the academic equivalent of opencast coal-mining or logging the rainforests. We destroy our information richness for the benefit of monetary gain or simple laziness.
Peter, I know various people at publishers who are very keen on introducing semantic technologies… instead, I think it is the journal editors who do not understand what is going on. The publishers provide the platform, but if no journal wants to move in a semantic direction… therefore, I do not feel the publisher is the sole source of trouble here. E.g. see this post:
http://chem-bla-ics.blogspot.com/2009/03/nature-chemistry-improves-publishing.html
Instead, what we should really have, is a ‘Is it a Semantic Publication?’ initiative, where we can send emails to journal editors, and put pressure on them… e.g. see this post:
http://chem-bla-ics.blogspot.com/2009/03/journal-of-cheminformatics-i-hope.html
Therefore, I think more accurate statements are:
* Journal editors do not understand semantics and therefore destroy them.
* Publishers are not generally interested in semantic innovation unless the journal editors want to join in.
Might I raise an interesting issue recently discussed by the Jmol community. Jmol, if you do not know, is quite the most amazing opensource community, and over about 14 years, they have developed a remarkable tool for visualising an ever increasing variety of molecular data. The topic of recent discussion was export to PDF (sic). This is done via two intermediate formats, and the result is that rotatable 3D models are embedded in the Acrobat file. The models can have navigation viewpoints (bookmarks) and other visually impressive features, and are a great improvement (for molecular science) over static diagrams. I suspect the publishers will love them! They are of course largely data free! Thus I know of no easy way of taking a molecule with these embedded U3D models in an Acrobat file and recovering even just the basic coordinates (if it IS possible, please someone correct this statement!). Just like Acrobat itself, its a one way journey, a journey that was never designed to be reversible. Peter and I coined the term “round tripping for data” a few years back, the idea being that whatever journey the data had taken, and whatever styles and other wrappers had been applied during that journey, it should be possible, despite the history of that trip, to recover the underlying data at all stages. That test I think is not passed by Acrobat/U3D.
Whilst I do welcome anything that engages the reader in the science, and hence I think that Acrobat/U3D does have a role to play, adopting such a format should not risk throwing the baby away with the bathwater (i.e. making the data inaccessible again). Believe me, grappling with a text editor to recover the data in a double column, page broken PDF file would be nothing compared to contemplating how to recover coordinates from a PDF/U3D file.
PDF is a complex file format with uncertain ownership (open but not open source) and a constantly changing standard. Its arguably also a digital preservation nightmare and another reason not to rely on it as the sole means of redistribution of any information. As such, I’m partly surprised by the comment that Librarians and the repository community are so enamoured by it, although as repositories and ejournal sites are dominated by PDF’s I can see the logic.
I suspect Librarians lean towards it useful because it so useful as a textual format. PDF thrives on its ubiquity and is easy to create, view, print, read and redistribute. it also has mechanisms for watermarking and can also easily be tied into DRM with the minimum of effort, which presumably makes it more attractive to publishers.
If consumers of online resources demanded more flexibility in format, I would hope that we as Librarians would communicate that demand and that publishers would listen. I believe that the Nesli2 license model that is increasingly being used in the UK has provision for text mining, which should help to get the point across.
PDF certainly can be tied into DRM. The British library in the UK uses this to restrict the scientific journal articles they distribute upon electronic inter-library-loan request to have a finite lifetime (30 days) and to permit only a single print operation during that period. After which one is left only with paper. The British library tells me that this is because the publishers insisted upon it. So much for digital preservation! And lets not even mention how data might be handled under those terms and conditions.
Is this progress?
So, Peter, what’s your solution? And would you want to add data to the intellectual property that publishers now control?
Thanks all…
Brian – not sure I understand. I don’t want poublishers to control *anything*. To help disseminate, evaluate, etc. But not *control*