I've been working with Nico Adams at CSIRO (Melbourne/Clayton, AU) for nearly 3 months, supported by a Fellowship. CSIRO (http://www.csiro.au/) is a government institution similar in many ways to a National Laboratory. It does research (public and private) and publishers it. But it is also a publisher in its own right – everything from Chemistry, to Gliding mammals, to how to build your dream home. Nico and I have struck up a rapport with people in CSIRO publishing http://www.publish.csiro.au/ and today – my last full day in AU – we are going to visit and present some of what we have done and more generally have a discussion where we learn about what CSIRO Publishing does.
CSIRO publishes a range of journals and we'll be concentrating on that, though we'll also be interested in reports, books, etc. We've had the opportunity to work with public and non-public content and to use that as a guide to our technology development (all the software I write is, of course, Open Source). Among the questions I'll want to raise (not specifically CSIROPub) are:
- Is the conventional journal type-setting process still needed? I will argue NO – that it costs money and makes information worse. ArXiV has totally acceptable typography in Word or LaTeX and this is better than most journals for content-mining, etc.
- How should data be published? I shall take small-molecule crystal structures as an example. At present CSIRO sends crystal structures to CCDC where they are no openly accessible. I'll argue they should be part of the primary scientific record.
Nico will be talking about semantics – what it is and how it can be used. I think he'll hope to show the machine extraction of content from Aust. J Chem.
I'll probably play down the political aspect in my formal presentation. The main issue now is how we recreate a market where scientific communication (currently broken) can be separated from the awarding of scientific glory (reputation). I'll concentrate on the communication.
I have simple, practical, understandable IMMEDIATE proposal addressing the document side of STM (this doesn't of course address the issues of data semantics or whatever) .
- The current primary documentary version of scientific record should not be PDF but be a Word Or LaTeX or HTML or XML (e.g. NLM-DTD) document.
- All documents should use UTF-8 and Unicode.
There are zillions of Open tools that adhere to UTF-8 and Unicode.
Where PDFs are used they should adhere to current information standards, specifically:
- PDF documents (http://en.wikipedia.org/wiki/Portable_Document_Format ) should be ISO 32000-1:2008 compliant
- Publishers should EXCLUSIVELY use UTF-8 and Unicode.
A graduate thesis is a BETTER document than the output of almost any publisher I have surveyed. STM publishing destroys information quality. All documents I have looked at on ArXiV and BETTER that the output of STM publishing.
So I shall make the following proposals:
- CSIRO publishing should publish in a standards-compliant manner.
- CSIRO should make supplemental data Openly available (we'll take crystallography as the touch stone).
The average cost to the public for the publication of a scientific paper is around 3000 USD. The information quality is a disgrace. Some of that money can be saved by doing it better. It's similar to recycling. It makes sense to re-use your plastic bags, toilet paper, etc. (Yes, Healesville animal sanctuary promotes green bum-wiping to save the environment (technically recycled paper)).
Let's have a sticker:
"This journal promotes recycled scientific information"
I'll be presenting the work that Murray Jensen and I have been doing on AMI2 . MANY thanks to Murray – he has been given an "AMI" in small acknowledgement.
Murray's AMI in typical Melbourne bush.
AMI progresses steadily. It's taken much longer than I thought primarily because STM publication quality is AWFUL. It's now at a stage where we can almost certainly make an STM publication considerably better. However Murray and I have hacked the worst. AMI2-PDF2SVG turns PDF and AWFUL-PDF into good Unicode-compliant SVG. I'm concentrating on AMI2-SVGPLUS which turns SVG into meaningful documents. Nearly there. Again the absurd process of creating double column justified PDF (that no scientist would willingly pay for) destroys information seriously and SVGPLUS has to recover it. Then the final exciting part will create science from the document.
I'll hope to present some today.