I have been silent on this blog for too long (over 1 month) because I have been obsessively concentrating on two major software projects. This post is to keep you up to date and reassure those of like mind that I continue to be very active in trying to liberate knowledge.
Travels. I am off back to CSIRO Melbourne for a month where I'm helping Nico Adams with his Materials Summer School and Workshop. I think the semantic tools that we have all been developing are going to be valuable in creating better informatics and computational approaches to materials. I'm particularly interested in crystalline materials and computational processes.
I'm taking a week off to fly to Auckland, NZ for the tail of the Open Research meeting (Fabiana Kubke) and then Kiwi Foo. Very excited. I'm sorry I can't spend more time with Fabiana but the Summer School overlaps.
In late Feb/March I have been invited to speak at the Columbia Research Data Symposium (http://conferences.cdrs.columbia.edu/rds/index.php/rds/rds ). This will be a very exciting meeting and I'm very grateful to Columbia. Originally I declined because I would have been sponsored by Elsevier and I have publicly stated that I am boycotting all Elsevier activities. Columbia's sponsorship means I do not have to take an Elsevier-friendly line. I will blog this meeting before I go and outline some of the issues that the world has to decide on. In simple terms our academic digital freedom is at stake. Data presents a huge opportunity and doubtless large additional income. Academia and governments should act wisely and not outsource their decisions and ethics.
Then I'm off to Kitware , a scientific/consultancy company that makes money out of open Source, including VTK and Avogadro. I am really excited as I hope to bounce the Declaratron design off them. As always my software is not only Open but non-competitive. Anyone can join in the meritocracy.
AMI2 (http://bitbucket.org/petermr/pdf2svg ) is a project to turn PDFs into fully semantic computable, searchable, executable documents with human intervention. There are >2 million STM PDFs published each year in EuropePMC alone (more on that later). More and more are Open Access of some kind. We have developed a relatively comprehensive and high accuracy converter and tested it on some thousands of PDFs from several hundred publishers. (Don't rush for your lawyers, publishers, I'm not going to publish your holy PDFs). The results of this are:
- The technical standard of publishers' PDFs is AWFUL. I don't think I have found one that conforms to the PDF standard
- We have learnt how to turn them into Unicode
- The result is technically better than what the publishers produce.
- The next stage, turning SVG into semantic form is doing well. I am particularly keen on extracting maths equations in semantic MathML form. Equations aren't copyright are they? Perhaps they are – Pythagoras only died 2500 years ago, so maybe he is still in copyright somewhere. JSBach still is.
I'd love to hear from anyone interested in developing content mining
The Declaratron. This is a new declarative approach to reproducible semantic computing and directly addresses things like:
- Can scientific computation be reproduced? The current answer is generally – only partially. To do so completely requires the complete semantic unification of all components – data, specification, computational engine and visualisation/publication.
- Have we eliminated all syntactic error and as much semantic error as possible? For example are our units consistent? Are data linked to computable ontologies?
- Can the algorithm be transported to a different environment without writing code?
- Can we follow the progress of the computation?
- Can we modify the algorithm, even in mid computation?
- Can the machine document the complete course of the calculation at whatever granularity we desire?
- Can the results be re-used in another context without human intervention?
… and a great deal more. I think the answer to all of these is yes and I'll be showing how the Declaratron works.
Open Access/Knowledge. I shall try and blog something on Aaron Swartz. I didn't know him, but I know people who did, and the wealth of tributes has been impressive in itself and also given me more insight into his passion for liberation. The smell of injustice is pervasive.
Content Mining. Hargreaves is going to turn its recommendations into law. No arguments. So in October 2013 I can legally mine anything I have access to and publish as CC-NC. Publishers will whinge scream lobby etc. But that will be UK law (it doesn't require re-legislation and done through statutory instruments). There's a lot of to-and-fro-ing. Neelie Kroes and colleagues are running something in Brussels in 2 weeks' time – Ross is representing OKF. The publishers are running semi-closed lobbying shops. We all have to remain very vigilant as publishers have people who are paid to stop progress and we have to rely on volunteers, spare time, etc. That is why I am grateful to Wellcome and the RCUK for their very clear impetus and drive. They have shown passion where the Universities have been spineless or ultra-timid. I'll write more on this before Columbia.
Chuff will be going to AU and NZ… and I'll be meeting with OKF people there. Tweet or mail if you're around Auckland/Warkworth 2013-02-07/12