Monthly Archives: October 2012

#ami2 #opencontentmining An intelligent reader of the PDF STM literature. We achieve the first phase: (alpha) PDF2SVG

In a previous post I outlined the architecture for building a (weakly) intelligent scientific amanuensis, AMI2. (http://blogs.ch.cam.ac.uk/pmr/2012/10/20/opencontentmining-the-ami2-project-for-understanding-the-scientific-literature-and-why-i-love-pdf/ ) We have made a lot of progress since then, mainly in formalizing, refactoring, documenting, clearing our thoughts. (Refactoring is a computing chore, rather like cleaning the cooker or digging in manure or setting pest traps. There’s nothing new to offer people, but you are in a much better position to cook, grow, build, etc. Things will work). So we are now able to say more clearly what AMI2 is (currently) comprised of.

[You don't have to be a compsci to understand this post.]

I’ll show our picture again, if only because of the animal (you know what it is, I hope):

And label them

  • PDF2SVG (creating clear syntax)
  • SVGPlus (creating clear structure)
  • STMXML (creating science)

(These names may change. Constant change (“Refactor mercilessly”) is a necessary feature of good software (the reverse may also be true, I hope!). These are becoming clearly defined modules.

At the end of the last post I asked some questions. I hoped people would give answers so that I could learn whether my ideas made sense. (Feedback is very valuable, Silence rarely helps). Here they are again

And some more open-ended questions (there are many possible ways of answering). In

How would you describe

  • The top right object in

There are no right answers. It depends who or what you are. I’ll postulate three types of intelligent being

  • A polymer chemist
  • A non-scientific hacker
  • PDF2SVG

The chemist would answer something like:

  • The initiation process of the polymerization
  • A forward-proceeding chemical reaction
  • A reaction scheme
  • A free radical (the “dot”

The hacker might answer:

  • The word “initiation” at a given coordinate and in a given font/weight/style
  • A right-pointing arrow
  • A string of two words (“Scheme” and “1.”)
  • A superscript (“degree”) symbol

The PDF2SVG part of AMI2 sees this in a more primitive light. She sees:

  • 10 characters in a normal san-serif font with coordinates, sizes and fonts
  • A horizontal line and *independently* a closed curve of two lines and a cubic Bezier curve
  • 8 characters in a bold serif font.
  • Two Cubic Bezier curves.

In PDF there are NO WORDS, NO CIRCLES, NO ARRROWS. There are only the primitives:

  • Path – a curved/straight line which may or may not be filled
  • Text – usually single characters, with coordinate
  • Images (like the animal)

So we have to translate the PDF to SVG, add structure, and then interpret as science.

This is hard and ambitious, but if humans can do it, so can machines. One of the many tricks is separating the operations. In this case we would have to:

  • Translate all the PDF primitives to SVG (I’ll explain the value of this below)
  • Build higher-level generic objects (words, paragraphs, circles, arrows, rectangles, etc.) from the SVG primitives
  • Interpret these as science.

Hasn’t all this been done before?

Not at all. Our unique approach is that this is an OPEN project. If you are interested in, say, interpreting flow diagrams from the literature and you enjoy hacky puzzles then this is a tremendous platform for you to build on. You never need to worry about the PDF bit – or the rectangle bit – we have done it for you. Almost all PDF converters neglect the graphical side – that’s why we are doing it. And AMI2 is the only one that’s OPEN. So a number of modest contributions can make a huge difference.

You don’t have to be a scientist to get involved.,

Anyway, why PDF2SVG?

PDF is a very complex beast. It was developed in commercial secrecy and some of the bits are still not really published. It’s a mixture of a lot of things:

  • An excutable language (Postscript)
  • A dictionary manager (computer objects, not words)
  • A font manager
  • A stream of objects
  • Metadata (XMP)
  • Encryption, and probably DRM

And a lot more. BTW I know RELATIVELY LITTLE about PDF and I am happy to be corrected. But I think I and colleagues know enough. It is NOT easy to find out how to build PDF2SVG and we’ve been down several ratholes. I’ve asked on Stackoverflow, on the PDFBox list and elsewhere and basically the answer is “read the spec and hack it yourself”.

PDF is a page-oriented language and printer-oriented. That makes things easy and hard. It means you can work on one page at a time, but is also means that there is no sense of context anywhere. Characters and paths can come in any order = the only thing that matters is their coordinates. We’re using a subset of PDF features that map onto a static page, and S VG is ideal for that:

  • It’s much simpler than PDF
  • It’s as powerful (for what we want) so there is no loss in converting PDF to SVG
  • It was designed as an OPEN standard from the start and it’s a very good design
  • It’s based on XML which means it’s easy to model with it.
  • It interoperates seamlessly with XHTML and CSS and other Markup Languages so it’s ideal for modern browsers.
  • YOU can understand it. I’ll show you how.

PDF is oriented towards visual appearance and has a great deal of emphasis on Fonts. This is a complex area and we shall show you how we tackle this. But first we must pay tribute to the volunteers who have created PDFBOX. It’s an Open Source Apache project (http://pdfbox.apache.org ) and it’s got everything we need (though it’s hard to find sometimes). So:

HUGE THANKS TO BEN LITCHFIELD AND OTHERS FOR PDFBOX

I first started this project about 5-6 years ago and had to use PDFBox at a level where I was interpreting PostScript. That’ no fun and erro-prone, but it worked enough to show the way. We’ve gone back and found PDFBox has moved on. One strategy wold be to interrupt the COSStream and interpret the objects as they come through. (I don’t know what COS means either!) But we had another suggestion from the PDFBox list – draw a Java Graphics object and capture the result. And that’s’ what I did. I installed Batik (a venerable Open Java SVG project) , created an Graphics2D object , saved it to XML. And it worked, And that’s where we were a week ago. It was slow, clunky and had serious problem with encodings.

So we have been refactoring without Batik. We create a dummy graphics with about 60-80 callbacks and trap those which PDFBox calls. It’s quite a small number. We then convert those to text or paths, extract the attributes from the GraphicsObject and that’s basically it. It runs 20 times faster at least – will parse 5+ pages a second on my laptop and I am sure that can be improved. That’s 1-2 seconds for the average PDF article.

The main problem comes with characters and fonts. If you don’t understand the terms byte stream, encoding, codepoint, character, glyph, read Joel Spolksy’s article “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” (http://www.joelonsoftware.com/articles/Unicode.html ). If you think you know about these issues, reread the article (I am just going to). A considerable and avoidable amount of the garbage on the web is due to developers who did not understand. Cut-and-paste is almost a certain recipe for corruption. And since science relies on a modest number of high codepoints it is critical to get them right. A missing minus sign can cause planes to crash. A 1 instead of an L. An m instead of a mu (milligram vs micrograms) in a dose could kill someone.

In simple terms, PDF can manage an infinite number of fonts. This is fine for display, but unless we know what the characters *are* it’s useless for AMI2. We can treat fonts in 2 main ways:

  • Send a graphical object representing what the character looks like on screen or printer. These are usally infinite scalable fonts and don’t degrade when magnified. But it is very hard for AMI2 to work out what a given sets of curves means as a character. (How would you describe a “Q” over the telephone and get a precise copy at the other end?)
  • Send the number of the character (technically the code point in Unicode) and a pointer to the font used to draw it. That’s by far the easiest for AMI2. She generally doesn’t care whether an “A” is serif or not (there are some disciplines where fonts matter but not many). If she gets (char)65 that is good enough for her (with one reservation – she need to know how wide it is to work out when words end).

Anyway almost all of the characters have Unicode points. In our current test document, of 130 pages we’ve only found about 10 characters that didn’t have Unicode points. These are represented by pointers into glyph maps. (If this sounds fearsome the good news is that we have largely hacked the infrastructure for it and you don’t need to worry). As an example in one place the document uses “MathematicalPi-One” font for things like “+”. By default AMI2 gets sent a glyph that she can’t understand. By some deep hacking we can identify the index numbers in MathematicalPi-One – e.g. H11001. We have to convert that to Unicode.

** DOES ANYONE HAVE A TABLE OF MathematicalPi-One TO UNICODE CONVERSIONS? **

If so, many thanks. If not this is a good exercise for simple crowdsourcing – many people will benefit from this.

I’ve gone on a lot, but partly to stress how confident we are that we have essentially solved a standalone module. This module isn’t just for science – it’s for anyone who wants to make sense of PDF. So anyone in banking, government, architecture, whatever is welcome to join in.

For those of you who are impatient, here’s what a page number looks like in SVG:

<text stroke=”#000000″ font-family=”TimesNewRomanPS” svgx:width=”500.0″ x=”40.979″ y=”17.388″ font-size=”8.468″>3</text>

<text stroke=”#000000″ font-family=”TimesNewRomanPS” svgx:width=”500.0″ x=”45.213″ y=”17.388″ font-size=”8.468″>8</text>

<text stroke=”#000000″ font-family=”TimesNewRomanPS” svgx:width=”500.0″ x=”49.447″ y=”17.388″ font-size=”8.468″>0</text>/

That’s not so fearsome. “stroke” means colour of the text and 00/00/00 is r/g/b so 0 red, 0 green 0 blue which is black! Times is a serif font. Width is the character width (in some as yet unknown units but we’ll hack that) . x and y are screen coordinates (in SVG y goes DOWN the screen). Font-size is self-explanatory. And the characters are 3,8,0. The only reason we can interpret this as “380″ is the x coordinate, the font-size and the width. If you shuffled the lines it would still be 380.

In the next post I’ll explain how early-adopters can test this out. It’s alpha (which means we are happy for friends to try it out. It *will* occasionally fail (because this is a complex problem) and we want to know where. But you need to know about installing and running java programs in the first instance. And we need to build more communal resources for collaboration.

 

 

 

 

Open Access: What is it and what does “Open” mean

This is the start of “Open Access Week” www.openaccessweek.org/ and I am urged (including by myself) to write something for it. The OKF is contributing something and I hope that in writing this blog there is something suitable.

I’m going to ask questions. They are questions I don’t know the answer to – maybe I am ignorant I which case please comment with information – or maybe the “Open Access Community” doesn’t know the answer. Warning: I shall probably be criticized by some of the mainstream “OA Community”. Please try to read beyond any rhetoric.

As background I am well versed in Openness. I have taking a leading role in creating and launching many Open efforts – SAX (http://www.saxproject.org/sax1-history.html ), Chemical MIME, Chemical Markup Language, The Blue Obelisk (http://en.wikipedia.org/wiki/Blue_Obelisk ) , Panton Principles, Open Bibliography, Open Content Mining and helped to write a significant number of large software frameworks (OSCAR, JUMBO, OPSIN, AMI2). I’m on the adv board of the Open Knowledge Foundation and can shortly reveal another affiliation. I have contributed to or worked with Wikipedia, Open Streetmap, Stackoverflow, Open Science Summit, Mat Todd (Open Source Drug Discovery) and been to many hackathons. So I am very familiar with the modern ideology and practice of “Open”. Is “Open Access” the same sort of beast?

The features of “Open” that I value are:

  • A meritocracy. That doesn’t mean that decisions are made by hand counting, but it means that people’s views are listened to, and they enter the process when it seems right to the community. That’s happened with SAX, very much with the Blue Obelisk, and the Open Knowledge Foundation.
  • Universality of participation, particularly from citizens without formal membership or qualifications. A feeling of community.
  • A willingness to listen to other views and find means of changing strategy where necessary
  • Openness of process. It is clear what is happening, even if you are not in command.
  • Openness of results. This is universally fundamental. Although there have been major differences of opinion in Free/Open Source Software (F/OSS) everyone is agreed that the final result is free to use, modify, redistribute without permission and for any purpose. http://en.wikipedia.org/wiki/Four_Freedoms_%28Free_software%29#definition . Free software is a matter of liberty, not price. To understand the concept, you should think of ‘free’ as in ‘free speech‘, not as in ‘free beer’”.[13] See Gratis versus libre.
  • A mechanism to change current practice. The key thing about Wikipedia is that it dramatically enhances the way we use knowledge. Many activities in the OKF (and other Open Organisations) are helping to change practice in government, development agencies, companies. It’s not about price restrictions, it’s about giving back control to the citizens of the world. Open Streetmap produces BETTER and more innovative maps that people can use to change the lives of people living right now – e.g. the Haitian earthquake.

How does Open Access measure up against these? I have difficulty saying that OA as currently practiced meets any of these to my own desires. That doesn’t mean it isn’t valuable, but it means that it doesn’t have obvious values I can align with. I have followed OA for most of the last 10 years and tried to contribute, but without success. I have practiced it by publishing all my own single-author papers over the last 5 years in Gold CC-BY journals (but without much feeling of involvement – certainly not the involvement that I get from SAX or BlueObelisk.

That’s a harsh statement and I will elaborate:

Open Access is not universal – it looks inward to Universities (and Research Institutions). In OA week the categories for membership are:

“click here if you’re a: RESEARCH FUNDER | RESEARCHER/FACULTY MEMBER | ADMINISTRATOR | PUBLISHER | STUDENT | LIBRARIAN

There is no space for “citizen” in OA. Indeed some in the OA movement emphasize this. Stevan Harnad has said that the purpose of OA is for “researchers to publish to researchers” and that ordinary people won’t understand scholarly papers. I take a strong and public stance against this – the success of Galaxy Zoo has shown how citizens can become as expert as many practitioners. In my new area of phylogenetic trees I would feel confident that anyone with a University education (and many without) would have little difficulty understanding much of the literature and many could become involved in the calculations. For me, Open Access has little point unless it reaches out to the citizenry and I see very little evidence of this (please correct me).

There is, in fact, very little role for the individual. Most of the infrastructure has been built by university libraries without involving anyone outside (I regret this, because University repositories are poor compared to other tools in the Open movements). There is little sense of community. The main events are organised round library practice and funders – which doesn’t map onto other Opens. Researchers have little involvement in the process – the mainstream vision is that their university will mandate them to do certain things and they will comply or be sacked. This might be effective (although no signs yet) but it is not an “Open” attitude.

Decisions are made in the following ways:

  • An oligarchy, represented in the BOAI processes and Enabling Open Scholarship (EOS). EOS is a closed society that releases briefing papers and has a members ship of 50 EUR per year and have to be formally approved by the committee (I have represented to several members of EOS that I don’t find this inclusive and I can’t see any value in my joining – it’s primarily for university administrators and librarians).
  • Library organizations (e.g. SPARC)
  • Organizations of OA publishers (e.g. OASPA)

Now there are many successful and valuable organizations that operate on these principles, but they don’t use the word “Open”.

So is discussion “Open”. Unfortunately not very. There is no mailing list with both large volume of contributions and effective freedom to present a range of views. Probably the highest volume list for citizens (as opposed to librarians) is GOAL http://mailman.ecs.soton.ac.uk/pipermail/goal/ and here differences of opinion are unwelcome. Again that’s a hard statement but the reality is that if you post anything that does not support Green Open Access Stevan Harnad and the Harnadites will publicly shout you down. I have been denigrated on more than one occasion by members of the OA oligarchy (Look at the archive if you need proof). It’s probably fair to say that this attitude has effective killed Open discussion in OA. Jan Velterop and I are probably the only people prepared to challenge opinions = most others walk away.

Because of this lack of discussion it isn’t clear to me what the goals and philosophy of OA are. I suspect that different practitioners have many different views, including:

  • A means to reach out to citizenry beyond academia, especially for publicly funded research. This should be the top reason IMO but there is little effective practice.
  • A means to reduce journal prices. This is (one of) Harnad’s arguments. We concentrate on making everything Green and when we have achieved this the publishers will have to reduce their prices. This seems most unlikely to me – any publisher losing revenue will fight this (Elsevier already bans Green OA if it mandated).
  • A way of reusing scholarly output. This is ONLY possible if the output is labelled as CC-BY. There’s about 5-10 percent of this. Again this is high on my list and the only reason Ross Mounce and I can do research into phylogenetic trees.
  • A way of changing scholarship. I see no evidence at all for this in the OA community. IN fact OA is holding back innovation in new methods of scholarship as it emphasizes the conventional role of the “final manuscript” and the “publisher”. In fact Green OA relies (in practice) in having publishers and so legitimizes them

And finally is the product “Open”? The BOAI declaration is (in Cameron Neylon’s words http://cameronneylon.net/blog/on-the-10th-anniversary-of-the-budapest-declaration/ ) “clear, direct, and precise:” To remind you:

“By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”

This is in the traditions of Stallman’s software freedoms, The Open Knowledge Definition and all the other examples I have quoted. Free to use, re-use and redistribute for any lawful purpose. For manuscripts it is cleanly achieved by adding a visible CC-BY licence. But unfortunately many people, including the mainstream OA community and many publishers use “(fully) Open Access” to mean just about anything. No-one other than a few of us challenge this. So the result is that much current “OA” is so badly defined that it adds little value. There have been attempts to formalize this but they have all ended in messy (and to me unacceptable) compromise. In all other Open communities “libre” has a clear meaning – freedom as in speech. In OA it means almost nothing (“removal of some permission barriers” – could be managed by the permission to post a copy on a personal website but restrict copying and further re-use. Unfortunately anyone trying to get tighter approaches is shouted down. For that reason we have set up our own Open-access list in OKF http://blog.okfn.org/category/open-access/ and http://lists.okfn.org/pipermail/open-access/. So, and this is probably the greatest tragedy, Open Access does not by default produce Open products. See http://blog.okfn.org/2012/10/22/the-great-open-access-swindle/ for similar views

*If* we can have a truly Open discussion we might make progress on some of these issues.

#opencontentmining The #ami2 project for understanding the scientific literature and why I “love” PDF

Earlier this week I blogged about our new Open project #AMI2 http://blogs.ch.cam.ac.uk/pmr/2012/10/15/opencontentmining-starting-a-community-project-and-introducing-ami2/ . This is an Open community project to create a complete infrastructure for machine understandability of the scientific literature. That’s a bold statement, especially for something with no formal funding, but in the present era that’s not necessarily a show-stopper. Because when we grow the community we can do almost anything. Of course AMI2 can’t understand everything, but she will shortly be able to outstrip several current human activities.

And we’ve had several offers of help. Welcome to all those who have mailed. We have critical mass and Ross and I will work out how to create a community platform. Because #ami2 is Y/OUR project.

The scientific literature covers 300 Billion USD of pubklicly funded endeavour and 15 Billion USD to publish. Many of the results are inaccessible due to old-fashioned means of publication and the difficulty of understanding them. So if people want to help build AMI2 it will have a very important impact.

The current approach to this is to get authors to produce semantic documents. I agree that this is the best way. But authors aren’t interested and publishers are incredibly conservative. So we have to start the other way. By creating #ami2 to understand the current literature in the way that humans do. And much of this can be done by gluing together the technology that already exists. In these blogs we are going to show that, in specific domains, it can be done. NOW! And we make the assumption that it’s not too difficult to build similar technology in parallel in other domains.

So we are going to start with the following disciplines:

  • Phylogenetic trees
  • X-Y plots
  • Chemical diagrams

Of these only the chemistry is at all hard to understand if you are 12 years old. Everyone can understand trees and everyone can understand graphs (because in the information age every citizen should be able to understand a graph). So anyone should be able to follow.

Here’s the AMI2 process:

[I authored the diagram in SVG but cannot use Word+Wordpress to publish it. Anyone know how? So apologies for the PNG – it goes against the philosophy of the project.]

There are three components to this document.

  • Text. There are many uses for text (discussion, tables, references, metadata) but they all use characters and the general technology is the same. We’ll see the distinctions later
  • Diagrams. I use this to mean objects which are created from lines, circles, text, etc. and where there are a number of well-defined objects
  • Images. This covers bitmaps, where there is no formal substructure to the object. Photographs are a common type of image.

A document like the above can be represented with different technologies. I distinguish:

  • Bitmaps. Here only the pixels in the (printable) page are transmitted. By default there is no understandable document content. A typical touchstone is that you cannot cut-and-paste anything useful other than subimages. A test for a bitmap is to scale it. As it gets larger it gets fuzzier. Pixels may appear which get larger with the magnification. Some bitmaps preserve all the pixels (e.g. TIFF, BMP). Some compress the file. PNG compresses without loss (a PNG can be reconverted to the corresponding BMP). JPEG is a lossy format – you cannot recreate the uncompressed bitmap. It was designed for photographs where it is excellent. The use of JPEG compression for scientific diagrams is completely unnecessary and act of information destruction. No publisher should ever use JPEG except – possibly – for photographs.
  • Text. Most scientific documents have semi-structured text. Subsections can be cut-and-pasted (perhaps with loss of fonts, etc.). If science were only communicated with text (e.g. like most literature) we wouldn’t have a major problem. But the text is only PART of a scientific document. Unfortunately terms such as “fulltext”, “textmining” suggest that the only valuable stuff is full text.
  • Vector graphics. Most diagrams are authored as vector graphics with tools such as Inkscape, Powerpoint, etc. There is a menu of objects (lines, rectangles, circles, text, etc.). It is generally easy to create scientific diagrams of medium quality using these. (It is not easy to create graphic arts). Typical mediums of transmission are SVG and EPS (Postscript). Many machines (e.g. spectrometers) create vector graphics. Vector graphics are scalable. If you magnify the display even to 4000% all the lines will be sharp and the text will have clean edges. This is an almost infallible test for VG. Almost all scientific diagrams start life as vector graphics but many get converted into bitmaps. The use of bitmaps for scientific diagrams is completely unnecessary and an act of information destructions. No publisher should ever use PNG where they start with a vector graphics object.

The major technologies for scientific publishing are:

  • TeX/LaTeX. This is a semi-structured, semi semantic language of great vision and great value to science. A large amount of science can be reconstructed from it through content-mining. No publisher should ever destroy LaTeX, and where possible they should publish it – it is far more valuable than PDF. LaTeX often uses EPS as its vector graphics. I would be generally be happy to get a paper in *.tex. for input to #AMI2.
  • Word. Early versions of word are proprietary and have hideous internal structure which is almost impossible to manage without MS tooling. Modern Word uses XML (OOXML). Leaving aside the politics of the OOXML Standard process, I will say that it’s a reasonably well structured , if bloated, technology. Word can contain other XML technologies such as Chemical Markup Language (CML). If OOXML were openly and usefully available on non-MS frameworks I would make stronger recommendations for it. However the OOXML is tractable and I would be happy to get a scientific document in *.docx.
  • XHTML. Most publishers provide XHTML as a display format. This is a good thing. The downside is that it isn’t easy to store and distribute XHTML. The images and often other components are separate, fragmented. It is a major failing of the W3C effort that there isn’t a platform independent specification for packaging compound documents.
  • PDF. If you think PDF is an adequate format for conveying modern science to humans then you are probably sighted and probably don’t use machines to help augment your brain. PDF is good for some things and terrible at others. Since >99% of the scientific literature is distributed as PDF (despite its origins) I have very reluctantly come to accept have to work with it. Like Winston Smith in 1984 I have realised that I “love” PDF.

A few words, then, about PDF. There will be many more later.

  • PDF is a page oriented spec. The popularity of this is driven by people who sell pages – publishers. We still have books with pages, but we have many other media – including XHTML/CSS/SVG/RDF which are much more popular with modern media such as the BBC. Pages are an anachronism. AMI2 will remove the pages (among many other things).
  • PDF is designed for printing. PDF encapsulates Postscript , developed for PRINTERS. Everything in PDF internals screams Printed Page at you.
  • PDF is designed for sighted humans. It is the ink on the screen, not the semantics that conveys information. That’s why it’s a hard job training AMI2. But is can be done
  • PDF has many proprietary features. That doesn’t mean that we cannot ultimately understand them and it’s more Open than it was, but there isn’t a bottom-up community as for HTML and XML.
  • PDF is a container format. You can add a number of other things (mainly images and vector graphics) and they don’t get lost. That’s a good thing. There are very few around (G/ZIP is the most commonly used). Powerpoint and Word are also container formats. We desperately need an Open container.
  • PDF is largely immutable. If you get one it is generally read-only. Yes there are editors, but they are generally commercial and interoperability outside of major companies is poor. There are also mechanisms for encryption and DRM and other modern instruments of control. This can make it difficult to extract information.

So here is our overall plan.

  • Convert PDF to SVG. This is because SVG is a much more semantic format than PDF and much flatter. There is almost no loss on the conversion. The main problems come with font information (we’ll see that later). If you don’t mind about the font – and fonts are irrelevant to science – then all we need to do is extract the character information. This process is almost complete. Murray Jensen and I have been working with PDFBox and we have a wrapper which can convert clean PDF to SVG with almost no loss at a page/sec or better on my laptop. The main problem is strange fonts.
  • Create semantic science from the SVG. This is hard and relies on a lot of heuristics. But it’s not as hard as you might think and with a community it’s very tractable. And then we shall be able to ask AMI2 “What’s this paper about and can I have the data in it?”

Please let us have your feedback and if you’d like to help. Meanwhile before the next post here is an example of what we can do already: The first image is a snapshot of a PDF. The second is a snapshot of the SVG we produce. There are very small differences that don’t affect the science at all. Can you spot any? and can you suggest why they happened:

And some more open-ended questions (there are many possible ways of answering). How would you describe:

  • The top right object in

     

Because those are the sort of questions that we have to build into AMI2.

 

 

#opencontentmining Starting a community project and introducing #AMI2

This is the first post in (hopefully) a regular series on the development of Open Content Mining in scholarly articles (mainly STM = Science Technical Medical). It’s also a call for anyone interested to join up as a community. This post describes the background – later ones will cover the technology, the philosophy and the contractual and legal issues. I shall use #opencontentmining as a running hashtag.

I’m using the term “content mining” as it’s broader than “text-mining”. It starts from a number of premises:

  • The STM literature is expanding so quickly that no-one human can keep up, even in their own field. There are perhaps 2 million articles / year == 60, 000 per day. You could just about read the titles at 1 per second if you had no sleep. Many of them might be formally “outside” your speciality but actually contain valuable information.
  • A large part of scientific publication and communication is data. Everyone is becoming more aware of how important data is. It is essential to validate the science, it can be combined with other data to create new discoveries. Yet most data is never published and of the rest much ends up in the “fulltext” of the articles. (Note that “fulltext” is a poor term as there are lots of pictures and other non-text content. “Full content” would be logical (although misleading in that papers only report a small percentage of the work done).
  • The technology is now able to do some exciting and powerful things. Content-mining is made up or a large number of discrete processes and as each one is solves (even partially) we get more value. This is combined with the increasing technical quality of articles (e.g. native PDF rather than camera-ready photographed text).

I used to regard PDF as an abomination. See my post 6 years ago: http://blogs.ch.cam.ac.uk/pmr/2006/09/10/hamburgers-and-cows-the-cognitive-style-of-pdf/. I quoted the maxim “turning a PDF into XML is like turning a hamburger into a cow.” (not mine, but I am sometimes credited with it). XML is structured semantic text. PDF is a (random) collections of “inkmarks on paper”. The conversion destroys huge amounts of information.

I still regard PDF as an abomination. I used to think that force of argument would persuade authors and publishers to change to semantic authoring. I still think that has to happen before we have modern scientific communication through articles.

But in the interim I and others have developed hamburger2cow technology. It’s based on the idea that if a human can understand a “printed page” then a machine might be able to. It’s really a question of encoding a large number of rules. The good thing is that machines don’t forget rules and they have no limit to the size of their memory for them. So I have come to regard PDF as a fact of life and a technical problem to be tackled. I’ve spent the last 5 months hacking at it (hence few blog posts) and I think it’s reached an alpha stage.

And also it is parallelisable at the human level. I and others have developed technology for understanding chemical diagrams in PDF. You can use that technology. If you create a tool that recognizes sequence alignments, then I can use it. (Of course I am talking Open Source – we share our code rather than restricting its use). I have created a tool that interprets phylogenetic trees – you don’t have to. Maybe you are interested in hacking dose-response curves?

So little by little we build a system that is smarter than any individual scientist. We can ask the machine “what sort of science is in this paper?” and the machine will apply all the zillions of rules to every bit of information in an article. And the machine will be able to answer: “it’s got 3 phylogenetic trees, 1 sequence alignment, 2 maps of the North Atlantic, and the species are all sea-birds”.

The machine is called AMI2. Some time ago we had a JISC project to create a virtual research environment as we called the software “AMI”. That was short for “the scientists’ amanuensis”. An amanuensis is a scholarly companion; Eric Fenby assisted the blind composer Frederick Delius in writing down the notes that Delius dictated. So AMI2 is the next step – a scientifically artificially intelligent program. (That’s not as scary as it sounds – we are surrounded by weak AI everywhere, and it’s mainly a question of glueing together a lot of mature technologies).

AMI2 starts with two main technologies – text-mining and diagram-mining. Textmining is very mature and could be deployed on the scientific literature tomorrow.

Except that the subscription-based publishers will send lawyers after us if we do. And that is 99% of the problem. They aren’t doing text-mining themselves but they won’t let subscribers do it either.

But there is 5% of the literature that can be text-mined – that with a CC-BY licence. The best examples are BioMedCentral and PLoS. Will 5% be useful? No-one knows but I believe it will. And in any case it will get the technology developed. And there is a lot of interest in funders – they want their outputs to be mined.

So this post launches a
community approach to content-mining. Anyone can take part as long as they make content and code Open (CC-BY-equiv and OSI F/OSS). Of course the technology can be deployed on closed content and we are delighted for that, but the examples we use in this project must be Open.

Open communities are springing up everywhere. I have helped to launch one – the Blue Obelisk – in chemistry. It’s got 20 different groups creating interoperable code. It’s run for 6 years and its code is gradually replacing closed code. A project in content-mining will be even more dynamic as it addresses unmet needs.

So here are some starting points. Like all bottom-up projects expect them to change:

  • We shall identify some key problems and people keen to solve them
  • We’ll use existing Open technology where possible
  • We’ll educate ourselves as we go
  • We’ll develop new technologies where they are needed
  • Everything will be made Open on the web as soon as it is produced. Blogs, wikis, repositories – whatever works

If you are interested in contributing to #opencontentmining in STM please let us know. We are at alpha stage (i.e. you need to be prepared to work with and test development systems – there are no gold-plated packages- and there probably never will be). There’s lots of scope for biologists, chemists, material scientists, hackers, machine-learning experts, document-hackers (esp. PDF), legal, publishers, etc.

Content-mining is set to take off. You will need to know about it. So if you are interested in (soluble) technical challenges and contributing to an Open community let’s start.

[More details in next post – maybe about phylogenetic trees.]

Update: I am off to CSIRO(AU), eResearch2012, Open Content Mining, AMI2, PDF hacking etc.

I haven’t blogged for some time because I have been busy elsewhere – going to #okfest, #odlc (Open Data La conference in Paris) and preparing for a significant stay (~ 3months) with CSIRO in Clayton (Melbourne, AU).

I’m in AU at the invitation of Nico Adams and CSIRO as a visiting researcher. When we were daily colleagues Nico pioneered the use of the semantic web for chemistry and materials. He is ahead of the game, but chemistry is slowly waking up to the need for semantics. We’ll be working on themes such as:

  • Formal semantics and ontologies for materials science
  • Open Content Mining for chemical/materials data (AMI2)

As part of this I intend to create materials for learning and using CML (Chemical Markup Language), in weekly chunks. If anyone is interested I’m offering to run a weeklyish series of low key workshops on Semantic Chemistry and more generally Semantic Physical Science (Nico and I ran a day on SPS last year at eResearch Australia 2011). Maybe there will be enough material for a book, and if you know me it won’t be a conventional book. It could be a truly open-authored book if there is interest. Almost certainly Open Content. I’ve registered for eResearch 2012 in Sydney 28-1 Oct so if anyone is going we shall meet. Not doing any workshops this time round.

I’m working hard on Open Content Mining. I’ve developed a generic tool for extracting semantic information from PDFs (yes) called AMI2. It results from many months fairly solid hacking and several previous years of explorations. In the initial cases I have been able to get 100% accuracy from some subsets of PDFs and I’ll be taking you through this in blogs. Ros and I are applying it to phylogenetics and we expect to be able to extract a lot of trees from the literature.

We’ll be confining ourselves to BMC and PLoS material (with BMC being technical easier). I’ve downloaded 1000 potential papers and Ross Mounce will be annotating 80 of them as to whether they contain phylogenetics, where it is, etc. Content mining requires hard, boring graft to create a trustable system but the effort is worth it.

We can’t use it on Molecular Phylogenetics and Evolution although it has a lot of trees. Why Not? [Regular readers will know the answer].

And some recent experiences with Open. #okfest was incredible – a real feel that the world was changing and we and others were changing it. It’s the real sense of “Open”. Open isn’t just a licence or a process – it’s a community and a state of mind. It’s joyful, risk taking, collaborative, global.

And Open Scholarship? Well mostly it doesn’t exist and I’m seeing difficulty where it’s coming from. Open Scholarship consists of at least Open Access, Authoring, Bibliography, Citations, Data, Science, etc. Of these only Open Science has a true Open agenda, community and practice (inspired by Joseph Jackson, Mat Todd and others who want to change the world). Open Access is not Open in the modern sense of the word. The initial idealism in 2002 was great, but since then it’s become factional, cliquey and authoritarian in large part. Open Access is complex and needs serious public discussion but this is frequently shouted down. [I sat through an hour's plenary lecture at Digital Research from Stevan Harnad on "why the RCUK is wrong and must change its policy" with the subtitle "What Peter Murray-Rust thinks and why he is wrong". The views attributed to me were not mine and his conclusions erroneous, but he doesn't listen to me and many others. He has now mounted a public attack on RCUK. This will help no-one other than reactionary money-oriented publishers.]

I have been meaning to blog about Open Access for some time, but each time wonder whether I would do more harm than good. However I think it is now important to have proper public discussion about the serious issues, and Open Access Week may be an opportunity. As an example of the problem I find it very hard to find any centre to “Open Access” – who runs Open Access? What’s its purpose? Is there a consensus? Where can I expect to have a proper discussion without being insulted? Because if questions like this are not answered the movement (in so far as it *is* a movement) will surely fracture. And unless new coherent visions emerge then the losers will be academia but even more the SCHOLARLY POOR.