#opencontentmining The #ami2 project for understanding the scientific literature and why I “love” PDF

Earlier this week I blogged about our new Open project #AMI2 http://blogs.ch.cam.ac.uk/pmr/2012/10/15/opencontentmining-starting-a-community-project-and-introducing-ami2/ . This is an Open community project to create a complete infrastructure for machine understandability of the scientific literature. That’s a bold statement, especially for something with no formal funding, but in the present era that’s not necessarily a show-stopper. Because when we grow the community we can do almost anything. Of course AMI2 can’t understand everything, but she will shortly be able to outstrip several current human activities.

And we’ve had several offers of help. Welcome to all those who have mailed. We have critical mass and Ross and I will work out how to create a community platform. Because #ami2 is Y/OUR project.

The scientific literature covers 300 Billion USD of pubklicly funded endeavour and 15 Billion USD to publish. Many of the results are inaccessible due to old-fashioned means of publication and the difficulty of understanding them. So if people want to help build AMI2 it will have a very important impact.

The current approach to this is to get authors to produce semantic documents. I agree that this is the best way. But authors aren’t interested and publishers are incredibly conservative. So we have to start the other way. By creating #ami2 to understand the current literature in the way that humans do. And much of this can be done by gluing together the technology that already exists. In these blogs we are going to show that, in specific domains, it can be done. NOW! And we make the assumption that it’s not too difficult to build similar technology in parallel in other domains.

So we are going to start with the following disciplines:

  • Phylogenetic trees
  • X-Y plots
  • Chemical diagrams

Of these only the chemistry is at all hard to understand if you are 12 years old. Everyone can understand trees and everyone can understand graphs (because in the information age every citizen should be able to understand a graph). So anyone should be able to follow.

Here’s the AMI2 process:

[I authored the diagram in SVG but cannot use Word+Wordpress to publish it. Anyone know how? So apologies for the PNG – it goes against the philosophy of the project.]

There are three components to this document.

  • Text. There are many uses for text (discussion, tables, references, metadata) but they all use characters and the general technology is the same. We’ll see the distinctions later
  • Diagrams. I use this to mean objects which are created from lines, circles, text, etc. and where there are a number of well-defined objects
  • Images. This covers bitmaps, where there is no formal substructure to the object. Photographs are a common type of image.

A document like the above can be represented with different technologies. I distinguish:

  • Bitmaps. Here only the pixels in the (printable) page are transmitted. By default there is no understandable document content. A typical touchstone is that you cannot cut-and-paste anything useful other than subimages. A test for a bitmap is to scale it. As it gets larger it gets fuzzier. Pixels may appear which get larger with the magnification. Some bitmaps preserve all the pixels (e.g. TIFF, BMP). Some compress the file. PNG compresses without loss (a PNG can be reconverted to the corresponding BMP). JPEG is a lossy format – you cannot recreate the uncompressed bitmap. It was designed for photographs where it is excellent. The use of JPEG compression for scientific diagrams is completely unnecessary and act of information destruction. No publisher should ever use JPEG except – possibly – for photographs.
  • Text. Most scientific documents have semi-structured text. Subsections can be cut-and-pasted (perhaps with loss of fonts, etc.). If science were only communicated with text (e.g. like most literature) we wouldn’t have a major problem. But the text is only PART of a scientific document. Unfortunately terms such as “fulltext”, “textmining” suggest that the only valuable stuff is full text.
  • Vector graphics. Most diagrams are authored as vector graphics with tools such as Inkscape, Powerpoint, etc. There is a menu of objects (lines, rectangles, circles, text, etc.). It is generally easy to create scientific diagrams of medium quality using these. (It is not easy to create graphic arts). Typical mediums of transmission are SVG and EPS (Postscript). Many machines (e.g. spectrometers) create vector graphics. Vector graphics are scalable. If you magnify the display even to 4000% all the lines will be sharp and the text will have clean edges. This is an almost infallible test for VG. Almost all scientific diagrams start life as vector graphics but many get converted into bitmaps. The use of bitmaps for scientific diagrams is completely unnecessary and an act of information destructions. No publisher should ever use PNG where they start with a vector graphics object.

The major technologies for scientific publishing are:

  • TeX/LaTeX. This is a semi-structured, semi semantic language of great vision and great value to science. A large amount of science can be reconstructed from it through content-mining. No publisher should ever destroy LaTeX, and where possible they should publish it – it is far more valuable than PDF. LaTeX often uses EPS as its vector graphics. I would be generally be happy to get a paper in *.tex. for input to #AMI2.
  • Word. Early versions of word are proprietary and have hideous internal structure which is almost impossible to manage without MS tooling. Modern Word uses XML (OOXML). Leaving aside the politics of the OOXML Standard process, I will say that it’s a reasonably well structured , if bloated, technology. Word can contain other XML technologies such as Chemical Markup Language (CML). If OOXML were openly and usefully available on non-MS frameworks I would make stronger recommendations for it. However the OOXML is tractable and I would be happy to get a scientific document in *.docx.
  • XHTML. Most publishers provide XHTML as a display format. This is a good thing. The downside is that it isn’t easy to store and distribute XHTML. The images and often other components are separate, fragmented. It is a major failing of the W3C effort that there isn’t a platform independent specification for packaging compound documents.
  • PDF. If you think PDF is an adequate format for conveying modern science to humans then you are probably sighted and probably don’t use machines to help augment your brain. PDF is good for some things and terrible at others. Since >99% of the scientific literature is distributed as PDF (despite its origins) I have very reluctantly come to accept have to work with it. Like Winston Smith in 1984 I have realised that I “love” PDF.

A few words, then, about PDF. There will be many more later.

  • PDF is a page oriented spec. The popularity of this is driven by people who sell pages – publishers. We still have books with pages, but we have many other media – including XHTML/CSS/SVG/RDF which are much more popular with modern media such as the BBC. Pages are an anachronism. AMI2 will remove the pages (among many other things).
  • PDF is designed for printing. PDF encapsulates Postscript , developed for PRINTERS. Everything in PDF internals screams Printed Page at you.
  • PDF is designed for sighted humans. It is the ink on the screen, not the semantics that conveys information. That’s why it’s a hard job training AMI2. But is can be done
  • PDF has many proprietary features. That doesn’t mean that we cannot ultimately understand them and it’s more Open than it was, but there isn’t a bottom-up community as for HTML and XML.
  • PDF is a container format. You can add a number of other things (mainly images and vector graphics) and they don’t get lost. That’s a good thing. There are very few around (G/ZIP is the most commonly used). Powerpoint and Word are also container formats. We desperately need an Open container.
  • PDF is largely immutable. If you get one it is generally read-only. Yes there are editors, but they are generally commercial and interoperability outside of major companies is poor. There are also mechanisms for encryption and DRM and other modern instruments of control. This can make it difficult to extract information.

So here is our overall plan.

  • Convert PDF to SVG. This is because SVG is a much more semantic format than PDF and much flatter. There is almost no loss on the conversion. The main problems come with font information (we’ll see that later). If you don’t mind about the font – and fonts are irrelevant to science – then all we need to do is extract the character information. This process is almost complete. Murray Jensen and I have been working with PDFBox and we have a wrapper which can convert clean PDF to SVG with almost no loss at a page/sec or better on my laptop. The main problem is strange fonts.
  • Create semantic science from the SVG. This is hard and relies on a lot of heuristics. But it’s not as hard as you might think and with a community it’s very tractable. And then we shall be able to ask AMI2 “What’s this paper about and can I have the data in it?”

Please let us have your feedback and if you’d like to help. Meanwhile before the next post here is an example of what we can do already: The first image is a snapshot of a PDF. The second is a snapshot of the SVG we produce. There are very small differences that don’t affect the science at all. Can you spot any? and can you suggest why they happened:

And some more open-ended questions (there are many possible ways of answering). How would you describe:

  • The top right object in


Because those are the sort of questions that we have to build into AMI2.



This entry was posted in Uncategorized. Bookmark the permalink.

One Response to #opencontentmining The #ami2 project for understanding the scientific literature and why I “love” PDF

  1. Pingback: Unilever Centre for Molecular Informatics, Cambridge - #ami2 #opencontentmining An intelligent reader of the PDF STM literature. We achieve the first phase: (alpha) PDF2SVG « petermr's blog

Leave a Reply

Your email address will not be published. Required fields are marked *