Monthly Archives: December 2012

#opencontentmining MASSIVE step forward. Come and join us in the UK!

The UK government has now given the go-ahead to the major reforms proposed by the Hargreaves committee. . The message is now very simple:



Business Secretary, Vince Cable said:

"Making the intellectual property framework fit for the 21st century is not only common sense but good business sense. Bringing the law into line with ordinary people's reasonable expectations will boost respect for copyright, on which our creative industries rely.

"We feel we have struck the right balance between improving the way consumers benefit from copyright works they have legitimately paid for, boosting business opportunities and protecting the rights of creators."

In his review of intellectual property and growth, Professor Hargreaves made the case for the UK making greater use of these exceptions, which are allowed under EU law. In response to a consultation earlier this year, the Government will make changes to:


  • Data analytics for non-commercial research - to allow non-commercial researchers to use computers to study published research results and other data without copyright law interfering;


These changes could contribute at least £500m to the UK economy over 10 years, and perhaps much more from reduced costs, increased competition and by making copyright works more valuable.

In addition the Government will introduce a new, non-statutory system for clarifying areas where there is confusion or misunderstanding on the scope and application of copyright law. Copyright notices will issued by the Intellectual Property Office. These notices are intended to clarify, but not make new law.

It makes it clear that publishers cannot set licence terms that override this.

New measures include provisions to allow copying of works for individuals' own personal use, parody and for the purposes of quotation. They allow people to use copyright works for a variety of valuable purposes without permission from the copyright owners. They will also bring up to date existing exceptions for education, research and the preservation of materials.

We can start content-mining today (and we shall).

Copyright is complex and some of the questions are not easy to answer. So there is a provision for copyright-holders to appeal to the Secretary of State if they don't like it. If publishers can convince Vince Cable that my activities are a threat to the health of the UK economy I'll stop.

So everyone should adopt the principle:

If you have a right to the content you have a right to mine it

You DON'T have to ask permission.

What about non-UK people? Just come and visit us here! You will then be governed by the law of the UK. And we'd love to see you.



AMI2 Content mining using PDF and SVG: progress

I'm now returning to UK for a few weeks before coming back to AU to continue. This is a longish post but important for anyone wanting to know the details of how we build an intelligent PDF reader and what it will be able to do. Although the examples are chemistry-flavoured the approach applies to a wide range of science.

To recall…

AMI2 is a project to build an intelligent reader of the STM literature. The base is PDF documents (though Word, HTML and LaTeX will also be possible and much easier and of higher quality). There are three phases at present (though this and the names may change):

  • PDF2SVG. This converts good PDF losslessly into SVG characters, path and images. It works well for (say) student theses and ArXiV submissions but fails for most STM publisher PDFs because the quality of the "typesetting " is non-conformant and we have to use clunky, fragile heuristics. More on later blogs and below.
  • SVGPLUS. This turns low-level SVG primitives (characters and paths) into higher level a-scientific objects such as paragraphs, sections, word, subscripts, rectangles, polylines, circles, etc. In addition it analyses components that are found universally in science (figures, tables, maths equations) and scientific document structure. It also identifies graphs, plots, etc. (but not chemistry, sequences, trees…)
  • SVG2XML. This interprets SVGPLUS output as science. At present we have prototyped chemistry, phylogenetics, spectroscopy and have a plugin architecture that others can build on. The use of SVG primitives makes this architecture much simpler.

We've written a report and here are salient bits. It's longish so mainly for those interested in the details. But it has a few pictures…

PDFs and their interpretation by PDF2SVG


Science is universally published as PDF documents, usually created by machine and human transformation of Word or LaTeX documents. Almost all major publishers regard "the PDF" as the primary product (version of record) and most scientists read and copy PDFs directly from the publishers' web sites; the technology is independent of whether this is Open or closed access. Most scientists read, print and store large numbers of PDFs locally to support their research.

PDF was designed for humans to read and print, not for semantic use. It is primarily "electronic paper" – all that can be guaranteed is coloured marks on "e-paper". It was originally proprietary and has only fairly recently become an ISO standard. Much of the existing technology is proprietary and undocumented. By default, therefore a PDF only conveys information to a sighted human who understands the human semantics of the marks-on-paper.

Over 2 million scholarly publications are published each year, most only easily available in PDF. The scientific information in them is largely lost without an expert human reader, who often has to transcribe the information manually (taking huge time and effort). Some examples:

In a PDF these are essentially black dots on paper. We must develop methods to:

  • PDF2SVG: Identify the primitives (in this case characters, and symbols). This should be fairly easy but because the technical standard of STM publishing is universally very non-conformant to standards (i.e. "poor") we have had to create a large number of arbitrary rules. This non-conformity is a major technical problem and would be largely removed by the use of UTF-8 and Unicode standards.
  • . SVGPLUS (and below): Understand the words (e.g. that "F"-"I"-"g" and "E"-"x"-"c"-"e"-"s"-"s" are words). PDF has no concept of "word", "sentence", "paragraph", etc.
  • Detect that this is a Figure (e.g. by interpreting "Fig. ")
  • Separate the caption from the plot
  • Determine the axial information (labels, numbers and tics and interpret (or here guess) units
  • Extracts the coordinates of points (black circles)
  • Extract the coordinates of the line

If the PDF is standards-compliant it is straightforward to create the SVG. We use the Open Source PDFBox from Apache to "draw" to a virtual graphics device. We intercept these graphics calls and extract information on:

  • Position and orientation. PDF objects have x,y coordinates and can be indefinitely grouped (including scaling). PDF resolves all of this into a document on a virtual A4 page (or whatever else is used). The objects also have style attributes (stroke and fill colours, stroke-widths , etc.). Most scientific authors use simple colours and clean lines which makes the analysis easier.
  • Text (characters). Almost all text is individual characters which can be in any order ("The" might be rendered in the order "e"-"h"-"T". Words are created knowing the screen positions of their characters. In principle all scientific text (mathematical equations, chemical symbols, etc.) can be provided in the Unicode toolset (e.g. a reversible chemical reaction symbol

    is the Unicode point U+21CC or html entity &#x21cc and will render as such in all modern browsers.

  • Images. These are bitmaps (normally rectangular arrays of pixels) and can be transported as PNG, GIF, JPEG, TIFF, etc. There are cases (e.g. photographs of people or scientific objects) where bitmaps are unavoidable. However some publishers and authors encode semantic information as bitmaps, thereby destroying it. Here is an example:

    Notice how the lines are fuzzy (although the author drew them cleanly). It is MUCH harder to interpret such a diagram than if it had been encoded as characters and lines. Interpretation of bitmaps is highly domain-dependent and usually very difficult or impossible. Here is another (JPEG)

    Note the fuzziness which is solely created by the JPEG (lossy) compression. Many OCR tools will fail on such poor quality material

  • Path (graphics primitives). These are used for objects such as
    • graphical plots (x-y, scatterplots, bar charts)
    • chemical structures

      This scheme, if drawn with clean lines, is completely interpretable by our software as chemical objects

    • diagrams of apparatus
    • flowcharts and other diagrams expressing relationships

    Paths define only Move, Line, Curve. To detect a rectangle SVGPLUS has to interpret these commands (e.g. MLLLL).

There are, unfortunately, a large occurrence of errors and uncertainties. The most common is the use of non-standard, non-documented encodings for characters. These come from proprietary tools (such as Font providers of TeX, etc,) and from contracted typesetters. In these cases we have to cascade down:

  • Guess the encoding (often Unicode-like)
  • Create a per-font mapping of names to Unicode. Thus "MathematicalPi-One" is a commonly used tool for math symbols: its "H11001" is drawn as a PLUS and we translate to Unicode U+002B but there is no public (or private) translation table (We've asked widely). So we have to do this manually by comparing glyphs (the printed symbol) to tables of Unicode glyphs. There are about 20 different "de facto" fonts and symbol sets in wide scientific use and we have to map them manually (maybe while watching boring cricket on TV). We have probably done about 60% of what is required
  • Deconstruct the glyphs. Ultimately the PDF provides the graphical representation of a glyph on the screen, either as vectors or as a bitmap. We recently discovered a service (shapecatcher) which interprets up to 11,000 Unicode glyphs and is a great help. Murray Jensen has also written a glyph browser which cuts down the human time very considerably.
  • Apply heuristics. Sometimes authors or typesetters use the wrong glyph or kludge it visually. Here's an example:

    Most readers would read as "ten-to-the-minus-seven" but the characters are actually "1", "0", EM-DASH, "7". EM-DASH – which is used to separate clauses like this – is not a mathematical sign so it's seriously WRONG to use it. We have to add heuristics (a la UNIX lint) to detect and possibly correct. Here's worse. There's a perfectly good Unicode symbol for NOT-EQUALS (U+2260)

    Unfortunately some typsetters will superimpose an EQUALS SIGN (=)with a SLASH (/). This is barbaric and hard and tedious to detect and resolve. The continued development of PDF2SVG and SVGPLUS will probably be largely hacks of this sort.

SVG and reconstruction to semantic documents SVGPLUS


SVGPLUS assumes a correct SVG input of Unicode characters, SVG Paths, and SVGImages (the latter it renders faithfully and leaves alone). The task is driven by a control file in a declarative command language expressed in XML. We have found this to be the best method of representing the control, while preserving flexibility. It has the advantage of being easily customisable by users and because it is semantic can be searched or manipulated. A simple example:

<semanticDocument xmlns="">

<documentIterator filename="org/xmlcml/svgplus/action/ ">


<variable name="p.root" value="${d.outputDir}/whitespace_${}" type="file"/>

<whitespaceChunker depth="3" />

<boxDrawer xpath="//svg:g[@LEAF='3']" stroke="red" strokeWidth="1" fill="#yellow" opacity="0.2" />

<pageWriter filename="${p.root}_end.svg" />





This document identifies the directory to use for the PDFs ("action"), iterates over each PDF it finds, creates (SVG) pages for each, processes each of those with a whitespaceChunker (v.i.) and draws boxes round the result and writes each page to file. (There are many more components in SVGPLUS for analysing figures, etc). A typical example is:


SVGPLUS has detected the whitespace-separated chunks and drawn boxes round the "chunks". This is the start of the semantic document analysis. This follows a scheme:

  • Detect text chunks and detect the font sizes.
  • Sort into lines by Y coordinate and sort within lines by X coordinate. The following has 5 / 6 lines:



    Normal, superscript, normal, subscript (subscript), normal

  • Find the spaces (PDF often has no explicit space characters – the spaces have to be calculated by intercharacter distance. This is not standard and is affected by justification and kerning.
  • Interpret variable font-size as sub- and super-scripts.
  • Manage super-characters such as the SIGMA.
  • Join lines. In general one line can be joined to the next by adding a space. Hyphens are left as their interpretation depends on humans and culture. The output would thus be something like:

    the synthesis of blocks, stars, or other polymers of com~plex architecture. New materials that have the potential of revolutionizing a large part …

    This is the first place at which words appear.

  • Create paragraphs. This is through indentation heuristics and trailing characters (e.g. FULL STOP).
  • Create sections and subsections. This is normally through bold headings and additional whitespace. Example:

    Here the semantics are a section (History of RAFT) containing two paragraphs


The PATH interpretation is equally complex and heuristic. In the example below:

The reversible reaction is made up of two ML paths ("lines") and two filled curves ("arrowheads"). All this has to be heuristically determined. The arcs are simple CURVE-paths. (Note the blank squares are non-Unicode points)


In the axes of the plot

All the tick-marks are independent paths – SVGPLUS has to infer heuristically that it is an axis.

In some diagrams there is significant text:

Here text and graphical primitives are mixed and have to be separated and analysed.


In summary SVVGPLUS consists of a large number of heuristics which will reconstruct a large proportion (but not all) scientific articles into semantic documents. The semantic s do and will include:

  • Overall sectioning (bibliographic metadata, introduction, discussion, experimental, references/citations
  • Identification and extraction of discrete Tables, Figures, Schemes
  • Inline bibliographic references (e.g. superscripted)
  • Reconstruction of tables into column-based object(where possible)
  • Reconstruction of figures into caption and graphics
  • Possible interpretation of certain common abstract scientific graphical objects (graphs, bar charts)
  • Identification of chemical formulae and equations
  • Identification of mathematical equations

There will be no scientific interpretation of these objects


Domain specific scientific interpretation of semantic documents


This is being developed as a plugin-architecture for SVGPLUS. The intention is that a community develops pragmatics and heuristics for interpreting specific chunks of the document in a domain specific manner. We and our collaborators will develop plugins for translating documents into CML/RDF:

  • Chemical formulae and reaction schemes
  • Chemical synthetic procedures
  • Spectra (especially NMR and IR)
  • Crystallography
  • Graphical plots of properties (e.g. variation with temperature, pressure, field, molar mass, etc.)

More generally we expect our collaborators (e.g. Ross Mounce, Panton Fellow, paleophylogenetics at University of Bath UK) to develop:

  • Mathematical equations (into MathML).
  • Phylogenetic trees (into NEXML)
  • NA and protein sequences into standard formats
  • Dose-response curves
  • Box-plots


Fidelity of SVG rendering in PDF2SVG. This includes one of the very rare bugs we cannot solve:



[Note that the equations are identical apart from the braces which are mispositioned and too small. There is no indication in TextPosition as to where this scaling comes from.

In PDFReader the equation is correctly displayed (the text is very small so the screenshot is blurry. Nonetheless it's possible to see that the brackets are correct)



Temporary Farewell to AU and thanks to some of its mammals

I go back to UK today and am finishing up in the Prahran (Melbourne) apartment where I've been for 2.5 months. Prahran is a great place to be – easy tram ride to the CBD (centre) of Melbourne and less than 1 hour commute to CSIRO (including walking) after I had figured the 5 and 64 and the (somewhat random) semi-express nature of the Cranbourne trains. Trams generally great (except the one that broke down in Swanston street (which gums up everything) and the rather unpredictable nature of late trains. Excellent shuttle from Huntingdale station To Monash Univ.

Too many human mammals to thank but they include:

  • Nico Adams – unlimited praise and appreciation for him fixing this up. We are planning I will be back next year, probably late Jan.
  • Murray Jensen (CSIRO) for his collaboration on AMI2 – Murray has a huge range of expertise and his knowledge of fonts was both unexpected and absolutely critical.
  • Everyone involved at CSIRO.
  • Dave Flanders and the Flanders-irregulars – a mixture of incipient OKF, hackers, meeting in Melbourne cafes where the wifi and coffee is great. (This is a fantastic aspect of the Melbourne scene you can get café and free wifi at State Library of Vic, National Gallery, Fed Square, RMIT in Swanston (where Nico and I worked on reports, and next year.
  • Connie and colleagues for the great time in Perth.
  • Mat Todd and colleagues for Sydney
  • The Astor cinema in Prahran/Chapel Street. It's a 1930's art deco showing a mixture of classic films (Bogart, Bergman, Crawford, Stewart…) and new releases. TWO films per sitting and ice creams out of this world.
  • Prahran and its cafes. I am off to have brunch shortly. Wifi and great atmosphere).
  • The people we met on our travels down the Great Ocean Road and elsewhere – Wombat Cottage (with Wombats), Birdwatchers, Reserves (e.g. Tower Hill)…
  • And others that I've failed to add – sorry.

Lots of animals and birds –we've probably ticked 50+ AU birds. We're told Werribee sewage works is the place we must visit next time. The most interesting mammal was Thylarctos Plummetus . This can be dangerous to humans (what isn't dangerous in AU?) but there are no recorded fatalities. Here's the best picture we could get:

Our guide wouldn't let us get any closer because of the potential danger. It's clearly not a Koala and it looks ready to fall out of the tree.

The animals are sad and excited to be going to UK. Here's AMI and AMI with some classic Australian tucker which I've had to leave behind:

We didn't manage to make any #animalgarden photocomics – too much to do hacking grotty PDFs L

See you soon…



My/Our talk to CSIRO Publishing. How should we communicate science? “This article uses recyclable scientific information”.

I've been working with Nico Adams at CSIRO (Melbourne/Clayton, AU) for nearly 3 months, supported by a Fellowship. CSIRO ( is a government institution similar in many ways to a National Laboratory. It does research (public and private) and publishers it. But it is also a publisher in its own right – everything from Chemistry, to Gliding mammals, to how to build your dream home. Nico and I have struck up a rapport with people in CSIRO publishing and today – my last full day in AU – we are going to visit and present some of what we have done and more generally have a discussion where we learn about what CSIRO Publishing does.

CSIRO publishes a range of journals and we'll be concentrating on that, though we'll also be interested in reports, books, etc. We've had the opportunity to work with public and non-public content and to use that as a guide to our technology development (all the software I write is, of course, Open Source). Among the questions I'll want to raise (not specifically CSIROPub) are:

  • Is the conventional journal type-setting process still needed? I will argue NO – that it costs money and makes information worse. ArXiV has totally acceptable typography in Word or LaTeX and this is better than most journals for content-mining, etc.
  • How should data be published? I shall take small-molecule crystal structures as an example. At present CSIRO sends crystal structures to CCDC where they are no openly accessible. I'll argue they should be part of the primary scientific record.

Nico will be talking about semantics – what it is and how it can be used. I think he'll hope to show the machine extraction of content from Aust. J Chem.

I'll probably play down the political aspect in my formal presentation. The main issue now is how we recreate a market where scientific communication (currently broken) can be separated from the awarding of scientific glory (reputation). I'll concentrate on the communication.

I have simple, practical, understandable IMMEDIATE proposal addressing the document side of STM (this doesn't of course address the issues of data semantics or whatever) .

  • The current primary documentary version of scientific record should not be PDF but be a Word Or LaTeX or HTML or XML (e.g. NLM-DTD) document.
  • All documents should use UTF-8 and Unicode.

There are zillions of Open tools that adhere to UTF-8 and Unicode.

Where PDFs are used they should adhere to current information standards, specifically:

A graduate thesis is a BETTER document than the output of almost any publisher I have surveyed. STM publishing destroys information quality. All documents I have looked at on ArXiV and BETTER that the output of STM publishing.

So I shall make the following proposals:

  • CSIRO publishing should publish in a standards-compliant manner.
  • CSIRO should make supplemental data Openly available (we'll take crystallography as the touch stone).

The average cost to the public for the publication of a scientific paper is around 3000 USD. The information quality is a disgrace. Some of that money can be saved by doing it better. It's similar to recycling. It makes sense to re-use your plastic bags, toilet paper, etc. (Yes, Healesville animal sanctuary promotes green bum-wiping to save the environment (technically recycled paper)).

Let's have a sticker:

"This journal promotes recycled scientific information"

I'll be presenting the work that Murray Jensen and I have been doing on AMI2 . MANY thanks to Murray – he has been given an "AMI" in small acknowledgement.

Murray's AMI in typical Melbourne bush.

AMI progresses steadily. It's taken much longer than I thought primarily because STM publication quality is AWFUL. It's now at a stage where we can almost certainly make an STM publication considerably better. However Murray and I have hacked the worst. AMI2-PDF2SVG turns PDF and AWFUL-PDF into good Unicode-compliant SVG. I'm concentrating on AMI2-SVGPLUS which turns SVG into meaningful documents. Nearly there. Again the absurd process of creating double column justified PDF (that no scientist would willingly pay for) destroys information seriously and SVGPLUS has to recover it. Then the final exciting part will create science from the document.

I'll hope to present some today.