Culture for Semantic Materials and the Semantic Web

The semantic web depends on shared cultural values which include:

  • Sharing (resources, tools, data, etc.)
  • Collaboration
  • Use of and contribution to semantic web resources

Watch this short, compelling video http://www.youtube.com/user/nyuhsl/videos (http://www.youtube.com/watch?v=N2zK3sAtr-4 ) from NYU Health Sciences Library on why we MUST share data. Then…

…As we give and listen to presentations let’s ask ourselves:

  • Can I get hold of the data reported?
  • Can I read it into my machine?
  • Can I get the program used?
  • How would I know if I got the same results?

And when I run programs…

  • Could others repeat the work?
  • Could others build on my work?

What are we going to do to change this? Follow TBL: http://5stardata.info/

  • Make ALL our data available under open licence. Use CC-BY for text, CC0 for data.
  • Structure it. Use tables, not free text.
  • Use non-proprietary formats. Not Word, Not Excel.
  • Use W3C tools (XML, RDF, MathML, SVG) and their extensions to science (CML, etc.)
  • Create discipline-dependent dictionaries (see IUCr for crystallography http://www.iucr.org/resources/cif ).
  • Make our code OPEN (cf http://www.blueobelisk.org )
  • Put code and data on shared resources such as Bitbucket (http://bitbucket.org/petermr/pdf2svg ) and GitHub.
  • Build Wikipedia-like and OpenStreetmap-like communal resources. See http://wwmm.ch.cam.ac.uk/crystaleye for crystals
  • develop unique identifier systems
  • link to other resources

And let’s do this through HACKFESTs.

Posted in Uncategorized | Leave a comment

Semantic Science

Today I’m putting forward the value of semantics for materials science and more generally computational science. /pmr/2013/02/03/the-semantic-web-for-materials-science-and-a-great-day-for-melbourne/

Semantics is not just a technology – it’s an attitude of mind. It’s about how we can create a world where humans do what they are best at and machines do what they are best at. Currently that doesn’t happen in much of science (bioscience gets it, chemistry and materials don’t). So here’s a simple guide to semantics.

Compound 2a melted at 119 o C

Most scientists (of any discipline) would understand that.

But no machine would. Machines do not currently by default understand human communication. They don’y understand:

  • “Compound” (the word has many meanings). Which one?
  • “2a” . Why is it bold? Does it matter?
  • “melted” –
  • “at” – a place? A time?
  • “119” – an integer? A house number? A rock band?
  • o I’ve been naughty – it is an “oh”. But unless you use U+00B0 it’s not “degrees”
  • C? so many meanings.

So why should we translate this for a machine?

Because if machines can read and understand this they can act as our assistants.

There’s probably 50-500 million pages of scholarly scientific literature relevant to materials science published each year. We’d like to find all the melting points. Humans find this very very boring. They make mistakes.

Machines don’t get bored and they don’t make mistakes. But we have to translate that into semantic form. That’s not trivial. But once we’ve built the infrastructure it becomes routine. So here’s what a machine needs:

 

We’ve used XML (Chemical Markup Language) as the machine syntax. Humans do NOT need to author or read this. We have tools for that. But a a brief explanation:

  • The “compound” is described by the formal CML concept “molecule” CML defines precisely how machines should interpret “property”, “scalar”, “dictRef”, etc.)
  • “2a” is a REFERENCE to a file/document elsewhere
  • “melted” is a property defined by “prop:mpt” in a dictionary
  • Degrees is now defined by a units dictionary
  • 119 is a real number (“float”)

A machine can now “understand” this. Here’s AMI

AMI can understand:

  • XML syntax
  • Precisely formulated rules
  • Computer code

If we can create or convert information into semantic form, AMI can process it. For example Graphics:

Humans understand the top. AMI understands the bottom.

Or maths:

 

I shall demo this (I shall run out of time, as I always do!)

 

 

That’s our challenge.

 

 

 

Posted in Uncategorized | Leave a comment

A semantic puzzle

There was a workshop/summer school before the Materials Informatics conference and I gave a set of exercises which included converting chemistry from legacy to XML. [You DON’T need to understand chemistry or java to tackle this puzzle.] Here’s my Powerpoint slide of what I asked the delegates to do:

It worked for 90% of the delegates but 2 had problems. There was an error:

This contains a clear indication of the problem (you DON’T need to understand Java). I failed to diagnose what was wrong.

So

  • What did I do to diagnose the error?
  • What was the culprit?
  • What action should you and I take in future? (MUCH more general than Java)

I have given you all the clues…

I’ll post the answer when I have had some useful replies. A really smart web-savvy person will know how I solved it and report that.

============== UPDATE and SPOILER ===============

@Villu has answered this!! See discussion. I’ll post my answer below (scroll if you need it)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I posted my question to StackOverflow – a community resource which has answered over 4 MILLION programming questions. Here’s what I asked. http://stackoverflow.com/questions/14638911/cannot-load-jar-with-dependencies-on-mac Note that I got the diagnosis COMPLETELY WRONG!!

Cannot load jar-with-dependencies on Mac

 

I have an application created in Maven as a complete jar which runs on most platforms (Windows, Unix, some Mac) but not on Mac lion/10.6, failing with the error

java –jar jumbo-converters-crystal-cif-0.3-SNAPSHOT-jar-with-dependencies.jar 0151.cif 0151.cml

Exception in thread "main" java.lang.NoClassDefFoundError: ?jar

Caused by: java.lang.ClassNotFoundException: ?jar

        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)

        at java.security.AccessController.doPrivileged(Native Method)

        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

 

And I got the answer (from @Charlie):

(I have deliberately used an image!)

The point is that I wrote this correctly but WHEN I PASTED IT INTO POWERPOINT , Powerpoint automatically and “helpfully” converted one of the MINUS signs into an EN_DASH. Visually this is almost impossible to spot.

 

 

 

Posted in Uncategorized | 10 Comments

The Semantic Web for Materials Science (and a great day for Melbourne)

Three important things are happening in Melbourne today. TimBL is speaking, the Age/OKF/Dev8D is running a journo hackfest, and I’m helping to kick off

The First International Conference and Summer School in Molecular and Materials Informatics

The International Conference and Summer School in Molecular and Materials Informatics is the first conference to address the need for the development of molecular and materials informatics platforms and solutions. http://www.csiro.au/en/Organisation-Structure/Divisions/Materials-Science–Engineering/Molecular-and-Materials-Informatics.aspx

This is a really important meeting as materials is a key scientific and engineering discipline. http://en.wikipedia.org/wiki/Materials_science underpins batteries, solar polar, computing, airplanes, paint, and much more.

We are beginning to understand how to calculate the properties of new materials using Schroedinger’s equation and Newton’s laws of motion. I guess that 10-20% of scientific computing is used for materials in some form. We are potentially capable of generating vast amounts of data.

But our informatics is almost non-existent. Compared with bioinformatics (e.g. the genome) we have nothing.

  • We can’t even find information 50-100 years old.
  • We don’t and can’t make the results of calculations available.
  • We cannot communicate to machines

I’m going to urge that we change this by creating a Semantic Web for Materials. What’s that?

It happens that today Tim Berners-Lee is touring Australia and visiting Melbourne (http://www.library.unimelb.edu.au/library_news/news_articles/public_lecture_-_sir_tim_berners-lee_-_mon_4_february and http://tbldownunder.org/ ,). [Great kudos to the indefatigable Pia Waugh for Canberra for this – I met Pia and #okfest last summer.] It’s sold out (obviously).

But I had the great privilege of meeting TimBL (http://en.wikipedia.org/wiki/Tim_Berners-Lee )

20 years ago at CERN at WWW1 and his vision of the http://en.wikipedia.org/wiki/Semantic_Web has been the inspiration for my life since then. It’sx inspired me and others to create the tools for a Chemical Semantic Web and now the Semantic Web for Materials Science. And we’ve talked from time to time about Chemical Markup Language – Tim wanted it all in RDF – I preferred XML.

The second event is a hackfest run at the Melbourne Age (newspaper) http://data-newsroom-melbourne-eorg.eventbrite.com/# . What’s a hackfest? And what can it possibly have to do with Materials Data?

A great deal. Because the tool for data journalism are very similar to the tools we need for materials!

Data is boring.  No one ‘wants’ to ‘read a spreadsheet’.  Yet, within the data are stories – stories that can influence our ideas – stories that are more powerful because they are based on evidence.

But, how do we tell stories with data?  How do we make data interesting?

This event will transform the way you see data.  On the day, starting at 9am you will enter a newsroom: a newsroom full of reporters, citizens, data hackers, journalists, editors-in-chief and people just like you – people who want to make things better!  


Note: this picture for illustration purposes, the venue we have is even better 😉

Like a real news team we will brief you on the evidence (data) for the day and get you into ad hoc ‘citizen journo teams’.  From mid-morning to lunch (with your new-found team) you will need to find a story within the data and pitch it to an editorial board made up of journalists from The Age.

What we you to achieve by the end of the day: turn the data into a newsworthy story – something that people will want to see on the front page of The Age.

Again I’m gutted I can’t be there – though I’ll try to pop over for lunch. I might meet Steve Andoulakis there – who is developing the data capture system (Tardis) for synchrotron data on materials. See how it all fits together?

And then our own meeting. I’ll be announcing AMI2 for materials information capture. Separate blog posts. But we can change the world of materials science using modern Web tools.

And modern Web culture!

 

Posted in Uncategorized | 2 Comments

Update: Travels, Semantic Computing for Science, Reproducibility, and Open Stuff

 

I have been silent on this blog for too long (over 1 month) because I have been obsessively concentrating on two major software projects. This post is to keep you up to date and reassure those of like mind that I continue to be very active in trying to liberate knowledge.

Travels. I am off back to CSIRO Melbourne for a month where I’m helping Nico Adams with his Materials Summer School and Workshop. I think the semantic tools that we have all been developing are going to be valuable in creating better informatics and computational approaches to materials. I’m particularly interested in crystalline materials and computational processes.

I’m taking a week off to fly to Auckland, NZ for the tail of the Open Research meeting (Fabiana Kubke) and then Kiwi Foo. Very excited. I’m sorry I can’t spend more time with Fabiana but the Summer School overlaps.

In late Feb/March I have been invited to speak at the Columbia Research Data Symposium (http://conferences.cdrs.columbia.edu/rds/index.php/rds/rds ). This will be a very exciting meeting and I’m very grateful to Columbia. Originally I declined because I would have been sponsored by Elsevier and I have publicly stated that I am boycotting all Elsevier activities. Columbia’s sponsorship means I do not have to take an Elsevier-friendly line. I will blog this meeting before I go and outline some of the issues that the world has to decide on. In simple terms our academic digital freedom is at stake. Data presents a huge opportunity and doubtless large additional income. Academia and governments should act wisely and not outsource their decisions and ethics.

Then I’m off to Kitware , a scientific/consultancy company that makes money out of open Source, including VTK and Avogadro. I am really excited as I hope to bounce the Declaratron design off them. As always my software is not only Open but non-competitive. Anyone can join in the meritocracy.

AMI2 (http://bitbucket.org/petermr/pdf2svg ) is a project to turn PDFs into fully semantic computable, searchable, executable documents with human intervention. There are >2 million STM PDFs published each year in EuropePMC alone (more on that later). More and more are Open Access of some kind. We have developed a relatively comprehensive and high accuracy converter and tested it on some thousands of PDFs from several hundred publishers. (Don’t rush for your lawyers, publishers, I’m not going to publish your holy PDFs). The results of this are:

  • The technical standard of publishers’ PDFs is AWFUL. I don’t think I have found one that conforms to the PDF standard
  • We have learnt how to turn them into Unicode
  • The result is technically better than what the publishers produce.
  • The next stage, turning SVG into semantic form is doing well. I am particularly keen on extracting maths equations in semantic MathML form. Equations aren’t copyright are they? Perhaps they are – Pythagoras only died 2500 years ago, so maybe he is still in copyright somewhere. JSBach still is.

I’d love to hear from anyone interested in developing content mining

The Declaratron. This is a new declarative approach to reproducible semantic computing and directly addresses things like:

  • Can scientific computation be reproduced? The current answer is generally – only partially. To do so completely requires the complete semantic unification of all components – data, specification, computational engine and visualisation/publication.
  • Have we eliminated all syntactic error and as much semantic error as possible? For example are our units consistent? Are data linked to computable ontologies?
  • Can the algorithm be transported to a different environment without writing code?
  • Can we follow the progress of the computation?
  • Can we modify the algorithm, even in mid computation?
  • Can the machine document the complete course of the calculation at whatever granularity we desire?
  • Can the results be re-used in another context without human intervention?

… and a great deal more. I think the answer to all of these is yes and I’ll be showing how the Declaratron works.

Open Access/Knowledge. I shall try and blog something on Aaron Swartz. I didn’t know him, but I know people who did, and the wealth of tributes has been impressive in itself and also given me more insight into his passion for liberation. The smell of injustice is pervasive.

Content Mining. Hargreaves is going to turn its recommendations into law. No arguments. So in October 2013 I can legally mine anything I have access to and publish as CC-NC. Publishers will whinge scream lobby etc. But that will be UK law (it doesn’t require re-legislation and done through statutory instruments). There’s a lot of to-and-fro-ing. Neelie Kroes and colleagues are running something in Brussels in 2 weeks’ time – Ross is representing OKF. The publishers are running semi-closed lobbying shops. We all have to remain very vigilant as publishers have people who are paid to stop progress and we have to rely on volunteers, spare time, etc. That is why I am grateful to Wellcome and the RCUK for their very clear impetus and drive. They have shown passion where the Universities have been spineless or ultra-timid. I’ll write more on this before Columbia.

Chuff will be going to AU and NZ… and I’ll be meeting with OKF people there. Tweet or mail if you’re around Auckland/Warkworth 2013-02-07/12

 

Posted in Uncategorized | 2 Comments

#opencontentmining MASSIVE step forward. Come and join us in the UK!

The UK government has now given the go-ahead to the major reforms proposed by the Hargreaves committee. http://news.bis.gov.uk/Press-Releases/Consumers-given-more-copyright-freedom-68542.aspx . The message is now very simple:

 

THE UK GOVERNMENT SAYS IT’S LEGAL TO MINE CONTENT FOR THE PURPOSES OF NON-COMMERCIAL RESEARCH

Business Secretary, Vince Cable said:

“Making the intellectual property framework fit for the 21st century is not only common sense but good business sense. Bringing the law into line with ordinary people’s reasonable expectations will boost respect for copyright, on which our creative industries rely.

“We feel we have struck the right balance between improving the way consumers benefit from copyright works they have legitimately paid for, boosting business opportunities and protecting the rights of creators.”

In his review of intellectual property and growth, Professor Hargreaves made the case for the UK making greater use of these exceptions, which are allowed under EU law. In response to a consultation earlier this year, the Government will make changes to:

[…]

  • Data analytics for non-commercial research – to allow non-commercial researchers to use computers to study published research results and other data without copyright law interfering;

[…]

These changes could contribute at least £500m to the UK economy over 10 years, and perhaps much more from reduced costs, increased competition and by making copyright works more valuable.

In addition the Government will introduce a new, non-statutory system for clarifying areas where there is confusion or misunderstanding on the scope and application of copyright law. Copyright notices will issued by the Intellectual Property Office. These notices are intended to clarify, but not make new law.

It makes it clear that publishers cannot set licence terms that override this.

New measures include provisions to allow copying of works for individuals’ own personal use, parody and for the purposes of quotation. They allow people to use copyright works for a variety of valuable purposes without permission from the copyright owners. They will also bring up to date existing exceptions for education, research and the preservation of materials.

We can start content-mining today (and we shall).

Copyright is complex and some of the questions are not easy to answer. So there is a provision for copyright-holders to appeal to the Secretary of State if they don’t like it. If publishers can convince Vince Cable that my activities are a threat to the health of the UK economy I’ll stop.

So everyone should adopt the principle:

If you have a right to the content you have a right to mine it

You DON’T have to ask permission.

What about non-UK people? Just come and visit us here! You will then be governed by the law of the UK. And we’d love to see you.

 

 

Posted in Uncategorized | 1 Comment

AMI2 Content mining using PDF and SVG: progress

I’m now returning to UK for a few weeks before coming back to AU to continue. This is a longish post but important for anyone wanting to know the details of how we build an intelligent PDF reader and what it will be able to do. Although the examples are chemistry-flavoured the approach applies to a wide range of science.

To recall…

AMI2 is a project to build an intelligent reader of the STM literature. The base is PDF documents (though Word, HTML and LaTeX will also be possible and much easier and of higher quality). There are three phases at present (though this and the names may change):

  • PDF2SVG. This converts good PDF losslessly into SVG characters, path and images. It works well for (say) student theses and ArXiV submissions but fails for most STM publisher PDFs because the quality of the “typesetting ” is non-conformant and we have to use clunky, fragile heuristics. More on later blogs and below.
  • SVGPLUS. This turns low-level SVG primitives (characters and paths) into higher level a-scientific objects such as paragraphs, sections, word, subscripts, rectangles, polylines, circles, etc. In addition it analyses components that are found universally in science (figures, tables, maths equations) and scientific document structure. It also identifies graphs, plots, etc. (but not chemistry, sequences, trees…)
  • SVG2XML. This interprets SVGPLUS output as science. At present we have prototyped chemistry, phylogenetics, spectroscopy and have a plugin architecture that others can build on. The use of SVG primitives makes this architecture much simpler.

We’ve written a report and here are salient bits. It’s longish so mainly for those interested in the details. But it has a few pictures…

PDFs and their interpretation by PDF2SVG

 

Science is universally published as PDF documents, usually created by machine and human transformation of Word or LaTeX documents. Almost all major publishers regard “the PDF” as the primary product (version of record) and most scientists read and copy PDFs directly from the publishers’ web sites; the technology is independent of whether this is Open or closed access. Most scientists read, print and store large numbers of PDFs locally to support their research.

PDF was designed for humans to read and print, not for semantic use. It is primarily “electronic paper” – all that can be guaranteed is coloured marks on “e-paper”. It was originally proprietary and has only fairly recently become an ISO standard. Much of the existing technology is proprietary and undocumented. By default, therefore a PDF only conveys information to a sighted human who understands the human semantics of the marks-on-paper.

Over 2 million scholarly publications are published each year, most only easily available in PDF. The scientific information in them is largely lost without an expert human reader, who often has to transcribe the information manually (taking huge time and effort). Some examples:

In a PDF these are essentially black dots on paper. We must develop methods to:

  • PDF2SVG: Identify the primitives (in this case characters, and symbols). This should be fairly easy but because the technical standard of STM publishing is universally very non-conformant to standards (i.e. “poor”) we have had to create a large number of arbitrary rules. This non-conformity is a major technical problem and would be largely removed by the use of UTF-8 and Unicode standards.
  • . SVGPLUS (and below): Understand the words (e.g. that “F”-“I”-“g” and “E”-“x”-“c”-“e”-“s”-“s” are words). PDF has no concept of “word”, “sentence”, “paragraph”, etc.
  • Detect that this is a Figure (e.g. by interpreting “Fig. “)
  • Separate the caption from the plot
  • Determine the axial information (labels, numbers and tics and interpret (or here guess) units
  • Extracts the coordinates of points (black circles)
  • Extract the coordinates of the line

If the PDF is standards-compliant it is straightforward to create the SVG. We use the Open Source PDFBox from Apache to “draw” to a virtual graphics device. We intercept these graphics calls and extract information on:

  • Position and orientation. PDF objects have x,y coordinates and can be indefinitely grouped (including scaling). PDF resolves all of this into a document on a virtual A4 page (or whatever else is used). The objects also have style attributes (stroke and fill colours, stroke-widths , etc.). Most scientific authors use simple colours and clean lines which makes the analysis easier.
  • Text (characters). Almost all text is individual characters which can be in any order (“The” might be rendered in the order “e”-“h”-“T”. Words are created knowing the screen positions of their characters. In principle all scientific text (mathematical equations, chemical symbols, etc.) can be provided in the Unicode toolset (e.g. a reversible chemical reaction symbol

    is the Unicode point U+21CC or html entity &#x21cc and will render as such in all modern browsers.

  • Images. These are bitmaps (normally rectangular arrays of pixels) and can be transported as PNG, GIF, JPEG, TIFF, etc. There are cases (e.g. photographs of people or scientific objects) where bitmaps are unavoidable. However some publishers and authors encode semantic information as bitmaps, thereby destroying it. Here is an example:

    Notice how the lines are fuzzy (although the author drew them cleanly). It is MUCH harder to interpret such a diagram than if it had been encoded as characters and lines. Interpretation of bitmaps is highly domain-dependent and usually very difficult or impossible. Here is another (JPEG)

    Note the fuzziness which is solely created by the JPEG (lossy) compression. Many OCR tools will fail on such poor quality material

  • Path (graphics primitives). These are used for objects such as
    • graphical plots (x-y, scatterplots, bar charts)
    • chemical structures

      This scheme, if drawn with clean lines, is completely interpretable by our software as chemical objects

    • diagrams of apparatus
    • flowcharts and other diagrams expressing relationships

    Paths define only Move, Line, Curve. To detect a rectangle SVGPLUS has to interpret these commands (e.g. MLLLL).

There are, unfortunately, a large occurrence of errors and uncertainties. The most common is the use of non-standard, non-documented encodings for characters. These come from proprietary tools (such as Font providers of TeX, etc,) and from contracted typesetters. In these cases we have to cascade down:

  • Guess the encoding (often Unicode-like)
  • Create a per-font mapping of names to Unicode. Thus “MathematicalPi-One” is a commonly used tool for math symbols: its “H11001” is drawn as a PLUS and we translate to Unicode U+002B but there is no public (or private) translation table (We’ve asked widely). So we have to do this manually by comparing glyphs (the printed symbol) to tables of Unicode glyphs. There are about 20 different “de facto” fonts and symbol sets in wide scientific use and we have to map them manually (maybe while watching boring cricket on TV). We have probably done about 60% of what is required
  • Deconstruct the glyphs. Ultimately the PDF provides the graphical representation of a glyph on the screen, either as vectors or as a bitmap. We recently discovered a service (shapecatcher) which interprets up to 11,000 Unicode glyphs and is a great help. Murray Jensen has also written a glyph browser which cuts down the human time very considerably.
  • Apply heuristics. Sometimes authors or typesetters use the wrong glyph or kludge it visually. Here’s an example:

    Most readers would read as “ten-to-the-minus-seven” but the characters are actually “1”, “0”, EM-DASH, “7”. EM-DASH – which is used to separate clauses like this – is not a mathematical sign so it’s seriously WRONG to use it. We have to add heuristics (a la UNIX lint) to detect and possibly correct. Here’s worse. There’s a perfectly good Unicode symbol for NOT-EQUALS (U+2260)

    Unfortunately some typsetters will superimpose an EQUALS SIGN (=)with a SLASH (/). This is barbaric and hard and tedious to detect and resolve. The continued development of PDF2SVG and SVGPLUS will probably be largely hacks of this sort.

SVG and reconstruction to semantic documents SVGPLUS

 

SVGPLUS assumes a correct SVG input of Unicode characters, SVG Paths, and SVGImages (the latter it renders faithfully and leaves alone). The task is driven by a control file in a declarative command language expressed in XML. We have found this to be the best method of representing the control, while preserving flexibility. It has the advantage of being easily customisable by users and because it is semantic can be searched or manipulated. A simple example:

<semanticDocument xmlns=”http://www.xml-cml.org/schema/ami2″>

<documentIterator filename=”org/xmlcml/svgplus/action/ “>

<pageIterator>

<variable name=”p.root” value=”${d.outputDir}/whitespace_${p.page}” type=”file”/>

<whitespaceChunker depth=”3″ />

<boxDrawer xpath=”//svg:g[@LEAF=’3′]” stroke=”red” strokeWidth=”1″ fill=”#yellow” opacity=”0.2″ />

<pageWriter filename=”${p.root}_end.svg” />

</pageIterator>

</documentIterator>

</semanticDocument>

 

This document identifies the directory to use for the PDFs (“action”), iterates over each PDF it finds, creates (SVG) pages for each, processes each of those with a whitespaceChunker (v.i.) and draws boxes round the result and writes each page to file. (There are many more components in SVGPLUS for analysing figures, etc). A typical example is:

 

SVGPLUS has detected the whitespace-separated chunks and drawn boxes round the “chunks”. This is the start of the semantic document analysis. This follows a scheme:

  • Detect text chunks and detect the font sizes.
  • Sort into lines by Y coordinate and sort within lines by X coordinate. The following has 5 / 6 lines:

     

     

    Normal, superscript, normal, subscript (subscript), normal

  • Find the spaces (PDF often has no explicit space characters – the spaces have to be calculated by intercharacter distance. This is not standard and is affected by justification and kerning.
  • Interpret variable font-size as sub- and super-scripts.
  • Manage super-characters such as the SIGMA.
  • Join lines. In general one line can be joined to the next by adding a space. Hyphens are left as their interpretation depends on humans and culture. The output would thus be something like:

    the synthesis of blocks, stars, or other polymers of com~plex architecture. New materials that have the potential of revolutionizing a large part …

    This is the first place at which words appear.

  • Create paragraphs. This is through indentation heuristics and trailing characters (e.g. FULL STOP).
  • Create sections and subsections. This is normally through bold headings and additional whitespace. Example:

    Here the semantics are a section (History of RAFT) containing two paragraphs

 

The PATH interpretation is equally complex and heuristic. In the example below:

The reversible reaction is made up of two ML paths (“lines”) and two filled curves (“arrowheads”). All this has to be heuristically determined. The arcs are simple CURVE-paths. (Note the blank squares are non-Unicode points)

 

In the axes of the plot

All the tick-marks are independent paths – SVGPLUS has to infer heuristically that it is an axis.

In some diagrams there is significant text:

Here text and graphical primitives are mixed and have to be separated and analysed.

 

In summary SVVGPLUS consists of a large number of heuristics which will reconstruct a large proportion (but not all) scientific articles into semantic documents. The semantic s do and will include:

  • Overall sectioning (bibliographic metadata, introduction, discussion, experimental, references/citations
  • Identification and extraction of discrete Tables, Figures, Schemes
  • Inline bibliographic references (e.g. superscripted)
  • Reconstruction of tables into column-based object(where possible)
  • Reconstruction of figures into caption and graphics
  • Possible interpretation of certain common abstract scientific graphical objects (graphs, bar charts)
  • Identification of chemical formulae and equations
  • Identification of mathematical equations

There will be no scientific interpretation of these objects

 

Domain specific scientific interpretation of semantic documents

 

This is being developed as a plugin-architecture for SVGPLUS. The intention is that a community develops pragmatics and heuristics for interpreting specific chunks of the document in a domain specific manner. We and our collaborators will develop plugins for translating documents into CML/RDF:

  • Chemical formulae and reaction schemes
  • Chemical synthetic procedures
  • Spectra (especially NMR and IR)
  • Crystallography
  • Graphical plots of properties (e.g. variation with temperature, pressure, field, molar mass, etc.)

More generally we expect our collaborators (e.g. Ross Mounce, Panton Fellow, paleophylogenetics at University of Bath UK) to develop:

  • Mathematical equations (into MathML).
  • Phylogenetic trees (into NEXML)
  • NA and protein sequences into standard formats
  • Dose-response curves
  • Box-plots

 

Fidelity of SVG rendering in PDF2SVG. This includes one of the very rare bugs we cannot solve:

PDF:

SVG

[Note that the equations are identical apart from the braces which are mispositioned and too small. There is no indication in TextPosition as to where this scaling comes from.

In PDFReader the equation is correctly displayed (the text is very small so the screenshot is blurry. Nonetheless it’s possible to see that the brackets are correct)

 

 

Posted in Uncategorized | 1 Comment

Temporary Farewell to AU and thanks to some of its mammals

I go back to UK today and am finishing up in the Prahran (Melbourne) apartment where I’ve been for 2.5 months. Prahran is a great place to be – easy tram ride to the CBD (centre) of Melbourne and less than 1 hour commute to CSIRO (including walking) after I had figured the 5 and 64 and the (somewhat random) semi-express nature of the Cranbourne trains. Trams generally great (except the one that broke down in Swanston street (which gums up everything) and the rather unpredictable nature of late trains. Excellent shuttle from Huntingdale station To Monash Univ.

Too many human mammals to thank but they include:

  • Nico Adams – unlimited praise and appreciation for him fixing this up. We are planning I will be back next year, probably late Jan.
  • Murray Jensen (CSIRO) for his collaboration on AMI2 – Murray has a huge range of expertise and his knowledge of fonts was both unexpected and absolutely critical.
  • Everyone involved at CSIRO.
  • Dave Flanders and the Flanders-irregulars – a mixture of incipient OKF, hackers, meeting in Melbourne cafes where the wifi and coffee is great. (This is a fantastic aspect of the Melbourne scene you can get café and free wifi at State Library of Vic, National Gallery, Fed Square, RMIT in Swanston (where Nico and I worked on reports, and next year.
  • Connie and colleagues for the great time in Perth.
  • Mat Todd and colleagues for Sydney
  • The Astor cinema in Prahran/Chapel Street. It’s a 1930’s art deco showing a mixture of classic films (Bogart, Bergman, Crawford, Stewart…) and new releases. TWO films per sitting and ice creams out of this world.
  • Prahran and its cafes. I am off to have brunch shortly. Wifi and great atmosphere).
  • The people we met on our travels down the Great Ocean Road and elsewhere – Wombat Cottage (with Wombats), Birdwatchers, Reserves (e.g. Tower Hill)…
  • And others that I’ve failed to add – sorry.

Lots of animals and birds –we’ve probably ticked 50+ AU birds. We’re told Werribee sewage works is the place we must visit next time. The most interesting mammal was Thylarctos Plummetus . This can be dangerous to humans (what isn’t dangerous in AU?) but there are no recorded fatalities. Here’s the best picture we could get:

Our guide wouldn’t let us get any closer because of the potential danger. It’s clearly not a Koala and it looks ready to fall out of the tree.

The animals are sad and excited to be going to UK. Here’s AMI and AMI with some classic Australian tucker which I’ve had to leave behind:

We didn’t manage to make any #animalgarden photocomics – too much to do hacking grotty PDFs L

See you soon…

 

 

Posted in Uncategorized | Leave a comment

My/Our talk to CSIRO Publishing. How should we communicate science? “This article uses recyclable scientific information”.

I’ve been working with Nico Adams at CSIRO (Melbourne/Clayton, AU) for nearly 3 months, supported by a Fellowship. CSIRO (http://www.csiro.au/) is a government institution similar in many ways to a National Laboratory. It does research (public and private) and publishers it. But it is also a publisher in its own right – everything from Chemistry, to Gliding mammals, to how to build your dream home. Nico and I have struck up a rapport with people in CSIRO publishing http://www.publish.csiro.au/ and today – my last full day in AU – we are going to visit and present some of what we have done and more generally have a discussion where we learn about what CSIRO Publishing does.

CSIRO publishes a range of journals and we’ll be concentrating on that, though we’ll also be interested in reports, books, etc. We’ve had the opportunity to work with public and non-public content and to use that as a guide to our technology development (all the software I write is, of course, Open Source). Among the questions I’ll want to raise (not specifically CSIROPub) are:

  • Is the conventional journal type-setting process still needed? I will argue NO – that it costs money and makes information worse. ArXiV has totally acceptable typography in Word or LaTeX and this is better than most journals for content-mining, etc.
  • How should data be published? I shall take small-molecule crystal structures as an example. At present CSIRO sends crystal structures to CCDC where they are no openly accessible. I’ll argue they should be part of the primary scientific record.

Nico will be talking about semantics – what it is and how it can be used. I think he’ll hope to show the machine extraction of content from Aust. J Chem.

I’ll probably play down the political aspect in my formal presentation. The main issue now is how we recreate a market where scientific communication (currently broken) can be separated from the awarding of scientific glory (reputation). I’ll concentrate on the communication.

I have simple, practical, understandable IMMEDIATE proposal addressing the document side of STM (this doesn’t of course address the issues of data semantics or whatever) .

  • The current primary documentary version of scientific record should not be PDF but be a Word Or LaTeX or HTML or XML (e.g. NLM-DTD) document.
  • All documents should use UTF-8 and Unicode.

There are zillions of Open tools that adhere to UTF-8 and Unicode.

Where PDFs are used they should adhere to current information standards, specifically:

A graduate thesis is a BETTER document than the output of almost any publisher I have surveyed. STM publishing destroys information quality. All documents I have looked at on ArXiV and BETTER that the output of STM publishing.

So I shall make the following proposals:

  • CSIRO publishing should publish in a standards-compliant manner.
  • CSIRO should make supplemental data Openly available (we’ll take crystallography as the touch stone).

The average cost to the public for the publication of a scientific paper is around 3000 USD. The information quality is a disgrace. Some of that money can be saved by doing it better. It’s similar to recycling. It makes sense to re-use your plastic bags, toilet paper, etc. (Yes, Healesville animal sanctuary promotes green bum-wiping to save the environment (technically recycled paper)).

Let’s have a sticker:

“This journal promotes recycled scientific information”

I’ll be presenting the work that Murray Jensen and I have been doing on AMI2 . MANY thanks to Murray – he has been given an “AMI” in small acknowledgement.

Murray’s AMI in typical Melbourne bush.

AMI progresses steadily. It’s taken much longer than I thought primarily because STM publication quality is AWFUL. It’s now at a stage where we can almost certainly make an STM publication considerably better. However Murray and I have hacked the worst. AMI2-PDF2SVG turns PDF and AWFUL-PDF into good Unicode-compliant SVG. I’m concentrating on AMI2-SVGPLUS which turns SVG into meaningful documents. Nearly there. Again the absurd process of creating double column justified PDF (that no scientist would willingly pay for) destroys information seriously and SVGPLUS has to recover it. Then the final exciting part will create science from the document.

I’ll hope to present some today.

 

 

Posted in Uncategorized | 2 Comments

#ami2 #opencontentmining: AMI reports progress on #pdf2svg and #svgplus: the “standard” of STM publishing

AMI has been making steady progress on two parts of AMI2:

  • PDF2SVG. A converter of PDF to SVG, eliminating all PDF-specific information. This has gone smoothly –AMI does not understand “good” so “steady” means a monotonically increasing number of non-failing JUnit tests. AMI has also distributed the code, first on Bitbucket at:

    http://www.bitbucket.org/petermr/pdf2svg

    and then on the Jenkins continuous integration tool at PMR group machine in Cambridge:

    http://hudson.ch.cam.ac.uk – see https://hudson.ch.cam.ac.uk/job/pdf2svg/

    [Note: Hudson was open Source but it became closed so the community forked it and Jenkins is the new Open branch]. Jenkins is very demanding. AMI starts by developing tests on Eclipse, then runs these on maven, and then on Jenkins. Things that work on Eclipse often fail on maven, and things that work on maven can fail on Jenkins.

    AMI has also created an Issue Tracker: https://bitbucket.org/petermr/pdf2svg/issues?status=new&status=open Here humans write issues which matter to them – bugs, ideas, etc. PMR tells AMI what the issues are and translates them into AMI-tasks, often called TODO. PMR tells AMI he is pleased that there is feedback from outside the immediate group.

  • SVGPlus. This takes the raw output of PDF2SVG and turns into into domain-agnostic semantic content. Most of this has already been done so it is a questions of refactoring. AMI requires JUnit tests to drive the development. SVGPlus has undergone a lot of refactoring (AMI notes changes of package structure, deletion of large chunks and addition of smaller bits. The number of tests increases so AMI regards that as “steady progress”.

AMI now has a lot of experience with PDFs from STM publishers and elsewhere. AMI works fastest when there is a clear specification against which she can write tests. AMI works much slower when there are no standards. PMR has to tell her how to guess (“heuristics”). Here’s their conversation over the last few weeks.

AMI: Please write me some tests for PDF2SVG.

PMR: I can’t.

AMI: Please find the standard for PDF documents and create documents that conform.

PMR. I could do that but it’s no use. Hardly any of the STM publishers conform to any PDF standards.

AMI. If the deviations from the standard are small we can add some small exceptions.

PMR. The deviation from the standard is enormous.

AMI. If you read some of the documents we can create a de facto standard and code against that. It will be several times slower.

PMR. That won’t be useful. Every publisher does things differently.

AMI. How many publishers are there?

PMR. Perhaps 100.

AMI. Then it will take 100 times longer to write PDF2SVG. Please supply me with the documentation for each of the publishers’ PDFs.

PMR. There is no documentation for any of them.

AMI. Then there is no systematic quality way that I can write code.

PMR. Agreed. Any conversion is likely to have errors.

AMI. We may be able to tabulate the error frequency.

PMR. We don’t know what the correct output is.

AMI. Then we cannot estimate errors properly.

PMR. Agreed. Maybe we can get help from crowdsourcing.

AMI. I do not understand.

PMR. More people, creating more exams and tests.

AMI. I understand.

PMR. I will have to make it easy for them.

AMI. In which can we may be able to work faster. We may also be able to output partial solutions. Can we identify how the STM publishers deviate from the standard?

PMR. Let’s try.

AMI. Wikipedia has http://en.wikipedia.org/wiki/Portable_Document_Format . Is that what we want?

PMR. Yes

AMI. Is the standard Open?

PMR. Yes, it’s ISO 32000-1:2008.

AMI. [reads}

ISO 32000-1:2008 specifies a digital form for representing electronic documents to enable users to exchange and view electronic documents independent of the environment they were created in or the environment they are viewed or printed in. It is intended for the developer of software that creates PDF files (conforming writers), software that reads existing PDF files and interprets their contents for display and interaction (conforming readers) and PDF products that read and/or write PDF files for a variety of other purposes (conforming products).

AMI. Does it make it clear how to conform?

PMR. Yes. It’s well written.

AMI. Is it free to download?

PMR. Yes (Adobe provide a copy on their website)

AMI. Are there any legal restrictions to implementing it? [AMI understands that some things can’t be done for legal reasons like patents and copyright.]

PMR. Not that we need to worry about.

AMI. Do the publishers have enough money to read it? [AMI knows that money may matter.]

PMR. It is free.

AMI. So we can assume the publishers and their typesetters have read it? And tried to implement it.

PMR. We can assume nothing. Publishers don’t communicate anything.

AMI. I will follow the overview in Wikipedia:

File structure

A PDF file consists primarily of objects, of which there are eight types:[32]

  • Boolean values, representing true or false
  • Numbers
  • Strings
  • Names
  • Arrays, ordered collections of objects
  • Dictionaries, collections of objects indexed by Names
  • Streams, usually containing large amounts of data
  • The null object

Do the PDFs conform to that?

PMR: They seem to since PDFBox generally reads them

AMI. Fonts are important:

Standard Type 1 Fonts (Standard 14 Fonts)

Fourteen typefaces—known as the standard 14 fonts—have a special significance in PDF documents:

These fonts are sometimes called the base fourteen fonts.[34] These fonts, or suitable substitute fonts with the same metrics, must always be available in all PDF readers and so need not be embedded in a PDF.[35] PDF viewers must know about the metrics of these fonts. Other fonts may be substituted if they are not embedded in a PDF.

AMI: If a PDF uses the 14 base fonts, then any PDF software must understand them, OK?

PMR. Yes. But the STM publishers don’t use the 14 base fonts.

AMI. What fonts do they use?

PMR. There are zillions. We don’t know anything about most of them.

AMI. Then how do I read them? Do they use Type1 Fonts?

PMR. Sometimes yes and sometimes no.

AMI. A Type1Font must have a FontDescriptor. The FontDescriptor will tell us the FontFamily, whether the font is italic, bold, symbol etc. That will solve many problems.

PMR. Many publishers don’t use FontDescriptors.

AMI. Then they are not adhering to standard PDF.

PMR. Yes.

AMI. Then I can’t help.

PMR. Maybe we can guess. Sometimes the FontName can be interpreted. For example “Helvetica-Bold” is a bold Helvetica font.

AMI. Is there a naming convention for Fonts? Can we write a regular expression?

PMR. No. Publishers do not use systematic names.

AMI. I have just found some publishers use some fonts without FontNames. I can’t understand them.

PMR. Nor can anyone.

AMI. So the PDF renderer has to draw the glyph as there is no other information.

PMR. That’s right.

AMI. Is there a table of glyphs in these fonts.

PMR. No. We have to guess.

AMI. It will take me about 100 times longer to develop and write a correct PDF2SVG for all the publishers.

PMR. No, you can never do it because you cannot predict what new non-standard features will be added.

AMI. I will do what you tell me.

PMR. We will guess that most fonts use a Unicode character set. We’ll guess that there are a number of non-standard, non-documented character sets for the others – perhaps 50. We’ll fill them in as we read documents.

AMI. I cannot guarantee the results.

PMR. You have already done a useful job. We have had some positive comments from the community.

AMI. I don’t understand words like “cool” and “great job”.

PMR. They mean “steady progress”.

AMI. OK. Now I am moving to SVGPlus.

PMR. We’ll have a new blog post for that.

 

 

 

 

 

 

 

 

 

Posted in Uncategorized | Leave a comment