#ami2 #opencontentmining: AMI releases AMI2-PDF2SVG. Please have a go.

AMI and Friends (Murray Jensen, PMR) have made good progress with the first part of their 3-part Quest. PDF2SVG is now at continual-beta. The framework is in place, moderately exercised and “tested”. (It’s actually quite difficult to build tests for systems that have no fixed spec, but at least there are some sort of regression tests). It’s at http://www.bitbucket.org/petermr/pdf2svg

Running PDF2SVG should be easy. You need to have Java 1.6 installed. There’s a giant JAR file (pdf2svg-4-jar-with-dependencies.jar or later) which can be run on the commandline by:

$ java –jar pdf2svg-4-jar-with-dependencies.jar [myfile.pdf]

[If you don’t know what this is about and still want to run AMI, find the nearest person with a penguin on their T-shirt and ask very nicely. If you don’t know what a commandline is, suggest you leave it…]. [Jar will be uploaded today]. This takes the PDF file, reads about 5 pages per second (on my laptop) and creates a *.svg file for each page. Thus if the PDF had 10 pages you should have something like:

myfile1.svg, myfile2.svg … myfile10.svg

This is already somewhat magical. AMI has converted the impenetrable PDF file into much more penetrable SVG files. [Hang on – it’s worth it.]. SVG files can be read by geeks and are an infinitely better starting point for phases 2 and 3 of AMI’s quest. Indeed if you like things like Crosswords and Sudoku you’ll find it’s quite simple. If AMI has done a complete job then there should be only 3 sorts of things in these SVG files:

  • Characters (letters, numbers) positioned on the page
  • Paths (lines, curves, squiggles( positioned on the page).
  • Images (at present AMI leaves these out but they can go back at any time

There are no words, paragraphs, squares, graphs, chemical formulae at this stage because we haven’t taught AMI about them. (But we will and she can learn them quickly). The magic is that every object is precisely defined – e.g. what the character is, etc.

Now the slightly bad news. It’s possible to create excellent PDF files which translate trivially into SVG. AMI does this translation perfectly. The trouble is that STM publishers create PDF files of “varying” quality. The top quality is “barely satisfactory” and most are harder to work with.

There is an Open PDF specification which allows all characters to be created as Unicode and all graphics to be captured as Vector Graphics. But many publishers will transform the vector graphics (from which AMI can get a lot) into JPEGs (from which AMI can recover almost nothing). And many of the characters are transformed into Glyphs.

What’s a Glyph? It’s a pictorial diagram which represents a character (sometimes more than one character – a “ligature”). Publishers assume that all “readers” will be sighted humans and that a collection of dumb glyphs is satisfactory.

It isn’t.

An unsighted human cannot understand a glyph and nor (easily) can AMI. What’s this character: “I”? Is it a CAPITAL_LATIN_I or a SMALL_LATIN_L? You *cannot* tell by looking at it. What’s this: “III”? ILL, or Roman-THREE? If the values (codepoints) of the characters are given then we know absolutely. Without those codepoints we have to guess from the Glyph. (This is one of the problems of Optical Character Recognition – sometimes you cannot know for sure). What’s “-“? MINUS, HYPHEN-MINUS, EM_DASH, EN_DASH, OVERBAR, UNDERBAR, etc.

If all publishers used Unicode and standard PDF fonts we would be fine. But they don’t. They choose unknown, undocumented Fonts and also throw away significant amounts of information. In other cases the information is visibly inconsistent. We find a character whose name is “bullet” and whose glyph is “-“. What is it? We have to guess.

AMI is being taught to guess. Not by tossing a dice but by following a set of heuristic rules. And, where necessary, adding probabilities. So here is one of the most problematic she has encountered. [Remember AMI does not rant; she has no emotions. Things can be difficult or easy; valid or invalid; probably correct and probably incorrect.]

So first to show you what AMI can do with good material. This is from the Australian Journal Of Chemistry, from CSIRO Publishing. In general it’s manageable quality. Here’s the PDF:

And here’s AMI’s transformation into SVG:

Can you spot any differences (other than the boxes and red highlighting AMI has drawn in (for reasons we’ll see in later posts)? PMR can’t. All the characters are there and in the right places. The italics are detected and so are the bolds. There’s only one thing we had to teach AMI to get this right (see later).

But here’s an equation AMI thinks is probably wrong:

OK – most trained sighted scientists will probably spot something strange immediately. But AMI isn’t sighted and isn’t a scientist and doesn’t understand equations (not yet; but she will!). So why does she think it’s probably wrong?

Because of the characters. Nobody uses the single character ¼ (a quarter). It is (sometimes) used for farthings, imperial lengths, stock prices etc. But never in scientific documents. And then there are “eth” and “thorn”. How many scientific articles are written in Icelandic? And even those wouldn’t make linguistic sense. So there is a VERY high probability that these characters have been misinterpreted.

This is actually from a paper where most of the information is presented in glyphs, where no fonts are standard and where character names have been discarded. It’s almost certainly unintelligible to a non-sighted human. So what did the original equation look like?

You’ll see that the “G” should be a Capital Italic Gamma, the ¼ should be an “=” and there are two types of thorn (lowercase and uppercase). One represents a “+” and the other eth-Thorn is matching rounded parentheses.

It’s not AMI’s fault. It’s because the publisher gave incomplete information. Let’s see what is actually underneath. Today we gave AMI a tooltip so sighted humans can browse each character and debug AMI. That’s very useful:

So the tooltip says that the ¼ is codepoint 188 in Font AdvP4C4E74. The problem is AMI has never heard of this font. Nor has PMR and I suspect nor have most other people. It’s not documented. So how do we progress? We could write to the publisher who probably won’t reply and then will tell us they don’t know or it’s confidential or … So we’ll have to develop some heuristics. It may be that we simply have to say “This paper is written in an uninterpretable manner for non-sighted humans” but we hope we can do more.

Note that there are actually several fonts:

  • The “G” is “AdvPi2”
  • The “¼” and the eth-T/thorns are “AdvP4C4E74”
  • The “KfC” is “AdvTimes-i” (italic)
  • The “1” is AdvTimes
  • The (…) pair are “AdvP4C4E46”

[That’s 5 fonts used in 4 cm of text. If the publisher had used Unicode it could have been done with one (with appropriate font-sizes and italics). ]

This equation is probably beyond even AMI’s powers at present, but who knows. More experience and crowdsourcing will add greatly. We will have rules like “You can/can’t rely on publisher X for consistency or validity” (and we’ll start to publish examples of quality).

So here’s what PDF2SVG produces for the AJC paper (just one page for brevity, though AMI does 130 pages in ca 20 secs).

name=TimesNewRomanPS fam=TimesNewRoman type=Type1 bold=false it=false face=PS sym=false enc=WinAnsiEncoding

Excellent start by AJC. TimesNewRoman is a standard Type1Font in PDF and should map onto Unicode. It uses WinAnsi encoding which is standard and AMI has been taught about.

name=KAIHCK+Helvetica fam=Helvetica type=Type1 bold=false it=false face=null sym=false enc=WinAnsiEncoding

Again Helvetica is a standard Type1Font.

name=KAIKCD+Helvetica-Oblique fam=Helvetica type=Type1 bold=false it=true face=null sym=false enc=WinAnsiEncoding

name=TimesNewRomanPS-Bold fam=TimesNewRoman type=Type1 bold=true it=false face=null sym=false enc=WinAnsiEncoding

And these are fine as well (the publisher has recorded that the first is italic and the second is bold). This is encoded in the PDF (We should not have to guess it from the font names. Again excellent so far.

name=KAILAO+MathematicalPi-One fam=MathematicalPi-One type=Type1 bold=false it=false face=null sym=true enc=DictionaryEncoding

This is a problem. MathematicalPi (there are at least -One, -Two, -Three, -Four) is a non-standard font. It’s not documented anywhere PMR can find. (PMR has even offered a bounty of 200 points on StackOverflow for a conversion to Unicode).

MathematicalPi does not have numeric codepoints. It has charnames. The one we need is “H11001”. How do we know what that is? (or even that it’s consistent – PMR believes it is). There are two main ways:

  • Ask humans what they think the glyph means (now go back to the first diagram and find

    Most sighted humans would be happy to say this is a “PLUS SIGN”

  • Interpret the glyph without human eyes. So here’s the information (M = moveto, L = lineto, Z = close). If you plot this out on graph paper you’ll see a plus

    H11001: created M 0.390625 -0.03125 L 0.390625 -0.296875 L 0.125 -0.296875 L 0.125 -0.34375 L 0.390625 -0.34375 L 0.390625 -0.625 L 0.4375 -0.625 L 0.4375 -0.34375 L 0.71875 -0.34375 L 0.71875 -0.296875 L 0.4375 -0.296875 L 0.4375 -0.03125 Z


But AMI has to do this without eyes. It *is* possible, but we’d also be very interested in offers of help or code.

So this is a problem. A non-standard (probably) proprietary font makes it much more difficult. It gets more problematic still:

name=MTSYN fam=MTSYN type=Type1 bold=false it=false face=null sym=true enc=DictionaryEncoding

null: created M 0.703125 -0.21875 L 0.078125 -0.21875 L 0.078125 -0.28125 L 0.703125 -0.28125 Z

This is harder as we have a non-standard font and no charname. We have a glyph and a codepoint. What does it look like?

So it’s a MINUS SIGN (U+2212) . The characters before and after are Times which has a perfectly good HYPHEN-MINUS but for some reason the publisher has chosen a non-standard font. With no public documentation. At least it appears to be Unicode.

name=MTMI fam=MTMI type=Type1 bold=false it=true face=null sym=true enc=DictionaryEncoding

slash: created M 0.421875 -0.671875 L 0.09375 0.203125 L 0.046875 0.203125 L 0.375 -0.671875 Z

Yet another non-standard font (for encoding a standard Unicode character SOLIDUS).

And still another:

name=MTEX fam=MTEX type=Type1 bold=false it=false face=null sym=true enc=DictionaryEncoding

null: created M 0.515625 0.46875 L 0.171875 0.921875 L 0.703125 0.921875 C 0.859375 0.921875 0.921875 0.859375 0.953125 0.78125 L 0.984375 0.78125 L 0.90625 1.0 L 0.0625 1.0 L 0.421875 0.515625 L 0.0625 0.0 L 0.90625 0.0 L 0.984375 0.21875 L 0.953125 0.21875 C 0.90625 0.0625 0.796875 0.046875 0.703125 0.046875 L 0.21875 0.046875 Z

I’ll leave it to you to find out what this one is!

So this AJC page started out conformant and then got more and more non-standard and so more difficult and error-prone. Many other publishers are more “variable”. Remember AMI2 has no emotions and no idea of the size of the task ahead. She just solves one problem at a time.

So find a PDF you are interested in and see how well it works for you. You’ll almost certainly find strange characters and we ‘ll be hoping to collect communal wisdom. What’s more important at this stage is reports of crashes or hanging. Let us know.

Posted in Uncategorized | 5 Comments

Royal Society of Chemistry will charge students for re-using “Gold Open Access” articles

I have been trying to find “Open Access” articles published in Royal Society of Chemistry journals. It’s very difficult – Google doesn’t help – and I’ve scanned about 200 abstracts without finding one. Then I happened on http://blogs.rsc.org/cc/2012/10/08/chemcomm-celebrates-its-first-gold-for-gold-communication/ (I’ll reproduce it in full without permission but I have removed an image of a gold medal):

ChemComm celebrates its first Gold for Gold communication

08 Oct 2012

By Joanne Thomson.

A groundbreaking £1 million initiative to support British researchers

Eugen Stulz (University of Southampton) and colleagues are the first ChemComm
authors to publish a communication as part of our Gold for Gold initiative.

Their communication, entitled ‘A DNA based five-state switch with programmed reversibility’
is now free to access for all.

‘I’m delighted that Eugen’s communication is the first open access communication to be published in ChemComm using the RSC’s Gold for Gold programme,’  says Phil Gale, Head of Chemistry at the University of Southampton. ‘This open access programme will allow us to showcase our research to a much wider audience.’

Gold for Gold
is an innovative initiative rewarding UK RSC Gold customers with credits to publish a select number of papers in RSC journals via Open Science, the RSC’s Gold Open Access option.

Gold for Gold” is an RSC scheme where they will match funding for UK authors http://www.rsc.org/AboutUs/News/PressReleases/2012/gold-for-gold-rsc-open-access.asp . Excerpts include:

UK institutes who are RSC Gold customers will shortly receive credit equal to the subscription paid, enabling their researchers, who are being asked to publish Open Access but often do not yet have funding to pay for it directly, to make their paper available via Open Science, the RSC’s Gold OA option. 

The Research Councils UK (RCUK) also published their revised policy on Open Access, requiring researchers to publish in OA compliant journals.
‘Gold for Gold’ seeks to support researchers until the block grants from RCUK are distributed next April, which, once established are intended to fund Gold OA. [PMR emphasis]

And Univ of Cambridge and JISC seem to think it a good idea:

Lesley Gray, Journals Co-ordinator Scheme Manager from the University of Cambridge, said: “This initiative by the RSC is welcomed, and will serve to promote Open Access publishing to researchers.”

Lorraine Estelle, Chief Executive of JISC congratulated the RSC on launching ‘Gold for Gold’ which “demonstrates the Society’s engagement with the chemical science community and recent Open Access developments”. 

So what is Eugen Stulz and readers (and perhaps RCUK) getting for their money? Here’s the cover page of the article

What rights do readers have? Is this compliant with the RCUK definition of Gold which requires CC-BY at least? The phrase “RSC Open Science free article” is not-clickable (unlike “Open Access” buttons in BMC and PLoS. I cannot find any more information by Googling. So let’s look at the article – it should have some indication of authorship and copyright.

This journal is © The Royal Society of Chemistry 2012″

This is a very strange phrase which I have found consistently in RSC material. I don’t know what it means. There is no indication in the article that it is Open Access. I assume that almost anyone would assume this was an article in which the RSC claimed complete rights. So let’s go to “Request Permissions”. I’ll simulate a student asking for permission to re-use the three diagrams in her thesis> That’s a reasonable scientific thing to do. Indeed it could be scientifically irresponsible NOT to show other scientists’ data.

So here’s my request for a student to re-use the diagrams:

So even a student has to pay 240 USD for re-use of scientific data from this “Gold Open Access” article.

This is completely at odds with the RCUK policy of CC-BY for paid Open Access. RCUK read my blog and I hope they will make it quite clear to RSC that this is not in the letter or the spirit of paid Open Access.

And if as a lecturer I wanted to give every student in a class of 50 a copy of this 3-page “Open Access” article:

[No rant]

 

Posted in Uncategorized | 27 Comments

#ami2 #opencontentmining; Introducing AMI, and introducing AMI to publisher’s PDFs

I’ve been to central Melbourne (Central Business District, CBD) for the last two days. To hack. But the first visit was to http://en.wikipedia.org/wiki/Queen_Victoria_Market to find AMI. (AMI2 is the scientifically intelligent program amanuensis we are building and AMI will give us inspiration and symbolism – these things matter.) Queen Vic has rows of tourist traps including soft Australian toys.

So I had no doubt that AMI would find me. I looked long and hard, thinking platypus, opossum, echidna, koala, but I had no choice and here is AMI:

She – and AMI has always been a she – has not named her joey yet. In case you need to know more about http://en.wikipedia.org/wiki/Kangaroo Wikipedia says they are “shy and retiring by nature” so an excellent companion, who also “release virtually [no methane]. The hydrogen byproduct of fermentation is instead converted into acetate”.

So yesterday and today I sat and hacked in the superb La Trobe reading room in the http://en.wikipedia.org/wiki/State_Library_of_Victoria :

Free wifi, free power and that’s the view where I was sitting. Perfect silence. The occasional visitor coming to see Ned Kelly’s home-made suit of armour.

So I have spent two days teaching AMI about publishers’ PDFs. Remember AMI has no emotions, doesn’t get angry, doesn’t rant. So her main impressions are:

  • Highly variable
  • Quite a lot of work
  • A challenge but manageable
  • Non-standard

She doesn’t use words like bad/good awful/beautiful good/bad)valueForMoney, but “tractable/intractable” “standard/nonstandard” “deterministic/guesswork”.

She’s been following the discussion on the last post (/pmr/2012/11/04/ami2-opencontentmining-pdf2svg-ami-comments-on-her-experience-of-the-digital-printing-industry-and-stm-publishers/#comment-117090 ), where there are some very useful comments.

 

Villu says:

November 4, 2012 at 9:24 am  (Edit)

Having implemented some font support for PDFBox, it is my understanding that fontNames shouldn’t be used to judge about what’s “inside” them.

The fontName=MGCOJK+AdvTT7c108665 probably corresponds to some synthetic font object. The PDF specification makes it clear that when PDF documents are exported with “the smallest PDF file size possible” objective, then it is OK to perform the compacting of embedded font objects by leaving out unnecessary glyphs, by remapping character codes etc. For example, it is not too uncommon to encounter embedded font objects that contain only one or too glyphs.

The extent of the compacting of font objects depends on the scientific publisher. Some do it, some don’t.

OK. So AMI2 will have to deal with synthetic font objects. If they are Unicode it’s probably OK. If they aren’t we’ll need some per-publisher hacks, we suspect.

 

Steve Pettifer says:

November 5, 2012 at 8:27 am  (Edit)

Villu is correct — you cannot ‘trust’ the names of fonts to mean anything. If you’re lucky they will have some words like ‘bold’ or ‘italic’ or end with ‘-b’ or ‘-i’. But they are just opaque identifiers, planted in there by whatever software created the document (typically Unix like stuff plants human-readable names, Microsoft stuff plants duff names, but you shouldn’t rely on them even if they look as though they mean something).

So AMI2 knows not to trust fontNames. She will try to trust the content. But she’s used to heuristics and hacks.

 

Steve Pettifer says:

November 5, 2012 at 10:17 am  (Edit)

The point I’m trying to make is that the mess we see in PDFs (and HTML) representation is caused by a combination of a lack of sensible authoring tools, and the process of ‘publishing’; sometimes authors do things right, and publishers mangle them; sometimes the other way round, and the mistakes appear in all representations (even in the XML versions of things).

There is one place where PDF is considerably worse than HTML, and that’s in the very naughty use publishers sometimes make of combining glyphs to make characters ‘look right’. I’ve found instances where the Å (Angstrom) symbol is created in the PDF by drawing a capital A, and then a lower-case ‘o’, with instructions to shrink the o, move it back one character, and place it above the A. In the PDF this looks OK to a human, but comes out as guff to a machine (again, a heuristic needed to spot it). In this particular instance the HTML representation was also broken, coming out as two sequential characters ‘Ao’. And all this in spite of the fact that there’s a perfectly good unicode Å character.

The ß versus β problem is surprisingly common (though as you say, its relatively rare that ß would appear in STM articles to mean the German character) — we’ve found several instances of it in modern articles — again unless you’re analysing these things with a machine, they look plausible in both PDF and HTML, and its only an eagle-eyed reader that would spot them.

Perhaps it would be useful for us to jointly create a list of dodgy characters / naughty encodings and heuristics for spotting them?

AMI doesn’t understand words like “mess” and “mangle” and “naughty”. She does understand “right”. She now knows that she can expect a little “o” shifted back and up above an “A” for character “Aring” . That’s no problem. She also has to learn that “H” and “e” in that order without a space spells “He”. And she will soon be be taught that “He” can mean a male human or Helium (and probably lots of other things). By separating the problem into bits (first get the characters right, then see how they join, then interpret them as science) AMI2 is quite confident we shall get there. RSN.

And when we do she will be able to act as amanuensis to an awful lot of humans.

Meanwhile she has to meet Chuff…

And tomorrow she is off to the seaside with PMR to do some more hacking. Will we find wifi? Who knows – but it’s not needed for the final hack on PDF2SVG, which we might even post tomorrow. Who knows?

 

 

 

 

 

 

 

Posted in Uncategorized | Leave a comment

#ami2 #opencontentmining. PDF2SVG: AMI comments on her experience of the digital printing industry and STM Publishers

AMI2 is our new intelligent amanuensis for reading and understanding the Scientific Technical Medical literature. AMI is OURS – not mine – she is completely Open and you can take part.

YOU DON’T HAVE TO KNOW COMPUTING TO HELP.

AMI’s first phase (PDF2SVG) is ready for work. It’s alpha. That means that no-one has tested it yet. It also means you shouldn’t rely on the results. Indeed you would be very foolish to – we have detected an instance today where a string is printed as

And AMI interprets it as “mg”

That’s very very bad. The first is a microgram (1 millionth of a gram) and the second is a milligram (1 thousandth) of a gram. If you have a dose of (say) 1 micrograms and it is converted to 1 milligram you get 1000 too much. That could easily kill you.

So what’s the problem? After all AMI is smart and once taught something she never forgets.

The main problem is Fonts. Badly constructed and badly/un documented. An indication of the enormous diversity and (often poor practices) in the production of STM PDFs. AMI has to cope with all of this, as competently as possible. This is AMI’s blog post so there are no [PMR] rants. AMI has no emotions (that’s an exciting part of machine intelligence but far beyond PMR and AMI.) If AMI is given a job, she carries it out deterministically. Like a wage slave, but she doesn’t get paid, never gets hungry of tired. So here is how AMI has been instructed in what Fonts are and how to understand them.

Note that the terminology, like so much else in PDF production is misused. Some of the concepts we label in PDFspeak have better names. And please correct me when I am wrong! For example “Helvetica” formally refers to a typeface but the word “Font” is misused instead. Similarly we should use FontFoundry in some instances. “PDF Font” would probably be a precise name, and in the software we are using (PDFBox) there is a class PDFont to represent this.

The terminology and concepts date from about 150 years ago where the purpose of fonts was to create physical metal objects (“type”) which were used to print ink onto paper. Never forget that the printing of ink onto paper perfuses the language and thinking of digital typography. We have to change that. Here’s http://en.wikipedia.org/wiki/Font :

In typography, a font is traditionally defined as a quantity of sorts composing a complete character set of a single size and style of a particular typeface. For example, the complete set of all the characters for “9-point
Bulmer” is called a font, and the “10-point Bulmer” would be another separate font, but part of the same font family, whereas “9-point Bulmer boldface” would be another font in a different font family of the same typeface. One individual font character might be referred to as a “sort,” “piece of font,” or “piece of type”.

Font nowadays is frequently used synonymously with the term typeface, although they had clearly understood different meanings before the advent of digital typography and desktop publishing.

Beginning in the 1980s, with the introduction of computer fonts, a broader definition for the term “font” evolved. Different sizes of a single style—separate fonts in metal type—are now generated from a single computer font, because vector shapes can be scaled freely. “Bulmer”, the typeface, may include the fonts “Bulmer roman”, “Bulmer italic”, “Bulmer bold” and “Bulmer extended”, but there is no separate font for “9-point Bulmer italic” as opposed to “10-point Bulmer italic”.

So computer font technology has evolved (slightly) to make it possible to normalize some of the process of putting digital ink on digital display (either “screen” or “printer”). Never forget that it is assumed that a sighted human is “reading” the output. The output is a two dimensional object representing paper with marks on it.

Without a human sighted reader almost all the value of PDFs is lost. That’s what AMI is tackling. We are building http://en.wikipedia.org/wiki/Computer_font into her brain. If you and AMI are to underatnd each other you must understand this:

A computer font (or font) is an electronic data file containing a set of glyphs, characters, or symbols such as dingbats. Although the term font first referred to a set of metal type sorts in one style and size, since the 1990s it is generally used to refer to a scalable set of digital shapes that may be printed at many different sizes.

There are three basic kinds of computer font file data formats:

  • Bitmap fonts consist of a matrix of dots or pixels representing the image of each glyph in each face and size.
  • Outline fonts (also called vector fonts) use Bézier curves, drawing instructions and mathematical formulae to describe each glyph, which make the character outlines scalable to any size.
  • Stroke fonts use a series of specified lines and additional information to define the profile, or size and shape of the line in a specific face, which together describe the appearance of the glyph.

Bitmap fonts are faster and easier to use in computer code, but non-scalable, requiring a separate font for each size. Outline and stroke fonts can be resized using a single font and substituting different measurements for components of each glyph, but are somewhat more complicated to render on screen than bitmap fonts, as they require additional computer code to render the outline to a bitmap for display on screen or in print. Although all types are still in use, most fonts seen and used on computers are outline fonts.

A raster image can be displayed in a different size only with some distortion, but renders quickly; outline or stroke image formats are resizable but take more time to render as pixels must be drawn from scratch each time they are displayed.

Fonts are designed and created using font editors. Fonts specifically designed for the computer screen and not printing are known as screenfonts.

Fonts can be monospaced (i.e., every character is plotted a constant distance from the previous character that it is next to, while drawing) or proportional (each character has its own width). However, the particular font-handling application can affect the spacing, particularly when doing justification.

You really have to understand this if you are going to help build and make the best use of AMI2.PDF2SVG. On the other hand if you just want to use the output of PDF2SVG you should trust us to get it (mainly) right and then it becomes much simpler

Some of the problem arises from the technical constraints in – say – the 1980’s. Screens (e.g. Tektronix) were often not interoperable and graphics vector drawers (e.g. Calcomp) were expensive and had arcane drivers. Speed and bandwidth were also critical and we saw protocols such as http://en.wikipedia.org/wiki/Ascii85 which added slight compression at the cost of almost complete uninterpretability. Unfortunately some of this culture remains in vestigial mode.

Much of the problem arises from the closed and fragmented nature of the digital printing industry. Companies built systems where there was a ?complete proprietary toolchain from a single supplier. As long as you used this supplier throughout there was a reasonably chance the operation would produce paper with ink meaningful to humans. Major manufacturers (such as Adobe, Apple and Microsoft) created non-interoperable systems. Each would create non-interoperable codepoint maps (we’ll explain codepoint later) and additional such as symbol tables. So the only thing you could do is buy a compatible printer and print the document.

The de facto standard for printed documents in STM is PDF. [AMI: PMR considers this a disaster, but I am being built to cope with it.]. If you want to get an idea of the complexity look at http://en.wikipedia.org/wiki/Portable_Document_Format and http://en.wikipedia.org/wiki/PostScript_fonts: Here are some snippets:

Portable Document Format (PDF) is a file format used to represent documents in a manner independent of application software, hardware, and operating systems.[1] Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it.

And

The PDF combines three technologies:

  • A subset of the PostScript page description programming language, for generating the layout and graphics.
  • A font-embedding/replacement system to allow fonts to travel with the documents.
  • A structured storage system to bundle these elements and any associated content into a single file, with data compression where appropriate.

Before looking at PDF in detail, recall that in 1995 there were two emerging technologies which make much of the whole problem much simpler and which, if adopted more widely would mean better quality of transmission of scientific information today: HTML and Java. Neither were perfect but they have evolved, in Open manner, to allow almost perfect transmission of scientific information if care is taken. In particular they removed (1) the need for Postscript and (2) completely (the need to transmit fonts). Unfortunately they failed to address (3). In both HTML and Java there is the concept of a world-wide universal encoding for text and symbols, http://en.wikipedia.org/wiki/Unicode

Unicode is a computing
industry standard for the consistent encoding, representation and handling of text expressed in most of the world’s writing systems. Developed in conjunction with the Universal Character Set standard and published in book form as The Unicode Standard, the latest version of Unicode consists of a repertoire of more than 110,000 characters covering 100 scripts, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).[1]

Unicode standardizes the representation of characters. Characters are abstractions of letters, numbers and symbols and are independent of how they will be visualize. Thus the text string C-2 consists of 3 characters ‘c’, ‘hyphen-minus’ and ‘two’ . Here is typical language from http://en.wikipedia.org/wiki/Hyphen-minus

The hyphen-minus () is a character used in digital documents and computing to represent a hyphen () or a minus sign ().[1] It is present in Unicode as code point U+002D hyphen-minus; it is also in ASCII with the same value.

Almost all characters are representable by a glyph; here’s a typical hyphen-minus:

There is no standard glyph for a character and different computer fonts can have radically different glyphs (Douglas Hofstadter wrote considerably on the characters ‘A’ and ‘a’ and what was the essence of ‘a-ness’) http://en.wikipedia.org/wiki/A (Different glyphs of the lower case letter A.)

THE CONSUMER OF PDF2SVG SHOULD NOT HAVE TO CARE ABOUT GLYPHS

Java and HTML solved the problem by using Unicode for the character encoding and UTF-8 for the bytestream (http://en.wikipedia.org/wiki/UTF-8 ):

UTF-8 (UCS Transformation Format—8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set.

UTF-8 takes care of all the aspects of how a character is held in a machine or how many bytes are used when, where and in which order. In a properly constructed modern HTML/Java system the Unicode codepoint is all you need to know about. Both are based on Unicode. Both can and should be used with UTF-8.

The beauty of HTML and Java is that they removed the need to transmit fonts and glyphs. They did this by requiring the implementer to provide a basic set of fonts. In Java these are http://docs.oracle.com/javase/tutorial/2d/text/fonts.html

There are two types of fonts: physical fonts and logical fonts. Physical fonts are the actual font libraries consisting of, for example, TrueType or PostScript Type 1 fonts. The physical fonts may be Time, Helvetica, Courier, or any number of other fonts, including international fonts. Logical fonts are the following five font families: Serif, SansSerif, Monospaced, Dialog, and DialogInput. These logical fonts are not actual font libraries. Instead, the logical font names are mapped to physical fonts by the Java runtime environment.

And

Physical fonts are the actual font libraries containing glyph data and tables to map from character sequences to glyph sequences, using a font technology such as TrueType or PostScript Type 1. Note: Applications should not assume that any particular physical font is present. However, the logical fonts are a safe choice because they are always present. See Logical Fonts for more information.

So Java has separated the problem of glyphs and fonts from the transmission of characters. You can get a feel with the applet in http://docs.oracle.com/javase/tutorial/2d/text/fonts.html#logical-fonts

Here’s a standard representation:

While we can change to, say, Helvetica and make it bold and italic.

The point is that the CHARACTERS ARE THE SAME. See http://www.fileformat.info/info/unicode/char/54/index.htm for a comprehensive compilation of characters. Formally the sentence reads (name, Unicode point)

LATIN CAPITAL LETTER T (U+0054), LATIN SMALL LETTER H’ (U+0068) , LATIN SMALL LETTER E’ (U+0065)

If either of these representations is used there is no possibility of mistakes.

Java, therefore, allowed the programmer to forget about implementing Fonts and simply to concentrate on the characters. A letter ‘e’ is represented by U+0065 universally. (Although this is 0x0065 as a Java number this does not mean it has to be held in a single byte (and it probably isn’t) – early versions of Java did not support Unicode properly and many early String methods are deprecated. Also do not be confused by Java’s (char) and Character – think of this as a Unicode representation.)

The next bit of good news is that in 1998 W3C developed Scalable Vector Graphics. This is a well designed, relatively compact, language, that can do everything we need in STM publishing. It uses Unicode and can be transferred as UTF-8.

Between HTML(5) and SVG (and Java when required) we have a completely consistent, non-redundant way of representing any STM document.

That’s what AMI-PDF2SVG aims at. A final representation which is modern, interoperable, Open and runnable on every platform.

The problem is that STM publishers don’t use it as the primary output and use PDF. [AMI: no rants please].

Can PDF use Unicode? Yes, but most STM publishers choose not to use it. [… NO rants …]

This means that interpreting STM PDFs is a hundred times more complex than it might be. AMI has to be built to cope with the myriad of different Font Types, character sets and glyphs that are unstandardized. Here’s a typical problem:

Character in Unicode, charname=”two”, charCode=50 ; result: char(50) or ‘2’ in Java , normally interpreted as the digit 2

multiplication sign

Character in MathematicalPi-One, charname=”H11011″, charCode=1 ; result: Unknown

The problem is that MathematicalPi is not an Open standard. I can find nothing useful about it on the web: http://store1.adobe.com/cfusion/store/html/index.cfm?store=OLS-US&event=displayFont&code=MATH10005001 suggests that it is proprietaryAdobe and has to be bought. The next entry in a Google search is the Journal of Bacteriology: http://jb.asm.org/site/misc/journal-ita_ill.xhtml which says:

Fonts. To avoid font problems, set all type in one of the following fonts: Arial, Helvetica, Times Roman, European PI, Mathematical PI, or Symbol. Courier may be used but should be limited to nucleotide or amino acid sequences, where a nonproportional (monospace) font is required. All fonts other than these must be converted to paths (or outlines) in the application with which they were created.

So here is a learned society which could recommend authors to use Unicode, promoting the use of closed proprietary fonts and also turning characters (useful) into paths/glyphs (almost no use for non-humans). I suspect most other publishers do the same. So we have a clear indication of where the STM industry could standardize on Unicode and doesn’t.

So it’s this sort of variety, proprietary closed mess that AMI has to tackle. The main problems are:

  • Use of non-unicode encodings for characters.
  • Use of glyphs instead of characters
  • Imprecison in naming fonts, choice of font types, etc.

There is no technical excuse for not using standard fonts and Unicode. PDF supports it.

This is the core of what AMI has to deal with. If she had to tackle all of it, it would be horrendous. Fortunately the Apache community has created PDFBox which takes away some of the low aspects. By using PDFBox we do not have to deal with:

  • Byte stream encoding
  • Interpretation of Postscript
  • Denormalization of graphics transformations

AMI therefore gets PDFBox output as:

  • Characters, with a Font, a charname, and a charcode
  • OR characters as glyphs
  • AND a steam of paths for the graphics

Here’s a typical (good) character:

<text font-weight=”bold” stroke=”#000000″ font-family=”TimesNRMT” svgx:fontName=”HGNJID+TimesNRMT-Bold” svgx:width=”604.0″ x=”56.51″ y=”27.994″ font-size=”14.0″>S</text>

The font announces that it is bold and that its family is TimesNRMT (TimesNewRoman) and so presumably Unicode. We know the size (14.0), the position on the page (x,y) and that it is character U+0052 (LATIN CAPITAL LETTER S). If all STM publishing were this good, AMI could go straight on the the next part.

But it isn’t

What’s fontName=”MGCOJK+AdvTT7c108665″. PMR has never heard of it. Is it Unicode?
Or if not is there a mapping of the codepoints to Unicode? Because unless we know, we have to guess. And if we guess we shall make mistakes.

*** The Adv fonts seem to be common in STM publishing. Can anyone enlighten us where they come from and whether they have open specifications?***

Posted in Uncategorized | 12 Comments

#ami2 #opencontentmining An intelligent reader of the PDF STM literature. We achieve the first phase: (alpha) PDF2SVG

In a previous post I outlined the architecture for building a (weakly) intelligent scientific amanuensis, AMI2. (/pmr/2012/10/20/opencontentmining-the-ami2-project-for-understanding-the-scientific-literature-and-why-i-love-pdf/ ) We have made a lot of progress since then, mainly in formalizing, refactoring, documenting, clearing our thoughts. (Refactoring is a computing chore, rather like cleaning the cooker or digging in manure or setting pest traps. There’s nothing new to offer people, but you are in a much better position to cook, grow, build, etc. Things will work). So we are now able to say more clearly what AMI2 is (currently) comprised of.

[You don’t have to be a compsci to understand this post.]

I’ll show our picture again, if only because of the animal (you know what it is, I hope):

And label them

  • PDF2SVG (creating clear syntax)
  • SVGPlus (creating clear structure)
  • STMXML (creating science)

(These names may change. Constant change (“Refactor mercilessly”) is a necessary feature of good software (the reverse may also be true, I hope!). These are becoming clearly defined modules.

At the end of the last post I asked some questions. I hoped people would give answers so that I could learn whether my ideas made sense. (Feedback is very valuable, Silence rarely helps). Here they are again

And some more open-ended questions (there are many possible ways of answering). In

How would you describe

  • The top right object in

There are no right answers. It depends who or what you are. I’ll postulate three types of intelligent being

  • A polymer chemist
  • A non-scientific hacker
  • PDF2SVG

The chemist would answer something like:

  • The initiation process of the polymerization
  • A forward-proceeding chemical reaction
  • A reaction scheme
  • A free radical (the “dot”

The hacker might answer:

  • The word “initiation” at a given coordinate and in a given font/weight/style
  • A right-pointing arrow
  • A string of two words (“Scheme” and “1.”)
  • A superscript (“degree”) symbol

The PDF2SVG part of AMI2 sees this in a more primitive light. She sees:

  • 10 characters in a normal san-serif font with coordinates, sizes and fonts
  • A horizontal line and *independently* a closed curve of two lines and a cubic Bezier curve
  • 8 characters in a bold serif font.
  • Two Cubic Bezier curves.

In PDF there are NO WORDS, NO CIRCLES, NO ARRROWS. There are only the primitives:

  • Path – a curved/straight line which may or may not be filled
  • Text – usually single characters, with coordinate
  • Images (like the animal)

So we have to translate the PDF to SVG, add structure, and then interpret as science.

This is hard and ambitious, but if humans can do it, so can machines. One of the many tricks is separating the operations. In this case we would have to:

  • Translate all the PDF primitives to SVG (I’ll explain the value of this below)
  • Build higher-level generic objects (words, paragraphs, circles, arrows, rectangles, etc.) from the SVG primitives
  • Interpret these as science.

Hasn’t all this been done before?

Not at all. Our unique approach is that this is an OPEN project. If you are interested in, say, interpreting flow diagrams from the literature and you enjoy hacky puzzles then this is a tremendous platform for you to build on. You never need to worry about the PDF bit – or the rectangle bit – we have done it for you. Almost all PDF converters neglect the graphical side – that’s why we are doing it. And AMI2 is the only one that’s OPEN. So a number of modest contributions can make a huge difference.

You don’t have to be a scientist to get involved.,

Anyway, why PDF2SVG?

PDF is a very complex beast. It was developed in commercial secrecy and some of the bits are still not really published. It’s a mixture of a lot of things:

  • An excutable language (Postscript)
  • A dictionary manager (computer objects, not words)
  • A font manager
  • A stream of objects
  • Metadata (XMP)
  • Encryption, and probably DRM

And a lot more. BTW I know RELATIVELY LITTLE about PDF and I am happy to be corrected. But I think I and colleagues know enough. It is NOT easy to find out how to build PDF2SVG and we’ve been down several ratholes. I’ve asked on Stackoverflow, on the PDFBox list and elsewhere and basically the answer is “read the spec and hack it yourself”.

PDF is a page-oriented language and printer-oriented. That makes things easy and hard. It means you can work on one page at a time, but is also means that there is no sense of context anywhere. Characters and paths can come in any order = the only thing that matters is their coordinates. We’re using a subset of PDF features that map onto a static page, and S VG is ideal for that:

  • It’s much simpler than PDF
  • It’s as powerful (for what we want) so there is no loss in converting PDF to SVG
  • It was designed as an OPEN standard from the start and it’s a very good design
  • It’s based on XML which means it’s easy to model with it.
  • It interoperates seamlessly with XHTML and CSS and other Markup Languages so it’s ideal for modern browsers.
  • YOU can understand it. I’ll show you how.

PDF is oriented towards visual appearance and has a great deal of emphasis on Fonts. This is a complex area and we shall show you how we tackle this. But first we must pay tribute to the volunteers who have created PDFBOX. It’s an Open Source Apache project (http://pdfbox.apache.org ) and it’s got everything we need (though it’s hard to find sometimes). So:

HUGE THANKS TO BEN LITCHFIELD AND OTHERS FOR PDFBOX

I first started this project about 5-6 years ago and had to use PDFBox at a level where I was interpreting PostScript. That’ no fun and erro-prone, but it worked enough to show the way. We’ve gone back and found PDFBox has moved on. One strategy wold be to interrupt the COSStream and interpret the objects as they come through. (I don’t know what COS means either!) But we had another suggestion from the PDFBox list – draw a Java Graphics object and capture the result. And that’s’ what I did. I installed Batik (a venerable Open Java SVG project) , created an Graphics2D object , saved it to XML. And it worked, And that’s where we were a week ago. It was slow, clunky and had serious problem with encodings.

So we have been refactoring without Batik. We create a dummy graphics with about 60-80 callbacks and trap those which PDFBox calls. It’s quite a small number. We then convert those to text or paths, extract the attributes from the GraphicsObject and that’s basically it. It runs 20 times faster at least – will parse 5+ pages a second on my laptop and I am sure that can be improved. That’s 1-2 seconds for the average PDF article.

The main problem comes with characters and fonts. If you don’t understand the terms byte stream, encoding, codepoint, character, glyph, read Joel Spolksy’s article “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” (http://www.joelonsoftware.com/articles/Unicode.html ). If you think you know about these issues, reread the article (I am just going to). A considerable and avoidable amount of the garbage on the web is due to developers who did not understand. Cut-and-paste is almost a certain recipe for corruption. And since science relies on a modest number of high codepoints it is critical to get them right. A missing minus sign can cause planes to crash. A 1 instead of an L. An m instead of a mu (milligram vs micrograms) in a dose could kill someone.

In simple terms, PDF can manage an infinite number of fonts. This is fine for display, but unless we know what the characters *are* it’s useless for AMI2. We can treat fonts in 2 main ways:

  • Send a graphical object representing what the character looks like on screen or printer. These are usally infinite scalable fonts and don’t degrade when magnified. But it is very hard for AMI2 to work out what a given sets of curves means as a character. (How would you describe a “Q” over the telephone and get a precise copy at the other end?)
  • Send the number of the character (technically the code point in Unicode) and a pointer to the font used to draw it. That’s by far the easiest for AMI2. She generally doesn’t care whether an “A” is serif or not (there are some disciplines where fonts matter but not many). If she gets (char)65 that is good enough for her (with one reservation – she need to know how wide it is to work out when words end).

Anyway almost all of the characters have Unicode points. In our current test document, of 130 pages we’ve only found about 10 characters that didn’t have Unicode points. These are represented by pointers into glyph maps. (If this sounds fearsome the good news is that we have largely hacked the infrastructure for it and you don’t need to worry). As an example in one place the document uses “MathematicalPi-One” font for things like “+”. By default AMI2 gets sent a glyph that she can’t understand. By some deep hacking we can identify the index numbers in MathematicalPi-One – e.g. H11001. We have to convert that to Unicode.

** DOES ANYONE HAVE A TABLE OF MathematicalPi-One TO UNICODE CONVERSIONS? **

If so, many thanks. If not this is a good exercise for simple crowdsourcing – many people will benefit from this.

I’ve gone on a lot, but partly to stress how confident we are that we have essentially solved a standalone module. This module isn’t just for science – it’s for anyone who wants to make sense of PDF. So anyone in banking, government, architecture, whatever is welcome to join in.

For those of you who are impatient, here’s what a page number looks like in SVG:

<text stroke=”#000000″ font-family=”TimesNewRomanPS” svgx:width=”500.0″ x=”40.979″ y=”17.388″ font-size=”8.468″>3</text>

<text stroke=”#000000″ font-family=”TimesNewRomanPS” svgx:width=”500.0″ x=”45.213″ y=”17.388″ font-size=”8.468″>8</text>

<text stroke=”#000000″ font-family=”TimesNewRomanPS” svgx:width=”500.0″ x=”49.447″ y=”17.388″ font-size=”8.468″>0</text>/

That’s not so fearsome. “stroke” means colour of the text and 00/00/00 is r/g/b so 0 red, 0 green 0 blue which is black! Times is a serif font. Width is the character width (in some as yet unknown units but we’ll hack that) . x and y are screen coordinates (in SVG y goes DOWN the screen). Font-size is self-explanatory. And the characters are 3,8,0. The only reason we can interpret this as “380” is the x coordinate, the font-size and the width. If you shuffled the lines it would still be 380.

In the next post I’ll explain how early-adopters can test this out. It’s alpha (which means we are happy for friends to try it out. It *will* occasionally fail (because this is a complex problem) and we want to know where. But you need to know about installing and running java programs in the first instance. And we need to build more communal resources for collaboration.

 

 

 

 

Posted in Uncategorized | 1 Comment

Open Access: What is it and what does “Open” mean

This is the start of “Open Access Week” www.openaccessweek.org/ and I am urged (including by myself) to write something for it. The OKF is contributing something and I hope that in writing this blog there is something suitable.

I’m going to ask questions. They are questions I don’t know the answer to – maybe I am ignorant I which case please comment with information – or maybe the “Open Access Community” doesn’t know the answer. Warning: I shall probably be criticized by some of the mainstream “OA Community”. Please try to read beyond any rhetoric.

As background I am well versed in Openness. I have taking a leading role in creating and launching many Open efforts – SAX (http://www.saxproject.org/sax1-history.html ), Chemical MIME, Chemical Markup Language, The Blue Obelisk (http://en.wikipedia.org/wiki/Blue_Obelisk ) , Panton Principles, Open Bibliography, Open Content Mining and helped to write a significant number of large software frameworks (OSCAR, JUMBO, OPSIN, AMI2). I’m on the adv board of the Open Knowledge Foundation and can shortly reveal another affiliation. I have contributed to or worked with Wikipedia, Open Streetmap, Stackoverflow, Open Science Summit, Mat Todd (Open Source Drug Discovery) and been to many hackathons. So I am very familiar with the modern ideology and practice of “Open”. Is “Open Access” the same sort of beast?

The features of “Open” that I value are:

  • A meritocracy. That doesn’t mean that decisions are made by hand counting, but it means that people’s views are listened to, and they enter the process when it seems right to the community. That’s happened with SAX, very much with the Blue Obelisk, and the Open Knowledge Foundation.
  • Universality of participation, particularly from citizens without formal membership or qualifications. A feeling of community.
  • A willingness to listen to other views and find means of changing strategy where necessary
  • Openness of process. It is clear what is happening, even if you are not in command.
  • Openness of results. This is universally fundamental. Although there have been major differences of opinion in Free/Open Source Software (F/OSS) everyone is agreed that the final result is free to use, modify, redistribute without permission and for any purpose. http://en.wikipedia.org/wiki/Four_Freedoms_%28Free_software%29#definition . Free software is a matter of liberty, not price. To understand the concept, you should think of ‘free’ as in ‘free speech‘, not as in ‘free beer'”.[13] See Gratis versus libre.
  • A mechanism to change current practice. The key thing about Wikipedia is that it dramatically enhances the way we use knowledge. Many activities in the OKF (and other Open Organisations) are helping to change practice in government, development agencies, companies. It’s not about price restrictions, it’s about giving back control to the citizens of the world. Open Streetmap produces BETTER and more innovative maps that people can use to change the lives of people living right now – e.g. the Haitian earthquake.

How does Open Access measure up against these? I have difficulty saying that OA as currently practiced meets any of these to my own desires. That doesn’t mean it isn’t valuable, but it means that it doesn’t have obvious values I can align with. I have followed OA for most of the last 10 years and tried to contribute, but without success. I have practiced it by publishing all my own single-author papers over the last 5 years in Gold CC-BY journals (but without much feeling of involvement – certainly not the involvement that I get from SAX or BlueObelisk.

That’s a harsh statement and I will elaborate:

Open Access is not universal – it looks inward to Universities (and Research Institutions). In OA week the categories for membership are:

“click here if you’re a: RESEARCH FUNDER | RESEARCHER/FACULTY MEMBER | ADMINISTRATOR | PUBLISHER | STUDENT | LIBRARIAN

There is no space for “citizen” in OA. Indeed some in the OA movement emphasize this. Stevan Harnad has said that the purpose of OA is for “researchers to publish to researchers” and that ordinary people won’t understand scholarly papers. I take a strong and public stance against this – the success of Galaxy Zoo has shown how citizens can become as expert as many practitioners. In my new area of phylogenetic trees I would feel confident that anyone with a University education (and many without) would have little difficulty understanding much of the literature and many could become involved in the calculations. For me, Open Access has little point unless it reaches out to the citizenry and I see very little evidence of this (please correct me).

There is, in fact, very little role for the individual. Most of the infrastructure has been built by university libraries without involving anyone outside (I regret this, because University repositories are poor compared to other tools in the Open movements). There is little sense of community. The main events are organised round library practice and funders – which doesn’t map onto other Opens. Researchers have little involvement in the process – the mainstream vision is that their university will mandate them to do certain things and they will comply or be sacked. This might be effective (although no signs yet) but it is not an “Open” attitude.

Decisions are made in the following ways:

  • An oligarchy, represented in the BOAI processes and Enabling Open Scholarship (EOS). EOS is a closed society that releases briefing papers and has a members ship of 50 EUR per year and have to be formally approved by the committee (I have represented to several members of EOS that I don’t find this inclusive and I can’t see any value in my joining – it’s primarily for university administrators and librarians).
  • Library organizations (e.g. SPARC)
  • Organizations of OA publishers (e.g. OASPA)

Now there are many successful and valuable organizations that operate on these principles, but they don’t use the word “Open”.

So is discussion “Open”. Unfortunately not very. There is no mailing list with both large volume of contributions and effective freedom to present a range of views. Probably the highest volume list for citizens (as opposed to librarians) is GOAL http://mailman.ecs.soton.ac.uk/pipermail/goal/ and here differences of opinion are unwelcome. Again that’s a hard statement but the reality is that if you post anything that does not support Green Open Access Stevan Harnad and the Harnadites will publicly shout you down. I have been denigrated on more than one occasion by members of the OA oligarchy (Look at the archive if you need proof). It’s probably fair to say that this attitude has effective killed Open discussion in OA. Jan Velterop and I are probably the only people prepared to challenge opinions = most others walk away.

Because of this lack of discussion it isn’t clear to me what the goals and philosophy of OA are. I suspect that different practitioners have many different views, including:

  • A means to reach out to citizenry beyond academia, especially for publicly funded research. This should be the top reason IMO but there is little effective practice.
  • A means to reduce journal prices. This is (one of) Harnad’s arguments. We concentrate on making everything Green and when we have achieved this the publishers will have to reduce their prices. This seems most unlikely to me – any publisher losing revenue will fight this (Elsevier already bans Green OA if it mandated).
  • A way of reusing scholarly output. This is ONLY possible if the output is labelled as CC-BY. There’s about 5-10 percent of this. Again this is high on my list and the only reason Ross Mounce and I can do research into phylogenetic trees.
  • A way of changing scholarship. I see no evidence at all for this in the OA community. IN fact OA is holding back innovation in new methods of scholarship as it emphasizes the conventional role of the “final manuscript” and the “publisher”. In fact Green OA relies (in practice) in having publishers and so legitimizes them

And finally is the product “Open”? The BOAI declaration is (in Cameron Neylon’s words http://cameronneylon.net/blog/on-the-10th-anniversary-of-the-budapest-declaration/ ) “clear, direct, and precise:” To remind you:

“By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”

This is in the traditions of Stallman’s software freedoms, The Open Knowledge Definition and all the other examples I have quoted. Free to use, re-use and redistribute for any lawful purpose. For manuscripts it is cleanly achieved by adding a visible CC-BY licence. But unfortunately many people, including the mainstream OA community and many publishers use “(fully) Open Access” to mean just about anything. No-one other than a few of us challenge this. So the result is that much current “OA” is so badly defined that it adds little value. There have been attempts to formalize this but they have all ended in messy (and to me unacceptable) compromise. In all other Open communities “libre” has a clear meaning – freedom as in speech. In OA it means almost nothing (“removal of some permission barriers” – could be managed by the permission to post a copy on a personal website but restrict copying and further re-use. Unfortunately anyone trying to get tighter approaches is shouted down. For that reason we have set up our own Open-access list in OKF http://blog.okfn.org/category/open-access/ and http://lists.okfn.org/pipermail/open-access/. So, and this is probably the greatest tragedy, Open Access does not by default produce Open products. See http://blog.okfn.org/2012/10/22/the-great-open-access-swindle/ for similar views

*If* we can have a truly Open discussion we might make progress on some of these issues.

Posted in Uncategorized | 37 Comments

#opencontentmining The #ami2 project for understanding the scientific literature and why I “love” PDF

Earlier this week I blogged about our new Open project #AMI2 /pmr/2012/10/15/opencontentmining-starting-a-community-project-and-introducing-ami2/ . This is an Open community project to create a complete infrastructure for machine understandability of the scientific literature. That’s a bold statement, especially for something with no formal funding, but in the present era that’s not necessarily a show-stopper. Because when we grow the community we can do almost anything. Of course AMI2 can’t understand everything, but she will shortly be able to outstrip several current human activities.

And we’ve had several offers of help. Welcome to all those who have mailed. We have critical mass and Ross and I will work out how to create a community platform. Because #ami2 is Y/OUR project.

The scientific literature covers 300 Billion USD of pubklicly funded endeavour and 15 Billion USD to publish. Many of the results are inaccessible due to old-fashioned means of publication and the difficulty of understanding them. So if people want to help build AMI2 it will have a very important impact.

The current approach to this is to get authors to produce semantic documents. I agree that this is the best way. But authors aren’t interested and publishers are incredibly conservative. So we have to start the other way. By creating #ami2 to understand the current literature in the way that humans do. And much of this can be done by gluing together the technology that already exists. In these blogs we are going to show that, in specific domains, it can be done. NOW! And we make the assumption that it’s not too difficult to build similar technology in parallel in other domains.

So we are going to start with the following disciplines:

  • Phylogenetic trees
  • X-Y plots
  • Chemical diagrams

Of these only the chemistry is at all hard to understand if you are 12 years old. Everyone can understand trees and everyone can understand graphs (because in the information age every citizen should be able to understand a graph). So anyone should be able to follow.

Here’s the AMI2 process:

[I authored the diagram in SVG but cannot use Word+Wordpress to publish it. Anyone know how? So apologies for the PNG – it goes against the philosophy of the project.]

There are three components to this document.

  • Text. There are many uses for text (discussion, tables, references, metadata) but they all use characters and the general technology is the same. We’ll see the distinctions later
  • Diagrams. I use this to mean objects which are created from lines, circles, text, etc. and where there are a number of well-defined objects
  • Images. This covers bitmaps, where there is no formal substructure to the object. Photographs are a common type of image.

A document like the above can be represented with different technologies. I distinguish:

  • Bitmaps. Here only the pixels in the (printable) page are transmitted. By default there is no understandable document content. A typical touchstone is that you cannot cut-and-paste anything useful other than subimages. A test for a bitmap is to scale it. As it gets larger it gets fuzzier. Pixels may appear which get larger with the magnification. Some bitmaps preserve all the pixels (e.g. TIFF, BMP). Some compress the file. PNG compresses without loss (a PNG can be reconverted to the corresponding BMP). JPEG is a lossy format – you cannot recreate the uncompressed bitmap. It was designed for photographs where it is excellent. The use of JPEG compression for scientific diagrams is completely unnecessary and act of information destruction. No publisher should ever use JPEG except – possibly – for photographs.
  • Text. Most scientific documents have semi-structured text. Subsections can be cut-and-pasted (perhaps with loss of fonts, etc.). If science were only communicated with text (e.g. like most literature) we wouldn’t have a major problem. But the text is only PART of a scientific document. Unfortunately terms such as “fulltext”, “textmining” suggest that the only valuable stuff is full text.
  • Vector graphics. Most diagrams are authored as vector graphics with tools such as Inkscape, Powerpoint, etc. There is a menu of objects (lines, rectangles, circles, text, etc.). It is generally easy to create scientific diagrams of medium quality using these. (It is not easy to create graphic arts). Typical mediums of transmission are SVG and EPS (Postscript). Many machines (e.g. spectrometers) create vector graphics. Vector graphics are scalable. If you magnify the display even to 4000% all the lines will be sharp and the text will have clean edges. This is an almost infallible test for VG. Almost all scientific diagrams start life as vector graphics but many get converted into bitmaps. The use of bitmaps for scientific diagrams is completely unnecessary and an act of information destructions. No publisher should ever use PNG where they start with a vector graphics object.

The major technologies for scientific publishing are:

  • TeX/LaTeX. This is a semi-structured, semi semantic language of great vision and great value to science. A large amount of science can be reconstructed from it through content-mining. No publisher should ever destroy LaTeX, and where possible they should publish it – it is far more valuable than PDF. LaTeX often uses EPS as its vector graphics. I would be generally be happy to get a paper in *.tex. for input to #AMI2.
  • Word. Early versions of word are proprietary and have hideous internal structure which is almost impossible to manage without MS tooling. Modern Word uses XML (OOXML). Leaving aside the politics of the OOXML Standard process, I will say that it’s a reasonably well structured , if bloated, technology. Word can contain other XML technologies such as Chemical Markup Language (CML). If OOXML were openly and usefully available on non-MS frameworks I would make stronger recommendations for it. However the OOXML is tractable and I would be happy to get a scientific document in *.docx.
  • XHTML. Most publishers provide XHTML as a display format. This is a good thing. The downside is that it isn’t easy to store and distribute XHTML. The images and often other components are separate, fragmented. It is a major failing of the W3C effort that there isn’t a platform independent specification for packaging compound documents.
  • PDF. If you think PDF is an adequate format for conveying modern science to humans then you are probably sighted and probably don’t use machines to help augment your brain. PDF is good for some things and terrible at others. Since >99% of the scientific literature is distributed as PDF (despite its origins) I have very reluctantly come to accept have to work with it. Like Winston Smith in 1984 I have realised that I “love” PDF.

A few words, then, about PDF. There will be many more later.

  • PDF is a page oriented spec. The popularity of this is driven by people who sell pages – publishers. We still have books with pages, but we have many other media – including XHTML/CSS/SVG/RDF which are much more popular with modern media such as the BBC. Pages are an anachronism. AMI2 will remove the pages (among many other things).
  • PDF is designed for printing. PDF encapsulates Postscript , developed for PRINTERS. Everything in PDF internals screams Printed Page at you.
  • PDF is designed for sighted humans. It is the ink on the screen, not the semantics that conveys information. That’s why it’s a hard job training AMI2. But is can be done
  • PDF has many proprietary features. That doesn’t mean that we cannot ultimately understand them and it’s more Open than it was, but there isn’t a bottom-up community as for HTML and XML.
  • PDF is a container format. You can add a number of other things (mainly images and vector graphics) and they don’t get lost. That’s a good thing. There are very few around (G/ZIP is the most commonly used). Powerpoint and Word are also container formats. We desperately need an Open container.
  • PDF is largely immutable. If you get one it is generally read-only. Yes there are editors, but they are generally commercial and interoperability outside of major companies is poor. There are also mechanisms for encryption and DRM and other modern instruments of control. This can make it difficult to extract information.

So here is our overall plan.

  • Convert PDF to SVG. This is because SVG is a much more semantic format than PDF and much flatter. There is almost no loss on the conversion. The main problems come with font information (we’ll see that later). If you don’t mind about the font – and fonts are irrelevant to science – then all we need to do is extract the character information. This process is almost complete. Murray Jensen and I have been working with PDFBox and we have a wrapper which can convert clean PDF to SVG with almost no loss at a page/sec or better on my laptop. The main problem is strange fonts.
  • Create semantic science from the SVG. This is hard and relies on a lot of heuristics. But it’s not as hard as you might think and with a community it’s very tractable. And then we shall be able to ask AMI2 “What’s this paper about and can I have the data in it?”

Please let us have your feedback and if you’d like to help. Meanwhile before the next post here is an example of what we can do already: The first image is a snapshot of a PDF. The second is a snapshot of the SVG we produce. There are very small differences that don’t affect the science at all. Can you spot any? and can you suggest why they happened:

And some more open-ended questions (there are many possible ways of answering). How would you describe:

  • The top right object in

     

Because those are the sort of questions that we have to build into AMI2.

 

 

Posted in Uncategorized | 1 Comment

#opencontentmining Starting a community project and introducing #AMI2

This is the first post in (hopefully) a regular series on the development of Open Content Mining in scholarly articles (mainly STM = Science Technical Medical). It’s also a call for anyone interested to join up as a community. This post describes the background – later ones will cover the technology, the philosophy and the contractual and legal issues. I shall use #opencontentmining as a running hashtag.

I’m using the term “content mining” as it’s broader than “text-mining”. It starts from a number of premises:

  • The STM literature is expanding so quickly that no-one human can keep up, even in their own field. There are perhaps 2 million articles / year == 60, 000 per day. You could just about read the titles at 1 per second if you had no sleep. Many of them might be formally “outside” your speciality but actually contain valuable information.
  • A large part of scientific publication and communication is data. Everyone is becoming more aware of how important data is. It is essential to validate the science, it can be combined with other data to create new discoveries. Yet most data is never published and of the rest much ends up in the “fulltext” of the articles. (Note that “fulltext” is a poor term as there are lots of pictures and other non-text content. “Full content” would be logical (although misleading in that papers only report a small percentage of the work done).
  • The technology is now able to do some exciting and powerful things. Content-mining is made up or a large number of discrete processes and as each one is solves (even partially) we get more value. This is combined with the increasing technical quality of articles (e.g. native PDF rather than camera-ready photographed text).

I used to regard PDF as an abomination. See my post 6 years ago: /pmr/2006/09/10/hamburgers-and-cows-the-cognitive-style-of-pdf/. I quoted the maxim “turning a PDF into XML is like turning a hamburger into a cow.” (not mine, but I am sometimes credited with it). XML is structured semantic text. PDF is a (random) collections of “inkmarks on paper”. The conversion destroys huge amounts of information.

I still regard PDF as an abomination. I used to think that force of argument would persuade authors and publishers to change to semantic authoring. I still think that has to happen before we have modern scientific communication through articles.

But in the interim I and others have developed hamburger2cow technology. It’s based on the idea that if a human can understand a “printed page” then a machine might be able to. It’s really a question of encoding a large number of rules. The good thing is that machines don’t forget rules and they have no limit to the size of their memory for them. So I have come to regard PDF as a fact of life and a technical problem to be tackled. I’ve spent the last 5 months hacking at it (hence few blog posts) and I think it’s reached an alpha stage.

And also it is parallelisable at the human level. I and others have developed technology for understanding chemical diagrams in PDF. You can use that technology. If you create a tool that recognizes sequence alignments, then I can use it. (Of course I am talking Open Source – we share our code rather than restricting its use). I have created a tool that interprets phylogenetic trees – you don’t have to. Maybe you are interested in hacking dose-response curves?

So little by little we build a system that is smarter than any individual scientist. We can ask the machine “what sort of science is in this paper?” and the machine will apply all the zillions of rules to every bit of information in an article. And the machine will be able to answer: “it’s got 3 phylogenetic trees, 1 sequence alignment, 2 maps of the North Atlantic, and the species are all sea-birds”.

The machine is called AMI2. Some time ago we had a JISC project to create a virtual research environment as we called the software “AMI”. That was short for “the scientists’ amanuensis”. An amanuensis is a scholarly companion; Eric Fenby assisted the blind composer Frederick Delius in writing down the notes that Delius dictated. So AMI2 is the next step – a scientifically artificially intelligent program. (That’s not as scary as it sounds – we are surrounded by weak AI everywhere, and it’s mainly a question of glueing together a lot of mature technologies).

AMI2 starts with two main technologies – text-mining and diagram-mining. Textmining is very mature and could be deployed on the scientific literature tomorrow.

Except that the subscription-based publishers will send lawyers after us if we do. And that is 99% of the problem. They aren’t doing text-mining themselves but they won’t let subscribers do it either.

But there is 5% of the literature that can be text-mined – that with a CC-BY licence. The best examples are BioMedCentral and PLoS. Will 5% be useful? No-one knows but I believe it will. And in any case it will get the technology developed. And there is a lot of interest in funders – they want their outputs to be mined.

So this post launches a
community approach to content-mining. Anyone can take part as long as they make content and code Open (CC-BY-equiv and OSI F/OSS). Of course the technology can be deployed on closed content and we are delighted for that, but the examples we use in this project must be Open.

Open communities are springing up everywhere. I have helped to launch one – the Blue Obelisk – in chemistry. It’s got 20 different groups creating interoperable code. It’s run for 6 years and its code is gradually replacing closed code. A project in content-mining will be even more dynamic as it addresses unmet needs.

So here are some starting points. Like all bottom-up projects expect them to change:

  • We shall identify some key problems and people keen to solve them
  • We’ll use existing Open technology where possible
  • We’ll educate ourselves as we go
  • We’ll develop new technologies where they are needed
  • Everything will be made Open on the web as soon as it is produced. Blogs, wikis, repositories – whatever works

If you are interested in contributing to #opencontentmining in STM please let us know. We are at alpha stage (i.e. you need to be prepared to work with and test development systems – there are no gold-plated packages- and there probably never will be). There’s lots of scope for biologists, chemists, material scientists, hackers, machine-learning experts, document-hackers (esp. PDF), legal, publishers, etc.

Content-mining is set to take off. You will need to know about it. So if you are interested in (soluble) technical challenges and contributing to an Open community let’s start.

[More details in next post – maybe about phylogenetic trees.]

Posted in Uncategorized | 1 Comment

Update: I am off to CSIRO(AU), eResearch2012, Open Content Mining, AMI2, PDF hacking etc.

I haven’t blogged for some time because I have been busy elsewhere – going to #okfest, #odlc (Open Data La conference in Paris) and preparing for a significant stay (~ 3months) with CSIRO in Clayton (Melbourne, AU).

I’m in AU at the invitation of Nico Adams and CSIRO as a visiting researcher. When we were daily colleagues Nico pioneered the use of the semantic web for chemistry and materials. He is ahead of the game, but chemistry is slowly waking up to the need for semantics. We’ll be working on themes such as:

  • Formal semantics and ontologies for materials science
  • Open Content Mining for chemical/materials data (AMI2)

As part of this I intend to create materials for learning and using CML (Chemical Markup Language), in weekly chunks. If anyone is interested I’m offering to run a weeklyish series of low key workshops on Semantic Chemistry and more generally Semantic Physical Science (Nico and I ran a day on SPS last year at eResearch Australia 2011). Maybe there will be enough material for a book, and if you know me it won’t be a conventional book. It could be a truly open-authored book if there is interest. Almost certainly Open Content. I’ve registered for eResearch 2012 in Sydney 28-1 Oct so if anyone is going we shall meet. Not doing any workshops this time round.

I’m working hard on Open Content Mining. I’ve developed a generic tool for extracting semantic information from PDFs (yes) called AMI2. It results from many months fairly solid hacking and several previous years of explorations. In the initial cases I have been able to get 100% accuracy from some subsets of PDFs and I’ll be taking you through this in blogs. Ros and I are applying it to phylogenetics and we expect to be able to extract a lot of trees from the literature.

We’ll be confining ourselves to BMC and PLoS material (with BMC being technical easier). I’ve downloaded 1000 potential papers and Ross Mounce will be annotating 80 of them as to whether they contain phylogenetics, where it is, etc. Content mining requires hard, boring graft to create a trustable system but the effort is worth it.

We can’t use it on Molecular Phylogenetics and Evolution although it has a lot of trees. Why Not? [Regular readers will know the answer].

And some recent experiences with Open. #okfest was incredible – a real feel that the world was changing and we and others were changing it. It’s the real sense of “Open”. Open isn’t just a licence or a process – it’s a community and a state of mind. It’s joyful, risk taking, collaborative, global.

And Open Scholarship? Well mostly it doesn’t exist and I’m seeing difficulty where it’s coming from. Open Scholarship consists of at least Open Access, Authoring, Bibliography, Citations, Data, Science, etc. Of these only Open Science has a true Open agenda, community and practice (inspired by Joseph Jackson, Mat Todd and others who want to change the world). Open Access is not Open in the modern sense of the word. The initial idealism in 2002 was great, but since then it’s become factional, cliquey and authoritarian in large part. Open Access is complex and needs serious public discussion but this is frequently shouted down. [I sat through an hour’s plenary lecture at Digital Research from Stevan Harnad on “why the RCUK is wrong and must change its policy” with the subtitle “What Peter Murray-Rust thinks and why he is wrong”. The views attributed to me were not mine and his conclusions erroneous, but he doesn’t listen to me and many others. He has now mounted a public attack on RCUK. This will help no-one other than reactionary money-oriented publishers.]

I have been meaning to blog about Open Access for some time, but each time wonder whether I would do more harm than good. However I think it is now important to have proper public discussion about the serious issues, and Open Access Week may be an opportunity. As an example of the problem I find it very hard to find any centre to “Open Access” – who runs Open Access? What’s its purpose? Is there a consensus? Where can I expect to have a proper discussion without being insulted? Because if questions like this are not answered the movement (in so far as it *is* a movement) will surely fracture. And unless new coherent visions emerge then the losers will be academia but even more the SCHOLARLY POOR.

Posted in Uncategorized | Leave a comment

#okfest #openscience being streamed today. Updates.

Update:

OKFest is being streamed today: http://okfestival.org/streams/aalto-pro-lecture-room/ (6 viewers so far) Jenny Molloy is introducing. Now Puneet Kishor from CC-science

See http://science.okfn.org/blog for details. Yesterday we hacked PyBossa for crowdsourcing of spintronics.

I will update this blog over the next hour or so (I hope).

Now Joss Winn (JISC Orbital project)

 

Posted in Uncategorized | Leave a comment