Jailbreaking the PDF -3; Styles and fonts and the problems from Publishers.

Many scientific publications use specific styling to add semantics. In converting to XML it’s critical we don’t throw these away at an early stage, yet many common tools discard such styles. #AMI2 does its best to preserve all these and I think is fairly good. There are different reasons for using styles and I give examples from OA publishers…

  • Bold – used extensively for headings and inline structuring. Note (a) the bold for the heading and (b) the start-of-line


  • Italic. Species are almost always rendered this way.


  • Monospaced.
    Most computer code is represented in this (abstract) font.


This should have convinced you that fonts, and styles matter and should be retained. But many PDF2xxx systems discard them, especially for scholarly publications. There’s a clear standard in PDF for indicating bold, for italic and PDFBox gives a clear API for this. But many scholarly PDFs are awful (did I mention this before?). The BMC fonts don’t declare they are bold even though they are. Or italic. So we have to use heuristics. If a BMC font has “+20” after its name it’s probably bold. And +3 means italics.

Isn’t this a fun puzzle?

No. It’s holding science back. Science should be about effective communication. If we are going to use styles rather than proper markup, let’s do it properly. Let’s tell the world it’s bold. Let’s use 65 to mean A.

There are a few cases where an “A” is not an “A”. As in http://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols

Most of these have specific mathematical meanings and uses and most have their own Unicode points. They are not letters in the normal sense of the word – they are symbols. And if they are well created and standard then they are mangeable

But now an unnecessary nuisance from PeerJ (and I’m only using Open Access publishers so I don’t get sued):

What are the blue things? They look like normal characters, but they aren’t:

<text fill=”#00a6fc” svgx:fontName=”MinionExp-Regular” svgx:width=”299.0″

x=”284.784″ y=”162.408″ font-weight=”normal”></text>

<text fill=”#00a6fc” svgx:fontName=”MinionExp-Regular” svgx:width=”299.0″

x=”281.486″ y=”162.408″ font-weight=”normal”></text>

They are weird codepoints, outside the Unicode range:


These two seem to be small-capital “1” and “0”. They aren’t even valid Unicode characters. Some of our browsers won’t display them:

(Note the missing characters).

Now the DOI is for many people a critically important part of the paper! It’s critical that it is correct and re-usable. But PeerJ (which is a modern publisher and tells us how it has used modern methods to do publishing better and cheaper) seems to have deliberately used totally non-standard characters for DOIs to the extent that my browser can’t even display them. I’m open to correction – but this is barmy. (The raw PDF paper displays in Firefox, but that’s because the font is represented by glyphs rather than codepoints.) No doubt I’ll be told that it’s more important to have beautiful fonts to reduce eyestrains for humans and that corruption doesn’t matter. Most readers don’t even read the references – they simply cut and paste them.

So let’s look at the references:

Here the various components are represented in different fonts and styles. (Of course it would be better to use approaches such as BibJSON and even BibTeX, but that would make it too easy to get it right). So here we have to use fonts and styles to guess what the various bits mean. Bold are the authors, Followed by a period. And bold number for the year. And a title in normal font. A Journal in italics. More Bold for the volume number. Normal for the pages. Light blue is DOI.

But at least if we keep the styles then #AMI2 can hack it. Throwing away the styles makes it much harder and much more error prone.

So to summarise #AMI2=PDF2SVG does things that most other systems don’t do:

  • Manages non-standard fonts (but with human labour)
  • Manages styles
  • Converts to Unicode

AMI2 can’t yet manage raw glyphs, but she will in due time.(Unless YOU wish to volunteer – it actually is a fun machine-learning project).

NOTE: If you are a large commercial publisher then your fonts are just as bad.

Posted in Uncategorized | Leave a comment

Jailbreaking the PDF – 2; Technical aspects (Glyph processing)

A lot of our discussion in Jailbreaking related to technical issues, and this is a – hopefully readable – overview.

PDF is a page description format (does anyone use pages any more? other than publishers and letter writers?) which is designed for sighted humans. At its most basic it transmits a purely visual image of information, which may simply be a bitmap (e.g. a scanned document). That’s currently beyond our ability to automate (but we shall ultimately crack it). More usually it consists of glyphs (http://en.wikipedia.org/wiki/Glyph the visual representation of character). All the following are glyphs for the character “a”.

The minimum that a PDF has to do is to transmit one of these 9 chunks. It can do that by painting black dots (pixels) onto the screen. Humans can make sense of this (they get taught to read but machines can’t. So it really helps when the publisher adds the codepoint for a character. There’s a standard for this – it’s called Unicode and everyone uses it. Correction: MOST people, but NOT scholarly publishers. Many publishers don’t include codepoints at all but transmit the image of the glyph (this is sometimes a bitmap, sometimes a set of strokes (vector/outline fonts)). Here’s a bitmap representation the first “a”.

You can see it’s made of a few hundred pixels (squares). The computer ONLY knows these are squares. It doesn’t know they are an “a”. We shall crack this in the next few months – it’s called Optical Character Recognition OCR and usually done by machine learning – we’ll pool our resources on this. Most characters in figures are probably bitmapped glyphs, but some are vectors.

In the main text characters SHOULD be represented by a codepoint – “a” is Unicode codepoint 97. (Note that “A” is different and codepoint 65 – I’ll use decimal values). So every publishers represent “a” by 97?

Of course not. Publishers PDFs are awful and don’t adhere to standards. That’s a really awful problem. Moreover some publishers use 97 to mean http://en.wikipedia.org/wiki/Alpha . Why?? because in some systems there is a symbol font and it only has Greek characters and they use the same numbers.

So why don’t publishers fix this? It’s because (a) they don’t care and (b) they can extract more money from academia for fixing it. They probably have the correct codepoint in their XML but they don’t let us have this as they want to charge us extra to read it. (That’s another blog post). Because most publishers use the same typesetters these problems are endemic in the industry. Here’s an example. I’m using BioMedCentral examples because they are Open. I have high praise for BMC but not for their technical processing. (BTW I couldn’t show any of this from Closed publishers as I’d probably be sued).

How many characters are there in this? Unless you read the PDF you don’t know. The “BMC Microbiology” LOGO is actually a set of graphics strokes and there is no indication it is actually meaningful text. But I want to concentrate on the “lambda” in the title. Here is AMI2’s extracted SVG/XML (I have included the preceding “e” of “bacteriophage”)

<text stroke=”none” fill=”#000000″ svgx:fontName=”AdvOT46dcae81″

svgx:width=”500.0″ x=”182.691″ y=”165.703″ font-size=”23.305″

font-weight=”normal”>e</text>

<text stroke=”none” fill=”#000000″ svgx:fontName=”AdvTT3f84ef53″

svgx:width=”0.0″ x=”201.703″ y=”165.703″ font-size=”23.305″

font-weight=”normal”>
l</text>

Note there is NO explicit space. We have work it out from the coordinates (182.7 + 0.5*23 << 201.7). But the character 108 is a “l” (ell) and so an automatic conversion system creates

This is wrong and unacceptable and potentially highly dangerous – a MU would be convtered to an “EM”, so Micrograms could be converted to Milligrams.

All the systems we looked at yesterday made this mistake except #AMI2. So almost all scientific content mining systems will extract incorrect information unless they can correct for this. And there are three ways of doing this:

  • Insisting publishers use Unicode. No hope in hell of that. Publishers (BMC and other OA publishers excluded) in general want to make it as hard as possible to interpret PDFs. So nonstandard PDFs are a sort of DRM. (BTW it would cost a few cents per paper to convert to Unicode – that could be afforded out of the 5500 USD they charge us).
  • Translating the glyphs into Unicode. We are going to have to do this anyway, but it will take a little while.
  • Create lookups for each font. So I have had to create a translation table for the non-standard font AdvTT3f84ef53 which AFAIK no one other than BMC uses and isn’t documented anywhere. But I will be partially automating this soon and it’s a finite if soul-destroying task

So AMI2 is able to get:

With the underlying representation of lambda as Unicode 955:

So AMI2 is happy to contribute her translation tables to the Open Jalibreaking community. She’d also like people to contribute, maybe through some #crowdcrafting. It’s pointless for anyone else to do this unless they want to build a standalone competitive system. Because it’s Open they can take AMI2 as long as they acknowledge it in their software. Any system that hopes to do maths is almost certainly going to have to use a translator or OCR.

So glyph processing is the first and essential part of Jailbreaking the PDF.

 

Posted in Uncategorized | 3 Comments

Jailbreaking the PDF; a wonderful hackathon and a community leap forward for freedom – 1

Yesterday we had a truly marvellous hackathon http://scholrev.org/hackathon/ in Montpellier, in between workshops and main Eur Semantic Web Conference. The purpose was to bring together a number of groups who value semantic scholarship and free information from the traditional forms of publication. I’ll be blogging later about the legal constraints imposed by the publishing industry, but Jailbreaking is about the technical constraints of publishing information as PDF.

The idea Jailbreaking was to bring together people who have developed systems, tools, protocols, communities for turning PDF into semantic form. Simply, raw PDF is almost uninterpretable, a bit like binary programs. For about 15 years the spec was not Open and it was basically a proprietary format from Adobe. The normal way of starting to make any sense of PDF content is to buy tools from companies such as Adobe, and there has been quite a lot of recent advocacy from Adobe staff to consider using PDF as a universal data format. This would be appalling – we must use structured documents for data and text and mixtures. Fortunately there are now a good number of F/OSS tools, my choice being http://pdfbox.apache.org/ and these volunteers have laboured long and hard in this primitive technology to create interpreters and libraries. PDF can be produced well, but most scholarly publishers’ PDFs are awful.

It’s a big effort to create a PDF2XML system (the end goal). I am credited with the phrase “turning a hamburger into a cow” but it’s someone else’s. If we sat down to plan PDF2XML, we’d conclude it was very daunting. But we have the modern advantage of distributed enthusiasts. Hacking PDF systems by oneself at 0200 in the morning is painful. Hacking PDFs in the company of similar people is wonderful. The first thing is that it lifts the overall burden from you. You don’t have to boil the ocean by yourself. You find that others are working on the same challenge and that’s enormously liberating. They face the same problems and often solve them in different ways or have different priorities. And that’s the first positive takeaway – I am vastly happier and more relaxed. I have friends and the many “I“s are now we. It’s the same liberating feeling as 7 years ago when we created the http://en.wikipedia.org/wiki/Blue_Obelisk community for chemistry. Jailbreaking has many of the shared values, though coming from different places.

Until recently most of the tools were closed source, usually for-money though occasionally free-as-in-beer for some uses or communities. I have learnt from bitter experience that you can never build an ongoing system on closed source components. At some stage they will either be withdrawn or there will be critical things you want to change or add and that’s simply not possibly. And licensing closed source in an open project is a nightmare. It’s an anticommmons. So, regretfully, I shall not include Utopia/pdfx from Manchester in my further discussion because I can’t make any use of it. Some people use its output, and that’s fine – but I would/might want to use some of its libraries.

There was a wonderful coming-together of people with open systems. None of us had the whole picture , but together we covered all of it. Not “my program is better than your program”, but “our tools are better than my system“. So here a brief overview of the open players who came together (I may miss some individuals, please comment if I have done you an injustice). I’ll explain the technical bits is a later post – here I am discussing the social aspects.

  • LA-PDFText (http://code.google.com/p/lapdftext/
    Gully Burns). Gully was in Los Angeles – in the middle of the night and showed great stamina J In true hacking spirit I used the time to find out about Gully’s system. I downloaded it and couldn’t get it to install (needed java-6). So Gully repackaged it, and within two iterations (an hour) I had it working. That would have taken days conventionally. LA-PDFText is particularly good at discovering blocks (more sophisticated than #AMI2) so maybe I can use it in my work rather than competing.
  • CERMINE
    http://sciencesoft.web.cern.ch/node/120 . I’ve already blogged this but here we had the lead Dominika Tkaczyk live from Poland. I take comfort from her presence and vice versa. CERMINE integrates text better than #AMI at present and has a nice web service
  • Florida State University. Alexander Garcia, Casey McLaughlin, Leyla Jael Garcia Castro, Biotea (http://biotea.idiginfo.org/ ) Greg Riccardi and colleagues. They are working on suicide in the context of Veterans’ admin documents and provided us with an Open corpus of many hundred PDFs. (Some were good, some were really awful). Alex and Casey ran the workshop with great energy, preparation, food, beer, etc. and arranging the great support from the ABES site.
  • #crowdcrafting. It will become clear that human involvement is necessary in parts of the PDF2XML process. Validating or processes, and also possible tweaking final outputs. We connected to Daniel Lombraña González

    of http://crowdcrafting.org/ who took us through the process of building a distributed volunteer community. There was a lot of interest and we shall be designing clear crowdcrafting-friendly tasks (e.g. “draw a rectangle round the title”, “highlight corrupted characters”, “how many references are there”, etc.)

  • CITALO
    http://wit.istc.cnr.it:8080/tools/citalo. This system deduces the type of the citation (reference) from textual analysis. This is a very good example of a downstream application which depends on the XML but is largely independent of how it is created.
  • #AMI2. Our AMI2 system is complementary to many of the others – I am very happy for others to do citation typing, or match keywords. AMI2 has several unique features (I’ll explain later), including character identification, graphics (graphics are not images) extraction, image extraction, sub and superscripts, bold and italic. (Most of the other systems ignore graphics completely and many also ignore bold/italic)

So we have a wonderful synthesis of people and projects and tools. We all want to collaborate and are all happy to put community success as the goal , not individual competition. (And the exciting thing is that it’s publishable and will be heavily cited. We have shown this in the Blue Obelisk publications where the first has 300 citations and I’d predict that a coherent Jailbreaking publication would be of great interest. )

So yesterday was a turning point. We have clear trajectories. We have to work to make sure we develop rapidly and efficiently. But we can do this initially in a loose collaboration, and planning meetings and bringing in other collaborators and funding.

So if you are interested in An Open approach to making PDFs Open and semantic, let us know in the comments.

 

Posted in Uncategorized | 11 Comments

SePublica : Overview of my Polemics presentation #scholrev

This is a list of the points I want to cover when introducing the session on Polemics. A list looks a bit dry but I promise to be polemical. And try to show some demos at the end. The polemics are constructive in that I shall suggest how we can change the #scholpub world by building a better one than the current one.

NOTE: Do not be overwhelmed by the scale of this. Together we can do it.

It is critical we act now

  • Semantics/Mining is now seen as an opportunity by some publishers to “add value” by building walled gardens.
  • Increasing attempts to convince authors to use CC-NC.
  • We must develop semantic resources ahead of this and push the edges

One person can change the world

We must create a coherent community

  • Examples:
    • Open Streetmap,
    • Wikipedia
    • Galaxyzoo
    • OKFN Crowdcrafting,
    • Blue Obelisk (Chemistry – PMR),
    • ?#scholrev

Visions

  • Give power to authors
  • Discover, aggregate and search (“Google for science”)
  • Make the literature computable
  • Enhance readers with semantic aids
  • Smart “invisible” capture of information

Practice before Politics

  • Create compelling examples
  • Add Value
  • Make authors’ lives easier
  • Mine and semanticize current scholarship.

Text Tables Diagrams Data

  • Text (chemistry, species)
  • Tables (Jailbreak corpus)
  • Diagrams chemical spectra, phylogenetic trees
  • Data (output). Quixote

Material to start with

  • Open information (EuropePMC, theses)
  • “data not copyrightable”. Supp data, tables, data-rich diagrams
  • Push the limits of what’s allowed (forgiveness not permission)

Disciplines/artefacts with good effort/return ratio

  • Phylogenetic trees (Ross Mounce + PMR)
  • Nucleic acid sequences
  • Chemical formulae and reactions
  • Regressions and models
  • Clinical/human studies (tables)
  • Dose-response curves

Tools, services, resources

    We need a single-stop location for tools

  • Research-enhancing tools (science equiv of Git/Mercurial). Capture and validate work continuously
  • Common approach to authoring
  • Crawling tools for articles, theses.
  • PDF and Word converters to “XML”
  • Classifiers
  • NLP tools and examples
  • Table hackers
  • Diagram hackers
  • Logfile hackers
  • Semantic repositories
  • Abbreviations and glossaries
  • Dictionaries and dictionary builders

     

Advocacy, helpers, allies

  • Bodies who may be interested (speculative, I haven’t asked them):
    • Funders of science
    • major Open publishers
    • Funders of social change (Mellon, Sloane, OSF…)
    • SPARC, DOAJ, etc.
    • (Europe)PMC
  • Crowdcrafting (OKF, am involved with this)
  • Wikipedia

 

 

 

 

Posted in Uncategorized | Leave a comment

SePublica: Making the scholarly literature semantic and reusable

Scholarly literature has been virtually untouched by the digital revolution in this century. The primary communication is by digital copies of paper (PDFs) and there is little sign that it has brought any change in social structures either in Universities/Research_Establishments or in the publishing industry. The bulk of this industry comprises two sectors, commercial publishing and learned societies. The innovations have been largely restricted to Open Access publishing (pioneered by BMC and then by PLoS) and the megajournal (PLoSOne).

I shall generalise, and exempt a few players from criticism: The Open Access publishers above with smaller ones such as eLife, PeerJ, MDPI, Ubiquity, etc. And a few learned societies (the International Union of Crystallography and the European Geosciences Union, and please let me have more). But in general the traditional publishers (all those not exempted) are a serious part of the problem and cannot now be part of the solution.

That’s a strong statement. But over the last ten years it has been clear that publishing should change, and it hasn’t. The mainstream publishers have put energy into stopping information being disseminated and creating restrictions on how it can be used. Elsevier (documented on this list) has prevented me extracting semantic information from “their” content.

The market is broken because the primary impetus to publish is increasingly driven by academic recognition rather than a desire to communicate. And this makes it impossible for publishers to act as partners in the process of creating semantics. I hear that one large publisher has now built a walled garden for content mining – you have to pay to access it and undoubtedly there are stringent conditions on its re-use. This isn’t semantic progress, it’s digital neo-colonialism.

I believe that semantics arises out of community practice of the discipline. On Saturday the OKFN is having an economics hackathon (Metametrik) in London where we are taking five papers and aiming to build a semantic model. It might be in RDF, it might be in XML; the overriding principle is that it must be Open, developed in a community process.

And in most disciplines this is actively resisted by the publishing community. When Wikipedia started to use Chemical Abstracts (ACS) identifiers the ACS threated Wikipedia with legal action. They backed down under community pressure. But this is no way for semantic development. It can only lead to centralised control of information. Sometimes top-down semantic development is valuable (probably essential in heavily regulated fields) but it is slow , often arbitrary and often badly engineered.

We need the freedom to use the current literature and current data as our guide to creating semantics. What authors write is, in part, what they want to communicate (although the restrictions of “10 pages” is often absurd and destroys clarity and innovation). The human language contains implicit semantics, which are often much more complex that. So Metametrik will formalize the semantics of (a subset of) economic models, many of which are based on OLS (ordinary least squares). Here’s part of a typical table reporting results. It’s data so I am not asking permission to reproduce it. [It’s an appalling reflection on the publication process that I should even have to, though many people are more frightened of copyright that of doing incomplete science.]

 

And the legend:

How do we represent this table semantically? We have to identify its structure, and the individual components. The components are, for the most part well annotated in a large metadata table. (And BTW metadata is essential for reporting facts so I hope no one argues that it’s copyrightable. If they do, then scientific data in C21 is effectively paralysed.)

That’s good metadata for 2001 when the paper was published. Today, however , we immediately feel the frustration of not linking *instantly* to Gallup and Sachs, or La Porta. And we seethe with rage if we find that they are paywalled and this is scholarly vandalism – preventing the proper interpretation of scholarship.

We then need a framework for representing the data items. Real (FP) numbers, with errors and units. There doesn’t seem to be a clear ontology/markup for this, so we may have to reuse from elsewhere. We have done this in Chemical Markup Language (its STMML subset) which is fully capable of holding everything in the table. But there may be other solutions –please tell us.

But the key point is that the “Table” is not a table. It’s a list of regression results where the list runs across the page. Effectively its regression1, … regression 11. So a List is probably more suitable than a table. I shall have a hack at making this fully semantic and recomputable.

And at the same time seeing if AMI2 can actually read the table from the PDF.

I think this is a major way of kickstarting semantic scholarship – reading the existing literature and building re-usables from it. Let’s call it “Reusable scholarship”.

 

 

Posted in Uncategorized | Leave a comment

SePublica: What we must do to promote Semantics #scholrev #btpdf2

In the previous post (/pmr/2013/05/23/sepublica-how-semantics-can-empower-us-scholrev-scholpub-btpdf2/) I outlined some of the reasons why semantics are so important. Here I want to show what we have to do (and again stick with me – although you might disagree with my stance).

The absolute essentials are:

  • We have to be a community.
  • We have to identify things that can be described and on which we are prepared to agree.
  • We have to describe them
  • We have to name them
  • We have to be able to find them (addressing)

Here Lewis Carroll, a master of semantics shows the basics

And she went on planning to herself how she would manage it. `They must go by the carrier,’ she thought; `and how funny it’ll seem, sending presents to one’s own feet! And how odd the directions will look!

ALICE’S RIGHT FOOT, ESQ.

HEARTHRUG,

NEAR THE FENDER,

(WITH ALICE’S LOVE).

 

Oh dear, what nonsense I’m talking!’

Alice identifies her foot as a foot, and makes gives it a unique identifier RIGHT FOOT. The address consists of another unique identifier (HEARTHRUG) and annotates it (NEAR THE FENDER). There’s something fundamental about this – (How many children have annotated their books with “Jane Doe, 123 Some Road, This Town, That City, Country, Continent, Earth, Solar System, Universe?). Hierarchies seem fundamental to humans. Anything else is much more difficult. (Peter Buneman and I have been bouncing this idea about). I am sure we have to use hierarchies to promote these ideas to newcomers.

Things get unique identifiers. They can be at different levels. Single instances such as Alice’s left foot.

But there are also whole classes – the class of left feet. I have a left foot. It’s distinct from Alice’s. And we need unique names for these classes, such as “left foot“. Generally all humans have one (but see http://en.wikipedia.org/wiki/The_Man_with_Two_Left_Feet ). And we can start making rules, see http://human-phenotype-ontology.org/contao/index.php/hpo_docu.html.

At the moment, all relationships in the Human Phenotype Ontology are is_a relationships,  i.e.  a simple class-subclass relationships. For instance, Abnormality of the feet
is_a
Abnormality of the lower limbs. The relationships are transitive, meaning that they are inherited up all paths to the root. For instance,
Abnormality of the lower limbs is_a
Abnormality of the extremities, and thus Abnormality of the feet also is Abnormality of the extremities.

We see a terminology appearing. Some would call this an ontology, others would refute this. I tend to use the concept of “dictionary” fuzzed across language and computability.

This is where the difficulties start. One the one hand this is very valuable – if a disease affects the extremities, then it might affect the left foot. But it’s also where people’s eyes glaze over. Ontology language is formal and does not come naturally to many of us. And when it’s applied like a syllogism:

  • All men are mortal
  • Socrates is a man
  • Therefore Socrates is mortal

Many people think – so what? – we knew that already. On the other hand it’s quite difficult to translate this into machine language (even after realising that “men” is mans (the plural). The symbology is frightening (with upside down A’s and backwards E’s). Here are fundamental concepts in a type system: http://stackoverflow.com/questions/12532552/what-part-of-milner-hindley-do-you-not-understand :

The discussion on Stack Overflow includes:

  • “Actually, HM is surprisingly simple–far simpler than I thought it would be. That’s one of the reasons it’s so magical”
  • “The 6 rules are very easy. Var rule is rather trivial rule – it says that if type for identifier is already present in your type environment, then to infer the type you just take it from the environment as is. PMR is still struggling with the explanation
  • This syntax, while it may look complicated, is actually fairly simple. The basic idea comes from logic: the whole expression is an implication with the top half being the assumptions and the bottom half being the result. That is, if you know that the top expressions are true, you can conclude that the bottom expressions are true as well.

The problem is language and symbology. If you haven’t been trained in language it’s often impenetrable. For example music. If you haven’t been trained in it, it makes little sense and takes us a considerable time to learn:

So if we want to get a lot of people involved we have to be very careful about exposing newcomers to formal semantics. I avoid words like ontology, quantifier, predicate, disjunction, because people already have to be convinced they are worth learning.

Humans want to learn music not because they’ve seen written music but because they’ve heard music. Similarly we have to sell semantics by what it does, rather than what it is. And we cannot show what it does without building systems, any more than we are motivated to learn about pianos until we have seen and heard one.

The problem is that it’s a lot of effort to build a semantic system and that there is not necessarily a clear reward. The initial work, as always, was in computer science which showed – on paper – what could be possible but didn’t leave anything that ordinary people can pick up on. This is very common – before the WWW was a whole decade or more of publications in “hypermedia” but much of this was only read by people working in the field. And often the major reason for working in a new field is to get academic publications, not to create something useful to the world. There often seems to be a lag of twenty years and indeed that’s happening in semantics.

So it’s very difficult to get public funding to build something that’s useful and works. One effect is that the systems are built by companies. That’s not necessarily a bad thing – railways and telephones came from private enterprise. But there are problems with the digital age and we see this with modern phones – they can become monopolies which constrain our freedom. We buy them to communicate but we didn’t buy them to report our location to unknown vested interests.

And semantics have the same problem. The people who control our semantics will control our lives. Because semantics constrain the formal language we use and that may constrain the natural language. We humans may not yet be in danger of Orwell’s Newspeak but our machines will be. And therefore we have to assert rights to have say over our machines’ semantics.

That raises the other problem – semantic Babel. If everyone creates their own semantics no-one can talk (we already see this with phone apps). I live in the semantic Babel of machine-chemistry – every company creates a different approach. Result – chemistry is 20 years behind bioscience where there is a communal vision of interoperable semantics.

So I think the major task for SePublica is to devise a strategy for bottom-up Open semantics. That’s what Gene Ontology did for bioscience. We need to identify the common tools and the common sources of semantic material. And it will be slow – it took crystallography 15 years to create their dictionaries and system and although we are speeding up we’ll need several years even when the community is well inclined. (That’s what we are starting to do in computational chemistry – the easiest semantic area of any discipline). It has to be Open, and we have to convince important players (stakeholders) that it matters to them. Each area will be different. But here are some components that are likely to be common to almost all fields:

  • Tools for creating and maintaining dictionaries
  • Ways to extract information from raw sources (articles, papers, etc.) – that’s why we are Jailbreaking the PDF.
  • Getting authorities involved (but this is increasingly hard as the learned societies are often our problem , not the solution)
  • Tools to build and encourage communities
  • Demonstrators and evangelists
  • Stores for our semantic resources
  • Working with funders

We won’t get all of that done at SePublica. But we can make a lot of progress.

 

 

 

 

 

Posted in Uncategorized | Leave a comment

SePublica: How semantics can empower us; #scholrev #scholpub #btpdf2

I’m writing blog posts to collect my thoughts for the wonderful workshop at SePublica http://sepublica.mywikipaper.org/drupal/ where I am leading off the day. [This also acts as a permanent record instead of slides. Indeed I may not provide slides as such as I often create the talk as I present it.] My working title is

Why and how can we make Scholarship Semantic?

[If you switch off at “Semantics” trust me and keep reading… There’s a lot here about changing the world.]

Why should we strive to create a semantics web/world? I “got it” when I head TimBL in 1994. Many people have “got it”. There are startups based on creating and deploying semantic technology. My colleague Nico Adams (who understands much more about the practice of semantics than me) has a vision of creating a reasoning engine for science (he’s applied this to polymers, biotechnology, chemistry). I completely buy his vision.

But it’s hard to sell this to people who don’t understand. Any more than TimBL could sell SGML in 1990. (Yes there were whole industries who bought into SGML, but most didn’t). So what TimBL did was to build a system that worked (The WWW). And this often seems to be the requirement for Semantic Web projects. Build it and show it working.

SePublica will probably be attended by the converted. I don’t think I have to convince them of the value of semantics. But I do have to catalyse:

  • The creation of convincing demonstrators (examples that work)
  • Arguments for why we need semantics and what it can do.

So why are semantics important for scholarly publishing ? The following arguments will hopefully convince some people:

  • They unlock the value of the stuff already being published. There is a great deal in a single PDF (article or thesis) that is useful. Diagrams and tables are raw exciting resources. Mathematical equations. Chemical structures. Even using what we have today converted into semantic form would add billions.
  • They make information and knowledge available to a wider range of people. If I read a paper with a term I don’t know then semantic annotation may make it immediately understandable. What’s rhinovirus? It’s not a virus of rhinoceroses – it’s the common cold. That makes it accessible to many more people (if the publishers allow it).
  • They highlight errors and inconsistencies. Ranging from spelling errors to bad or missing units to incorrect values to stuff which doesn’t agree with previous knowledge. And machines can do much of this. We cannot have reproducible science until we have semantics.
  • They allow the literature to be computed. Many of thre semantics define objects (such as molecules or phylogenetic trees) which are recomputable. Does the use of newer methods give the same answer?
  • They allow the literature to be aggregated. This is one of the most obvious benefits. If I want all phylogenetic trees, I need semantics – I don’t want shoe-trees or B-trees or beech trees. And many of these concepts are not in Google’s public face (I am sure they have huge semantics internally)
  • They allow the material to be searched. How many chemists use halogenated solvents. (The word halogen will not occur in the paper). With semantics this is a relatively easy thing to do. Can you find second-order differential equations? Or Fourier series? Or triclinic crystals? (The words won’t help) AMI2 will be able to.
  • They allow the material to linked into more complex concepts. By creating a data base of species , a database of geolocations and links between them we start to generate an index of biodiversity. What species have been reported when and where? This can be used for longitudinal analyses – is X increasing/decreasing with time? Where is Y now being reported for the first time?
  • They allow humans to link up. If A is working on Puffinus Puffinus (no, it’s not a Puffin, that’s Fratercula Artica) in the northern hemisphere and B is working on Puffinus
    tenuirostris in Port Fairy Victoria AU then a shared knowledgebase will help to bring the humans together. And that happens between subjects – microscopy can link with molecular biology with climate with chemistry.

In simple terms semantics allow smart humans to develop communal resources to develop new ideas faster, smarter and better.

Please add other ideas! I am sure I have missed some.

 

Posted in Uncategorized | 1 Comment

#scholrev #ami2 #btpdf2 Jailbreaking content (including tables) from PDFs

We’ve got a splendid collection of about 600 Open PDFs for our jailbreak hackathon. They seem to have a medical focus. They are of very variable type and quality. Some are reports, guidelines , some academic papers. Some are born digital but at least one is scanned OCR where the image and the text are superposed. (BTW I am taking it on trust that the papers are Open – some are from closed access publishers and carry their copyright. It’s time we starting marking papers as Open ON THE PAPER).

I have given these to #AMI2 – she processes a paper in about 10 secs on my laptop so it’s just over an hour for the whole lot. That gives me a chance to blog some more. In rev63 AMI was able to do tables so here, without any real selection, I’m giving some examples. (Note that some tables are not recognised as such – especially when the authors don’t use the word “table”. But we shall hack those in time…). Also, as HTML doesn’t seem to have a tableFooter that manages the footnotes I have temporarily added this to the caption as a separate paragraph

From Croat Med J. 2007;48:133-9:

The table in the PDF

 

AMI’s translation to HTML:

Table 1. Scores achieved by 151 Croatian war veterans diagnosed with posttraumatic stress disorder on the Questionnaire on Traumatic Combat and War Experiences (USTBI-M), Mississippi Scale for Combat-Related Post-Traumatic Stress Disorder (M-PTSD), and Minnesota Multiphasic Personality Inventory (MMPI)-201 (presented as T values)

*Abbreviations: L – rigidity in respondents’ approach to the test material; F – lack of understanding of the material; K – tendency to provide socially acceptable answers.

 

Score

 

Questionnaire

(mean ± standard deviation)

Cut-off score

USTBI-M

77.8 ± 14.3

Maximum: 120

M-PTSD

122.1 ± 22.9

107

MMPI-201 scales*

   

L

51.1 ± 2.0

70

F

73.2 ± 6.3

70

K

42.4 ± 3.2

70

 

87.6 ± 5.1

70

 

96.7 ± 6.6

70

 

88.2 ± 4.7

70

 

67.3 ± 4.8

70

     
 

79.3 ± 5.8

70

Pt ( psychastenia )

75.4 ± 5.7

70

 

72.1 ± 7.4

70

 

52.3 ± 2.6

70

 

COMMENT: Some of the row labels/ headings are omitted, but I think that can be solved. (Remember this is AMI’s first attempt so we call it alpha)

Here’s another:



And what AMI translates it to

Table 2 The comparison of quality of life among study groups using analysis of variance and post-hoc tests

*Group-by-group comparisons that were significant at the level of P < 0.001 performed using LSD (homogenous variance; used for physical and overall quality of life) or Dunnet T3 (unhomogenous variance; all other questions). The significance was set at P < 0.001 in post-hoc test in order to reduce the increased chances of false positive results.

QOL dimension/status

Groups

N

Mean ± SD

F; P

Post-hoc differences*

Physical

PTSD + LBP (I)

79

75.44 ± 11.33

   
 

PTSD (II)

56

78.43 ± 11.54

49.18;

I-III, I-IV, II-III,

 

LBP (III)

84

87.43 ± 13.84

< 0.001

II-IV, III-IV

 

Controls (IV)

134

94.42 ± 11.65

   
 

Total

353

85.97 ± 14.40

   

Psychological

PTSD + LBP (I)

76

63.74 ± 14.60

   
 

PTSD (II)

58

67.45 ± 15.92

79.05;

I-III, I-IV, II-III,

 

LBP (III)

90

80.27 ± 14.59

< 0.001

II-IV, III-IV

 

Controls (IV)

132

90.67 ± 10.76

   
 

Total

356

78.51 ± 17.44

   

Social

PTSD + LBP (I)

80

33.40 ± 8.89

   
 

PTSD (II)

58

35.93 ± 9.98

70.19;

I-III, I-IV, II-III,

 

LBP (III)

91

41.58 ± 8.78

< 0.001

II-IV, III-IV

 

Controls (IV)

134

49.22 ± 7.13

   
 

Total

363

41.70 ± 10.6

   

Enviromental

PTSD + LBP (I)

79

92.81 ± 20.78

   
 

PTSD (II)

58

100.76 ± 19.79

66.27;

I-III, I-IV, II-IV,

 

LBP (III)

88

108.36 ± 17.71

< 0.001

III-IV

 

Controls (IV)

130

126.06 ± 14.27

   
 

Total

355

110.14 ± 22.02

   

Satisfaction with personal health status

PTSD + LBP (I)

80

1.84 ± 0.74

   
 

PTSD (II)

59

2.36 ± 0.85

127.48;

I-II, I-III, I-IV, II-IV,

 

LBP (III)

95

2.70 ± 0.98

< 0.001

III-IV

 

Controls (IV)

135

4.03 ± 0.85

   
 

Total

369

2.94 ± 1.23

   

Overall self-reported quality of life

PTSD + LBP (I)

73

2.82 ± 1.14

   
 

PTSD (II)

49

3.29 ± 1.28

24.04;

I-II, I-III, I-IV, II-III,

 

LBP (III)

75

4.04 ± 1.25

< 0.001

II-IV

 

Controls (IV)

42

4.48 ± 0.80

   
 

Total

239

3.59 ± 1.31

   

 

I think she’s got it completely right (the typos “Enviromental” and “Unhomogenous” are visible in the PDF).

AFAIK there is no automatic Open extractor of tables so we are very happy to contribute this to the public pool.

Posted in Uncategorized | Leave a comment

Academia should not wait for publishers to make the rules; WE should make them.

[I shall be arguing at SePublica http://sepublica.mywikipaper.org/drupal/ that WE have to take control of OUR scholarship, and that semantics are one of our tools to help us.]

A recent post (http://aoasg.org.au/2013/05/23/walking-in-quicksand-keeping-up-with-copyright-agreements/) from the The Australian Open Access Support Group (AOASG) laments the complexity of managing publisher “agreements”. I’ll quote first and then argue that we are capitulating to publishers rather than asserting OUR rights.

Walking in quicksand – keeping up with copyright agreements

 

As any repository manager will tell you, one of the biggest headaches for providing open access to research materials is complying with publisher agreements.

Most publishers will allow some form of an article published in their journals to be made open access.

One problem repository managers face is that publishers sometimes change their position on open access. Often there is no public announcement from the publisher; especially when the change imposes more restrictions on ‘green’ open access. This is where the blogosphere and discussion lists (such as the CAIRSS List in Australia) are invaluable in keeping practitioners on top of new issues in the area.

Some recent cases where publishers set more restrictions on ‘green’ open access include Springer and IEEE.

Then on 1 January 2011 IEEE changed the rules and said people could no longer put up the Published Version. They were still allowed to put up the Submitted Version (preprint) or the Accepted Version (postprint). The policy is on the IEEE website here. While this still allows IEEE works to be made available in compliance with the recent Australian mandates, a recent blog  argues that the re-use restrictions on the Accepted Version of IEEE publications imposed by IEEE means that the works are not open access in compliance with many overseas mandate requirements.


According to [Springer’s] Self-Archiving Policy: “Authors may self-archive the author’s accepted manuscript of their articles on their own websites. Authors may also deposit this version of the article in any repository, provided it is only made publicly available 12 months after official publication or later. …”

So now there is a 12 month embargo on making the Accepted Version available.


These changing copyright arrangements mean that the process of making research openly accessible through a repository is becoming less and less able to be undertaken by individuals. By necessity, repository deposit is becoming solely the responsibility of the institution.

Dr Danny Kingsley
Executive Officer
Australian Open Access Support Group

PMR: I’ve chosen this because it was blogged yesterday but we are seeing this everywhere. Universities (I use this instead of the L-word) do not make “agreements” with publishers – they try to operate the “rules” that the publishers make up whenever they feel like it. If a publisher (such as Elsevier) “offers” an “agreement” forbidding content mining the University does not challenge it – they sign it and work to make sure it’s enforced. The Universities are effectively doing acting as enforcers for the publishers rather than asserting their own rights.

When was the last time a University challenged a publisher agreement in public? And won any concession. If you can show me 10 separate examples I shall be less critical of the L-Universities.

The management of this and politics of this is also absurd. WE create the content. We review it. WE pay the publishers 10,000,000,000 USD every year. And then we wait for the publishers to create restrictions on what WE can do with OUR content.

It’s incredibly inefficient. Taking 100 major publishers that a library has to “negotiate” an “agreement” with (i.e. sign the publisher’s diktat) and 1000 universities, then 100,000 “agreements” are signed per year (with no public scrutiny and certainly not with MY knowledge). That’s hugely inefficient. And I have no doubt that many publishers are deliberately obfuscating.

Turn it round. Suppose we had an academic re-use licence and WE asked the publishers to agree to it. We would work out what WE (the purchasers) were prepared to accept for OUR reprocessed content. And if the publishers didn’t like it they would lose a customer. That’s how the rest of the world works. (Horror! Academics have the absolute right to use public money to publish where they want and buy whatever journals they want and pay publishers whatever they demand. Sorry, I forgot that).

We have seen an unedifying public fight between the green evangelists and others in the open Access community. There is now no agreement on what “green” is. Sometimes it means hidden in a dark archive for years where no-one can read it (see above). If a publisher calls it “green” who challenges them?

Very simply. The Open Access movement needs a formal body that can make the rules and challenge publishers. Why should it be left to a few activists (including me) to fight for content-mining? The funders are doing a good job here, but the Universities are making it increasingly difficult by giving in to the publishers. If we spent 0.1% of library subscriptions on fighting for our rights it would not only be the right thing to do but actually save money.

Oh, I had a private communication yesterday saying that one major publisher was now going to charge for text-mining its content.

Will the universities challenge that? If they did they might save the money. Or will they simply pay up yet again?

Posted in Uncategorized | 3 Comments

Jailbreaking the PDF, a collaborative #scholrev project, WE not I

I am really excited about the #scholrev hackathon program put together as “Jailbreaking the PDF”: some additional information at http://duraspace.org/jailbreaking-pdf-hackathon

From Alexander Garcia and Alex Garcia-Castro

Montpellier, France  The upcoming “Jailbreaking the PDF” hackathon (http://scholrev.org/hackathon) will be held Monday, May 27 in Montpellier, France at the Agence Bibliographique de l’enseignement Superieur (ABES):

http://www.abes.fr/Connaitre-l-ABES/Presentation-de-l-ABES.

Currently, the bulk of peer-reviewed scientific knowledge is locked up in PDF documents, which are difficult to get information.

We want to change that.

If you’re interested in hacking on PDFs and exploring ways to access scholarly data in modern ways, this hackathon is for you. There is no registration fee–the event is free.  Bring yourself your favorite laptop, and we’ll supply the food, drinks wifi,

repository, and everything else necessary to hack away.

Future announcements will be posted at http://scholrev.org/hackathon.

As with all hackathons we’ll work it out on the day (and possibly some on the night before). There are some suggested projects at http://wehack.it/hackathons/47-jailbreaking-the-pdf (I have put #ami2 in), but the important thing is to come up with things we can do on the day that will make a real impact. It’s a great chance to show that there is a critical mass of people in #scholrev and that we can achieve things.

The key thing is that we all want to change the world – in this case by repurposing PDFs to liberate information and by doing that working out how we change our ways of communicating (“publishing”) to humans and machines. What makes Jailbreak different is the Open approach – our tools are Open , our data and results are Open.

And it is more important that WE succeed rather than I succeed.

There are several reasons for developing technology and they include:

  • Creating a business and a market
  • Being the first to create something and gain (academic) recognition
  • Changing the world

So I and colleagues have been developing #AMI2, a toolset for turning PDF content into semantic form. I’m not interested in creating a business (at present) and I have the luxury of not needing academic glory. I shall be happy to submit a paper in due course as there are novel aspects but citations aren’t the primary driver.

No, this is my contribution to a toolkit to change the world. Because scholarly publishing critical needs a revolution and it’s not coming from conventional sources. Hacking PDFs can be a major part of the game-changer. And if the software is Open, then we can grow it.

I’m delighted to see that there are other people hacking PDFs and I shall meet some at this workshop. What will I feel if someone else has developed a tool that does things that AMI2 can’t do or does them better? It may surprise you, but I shall feel pleased. And I hope others would feel the same way.

Because it advances us all and makes the overall task easier and quicker. We’ve found this in chemistry software with the Blue Obelisk (http://en.wikipedia.org/wiki/Blue_Obelisk ) where over 20 groups write F/OSS software. Each is independent – we don’t try to aggregate this into one monster toolkit (it wouldn’t work). But each looks to see what the others are doing, keeps in gentle touch and avoids needless duplication. I expect this spirit to develop in Jailbreak.

In any case hacking PDFs requires a large amount of heuristics. Examples are:

  • Translating undocumented fonts to Unicode
  • Dealing with graphics
  • Interpreting figures
  • Publisher – and journal-specific annotations
  • Recognising and processing tables
  • Hacking references and metadata

Many of these are never-ending jobs. Many are also boring. But many are ideal for a shared approach. I am very interested to see what the CERMINE: Content ExtRactor and MINEr does:

CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts: document’s metadata, including title, authors, affiliations, abstract, keywords, journal name, volume and issue, parsed bibliographic references the structure of document’s sections, section titles and paragraphs. CERMINE is based on a modular workflow, whose architecture ensures that individual workflow steps can be maintained separately. As a result it is easy to perform evaluation, training, improve or replace one step implementation without changing other parts of the workflow. Most steps implementations utilize supervised and unsupervised machine-leaning techniques, which increases the maintainability of the system, as well as its ability to adapt to new document layouts. CERMINE is a Java library and a web service for extracting metadata and content from scientific articles in born-digital form. Limitations This is an experimental service, and result may be not accurate. Uploaded file will be used only for metadata extraction, we do not store uploaded files. Accepted file format – *.pdf, maximum file size is 5 MB. License CERMINE is licensed under GNU Affero General Public License version 3.

I’ve run CERMINE on a few files and it looks very useful. It certainly does things AMI2 doesn’t and vice versa. CERMINE is machine-learning based whereas AMI2 is heuristic. Both have to be adjusted when they get a new document type. AMI2 doesn’t do good metadata (still working out some general heuristics) but it addresses italics, bold, strange characters, sub/superscripts, compound document objects (e.g. captioned figures ), tables, document sections, etc. There’s undoubtedly a role for both.

And the opportunity to create shared resources (e.g. fonts, journal templates, common terminology and nomenclature, etc.)

Content-mining and re-use needs a community focus and this workshop looks exactly that.

Posted in Uncategorized | Leave a comment