Unilever Centre for Molecular Informatics
 

petermr's blog

A Scientist and the Web

 

Archive for the ‘Uncategorized’ Category

Jailbreaking the PDF – 4; Making text from characters

Thursday, May 30th, 2013

In previous posts I have shown how we can, in most cases, create a set of Unicode characters from a PDF. If the original authors (e.g. many Government documents) were standard-compliant this is almost trivial. For scholarly publications, where the taxpayer/student pays 5000 USD per paper, the publishers refuse to use standards. So we have to use heuristics on this awful mess. (I have not yet found a scholarly publisher which is compliant and makes a syntactically manageable PDF – we pay them and they corrupt the information). But we have enough experience that for a given publisher we are correct 99->99.999% of the time (depending on the discipline – maths is harder than narrative text).

So now we have pages and on each page we have an UNORDERED list of characters. (We cannot rely on the order in which characters are transmitted – I spent two “wasted” months trying to use sequences and character groupings). We have to reconstruct text from the following STANDARD information for each character:

  • Its XY coordinates (raw PDF uses complex coordinates, PDFBox normalises to the page (0-600, 0-800))
  • Its FontFamily (e.g. Helvetica). This is because semantics are often conveyed by Fonts – monospace implies code or data. (I shall upset typographical purists as I should use “typeface” (http://en.wikipedia.org/wiki/Typeface ) and not “font” or “font family”. But “FontFamily” is universal in PDF and computer terminology.
  • Its colour. This can be moderately complex – a character has an outline (stroke) and body (fill) and there are alpha overlays, transparency, etc. But most of the time it’s black.
  • Its font Weight. Normal or Bold. It’s complicated when publishers use fonts like MediumBold (greyish)
  • Its Size. The size is the actual font-size in pixels and not necessarily the points as in http://en.wikipedia.org/wiki/Point_%28typography%29 .

    Characters in the same font have different extents because of ascenders and descenders:

  • Its width. Monospaced fonts (http://en.wikipedia.org/wiki/Monospaced_font ) have equal width for all characters:

    Note that “I” and “m” have the same width. Any deliberate spaces also have the same width. That makes it easy to create words. The example above would have words “Aa”, “Ee”, “Qd”. (A word here is better described as a space-separated token, but “word” is simpler. It doesn’t mean it makes linguistic or numeric sense.

    If the font is not monospaced then we need to know the width. Here’s a proportional font (http://en.wikipedia.org/wiki/Typeface#Proportion ):

    See how the “P” is twice as wide as the “I” or “l” in the proportional font. We MUST know the width to work out whether there is a space after it. Because there are NO SPACES in PDFs.

  • Its style. Conflated with “slope”. Most scientists simply think “italic” (as in Java). But we find “oblique” and “underline” and many others. We need to project these to “italic” and “underline” as these have semantics.

Note that NormalBold , Normal|Italic, Normal|Underline can be multiplied to give 8 variants. Conformant PDF makes this easy – PDFBox has an API which includes:

  • public float getItalicAngle()
  • public float getUnderlineThickness();
  • public float getItalicAngle()
  • public static boolean isBold(Font font)

 

If we have all this information then it isn’t too difficult to reconstruct:

  • words
  • Weight of words (bold)
  • Style of word (italic or underline)

Which already takes us a long way.

Do scholarly publishers use this standard?

NO

(You probably guessed this.) For example I cannot get the character width out of ELife, the new Wellcome/MPI/HHMI journal. This seems to be because ELife hasn’t implemented the standard. They launched in 2012. There is no excuse for a modern publisher not being standards-compliant.

So the last posts have shown non-compliance in Elife, PeerJ, BMC. Oh, and PLoSOne also uses opaque fontFamilies (e.g. AdvP49811) . So the Open Access publishers all use non-standard fonts.

Do you assume that because closed access publishers charge more, they do better?

I can’t answer that because they have more money to pay lawyers.

I’ll let you guess. Since #AMI2 is Open Source you can do it yourself.

“Licences4Europe” has not accepted “The Right to Read is the Right to Mine”

Wednesday, May 29th, 2013

One sentence summary (this link has all the documentation)

Stakeholders representing the research sector, SMEs and open access publishers withdraw from Licences for Europe

 

I have formally been a member of EC-L4E-WG4 a working group of the European Commission concentrating on Text and Data Mining (TDM, though I prefer “Content Mining”). I haven’t attended meetings (due to date clashes) but Ross Mounce has stood in for me and given brilliant presentations). The initial idea of the WG was to facilitate TDM as an added value to conventional publications and other sources. (The current problem is that copyright can be interpreted as forbidding TDM). When I and others joined this effort it was on the assumption that we would be looking for positive ways forward to encourage TDM.

When I buy a book I can do what I like with it. I can write on it.

from (http://en.wikipedia.org/wiki/Marginalia ) I can cut it up into bits. I can give/sell the book to someone else. I can give/sell the cut-out bits to someone else. I can stick the cut-out bits into a new book. I can transcribe the factual content. I can do almost anything other than copy non-facts.

With scholarly articles I can’t do any of this. I cannot own an article, I can only rent it. (Appalling concession #1 by Universities went completely unnoticed – I shall blog more). I cannot extract facts from it. (Even more Appalling concession #2 by Universities went completely unnoticed – I shall blog more). So the publishers have dictated to Universities that we cannot anything with the 10,000,000,000 USD we give to the publishers each year.

The publishers are now proposing that if we want to use any of OUR content (which we have already paid for) we should pay the publishers MORE. That TDM is an “added service” provided by publishers. It’s not. I can TDM without any help from the publishers. The only thing the publishers are doing is holding us to ransom.

If you don’t feel this is unjust and counterproductive stop reading. Back to “Licences for Europe”…

The L4E group has had no chance to set the group assumptions. From the outset the chair has insisted that this group is “L4E”, licences for Europe. The default premise is that document producers can and should add additional restrictions through licences. In short – we have fought this publicly and the chair has failed to listen to us, let alone consider our arguments. Who are we?

  • The Association of European Research Libraries (LIBER)
  • The Coalition for a Digital Economy
  • European Bureau of Library Information and Documentation Associations (EBLIDA)
  • The Open Knowledge Foundation
  • Communia
  • Ubiquity Press Ltd.
  • Trans‐Atlantic Consumer Dialogue
  • National Centre for Text Mining, University of Manchester
  • European Network for Copyright in support of Education and Science (ENCES)
  • Jisc

Not a lightweight list. Here’s the formal history:

We welcomed the orientation debate by the Commission in December 2012 and the subsequent commitment to adapt the copyright framework to the digital age. We believe that any meaningful engagement on the legal framework within which data driven innovation exists must, as a point of centrality, address the issue of limitations and exceptions. Having placed licensing as the central pillar of the discussion, the “Licences for Europe” Working Group has not made this focused evaluation possible. Instead, the dialogue on limitations and exceptions is only taking place through the refracted lens of licensing. This incorrectly presupposes that additional relicensing of already licensed content (i.e. double licensing) – and by implication also licensing of the open internet– is the solution to the rapid adoption of TDM technology.

We wrote expressing our concerns (March 14) – some sentences (highlighting is mine):

10. Data driven innovation requires the lowest barriers possible to reusing content. Requiring the relicensing of copyright works one already has lawful access to for a non – competing use is entirely disproportionate, and raises strong ethical questions as it will affect what computer based medical and scientific research can and cannot be undertaken in the EU.

11. A situation where each proposed TDM based research or use of content, to which one already has lawful access, has to be submitted for approval is unscalable*, and will raise barriers to research and reduce online innovation. It will slow medical discoveries and data driven innovation inexorably, and will only serve to drive jobs, research, health and wealth – creation elsewhere.

12. For the full potential of data driven innovation to become a reality, a limitation and exception that allows text and data mining for any purposes, which cannot be over – ridden by private contracts is required in EU law.

13. Subject to point 3, we must be able to share the results of text and data mining with no hindrances irrespective of copyright laws or licensing terms to the contrary. 14. In the European information society, the right to read must be the right to mine.

(I am particularly pleased that my phrase “the right to read must be the right to mine” expresses our message succinctly.

Unfortunately the response (http://www.libereurope.eu/sites/default/files/130316-researchers-reply-signed.pdf ) was anodyne and platitudinal (“win-win solutions for all stakeholders”). It became clear that this group could not make any useful progress and at worse would legitimize the interests of the “content owners”.

So we have withdrawn.

Having placed licensing as the central pillar of the discussion, the “Licences for Europe” Working Group has not made this focused evaluation possible. Instead, the dialogue on limitations and exceptions is only taking place through the refracted lens of licensing. This incorrectly presupposes that additional relicensing of already licensed content (i.e. double licensing) – and by implication also licensing of the open internet– is the solution to the rapid adoption of TDM technology.

Therefore, we can no longer participate in the “Licences for Europe” process. We maintain that a vibrant internet and a healthy scholarly publishing community need not be at odds with a modern copyright framework that also allows for the barrier – free extraction of facts and data. We have already expressed this view sufficiently well within the Working Group.

And we have concerns about transparency.

We would like to reiterate our request for transparency around the “Licences for Europe” dialogue and kindly request that the following actions be taken:

  • That the list of organisations participating in all of the “Licenses for Europe” Working Groups be made publicly available on the “Licences for Europe” website;
  • That the date of withdrawal for organisations leaving the process is also recorded on this list;
  • That it is made clear on any final documents that the outputs from the working group on TDM are not endorsed by our organisations and communities.

 

If you feel that we have a right to mine our information, then help us fight for it. Because inaction simply hands our rights to vested interests.

Jailbreaking the PDF -3; Styles and fonts and the problems from Publishers.

Tuesday, May 28th, 2013

Many scientific publications use specific styling to add semantics. In converting to XML it’s critical we don’t throw these away at an early stage, yet many common tools discard such styles. #AMI2 does its best to preserve all these and I think is fairly good. There are different reasons for using styles and I give examples from OA publishers…

  • Bold – used extensively for headings and inline structuring. Note (a) the bold for the heading and (b) the start-of-line

  • Italic. Species are almost always rendered this way.

  • Monospaced. Most computer code is represented in this (abstract) font.

This should have convinced you that fonts, and styles matter and should be retained. But many PDF2xxx systems discard them, especially for scholarly publications. There’s a clear standard in PDF for indicating bold, for italic and PDFBox gives a clear API for this. But many scholarly PDFs are awful (did I mention this before?). The BMC fonts don’t declare they are bold even though they are. Or italic. So we have to use heuristics. If a BMC font has “+20″ after its name it’s probably bold. And +3 means italics.

Isn’t this a fun puzzle?

No. It’s holding science back. Science should be about effective communication. If we are going to use styles rather than proper markup, let’s do it properly. Let’s tell the world it’s bold. Let’s use 65 to mean A.

There are a few cases where an “A” is not an “A”. As in http://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols

Most of these have specific mathematical meanings and uses and most have their own Unicode points. They are not letters in the normal sense of the word – they are symbols. And if they are well created and standard then they are mangeable

But now an unnecessary nuisance from PeerJ (and I’m only using Open Access publishers so I don’t get sued):

What are the blue things? They look like normal characters, but they aren’t:

<text fill=”#00a6fc” svgx:fontName=”MinionExp-Regular” svgx:width=”299.0″

x=”284.784″ y=”162.408″ font-weight=”normal”></text>

<text fill=”#00a6fc” svgx:fontName=”MinionExp-Regular” svgx:width=”299.0″

x=”281.486″ y=”162.408″ font-weight=”normal”></text>

They are weird codepoints, outside the Unicode range:

These two seem to be small-capital “1″ and “0″. They aren’t even valid Unicode characters. Some of our browsers won’t display them:

(Note the missing characters).

Now the DOI is for many people a critically important part of the paper! It’s critical that it is correct and re-usable. But PeerJ (which is a modern publisher and tells us how it has used modern methods to do publishing better and cheaper) seems to have deliberately used totally non-standard characters for DOIs to the extent that my browser can’t even display them. I’m open to correction – but this is barmy. (The raw PDF paper displays in Firefox, but that’s because the font is represented by glyphs rather than codepoints.) No doubt I’ll be told that it’s more important to have beautiful fonts to reduce eyestrains for humans and that corruption doesn’t matter. Most readers don’t even read the references – they simply cut and paste them.

So let’s look at the references:

Here the various components are represented in different fonts and styles. (Of course it would be better to use approaches such as BibJSON and even BibTeX, but that would make it too easy to get it right). So here we have to use fonts and styles to guess what the various bits mean. Bold are the authors, Followed by a period. And bold number for the year. And a title in normal font. A Journal in italics. More Bold for the volume number. Normal for the pages. Light blue is DOI.

But at least if we keep the styles then #AMI2 can hack it. Throwing away the styles makes it much harder and much more error prone.

So to summarise #AMI2=PDF2SVG does things that most other systems don’t do:

  • Manages non-standard fonts (but with human labour)
  • Manages styles
  • Converts to Unicode

AMI2 can’t yet manage raw glyphs, but she will in due time.(Unless YOU wish to volunteer – it actually is a fun machine-learning project).

NOTE: If you are a large commercial publisher then your fonts are just as bad.

Jailbreaking the PDF – 2; Technical aspects (Glyph processing)

Tuesday, May 28th, 2013

A lot of our discussion in Jailbreaking related to technical issues, and this is a – hopefully readable – overview.

PDF is a page description format (does anyone use pages any more? other than publishers and letter writers?) which is designed for sighted humans. At its most basic it transmits a purely visual image of information, which may simply be a bitmap (e.g. a scanned document). That’s currently beyond our ability to automate (but we shall ultimately crack it). More usually it consists of glyphs (http://en.wikipedia.org/wiki/Glyph the visual representation of character). All the following are glyphs for the character “a”.

The minimum that a PDF has to do is to transmit one of these 9 chunks. It can do that by painting black dots (pixels) onto the screen. Humans can make sense of this (they get taught to read but machines can’t. So it really helps when the publisher adds the codepoint for a character. There’s a standard for this – it’s called Unicode and everyone uses it. Correction: MOST people, but NOT scholarly publishers. Many publishers don’t include codepoints at all but transmit the image of the glyph (this is sometimes a bitmap, sometimes a set of strokes (vector/outline fonts)). Here’s a bitmap representation the first “a”.

You can see it’s made of a few hundred pixels (squares). The computer ONLY knows these are squares. It doesn’t know they are an “a”. We shall crack this in the next few months – it’s called Optical Character Recognition OCR and usually done by machine learning – we’ll pool our resources on this. Most characters in figures are probably bitmapped glyphs, but some are vectors.

In the main text characters SHOULD be represented by a codepoint – “a” is Unicode codepoint 97. (Note that “A” is different and codepoint 65 – I’ll use decimal values). So every publishers represent “a” by 97?

Of course not. Publishers PDFs are awful and don’t adhere to standards. That’s a really awful problem. Moreover some publishers use 97 to mean http://en.wikipedia.org/wiki/Alpha . Why?? because in some systems there is a symbol font and it only has Greek characters and they use the same numbers.

So why don’t publishers fix this? It’s because (a) they don’t care and (b) they can extract more money from academia for fixing it. They probably have the correct codepoint in their XML but they don’t let us have this as they want to charge us extra to read it. (That’s another blog post). Because most publishers use the same typesetters these problems are endemic in the industry. Here’s an example. I’m using BioMedCentral examples because they are Open. I have high praise for BMC but not for their technical processing. (BTW I couldn’t show any of this from Closed publishers as I’d probably be sued).

How many characters are there in this? Unless you read the PDF you don’t know. The “BMC Microbiology” LOGO is actually a set of graphics strokes and there is no indication it is actually meaningful text. But I want to concentrate on the “lambda” in the title. Here is AMI2′s extracted SVG/XML (I have included the preceding “e” of “bacteriophage”)

<text stroke=”none” fill=”#000000″ svgx:fontName=”AdvOT46dcae81″

svgx:width=”500.0″ x=”182.691″ y=”165.703″ font-size=”23.305″

font-weight=”normal”>e</text>

<text stroke=”none” fill=”#000000″ svgx:fontName=”AdvTT3f84ef53″

svgx:width=”0.0″ x=”201.703″ y=”165.703″ font-size=”23.305″

font-weight=”normal”> l</text>

Note there is NO explicit space. We have work it out from the coordinates (182.7 + 0.5*23 << 201.7). But the character 108 is a “l” (ell) and so an automatic conversion system creates

This is wrong and unacceptable and potentially highly dangerous – a MU would be convtered to an “EM”, so Micrograms could be converted to Milligrams.

All the systems we looked at yesterday made this mistake except #AMI2. So almost all scientific content mining systems will extract incorrect information unless they can correct for this. And there are three ways of doing this:

  • Insisting publishers use Unicode. No hope in hell of that. Publishers (BMC and other OA publishers excluded) in general want to make it as hard as possible to interpret PDFs. So nonstandard PDFs are a sort of DRM. (BTW it would cost a few cents per paper to convert to Unicode – that could be afforded out of the 5500 USD they charge us).
  • Translating the glyphs into Unicode. We are going to have to do this anyway, but it will take a little while.
  • Create lookups for each font. So I have had to create a translation table for the non-standard font AdvTT3f84ef53 which AFAIK no one other than BMC uses and isn’t documented anywhere. But I will be partially automating this soon and it’s a finite if soul-destroying task

So AMI2 is able to get:

With the underlying representation of lambda as Unicode 955:

So AMI2 is happy to contribute her translation tables to the Open Jalibreaking community. She’d also like people to contribute, maybe through some #crowdcrafting. It’s pointless for anyone else to do this unless they want to build a standalone competitive system. Because it’s Open they can take AMI2 as long as they acknowledge it in their software. Any system that hopes to do maths is almost certainly going to have to use a translator or OCR.

So glyph processing is the first and essential part of Jailbreaking the PDF.

 

Jailbreaking the PDF; a wonderful hackathon and a community leap forward for freedom – 1

Tuesday, May 28th, 2013

Yesterday we had a truly marvellous hackathon http://scholrev.org/hackathon/ in Montpellier, in between workshops and main Eur Semantic Web Conference. The purpose was to bring together a number of groups who value semantic scholarship and free information from the traditional forms of publication. I’ll be blogging later about the legal constraints imposed by the publishing industry, but Jailbreaking is about the technical constraints of publishing information as PDF.

The idea Jailbreaking was to bring together people who have developed systems, tools, protocols, communities for turning PDF into semantic form. Simply, raw PDF is almost uninterpretable, a bit like binary programs. For about 15 years the spec was not Open and it was basically a proprietary format from Adobe. The normal way of starting to make any sense of PDF content is to buy tools from companies such as Adobe, and there has been quite a lot of recent advocacy from Adobe staff to consider using PDF as a universal data format. This would be appalling – we must use structured documents for data and text and mixtures. Fortunately there are now a good number of F/OSS tools, my choice being http://pdfbox.apache.org/ and these volunteers have laboured long and hard in this primitive technology to create interpreters and libraries. PDF can be produced well, but most scholarly publishers’ PDFs are awful.

It’s a big effort to create a PDF2XML system (the end goal). I am credited with the phrase “turning a hamburger into a cow” but it’s someone else’s. If we sat down to plan PDF2XML, we’d conclude it was very daunting. But we have the modern advantage of distributed enthusiasts. Hacking PDF systems by oneself at 0200 in the morning is painful. Hacking PDFs in the company of similar people is wonderful. The first thing is that it lifts the overall burden from you. You don’t have to boil the ocean by yourself. You find that others are working on the same challenge and that’s enormously liberating. They face the same problems and often solve them in different ways or have different priorities. And that’s the first positive takeaway – I am vastly happier and more relaxed. I have friends and the many “I“s are now we. It’s the same liberating feeling as 7 years ago when we created the http://en.wikipedia.org/wiki/Blue_Obelisk community for chemistry. Jailbreaking has many of the shared values, though coming from different places.

Until recently most of the tools were closed source, usually for-money though occasionally free-as-in-beer for some uses or communities. I have learnt from bitter experience that you can never build an ongoing system on closed source components. At some stage they will either be withdrawn or there will be critical things you want to change or add and that’s simply not possibly. And licensing closed source in an open project is a nightmare. It’s an anticommmons. So, regretfully, I shall not include Utopia/pdfx from Manchester in my further discussion because I can’t make any use of it. Some people use its output, and that’s fine – but I would/might want to use some of its libraries.

There was a wonderful coming-together of people with open systems. None of us had the whole picture , but together we covered all of it. Not “my program is better than your program”, but “our tools are better than my system“. So here a brief overview of the open players who came together (I may miss some individuals, please comment if I have done you an injustice). I’ll explain the technical bits is a later post – here I am discussing the social aspects.

  • LA-PDFText (http://code.google.com/p/lapdftext/ Gully Burns). Gully was in Los Angeles – in the middle of the night and showed great stamina J In true hacking spirit I used the time to find out about Gully’s system. I downloaded it and couldn’t get it to install (needed java-6). So Gully repackaged it, and within two iterations (an hour) I had it working. That would have taken days conventionally. LA-PDFText is particularly good at discovering blocks (more sophisticated than #AMI2) so maybe I can use it in my work rather than competing.
  • CERMINE http://sciencesoft.web.cern.ch/node/120 . I’ve already blogged this but here we had the lead Dominika Tkaczyk live from Poland. I take comfort from her presence and vice versa. CERMINE integrates text better than #AMI at present and has a nice web service
  • Florida State University. Alexander Garcia, Casey McLaughlin, Leyla Jael Garcia Castro, Biotea (http://biotea.idiginfo.org/ ) Greg Riccardi and colleagues. They are working on suicide in the context of Veterans’ admin documents and provided us with an Open corpus of many hundred PDFs. (Some were good, some were really awful). Alex and Casey ran the workshop with great energy, preparation, food, beer, etc. and arranging the great support from the ABES site.
  • #crowdcrafting. It will become clear that human involvement is necessary in parts of the PDF2XML process. Validating or processes, and also possible tweaking final outputs. We connected to Daniel Lombraña González

    of http://crowdcrafting.org/ who took us through the process of building a distributed volunteer community. There was a lot of interest and we shall be designing clear crowdcrafting-friendly tasks (e.g. “draw a rectangle round the title”, “highlight corrupted characters”, “how many references are there”, etc.)

  • CITALO http://wit.istc.cnr.it:8080/tools/citalo. This system deduces the type of the citation (reference) from textual analysis. This is a very good example of a downstream application which depends on the XML but is largely independent of how it is created.
  • #AMI2. Our AMI2 system is complementary to many of the others – I am very happy for others to do citation typing, or match keywords. AMI2 has several unique features (I’ll explain later), including character identification, graphics (graphics are not images) extraction, image extraction, sub and superscripts, bold and italic. (Most of the other systems ignore graphics completely and many also ignore bold/italic)

So we have a wonderful synthesis of people and projects and tools. We all want to collaborate and are all happy to put community success as the goal , not individual competition. (And the exciting thing is that it’s publishable and will be heavily cited. We have shown this in the Blue Obelisk publications where the first has 300 citations and I’d predict that a coherent Jailbreaking publication would be of great interest. )

So yesterday was a turning point. We have clear trajectories. We have to work to make sure we develop rapidly and efficiently. But we can do this initially in a loose collaboration, and planning meetings and bringing in other collaborators and funding.

So if you are interested in An Open approach to making PDFs Open and semantic, let us know in the comments.

 

SePublica : Overview of my Polemics presentation #scholrev

Saturday, May 25th, 2013

This is a list of the points I want to cover when introducing the session on Polemics. A list looks a bit dry but I promise to be polemical. And try to show some demos at the end. The polemics are constructive in that I shall suggest how we can change the #scholpub world by building a better one than the current one.

NOTE: Do not be overwhelmed by the scale of this. Together we can do it.

It is critical we act now

  • Semantics/Mining is now seen as an opportunity by some publishers to “add value” by building walled gardens.
  • Increasing attempts to convince authors to use CC-NC.
  • We must develop semantic resources ahead of this and push the edges

One person can change the world

We must create a coherent community

  • Examples:
    • Open Streetmap,
    • Wikipedia
    • Galaxyzoo
    • OKFN Crowdcrafting,
    • Blue Obelisk (Chemistry – PMR),
    • ?#scholrev

Visions

  • Give power to authors
  • Discover, aggregate and search (“Google for science”)
  • Make the literature computable
  • Enhance readers with semantic aids
  • Smart “invisible” capture of information

Practice before Politics

  • Create compelling examples
  • Add Value
  • Make authors’ lives easier
  • Mine and semanticize current scholarship.

Text Tables Diagrams Data

  • Text (chemistry, species)
  • Tables (Jailbreak corpus)
  • Diagrams chemical spectra, phylogenetic trees
  • Data (output). Quixote

Material to start with

  • Open information (EuropePMC, theses)
  • “data not copyrightable”. Supp data, tables, data-rich diagrams
  • Push the limits of what’s allowed (forgiveness not permission)

Disciplines/artefacts with good effort/return ratio

  • Phylogenetic trees (Ross Mounce + PMR)
  • Nucleic acid sequences
  • Chemical formulae and reactions
  • Regressions and models
  • Clinical/human studies (tables)
  • Dose-response curves

Tools, services, resources

    We need a single-stop location for tools

  • Research-enhancing tools (science equiv of Git/Mercurial). Capture and validate work continuously
  • Common approach to authoring
  • Crawling tools for articles, theses.
  • PDF and Word converters to “XML”
  • Classifiers
  • NLP tools and examples
  • Table hackers
  • Diagram hackers
  • Logfile hackers
  • Semantic repositories
  • Abbreviations and glossaries
  • Dictionaries and dictionary builders

     

Advocacy, helpers, allies

  • Bodies who may be interested (speculative, I haven’t asked them):
    • Funders of science
    • major Open publishers
    • Funders of social change (Mellon, Sloane, OSF…)
    • SPARC, DOAJ, etc.
    • (Europe)PMC
  • Crowdcrafting (OKF, am involved with this)
  • Wikipedia

 

 

 

 

SePublica: Making the scholarly literature semantic and reusable

Friday, May 24th, 2013

Scholarly literature has been virtually untouched by the digital revolution in this century. The primary communication is by digital copies of paper (PDFs) and there is little sign that it has brought any change in social structures either in Universities/Research_Establishments or in the publishing industry. The bulk of this industry comprises two sectors, commercial publishing and learned societies. The innovations have been largely restricted to Open Access publishing (pioneered by BMC and then by PLoS) and the megajournal (PLoSOne).

I shall generalise, and exempt a few players from criticism: The Open Access publishers above with smaller ones such as eLife, PeerJ, MDPI, Ubiquity, etc. And a few learned societies (the International Union of Crystallography and the European Geosciences Union, and please let me have more). But in general the traditional publishers (all those not exempted) are a serious part of the problem and cannot now be part of the solution.

That’s a strong statement. But over the last ten years it has been clear that publishing should change, and it hasn’t. The mainstream publishers have put energy into stopping information being disseminated and creating restrictions on how it can be used. Elsevier (documented on this list) has prevented me extracting semantic information from “their” content.

The market is broken because the primary impetus to publish is increasingly driven by academic recognition rather than a desire to communicate. And this makes it impossible for publishers to act as partners in the process of creating semantics. I hear that one large publisher has now built a walled garden for content mining – you have to pay to access it and undoubtedly there are stringent conditions on its re-use. This isn’t semantic progress, it’s digital neo-colonialism.

I believe that semantics arises out of community practice of the discipline. On Saturday the OKFN is having an economics hackathon (Metametrik) in London where we are taking five papers and aiming to build a semantic model. It might be in RDF, it might be in XML; the overriding principle is that it must be Open, developed in a community process.

And in most disciplines this is actively resisted by the publishing community. When Wikipedia started to use Chemical Abstracts (ACS) identifiers the ACS threated Wikipedia with legal action. They backed down under community pressure. But this is no way for semantic development. It can only lead to centralised control of information. Sometimes top-down semantic development is valuable (probably essential in heavily regulated fields) but it is slow , often arbitrary and often badly engineered.

We need the freedom to use the current literature and current data as our guide to creating semantics. What authors write is, in part, what they want to communicate (although the restrictions of “10 pages” is often absurd and destroys clarity and innovation). The human language contains implicit semantics, which are often much more complex that. So Metametrik will formalize the semantics of (a subset of) economic models, many of which are based on OLS (ordinary least squares). Here’s part of a typical table reporting results. It’s data so I am not asking permission to reproduce it. [It's an appalling reflection on the publication process that I should even have to, though many people are more frightened of copyright that of doing incomplete science.]

 

And the legend:

How do we represent this table semantically? We have to identify its structure, and the individual components. The components are, for the most part well annotated in a large metadata table. (And BTW metadata is essential for reporting facts so I hope no one argues that it’s copyrightable. If they do, then scientific data in C21 is effectively paralysed.)

That’s good metadata for 2001 when the paper was published. Today, however , we immediately feel the frustration of not linking *instantly* to Gallup and Sachs, or La Porta. And we seethe with rage if we find that they are paywalled and this is scholarly vandalism – preventing the proper interpretation of scholarship.

We then need a framework for representing the data items. Real (FP) numbers, with errors and units. There doesn’t seem to be a clear ontology/markup for this, so we may have to reuse from elsewhere. We have done this in Chemical Markup Language (its STMML subset) which is fully capable of holding everything in the table. But there may be other solutions –please tell us.

But the key point is that the “Table” is not a table. It’s a list of regression results where the list runs across the page. Effectively its regression1, … regression 11. So a List is probably more suitable than a table. I shall have a hack at making this fully semantic and recomputable.

And at the same time seeing if AMI2 can actually read the table from the PDF.

I think this is a major way of kickstarting semantic scholarship – reading the existing literature and building re-usables from it. Let’s call it “Reusable scholarship”.

 

 

SePublica: What we must do to promote Semantics #scholrev #btpdf2

Thursday, May 23rd, 2013

In the previous post (http://blogs.ch.cam.ac.uk/pmr/2013/05/23/sepublica-how-semantics-can-empower-us-scholrev-scholpub-btpdf2/) I outlined some of the reasons why semantics are so important. Here I want to show what we have to do (and again stick with me – although you might disagree with my stance).

The absolute essentials are:

  • We have to be a community.
  • We have to identify things that can be described and on which we are prepared to agree.
  • We have to describe them
  • We have to name them
  • We have to be able to find them (addressing)

Here Lewis Carroll, a master of semantics shows the basics

And she went on planning to herself how she would manage it. `They must go by the carrier,’ she thought; `and how funny it’ll seem, sending presents to one’s own feet! And how odd the directions will look!

ALICE’S RIGHT FOOT, ESQ.

HEARTHRUG,

NEAR THE FENDER,

(WITH ALICE’S LOVE).

 

Oh dear, what nonsense I’m talking!’

Alice identifies her foot as a foot, and makes gives it a unique identifier RIGHT FOOT. The address consists of another unique identifier (HEARTHRUG) and annotates it (NEAR THE FENDER). There’s something fundamental about this – (How many children have annotated their books with “Jane Doe, 123 Some Road, This Town, That City, Country, Continent, Earth, Solar System, Universe?). Hierarchies seem fundamental to humans. Anything else is much more difficult. (Peter Buneman and I have been bouncing this idea about). I am sure we have to use hierarchies to promote these ideas to newcomers.

Things get unique identifiers. They can be at different levels. Single instances such as Alice’s left foot.

But there are also whole classes – the class of left feet. I have a left foot. It’s distinct from Alice’s. And we need unique names for these classes, such as “left foot“. Generally all humans have one (but see http://en.wikipedia.org/wiki/The_Man_with_Two_Left_Feet ). And we can start making rules, see http://human-phenotype-ontology.org/contao/index.php/hpo_docu.html.

At the moment, all relationships in the Human Phenotype Ontology are is_a relationships,  i.e.  a simple class-subclass relationships. For instance, Abnormality of the feet is_a Abnormality of the lower limbs. The relationships are transitive, meaning that they are inherited up all paths to the root. For instance,
Abnormality of the lower limbs is_a Abnormality of the extremities, and thus Abnormality of the feet also is Abnormality of the extremities.

We see a terminology appearing. Some would call this an ontology, others would refute this. I tend to use the concept of “dictionary” fuzzed across language and computability.

This is where the difficulties start. One the one hand this is very valuable – if a disease affects the extremities, then it might affect the left foot. But it’s also where people’s eyes glaze over. Ontology language is formal and does not come naturally to many of us. And when it’s applied like a syllogism:

  • All men are mortal
  • Socrates is a man
  • Therefore Socrates is mortal

Many people think – so what? – we knew that already. On the other hand it’s quite difficult to translate this into machine language (even after realising that “men” is mans (the plural). The symbology is frightening (with upside down A’s and backwards E’s). Here are fundamental concepts in a type system: http://stackoverflow.com/questions/12532552/what-part-of-milner-hindley-do-you-not-understand :

The discussion on Stack Overflow includes:

  • “Actually, HM is surprisingly simple–far simpler than I thought it would be. That’s one of the reasons it’s so magical”
  • “The 6 rules are very easy. Var rule is rather trivial rule – it says that if type for identifier is already present in your type environment, then to infer the type you just take it from the environment as is. PMR is still struggling with the explanation
  • This syntax, while it may look complicated, is actually fairly simple. The basic idea comes from logic: the whole expression is an implication with the top half being the assumptions and the bottom half being the result. That is, if you know that the top expressions are true, you can conclude that the bottom expressions are true as well.

The problem is language and symbology. If you haven’t been trained in language it’s often impenetrable. For example music. If you haven’t been trained in it, it makes little sense and takes us a considerable time to learn:

So if we want to get a lot of people involved we have to be very careful about exposing newcomers to formal semantics. I avoid words like ontology, quantifier, predicate, disjunction, because people already have to be convinced they are worth learning.

Humans want to learn music not because they’ve seen written music but because they’ve heard music. Similarly we have to sell semantics by what it does, rather than what it is. And we cannot show what it does without building systems, any more than we are motivated to learn about pianos until we have seen and heard one.

The problem is that it’s a lot of effort to build a semantic system and that there is not necessarily a clear reward. The initial work, as always, was in computer science which showed – on paper – what could be possible but didn’t leave anything that ordinary people can pick up on. This is very common – before the WWW was a whole decade or more of publications in “hypermedia” but much of this was only read by people working in the field. And often the major reason for working in a new field is to get academic publications, not to create something useful to the world. There often seems to be a lag of twenty years and indeed that’s happening in semantics.

So it’s very difficult to get public funding to build something that’s useful and works. One effect is that the systems are built by companies. That’s not necessarily a bad thing – railways and telephones came from private enterprise. But there are problems with the digital age and we see this with modern phones – they can become monopolies which constrain our freedom. We buy them to communicate but we didn’t buy them to report our location to unknown vested interests.

And semantics have the same problem. The people who control our semantics will control our lives. Because semantics constrain the formal language we use and that may constrain the natural language. We humans may not yet be in danger of Orwell’s Newspeak but our machines will be. And therefore we have to assert rights to have say over our machines’ semantics.

That raises the other problem – semantic Babel. If everyone creates their own semantics no-one can talk (we already see this with phone apps). I live in the semantic Babel of machine-chemistry – every company creates a different approach. Result – chemistry is 20 years behind bioscience where there is a communal vision of interoperable semantics.

So I think the major task for SePublica is to devise a strategy for bottom-up Open semantics. That’s what Gene Ontology did for bioscience. We need to identify the common tools and the common sources of semantic material. And it will be slow – it took crystallography 15 years to create their dictionaries and system and although we are speeding up we’ll need several years even when the community is well inclined. (That’s what we are starting to do in computational chemistry – the easiest semantic area of any discipline). It has to be Open, and we have to convince important players (stakeholders) that it matters to them. Each area will be different. But here are some components that are likely to be common to almost all fields:

  • Tools for creating and maintaining dictionaries
  • Ways to extract information from raw sources (articles, papers, etc.) – that’s why we are Jailbreaking the PDF.
  • Getting authorities involved (but this is increasingly hard as the learned societies are often our problem , not the solution)
  • Tools to build and encourage communities
  • Demonstrators and evangelists
  • Stores for our semantic resources
  • Working with funders

We won’t get all of that done at SePublica. But we can make a lot of progress.

 

 

 

 

 

SePublica: How semantics can empower us; #scholrev #scholpub #btpdf2

Thursday, May 23rd, 2013

I’m writing blog posts to collect my thoughts for the wonderful workshop at SePublica http://sepublica.mywikipaper.org/drupal/ where I am leading off the day. [This also acts as a permanent record instead of slides. Indeed I may not provide slides as such as I often create the talk as I present it.] My working title is

Why and how can we make Scholarship Semantic?

[If you switch off at "Semantics" trust me and keep reading… There's a lot here about changing the world.]

Why should we strive to create a semantics web/world? I “got it” when I head TimBL in 1994. Many people have “got it”. There are startups based on creating and deploying semantic technology. My colleague Nico Adams (who understands much more about the practice of semantics than me) has a vision of creating a reasoning engine for science (he’s applied this to polymers, biotechnology, chemistry). I completely buy his vision.

But it’s hard to sell this to people who don’t understand. Any more than TimBL could sell SGML in 1990. (Yes there were whole industries who bought into SGML, but most didn’t). So what TimBL did was to build a system that worked (The WWW). And this often seems to be the requirement for Semantic Web projects. Build it and show it working.

SePublica will probably be attended by the converted. I don’t think I have to convince them of the value of semantics. But I do have to catalyse:

  • The creation of convincing demonstrators (examples that work)
  • Arguments for why we need semantics and what it can do.

So why are semantics important for scholarly publishing ? The following arguments will hopefully convince some people:

  • They unlock the value of the stuff already being published. There is a great deal in a single PDF (article or thesis) that is useful. Diagrams and tables are raw exciting resources. Mathematical equations. Chemical structures. Even using what we have today converted into semantic form would add billions.
  • They make information and knowledge available to a wider range of people. If I read a paper with a term I don’t know then semantic annotation may make it immediately understandable. What’s rhinovirus? It’s not a virus of rhinoceroses – it’s the common cold. That makes it accessible to many more people (if the publishers allow it).
  • They highlight errors and inconsistencies. Ranging from spelling errors to bad or missing units to incorrect values to stuff which doesn’t agree with previous knowledge. And machines can do much of this. We cannot have reproducible science until we have semantics.
  • They allow the literature to be computed. Many of thre semantics define objects (such as molecules or phylogenetic trees) which are recomputable. Does the use of newer methods give the same answer?
  • They allow the literature to be aggregated. This is one of the most obvious benefits. If I want all phylogenetic trees, I need semantics – I don’t want shoe-trees or B-trees or beech trees. And many of these concepts are not in Google’s public face (I am sure they have huge semantics internally)
  • They allow the material to be searched. How many chemists use halogenated solvents. (The word halogen will not occur in the paper). With semantics this is a relatively easy thing to do. Can you find second-order differential equations? Or Fourier series? Or triclinic crystals? (The words won’t help) AMI2 will be able to.
  • They allow the material to linked into more complex concepts. By creating a data base of species , a database of geolocations and links between them we start to generate an index of biodiversity. What species have been reported when and where? This can be used for longitudinal analyses – is X increasing/decreasing with time? Where is Y now being reported for the first time?
  • They allow humans to link up. If A is working on Puffinus Puffinus (no, it’s not a Puffin, that’s Fratercula Artica) in the northern hemisphere and B is working on Puffinus tenuirostris in Port Fairy Victoria AU then a shared knowledgebase will help to bring the humans together. And that happens between subjects – microscopy can link with molecular biology with climate with chemistry.

In simple terms semantics allow smart humans to develop communal resources to develop new ideas faster, smarter and better.

Please add other ideas! I am sure I have missed some.

 

#scholrev #ami2 #btpdf2 Jailbreaking content (including tables) from PDFs

Thursday, May 23rd, 2013

We’ve got a splendid collection of about 600 Open PDFs for our jailbreak hackathon. They seem to have a medical focus. They are of very variable type and quality. Some are reports, guidelines , some academic papers. Some are born digital but at least one is scanned OCR where the image and the text are superposed. (BTW I am taking it on trust that the papers are Open – some are from closed access publishers and carry their copyright. It’s time we starting marking papers as Open ON THE PAPER).

I have given these to #AMI2 – she processes a paper in about 10 secs on my laptop so it’s just over an hour for the whole lot. That gives me a chance to blog some more. In rev63 AMI was able to do tables so here, without any real selection, I’m giving some examples. (Note that some tables are not recognised as such – especially when the authors don’t use the word “table”. But we shall hack those in time…). Also, as HTML doesn’t seem to have a tableFooter that manages the footnotes I have temporarily added this to the caption as a separate paragraph

From Croat Med J. 2007;48:133-9:

The table in the PDF

 

AMI’s translation to HTML:

Table 1. Scores achieved by 151 Croatian war veterans diagnosed with posttraumatic stress disorder on the Questionnaire on Traumatic Combat and War Experiences (USTBI-M), Mississippi Scale for Combat-Related Post-Traumatic Stress Disorder (M-PTSD), and Minnesota Multiphasic Personality Inventory (MMPI)-201 (presented as T values)

*Abbreviations: L – rigidity in respondents’ approach to the test material; F – lack of understanding of the material; K – tendency to provide socially acceptable answers.

 

Score

 

Questionnaire

(mean ± standard deviation)

Cut-off score

USTBI-M

77.8 ± 14.3

Maximum: 120

M-PTSD

122.1 ± 22.9

107

MMPI-201 scales*

  

L

51.1 ± 2.0

70

F

73.2 ± 6.3

70

K

42.4 ± 3.2

70

 

87.6 ± 5.1

70

 

96.7 ± 6.6

70

 

88.2 ± 4.7

70

 

67.3 ± 4.8

70

   
 

79.3 ± 5.8

70

Pt ( psychastenia )

75.4 ± 5.7

70

 

72.1 ± 7.4

70

 

52.3 ± 2.6

70

 

COMMENT: Some of the row labels/ headings are omitted, but I think that can be solved. (Remember this is AMI’s first attempt so we call it alpha)

Here’s another:

And what AMI translates it to

Table 2 The comparison of quality of life among study groups using analysis of variance and post-hoc tests

*Group-by-group comparisons that were significant at the level of P < 0.001 performed using LSD (homogenous variance; used for physical and overall quality of life) or Dunnet T3 (unhomogenous variance; all other questions). The significance was set at P < 0.001 in post-hoc test in order to reduce the increased chances of false positive results.

QOL dimension/status

Groups

N

Mean ± SD

F; P

Post-hoc differences*

Physical

PTSD + LBP (I)

79

75.44 ± 11.33

  
 

PTSD (II)

56

78.43 ± 11.54

49.18;

I-III, I-IV, II-III,

 

LBP (III)

84

87.43 ± 13.84

< 0.001

II-IV, III-IV

 

Controls (IV)

134

94.42 ± 11.65

  
 

Total

353

85.97 ± 14.40

  

Psychological

PTSD + LBP (I)

76

63.74 ± 14.60

  
 

PTSD (II)

58

67.45 ± 15.92

79.05;

I-III, I-IV, II-III,

 

LBP (III)

90

80.27 ± 14.59

< 0.001

II-IV, III-IV

 

Controls (IV)

132

90.67 ± 10.76

  
 

Total

356

78.51 ± 17.44

  

Social

PTSD + LBP (I)

80

33.40 ± 8.89

  
 

PTSD (II)

58

35.93 ± 9.98

70.19;

I-III, I-IV, II-III,

 

LBP (III)

91

41.58 ± 8.78

< 0.001

II-IV, III-IV

 

Controls (IV)

134

49.22 ± 7.13

  
 

Total

363

41.70 ± 10.6

  

Enviromental

PTSD + LBP (I)

79

92.81 ± 20.78

  
 

PTSD (II)

58

100.76 ± 19.79

66.27;

I-III, I-IV, II-IV,

 

LBP (III)

88

108.36 ± 17.71

< 0.001

III-IV

 

Controls (IV)

130

126.06 ± 14.27

  
 

Total

355

110.14 ± 22.02

  

Satisfaction with personal health status

PTSD + LBP (I)

80

1.84 ± 0.74

  
 

PTSD (II)

59

2.36 ± 0.85

127.48;

I-II, I-III, I-IV, II-IV,

 

LBP (III)

95

2.70 ± 0.98

< 0.001

III-IV

 

Controls (IV)

135

4.03 ± 0.85

  
 

Total

369

2.94 ± 1.23

  

Overall self-reported quality of life

PTSD + LBP (I)

73

2.82 ± 1.14

  
 

PTSD (II)

49

3.29 ± 1.28

24.04;

I-II, I-III, I-IV, II-III,

 

LBP (III)

75

4.04 ± 1.25

< 0.001

II-IV

 

Controls (IV)

42

4.48 ± 0.80

  
 

Total

239

3.59 ± 1.31

  

 

I think she’s got it completely right (the typos “Enviromental” and “Unhomogenous” are visible in the PDF).

AFAIK there is no automatic Open extractor of tables so we are very happy to contribute this to the public pool.