Unilever Centre for Molecular Informatics
 

petermr's blog

A Scientist and the Web

 

#ignorantchemist Typographical amusement #ami2

April 30th, 2013

We are doing well at reconstructing semantic material from PDFs (#AMI2) but the challenges we are thrown are considerable. Here’s today’s amusement:

#AMI2 can reconstruct most of this perfectly, but she doesn’t know what to do with a hyphenated-subscript. Nor do I, but I’m just an ignorant chemist. The publishing industry tells us that they need our money to produce beautiful easily readable typeset documents. So here’s an example of human readability from the same paper:

#AMI2 can read this, but can you? Wouldn’t it be easier to typeset it as equations? But that would take up an awful lot of space, and as we know journals have to reduce the space (I never understand why).

I have a plane journey so AMI and I can do some real hacking. We hope to release an alpha version RSN.

#openaccess: American Chemical Society charge additional 1000 USD for Creative Commons Licences

April 25th, 2013

From the start of this month all RCUK-funded researchers will have to publish “Open Access”. Exactly what this means has been the subject of a messy set of polemics. But on the assumption that authors wish to publish under a CC-BY licence (effectively the only one compliant the with BOAI declaration – free to copy, use, re-use and redistribute) then are they able to?

I’ve taken a prominent journal – Journal of the American Chemical Society – in which I have previously published. Can I publish “Open Access” and comply with the RCUK requirements?

There’s a useful tool http://www.sherpa.ac.uk/fact/

Many publishers have been extremely poor at providing simple information for readers and authors. Often you have to chase round the buttons on the site (avoiding the (self-)advertising). Sometimes I get the impression that the publishers aren’t really trying to be helpful. Ross Mounce has done a great job on trying to winkle out licence and prices info and SHERPA have now done much of the grunt work in providing the right button to click. systematize this as well. So I can go straight to the key info:

What’s “Author Choice”? It’s ACS-specific and it’s some form of “Open Access” (according to the ACS). Many of these publisher-specific labels ( (Author|Reader|Free|Open)(Access|Choice|Article) have fuzzy words and fuzzy conditions.

But we have Creative Commons (and without CC we would be in an awful mess). CC provide a range of licences. ONLY CC-BY (CC0, and possibly CC-BY-SA) fit the BOAI definition of open access. Only CC-BY allows copying, re-use and redistribution.

Which, simply, is what Science is about.

Any restriction of access or re-use is anti-scientific.

It may be good business, but it harms science.

So it is possible to use a CC-BY licence when publishing with the ACS. But ONLY by paying an extra 1000 USD.

Does it COST this much to add a CC-BY licence?

Of course not. It shouldn’t cost anything (it’s a standard 50 characters on a page and a hyperlink).

It’s effectively a ransom from the publisher to raise extra revenue. The publishers can make up any set of charges they like. And the authors will either pay it or hide their publication behind an embargo-wall (say for 1-2 years).

Is this good for science? Of course not. It makes it harder to detect bad science. Humans and machines can validate or invalidate science if they are allowed to read the full text.

Very few publishers have earned respect during the evolution of Open Access. Most have been seen to value commerce above other considerations. There is no price pressure on OA.

And many “open access advocates” have actually welcomed non-CC-BY and embargoed green OA – which has led us to these huge APCs for BOAI Open Access.

To fight this we need strength from the funders and unanimity of purpose.

And we have this and it’s the primary redeeming feature in Open Access.

We need tools for uniform practice – what does a publisher offer? And we are getting them (kudos in UK to JISC, SHERPA, and Ross) and they are cutting through the fuzz.

We need tools for measuring author compliance. Because many authors simply don’t care about the funders requirements and will still publish in a completely closed manner so as to advance their careers and funding prospects. And we are getting them.

The organizations that have let us down are the Universities and their libraries. They don’t really care. They could have fought this battle 10 years ago instead of waiting for the funders to do it. They accept whatever prices the publishers charge for OA APCs and route tax-payer money or student fees to the publishers…

But that’s another blog post. Soon…

Update: The struggle continues… #ami2 would like alpha testers

April 25th, 2013

A quick update. I’ve been spending most of my time on #ami2 which is now at raw alpha (see below). Other items of note include:

  • Mendeley is now owned by Elsevier. I shall blog this. If you care about Open scholarship you have to be seriously concerned.
  • Open Data Workshop (http://blog.okfn.org/2013/02/27/open-data-on-the-web-workshop-april-2013/, http://www.w3.org/2013/04/odw/ ). Really exciting to see the concentration of interest. There was a pre-workshop evening run by OKFN – lightning talks (I gave a short one (3-4 mins) on #ami2 and the problems of scientific data. Many international visitors came.
  • Ross and Avril got married (@rmounce) – their 2nd or 3 weddings. Great occasion – thanks all.
  • Went to talk by Glyn Moody on Copyright.
  • Meeting by JISC/Cameron on tools to determine openness of livcences in scholpubs.
  • Opening of Materials centre at QMU (Martin Dove). CML continues to be valuable.
  • Good progress on CML dictionaries for compchem.
  • We keep fighting for “the right to read is the right to mine” at Brussels (Licences for Europe). Do university libraries care?? They’d rather buy things than fight.

Overall I worry seriously about Open Scholarship. The universities and their libraries don’t care and are giving it away and then buying it back. It’s getting worse not better. We should be fighting for our rights.

#ami2 is at raw alpha. That means that it can do useful stuff if you know what you are doing and know the limitations. We are not appealing for volunteers yet but if you want to be involved please let me know. You will need to be able to:

  • Run Maven and Java.
  • Use Bitbucket.
  • Get excited about really boring stuff (like errors in fonts, pagination etc.)
  • Sort problems yourself/communally.
  • Want to liberate information from PDFs.
  • Have a few minable papers (“Open” in some sense).
  • Be patient.
  • Respect copyright.

Currently there are no proper metrics but:

  • Ca. 1 sec per page
  • Useful compression for text-only (images can’t compress, of course).

Mail me or leave a message here or simply use Bitbucket (http://www.bitbucket.org/petermr/svg2xml-dev ) and give feedback.

#animalgarden Bottom-up Ontologies in Physical Science

April 14th, 2013

On Thursday (2013-04-11) I was invited by Fiona McNeill to give a 5-minute talk on ontologies at Edinburgh (http://dream.inf.ed.ac.uk/events/ukont-13/2013_workshop_program.html ). The workshop aims included:

Amongst other areas of interest, there will be a particular focus on creating and using open data. The program and audience is intentionally very diverse; the aim is to cover areas from many disciplines. We are particularly interested in bringing together those creating and developing the technology with those using the technology in industry, government and public organisations.

A short talk requires special preparation. No point in trying to prove theorems in first-order logic. In fact I argue that this is far too complicated and unnecessary for physical science. So #animalgarden offered to make a presentation. (They didn’t have time to have a proper shoot so they have re-used old slides and there’s no music yet). The slides are at http://www.slideshare.net/petermurrayrust/ontologies-in-physical-science – there are a few snapshots here. (Conventional chemists can read the words – which are deadly serious – and ignore the animals L )

The problem is that much of physical science doesn’t even use common identifiers or vocabularies. So the problems are people-problems, not technical ones.

There are a very few chemical ontologies but few people use them and this is even more problematic in materials science. This domain is probably the easiest of all sciences to create ontologies for but paradoxically it hasn’t happened. Crystallography (www.iucr.org/cif) is a shining exception but computational chemistry has nothing.

So a number of us are joining together to create “bottom-up ontologies”. Firstly small coherent group systematize the description of what they do in semantic form. Computational chemistry is particularly well suited to this – the programs (codes) have implicit semantics (because the code works and gives the right answers)! Then the community looks at the resultant collection of ontologies and systematizes them where they have the same concepts. In these cases there is a common entry in a communal ontology.

When this isn’t possible the ontologies create machine-readable conventions.

But few computational codes have explicit ontologies. Some define a few of the terms in their manuals, but they aren’t linked to the programs. We’ve developed Chemical Markup Language a which does exactly this. Each code (NWChem, Hyperchem, DLPOLY…) creates their own ontology using a common syntax (CML) but their own identifiers.

There are immediate benefits – the program output becomes semantic and can be re-used for analysis, aggregation, etc. If two groups have ontologies they compare notes and create a toplevel dictionary. As more groups join, the top-level dictionary gains more knowledge and acceptance from the community. And everyone has a feeling of ownership.

We are delighted that Hyperchem http://www.hyper.com/ have recently offered to join in the communal effort. See http://blogs.ch.cam.ac.uk/pmr/2011/11/02/searchable-semantic-compchem-data-quixote-chempound-fox-and-jumbo/ for an overview of the collaboration with PNNL. And http://blogs.ch.cam.ac.uk/pmr/2013/02/03/topics-and-links-for-my-talk-on-semantic-web-for-materials/ for work with CSIRO. And some idea of the great contribution from Kitware http://blogs.ch.cam.ac.uk/pmr/2013/03/01/liberation-software/

The slides are CC-BY. I need to add this.

#ami2 #ukont2013 15-min demonstration of AMI2 (and maybe OPSIN and ChemicalTagger)

April 11th, 2013

I’m demoing after lunch to the 2nd UK Ontology Network Workshop in Edinburgh and it’s billed as AMI2 (our content-mining software for #scholpub and related documents). Why content-mining at an ontology meeting? Because many ontologies are created “bottom-up” from the language we use. This post is just to announce what I am going to show (hopefully) and also to give URLs.

  • AMI2 will read PDFs and convert them to XHTML (prior to creating domain-specific XML). AMI2 is at: https://bitbucket.org/petermr/pdf2svg (for converting PDF to SVG) and https://bitbucket.org/petermr/svg2xml (for converting SVG2XML). Use https://bitbucket.org/petermr/pdf2svg-dev and https://bitbucket.org/petermr/svg2xml-dev for the code for the bleeding edge versions (I’ll be demoing the latter, using Maven from the commandline). We’re beginning to get collaborators – recently AMI2 started working with Renaud Richardet in EPFL Lausanne , for example.

    For newcomers, AMI2 reads a PDF using PDFBox, and uses PDF2SVG to interpret STM publisher characters (which usually are not Unicode). That creates a raw SVG made up of single characters and discrete paths and images. Then she uses SVG2XML to create running text and separate figures and tables. We’ll show how species can be extractedThat’s where today stops. (In the final phase, AMI2-Aaron (in memory of Aaron Swartz) we shall support domain-specific plugins).

  • Then we’ll show OPSIN to show an example of a domain-specific plugin that translates chemical names to Chemical Markup Language.
  • Lastly we’ll show Chemical Tagger (http://chemicaltagger.ch.cam.ac.uk/ ) which uses Natural Language Processing to create semantic chemistry (using CML/XML ontology).

PARTICIPANTS: PLEASE LET AMI2 HAVE SOME PDFs TO EAT!

#openaccess Who owns the Law? Who owns scholarship? You must listen to Ed Walters

April 6th, 2013

IF YOU HAVE ANY INTEREST IN OPENACCESS spend 15 Minutes on http://vimeo.com/63123518 “Ed Walters – Who Owns The Law?” It’s worth the time.

 

In a chillingly precise, researched piece Ed shows how US states have handed over the ownership of their Law to commercial publishing companies. Elsevier and Thmoson-Reuters.

Heard of them? Yes, the same companies that publish Scopus and WebOfScience .

I don’t want to take away the chilling effect of Ed’s presentation – so listen. And be outraged.

And then realise that the same thing is happening in Science and that naïve Open Access is making it worse. Assuming that other people will look after our rights, and meanwhile handing over our freedom. It’s happening right now.

And unless we wake up and challenge, it will be too late.

I’ll blog in more detail after you’ve watched Ed’s video.

 

Teaching #ami2 to recognize biological names (binomial)

April 4th, 2013

 

Erithacus rubecula (Wikimedia Commons) “the Robin”

 

#ami2 can now read the text of scientific articles as HTML (she has a little trouble with bold letters and strange fonts but we’ll teach her how to manage). Here is how she finds organisms in text. Having created the HTML (which is also XML) she can search it with XPath. XPath is one of the simplest and most powerful search tools for moderate chunk of information. Here she searches a page for italic phrases with at least one space (e.g.

I heard an Erithacus Rubecula Erithacus rubecula today. (@rmounce points out the capitalization!)

AMI has extracted the HTML (<i>…</i> means italics)

<p>I heard an <i>Erithacus rubecula</i> today.</p>

Now she creates an xpath :

“.//html:i[contains(.,' ')]”

This means:

  • .// anywhere in the document (we can increase the precision later)
  • html:i a chunk of italics
  • contains(.,’ ‘) which (.) contains a space (‘ ‘)

It’s not flowing prose but it’s trivial for AMI. And the result (using Jaxen query() in XOM) is:

  • & Evolution
  • 16S, COI
  • 16S, COI, COII
  • 16S, P
  • Achillea macrophylla, Adenostyles alliarae
  • Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio
  • Advances in Chrysomelidae Biology 1.
  • Ae. triuncialis
  • Aegilops geniculata
  • Annals of the Entomological Society of
  • Annals of the Entomological Society of America
  • Annual Review of Ecology and
  • Applied Statistics
  • BMC Bioinformatics
  • BMC Evolutionary Biology
  • Bioinformatics 2005, 21(24):4423-4424. 69. Sikes DS, Lewis PO: PAUPRat: PAUP implementation of the parsimony ratchet.
  • Biological Journal
  • Biology and Evolution
  • Boston University, Boston,
  • COI (13 PPIc among 16 polymorphic sites) and
  • COII, P
  • Cladistics-the International Journal of the Willi Hennig Society
  • Current Biology
  • Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs
  • Diabrotica virgifera
  • Die Käfer Mitteleuropas.
  • Doronicum clusii
  • Doronicum grandiflorum

Clearly not all italics are organisms. Many are bibliographic indicators. There are two simple ways to improve the precision:

  • Remove false positives. We can probably remove most of the bibliography by context (they occur on title pages and in references)
  • Include only known species. This is probably the best way forward and we have an excellent Open Source tool (Linnaeus) from Casey Bergmann and colleagues at Manchester with > 10000 commonest species.

There are other ways:

  • Morphology and lexical analysis of digraphs (the letter frequency in organisms is very different from English prose – higher vowel frequency for example).
  • Local context (include Hearst patterns … but hey, I have to go…)

So we easily get:

  • Achillea macrophylla, Adenostyles alliarae
  • Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio
  • Ae. triuncialis
  • Aegilops geniculata
  • Diabrotica virgifera
  • Doronicum clusii
  • Doronicum grandiflorum

So I hope you are now clear about how powerful content-mining is, how it will revolutionise science and how it is a crime against human knowledge to restrict its deployment.

#ami2 Can only academics understand scientific papers? Or can the #scholarlypoor be scientists as well? We need us

April 4th, 2013

 

A FORB (Wikipedia)

One of the arguments scholarly publishing is that it is for “academics to publish to academics”. Even Open Access advocates such as Stevan Harnad have stated this publicly. I find this arrogant and unacceptable – I think with modern resources such as Wikipedia and Internet search engines much of science is accessible to a huge number #schiolarlypoor. (people outside rich universities with no access to closed publications).

I am trained as a chemist, crystallographer, self-taught computer-scientist and I have no formal biology training. But Ross Mounce and I are working on liberating the world’s phylogenetic trees. DON’T switch off at “phylogenetic” – like many scientific terms you know much about http://en.wikipedia.org/wiki/Phylogenetic_tree already. Can you understand:

A phylogenetic tree or evolutionary tree is a branching diagram or “tree” showing the inferred evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical and/or genetic characteristics. The taxa joined together in the tree are implied to have descended from a common ancestor.

I think anyone with high school education will (or should!) be familiar with everything here. The only difficult words are “entities” (posh word for “thing”) and “taxa” which is either fairly obvious or you can look up. Again from Wikipedia:

A taxon (plural: taxa) is a group of one (or more) populations of organism(s), which a taxonomist adjudges to be a unit. Usually a taxon is given a name and a rank, although neither is a requirement. Defining what belongs or does not belong to such a taxonomic group is done by a taxonomist with the science of taxonomy. It is not uncommon for one taxonomist to disagree with another on what exactly belongs to a taxon, or on what exact criteria should be used for inclusion.

And here is the tree. You may not understand all the names (*I* don’t!) but you can see “Bacteria”, “Animals”, “fungi”, “plants”, etc. I don’t need to understand everything – because I have colleagues such as Ross Mounce and Matthew Wills at Bath I am working with.

So here is a page of BMC Evolutionary Biology that #AMI2 has turned into HTML. Can you understand it? (It’s a LOT easier than understanding domestic energy tariffs in UK):

From this interaction follows that divergent selection between ecological niches is a major driving force differentiating lineages until reproductive isolation occurs [17]. Ecologically divergent pairs of populations will show higher levels of reproductive incompatibility and lower levels of gene flow than ecologically more similar population pairs [29]. A resulting corollary is that ecological speciation is more likely to arise in regions with patchworks of contrasting habitats and/or distinct environmental gradients.

PMR. Some of the long words are precise terms but I think this could be written in simpler language.

The number of taxa within the insect order Coleoptera exceeds that of any known plant or animal group [30]. More than half of the beetles are phytophagous, including the species rich superfamilies Curculionoidea and Chrysomeloidea, of which a majority feeds on angiosperms [31]. The increase in phytophagous beetle diversity was facilitated by the rise of flowering plants [31]. The family Chrysomelidae currently consists of more than thirty-five thousand recognized species including economically important pest species such as the Colorado potato beetle ( Leptinotarsa decemlineata), the Northern corn rootworm ( Diabrotica virgifera), the Cereal leaf beetle ( Oulema melanopus), and the Striped turnip flea beetle ( Phyllotreta nemorum). The biological and economic importance of the superfamily Chrysomeloidea make it vital to understand the factors that drive diversification in this group.

Here, we present a case of ecological niche differentiation in the alpine leaf beetle Oreina speciosissima that may represent the early stages of ecological speciation. The genus Oreina currently includes twenty-eight species, of which only seven early-diverging taxa do not exclusively occur in high forbs (i.e. five develop in stone run vegetation and two can be found in both high forbs and stone runs) [32]. According to current knowledge [34], the most parsimonious explanation is that high forbs vegetation is the ancestral niche for the remaining twenty-one Oreina lineages, among which only our focal taxon Oreina speciosissima shows a partial reversal, since it is found both in high forbs and stone run vegetation.

Oreina speciosissima is distributed across nearly the entire range of the genus Oreina (from the Pyrenees in the west to the Carpathian Mountains in the east) through a wide altitudinal gradient (ranging from 800 to 2700 m above sea level). At lower elevations it generally colonizes the very abundant high forbs vegetation whereas at higher elevations it is found in stone run habitats across a small portion of its distribution range [unpublished observations MB, TVN][32]. Kippenberg [32] and personal observations suggest that Oreina speciosissima feeds exclusively on Asteraceae ( Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio and Tussilago) and colonizes four distinct habitats

Did you understand it’s about how beetles in European mountains evolve? You may very probably know about European biology (when at school I used to travel to the Alps and identify and photograph alpine plants and to ring (band) birds). That was before I became an “academic”. But I knew all the binomials of European birds and plants I had seen. If you are similar you are entitled to be part of open scholarship.) There are words I don’t know: “forbs”

["A forb (sometimes spelled phorb) is a herbaceous flowering plant that is not a graminoid (grasses, sedges and rushes). The term is used in biology and in vegetation ecology, especially in relation to grasslands[1] and understory. From Wikipedia]

And I didn’t know “stone run” either:

A stone run (called also stone river, stone stream or stone sea[1]) is a conspicuous rock landform, result of the erosion of particular rock varieties caused by myriad freezing-thawing cycles taking place in periglacial conditions during the last Ice Age.[2]

But I am sure you understood it!

We have the equipment to open scholarship to the world. Let’s embrace and use it.

 

#ami2 and @tabula : collaboration vs competition; #scholrev

April 4th, 2013

Oreina gloriosa from Wikipedia (you’ll see why)

I’ve read 25+ academic papers about extraction of information from PDFs and only 1 of those makes any mention of availability of code. These papers are published to announce a new (usually incremental or even repeated) advance and the main driving force is academic glory and reward. I don’t blame the authors in most cases – that’s how the academic systems works (and that’s what I blame). But it means that when I, for example, wanted to create #ami2 I had to start from scratch.

No, that’s completely unfair to PDFBox and Apache on which #ami2 is based. But in terms of analysing scientific PDFs I had to start from PDFBox. No existing code to help with tables, graphs, trees, text, etc. And although I have heard many presentation by academics there is very little re-usable code – so I had to write my own.

Not MY own. OUR own. Because everything I do is for US. That’s what works in the Blue Obelisk. (I was delighted to hear yesterday that Jmol now has a completely JavaScript version. That means I don’t have to write a JavaScript viewer for 3D chemistry.) And it is what will work in #scholrev. A community approach to building the tools for open scholarship.

Today I got a tweet that a group (@Tabula) was working on extraction of tables from PDFs – the area I am spending a lot of my time in. A typical academic reaction might be “Blast. We’ve been scooped”. Because that means we couldn’t publish anything on extraction of tables. (That’s not true, of course; duplicate work often gets published – Just not in the glamour mags. And duplication – within reason – is good because it cross-fertilizes and acts as a check).

So MY/OUR reaction was Great! I don’t have to do tables. I can use @Tabula instead. Now let’s see what it does. I haven’t yet corresponded with @tabula folks but it’s related to Mozilla and anyway it’s under an Open licence and invites collaboration. So I know I can use it – only question is what sort of technology – static/dynamic link, web service, or even translating code. (Of course this would be done with agreement and acknowledgement.

Let’s have a look: http://source.mozillaopennews.org/en-US/articles/introducing-tabula/

A table with ruling lines

A fully lined table.

Tables without row or column graphic separators are also common. For these type of tables, we cluster together the words that vertically overlap each other. The row boundaries are the bounding boxes of each detected cluster of words.

A table without graphic separators

Detected row boundaries in a table without graphic separators.

An analogous procedure is then carried out for detecting column boundaries. Tabula clusters together words that overlap horizontally. The bounding boxes of those clusters are the column boundaries.

Wow. This looks exactly complementary to #ami2-svg2xml. Here’s where we have got to with #AMI2 – chopping up the page. A table from BMC Evolutionary Biology. (BMC is a commercial Open Access CC-BY publisher who WANT you to re-use material, unlike most mainstream “closed publishers” who make it extremely difficult).

 

#AMI has chopped the page into bits (this is not all of it) and has identified the Table because it says T-a-b-l-e. (We have to teach AMI every word). The “Table” consists of a box with (a) caption and (b) table body (c) a footer. The table body has column headers (e.g. Code, Population). AMI2 does not yet understand what these actually mean – but we shall teach her.

I haven’t yet tried out @Tabula but I am very hopeful it will manage the body of the table.

When it does we then have to find out what the columns mean. I expect that words like “Coordinates” and “Year” will be very common and we can develop heuristics or machine learning. The format of the columns also contains vital information. Note that the altitudes are all > 1000 m so we have an alpine context.

What’s it about? “Sampled population of …” suggests population studies. And we can look in the text:

Oreina speciosissima” occurs in italics. This is suggestive of a binomial organism name. Here’s NL Wikipdeia http://nl.wikipedia.org/wiki/Oreina_speciosissima . A web search gives us http://www.biol.uni.wroc.pl/cassidae/European%20Chrysomelidae/oreina%20speciosissima.htm where we have pictures (they are copyright but very beautiful). I’ll give you an http://en.wikipedia.org/wiki/Oreina_speciosa instead

I hope you can see how all this links together. Beetles, places, mountains, dates, etc. A new type of science.

And why I am so ANGRY about mainstream publishers preventing us doing this.

The Lancet’s new #openaccess policy. Do they/Elsevier take me for an (April) Fool?

April 2nd, 2013