I ask the University of Glasgow to reconsider its FOI answer that they don't hold information about Licences signed with Publishers. Help?

The University of Glasgow has summarily dismissed my request for information on publisher licences /pmr/2014/03/04/my-foi-request-to-the-university-of-glasgow-has-left-me-speechless-they-do-not-know-anything-about-publishers-or-the-licences-they-sign/. Response from the Twittersphere has been incredulous. Prof. Charles Oppenheim – and expert in UK libraries opined “makes no sense” and advised “you have to appeal”.
So I’m appealing. I’ve not done this before. The WhatDoTheyKnow team is tremendous – they advise that this may take some time and if I don’t get a reply I can go to the Scottish Information Commissioner. So here’s my appeal (hyperlink within gives all correspondence)
Everything is public. So the University of Glasgow may be interested in how the citizenry, whose taxes fund the University, views this inability to manage core information on an expensive and legally critically business.
 

From: Peter Murray-Rust
4 March 2014

Dear University of Glasgow,
Please pass this on to the person who conducts Freedom of
Information reviews.
I am writing to request an internal review of University of
Glasgow’s handling of my FOI request ‘Licences with subscription
publishers forbidding content mining’.
I asked for information about the University’s licences with
academic publishers and was rejected on the grounds that the
University does not hold the information. This is completely
unacceptable since:
* all UK universities sign multiple licences with multiple
publishers every year. (Worldwide this is a 10 billion GBP
business)
* the University has a legal duty of care to maintain this
information
* the terms of many licences legally require the University and its
staff to conform to certain conditions. In a well-run university it
is unthinkable that the University does not hold these conditions
in a state where it can refer to them, possibly on a daily basis.
* most universities have sporadic cases where publishers claim that
the terms and conditions have not been observed. If these are
upheld it is probable that the University will lose the service it
has bought or that legal action will be taken against it (these
have happened elsewhere). Again it is inconceivable that a well-run
University does not hold a detailed record or such events.
I cannot interpret the answer I received in a positive light. I
hope the University understood the question as it is simple and
aligned to mainstream business activities. In which case it is
difficult to avoid the interpretation that the University does not
take FOI requests seriously or that it has something to hide.
WhatDoTheyKnow provides a public service and all my correspondence
is made public. I generally make a point of sharing this widely.
This has already resulted in confirmatory feedback that the
University has not answered satisfactorily.
A full history of my FOI request and all correspondence is
available on the Internet at this address:
https://www.whatdotheyknow.com/request/l…
Yours faithfully,
Peter Murray-Rust
 
Posted in Uncategorized | Leave a comment

My FOI request to the University of Glasgow has left me speechless; They do not know anything about publishers or the licences they sign

I have have sent out ca 10 FOI requests to UK Russell Group Universities (see this blog). IMO these are reasonable requests, given the imminent change in statutory instrument on Copyright. [My questions are appended at the bottom – I hope you agree they are polite and clear and deserving of answers. Basically I have asked them what their practice and policy is about restrictive licences which they sign with publishers about Text and Data Mining.
The University of Glasgow is the first to respond (other than acknowledgments). Their response is
“we don’t hold the information you have requested”.
I am speechless. I expected a variety of responses but not this.
Very, very, simply. The University of Glasgow has legally signed contracts with publishers. They tell me that they do not have the contracts they have signed.
What?
If you sign a contract you should keep it. I’d assume they have to keep copies of contracts for seven years. But apparently Glasgow throw them away after signing.
That’s  most charitable explanation I can put on this.
To summarize , their response taken at face value says:
  • we don’t have a clue what goes on in our university and we don’t care
  • we don’t keep records of what the library does
  • we don’t care when the library runs foul of publishers

Yes – I was expecting “it’s too much work” – “it’s secret because the publisher won’t let us” – “we can’t issue personal data about employees or users” .
But not:

  • We don’t take your request seriously

I’d like confirmation from readers that I’m not overreacting. And if so, what should I do? Write to the University Rector? Or the local MP?
Because we live in a democracy where part of the process is to treat people courteously even if they ask uncomfortable questions. Because uncomfortable questions often lead to better ways of doing things. I’ll sleep on it and read your responses.
But if this is common over UK Universities – that they don’t care about new legislations and don’t care about answering questions – we start to have problems with Universities.
================================
Peter Murray-Rust request-198359-0a4bd05a@whatdotheyknow.com
4 March 2014
Our Ref: FOI 2014/51 – F0360445
Dear Mr Murray-Rust,
Re: Freedom of Information (Scotland) Act 2002 – Request for Information
Thank you for your email which was received by the University on 19 February 2014 timed 05:23 hours, requesting the following information:
University’s Response
The University of Glasgow does not hold the information that you have requested and is not aware of any other public authority that could respond to your request. Section 17 of FOISA states that where public authorities receive requests for information that they do not hold, they must issue a notice advising that they do not hold the requested information.
The supply of documents under the terms of the Freedom of Information (Scotland) Act 2002 does not give the applicant or whoever receives the information any right to re-use it in such a way that might infringe the Copyright, Designs and Patents Act 1988 (for example, by making multiple copies, publishing or otherwise distributing the information to other individuals and the public). The Freedom of Information (Scotland) Act 2002 (Consequential Modifications) Order 2004 ensured that Section 50 of the Copyright, Designs and Patents Act 1988 (“CDPA”) applies to the Freedom of Information (Scotland) Act 2002 (“FOISA”).
Breach of copyright law is an actionable offence and the University expressly reserves its rights and remedies available to it pursuant to the CDPA and common law. Further information on copyright is available at the following website:

page1image13992 page1image14416 page1image14576 page1image15000 page1image15160 page1image15320 page1image15480

See attached extracted request

page1image16208 page1image16368 page1image16528 page1image16952 page1image17112 page1image17536

http://www.ipo.gov.uk/copy.htm

page1image18144

DATA PROTECTION AND FREEDOM OF INFORMATION OFFICE
Main Building, University of Glasgow, Glasgow G12 8QQ
Data Protection: Telephone: 0141-330-3111 E-Mail: dp@gla.ac.uk Freedom of Information: Telephone: 0141-330-2523 E-Mail: foi@gla.ac.uk The University of Glasgow, charity number SC004401
Your right to seek a review
Should you be dissatisfied with the way in which the University has dealt with your request, you have the right to require us to review our actions and decisions. If you wish to request a review, please contact the University Secretary, University Court Office, Gilbert Scott Building, University of Glasgow, Glasgow, Scotland G12 8QQ or e-mail: foi@gla.ac.uk within 40 working days. Your request must be in a recordable format (letter, email, audio tape, etc). You will receive a full response to your request for review within 20 working days of its receipt.
If you are dissatisfied with the way in which we have handled your request for review you may ask the Scottish Information Commissioner to review our decision. You must submit your complaint in writing to the Commissioner within 6 months of receiving the response to review letter. The Commissioner may be contacted as follows:
The Scottish Information Commissioner Kinburn Castle
Doubledykes Road
St Andrews
Fife
KY16 9DS
Telephone: 01334 464610
Fax: 01334 464611
Website www.itspublicknowledge.info E-mail: enquiries@itspublicknowledge.info
An appeal, on a point of law, to the Court of Session may be made against a decision by the Commissioner.
For further information on the review procedure please refer to (http://www.gla.ac.uk/services/dpfoioffice/policiesandprocedures/foisa-complaintsandreview/ ) All complaints regarding requests for information will be handled in accordance with this procedure.
Yours sincerely,
Data Protection and Freedom of Information Office

page2image13848 page2image14008

======== PMR’s questions =========
Extracted request for call F0360445
Dear University of Glasgow,
Background and terminology:
This request relates to content mining (aka Text And Data Mining (TDM), or data analytics) of scholarly articles provided by publishers under a subscription model. Mining is the use of machines (software) to systematically traverse(crawl, spider) subscribed content, index it and extract parts of the content, especially facts. This process (abstracting) has been carried out by scholars (“researchers”) for many decades without controversy; what is new is the use of machines to add speed and quality.
Most subscribers (universties, libraries) sign contracts provided by the publishers. Many of these contain clauses specifically restricting or forbidding mining (“restrictive contracts”). Recently the UK government (through the Intellectual Property Office and Professor Hargreaves) recommended reform of Copyright to allow mining; a statutory instrument is expected in 2014-04. Many subscription publishers (e.g. Elsevier) have challenged this (e.g. in Licences 4 Europe discussions) and intend to offer bespoke licences to individual researchers (“click-through licences”).
In many universties contracts are negotiated by the University Library (“library”) who agree the terms and conditions (T&C) of the contract. At the request of the publishers some or all of the contract is kept secret.
Oversight of library activities in universities usually involves “library committee” with a significant number of academics or other non-library members.
————-
Questions (please give documentary evidence such as library committee minutes or correspondence with publishers):
* How many subscription publishers have requested the university to sign a restrictive contract (if over 20 write “> 20′′)?
* When was the first year that the University signed such a contract?
* How often has the university challenged a restrictive contract?
* How many challenges have resulted in removal of ALL restrictions on mining?
* Has the university ever raised restrictions on mining with a library committee or other committee?
* How many researchers have approached the university to request mining? How many were rejected?
* How often has the university negotiated with a publisher for a specific research project requiring mining? Has the publisher imposed any conditions on the type or extent of the research? Has the publisher imposed conditions on how the research can be published?
* How often has an researcher carried out mining and caused an unfavourable response from a publisher (such as removal of service or a legal letter)?
* How often has the university advised a researcher that they should desist from mining? Have any researchers been disciplined for mining or had subscription access removed?
* Does the university have a policy on researchers signing “click through licences”?
* Does the university have a policy for facilitating researchers to carry out mining after the UK statutory instrument is confirmed?

page3image24568

* Does the university intend to refuse to sign restrictive contracts after the statutory instrument comes into force?
Posted in Uncategorized | 8 Comments

I teach AMI 255 shades of gray for our revolution. Any dedicated Java Image hackers want to help?

In my last technical post I mentioned that we were trying to recognize the character “A”. Not too difficult for a sighted European human. Hard for Eeyore. Hard for AMI our document reading program.
It’s taken a bit longer than I thought. Here’s our “A”
A
 
Simple, isn’t it? It’s one colour – black. What could be simpler?
Well actually it isn’t one colour. it’s 255 shades of gray (I have given up and use US spelling for compsci things). or more correctly it’s a gradation from 0 (black, no light at all) to 255 (as white as you can get). Look closely and you will see “jaggies” which aren’t completely blank. They are there as “antialiasing” – a method to make it look nice for humans (and it works). Remember we aren’t allowed to draw straight lines – we have to use pixels. Many years ago – 1970 – we used pen plotters (Calcomp) or moving spots of light (Tektronix) to draw straight lines. The great Evans and Sutherland computers – for which modern structural biologists should be grateful – produced many of the classic protein structures with straight lines (vectors) during the 1980’s.
But then pixels came back – Silicon Graphics and desktops – and are almost universal. So we have to draw lines and circles with pixels. There are clever algorithms (I still marvel at Bresenham’s circle) but the output is for humans, not machines. A machine simply sees an array (about 30*36 = 1080) of white, grey and black pixels.
No lines.
No A.
We’ve got to reconstruct the A. (AMI asks “Couldn’t the publishers publish proper A’s?”. “No AMI, they can’t, because they want to do the same thing they were doing 20 years ago – simulate paper”). Here’s a bit of it:
AA
 
Remember “0” is black and “255” is white. You can see the SW corner of the A. I have to teach AMI how to recognise it.
AMI: Are all A’s the same size?
P. No. They can be as small as 7 pixels high (anything smaller is unreadable for humans).
AMI. Are they all the same aspect ratio?
P. No. Some are thin and some are fat.
AMI. Are all the lines the same width?
P. No. There’s Helvetica light (thin), Helvetica (medium) and HelveticaBold (thick)
AMI: Is the A always symmetrical?
P: No, it can be slanted (oblique, italic)
AMI: Are there other fonts besides Helvetica?
P. Zillions. At least ten thousand.
AMI: so you will have to teach me a hundred thousand fonts.
P. I can’t. Some of them are Copyright
AMI: That means you go to jail if you use them and redistribute them?
P. More or less.
AMI. I will not be able to recognise all As.
P. May of them are VERY similar. I will teach you how to recognise similar characters. First we have to convert them to gray.
AMI. I have some convertToGray() modules.
P. Good. Then we have to clip them.
AMI. That’s what you are writing now? Trimming off the white pixel edges.
P. That’s right.
AMI: But some are “nearly white” (240) is that white?
P: we shall set a heuristic cutoff – maybe 240, maybe higher.
AMI: How do you tell?
P. Trial and error. It’s very boring. What is worse is that I am doing some of the things for the first time.
AMI: But you should’nt make mistakes. That’s why we use JUnit and Maven and Jenkins.
P. But I don’t know what methods work best. Maybe Otsu? Maybe Hough? There is no single solution.
AMI. Perhaps you can get some hackers to help you. They might know more about Java Images.
P. Good idea. Java’s ImageIO is not very cuddly – for example a missing file does not throw FileNotFound, but NullPointer.
AMI. ARE THERE ANY JAVA IMAGE HACKERS WHO CAN HELP PM-R? You don’t need to know any chemistry!
P. Thanks AMI. Hackers, please leave a comment on the blog. And there’s a very exciting way of meeting them that I am not yet allowed to blog – a few days.
AMI. Communal knowledge makes projects go faster. I like having several people writing my code. We have good tools for keeping in synch.
P. I took days – an experienced image guru would have done this in a morning.
AMI. So we have clipped the images. Now we have to make them the same size?
P. Yes. And I found a very useful library Imgscalr.
AMI. Yes, you installed and tested it on a few characters . Did it work?
P. Seems to. Now we have to compute correlation… and then see how unique the results are.
 
 

Posted in Uncategorized | 5 Comments

Doh! Processing pixel images. I spend a week instead of reusing one line of code. I refactor my code by Jumping on it

We are developing software to read scientific documents automatically. One problem is that many of the characters are in pixel form and we have recognise them. This is called Optical Character Recognition or OCR.
As with all problems I look to see whether there is a solution already. I need a FLOSS (Open) pure Java solution. I have found three possibilities:

  • Tesseract. Not pure Java
  • Java OCR (was simple but now so complicated we don’t even know how to run it! Help would be useful!)
  • Lookup (a method for correlating character images – but where to start?).

So we have decided (regretfully) we have to write our own. If anyone has a pure Java OCR solution that doesn’t require training please put us out of our agony. Ultimately it should do a  subset of Unicode including Greek, Maths, symbols, etc.  I’ve hacked several ideas including thinning and topology (which is still on the boards). But recently I came to the conclusion that to make a start we should correlate known characters with unknown ones. Since most scientific images use Helvetica (or similar sans-serif) or Times-New-Roman (or similar serif) or Courier (or similar monospace) we can start with those to compare against.
What’s this?
A
From the “House at Pooh Corner” quoted without permission but with Love (A A Milne was my aunts’ uncle).

Eeyore had three sticks on the ground, and was looking at them. Two of the sticks were touching at one end, but not at the other, and the third stick was laid across them. Piglet thought that perhaps it was a Trap of some kind.
“Oh, Eeyore,” he began again, “I just–“
“Is that little Piglet?” said Eeyore, still looking hard at his sticks.
“Yes, Eeyore, and I–“
“Do you know what this is?”
“No,” said Piglet.
“It’s an A.”

How do we know it’s an “A”? (Douglas Hofstadter wrote deeply on this).  My simple solution is “does it have a good pixel wise correlation coefficient with any of the As in three font families after scaling and translation?”.
I’ve spent a week writing code (very badly) to do the scaling and translation. Part of the problem is that I was in the Australian desert with very limited wifi and no geek companions and broken chunks of time. I’d write a bit each day, spend my time remembering what I had done, and then hack away at tests and then start fro scratch again.
I’ve now come back and found a simple online solution:
BufferedImage bimage = Scalr.resize(bimage, Method.QUALITY, Mode.FIT_EXACT, width,
height);

from the imgscalr library (see this post http://e-blog-java.blogspot.com.au/2012/02/cropping-rotate-and-resizing-images.html )
Doh! I could have got all this working a week ago.
Yes, if I had known there *was* a solution and where to look. Sometimes it takes a week to find where to start. You look on Stackoverflow, Google a bit, tweet a bit and sometimes it takes minutes, but sometimes takes a week. It really helps to have a geek cloud round you. I do reasonably well on these things but when you convolute the different languages (C, Java, Python, Perl, R, JS, etc.) with Open and Easy then the message becomes more diffuse. (There are some packages – e.g. OpenCV – that can take a week to know where to start!).

Rabbit came up importantly, nodded to Piglet, and said, “Ah, Eeyore,” in the voice of one who would be saying “Good-bye ” in about two more minutes. 
      “There’s just one thing I wanted to ask you, Eeyore. What happens to Christopher Robin in the mornings nowadays?” 
      “What’s this that I’m looking at?” said Eeyore, still looking at it. 
      “Three sticks,” said Rabbit promptly. 
      “You see?” said Eeyore to Piglet. He turned to Rabbit. “I will now answer your question,” he said solemnly. 
      “Thank you,” said Rabbit. 
      “What does Christopher Robin do in the mornings? He learns. He becomes Educated. He instigorates–I think that is the word he mentioned, but I may be referring to something else–he instigorates Knowledge. In my small way I also, if I have the word right, am–am doing what he does. That, for instance, is?” 
      “An A,” said Rabbit, “but not a very good one. Well, I must get back and tell the others.”

So Eeyore has to refactor his “A”:

“What did Rabbit say it was?” he asked. 
      “An A,” said Piglet. 
      “Did you tell him?” 
      “No, Eeyore, I didn’t. I expect he just knew.” 
      “He knew? You mean this A thing is a thing Rabbit knew?” 
      “Yes, Eeyore. He’s clever, Rabbit is.” 
      “Clever!” said Eeyore scornfully, putting a foot heavily on his three sticks. “Education!” said Eeyore bitterly, jumping on his six sticks. “What is Learning?” asked Eeyore as he kicked his twelve sticks into the air. “A thing Rabbit knows! Ha!”

So I have refactored my code by jumping on it and kicking it into the air
But I am not bitter – I am happy!
 

Posted in Uncategorized | Leave a comment

OKFest 2014 will be sensational; it's inclusive and empowering

The OKFest in Berlin has just been announced http://2014.okfestival.org/blog/. It will be fantastic.
How do I know? Because it’s being created by fantastic people and because OKFest in Helsinki was fantastic.
And I love the theme. It echoes my own thoughts. So here…

A Theory of Change

The programme for the 2014 edition of the festival is fuelled by a theory of change. Using this theory as our outline, the event provides an ideal opportunity for the open movement to come together to co-create the roadmap that will guide its next steps.
Our Theory: we believe that Knowledge, Tools and Society are the levers of Change. As such, Knowledge, Tools and Society will be the three streams that form the architecture of the programme.
Knowledge: Knowledge informs change. At OKFestival we’re keen to discuss ways of unlocking, expanding and sharing knowledge through open access, open research data, open educational resources, open science, data journalism and campaigning, data visualisation and literacy.
Tools: Tools enable change. We’ll be discussing facilitating the flow of knowledge through non profit technology, open source software, open hardware, design, architecture and urban planning.
Society: The group(s) who effect change.  Topics may include designing institutions, building communities and protecting environments. Additionally, powering economies through open government, transparency, open tech businesses, open development, open education, open culture, open sustainability and open economy will feature on the programme. Security and privacy-related topics also fit into this stream.
The festival will weave together these three streams of open knowledge innovation and impact. Each session will go deep within its realm as well as identifying interdisciplinary features that span different domains or disciplines.
 

This is exactly what the innovation in this century is about. It’s inclusive.

  • Everyone has a right to all public knowledge at zero access cost and without barriers. Anyone can use it for any purpose
  • Tools are becoming universal. We are creating the gift economy where everyone wishes to make tools available to everyone for free. And where the cost of tool production is dropping rapidly, whether it be software or material manufacturing.
  • We are creating communities in easy that have never happened before. We can find our partners in an email or social network interchange anywhere and any when in the world. We move in and between communities.

And traditional organisations like cities and funders and voluntary groups and governments are understanding the power of the community. You’ll meet them all at OKFest.
It is a great time to be alive and to be part of it.
 

Posted in Uncategorized | Leave a comment

Content Mining; Extracting Facts from Plots – 2; we find errors in the paper

In the previous post I introduced the need to extract data from plots – I continue with the details of how to do it. (Again, please stay with this even if you aren’t a scientist or geek – the principles are general. And I have a surprise, which surprised even me!).
Looking at PedroS’s problem we’ve narrowed it down to one image and one caption. Note that this single paper could give 20 times this amount of information and there are millions of papers. But we’ll “zoom in” to this single task
snip
Here’s the image. If you zoom in you’ll see it’s made of pixels
snip1
 
There are no explicit characters or lines – only groups of pixels that represent them. (This is actually very high quality compared with many pictures.) Interpreting them is tricky, but we’ve done most of the hard work (hackers welcome to help!). What we have to do is:

  • find the data points (there are 9, represented by diamonds)
  • the straight line relationship
  • the x-axis
  • the x-axis scale (tick marks and numbers 0.0031 -> 0035)
  • the x-axis quantity (T)
  • the x-axis units (1/K)
  • the y-axis
  • the y-axis scale
  • the y-axis quantity (ln k) – there are no units as logarithms are dimensionless

Knowledgeable scientists will already have spotted there is an error in the plot.  An Arrhenius plot should have 1/T, not T as the x-axis. It’s almost certainly a levelling error, not a data error, bt even so it’s an error.
This is the holy “version of record” and it’s wrong!!.
I have asserted that on average every paper contains errors. Not necessarily serious ones, but errors all the same. A graduate student would have to recreate this plot to pass their exam.
Could our AMI program detect this automatically? In principle, yes – quite easily. We’d need a template for “Arrhenius plot” and the dimensions of the x-axis.
So that’s our contention that every paper must be examined by machines for errors. Any publisher who prevents us doing so is trying to support its monopoly at the cost of allowing BAD SCIENCE. That’s why we have a moral right an imperative to use content mining for the whole literature.
To check for errors (and worse).
A publisher who defends their right to condone bad science will lose in the court of public opinion.
(Now, I didn’t know when I started this post that there was an error in it!  I’ll continue with what I wanted to say…)
We now have to interpret what the the semantic of the plot are. Let’s look at the caption for (C) and annotate using the Internet and especially Wikipedia.
Wikipedia??? Isn’t that unreliabe and rubbish??
NO. No more so than anything else (remember the current paper has at least one error, that would never persist in Wikipedia). And in science the quality of Wikipedia is extremely high (NB I wrote some of it). So let’s see how much of the caption we can interpret (this has links to Wikipedia, etc.)

determination of the activation energy [http://en.wikipedia.org/wiki/Activation_energy ]  of DPOR catalysis. In an Arrhenius plot [http://en.wikipedia.org/wiki/Arrhenius_plot] the logarithm of activity [http://en.wiktionary.org/wiki/catalytic_activity] (ln k, where k is the initial rate [http://chemwiki.ucdavis.edu/Physical_Chemistry/Kinetics/Virtual%3A_Kinetics/Method_of_Initial_Rates] of Chlide formation in the standard DPOR assay) is plotted versus the reciprocal of the absolute temperature [http://simple.wikipedia.org/wiki/Absolute_temperature] in K.

Over half the semantics are precisely described in Wikipedia. In sufficient detail that our program AMI could be taught to understand them. The only things missing are “DPOR” and “Chlide”.
*I* don’t know what they are. There is no shame in ignorance. Can we find out?
They should be mentioned earlier in the paper – let’s scroll back to the abstract…

During chlorophyll and bacteriochlorophyll biosynthesis in gymnosperms, algae, and photosynthetic bacteria, dark-operative protochlorophyllide oxidoreductase (DPOR) reduces ring D of aromatic protochlorophyllide stereospecifically to produce chlorophyllide.

What’s “protochlorophyllide oxidoreductase”? AMI can look in: http://en.wikipedia.org/wiki/Protochlorophyllide_reductase
and Chlide..?? Explained in the introduction:

Protochlorophyllide (Pchlide)2 is a central metabolite for the biosynthesis of chlorophylls (Chl) and bacteriochlorophylls (bChl). In photosynthetic organisms two distinct enzymes catalyze the stereospecific reduction of ring D of the aromatic Pchlide to form chlorophyllide (Chlide) (13) (Fig. 1). The first enzyme is the light-dependent Pchlide oxidoreductase (LPOR; NADPH Pchlide oxidoreductase, EC 1.3.1.33).

and this is really valuable. It has an Identifier. An EC number. This highlights the massive work that the biology community has done in providing a semantic infrastructure for public science. Have a look at http://enzyme.expasy.org/EC/1.3.1.33 and marvel at the linked information.
To summarise so far:
Many articles in science are full of semantically supported Facts. There are dictionaries, ontologies and identifier systems. Putting them all together make a vast semantic resource – the Semantic Web. And bioscientists can make massive use of computers – it’s so exciting.
And chemistry? Well the American Chemical Society tried to close Pubchem and threw the lawyers at Wikipedia to preserve its monopoly of information. Which is now about 20 years behind the cutting edge of scientific information.
So please, ACS, have a change of heart and embrace the public value of semantic chemistry. Because it’s going to happen anyway.
In the next post(s) I’ll explain how we get real numbers and labels out of the plot.
 
 
 
 
 
 
 
 

Posted in Uncategorized | 2 Comments

Content Mining; Extracting Facts from Plots and how we can save billions – 1

NOTE. This plot contains Science. But I hope that everyone can understand the message – you don’t have to be a molecular biologist. Sp please keep reading – it’s important to you…
People frequently ask me what is the use of isolated Facts (I’ll use this typography as FACT has an unpleasant association). I’ll be giving examples of this and I hope they will come from YOU! But here is one of the universally valid type of Fact – the X-Y Plot. It’s common to call these “graphs” and that’s fine, but graph has other meanings as we’ll see in later blogs. Since there are many types of plot (and I’ll be showing how we can hack all of them) I’ll call this an XYPlot.
PedroS has commented: PedroS says:

February 27, 2014 at 4:15 pm  (Edit)
Would it be very difficult to automaticaly detect graphs in papers and extract the values of the data points? I am now extracting (by hand) those data from an Arhenius plot, which I intend to use for an Eyring plot ;-)
>See graph 3.c from dx.doi.org/10.1074/jbc.M708010200

So here we have the crux. Numeric data are often plotted as an XYPlot. This is a good way of communicating for most humans. They can tell what’s related to what and how well. But then the plot is published as an IMAGE. The humans can still read it but the machines can’t. So the data are lost.

“I am now extracting (by hand)” … the tragedy of our failure of vision in this century of electronic enlightenment.

There are probably 1 million graphs in the literature that people might want to get data from. Let’s cost a scientist’s time at 500 a.c.u per day (arbitrary currency units). Let’s say it takes PedroS half a day. That’s 250 million dollars a year in wasted scientist time. And that excludes the opportunity cost. And when we come to other diagrams (phylogenetic trees, chemistry, bar charts, etc.) it’s easily up to billions…
Here’s the paper. I know one or two people connected with J. Biol. Chemistry and some of my close associates have published in it. I think its standards are as high as almost any journal. It’s not primarily Open Access but it makes papers freely readable after a fairly short period. (Please update…).
The paper is Copyright © 2014 by American Society for Biochemistry and Molecular Biology. I am going to copy the plot without their permission and show how to extract the data. I think I can defend it legally but anyway I don’t think they will mind – and I don’t think they will send lawyers. And, in any case, in 1 month it will be legal…
Here’s the link to the image .. image  and here’s what’s in it (
F3.large
(BTW I understand this paper enough to comment authoritatively on some of it). This diagram contains four sub-diagrams. This is not done for scientific reasons, it’s probably because the authors were charged per-diagram. The literature is full of this unnecessary jigsaw of information. So diagram A and C have no relation – they are bundled to save money and/or pixels.
What does diagram C mean (to a human)? We have to look at a caption. And here it is:

FIGURE 3.

Purification of recombinantC. tepidum DPOR subunits, catalytic activity, activation energy, and CN inhibition assays. A, SDS-PAGE analyses of purified, recombinant BchNB complex and BchL. Lane 1, molecular size marker, masses as indicated (×1000); lane 2, purified GST-BchN complexed with BchB, cell extracts from E. coli BL21(DE3) Codon Plus RIL containing pGEX-bchNBL* after isopropyl β-D-thiogalactopyranoside induction, affinity chromatography on glutathione-Sepharose, extensive washing, and glutathione elution; lane 3, BchN and BchB were recovered from glutathione-Sepharose after proteolytic cleavage; lane 4, E. coli extracts from cells containing pGEX-bchL after isopropyl β-D-thiogalactopyranoside induction, affinity chromatography on glutathione-Sepharose, extensive washing, and glutathione elution. B, absorption spectra of standard DPOR assays using E. coli cell extracts or reconstitution assays after 20 min at 35 °C and acetone extraction.Trace a, standard DPOR assay containing 30 μl of E. coli extract; trace b, assay mixture without dithionite; trace c, assay mixture without ATP;trace d, control reaction without cell extract; trace e, reconstitution assay using 20 μg of purified (BchNB)2 and 20 μg of purified BchL2;trace f, control reaction using 20 μg of (BchNB)2 but no BchL2trace g, control reaction using 20 μg of BchL2 but no (BchNB)2C, determination of the activation energy of DPOR catalysis. In an Arrhenius plot the logarithm of activity (ln k, where k is the initial rate of Chlide formation in the standard DPOR assay) is plotted versus the reciprocal of the absolute temperature in K. D, cyanide inhibition of DPOR. The activity of standard DPOR assays (15 min at 35 °C) in the presence of 0–60 NaCN is plotted against the concentration of NaCN. 50% inhibition of DPOR is achieved at 36 mM NaCN.

It’s cognitively appalling (in part because of the unnatural format into4 sub diagrams. So let’s separate out our graph (C). This is trivial for a human – a bit harder for AMI but it’s possible.

C, determination of the activation energy of DPOR catalysis. In an Arrhenius plot the logarithm of activity (ln k, where k is the initial rate of Chlide formation in the standard DPOR assay) is plotted versus the reciprocal of the absolute temperature in K.

These are really valuable Facts and metadata (metaFacts). I understand half of it. I don’t understand “DPOR” or “Chlide”. But the rest are standard physical chemistry . (BTW Arrhenius  http://en.wikipedia.org/wiki/Svante_Arrhenius was a genius – in his thesis he developed the ionic theory and almost failed because the assessors said it was rubbish. In today’s bean-counting universities he would have been thrown out. Fortunately he continued, winning a Nobel prize and predicting the greenhouse effect of Carbon Dioxide).
The figures and their captions are often the most important part of an article. Looking at this one diagram tells you at least as much about the content as the abstract. Can AMI understand it?
Probably. The first sentence reads:

determination of the activation energy of FoobarEnzyme catalysis

(Foobar is a general placeholder (metasyntactic variable) for some entity). This phrase probably occurs 10,000 times per year in the scientific literature. So we (I mean YOU as well as me) can train our AMI program to recognise it (more in later blogs). With more work we can train AMI to understand the other 3 diagrams.
So can we reconstruct the data from the plot? Two years ago I thought NO. Now I think absolutely YES. And in the next blogs I’ll show you how. We’ve written much of the software (hackers very welcome!) and then you can use it!
 
 

Posted in Uncategorized | 2 Comments

I have been awarded a Shuttleworth Fellowship to change the world; my first reactions

The Shuttleworth Foundation has done me the honour of appointing me as a Fellow, starting today. The remit (http://www.shuttleworthfoundation.org/fellowship/ ) is:

The holy grail of every funder is sustainability, an idea and approach living long after the money has run out. That is why we fund people not projects. The only true way to sustainability is not a business plan but a champion, someone who will drive an idea through an ever changing landscape, to make a real difference in the world.
We are looking for social innovators who are helping to change the world for the better and are seeking support through an innovative social investment model.
 

My new entry is here: http://www.shuttleworthfoundation.org/fellows/current/peter-murray-rust/
This is incredible. I’ve had a week or two to adjust but I’m still finding new ideas, visions, people on a daily basis.  So this is a first reaction.
I am going to change the world for the better. Yes. Over the last few years when people have asked me what I want to do I reply “change the world”. It’s what we should all aspire to. And this is the most concentrated  time of innovation in the history of the planet and it’s much easier. In the past heroes such as Diderot had to rely on print to reach people – I can reach millions of people with a few keystrokes.
It’s this ability to create communities that makes us different from our predecessors. As exemplars I look to my own immediate circle of electronic communities: Wikipedia, Mozilla, Open Knowledge Foundation, Creative Commons, Open Rights Group, Blue Obelisk …
All started by people – often just one. And all self-sufficient without their founders. That’s my immediate model for sustainability. I don’t know exactly *how* it will happen , but I am certain it will. (Certainty is an essential ingredient of success). So the goal is to build a community of vision and practice.
This year I have undertaken to liberate 100, 000, 000 FACTs from the scientific/technical/medical literature. FACTs belong to the world, not individuals and not corporations. I use uppercase to stress that they are not protectable as Intellectual property (IP). FACTs save lives (think helicobacter and ulcers). FACTs help to create new materials. FACTs lead to better decision making (e.g. climate change). FACTs generate new information-based industries which generate new wealth (the 4 Billion USD invested in the human genome generated 700 Billion of downstream wealth.  (I’ve blogged a lot about the Content Mine and I’ll be blogging a lot more, of course).
Because it is freely available to everyone on the planet who can connect to the Internet.
Put most of all I must thank the Shuttleworth Foundation. They have a wonderful vision and wonderful people. There’s a lot I am discovering.
But, simply, they put in the effort to make sure people succeed.
They have a wonderful infrastructure that I suspect few other funding bodies can emulate. I have a very real relationship with  Karien Bezuidenhout and Helen Turvey who run the Fellowship program.  We’ve spent a lot of time bouncing ideas around and I shall be meeting Helen in a few days in London. Karien and I will have virtual meetings twice a month! This can make all the difference to being focussed and setting achievable objectives.
And I know several of the Fellows already. Rufus Pollock (OKFN), Daniel Lombraña González (Crowdcrafting)  …
… and Fracois Grey who will be my buddy / mentor. This is a wonderful idea. I’m hoping I can visit New York and run a workshop there with his Citizen Science community.
And then there is the community of the Fellowship – again this is a wonderful resource. Fellows come from all disciplines and experience and the cross-fertilisation will be massive. We meet virtually every week and we have 2 physical meetings a year. I’ll be doing a lot of listening.
It’s a huge responsibility, but that’s absolutely how it should be. I shall give it my best. I cannot know how it will work out in detail. I’ve a loose group of current collaborators and I’ll be talking with Helen about the best way of involving them.We’ve already plotted some activities.
Massive thanks to those who have helped with my application, acted as sounding boards and acted as referees.
Shuttleworth is the difference being *hoping* your ideas will take root  and *knowing* they will.
 
 
 
 

Posted in Uncategorized | 4 Comments

101 uses for Content Mining

It’s often said by detractors and obfuscates that “there is no demand for content mining”. It’s difficult to show demand for something that isn’t widely available and which people have been scared to use publicly. So this is an occasional post to show the very varied things that content mining can do.
It wouldn’t be difficult to make a list of 101 things that a book can be used for. Or television. Or a computer (remember when IBM told the world that it only needed 10 computers?) Content mining of the public Internet is no different.
I’m listing them in the order they come into my head, and varying them. The primary target will be scientific publications (open or closed – FACTs cannot be copyrighted) but the technology can be applied to government documents, catalogues, newspapers, etc. Since most people probably limit “content” to words in the text (e.g. in a search engine) I’ll try to enlarge the vision. I’ll put in brackets the scale of the problem

  1. Which universities in SE Asia do scientists from Cambridge work with? (We get asked this sort of thing regularly by ViceChancellors). By examining the list of authors of papers from Cambridge and the affiliations of their co-authors we can get a very good approximation. (Feasible now).
  2. Which papers contain grayscale images which could be interpreted as Gels? A http://en.wikipedia.org/wiki/Polyacrylamide_gel is a universal method of identifying proteins and other biomolecules. A typical gel (Wikipedia CC-BY-SA) looks like SDS-PAGE  Literally millions of such gels are published each year and they are highly diagnostic for molecular biology. They are always grayscale and have vertical tracks, so very characteristic. (Feasibility – good summer student project in simple computer vision using histograms).
  3. Find me papers in subjects which are (not) editorials, news, corrections, retractions, reviews, etc. Slightly journal/publisher-dependent but otherwise very simple.
  4. Find papers about chemistry in the German language. Highly tractable. Typical approach would be to find the 50 commonest words (e.g. “ein”, “das”,…) in a paper and show the frequency is very different from English (“one”, “the” …)
  5. Find references to papers by a given author. This is metadata and therefore FACTual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate.
  6. Find uses of the term “Open Data” before 2006. Remarkably the term was almost unknown before 2006 when I started a Wikipedia article on it.
  7. Find papers where authors come from chemistry department(s) and a linguistics department.  Easyish (assuming the departments have reasonable names and you have some aliases (“Molecular Sciences”, “Biochemistry”)…)
  8. Find papers acknowledging support from the Wellcome Trust. (So we can check for OA compliance…).
  9. Find papers with supplemental data files. Journal-specific but easily scalable.
  10. Find papers with embedded mathematics.  Lots of possible approaches. Equations are often whitespaced, text contains non-ASCII characters (e.g. greeks, scripts, aleph, etc.) Heavy use of sub- and superscripts. A fun project for an enthusiast

So that’s just a start. I can probably get to 50 fairly easily but I’d love to have ideas from…
…YOU
 
[The title many or may not allude to http://en.wikipedia.org/wiki/101_Uses_for_a_Dead_Cat ]

Posted in Uncategorized | 26 Comments

Content Mining Myths 1: "It's too hard for me to do"; no it's easy

One of the many myths about content mining is that it’s difficult and only experts can do it.
Quite the opposite – with the right tools anyone can do it. And in fact most of you do content-mining every day…

  • When you type a phrase into a search engine (Google, Bing)  you are using the mined content of the web. You phrase your question to try to get the most precise, most relevant answers. Agreed, it’s not easy to WRITE a search engine, but it is easy to use one. If we know what questions you want to ask the scientific literature then we can work out how to build the engine.
  • When you use software to examine photographs it can pick out faces. Again it’s not easy to write such software but it’s easy to use it. And that’s what we are doing for chemistry – recognising compounds and reactions in pictures. We’ll present this at the upcoming American Chemical Society meeting in Dallas next month so if you are there you’ll get an idea. It’s only 3 months old but we’ve come a long way.
  • When you search your mail for a name you are mining the content. Again it’s easy to do.

Because content-mining in science has been held back by restrictive practices there are lots of valuable tools waiting to be applied. That’s what we are doing. We expect progress to be rapid. Obviously we’ll appreciate direct help, but we’ll also appreciate general interest.
What do you want to be able to do? What FACTs do you want to extract (or for us to extract and publish)? It won’t all be possible , but a huge amount will be.
And when we have tens of thousands of scientists mining the literature and making the results public there will be a huge acceleration.
 

Posted in Uncategorized | Leave a comment