Content Mining; Extracting Facts from Plots and how we can save billions - 1

NOTE. This plot contains Science. But I hope that everyone can understand the message - you don't have to be a molecular biologist. Sp please keep reading - it's important to you...

People frequently ask me what is the use of isolated Facts (I'll use this typography as FACT has an unpleasant association). I'll be giving examples of this and I hope they will come from YOU! But here is one of the universally valid type of Fact - the X-Y Plot. It's common to call these "graphs" and that's fine, but graph has other meanings as we'll see in later blogs. Since there are many types of plot (and I'll be showing how we can hack all of them) I'll call this an XYPlot.

PedroS has commented: PedroS says:

February 27, 2014 at 4:15 pm  (Edit)

Would it be very difficult to automaticaly detect graphs in papers and extract the values of the data points? I am now extracting (by hand) those data from an Arhenius plot, which I intend to use for an Eyring plot ;-)
>See graph 3.c from dx.doi.org/10.1074/jbc.M708010200

So here we have the crux. Numeric data are often plotted as an XYPlot. This is a good way of communicating for most humans. They can tell what's related to what and how well. But then the plot is published as an IMAGE. The humans can still read it but the machines can't. So the data are lost.

"I am now extracting (by hand)" ... the tragedy of our failure of vision in this century of electronic enlightenment.

There are probably 1 million graphs in the literature that people might want to get data from. Let's cost a scientist's time at 500 a.c.u per day (arbitrary currency units). Let's say it takes PedroS half a day. That's 250 million dollars a year in wasted scientist time. And that excludes the opportunity cost. And when we come to other diagrams (phylogenetic trees, chemistry, bar charts, etc.) it's easily up to billions...

Here's the paper. I know one or two people connected with J. Biol. Chemistry and some of my close associates have published in it. I think its standards are as high as almost any journal. It's not primarily Open Access but it makes papers freely readable after a fairly short period. (Please update...).

The paper is Copyright © 2014 by American Society for Biochemistry and Molecular Biology. I am going to copy the plot without their permission and show how to extract the data. I think I can defend it legally but anyway I don't think they will mind - and I don't think they will send lawyers. And, in any case, in 1 month it will be legal...

Here's the link to the image .. image  and here's what's in it (

F3.large

(BTW I understand this paper enough to comment authoritatively on some of it). This diagram contains four sub-diagrams. This is not done for scientific reasons, it's probably because the authors were charged per-diagram. The literature is full of this unnecessary jigsaw of information. So diagram A and C have no relation - they are bundled to save money and/or pixels.

What does diagram C mean (to a human)? We have to look at a caption. And here it is:

FIGURE 3.

Purification of recombinantC. tepidum DPOR subunits, catalytic activity, activation energy, and CN- inhibition assays. A, SDS-PAGE analyses of purified, recombinant BchNB complex and BchL. Lane 1, molecular size marker, masses as indicated (×1000); lane 2, purified GST-BchN complexed with BchB, cell extracts from E. coli BL21(DE3) Codon Plus RIL containing pGEX-bchNBL* after isopropyl β-D-thiogalactopyranoside induction, affinity chromatography on glutathione-Sepharose, extensive washing, and glutathione elution; lane 3, BchN and BchB were recovered from glutathione-Sepharose after proteolytic cleavage; lane 4, E. coli extracts from cells containing pGEX-bchL after isopropyl β-D-thiogalactopyranoside induction, affinity chromatography on glutathione-Sepharose, extensive washing, and glutathione elution. B, absorption spectra of standard DPOR assays using E. coli cell extracts or reconstitution assays after 20 min at 35 °C and acetone extraction.Trace a, standard DPOR assay containing 30 μl of E. coli extract; trace b, assay mixture without dithionite; trace c, assay mixture without ATP;trace d, control reaction without cell extract; trace e, reconstitution assay using 20 μg of purified (BchNB)2 and 20 μg of purified BchL2;trace f, control reaction using 20 μg of (BchNB)2 but no BchL2trace g, control reaction using 20 μg of BchL2 but no (BchNB)2C, determination of the activation energy of DPOR catalysis. In an Arrhenius plot the logarithm of activity (ln k, where k is the initial rate of Chlide formation in the standard DPOR assay) is plotted versus the reciprocal of the absolute temperature in K. D, cyanide inhibition of DPOR. The activity of standard DPOR assays (15 min at 35 °C) in the presence of 0–60 NaCN is plotted against the concentration of NaCN. 50% inhibition of DPOR is achieved at 36 mM NaCN.

It's cognitively appalling (in part because of the unnatural format into4 sub diagrams. So let's separate out our graph (C). This is trivial for a human - a bit harder for AMI but it's possible.

C, determination of the activation energy of DPOR catalysis. In an Arrhenius plot the logarithm of activity (ln k, where k is the initial rate of Chlide formation in the standard DPOR assay) is plotted versus the reciprocal of the absolute temperature in K.

These are really valuable Facts and metadata (metaFacts). I understand half of it. I don't understand "DPOR" or "Chlide". But the rest are standard physical chemistry . (BTW Arrhenius  http://en.wikipedia.org/wiki/Svante_Arrhenius was a genius - in his thesis he developed the ionic theory and almost failed because the assessors said it was rubbish. In today's bean-counting universities he would have been thrown out. Fortunately he continued, winning a Nobel prize and predicting the greenhouse effect of Carbon Dioxide).

The figures and their captions are often the most important part of an article. Looking at this one diagram tells you at least as much about the content as the abstract. Can AMI understand it?

Probably. The first sentence reads:

determination of the activation energy of FoobarEnzyme catalysis

(Foobar is a general placeholder (metasyntactic variable) for some entity). This phrase probably occurs 10,000 times per year in the scientific literature. So we (I mean YOU as well as me) can train our AMI program to recognise it (more in later blogs). With more work we can train AMI to understand the other 3 diagrams.

So can we reconstruct the data from the plot? Two years ago I thought NO. Now I think absolutely YES. And in the next blogs I'll show you how. We've written much of the software (hackers very welcome!) and then you can use it!

 

 

2 thoughts on “Content Mining; Extracting Facts from Plots and how we can save billions - 1

    1. pm286 Post author

      I think we are saying the same thing. It is *used* it's worth billions. If it's not allowed to be used it's a waste.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>