petermr's blog

Robert Terry leaving Wellcome

Posted on September 19, 2007 by pm286

It was announced that Robert Terry will be leaving Wellcome, and I quote from a private letter (I am sure that’s OK).

I will be leaving the Trust on 21st September to take up a position as
Project Director to develop the health research strategy for the WHO.
Rest assured the Trust will continue with its partners to further all
areas of opening up science and the policy contact from this point
forward will be Nicola Perrin (c.c. above) and Robert Kiley and Chris
Bird remain actively involved in all things open access.

I’d like to emphasize the enormous role that Robert and Wellcome have played in pushing forward Open Access. I am absolutely sure that Wellcome has critical mass and fixity of purpose here, soe this should be seen as splitting a mature plant and spreading the meme.

Posted in berlin5, open issues | Leave a comment

Berlin 5: Open access – both easy and difficult

Posted on September 19, 2007 by pm286

I’ve arrived at Berlin 5 after the welcoming. The Opening Plenary is by Sijbold Noorda
(on behalf of European Universitie Association) with the theme: “Open access – both easy and difficult”. (PMR: No comments from me).
It’s a simple concept with complicated realities
OA is not new. It’e the analogue of public libraries in the C19: “do in the digital age what publishers and public libraris did in the old days; make accessible to the public what should be public knowledge”.
His themes:

digital access
business
quality
archives
e-science
variety

– Digital access:
Basic rule of all researchers: “make your work is digitally traceable, searchable, harvestable”. Otherwise you do not exist.
Everyone can self-archive [some form of their work].
“all you need is a well-connected and well-arranged repository”.
“why do only a few of us practice what we preach?”
– Business:
subscriptions replaced by advanced payments. Only one tiny possibility. Wouldn’t change publishers – advantage would be that community would be broader and Open. Cost carried by those who produce knowledge.
Broadcasting mode. It becomes interesting when the business model changes. E.g. SCOAP – the CERN publishing model. Selling rights to publishers. CERN will save ca. 40% on pubication costs.
Completely new power structure. Hybrid models are a complication and bring little useful, especially for libraries.
– clients. OA Does not solve pricing issues. The public at large is a client
– physicians in small hospitals, insurers cannot read the literature.
Sending the bills to researchers changes the model. They don’t see the bills for current model. But if they have to pay. Overcomes fragmentation by cooperating.
(NB: changing rules is hard for young researchers.)
– Quality matters. peer hierarchy and review is critical. Reliability matters.
Labelling must be done and if publishers don’t do it, some one else must. Cannot leave it to the blogosphere.
– Archiving. The faster the innovations the greater the problems. e.g. reconstruct old IBM and CDC from late 1970’s (PMR: I remember them…)
It can and should be handled by public library consortia.
– E-science. Data sharing, virtual labs, collaborattories, wikis, multi-media e-learning. Need projects and experiments. Forerunners like university presses consortium. Barend Mons has a closed virtual community – entry is when you have publisher 3 peer-reviewed papers anywhere – then you can enter and comment on anything.
What about the book? we will never be able to do without it.
European cooperation may make the difference, have a key role. Standards, connections, making academics more independent of publishers.
[Fred Friend: over 30% of UK-funded research is not available to most other UK universities.]

Posted in berlin5, open issues | 1 Comment

Name that graph

Posted on September 19, 2007 by pm286

I took a snap poll of my colleagues. 2 knew it. One worked it out gradually. 2 did not know it. I think every human in education above the age of 10 should know it
I am writing an article for Nature on “Open Chemistry”. Here is the first paragraph of the first draft:

I am writing this article having just come back from the sixth annual UK All Hands meeting on [E-Science]. Several hundred delegates met to discuss how the Internet (and extensions such as the [Grid]) could support and change the way we do science. The message of one plenary lecture: “Digital Earth: The New Digital Commons” [1] by Timothy Foresman was starkly simple. Unless we act the higher organisms on the planet will die. This is not a conjecture; our data are sufficiently comprehensive and our simulation tools sufficiently good and cheap to make it a certain scientific fact. On the positive side, if we make every scrap of scientific research and knowledge fully public and develop and use collaborative tools across the planet we have a chance. The Internet may have come just in time.

Science is multidisciplinary. The [Keeling_curve] – the most beautiful and terrible graph of the twentieth century (Fig 1) links chemical bonding, spectroscopy, quantum mechanics, thermodynamics, fluid dynamics, meteorology, geology, astronomy, reactions, biochemistry, biology, oceanography, etc. directly to transport, economics, finance, politics, psychology and much more. It epitomises [eScience] which seeks to develop the tools, the content and the social science to support multidisciplinary collaborative science. How can we gather the data, formalize its representation, build the computational support, grow the community and share the results in forms that are appropriate to each reader? And who are the readers? Not just professional scientists, but children, senior citizens, lawmakers and funders in every country. And not only humans. To gather and manage this data and simulation we must make it accessible to machines since we cannot cope with the scale and complexity unaided.

Figure 1. http://en.wikipedia.org/wiki/Keeling_curve

I shall talk about some of this at Berlin 5. Simply, Open Data might help save the planet. Any person or organization who fails to make data – of any sort – Open will have to answer to whatever comes after us.

Posted in berlin5 | Leave a comment

US citizens: act!

Posted on September 18, 2007 by pm286

Essential action from Peter Suber’s blog:
var imagebase=’file://C:/Program Files/FeedReader30/’;

Note to US citizens

The American Library Association has created an action alert to simplify the process of asking your Senators to support the strengthening of the NIH policy. However, it doesn’t contain a default message and requires users to compose their own or paste one in.
Charles Bailey has solved this problem with a strong, ready-to-paste message based on public texts (in particular, this one and this one) by the ALA, ARL, ATA, and SPARC .
No more excuses. If you’re a US citizen, please contact your Senators before September 28, and please spread the word to others. (Thanks, Charles!)

Posted in open issues | Leave a comment

Name that graph (acknowledgements to Rich)

Posted on September 17, 2007 by pm286

Rich Apodaca has an excellent series of graphs (e.g. Name That Graph) where he has removed key annotations (titles, units, axes, etc.) I’m not going to to steal his theme but there is one graph that I hope my readership is familiar with. I’ve been using it in an article – more later – and will also blog before the article appears. So, with apologies to Rich, what;s this?

A clue. It’s about chemistry. But you don’t need to be a chemist… and since most of you should know the answer, please don’t post it

Posted in ahm2007, data | 3 Comments

Peter Murray-Rust: Prospect and Nessie and OSCAR

Posted on September 17, 2007 by pm286

I am delighted to congratulate the Royal Society of Chemistry on their award for Project Prospect. Prospect is one of the first examples of true semantic publishing. We’re pleased to have been closely involved – 5 years ago David James and Alan McNaught of RSC funded two summer students, Fraser Norton and Joe Townsend, to look at creating a tool to check data in RSC publications.

It was a boring job – they had to read ove a hundred papers and tabulate things like how the authors recorded the Molecular Weight, analytical data, etc. The RSC gave guidelines to which authors would conform, which of course they did (about 5% of the time!). “MP”, “MPt.” Melting Point”, “M.P.” shows the sort of variation. So OSCAR-1 had complex regular expressions (tools to search for variable lexical forms) to check the chemistry.
Next year Joe was joined by Chris Waudby [*** AND SAM ADAMS ***] [1] and together they built a large and very impressive Java framework for reading documents and extracting the data. This tool, runing perfectly after 5 years with zero maintenance, is mounted on the RSC site and has been used by hundreds or thousands of authors and reviewers.In the same summer, Timo Hannay from Nature Publishing Group funded Vanessa deSousa who created Nessie. Nessie used OSCAR tools to find likely chemical names in text and then search for them in a local lexison downloaded from the NIH/NCI database. Where this failed, Nessie would search sites from chemical suppliers through the ChemExper website, often reaching recall rates of ca 60%. This was an impressive case of gluing together several components.
These successes became widely known and as a result we were able to collaborate with Ann Copestake and Simone Teufel in the Computer Laboratory and get a grant for SciBorg, the chemist’s amanuensis. The idea is – inter alia – to have a chemically intelligent desktop which “understands” chemical natural language sufficiently to add meaning for readers of documents. The RSC continued to sponsor us in partnership with Nature and the International Union of Crystallography.
Peter Corbett joined us on the projects and has taken the previous OSCARs and created OSCAR3, which can now recognise chemical terms in over 80% of cases. (I must be more precise; given a corpus, OSCAR3 will recognise which words or phrases conform to a 30-page definition of chemical language with a combined precision-recall of that magnitude. I do not believe there is a superior tool when measured like this. It will be able to provide a connection table with rather lower frequency – depending partly on the specificity or generality of the language.)
Last summer Richard Marsh and Justin Davies – sponsored by RSC and Unilever – returned to the data checking aspect of OSCAR and created the prototype of a semantic chemical authoring tool. This can read a paper and flag up many types of error, and also help authors create data where certain types of error are impossible. The next phase of this is evolving and we’ll tell you more shortly.
We have frequent contact with the RSC, particularly of Richard Kidd and Colin Batchelor who visits frequently. They have done much of the design and implementation of Prospect and it has required a significant effort and investment.
But it’s worth it. The data quality is higher, the ease of understanding and searching is enhanced. It’s easier to turn into semantic content. There’s a way to go yet – we need to address reactions, and data with scientific units of measurement but it starts to look easier.
After all there is probably another prize to be won next year…

[Note: I later overwrote this by mistake after publishing but found it on Planet Scifoo. Saved from having to rekey for the scholarly record!]
[1] SAM, I AM VERY SORRY FOR MISSING YOU OUT IN THE FIRST DRAFT. p

Posted in semanticWeb | 1 Comment

change because old scientists die

Posted on September 17, 2007 by pm286

Tobias Kind has asked (Comment to Nature Protocols: How much can we re-use?) why shouldn’t require chemists to submit data…

Hi Peter,
making chemistry data machine-readable is not the business of the publisher! It’s the business of the chemists themselves and it should be a requirement from editorial boards and reviewers. If chemists have to submit molecular structures and chemical property data before publication (a common fact for modern life sciences – compared to old-style chemistry) there would be no need to run any hamburger to cow algorithm like OSRA, Kekule, CLiDE, ChemOCR or Oscar. Beware(!), these are all sophisticated algorithms but their use could be avoided for new publications if raw data + metadata is directly submitted to a not yet functioning international open data chemistry repository. You can check out GenBank “Many journals require submission of sequence information to a database prior to publication so that an accession number may appear in the paper.” http://www.ncbi.nlm.nih.gov/Genbank/ As long we keep reviewing journals without requesting that molecular structures and metadata and spectra and molecular property data are made publicly available and as long we serve in editorial boards of journals which don’t require submission of original molecule data and other molecular property data in machine readable format its our own fault. All this will be a painful process but it will come; it’s also a process of teaching the young chemists. The upcoming ticket system for chemistry publications requiring a accession number for each publication will be nice topic for the BlueObelisk; Don’t you think so, or is that too radical for you http://blueobelisk.sourceforge.net/wiki/index.php/Open_Data_in_Chemistry

Kind regards
Tobias Kind
fiehnlab.ucdavis.edu

No, not too radical at all. Max Planck wrote: Scientific theories don’t change because old scientists change their minds; they change because old scientists die. And that’s the case here.
But I am not radical enough to hasten the death literally, so how do we ensure that old chemists die metaphorically? Note that I don’t want chemistry to die, and it will survive elsewhere – in biology, materials science, neuroscience and even computing. But many chemists are doing a good job of destroying the current edifice with their lack of interest in the current century’s informatics, their closed access publishing, their corralled information which is sold rather than freed, their lack of multidisciplinary collaborations. Of course there are exceptions, but if you look at eScience, publishing, informatics where is mainstream chemistry?
So we’ll devise methods and protocols elsewhere. The biologists have taken over the chemical ontosphere – we use Pubchem and ChEBI, created by and for biosciences. They’ll create the bits of the chemical semantic web they need – and I’ll be happy for that
No, it’s not the role of the publishers to convert data, and I didn’t mean to suggest it. The publishers should be the servants of the community and the community should ask them to police deposition of semantic documents. There is, I think, an interim period when we need messy measures such as converting legacy to show the way forward. OSRA, reverse-hamburgers and the rest are simply to prove the semantics, show the value. We know that all we have to do is invest in authoring tools and require them. Very shortly we’ll let you know what we are doing in that area.

Posted in data, open issues, semanticWeb, XML | Leave a comment

Nature Protocols: How much can we re-use?

Posted on September 12, 2007 by pm286

In my last post (Nature: How much content can our robots access?) I asked general questions about what data, if any, in a scientific article publishers would not allow humans and robots to use without permission. So, as an example, I’m asking Nature what parts of a data-rich article can be re-used automatically.
The model I have is that our robots can access electronic publications, extract data, and re-use it. Ideally we would like XML, but text and PDF can give good (though not perfect) results. Nature has recently brought out “Protocols”, which publishes recipes. I think this is a great idea – not enough credit is given to the actual work laboratory done in many subjects. So I give some snapshots (claiming fair use as the paper is freely readable and I haven’t taken the whole) show what our robots can do, and ask what I am permitted to do:
That’s the abstract. I expect it’s on Pubmed, so perhaps my robots can mine it for chemicals?
YES/NO?

This is one of the molecules mentioned in the paper (there are 10-20). This is the normal way of communicating the structure of a molecule, but it’s also an image. My robots can download it and interpret the structure. But technically it’s an image. Are the robots allowed to download and extract the molecular data without permission:
YES/NO
So then the protocol shows what actually happens in the reaction, and notes that there is a critical patch of colour:

Here the only way of communicating the FACT of the reaction colour is to show an image. This isn’t a creative work of art (though there are good photos and bad ones) but is an ESSENTIAL part of the record. There are now systems which could be trained to browse the Internet and return all flasks which contain a greenish liquid. Can we download and analyse images of experimental results:
YES/NO:
And here is the spectral data.

Can my robots download this and turn it into semantic data.?
And in doing so this is the sort of thing that OSCAR can produce. It’s taken part of the text and identified all the compounds (and worked out the atomic connectivity – see “furan” at the right).

All of this is enormously valuable. If we are allowed to do it we can answer questions like “what reactions involve furans, or pyrroles, use Sodium ammonia, and undergo a light green colour”.
But it will only happen if publishers let us have access to their data without any barrier – no written permissions, no hangup over scientific photographs and images.
I’ve put Timo on the spot, but I trust him to be clear. That will make it easier when asking the other publishers later.

Posted in open issues | Leave a comment

Nature: How much content can our robots access?

Posted on September 12, 2007 by pm286

In this blog ( Copyrighted Data: replies [1], Wiley and eMolecules: unacceptable; an explanation would be welcome –[2] , and elsewhere we have been discussing the “copyright” of factual information, or “data”. In [2] I ask a major publisher whether copyright applies to some or all of the factual scientific record they publish. So far I have had no reply. Here I ask another, Nature, who – at least through Timo Hannay – have been very helpful in discussing aspects of publication (most other publishers have been silent).
The issue arises in “supplemental data” or “supporting information” which is the factual record of the experiment – increasingly required as proof of correctness. Some major publishers (Royal Soc Chemistry, Int. Union of Crystallography, Nature) do not claim copyright over this; others such as American Chemical Society and Angewandte Chemie (Wiley) appear to do so, though I haven’t had a definitive public statement from either.
Our vision for the future is that a large part of published scientific data could be made directly machine-understandable, if the publishers collaborate in this. This would mean that we would have semantic knowledge accessible through engines such as Google, Metaweb, OpenLink, etc. We have built technology (OSCAR) that can extract 80% of the chemistry from scientific publications and it could index the whole literature in a few days using systems such as Condor. I heard yesterday of an image recognition system which can scan Flicker for photos of “family of four”, “red flower”, “cat and dog”, for example (feature recognition, not tagging). It looks straightforward to ask for papers which contain images of “gel”, “protein surface (GRASP)”, “aligned sequences”, “dose response curve”, “chemical formulae” etc. These are sufficiently stereotyped that I am sure this is possible – and even if it isn’t we should try.
But many publishers will simply forbid us to do this. Wiley claims graphs are their copyright. I expect they do this for gels, protein images, etc. If I ask I expect the average publisher will say you have to apply separately for permission for every image. This, of course, is impossible for a robot – we want to index a million images in a day.
The information itself is not copyright. If I sit down with a keyboard I can retype all the factual information into abstracts, collections, etc. But only the stuff that can be entered on a keyboard. Not images. Not graphs. Those I have to disassemble into words or numbers, which – in the C21 is grotesque.
So I am going to ask Nature what I can do and what I can’t. What my robots can do and what the can’t. If the answer is not “YES” to a question it is “NO” – there can be no “middle ground” for robots. If you don’t know then the answer is NO. If I have to ask for permission the answer is NO.
As background I want to praise Timo and colleagues for their support for us over the years – they have funded a summer student, and also given us an XML corpus for our SciBorg project (on which they also advise). They understand the vision.
This is already a long post, so the details and questions will be in the next one.
I’ll be using the new – and I think exciting and valuable – Nature Protocols as an example – in particular http://www.nature.com/nprot/journal/v2/n8/full/nprot.2007.245.html
which I think is currently Freely Accessible though not (in my language and BBB) full Open Access.

Posted in open issues | 1 Comment

PRISM: Cambridge UP distances itself

Posted on September 11, 2007 by pm286

As readers will know I have written Open letters to publishers with whom I have a connection about their connection with PRISM. I am pleased to report that I have a clear response from CUP, the University’s press. Interestingly – and I shall comment on this in later posts – my normal email didn’t get through (as I think this may not be uncommon). Stephen Bourne of CUP writes:

I heard about PRISM launch for the first time last week, from which you will understand that Cambridge University Press has in no way been involved in, or consulted on, the PRISM initiative.
PRISM’s message is over-simplistic and ill-judged, with the unwelcome consequence of creating tension between the publishing community and the proponents of Open Access. While there will inevitably be some differences in philosophy between those two communities, we believe that there will always be a place for alternative forms of publishing, and that we can and should co-exist. With that objective in mind, Cambridge University Press has long been constructively engaged in Open Access issues, the Cambridge Open Option (http://www.journals.cambridge.org/action/forAuthors?page=open# ) being an example of one of the forms of publishing we offer. We will continue to be open to discussions on this matter.
Yours sincerely,
Stephen Bourne

PMR: Thank you, Stephen, for a very clear statement.

Posted in open issues | Leave a comment

Robert Terry leaving Wellcome

Berlin 5: Open access – both easy and difficult

Name that graph

US citizens: act!

Name that graph (acknowledgements to Rich)

Peter Murray-Rust: Prospect and Nessie and OSCAR

change because old scientists die

Nature Protocols: How much can we re-use?

Nature: How much content can our robots access?

PRISM: Cambridge UP distances itself

Recent Posts

Recent Comments

Archives

Categories

Meta