petermr's blog

A Scientist and the Web


Archive for August, 2010

#solo10: GreenChainReaction – volunteers start to process patents

Tuesday, August 31st, 2010

Scraped/typed into Arcturus

We now have our “beta”-production software for the Green Chain Reaction. We’ve put about 10,000 solvents on the web page. But – if more volunteers take part in the next 2-3 days – that could be multiplied by 20 at least.

The GCR is much more than answering a chemical question (albeit an important one). It’s about Open data on the web. If you get involved, then you’ll start to understand.

As an example what we are doing – in a sort of human mockup – is what Google and other search engines do. It’s called Map-Reduce. If you take part in GCR you’ll understand what Map-Reduce is – painlessly and hopefully entertainingly.

I must thank our several volunteers. They have all played an important part and will continue to do so. They’ve written code, given advice, given moral support (very important at 0200), and shown that – if we want – we can change the way that scientific information is collected, analysed and processed. I’ll explain more later – but I’ve still got code to write.

The steps are simple:

  • Download the software and ancillary files
  • Download some patent indexes
  • Run the software
  • Possibly eyeball the results
  • Run software to upload the results

That’s it. It probably takes 2 hours to get familiar with what has to be done, and make the first-time mistakes. Then there is an elapsed time of ca. 1 hour running the job. We are getting most of the bugs out of the system so most things now work out of the box – as long as you use the current box!

I now have to create the tools to display the results. Challenging, but we have some good infrastructure

Does “Open API” mean anything?

Sunday, August 29th, 2010

Roderic Page ( ) has discussed Mendeley’s “Open API” and decided (as do I [Mendeley, Scopus, Talis – will you be making your data Open? ] ) that whether the code is Open is less important than whether the data is Open.

This is a good opportunity to address the almost meaningless phrase “Open API”. “Open” is often used as a marketing phrase, similar to “healthy”, “Green”, “safe” or many others. To show the problem, here’s a typical experience. I was at a LIS conference and a company which provides library tools and services was trumpeting its “Open API”. So I asked:

  • “was the API publicly available?” … No
  • “Could they let me have details?” … no
  • “why not?” … because their competitors might see it

As far as I could gather (and this is not an isolated case – I’ve seen it in chemical software companies )– “Open” means something like:

“if you purchase our software we will document our trade secrets for you if you are prepared to contarct that you will not divulge any details”

My complaint is not with this per se, (although it generally restricts innovation and constructive feedback) but the use of “Open API” to describe it. “Open ” surely means more than “we have documented our software architecture and you can buy the documentation under a secrecy agreement”.

It will be no surprise use that I use “Open” carefully consistent with with the OKF’s Open Definition:

The Open Knowledge Definition (OKD) sets out principles to define ‘openness’ in knowledge – that’s any kind of content or data ‘from sonnets to statistics, genes to geodata’. The definition can be summed up in the statement that “A piece of knowledge is open if you are free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

The Open Software Service Definition (OSSD) defines ‘openness’ in relation to online (software) services. It can be summed up in the statement that “A service is open if its source code is Free/Open Source Software and non-personal data is open as in the Open Knowledge Definition (OKD).”.

The second is important in this case as well as the first.

I have no particular quarrel with Mendeley – they are innovating by putting power in the hands of clients and that’s gently useful. But unless they actually are Open according to the OKD then they aren’t giving us much (and this applies to many companies – and almost all in chemistry). At present “Open API” seems to mean

“We will show you how you can query the data in our system without fee”. This is free beer, not Libre. It’s not an Open service, it’s a free service. Here are some common features of Free but not Open services:

  • They can be withdrawn at any time. This is extremely common in Free services – It’s happened innumerable times in chemical information systems. A company will offer free access and then later offer it only to subscribers or offer only a very limited service (a handful of queries)
  • There are no explicit rights for re-using the results. Here’s a typical phrase from Surechem (a Macmillan subsidiary ):

    Unless you have SureChem’s prior written permission you are not permitted to copy, broadcast, make available to the public, show or play in public, adapt or change in any way the material (or any part of it) contained on this Web Site for any purpose whatsoever. Unless you are SureChem Pro or PDF subscriber, you may not download, store (in any medium) or transmit the material contained on this Web Site. Access of the SureChem database in an automated manner is not permitted, and may result in banning from the Web Site, as well as legal action. Data from this Web Site may not be resold or redistributed. SureChem, Inc. reserves the right to ban users from this site, whether using automated means or not, if, in our sole opinion, they are abusing our data or database.


    There’s nothing wrong with this – it’s similar in their companion company “Free Patents online” ( ). They don’t claim the word “Open”. Just make sure you understand the difference between Free-as-in-beer and fully Open.

  • You don’t have access to the full data. You never know whether you are seeing the full index or part of it. There is no algorithm to expose the whole data. By contrast you can, in principle, download the whole of an Open system.
  • The data are not normally semantic and the metadata of the system is often inadequately documented. The results are often presented as ASCII strings rather than XML or RDF, and the user often has to guess at their meaning. There is often no identifier system. (admittedly Open systems can suffer from this as well, but at least the community can correct or enhance it).

These features make it impossible to use the data as true Linked Open Data.

I have no major complaint with a company which collects its own data and offers it as Free – Google does this and many more.

But don’t call it Open.

So commercial companies which promote mashups, linked open data, semantic web, etc but do not make their data Open are using these concepts as marketing tools rather than providing roots of community innovation.

I don’t know where Mendeley are at. When I asked some of their staff about whether the data were Open, they said yes. I then wrote to Ian Mulvany – I’m still waiting to hear back – maybe at his session at Science Online. I hope their data really are Open because that will change the world. Free will simply change the balance between monopolists.


Mendeley, Scopus, Talis – will you be making your data Open?

Friday, August 27th, 2010

Scraped/typed into Arcturus

Ian Mulvany (VP New Product Development, Mendeley)

Has blogged about the excitement of connecting scientific data: ( )

I’m going to be hosting a session at science online London next weekend, I’m excited. I’ve been interested in the issues of connecting scientific data for a long time. In the last six months I’ve become particularly excited about the potential of web based tool like Yahoo Query Language. I was hoping to talk a little about that, but I’ve been lucky to get some amazing people to come and share their experiences about linking data, so I’m going to cede the floor to them. I might be able to get some YQL hackery into one of the unconference slots that will be knocking around. Science online is shaping up to be a pretty awesome event, and you can check out the conference program to see what you will be missing out on!

Here is the spiel and speaker bios for the section that I’m going to be running:

Connecting Scientific Resources

Do you have data? Have you decided that you want to publish that data in a friendly way? Then this session is for you. Allowing your data to be linked to other data sets is an obvious way to make your data more useful, and to contribute back to the data community that you are a part of, but the mechanics of how you do that is not always so clear cut. This session will discuss just that. With experts from the publishing world, the liked data community, and scientific data services, this is a unique opportunity to get an insight into how to create linked scientific data, and what you can do with it once you have created it.

The other speakers are:

Michael Habib, Product Manager, Scopus UX + Workflow

Richard Wallis, Technology Evangelist Talis

Chris Taylor, Senior Software Engineer for Proteomics Service, EBI

This looks like a very exciting session, and I’ll be going. Linking scientific data will transform the way we do science – Tony Hey and colleagues haver published “The Fourth Paradigm” about data-driven science (some people call it discovery science), where the analysis of the data, especially from many fields, comes up with radical new insights.

There’s only one requirement.

The data must be Open. (ideally it should be semantic as well). Open as in libre. Free as in speech not as in beer. Compliant with the Open Knowledge Definition ( ). And if the data are provided through a service, that must also be Open ( ).

This is not a religious stance, it’s that pragmatically Open is the only way that linked science data will work. It must be possible to access complete datasets, not just a query interface. We must know that – in principle – we have the absolute right to download and re-purpose the data. “Freely available” is not good enough (people can be sued for re-using free services).

There are problems. Data management costs money even if the content is free. Traditionally this has been a user-pays model. But this cannot work in from Linked Open Science. The data have to be as free as the text and images in CC-BY articles. Freely re-usable. Freely resellable.

And it’s true that companies are often better at managing data than academic researchers. They can invest in people and processes that are not dedicated to the holy grail of getting citations.

But how can a company create an income stream from Open Scientific content?

That’s the a question for me for this decade. If we can solve it we can transform the world. If however the linked Open data are all going to be through paywalls, portals, query engines then we regress into the feudal information possession of the past. I hope the companies presentin this session can help solve this. It won’t be easy but it has to be done.

So I now ask Mendeley, Elsevier/Scopus, Talis:

Are your data Openly available for re-use?

I’ve asked this of Elsevier about a year ago, when they promoted “mash up everything” in a public session on text-mining for science. I asked if we could mine their journal content for DATA and make it Openly available. They informally said yes. Informal isn’t good enough in a world where lawyers abound and I’m still following this up. That’s why we’ve set up the OKF’s isItOpen service.

I hope enough publishers of information can see the scope for Open Knowledge.

Then the world really will change.

#solo10 GreenChainReaction: it’s starting to work!

Friday, August 27th, 2010

Dictated into Arcturus

The GreenChainReaction is already a success!. It’s a success because we’ve had a wonderful set of volunteers who are committed to making it work. The atmosphere in the Etherpad yesterday was wonderful. The group has worked hard to extract the scratchpad-like notes and put them into the wiki pages on Science Online ( )

As I expected there have been bugs. Some of these had been fixed by the volunteers, and that’s a common and wonderful aspect of Open projects. There are two or three bugs the need to be fixed:

  • Spurious directories are created in the results (probably bad calls to mkdirs());
  • There are intermittent problems in accessing our server.

But other than that we’ve shown that the approach works. For example one volunteer, Richard West, has run about 15 jobs, year-by-year. That has shown that the early patents, e.g. before 1995, really don’t have any useful semantic information. We didn’t know that when we started, so that’s already a useful result. We are already finding that some patents give a large list of solvents and as a result I shall be making a minor change to the software today. One of the features of web 2.0 is that the systems are always in continual Beta.

We still welcome newcomers because there is now a critical mass of people who can help. It’s easy to run a dozen jobs while watching the Test Match, or while you’re asleep. This is not mindless, and we’re going to ask people to look at the aggregated results before uploading them. They’ll be able to get an idea of the accuracy with which text-mining works, and feedback to us problems we need to fix. So unlike programmes like SETI-at-home were all feedback is automatic, here we expect valuable contributions from the participants.

I think today should be the last day when people are likely to encounter generic problems. Many thanks to everyone and we should have a wonderful set of material to present next Saturday.

#solo10 GreenChainReaction : an Open Index of patents

Thursday, August 26th, 2010

Scraped/typed into Arcturus

We are on the last lap of preparation for the Green Chain Reaction. We have now built a server to receive the results – the address is in the Etherpad ( )

We are uploading the weekly indexes of all European Patents to this site. Each index contains several thousand patents and is about 1 MByte. There will be about 1500 such weekly indexes.

How does it work?

  • Each volunteer downloads the patentAnalysis code and verifies they can run it
  • They then take a weekly index and run the code against it. This code
    • Selects the chemical patents (by EPO code) – 10-100 per week
    • Extracts the text for the experimental sections
    • Parses the solvents from them
    • Aggregates the result for each patent (dissolveTotal.html)
    • Uploads this to the GreenChain Server (code being written)
  • Repeat. It takes about 30-60 mins for 1 week. You could get through 10 a day just watching the Test Match (or the rain)

We shall then trawl the results from the server and present them at the meeting. Since the data are all Open this is truly Open Notebook work. Anyone with a different approach to analysis is welcome to use the data.

Indexes for 1980-1900 are now loaded, and the rest should be done in an hour.

Many thanks to our volunteers and to Sam Adams for creating the Green Chain Server system and code to access it.




#solo10 GreenChainReaction: Almost ready to go, so please volunteer

Wednesday, August 25th, 2010

Scraped/typed into Arcturus

We have now scoped out the project for the Green Chain Reaction and I am almost certain it can work. How well it works depends largely on the number and time of the volunteers.

If you are interested in taking part let us know now … it should be fun and it doesn’t matter if any particular person or machine doesn’t succeed. BE BRAVE!

The code now works at the following levels:

  • It takes a weekly patent index and downloads all the chemical patents.
  • It trawls through these to see which contain experimental sections
  • It analyses the text in these to extract mentions of solvents, including chemical formula and amount (where given)
  • It aggregates all the solvent data from a single patent into a summary file (dissolveTotal.html)

I am hoping that we can add company and country information to disoolveTotal.html but this is not critical – just fun.

Volunteers should read and email me (pm286 a t cam dot ac dot uk). I will put them on a communal mailing group so they can mail for help. Each volunteer will select a patent Index (there are about 1500 – 30 years at 50 weeks). It takes about 30-60 minutes to download, unzip and analyse a week’s material. It generates about 200 MB and 12000 files. Only about 50 of these will need to be (automatically) uploaded.

So for the whole group in total about 1000 hours’ work and 12 million files. It’s this job we want YOU to help with.

The dissolveTotal.html is quite small – a few Kb and there are perhaps 20/patentIndex and it’s this you will upload to our server. When we get a significant number of these we can then start using our new software to analyse and display the results.

We’ve currently got about 8 people who can run these jobs. That’s quite a lot of effort per person – 200 jobs. So if we get more volunteers it will make it more fun. Of course we don’t have to do the whole lot, but it’s a fun challenge.

We’ll probably make two runs at the data as the parser needs tuning in respect of what we see.

Please join in…



PantonDiscussion#1: Richard Poynder

Tuesday, August 24th, 2010

Dictated into Arcturus

We’ve just had the first of our Panton Discussions with Richard Poynder as the visiting guest. This went extremely well and as you can see from the photographs was held in the small room in the Panton Arms. We had this all to ourselves and a session lasted for about 2 hours.

It turned out that the session was mainly driven by Richard asking a number of questions to which Jordan Hatcher and I replied as appropriate, although there were times at which free discussion took place. This has been very useful for us because it is allowed us to lay out several of the primary ideas underlying the Panton Principles and because Jordan and I were able to balance each other from different viewpoints. It means that the session will serve as a good reference point.

Brian Brooks recorded the whole session on audio (ca 80 mins) and some of it on video. We hope that there will be snippets of video which will stand alone and can be posted on YouTube. The audio should be available more or less as is after minor edits, but we’d also like a written transcript. We’d like suggestions here as to the best way of doing this. Brian is going to try and read this into Dragon, and there we will need to correct the mistranscribed words.

Alternatively we may need to type it up from the audio in which case any volunteers would be extremely valuable!

UPDATE: we’ve had two volunteers for transcription and we’d welcome more as that will help both of them and us


Panton Discussions#1: Richard Poynder 2010-08-24

Monday, August 23rd, 2010

Typed into Arcturus


Following on from our creation of the Panton Principles we are now starting a series of “Panton Discussions”. These are intended to explore in public, with a well-known guest, aspects of Open Data. They have a fluid format, which may well change as we progress.

We are delighted to be joined in the Panton Arms tomorrow (Tuesday 24th Aug) at 1200 by Richard Poynder (see ). Anyone is welcome. – we’ve booked a room. Richard describes himself as:

Richard Poynder writes about information technology, telecommunications, and intellectual property. In particular, he specialises in online services; electronic information systems; the Internet; Open Access; e-Science and e-Research; cyberinfrastructure; digital rights management; Creative Commons; Open Source Software; Free Software; copyright; patents, and patent information.

Richard has contributed to a wide range of specialist, national and international publications, and edited and co-authored two books: Hidden Value and Caught in a Web, Intellectual Property in Cyberspace. He has also contributed to radio programmes.

Richard has interviewed a number of leaders in the Open Access and other movements and these interviews are highly valued.

We don’t yet know the precise form of tomorrow’s Discussion. We’d each put forward some topics/questions and I’ve listed these below. I doubt we’ll get through everything. We’ll probably start by sorting/triaging them.

The meeting will hopefully be recorded by us and we’d like to keep editing to a minimum. There won’t, by default, be a transcript unless someone volunteers. If there is a video it would be Openly published.


  • Open Data is a relatively new phrase but is now being heard everywhere. What do you think has led to this?
  • What can the Open Data movement learn from other Open movements (especially Access, but also Open Source)?
  • Where should the open data movement put its initial energies? Should it try to tackle some well-defined issues or must it draw a wide plan of what needs to be addressed?
  • Who are likely to be, or become, enthusiastic champions of Open Data?
  • How can one build a successful business model (i.e. generate revenue) from Open knowledge. It’s worked in the limited area of ICT but there are few examples outside that?


         What is open data, and why is it important? What’s the problem it needs to fix?


         As [we] understand it there is both the Public Domain Dedication Licence and the Creative Commons Zero. What are these, how do they work, and how do they differ?


         We also have the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition. What are these? How do they differ? And why again do we need two initiatives?


         And more recently we have The Panton Principles? What does this provide that was not available before?


         Meanwhile Open Data has acquired a much wider definition now, covering for instance government and other public data? Is there some confusion around?


         How many people have signed up to the Panton Principles?


         Why have they not been endorsed by the Science Commons?


         There was some disagreement I think in developing the Panton Principles I think, and so they represent a compromise. Who had to compromise over what and why?


         Essentially we are talking about never using Non Commercial licences?


         What about Share Alike? I guess it does not have to be Share Alike. So would it perhaps not fit Richard Stallman’s concept of freedom?


         Bollier in his book Viral Spiral says that Open Data does not rely on copyright, and does not use a licence, but rather uses a set of protocols. Can you expand on this? It’s a public commitment basically is it?


         The open data philosophy also talks of using norms. Can you say expand on this?


         In the Panton Principles it says, “You may wish to take the ‘ask for forgiveness, not permission’ approach”. Isn’t that somewhat dangerous advice?


         Why not cover the social sciences and humanities too?


         Where does Open Data fit with Open Access?


         You will note that the OA movement has now differentiated between OA gratis and OA libre. Does that help the cause of Open Data? (I.e. on re-usability)?


         Where does Open Science fit in here?


         What about Open Notebook Science?


         Where does Open Data live: in institutional repositories, or in central archives like arXiv and PubMedCentral?


         Earlier this year you got a SPARC Innovators Aware. Beyond that, do you have any figures to show the success you are having: the number of data sets with the Open Data Logo on for instance? Or the number of records now freely available etc.?


         What advice would you give to any scientist seeking to make their data available today? What pitfalls do they need to avoid?


         Where do scholarly publishers fit in here?  Are they problematic?


         Have all OA publishers now adopted Open Data too?



Data Validation and publication – some early ideas

Monday, August 23rd, 2010

Dictated into Arcturus

On Friday David Shotton from Oxford and I visited Iain.Hrynaszkiewicz and colleagues at Biomed Central to discuss collaboration on several grants funded by JISC. In our case they are Open-bibliography (#jiscopenbib) and #jiscxyz (on data) and in David’s they are Open Citations (#jiscopencite) and Dryad-UK. These involve several other partners (who I shall mention later, but highlight here The International Union Of Crystallography). Our meeting with Iain was about access to bibliographic material and also their requirements for data publication. I’ll be blogging a lot about this, but one thing was very clear:

Data should be validated as early as possible, for example before the manuscript is sent to a publisher for consideration.

This has a major influence on how and where data are stored and published (and archived) and influence whether we should have Domain Specific Repositories and where they should be. Our group on Friday was unanimous that repositories should be domain-specific. (I know this is controversial and I’ll be happy to comment on alternative models – I shall certainly blog this topic).

I am also keen that wherever possible data validation should be carried out by agreed algorithmic procedures. Humans do not review data well, and there are recurring examples of science based on unjustifiable data, where pre-publication data review would have helped the human reviewers to take a more critical view. I shall publish a blog on one such example

Here are some things that a machine is capable of doing before a manuscript is submitted. I’ll analyse them in detail later. They will vary from discipline to discipline.

  • Labelling. Have the data components been clearly and unambiguously labelled? (for example we have had *.gif files (image) wrongly labelled as *.cif) crystallography)
  • formats and content. (Is the content of the files described by an Open specification? Does the content conform to those rules?)
  • units. Do all quantities have scientific units of measurement.
  • Uncertainty. Is an estimate given of the possible measurement and other variation?
  • Completeness Checklist. Are
    all the required
    components present? If, say, there is a spectrum of a molecule is the molecular formula also available?
  • “Expected values”. In most disciplines data fall within a prescribed range. For example human age is normally between zero and 120. A negative value would cause suspicion as would an age of 999.
  • Interrelations between components. Very often components are linked in such a way that the relationship can be tested by a program.
  • Algorithmic validation of content. In many disciplines it’s possible to compute either from first principles or heuristically what the expected value is. For example the geometry and energy of a molecule can be predicted from Quantum Mechanics.
  • Presentation to humans. The robot reviewer should compile a report that a human can easily understand – again I shall show a paper where if this had been done to paper would have been seriously criticized.



Blue Obelisks: Openness is taking off

Sunday, August 22nd, 2010

Scraped/typed into Arcturus

The Blue Obelisk ( ) is a relatively unstructured group of people interested in software and information in chemistry. The mantra is:

“Open Data, Open Standards, Open Source”

This guides the BlueOb in its thinking. It has web pages and a mailing list to which anyone can sign up

It is an unorganization:

  • It unmeets at dinners when it does
  • It has an unmembership
  • It has an unagenda and unminutes

There may or may not be a BlueObelisk dinner at the ACS meeting. It’s been planned. The unattendees are those who express interest and can interpret the uninstructions on how to get to a restaurant at a given time.

Sometimes people get BlueObelisks and sometimes they don’t.

Sometimes the blue obelisks are blue, and sometimes they aren’t. Sometimes they are obelisks and sometimes they aren’t. But most of the time they are.

I have spent today travelling to a secret source of Blue Obelisks. Here are some:


I shan’t physically be at the Blue Obelisk dinner at the ACS but I’ll be there in spirit.

The Blue Obelisk now provides good, Open solutions in most of chemoinformatics. Consider joining the Blue Obelisk. You may wish to contribute – there are many ways other than writing code. Being an enthusiastic user is just as useful. The Open movement is taking off everywhere – it’s worth being part of it.