#solo10 GreenChainReaction how we take this forward

Scraped/typed into Arcturus

Reactions to the Reaction.

 

Responses to “#solo10 Immediate reactions and thanks”

  1. September 6, 2010 at 9:48 am  (Edit)

    Peter, what is next? Are you going to push this project further, or does it stop here? Despite all the interesting spin offs, are you going to work out this analysis further and write up a review paper with all those (significantly) contributed?

    Or, is this project now going to end, with #solo10 finished?

  2. rpg says:

    September 6, 2010 at 11:09 am  (Edit)

    Hi Peter

    It was a fascinating session, thanks. I really hope you don’t end it there.

  3. Mat Todd says:

    September 6, 2010 at 1:55 pm  (Edit)

    I think it’s important that this exercise is written up and published, to try to disseminate to a wider audience who perhaps aren’t familiar with the area. From the initial analysis I saw of the graphs that were produced it looked like there were some changes in solvents used … for the worse. But let’s get the full digestion of data and conclude something certain. It was great to be part of it.

I certainly do NOT intend to finish here and I’d like all of you to help as authors. I’d like to try to put together an accurate snapshot of where we have arrived at and what conclusions we can draw. It was ironic that the server was off air and so I know some people didn’t manage to upload – nonetheless you have all contributed morally.

 

There is a problem we have been alerted to today – apparently the EPO may not have allowed us to do what we have been doing. I’d be grateful for accurate information on this. My moral and political view is that as a European taxpayer I should be entitled to use this information, and that if there is restricted access this is harming the practice of European and World innovation. Bad patent information helps no-one other than some applicants and some traditional downstream commercial processors.

 

Without this restriction this was the plan:

  • Clean the current site and re-run the aggregation with the later versions of the software. This means that people could analyse perhaps 10-20 weeks per day which works out at 5 days per patent-year. This means that as a community we could do the job properly in a week or two. This depends on having somewhere to host the data – and we have an exciting offer – but I don’t want to contaminate them with tainted information.
  • Re-analyse the information on the aggregation site. This involves normalising names and chemical structures (e.g. O and [H]O[H] are both water).
  • Outputting the information in a form where it can be analysed by spreadsheet and other techniques and where we can get a useful answer. This may include analysis of volumes as well as the actual substances.

Ideally we should do a human analysis of precision and recall. I think recall is very difficult as it depends on the false negatives and that means reading a subset of the experimental paragraphs. I think it’s unlikely we will have missed serious classes of solvent as it is the linguistic context and X (dissolved) IN Y (amount) is a very strong phrase. I think it’s unlikely that anyone would change their language for a particular solvent.

 

The good thing about this study is that it need not be comprehensive – we can turn down the recall and increase the precision. Thus we could remove all solvents with less than 5 mentions per year and that removes a lot of rubbish “in vitro”, “in deg Celsius”, etc. (Note that these are normally parsed correctly – it’s only a few misparses that get through.

 

I’ll try to process the data tonight and re-present it tomorrow. This is true open Science in that it’s quite possible that anyone can find a pattern. I’ll be too busy hacking the code.

 

So watch http://greenchain.ch.cam.ac.uk and this blog.


 

Posted in Uncategorized | Leave a comment

#solo10 Immediate reactions and thanks

Scraped/typed into Arcturus

I AM SO GRATEFUL TO EVERYONE WHO HAS HELPED WITH THE GREEN CHAIN REACTION. THE SESSION WAS A GREAT SUCCESS. WE GOT A PRELIMINARY RESULT ABOUT 5 MINS BEFORE WE WENT LIVE:

There is no immediately obvious difference in the solvents reported in 2000 and in 2009.

The chemistry department electricity went off line yesterday over the weekend so we had no server. We managed to scrape enough before that so that we could show the basis of the experiment. But most of the session was about how we collaborated on this experiment. I’ll write more later.

 

There were many people who helped before the event – code, running jobs, documentation and general support. And many who helped on the crisis . And others who helped with recording the event including livestream. I’ll try to give all credits later.

 

This experiment was created specifically for the Science Online and had a deadline. It’s raised lots of issues and we think there is an exciting way forward which I’ll discuss later.

 

BUT ONCE AGAIN ENORMOUS THANKS

 

 


 

Posted in Uncategorized | 3 Comments

#solo10 GreenChainReaction first results and comments

Scraped/typed into Arcturus

The first results are Live!

For 11 years (2000-2010) we have a list of the aggregated frequency of solvents. YOU can start to decide on the greenness – no code needs to be run.

 

NOTES:

  • There are some years where there are not much data. Volunteers should try to run the code over these to increase them
  • Mixtures (“ethanolwater”) are not properly parsed. We will fix this (but not before tomorrow).
  • Sometimes a solvent occurs twice (“DMF” and “dimethylformamide”). I shall try to normalize this today.
  • I shall try to normalize all of these against Wikipedia during the day and come up with images of structure

 

But all in all I am very pleased and very grateful to everyone who has helped. More to come.

 

http://greenchain.ch.cam.ac.uk/patents/results/2000/solventFrequency.htm

http://greenchain.ch.cam.ac.uk/patents/results/2001/solventFrequency.htm

http://greenchain.ch.cam.ac.uk/patents/results/2002/solventFrequency.htm

http://greenchain.ch.cam.ac.uk/patents/results/2003/solventFrequency.htm

http://greenchain.ch.cam.ac.uk/patents/results/2004/solventFrequency.htm

http://greenchain.ch.cam.ac.uk/patents/results/2005/solventFrequency.htm

http://greenchain.ch.cam.ac.uk/patents/results/2006/solventFrequency.htm

http://greenchain.ch.cam.ac.uk/patents/results/2007/solventFrequency.htm

http://greenchain.ch.cam.ac.uk/patents/results/2008/solventFrequency.htm

http://greenchain.ch.cam.ac.uk/patents/results/2009/solventFrequency.htm

http://greenchain.ch.cam.ac.uk/patents/results/2010/solventFrequency.htm


 

Posted in Uncategorized | 2 Comments

#solo10: Green Chain Reaction here are the first results! Make your own judgment on greenness

Scraped/typed into Arcturus

The first results are Live!

For 11 years (2000-2010) we have a list of the aggregated frequency of solvents. YOU can start to decide on the greenness – no code needs to be run.

See http://greenchain.ch.cam.ac.uk/patents/results/2008/solventFrequency.htm

 

NOTES:

  • There are some years where there are not much data. Volunteers should try to run the code over these to increase them
  • Mixtures (“ethanolwater”) are not properly parsed. We will fix this (but not before tomorrow).
  • Sometimes a solvent occurs twice (“DMF” and “dimethylformamide”). I shall try to normalize this today.
  • I shall try to normalize all of these against Wikipedia during the day and come up with images of structure

 

But all in all I am very pleased and very grateful to everyone who has helped. More to come.


 

Posted in Uncategorized | 1 Comment

The ABCD of Open Scholarship

We had a wonderful meeting yesterday with Dave Flanders (JISC) David Shotton’s (Oxford) group (#jiscopencite) and our #jiscopenbib (Cambridge/OKF) – more details later. We really believe these projects can make a major change to Open Scholarship. We came up – almost by chance with the ABCD of Open Scholarship:

  • Open Access
  • Open Bibliography
  • Open Citations
  • Open Data

We are going to create OKF buttons for the last three of these(there’s already OA buttons) that publishers can embed in their publications. Then we’ll be absolutely clear what the Open material is!

Here’s a flowerpoint including some of our collaborators and sponsors:

It’s “upside down” but bits of it have to be…

Posted in Uncategorized | 1 Comment

#solo10: Green Chain Reaction starts to deliver Open Notebook Science and Open Patents

Scraped/typed into Arcturus

The Green Chain Reaction is now delivering results! Due to the enormous contribution of volunteers we have now distributed work over many machines and extracted large amounts of data and successfully text-mined it. On my laptop alone I have nearly 500,000 files. It got quite hot last night.

I’m guessing that we’ve done about 20% of the possible work – perhaps 100+ weekly indexes out of 600. We’ve got enough to answer our original question and we’ll start to do this. The software gets better each day and there’s still opportunity for anyone to take part before Saturday, Please mail me or leave a name on http://okfnpad.org/openPatents.

We’ve aggregated yearly indexes, and you can browse these. Here’s a typical one.

http://greenchain.ch.cam.ac.uk/patents/results/2006/yearTotal.htm

(You could download this into a spread sheet if you want – we hope to show something exciting on Saturday). You can even try to answer our question – are solvents “greener” in 2010 than 2000?

I’ll update the instructions. Some of you have been very patient as the software showed unexpected bug – well, we expected bugs but we didn’t know which! It runs for most people now most of the time which is great if we are simply collecting “enough” data.

The GCR is of course much more than a simple experiment. It’s the first distributed Open (OKD) chemical information experiment we know of. All the data, code and results are completely Open. It’s been done in real-time so it’s Open Notebook Science. It’s a model for Open computational science.

It creates a completely Open resource (we still have to add the OKF Open data buttons). There’s no reason why this shouldn’t be the start of an Open Patent resource.

I am sure you can think of other things the GCR explores. Must rush…

Posted in Uncategorized | 1 Comment

#solo10: GreenChainReaction – volunteers start to process patents

Scraped/typed into Arcturus

We now have our “beta”-production software for the Green Chain Reaction. We’ve put about 10,000 solvents on the web page. But – if more volunteers take part in the next 2-3 days – that could be multiplied by 20 at least.

The GCR is much more than answering a chemical question (albeit an important one). It’s about Open data on the web. If you get involved, then you’ll start to understand.

As an example what we are doing – in a sort of human mockup – is what Google and other search engines do. It’s called Map-Reduce. If you take part in GCR you’ll understand what Map-Reduce is – painlessly and hopefully entertainingly.

I must thank our several volunteers. They have all played an important part and will continue to do so. They’ve written code, given advice, given moral support (very important at 0200), and shown that – if we want – we can change the way that scientific information is collected, analysed and processed. I’ll explain more later – but I’ve still got code to write.

The steps are simple:

  • Download the software and ancillary files
  • Download some patent indexes
  • Run the software
  • Possibly eyeball the results
  • Run software to upload the results

That’s it. It probably takes 2 hours to get familiar with what has to be done, and make the first-time mistakes. Then there is an elapsed time of ca. 1 hour running the job. We are getting most of the bugs out of the system so most things now work out of the box – as long as you use the current box!

I now have to create the tools to display the results. Challenging, but we have some good infrastructure

Posted in Uncategorized | 1 Comment

Does “Open API” mean anything?

Roderic Page (http://iphylo.blogspot.com/2010/08/on-being-open-mendeley-and-open-data.html ) has discussed Mendeley’s “Open API” and decided (as do I [Mendeley, Scopus, Talis – will you be making your data Open? ] ) that whether the code is Open is less important than whether the data is Open.

This is a good opportunity to address the almost meaningless phrase “Open API”. “Open” is often used as a marketing phrase, similar to “healthy”, “Green”, “safe” or many others. To show the problem, here’s a typical experience. I was at a LIS conference and a company which provides library tools and services was trumpeting its “Open API”. So I asked:

  • “was the API publicly available?” … No
  • “Could they let me have details?” … no
  • “why not?” … because their competitors might see it

As far as I could gather (and this is not an isolated case – I’ve seen it in chemical software companies )– “Open” means something like:

“if you purchase our software we will document our trade secrets for you if you are prepared to contarct that you will not divulge any details”

My complaint is not with this per se, (although it generally restricts innovation and constructive feedback) but the use of “Open API” to describe it. “Open ” surely means more than “we have documented our software architecture and you can buy the documentation under a secrecy agreement”.

It will be no surprise use that I use “Open” carefully consistent with with the OKF’s Open Definition:

The Open Knowledge Definition (OKD) sets out principles to define ‘openness’ in knowledge – that’s any kind of content or data ‘from sonnets to statistics, genes to geodata’. The definition can be summed up in the statement that “A piece of knowledge is open if you are free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

The Open Software Service Definition (OSSD) defines ‘openness’ in relation to online (software) services. It can be summed up in the statement that “A service is open if its source code is Free/Open Source Software and non-personal data is open as in the Open Knowledge Definition (OKD).”.

The second is important in this case as well as the first.

I have no particular quarrel with Mendeley – they are innovating by putting power in the hands of clients and that’s gently useful. But unless they actually are Open according to the OKD then they aren’t giving us much (and this applies to many companies – and almost all in chemistry). At present “Open API” seems to mean

“We will show you how you can query the data in our system without fee”. This is free beer, not Libre. It’s not an Open service, it’s a free service. Here are some common features of Free but not Open services:

  • They can be withdrawn at any time. This is extremely common in Free services – It’s happened innumerable times in chemical information systems. A company will offer free access and then later offer it only to subscribers or offer only a very limited service (a handful of queries)
  • There are no explicit rights for re-using the results. Here’s a typical phrase from Surechem (a Macmillan subsidiary http://www.surechem.org/ ):

    Unless you have SureChem’s prior written permission you are not permitted to copy, broadcast, make available to the public, show or play in public, adapt or change in any way the material (or any part of it) contained on this Web Site for any purpose whatsoever. Unless you are SureChem Pro or PDF subscriber, you may not download, store (in any medium) or transmit the material contained on this Web Site. Access of the SureChem database in an automated manner is not permitted, and may result in banning from the Web Site, as well as legal action. Data from this Web Site may not be resold or redistributed. SureChem, Inc. reserves the right to ban users from this site, whether using automated means or not, if, in our sole opinion, they are abusing our data or database.

     

    There’s nothing wrong with this – it’s similar in their companion company “Free Patents online” (http://www.freepatentsonline.com/privacy.html ). They don’t claim the word “Open”. Just make sure you understand the difference between Free-as-in-beer and fully Open.

  • You don’t have access to the full data. You never know whether you are seeing the full index or part of it. There is no algorithm to expose the whole data. By contrast you can, in principle, download the whole of an Open system.
  • The data are not normally semantic and the metadata of the system is often inadequately documented. The results are often presented as ASCII strings rather than XML or RDF, and the user often has to guess at their meaning. There is often no identifier system. (admittedly Open systems can suffer from this as well, but at least the community can correct or enhance it).

These features make it impossible to use the data as true Linked Open Data.

I have no major complaint with a company which collects its own data and offers it as Free – Google does this and many more.

But don’t call it Open.

So commercial companies which promote mashups, linked open data, semantic web, etc but do not make their data Open are using these concepts as marketing tools rather than providing roots of community innovation.

I don’t know where Mendeley are at. When I asked some of their staff about whether the data were Open, they said yes. I then wrote to Ian Mulvany – I’m still waiting to hear back – maybe at his session at Science Online. I hope their data really are Open because that will change the world. Free will simply change the balance between monopolists.


 

Posted in Uncategorized | 8 Comments

Mendeley, Scopus, Talis – will you be making your data Open?

Scraped/typed into Arcturus

Ian Mulvany (VP New Product Development, Mendeley)

Has blogged about the excitement of connecting scientific data: (http://directedgraph.net/2010/08/27/connecting-scientific-data/ )


I’m going to be hosting a session at science online London next weekend, I’m excited. I’ve been interested in the issues of connecting scientific data for a long time. In the last six months I’ve become particularly excited about the potential of web based tool like Yahoo Query Language. I was hoping to talk a little about that, but I’ve been lucky to get some amazing people to come and share their experiences about linking data, so I’m going to cede the floor to them. I might be able to get some YQL hackery into one of the unconference slots that will be knocking around. Science online is shaping up to be a pretty awesome event, and you can check out the conference program to see what you will be missing out on!

Here is the spiel and speaker bios for the section that I’m going to be running:

Connecting Scientific Resources

Do you have data? Have you decided that you want to publish that data in a friendly way? Then this session is for you. Allowing your data to be linked to other data sets is an obvious way to make your data more useful, and to contribute back to the data community that you are a part of, but the mechanics of how you do that is not always so clear cut. This session will discuss just that. With experts from the publishing world, the liked data community, and scientific data services, this is a unique opportunity to get an insight into how to create linked scientific data, and what you can do with it once you have created it.

The other speakers are:

Michael Habib, Product Manager, Scopus UX + Workflow

Richard Wallis, Technology Evangelist Talis

Chris Taylor, Senior Software Engineer for Proteomics Service, EBI

This looks like a very exciting session, and I’ll be going. Linking scientific data will transform the way we do science – Tony Hey and colleagues haver published “The Fourth Paradigm” about data-driven science (some people call it discovery science), where the analysis of the data, especially from many fields, comes up with radical new insights.

There’s only one requirement.

The data must be Open. (ideally it should be semantic as well). Open as in libre. Free as in speech not as in beer. Compliant with the Open Knowledge Definition (http://www.opendefinition.org/ ). And if the data are provided through a service, that must also be Open (http://www.opendefinition.org/ossd/ ).

This is not a religious stance, it’s that pragmatically Open is the only way that linked science data will work. It must be possible to access complete datasets, not just a query interface. We must know that – in principle – we have the absolute right to download and re-purpose the data. “Freely available” is not good enough (people can be sued for re-using free services).

There are problems. Data management costs money even if the content is free. Traditionally this has been a user-pays model. But this cannot work in from Linked Open Science. The data have to be as free as the text and images in CC-BY articles. Freely re-usable. Freely resellable.

And it’s true that companies are often better at managing data than academic researchers. They can invest in people and processes that are not dedicated to the holy grail of getting citations.

But how can a company create an income stream from Open Scientific content?

That’s the a question for me for this decade. If we can solve it we can transform the world. If however the linked Open data are all going to be through paywalls, portals, query engines then we regress into the feudal information possession of the past. I hope the companies presentin this session can help solve this. It won’t be easy but it has to be done.

So I now ask Mendeley, Elsevier/Scopus, Talis:

Are your data Openly available for re-use?

I’ve asked this of Elsevier about a year ago, when they promoted “mash up everything” in a public session on text-mining for science. I asked if we could mine their journal content for DATA and make it Openly available. They informally said yes. Informal isn’t good enough in a world where lawyers abound and I’m still following this up. That’s why we’ve set up the OKF’s isItOpen service.

I hope enough publishers of information can see the scope for Open Knowledge.

Then the world really will change.

Posted in Uncategorized | Leave a comment

#solo10 GreenChainReaction: it’s starting to work!

Dictated into Arcturus

The GreenChainReaction is already a success!. It’s a success because we’ve had a wonderful set of volunteers who are committed to making it work. The atmosphere in the Etherpad yesterday was wonderful. The group has worked hard to extract the scratchpad-like notes and put them into the wiki pages on Science Online (http://scienceonlinelondon.wikidot.com/topics:green-chain-reaction )

As I expected there have been bugs. Some of these had been fixed by the volunteers, and that’s a common and wonderful aspect of Open projects. There are two or three bugs the need to be fixed:

  • Spurious directories are created in the results (probably bad calls to mkdirs());
  • There are intermittent problems in accessing our server.

But other than that we’ve shown that the approach works. For example one volunteer, Richard West, has run about 15 jobs, year-by-year. That has shown that the early patents, e.g. before 1995, really don’t have any useful semantic information. We didn’t know that when we started, so that’s already a useful result. We are already finding that some patents give a large list of solvents and as a result I shall be making a minor change to the software today. One of the features of web 2.0 is that the systems are always in continual Beta.

We still welcome newcomers because there is now a critical mass of people who can help. It’s easy to run a dozen jobs while watching the Test Match, or while you’re asleep. This is not mindless, and we’re going to ask people to look at the aggregated results before uploading them. They’ll be able to get an idea of the accuracy with which text-mining works, and feedback to us problems we need to fix. So unlike programmes like SETI-at-home were all feedback is automatic, here we expect valuable contributions from the participants.

I think today should be the last day when people are likely to encounter generic problems. Many thanks to everyone and we should have a wonderful set of material to present next Saturday.

Posted in Uncategorized | Leave a comment