Scraped/typed into Arcturus
Reactions to the Reaction.
Responses to “#solo10 Immediate reactions and thanks”
Egon Willighagen says:
Peter, what is next? Are you going to push this project further, or does it stop here? Despite all the interesting spin offs, are you going to work out this analysis further and write up a review paper with all those (significantly) contributed?
Or, is this project now going to end, with #solo10 finished?
It was a fascinating session, thanks. I really hope you don’t end it there.
Mat Todd says:
I think it’s important that this exercise is written up and published, to try to disseminate to a wider audience who perhaps aren’t familiar with the area. From the initial analysis I saw of the graphs that were produced it looked like there were some changes in solvents used … for the worse. But let’s get the full digestion of data and conclude something certain. It was great to be part of it.
I certainly do NOT intend to finish here and I’d like all of you to help as authors. I’d like to try to put together an accurate snapshot of where we have arrived at and what conclusions we can draw. It was ironic that the server was off air and so I know some people didn’t manage to upload – nonetheless you have all contributed morally.
There is a problem we have been alerted to today – apparently the EPO may not have allowed us to do what we have been doing. I’d be grateful for accurate information on this. My moral and political view is that as a European taxpayer I should be entitled to use this information, and that if there is restricted access this is harming the practice of European and World innovation. Bad patent information helps no-one other than some applicants and some traditional downstream commercial processors.
Without this restriction this was the plan:
- Clean the current site and re-run the aggregation with the later versions of the software. This means that people could analyse perhaps 10-20 weeks per day which works out at 5 days per patent-year. This means that as a community we could do the job properly in a week or two. This depends on having somewhere to host the data – and we have an exciting offer – but I don’t want to contaminate them with tainted information.
- Re-analyse the information on the aggregation site. This involves normalising names and chemical structures (e.g. O and [H]O[H] are both water).
- Outputting the information in a form where it can be analysed by spreadsheet and other techniques and where we can get a useful answer. This may include analysis of volumes as well as the actual substances.
Ideally we should do a human analysis of precision and recall. I think recall is very difficult as it depends on the false negatives and that means reading a subset of the experimental paragraphs. I think it’s unlikely we will have missed serious classes of solvent as it is the linguistic context and X (dissolved) IN Y (amount) is a very strong phrase. I think it’s unlikely that anyone would change their language for a particular solvent.
The good thing about this study is that it need not be comprehensive – we can turn down the recall and increase the precision. Thus we could remove all solvents with less than 5 mentions per year and that removes a lot of rubbish “in vitro”, “in deg Celsius”, etc. (Note that these are normally parsed correctly – it’s only a few misparses that get through.
I’ll try to process the data tonight and re-present it tomorrow. This is true open Science in that it’s quite possible that anyone can find a pattern. I’ll be too busy hacking the code.
So watch http://greenchain.ch.cam.ac.uk and this blog.