Today 2014-06-01 is a very important date. The UK government has pushed for reform of copyright and - despite significant opposition and lobbying from mainstream publishers - the proposals are now law. Today.
Laws are complicated and the language can be hard to understand but for our purposes (Scientific articles to which we have the right to read ) :
- If you have the right to read something in the UK then you have the right to extract and publish facts from it for non-commercial use.
- This right overrides any restrictions in the contract signed between the publisher and and the buyer/renter.
Of course we are still bound by copyright law in general, defamation, passing off and many other laws. But our machines can now download subscribed articles without legal hindrance and as long as we don't publish large non-factual chunks we can go ahead.
Without asking permission.
That's the key point. If we had to ask permission or were bound by contracts that forbid us then the law would be useless. But it isn't.
I'm mentally starting today, but since I'm not in UK I'll wait for a few days. I've got several non-commercial projects I want to work on - one today about pheromones - I need to scan a lot of papers for chemical structures and species.
It also wouldn't be much use without the technology. There's 1000-5000 articles per day - no-one really knows. That's 1-2 a minute to crawl and scrape. We believe that a lot of the crawled metadata is freely available so we are concentrating on scraping.
We'll launch the technology on Wednesday at http://www.fwf.ac.at/de/aktuelles_detail.asp?N_ID=600 . If you are in the Vienna area you might want to come - I think there may be a place or two but can't guarantee it. We'll post the details and probably open an Etherpad if any brave people want to try remotely .
All the http://contentmine.org people have worked very hard but top kudos to Richard Smith-Unna (@blahah404) for building the scraper. It's a scary ghost ride with a "headless browser", "PhantomJS", "SpookyJS", "CasperJS" but we'll be doing this in daylight so it should be safe.
The workshop is truly interactive - we want to hear what the participants want, why it does/not work for them, and to build collaborative projects. Ideally we'd like a self reproducing community developing applications and running workshops.
A small amount of the workshop - e.g. Computer Vision for Science - will be "bleeding edge". It should be fun.