TL;DR We had a great session at FORCE2015 yesterday in Oxford - people liked it, understood it, and are wanting to join us.
We ran a pre-conference workshop for 3 hours followed by extra hack. This was open to all and all sorts of people came including:
- scholarly poor
So we deliberately didn't have a set program but we promised that anyone could learn about many of the things that ContentMine does and get their hands dirty. Our team presented the current state of play and then we broke into subgroups looking at legal/policy, science, and techie.
ContentMining is at a very early stage and the community, including ContentMine, is still developing tools and protocols. There's a lot to know and a certain amount of misunderstanding and disinformation. So very simply:
- facts are uncopyrightable
- large chunks of scientific publications are facts
- in the UK we have the legal right to mine these documents for facts for non commercial activity / research
- the ContentMine welcomes collaborators who want to carry out this activity - it's inclusive - YOU are part of US. ContentMine is not built centrally but by volunteers.
- Our technology is part alpha, part beta. "alpha" means that it works for us, and so yesterday was about the community finding out whether it worked for them.
And it did. The two aspects yesterday were (a) scraping and (b) regexes in AMI. The point is that YOU can learn how to do these in about 30 mins . That means that YOU can build your bit of the Macroscope ("information telescope") that is ContentMine. Rory's interested in farms, so he, not us, is building a regexes for agriculture. (A week ago he didn't know what a regex was). Yesterday the community built a scraper for peerj - so if you want anything from that, it's now added to the repertoire (and available to anyone). We've identified clinical trials as one of the areas that we can mine - and we'd love volunteers here.
What can we mine? Anything factual from anywhere. What are facts (asked by one publisher yesterday)? There's the legal answer ("what the UK judge decides when the publisher takes a miner to court") and I hope we can move beyond that - that publishers will recognize the value of mining and want to promote a community approach. Operationally it's anything which can be reliably parsed by machine into a formal language and regenerated without loss. So here are some facts: "DOI 123456 contains..."
- this molecule
- this species
- this star, galaxy
- this elementary particle.
and relationships ("triples" in RDF-speak)
- [salicylic acid] [was dissolved in] [methanol]
-  [fairy penguins] [breed] [in St Kilda, VA]
Everything in [...] is precisely definable in ontologies and can be precisely annotated by current ContentMine technologies.
We can do chemistry (in depth), phylogenetics, agriculture, etc. but what about clinical trials? So we need to build:
- a series of scrapers for appropriate journals
- a series of regexes for terms in clinical trials. "23 adult females between the ages of ...".
For the really committed and excited we will also be able to analyze tables, figures and phrases in text using Natural Language Processing. So if this is you, and you are committed, then it will be very exciting.