In my presentation to the @UKSG tomorrow I shall argue that the scholarly literature is a vast untapped source of high quality data and can supplant traditional expensive, human abstracting services. The organizers asked me not to present vapourware (which is easy because I never do!). The Content Mine is not vapourware, but it’s only just started the first lode.
FACTS are uncopyrightable. A purchaser of a book can extract and republish facts without permission. A subscriber to a journal can extract and republish facts – how else have the abstracting services worked? So my machines can extract and analyse and check and republish and enhance scientific data.
We have been doing this for 6 years in crystallography. But we weren’t the first. The Crystallography Open Database (COD) celebrates 10 years of machine abstraction from the literature. Started in France, by Armel Le Bail, it’s moved to Vilnius in Lithuania where Saulius Gražulis runs it. They’ve got 240,000. Our own Crystaleye (CY) was developed in 1 year by Nick Day, who needed structures for computational chemistry and decided to abstract the whole visible, Open literature for crystal structures; he’s got about 250,000. Obviously there’s a lot of overlap but it’s probably fair to say there’s a total of 300,000 high quality useful crystal structures.
And it’s all Openly available. It can be used for drug discovery, new electronic materials, capturing carbon dioxide, and education.
How does this happen? The crystallographic community requires all published structures to have data available (Supplementary data; Supporting Information) on the websites alongside the “fulltext”. Our machines can hoover these up and, importantly, associate them with the fulltext. Most publishers, RSC, ACS, Nature… regard SI as public data even for closed fulltext. (However Wiley, Springer, Elsevier put it behind the CCDC access wall where structures are only available by email, and then with a limit of < 100).
You might think that COD and CY are rivals – that’s the traditional academic view. Not at all – we are friends and we are starting serious collaboration! We have jointly obtained a grant from the Lithuanian Academy of Sciences for me to travel to Lithuania (Vilnius) and work together to integrate our data collections.
And our tools – which are Open and powerful. For example I’ll be integrating the Content Mine tools which will allow a complete indexing of the crystallographic literature and data from inside the fulltext! That means that our data base will contain material not available anywhere else – open or closed.
We shall simply have the best Open small-molecule database on the planet.
And we invite volunteers at add material and make the software and services better!