Content Minings Starts Today! and we have the technology

Today 2014-06-01 is a very important date. The UK government has pushed for reform of copyright and – despite significant opposition and lobbying from mainstream publishers – the proposals are now law. Today.
Laws are complicated and the language can be hard to understand but for our purposes (Scientific articles to which we have the right to read ) :

  • If you have the right to read something in the UK then you have the right to extract and publish facts from it for non-commercial use.
  • This right overrides any restrictions in the contract signed between the publisher and and the buyer/renter.

Of course we are still bound by copyright law in general, defamation, passing off and many other laws. But our machines can now download subscribed articles without legal hindrance and  as long as we don’t publish large non-factual chunks we can go ahead.
Without asking permission.
That’s the key point. If we had to ask permission or were bound by contracts that forbid us then the law would be useless. But it isn’t.
I’m mentally starting today, but since I’m not in UK I’ll wait for a few days. I’ve got several non-commercial projects I want to work on – one today about pheromones – I need to scan a lot of papers for chemical structures and species.
It also wouldn’t be much use without the technology. There’s 1000-5000 articles per day – no-one really knows. That’s 1-2 a minute to crawl and scrape. We believe that a lot of the crawled metadata is freely available so we are concentrating on scraping.
We’ll launch the technology on Wednesday at http://www.fwf.ac.at/de/aktuelles_detail.asp?N_ID=600 . If you are in the Vienna area you might want to come – I think there may be a place or two but can’t guarantee it. We’ll post the details and probably open an Etherpad if any brave people want to try remotely .
All the http://contentmine.org  people have worked very hard but top kudos to Richard Smith-Unna (@blahah404) for building the scraper. It’s a scary ghost ride with a “headless browser”, “PhantomJS”, “SpookyJS”, “CasperJS” but we’ll be doing this in daylight so it should be safe.
The workshop is truly interactive – we want to hear what the participants want, why it does/not work for them, and to build collaborative projects. Ideally we’d like a self reproducing community developing applications and running workshops.
A small amount of the workshop – e.g. Computer Vision for Science – will be “bleeding edge”. It should be fun.
 
 

Posted in Uncategorized | 4 Comments

Shuttleworth Gathering Budapest, Content Mine Dogfood

Twice a year the Shuttleworth Fellowship meets in a Gathering – could be anywhere in the world (subject to a minimum travel costs algorithm). This is my first and we are in Budapest – one of Europe’s loveliest cities. (I’ve been here before, luckily, as our programme has been very full and we only got out once formally for a river cruise.
It’s Chatham House Rule so no details but see our web page for the 13 fellows. This is one of the most coherent, inspiring, groups I have ever been in. So much is common ground – we agree on doing Open, the questions are why? what and how? and we’ve explored those. I’ve found so much in common – we are in the area of liberating knowledge and inspiring innovation , mixed with democracy and justice. I’m finding out about how to build communities, annotation, education while being able to help with computer vision, information extraction, metadata, etc.
We each ran a 75 minute slot on “eating our own dogfood”. NOT a lecture. We had to bring the practice of our project and ask the others – everyone – to grok it and hack it. Often this was in small groups and so for mine we had 5 groups of 5. Here’s my rough summary with comments:

  • Why are we doing ContentMining? economics, openness/democracy, innovations, disruption.  Hargreaves

Very useful discussion (as would be expected)

  • Manual markup (highlighters) of two articles

Worked very well. Lots of questions about “should we mark this?”. 

  • Demo (PMR) of semantic content  (chemistry)

  • Crawling exercise (manual)

Good involvement. “Why doesn’t publisher X have an RSS feed?”, etc.

  • Scraping exercise (manual and software)

Again worked very well

  • Extraction (software and manual design)

Mainly concentrated on manual markup but showed chemical tagger, etc.

  • Where are we going?

 
I deliberately put far too much in – so people could test the software worked, etc. But the main idea was to see how non-biologists managed. I chose a paper on evolutionary biology of Lions in Africa and everyone got the point. In fact it reinforced how needlessly exclusive scientific language is. The first part of the introduction could be rewritten without loss to read something like

“African Lions are dying out because of hunting and environment change. DNA analyses show that lions in different parts of Africa have evolved in different ways. By studying the DNA and historical specimens we can understand the evolution and perhaps use this for conservation.”

There wasn’t enough time for everyone to run the software – deliberately – but we got very useful feedback.  I shall be tweaking it over the weekend to make sure it’s working for our Vienna workshop.
 

Posted in Uncategorized | 1 Comment

Content Mining will be legal in UK; I inform Cambridge Library and the world of my plans

Early last week the UK House of Lords passed the final stages of a Statutory Instrument with exceptions to copyright. For me that most important was that those with legitimate access to electronic content can now use mining technology to extract data without permission from the owners. The actual legislation took less than a minute, but the process has been desperately fought by the traditional publishers who have attempted to require subscribers to get permission from them.
IN THE UK THEY HAVE FAILED IN THIS BATTLE
That means that I, who have legitimate access to the content of Cambridge University Library and their electronic subscriptions, can now use machines to read any or all of this without breaking copyright law. Moreover the publishers cannot override this with additional restrictive clauses in their contracts.
The new law restricts the use to “non-commercial” but this will no affect what I intend to do. To avoid any confusion I am publicly setting out my intentions; because I shall be using subscription content I am advising Cambridge University Library. I am not asking anyone’s permission because I don’t have to.
Yesterday I wrote to Yvonne Nobis, Head of Science Information in CUL.

I am informing you of my content mining research using subscription content in CUL. Please forward this to anyone else in CUL who may need to know. Also if there is any time this week I would be very happy to meet (or failing that Skype) – even for a short time.
As you know the UK government has passed a Statutory Instrument based on the Hargreaves review of copyright exempting certain activities from copyright, especially “data analytics” which covers content mining for facts. This comes into force on 2014-06-01.
I intend to use this to start non-commercial research and to publish the results in an OpenNotebookScience (https://en.wikipedia.org/wiki/Open_notebook_science)  philosophy (i.e. publicly and immediately on the web as the work is done, not retrospectively). This involves both personal research in several scientific fields and also collaborations in 3-4 funded projects:
  •  PLUTo (BBSRC, Univ Bath) – Ross Mounce
  • Metabolism mining (Andy Howlett, Unilever funded PhD and also with Christoph Steinbeck EBI, Hinxton, UK)
  • Chemical mining (TSB grant) Mark Williamson.
We are also collaborators in the final application stage for an NSF grant collaboration for chemical biodiversity in Lamiacae (mints, etc.). This is very exciting and mining may throw light on chemicals as signals of climate change.
I intend to mine responsibly and within UK law. I expect to mine about 1000-2000 papers per day – many will be subscription-based through CUL. I have access to these as I have an Emeritus position but as I am not paid by CU then this cannot be construed as commercial activity. Typically my software will ingest a paper, mine it for facts, and discard the paper – the process takes a few seconds.
As a responsible scientist I am required by scientific ethics and reproducibility/verifiability to make my results Open and this includes the following Facts:
  • bibliographic metadata of the article (but not the abstract)
  • citations (bibliographic references) within the article
  • factual lists of tables , figures and supplemental data.
  • sources of funding (to evaluate the motivations of researchers
  • licences
  • scientific facts (below)
I shall not reproduce the whole content but shall reproduce necessary textual metadata without which the facts cannot be verified. These include:
  • figure and table captions (i.e. metadata)
  • experimental methodology (e.g. procedures carried out)
I shall not reproduce tables and figures. However my software is capable, for many papers, of interpreting tables and diagrams and extracting Factual information (e.g. in CSV files). [My output will be more flexible and re-sable than traditional pixel-based graphs.]
I expect to extract and interpret the following types of Facts:
  • biological species
  • place names and geo-locations (e.g. lat/long)
  • protein and nucleic acid sequences
  • chemical names and structure diagrams
  • phylogenetic (e.g. evolutionary) trees
  • scatterplots, bar graphs, pie charts, etc.
 and several others as the technology progresses.
The load on publishers’ servers is negligible (this has been analysed by Cameron Neylon of PLoS).
I stress the the output is qualitatively no different from centuries of extraction from the literature – it is the automation of the procedure. Facts are not copyrightable and nor will my output be.
I shall publish the results on my personal open web pages, repositories such as Github and offer them to EuropePMC for incorporation if they wish . Everything I publish will be licensed under CC 0 (effectively public domain). I would also like to explore exposing the results through the CUL. I have already pioneered dspace@cam for large volumes of facts, but found that the search and indexing wasn’t appropriate at the time. If you have suggestions as to how the UL might help it could be a valuable example for other scholars.
I am not expecting any push-back or take-downs from publishers as this activity is now wholly legal.  The Statutory Instrument overrides any restrictive clauses from suppliers, including robots.txt. I therefore do not need or intend to ask anyone for permission. This will be a very public process – I have nothing to hide. However I wish to behave responsibly, the most likely problem being load on publishers’ servers. Richard S-U (Plant Sciences, Cambridge, copied) and I are developing crawling and scraping protocols which are publisher-friendly (e.g. delays and retries) – we  have also discussed this with PLoS (Cameron).
In the unlikely event of any problems from publishers I expect that CUL, as licensee/renter of content, would be the first point of contact. I will be happy to be available if CUL needs me. If publishers contact me directly I shall immediately refer them to CUL as CUL is the licensee.
I have written this in the first person (“I”) since the legislation emphasises personal use and because organised consortia may be seen as “commercial”. The law is for the UK. Fortunately the mining is wholly compatible:
  • I am a UK citizen from Birth
  • I live in the UK
  • I have a pension from the UK government (non-commercial activity)
  • My affiliation is with a UK university
  • The projects I outline are funded by UK organisations.
  • My collaborators are all UK.

I play a public domain version of “Rule Britannia!” incessantly and have a  Union Jack teddy bear. I shall however, vote for Britain to continue as a member of the EU and also urge my representatives (MEPs) to continue to press for similar legislation in  Europe. I personally thank Julian Huppert and David Willetts for their energy and consistency in pushing for this reform, which highlights the potential value of parliaments in a democracy.
I also thank my collaborators in the ContentMine (http://contentmine.org) where I shall be demonstrating and discussing our technology, which is the best that I know of outside companies like G**gle. As an academic I welcome offers of collaboration, but stress that we cannot run a mining service for you (though we can show you how to run our toolkit).  If the projects are interesting enough to excite me as a scientist I may be very happy to work with you as a co-investigator, though I cannot be paid for mining services.
Sadly, very few publishers come out of this with anything positive. Naturally the Open Access publishers (PLOS, BMC, eLife, MDPI, PeerJ, Ubiquity and others) have no problems as they can be and want to be mined. We have already had long discussions with them. The Royal Society (sic, not the RSC) has positively said that their content can be mined. All the rest, and especially the larger ones, have actively lobbied and FUDded to stop content mining. When you know that organisations are spending millions of dollars to stop you doing science it can be depressing, but we’ve had the faith to continue. I’m particularly proud of Jenny Molloy, Ross Mounce and others for their public energy in maintaining
“The Right To Read is the Right To Mine”
Now that the political battle (which has taken up 5 years of my life) is largely over, I’m devoting my energies to getting the ContentMine as a universal resource and building new next generation of intelligent scientific software.
And you can be an equal part of it, if you wish.
 
 
 

Posted in Uncategorized | 1 Comment

Jean-Claude Bradley: Hero of Open Notebook Science; it must become the central way of doing science

It is with great sadness that we report the death of Jean-Claude Bradley who invented the concept of “Open NoteBook Science”.  (https://en.wikipedia.org/wiki/Open_notebook_science ).
img_2070
[Blue Obelisk presented to J-C (left) by Egon Willighagen (right), 2007. Photo Credit CC BY Christoph Steinbeck]
I learnt of this last Wednesday, while preparing a keynote talk on “Open Data” at the European Bioinformatics Institute at Hinxton.  I dropped half of what I was intending to present , to provide a fitting tribute to J-C. On the Blue Obelisk mailing list I wrote:

Jean-Claude was years ahead of his time. He did what he considered right, not what was expedient or what the world expected.
He and I discussed Open Data and Open Notebook Science. We found that they were different things and that each was a critically important subject. J-C set up a webpage on Wikipedia to describe ONS and its practice.
ONS is truly innovative. The research must be available to everyone – regardless of who they are are or what they had studied. And it must be fair – “no insider knowledge”.
Several groups in chemistry are following J-C’s lead – and we honour him in that.
I have been invited to present a keynote on “Open Data” at Hinxton Genome Campus tomorrow and shall make J-C’s work the focus and inspiration.
I am truly glad we awarded him a Blue Obelisk. As a community we should think how to take the message further.

I stayed up late into the night finding material to include. J-C has left a clear legacy and it has been possible to find clear, simple, precise indications of his thinking . See slides 4-20 in http://www.slideshare.net/petermurrayrust/ebi-34715150. There is an excellent video interview last year (links at end of my presentation).
As I found more material I suddenly got the revelation:
“This is the only proper way to do science in the Century of the Digital Enlightenment”
I perhaps knew this theoretically, but now it hit me emotionally.  Jean-Claude’s vision was absolute, simple, and feasible. In fact ONS is a simpler way of doing science than we have at present. It’s vastly better and immediately provides a total record of what everyone has done. It’s literally edited by the minute. Everyone gets fair credit for what they have done, there is massive loss of wasted effort, no opportunity for fraud.
ONS also solves the “Open Data” and “Open Access” at a stroke. It is impossible not to publish Open data, impossible for publishers to try to steal it from the public. Open Access becomes virtually irrelevant – it’s an integral part of the system.
I’ll have a lot more to write. In preparing my talk I asked Mat Todd, Univ of Sydney, to comment. Mat has been another pioneer in OpenNotebookScience, using chemistry not for conventional academic glory (though he has that from many) but to cure human disease, particularly Neglected Tropical Diseases. Mat wrote:

JC was a pioneer in open science, and uncompromising about its importance. We had so many productive interactions over the years, starting from the end of January 2006, when we started our open chemistry project on The Synaptic Leap (JC was the first to comment!) and JC posted his very first experiment online at Usefulchem. I remember starting to think about how to do completely open projects, looking around the web in 2005 to see if anything open was going on in chemistry, and coming across JC’s lone voice, and I thought “Wow, who isthis guy?” He had dedication and integrity – we’ll all miss him.

 

Posted in Uncategorized | 3 Comments

TheContentMine: Progress and our Philosophy

TheContentMine is a project to extract all facts from the scientific literature. It has now been going for about 6 weeks – this is a soft-launch. We continue to develop it and record our progress publicly. It’s a community project and we are starting to get offers of help right now.  We welcome these but we shan’t be able to get everything going immediately.
We want people to know what they are committing to and what they can expect in return. So yesterday I drafted an initial Philosophy – we welcome comments.

Our philosophy is to create an Open resource for everyone created by everyone. Ownership and control of knowledge by unaccountable organisations is a major current threat; our strategy is to liberate and protect content.
The Content Mine is a community and we want you to know that your contribution will remain Open. We will build safeguards into The Content Mine to protect against acquisition.
We are a meritocracy. We are inspired by Open communities such as the Open Knowledge Foundation, MozillaWikipedia and OpenStreetMap all of whom have huge communities who have developed a trustable governance model.

We are going ahead on several fronts – “breadth-first”, although some areas have considerable depth. Just like Wikipedia or OSM you’ll come across stubs and broken links – it’s the sign of an Open growing organisation.
There’s so much to do, so we are meeting today to draft maps, guidelines, architecture. We’re gathering the community tools – wikis, mail lists, blogs, Github, etc. As the community grows we can scale in several directions:

  • primary source. Contributors can choose particular journals or institutions/theses to mine from.
  • subject/discipline. You may be interested in Chemistry or Phylogenetic Trees, Sequences or Species.
  • technology. Concentrate on OCR, Natural Language Processing, CrawlingSyntax or develop your own extraction techniques
  • advocacy and publicity. A major aim is to influence scientists and policy makers to make content Open
  • community – its growth and practice.

We are developing a number of subprojects which will demonstrate our technology and how the site will work. Hope to report more tomorrow.
 

Posted in Uncategorized | 1 Comment

Is Elsevier going to take control of us and our data? The Vice-Chancellor of Cambridge thinks so and I'm terrified

I am gutted that I missed the Q+A session with Professor Sir Leszek Borysiewicz the Vice-chancellor of  Cambridge University. It doesn’t seem to have been advertised widely – only 17 people went – and it deserves to be repeated.
The indefatigable Richard Taylor – who reports everything in Cambridge – has reported it in detail. It was a really important meeting. I’ll highlight one statement, which chills me to the bone (note that this is RT’s transcript):

“the publishers are faster off the mark than governments are. Elsevier is already looking at ways in which it can control open data as a private company rather than the public bodies concerned.”

Now I know this already – I’ve spent 4 years finding out  in detail about Elsevier’s publishing practices. It’s good that the VC realises it as well. Open Access is a mess – the Universities have given part of their priceless wealth to the publishers and are desperately scrabbling to get some of it back. The very lack of will and success makes me despondent – LB says:

“And I know disadvantaging the individual academic by not having publication in what is deemed to be the top publications available? So it’s a balance in the argument that we have.”

in other words we have to concede control to the publishers to get the “value” of academics publishing where they want.
Scholarly publishing costs about 15,000,000,000 USD per year. Scholarly knowledge/data is worth at least ten times that (> 100,000,000,000 USD/year).  [I’ll justify the figure later]. And we are likely to hand it all over to Elsevier (or Macmillan Digital Science).
I’ve done what I can to highlight the concern. This was the reason for my promoting the phrase “Open Data” in 2006  – and in helping create the Panton Principles for Open Data in Science in 2008. The idea is to make everyone aware that Open Data is valuable and needs protecting.
Because if we don’t Elsevier and Figshare and the others will possess and control all our data. And then they will control us.
Isn’t this overly dramatic?
No. Elsevier has bought Mendeley – a social network for managing academic bibliography.  Scientists put their current reading into Mendeley and use it to look up others. Mendeley is a social network which knows who you are, and who you are working with.
Do you trust Mendeley? Do you trust Elsevier? Do you trust and large organisations without independent control (GCHQ, NSA, Google, Facebook)? If you do, stop reading and don’t worry.
In Mendeley, Elsevier has a window onto nearly everything that a scientist is interested in. Every time your read a new paper Mendeley knows what you are interested in.  Mendeley knows your working habits – what time are you spending on your research?
And this isn’t just passive information. Elsevier has Scopus – a database of citations. How does a paper get into this? – Scopus decides, not the scientific world. Scopus can decide what to highlight and what to hold back. Do you know how Journal Impact Factors are calculated? I don’t because it’s a trade secret.  Does Scopus’ Advisory Board guarantee transparency of  practice? Not that I can see. Since JIF’s now control much academic thinking and planning, those who control them are in a position to influence academic practice.
Does Mendeley have an advisory board? I couldn’t find one. And when I say “advisory board”, I mean a board which can uncover unacceptable practices. I have no evidence that anything wrong is being done, but I have no evidence that there are any checks against it. Elsevier has already created fake journals for Merck, so how can I be sure it will resist the pressure to use Mendeley for inappropriate purposes? Is Mendeley any different from Facebook as far as transparency is concerned?  Is there any guarantee that it is not snooping on academics and manipulating and selling opinion? “Dear VC – this is the latest Hot Topics from Mendeley; make your next round of hirings in these fields”.
I’m also concerned that Figshare will go the same way. I have have huge respect for Mark Hahnel who founded it.  But Figshare also  doesn’t appear to have an advisory board. Do I trust Macmillan? “we may anonymize your Personal Information so that you cannot be individually identified, and provide that information to our partners, investors, Content providers or other third parties.” Since information can be anonymised or useful but not both are you happy with that?
There aren’t any easy solutions.  If we do nothing, are we trusting our academic future to commercial publishers who control the information and knowledge flow. We have to take back our own property – the knowledge that *we* produce. Publishers should be the servants of knowledge – at present they are becoming the tyrants.
 
 
 
 
 

Posted in Uncategorized | Leave a comment

Shuttleworth Fellowship: Month 2; synergy with the Digital Enlightenment can change the world

I’m now finishing the second month of my Shuttleworth Fellowship – the most important thing in my whole career. My project The Content Mine aims to liberate all the facts in the scientific literature.
That’s incredibly ambitious and I don’t know in detail how it’s going to happen – but I am confident it will.
This week we posted our website – and showed how we create content. What’s modern is that this is a community website – we’re inspired by Wikipedia and OpenStreetmap where volunteers can find their own area of interest and contribute. Since there is no other Open resource for content-mining we shall provide that – we have 100 pages and intend to go beyond 1000. Obviously you can help with that. And of course Wikipedia’s information is invaluable.
We have an incredible team:

  • Michelle Brook .  Michelle is Manager and making a massive impression with her work on Open Access.
  • Jenny Molloy. Jenny has co-authored the foundations of Open Content Mining and ran the first workshop last year.
  • Ross Mounce. Ross has championed Open Content Mining in Brussels and is developing software for mining phylogenetics.
  • Mark MacGillivray. Co-authored Open Bibliography and founded CottageLabs who are supporting our web presence and IT infrastructure.
  • Richard Smith-Unna. Founder of the volunteer scientist-developer community solvers.io to which he is pitching ContentMine to support Crawling.

But we have also masses of informal links and collaborations. Because we are Open, people want to find out what we are doing and offer help. It’s possible that much of our requirements for crawling may be provided by the community – and that’s happening over the last week. We’ve had an important contribution to our approach to Optical Character Recognition. Today I was skyped with suggestions about Chemistry in the ContentMine.
This all happens because of the Digital Enlightenment. People round the world are seeing the possibilities of zero-cost software, efficient voluntary Open communities and the value of liberated Knowledge. There’s many projects wanting to liberate bibliography, reform authoring, re-use bioscience, etc. Occasionally we wake up and think “wow! problem solved!”. If you think “we”, not “me”, the world changes.
The Fellows and Foundation are fantastic. I have an hour Skype every week with Karien, and another hour with the whole Fellowship. These are incredibly valuable.  With such a huge ambition we need focus.
There’s huge synergy with several formal and many informal projects. Once you decide that your software and output is Open, you can move several times faster. No tedious agreements to sign. No worries about secrecy, so no delays in making knowledge open.  Of the formal projects :

  • Andy Howlett is doing the 3rd year of his PhD in the Unilever Centre here on metabolism. He can use the 10 years’ worth of Open Source we have developed and because his contributions are also Open we’ll benefit in return.
  • Mark Williamson is  using our software in similar fashion.
  • Ross Mounce and Matt Wills at Bath are running the PLUTo project. Because it’s completely Open they can use our software and we can re-use their results.
  • we are starting work with Chris Steinbeck at EBI on automated extraction of metabolites and phytochemistry from the literature.

Informally we are working with Volker Sorge (Birmingham) and Noureddin Sadawi (Brunel) on scientific computer vision and re-use of information for Blind and Visually Impaired people. With Egon Willighagen and John May on the (Open) Chemistry Development Kit. With the Crystallography Open Database…
How can it possibly work?
In the same way that Steve Coast “single-handedly” and with zero-cash built up OpenStreetmap.

  • promoting the concept. We are already well known in the community and people are watching and starting to participate.
  • by building horizontal scalability.  By dividing the problem into separate journals, we can build per-journal solutions. By identifying independent disciplines (chemistry, species, phylogenetics…) we can develop independently.
  • an Open modular software and information architecture. We build libraries and tools, not applications. So it’s easy to reconfigure. If people want a commandline approach we can offer that.
  • By re-using what’s already Open. We need a chemical database? don’t build it ourselves – work with EBI and Pubchem. An Open bibliography? work with Europe PubMedCentral.
  • by attracting and honouring volunteers. RichardSU has discovered the key point is to offer evening-sized problems. Developers don’t want to tackle a complex infrastructure – they want something where the task is clear and they can complete before they go to bed. And we have to make sure that they are promoted as first-class citizens.

Much of what we do will depend on what happens every week. A month ago I hadn’t planned for solvers.io; or Longan Java OCR; or Peer Library; or JournalToCs; or BoofCV; or …
… YOU!
PS: You might wonder what a 72-year-old is doing running a complex knowledge project. RichardSU asked that on hacker-news and I’m pleased that others value my response. If Neelie Kroes can change the world at 72, so can I – and so can YOU.
If you are retired you’re exactly the sort of person who can make massive contributions to the Content Mine. And it’s fun.
 
 
 

Posted in Uncategorized | 1 Comment

The Cost of Knowledge; Tim Gowers' amazing analysis of Elsevier's income

An amazing post came out yesterday from an amazing person. Tim Gowers is a Fields Medallist (the mathematics equivalent of the Nobel Prize).  But Tim is also a star in the world of Open. 5 years ago he launched the Polymath project – a completely Open, meritocratic , collaborative project in citizen mathematics. They solved a complex and difficult mathematical problem is an astonishingly short time. It’s rightly regarded as an exemplar of what the future can be in the century of the Digital Enlightement.
But Tim has also fought the political battle for Openness and freedom of access to scholarship. Two years ago Tim was incensed by the outrageous cost of Elsevier journals and called for a boycott (“The Cost of Knowledge”). This was instantly successful, getting thousands of signatures in weeks. (I have signed it).
Now he’s taken this further in a large project and huge blog post. The prices of scholarly journals are closely guarded secrets. Universities use public money to buy subscriptions and the Elsevier requires the prices to be confidential. They even require the confidentiality clause to be confidential. (PMR: Why do Universities meekly sign this). But there is a way forward. Universities are public institutions and as such bound by the Freedom Of Information Acts.
So Tim has made requests to all Russell Group Universities asking for details of the contract and prices with Elsevier.
I know how much effort this is because I’ve done a similar thing (asking for restrictive clauses in publisher contracts). Some universities give positive helpful replies (Cambridge was one – https://www.whatdotheyknow.com/request/licences_with_subscription_publi#outgoing-341924 – this shows the process). But some Universities try to avoid giving useful answers. In that case we may have to go back and re-ask the question differently or even write to the Information Commissioner. It’s a LOT of work.
So Tim has published a huge amount of information and comment.

  • read it
  • read Michelle Brook’s great summary (first).
  • give it to your students to read. Students, Give it to your lecturers and professors to read.
  • write to your MP. (I have)

Here’s part of Michelle’s summary:

Cambridge spent £1,161,571 in 2012.

Scale that up and you find that the UK is paying over   150 MILLION pounds to Elsevier every year.

And , although they are smaller, there are hundreds of other publishers out there.

The world pays 15 BILLION dollars to scholarly publishers each year. And a significant amount of that is used to stop YOU reading it (http://scholarlykitchen.sspnet.org/2014/04/24/rearguard-and-vanguard/).

Posted in Uncategorized | Leave a comment

Elsevier doesn't publish Junk Science. Does it?

Some years ago Elsevier funded the PRISM initiative to discredit Open Access with the slogan “Open Access is Junk Science”. The implication, of course, was that of course Elsevier didn’t publish junk.
The chemical blogosphere was alerted by Egon Willighagen to a paper in “Drug Discovery Today”
drugprinter
Chemjobber – a regular and respected Chemical Blogger writes (http://chemjobber.blogspot.co.uk/2014/04/i-cannot-believe-this-was-published.html):

I CANNOT believe this was published

Via Egon Willighagena truly bizarre article in Drug Discovery Today that appears to have been accepted for publication:

In drug discovery, de novo potent leads need to be synthesized for bioassay experiments in a very short time. Here, a protocol using DrugPrinter to print out any compound in just one step is proposed. The de novo compound could be designed by cloud computing big data. The computing systems could then search the optimal synthesis condition for each bond–bond interaction from databases. The compound would then be fabricated by many tiny reactors in one step. This type of fast, precise, without byproduct, reagent-sparing, environmentally friendly, small-volume, large-variety, nanofabrication technique will totally subvert the current view on the manufactured object and lead to a huge revolution in pharmaceutical companies in the very near future.

Believe it or not, the author proposes the use of optical tweezers to synthesize drugs atom-by-atom (among other nanofabrication techniques.)
I am holding out hope that this paper is some sort of Sokal Affair-type hoax, or perhaps an incredibly convincing piece of elaborate performance art.

See also Corante Derek Lowe who writes widely on drugs…. http://pipeline.corante.com/archives/2014/04/21/molecular_printing_of_drug_molecules_say_what.php
PMR: It is incredible that this paper was published . It’s pure fantasy – “if we create robots at a molecular level we can assemble molecules” . It’s in the realms of matter transformation, hyperspace jumps, etc. Great for Amazing Science Fiction.
But it has no place in a reputable scientific journal. There are no experiments, no discussion of scale – no anything. It’s total fantasy. Yes it might happen in 100 years; so might worm-hole travel; and moral robots.
Oh, and it’s APC – paid Open Access. The author has probably paid ca 2500 GBP to allow other people to read this.
And who is Peter Murray-Rust to criticise a paper published in an Elsevier journal? It’s peer-reviewed so it must be correct. And, as we know, Elsevier doesn’t publish junk science.
 

Posted in Uncategorized | 2 Comments

The Content Mine website – how we create it. And the community can edit and contribute.

We are now about 6 weeks into The Content Mine project and have now released our website (http://www.contentmine.org). In the spirit of living a web-friendly life this is a living object which is planned to be:

  • easy to update and maintain
  • re-usable
  • communal and collaborative.
  • scalable

Garzweiler_Panorama_2013_-_1252-1266
© Raimond Spekking / CC BY-SA-3.0 (via Wikimedia Commons)
To do that we have taken a novel approach to creating the site. We want the material to be easy to edit and create, with potentially lots of contributors. That’s not always easy if you have to have login access to the website.
The best software is often on collaborative FLOSS software sites. That’s because it’s had hundreds of years of knowledgeable users and developers. So I turned to Github and its wiki.  A wiki is an excellent tool to develop one’s thoughts as the structure evolves as our insight develops. So I started off with a list of the most important things that I thought we would need and put them on the first page of the Wiki (https://github.com/petermr/contentMine/wiki/ContentMining) which looks/ed like:
website
 
This is how you see it after an initial edit. It’s very functional, with lots of editing icons, etc. The blue phrases are links to other pages or external pages. I created about 100 pages on Sunday – some are stubs but most have text and links to other pages. And the value is that we are building up a structured resource. It’s a set of pages that can be re-used for tutorials, reference and, we hope, additions by volunteers.
However to make it more like a normal web page Mark MacGillivray and his Cottage Labs colleagues have created software for transferring Github content to a standard website. It can be automated so that, for example, we can update the website from the wiki every midnight. Here’s the same page:
website1
 
(The picture is RNA from some of Ross Mounce’s Openly extracted phytotaxa scraping.). Mark’s done a great job in almost no time. That’s partly because CL are  very smart and partly because CL build re-usable code. And it’s easy to change the look-and feel.
Most people hate keeping websites up to date,  but I like wikis.  So I’ll be adding more pages which will help to explain content mining, and create re-usable resource.
 

Posted in Uncategorized | 1 Comment