Content Mining; thoughts from den Haag – can we aspire to universal knowledge?

I’m in den Haag (The Hague) for a meeting run by LIBER – the association of European Research Libraries – about Content Mining. Content Mining is often called TDM – Text and Data Mining – but it also applies to images and other media which contain uncopyrightable facts. This meeting is not primarily technical – it’s about the socio-politico-legal issues in doing mining.
Mining can create Universal knowledge. I’ve just read a wonderful post from Pierre Estienne
we share the same vision and the same deep-seated concern that publishers are destroying this vision. Just a few quotes – you should read it.

A Universal Library is a representation of science. Gathering all human knowledge in one place creates a monolithic artefact I call the Universal Library. It contains all of what Popper called the third world or world three: all of humankind’s literature.
As Popper said, “instead of growing better memories and brains, we grow paper, pens, pencils, typewriters, dictaphones, the printing press, and libraries.”, yet today brain-enhancing tools like libraries are scattered around the globe, and are (academic libraries especially) inaccessible for most of us. The Universal Library is the ultimate tool we can create in order to store and retrieve all of our knowledge easily.


“The internet is Gutenberg on steroids, a printing press without ink, overhead or delivery costs”. [Michael Scherer]… [PE continues]  Yet the internet isn’t seen this way by publishers. They still behave like books are a “scarce” commodity, while the internet allows unlimited distribution of books for free. If the publishers really embraced the internet, they would publish their books/journals for free, instead of charging exorbitant amounts of money for pdfs.


Google is a great tool, but it doesn’t have access to everything – scholarly publications especially are locked inside publishers’ databases and are behind paywalls – if you want to really get a good look at most of the literature, you have to switch between multiple tools: Google, Elsevier, Wiley, Springer’s databases, etc… It’s a very time consuming process the Universal Library should make fast and simple.


So, publishers. The “big three” (Wiley, Springer, Elsevier) and a few others retain a monopoly on scientific publications, and behave like a cartel, making deals to not compete with one another (just look at their prices, which are kept very high and are the same for all the different publishers). As they refuse to compete, they are very unlikely to change their business model. I’m surprised they haven’t been under investigation for antitrust… As they have the copyrights of most of the scientific publications in circulation, they can charge sky-high prices for simple pdfs, and they are quick to call “pirate” anyone who tries to make these papers more available.

[and lots more wonderfully clear, historically grounded stuff…]
Picking up immediate threads…
I am challenging Nature / Macmillan over their new “experiment” in releasing dumbed down (read-only) versions of the scholarly literature that “they own”. They think it’s a step forward. I think it’s an assertion that they believe they control the scholarly literature. I’ll blog more , but here’s something to think about:
It costs Nature 30-25 THOUSAND dollars to process a single accepted article (usually 2-6 pages).
That’s Nature’s figures, this week. I’ll rephrase that:
Nature take 30-45,000 USD out of the community (taxpayers and students-paying-fees) to create a single published article which, in most cases they control (“own”)
There is no way this is moving towards a Universal Library. There are many dystopias which we can imagine – 1984, Fahrenheit451, The Lives Of Others, and the life and death of Aaron Swartz.
So in contentmine,org we are developing a part of the universal library. We are starting with Open Access articles and then moving to legally-minable-facts-in-UK. This is not Universal. It’s severely restricted by the publishing industry. If I step outside the lines they have drawn we shall be challenged. Diderot was challenged by the establishment – they, including the printers, destroyed his work.
But his soul shines brightly today and Pierre and I and many others honour him.
And ContentMine is starting to catch on. We’ve had a great contribution from Magnus Manske + Wikidata and it deserves a complete post on its own.
Off to LIBER, via Bezuidenhout…

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Content Mining; thoughts from den Haag – can we aspire to universal knowledge?

  1. I read recently the Nature $25k-30k figure was gotten (some time ago) by dividing income by number of articles. So that is the price tag people would have to pay to maintain NPG’s current income level, rather than the actual cost to them per article. That it costs them ~$10k per page to produce an article I find hard to swallow. Is that just for Nature? Can’t they defray the costs (workflow, subject experts etc) over the rest of their journals?

    • pm286 says:

      Yes – it seems to be true. It’s probably cheaper to create a journal made of pure gold. Another calculation is that a single person in Nature can probably process about 3 journal articles a year (taking salary + overheads at 100 K GBP).

  2. Pingback: Content Mining; thoughts from den Haag – can we aspire to universal knowledge? – ContentMine

Leave a Reply

Your email address will not be published. Required fields are marked *