Content Mining hackday in Edinburgh; we solve Scraping


[P Murray-Rust, CC  0]
We had our hack day in Edinburgh yesterday on content mining.
First, massive thanks to:
  • Mark MacGillivray for organising the event in Informatics Forum
  • Informatic Forum for being organised
  • Claire and Ianthe from Edinburgh library for sparkle and massive contributions to content mining
  • PT (Sefton) for organising material for the publishing and forbearance when it got squezzed in the program
  • Richard Smith-Unna who took time off holiday to develop his quickscrape code.
  • CottageLabs in person and remotely
  • CameronNeylon and PLoS for Grub/Tucker etc.
  • and everyone who attended
Several participants tweeted that they enjoyed it
Claire Knowles @cgknowles Thanks to @ptsefton for inviting us and @petermurrayrust for a fun day hacking #dinosaur data with @kimshepherd@ianthe88 & @cottagelabs
So now it’s official – content mining is fun!. You’ll remember we were going to
  • SCRAPE material from PLOS (and other Open) articles. And some of these are FUN! They’re about DINOSAURS!!
  • EXTRACT the information. Which papers talk about DINOSAURS? Do they have pictures?
  • REPUBLISH as a book. Make your OWN E-BOOK with Pictures of DINOSAURS with their FULL LATIN NAMES!!

About 15 people passed through and Richard Smith-Unna and Ross Mounce were online. Like all hackdays it had its own dynamics and I was really excited by the end. We had lots of discussion, several small groups crystallised and we also covered molecular dynamics. We probably didn’t do full justice to PT’s republishing technology, that’s how it goes. But we cam up with graphica art for DINOSAUR games!
We made huge progress on the overall architecture (see image) and particularly  on  SCRAPING. Ross had provided us with 15 sets of URLs from different publishers, all relating to Open DINOSAURS.

APP-dinosaur-DOIs.txt APP CC-BY articles, there are more that are free access but I have on… 4 days ago
BioMedCentral-dinosaur-articlelinks.txt BMC article links NOT DOI’s, filtered out ‘free’ but not CC BY articles 4 days ago
Dinosauria_valid_genera.csv List of valid genera in Dinosauria downloaded from PaleoDB. It includ… 4 days ago
Elsevier-CCBY-dinosaur-DOIs.txt 3 Elsevier CC BY articles 4 days ago
FrontiersIn-dinosaur-35articlelinks.txt FrontiersIn 4 days ago
Hindawi-dinosaur-DOIs.txt Pensoft & Hindawi 4 days ago
JournalofGeographyandGeology_DOI.txt Create JournalofGeographyandGeology_DOI.txt 2 days ago
Koedoe-DOI.txt PDF scan but CC BY from 1986 2 days ago
MDPI-dinosaur-DOI.txt MDPI one article 4 days ago Added one Evolution (Wiley) article 4 days ago
RoyalSocietyOA-dinosaur-DOIs.txt just one 4 days ago
SAJournalofScience-DOI.txt 1 CC BY article on African dinosaurs 2 days ago
SATNT-DOI.txt 1 CC-BY article in Afrikaans 2 days ago
Wiley-CCBY-dinosaurs.txt Added one Evolution (Wiley) article 4 days ago
peerj-dinosaur-DOIs.txt 8 PeerJ article DOIs 4 days ago
pensoft-dinosaur-DOIs.txt Pensoft & Hindawi 4 days ago
plos-biology-dinosaurs-DOIs.txt 20 PLOS Biology 4 days ago
plos-one-dinosaur-DOIs.txt first commit 4 days ago
Hard work, and we hope to automate it through CRAWLING, but that’s another day. So could we scrape files from these. Remember they are all Open so we don’t even have to invoke the mighty power of Hargreaves yet . However the technology is the same whether it’s Open or paywalled-and-readable-because-Cambridge-pays-lots-of-money.
We need a different scraper for each publisher (although sometimes a generic one works).  Richard Smith-Unna has created the quickscrape platform In this you have to create a *.json for each publisher (or even journal).
The first thing is to install quick scrape. Node.js, like java, is a WORA write-once-run-anywhere (parodied as WODE write-once-debug-everywhere). RSU has put a huge amount of effort into this so that most people installed it OK, but a few had problems. This isn’t RSU’s fault, it’s a feature of dependencies in any modern language – versions and platforms and libraries. Thanks to all yesterday’s hackers for being patient and for RSU breaking his holidy to support them. (Note – we haven’t quite hacked Windows yet, but we will). For non-hacker worksops – i.e. where we don’t expect so many technical computer experts we have a generic approach to distribs.
Then you have to decide WHAT can be scraped. This varies from whole articles  (e.g. HTML) to images (PNG) to snippets of text (e.g. licences) What really excited and delighted me was how quickly the group understood what to do and then went about it without any problems. The first task was to list all the scrapable material and we used a GoogleSpreadsheet for this. It’s not secret (quite the reverse) but I’m just checking permissions and other technicalities before we release the URL with world access.
You’ll see (just) that we have 15 publishers and about 20 attributes. Who did it? which scraper (note with pleasure that RSU’s generic scraper was pretty good!). Did it work? If not this means customsing the scraper. 9.5/15 is wonderful at this stage.
The great thing is that we have built the development architecture. If I have the Journal of Irreproducible Dinosaurs then I can write a scraper. And if I can’t it will get mailed out to the Content Mine communiaty and they/we’ll solve it.  So fairly shortly we’ll have a spreadsheet showing how we can scrape all the journals we want. In many instances (e.g. BioMedCentral) all the journals (ca 250) use the same tecnology so one-scraper-fits-all.
If YOU have a favourite journal and can hack a bit of Xpath/HTML then we’ll be telling you how you can tackle it and add to the spreadsheet. For the moment just leave a comment on this blog.
This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Content Mining hackday in Edinburgh; we solve Scraping

  1. Pingback: Content Mining hackday in Edinburgh; we solve Scraping – ContentMine

Leave a Reply

Your email address will not be published. Required fields are marked *