petermr's blog

A Scientist and the Web

 

The Content Mine website – how we create it. And the community can edit and contribute.

April 22nd, 2014

We are now about 6 weeks into The Content Mine project and have now released our website (http://www.contentmine.org). In the spirit of living a web-friendly life this is a living object which is planned to be:

  • easy to update and maintain
  • re-usable
  • communal and collaborative.
  • scalable

Garzweiler_Panorama_2013_-_1252-1266

© Raimond Spekking / CC BY-SA-3.0 (via Wikimedia Commons)

To do that we have taken a novel approach to creating the site. We want the material to be easy to edit and create, with potentially lots of contributors. That’s not always easy if you have to have login access to the website.

The best software is often on collaborative FLOSS software sites. That’s because it’s had hundreds of years of knowledgeable users and developers. So I turned to Github and its wiki.  A wiki is an excellent tool to develop one’s thoughts as the structure evolves as our insight develops. So I started off with a list of the most important things that I thought we would need and put them on the first page of the Wiki (https://github.com/petermr/contentMine/wiki/ContentMining) which looks/ed like:

website

 

This is how you see it after an initial edit. It’s very functional, with lots of editing icons, etc. The blue phrases are links to other pages or external pages. I created about 100 pages on Sunday – some are stubs but most have text and links to other pages. And the value is that we are building up a structured resource. It’s a set of pages that can be re-used for tutorials, reference and, we hope, additions by volunteers.

However to make it more like a normal web page Mark MacGillivray and his Cottage Labs colleagues have created software for transferring Github content to a standard website. It can be automated so that, for example, we can update the website from the wiki every midnight. Here’s the same page:

website1

 

(The picture is RNA from some of Ross Mounce’s Openly extracted phytotaxa scraping.). Mark’s done a great job in almost no time. That’s partly because CL are  very smart and partly because CL build re-usable code. And it’s easy to change the look-and feel.

Most people hate keeping websites up to date,  but I like wikis.  So I’ll be adding more pages which will help to explain content mining, and create re-usable resource.

 

Jenny Molloy Awarded an AMI

April 22nd, 2014

Jenny Molloy is a central figure in the Open community and has been particularly active in campaigning for Content Mining. We are delighted that she is part of our core team on the ContentMine project (http://contentmine.org).  AMI the kangaroo is the mascot of our content-mining software and when wonderful people do wonderful things they are awarded an AMI:

jennyAmi

Previous AMI awardees are :

  • Ross Mounce
  • Michelle Brook
  • Helen Turvey (Shuttleworth)
  • Karien Bezuidehout (Shuttleworth – presentation next month)

Jenny runs the Open Knowledge Science Working Group and has co-authored our principles and practice of Open Content Mining.  She advocates Open Data (slides).

 

Glitches on blog

April 19th, 2014

The blog had some glitches recently (technically “bumps on the road of our journey”).  These were due to upgrades. I couldn’t post and people couldn’t comment. I can now post, but comments don’t show up. If you have comments mail me (peter.murray.rust <largesearchengine>mail.com) or @petermurrayrust on twitter. Hope to be back after Monday

OCR in Java (2); Zarkonnen Longan is the best yet

April 17th, 2014
The web is wonderful! The best way to write code is not to. I posted this morning about the problems I had in using Java for Optical Character Recognition. And within an hour I had this great response from David Stark (@Zarkonnen_com)
For starters, here’s what I tried to post on your blog about Longan earlier:
So a few years back I found myself in the same situation as you – wanting to do Java OCR, and the only real solution on the block is tesseract/ocropus, which is a nightmare to install/distribute. I eventually started work on a pure-Java OCR system called “Longan”. The project is currently on an extended hiatus, but if there is interest / potential collaborators / potential users, I’d be very interested in reviving it.

PMR: that’s wonderful. Not only for this project, but the idea that people can carry a project throughout a number of years. You don’t have to be in a University. There’s a huge need…

Longan’s features:
* Pure Java with zero dependencies. It doesn’t even reference any external jars.
* Based on convolutional neural networks, which are a pretty modern and robust approach to OCR.
* Usable as a library or command-line program.
* Reasonably modular system composed of stages.
* Takes care of eliminating images and speckles from the input, adjusting input rotation, and detecting multi-column layouts.
* Recognition system is pretty much all data-driven. You can plug in a different set of neural network weights to get a different alphabet or specialise the system for a particular font or group of fonts.
* Free and libre, licensed under Apache 2.0.

PMR: very much the features I try to use myself. (I do, however use communal libraries such as Apache and with Maven that’s easy once you have learnt how. So David doesn’t need to include the code of apache.commons.cli anymore – we simply have a maven dependency in the pom.xml). Fully agreed about library and CLI and the modularity.

There’s a workflow aspect to image processing. Typically we may have to crop, denoise, deskew, equalise, binaries,  thin, recognise , etc.   I’m normally starting with born-digital so I only need the last one or two. It’s therefore important to modularise and pipeline. And it’s very important to be able to experiment – hence data-driven and parameterisation.

DS-Z: Anyway, I hiatus-ed the project a few years ago because I hit a point where I realised I had to start going about fine-tuning the neural networks in a more methodical way, and I didn’t have the time to do so. I had basically reached the point where I’d try it out on some example input, fix things up to make it work better, only to realise I’d made it perform much worse on other input!

PMR: Very common. When it’s a difficult problem (like OCR) the parameters are often finely balanced.

In terms of trying it out, try running it with com.zarkonnen.longan.Main as the entry point and the path to the attached file as the single command-line argument. Note that the file is just a random fragment I grabbed from a scan, so it’s not particularly optimised for Longan, or Longan optimised for it. Then try commit e2f819f5f865ae6e9211a435f098883979fdb1ed which is actually much better, as it’s the version before some major re-engineering efforts which had the aforementioned effect of making it work better for some inputs and a lot worse for others.
I can also have a spelunk around and check for any secondary projects/data that I used during the project.
- David

Marvellous! It actually took me five minutes to work out how to run Longan, whereas I have struggled for days with javaocr2012.

But I shan’t throw that away. I want to see how Neural nets compare with Moments+Mahalanobis.  With NN you have no insight into the model and that’s a problem when you need to refine things. I shall use both – neither can be 100% perfect. And in any case how do you tell a zero (0) from an oh (o)? I’m also going to include a third way – topology of skeletons – not sure whether it;s been used before. And we’ve also got information from the environment – io is more likely to be one-zero (== 10) than one-oh although if you’re a planetologist it might be Io – the Jovian moon. In chemistry IO could be hypoiodite. and so on.

 

So I’d love David to find spare time to hack on this…

Optical Character Recognition (OCR) in Java; my current summary of situation – please comment

April 17th, 2014

In The Content Mine and PLUTo projects we need OCR to interpret diagrams with letters and numbers. OCR is a well tested and developed technology and widely used. Unfortunately it’s not trivial to find a Open (F/LOSS) Java solution (please correct me if I’m wrong – would be delighted). Hopefully this blog will help others. If my comments here are inaccurate please correct them – unfortunately most of the codes I discuss are poorly documented and have few or no examples.

OCR has many features but here we restrict this to

  • high-quality born-digital typefaces (i.e. no handwriting, no scanning, no photos – we shall move to those later)
  • a small number of font-families (Helvetica, Times, Courier or their near relations). No ComicSans.
  • understanding what is happening as we shall want to extend this – e.g. to Unicode symbols for graphs and maths and Greek.
  • modular. It needs to be easily integrated, so no GUIs, etc.

If you are desperate and don’t mind paying there are commercial solutions. If you don’t mind using C(++) there’s OpenCV. But I need to integrate with Java as this has to redistribute everywhere. I can’t run a server as people may have to post copyright material.

If you search for Java and OCR you will variously find:

  • Tesseract (http://tess4j.sourceforge.net/). This is a de facto standard, BUT it’s C(++) wrapped in Java. That will be a nightmare to redistribute.
  • Lookup (https://github.com/axet/lookup). I haven’t got to the bottom of this, but I think it’s a fast pixel comparator. IOW if you have a small image it will see where it can be found in a larger image. For that you need the precise iamge – it is unlikely to recognize a Times “B” using a Helvetica “B” for searching.
  • JavaOCR. This is where the answer may lie but the confusion starts:
  1.  javaocr “20100605″ (http://sourceforge.net/projects/javaocr/). From Ron Cemer. This seems to be an initial effort which uses simple features such as aspect ratio or very simple moments. There’s a zip file to dowload with the name javaocr20100605. the project has been stopped at about 2010 but Ron still appears to comment on the later version below. I think we owe Ron thanks, but 20100605 is not the route we shall follow.
  2. javaocr “2012-10-27″(http://sourceforge.net/p/javaocr/source). This appears to be a fork of javaocr20100605. There’s a download but it’s basically 20100605. To get the more recent version you must clone the repository:

git clone git://git.code.sf.net/p/javaocr/source javaocr-source

The code and useful comments appear to have stopped at 2012-10-xx and the authors admit they haven’t documented stuff. However the system loads into Eclipse and tests. After some unrewarding detective work I’ve come to the conclusions:

  • the project distrib is overly complex and poorly documented. It seems to be oriented towards Android and this may be difficult to demodularise. The distrib is a set of maven modules – I think some can be disregarded.
  • the recognition has advanced and uses HuMoments and Mahalanobis distances. (I had come to the independent conclusion that these are what we needed).
  • I have tracked down a demo at:

net.sourceforge.javaocr.ocrPlugins.OCRDemo.OCRScannerDemo
which has a main() entry point. It’s still very messy – hardocded filenames and no explanation. It looks like a town hit by a pyroclastic flow – abandoned in an instant. There’s a training file

javaocr/legacy/ocrTests/trainingImages/ascii.png

with the characters 33-112 in order as monospace glyphs. This is hardcoded to map the codepoints onto the glyphs (either by x,y coordinate or whitespace – don’t know which yet). There’s another image

javaocr/legacy/ocrTests/asciiSentence.png

which is a test. I run the OCRScannerDemo and it prints out the characters on the console output.

So the system is able to recognize characters exactly if you have the exact or very similar font. This is good news. What I don’t know is how much we can vary the test fonts. I’m hoping that our own work on thinning will make skeletons which are less dependent on the font families. Or it may be that simply traing with three families will be ok for most science.

So this is a first step for those – like me – who are finding it difficult to navigate javaocr. If we don’t hear anything we may create a forked project with clearer documentation and examples.

 

Open Access Button; Thursday 2014-04-10:1300 London; This is where scholarly publishing gets changed

April 9th, 2014

Tomorrow is a very important day for OPEN – the Open Access Button initiative (https://www.openaccessbutton.org ) is holding an afternoon get-together in London.

The OAButton is driven by undergraduates – initially in Medicine – who are frustrated and now ANGRY about publishers’ paywalls. It’s immoral that medical information should only be available to <0.1% of the world. And – to be quite clear – there are only two things driving paywalls:

  • publishers’ greed
  • academics’ lock-in to the quest for personal glory

OAButton is initially raw anger – this is unacceptable and must be changed. Simply tell the the world. It’s a Digital Century demonstration – a freedom march. My generation marched to Aldermaston, Greenham Common and Molesworth to protest against the injustice or nuclear weapons; OAButton is similarly digitally marching against publishers’s paywalls.

Protests often start off slow and are ridiculed. You may be tempted to write off the OAButton as a few undergraduates making a protest that no-one will take seriously. You would be very foolish. Protest can grow rapidly into mass action. The driving force is injustice, because:

CLOSED ACCESS MEANS PEOPLE DIE

And the publishing industry has now very few friends except their shareholders and those entertained by their lobbyists. They’re aren’t now selling any useful service – the academics do the writing, the review. The publishers technical ability is AWFUL – they make things worse.

So the publishers are selling two things:

  • branded glory for academics and universities
  • fear: through their lawyers

Anything else can be created and delivered without publishers. So publishing is broken and could collapse and any time. It relies totally on academic glory. It points backwards.

And the OAButton points forward. The future belongs to undergraduates, and I’m backing them. I don’t know where they are going, but I hope they throw off compromise, fudge, bureaucracy, mumble.

I’m going to London tomorrow. I shall listen. It’s possible that our own effort to create a bibliography of scientific data may be useful. If you’re young at heart, idealistic, motivated and courageous get involved!

 

 

We should have Collaboration as well as competition in research. Citizens please join in!

April 9th, 2014

Daniel Mietchen [one of the central figures in Open Science / Wikimedia] has just posted to the OKF Open-Science list

 

as briefly mentioned before, we are working on a public proposal to
make research proposals increasingly open:
https://www.newschallenge.org/challenge/2014/submissions/opening-up-research-proposals
.

The drafting period as part of that News Challenge ends on April 17,
and feedback of any kind is most welcome, especially before then,
though we certainly envisage to develop the project further over time.

To discuss the topic of open research proposals with a broader group,
there will be a public hangout on Friday (April 11) at 7pm UTC:
https://plus.google.com/u/0/events/ce1t7snttl29082nmpa44joqsdc .
It would be nice to have some of you with us then!

 

Read the proposal and come to the hangout. Daniel’s proposal is excellently argued and makes the case that collaborative research can be more effective in many ways than competition. Tim Gowers ran a completely Open, collaborative maths research project (Polymath) which solved an important problem in a staggeringly short time. He said it was:  ”… to normal research as driving is to pushing a car.”

Of course scientists are competitive. Only a very few turn down Nobel Prizes or Fields Medals. And on the small number of occasions when others have taken my work and claimed it as their own I have been very angry. But secret grant applications have their downside. Most grant applications fail and so it’s a lot of wasted work if no-one builds on the ideas. It can lead to a metric oriented outcome where delivering lots of small pieces is more important than building for the future. (A similar problem: I find some computer science research very frustrating – they announce they have solved a problem without giving details or code or a solution the community can use. This makes it harder to justify repeating the work – and impossible to get any funding. )

We are now in the science data century. I guess that 50% of science effort is managing data. Data management in competitive projects is awful. It is short-term, and fouls up the future. I have been very fortunate to get EPSRC funding (thank you) for making our software distributable. Just as Wellcome justify publication as a central point of science, dissemination of code should also be high on the list.

There are many radical new things in the Digital Century. Just as Wikipedia has shown it is becoming the central scientific reference, we’re also seeing that collaborative science involving citizens can be highly productive. I’ve helped to set up the Blue Obelisk collaborative software project in chemistry which is the de facto for most Open Source and has also resulted in citation-rich publications (for those who care). The explicit total annual cost: about 20 USD – I buy 2 obelisks as prizes. Everything else is volunteer and marginal costs (bandwidth, servers, etc.).

Much research is parallelisable. In our own PLUTo project we need to build tools for machines to read the literature. We’ve got hundreds of journals to read and each requires a custom-written scraper. It takes an evening to write one. The conventional way is to hire someone and make them write hundreds of scrapers. The modern way is to appeal to the community. Maybe some people have already written some? Maybe there are citizens who love hacking? Either way everyone benefits – more people get involved, there’s a community. It’s science in the Digital Century. Join us

Some snippets from Daniel

Many ideas are lost in the current closed system, and so are opportunities to collaborate and improve those few that are actually being worked on. We propose to elaborate mechanisms that would allow a transition from the current secretive model to one in which sharing research ideas is the default and seen as an invitation for collaboration, for accelerating and improving research rather than as a breach of private property.

Back in 1959, psychologist Myron Brender wrote
“I propose the creation [..] of a newsletter or journal to be devoted exclusively to the publication of unexecuted research proposals.”

The main challenge of implementing this idea, however, is not technical but cultural: researchers currently have no incentive to share research proposals, and research funders have no habit of making their funding decisions public, nor who has applied for what.

Public research proposals would also open the door for science journalism to go new ways: instead of headlines of the “researchers found” kind once a research project has long finished, they could cover research projects from early on and highlight the process behind it.

Many have experienced this in the current system, but effective scooping is actually relatively simple now and much harder if your ideas are out there in the open: if everyone knows you were the first to propose (and actually pursue) that idea, anyone who tries to sell it as their own will risk loosing reputation, so they may actually prefer to work with rather than against you. More on that in the Harvard video embedded above.

computer scientist Ehud Shapiro recently wrote:
“Genuine interdisciplinary research is nothing like a competitive race. It is much more like a solitary exploratory hike through an uncharted landscape. [...] There are no peers to compete with”. But there may be collaborators if they ask nicely.

[PMR Yes - I know this solitary hike...]

I’ll be at the hangout THIS FRIDAY! see you there

 

ACSGate: has Atypon fallen into its own Publisher Spider Trap? and the ACS reply

April 4th, 2014

<s>No word yet from ACS  so some of this is hypothetical – but they are communal hypotheses.</s>

It seems the spider trap is part of Atypon software (http://www.atypon.com). From their site:

Atypon delivers innovative solutions that revolutionize the way publishers and media organizations do business. Literatum, Atypon’s flagship ePublishing platform, provides all of the functionality that publishers need to compete in the digital world, including advanced search and information discovery, access control, e-commerce, marketing and business intelligence. Literatum hosts more than 17 million journal articles …

It’s run by “Georgios Papadopoulos, Founder and Chief Executive Officer”

Its clients include ACS, Elsevier, Informa, NewEngJMed, OUP, Taylor and Francis and 20 others.  Interestingly some of these also appear to have the spider trap. From the info above it appears that the articles – including the OA ones – are hosted on Atypon. It’s therefore believable that the spider trap link was added by Atypon – whether the ACS knew about this we don’t know and wait for Darla.

The following – incredible – comment from Georgios Papadopoulos atypon.com  appeared on my blog. I don’t believe it’s a spoof.

This is really funny. Tom Demeranville described the trap very acurately.

These LINKS (they are not DOIs!) are not visble or clickable. Only a (dumb) spider follows them.
You created such a dumb spider and you were scraping the content. You were not reading it or clicking on anything.

You were caught, but perhaps the funniest part of that was that then you also came up and exposed yourself. We usually never identify the writers of such crawlers.

If genuine, this is one of the most breathtakingly self-destructive statements from a CEO since Gerald Ratner  described his products as “crap”. GP boasts of his cleverness but has utterly missed the point and revealed himself as completely out of touch.

So what about the spider trap that he and his company built? Well this afternoon the story of the Spider Trap hit Hacker News. I Promise I didn’t send it and I didn’t urge others to. Hacker news (Hacker is a positive term based on MIT usage) has news about all things geeky. They know about the web. What did they think of the Spider Trap? see https://news.ycombinator.com/item?id=7530712 here are some…

  •  It’s so technologically simple as to be useless against anyone who could deploy a web scraper in the first place.
  •  This has massive potential for abuse.
  • That’s some level of incompetence – the trappers I mean. A half arsed solution because they couldn’t think of a better one.
  •  I am furious because the malice was implemented in the stupidest, most useless, laziest manner possible. It’s like keeping the neighborhood kids off your lawn by burying a pressure plate switch out there for the armed nuclear bomb in your garage. And then not telling anyone about it. And then inviting all the neighbors over for a croquet tournament.

The overwhelming consensus is that the spider trap was totally incompetent and highly dangerous. The URL could easily have been transformed and redistributed by software which simply edited HTML files. Hidden in mails. Even the existence of a simple URL that disables a whole university (yes – there are universities with only one IP) is unbelievable.

Here’s Tom Demeranville again – who applauds what Ross did as the quickest and most effective way of lancing this ugly boil… (my emphases)

I’ve just identified another serious spider trap that would cut universities not just from one publisher but whole swathes of them. As much as I’d like to share the link around for the LOLz, I think I better contact the owner first :D Damn.

I don’t think you can shoot the messenger here. Ross made everyone well aware of the dangers these traps pose. If it wasn’t for him I’d have not found what I’ve found. Speaking from experience I’ll also add that polite emails to publishers regarding bugs in their websites take approximately eleventy billion years to be actioned. This way of maximum publicity is the quickest way to get it fixed.

I’ve shown what the world thinks of Atypon’s spider trap. You can decide what you think of Atypon and its CEO. It’s caused massive public furore – most of it against ACS, Ross and me (probably in that order). It’s caused huge waste of time and effort – I hear that JISC, Nature Publishing Group, CrossRef were all cut off from ACS. Indeed if Pandora, I and Ross had not exposed it we might have had regular ACS outages indefinitely.

ACS have to stand up and tell us what’s going on.

If they don’t, this sort of thing will continue. If a foolish implementation is allowed to persist who knows what may happen?

2014-04-05:07-12 UTC: ACS have now posted a reply. This is competent, if relatively uninformative of details. Selected quotes:

ACS worked diligently to resolve the issue, and as of 4 PM EDT April 3, service was restored for all subscribers affected by this incident. Simultaneously, steps were taken to address the specific protocol that triggered this outage.

Employing the use of these types of tools is imperative to providing users with continued access to that trusted research. We will therefore continue to refine our security procedures to support evolving publishing access models while protecting both users and content from malicious activities.

The rest is either history I have provided or general stuff about ACS serving the community.

Are there more spider traps out there? Almost certainly. TomD has identified some. Will I go looking for them? Not as an activity in itself. Will others go looking for them? Judging by the tenor of Hacker News almost certainly. Will we hit further traps as we launch The Content Mine? Hopefully not, especially if they are labelled “Bomb”. Unlike Georgios Papadopoulos’ simplistic view we don’t just throw wget at the Web – we work out the semantics.

What is clear is that machine reading of the literature is now a legitimate mainstream activity. We can and will do it without “publishers’ API”. We shall continue to expose incompetent and dangerous publishing.

Open Access Questions for Universities at Leicester 2014-04-04

April 4th, 2014

In my event at Leicester today (http://blogs.ch.cam.ac.uk/pmr/2014/04/03/my-talk-on-openaccess-at-university-of-leicester-2014-04-041300-utc/ ) I shall emphasise the wider picture. Since the audience has many from the library I’ll probably concentrate on that. So here are some questions – if anyone reads beforehand think of some answers to bring this

  • Why do we want Open Access? One sentence only – just one clear major reason.
  • Do you think Open Access is Open? Is it democratic or bureaucratic? authoritarian or libertarian?
  • Where do you think Open Access  WILL BE in 5 years time? Is that where you would want to be?
  • Where would you LIKE Open Access to be like in 5 years time?
  • Are you happy with the role of the current major publishers? Is the spend of 15 million USD / year reasonable?
  • If it isn’t reasonable what could we design which would be a better way to spend the money?
  • Should libraries implement the rules that publishers create (whether open or closed?) Or should universities tell publishers what they want and require it as service?
  • Where is the innovation in scholarly communication? should it involve universities? if so, what should they do?
  • What is the role of authors? Do they deserve anything different from now? can Open Access provide it? Can Universities?
  • what is the role of  readers? Who ARE the readers? Who SHOULD be the readers?
  • Is our investment in Open Access providing value?
  • What is the cost of a scholarly publication? What SHOULD be the cost? Think imaginatively.
  • What is the VALUE of a scholarly publication? Are we getting value?
  • What is the total cost of STEM research per year? What is its value? Are we getting value?
  • If we continue in the same way, what events in the world will overtake us?

 

ACSGate: what the Twitterati think of American Chemical Society’s Spider Trap

April 4th, 2014

I don’t normally scrape Twitter, but there’s been a lot of useful comment on the ACS spider trap. The consensus (among the people I follow) is that it’s

  • irresponsible,
  • inappropriate
  • seriously obsolete

Here’s two authorities you can trust:

@crossRefNews. CrossRef is the central gateway for most DOI resolution and they are a solid part of scholarly publishing:

  • @rmounce DOI is registered trademark of International DOI Foundation, using “doi” in this way clearly wrong. #ACSgate
  •  @rmounce We assume it’s a prank–Prefix is Wiley, landing page is Informa. Message re ACS. Suffix is odd–999999? #ACSgate
  • @rmounce Pubs sometimes put DOI strings in proprietary urls. A true DOI link goes through the DOI resolver dx.doi.org. #ACSgate
  • @rmounce Rarely, sketchy pubs publish strings they call DOIs but are not deposited with any registration agency. #thatsnotcool #ACSgate.
  • @rmounce But not legitimate publishers like Wiley, ACS, and Informa #ACSgate

And Cameron Neylon – a highly respected scientist working with PLoS

So the consensus is that no responsible publisher would do this sort of thing. Perhaps it got in without ACS knowing? Was it a hack? or malware in a service they used? I’m waiting for ACS’s answer.

I’ll note that none of this happens with full Gold #openaccess. PLoS has no need to stop people reading articles. J. Machine Learning Research wants everyone in the world to read articles.

Paywalls and the associated technical and legal security are a massive cost imposed by the closed access model.  If we spend 10 Billion USD on closed access per year, and if only 2-3% is paywall and legal technology, then that’s already hundreds of millions of University subscriptions.

Libraries, are you happy that your subscriptions are being used to support this? I’ll ask this question at Leicester today.