The blog had some glitches recently (technically “bumps on the road of our journey”). These were due to upgrades. I couldn’t post and people couldn’t comment. I can now post, but comments don’t show up. If you have comments mail me (peter.murray.rust <largesearchengine>mail.com) or @petermurrayrust on twitter. Hope to be back after Monday
For starters, here’s what I tried to post on your blog about Longan earlier:So a few years back I found myself in the same situation as you – wanting to do Java OCR, and the only real solution on the block is tesseract/ocropus, which is a nightmare to install/distribute. I eventually started work on a pure-Java OCR system called “Longan”. The project is currently on an extended hiatus, but if there is interest / potential collaborators / potential users, I’d be very interested in reviving it.
PMR: that’s wonderful. Not only for this project, but the idea that people can carry a project throughout a number of years. You don’t have to be in a University. There’s a huge need…
GitHub link: https://github.com/Zarkonnen/LonganLongan’s features:* Pure Java with zero dependencies. It doesn’t even reference any external jars.* Based on convolutional neural networks, which are a pretty modern and robust approach to OCR.* Usable as a library or command-line program.* Reasonably modular system composed of stages.* Takes care of eliminating images and speckles from the input, adjusting input rotation, and detecting multi-column layouts.* Recognition system is pretty much all data-driven. You can plug in a different set of neural network weights to get a different alphabet or specialise the system for a particular font or group of fonts.* Free and libre, licensed under Apache 2.0.
PMR: very much the features I try to use myself. (I do, however use communal libraries such as Apache and with Maven that’s easy once you have learnt how. So David doesn’t need to include the code of apache.commons.cli anymore – we simply have a maven dependency in the pom.xml). Fully agreed about library and CLI and the modularity.
There’s a workflow aspect to image processing. Typically we may have to crop, denoise, deskew, equalise, binaries, thin, recognise , etc. I’m normally starting with born-digital so I only need the last one or two. It’s therefore important to modularise and pipeline. And it’s very important to be able to experiment – hence data-driven and parameterisation.
—DS-Z: Anyway, I hiatus-ed the project a few years ago because I hit a point where I realised I had to start going about fine-tuning the neural networks in a more methodical way, and I didn’t have the time to do so. I had basically reached the point where I’d try it out on some example input, fix things up to make it work better, only to realise I’d made it perform much worse on other input!
PMR: Very common. When it’s a difficult problem (like OCR) the parameters are often finely balanced.
In terms of trying it out, try running it with com.zarkonnen.longan.Main as the entry point and the path to the attached file as the single command-line argument. Note that the file is just a random fragment I grabbed from a scan, so it’s not particularly optimised for Longan, or Longan optimised for it. Then try commit e2f819f5f865ae6e9211a435f098883979fdb1ed which is actually much better, as it’s the version before some major re-engineering efforts which had the aforementioned effect of making it work better for some inputs and a lot worse for others.I can also have a spelunk around and check for any secondary projects/data that I used during the project.- David
Marvellous! It actually took me five minutes to work out how to run Longan, whereas I have struggled for days with javaocr2012.
But I shan’t throw that away. I want to see how Neural nets compare with Moments+Mahalanobis. With NN you have no insight into the model and that’s a problem when you need to refine things. I shall use both – neither can be 100% perfect. And in any case how do you tell a zero (0) from an oh (o)? I’m also going to include a third way – topology of skeletons – not sure whether it;s been used before. And we’ve also got information from the environment – io is more likely to be one-zero (== 10) than one-oh although if you’re a planetologist it might be Io – the Jovian moon. In chemistry IO could be hypoiodite. and so on.
So I’d love David to find spare time to hack on this…
In The Content Mine and PLUTo projects we need OCR to interpret diagrams with letters and numbers. OCR is a well tested and developed technology and widely used. Unfortunately it’s not trivial to find a Open (F/LOSS) Java solution (please correct me if I’m wrong – would be delighted). Hopefully this blog will help others. If my comments here are inaccurate please correct them – unfortunately most of the codes I discuss are poorly documented and have few or no examples.
OCR has many features but here we restrict this to
- high-quality born-digital typefaces (i.e. no handwriting, no scanning, no photos – we shall move to those later)
- a small number of font-families (Helvetica, Times, Courier or their near relations). No ComicSans.
- understanding what is happening as we shall want to extend this – e.g. to Unicode symbols for graphs and maths and Greek.
- modular. It needs to be easily integrated, so no GUIs, etc.
If you are desperate and don’t mind paying there are commercial solutions. If you don’t mind using C(++) there’s OpenCV. But I need to integrate with Java as this has to redistribute everywhere. I can’t run a server as people may have to post copyright material.
If you search for Java and OCR you will variously find:
- Tesseract (http://tess4j.sourceforge.net/). This is a de facto standard, BUT it’s C(++) wrapped in Java. That will be a nightmare to redistribute.
- Lookup (https://github.com/axet/lookup). I haven’t got to the bottom of this, but I think it’s a fast pixel comparator. IOW if you have a small image it will see where it can be found in a larger image. For that you need the precise iamge – it is unlikely to recognize a Times “B” using a Helvetica “B” for searching.
- JavaOCR. This is where the answer may lie but the confusion starts:
- javaocr “20100605″ (http://sourceforge.net/projects/javaocr/). From Ron Cemer. This seems to be an initial effort which uses simple features such as aspect ratio or very simple moments. There’s a zip file to dowload with the name javaocr20100605. the project has been stopped at about 2010 but Ron still appears to comment on the later version below. I think we owe Ron thanks, but 20100605 is not the route we shall follow.
- javaocr “2012-10-27″(http://sourceforge.net/p/javaocr/source). This appears to be a fork of javaocr20100605. There’s a download but it’s basically 20100605. To get the more recent version you must clone the repository:
The code and useful comments appear to have stopped at 2012-10-xx and the authors admit they haven’t documented stuff. However the system loads into Eclipse and tests. After some unrewarding detective work I’ve come to the conclusions:
- the project distrib is overly complex and poorly documented. It seems to be oriented towards Android and this may be difficult to demodularise. The distrib is a set of maven modules – I think some can be disregarded.
- the recognition has advanced and uses HuMoments and Mahalanobis distances. (I had come to the independent conclusion that these are what we needed).
- I have tracked down a demo at:
which has a main() entry point. It’s still very messy – hardocded filenames and no explanation. It looks like a town hit by a pyroclastic flow – abandoned in an instant. There’s a training file
with the characters 33-112 in order as monospace glyphs. This is hardcoded to map the codepoints onto the glyphs (either by x,y coordinate or whitespace – don’t know which yet). There’s another image
which is a test. I run the OCRScannerDemo and it prints out the characters on the console output.
So the system is able to recognize characters exactly if you have the exact or very similar font. This is good news. What I don’t know is how much we can vary the test fonts. I’m hoping that our own work on thinning will make skeletons which are less dependent on the font families. Or it may be that simply traing with three families will be ok for most science.
So this is a first step for those – like me – who are finding it difficult to navigate javaocr. If we don’t hear anything we may create a forked project with clearer documentation and examples.
Open Access Button; Thursday 2014-04-10:1300 London; This is where scholarly publishing gets changedApril 9th, 2014
Tomorrow is a very important day for OPEN – the Open Access Button initiative (https://www.openaccessbutton.org ) is holding an afternoon get-together in London.
The OAButton is driven by undergraduates – initially in Medicine – who are frustrated and now ANGRY about publishers’ paywalls. It’s immoral that medical information should only be available to <0.1% of the world. And – to be quite clear – there are only two things driving paywalls:
- publishers’ greed
- academics’ lock-in to the quest for personal glory
OAButton is initially raw anger – this is unacceptable and must be changed. Simply tell the the world. It’s a Digital Century demonstration – a freedom march. My generation marched to Aldermaston, Greenham Common and Molesworth to protest against the injustice or nuclear weapons; OAButton is similarly digitally marching against publishers’s paywalls.
Protests often start off slow and are ridiculed. You may be tempted to write off the OAButton as a few undergraduates making a protest that no-one will take seriously. You would be very foolish. Protest can grow rapidly into mass action. The driving force is injustice, because:
CLOSED ACCESS MEANS PEOPLE DIE
And the publishing industry has now very few friends except their shareholders and those entertained by their lobbyists. They’re aren’t now selling any useful service – the academics do the writing, the review. The publishers technical ability is AWFUL – they make things worse.
So the publishers are selling two things:
- branded glory for academics and universities
- fear: through their lawyers
Anything else can be created and delivered without publishers. So publishing is broken and could collapse and any time. It relies totally on academic glory. It points backwards.
And the OAButton points forward. The future belongs to undergraduates, and I’m backing them. I don’t know where they are going, but I hope they throw off compromise, fudge, bureaucracy, mumble.
I’m going to London tomorrow. I shall listen. It’s possible that our own effort to create a bibliography of scientific data may be useful. If you’re young at heart, idealistic, motivated and courageous get involved!
Daniel Mietchen [one of the central figures in Open Science / Wikimedia] has just posted to the OKF Open-Science list
as briefly mentioned before, we are working on a public proposal to
make research proposals increasingly open:
The drafting period as part of that News Challenge ends on April 17,
and feedback of any kind is most welcome, especially before then,
though we certainly envisage to develop the project further over time.
To discuss the topic of open research proposals with a broader group,
there will be a public hangout on Friday (April 11) at 7pm UTC:
It would be nice to have some of you with us then!
Read the proposal and come to the hangout. Daniel’s proposal is excellently argued and makes the case that collaborative research can be more effective in many ways than competition. Tim Gowers ran a completely Open, collaborative maths research project (Polymath) which solved an important problem in a staggeringly short time. He said it was: ”… to normal research as driving is to pushing a car.”
Of course scientists are competitive. Only a very few turn down Nobel Prizes or Fields Medals. And on the small number of occasions when others have taken my work and claimed it as their own I have been very angry. But secret grant applications have their downside. Most grant applications fail and so it’s a lot of wasted work if no-one builds on the ideas. It can lead to a metric oriented outcome where delivering lots of small pieces is more important than building for the future. (A similar problem: I find some computer science research very frustrating – they announce they have solved a problem without giving details or code or a solution the community can use. This makes it harder to justify repeating the work – and impossible to get any funding. )
We are now in the science data century. I guess that 50% of science effort is managing data. Data management in competitive projects is awful. It is short-term, and fouls up the future. I have been very fortunate to get EPSRC funding (thank you) for making our software distributable. Just as Wellcome justify publication as a central point of science, dissemination of code should also be high on the list.
There are many radical new things in the Digital Century. Just as Wikipedia has shown it is becoming the central scientific reference, we’re also seeing that collaborative science involving citizens can be highly productive. I’ve helped to set up the Blue Obelisk collaborative software project in chemistry which is the de facto for most Open Source and has also resulted in citation-rich publications (for those who care). The explicit total annual cost: about 20 USD – I buy 2 obelisks as prizes. Everything else is volunteer and marginal costs (bandwidth, servers, etc.).
Much research is parallelisable. In our own PLUTo project we need to build tools for machines to read the literature. We’ve got hundreds of journals to read and each requires a custom-written scraper. It takes an evening to write one. The conventional way is to hire someone and make them write hundreds of scrapers. The modern way is to appeal to the community. Maybe some people have already written some? Maybe there are citizens who love hacking? Either way everyone benefits – more people get involved, there’s a community. It’s science in the Digital Century. Join us…
Some snippets from Daniel
Many ideas are lost in the current closed system, and so are opportunities to collaborate and improve those few that are actually being worked on. We propose to elaborate mechanisms that would allow a transition from the current secretive model to one in which sharing research ideas is the default and seen as an invitation for collaboration, for accelerating and improving research rather than as a breach of private property.
Back in 1959, psychologist Myron Brender wrote
“I propose the creation [..] of a newsletter or journal to be devoted exclusively to the publication of unexecuted research proposals.”
The main challenge of implementing this idea, however, is not technical but cultural: researchers currently have no incentive to share research proposals, and research funders have no habit of making their funding decisions public, nor who has applied for what.
Public research proposals would also open the door for science journalism to go new ways: instead of headlines of the “researchers found” kind once a research project has long finished, they could cover research projects from early on and highlight the process behind it.
Many have experienced this in the current system, but effective scooping is actually relatively simple now and much harder if your ideas are out there in the open: if everyone knows you were the first to propose (and actually pursue) that idea, anyone who tries to sell it as their own will risk loosing reputation, so they may actually prefer to work with rather than against you. More on that in the Harvard video embedded above.
computer scientist Ehud Shapiro recently wrote:
“Genuine interdisciplinary research is nothing like a competitive race. It is much more like a solitary exploratory hike through an uncharted landscape. [...] There are no peers to compete with”. But there may be collaborators if they ask nicely.
[PMR Yes - I know this solitary hike...]
I’ll be at the hangout THIS FRIDAY! see you there
<s>No word yet from ACS so some of this is hypothetical – but they are communal hypotheses.</s>
It seems the spider trap is part of Atypon software (http://www.atypon.com). From their site:
Atypon delivers innovative solutions that revolutionize the way publishers and media organizations do business. Literatum, Atypon’s flagship ePublishing platform, provides all of the functionality that publishers need to compete in the digital world, including advanced search and information discovery, access control, e-commerce, marketing and business intelligence. Literatum hosts more than 17 million journal articles …
It’s run by “Georgios Papadopoulos, Founder and Chief Executive Officer”
Its clients include ACS, Elsevier, Informa, NewEngJMed, OUP, Taylor and Francis and 20 others. Interestingly some of these also appear to have the spider trap. From the info above it appears that the articles – including the OA ones – are hosted on Atypon. It’s therefore believable that the spider trap link was added by Atypon – whether the ACS knew about this we don’t know and wait for Darla.
The following – incredible – comment from Georgios Papadopoulos atypon.com appeared on my blog. I don’t believe it’s a spoof.
This is really funny. Tom Demeranville described the trap very acurately.
These LINKS (they are not DOIs!) are not visble or clickable. Only a (dumb) spider follows them.
You created such a dumb spider and you were scraping the content. You were not reading it or clicking on anything.
You were caught, but perhaps the funniest part of that was that then you also came up and exposed yourself. We usually never identify the writers of such crawlers.
If genuine, this is one of the most breathtakingly self-destructive statements from a CEO since Gerald Ratner described his products as “crap”. GP boasts of his cleverness but has utterly missed the point and revealed himself as completely out of touch.
So what about the spider trap that he and his company built? Well this afternoon the story of the Spider Trap hit Hacker News. I Promise I didn’t send it and I didn’t urge others to. Hacker news (Hacker is a positive term based on MIT usage) has news about all things geeky. They know about the web. What did they think of the Spider Trap? see https://news.ycombinator.com/item?id=7530712 here are some…
- It’s so technologically simple as to be useless against anyone who could deploy a web scraper in the first place.
- This has massive potential for abuse.
- That’s some level of incompetence – the trappers I mean. A half arsed solution because they couldn’t think of a better one.
- I am furious because the malice was implemented in the stupidest, most useless, laziest manner possible. It’s like keeping the neighborhood kids off your lawn by burying a pressure plate switch out there for the armed nuclear bomb in your garage. And then not telling anyone about it. And then inviting all the neighbors over for a croquet tournament.
The overwhelming consensus is that the spider trap was totally incompetent and highly dangerous. The URL could easily have been transformed and redistributed by software which simply edited HTML files. Hidden in mails. Even the existence of a simple URL that disables a whole university (yes – there are universities with only one IP) is unbelievable.
Here’s Tom Demeranville again – who applauds what Ross did as the quickest and most effective way of lancing this ugly boil… (my emphases)
I’ve just identified another serious spider trap that would cut universities not just from one publisher but whole swathes of them. As much as I’d like to share the link around for the LOLz, I think I better contact the owner first Damn.
I don’t think you can shoot the messenger here. Ross made everyone well aware of the dangers these traps pose. If it wasn’t for him I’d have not found what I’ve found. Speaking from experience I’ll also add that polite emails to publishers regarding bugs in their websites take approximately eleventy billion years to be actioned. This way of maximum publicity is the quickest way to get it fixed.
I’ve shown what the world thinks of Atypon’s spider trap. You can decide what you think of Atypon and its CEO. It’s caused massive public furore – most of it against ACS, Ross and me (probably in that order). It’s caused huge waste of time and effort – I hear that JISC, Nature Publishing Group, CrossRef were all cut off from ACS. Indeed if Pandora, I and Ross had not exposed it we might have had regular ACS outages indefinitely.
ACS have to stand up and tell us what’s going on. If they don’t, this sort of thing will continue. If a foolish implementation is allowed to persist who knows what may happen?
2014-04-05:07-12 UTC: ACS have now posted a reply. This is competent, if relatively uninformative of details. Selected quotes:
ACS worked diligently to resolve the issue, and as of 4 PM EDT April 3, service was restored for all subscribers affected by this incident. Simultaneously, steps were taken to address the specific protocol that triggered this outage.
Employing the use of these types of tools is imperative to providing users with continued access to that trusted research. We will therefore continue to refine our security procedures to support evolving publishing access models while protecting both users and content from malicious activities.
The rest is either history I have provided or general stuff about ACS serving the community.
Are there more spider traps out there? Almost certainly. TomD has identified some. Will I go looking for them? Not as an activity in itself. Will others go looking for them? Judging by the tenor of Hacker News almost certainly. Will we hit further traps as we launch The Content Mine? Hopefully not, especially if they are labelled “Bomb”. Unlike Georgios Papadopoulos’ simplistic view we don’t just throw wget at the Web – we work out the semantics.
What is clear is that machine reading of the literature is now a legitimate mainstream activity. We can and will do it without “publishers’ API”. We shall continue to expose incompetent and dangerous publishing.
In my event at Leicester today (http://blogs.ch.cam.ac.uk/pmr/2014/04/03/my-talk-on-openaccess-at-university-of-leicester-2014-04-041300-utc/ ) I shall emphasise the wider picture. Since the audience has many from the library I’ll probably concentrate on that. So here are some questions – if anyone reads beforehand think of some answers to bring this
- Why do we want Open Access? One sentence only – just one clear major reason.
- Do you think Open Access is Open? Is it democratic or bureaucratic? authoritarian or libertarian?
- Where do you think Open Access WILL BE in 5 years time? Is that where you would want to be?
- Where would you LIKE Open Access to be like in 5 years time?
- Are you happy with the role of the current major publishers? Is the spend of 15 million USD / year reasonable?
- If it isn’t reasonable what could we design which would be a better way to spend the money?
- Should libraries implement the rules that publishers create (whether open or closed?) Or should universities tell publishers what they want and require it as service?
- Where is the innovation in scholarly communication? should it involve universities? if so, what should they do?
- What is the role of authors? Do they deserve anything different from now? can Open Access provide it? Can Universities?
- what is the role of readers? Who ARE the readers? Who SHOULD be the readers?
- Is our investment in Open Access providing value?
- What is the cost of a scholarly publication? What SHOULD be the cost? Think imaginatively.
- What is the VALUE of a scholarly publication? Are we getting value?
- What is the total cost of STEM research per year? What is its value? Are we getting value?
- If we continue in the same way, what events in the world will overtake us?
I don’t normally scrape Twitter, but there’s been a lot of useful comment on the ACS spider trap. The consensus (among the people I follow) is that it’s
- seriously obsolete
Here’s two authorities you can trust:
@crossRefNews. CrossRef is the central gateway for most DOI resolution and they are a solid part of scholarly publishing:
- @rmounce DOI is registered trademark of International DOI Foundation, using “doi” in this way clearly wrong. #ACSgate
- @rmounce We assume it’s a prank–Prefix is Wiley, landing page is Informa. Message re ACS. Suffix is odd–999999? #ACSgate
- @rmounce Pubs sometimes put DOI strings in proprietary urls. A true DOI link goes through the DOI resolver dx.doi.org. #ACSgate
- @rmounce Rarely, sketchy pubs publish strings they call DOIs but are not deposited with any registration agency. #thatsnotcool #ACSgate.
- @rmounce But not legitimate publishers like Wiley, ACS, and Informa #ACSgate
And Cameron Neylon – a highly respected scientist working with PLoS
- @rmounce @Suelibrarian @petermurrayrust it’s not a registered DOI with crossref but it is misleading to use that url structure
- @invisiblecomma @rmounce @petermurrayrust do other publishers do this? Seems both crude and dangerous?
So the consensus is that no responsible publisher would do this sort of thing. Perhaps it got in without ACS knowing? Was it a hack? or malware in a service they used? I’m waiting for ACS’s answer.
I’ll note that none of this happens with full Gold #openaccess. PLoS has no need to stop people reading articles. J. Machine Learning Research wants everyone in the world to read articles.
Paywalls and the associated technical and legal security are a massive cost imposed by the closed access model. If we spend 10 Billion USD on closed access per year, and if only 2-3% is paywall and legal technology, then that’s already hundreds of millions of University subscriptions.
Libraries, are you happy that your subscriptions are being used to support this? I’ll ask this question at Leicester today.
It appears that the Spider trap (whatever it is) has affected many people. I have no full understanding of what has happened but here is my best analysis:
- People have really been affected (it’s not just a rumour). They have received a message – it’s not clear where from.
- The ACS is aware of the problem and has posted to this blog:
Thank you for alerting us to the finding shared by your reader. We are exploring and are committed to providing text and data mining solutions for readers of our open access content. In the meantime, for those who have unfortunately clicked on the link referenced and received the spider message, please email firstname.lastname@example.org your institution name and we will work to reinstate access at your institution as quickly as possible.
Darla Henderson, Ph.D.
Asst. Director, Open Access Programs
American Chemical Society
Thank you Darla. I would recommend those affected to contact Darla.
It is still unclear where the problem came from. The URL is not unique to ACS – By searching Google I have found it in in 3 other publishers – Blackwell, Informa and Copeia. The DOI prefix seems to be Wiley, but CrossRef has said it’s not a valid DOI. Anyone finding a similar link should report it rather than following it.
I have personal evidence that the ACS shuts down whole universities instantly if it thinks somebody is doing something wrong. I have made it clear to them that this is not acceptable practice – it’s brutal and unselective. I do not know whether they still practice it, but my suspicion is that they do and that something triggered the ACS servers to shut down subscribers.
It is not clear whether the link was created by the ACS publication system (I hope not) or was malware introduced into the HTML. This would not be easy as it is on the publishers’ sites and could suggest they had been hacked.
It is clearly unsatisfactory and I posted that it had to be mended right away. Ross Mounce (whom I work with and will fully support in his action) wished to explore further and he asked whether others had this problem. I haven’t talked to him, but my guess is that he didn’t expect the ACS system to react so catastrophically to whatever is the problem.
Some people have blamed Ross. IMO this is unfair – it is the ACS system which is at fault. Many commenters have expressed the view that this is an archaic and unacceptable way of running a website.
Until I have more information I can’t judge…
Wellcome Trust and Michelle Brook say thank you to all who are helping hack the WT APC data and highlight major issuesApril 3rd, 2014
A mail from Michelle on the OKFN open-access list:
Hey all,The Wellcome Trust has just said thank you to the community for the work we’ve been doing on the author processing charge dataset they released: https://twitter.com/wellcometrust/status/451659293801340928: where WT said:
- Ernesto Priego I think you are setting up an example of best practice. May other funders follow your lead.
That’s so awesome!The work we’ve done led to an incredible statement from the Wellcome Trust late last week, that you should certainly read if you’ve not seen it yet (found here). It includes some awesome quotes, such as:
[WT] “We expect every publisher who levies an open access fee to provide a first class service to our researchers and their institutions.”
[WT] “The bigger issue concerns the high cost of hybrid open access publishing, which we have found to be nearly twice that of born-digital fully open access journals. We need to find ways of balancing this by working with others to encourage the development of a transparent, competitive and reasonably priced APC market.”
While debates may rage about whether journal led or repository led open access is the best way forwards, I think we can all agree that high APCs charged for papers published in hybrid journals (meaning these journals are also supported by library subscriptions) is not what we really want to see.
They couldn’t have made this statement without the effort of many people on this list – so many thanks to all of you who spent time working on it. I actually had a couple of nightmares about the spreadsheet… which probably says something about me. I’ve written a quick post thanking people publicly, because the effort we’ve all put in is certainly worth recognising(Let me know if I’ve missed your name off, as there were many people editing the document anonymously – I’ve already had one case flagged to me – so please don’t feel bad about reminding me).Work is still on going with the dataset… but I still can’t believe how much we’ve done in such a short time! And still amazed that we’ve enabled the Wellcome Trust to make the statement they did…Michelle
- publication is a critical part of scientific research.
- publication costs money and funders (and universities) should be prepared to pay reasonable costs
- publication is more than just digital paper
and it has launched and supported EuropePMC, which I am proud to be on.
The WT APC data is much more than a list of charges.
It’s part of the bibliographic map of science.
Now we need the rest – so other funders please follow.