ACSGate: The American Chemical Society spider trap; reactions and warning

Posted on April 3, 2014 by pm286

DO NOT FOLLOW ANY LINKS IN THIS POST – THEY ARE HIGHLY DANGEROUS

The Sydney Funnel-Web spider (Thx: http://commons.wikimedia.org/wiki/Category:Atrax_robustus) is among the deadliest in the world. The American Chemical Society’s Spider trap is also deadly. It can cause whole universities to be cut off within milliseconds . Here I’m trying to get information and informed reaction.
The problem comes from following this link:

<a href="/doi/pdf/10.1046/9999-9999.99999">...</a>

If you prepend “http://pubs.acs.org” to it (as a browser would do) you get the deadly link. I will not print it here in case someone clicks it.
DO NOT ATTEMPT TO FOLLOW IT. IT WILL CUT OFF YOUR UNIVERSITY
I don’t know what it does. I can’t find out without shutting down the university’s access to ACS.
But it’s not just ACS – it’s other publishers. A quick search finds

href="http://www.blackwell-synergy.com/doi/pdf/10.1046/9999-9999.99999"

which suggests that Blackwell do it as well.
The ACS example suggests that it relies on a base URL (note the leading slash). I have no idea what Blackwell does when it’s triggered. I don’t know whether this is different for each publisher or whether there is a central publisher independent spider trapper. Maybe

10.1046/9999-9999.99999

is just a lorem ipsum which spider-trapping publishers use.
NOTE THAT YOU DO NOT NEED TO BE RUNNING A SPIDER TO TRIGGER THIS. A SINGLE HUMAN CLICK (LIKE A SINGLE BITE FROM ATRAX) CAN KILL.
Reactions on Twitter:

CameronNeylon @CameronNeylon [PLoS] do other publishers do this? Seems both crude and dangerous? // it’s not a registered DOI with crossref but it is misleading to use that url structure
Sue Cook @Suelibrarian ACS access rapidly dropping as random people click on that link 🙂 phew I wasn’t on work network.
Ross Mounce @rmounce But seriously, academics trust DOI’s. This is an abuse of trust by ACS, in my opinion

I’ll note that the affected were nt trying to download the whole of ACS and sell it , but were indulging in natural and acceptable curiosity.
Questions:

Does anyone know how this works?
Which publishers implement it? I’ve got ACS, Blackwell, Informa Healthcare (T+F),
Is there a better way of doing it?
How many institutions have been unnecessarily screwed up by publisher controls. (Note that my own experience was not a spider trap but simply (humanly) reading too many papers too rapidly – publications are not meant to be read rapidly, are they?)

PLoS manage to publish without spider traps – maybe they can tell ACS what modern publishing should be.

UPDATE – see comment (Tom Demeranville) :

Well, in the olden days when servers were slower and bandwidth thinner a random crawl by a search engine could bring your service to it’s knees. The spider would visit every page, then every link on that page indexing as it went. This would result in an inadvertent denial of service attack.
The old way of stopping this was to put invisible links on a few pages. People wouldn’t see them, but the spiders would. The spider would visit the link and BAM, you block the spiders IP address. I imagine it’s still fairly common, but the reasons are different. It’s to stop scraping rather than indexing.
Is there a better way of doing it?
It works because many publishers are so technologically backwards they still use IP authentication to manage access to their resources. The better way would be to use proper authentication – federated or otherwise. SAML, OpenID and OAuth have been around for years. Most Eurozone universities have an academic SAML federation and manage access using that but the publishers are reluctant to make the full switch because it relies on the subscriber changing the way they work as well as the publisher.

This entry was posted in Uncategorized. Bookmark the permalink.

18 Responses to ACSGate: The American Chemical Society spider trap; reactions and warning

David Jones says:

April 3, 2014 at 11:22 am

I clicked on the Blackwell version of the cursed link and it seems innocuous enough. I have a more amusing hypothesis about how Backwell end up with that link on their page:
They added it by scraping other publishers, but they just list it as a DOI, it’s not connected to a magic trap.

Reply
- pm286 says:
  
  April 3, 2014 at 12:15 pm
  
  It may only work if you click it form INSIDE a university
  
  Reply
  - David Jones says:
    
    April 3, 2014 at 1:02 pm
    
    The ACS one works wherever you are. it’s a fun toy to play with!
    
    Reply
Steve Pettifer says:

April 3, 2014 at 11:27 am

Looks like eduroam is blocked, at least here in Manchester, and same if I try to view any of the journal pages via an O2 mobile connection.
I’m sure ACS will be immediately be issuing a partial refund of subscriptions to all those affected. Hopefully no one was planning on doing any chemistry today.

Reply
- David Jones says:
  
  April 3, 2014 at 1:04 pm
  
  Right. Because of carrier grade NAT, it’s not unusual for entire mobile phone network to be using a single or very small number of legacy IP addresses.
  
  Reply
Tom Demeranville says:

April 3, 2014 at 11:55 am

How does it work?
Well, in the olden days when servers were slower and bandwidth thinner a random crawl by a search engine could bring your service to it’s knees. The spider would visit every page, then every link on that page indexing as it went. This would result in an inadvertent denial of service attack.
The old way of stopping this was to put invisible links on a few pages. People wouldn’t see them, but the spiders would. The spider would visit the link and BAM, you block the spiders IP address. I imagine it’s still fairly common, but the reasons are different. It’s to stop scraping rather than indexing.
Is there a better way of doing it?
It works because many publishers are so technologically backwards they still use IP authentication to manage access to their resources. The better way would be to use proper authentication – federated or otherwise. SAML, OpenID and OAuth have been around for years. Most Eurozone universities have an academic SAML federation and manage access using that but the publishers are reluctant to make the full switch because it relies on the subscriber changing the way they work as well as the publisher.

Reply
- Christina Pikas says:
  
  April 3, 2014 at 3:27 pm
  
  Getting rid of IP access means ending walk-up usage. Which would be a shame.
  
  Reply
  - pm286 says:
    
    April 3, 2014 at 4:19 pm
    
    Only closed access is affected by this.
    
    Reply
  - Tom Demeranville says:
    
    April 3, 2014 at 4:50 pm
    
    It doesn’t have to. You can have locally authenticated guest accounts available in libraries. In fact it allows you to count/limit/whatever the number of walk in users. It also gives you proper usage stats across the board (if done right)
    
    Reply
Geoffrey Bilder says:

April 3, 2014 at 12:44 pm

Hi, As Cameron pointed out, though this looks like a DOI, it isn’t actually registered and so it isn’t a DOI and doesn’t show up in our systems. Nonetheless, we have checked our database to see if we can find any evidence of others implementing this trick and accidentally registering them too. Thankfully, we can’t find any.
Needless to say, we at CrossRef think this is very bad practice.
First, it is likely to undermine trust in DOIs.
Second, within the great panoply of misguided and ineffective DRM measures, this is a particularly dumb way of trapping bots/spiders. The risk of false positives is huge and the ways it can be bypassed are countless. It can even be used as the basis for denial of service attack. Really not a good idea.

Reply
Lars Juhl Jensen says:

April 3, 2014 at 2:41 pm

Suppose we all started clicking that link and got all universities and other institutions blacklisted. Imagine the nightmare for the ACS support team, having to deal with thousands of customers having been blacklisted for systematic downloading that did not take place. I know, I know … it’s probably not a wise way to deal with it.

Reply
Egon Willighagen says:

April 3, 2014 at 2:59 pm

And the 10.1046 prefix does not seem special. Because this gives the same warning:
http://pubs.acs.org/doi/pdf/10.1021/9999-9999.99999
But a proper test would run this one before the 10.1046 one… should I try to reproduce when I am unblocked?

Reply
- Egon Willighagen says:
  
  April 3, 2014 at 4:42 pm
  
  The above links seems fine. It really is the prefix that matters, as far as I can tell…
  
  Reply
Henry Rzepa says:

April 3, 2014 at 3:19 pm

An enormous benefit of having the Web operate with HTML source codes is that one can find such strings. I frequently inspect HTML source.
Can anyone remind me of an initiative a year or so back to replace HTML source with binary files that could not be so inspected? Has that gone away?

Reply
- Ryan B. says:
  
  April 8, 2014 at 2:13 pm
  
  Can you elaborate on the “strings” you are finding in the HTML source code? I am honestly puzzled as to why researchers would be poking around in the HTML source code. To my ear, the strangest thing about this whole episode is that researchers are digging around in publishers’ HTML source code, and that this seems to be considered normal and “above board”.
  (I’m not trying to be argumentative, I’m genuinely curious)
  
  Reply
  - pm286 says:
    
    April 8, 2014 at 10:20 pm
    
    The point is that by crawling publisher websites it’s possible to collect huge amounts of valuable data.
    For example we have downloaded perhaps 50,000 data files from the ACS site. They know we are doing it and they permit it. We’ve built them into a crystallographic knowledge base of enormous value.
    We intend to do the same with Open Access papers which we have a right to read with our eyes and – in a month – have a right in UK to read with machines. We can extract chemical compounds, phylogenetic trees, sequence, etc, which are already visible to humans. Machines are quicker than humans and make fewer mistakes.
    Se wwmm.ch.cam.ac.uk/crystaleye for the scraped data and also http://www.crystallography.net/‎. All scraped from the open HTML pages on the web.
    
    Reply
Georgios Papadopoulos says:

April 3, 2014 at 4:57 pm

> Note that my own experience was not a spider trap but simply (humanly) reading too many papers too rapidly – publications are not meant to be read rapidly, are they?
This is really funny. Tom Demeranville described the trap very acurately.
These LINKS (they are not DOIs!) are not visble or clickable. Only a (dumb) spider follows them.
You created such a dumb spider and you were scraping the content. You were not reading it or clicking on anything.
You were caught, but perhaps the funniest part of that was that then you also came up and exposed yourself. We usually never identify the writers of such crawlers.

Reply
- pm286 says:
  
  April 3, 2014 at 7:04 pm
  
  I repeat – my own experience was not a spider trap – it was a trigger for a rapid reader monitor which failed to realise this was a normal human activity.
  
  Reply