berlin5 : The laws of robotics; request for drafting

I have been asked about what we need for robotic access to publishers’ sites. Several publishers are starting to allow robotic access to their Open material. (Of course the full BBB declarations logically require this, but in practice many publishers haven’t made the connection). So let’s assume a publisher who espeouse Open Access and allows robotic access to their site. Is, say, CC licence enough?
There are no moral problems with CC, but the use of robots has additional technical problems, even when everyone agrees they want it to happen. There’s a voluntary convention, robots.txt,  which suggests how robots should behave on a website. It’s been around since the web started, and there is no software enforcement. In essence it says:

  • I welcome X, Y, Z
  • I don’t welcome A, B, C
  • Feel free to visit pages under /foo, /bar
  • Please don’t visit /plugh, /xyzzy

From the WP article:

This example allows all robots to visit all files because the wildcard “*” specifies all robots:

User-agent: *
Disallow:

This example keeps all robots out:

User-agent: *
Disallow: /

The next is an example that tells all crawlers not to enter into four directories of a website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

Example that tells a specific crawler not to enter one specific directory:

User-agent: BadBot
Disallow: /private/

Example that tells all crawlers not to enter one specific file:

User-agent: *
Disallow: /directory/file.html

There’s another dimension. Even if the robots go where they are allowed, they mustn’t slaughter the server. 100 hits per second isn’t welcome. So some extensions:

Nonstandard extensions

Several crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server: [1] [2]

User-agent: *
Crawl-delay: 10

Extended Standard

An Extended Standard for Robot Exclusion has been proposed, which adds several new directives, such as Visit-time and Request-rate. For example:

User-agent: *
Disallow: /downloads/
Request-rate: 1/5         # maximum rate is one page every 5 seconds
Visit-time: 0600-0845     # only visit between 6:00 AM and 8:45 AM UT (GMT)

I can see roughly two types of robotic behaviour:

  1. systematic download for mining or indexing. CrystalEye is in this category – it visits publishers sites every days and attempts to be comprehensive (it doesn’t index Wiley or Elsevier because they don’t expose any crystallography). It would be highly desirable to minimise repetitious indexing and an enthusiastic publisher could put their XML material in a proper repository framework with a RESTful API (rather than requiring HTML screen-scraping of PDF-hack-and-swear). In return there could be a list of acknowledged robots so that these could act as “proxies” or caches.
  2. Random access from links in abstracts or citations. This is likely to happen when the bot is in PMC/UKPMC, or crystaleye, and discovers an interesting abstract and goes to the full-text on a publishers site. The bot may have been created by an individual researcher fo a single one-time purpose.

So I’d like to come up with (three?) laws of mining robotics. Here’s a first shot:

  • A publisher should display clear protocols for robots, with explanations of any restrictions and lists of any regular mining bots.
  • A data-miner should use software that is capable of honouring machine-understandable guidance from servers. The robots should be prepared to use secondary  sites.
  • Mining software should be Open Source and should honour a common set of public protocols.

But I would like suggestions from people who have been through this…

This entry was posted in berlin5, open issues. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *