Why we need unique addresses and identifiers

#jiscopenbib #quixotechem

This post is about identifier and indexing systems – we shall need these for Lensfield and Quixote.

The hierarchical system seems to come naturally to most humans. I’m not a physiologist, neuroscientist or psychologist but it seems natural for schoolchildren to write something like:

Jane Doe

Bedroom 3

First floor

12 Occupation Road

Abbey Ward

Cambridge

Cambridgeshire

England

Great Britain

United Kingdom

Europe

The World

The Solar System

The Galaxy

The Universe

And it’s the genius and apparently simplicity of so many naming schemes that make it possible to manage modern information. Many are a contract between a global provider and a local system. Thus when we write:

http://wwmm.ch.cam.ac.uk/blogs/murrayrust

we are actually building a hierarchy based on several providers:

The internet naming authorities (InterNIC) create the toplevel domains such as “uk”

The authorities in the countries create the next level (here “ac”). Within that each university has its own domain such as “cam”. The university then decides on the subcomponents (“ch ” is chemistry).

Our group then gets its own sub…subdomain (“wwmm”). Under this our sysadmin allocates levels such as “blogs”, “svn”, etc. I have the sublevel for “murrayrust”. Now I can do more or less what I like in naming the resources.

 

Some systems – such as the WordPress software we use allocates semantically void identifiers, such as “1234”. Its only concern is to make sure that no two blog posts every have the same ID. In this case its eems to work by using serial numbers and this is an excellent approach. The number is meaningless except that smaller ones are earlier. They act as an indexing system for the blog. It means the system has to keep track of everything that has been done. If, for example, I closed the system down and then restarted blogging it might well start at “1” and that would foul up anyone who had bookmarks to the earlier posts.

Making sure an identifier is unique isn’t easy and usually requires mapping onto some human activity. A useful way is to include the date, time and location in some way. Another is to generate numbers so large that they are “almost certainly” unique – this is what the UUID approach does. It works for most of us.

Another way is to use a completely semantic system where every step of the hierarchy makes sense. This is what Nick Day does in Crystaleye. He creates a large URL rather like:

http://wwmm.ch.cam.ac.uk/crystaleye/<publisher>/<journal>/<year>/<issue>/<article>/<datafile1>

and a real example:

http://wwmm.ch.cam.ac.uk/crystaleye/summary/acta/e/2007/01-00/data/bh2062/bh2062sup1_I/bh2062sup1_I.cif.summary.html

This works because the whole publishing system is based on unique publishers which manage their information in a professional manner. Almost all follow this sort of strategy.

So how do we extend this to Quixote, where we are calculating chemistry? Here the generation of identifiers is critical. For example suppose we have 5 people in the group who submit jobs we might have

murrayrustgroup/2010/10/01/job23

This requires a group tool that allocates unique job numbers. Individuals MUST use this tool. If they make up their job numbers then they will certainly clash, probably within days of starting. We could, of course, use a UUID system and have something like:

murrayrustgroup/2010/10/01/a379dc23aaef3490ffeacb23aac

 

This is safe but psychologically depressing for many. People like to remember their information by handy identifiers. I still remember the chemical compounds I worked on in the pharma industry such as AH19065, GR123976 and so on. That’s about the mental limit. It rests on yet another agreed scheme – a prefix for each comapny. “AH” is Allen and Hanburys’, GR is Glaxo Group, and so on. Within each company the compounds had to be unique. For chemistry it’s technically and semantically difficult and it still is (and we may revisit this later when we name chemicals).

So it’s critical in Quixote-Lensfield that we have good identifier schemes. They don’t have to be top-down. But they have to be capable of being integrated with top-down schemes. And they must create unique identifiers. And as we’ve seen before there are only a limited number of ways of doing this EASILY – i.e. where we can rely on everyone doing it.

And the simplest of these is the hierarchical filing system. It’s impossible to create objects with duplicate filenames on the same system. (It’s possible to foul up when you clone an old machine onto a new machine – as I have just done – and then continue using both (as I hope I haven’t done).

The problem is that most projects don’t fit naturally into any scheme. They nearly do, but there are normally problems. However I’m going to assume that we can have a system rather like:

external-id/organization/person/project

Where external-id is something like a domain name, a DOI, or other guaranteed unique root, /organization is something like a company or university, /person is a unique individual within an organization, and that the person can manage their own projects on their own filestore. (Yes, it breaks down when people move organizations, or when the work with 2 organizatins, or when several people work on the same project, or… But there are NO easy answers here. What I am describing is common enough in many sciences – and individual has the freedom and the responsibility to manage their own information. Unfortunately they rarely have any guidance!

My examples will then be based on Lensfield acting within a hierarchical filing system. The key thing is not to be afraid of creating directories (or folders or whatever term is used). We’re recommending 1 folder for one unit of work. So one folder per crystal structure; 1 folder for calculations on one compound. And create subfolders if you have variant calculations or experiments on the same system.

[… I’ll stop here because this is already 3 days in writing and I’ll continue on the train back from Chester …]

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Why we need unique addresses and identifiers

  1. The semantically meaningful way of naming datafiles is great, but in compchem there are too many things that can distinguish one calculation from another, like
    http://neptuno.unizar.es/data/unizar/pabloechenique/peptidedb/HCO-L-Ala-NH2/charge0/geometryoptimization/NewtonRaphson/conv10e-5/MP2/FC/aug-cc-pVTZ/modfiedsoandso/SCFconv10e-4/DIIS/
    And I drop only one of them, then the naming convention might become ambiguous.
    I would say UUID are the right choice.

  2. JURN says:

    How about, for journals…
    http://www.technology-history.org/ejournal_issue_004/free_full_text/2009_adams_preindustrial_water_mills.html
    Where preindustrial_water_mills are the first three words of the article title.
    Without even accessing the document, a human can now glance at the URL in search results and read off:
    Journal name (Technology History)
    Issue number (Number 4)
    It’s from an ejournal
    It’s free full-text
    The year published (2009)
    The author surname (Adams)
    The first three words of the article title (“preindustrial water mills”)

  3. Anonymous says:

    We organize into broken hierarchies. Where stands the Channel Islands in your ordering, or the British Antarctic Territory? The Commonwealth realms and nations? The EU and Schengen countries? Your URL example shows more broken hierarchies: “cif.summary.html” is information along another dimension with “summary” mentioned twice. And “Almost all” must certainly include those journals with more than one year in the publication year. You even write that hierarchies break down and cannot capture everything. So why are hierarchies natural? … Are they natural? Most people don’t organize their files and data into folders upon folders. Perhaps Jef Raskin is correct when he says we programmers are corrupted by our exposure to pure but artificial hierarchical structures like binary trees, folder structures, and 2.5D window management. Search, after all, wins handily over structure.

Leave a Reply

Your email address will not be published. Required fields are marked *