# Truth or beauty, continued

Continuing our discussion on whether a chemical strucure diagram is copyrightable.

1. Steven Bachrach Says:
September 24th, 2007 at 9:27 pm ePeter,
I have to take exception to some of your claims. The chemical formula drawing is not the only way of communicating the compound. In fact there are really much better ways of doing this, though not necessarily the best for a human to readily read. The InChI or smiles or 3-D coordinates really capture more information in a more reusable and less likely to be error-ridden way (especially the 3-D coordinates). The chemical formula drawing is not even unique, as we have seen in your examples.

Furthermore, Totally Synthetic had many arbitrary decisions to make in how to represent the structure. I have modified this structure in a few simple ways to make this point:

Note that I have changed the orientation of the terminal isopropyl/OH groups and the way the amide connects to ring A. With regard to ring B, the wedges here are actually NOT how it has to be. Note that the carbons of ring B are not stereocenters. The structure is drawn to try to indicate that ring B sort of π-stacks above ring E. This may or may not be true. Furthermore, the oxygen of the ring could in fact be pointing backwards. In my representation, I decided not to indicate any of this 3-D relationship.

Now I am not claiming that my structure is better than the original. My claim, however, is that Totally Synthetic made some creative decisions in making this presentation, and thus it should be protected.

PMR: "With regard to ring B, the wedges here are actually NOT how it has to be. Note that the carbons of ring B are not stereocenters. " I don't know whether this is true or not. Looking at the 3D structure it seems to me that there are two isomers (not conformers) where TS's wedges show one. It may, however, be that the ring is sufficiently flexible that they interconvert rapidly enough not to be isomers.

More generally, however, there are many reasons why structural diagrams are essential. The diagram above is numbered. The numbers are essential to understand much of the data (spectral assignments, reactivity, etc.) They cannot be held in SMILES, or InChI or, indeed, in anything in common use other than CML (which has support for many sorts of annotation). Neither InChIs nor SMILES are any use for most organometallic compounds, polymers, intermolecules compounds, supermolecules, nanotubes, polymer beads, etc. Many of these things don't have useful 3D coordinates - we are working on polymers and have developed Polymer Markup Language - and we can generate 3D coordinates very nicely, but not many others can.

So I contend that for much of chemistry diagrams are the only method of primary communication. It's actually part of the problem of involving machines. How do we get these things into a formal system without losing information.

That's hard enough without the publishers' lawyers suing us.

# Semantic web : the scream!

I have just blogged Paul Miller's Talis Community Licence and realised that - I think - I used to get a feed from his/Talis blog. So I put it in the Feedreader and found a whole lot of posts on the semantic web (or Semantic Web). Now I had been battling with SPARQL for a day or two trying to make a query with real numbers (e.g.

FILTER (?foo < "1.23"^^xsd:float)

I could NOT get it to work. Finally my colleague Diana Stewart tracked it down to the fact that in some places in RDF you are allowed to use prefixes and in some places you can't. It's almost completely arbitrary. It's not in the simple tutorials. RDF is a place where if you make an inspired guess you will be wrong. The syntax (wrong, the 3-4 syntaxes) are all over the shop. XML,N3, Turtle, ???  It makes me scream.

So I was pleased to see that Jeni Tennison (one of early evangelists of XSLT, coming up with some magic tricks - and goodness you need all the tricks you can get with XSLT) had the same reaction:

# Diazonamide : The Blue CrystalEye Greasemonkey lends a hand

There is some doubt about what the structure of diazonamide A is. Because there is no absolute way of assigning names to structures. We only agree what aspirin is because everyone has been assigning the same structure to it for 100+ years. Many people are careless with names and even more are careless with structure diagrams. Indeed there seems to be a minor industry in drawing some structures wrongly. A year of two back when Nick Day was pioneering the use of InChI he used "staurosporine" as an example. He found lots of structure diagrams and I think there were 19 (sic) different diagrams. Some were frankly "wrong". Others missed out the stereochemistry, others had other problems. And some of these were from suppliers sites (i.e. "labels on bottles").

So how can we be sure? It needs an authority - but which one? Staurosporine is a (potential?) drug, so... WHO drugs? British National Formulary? US National Pharmacopeia? Chemical Abstracts? Beilstein? All of these are pay-to-view. So I cannot look them up (remember I am at home, simulating an interested person, such as a patient). Ah! Pubchem... with 16 entries, and several variations of stereochemistry. Wikipedia has a nice picture ... but this about diazonamide...

On TotSynth's post there's a link to the latest paper (DOI: 10.1021/ja0744448). And following this I find:

PMR: The Blue Eye is NOT part of the abstract - it shows that the Blue Obelisk Greasemonkey has found a crystalEye entry which looks like this:

and here you can see the actual stereochemistry of the diazonamide nucleus (it's not exactly the title compound) so there is virtually no doubt. The diagram on the right is calculated from the 3D coordinates and the layout is through CDK - note the stereo wedges and hatches.

So now I know what some of the stereo is. And because PNAS have made the text Open I can read how it relates to TS's structure. The CDK may not be 100% beautiful, but it should be true (Cue some reader finding it's wrong and a bug in JUMBO, but that's what Open science is about). And you can always pay Chemical Abstracts 6.20 USD to check whether you have got it "right".

So install the Blue Obelisk Grease Monkey (blog post) in your Firefox browser and Open your Eyes to a whole new world of truth and beauty.

# Beauty is truth, truth beauty - and copyrightable?

In (Finding chemical structures - InChIs et al., an amusement) I explored the varied approaches to drawing structures and the problems of representing them. I commented that Totally Synthetic's diagrams were not only the most unambiguous but also the most beautiful. I now regret having done that, as Steve Bachrach has argued that this makes them copyrightable.

1. Steven Bachrach Says:
September 24th, 2007 at 5:23 pm eThis discussion demonstrates why I believe that “structure drawing” falls within the domain of materials than can be copyrighted. What is the difference in all of these representations (let’s agree not to worry about what might be a different stereochemistry at one center)? In the eye of the viewer some of these are “ugly” and some are more aesthetically pleasing. I might argue that Totally Synthetic’s representation is not only clear, it might even be called “pleasing”.Since it is aesthetics (beauty, clarity, etc) that differentiates these drawings, the creator’s choices in how to display this molecule were critical creative acts. It seems to me that this defines work that should fall under copyright protection. The fact that all of these representations refer to the same underlying chemistry does not diminish in the least the creativity involved to create the last structure, and perhaps a total lack of creativity in producing the PubChem drawing.So it seems to me that one should be careful in the re-use of structure drawings. To me these “drawings” are not data, while the connection table, for example, is data.Steven

PMR; Steve is a strong evangelist for OA and we are on the same side but I don't think arguing for copyrighting chemical structure diagrams is helpful. Let's take the analogy of mathematical equations. I could argue that

$e^x = \sum_{n=0}^\infty {x^n \over n!} = \lim_{n \to \infty}\left(\frac{1}{0!} + \frac{x}{1!} + \frac{x^2}{2!} + \cdots + \frac{x^n}{n!}\right).$
(taken from Wikipedia on Euler) was beautiful, while
ex = Σ(&infty;n=0)xn/n! was ugly.

Let's assume that Euler had access to TeX and had published his formula in a journal belonging to the strong-copyright school of thought and that a student has cut and pasted the formula to explain the summation. The publisher could then claim "the formula is beautifully typeset so you must take it down, or retype it". Poor Euler is dead so hasn't any say.

In the same way the chemical formula diagram is the ONLY means of communicating the structure. In TotSynth's case the wedge  bonds in ring B are not to make it pretty but to emphasize what the compound actually is

while the PNAS structure tries to do the same but in a much uglier and gritty fashion. There is a real likelihood of confusion as to what the structure actually is:

1. DrZZ Says:
September 24th, 2007 at 12:26 pm eInteresting stuff. Let me add some additional points. One of the structures in PubChem comes from us (DTP/NCI). If you look at the compound record you get the mess you included above. If you navigate to the substance record (click on the CID, on that page look for the Substance: 1 link, when that hit comes up click on the SID, and on that page change the drop down choice for Compound Displayed from PubChem to deposited) you see a much more sensible 2D drawing. In a quick look, I think the difference between the two structures in PubChem is that one of the stereocenters in the NCI deposited structure is unspecified in the other structure. The NCI structure was submitted in 1997 by one of the authors of the original isolation paper. As the structure correction was published in 2001, it is almost certain that the NCI structure contains the original error. I say almost because we have no audit trail in our internal database for structures (at least not easily visible to me). A NSC has a structure, period. There is some possibility that the structure was fixed, but that just overwrites the previous structure. It just reinforces my view that it is extremely important to treat the structure of a substance as one more data point, subject to varied and possibly conflicting values.

It's very easy to get it wrong.

Now if I cut and paste the diagrams and say "this one shows clearly that the ring is sticking up" I might help avoid the wrong compound being given to the wrong patient somewhere down the line. (This is not hypothetical - these are possible drugs). Steve, can you justify a publisher saying - "we'll send the lawyers after you for posting copyright chemical structures"?

Because if so, the C21 will be enormously impoverished. So yes, it's beautiful. but NO it mustn't be copyrighted by publishers.

# Talis licence for Open Data

I used to think Open Data was simple - "facts are not copyrightable" and everything follows. No I am wiser and realise that data are complex and need a lot of attention - fast. So it's very valuable to see groups who are addressing the problem. Here is Paul Miller of Talis (who convened a WWW2007 session on Open Data)

18:11 24/09/2007, Nodalities
In the world of creative works, notions espoused by Lawrence Lessig and others over a number of years are becoming increasingly well understood. A Creative Commons license, for example, is recognised as giving the holder of rights an ability to prospectively grant certain permissions rather than limit use of their work by expecting all comers to request these permissions, again and again. Those rights are not cast aside, removing all opportunities to protect your work, your name, or your potential revenue stream. Rather, you are provided with a means to explicitly declare that your work may be used and reused by others in certain ways without their needing to request permission. Any other use is not forbidden; those uses must simply be negotiated in the 'normal' way... a normal way that also applied to those uses covered by Creative Commons licenses before the advent of those licenses.

Creative Commons licenses are an extension of copyright law, as enshrined in the legal frameworks of various jurisdictions internationally. As such, it doesn't really work terribly well for a lot of (scientific, business, whatever) data... but the absence of anything better has led people to try slapping Creative Commons licenses of various types on data that they wish to share. It will be interesting to see what happens, the first time one of those licenses needs to be upheld via a court!

At Talis, we have an interest in seeing large bodies of structured data available for use. Through the Talis Platform, we offer one means whereby such data may be stored, used, aggregated and mined, although we clearly recognise that similar data may very well also be required in similar contexts.

Recognising that contributors of such data need to be reassured as to the uses to which we - and others - may put their hard work, we spent some time a couple of years ago drafting something then called the Talis Community Licence. This draft licence is based upon protections enshrined in European Law, and has been used 'in anger' for a while to cover contributions of millions of records to one particular application on the Talis Platform.

There has been plenty of talk around 'open data' here on Nodalities, and on our sister blog Panlibus. See, for example, this recent post from Rob Styles. There were also fascinating discussions at the WWW2007 conference earlier this year.

Despite interest in open (or 'linked') data, licenses to provide protection (and, of course, to explicitly encourage reuse) are few and far between. Amongst zealous early adopters, there does seem to be a tendency to either (mis)use a Creative Commons license, to say nothing whatsoever, or to cast their data into the public domain. None of these strategies are fit for application to business-critical data.

Building upon our original work on the TCL, we recently provided funding to lawyers Jordan Hatcher and Charlotte Waelde. They were tasked with validating the principles behind the license, developing an effective expression of those principles that could be applied beyond the database-aware shores of Europe, and working with us to identify a suitable home in which this new licence could be hosted, nurtured, and carried forward for the benefit of stakeholders far outside Talis.

Today, Jordan posted the latest draft of this license (now going by the name 'Open Data Commons'), some rationale, and pointers to various ways in which he - and we - are seeking input and further validation.

As my colleague Rob (again!) has argued, curators of data need an option on the permissions continuum between free-for-all and locked down. The Open Data Commons, née Talis Community Licence, offers that option.

Take a look. Think about how you would use it. Consider what sort of administrative framework you would want behind such a license. Join the conversation.

PMR:  First of all many thanks to funding legal work on Open Data. Whatever else we have to remain within the legal framework or we court disaster at a later stage.

There will not be a single approach to this anymore than there is a single Open Source licence. Motivations vary and, even more importantly, data is more varied than software. I know of two other efforts, Science Commons  - (in Cambridge US) springing from CC, and the The Open Knowledge Foundation set up by the tireless Rufus Pollock (in Cambridge UK) who invited me to be on the board. We honour this by using the OKFN "Open Data" on
our own CrystalEye. I expect that people will choose different licences to emphasize different policies. (For example I currently use Artistic as my software licence as I don't want the name JUMBO to be misused for derivative works which are not compliant. I might well use BSD elsewhere. and so on).

# CDK's Diazonamide and general thoughts on Openness

Noel O'Blog has suggested that I should use Rajarshi Guha's CDK service to layout the Diazonamide structure (see my post Finding chemical structures - InChIs et al., an amusement)

1. baoilleach Says:
September 24th, 2007 at 7:59 am eFor the record, you can compare with CDK’s SMILES to 2D at:
http://cheminfo.informatics.indiana.edu/~rguha/code/java/cdkws/cdkws.html#sdg

PMR: so here it is:

PMR: I think it's correct. Interpretable. I'd put it on the same level as the Daylight one. One message is that it is difficult for software to layout structures with a 10-ring nucleus.

The point is that CDK is Open Source and can therefore be enhanced by the community. Daylight and the software that Pubchem (?Cactus?, ?Openeye?) use isn't. CDK is joint leader, and we can improve it.

A complementary approach is to start making collections of human-drawn images. The intelligible Chemspider image was hand-drawn by the PNAS authors - I don't know how it got to Chemspider. (Personally I think it's pretty awful - I do not like stereo bonds which are rectangular rather than wedges. Why do people use them. And You only have to scale the image to corrupt this info). So we need an Open collection of chemical structures.

This is not technically difficult but is lathered with copyright madness. Can I reproduce a chemical structure from Nature without permission? I've asked but they haven't got back to me. Can I reproduce a chemical structure diagram from Wiley? I've asked but... ... they haven't got back to me.

It has to be fully Open. Every structure diagram has to be copyright-free and accompanied by metadata that gives provenance and alternative descriptions (names, InChIs, etc.). Is there anywhere that has chemical images that I can download that fulfils all these permissions?

I've found one (sorry for the layout). Here's taxol:

Paclitaxel

β-(benzoylamino)-α-hydroxy-,6,12b-bis
(acetyloxy)-12-(benzoyloxy)-2a,3,4,4a,
5,6,9,10,11,12,12a,12b-dodecahydro-4,11-
dihydroxy-4a,8,13,13-tetramethyl-5-oxo-
7,11-methano-1H-cyclodeca(3,4)benz(1,2-b)
oxet-9-ylester,(2aR-(2a-α,4-β,4a-β,6-β,9-α
(α-R*,β-S*),11-α,12-α,12a-α,2b-α))-
benzenepropanoic acid

And there's lots of data with it that looks like this:

I'll leave you to guess where this is. Clues: It's Open, re-usable, very highly curated, and the first place that students look. That - or a derivative - is where the world's chemistry should reside.

# Finding chemical structures - InChIs et al., an amusement

Totally Synthetic, Chemspider and I have been discussing the value of InChIs in blogs. TS's blog is, of course Openly available under CC licence, and he is widely revered in the community for the beauty and acuuracy of his structural diagrams. This post is a slightly light-hearted voyage through what can be discovered with Toll-Access barriers in place. I leave readers to judge whether TSand Pubmed are up to the ease and value of the information from commercial providers.
I'm reading this from outside the University and I do not have a VPN. This is useful as it shows me what it's like to be an information-impoverished reader. TS blogged today about Diazonamide A , a natural product which was billed as the next big breakthrough in cancer some years ago. (It has 4 reports in Pubmed about its biology, and 26 ones about the chemical synthesis. Taxol has 30,000). Anyway TS has taken the advice of the Blue Obelisk list and managed to put InChIs into his blog.

I'll show his beautiful-as-always structure at the end, but meanwhile I wanted to see how easy it was to find the structure from freely accessible sites. This includes most abstracts (in science it seems to be almost universal to post abstracts in clear, so be grateful).

Wikipedia does not list it, but has the (intriguing and misleading) entry under "Trivial_name":

For example, the most important structural feature of Diazonamide is that it's a nonribosomal peptide, which is denoted by the suffix "amide".

PMR: it might have started as a peptide but I don't think many people would now call it that. (Unless there is another Diazonamide that I don't know of).

So on to the latest synthesis (Magnus, Cheung, Goldberg, Russell, Turnbull and Lynch. JACS, 2007, ASAP. DOI: 10.1021/ja0744448.), remembering I can't read the full text. The abstract is a superb illustration of hanging links (NullPointerExceptions in Java):

Abstract:

During the course of studies on the synthesis of diazonamide A 1, an unusual O-aryl into C-aryl rearrangement was discovered that allows partial control of the absolute stereochemistry of the C-10 quaternary stereogenic center. Treatment of 30 with TBAF/THF gave the O-tyrosine ethers 31 and 32 (1:1), which on heating each separately in chloroform at reflux rearranged to 33 and 34 in ratios of 84:16 and 56:44, respectively. This corresponds to a 70% yield of the correct C-10 stereoisomer 33 and a 30% yield of the wrong C-10 stereoisomer 34. Attempts to convert 34 into 33 by ipso-protonation and equilibration were unsuccessful. Confirmation of the stereochemical outcome of the rearrangement was obtained by converting 33 into 37, an advanced intermediate in the first synthesis of diazonamide A by Nicolaou et al. It was also found that the success of the above rearrangement is sensitive to the protecting group on both the tryptophan nitrogen atom and the tyrosine nitrogen atom.

PMR: What a splendid piece of non-communication! [My comments could apply to many publishers, not just ACS]. Without the full text (which, after considerable perusal will tell us what 1, 30, 31, 32, 33, 34 and 37 are) it's almost meaningless. I am reminded of Alice's comment on Jabberwocky:

"Somehow it seems to fill my head with ideas – only I don't exactly know what they are! However, SOMEBODY killed SOMETHING: that's clear, at any rate -- '"

PMR: and the authors made something from something else...

So off to Pubchem. Many compounds made by synthetic chemists are no in Pubchem because they are of no interest, but Diazonamide is. It has a structural diagram [1]

PMR: Lovely. I think it's correct, but it's not exactly beautiful. like mathematical equations chemical strucures can be pretty or semantic. This is semantically correct and it's probably pretty to jellyfish (this was a marine compound) but not to humans.

So on the InChI. Pubchem tells me that the compound has InChI:

InChI=1/C40H34Cl2N6O6/c1-15(2)27-37-46-29-32(54-37)40-20-9-5-8-19(18-7-6
-10-22-25(18)26(33(41)43-22)31-34(42)48-38(29)53-31)28(20)47-39(40)52-24
-12-11-17(13-21(24)40)14-23(35(50)45-27)44-36(51)30(49)16(3)4/h5-13,15-1
6,23,27,30,39,43,47,49H,14H2,1-4H3,(H,44,51)(H,45,50)/t23-,27-,30-,39-,4
0u/m0/s1/f/h44-45H

The problem is that this is not pretty for blogs as it runs over the line ends and spaces are a problem. So IUPAC are working out new approaches and some of these are discussed by the Blue Obelisk.

There is also a SMILES:

CC(C)C1C2=NC3=C(O2)C45C(NC6=C(C=CC=C64)C7=C8C(=CC=C7)NC(=C8C9=C(N=C3O9)C
l)Cl)OC2=C5C=C(CC(C(=O)N1)NC(=O)C(C(C)C)O)C=C2

which is a linear way of encoding the structure. Let;s go to the Daylight site (they invented SMILES) to see what it looks like:

I think it's correct, and it's certainly a lot better than the Pubchem offering but it's not beauty - except for Shrek.

Let's try Chemical Abstracts. It's got every compound ever made. Maybe they will let me have a free go... (STNEasy) I find:

A free demo! Just what I wanted...

PMR: This is fine, and it points to the same abstract, but I can't get at the structure. Let's try CAS-Number lookup - it will tel me the number and the structure... and there is a free demo as well:

Oh dear... Yes, a free demo, but only if you are looking for caffeine. I get get all I want about caffeine from Wikipedia without paying 6.20 USD. Ah well,

So, off to chemspider which is free. The search for diazonamide A reveals:

10472888 is shown at full size. (There are two more structures but both are  equally unreadable). Note that the atom counts of the structures are inconsistent - the actual composition - I think - is that of 4591072. I try to zoom the formula and get a featureless gray square on both IE and Firefox. So I try Jmol (shown right). Now the molecules are three-dimensional but the coordinates in chemspider are those of the 2-D diagram. Personally I regard this as extremely misleading and would NEVER use Jmol for 2D diagrams, but I shan't pursue this here.

So I still don't know what the molecule is. Where else? Perhaps I can use some more abstracts...

And the fourth one on Pubmed hits gold. It's from PNAS:

and it's FREE!!!!!

so we find the structure:

Truth at last. (For non-chemists the exact width of the lines matters, and the pixellation makes it very difficult to be sure. But I'm sure it's correct.

And now what you have been waiting for - Totally Synthetic's structure:

I think you'll agree that the blogosphere is starting to emerge as a serious place to look for chemistry.

[1] pasted directly from the Pubchem site, suggesting we can create an image library for chemical structures

# open access : Thank you American Chemical Society

In my reviews of the practice of Open Access (Author Choice in Chemistry at ACS - and elsewhere?) I pointed out that there were deficiences in access and labelling on Open offerings. I've now had a reply from Dave Martinsen:

Peter,

Thanks for pointing out the problem in accessing ACS AuthorChoice articles. This was a technical glitch which is in the process of being fixed. Please be assured that it is our intention that AuthorChoice material is available without charge from the time it is posted on the web. We believe the solutions we’re putting into place will prevent this access problem from happening again.

Dave

*********************************
David Martinsen
American Chemical Society

1155 16th St. NW
Washington, DC 20036
d_martinsen AT work-it-out

PMR: Thank you Dave (Dave - as I have already mentioned - has been very supportive of new approaches to chemical informatics).

AuthorChoice is a "hybrid Open Access" product produced by the ACS. "Hybrid" only applies to publishers (and sometime specific journals) that are primarily closed (Toll Access, pay-to-read) but where authors may purchase "Open Access" for their specific article. (Many OA publishers require all authors to pay to publish). Every publisher has a different name for their hybrid products and almost all of them offer different rights and restrictions.
As I have said before, the quality of delivery of hybrid Open Access (and related products) is often poor. They are not well labelled, the navigation is poor, and the rights - if any - are often vague and contradictory. Hybrid offerings (as with the ACS) often still require the author to transfer copyright and do not allow full re-use of the article.
I am not (here) criticizing hybrid OA per se (though personally I think it is a distraction and is likely to be ineffective in every way). Nor am I concerned (here) with the price level, though I personally would not believe that I get good value from many publishers (as I require full permissions, including author retention of copyright). What concerned me here was that the reader (and thereby the author) was not getting what they were entitled to.

It is very clear that the OA community MUST insist on clear labelling and must police the practice. Many "OA" publishers are creating unacceptable offerings - either deliberately or probably through laziness and lack of commitment (I call this systemic failure of the industry). I had not intended to embark on any campaign and I am glad to see that others at Berlin5 are interesting in putting in place more formal mechanisms. For example we need a system of labels - but that's not my story to tell.

I don't actually like attacking people (institutions are slightly different). Sometimes my role appears to be that of a gadfly. I didn't know why people use this particular analogy so looked it up in WP and found Gadfly

"Gadfly" is a term for people who upset the status quo by posing upsetting or novel questions, or attempt to stimulate innovation by proving an irritant.

The term "gadfly" was used by Plato to describe Socrates' relationship of uncomfortable goad to the Athenian political scene, which he compared to a slow and dimwitted horse. It was used earlier by the prophet Jeremiah in chapter 46 of his book. The term has been used to describe many politicians and social commentators.

During his defense when on trial for his life, Socrates, according to Plato's writings, pointed out that dissent, like the tiny (relative to the size of a horse) gadfly, was easy to swat, but the cost to society of silencing individuals who were irritating could be very high. "If you kill a man like me, you will injure yourselves more than you will injure me," because his role was that of a gadfly, "to sting people and whip them into a fury, all in the service of truth."

PMR: I'm delighted to know the etymology (or rather the usage). And Perhaps that is sometimes why I like the Socratic approach - posing questions which require definite answers rather than generalities. But, ahem, although it grows here I really don't like hemlock.

# How blogging makes contacts and seeds communities

I mailed yesterday about how blogging links to other blogs and generates new contacts. Here is a direct example:

Jakob Says:

... and ...

From the librarian’s point of view I can tell you that archiving data is probably even more complex then it seems to be. From the computer scientist’s point of view I can tell you that Semantic Web will enlight us easily. From the Open Content movement’s point of view I can tell you that you should just license the data and make it available and usable for anyone - like you said: first make sure THAT the data CAN be used.

PMR: Thanks Jakob. There is a growing number of people like you - we need to link them to generate critical mass. In chemistry we have created the Blue Obelisk community and we have pooled our resources and efforts. This could be done for content systems - informally as well as through institutions - an example is our collaboration with Peter Sefton on authoring tools.

# Does linking to technorati tags generate spam?

In a recent post (blogs, folksonomies and tagging - get going!) I encouraged the Open Access community to start using blogs and tagging. I specifically pointed to Technorati to illustrate the value and showed that some conferences had huge amounts of traffic and others almost none. I gave several examples and gave links to the technorati summary of the posts under given tags. This was based on a particular URL structure.

On revisiting these sites I find that the lists at Technorati have been drastically altered.  The berlin5 one has 11 porno spam links. The method a fairly recent one - take the content of a genuine post and do some very crude lexical munging of the words and phrases (I get zillions of these each day sumitted to the blog comments). Somehow they actually linked to sex sites in Cambridge, so maybe they interpret domain names. So it seems the spammers have found my post yesterday and somewhere generated spam content that is either injected into Technorati or has already been linked. AFAICS the genuine links are still there.

Then I looked at www2007 worrying that I would see the same. But whereas there were 300+ links yesterday to www2007, now there are only 6, all half a year old. Was technorati spammed and tried to clean it?

If by linking to Technorati I have unwittingly generated spam I apologize, but this can be done in other ways.

I don't take Technorati counts very seriously - about as seriously as I take ISI citation counts - but it's a useful way of finding people. But maybe we have to be careful about the exact way we use it. I welcome enlightenment.