Conspiracy and chemistry and an invitation to lunch

Antony Williams (Chemspider) and Stuart Cantrill (Nature) have recently blogged about what the blogosphere is seeing as censorship on the Web by the American Chemical Society. This is a bold and serious claim and needs some background. The facts,  from Stuart:

One story that caught our eye was from Outsell, which, in their words, ‘is the only research and advisory firm focused on the publishing, information, and education industries’. The article was entitled ‘Chemical Bonding InChI by InChI’ and it offered an analysis of how certain publishers were making use of InChIs – those of you unfamiliar with InChIs can go here for a primer.

Daniel Pollock at Outsell had published an article on March 30th 2009 entitled “Chemical Bonding InChI by InChI”. He discussed the InChI Resolver and the efforts to raise enthusiasm for the InChI. He also discussed the efforts of both Nature Publishing Group and the Royal Society of Chemistry to proliferate the use of InChIs.
[…]

The article them moves on to consider whether CAS (Chemical Abstracts Service), which is owned by the ACS, will also embrace InChIs. The conclusion was that we may have to wait a while for that to happen.

So why do you need to know this? Well, the story from Outsell has been withdrawn (on April 8th) – and more than that, in fact, it has been removed from their archives (although the original story is cached on Google and you can find it here).

Whether it is right to completely remove every trace of a story that you withdraw is a discussion for another day – but now all that remains is a brief notice indicating that the original story did not hold up to Outsell’s internal standards.

Outsell now say that the original article wasn’t balanced and that the ‘tone of the piece could be taken to single out CAS as being late in responding to the trends’. Surely readers could make that judgement for themselves?

The great shame is that the whole article has simply been removed and an analysis of how cross-publisher development on an important topic such as the InChI – which may have a significant role to play in chemistry publishing – has been lost.

Antony uses stronger language and speculation  (Conspiracy Theories and InChIs – Why was the Article Removed? – there’s much more that is worth reading) :

Conspiracy theories are already moving around the community. The majority of people I have discussed this with believe that the retraction was likely forced by CAS

PMR: In short the best guess is that CAS see InChIs as a threat (I’ll discuss the foolishness of this below) and that they put pressure on Outsell to retract. I don’t know under what auspices Daniel writes for Outsell – employee, invited contributor – etc. but Outsell have the right to moderate what is published on their site. They may feel that Daniel’s article detracted from their brand; I take the opposite view – that the article was well written and that the retraction has done Outsell damage. (Contrast a foward-looking company like Talis whose Panlibus blog written by employees is a major enhancement). It emphasizes the problem of employees publishing under their company name, and I have empathy for Daniel.

The retraction seems to be typical of the knee-jerk reaction that CAS applies to anything that could conceivably be seen as challenging their monopoly. For example last year Wikipedia volunteers started checking CAS numbers for correctness and the first reaction was to tell them that they were in breach of contract. Not “we are glad to see quality applied to chemistry”; “glad to see responsible use of unique identifiers”. After the natural blogosphere outrage (including this blog) CAS relented. I doubt they will relent this time.

It’s difficult to know what the reality is – but there are too many stories about clandestine and lobbying practices at ACS to ignore them. PRISM, the ACS mole, the constant lobbying, etc. ACS frequently resort to legal action (e.g. against Google) and I suspect there was a phone call from lawyers. We’ll probably never know.

Does this protect CAS’s monopoly? No, quite the reverse. It makes them look foolish, out of touch, and ultra-monopolistic. They have a huge turnover, and a monopoly of complete chemical information so they are immune from competition, right? Wrong.

Here is the UK Guardian newspaper recounting the demise of Encyclopedia Britannica (which I estimate has similar turnover to CAS):

By 1990 sales revenues had reached $650m.
Yet within five years, EB underwent a near-death experience. What almost killed it was a product that most of its executives regarded as a joke, an encyclopedia on CD-Rom launched by Microsoft and called Encarta. The original content was licensed from an outfit with the Dickensian name of Funk and Wagnalls, and some of it gave trash a bad name. So Microsoft spruced it up, added multimedia content and made it easy to use. To the astonishment of EB’s board, this meretricious object triggered a precipitous decline in sales of their gold-standard product.

Faced with catastrophe, the Benton foundation put EB up for sale. It took 18 months to find a buyer, a Swiss billionaire named Jacob Safra who bought the company for half its book value. The story of Britannica is now a business-school case study in how rapidly competitors can emerge – apparently from nowhere – in a digital world. The First Rule of Business nowadays is that somewhere out there someone (and not just Google) is incubating a business plan that is based on eating your lunch

So where are the lunch-eaters coming from? Surely we cannot recreate a database of 30 million compounds? No, we can’t – we can create something much better – Linked Open Chemistry. It won’t come from a single source but from all the Open chemistry efforts that have grown over the last 1-3 years. They include Chemspider, Pubchem, ChEBI, Wikipedia, The Blue Obelisk, CrystalEye, Open Noteboook Science and a number of others. None have found the ACS as a body receptive to the new wave of chemistry. They are bringing to the lunch table:

  • Openness
  • Re-use and sharing
  • Immediacy
  • Innovation
  • Linkedness
  • Semantics
  • Quality Control
  • Community

Not all of these are fully developed but they are part of the Linked Data of the future and they can grow quickly. CAS’s actions and perceived stance is uniting them in a common effort to make chemical data free. Antony and I are meeting and the end of this month and we’ll be seeing how our offerings fit together. Yes, we’ve had differences, but these have helped to re-orient both of us and we now have at least a common goal of liberating chemistry.

There are some simple approaches which can revolutionize the way chemistry is captured and aggregated. Our own approach is semantic publishing (e.g. Chem4Word) means that the tacky business of text-and image mining could disappear. Yes, it needs a culture change in chemistry, but that is looking likelier all the time. Meanwhile, high prices, restrictive practices will only serve to drive more people (including those outside chemistry) to create Linked Open Chemical Data.

After all it’s OUR DATA.

Posted in "virtual communities", Uncategorized | 1 Comment

Blogging for an eScientist – more thoughts

It is often exciting and rewarding when someone from outside regular contributors comments so I’ll reply to  Jean Calvin (blog: Post Tenebras Lux) Submitted on 2009/04/10 at 9:44am

Good timing Peter! I started off a new blog recently and have been bogged down in the detritus of academic life. I knew blogging took time but it suffered when I hit a busy period just after my teaching finished (never understood how that worked).

Regarding the choice of theme, how important is this? I’d like a blend of themes but suspect that this might end up a little too entropic for the general reader…

Here are a few more thoughts. I had a look at your site, Jean and I have added it to my blogroll, and my feedreader so you gain a little blog karma. You already seem to have picked a theme – you want to communicate science and I’d urge you to stick with it. It can be difficult at the start because you may not know anyone is reading your blog, but even the act of writing is therapeutic. be patient – it can take weeks or more before you get feedback, but you may also find that you have discovered your theme or role. Be honest, and say as much or as little as you feel you can. You are in public, yes, but unless you say something really outrageous or illegal the sky won’t fall in.

Read other blogs and consider joining an aggregator such as http://scienceblogs.com/ or http://blogs.nature.com/blogs and http://blogs.nature.com/blogs/suggest (where you can submit your blog for wider visibility). [I can’t remember which of these are invitation and which act as a service.] Perhaps early days, but maybe some of my readers will give you hints.

I know it’s hard work, and I know I don’t do it as much as I should, but consider adding links to other blogs and explanatory links (for example link to Higgs Boson in Wikipedia).

You may find it useful to think of a particular readership and tune your posts to them, or you may wish to write more broadly and see what gets picked.

But above all do it because you want to, or feel you are driven to – not because you feel it’s your duty.

Posted in Uncategorized | 3 Comments

Open Notebook Ontology development

One of the challenging aspects of ontology development is that in the early stages there are likely to be major collisions of orientation, and chemistry is no exception. So we need to prepare for some hard work addressing these. Traditionally this has been done by getting together physically, identifying the issues and working through them on the hope that a useful amount of agreement could be reached. It’s still extremely useful to meet and discuss.

However the web provides a different process where we can be much more inclusive and share our world views at an early stage. The plus side is that everything is visible; the downside is that without some process to orient thought it can become chaotic. To that end Nico Adams has been logging his current ontologies and thoughts:

ChemAxiom: An Ontology for Chemistry – 1. The Motivation

ChemAxiom: An Ontology for Chemistry 2. The Set-Up

ChemAxiom: An Ontology for Chemistry 3 – Choosing an Upper Ontology

I am not even going to start to summarise as Nico has been and will continue to do this. But I’ll add some comments. In posting his ideas to the web Nico is effectively doing Open Ontology development. I am delighted that Michel Dumontier – who also has a chemical ontology – has picked this up – within hours – and has offered to collaborate – both on Nico’s blog and on Twitter. Because we realise that ontologies are hard, we know that by combining ideas and infrastructure we are more likely to create something that reaches a wide community, is useful and can be explained.

We all have a major challenge in chemical ontologies. Most chemists have never heard the term and – reaslistically – most will only be mildly interested in the idea (many will see it as irrelevant). So we need to work out cases where this can be seen to be uaseful and more than an abstract philosophical exercise. There is less problem with bioscience – there the concept of ontology is commoner and also seen to be necessary in bringing together different disciplines.

By carrying out this design in public Nico gives us the chance to get widespread input, but this carries the certainty of colliding world views. What are these likely to be? I see at least:

  • Whether or not to use an upper ontology and if so which. Here we are guided by bioscience which has already – in many cases – chosen to do so and to use the Basic Formal Ontology, and Nico has done likewise – I trust his judgment.
  • Differences between bioscience and physical science. This will be one of the first full ontologies developed from a physical science point of view. There will certainly be differences in approach – chemists tend to have a more Platonic view of the world and their pragmatism leads to a fairly concrete and often object-oriented approach – nouns rather than verbs. Moreover much physical science is heavily based on consistent mathematical models and there is an underlying idea of a consistent set of axioms. By contrast bioscience is more phenomenological and categorical (I hope that’s right) where we try to impose some structure on a world that may have little other than our own.
  • Words. Words help to bring us together but also separate us if their basis varies between communities. What is a “measurement”? Is it a process, the result of a process, or a mathematical “point”? If we know how something is carried out in the lab that may affect our understanding of a word – is “sequence” a verb or noun? In general I find nouns much easier than verbs and CML concentrates on nouns (resulting in an object-oriented program system) and leaves relationships to other techniques such as RDF and ontologies.

So we have to look for consensus but also recgnise where there are fundamental irreconcilable differences. It will be hard, but it should be rewarding.

Posted in Uncategorized | 2 Comments

CML – "can your system encode these semantics"

Rich Apodaca (Bluer Obelisk) frequently asks “Can your chemoinformatics tool do this?” and has asked how CML represents various systems:

Submitted on 2009/04/05 at 12:04am

Peter, I would gladly drop FlexMol and enthusiastically support any robust system that enabled me to faithfully represent, store, and transmit representations of molecules containing axial chirality (biaryls, allenes), planar chirality (Fu’s chiral DMAPs), organometallics (metallocenes, piano stool complexes, pi-allyls), square planar stereoisomerism (cis/transplatin) aromatic radical cation/anions and other multi-centered bonding species, and other problem motifs without using templates, non-standard extensions, abusing the “wedge bond”, or other hacks.

I’m glad to see that you believe CML can do this. However, nothing in the documentation I’ve found leads me to believe this is the case, and I’ve seen not one example to show it.

I’m not trying to make a pest of myself (may be too late ;-)), but I really would like an example to back up your claim. I’ve provided a link to a specific example for both cyclopentadienyl anion and ferrocene using FlexMol in my previous comment.

What is the best practice representation (actual, valid XML) for ferrocene in either C4W or CML?

This is an excellent question, and it takes some time to answer. At the heart is the difference between linguistico-graphical representation of chemistry and semantic chemistry. Chemists often communicate using words and diagrams and this is very powerful especially if they have a common education. For example Rich and I both know what a “piano stool” representation, but if we didn’t we wouldn’t be able to communicate.

The problem comes when we try to communicate this to machines. They don’t understand words and they don’t understand pictures. So we have to try to create a system which doesn’t rely on a shared understanding. What does “aromatic” mean? I can guarantee that if you gave 20 chemists a wide range of moderately common molecules including heterocycles and organometallics and asked them to say which was, and which was not aromatic, there would be considerably less than 100% agreement – could well be less than 80%. You can only get higher agreement by very carefully defining the rules. And though chemists love rules there are still many systems where the rules are very fuzzy.

It is very important to stress that

CML is not a file format, it’s a semantic language

CML endeavours to capture those semantics which ared universally agreed and to provide means for representing those which are not universally agreed. So I showed how CML could represent different approaches to ferrocene. All are valid – in the sense that the have a convention which they apply correctly, and all have different conventions.

CML can support most common conventions but does not choose between them

Suppose we take the words and the paper away from chemists – how to they then communicate? Because this is what we have to do for machine representations. I often give students a test – can you communicate this molecule to another student simply by taking (as it were over the phone). Or assume you were unsighted and could not touch the other student – how would you do then?

Chemistry is described in several worlds:

  • words. very important, but only useful to a machine if we have ontologies.
  • connection tables (topology). Very useful but breaks down when molecules are dynamic or geometry matters. This is the strength and weakness of InChI
  • Local geometry. In many molecules geometrical features of part of the molecule play a major role in the chemistry. Thus the orientation round single and double bonds and roundf atoms can be critical (topo and stereochemistry). Chemists normally do this by sketching that part of the molecule using 3D clues. It’s effective for sighted humans, useless for machines. So some, but relatively few of these, have been encoded in pictureless chemical conventions. Those that have include cis/trans, E/Z, atom chirality but little more. Most of the molecules that Rich showed (and I’ll return to them later) rely on a picture – not just 2D coordinates but also visual clues.
  • Global geomstry. These are aspects such as clusters, and infinite solids. They require 3D coordinates and associated geometrical concepts.
  • CML can support the geometric and topological aspects of most of the common pictureless conventions for representing chemical structure

    If we need a picture to describe a molecule, then CML has can support 2D coordinates, bonds between atoms including multicenter bonds, bonds between atoms and bonds and it can annotate them. In principle these annotations can be displayed as glyphs representing many of the visual cues in Rich’s examples. But there are some aspects (e.g. hidden line removal) that CML does not do. Nor can a machine understand these. CML deals strictly with semantics that an machine can, in principle, understand. With ChemSS (chemical stylesheets) which Joe Townsend is working on we shall be able to depict them visually in many different ways. But the semantics are unchanged.

    CML adopted the separation of semantic and style that is key to modern XML languages

    Rich mentions “tool” frequently and I assume this to mean a program. The problem with almost all chemical representation systems is that they rely on software to provide the implicit semantics. A good example is Daylight’s SMILES system. They say, quite accurately – the definition of an aromatic bond is what our program decides. The test of whether semantics are explicit is whether you can write out a file and import it into another program without losing information. That’s hard because of the many conventions but the vision is that a chemical information system should consist of a semantically aware programs and exposed data with no implict semantics. That’s hard, because the functionality of the program may have to be complex to convert between different representations (even the explicit ones). But CML can, at least, encode the semantics and there an increasing number of programs that are CML-aware.

    So the question really should be:

    does your system provide all semantics explicitly and are there programs that can process them.

    CML and CML-aware programs can do a lot towards this goal. In a later post I’ll show approaches towards restricted rotation and metallic stereochemistry using CML. It’s all in the language and there are systems to process some, but not all semantics.

    Posted in Uncategorized | 1 Comment

    Ontological Warfare

    I am really excited abot the state of current chemical ontological development – there are now 3-4 groups including ours and I’ll expand later. But first I want to set the scene – looking at my ontological roots.

    The first conscious involvement with ontology was with crystallography – the CIF dictionary project. This was – and is – a splendid example of a community building its ontological infrastructure. It gave me – naively – the view that it should be possible to define everything in the world and link it together. A colleague at Glaxo, Lesley West, cam to hear Henry Rzepa speak and – I can’t quite remember how – we found we had a lot in common about hypermedia and annotation. Lesley had been in regulatory affairs in Glaxo and she introduced me to the world of disease dictionaries – such as ICD-10 (more later) a compendium of 10,000 terms of “morbidity and mortality”. I know it by heart. Together we had the vision – in ca. 1994 of a world of namespaced (we didn’t know the term) hyperlinked (we knew that) ontologies (we didn’t know that). We cam up with the concept of “The Virtual Hyperglossary” (VHG) and had all sorts of experiences including setting up a company under its name.

    For a brief while the VHG caught a few people’s attention and I was invited to meetings of InterCOCTA – a UNESCO project to try to harmonize the world through defining terminology. Simply – if we don’t understand each other we are likely to come to blows or worse. The doyen of the group was Fred Riggs – a distu=inguished professor of political sience from Hawaii. He showed me the relationship between terms and concepts (I was very stupid and he was very patient). For example there are many meanings of the word “bureaucracy” which varies from pejorative to neutral to positive in different cultures. Fred also created a terminology of conflict – “Turmoil among Nations” where he analysed or foresaw the asymmetric struggkes of today. And that is one of the many reasons why an understanding of language is essential for peace in the world. (A practical example – one of the delegates was from Iceland and she was compiling a glossary of fish. Since fish are (or were) important to Iceland’d economy, the precise meaning of a term is important. A sardine in the UK is different from a sardine in the Mediterranean.

    Over these years I realised that ontologies touch our deepest being. Everyone has a world view built up over the years. When we try to communicate in words we always fail to agree on the entirety. Everyone is different and has a different view. If uncontrolled this leads to conflict and I have seen these and been part of them. So whenever anyone talks about creating ontologies I warn them that they will – inevitably come into conflict and must be prepraed to work out how to resolve this – it might be a compromise – it might be an agreement to differ.

    This is reflected in the Upper ontology page which specifically warns

    Upper ontologies are also commercially valuable, creating competition to define them. Peter Murray-Rust has claimed that this leads to “semantic and ontological warfare due to competing standards” [1], and accordingly any standard foundation ontology is likely to be contested among commercial or political parties, each with their own idea of ‘what exists’.

    No one upper ontology has yet gained widespread acceptance as a de facto standard. Different organizations are attempting to define standards for specific domains. The ‘Process Specification Language’ (PSL) created by the National Institute for Standards and Technology (NIST) is one example.

    There is debate over whether the concept of using a single, shared upper ontology is even feasible or practical at all. There is further debate over whether the debates are valid – often leading to outright censorship and boosterism of particular approaches in supposedly neutral sources including this one. Some of these arguments are outlined below, with no attempt to be comprehensive. Please do not censor them because you promote some ontology. [WP italics]

    My quote was actually made when XML was at the peak of the Garner curve and I could see companies wishing to stake out territory in an ontological gold rush. Ontologies will cause enough conflict without being protected walled gardens. The world needs ontologies for its survival – we need to understand ourselves and we need to understand the laws of nature.

    Posted in Uncategorized | 1 Comment

    library of the future – videos

    This is the last post on LOTF09 – just to say that JISC have done an excellent job of capturing the video:
    http://www.jisc.ac.uk/whatwedo/campaigns/librariesofthefuture/debate.aspx
    This is particularly valuable for me as I do not follow a set pattern of presentation – no linear sequence of Powerpoint – no predetermined start and end. I have 10,000 slides and choose from second-to-second which I show and what I say to them. The slides, therefore, are an inadequate record and my ideas are much better captured in the video and audio. So, on the fairly infrequent times they get captured and published I am really grateful.

    Posted in Uncategorized | Tagged | Leave a comment

    Crystal26

    I’m in Australia because I’ve been invited to talk to Crystal26 – the 26th Biennial Conference of the Society of Crystallographers in Australia and New Zealand. Crystallographers from ANZ have made enormous contributions and when I studied in Oxford tha lab was full of them – Guy and Eleanor Doson, Ted Maslen, Ken Watson, Clive Nockolds, Ted Baker, and many others. I owe a personal debt to their influence.

    The best known crystallographer of all – W.H.Bragg – was from Australia; WP gives:

    In 1885 Bragg was appointed “Elder Professor of Pure and Applied Mathematics, who shall also give instruction in Physics”[2] at the University of Adelaide in Australia and started work there early in 1886. At that time he had limited knowledge of physics, most of which was in the form of applied mathematics he had learnt at Trinity. There were only about a hundred students doing full courses at Adelaide of whom scarcely more than a handful belonged to the science school. As a result Bragg was able to develop his knowledge of physics in his early years, spending a year at Cavendish Laboratory taking the equivalent of an undergraduate course. It was not until he was past 40 that he began to do research work of note. At the meeting of the Australasian Association for the Advancement of Science, held at Dunedin in 1904, Bragg, as president of his section, delivered an address on “Some Recent Advances in the Theory of the Ionization of Gases”

    This success late in life and isolated from the mainstream should be an encouragement to us all.

    Crystallography has a family feel unlike many other disciplines – it’s possible to go into a lab and start talking with a common knowledge of the physics, the equipment, the history, the people. The oral tradition is strong and the sense of commitment to the community. I am proud and humbled to be part of this.

    I have chosen as my topic “The Crystallographic Semantic Web”. This is a new title, but not a new topic – 10 years ago I talked on “The globalization of crystallographic kowledge”. But the technology has come a long way since then and we are no really approaching the time when knowledge will be spread across the world without technical and political barriers.

    As always I do not know what I shall say in detail until I get there and talk with people. One thing I shall certainly stress is the vision of the creators of CIF – the Crystallographic Information File which even  when conceived nearly 20 years ago was essentially an ontology – though the word was not used then. We now have the technology and in the last few weeks Nico Adams and I have been able to turn the CIF dictionaries into an OWL-based ontology. It’s an excellent microcosm for displaying the ideas.

    I will probably blog my talk before I give it. I don’t know if there is a conference tag but I shall assume CRYSTAL26 here.  I wonder how strong the community is on Twitter?

    Posted in Uncategorized | Tagged , | 2 Comments

    CKAN – an idea whose time has now come

    CKAN – The Comprehensive Knowledge Archive Network is the brainchild of Rufus Pollock (a young and incredibly energetic economist) at Cambridge. It’s part of Rufus’ vision of a world of distributed semantic Open knowledge. I think CKAN is an idea whose time has now come. It is impossible to make accurate predictions as to exactly which new web resource will blossom, but here’s the case for CKAN.

    We’ve seen how successful Wikipedia is. But it wasn’t the first encyclopedia on the web (I’d flag the WWW Virtual Library as that) and started fairly slowly (as far as I recall).  And the quality and meta-quality (the tools and protocols to create quality) were fairly primitive compared with what they today.

    Wikipedia is wonderful for many things, and I am really pleased that they created the infobox. This is an approach that enourages an “annealing towards consistency of representation”. The infobox is a collection of key data or metadata about the subject of a page. These are not developed top-down but tend to arise from a subcommunity which wants to systematise their field – everything from steam engines to battles to chemical compounds. The volunteers in the subject decide on a representation, and some metadata and the community fills this in. What’s impressive is that even without clear direction they converge on a useful mean, rather like the synchronicity of firelies. That seems to me, at least, to suggest that if a good, but not perfect, framework is presented to voluntary contributors they will not only add content, but also work to improve the framework.

    So after that rather lengthy introduction (I’ve just landed in Melbourne and  am readjusting) what’s CKAN?

    CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones). CKAN is the place to search for open knowledge resources as well as register your own.

    There are currently 368 active packages on the system.

    Thanks to its underlying versioned domain model CKAN has a wiki-like interface that lets anyone add and correct packages. Examples of existing entries include a set of Shakespeare’s works, a global population density database, the voting records of UK MPs, or 30 years of US patents.

    CKAN is not a data repository – it’s a meta-repository in that it points to the resources. But please don’t think of it as just another metadata repository.  It’s Open (and that’s the intention). It’s multidisciplinary. And it’s now got a growing network of volunteers.

    The vision is that anyone with a piece of knowledge (I’l concentrate on data sets here) that they think might be useful to the world should deposit it in CKAN under an OKFN Open Data protocol. In some cases – like Shakespeare – it can be used on its own. But increasingly data is valuable when combined – mashed up – with other data. For that we need common pieces of information – ideally identifiers, but often simple text fields or even running prose.

    CKAN is not yet large and not yet systematised. But that doesn’t mean it’s not valuable – as I said earlier WP started out small. The important things is that CKAN is a community project for communal contribution and exploitation. It’s got an emphasis on liberation (or perhaps enlightenment) – it need not be comprehensive, but it should be useful.

    You might reasonably suggest that Wikipedia is already systematising data. And to the extent that infoboxes represent data that’s true. But WP is avowedly and rightly an encyclopedia – you could not devote a page to a data set (unless it was as important as the Keeling curve). So where are you going to get those crucial bits of information that you need for research, teaching, learning?

    CKAN can be the answer. We are the usual cyclic argument at present – “it’s not got enough in it, so I won’t use it” – as opposed to “It could do with some more knowledge – so here is some”. It’s also not organised in a fully Linked Data way – but then what data is? CKAN is a great place to start experimenting with Linking Data.

    The exciting thing is that not only public data storage but even public triple storage is starting to become massively freely available. As soon as I knew that Talis was offering their platform to host Open Data (PDDL, to which Talis made critical and significant contribution) I started to think how we could get CKAN into it. Not everything will fit. But we can get enough overlap of concepts that we can start to unite the entries using SPARQL.

    And move towards a genuine Open knowledge resource.

    Posted in "virtual communities", Uncategorized | 2 Comments

    Blogging encouragement to an eScientist

    I recently talked at some length with a young eScientist (whose blog I might highlight later). He’s a computer scientist, professionally interested in multidisciplinary work (music, environment, agents, etc.) and although very clued up on modern informatics didn’t have a blog. I encouraged him to start one – which he has done – and suggested some themes that have been successful in the blogosphere.

    Last year Nature ran a competition to encourage senior scientists to blog (Russ B Altman wins Science Blogging Challenge… and a trip to SciFoo) and did me the honour of asking me to be on the judging panel. Like many new ideas (and Nature has had many of those) it has yet to get wide momentum but the winners were worthy of their awards (a prestigious and luxurious trip to Sci-Foo) – the Oscar of blogging. I must, of course, highlight the great contribution to blogging made by Timo Hannay of Nature and his colleagues.

    I have no doubt that blogging has made a great deal of positive difference to what I have been able to do over the last 2-3 years. I make many new contacts and have been able to create considerable advocacy for Open Data – whose time seems now to have arrived. I’ve enjoyed reading many other blogs – especially after going to SciFoo (it’s great) and getting on the foo-campers’ digest. Science is a mixture of altrusim and competition and blogging allows both – the ability to share with others in the field while advancing one’s own position (visibility, collaborations, strength of message.)

    So my advice – over an hour on Skype – was roughly:

    • Try to have a consistent theme (yes I know this blog oscillates between OpenFoo and Chemistry, but I trust my readers can distinguish).
    • Think about how to communicate to people you don’t know are there. You will get the most surprising replies. But don’t be disappointed if no-one replies – readership/writership is often > 100.
    • Unless you have an unwavering uncomfortable message be collaborative and welcoming.

    Themes which are common and appreciated:

    • “My own work”. This depends very much on what is already out there and the community. Open Scientists like Jean-Claude Bradley and Cameron Neylon are changing out attitudes to how we publish science. But be sensible – some journals won’t allow papers whose data has already appeared. Maybe they should be encoiuraged to think otherwise.
    • “Meetings I go to”. Blogging and tweeting is transforming how we report meetings. Someone who acts as the blogscribe of a meeting is often welcomed by the community – it’s a good way of getting known. The tweeting at #LOTF09 was a revelation to me – the hundreds of 140char messages acts as a dynamic exciting record of the meeting “Does PMR know his mike is still broadcasting during his tea-time conversations”? I see from Tweetdeck that the eScientist has already taken my advice and tweeted his meeting – he was the only one, so maybe a report to the community will be valued.
    • “what’s happening in my field”. Some bloggers act as rapporteurs for new and exciting work (or awful work) in their discipline. There’s a strong tradition in chemistry blogging with comments on synthetic chemistry, new drugs, new software, etc.
    • recipes. These are often highly valued – preparations of materials, tuning apparatus, software bugs, etc. Again chemistry has a good collection of these.
    • advocacy. If you have a passionate cause – mine is Openness – then blogging is a perfect medium. It’s the quickest way of attracting like-minded people. You will make enemies as well, so make sure that your facts and arguments are correct and coherent.
    • Aggregation. Some blogs aggregate from other places. That can be extremely useful if the sources are from outside the domain, but make sure you are not simply recycling what is already widely known.
    • Humour, fun, and media. If well done this is very much appreciated. Cartoons, photographs of interesting things, audio, etc.

    On the technical side

    • Blogging can take longer than you think.
    • Consider using a blogging service – including Nature’s.
    • Make sure your blog can be easily added to people’s feeds.
    • Be prepared for linkspam – it’s horrible. A service provider may help here.
    • Most authoring tools are primitive – so don’t be too adventurous at the start. (We have dropped to a completely basic skin at UCC because everything else was too hairy in WordPress).
    • Allow comments, but protect with Captcha. Even then you get linkspam.

    And don’t lose faith if you get no immediate reaction. It takes time to get known and many blogs get very little immediate response anyway. Comment on other blogs – that gives them a pingback to your blog. Put them in your blogroll – gradually your readership will grow. Leave comments on other blogs , with pointers back to what you have written – but don’t make it look like marketing. Ultimately it is the quality and interest of what you blog that will create your readership.

    I have omitted a lot of things – please let me know what.

    Posted in Uncategorized | 3 Comments

    Wikipedia has won – how can we convince you?

    The UK’s Sunday Observer newspaper yesterday had an article (Face facts: where Britannica ruled, Wikipedia has conquered) where John Naughton writes:

    Unwillingness to entertain the notion that Wikipedia might fly is a symptom of what the legal scholar James Boyle calls “cultural agoraphobia” – our prevailing fear of openness. Like all phobias it’s irrational, so is immune to evidence. I’m tired of listening to brain-dead dinner-party complaints about how “inaccurate” Wikipedia is. I’m bored to death by endless accounts of slurs or libels suffered by a few famous individuals at the hands of Wikipedia vandals. And if anyone ever claims again that all the entries in Wikipedia are written by clueless amateurs, I will hit them over the head with a list of experts who curate material in their specialisms. And remind them of Professor Peter Murray-Rust’s comment to a conference in Oxford: “The bit of Wikipedia that I wrote is correct.”

    And I am proud to stand by it. Anyone can write anything anywhere in WP but it’s not uncommon for one person to start a page and write enough to create critical mass, and others will then repeatedly tweak it – formatting, tidying, categorising, references, etc.. So, for example, I wrote an page on Molecular Graphics over a few days. I wanted to capture some of the growth of the subject – which as a founder member of the Molecular Graphics Society I loved. Other Wpedians have tweaked bits since. I was delighted to see that it’s been awarded a “B” grade by the chemists – roughly in the top 10 percentile.

    In bioscience, physical science and mathematics mature pages are usually excellent. If I want to know how to integrate a differential equation, know the structure of a protein, or find the melting point of a common compound I’ll go stright to Wikipedia. If the page is very new I’ll be cautious, but if it’s been around for a year or more, if it’s got > 200 versions, if it’s got an infobox that points at online sources then it’s almost certainly “nearly correct”.

    Some diehard detractors point to vandalism as a major drawback of WP – but any practiced reader will easily spot it. But could we improve the believability of WP by stamping it as “fit for purpose” at any given stage. WP is carefully versioned, so if it’s vandalised then it should be possible to stamp versions as “non-vandalised” – it might require some MD5 magic or diffing to be absolutely sure but I’d be surprised if this wasn’t straightforward.

    In which case if we can create high-quality pages (B and greater) and stamp them, then even the naysayers must come round to agreeing that WP is fit for study.

    I should of course, make it clear that there is nothing special about my contributions. The chemistry pages are created by a really dedicated group of volunteers, which started slowly but over the last 2 years or so has really taken off. I am absolutely clear that WP can soon become the primary reference handbook for undergraduates and for many of the rest of us as well.

    Posted in Uncategorized | 3 Comments