petermr's blog

A Scientist and the Web


Archive for the ‘XML’ Category

funding models for software, OSCAR meets OMII

Saturday, May 16th, 2009

In a previous post I introduced our chemical natural language tools OSCAR and OPSIN. They are widely used, but in academia there is a general problem – there isn’t a simple way to finance the continued development and maintenance of software . Some disciplines (bioscience, big science) recognize the value of funding software but chemistry doesn’t. I can count the following other approaches (there may be combinations)

  • Institutional funding. That’s the model that ICE: The Integrated Content Environment uses. The major reason is that the University has a major need for the tool and it’s cost-effective to do this as it allows important new features to be added.
  • Consortium funding. Often a natural progression from the latter. Thus all the major repository software (DSPACE, ePrints, Fedora) and content/courseware (Moodle, Sakai) have a large formal member base of instutions with subventions. These consortia may also be able to raise grants.
  • Marginal costs. Some individuals or groups are sufficiently committed that they devote a significant amount of their marginal time to creating. An excellent example of this is George Sheldrick’s SHELX where he single-handedly developed the major community tool for crystallographic analysis. I remember the first distributions – in ca 1974 – when it was sent as a compressed deck of FORTRAN cards (think about that).  For afficionados there was a single variable A(32768) in which different locations had defined meanings only in George’s head. Add EQUIVALENCE, blank COMMON and any alteration to the code except by George led to immediate disaster. A good strategy to avoid forks. My own JUMBO largely falls into this category (but with some OS contribs).
  • Commercial release. Many groups have developed methods for generating a commercial income stream. Many of the computational chemistry codes (e.g. Gaussian) go down this route – an academic group either licenses the software to a commercial company, or set up a company themselves, or recover costs from users. The model varies. In some cases charges are only made to non-academics, and in some cases there is an active academic devloper community who contribute to the main branch, such as for CASTEP
  • Open Source and Crowdsourcing. This is very common in ICT areas (e.g. Linux) but does not come naturally to chemistry. We have created the BlueObelisk as a loose umbrella organisation for Open Data, Open Standards and Open Source in chemistry. I believe it’s now having an important impact on chemical informatics – it encourages innovation and public control of quality. Most of the components are created on marginal costs. It’s why we have taken the view that – at the start – all our software is Open. I’ll deal with the pros and cons later but note that not all OS projects are suited for crowdsourcing on day one – a reliable infrastructure needs to be created.
  • 800-pound gorilla. When a large player comes into an industry sector they can change the business models. We are delighted to be working with Microsoft Research – gorillas can be friendly – who see the whole chemical informatics arena as being based on outdated technology and stovepipe practices. We’ve been working together on Chem4Word which will transform the role of the semantic document in chemistry. After a successful showing at BioIT we are discussing with Lee Dirks, Alex Wade and Tony Hey the future of C4W
  • public targeted productisation. In this there is specific public funding to take an academic piece of software to a properly engineered system. A special organisation, OMII, has been set up in the UK to do this…

So what and why and who and where are OMII? :

OMII-UK is an open-source organisation that empowers the UK research community by providing software for use in all disciplines of research. Our mission is to cultivate and sustain community software important to research. All of OMII-UK’s software is free, open source and fully supported.

OMII was set up to exploit and support the fruits of the UK eScience program. It concentrated on middleware, especially griddy stuff, and this is of little use to chemistry which needs Open chemistryware first. However last year I bumped into Dave DeRoure and Carole Goble and they told me of an initiative – ENGAGE – sponsored by JISC – whose role is to help eResearchers directly:

The widespread adoption of e-Research technologies will revolutionise the way that research is conducted. The ENGAGE project plans to accelerate this revolution by meeting with researchers and developing software to fulfil their needs. If you would like to benefit from the project, please contact ENGAGE ( or visit their website (

ENGAGE combines the expertise of OMII-UK and the NGS ? the UK?s foremost providers of e-Research software and e-Infrastructure. The first phase, which began in September, is currently identifying and interviewing researchers that could benefit from e-Research but are relatively new to the field. “The response from researchers has been very positive” says Chris Brown, project leader of the interview phase, “we are learning a lot about their perceptions of e-Research and the problems they have faced”. Eleven groups, with research interests that include Oceanography, Biology and Chemistry, have already been interviewed.

The results of the interviews will be reviewed during ENGAGE’s second phase. This phase will identify and publicise the ‘big issues’ that are hindering e-Research adoption, and the ‘big wins’ that could help it. Solutions to some of the big issues will be developed and made freely available so that the entire research community will benefit. The solutions may involve the development of new software, which will make use of OMII-UK’s expertise, or may simply require the provision of more information and training. Any software that is developed will be deployed and evaluated by the community on the NGS. “It’s very early in the interview phase, but we?re already learning that researchers want to be better informed of new developments and are keen for more training and support.” says Chris Brown.

ENGAGE is a JISC-funded project that will collaborate with two other JISC projects ? e-IUS and e-Uptake ? to further e-Research community engagement within the UK. “To improve the uptake of e-Research, we need to make sure that researchers understand what e-Research is and how it can benefit them” says Neil Chue Hong, OMII-UK’s director, “We need to hear from as many researchers and as many fields of research as possible, and to do this, we need researchers to contact ENGAGE.”

Dave and Carole indicated that OSCAR could be a candidate for an ENGAGE project and so we’ve been working with OMII. We had our first f2f meeting on Thursday where Neil, and two colleagues, Steve and Steve came up from Southampton (that’s where OMII is centered although they have projects and colleagues elsewhere). We had a very useful session where OMII have taken the ownership of the process of refactoring OSCAR and also evangelising it. They’ve gone into OSCAR’s architecture in depth and commented favourably on it. They are picking PeterC’s brains so that they are able to navigate through OSCAR. The sorts of things that they will address are:

  • Singletons and startup resources
  • configuration (different options at statup, vocabularies, etc.)
  • documentation, examples and tutorials
  • regression testing
  • modularisation (e.g. OPSIN and pre- and post-processing)

And then there is the evangelism. Part of OMII-ENGAGE’s remit is to evangelise, through brochures and meetings. So we are tentatively planning an Open OSCAR-ENGAGE meeting in Cambridge in June. Anyone interested at this early stage should mail me and I’ll pass it onto the OMII folks.

… and now OPSIN…

Chem4Word – why semantics are necessary

Monday, April 6th, 2009

I was asked to explain how Chem4Word and CML could encode ferrocene. I’ll start by using Wikipedia to give a clear and accurate picture. Sorry for the cut-and-paste mess.

WP: Ferrocene is the organometallic compound with the formula Fe(C5H5)2. It is the prototypical metallocene, a type of organometallic chemical compound consisting of two cyclopentadienyl rings bound on opposite sides of a central metal atom.

Other names dicyclopentadienyl iron
CAS number 102-54-5
PubChem 11985121
ChEBI 30672
IUPAC name
Other names dicyclopentadienyl iron

Very clear and tidy. By contrast the entries in Pubchem are a mess. That’s NOT Pubchem’s fault – it’s the non-semantic stuff that is sent by depositors. Again I shan’t bash the depositors too hard as they have voluntarily deposited their material – it the awful non-semantic authoring tools they use and the absence of agreed conventions.

Chem4Word aims to raise the standard. You’ll note from the entries below that the formulae for some of these structures are grotesque (10 negative charges). C4W will give authors a clear indication of the molecular formulae and charges and encourage semantic validation.

Anyway here goes. These are all the different compound IDs associated with ferrocene. I assume that all these compounds are meant to be ferrocene but their formulae are garbled by the tools – note the absurd charges. CML prevents such garbling.

Ferrotsen; Catane; FERROCENE …
Compound ID: 7611
Source: LeadScope (LS-357)
IUPAC: cyclopenta-1,3-diene; iron(2+)
MW: 186.031400 g/mol | MF: C10H10Fe

FERROCENE; Bis(.eta.-cyclopentadienyl) iron
Compound ID: 11985121
Source: NIST Chemistry WebBook (3993653726)
IUPAC: cyclopenta-1,3-diene; cyclopentane; iron
MW: 186.031400 g/mol | MF: C10H10Fe-6

FERROCENE; Di(cyclopentadienyl)iron; Bis(cyclopentadienyl)iron …
Compound ID: 10219726
Source: Sigma-Aldrich (F408_ALDRICH)
IUPAC: cyclopentane; iron
MW: 186.031400 g/mol | MF: C10H10Fe

Compound ID: 504306
Source: NIST Chemistry WebBook (1113374621)
IUPAC: cyclopenta-1,3-diene; iron(2+)
MW: 186.031400 g/mol | MF: C10H10Fe

Ferrotsen; FERROCENE; Dicyclopentadienyl iron …
Compound ID: 24196050
Source: DTP/NCI (209798)
IUPAC: cyclopenta-1,3-diene; iron
MW: 177.967880 g/mol | MF: C10H2Fe-10
Tested in BioAssays: All: 3, Active: 0; BioActivity Analysis

Ferrotsen; FERROCENE; Dicyclopentadienyl iron …
Compound ID: 5150118
Source: DTP/NCI (44012)
IUPAC: cyclopenta-1,3-diene; iron(2+)
MW: 177.967880 g/mol | MF: C10H2Fe-8
Tested in BioAssays: All: 1, Active: 0; BioActivity Analysis

“should theses be Open?”

Sunday, March 29th, 2009

Until now most theses reside in a dusty basement or on a supervisor’s shelf, but we are in transition to a world where all theses are -potentially – Openly visible to anyone. Surely this is a good idea.

In principle, of course, anyone can see my thsesis. It’s badly written and the examiners rightly gave me a terrible time, but all the work in it was eventually published in peer-reviewed journals. In those days corrections meant ripping the thesis apart and Tippexing or rebinding. At the distance of some decades I’m now very happy if Oxford wishes to digitize it and put it on the web. Linguists can use it a useful source of typos.

Do all academics feel that Open theses are a good idea?

Two recent anecdotes – paraphrased and anonymized:

Academic 1: “I wouldn’t want people to see our theses – many of them are of terrible quality”.

Academic 2: “I wouldn’t want anyone to see our theses – there are so many good ideas that we don’t want our competitors to see.”

This makes the case strongly that Open theses will improve quality and the dissemination of science.

The library of the future – Guardian of Scholarship?

Thursday, March 19th, 2009

I am still working out my message for JISC on April 2nd on “The library of the future”. I’ve had suggestions that I should re-ask this as ““What are librarians for?”” (Dorothea Salo) and “what can a library do?” (Chris). Thanks, and please keep the comments coming, but I am currently thinking more radically.

When I started in academia I got the impression (I don’t know why) that Libraries (capital-L = formal role in organization) had a central role in guiding scholarship. That they were part of the governance of the university (and indeed some universities have Librarian positions which have the rank of senior professor – e.g. Deans of faculties). I have held onto this idea until it has become clear that it no longer holds. Libraries (and/or Librarians) no longer play this central role. That’s very serious and seriously bad for academia as it has left a vacuum which few are trying to address and which is a contributor to the current problems.

I current see very few – if any – Librarians who are major figures in current academia. Maybe there never was a golden age, but without such people the current trajectory of the Library is inexorably downward. I trace this decline to two major missed opportunities where, if we had had real guradians of scholarship we would not be in the current mess – running scared of publishers and lawyers.

The first occasion was about 1972 (I’d be grateful for exact dates and publishers). I remember the first time I was asked to sign over copyright (either to the International Union of Crystallography or the Chemical Society (now RSC)). It looked fishy, but none of my colleagues spoke out. (Ok, no blogosphere, but there were still ways of communicating rapidly – telephones). The community – led by the Librarians – should (a) have identified the threats (b) mobilised the faculty. Both would have been easy. No publisher would have resisted – they were all primarily learned societies then – no PRISM. If the Universities had said (“this is a bad idea, don’t sign”) we would never have had Maxwell, never had ownership of scholarship by for-profit organizations. Simple. But no-one saw it (at least enough to have impacted a simple Lecturer).

The second occasion was early 1990′s – let’s say 1993 when Mosaic trumpeted the Web. It was obviou to anyone who thought about the future that electronic publication was coming. The publishers were scared – they could see their business disappearing. Their only weapon was the transfer of copyright. The ghastly, stultifying, Impact Factor had not been invented. People actually read papers to find out the worth of someone’s research rather than getting machines to count ultravariable citations.

At that stage the Universities should have re-invented scholarly publishing.The Libraries and Librarians should had led the charge. I’m not saying they would have succeeded, but they should have tried. It was a time of optimism on the Web – the dotcom boom was starting. The goodwill was there, the major universities had publishing houses. But they did nothing – and many contracted their University Presses.

There is still potential for revolution. But at every missed opportunity it’s harder. All too many Librarians have to spend their time negotiating with publishers, making sure that students don’t take too many photocopies, etc. If Institutional Repositories are an instrument of revolution (as they should have been) they haven’t succeeded.

So, simply, the librarian of the future must be a revolutionary. They may or may not be Librarians. If Librarians are not revolutionaries they have little future.

In tomorrow’s post I shall list about 10 people who I think are currently librarians of the future.

Closed Data at Chemical Abstracts leads to Bad Science

Tuesday, March 17th, 2009

I had decided to take a mellow tone on re-starting this blog and I was feeling full of the joys of spring when I read a paper I simply have to criticize. The issues go beyond chemistry and non-chemists can understand everything necessary. The work has been reviewed in Wired so achieved high prominence (CAS display this on their splashpage). There are so many unsatisfactory things I don’t know where to begin…

I was alerted by Rich Apodaca  who blogged…

A recent issue of Wired is running a story about a Chemical Abstracts Service (CAS) study on the distribution of scaffold frequencies in the CAS Registry database.

Cheminformatics doesn’t often make it into the popular press (or any other kind of press for that matter), so the Wired article is remarkable for that aspect alone.
From the original work (free PDF here):

It seems plausible to expect that the more often a framework has been used as the basis for a compound, the more likely it is to be used in another compound. If many compounds derived from a framework have already been synthesized, these derivatives can serve as a pool of potential starting materials for further syntheses. The availability of published schemes for making these derivatves, or the existence of these desrivates as commercial chemicals, would then facilitate the construction of more compounds based on the same framework. Of course, not all frameworks are equally likely to become the focus of a high degree of synthetic activity. Some frameworks are intrinsically more interesting than others due to their functional importance (e.g., as a building blocks in drug design), and this interest will stimulate the synthesis of derivatives. Once this synthetic activity is initiated, it may be amplified over time by a rich-get-richer process. [PMR - rich-get-richer does not apply to pharma or publishing industries but to an unusual exponent in the power law].

With the appearance of dozens of chemical databases and services on the Web in the last couple of years, the opportunities for analyses like this (and many others) can only increase. Who knows what we’ll find.

Thanks Rich. Now the paper has just appeared in a journal published by ACS (American Chemical Society, of which Chemical Abstracts (CAS) is a division). (There is no criticism of the ACS as publisher in my post, other than that I think the paper is completely flawed).  Because ACS is a Closed publisher the paper is not normally Openly readable, but papers often get the full text exposed early on and then may become closed. I’ve managed to read it from home, so if you don’t subscribe to ACS/JOC I suggest you read it quick.

I dare not reproduce any of the graphs from the paper as I am sure they are copyright ACS so you will have to read the paper quickly before it disappears.

Now I have accepted a position on the board of the new (Open) Journal Of Chemoinformatics. I dithered, because I feel that chemoinformatics is close to pseudo-science along the lines of others reported by Ben Goldacre (Bad Science). But I thought on balance that I’d do what I could to help clean up chemoinformatics and therefore take a critical role of papers which I feel are non-novel, badly designed, irreproducible, and badly written. This paper ticks all boxes.

[If I am factually wrong on any point of Chemical Abstracts, Amer. Chemical Soc. policies etc. I'd welcome correction and 'll respond in a neutral spirit.]

So to summarize the paper:

The authors selected 24 million compounds (substances?) from the CAS database and analysed their chemical formula. They found that the frequency of frameworks (e.g. benzene, penicillin) fitted a power law. (PLs are ubiquitous – in typsetting, web caches, size of research laboratories, etc. There is nothing unusual in finding one). The authors speculate that this distribution is due to functional importance stimulating synthetic activity.

I shall post later about why most chemoinformatics is flawed and criticize other papers. In general chemoinformatics consists of:

  1. selection of data sets
  2. annotating these data sets with chemical “descriptors”
  3. [optionally] using machine learning algorithms to analyse or predict
  4. analyse the findings and prepresentation

My basic contention is that unless these steps are (a) based on non-negotiable communally accepted procedures (b) reproducible in whole – chemoinformatics is close to pseudoscience.

This paper involved steps 1,2,4.  (1) is by far the most serious for Open Data advocates so I’ll simply say that
(2) There was  no description of how connection tables (molecular graphs) were created. These molecules apparently included inorgnaic compounds and the creation of CTs for these molecules is wildly variable or often non-attempted. This immediately means that millions of data in the sample are meaningless. The authors also describe an “algorithm” for finding frameworks which is woolly and badly reported. Such algorithms are common – and many are Open as in CDK and JUMBO. The results of the study will depend on the algorithm and the textual description is completely inadequate to recode it. Example – is B2H6 a framework? I would have no idea.

(4) There are no useful results. No supplemental data is published (JOC normally requires supplemental data but this is an exception – I have no idea why not). The data have been destroyed into PDF graphs (yes – this is why PDF corrupts – if the graphs had been SVG I could have extracted the data). Moreover the authors give no justification for their conclusion that frequency of occurrence is due to synthetic activity or interesting systems. What about natural products? What about silicates?

But by far the most serious concern is (1). How were the data selected?

The data come – according to the authors – from a snapshot of the CAS registry in 2007. I believe the following to be facts, and offer to stand corrected by CAS:

  • The data in CAS is based almost completely on data published in the public domain. I agree there is considerable “sweat of brow” in collating it, but it’s “our data”.
  • CAS sells a licence to academia (Scifinder) to query their databse . This does not allow re-use of the query results. Many institutions cannot afford the price.
  • There are strict conditions of use. I do not know what they are in detail but I am 100% certain that I cannot download and use a signifcant part of the database for research, and publish the results. Therefore I cannot – under any circumstances attempt to replicate the work. If I attempted I would expect to receive legal threats or worse. Certainly the University would be debarred from using CAS.

The results of the paper – such as they are – depend completely on selection of the data. There are a huge number of biological molecules (DNA, proteins) in CAS and I would have expected these to bias the analysis (with 6, 5, and 6-5 rings being present in enormous numbers). The authors may say – if they reply – that it’s “obvious” that “substance” (with < 253 atoms) excluded these – but that is a consequence of  bad writing, poor methodology and the knowledge that whatever they put in the paper cannot be verified or challenged by anyone else on the planet.

There are many data sources which are unique – satellite, climate, astronomical, etc. The curators of those work very hard to provide universal access. Here, by contrast, we have a situation where the only people who can work with a dataset are the people we pay to give us driblets of the data at extremely high prices.

This post is not primarily a criticism of CAS per se (though from time to time I will publish concerns about their apparent – but loosening – stranglehold on chemical data). If they wish to collect our data and sell it back to us it’s tenable business model and I shall continue to fight it.

But to use a monopoly to do unrefereeable bad science is not worthy of a learned society.

Working with the NCI

Monday, February 4th, 2008

I was intending to blog about our collaboration with Dan Zaharevitz and colleagues at the National Cancer Institute in the DTP (Developmental Therapeutics Program). Dan beat me to it: in a CMLBlog comment (February 4th, 2008 at 5:02 pm e) to CML – what and why. In the comment he explains why the NCI has chosen to work with us on CML.

Dan and I first made contact ca. 5+ years ago. I think he had noticed my posting or contributing to CDK (Chemistry Development Kit) and had asked about what CML could do.  We got into correspondence and as a result he supported Henry and me  in the development of JUMBO – probably JUMBO 4.6.
It is refreshing to work with the NCI. Their agenda is ultimately simple – methods of combatting cancer. And they are very clear that the way to do this is through Openness – Open Data, Open Source, Open Standards. So it is wonderful to have a sponsor who says “we will help you to develop this code” and you can make it Open – indeed this is  virtually a requirement.

NCI is well known for pioneering the release of their data in Open form. For many years the NCI database – with about 250,000 compounds and associated biological data – was the only data that could be used for free in chemistry. This database was the logical predecessor of Pubchem which now has over 18 million compounds. (An important difference is that the NCI database relates to physical samples while many entries in Pubchem do not).

Dan’s support has been invaluable. Firstly it’s supported us to do the work. Secondly it gives much moral support to continue. And third it has given us important feedback. Since CML has many uses (publishing, computation, crystallography) it ‘s very useful to have an organisation who wants to manage data. NCI is not only interested in chemical structure but also associated data, including analytical.

So it was great to sit in Dan’s splendid basement and review how he was using CML and how we jointly felt it might develop. CML details will follow on the CMLBlog.

XML, Fortran and Mr Fox at NESC

Friday, January 18th, 2008

Toby White (“Fantastic Mr Fox@) has developed a superb system for enabling FORTRAN programs to emit XML in general and CML specifically. He and colleagues are presenting this at Edinburgh as part of the NESC programme:

Integrating Fortran and XML
28 January, 08 01:00 PM – 30 January, 08 01:00 PM
e-Science Institute, 15 South College Street, Edinburgh
Organiser: Toby White

eScience technologies offer great hope for massive improvements in the quality and quantity of science that we are able to do, particularly in the domains of data management and information delivery. Many of our escience tools rely on XML and related technologies. However, an enormous number of scientific codes are written in Fortran, and many scientists do much of their work using Fortran. Unfortunately, Fortran knows very little about XML, and vice versa; thus many useful scientific codes are de facto excluded from the world of escience. However, there is an increasing number of tools being made available to integrate Fortran into an XML-aware world, and there is a large body of knowledge and lessons learned on sensible strategies for making use of existing scientific codebases in escientific ways. This workshop will aim to instruct participants in the uses of several Fortran-XML tools, and to transfer practical experience about successes in this area. It is expected that participants will then be able to extend their existing Fortran codes to both write and read XML files, and otherwise manipulate XML data.

Target Audience

Scientific programmers from eScience projects who need to integrate existing or legacy Fortran codes into modern data handling infrastructures.

Delegates should have some experience in programming Fortran and of working in a Linux/Unix environment. No prior knowledge of XML is required.


This event is provisionally scheduled to start at 13:00 Monday 28 January 2008 and close at 13:00 on Wednesday 30 January 2008.

A programme is available at: XML timetable.pdf

Toby White <>, Cambridge
Andrew Walker <>, Cambridge
Dan Wilson <>, JWG-University, Frankfurt, Germany

CML Blog and Update

Sunday, January 6th, 2008

Henry [Rzepa] and I are planning a major facelift for the public face of CML this year.  CML is about 13 years old and has gone through several revisions and relocations, so that information is somewhat scattered. CML is now a large system (ca 100 elements and 200 attributes) and we now have good proof-of-concept in all the key areas:

  • molecular structures (atoms. bonds, etc.)
  • reactions
  • substances, mixtures, macroscopic amounts
  • properties of molecules, reactions and substances
  • crystallography and solid state
  • computational chemistry
  • analytical data and spectroscopy
  • procedures, actions and objects in physical science

in addition CML can support

  • interoperation with other markup languages, especially XHTML, MathML and SVG
  • dictionaries and ontologies
  • representation in RDF(S)

CML can also support a number of language features

  • data interchange
  • ontology development
  • workflow and computation
  • computational grammar (e.g. combinatorial chemistry, fuzzy structures, variability)

CML has been publicly available for many years, and over the last two years has stabilised in design and software. We do not expect major changes in the next year and so are rationialising access to the components and information. Recently we have had bad attacks from spammers on the Wikis so will be discontinuing this as an interactive feature and will use it as a read-only resource. Since the blog system here has worked well I shall use the CMLBlog as a means of developing public resources for CML.

Since the CMLBlog has been dormant for the last year I shall post messages on this blog and clone them to the CMLBlog so that those who only want to follow CML can transfer there.  I hope to post about 1 topic/day which should get me through the schema by the end of the year. Eac h post will cover a clear topic and allow for feedback. And there will be regular requests for new topics.

BTW – if anyone knows a good forum software this might be an alternative to a blog.

Why publishers’ technology is obsolete – I

Sunday, January 6th, 2008

I have just finished writing an article for a journal – and I suspect the comments apply to all publishers. To create the Citations (or “references”) they require:

CITATIONS Citations should be double-spaced at the end of the text, with the notes numbered sequentially without superscript format. Authors are responsible for accuracy of references in all aspects. Please verify quotations and page numbers before submitting.

Superscript numerals should be placed at the end of the quotation or of the materials in which the source is mentioned. The numeral should be placed after all punctuation. SR follows the latest edition of the Chicago Manual of Style, published by the University of Chicago Press. Examples of the correct format for most often used references are the following:

Article from a journal: Paul Metz, “Thirteen Steps to Avoiding Bad Luck in a Serials Cancellation Project,” Journal of Academic Librarianship 18 (May 1992): 76-82.

[Note: when each issue is paged separately, include the issue number after the volume number: 18, no. 3(May 1992): 76-82. Do not abbreviate months. When citing page numbers, omit the digits that remain the same in both the beginning and ending numbers, e.g., 111-13.

PMR: It’s the author who has to do all this. In a different journal it would be a different style – maybe Harvard, or Oxford or goodness knows. Each with their own bizarre, pointless micro syntax.

As we know there is a simple effective way of identifying a citation in a journal – the Digital object identifier – (Wikipedia), It’s a unique identifier, managed by each publisher and there is a resolution service. OK not all back journals are in the system, and OK it doesn’t do non journal articles but why not use it for the citations it can support. IN many science disciplines almost all modern citations would have DOIs.

Not only would it speed up the process but it would save errors. Authors tend to write abbreviations (J. Acad. Lib), mungle the volumes and pages, get the fields in the wrong areas. They hate it, and I suspect so do the technical editors when they have to correct the error. I can’t actually believe the authors save the technical editorsany time – I suspect it costs time.

You may argue that the publisher still has to type out the citation from the DOI. Not at all. This is all in standard form. Completely automatic.

Why also cannot publisher emit their bibliographic metadata in standard XML on their web page. It’s a solved problem. It would mean that anyone getting the a citation would get it right. (I assume that garbled citations don’t get counted in the holy numbers game, so it pays to have your metadata scraped correctly. And XML is the simple, correct way to do that.

It’s not as if the publishers don’t have an XML Schema (or rather DTD). They do.

It’s called PRISM. Honest. Publishing Requirements for Industry Standard a worthy – if probably overengineered approach. But maybe the name has got confused.

Of course the NIH/Pubmed has got this problem solved. Because they are the scientific information providers of the future.

Why not borrow their solution?

Exploring RDF and CML

Sunday, December 30th, 2007

I’ve taken the chance pf a few days without commitments to investigate how we shall be using RDF. We’ve got several projects where we are starting to use it – CrystalEye – WWMM, eChemistry, SPECTRa : JISC and other ORE-based projects. I’ve been convinced for a few years that CML+RDF has to be the way forward for representing chemistry – the only question was when. CML gives the precision that is required for defining the local structure of objects (such as molecules) and RDF gives the flexibility for supporting a very diverse community who have different approaches and needs. It’s a balance between these two.

RDF represents information by triples – classically

subject – predicate – object

Here’s an example from WP:



        <rdf:Description rdf:about="">
                <dc:title>Tony Benn</dc:title>

To an English-speaking person, the same information could be represented simply as:

The title of this resource, which is published by Wikipedia, is ‘Tony Benn’

[Tony Benn is a well-known socialist UK politician much respected by people of all parties and none.]

This can be represented by a graph (from the W3C validator service) :


This is a very simple graph. The strength of RDF is that you can add a new triple anywhere and keep on doing it. The weakness of RDF is that you can add a new triple anywhere and keep on doing it. You end up with graphs of arbitrary structure. The challenge of ORE is to make sense of these.

Molecules have a variable RDF structure, We have to cater for molecules with no names, a hundred names, many properties, parameter constraints, etc. And the data are changing constantly and can come from many places. So there needs to be a versioning system and RDF is almost certainly the best way to tackle this. So here is a typical molecule:


The quality is bad because the graph is much larger and had to be scaled down (you can click it). But it shows the general structure – a “molecule” node, with about 10 “properties” (in the RDF sense) and 3-4 layers.

The learning curve for RDF is steep. The nomenclature is abstract and takes some time to become familiar with. Irritatingly there are at least 4 different syntaxes and some parts of them are very similar. Several query languages as well. However having spent a day with Jena, I can now create RDF from CML and it makes a lot of sense. (Note that it’s relatively easy to create RDF from XML, but no guarantee that arbitrary RDF can be transformed to XML).

The key thing that you have to learn is that almost everything is a Uniform Resource Identifier (URI) or a literal. So up to now we have things in CML such as dictRef, convention, units. In RDF alll these have to be described by URIs. This is hard work but very good discipline and helps to firm up CML vocabulary and dictionaries.

So we now have over 100,000 chemical triples and should be able to do useful things very soon.