petermr's blog

A Scientist and the Web


Archive for June, 2007

Collaborative Organic Synthesis (a subversive proposal)

Saturday, June 30th, 2007

Every months we get several new chemistry blogs – I don’t have time to do more than glance at them but I was struck by a newcomer, TotallyRetrosynthetic. (TotallyFoo is a metasyntactic linguistic style sparked off by TotallySynthetic.) Retrosynthesis is the process of working out how to make a chemical compound by starting at the target and working backwards – if you want to climb a mountain start at the top and descend and then retrace your steps. Of course for chemistry this is done on paper. (it ought to be done in silico as well, but organic chemists fear they will lose virility by using computers to help them).

TR suggests a subversive proposal – they chemists should collaborate and distribute the work:

Join the project to help the cause

I would like to extend the suggestion made to European Chemist to become a member of this project (in the comments section of Daphnicyclidins post here) to others as well who are willing to contribute to the cause of this project. It need not be Daphnicyclidins. It could be your dream proposal that you had come up at some point of your career which never saw the light. All you have to do is join this project as a member, create a new page where you explain briefly about the importance of your “myresearchproposal” and post it there. I will make a post on it on this blog. Peers would review it, comments would flow, and the idea would refine and evolve further.

Then person ‘A’ from Australia could try the step 1 of your proposal since he has the required expertise and the materials, and post his experimental results here. Then ‘B’ from Brazil can pitch in and evaluate the feasibility of key transform since he has the closely related model compounds in his lab, and post his results here. Then person ‘C’ from China could carry some computational studies for the observations of the ‘B’ or the proposed out come of step 10 and list his results and inferences. Biochemist ‘D’ can collaborate for molecules that are made and screen them against his targets. Some funding agency ‘X’ could sponsor projects that are worth of pursuing. It could be any proposal, as long as it is yours):

The flow of the materials, ideas, expertise and resources in the fashion described above would render your dream project become reality that you never thought of, be it because of your limited resources or lack of opportunities. I agree that it might take time and efforts, but isn’t it what anyhow some project would take even in a traditional setup; and never see the light at all, at times, because it met a dead end at some final stages since it was done in the closed doors, and never tapped the expertise of other scientists out there and thus wasted the taxpayers’ money. Imagine, the same project being carried out else where in the world or in the immediate next door with the taxpayers’ money again, because it was not done in an open fashion. I could be exaggerating a bit, but I want you to give a serious thought to it.

I am sure that the regular members of Chemical Blogosphere know the ‘potential’ that is referred to. Let that proposal of yours, and your scientific talent be not wasted!

Alternatively, some PI could come forward to try your proposal with his resources in a traditional fashion. He is welcome to do so upon mutual agreements.

The objectives and advantages could go on and on ……….

So, I welcome you to join me if you are willing to become the part of this project and take it forward. You never know your ideas might add something that I did not think of as far as the project is concerned. So pull out that proposal you drafted that has been sitting under the heap of papers, and refine it a bit with your added expertise, and post it to MyResearchProposals. I would also suggest those of you who wants to be A, B, C, D, and X also join the team, and you are important here. If you just want to be a mere knowledgeable peer you are most welcome to be a member so that you can review the things and leave your impact.

Click here to see the file to check what is it that we have been talking about.

I would suggest trusting in the scientific attitude of the scientists . After all, we are talking about the progress of science. As you all know, this project is still in its incipient stage – things will be defined, and actions will be planned as we progress.

(Caution: You be the judge for your proposal or idea, and decide if you want to be part of the things here, and act accordingly.)



This is a wonderful vision. It is, of course, what we try to arrange in Open Source where different software modules are offered and different people agree to accept them. It’s hard, often fails, but works very well in many cases.

Could it work in chemistry? Yes, if chemistry is seen as a collaborative science where there is a common goal for the benefit of humankind. Unfortunately we have a little way to go. Currently synthetic organic chemistry is often a competitive sport rather than a distributed science. The goal is to make things that are more impressive than your competitors, rather than make things that are useful in themselves. It’s rather like the plumage wars that male birds engage in. And graduate students are often seen as wage slaves or cannon fodder. A regular reading of the chemical blogosphere reveals that the non-Open process results in over-hyped yields (i.e. the reported success of experiements), badly presented supporting data, etc. While these are relatively infrequent (I hope) the blogosphere from those who actually do the work are sufficiently concerned that it is a common topic.  In way of contrast who ever heard of an Open Source programmer who manufactured code?

I’m not suggesting that chemistry should go Open and start collaborating. It won’t happen. But why don’t we pursue the idea of Open Chemical Synthesis directed against real targets of benefit to humankind. The idea of international collaboration should be feasible – many years ago the EC (I think) funded the sequencing of the yeast chromosome by farming out each chromosome to each of 17 nations. This could be done in Open Medicinal Chemistry. So I hope TR gets some critical mass of interest and maybe finds a funder who wants to do science rather than sport.

Stochastic hyperslide at MKM2007

Friday, June 29th, 2007

I have just given my presentation at Mathematical Knowledge Management 2007 for which I wrote an abstract about 2-3 months ago : Mathematics and scientific markup. I knew that in the intervevning time I would find something new to get excited about – and this has happened – I have added the excitement of the lc-semanticweb. Of course the technology and community have developed since then.

As many of you know I rail against Powerpoint as a prime destroyer of semantic content. Powerpoint also constrains the presenter to a linear mode – yes you can skip a few slides and maybe even hide them, but it’s not easy to flip about. And it’s a poor launch platform for interactive demos.

I’ve done my slides in XHTML+SVG, believing this is the right way to remain true to my campaign for XML. (I’ll do Powerpoint when it’s necessary for business purposes – e.g. to integrate with colleagues, but that’s about it). This worked for a bit but soon hit problems of scale. I started addressing that with XSLT to add menus to the presentation. In fact I started with the wrong technology (for some bizarre reason I chose it to be Windows specific) and have now simply changed to XHTML.

I have over 12000 XHTML slides. (before you get the wrong idea,  many of these are scraped – so 3000+ from one example of OSCAR3). But nonetheless there are very many. I want to be able to reassemble them for each talk, and I want the technology to be as simple as possible – ideally none. (The efforts I have used in the past have all been broken by browser “upgrades” – a synonym for disasters).

Some ideas are:

  • use a database and craft metadata for each slide
  • use something like Spotlight or local Google

but these don’t assemble the talk. So at present I have about 100 directories (maybe with trivial subdirectories) and 5-20 slides per directory. I make the talk by selecting directories which may have some general bearing in the talk – perhaps 20-30. Admittedly it takes memory to work out what is likely to be in each folder but I have to work hard at a talk and the time is well spent. I then asterisk those directories which I HAVE to present (i.e. if I get to 5 mins before the end and haven’t mentioned them, i break off and visit them). I prepare demos (such as BIOCLIPSE, OSCAR1, GoogleInChI, Blue Obelisk GreaseMonkey,) and visits to the WWW (when the organizers have provided it – e.g. the ACS hardly ever does even when I ask in advance – it makes little sense to have sessions about the Web when you can’t get there).

So I prepared this for today’s talk to the MKM. A very nice audience to present to as they understand all about semantic content, namespaces, XML, dictionaries – so none of that has to be explained. I said my hyperslide would be stochastic – I didn’t know what slides I would present and in what order. The demos might break.

They did. BIOCLIPSE hung on Jmol rotation (although I got to demo Jmol later). However I am sure the audience appreciated the value – we’d seen Eclipse being used for theorem proving, etc. GreaseMonkey worked yesterday, but failed today. Now I have reinstalled it and it works great. GoogleInchi failed (is the Google API finally broken?) But OSCAR1 and OSCAR3 worked – and the links out to Pubchem and the chemical blogosphere. And the polymer builder, although I didn’t have time to explain exactly how it was a symbol manipulator. And I certainly covered less than half of what I might have said. But at least the hyperslide approach means I never overrun – as you can stop when you need to.

There are downsides. It’s difficult to keep a record (that’s why videos are useful). And  Powerpoint does have the merit of acting as a document container. I’ve tried both S3 ans Slidy but neither help you to assemble talks.

The only complete way to make slides available is to put them under SVN on WWMM. I can copy the directories to a pen drive. But none of this is a record of which slides were visited in which order and what was said.

I’d be interested in whether anyone else is mad enough to create new ways of managing their slides? And whether they have any ideas. At present I’m almost motivated to try Javascript, but the last time I did that – 5 years ago – everything broke within a year.

A snapshot of the chemical blogosphere

Thursday, June 28th, 2007

I want to show the mathematicians the vibrancy and value of the chemical blogosphere so – at random – I picked today’s TotallySynthetic. By chance it’s very fitting as it is a review of a paper by one of the blogospheric heroes – Tenderbutton (== Dylan Stiles). Dylan wrote a fascinating and idiosyncratic blog for over a year until his supervisor and pressure of work combined to stop him – fittingly he has a regular blogging column for a chemical publisher. Here’s a flavour of the current post which I’ll show tomorrow:

== Excerpt (with some deletions) from TotallySynthetic Blog ==

Spirotryprostatin B

27 June 2007 spirotryprostatin.jpg

Trost, and Stiles. Org. Lett., 2007, ASAP. DOI: 10.1021/ol070971k.

He’s done it :) . Those of us who read Tenderbutton from the start will have known of Dylan’s work on this tasty little number, and he’s done it proud. Eight steps to the natural product; we’ll start with step one (or rather, the first non-literature step):


We’ve looked at Otera’s catalyst before, but as a quick reminder, it’s a funky trans-esterification catalyst. It’s been known for a little while, and the mechanism is in this JOC article. Nice to see it being used; what was it like to handle, Stiles?


Good job, old chap!

10 Responses to “Spirotryprostatin B”

  1. milkshake Says:
    June 28th, 2007 at 1:56 A lovely detail is that a simple oxindole was used as a starting material…
  2. the dude Says:
    June 28th, 2007 at 2:32 Could someone (maybe even Dylan) comment … Why can’t you use any of the plethora of other esterification catalysts? Thanks.
  3. provocateur Says:
    June 28th, 2007 at 2:32 just a doubt…how is otera’s catalyst superior to the usual transesterification catalyst , titanium isopropoxide?
  4. Spiro Says:
    June 28th, 2007 at 2:37 #1Absolutely. This guy, Trost, is plagiarizing Baran!
  5. Spiro Says:
    June 28th, 2007 at 3:31 #2This is the one-million-dollar question!Provocateur’s comment about the “usual transesterification catalyst” makes me laugh. There are probably as many usual transesterification catalysts as there are Synth. Commun., Tet. Lett., Chem. Commun. or Org. Lett. articles titled “XYZ, a superactive catalyst for (trans)esterification of QWERTY-acids with YTREWQ-alcohols under mild conditions”.

    PPS: I have read Otero’s book on esterification :
    This is the poorest book I read from Wiley. After reading this book and many more articles on the topic I came to the conclusion that this field is a hoax.

  6. kiwi Says:
    June 28th, 2007 at 10:11 #2, #3 - Oteras catalyst is a beaut, nice sparkling white crystals, stable as a rock. a shaved lab rat can make it on a scale of tens of grams…

=========== end of excerpt ==========

Synergy between “MKM”-math and “CML”

Thursday, June 28th, 2007

In this post I am using the context of the Mathematical Knowledge Management 2007 conference to try to construct similarities in the MathML and CML communities and their thought processes. I’ll show this list in my presentation tomorrow – I may not cover all the points. Similarities:

  • both are evolving languages – MathML is not “finished” and won’t be for some time.
  • smallish part of the main community, often struggling to get the message across.
  • absolute belief that computer mechanization is essential and beneficial.
  • belief this will happen, but timescales are unclear
  • need for a formal, somewhat arbitrary, selection of core components. e.g. geometry has 42 formal concepts; CML has 100 elements. Often pragmatic.
  • extension is through dictionaries (OM has ContentDictionaries, CDs; CML has convention-based dictionaries (CML/Golem))
  • systems are still evolving and will continue indefinitely. Need versioning and flags.
  • Correctness checking (e.g. of publications) is important, but undervalued by the mainstream communities.
  • Fragmented support for development. Have to concentrate limited resources to aim in communal direction.
  • Would revolutionise publication process but difficult to get mainstream involvement including learned Socs.
  • Problems with browsers – how do we transmit rich content? And every browser release things break.
  • Legacy – can we extract useful info from Bitmaps? PDF? LaTeX/ChemDraw
  • Commercial organizations are often apathetic or even antagonistic
  • Virtual communities can flourish. Openness is critical
  • Both have a strong bottom-up approach. MathML has enough pragmatists to make it work in practice.
  • both are needed by other disciplines – sometimes the support comes from them rather than the mainstream.

Some differences:

  • Maths is relatively poor – although there are commercial companies providing tools and services.
  • Maths sometimes produces proof of concept without later sustainability.
  • Maths believes in standards. Chemistry doesn’t (yet).

Possible joint activities:file:///D:/wwmm/presentations/standalone/resources/golem/indexFrame.html

  • workshop on combined infrastructure of MLs, tools, examples
  • joint projects (inc. students presented with “toy” systems)
  • synergy in developing tools (esp. client side)
  • blogs etc. aimed at joint activities.
  • minimal level of infrastructure for OM CML interoperability. Converters? libraries? stylesheets?
  • rich client

I’ve gone over these with Michael Kohlhase and we roughly agree on them.

The Handbook of Integer Sequences

Thursday, June 28th, 2007

I am at the Mathematical Knowledge Management 2007 and having a wonderful time. At present Neil Sloane is talking about his marvellous On-Line Encyclopedia of Integer Sequences a collection of every known (and voluntarily communicated) sequence. e.g. what is next term in:



and over 100,000 more

  • 1964 started
  • 1973 2500 seqs
  • 1995 1 m**3 of mail
  • 1995 5500 seqs
  • 1996 10000
  • 2007 131000

A large volunteer community – with a virtual party for 100,000 sequence. One has set the sequences to music (try “Listen” – this sometimes great fun).

2000000 lines of flat file, 120M edited with emacs – total 450 Mbytes including all info. 10K seqs/year, 30 comments/day, 600 emails/day

shortest seq 76337 (1 term)

longest a27 500,000 terms (natural numbers) – this raised a laugh. The point of including it that you can plot one sequence against another.

Used to find out everything about a sequence. Difficulty between conjectures and proofs.

Sequence fans mailing list. But The database is (rightly) restricted to a set of trusted editors. But ultimately almost all is done by Neil.
200 examples of sequence pairs which have identical content but are different sequences.

I find it wonderful that one can search Google with a sequence and it will hit Neil’s site (if it’s in the database). It’s one of the few digital objects where the content acts as an index in Google. InChI is another.
It’s fascinating. Every sequence that comes in is transformed but over 100 transforms to see what its structure might be. It’s a real living ecology of digital objects. Neil has shown us how datamining the database – comparing sequences with each other – has resulted in new mathematical theorems. Wouldn’t it be wonderful if chemical information was available for datamining and not owned by commercial interests?

OpenMath/MathML, CML and communities of practice

Thursday, June 28th, 2007

James Davenport – one of the originators of OpenMath is presenting the current status. (OpenMath and MathML are converging and although they are distinct I shall confalte them here).

What is the formal semantics? OM has 4 flags:

  • official
  • experimental
  • private
  • obsolete

Normally only the official and obsolete are visible on the official site.

What if we have conflict? An exampls is systems (like Euclid) which start from 1 whereas OpenManth may start from 0. (I’m on weak ground here…). MathML has ContentDictionaries (CD) for representing each approach – but we need a mechanism for conversion.

This is very similar to the CML idea of convention attribute.

For example the British and the French have different symbolisms for integrals. They aren’t just linguistic. In CD sqrt(-1) is i, in engineering it is j.

Do we skate over inconsistencies or refuse to emit any results where there are conflicts?

“A theorem prover which  emits wrong results is probably worse than useless.” [A chemistry system which emits wrong compounds? Of course. (but most of the current ones do].

Is notation independent of semantics? Apparently not. And it’s a harder problem that in looks to convert. How do we speak:

T(n) = O(n2)

Apparently no mathematician would speak “equals” – they would say “is in”. JD is suggested a “notation” approach (I see this as similar to CML’s convention). So our symbology is much deepr than we usually realise.

So OM defines a smallish set of symbols which are fundamental. This are hardcoded. (In chemistry these migh be element symbols, coordinates, etc.) Then there are those that OM uses widely but are not fundamental, and then those that the community develops – they have to be supported by OM, but are not part of it.

I’ll stop here – I hope you get the idea that OM is struggling to formalize those things which are absolute, those which are universal but not fundamental, and those which are  widely used. If those words (which are mine not James’) conflict, that’s probably accurate.

He finishes:

“OM is useful, but being semantic, let’s use it carefully”. That goes for CML as well.
My message is that MLs are community processes, depend heavily on interpreting implicit semantics, and in doing so will help to formalize our understanding.

Maths has a community process. CML is much less advanced, but uses the Blue Obelisk as a clearing house.  But I take from OM and communities like FOAF that CML should have flags on its components such as

  • accepted
  • experimental
  • convention (e.g. JSpecView)
  • obsolete

Towards Mechanized Mathematical Assistants (and the Scientist’s Amanuensis)

Thursday, June 28th, 2007

I’m now at MKM2007 whose book[1] has the splendid title:

Towards Mechanized Mathematical Assistants

the vision that computers can work together with humans to enhance the the understanding of both parties. This vision has much in common with our own SciBorg project where we use the idea “scientist’s amanuensis” for a computer system than can help hi=umans understand chemistry in machine-accessible form
(see also An Architecture for Language Processing for Scientific Texts)

As with many of my talks I’m blogging so of the talks – ans more importantly the encounters – during this meeting. I haven’t decided what I’m going to say tomorrow and these notes may form part of it.

I’m first of all searching for similarities in the thought processes between the MKM community and our own chemical knowledge community (i.e. those people in chemistry who have realised we are in the Century of information and knowledge – a small percentage). That’s also true in mathematics – the MKM community – which overlaps with symbolic computation and computer algebra – is not mainstream but, unlike chemistry, has achieved critical mass (i.e. there are regular meetings, serial publications, etc. – that’s a pointer for ourselves – we need to do these things as well). So here are some top-level isomorphisms:

  • need to check the correctness of submitted manuscripts. Maths has many places where the author states “it’s obvious…”, or lemma 12.23.34 appears without an actual derivation. In chemistryand other sciences we have suspect or missing data
  • it’s not “proper maths/chemistry”. Markup is trivial/non-essential/notRealResearch, etc. You have to proves theroems or make compounds to be a proper researcher.
  • The Open Source systems are no good because the commercial systems (Mathematica, Maple, ChemDraw, OpenEye) are better. If it doesn’t cost money it can’t be any good
  • Machines can never have the insight that humans have.
  • The publishers/senior editors know what is good for the community. We’ve done in on printed papaer for 200+ years. It works, whyc change it? (It’s also a lot of trouble.

The first lecture (Paule) showed that machines can now prove some quite complex theorems. Cerianly miles beyond me. Things like Wallis’ formula for PI and Ramanujan’s first identity. And, if we are going to use symbolic algebra systems, we need other systems to prove that the CAS are correct.

[1] Springer, LNAI 4573 eds: Kauer, Kerber, Miner, Windsteiger. Not, of course, Open Access – you have to buy a physical book.

HHMI – green or gold? And the data?

Wednesday, June 27th, 2007

Peter Suber has highlighted a new policy by HHMI and given a careful critique of what “Open” may or may not mean. It’s a good illustration of the fuzzy language that is often used to describe “Open”.  See: HHMI mandates OA but pays publishers to allow it

HHMI Announces New Policy for Publication of Research Articles, a press release from the Howard Hughes Medical Institute (HHMI), June 26, 2007.  Excerpt:

The Howard Hughes Medical Institute today announced that it will require its scientists to publish their original research articles in scientific journals that allow the articles and supplementary materials to be made freely accessible in a public repository within six months of publication.

[... snip ...]
HHMI also announced today that it has signed an agreement with John Wiley & Sons. Beginning with manuscripts submitted October 1, Wiley will arrange for the upload of author manuscripts of original research articles, along with supplemental data, on which any HHMI scientist is an author to PMC. The author manuscript has been through the peer review process and accepted for publication, but has not undergone editing and formatting. HHMI will pay Wiley a fee for each uploaded article.

In addition, the American Society of Hematology, which publishes the journal Blood, has extended its open access option to HHMI authors effective October 1. Cech said that discussions with other publishers are ongoing.

The policy and supporting resources have been posted on the Institute web site and may be found [here].

To supplement this press release see

  1. The policy itself, dated June 11, 2007, to take effect January 1, 2008
  2. The Institute’s new page on HHMI & Public Access Publishing

Comments (by Peter Suber – absolutely to the point as always).

  • HHMI is finally mandating that its grantees provide OA to their published articles based on HHMI-funded research within six months of publication.  We knew last October that it was planning to adopt a mandate, but now it’s a reality.  Moreover, HHMI is taking the same hard line that the Wellcome Trust has taken:  if a grantee’s intended publisher will not allow OA on the funder’s terms, then the grantee must look for another publisher.  This is all to the good.  Funders should mandate OA to the research they fund, and they should take advantage of the fact that they are upstream from publishers.  They should require grantee compliance, not depend on publisher permission.
  • But unfortunately, HHMI is continuing its practice of paying publishers for green OA.  I criticized this practice in SOAN for April 2007 and I stand by that criticism.  HHMI should not have struck a pay-for-green deal with Elsevier and should not be striking a similar deal with Wiley.  HHMI hasn’t announced how much it’s paying Wiley, and it’s possible that the Wiley fees are lower than the Elsevier fees.  But it’s possible that they’re just as high:  $1,000 – $1,500.  We do know that its Wiley fees will not buy OA to the published edition, but only OA to the unedited version of the author’s peer-reviewed manuscript.  HHMI hasn’t said whether its Wiley fees will buy unembargoed OA or OA with a CC license.  The Wellcome Trust’s fees to Elsevier buy three things of value –immediate OA, OA to the published edition, and OA with a CC license– while HHMI’s fees to Elsevier buy none of these things.  If HHMI gets all three of these valuable things for its Wiley fees, then it’s basically paying for gold OA and no one can object to fees that are high enough to cover the publisher’s expenses.  But paying for green OA, when the publisher’s expenses are covered by subscription revenue, is wrong and unnecessary even if the fees are low.   For details, see my April article.


Note that “Green” OA is very unlikely to make the Data Open. By default the publisher may restrict text-mining, and may have copyrighted the data (Wiley certainly have done and do this). So unless there is a CC license – which makes it effectively “gold” (in this very unsatisfactory terminology) it’s almost useless to data-driven science.

What we do at UCC – job opportunity in Polymer Informatics

Tuesday, June 26th, 2007

I don’t normally say very much in this blog about what our day jobs are; now is a useful time to do so. The Centre is sponsored by Unilever PLC – the multinational company with many brands in foods and HomeAndPersonalCare (HPC). It came about through some far sighted collaboration between Unilever and Cambridge to create a Centre where cutting edge research would be done in areas which didn’t just address present needs  but also looked to the future.

This is typified by Polymer Informatics – where we have an exciting vacancy.  Many of Unilever’s product s contain polymers – you can think of them as long wriggly molecules.  They  can be very hard – as in polythene, or flexible as in silicones or additives in viscous liquids. Next time you put something on your hair, teeth, face, toilet bowl or laundry, etc there’s a good chance it will have a polymer ingredient of some sort.

Work in my group looks forward to where the world will be in 5 or even 10 years’ time. Here’s a list of some of the technologies in the current position:

OSCAR3, natural language processing, text-mining, Atom, Eclipse/Bioclipse, SPARQL, RDF/OWL, XPath, XSLT, etc.)

What’s that got to do with wriggly molecules? Everything. Science is becoming increasingly data- and knowledge-driven. In many cases the “answer is out there” if only we knew where to look – publications, patents, theses, blogs, catalogs, etc.. We may not need to go back to the lab but can use reasoning techniques to extract information from the increasingly public world of information. And, as we liberate the major sources – scholarly publications, theses, patents – from their current closed practices we shall start to discover science from the relations we find. Open scientific information has to be part of the future.
The phrase Pasteur’s Quadrant is sometimes used to describe research which is both commercially exploitable and also cutting edge scholarship.  That’s a useful vision. I have certainly found in my time in the Centre that industrial problems are often very good at stimulating fundamental work. So polymer informatics has taken me to new fields in knowledge representation. Polymers, unlike crystals, are not well-defined but fuzzy – they can have variable lengths, branching, chemical groups, etc. They flop about and get tangled. This requires a new type of molecular informatics and we have had to explore adding a sort of functional programming to CML to manage it. We now have a markup language which supports polymers and several features are novel.

And, as the world develops, the information component in products continues to increase.  So we know we are going in an exciting direction.

Top-down or bottom-up ontologies?

Sunday, June 24th, 2007

I am working out some of the ideas I want to talk about at Mathematical Knowledge Management 2007 – and in this post I explore how a knowledge framework might be constructed, and also how it can be represented in machine-understandable form. This, I think, will be seen as one of the central challenges of the current era.

I have worked on bodies responsible for the formalisation of information and have also for 15 years been co-developing Chemical Markup Language (with Henry Rzepa). This journey has not yet ended and I’m changing my viewpoint occasionally.

My first view – perhaps stemming from a background in physical science – was that it should not be too difficult to create machine-processable systems. We are using to manipulating algorithms and transforming numeric quantities between different representations. This process seemed to be universal and independent of culture. This was particularly influenced by being part of the Int. Union of Crystallography’s development of the Crystallographic Information Framework dictionary system.

This is a carefully constructed, self-consistent, system of concepts which are implemented in a simple formal language. Physical quantities of interest to crystallographic experiments can be captured precisely and transformed according to the relations described, but not encoded, in the dictionaries. It is now the standard method of communicating the results of studies on small molecules and is the reason that Nick Day and I could create CrystalEye. Using XML and RDF technology we have added a certain amount of machine processability.
Perhaps encouraged by that I and Lesley West came up with the idea of a Virtual Hyperglossary (original site defunct, but see VIRTUAL HYPERGLOSSARY DEVELOPMENTS ON THE NET) which would be a machine-processable terminology covering many major fields of endeavour. Some of this was very naive, some (e.g. the use of namespaces) was ahead of the technology. One by product was an invitation to INTERCOCTA (Committee on Conceptual and Terminological Analysis) – a UNESCO project on terminology. There I met a wonderful person Fred W. Riggs who very gently and tirelessly showed me the complexity and the boundaries of the terminological approach. Here (Terminology Collection: Terminology of Terminology) is an example of the clarity and carefulness of his writing. One of Fred’s interests was conflict research and his analysis of the changing nature of “Turmoil among nations”. I am sure he found my ideas naive.
So is there any point in trying to create a formalization of everything – sometimes referred to as an Upper Ontology? From WP:

In information science, an upper ontology (top-level ontology, or foundation ontology) is an attempt to create an ontology which describes very general concepts that are the same across all domains. The aim is to have a large number on ontologies accessible under this upper ontology. It is usually a hierarchy of entities and associated rules (both theorems and regulations) that attempts to describe those general entities that do not belong to a specific problem domain.

The article lists several attempts to create such ontologies, one of the most useful for those in Natural Language Processing being

WordNet, a freely available database originally designed as a semantic network based on psycholinguistic principles, was expanded by addition of definitions and is now also viewed as a dictionary. It qualifies as an upper ontology by including the most general concepts as well as more specialized concepts, related to each other not only by the subsumption relations, but by other semantic relations as well, such as part-of and cause. However, unlike Cyc, it has not been formally axiomatized so as to make the logical relations between the concepts precise. It has been widely used in Natural Language Processing research.

(and so is extremely valuable for our own NLP work in chemistry).

But my own experience has shown that the creation of ontologies – or any classification – can be an emotive area and lead to serious disagreements. It’s easy for any individual to imagine that their view of a problem is complete and internally consistent and must therefore be identical to others in the same domain. And so the concept of a localised “upper ontology” creeps in – it works for a subset of human knowledge. And the closer to physical science the easier to take this view. But it doesn’t work like that in practice. And there is another problem. Whether or not upper ontologies are possible it is often impossible to get enough minds together with a broad enough view to make progress.

So my pragmatic approach in chemistry – and it is a pragmatic science – is that no overarching ontology is worth pursuing. Even if we get one, people won’t use it. The International Union of Pure and Applied Chemistry has created hundreds of rules on how to name chemical compounds and relatively few chemists use them unless they are forced to. We have found considerable variance in the way authors report experiments and often the “correct” form is hardly used. In many cases it is “look at the current usage of other authors and do something similar”.

And there is a greater spread of concepts than people sometimes realise. What is a molecule? What is a bond? Both are strongly human concepts and so difficult to formalize for a machine. But a program has to understand exactly what a “molecule” is. So a chemical ontology has to accept variability in personal views. A one-ontology-per-person is impossible, but is there scope for variability? And if so how is this to be managed.

So far CML has evolved through a series of levels and it’s not yet finished. It started as a hardcoded XML DTD – indeed that was the only thing possible at that stage. (In passing it’s interesting to see how the developing range of technology has broadened our views on representability). Then we moved to XML Schema – still with a fairly constrained ontology but greater flexibility. At the same stage we introduced a “convention” attribute on elements. A bond was still a “bond” but the author could state what ontology could be attached to it. There was no constraint on the numbers of conventions but the implied rule is that if you create one you have to provide the formalism and also the code.

An example is “must a spectrum contain data?”. We spent time discussing this and we have decided that with the JSpecView convention it must, but that with others it need not. This type of variable constraint is potentially enforceable by Schematron, RDF or perhaps languages from the mathematics community. We have a system where there is “bottom-up” creation of ontologies, but which agree they need a centrally mechanism for formalizing them – a metaontology.  The various flavours of OWL will help but we’ll need some additional support for transformation and validation, especially where numbers and other scientific concepts are involved.