Monthly Archives: October 2006

Creating "Open Data" on Wikipedia

In an earlier post I mentioned that I was going to start an article on "Open Data" on Wikipedia. This is a blow-by-blow account (a few technical details are omitted).

  1. Do not be afraid. (I used to be afraid, but there is no need). Nobody has ever made me feel unwanted (there have been robust discussions, but none angry). So I go to WP and search for "Open Data". (The capitalisation may be important - always start with a capital if you can). There is no such page, but it offers one a start

No page with that title exists.

You can create this page or request it.

So by clicking on the first link you get a fresh page headed "Open Data

2. You don't have to write the whole page at once. Even the title and a few words is enough. This is a "stub" and others can add to it or modify it. Let's see how far we get before nightfall.

3. Read some WP pages to get a sense of style. Good ones have an introductory paragraph and then the body, made of several labelled sections. Then there are usually links to other WP pages, and finally references. WP is very insistent on good references. Of course you don't have to have them all at once.

4. You can always backtrack if you get something wrong. WP saves everything. No-one minds if there are 100 versions of something. Indeed if you are on a flaky connection you may wish to make a number of small changes and save after each.

5. You don't have to let people know who you are. I do, but I don't expect it from others. You can use an IP or (if registered) an alias, or a real name. I use "petermr" which is not difficult to crack (use Google). If you are registered then add a signature (~~~~ is the magic) which links to your own WP Talk page.

Since all the edits are recorded, I can simply link to them!

  • Page creation. Note that WP has formatted the page nicely (I simply have to remember that there is no need for a title, that each section has ==foo== and to add {{stub}} at the bottom. The links go in [[..]], e.g. [[Open Access]]. I've saved it after a paragraph as it's dinner time. Any visitor will see that it's a new page (stub) - they may or may not feel they want to add something. It's not my page, it's our page. However it's probably good manners to wait a day or so before editing a new page.
  • Here is the history of the edits. Notice that a Wikipedian has already spotted the article and removed the "stub" before me!
  • Here is the latest version. It's not "mine" - it's ours. Anyone can edit it - I am sure at least there are typos and many other links and references can be added.
  • the real final version tonight.

Enjoyable version control with the Tortoise

Update, merge, test, add, update, commit... that's what I do when working in a communal software project. Update, merge, test, add, update, commit...

So I've written some additional tests for JUMBO. Now I have to save them and share them with my collaborators (anyone who wants to collaborate). When I look at my directory on Windows it looks like:


As you can see some of the files are green and some are orange. The orange ones are those I have changed since "last time". Last time of what? last time I used the Tortoise,:



This cheerful fellow (developed initially by Francis - who I first met on the Cambridge geek puntconvention - maybe more later) sits on top of CVS. Now CVS is a wonderful tool - about 15 years old, I thnk - which manages the versions of your documents. You (or someone else) sets up a CVS server and every so often you chek your current files against the server. It tells you when they are out of synce and whether you need to upload a file TO the server or download someone else's update FROM the server. (Remember we work as collaborators!).

CVS is a great concept but pretty hairy for normal mortals, with some fairly esoteric runes to remember. It's sufficiently forbidding almost to scare you away. Tortoise makes it trivially easy by adding itself to your normal file browser. All you do is right-click and follow simple commands.

The first is Update. Has anyone changed the files on the server since I last used CVS? If so it will download them to my directory and merge them. (Don't worry about being overwritten, CVS remembers all versions). Update also tells me which files I have edited.


(The pink ones are new). Since there were one or two serverside changes I now have to re-test the system (the changes might break my code). In this case the test passes. Sometimes it requires a lot of work to merge the changes - that's normally a good thing because someone else is working with you. Now let's re-update in case the server has changed. No? Did I create any new files? I check with "Add contents" - yes, I did. (This is where I most frequently goof up - unless I send these files to the server my collaborators won't see them. And their systems will then fail). So I add the files, and then "Commit".

Now everyone in the project can update from my latest changes...

Some of you will have thought "what happens if two people make changes to the same file at the same time?" CVS will try to merge the changes. If they are in different parts of the file it's probably OK. If not there will be a conflict. This is a bit of a pain, and normally involves reverting to the older version and then agreeing between the two people what changes each wanted to make and why. In many projects - such as CDK - the active developers keep a chat room open so they can send messages like "I need to change X - does anyone mind?" Much better than technical mechanisms.

Here's the final commit:


CVS is now being superseded by Subversion (SVN) which is even easier (and also has a Tortoise overlay). We use SVN locally, but I haven't got a round to changing it at SourceForge for JUMBO...

CVS or SVN is also very good for dealing with other documents (if they are in ASCII). Take a look. You'll find you are working in a community, not alone...

(Oh - and why am I using Windows? Don't ask)

Unit tests or get a life

My unit tests have taken over my life this weekend. This post is a brief respite...

Unit tests are one of the great successes of modern approaches to programming. If written briefly about them before and Egon has an introduction of how they are used in Eclipse. Unit tests are often seen as part of "eXtreme Programming" though it's possible to use them more or less independently. Anyway, I was hoping to add some molecule building extensions to JUMBO, and started on the CMLMolecule.class. Even if you have never used Unit tests the mental attitudes may be useful.
I then discovered that only half the unit tests for this class were written. Purists will already have judged me on this sentence - the philosophy is to write the tests before the code. For the CMLMolecule class this makes excellent sense. Suppose you need a routine:

Point3 molecule.getCentroid3()

then you first write a test. You might create a test molecule (formally a "fixture"), something like:

CMLMolecule testMolecule = makeTestMolecule();
Point3 centroid = testMolecule.getCentroid3();
new Point3(0.1, 0.2, 0.3), centroid, EPSILON);

I haven't written the actual body of getCentroid3(). I now run the test and it will fail the assertion (because it hasn't done anything). The point is I have a known molecule, and a known answer. (I've also had to create a set of tests for Point3 so I can compare the results).

You won't believe it till you start doing it but it saves a lot of time and raises the quality. No question. It's simple, relatively fun to do, and you know when you have finished. It's tempting to "leave the test writing till later" - but you mustn't.

Now why has it taken over my weekend? Because I'm still playing catch-up. At the start of the year JUMBO was ca 100,000 lines of code and there were no tests. I have had to retro-fit them. It hasn't been fun, but there is a certain satisfaction. Some of it is mindless, but that has the advantage that you can at least watch the football (== soccer) or cricket. (Cricket is particularly good because the cycle between action (ca 1 minute) often coincides with the natural cycle of edit/compile/test).

It's easy to generate tests - Eclipse does it automatically and makes 2135 at present. If I don't add the test code I will get 2135 failures. Now Junit has a green bar ("keep the bar green to keep the code clean") which tells you when the tests all work. Even 1 failure gives a brown bar. The green bar is very compelling. It gives great positive feedback. It's probably the same limbic pathways as Sudoku. But I can't fix 2135 failures at one sitting. So I can bypass them with an @Ignore. This also keeps the bar green, but it's a fudge. I know those tests will have to be done some day.
So 2 days ago JUMBO had about 240 @Ignores. Unfortunately many were in the CMLMolecule and MoleculeTool class. And the more tests I created, the more I unearthed other @Ignores elsewhere.

So I'm down to less than 200 @Ignores now. I've found some nasty bugs. A typical one is writing something like:

if (a == null)

when I mean

if (a != null)

This is exactly the sort of thing that Unit tests are very good at catching.

So I here's my latest test

oh dear! it's brown. I'll just @Ignore that one test and...


and it's GREEN!

(But I haven't fooled anyone. The @Ignore is just lying in wait to bite me in the future...)

Do we really need discovery metadata?

Many of the projects we are involved in and interact with are about systematising metadata for scientific and other scholarly applications. There are several sorts of MD; I include at least rights, provenance, semantics/format, and discovery. I'll go along with the first three - I need to know what I can do with something, where it came from and how to read and interpret it. But do we really need discovery metadata?

Until recently this has been assumed as almost an axiom - if we annotate a digital object with domain-specific knowledge then it should be easier to find and there should be fewer false positives. If I need a thesis on the synthesis of terpenes then surely it should help if it is labelled "chemistry", "synthesis" and "terpenes". And it does.

But there are several downsides:

  • It's very difficult to agree on how to structure metadata. This is mainly because everyone has a (valid) opinion and no-one can quite agree with anyone else. So the mechanism involves either interminable committees in "smoke-filled-rooms" (except without the smoke) or self-appointed experts who make it up themselves. The first is valuable if we need precise definitions and possibly controlled vocabularies but is not normally designed for discovery. The second - as is happening all over the world leads to collisions and even conflict.
  • Authors won't comply. They either leave the metadata fields blank, or make something up to get it over with or simply abandon the operation
  • It's extremely expensive. If a domain expert is required to reposit a document it doesn't scale.

So is it really necessary? If I have a thesis I can tell without metadata just by looking whether it's about chemistry (whatever language it's in), whether it has synthesis and whether it contains terpenes. And so can Google. I just type "terpene synthesis" and all the first page are about terpene synthesis.

The point is that indexing full text (or the full datument :-) ) is normally sufficient for most of our content discovery. Peter Corbett has implemented Lucene - a free text indexer and done some clever things with chemistry and chemical compounds. That means his engine is now geared up to discover chemistry on the Web from its content. I'll speculate that it's more powerful than the existing chemical metadata...

So it was great to see on Peter Suber's Blog (can't stop blogging it!)

Full-text cross-archive search from OpenDOAR

OpenDOAR has created a Google Custom Search engine for the 800+ open-access repositories in its directory. From today's announcement:

OpenDOAR - the Directory of Open Access Repositories - is pleased to announce the release of a trial search service for academics and researchers around the world....

OpenDOAR already provides a global Directory of freely available open access repositories that hold research material: now it also offers a full-text search service from this list of quality-controlled repositories. This trial service has been made possible through the recent launch by Google of its innovative and exciting Custom Search Engine, which allows OpenDOAR to define a search service based on the Directory holdings.

It is well known that a simple full-text search of the whole web will turn up thousands upon thousands of junk results, with the valuable nuggets of information often being lost in the sheer number of results. Users of the OpenDOAR service can search through the world's open access repositories of freely available research information, with the assurance that each of these repositories has been assessed by OpenDOAR staff as being of academic value. This quality controlled approach will help to minimise spurious or junk results and lead more directly to useful and relevant information. The repositories listed by OpenDOAR have been assessed for their full-text holdings, helping to ensure that results have come from academic repositories with open access to their contents.

This service does not use the OAI-PMH protocol to underpin the search, or use the metadata held within repositories. Instead, it relies on Google's indexes, which in turn rely on repositories being suitably structured and configured for the Googlebot web crawler. Part of OpenDOAR's work is to help repository administrators improve access to and use of their repositories' holdings: advice about making a repository suitable for crawling by Google is given on the site. This service is designed as a simple and basic full-text search and is intended to compliment and not compete with the many value-added search services currently under development.

A key feature of OpenDOAR is that all of the repositories we list have been visited by project staff, tested and assessed by hand. We currently decline about a quarter of candidate sites as being broken, empty, out of scope, etc. This gives a far higher quality assurance to the listings we hold than results gathered by just automatic harvesting. OpenDOAR has now surveyed over 1,100 repositories, producing a classified Directory of over 800 freely available archives of academic information.

Comment. This is a brilliant use of the new Google technology. When searching for research on deposit in OA repositories, it's better than straight Google, by eliminating false positives --though straight Google is better if you want to find OA content outside repositories at publisher or personal websites. It's potentially better than OAIster and other OAI-based search engines, by going beyond metadata to full-text --though not all OA repositories are configured to facilitate Google crawling. If Google isn't crawling your repository, consult OpenDOAR or try these suggestions.

I agree! Let's simply archive our full texts and full datuments. We'll never be able to add metadata by human means - let the machines do it. And this has enormous benefots for subjects like chemistry - Peter's OSCAR3 can add chemical metadata automatically in great detail.

So now all we want is chemistry theses... and chemistry papers ... in those repositories. You know what you have to do...

Build your own Institutional Repository

I have alluded to Institutional Repositories (IR) before. Although I am an enthusiast and early adopter (having reposited 250, 000 digital objects) a year ago I would have said they were still a minority activity. Not now. Universities and related HEIs are all implementing IRs even though I suspect the majority of staff in some have never heard of them (we'll reveal details for chemistry later). As well as the (IMO seriously overhyped) commercial tools there are several Open Source tools (ePrinst, DSpace, Fedora (not the Linux one)) and responsible services (e.g. from Biomed Central). There are reasonable paths for institutions of different sizes to take the plunge. So it was nice to see on Peter Suber's blog:

Implementing an institutional repository

Meredith Farkas has blogged some notes on Roy Tenant's talk on institutional repositories at Internet Librarian 2006 (Monterey, October 23-25, 2006). Excerpt:

I knew that Roy would be likely to give a very practical nuts-and-bolts introduction to developing institutional repositories and I was certainly not disappointed.

Why do it?

  • Allows you to capture the intellectual output of an institution and provide it freely to others (pre-prints, post-prints, things that folks have the rights to archive). Many publishers allow authors to publish their work in archives either as a pre-print or after the fact.
  • To increase exposure and use of an institution’s intellectual capital. It can increase their impact on a field. More citations from open access and archived materials.
  • To increase the reputation of your institution.

How do you do it? ...

Software options....

Key decisions

  • What types of content do you want to accept (just documents? PPT files, lesson plans, etc?)
  • How will you handle copyright?
  • Will you charge for service? Or for specific value-added services?
  • What will the division of responsibilities be?
  • What implementation model will you adopt?
  • You will need to develop a policy document that covers these issues and more.

Implementation models

  • Self archiving – ceaselessly championed by Stevan Harnad. Authors upload their own work into institutional respositories. Most faculty don’t want to do this.
  • Overlay – new system (IR) overlays the way people normally do things. Typically faculty give their work to an administrative assistant to put it on the Web. Now, the repository folks train the admin assistant to upload to the repository instead. Content is more likely to be deposited than if faculty have to do it....
  • Service provider – not a model for a large institution. Library will upload papers for faculty. The positives is that works are much more likely to be deposited. The negative is that it’s a lot of work and won’t scale....

Discovery options: Most traffic comes from Google searches, but only for repositories that are easily crawlable and have a unique URL for each document. OAI aggregators like have millions and millions of records. They harvest metadata from many repositories. Some may come direct to the repository, but most people will not come there looking for something specific. Citations will drive traffic back to the repository.

Barriers to success:

  • Lack of institutional commitment
  • Faculty apathy (lack of adoption and use)
  • If it is difficult to upload content, people won’t use it.
  • If you don’t implement it completely or follow through it will fail.

Strategies for Success

  • Start with early adopters and work outward.
  • Market all the time. Make presentations at division meetings and stuff
  • Seek institutional mandates
  • Provide methods to bulk upload from things already living in other databases
  • Make it easy for people to participate. Reduce barriers and technical/policy issues.
  • Build technological enhancements to make it ridiculously easy for people to upload their content....

This is a good summary. I'd add that much of the early work in IRs has come from subjects where the "fulltext" is seen as the important repositable [almost a neolgism!]. We're concerned with data and my repositions have been computations on molecules. I also admit that even as an early adopter I don't self-archive much. This is mainly because the publishers don't allow me to. In some cases I cannot even read my own output on the publishers website as Cambridge doesn't subscribe to the journal online.

I have just realised what I have written! The publisher does not allow me to read my own work! We accept this?

The message from Open Scholarship was that voluntary repositing doesn't work. There has to be explicit carrot and/or stick.

So while you are implementing your own IR make sure that you can reposit data as well as fulltext. This will be a constant theme of this blog!

Open Map Data?

From Peter Suber's Blog:

Mike Cross, Ordnance Survey in the dock again, The Guardian, October 26, 2006. Excerpt:

On one side of an electoral boundary, people might buy sun-blushed tomatoes; on the other, economy baked beans. Retailers like to know such things, so data from the 2001 census is of great commercial interest - and also the subject of the latest controversy in the Free Our Data debate.

Last week, the Association of Census Distributors filed a complaint against a state-owned entity, Ordnance Survey, over the conditions placed on the re-use of intellectual property in census data. It is the second time this year that the national mapping agency has been the subject of a complaint to the government's Office of Public Sector Information.....

Technology Guardian's Free Our Data campaign proposes that the best way to avoid such disputes is for basic data sets collected at taxpayers' expense to be made freely available for any purpose (subject to privacy and national security constraints). While this would involve more direct funding for agencies such as Ordnance Survey, the economy as a whole would gain. At the moment, says [Peter Sleight of Target Marketing Consultancy], the national good is compromised because of a single trading fund's commercial needs.

For non-UK readers: the Guardian is a liberal national newspaper (sharing with this blog a reputation for typos). The Ordnance Survey is the Government organization responsible for UK maps.

In or monthly CB2 meetings where we work out how to put the world to rights freedom of map data is a frequent topic. I regard maps as "open data" - they are part of the public infrastructure of existence. It was interesting that at the UK eScience (==Grid, == cyberinfrastructure) a lot of the applications were map based and almost all used GoogleMap API. So it makes absolute sense for maps to be part of the Open Data definition.

It's also good to see a newspaper championing freedom - we can almost prove it makes economic sense to remove this part of the anticommons.

Commons in the pharma industry?

I was excited to see the following in Peter Suber's Open Access Blog:

var imagebase='file://C:/Program Files/FeedReader30/';

17:54 24/10/2006, Peter Suber, Open Access News
Pfizer is exploring data sharing with Science Commons. There are no details in this interview with David de Graaf of Pfizer’s Research Technology Center, but it's a promising prospect to watch. Here's the key passage:

When you encounter a knotty problem or roadblock in terms of your work in systems biology, who do you call among your peers in the industry?

...Everybody keeps running into the same toxicity and we can’t solve it. Actually putting our heads together and, more importantly, putting our data together may be something that’s worthwhile, and we’re exploring that together with the folks at Science Commons right now, as well as the folks at Teranode.

I have argued elsewhere that the current model of pharmaceutical sponsorship depletes the scientific Commons - this is a wonderful opportunity to change the model and enhance it. (I haven't read the interview - the link seems broken).I used to work in the pharma industry - it is well known that it is very difficult to discover safe and useful drugs. Each company tackles very similar problems to the others and each runs into the same problems. Most drug projects fail. If one company has slightly fewer failures they might count that as a competitive advantage, but if we look at it from the global view of the commons (post)- or the anticommons (post) - of the industry it is still a tragedy.

In some industries (e.g. luxury goods) a failure only costs the shareholders; in pharma it often results in poor or unsafe drugs. The drug companies have to collect safety information (I have worked with WHO on this issue) but much of this is secret. Since it is us who are the test vehicles for new compounds is there not an overwhelming case for making toxicity information public and seeing this as a pre-competitive activity?

Rich Apodaca: Closed Chemical Publishing and Disruptive Technology

Rich Apodaca, a founder member of the Blue Obelisk, has a thoughtful blog, DepthFirst. Besides the interesting stuff on programming - especially Ruby - there are useful injections from outside chemistry and IT. Here's one:

The Directory of Open Access Journals (DOAJ) currently lists 2420 Open Access scholarly journals. Of these, 52 currently fall under the category of chemistry. Although the organic chemistry subcategory only currently lists three journals, the general chemistry category actually contains several journals containing organic chemistry content, such as the Bulletin of the Korean Chemical Society, Chemical and Pharmaceutical Bulletin, and Molbank.Clearly, the chemistry journals included in DOAJ's listings would not be considered to be in "the mainstream" by experts in the field. And that's exactly the point. Innovation always happens at the margins.As Clayton Christensen puts it in his landmark book, The Innovator's Dilemma:

As we shall see, the list of leading companies that failed when confronted with disruptive changes in technology and market structure is a long one. ... One theme common to all of these failures, however, is that the decisions that led to failure were made when the leaders in question were widely regarded as among the best companies in the world.

Replacing the word "company" with "scientific journal" leads to an important hypothesis about the future of scientific publishing.

And on the subject of disruptive innovation itself, Christensen writes:

Occasionally, however, disruptive technologies emerge: innovations that result in worse product performance, at least in the near-term. Ironically, in each of the instances studied in this book, it was disruptive technologies that precipitated the leading firms' failure.

It seems very unlikely that scientific publishing operates according to a different set of rules than any other technology-driven business. The coming wave of disruptive innovation will be dramatic, and the outcome completely predictable.

PMR: and elsewhere he points to a possible disruptive technology...
var imagebase='file://C:/Program Files/FeedReader30/';

Like everything else in information technology, the costs of setting up and maintaining a scientific journal are rapidly approaching zero. A growing assortment of Open Source journal management systems is available today. Recently, I was introduced to one of these packages by Egon Willighagen as part of my involvement with CDK News.

Open Journal Systems

Open Journal Systems (OJS) automates the process of manuscript submission, peer review, editorial review, article release, and article indexing. All of these elements are, of course, cited as major costs by established publishers intent on maintaining their current business models.

OJS appears to work in much the same way as automated systems being run by major publishers. In fact, OJS is already in use by more than 800 journals written in ten languages worldwide.

Did I mention that OJS is free software - as in speech? The developers of OJS have licensed their work under the GPL, giving publishers the ability to control every aspect of how their journal management system operates. Standing out from the crowd will no doubt be an essential component of staying competitive in a world in which almost anyone can start their own journal.


And there's even better news: OJS has competition. Publishers can select from no fewer than seven open source journal management systems: DPubs; OpenACS; GAP; HyperJournal; SciX; Living Reviews ePubTk; and TOPAZ.

The Last Word

Open Source tools like Open Journal Systems have the potential to radically change the rules of the scientific publication game. By slashing the costs of both success and failure in scientific publication to almost zero, these systems are set to unleash an unprecedented wave of disruptive innovation - and not a moment too soon. What are the true costs of producing a quality Open Access scientific publication - and who pays? Will the idea of starting your own Open Access journal to address deficiencies with existing offerings catch on, especially in chemistry, chemical informatics, and computational chemistry? Before long, we will have answers to these questions.

PMR: Yes - these ideas are looking increasingly relevant and believable. In the same vein Steve Heller has wittingly and irreverently shown the immense power of disruptive technology. Two years ago, when the Blue Obelisk was founded, it probably looked like the margins. Does it still? Many will think yes - I don't :-)

Silicos contributes Commercial Open Source - thank you

It is very uncommon for commercial organizations in chemoinformatics to make any of their material Open Source. (Unlike the contributions of many IT companies - e.g. Eclipse, Netbeans, etc.) So I was very pleased to see an announcement of open Source  [BSD] chemoinformatics software on the Blue Obelisk list:

SiMath is Silicos' open source library for the manipulation of data
matrices and the subsequent mathematical and statistical modeling.
SiMath tries to provide a simple C++ API that is data matrix centered
and that covers the model building procedure from data preprocessing
to training and evaluation of the model. The goal is to provide a
library that can be easily integrated into standalone applications.
The rationale of SiMath is not to invent the wheel again but to
integrate available open source packages and also newly implemented
algorithms into one comprehensive library. Several well established
libraries exist nowadays, but they all have a different interface and
work with their own matrix representation. These tools are
incorporated into SiMath and adapted such that their interface is
consistent over all tools. For instance, all clustering algorithms
are initiated by defining a set of parameters and the actual
clustering is done by calling the cluster method with the data matrix.
Currently, SiMath contains modules for PCA (or SVD), matrix
discretisation, SVM training and evaluation, several clustering
algorithms, self-organing map and several general mathematical
More information about SiMath and how to download the source code can
Silicos is a chemoinformatics-based biotechnology company empowering
virtual screening technologies for the discovery of novel compounds
in a variety of disease areas.

This makes sense. The technology here is common to many applications and as (Hans De Winter ) says it is foolish to reinvent the wheel. This is exactly the sort of components we need in the discipline. Because they are in C++ and many of use use Java it may make sense to develop these as Web services (REST) as the message overhead is likely to be smaller than the computational cost.

The Blue Obelisk mantra - Open Data, Open Source, Open Standards welcomes contributions in any of these areas.

Chemistry Theses: How do you write them?

As I have shown it is hard and lossy to recover information from theses (or anything else!) written in PDF. In unfavourable cases it fails completely. I have a vision which I'll reveal in future posts, but here I'd like to know how you wrote, write (or intend to write) your theses. This is addressed to synthetic chemists, but other comments would be useful. I have a real application and potential sponsorship in mind.

Firstly I guess that most of you write using Word. Some chemists use LaTeX (and Joe, who is just writing up) told us that the most important thing he would do differently when he started his PhD would be to use LaTeX). I would generally agree with this, although I am keen to see - in the future - what can be done with Open Office and Open Document tools which will use XML as the basis. The unpredictable thing is how quickly OO arrives and what authoring support it has.

A main reason for using Word is that it supports third-party tools whose results can be embedded in a Word document. The most important of these are molecular editors (such as ChemDraw (TM) and IsIsDraw (TM)). These are commercial products and have closed source. They also generally use binary formats which are difficult to untangle. (When these formats are embedded in Word they are impossible to decode - the Word binary format is not documented and efforts to decipher it are incomplete). In some cases I could extract many (but not all) of the ChemDraw files in a document. There are also MS tools such as Excel.

I'd be interested to know if OO and/or the release of MS's XML format has changed things and what timescales we can reasonable expect for machine-processable compound documents. But for the rest of the discussion I'll assume that the current practice is Word + commercial tools. (In later posts I shall try to evangelise a brighter future...)

The typical synthetic chemistry thesis contains inter alia :

  • discursive free text describing what was done and why
  • enumerated list of compounds (often 200+) with full synthetic details and analytical data.

The free text looks like:


============ OR ===========


===== the compound information looks like ========


Note that compounds are identified by a bold identifier ( e.g. 38) which normally increases in serial order throughout the text. This is fragile, in that the insertion of a new number requires manual editing throughout the text (this is confirmed by various chemical gurus). Compounds are drawn in the middle of free text sections, and again in the compound information. There are no tools to enforce consistency between the numbering and the diagrams. Moreover information such as reagents, yields, physical and analytical data are repeated in several places. These have to be manually transcribed and (unless you tell me differently) this is a tedious, frustrating and error-prone process.

Moreover at this stage of writing the thesis the student has to assemble all the data for the 200 compounds. Are they all there? Could any of the spectra be muddled? Is that figure in the lab book a 2 or a 7? Heaven help if a spectrum is missing and the compound has now decomposed into a brown oil or got lost in the great laboratory flood. Of course none of this ever happens...

So are you all happy with how you authored or will author your thesis? I haven't even touched on how peaks are transcribed from spectra and how the rigmarole of spectra peaks has to be authored and formatted. If so, I'll shut up. Else I will make some serious and positive suggestions in a later blog.