Chemistry, Chess and Computers

Sometime in the 1970’s the Amer. Chem. Soc. published a review of Computers in Chemistry (cannot remember date or title and I’ve lost my copy) and it has remained an inspiration ever since. In it was summarised the work of the Stanford (DENDRAL, CONGEN) and Harvard (LHASA) groups on the applications of artificial intelligence to chemistry (structure elucidation and organic synthesis). Both have heavy elements of problem-solving, coupled with pattern recognition. The systems effectively contained:
* a knowledge base of chemistry
* a set of heuristics (rules)
* formal deterministic procedures (e.g. tree searches).
The accomplishment was remarkable. The systems worked. They weren’t as good as a professional synthetic chemist, but in small areas they were better than me. It seemed obvious to me that with sufficient work on all components, but especially the knowledge base these systems would be able to do organic chemistry at the level of all except the best in the field. Certainly I expected that with the passage of 30 years the chemist/machine combination would be common. (Admittedly I sometimes believed too much hype about AI – now that I work in one branch (language processing) I know how difficult it is).
At the same time very similar work started to be done on chess. Again, when the first programs came out I could easily beat them (and I am a weak player). But gradually they improved and now they can beat essentially all humans.
It seemed to me that chemistry and chess would be quite similar. They are formal systems, too complex for brute force, and where a knowledgebase is essential. In chess all significant games have been captured in a database, and a large number of endgames have been exhaustively worked out. What is interesting is that the chess grandmasters have formed a symbiosis with computer programmers and machines and are still exploring what aspects machines can and cannot do. (I’m not an expert here and comments would be welcome).
By contrast there has been no significant work on chemistry and AI in, perhaps, 15 years. When I was in the pharma industry my boss used to speak of “another outbreak of Lhasa fever” (sic) – meaning that someone had suggested that machine synthesis should be explored. The Lhasa organisation has effectively stopped supplying synthesis methodology and turned to toxicology prediction (albeit it highly valuable).
So I feel a considerable feeling of sadness. I am sure that if synthetic chemists had embraced computers in the same way as chess players we would be sgnificantly better off. This is, of course, an act of faith but it’s borne out by the knowledge revolution taking place in many disciplines. The bioscientists are eagerly exploring the S/semantic W/web witn formal ontologies and reasonaing – another approach to “AI”.
I’ve just been at the UK eScience meeting (cyberinfrastructure) meeting for 3 days. (I’ll probably hark back in future posts). One keynote was given by Stephen Emmott (Director, Eur. Sci. Programme) Microsoft Research, Cambridge). Stephen talked about 2020 and gave a vision when computing could be based on biology – where molecular computers have already been injected into cells. Microsoft is hiring bioscientists who are also computer-able (i.e. they can make their ideas happen through code, rather than requiring comput/er/ational scientists to write the code for them.) He stressed that he did not want a mixture of computer scietists and biologists, he wanted scientists with a mixture of computing and biology. Since his future involves molecules, maybe he’s also hiring chemist/computerScientists…
But we are actively discouraging the sort of work envisioned By Lederberg and Corey 30 years ago. There are exceptions – I spent 3 hours with my colleague Steve Ley discussing how we can bring modern informatics into synthetic chemistry. I am sure that our biggest problem is the lack of an immediate Open global knowledge base in chemistry. It’s all there on paper, but to get it into a machine is a mighty task. It will need new methods of computing – including social computing and I’ll explore these ideas systematically in this blog. We might even achieve something with your help.
So I am pleased to see the quality of the chemical blogs, even if Tenderbutton is retiring. With lightweight mashup-like approaches we may be able to use the new approaches to informatics that are being developed in social computing. Biology has control of its knowledgebase – it had to fight to keep it in the genome information wars- but it’s vibrant and innovative. Chemistry has surrendered its knowledgebase to commercial and quasi-commercial interests who point in the direction of pharma rather than the information revolution. I will show in a week or two how we might be able to start regaining some of it.
P.

Posted in chemistry, open issues, programming for scientists | 8 Comments

The cost of decaying scientific data

My colleague John Davies, who provides a crystallographic service for the deparment has estimated that the data for 80% of crystal structures (in any chemistry department) never leave the laboratory. They are locally archived, perhaps on CDROM, perhaps on a local or departmenta machine. With the passge of time – changes in staff, organisation, machines – information decays and it is likely that crystallographic data wil be systematically lost.

Recently a number of UK groups have been funded by JISC – The Joint Information Systems Committee

to research the development of digital repositories. Three groups have been collaborating in chemistry, with a strong emphasis on crystallography and spectroscopy. This involves all aspects – building software, designing metadata specs, and understanding the way chemists work and think. We have found that the social aspects are at least as important as the technical – I won’t eleborate here yet as these will be reported at:

An eBank / R4L / SPECTRa Joint Consultation Workshop.
Digital repositories supporting eResearch: exploring the eCrystals
Federation Model

Why is it important to archive the data? Isn’t normal academic publication (including theses) sufficient? Isn’t it very costly and a waste of money that could be spent on proper research?
Well, the crystallographic community has archived its data for many years and research on this data alone has given rise to hundreds or even thousands of papers datamining this resource. Without this chemistry would be very much poorer as we would have little in the way of molecular or crystal structure systematics.
So what is the cost of the unpublished data? To carry out the structures at commercial rates would be about USD 1500-5000 for the size of structures currently published. Let’s assume a laboratory does 500 structures a year and if we assume that full economic costs are half the commercial (this is just a guess) – we are looking at half a million dollars per year to do crystal structures in a chemistry department. (I suspect the numbers are on the low side – I’d be interested in comments).
Allowing that there has been some publication of some of the material as comments in chemical papers I suspect that the information from quite a high proportion of the structures is never published in any form. How easy is it to find information in current theses, especially if you don’t know it’s there?
I think I would be safe in saying that wordlwide hundreds of millions of dollars’ worth of crystallographic data is lost each year. For spectra and synthetic chemistry it will be at least 10 times greater. Many synthetic chemists say they are interested in failed reactions – and these are almost never published!
If funders are aware of this they should be concerned about the loss. Funders are increasingly being proactive in requiring funded research to be Openly accessible. The Wellcome Trust is among the stromgest proponents:

Robert Terry on Open Access

and a quote

The Trust provides additional funding to cover the
costs relating to article-processing charges levied by
publishers who support this model.
• Approximately 1% of the research grant budget
would cover costs of open access publishing

Posted in data, open issues | 3 Comments

Moderatorial

A recent anonymous comment on this blog read

In that case, perhaps you should have parted with the observation “ACS is a problem”.
:-), but partly serious.

I thnk the tone of this is out of keeping with this blog and I am therefore writing a “moderatorial”. This was a term I used (I doubt it was a neologism) when Henry and I ran the XML-DEV list. A Moderatorial (example) was to guide the list, but not constrain it. Although this is not a list, anyone can post a comment and I will automatically post it whether or not I agree with the sentiment.
However I wish to avoid flame wars and ad hominem remarks and outline my own philosophy on this blog.
I try to post statements which are accurate and not unnecessarily emotive. I do not completely have a strict Wikipedian-like Neutral Point of View (NPOV) in my posts and use the list for advocacy. However I do not wish the comments to be one-sided and invite a range of views – the result might indeed be neutral. I take as an example the excellent blog from Peter Suber – he is analytical and incisive. A typical example read:

(From ACS press release)

In October, American Chemical Society journal authors will have the option of paying to immediately provide free online access to their articles on the society’s website. Authors will also be able to post electronic copies of their sponsored articles on personal websites and institutional repositories. Fees for the program will range from $1,000 to $3,000 per paper, depending on whether the author is an ACS member or is affiliated with an institution that subscribes to ACS journals.

Comments (from PeterS).
(2) See my (PeterS) nine questions for hybrid journal programs, just published on Sunday. Of the nine, the ACS announcements give good and welcome answers to two: it will let authors deposit articles in repositories independent of ACS and it will not retreat on its green self-archiving policy. It gives unwelcome answers to two more: it will not let participating authors retain copyright and it does not promise to reduce its subscription prices in proportion to author uptake. (Hence, it plans to use the “double charge” business model.) It leaves us uncertain on the remainder: Will it let participating authors use OA-friendly licenses? Will it waive fees in cases of economic hardship? Will it force authors to pay the fee if they want to comply with a prior funding contract mandating deposit in an OA repository? Will it lay page charges on top of the new AuthorChoice fee?
(3) The ACS has been a bitter opponent of OA through PubChem and FRPAA. But I don’t believe it ever opposed the very idea of charging author-side fees to support the costs of a peer-reviewed journal, as some other hybrid journal publishers did before adopting the hybrid model.

Permanent link to this post

This is a style I strive to emulate. PeterS has a position of advocacy (Open Access through various models) but reports accurately and without ad hominem arguments.
In the present case it is clear that the devil is in the details. Whether I welcome or criticize the ACS hybrid policy depends on whether it enhances the free use of data. It sounds dubious from PeterS’s report, but hopefully there will be more clarity from all parties.
In the case of control of published data – my fundamental position is that scientific data belongs to the commons and that there is good legal and moral precedent for this. The stronger this basis, the stronger the case. Open Access is complex and, I believe, changing so that entrenched positions are not always helpful. Although I wish for total Open Access I am prepared to work with publishers operating different models. My engagement is dedicated to trying to make scientific data Open.
I have frequently been asked to speak at the ACS meetings and have accepted. My advocacy for Open Data is robust but hopefully not personal. People and organisations are flexible. Thus, for example, I gave a talk at ACS last year in the Open Access session. There were presentations for and against Open Access and (in my opinion) the Open ones were better presented and more compelling. But I still listened carefully to all arguments. My own presentation was a demonstration of the power of data and the value of Opening it. As a result Pieter Borman invited me to talk at the annual meeting of the STM publishers in Frankfurt. I went with some doubt as to whether my arguments would be taken on board – but I had a good audience – and I heard (though I can’t find details) that STM publishers have recommended that scientific data should be copyright free (confirmation is welcomed).
So I don’t take entrenched positions about people and organisations, but about issues. The Firefox/downloading episode is a problem – I have highlighted it – and hope that the factual analysis makes a useful contribution. It might not change policy directly but it should help to avoid misunderstandings.
Finally therefore I shall directly accept all non-spam comments, but reserve the right to issue moderatorials if I feel the comments might ignite flames.
P.

Posted in open issues | Leave a comment

OSCAR reviews a journal

In the last post I described OSCAR, which can review and extract chemical data from published articles. Here is how I used it to review the Beilstein Journal of Organic Chemistry
The BJOC unlike most other chemistry journals encourages reader’s comments, so I thought OSCAR would like to add some. Since I did this on a Saturday none of the comments have been moderated (or at least none have appeared). I first added comments to the journal announcement about what I intended to do, and gave links to the OSCAR home page. I then started at the first paper and found the “Additional File 1” which contains a pointer to the chemical data. (The process seems overly convoluted, and I have commented on this). I first downloaded OSCAR (the adventurous among you can try this and the following), started it (click the jar file), opened the BJOC (Word) file with the data, selected all of it and pasted it into OSCAR.
This is a very well presented file (and worthy of the authors’ orgnaisations – GSK and Leeds) – not all chemical manuscripts are as well prepared. OSCAR reveals only two errors, which are missing commas. (These are more important than they sound as we rely on them for the parsing). Typical results can be seen in the previous post. I therefore added this to the comments section for the paper. I assume the comments will appear in a day or two. I don’t know whether the authors will be automatically informed – I expect so – and whether the deposited data can then be corrected either by authors or editorial staff. If so, this is a real mechanism for cleaning up the current literature. Of course if the authors use OSCAR in future they will get a clean sheet!
I then applied OSCAR to all the papers in the Journal that contained chemical synthetic data – about 27. There is no standard place for the data – sometimes they occur in free text and sometimes in “Additional File n” (this name is not very helpful and I have suggested it should be changed to something with chemical semantics). I commented on the variability in navigation which made it difficult for me (and very difficult for OSCAR if it wished to review the journal systematically). OSCAR discovered several important errors – for example a chemical formula was wrong (this matters) and many suggestions about style improvements. (I did not comment on these as OSCAR’s rules don’t yet include BJOC policy). I also noted that some papers didn’t include data. I did not comment on the chemistry at all – its merit or its correctness – as I am not a specialist except on data. But perhaps this will stimulate expert readers to do so in future.
OSCAR raised concerns in almost all papers – ranging from punctuation to incorrect formula. I stress that this is common in ALL chemistry papers – and should not be used to measure BJOC against others. They all need cleaning up.
I made addiitonal comments on the accessibility of crystallographic data – these were not added as supplemental data and I argue strongly that they should be. I’ll write later about this.
I am hoping this will be seen as positive critiquing – it would be in compsci or crystallography. Certainly the adoption of data standards will make an enormous impact in the standard and re-usability of chemistry.
(Note: Our two summer students this year- Richard Moore and Justin Davies – again financed by RSC, have been refactoring OSCAR – we call this OSCAR-Data. OSCAR-Data uses OPSIN (OSCAR3) and allows for several inputs – SciXML, HTML, converts them into CML and then applies a set of custom rules (which could be publisher-specific). )

Posted in chemistry, open issues | 1 Comment

OSCAR, the chemical data checker

I spent yesterday reviewing the data in BJOC (the Beilstein Journal of Organic Chemistry) (articles). This is a new (ca. 1 year) and important journal as it is the first free-to-author and free-to-read journal in chemistry, supported by the Beilstein Institut.
BJOC advocates Openness and has a facility for adding comments on each article. Up till now there has been little or no use of this facility (it is always daunting to be the first commentator) so I thought I would add some to catalyse the process of communal knowledge generation. I’m not an expert in most of the actual science, but I am familiar with the type of data in organic papers, which has a large emphasis on describing chemical reactions and the products.
.
.
.
.
.
.
[This space blank because the images are necessarily large and overlap the wordpress menu]
.
.
.
.
.
.
There is a fairly formal manner of reporting the chemical data:
bjoc.GIF
Not bedtime reading (except for chemists) but it is a formal statement of the experiment. (The tragedy is that it is used to be a machine-readable file with 16,000 points and that all chemical publishers require it to be abstracted into this text – but that is a different rant). The chemist has to extract this data from the spectrum, retype it into this form (taking ca. 45 minutes – my informal survey) and then sent to the editor. There can easily be 50 chunks of this stuff in one manuscript.
Because of this (absurb) retyping (instead of depositing the raw data with the publisher), errors creep in. The reviewer now has to check all this stuff by hand (again taking 30 minutes). And they really do – I have seen them. It could easily take a day to wade through the experimental section in a paper. Because this is the ultimate touchstone as to whether the correct compound has been made. (There have been frequent cases where claims of a synthesis have been later disproved by using this information – there was a recent example at the ACS meeting – entertainingly reviewed by Tenderbutton (search for “La Clair” in the text) where computer calculations on the proposed chemical structure did not agree with the published values). So they matter.
But what if they are mistyped? That’s where OSCAR comes in. Five years the Royal Society of Chemistry supported two undergraduate summer students (Joe Townsend and Fraser Norton) to look into creating an “Experimental Data Checker” for this sort of material. They did brilliantly and were followed next year by Sam Adams and Chris Waudby (also RSC funded). The result was OSCAR – the Open Source Chemical Analysis and Retrieval system, written in Java. Even if you are not a chemist, you may enjoy trying it out. You can take a raw manuscript (DOC) or published (HTML) – PDF is a hamburger so it probably won’t work – and drop it into OSCAR. OSCAR parses this text (using regular expressions) and produces:
oscar1.PNG
OSCAR has recognised the different sort of spectra (coloured) the melting point, etc. It struggles with the appalling diversity of character sets used by Word (black squares) but makes sense of almost anything. These were very conscientious authors (they are from GSK) and the syntax is very correct. (This is impressive as there is no formal definition of the syntax, and OSCAR guesses from a large number of dialects).
OSCAR can now extract the data from the publication. Not, unfortunately, the raw data as the publishers currently don’t accept this for publication. But the numbers in the document. OSCAR can even guess what the spectrum might have looked like before it was lost in publication.
oscar2.PNG
It’s only a guess, but hopefully it brings home what is lost in publication.
Now, although we don’t know whether any of the data are correct (only the author does) OSCAR has rules that point out when they might be incorrect. For example if a melting point of -300 deg C is reported it is obviously wrong (maybe a minus sign crept in). Or if the number of hydrogens calculated from the spectrum above don’t agree with the formula again something may be wrong. OSCAR applies these to each report of a chemical in the paper and lists all the warnings. Like CheckCIF (previous post) some of these are potential serious, while others are matters of style.
And, finally, OSCAR can extract all this data. I contend that the text above is factual material and cannot be copyrighted so OSCAR can extract data from the world’s chemical literature, even if it is closed and copyrighted. (However some publishers will probably claim that their articles are “entries in a database” and hence copyright. We have to fight this. It’s OSCAR that has done the work, not the publishers). The good news is (a) most data is now in open, non-copyright “Supplemental Material”, “Supporting Information”, or similar and (b) there are (a few) open access articles in chemistry. The next post shows how OSCAR can review the Open chemical literature…
(NB. This is OSCAR-1. The latest version, OSCAR3, has a much more sophisticated parser and can also work out chemical strucures from their names. It really is our vision of a “journal eating robot”. It’s on SourceForge but is strictly early adopter. More about that later.)

Posted in chemistry, open issues | 6 Comments

Linus' Law and community peer-review

Linus Torvalds of Linux fame is creand dited with the law
“given enough eyeballs, all bugs are shallow”
In a communal Open Source project every developer and every tester (or user when the code is released) can contribute bugs to a buglist. There is both the incentive to post bugs and the technology to manage them. (How many of you send off bug reports after a Blue Screen Of Death on Windows?) The bugs are found, listed, prioritised and – as developers are available – are fixed. Large sites such as Apache has huge lists – many thousands – the Blue Obelisk projects have less but it is still the way we try to work. The key thing is that bugs are welcomed – of course we hate hearing about a new bug at 1 a.m. – but we’d rather know now than six months down the line.
Can this be extended to peer-review? We can hardly extend Linus’ Law to chemistry (we have an even more famous Linus) but something like:
“With many readers, all data can be cleaned”not very punchy, but it gives the idea.
Can we have communal peer-review? Is peer-review not something that has to be done by the great and the good? No – just as all bugs are not equal, so peer-review can be extended over the community. This is being explored by Nature – typical examples are:
Scientific publishers should let their online readers become reviewers.
andPeer review would be improved by discussions across the lab.Here I want to explore a special case of peer review – data review. In many sciences the data are of prime importance – they almost are the publication. Where this happens some sciences implement impressive systems for data review – a good example is in crystallography where all papers are reviewed by machines as well as humans. Here’s a paper that had no adverse comments from the CheckCIF robot and here is one with quite a lot of potential problems:

Alert level B PLAT222_ALERT_3_B Large Non-Solvent    H     Ueq(max)/Ueq(min) ...       4.38 Ratio
PLAT413_ALERT_2_B Short Inter XH3 .. XHn     H16A   ..  H18A    ..       2.06 Ang.

Alert level C PLAT062_ALERT_4_C Rescale T(min) & T(max) by ..................... 0.95 PLAT220_ALERT_2_C Large Non-Solvent C Ueq(max)/Ueq(min) ... 3.44 Ratio PLAT230_ALERT_2_C Hirshfeld Test Diff for O1A - C15A .. 5.01 su PLAT318_ALERT_2_C Check Hybridisation of N1B in Main Residue . ? PLAT720_ALERT_4_C Number of Unusual/Non-Standard Label(s) ........ 24

The robot knows about several hundred problems. The process is so well established that authors submit their manuscripts to CheckCIF before they send them off to the journal. For really serious problems (Alert level A) the authors either have to fix them or send a justification as to why the work is fit for
publication.
How common is this sort of data checking in science? It happens in bioscience – authors have to prepare carefully checked protein structures and sequences. I think it happens in parts of astronomy (though I can’t remember where). Until recently there was nothing like this in chemistry but now we have two approaches, OSCAR (described in next post) and ThermoML. ThermoML is an impressive collaboration between NIST, IUPAC and at least 4 publishers, whereby all data in relevant journals is checked and archived in a public database.
Crystallography and thermochemistry are technically set up for semantic data checking and authors in those subjects are well aligned towards validated authoring of data. But can it work retrospectively? Can the community look at what has already been published and “clean it up”? In the next post I’ll show an experiment for synthetic organic chemistry and how, with the aid of OSCAR, we can clean up published data. And , since readers are now both human and robotic:
“With many readers, all data can be cleaned”

Posted in "virtual communities", open issues | Leave a comment

GIFs and other horrors

The GIF (and its extended family of PNG, JPEG, TIFF, BMP, etc.) are major destroyers of scientific data. This post shows why they should be avoided for much scientific data. (The GIF has additional infamy through the patent fiasco). In this post I’ll us “GIF” to refer to all bitmapped formats (as opposed to vector formats such as SVG).
All bitmapped images contain data captured as individual pixels. The resolution of the data cannot, therefore be better than the separation between pixels. The problem occurs when a high-resolution object (such as a spectrum) is captured as a GIF. A spectrum in chemistry typically has a resolution along the x-axis of ca. 16, 000 points, while a GIF may have 1000, Therefore 94% of the data are lost in coverting from spectrum to GIF. Sometimes the conversion involves dithering pixels so that the final image looks somewhat more beautiful but this adds no information and usually destroys some. Anyway, here goes:
Text and chemicals:
bjoc-react.gif
– a bit diificult to read, so let’s magnify it:
bjoc-react2.GIF
Bigger but not much better…
bjoc-react4a.GIF
… now we see the full horror – the dithering hasn’t added information – it just hid the problem.
It isn’t just that we have jaggies, but we can actually lose information in a seriously misleading manner. Here’s a chemical reaction:
betalactam2.GIF
This looks very pretty. But suppose we have to shrink it just a little bit (say 10%). Now we get:
betalactam2a.GIF
What’s happened? The lines used to be one pixel wide. When the picture was shrunk the converter had to decide whether the line was in a vertical line of pixels. It just missed, so it’s not been drawn. This corrupts, rather than destroys the chemistry – it could be mistaken for a different molecule!
In practice the greatest destruction is probably in the spectra. Remember they have a resolution far greater than the screen. But here are some pixel-based spectra from supplemental data. You can find these in all publishers’ repositories…
acs.gif
The resolution of the spectrometer is probably 0.001 in the vertical axis – the GIF can only manage about .025
bjoc-spect.GIF
Here the spectrum has been dithered but that can’t save it. Again the actual data resolution is probably 50 times what you can see.
rsc-suppdata.jpg
And this is one of the best. It was a proper digital spectrum. It’s been printed out (losing some of the metadata such as frequency), been annotated with the compound on a Post-it (though we cannot make sense of what is attached – it seems to be related to a different spectrum). Then it has been photcopied – losing resolution again. We don’t know how it got to the publisher, but here is their record of the scientific experiment.

Posted in chemistry, general | 5 Comments

Useful Chemistry: Publish and be…?

It was great to meet Jean-Claude Bradley, the guru of the Useful Chemistry blog at the Am. Chem. Soc meeting. The Useful chemistry blog has a remarkable and valuale feature – J-C publishes chemistry as it is being done. To get an example see:
[the Ugi reaction in water]
You don’t have to know any chemistry to see the freshness of style and the excitement of research. For details there are links to his Wiki giving details of the experiments, spectra, etc.
This raises a fundamental problem in publishing – is it “science”?. To me it is obviously science – a formal description of the hypothesis and its testing by doing an experiment. The careful measurement of the results and the critical analysis. Did it work? – J-C and collaborators are prepared to admit “failure” – although failure should be a positive idea in science. By publishing he establishes the date of the experiment (and therefore priority) and invites critiques from the rest of the community. Since he is working in antimalarials he also gives the world community a chance to pick up potentially exciting compounds.
But it isn’t part of the mainstream of scientific publishing. By putting his work on the web he has automatically forfeited the opportunity to submit the work to a mainstream journal in chemistry. Many mainstream chemistry journals require that the work has not been previously published and that includes putting it on the Web. Henry an I have had this experience – we mounted one of our CML schemas on a web page for people to comment on and were told that unless we took it down immediately we wouldn’t be allowed to publish. We could send paper (sic) copies to a small number of close collaborators as preprints. So, in this way, the scientific publishing process can actually inhibit useful critiquing before publication. (Many other disciplines – such as physics and computer science – encourage the posting of preprints for community critique – and it’s sad we can’t do this for mainstream chemistry).
Why do we publish? Unfortunately the single most important reason for many authors is “to be cited” in a high-impact journal. (Hilaire Belloc opined ‘When I am dead, I hope it may be said: “His sins were scarlet, but his books were read.”‘. Scientists may change that to “his papers were cited”). I’ll post more later on the citation economy… Jean-Claude is (possibly) forfeiting the opportunity to be cited in a high-impact journal
But most other reasons for publishing are fulfilled by the blog:

  • priority
  • communication of the work
  • ability to be critiqued and to gain feedback.
  • record of the work, re-usable by others

We can usefully argue whether these are done better or worse by blog than traditional methods. IMO the blog has many advantages and I’ll be developing the following themes in later posts:

  • the blog can experiment with semantic publishing. (So can publishers, but the investment is larger). J-C and I can start adding active CML to his blogs almost immediately. This means the blogs and wikis can act as active semantic documents (cows) not dead paper (hamburgers)
  • the community can review the blog. This is anathema to traditionalists – unless a paper has been formally peer-reviewed it’s worthless. In some disciplines (e.g. clinical trials) I would agree. But in chemistry is the formal peer-reviewing process so wonderful? I and OSCAR (the robot) have found technical errors in almost every paper on synthesis I have looked at. Reactions that don’t “balance”, formulae that don’t square with the compound being discussed, mistyped chemical names and compound references, etc. I am sympathetic to the reviewers – an in-depth peer-review of a chemical synthesis can easily take a day. I found one where the supporting information (more later) ran to 200 pages – most of it PDF hamburger. I am not advocating the abandonment of peer-review but in some cases there are complementary approaches
  • the blog is immediate, formal publication can take months
  • The blog can link to other resources and unlike formal publications can be updated, preserving its revison history (in a Wikipedia-like manner)
Posted in "virtual communities", chemistry, open issues | Leave a comment

The Blue Obelisk: A volunteer!

I flew back from SF to Heathrow yesterday and, as usual, was hacking on the laptop (with difficulty as Virgin doesn’t seem to give enough space to open a laptop). After a while my neighbour (S) asked:
S: “do you mind saying what you do?”
P: “not at all – I’m a chemist at Cambridge, working in informatics”
S: “I saw you using Eclipse. I’m a software developer for [a company that tests devices] and we use it a lot”.
P: “Ah, we not only use it for testing, we have a rich client (Bioclipse) – it’s great – do you want to see it working?”
P shows Steve Bioclipse, Jmol and he is impressed…
S: “I’d like to help with that – I enjoy software challenges and would like to do some of that in my spare time”
S is a computer scientist and I tell him that Miguel Howard – a compsci – created the stunning graphics on Jmol without knowing any significant amount of chemistry. And we started to discuss the sorts of contribution that can be made towards the Blue Obelisk movement. S would be happy to look at developing machine learning methods, distributed threading, image processing, etc. where the project didn’t require an in depth knowledge of chemistry and where the project is reasonably self-contained.
So – by chance – we have a volunteer who can bring a lot to the project. I shall mail the BO suggesting that we promote a wish-list on the site that such volunteers can choose from. There must be quite a lot of potential volunteers out there in all sorts of disciplines who would enjoy – and that is the key word – contributing to the BO.  Because it’s FUN – not a bar on a Gantt chart. I understand S – Often I tackle software problems for relaxation – it’s creative and fun (if there isn’t a deadline). The satisfaction of creating a bit of working software can be enormous (although the pain of debugging it is the opposite!).
So if you think you have something to contribute to BO, have a look at the site, download and try the software and see if it excites you.
I have worked with virtual volunteers for many years. It’s a different ethos from a managed project – you have no right to expect anything from anyone. The demands of real life (RL) must take precedence. Sometimes there are problem with employers or supervisors. And many volunteers offer something and then are never heard from again. But we now have a substantial software and information architecture and there are opportunities in many aspects besides software: BO Data curation, tutorials, advocacy, documentation (not as boring as it sounds if you don’t do it all at once, or if you have a new slant on it), examples, integration, mashups, applications, etc.
http://www.blueobelisk.org

Posted in "virtual communities", programming for scientists | 2 Comments

Blue Obelisk Award – Geoff Hutchison of OpenBabel

A major problem in chemistry is that there is a plethora of file formats and it continues to get worse. Each manufacturer thinks they are the centre of the world and everyone else will use their approach. So they make up some ad hoc format and the number of different file types multiplies. Synactic, semantic and ontological incompatibility are rife. One speaker from the pharma industry at ACS opined that this was a fact of life we had to put up with.
We don’t – and that is what the BO is about. In some sunny future we shall use XML/CML-based files in which modern tools can store and convert ontological information. But for now we have to convert between different types. This process is necessarily lossy and would normally require n(n-1)/2 programs for n file types.
Babel (sic) provided an early solution – it would read molecules in format A and spit ot B or C or whatever as requested. But it was limited. Under Geoff’s leadership it has become transformed to the de facto standard (Openbabel) which can process at least 50 file types.
The architecture has been totally overhauled – it has a core off which modules can be hing for each file type. In this way it can be much more easily extended and there are volunteers who do some of this. Readers who have not written tools will not appreciate the dedication required. And this is done in Geoff’s marginal time – his day job is materials research.
As before the object is neither blue nor an obelisk. Again we hope Raja will post a picture to me. It could be Gimped…

Posted in "virtual communities", chemistry | 2 Comments