petermr's blog

A Scientist and the Web

 

Archive for the ‘XML’ Category

Egon on SMILES InChI CML and RSS

Sunday, December 10th, 2006
I agree with everything Egon says and add comments.
(Incidentally WordPress and Planet remove the microformats so please read his original
for the correct syntax)
The blogs ChemBark and KinasePro, have been some discussions on the use of SMILES, CML and InChI in Chemical Blogspace (with 70 chemistry blogs now!). Chemists seem to prefer SMILES over InChI, while there is interest in moving towards CML too. Peter commented.

PMR: 70 blogs is great. Go back a year and we’d have ca 10 I suspect. As I say I’m only looking for the 5-10% who are happy to be early adopters

Any incorporation of content other than images and free text requires some HTML knowledge, but this can be rather limited. It is up to us chemoinformaticians to write good documentation on how to do things; so here is a first go.

PMR: Yes, documentation is key as we are always being reminded! But we are also still fighting the browser technology. One of the great problems is that browsers have been a moving target for 12 years – it was almost easier to create a “plugin” in 1994 than now. How many of you can run Chime under Firefox?

Including CML in blogs and other RSS feedsI blogged about including CML in blogs last February, and can generally refer to this article published last year: Chemical markup, XML, and the World Wide Web. 5. Applications of chemical metadata in RSS aggregators (PMID:15032525, DOI:10.1021/ci034244p). Basically, it just comes down to putting the CML code into the HTML version of your blog content, though I appreciate the need for plugins.

PMR: you should always try to create XHTML (HTML with balanced tags). Unfortunately (and most regrettably) some tools, including WordPress, can often remove end tags.

Including SMILES, CAS and InChI in blogsIncluding SMILES is much easier as it is plain text, and has the advantage over InChI that it is much more readable. Chris wondered in th e KinasePro blog on how to tag SMILES, while Paul did the same on ChemBark about CAS numbers.
PMR: SMILES shouldn’t need to be “readable” and some of it isn’t (e.g. if you have a complete disconnected structure). It is because people have got used to seeing it for many years that they don’t feel frightened. There is no way to create canonical SMILES by hand, so you have to have a tool. InChI seems more forbidding because (a) it’s new (b) It can never be hand authored (c) it’s about 50% more verbose (d) it has layers. But each of those has a positive side.
Now, users of PostGenomic.com know how to add markup to their blogs to get PostGenomic index discussed literature, website and conferences. Something similar is easily done for chemistry things too, as I showed in Hacking InChI support into postgenomic.com (which was put on lower priority because of finishing my PhD). PostGenomic.com basically uses microformats, which I blogged about just a few days ago in Chemo::Blogs #2, where I suggested the use of asperin.And this is the way SMILES, CAS and InChI’s can be tagged on blogs. The element is HTML code to indicate a bit of similar content in HTML, and can, among many other things, be formatted differently than other text. However, this can also be used to add semantics in a relatively cheap, but accepted, way. Microformats are formalized just by use, so whatever we, as chemistry bloggers, use will become the de facto standard. Here are my suggestions:
[snipped see Egon's blog]
The RDFa alternativeThe future, however, might use RDFa over microformats, so here are the RDFa equivalents:
[snipped see Egon's blog]
which requires you to register the namespace xmlns:chem=”http://www.blueobelisk.org/chemistryblogs/” somewhere though. Formally, the URN for this namespace needs to be formalized; Peter, would the Blue Obelisk be the platform to do this? BTW, this is more advanced, and currently does not have practical advantages over the use of microformats.
Egon is right: there is currently no clear indication of which approach will come out as the “winner” although there is lots of Web discourse. However for us I suspect we would adopt both if lots of people were using them, and see which approach won.
Yes, of course we should use blueobelisk for the RDF! This has the real chance of succeeding.
Again the message is that the rest of the world is going down this route and at some stage chemistry will follow. RDF looks just as impenetrable as InChI, DOI, and all the rest…

Why bother with new technology?

Sunday, December 10th, 2006

Kinasepro has blogged about discussions of new chemoinformatics technology (specifically CML (Chemical Markup Language) and InChI (chemical identifier)). Here’s the post and some correspondence. It’s basically about the introduction of new technology. Obviously I’m not neutral but I will try to discuss it in a neutral manner. For that reason I have copied it more or less in full.

There’s been a fair amount of talk [ChemBark] over the last little while on the topic of chemoinformatics and chemblogs. Here’s my two cents.
smiles inchi Aldrich
smiles inchi ChemExper
smiles inchi The PDB
smiles inchi Chemdraw (until v10)
smiles inchi The entire pharmaceutical industry.
smiles inchi Peter Murray Rust
smiles inchi IUPAC
So somehow a couple librarians have convinced Google that inchi > smiles. Result? Google may well do Inchi, but noone but the librarians are currently using it, and meanwhile google doesn’t index smiles very well. I’m reminded of a day when it was thought to be a good idea to put the CAS#s of new entities at the bottom of ACS journal articles. Don’t worry, we survived those librarians too.
PMR: I’m not sure who the librarians are. I’d label all of (us) as chemical informatics. The institutions include NIST, RSC, and University of Cambridge. I don’t think Google has been convinced of anything – chemistry is relatively too small for Google to worry about. But yes, we have visited and had very useful and forward-looking conversations. Watch out for Googlebase…
Lookit, we don’t need a string of XML code that you need an advanced degree to use. We don’t need people telling us to tag our blog posts, we need an integrated solution. We need something that can draw structures and present them attractively in an index friendly HTML format. Near term: Get google to index picture descriptions, and code a firefox plugin that can insert smiles into said descriptions.
PMR: I am not quite sure what “index picture descriptions” means. Google indexes the fact that there is a picture but not the content. There are major efforts in image recognition, but I am not aware that any of this is being done in chemistry. I think that indexing chemistry in published GIFs is extremely difficult. I’ve looked at this over the years and conclude that it would be much easier if authors simply made their molecular files available.
Till google has a smiles substructure search, I’m not going to bother.
PMR: This is a perfectly valid response from an individual in the system. It’s rather less encouraging if it reflects the whole of chemistry (which currently it does). If the chemical informatics community says “at some stage Google will solve all our chemical problems, until then we’ll do nothing” that’s regrettable. (All other major scientific disciplines – physics, astronomy, bioscience, geosciences, etc.) are making major efforts to develop informatics infrastructure.Some of us are, in fact, thinking about how to do this. The problem is that there has to be some software somewhere. It can be in the following places:

  • client (i.e. your browser)
  • Google (we have discussed this with Google and it’s not impossible)
  • third party (who may or may not charge for it).

Given that Openbabel can search millions of structures quite rapidly there are some encouraging opportunities.

  1. 1 totallymedicinal Dec 5th, 2006 at 3:09 pm
    Couldn’t agree more with the sentiment – not only does my ancient version of ChemDraw not support this exotic format, but I have enuff hassle in my life without learning some obscure new coding system.

PMR: Again this is a perfectly valid response. Any approach to chemoinformatics requires tools. And I suspect or your institution would have to pay for an upgrade to Chemdraw. Obviously there is the opportunity of some Open Source free tools but they are not yet widely deployed and are effectively for early adopters.

  1. 3 Paul Dec 7th, 2006 at 4:06 am
    I could not agree more about the need for an integrated solution! I got a really thoughtful response from Peter Murray-Rust and friends, and I feel kind of bad about not acting on it, but putting random InChI designations at the bottom of all our blog posts doesn’t seem worth it to me. I think that CML is indeed the future, and I look forward to the day of being able to download a CML plugin for WordPress that will take care of everything for us lazy bloggers.
PMR: There is no doubt there are technical problems and they will require some early adopters to solve. I have tried to hide the InChI – it is an effort and is fragile. Given that I have problems with simple computer code in WordPress I expect the same with chemistry. However we have some new ideas of how to take this away from the WordPress process.
  1. 4 Chris Dec 9th, 2006 at 12:23 pm
    The argument against SMILES seems to be they are not an Open Format and it is possible to represent a single molecule with multiple SMILES strings. For my part I can read and write SMILES, (and SMARTS and SMIRKS). I find InChi impenetrable and I don’t think there is syntax for substructure or similarity queries, in addition I don’t think there is a system for describing reactions.I’ve started to add SMILES to my web pages in the hope that someone will build an index at some point, I guess it would help if there was a SMILES tag.
PMR: SMILES was a groundbreaking language when it came out. In general I have no problem with non-Open formats if there are free tools to manage them. There is a canonicalization algorithm for SMILES but it is closed and proprietary. I have regularly discussed the value of making it openly available with Daylight management but they are not prepared to do this. This is a legitimate business approach – control the market through trade secrets. IN the current case, however, it has the practical downside that several groups have created incompatible “canonical” SMILES.
The main virtue of InChI is that it is a public Open Canonicalization algorithm. It’s perfectly possible to convert InChI to SMILES if you want. It would not be “canonical SMILES” in the strict sense, but it would be canonical. That may, in fact, be a useful approach for certain types of compound. As InChI has a richer set of concepts than SMILES there may be some information loss.
In summary, if Daylight had made the SMILES algorithm public and it had been used responsibly I doubt very much whether we would have InChI. It has been driven by the lack of interoperability in chemistry – coming in some part from government agencies and the publishing community.
InChI is by definition impenetrable. It’s an identifier. Do you find DOI, ISBN, security certificates impenetrable? I hope so :-)

  1. 5 kinasepro Dec 9th, 2006 at 7:03 pm
    InchI and CML may well be the future, and no-one will embrace it more then me, but SMILES is the present. For people working in the field not to understand that boggles the mind!

PMR: I’m not sure who “people working in the field” are. If it includes me, then I fully understand it. I am simply trying to bring the future to the present a bit quicker and a bit more predictably. :-)

  1. I’ve experimented on this site a little with smiles. For instance a google search of the following string brings you here:O=C(C2=CN=C(NC3=NC(C)=NC(N4CCN(CCO)CC4)=C3)S2)NC1=C(C)C=CC=C1ClOf note I’m not the only one with that string on the web! Maybe thats an important compound? Sadly google indexed that page under my SRC tag rather then as a standalone page. Put that together with the fact that smiles strings are not substructure searchable via google and its clear to me that google is not ready to be a chemistry informatics platform. It’s sad really, because it doesn’t seem to me that it would be that difficult for them to make SMILES strings substructure searchable via the same algorithm the PDB, relibase, aldrich and everybody else is using.

PMR: This is a very important point and at the heart of the problem. Google works by indexing text. It’s good at it and can distinguish different roles for text and can look for substrings. This is a simple, powerful model. But at present it doesn’t index other objects (faces, maps, etc.) These are both harder and require specialist software. By contrast PDB, Relibase, Aldrich do index chemical structures. That means that they have to have specialist software running on their servers. Which means a business model. And that someone has to pay somewhere. PDB gets a grant, Relibase is commercial, Aldrich will see this as the basis for selling more compounds. All completely valid. But there is no business reason for Google to invest in chemistry-specific software – as I said chemistry is too small for Google to bother with. It’s not helped by the fact that all the information is proprietary and that one of the major chemical information suppliers (CAS/ACS) sued Google. So unless you convince them differently – and I have gently tried – it won’t happen.

So this is all about the introduction of new technology. The primary messages from the chemistry community are something like:

  • We’re happy with what we’ve got – it’s worked for the last 20 years and will go on doing so. Yes, for a little while.
  • When it’s necessary CambridgeSoft, Chemical Abstracts, Elsevier will develop a new technology and we’ll pay them to use it. Unfortunately I don’t see any movement from any of these to embrace the new Web metaphors. Biology, geoscience, etc. are working hard to develope the semantic web in the subjects – apart from a few of us noone in chemistry is.
  • Well, it’s a bit of a mess, but it’s not at the top of my priorities. I’ll come back in a few years.
There are movements in chemistry, particularly in three areas:
  • computational chemistry. We are having a visit of COST D37 (EU) to Cambridge tomorrow to create an interoperable infrastructure for computational chemistry. It will be based on communal agreements and use XML/CML as the infrastructure.
  • chemoinformatics. The Open Source community (e.g. Blue Obelisk) supports both current (legacy) formats (SMILES, Mol) etc. and also CML/InChI. This can provide a smooth path towards the wider adoption of these newer approaches, including toolkits. The toolkits are free, which some see as a disadvantage, in which case you will have to convince the commercial suppliers to create them.
  • publishing. Commercial publishing is universally based on XML (and variants) so it is easy for them to include CML and related systems. I won’t give details but I’d be surprised if there weren’t major changes in the next 2-3 years here which I hope will answer some of the obejctions raised here.

There are also general major drivers elsewhere for the abandonment of legacy formats. They include the semantic web, RSS, institutional repositories, archival, etc. These efforts require interoperability and freely available tools – you can’t archive – say – a binary chemistry file and expect it to be readable in 5 years time. There are a lot of people to whom that matters.

So I’m not telling anyone to do anything – I’m putting ideas, protocols and tools where they may wish to pick them up. If 5% of a community is enthusiastic that’s a good beginning. It worries me that the pharma industry has no concept of interoperability. But I’ve said that already.

RELAX wins

Wednesday, November 29th, 2006
There’s been a buzz today about the changing scene in XML. On the XML-DEV list Michael Champion wrote:

I see that Elliotte Harold has declared the schema wars over, and Tim Bray, Don Park, and others have piled on. That would be great news, except for the little detail that the non-cognoscenti don’t seem to know or care.

He’s referring to Tim Bray who writes:

Choose RELAX Now · Elliotte Rusty Harold’s RELAX Wins may be a milestone in the life of XML. Everybody who actually touches the technology has known the truth for years, and it’s time to stop sweeping it under the rug. W3C XML Schemas (XSD) suck. They are hard to read, hard to write, hard to understand, have interoperability problems, and are unable to describe lots of things you want to do all the time in XML. Schemas based on Relax NG, also known as ISO Standard 19757, are easy to write, easy to read, are backed by a rigorous formalism for interoperability, and can describe immensely more different XML constructs. To Elliotte’s list of important XML applications that are RELAX-based, I’d add the Atom Syndication Format and, pretty soon now, the Atom Publishing Protocol. It’s a pity; when XSD came out people thought that since it came from the W3C, same as XML, it must be the way to go, and it got baked into a bunch of other technology before anyone really had a chance to think it over. So now lots of people say “Well, yeah, it sucks, but we’re stuck with it.” Wrong! The time has come to declare it a worthy but failed experiment, tear down the shaky towers with XSD in their foundation, and start using RELAX for all significant XML work. [Update: Piling-on are Don Park, Gabe Wachob, Mike Hostetler and some commenters. There’s thoughtful input from Dare Obasanjo, and now the comments have some push-back too.] [4 comments]
Some of the XML-DEVers are less than convinced – their argument is that paying customers in large companies don’t care how awful the technology is as long as it’s seen to be standard. W3C XSD will survive and prosper simply because it’s there, tooled up, and supported by the 800lb gorillas.
I’

The War on Error

Saturday, November 18th, 2006

There’s been a lot of  excitement over Pete Lacey’s The S stands for Simple. This Socratic dialogue, which I blogged yesterday has shown the futility of the overengineered madness from the W3C committees. There are other similar postings, summarised in Bill de hÓra‘s blog:

The War On Error

Last March: REST wins, noone goes home.Well, it looks like we’re done. Which is worse, that everyone gets it now and we’ll have REST startups in Q207, or that it took half a decade?

It’s tempting be scathing. But nevermind, The Grid’s next.

So in the our Centre we shall be going 100% for RESTful chemistry – it’s just a pity we have wasted so much time.  I am interesed to see that the Grid is next! Certainly my own approach is that where we can use HTTP rather than Globus we should – at least in the initial stages. That’s not to say it isn’t without its uses – just that we haven’t needed it yet.

Organic Theses: Hamburger or Cow?

Wednesday, October 25th, 2006

This is my first attempt to see if a chemistry thesis in PDF can yield any useful machine-processable information. I thank Natasha Schumann from Frankfurt for the thesis (see below for credits).

A typical chemical synthesis looks like this (screenshot of PDF thesis).

diss38b.PNG

For non-chemists this consists of name and number of compound, recipe for synthesis, structural diagram (picture), number and analytical data (Mass Spec, Infrared and Ultraviolet). This is a very standard format and style.The ugly passive (“To a solution of X was added Y”) is unfortunately universal (cf. “To my dog was donated a bone by me”). The image is not easily deconstructed (note the badly placed label “1″ and “+” making machine interpretation almost impossible – that is where we need InChIs).
I then ran PDFBox on the manuscript. This does as good a job as can be expected and produces the ASCII representation.
diss38c.PNG

This is not at all bad (obviously the diagram is useless) and the greek characters are trashed but the rest is fairly good. I fed this to OSCAR1; it took about 10 seconds to process the whole thesis. You can try this as well!
diss38.PNG

OSCAR has parsed most of the text (obviously it can’t manage the diagram labels but the rest is sensible. It has extracted much of the name (fooled a bit by the greek characters) and pulled out everything in the text it was trained to do (nature, yield, melting point). It cannot manage the analytical data because the early OSCAR only read RSC journals but OSCAR3 will do better and can be relatively easily trained to manage this format.

So first shots are better than I have got in the past. OSCAR found data for 40 compounds – ca. 4 per second. Assuming that there are many similar theses there is quite a lot it can do. But not all have PDF that behaves this well…
===================

Acknowledgement (from PDF)
Chiral Retinoid Derivatives:
Synthesis and Structural Elucidation of a New Vitamin A Metabolite

Von der Fakultät für Lebenswissenschaften der Technischen Universität Carolo-Wilhelmina zu Braunschweig zur Erlangung des Grades einer
Doktorin der Naturwissenschaften (Dr. rer. nat.) genehmigte
D i s s e r t a t i o n von Madalina Andreea Stefan aus Ploiesti (Rumänien)

What are the advantages of XML and why should I care? (text)

Tuesday, October 17th, 2006

This is an attempt to explain why XML is important in a scientific context. I shall try to assemble as many reasons as possible, but there are also many other tutorials and overviews.

I believe that XML is a fundamental advance in our representation of knowledge. It’s not the first time this has been attempted – for example you can do anything in LISP that you can do in XML and a good deal more. But XML has caught on and is now found in every modern machine on the planet.

Let’s start with a typical piece of information:

Pavel Fiedler, Stanislav Böhm, Jií Kulhánek and Otto Exner, Org. Biomol. Chem., 2006, 4, 2003

How do we interpret what this means? We guess that there are 4 authors (although it is not unknown for people to have “and” in their names), that the italic string is the abbreviation of a journal, that 4 is a journal number. But what are “2006″ and “2003″? Unless you know that the first number is the year and the third the starting page (see RSC site) you have to guess. And many of you would guess wrong.

If, however, this is created as:

<author>Pavel Fiedler</author>

<author>Stanislav Böhm</author>

<author;>Jií Kulhánek</author>

<author;Otto Exner</author>

<journal>Org. Biomol. Chem.</journal>

<year>2006</year>

<journal>4</journal>

<page>2003</page>

you can see that each piece of information is clearly defined. There is no reliance on position, formatting or other elements of style to denote what something means.

But isn’t this harder to create and read? If everything is done by a human, perhaps. But almost all XML documents are authored by machines, either from editors or the result of a program. And the good news is that the style – the italics, etc. – can be automatically added. XSLT allows very flexible and precise addition of syyle information through stylesheets.

So it won’t surprsie yout that publishers actually create their content in XML. When you submit a Word or LaTeX document it gets converted into XML – either by software (which isn’t always perfect) or by retyping :-( . The final formatting – either as PDF or HTML can be done automatically by applying different stylesheets. So the document process is:

XML + PDFstylesheet -> PDF

XML + HTMLStylesheet -> HTML

The stylesheets don’t depend on the actual document being processed and work for any instance. Of course it takes some work and care to create them, but most of you don’t need to worry.

So for anyone working with documents, XML allows the content to be stored independently of the style. That’s a great advantage also when it comes to preservation and archival. Because XML is standard, Open, ASCII, etc. it doesn’t suffer from information loss when it is moved from one machine to another (how many of you have lost newline characters when going from Windows to Mac to Zip to Word, etc.?) It’s possible to calculate a digital checksum for a canonical XML document so any corruption can be immediately spotted.

There are a number of other aspects. Notice that the second and third authors have diacritic marks in their names. XML supports a very wide range of encodings and character sets so is an international specification.
In later posts I’ll show the power of XML for validation, how software can be applied and how data can be structured. Please feel free to add comments or questions.

What are the advantages of XML and why should I care? (0)

Tuesday, October 17th, 2006

As I have blogged before we are looking at ways of improving the information infrastructure in our Centre. We’re all very consicous of how little we know – I know I know very little and I’m quite prepared to admit it in public. Ignorance per se is not a crime – only wilful ignorance. As part of the process we created some self-help groups and the first feedback is that they would like a set of FAQs for a wide variety of questions. Remembering that this is a group of 40+ molecular informatics scientists I’ll post some of the questions on an occasional basis. Because others can contribute to this blog maybe we’ll build some communal FAQs…

So I cannot resist “What are the advantages of XML and why should I care?”. I’ve invested several years of my life in developing XML, and layering Chemical Markup Language (CML) on to of it. So it’s very dear to my heart. This post won’t answer the general question directly so there will be more.
I got introduced to Markup Languages after WWW2. At WWW1 (1994) it was clear that HTML had succeeded very well with text and graphics but that more formality was required for other science disciplines. Recall that the early web was about science, not commerce and although TimBL saw the commercial potential, it was low key at that time. So Henry Rzepa went off to WWW2 and came back saying that people were talking about “something called SGML“. It was also clear that CERN (where TBL developed HTML) was strong on SGML and that it could support complex documents. I had been struggling for several years with the need formalize chemistry into a component-based system and with SGML this seemed possible. You could create your own tags for whatever you liked as long as you defined it formally (with a DTD).

So I created some sample CML documents with my own tags. That was the easy bit. The DTD (which defined the language) was harder but possible. The real difficulty was actually doing anything useful with SGML. You could read it and… agree it was correct… and send it to other people… but it didn’t do anything. Why would chemists use it? At that stage they wouldn’t….

The users of SGML were somewhat esoteric groups. Typical examples were the Text Encoding Initiative (a project to encode the world’s literature in SGML). At the other end were creators of aircraft maintenance manuals. (Although there were hints that SGML could be used for anything it was primarily used for text). The good news was that almost all major publishers of scientific articles used SGML in the production process.
I soon realised that to do anything useful – especially for chemistry – required procedural code. And there was very little. Some of it was extremely expensive – one company wanted $250K (sic) for a site license. The main clients were technical publishers – e.g. in aerospace. So I started to write my own system without any idea what I had got into. I found myself having to refer to “parents and children” of parts of documents – this seemed very strange to me at the time. I was extremely grateful to Joe English who developed a system called CoST and gave me huge amounts of virtual help. Joe, you were very patient – hope all is well! However there were a few pioneers of Open Source like Joe and IMO they saved the day for SGML and paved the way for XML. Top of the list is James Clark – whom I’ve never physically met – but has underpinned much of XML with his code and ideas. His nsgmls system was the only code that had the power I required and which could transform the (potentially incredibly complex) SGML documents into something tractable.

So by 1995 I had a system which could represent chemistry in SGML and process it with a mixture of tcl/CoST and nsgmls. It had fairly advanced graphics (in tk) and could even do document analysis of sorts. At that stage (another story) I was converted to Java and effectively wrote a complete system for CML/SGML in Java. This had a simple DOM, a menuing system and a tree widget (in AWT!) and could hold a complete chemical document.

Then, in 1996, Henry pointed me at a small activity on the W3C pages called XML. (Actually Henry and I had already used “XML” as part of CML, but we surrended the term). I got myself onto the working group and was therefore one of abou 100 people who contributed to the development of XML.

When XML was first started it was “SGML on the Web”. It wasn’t expected to be important and it wasn’t even on the front page of the W3C. As SGML was seen as complex and limited, XML wasn’t really expected to flourish.

XML’s success was due to the foresight and energy of a number of people, but especially Jon Bosak – the “father of XML”. Jon worked on technical document management in Sun (I hope that’s right) and saw very clearly that XML was part of the future of the Web. He coordinated the effort, got funding and political support, and I remember his pride in showing the back cover of the first draft of XML – ” sponsored by Sun and Microsoft”. This was a great technical and political achievement.

Tim Bray – another champion and parent of XML – writes:

“It is to Jon Bosak’s immense credit that he (like many of us) not only saw the need for simplification [of SGML] but (unlike anyone else) went and hounded the W3C until it became less trouble for them to give him his committee than to keep on saying SGML was irrelevant.”

It was supported by one of the largest and most active virtual communities. Henry and I offered to run the mailing list, XML-DEV, on which much of the planning and software development took place. By insisting on Open software as a primary means of verification the spec was kept to a manageable, implementable size. This meant that, unlike SGML, XML could be implemented by the DPH (“Desperate Perl Hacker”). And it was….

… the rest is history. XML has become universal. Jon (I think) described it as “the digital dial tone” – i.e. wherever information is being passed on the web it will increasingly be in XML.

So that explains why I care :-) . Next post I’ll explain why you should also care.

Hamburgers and Cows; The Cognitive Style of PDF

Sunday, September 10th, 2006

PDF is one of the greatest disasters in scientific publishing – why?
I normally give my slides in XHTML rather than Powerpoint and prefix them with the quote which I made up:
“Power corrupts; Powerpoint corrupts absolutely”
I then searched the web and found thefts Edward Tufte had already thought of it in
The Cognitive Style of PowerPoint.
Tufte contends that PP had an important role in the Space Shuttle disaster(s). Tufte’s premise is that PP requires authors to omit critical data and dumb-down thought. I had never thought of PP as actually perverting they way we think, but it is absolutely right
Mine attack on PP is complementary – technical rather than political. PP corrupts any semantics in the document completely. Just try to read the saved HTML from a PP (in say Google) and you will be lucky to get anything. PP is probably the most effective destroyer of semantic information yet devised. Tufte urges that authors use Word instead. I will interpret this to mean “any tool that displays conventional compound documents at the required level and without loss”. I therefore choose XHTML (because Word is a pretty good semantic destroyer as well).So why not just use PDF? It’s universal, it’s beautiful to look at? It’s used for scientific publishing…

NO! PDF is the biggest destroyer of scientific information currently in use.

PDF concentrates on only one thing: reproducing the process of adding printers’ ink to paper. The PDF that scientists use for publications was not promoted by them, but by the scientific publishers. How many scientists wrote to the publishers saying “we would like double column text in PDF”.

The “e publishing revolution” has had the major and very sad effects of:
* transferring the printing bill from the publisher to the reader (almost all scientists seem to print out the papers and annotate them with markers
* transferring political power to the publishers. It allows the publishers to claim (as the ACS does) that

What is important to realize is that a subscription to an STM journal is no longer [...] a subscription; in fact, it is an access fee to a database maintained by the publisher.

[...] one important consequence of electronic publishing is to shift primary responsibility for maintaining the archive of STM literature from libraries to publishers. I know that publishers like the American Chemical Society are committed to maintaining the archive of material they publish. Maintaining an archive, however, costs money.

From “Socialized Science” (ACS[*] commentary on NIH)
RUDY M. BAUM, Editor-in-Chief, C&E News,
September 20 2004 Volume 82, Number 38 p. 7

How many scientists asked the publishers to convert journals into databases. How many asked the publishers to become the guardians of the archive? And have them switch off access at a moment’s notice (as they did to Cambridge last week)

There are some minor benefits from ePublishing, Crossref, more rapid access, but it’s a Faustian bargain and we are suffering. PDF has been the devil’s agent in this. It has insidiously transferred control to publishers with the unintended but equally horrific downside of semantic destruction.

Apart from the politics, why is PDF so bad? A question on XML-DEV about how to convert PDF to XML brought the lovely comment from Mike Kay (author of the (OpenSource) Saxon XSLT tool):

>
> Could you please tell me, How we can convert the PDF data
> into Xml file using java? I found a library PDFBox.
>

Converting PDF to XML is a bit like converting hamburgers into cows. You may
be best off printing it and then scanning the result through a decent OCR
package.

Michael Kay

http://www.saxonica.com/

http://lists.xml.org/archives/xml-dev/200607/msg00509.html

So I use XHTML and preserve my semantics. It’s a labour – but it has to be the way forward. I’ll write more on this later and why the browser manufacturers have destroyed semantics as well.

(Judith M-R tells me there were too many typos in last post, so I shall edit offline, spellcheck and paste. I am still losing edits in WordPress and then finding later they have been saved after I have rewritten them.)

ACS Presentation – Part II

Friday, September 8th, 2006

The first part of my presentation dealt with the technical issues surrounding semantic chemistry. This page contains predictions – they are general enough that you don’t have to be a chemist to appreciate them. I’ll probably try to cover some in the talk but if not, at least Wendy can write them up.

They are not agressive polemics, but statements of what I see as the current and the inevitable.

Chemical informatics and information is broken. It’s expensive, lossy, out of data and restrictive. There is virtually no innovation and no obvious understanding of how the web is changing. I don’t think the future Web (“Web 2.0″ or whatever current acronym can co-exist with the closed, inward-looking chemoinformatics community which supports the closed world of pharmaceutical research.

Unless current providers of information and software, and purchasers of these services (pharma) change rapidly there will be a split. The new informatics will be characterised by:

  • biosciences and some sciences adjacent to chemistry (perhaps geosciences)
  • funders who agressively promote Open Access and require their grantees to make their output universally available
  • data providers who wish to build mashups – especially multidisciplinary, combined services, and autonomous processes.
  • the young-at-heart generation who espouse Wikipedia, folksonomies, and social computing. Expect to see a lot of semi-formal semi-voluntary reviewing of information resources such as PubChem and Wikipedia
  • a growing Open Source community based on the Blue Obelisk mantra of Open Source, Open Data and Open Standards
  • publishers with the foresight to see the new opportunities and the value of new products and services

Five years ago I made several predictions about the Semantic Chemical Web. Many have come true in technology (but not always in human uptake). Here are some of the next five:

  • Wikipedia Chemistry will be more accessed than the Merck Handbook or general chemical textbooks
  • Students will bring PDAs into lectures (if they even bother to go) and point out when the lecturer makes mistakes
  • machines will be able to answer some first year chemistry exam questions
  • machines will roam the Open chemical semantic web mashing data against bio- and geo-sciences.
  • PubChem will be more accessed that Chemical Abstracts. Universities will cancel their subscriptions to the latter, which will be increasingly oriented to serve the pharma industry
  • chemical linguistic robots will read Open Chemical papers on behalf of the community and extract data, give guidance on what papers are worth reading, build personal chemical memexes, etc.
  • mashups of Open crystallographic data will become universal and except for historical data searches replace the crystallographic datbases.

There are some unprectible aspects:

  • will the pharma industry continune in its closed approach to information? If it is to be information-driven it has to develop and open supply chain for multidisciplinary information and services
  • will the major publishers react positively?
  • will Google enter chemistry? I’ve been invited to Mountain View next week – very exciting. I expect to get a very different type of audience from the ACS – probably no chemists but many excited young web hackers. Google and the new technology could dramatically change chemical informatics.

ACS presentation Part I

Friday, September 8th, 2006

Edward Tufte said in his recent book that one shouldn’t use Powerpoint to present information, but Word. Although I am not a fan of Word (see later posts) I agree with the message. So this is the first part of my talk to the American Chemical Society. Don’t worry – there’s not a lot of chemistry.

The title is someting like eChemistry (which would be nice if it actually existsed – we are trying to create it). The abstract is irrelevant as it was written 3 months ago and the world has changed so much the abstract is either out of date or so general it doesn’t matter.

First thanks. When you are likely to run into time problems, thank people at the start. Here are some (please let me know If I have missed anyone – it’s easy to do). Almost everyone on this list has hacked something
Cyberheroes (Mainly Blue Obelisk)

Bob Hanson (Jmol)
Christoph Steinbeck(Cologne)
Egon Willighagen(Cologne)
Tobias Helmut(Cologne)
Stefan Kuhn(Cologne)
Ola Sputh(Uppsala)
Eklund, Martin (Uppsala)
Miguel Howard (Jmol)
Joerg Wegner (Tuebingen/ALTOVA)
Rich Apodaca (Stanford)
Rajarshi Guha
Geoffrey Hutchison (Cornell)

Indiana:
Gary Wiggins
David Wild
Geoff Fox
Marlon Pierce

Symbiote: Henry Rzepa

Cambridge:
Ann Copestake and colleagues
Peter Corbett
Nick Day
Jim Downing
Justin Davies
Richard Moore
Joe Townsend
Alan Tonge
Andrew Walkingshaw
Andrew Walker
Toby White

Sponsors:
DTI, Accelrys, IBM
Unilever
Royal Soc Chemistry, Int Union of Crystallography, Nature Publishing Group
JISC
EPSRC
BBSRC

The current workflow in chemical informatics is broken. A typical scenario is:

nonxmlchain1.png

nonxmlchain.pngnonxmlchain.png

nonxmlchain.png
Here we see legacy programs, human activities and legacy data. At each stage a human has to cut and paste stuff, edit it, etc. This causes loss of time, loss of quality and loss of temper. Wouldn’t it be easier if everything was in a consistent interoperable format like this?

xmlchain.pngxmlchain.png

Here all the data is in XML with semantic markup. human input goes seamlessly into programs, databases, etc. The outputs pass between programs display, etc. with no semantic loss and no friction. XML ontologies add meaning to all information components. The basic components now exist in enough cases that we can build mashed up systems.

I’m going to demonstrate some mashups. Some demos use the Internet, some have been locally crafted. Obviously we can’t demonstrate the 6 month project where we ran 1 million jobs with all interfaces in XML. Here are some of the components:

programs: MOPAC, GAMESS-US, DL_POLY, GULP, SIESTA, CASTEP, METADISE, etc.

editors: Jchempaint, etc.

renderers: Jchempaint, Jmol, JSpecView

Rich Client: Bioclipse

Services: CMLRSS, InChI. OpenBabel

Repository: SPECTRa/DSPACE (CMLCryst, CMLSpect, CMLComp)

Toolkits: CDK, JUMBO, JOELib, CIF2CML, OSCAR3, OSCARDATA

Demonstrations:

Semantic Markup (MACiE)
Simple Mashup: Placeopedia
Chemical Mashup: GoogleInchi. InChI API + Google search API
Semantic data and linking (clickable graphs and tables in CML). Jmol display
Journal-eating robots: OSCAR-DATA (chemical data)
OSCAR3 (chemical text and names) – mashup with PubChem
Reposition of data (SPECTRa) in institutional repositories
CMLRSS: molecular feeds (on Acta Crystallographica)
Rich client: Bioclipse

I shall try to get through all of these in 21.5 minutes – if the connections are slower I may have to omit some. At the end it should be clear that there is enough technology from the Open Source community to take chemistry into the 21st Century.

The next post will cover descriptions and predictions…

——————————————-

Some of these have static URLs and can be viewed relatively easily and robustly

http://www.placeopedia.com

http://wwmm.ch.cam.ac.uk/cryst/summary/acta/e/2006/07-00/ (static repository of Acta Crystallographica CIFs)
http://www-mitchell.ch.cam.ac.uk/macie/ (MACiE) (100 Entries | M0001 | animate reaction – needs IE)
http://wwmm-svc.ch.cam.ac.uk/wwmm/html/googleinchiserver.html GoogleInChI