Searching Word and PDF for chemistry

ChemConnector (from the Chemspider stable – anonymous, but possibly Antony Williams) has posted on searching for chemical structures in Word anf PDF files. S/he rightly concludes that (a) it’s possible and (b) not done – at least in public. I’ll reproduce the whole post and then comment

Hamburger PDFs and Making Them Structure Searchable

There have been numerous conversations about “Hamburger PDFs” over the months and the most recent exchange is that between Chris Rusbridge and Peter Murray-Rust. Another conversation that I have seen go on has been about making Word documents structure searchable (cannot track down the appropropriate blog-postings at present). This is just an fyi comment for the community really since this is a general assumption that Word Documents and PDFs cannot be made structure-searchable. The truth is that both can be made structure searchable. How? Well, you need to write the correct information into the file to enable it but it’s possible. There are a number of solutions out there allowing structure-based searching of Word document files. I believe the first one was originally from Oxford Molecular before being acquired by Accelrys. I think there are now multiple including, I believe, Cambridgesoft, ACD/Labs and probably others. The only PDF structure searching capability I am aware of is that created by ACD/Labs a few years ago. Their website states “Our Search for Structure system allows you to seek out chemical structures in various file formats throughout your computer’s file systems. These formats include: SK2, MOL, SDF, SKC, CHM, CDX, RXN, and PDF (Adobe Acrobat); DOC (Microsoft Word), XLS (Microsoft Excel), and PPT (Microsoft PowerPoint), and ACD/Labs databases: CUD, HUD, CFD, NDB, ND5, and INT.” For PDF it was required that structure files were “tagged” appropriately when written to PDF by an embedded PDF generation capability. Since the PDF format can be extended ACD/Labs did so. If we wanted to make the majority of PDF files structure searchable then it seems as if the appropriate thing to do would be to extend the general PDF format for Life Sciences, talk to Adobe about including the capabilities into their tools and get the publishers to support it. Ok, there’s details….but why isn’t anyone talking about extending PDF to support structures in this way. it’s already proven, years ago. Next thing will be that structures will be getting embedded into Word documents and made searchable as if it is something novel. It’s been done many times already. The ACD/Labs website states “Microsoft Word documents with structures created in ChemDraw or MDL ISIS can also be retrieved. Not only can you perform exact structure searches, but you can also search by substructure. Added options allow you to preview search results, open search result documents in ChemSketch as well as in other applications, and store search results for later access.” There are other products doing this too. Strangely people don’t seem to know about these capabilities. They will…as we move forward to index the web for structures we hope to build the capabilities to search structures inside Word documents directly.

PMR: This is – rightly – not absolute news. It’s been possible for some time. For all I know many of the pharma companies are doing it behind closed doors and simply not telling us. But it’s not done in public. Why? I looked into this about 5 years ago with the intention of extracting ChemDraw files from Word. There was no zero-cost solution.
I’ll come to the mechanics of embedding later, but essentially the chemistry is often (not always) present as embedded (OLE) objects, usually ChemDraw but sometimes ISISDraw. Often it isn’t present except as an image. I tried to read the Word file (probably Word97) and extract the chemistry. CambridgeSoft – who I congratulate publicly – published their specification on the web, including the binary format. So, in principle, it should be possible to find the ChemDraw sections (which start with a special flag “VjCD0100”). Yes, I could find these, but the main problem was that Word garbled these into an unpredictable order of blocks. So about 70% of these were readable, while 30% were unreadable. Of course it’s technically possible to buy CambridgeSoft software to read CambridgeSoft files in Word, ACDLabs software to read their offerings, etc. But it costs money (probably lots) and more importantly lots of time in negotiation. We didn’t have a business case at the time, and in any case there wasn’t anything to read. The main problem was that there were no Word documents anyway. We tried some experiments with extracting stuff from native PDF and it clearly wasn’t going anywhere.
However I did write a ChemDraw to CML converter. Its on Sourceforge under cml.sf.net. It follows the public spec faithfully (although a few of the concepts are unclear such as “Generic Nickname”). It works. We’ve tested it on thousands of files. Of course it has the disadvantage that it’s free, Open and non-proprietary so no-one takes it seriously.
ChemConnector is right in that you *can* embed things in PDF and Word. But people don’t. There is a fundamental difference between the two formats… PDF is seen by most people as electronic paper. It’s much more than that – it’s a comppund document format. It’s a container and you can put anything you like in it. It accepts images, XML, movies – you name it. It does metadata (I think MPEG/DIDL). But this has to be a conscious act. Most people don’t know this and don’t care. Moreover the format has changed many times. They may or may not be standards but they are highly confusing and AFAIK no-one in academia of scientific publishing does this.
Henry Rzepa is an advocate and explorer of PDF as a container for exciting objects and we looked at this in the SPECTRaT project but didn’t come away with much enthusiasm. If you put enough effort into developing a compound document format for a community you can make it work. There are many who use Docbook – the SGML/XML standard for technical manuals and beyond. But it’s too heavyweight for academia – either theses or publications. So you would get very short change from suggesting to an academic that they take on a new set of tools, have to be trained, etc. “just” to make their data more accessible or permanent. I know it shouldn’t be, but it is that way.
Word(TM) is somewhat different. Word accepts embedded binary objects (OLE). The normal way this is done is by buying tools from manufacturers (ChemDraw(TM), MDL(TM)/Symyx(TM),  ACDLabs(TM), etc.). These companies have worked with the guts of Office/Word/Excel and so their objects can be inserted. The rest of this paragraph is speculation but roughly true… When you add a ChemDraw object to Word it continues to know it’s ChemDraw object (OLE). When you click on it it looks to see if you have ChemDraw installed and if it does it brings up a ChemDraw window. You can edit the material and resave it in Word. Presumably the same thing happens for ISIS(TM)/MDL with MDL software. If you don’t have the software installed (like me) then it tells me I don’t have the software installed. But it still displays it – through something called a WindowsMetaFile (WMF(TM)). Suppose I find a chemistry thesis with chemical structure diagrams in it. What can I do? I can’t tell just by looking at it. So I click the picture. If it says it wants to run ChemDraw, it;s a ChemDraw. If it wants to run ISIS it’s an ISIS file. Otherwise it might be a Windows Metafile. Or worse it may be a bitmap.
With Word2007 things change. I owe a lot of this to Joe Townsend who discovered all this in the SPECTRaT project. Word2007 has a native XML format. OOXML. We’ll leave for the moment whether that’s a good format or a spawn of Satan. Both views exist. But it does expose all the components. If the files contains ChemDraw files they are all separated into a directly called embeddings/. If they are WMFs they go into media/. So they are all accessible in some form.
The important thing is that they are already in the Word document. Nobody told the students to put them in so they can be archived. They did it without thinking.
Yes, they can be searched.  With our own software. I’ll be demoing it on Thursday at the Royal Society of Chemistry.
But the trouble is that it’s Open Source and Open Data. So perhaps I should decide now to make it commercial. What do you think?

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Searching Word and PDF for chemistry

  1. Peter Sefton says:

    Don’t forget that OpenOffice.org and ODF have a similar way of packaging XML and binary objects so you can take the Word documents with OLE embeddding and process them in ODF as well.

  2. pm286 says:

    (1)Thanks – and Jim and I will be pursuing this with you. But can you use Open Office to embed them in the first place or do you always have to start with Word?

  3. James Jack says:

    Peter – I have this working in a prototype for files of type doc, docx, ppt, pptx, xls, xlsx, pub and pubx. I can give you a demo via webex. The functionality is likely to be unveiled at the Symyx UGM in Philadelphia. I’m working on making PDF searchable. Both SSS and ExactMatch searches are working and relatively fast.
    Best Regards,
    James

Leave a Reply

Your email address will not be published. Required fields are marked *