Green Publishers in Chemistry

My blog comments seem to be giving people a lot of problems. Stevan Harnad comments:

Peter Murray-Rust writes: “Most chemistry publishing is closed access, not even allowing Green self-archiving (unless paid for). There is no sign that any of the major closed publishers (ACS, RSC, Wiley, Springer, Elsevier, Nature) are likely to change in the immediate future.”


Peter is right that ACS is likely to be the last of all publishers to go Green on OA self-archiving (http://openaccess.eprints.org/index.php?/archives/213-guid.html), but he is mistaken about most of the others on his list (see http://romeo.eprints.org/ ):


ACS: gray
RCS: GREEN
Wiley: GREEN
Springer: GREEN
Elsevier: GREEN
Nature: pale-green


Stevan Harnad
http://www.eprints.org/openaccess/

PMR: Thanks Stevan. I am happy to be corrected on Wiley, Springer and Elsevier. I seem to remember that the RSC (sic) announced it was Green ca. 2 years ago. That got into Romeo but then they said they weren’t green after all and it was a mistake.

I am also not really clear any more what pale-green is. I assume “gray” means neither gold nor green.

Posted in Uncategorized | 1 Comment

Open Access in Chemistry – thoughts for Thursday

The RSC is hosting a meeting on Open Access on Thursday:

Event Details

Open Access Publishing in the Chemical Sciences
The meeting will revisit the area of open access publishing including Bioinformatics and data repositories.
Date: 22 May 2008 09:30 – 17:00

and I’m kicking off so I will spend a little time setting the scene. Over the next 3 days I’ll use this blog to explore some ideas that might be relevant. This has the advantage that I can get feedback to see if I’m on the right track. It also acts as a set of notes in case (a) the projection system breaks down and (b) people want something afterwards. It may even persuade people to come to the meeting (no idea whether there is spare space).
So th issues…
Chemistry is an established subject, over 200 years old and data-rich. It has well-established learned organisations (IUPAC, RSC, ACS and other national societies). Chemistry itself is less in the forefront of radical scientific change in physical science  (the ideas of the C19 chemists are still fundamental today) although it continues to shed insight on certain aspects such as fast processes and quantum mechanics.
It’s increasingly an applied science – chemistry is used to explain processes in other sciences such as bioscience, climate, etc. It is has a very large design/engineering motive – can we make something with the following properties or to do the following job? There has been a large and successful chemical industry for over 100 years. Chemistry will still be around as a discipline in 50 years’ time, though it may have changed its name and organisational location. All of which makes it hard to come up with a single approach to “Open Access Publishing”.
Chemistry is a conservative discipline, in contrast to much bioscience and much physics and the whole of informatics. This is reflected in its buildings, orgainsations, and publishing. There is nothing necessarily morally wrong in being slow to change, but it’s likely to cause tensions with fast moving areas such as the Web and bioscience. And it does.
Most chemistry publishing is closed access, not even allowing Green self-archiving (unless paid for). There is no sign that any of the major closed publishers (ACS, RSC, Wiley, Springer, Elsevier, Nature) are likely to change in the immediate future. There are Open Access publishers, the most prominent in the PLoS or BMC camps. They have little market share at present and they will have to work very hard to change this. (This is not true in bioscience where the Open Access publishers are making major advances).
They have been challenged by the various Open Access (or Free Access) archiving mandates, most notably from the NIH. Some of them, particularly the ACS, see this as unacceptable government action leading to the destruction of scholarly publishing. It is likely that chemical publishers have been lobbying to continue closed access to defend peer-review.
I hope the meeting does not discuss these played out issues or it will be mainly a non-productive talking shop. Where we differ we are unlikely to agree as a result of a few hours presentations.
What I hope we can do is to look to what is technically possible and desirable in the future of chemical publishing. Chemistry has enormous potential – it could be one of the most exciting data-driven sciences.
But it isn’t. That’s because of the conservatism of outlook and that fact that whatever the current business model we are prevented from innovation (even if we paid zillions for it).
There are hundreds of thousands of papers reporting chemistry each year and many of them – probably most – are loaded with facts. These facts are high quality. Not all, but most. Most chemistry is reproducible (much bioscience is not). Chemists care about reproducibility.
But we cannot get the facts. I’ll use “data” from now on, but they are facts. The melting point of X is Y (temperature) at Z (pressure) is a fact. I hope at least we can agree on that, and that it isn’t a “creative work”.
There are large compilations of chemical facts, but they are a small percentage of the published literature. Partly because some need reviewing/curation, but mainly because there are just too many. The aggregators of facts can’t keep up. Even if they could we can’t reuse the facts in their current form. Each supplier of aggregated facts has their own idiosyncratic data format, each will only let them out under licence, and these licences form an anticommons which effectively prevents their re-use.
I shall present a scenario where these facts are gathered automatically and made instantly available. It’s not fantasy – we’re actually doing it. And it’s not even expensive – it can be done on marginal costs.
But – it seriously threatens a conservative publishing industry. Not because I want to, but because in the nature of CE 2008 and beyond. We cannot prevent it. It’s happening.
So chemical data publishing will make a decision on Thursday. There is no escape from making a decision. I hope it’s a positive one, but  not facing up to the changes of the Web is also a decision. It says “we will continue incrementally in the same way as we have been doing”. It’s an incremental decision.
If the chemical data publishing industry continues in its current form, does it have a future?  Yes, and I’ll try to suggest it for Chemical Abstracts and for the Cambridge Crystallographic Data Centre (I have no inside knowledge on either).
Can it and should it make major changes? It should, because the future will be brighter. Can it? I am not a business expert and have no useful opinion. But I hope so, for chemistry’s sake.

Posted in Uncategorized | 4 Comments

Keep Open Access Licences and Contracts Simple

I missed this post ( The Scientist; ) more than a week ago. It generated a fair amount of comment. The point I’m going to make is that Open Access can range from very simple to very complicated…

new paradigm

Date:
Tuesday, 06 May 2008
In a move that is bound to put Felix among the Columbidae the Journal of Cell Biology has come up with an interesting way of licensing its content.
Emma Hill and Mike Rossner take us on a journey that begins in 1787 and ends with the rather extraordinary statement (from a major scientific journal, at least) that they have now decided to return copyright to our authors, in return for the authors (that is, those of us who manage to publish in JCB) making the work available to the public.
In other words, article authors — scientists like you and me — grant JCB a licence to publish their work for six months.
There is a lot to think about in here, and I encourage you to read the entire statement . There is the thesis that six months is the monetary lifetime of a paper. There is the possibility that real data-mining will be made possible, a move that should make the likes of Peter Murray-Rust very happy (although the matter of format now becomes more pressing).
There is the intriguing thought that what appears to be a reclamation of the original premise of copyright might jump across to patents . Now that would be exciting, and good for science.

PMR: Here is part of the statement:

You wrote it; you own it!
Emma Hill1 and Mike Rossner2
1 Executive Editor, The Journal of Cell Biology
2 Executive Director, The Rockefeller University Press

[…]
With the growing demand for public access to published data, we recently started depositing all of our content in PubMed Central. In a further step to enhance the utility of scientific content, we have now decided to return copyright to our authors. In return, however, we require authors to make their work available for reuse by the public. Instead of relinquishing copyright, our authors will now provide us with a license to publish their work. This license, however, places no restrictions on how authors can reuse their own work; we only require them to attribute the work to its original publication. Six months after publication, third parties (that is, anyone who is not an author) can use the material we publish under the terms of the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License (http://creativecommons.org/licenses/by-nc-sa/3.0).
What does this Creative Commons License mean? It means that our published content will be open for reuse, distribution, data mining, etc., by anyone, as long as attribution is made to the original work. Share-alike means that any subsequent distribution must follow the rules set out in this license. Non-commercial means that published work can be reused without permission, as long as it is for noncommercial purposes.
We still believe in protecting our authors from commercial exploitation of their work. Commercial reuse of material published in one of our journals will still require permission from the journal. Authors, on the other hand, can reuse their own material for any purpose, including commercial profit, as long as proper attribution is made.
Within the first six months after publication, the same terms of this license apply, with one exception: the creation of mirror sites containing all of, or a subset of, our content is prohibited during that period. The Rockefeller University Press still derives essential revenue from journal subscriptions to content within the first six months, and thus we cannot risk the creation of a free mirror site during this time.
The Creative Commons License will apply retroactively to all work published by The Rockefeller University Press before November 1, 2007; the license restricting the creation of mirror sites will apply to all work published within the last six months. Authors who previously assigned their copyright to the Press are now granted the right to use their own work in any way they like, as long as they acknowledge the original publication.
We are pleased to finally comply with the original spirit of copyright in our continuing effort to promote public access to the published biomedical literature.

PMR: There were a dozen or so comments, several of which pointed out that this is nothing particularly new. Not all commenters agreed on the interpretation. First, I think that RUP are trying to aim in a reasonable direction. However the actual approach is complicated and I’m not sure I can understand it. I think its says something like:

“You the authors own the content, and you the authors retain the copyright. However for the first six months you the authors agree not to post it on your own web sites nor to help anyone else to do this on their web sites. After six months the material will become available under CC-NC-SA. [PMR: This licence is viral; it requires anyone who uses it to make the whole a derivative work available under the same licence – I don’t think it’s a good choice, especially after our discussions of last week. But that’s not the issue here.]
You the subscribing reader can read the material as soon as it is published, but you can’t do anything with it. [How this is communicated tp the reader is a separate problem. A CC licence will suggest that the reader *can* re-use it; so there has to be a separate statement. Perhaps something like: ‘This article carries a licence which you should ignore until six months after publication’. or ‘after 6 months a licence will start to appear on articles on this web site’.]
You the non-subscribing reader wn’t be able to do anything for six months. When that time has elapsed you can do anything so long as it is non-commercial (whatever that means) and as long as you apply share-alike (if you can understand the complexities).”

PMR: My point is that whatever this is and however well-intentioned it’s complicated. I suspect that no other journal operates this exact licence/contract/copyright. It would take someone expert 15 minute minimum to work out what is going on. And a non-expert (in licences) might well misunderstand it.
There are tens of thousands of journals. If each has a different set of conditions it makes it impossible for everyone – authors, readers, repositarians, etc. It already is impossible.
Open Source now gets by with a workable set of licences. For most people there are only half a dozen – GPL, LGPL, BSD, Apache, Mozilla, Artistic, etc. It takes me a second to understand the implications of each. OK, it would take me an hour or two if I came in fresh – but there are lots of web sites to help. But even after 2-3 years I still have little idea as a reader/re-user how to interpret licence/contract/copyright on publishers sites. Even those who try to be helpful.
Unless, of course, they add a CC-BY licence. Whatever you like or dislike about CC-BY, it’s SIMPLE. Everyone can understand it in a sentence:
“You can do whatever you like with this as long as you attribute the authors”.
Consider carefully whether simplicity isn’t worth something.

Posted in Uncategorized | 6 Comments

Generic chemical names

In a recent post ( Chemical names – the challenge) I addressed the question of how to identify chemical names in text and referred to the problem of generic rather than specific names. Here’s an example showing how we cannot just lift names out of text and look them up or convert them to connection tables.
I happened to notice that the American Chemical Society had posted a list of its most highly accessed articles. [Which – like all publishers – probably bears little relation to their most highly cited articles]. So I thought I’d use the first one as an example of the use of chemical names. I reproduce the abstract but to avoid charges of violating copyright and remaining within “fair use” have removed all the verbs, replacing them by ellipses.
==============================================================
J. Am. Chem. Soc., 129 (5), 1465 -1469, 2007. 10.1021/ja068047t S0002-7863(06)08047-4
Web Release Date: January 11, 2007 Copyright © 2007 American Chemical Society
Total Synthesis of (+)-Nakadomarin A
Ian S. Young and Michael A. Kerr*
[…]

[PMR: chemical structure diagram deleted as I do not want to violate copyright and I haven’t got time to redraw it. It might, of course, make the abstract easier to understand].

Abstract:
The total synthesis of (+)-nakadomarin A is […]ed. A three-component cycloaddition of a hydroxylamine, aldehyde, and cyclopropane to form a highly functionalized tetrahydro-1,2-oxazine […]s as the foundation for this synthesis. The resulting oxazine is […]ed as a single diastereomer with the absolute configuration being […]ed by the chirality of the cyclopropane. Other key steps […] desymmetrization of a malonate by reduction, Heck cyclization and pyrrolidine formation, and ring-closing metathesis to […] both cycloalkenes. Overall, the synthesis […]ed 23 linear steps from the cyclopropane, which in turn […] available (six steps) in optically pure form from commercially available D-mannitol.
So there are 11 different chemical names (one is repeated) of which I judge only 2 represent well defined single identifiable compounds while the other 9 are generic or may represent Synecdoche.
The generic nature of these compounds is not trivial. “aldehyde” could be represented by “R-C(=O)H” but “cyclopropane” would have to be “R1C(R2)1C(R3)(R4)C(R5)(R6)1”, unless further indication is given in the text.
UPDATE…
[FWIW: (+)-nakadomarin A is not given in Pubchem so is probably of no interest except to synthetic chemists. I can’t reproduce the structure as the ACS claim copyright over the graphical abstract. Presumably the high access is mainly from synthetic chemists.
Pubchem does, however, tell us what “D-mannitol” is.
galactitol; D-mannitol; mannitol …
It is listed as having lots of synonyms:


Depositor-Supplied Synonyms: (Total: 120)


Display: Next 10 | All | Sort:
Weight Alphabetic

galactitol
D-mannitol
mannitol
dulcitol
sorbitol
dulcose
glucitol
D-Glucitol
D-Sorbitol
dulcite
are these all the same thing? I somewhat doubt it… And some of them are ceritanly generic. Is “mannitol” the same as “D-mannitol”? Without knowing whether they are used synonymously I can’t say.

Posted in Uncategorized | Leave a comment

Chemical names – the challenge

Antony Williams on Chemspider posts a serious question and I give a serious answer. I wonder if it’s what he was expecting… I’ll state it, and then comment on NLP before giving an answer:
06:05 17/05/2008, Antony Williams,
[…]
How many chemicals are mentioned in this paragraph?
“She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”
The question is basically to identify chemical names in free human text. This is different from working out what those names mean. Peter Corbett, Colin Batchelor (Royal Soc Chemistry), Ann Copestake and others in our Sciborg project have been working hard on this for many months. PeterC- please forgive me if I get details wrong. PeterC is presenting this shortly at a Natural Language Processing conference and it was also highlighted in Ann’s presentation to the British Computer Society last Tuesday (which I still have to blog). Peter gave a dry run to the Unilever Centre two weeks ago and I’ll try to do this justice.
Firstly there are several ways of types of usage which involve chemical names. Peter uses “pyridine” as an example.
  • the reactants were dissolved in pyridine
  • nicotinic acid and other pyridine derivatives
  • the signal from the protons in the pyridine ring were shifted upfield
In only the first does “pyridine” refer to a single, identifiable, compound, but all three sentences contain one or more chemical names. But is “reactant” a chemical? “Nicotinic acid” is. Are “protons” chemicals?
Language is flexible and ambiguous. Ann argued strongly, and I support her, that language is ipso fact ambiguous and its power comes from its ambiguity. We humans are very good at resolving this. I went into a room in the chemistry department recently and there was a large cabinet with O2 on it. Was this a chemical [see below]?
Almost all sentences are ambiguous. The work in Sciborg involves exhaustive multiple parses for any sentence. Here’s a simple example:
“There was nothing on Tuesday”
The meaning of this seems obvious, but a machine would probably come up with at least 10. For longer sentences it is not impossible to have thousands of parses and even run out of stack space on older machines. Interpretations could be:
“Nothing significant happened on Tuesday”
“No mail arrived on Tuesday”
“There was nothing significant on the TV, Tuesday” (US usage often omits “on” before a date”)
“The day on which something was expected to happen was not Tuesday”
and more far-fetched:
“The universe disappeared into a void on Tuesday”
“Tuesday [Weld] had no clothes on”
“The mob had no hold on [Ruby] Tuesday”
The machine would almost certainly come up with most of these. The only manageable way is to give them all probabilities. A typical parse could hold millions of possibilities for a document, and the challenge of NLP is to try to dfisambuiguate them. For example,
“She waited for his letter. There was nothing on Tuesday”
is still very hard to parse, but some of the ambiguities can pragmatically be removed.
So when OSCAR3 parses text Peter adds a probability to each putative named entity. Here’s a simple example:
ne id=”o71″ surface=”imidazole” type=”CM” confidence=”0.9257968817491067″ SMILES=”c1c[nH]cn1″ InChI=”InChI=1/C3H4N2/c1-2-5-3-4-1/h1-3H,(H,4,5)/f/h4H” cmlRef=”cml9″ ontIDs=”CHEBI:16069″
OSCAR has identified “imidazole” as a putative chemical compound with a confidence of 93%. For
ne id=”o101″ surface=”2H” type=”CM” confidence=”0.3341514144473448″ rightPunct=”,”
it gives only 33% for the string “2H”, which could be deuterium, or two Hydrogen atoms, or – as in this case – an annotation of a hydrogen atom in a spectrum.
And surely we can aim for 100%?
No. The responsible way to do research in NLP is to mark up a corpus first. So three annotators (Peter, Colin and David Jessop) spent some weeks marking up papers kindly provided by Royal Society of Chemistry. (One of the major problems in text-mining is that you can’t usually get a decent corpus because many publishers won’t let you). The average agreement was about 90%.
So there cannot be a single answer to Antony’s question. A meaningful question would be:
“Using a given corpus, previously annotated by experts, and with agreed guidelines for marking up chemicals, what compounds occur in the following paragraph with a probability of greater than x (e.g. 0.9)”
OSCAR3 has been developed in this manner. Currently it achieves about 80% rather than the 90% that expert humans do. Among the strategies that OSCAR uses or might use are:
  • comparison with English-language lexicon. If a word is also an English language word it is less likely to be a chemical.
  • comparison with chemical lexicon, e.g. ChEBI. If it’s in there, its probability is increased
  • part of speech. If it’s a noun it’s increased, if a verb it’s decreased
  • lexical form. footyloxybarate is not a known chemical, but its lexical form makes it highly probable it is a fictitious one and not, say, a film star or pop group.
  • Hearst patterns. “bioactive compounds such as aspirin or spat”. Even if not in a lexicon “spat” is probably a chemical rather than the past tense of spit.
  • And usage (probabilistic). “take an aspirin” is a common phrase. “take a benzene” is of very low probability. So although “Dagger” (capitalised) is a trade name in Pubchem, I doubt there are any extant uses of “a dagger” as apposed to “some dagger”.

Peter has other clever tricks (and I suspect that there are some that are unique to our project).

So my answer to Antony’s question is that although there are several lexical forms which occur in certain lexicons (primarily Pubchem) their local syntactic and semantic occurrence makes it extremely improbable that any of them would be meaningful compounds. OSCAR will find only two possible compounds:
  • He. Unfortunately short strings (He, As, In, Be, etc and many abbreviations are difficult. OSCAR weights these down and the probability is low.
  • aspirin – by lookup.
If you think it’s easy to identify chemicals, try this phrase:
“She used her platinum card to buy a gold necklace, then crossed the iron bridge across the water as gold flecks decorated the sunset. Salt spray blew as she walked across the sand… “
I doubt that Peter and Colin would achieve 100% on that. But it’s an unfair test as the chemical guidelines were based on 14 papers from the RSC, not bodice-ripper novellas.
PMR: [answer to O2: No, there is a telcom supplier in the UK with the trade name O2and it was full of telecomms gear. No oxygen except what comes from the air.
Posted in Uncategorized | 2 Comments

Klaus Graf on OA and thoughts for Thursday

In an email from Klaus Graf, one of the most consistent supporters of BBB-OA (and CC-BY):

http://archiv.twoday.net/stories/4931334/
In German. [PMR: It translates quite readably with Babelfish]. Reasons for re-use OA and my mantra "Make all
research results CC-BY":
(1) Data mining, see already
http://archiv.twoday.net/stories/4851871/ (in English)
(2) Educational use, see GSU case and "Open Education"
movement
(3) Open Content projects (e.g. Wikimedia projects)
(4) Re-Use of heritage items and scholarly
photographs/illustrations: not possible if CC-NC and
publication in commercial journal
(5) More chance for impact if commercial use
(6) Translations
(7) Wiki-re-use of materials
(8) Orphans
(9) Mirroring in repositories - LOCKSS principle
(10) We need more remix experiments
Summary: CC-BY as default license.
PMR: Clarification: I think "Orphan" means abandoned by the author(s) often through death. This overcomes the ludicrous repsonse I sometimes get from Libraries (and by indirection the British Library - "this work is over 100 years old, but we can't let you have it without permission from the author").
PMR: I'd like to see this list agreed and widely posted for reference. I'm not sure whether it's comprehensive. I have a particular requirement for "images" in scientific  publications. Things like the diagram of the apparatus; chemical formulae; the machine-produced graphs of the data and many more. I'd see these as different from 4. The knee-jerk response of publishers is "it's an image- it's OURS! If you want it, you'll have to pay!"
I'm thinking of what I say on Thursday. There are several publishers there. Publishers have three policies on Open Access:
  • clear and positive (PLoS, BMC and occasional Open Access journals from others).
  • It’s all ours and we make this clear
  • Mumble

Unfortunately most publishers adopt the mumble policy. This is simple to describe and simple to operate. If asked a direct question, fail to give the courtesy of an answer. This perpetuates FUD or at least UD. Typical mumble licensing policy is shown by the American Chemical Society which was asked 6 months ago by Chemspider whether CS could use our CrystalEye data derived from the supplemental data in ACS articles. This data is outside the firewall but indirectly stamped as copyright. Is re-use of this data allowed?

Mumble.

So I think there may be representatives of ACS divisions present at the meeting.

Is it rude to ask them what their policy is?

They have 5 days to think of an answer. Surely that’s long enough?

Oh, and while I’m at it, and if there is a Wiley representative, maybe I’ll ask them whether I am allowed to reproduce graphs from their papers without permission.

After all it’s our data, not theirs

Posted in Uncategorized | Leave a comment

Avoid the pain and embarassment – make all the raw data available

An excellent account from Cameron Neylon of the frustration (and possibly ruined young careers) when experiments are not repeatable because they haven’t been done properly. While there is no algorithm that guards against this, the deposition of raw data is a powerful mechanism. Whether or not it is made open on day one, if there is a sure knowledge that the data are Open, it focuses the researcher. From our own work on NMR we saw that the discipline really made one think about what and how was reported
10:07 16/05/2008, Cameron Neylon,
Posted in Uncategorized | Leave a comment

Searching Word and PDF for chemistry

ChemConnector (from the Chemspider stable – anonymous, but possibly Antony Williams) has posted on searching for chemical structures in Word anf PDF files. S/he rightly concludes that (a) it’s possible and (b) not done – at least in public. I’ll reproduce the whole post and then comment

Hamburger PDFs and Making Them Structure Searchable

There have been numerous conversations about “Hamburger PDFs” over the months and the most recent exchange is that between Chris Rusbridge and Peter Murray-Rust. Another conversation that I have seen go on has been about making Word documents structure searchable (cannot track down the appropropriate blog-postings at present). This is just an fyi comment for the community really since this is a general assumption that Word Documents and PDFs cannot be made structure-searchable. The truth is that both can be made structure searchable. How? Well, you need to write the correct information into the file to enable it but it’s possible. There are a number of solutions out there allowing structure-based searching of Word document files. I believe the first one was originally from Oxford Molecular before being acquired by Accelrys. I think there are now multiple including, I believe, Cambridgesoft, ACD/Labs and probably others. The only PDF structure searching capability I am aware of is that created by ACD/Labs a few years ago. Their website states “Our Search for Structure system allows you to seek out chemical structures in various file formats throughout your computer’s file systems. These formats include: SK2, MOL, SDF, SKC, CHM, CDX, RXN, and PDF (Adobe Acrobat); DOC (Microsoft Word), XLS (Microsoft Excel), and PPT (Microsoft PowerPoint), and ACD/Labs databases: CUD, HUD, CFD, NDB, ND5, and INT.” For PDF it was required that structure files were “tagged” appropriately when written to PDF by an embedded PDF generation capability. Since the PDF format can be extended ACD/Labs did so. If we wanted to make the majority of PDF files structure searchable then it seems as if the appropriate thing to do would be to extend the general PDF format for Life Sciences, talk to Adobe about including the capabilities into their tools and get the publishers to support it. Ok, there’s details….but why isn’t anyone talking about extending PDF to support structures in this way. it’s already proven, years ago. Next thing will be that structures will be getting embedded into Word documents and made searchable as if it is something novel. It’s been done many times already. The ACD/Labs website states “Microsoft Word documents with structures created in ChemDraw or MDL ISIS can also be retrieved. Not only can you perform exact structure searches, but you can also search by substructure. Added options allow you to preview search results, open search result documents in ChemSketch as well as in other applications, and store search results for later access.” There are other products doing this too. Strangely people don’t seem to know about these capabilities. They will…as we move forward to index the web for structures we hope to build the capabilities to search structures inside Word documents directly.

PMR: This is – rightly – not absolute news. It’s been possible for some time. For all I know many of the pharma companies are doing it behind closed doors and simply not telling us. But it’s not done in public. Why? I looked into this about 5 years ago with the intention of extracting ChemDraw files from Word. There was no zero-cost solution.
I’ll come to the mechanics of embedding later, but essentially the chemistry is often (not always) present as embedded (OLE) objects, usually ChemDraw but sometimes ISISDraw. Often it isn’t present except as an image. I tried to read the Word file (probably Word97) and extract the chemistry. CambridgeSoft – who I congratulate publicly – published their specification on the web, including the binary format. So, in principle, it should be possible to find the ChemDraw sections (which start with a special flag “VjCD0100”). Yes, I could find these, but the main problem was that Word garbled these into an unpredictable order of blocks. So about 70% of these were readable, while 30% were unreadable. Of course it’s technically possible to buy CambridgeSoft software to read CambridgeSoft files in Word, ACDLabs software to read their offerings, etc. But it costs money (probably lots) and more importantly lots of time in negotiation. We didn’t have a business case at the time, and in any case there wasn’t anything to read. The main problem was that there were no Word documents anyway. We tried some experiments with extracting stuff from native PDF and it clearly wasn’t going anywhere.
However I did write a ChemDraw to CML converter. Its on Sourceforge under cml.sf.net. It follows the public spec faithfully (although a few of the concepts are unclear such as “Generic Nickname”). It works. We’ve tested it on thousands of files. Of course it has the disadvantage that it’s free, Open and non-proprietary so no-one takes it seriously.
ChemConnector is right in that you *can* embed things in PDF and Word. But people don’t. There is a fundamental difference between the two formats… PDF is seen by most people as electronic paper. It’s much more than that – it’s a comppund document format. It’s a container and you can put anything you like in it. It accepts images, XML, movies – you name it. It does metadata (I think MPEG/DIDL). But this has to be a conscious act. Most people don’t know this and don’t care. Moreover the format has changed many times. They may or may not be standards but they are highly confusing and AFAIK no-one in academia of scientific publishing does this.
Henry Rzepa is an advocate and explorer of PDF as a container for exciting objects and we looked at this in the SPECTRaT project but didn’t come away with much enthusiasm. If you put enough effort into developing a compound document format for a community you can make it work. There are many who use Docbook – the SGML/XML standard for technical manuals and beyond. But it’s too heavyweight for academia – either theses or publications. So you would get very short change from suggesting to an academic that they take on a new set of tools, have to be trained, etc. “just” to make their data more accessible or permanent. I know it shouldn’t be, but it is that way.
Word(TM) is somewhat different. Word accepts embedded binary objects (OLE). The normal way this is done is by buying tools from manufacturers (ChemDraw(TM), MDL(TM)/Symyx(TM),  ACDLabs(TM), etc.). These companies have worked with the guts of Office/Word/Excel and so their objects can be inserted. The rest of this paragraph is speculation but roughly true… When you add a ChemDraw object to Word it continues to know it’s ChemDraw object (OLE). When you click on it it looks to see if you have ChemDraw installed and if it does it brings up a ChemDraw window. You can edit the material and resave it in Word. Presumably the same thing happens for ISIS(TM)/MDL with MDL software. If you don’t have the software installed (like me) then it tells me I don’t have the software installed. But it still displays it – through something called a WindowsMetaFile (WMF(TM)). Suppose I find a chemistry thesis with chemical structure diagrams in it. What can I do? I can’t tell just by looking at it. So I click the picture. If it says it wants to run ChemDraw, it;s a ChemDraw. If it wants to run ISIS it’s an ISIS file. Otherwise it might be a Windows Metafile. Or worse it may be a bitmap.
With Word2007 things change. I owe a lot of this to Joe Townsend who discovered all this in the SPECTRaT project. Word2007 has a native XML format. OOXML. We’ll leave for the moment whether that’s a good format or a spawn of Satan. Both views exist. But it does expose all the components. If the files contains ChemDraw files they are all separated into a directly called embeddings/. If they are WMFs they go into media/. So they are all accessible in some form.
The important thing is that they are already in the Word document. Nobody told the students to put them in so they can be archived. They did it without thinking.
Yes, they can be searched.  With our own software. I’ll be demoing it on Thursday at the Royal Society of Chemistry.
But the trouble is that it’s Open Source and Open Data. So perhaps I should decide now to make it commercial. What do you think?

Posted in Uncategorized | 3 Comments

Open Access at the Royal Society of Chemistry

The preposition is “at”. Next Thursday the RSC is organising a one day meeting in Burlington House London (next to the Royal Academy if you want a change afterwards. Here’s the programme

Open Access Publishing in the Chemical Sciences

Final programme for the one-day meeting at Burlington House on May 22nd, 2008.

Time Topic and Speaker

0930 Registration and coffee

1000 Welcome and Introduction to the meeting

Barry Dunne, CICAG Chairman

1010 Open Data – why it must become universal

Peter Murray-Rust, University of Cambridge

1045 Coffee

1100 Open Access – the publishers’ perspective

Ian Russell, Association of Learned and Professional Society Publishers

1135 Open Access and the Wellcome Trust

Robert Kiley, The Wellcome Trust

1210 Some new chemistry in European Bioinformatics

Christoph Steinbeck, European Bioinformatics Institute

1315 Lunch

1415 Designing data repositories to support preservation & publication for the chemistry community

Simon Coles, University of Southampton

1445 Title to be announced

Diana Leitch, University of Manchester

1520 Growth of support for Open Access, and its spread from biomedicine

Bryan Vickery, ChemistryCentral

1555 Tea

1610 How will NPG provide better access and impact with chemistry content?

David Hoole, Nature Publishing Group

1645 General Discussion All

1700 Close of Meeting N/A

====================================================================

My spies tell me there are already attendees from “publishers, a wide variety of universities (both library/information and technical staff), industry, and some from organisations such as the British Library, CAS, STN, MIMAS and the RSC.” So drop what you were planning to do on Thursday and come and see the future of chemical publishing.

The RSC operates a hybrid journal policy (RSC Open Science) which is an author-pays free-to-read policy. (Every major publisher shuffles the words “Open, Free, Author” and “Choice, Science, Access” to describe its own particular hybrid OA. So that’s a good trivia question for Mastermind or Millionaire – which major publisher operates “Open Science”). They say:

“You may deposit the accepted version of the submitted article in other repository(ies) as required, with no embargo period, except that you are not permitted to deposit your work in any commercial service.”

Since this is the removal of a permission barrier, this is now “strongOA” – although the name will be changed soonish.

The Google snippet for this page announced

“Yes, but with the caveat that, along with many other publishers, RSC considers the author-pays open access model to be an experiment rather than a proven …”

I can’t find this phrase any more, so I’m assuming it’s a year old. So probably by now the RSC has had a chance to see that OA is no longer an experiment but a shining success. And this meeting will confirm it…

What will I say? I have no idea in detail as there has been so much going on. It will be on data.

  • I will show our latest demo which should convince you that by publishing semantic data chemistry has the opportunity to become the pre-eminent data-driven physical science.
  • I will urge everyone to take data seriously. It’s tragic – and I think an abrogation of the duty of a learned society – that the ACS has declined to give a clear steer to Chemspider on whether the our (Cambridge) CrystalEye abstraction of their openly-visible-but-inappropriately copyrighted crystal data (==facts) can be re-used. It was a simple, polite, public question.
  • The last few days have convinced me that data MUST be treated as a first-class-citizen, and not tagging along under whatever Open Access licence or mumble is attached to the full-text. Data are free (as in air). Publishers should rejoice in this.

Because, publishers, the quality of your science depends in large part on the data related to those publications. That’s why many of you publish supplemental information. It says:
“We, the authors, did this experiment. Here’s the data to support our claims. We made these compounds. They are what we say they are. If you don’t believe us, here are the spectra. And the melting points. And the crystal structures. And we are reinforced in our claim because the journal reviewed this data and agreed it stands up. We challenge you to download it and show any different. Recalculate the NMR spectrum. We are brave enough to claim it’s right. Re0analyse the crystal structure. You’ll find that atom really is an oxygen and not a nitrogen. And the chirality really is R- not S-“.
But to make these claims the data must be free. Not licensed. Not mumble. Free as in air. The human race has an automatic right to data.

Posted in Uncategorized | Leave a comment

Chemistry and Wikipedia

PMR: Recent comments on this blog about derivative works of Wikipedia

  1. Physchim62 Says:
    May 15th, 2008 at 6:38 pm eI would be very interested in having a chat with you about this project, either here or by email. Although I am somewhat more pessimistic than you about the possibilities of avoiding human input, I might be able to help you to avoid rediscovering some of the problems we have had on Wikipedia!

PMR: I am certainly not against human input – it’s essential. But I think machines can help in semantic integrity.

  1. Martin Walker Says:
    May 15th, 2008 at 6:47 pm eAs you know, we (the Wikipedia chemists) are organising our lists as I write this, so things are in a state of flux. There are two lists currently available. of 4743 contains nearly all the organics plus a few inorganics, and then we created another smaller . There is some duplication, which will disappear once they are merged. Missing from the list are a few such as skatole (unknown reasons!), plus other articles like 2-methylpyridine that are new. If you would like this information in database format, we are now in a position to provide it, though it will lack the validated CAS numbers at this time. As you know, we welcome mashups and other collaborations.Martin A. Walker (walkerma on Wikipedia)

PMR: Many thanks. I’d be grateful for it in database format. (Some of you may not realise how much effort Martin Physchim and other put into Wikipedia). More comments below.

  1. Physchim62 Says:
    May 15th, 2008 at 5:55 pm eAt Wikipedia, we are currently involved in a VERY similar process to the one you’re describing here, except we are less optimistic as to the possibilities of automation. Given our somewhat eclectic range of compounds, we are more than used to the fact that many fundamental data are simply not known. To take one (extreme) example, have a look at , where we give virtually all of the publically available information on this compound.While I would not wish to discourage your group, I must say that, at Wikipedia, we have found that the most valuable “semantic chemical authoring tool” is a human chemist: personally, I charge less for consultancy than CAS charges for access to its databases (but maybe that’s my mistake!) Much chemical information, on the web and on paper, is false, and most of it lacks the necessary metadata to be able to judge its veracity. THAT, I feel, is the real problem!

PMR: I fully agree – and although I haven’t contributed compounds I have created chemical entries in WP. But as the numbers scale it will benefit from semantic tools.
I’m very keen to use DBPedia and at the moment this suffers from inconsistency in the semantics. Here’s a few examples of <http://dbpedia.org/property/molarmass>:

:%2528Benzylideneacetone%2529iron_tricarbonyl/section2/Chembox_Properties [http]    286.06
:1%252C1%2527-Bi-2-naphthol/section2/Chembox_Properties [http]    “286.32 g/mol”@en
:1-Aminocyclopropane-1-carboxylic_acid/section2/Chembox_Properties [http]    “101.1 {{Ref_N”@en
:2%252C2%252C2-Trichloroethanol/section2/Chembox_Properties [http]    “149.40 g mol<sup>-1</sup>””@en
:3%252C3%2527-Diaminobenzidine_tetrahydrochloride/section2/Chembox_Properties [http]    “360.11 g/mol<BR> 396.14 g/mol (dihydrate)””@en
:Alginic_acid/section2/Chembox_Properties [http]    “10000 – 600000″@en
:Aluminium_chloride/section2/Chembox_Properties [http]    “133.34 g mol<sup>-1</sup> (anhydrous) 241.432 g mol<sup>-1</sup> (hexahydrate)””@en
:Aluminium_sulfate/section2/Chembox_Properties [http]    “342.15 g/mol as anhydrous salt”@en
:CS_gas_%2528data_page%2529/section2/Chembox_Properties [http]    “188.6 g/mol Documents the chemistry of CS and its effects on the body.”@en
:Calcium_chloride/section2/Chembox_Properties [http]    “110.99 g/mol, anhydrous 147.02 g/mol, dihydrate 182.04 g/mol, tetrahydrate 219.08 g/mol, hexahydrate””@en

PMR: Here we can see the variability in the reporting of the physical quantity and the units associated with it. When this is normalised then Wikipedia/DBPedia will become a stunning chemical resource. Our offer is to try to see if we can normalise this outside WP so it can be re-inserted.

Posted in Uncategorized | Leave a comment