Monthly Archives: December 2006

Ownership/Copyright of Comments and Email

  1. David Goodman raises a question in response to this post:
    December 18th, 2006 at 11:33 pm e excerpted and adapted from my Dec 18 posting on the SPARC-OA list:As an example, relevant to this very posting, I quote from this blog:
    “You own the copyright in your comments: but you also agree to license your comments under the the Creative Commons Attribution-NonCommercial license.”
    When you act as a publisher, you do not license commercial use.

    David Goodman,

When we set this series of blogs up we thought it was important to consider the ownership of comments. I've been in positions before when I have been unable to distribute a collaborative work because we didn't address this issue. So we took a stab and picked a license for the comments. I'm quite happy to be convinced that "commercial use" would be better, but we omitted it because we thought it might put off some readers (we thought that even requiring a license might cause problems).

So, readers (and commenters) what is your view? Would you be happy with a license that allows commercial exploitation? Are any of you put off by the current license? (If you are, and don't want to use the comments, please mail me.).

David - I do not understand "you" in the last sentence. Does it mean:

"When PMR acts as a publisher, PMR does not license commercial use" (i.e. a statement of fact) or

"When anyone acts as a publisher, that person does not license commercial use" (i.e. a metasyntactic variable applying to all publishers) (and perhaps "does not" implies "should not" or "cannot").
or something else...

David's other comment refers to the SPARC Open Data list which has had a discussion about copyright on email. David writes (Mailing List Message #85):

If you mean quoting, that is protected fair use. When it is for the
purpose of criticism, and especially scholarly criticism,
as it usually is on these lists, this can be very encompassing.

If you mean reposting without permission on a different list, I
think youy will find that most of the posters report
that they have permission--the usual statement is
"reposted with permission from the X list";
I know that i have always asked. I think that Heather
has always asked.

Once or twice the original poster did not want a larger or different
audience, and in those cases I did not repost.

And, as Heather says, posting or quoting from a private letter is
totally unethical without explicit permission, print or email. I have
not seen anyone I know do it.

So this quote (almost all David's mail except for the address) is quoted here under fair use.

The penultimate point is important. I have been embarrassed in the past because someone took one of my posts to a fairly small list (A) and reposted it in a much larger list (B) with the wrapper "from Peter Murray-Rust". This post caused harm, because it appaered that I had deliberately posted it in list B where its content and attribution was seen as agressive while on list A it was normal discourse in the community. Unfortuantely the person (an open source evangelist) made a habit of this and I took to including something like "please do not repost without  explicit statement that this  was not done by the original author". I don't know the answer to this.

A typical scenario might be:

I post here something that says "I think it would be a good idea if all publishers added InChIs to their chemistry" . This is fine on this blog. If, however, the post was copied to the Computational Chemistry List (where the person copied my material) and attributed to me, then those members might see me as spamming, uncritical, intrusive, etc. If he had written (I found this quote on PeterMR's blog ...) I couldn't complain.
So metadata and transclusion matters. But we don't want any more hard stuff before breakfast...

Sometimes when someone writes to me I extract part and comment:

"A correspondent asked me:

"When should I use cml:property and when should I use cml:Parameter? Here is some code to highlight the problem... "

[code included in quote]

and I'm replying in public because I think many people will be interested."

I hope  that this does not break confidentiality and fair use. Otherwise I will have to write a lot more emails.

Open Data: more on ALPSP and STM statement

The next few posts will contain a substantial amount of comment (hopefully from you, dear readers and commentators) on the June 2006 ALPSP/STM statement on data in scholarly publications. I'm kicking myself for missing it. I plead that I have many other things to do (chemical research, writing programs), and Open Data is not my top priority, so I wasn't reading Peter Suber's Open Access News blog on a daily basis. I have a better feedreader and am now. Here's his June comment:

ALPSP and STM support OA data The ALPSP and STM have issued a joint statement in support of free online access to scientific data (June 2006). Excerpt:

Publishers recognise that in many disciplines data itself, in various forms, is now a key output of research.

[snip... you can read the full statement here]

this is fundamental in copyright law. In the EU, the use of ‘insubstantial’ parts of a database, provided it is not systematic and repeated, does not infringe the database maker’s rights.

PS: Comments. Three quick responses.

  1. First, I commend ALPSP and STM for the primary recommendation in this significant statement. Open access to data is important for all the reasons they cite.
  2. Their call for intellectual property protection for databases is a separable and regrettable part of their statement. The EU has such protection but the US does not. Any argument that such protection is needed to stimulate the production, collection, or dissemination of data is refuted by the vigor of science in the US.
  3. The ALPSP and STM both lobby against policies that would provide OA to research literature, like FRPAA and the draft RCUK policy. I acknowledge that there are many differences between OA to data and OA to peer-reviewed articles interpreting or analyzing data. But ALPSP and STM should acknowledge that there are many similarities, and that most of their arguments for OA data (enhancing research productivity, avoiding costly repetition of research, supporting the creative integration and reworking of research) also apply to OA literature.

PMR comments:

(1) Like PS I commend the ALPSP and STM for this. (It is very important that we do not see people and organisations as "good" and "bad" - this sometimes happens in web-based and other discussions. They have good and bad policies. So this is a very valuable advance.

(2) I support Peter's position as well. I think the blanket claim for IPR through this method is both crude and generally undesirable. Anything that can be done to make it clear that data in publications lie without this directive is important. A journal is not, IMO, a database (though the ACS now explicitly claims that a journal is not longer a journal, but a database to which readers have access. There are other issues here, but I stick with the data). So I will help to promote the swell of challenge to database rights, but I'm not leading the charge.

(3) I am, of course, a strong supporter of Open Access, and the logical premise that OA implies Open Data. That is the area where I'm trying to lead a charge.

So, in principle, all members of ALPSP and STM should support this, shouldn't they? Or has there been dissension. If not, we'll assume it's got universal support.

So what possible legitimacy do we have for journals that copyright supplemental info?

(There are many other questions - what about data in full text? what about spidering websites? It should be exciting to examine these. Let's see if we can use the blogosphere to do some useful research into this.)

And - finally - why did no-one on the SPARC Open Data list flag this?
I missed this in June: I'll reproduce Peter Sub

Open Data : help from ALPSP and STM

I'm over the moon! Richard (from the Royal Society of Chemistry) has commented on my post: about the ownership and licensing of data

  1. Richard Says:
    December 13th, 2006 at 3:27 pm eAnd of course, publishers (well, ALPSP and STM members) have this statement as guidance.

Now I knew that STM publishers were thinking of coming out with something like this and I kept searching from time to time, but I missed it. (It's dated June 2006). It really should have got spotted by the SPARC Open Data membership, shouldn't it.  Anyway here we go (it doesn't have colpyright statement so I claim fair use, and it's in hamburger format so I hope I have got most of what matters):

Databases, data sets, and data accessibility – views and practices
of scholarly publishers

A statement by the Association of Learned and Professional
Society Publishers (ALPSP) and the International Association of
Scientific, Technical and Medical Publishers (STM)

Publishers recognise that in many disciplines data itself, in various forms,
is now a key output of research. Data searching and mining tools permit
increasingly sophisticated use of raw data. Of course, journal articles
provide one ‘view’ of the significance and interpretation of that data – and
conference presentations and informal exchanges may provide other
‘views’ – but data itself is an increasingly important community resource.
Science is best advanced by allowing as many scientists as possible to
have access to as much prior data as possible; this avoids costly
repetition of work, and allows creative new integration and reworking of
existing data.
There is considerable controversy in the scholarly community about
‘ownership’ of and access to data, some of which arises because of the
difficulty in distinguishing between information products created for the
specific display and retrieval of data (‘databases’) and sets or collections
of raw relevant data captured in the course of research or other efforts
(‘data sets’). Another point of difficulty is that in many cases data sets or
even smaller sub-sets of data are also provided as an electronic adjunct to
a paper submitted to a scholarly journal, either for online publication or
simply to allow the referees to verify the conclusions.
We believe that, as a general principle, data sets, the raw data outputs of
research, and sets or sub-sets of that data which are submitted with a
paper to a journal, should wherever possible be made freely accessible to
other scholars. We believe that the best practice for scholarly journal
publishers is to separate supporting data from the article itself, and not to
require any transfer of or ownership in such data or data sets as a
condition of publication of the article in question. Further, we believe that
when articles are published that have associated data files, it would be
highly desirable, whenever feasible, to provide free access to that data,
immediately or shortly after publication, whether the data is hosted on the
publisher’s own site or elsewhere (even when the article itself is published
under a business model which does not make it immediately free to all).
We recognise, however, that hosting, maintaining and preserving raw
data or data sets, and continuing to make such data available over the
long term, has a cost which, in certain circumstances, the host site may
need to recover. We also recognize that on occasion the generation of
data has been privately funded, and the funding entity may have a
particular reason for restricting access to the data (either temporarily or
even permanently), but we believe these should be limited exceptions,
and that journal publishers themselves should claim no ownership interest
in such data. The academic and publishing communities should discuss
further (in the context of the debate on the public funding of research)
whether more reliable and more permanent sites should be established to
host research data.
None of this means, however, that databases themselves – collections of
data specifically organised and presented, often at considerable cost, for
the ease of viewing, retrieval and analysis – do not merit intellectual
property protection, under copyright or database protection principles.
Such databases are often characterized by the sophistication of their data
field structuring, searchability tools, and the like, and scholarly publishers
are often involved in producing and marketing databases that contain
valuable and useful information for scholarly research. The research
interest and value of raw research data sets and individual data points is
entirely different, and serves different purposes, from that of specific
databases that have been organised and compiled for particular research
There is sometimes confusion about whether the use of individual ‘facts’
and data points extracted from a database is permitted under law. Facts
themselves are not copyrightable, but only the way in which information is
expressed – this is fundamental in copyright law. In the EU, the use of
‘insubstantial’ parts of a database, provided it is not systematic and
repeated, does not infringe the database maker’s rights.
Articles published in scholarly journals often include tables and charts in
which certain data points are included or expressed. Journal publishers
often do seek the transfer of or ownership of the publishing rights in such
illustrations (as they might do with respect to an author’s photograph),
but this does not amount to a claim to the underlying data itself.
We hope that this statement is helpful in clarifying the views of publishers
concerning raw data, data sets and databases, and that the statement will
serve as useful guidance for publishers in their policies concerning data
sets submitted with papers. Scholarly and scientific publishers share the
view that research data should be as widely available as possible.
June 2006

PMR: This looks very useful indeed. Thanks Richard. I don't know how I missed it. I will comment later.

Open Data: help from Microsoft

In reply to my last post (about the idea of adding Creative Commons licenses to scientific data)...

  1. Robin Rice Says:
    December 12th, 2006 at 7:04 pm eThere was an article in the October issue of Ariadne, Creative Commons Licences in Higher and Further Education: Do We Care? which points out some of the questions around widespread use of Creative Commons licenses. “Naomi Korn and Charles Oppenheim discuss the history and merits of using Creative Commons licences whilst questioning whether these licences are indeed a panacea.”

Robin - that's wonderful - I'd missed this but I'll quote extensively (the article is Openly available and while it doesn't fulfil all the criteria of the BOAI it allows extensive quoting. But you can't resell it :-) ...

Creative Commons Licences in Higher and Further Education: Do We Care?
Naomi Korn and Charles Oppenheim discuss the history and merits of using Creative Commons licences while questioning whether these licences are indeed a panacea.

The recent incorporation of Creative Commons licences within Microsoft Office Word, Excel and PowerPoint applications via a downloadable plug-in [7] now provides an integrated method for the creation and licensing of content. It is a brilliant way of encouraging consideration about who is allowed access to digital content and under what conditions, at the time that the content is generated. This development has enormous potential for nurturing educated and mature approaches to copyright and access, but at the same time had also necessitated the need to re-examine the validity of using Creative Commons within teaching, learning and research activities. It has precipitated a critical assessment of their use and the need to set clear parameters about when they can and cannot be used.

[7 is Microsoft Office plug-in ]

PMR: This is really exciting. I don't mind other people being visited by a meme ahead of me - it strengthens my resolve. From the Microsoft (yes, Microsoft) page:


This add-in enables you to embed a Creative Commons license into a document that you create using the popular applications: Microsoft Office Word, Microsoft Office PowerPoint, or Microsoft Office Excel. With a Creative Commons license, authors can express their intentions regarding how their works may be used by others.

The add-in downloads the Creative Commons license you designate from the Creative Commons Web site and inserts it directly into your creative work. Creative Commons supports a number of languages.

To learn more about Creative Commons, please visit its web site, To learn more about the choices among the Creative Commons licenses, see

Microsoft Office productivity applications are the most widely used personal productivity applications in the world, and Microsoft’s goal is to enhance the user’s experience with those applications. Empowering Microsoft Office users to express their intentions through Creative Commons licenses is another way Microsoft enables users around the world to exercise their creative freedom while being clear about the rights granted to users of a creative work. In the past, it has not always been easy or obvious to understand the intentions of some authors or artists regarding distribution or use of their intellectual creations.

This add-in is made available through a partnership among Creative Commons, Microsoft, and 3Sharp, LLC, an independent solution provider.

The Creative Commons Add-in is an unsupported technology preview, however we welcome your feedback and input. Please send it to us at

PMR: Whatever else one thinks of Microsoft, this looks like a genuine and important offering.

My Korn-Oppenheim approximation (couldn't resist the pun)... I quote snippets...

Creative Commons actively encourages the sharing of educational material, ... This 'standard licence' is not specific for educational purposes and users can only choose, as in the case of the majority of Creative Commons licences, to make the content available either for commercial or non-commercial purposes. Creative Commons licences do not specifically cater for educational purposes.

It is also worth noting that there is a significant difference between the information that is provided within the full-length licence in comparison to that which appears within the shortened code of Creative Commons licences. [...]

  • If the work includes third party rights for which the Higher or Further Education institution has not secured permission for them to then be disseminated under a CC licence. This might include the use of photographs, text, images, etc., generated by third parties, or indeed images of third parties for which permission would need to be sought.
  • As an employee, unless there is an agreement with the employer to the contrary, the employer is likely to own the rights in the work created (as is the case with course material, research outputs, etc). In this case, as the employee does not own the rights in the work that is produced, he or she will need to check with the institution that it is happy for it to be made available under a Creative Commons licence. Reasons for refusal might include those that are ethical, political, financial or legal. We are, of course, aware that in many institutions, custom and practice leaves such decisions to the employee. The recent HEFCE Good Practice Guide [17], however, strongly encourages such institutions to assert ownership of copyright in e-learning materials, and such advice may well occasion changes in policy in the future.
  • The terms of the Creative Commons licence may cut across some of the activities for which an institution or department might normally charge, the business activities of the department or institution, or undermine existing licensing arrangements with third parties.
  • Creative Commons licences are global licences without providing any means to restrict the countries in which material may be used. This is important if contractual agreements, relationships, political or ethical reasons preclude the release of certain types of learning material or research outputs in particular countries.
  • The department or institution may want more control over the context of use of the work and want to prevent any implied or direct endorsement. This is not currently one of the areas which is covered by the licence.
  • It maybe important to know who accesses and uses the material that is generated, for evaluation, marketing and other in-house purposes. The Creative Commons licences have no provision for user accountability, or tracing of usage.
  • There are also many instances where you cannot use TPMS (Technical Protection Measures) in conjunction with the use of Creative Commons licences, if TPMS undermine any of the provisions of the licences. Thus, for example, restricting access to students within the institution is incompatible with a Creative Commons licence.
  • The Creative Commons licence does not cover database rights, yet much output from FE and HE institutions is protected by such rights. Thus, even with a Creative Commons licence, users may not infringe any database rights.
  • It is, in any case, unclear if Creative Commons licences are valid in UK law, as they do not provide any 'consideration' or payment, and there is no 'I agree' button to accept.

Concluding Remarks

The well-known JORUM service [18] decided not to use Creative Commons licences in the past because of some of the points made within this article. So, are Creative Commons licences a panacea? No. Should we worry? Probably not, but we need to be aware of the implications and limitations of using Creative Commons licences and remember that their use is only as good as staff awareness about copyright issues, rights management procedures and robust policies underpinning the operations of educational institutions. HE and FE institutions need to be clear about their policies towards access and broader strategic and commercial goals, before committing themselves to the irrevocable terms of Creative Commons licences.

It might be advisable instead for institutions to explore the use of Creative Archive licences [5], which are a set of more restrictive licences, based upon the same premise as Creative Commons but with limits upon the use of content for educational and non-commercial purposes and restrictions relating to the territories in which they may be used.

All these are valid points. However I am only concerned here with scientific facts supporting research output. I hope that HE institutions do not regard themselves as the owners of scientific facts in which case there is no problem. I agree that CC is not designed for science - hence the Science Commons effort.

I haven't downloaded and installed the plugin, but my main concern is that someone might leave the CC license switched on at an early stage. If this circulated in private it might vitiate a patent although I don't know the law. When the data has appeared the license has no effect on a patent which has been successfully filed. You cannot patent data.

So let's encourage everyone to download this plugin. Wow!

Open Data - what can I do? Simple, legal, viral suggestion

Following discussion on the SPARC Open Data list I got a mail:

I'd like to hear more discussion on open data, too. In particular, what are the practical approaches that will help adoption of open data by researchers themselves? We all know technology is not the big issue here. The biggest challenge is how to get researchers to share their raw experiment data on the web. Because this is quite different from traditional publication, a big uphill battle is expected.
Since I'm a software developer (with research background), my thinking is always centered around creating free new tools and services for the end users (i.e. researchers). And the key is to come up with a set of tools and services that can benefit users immediately as they open up their data bit by bit on the web. Of course, they have to be relatively easy to implement (ideally, no funding is required).Fortunately, open source software and communities have made this possible. In addition, the emerging semantic web technologies seem to be right for this task. So, I'm working with W3C semantic web group to openly develop ontologies for representing research data (at high level). I'm also developing necessary web publishing tool and R&D community search engine through open source project. My hope is that researchers will open more data when they actually see these open data bring more visibility and recognition to their work through a community search engine.


AJ Chen, Ph.D.
Palo Alto, CA. USA
W3C Scientific Publishing task force
"Open data on semantic web"

PMR: The simplest think that researchers can do is to add a Creative Commons license to their data. It costs nothing, is a simple cut-and-paste, and could be trivially made a template in any data production tool. (For example if you publish spreadsheets in Excel add a creative commons license as your last/first line. Every time you open a blank stylesheet it would have this.

Similarly scientific software developers could output the additional line:

"The data output by this program are offered under a Creative Commons Attribution Share-alike license."

(or better in a machine-readable XML/RDF format like the creative commons already do).

We might add the following:

"The authors regard these as not copyrightable Open Data. Some publishers wish to claim that if published alongside a journal article they own the copyright on them. This license effectively forbids them to claim copyright without the authors permission and acquiescence.

I think the effect of this would be dramatic. Scientists would start to see these messages and think: "Why should I give these data to the publisher?" And if the publisher simply adds a copyright notice saying "all these data are copyright the publisher - you cannot use them for X, Y, Z without permission" this would be in violation of the authors' license. The author would have to deliberately remove this statement to hand over the IPR to the publisher.

It is never easy to design a viral campaign, but this has all the prerequisites of a meme:

  • it infects a significant number of the potential population
  • they wish to reproduce and spread the meme
  • the costs of replicating the meme are effectively zero

The first two are unknowns. The critical things are to get the form of words right and not to foul up technically. Not all data sets can carry text (although an increasing number will be accompanied by metadata which is ideal for this.)

If the scientific programmers buy into this it is unstoppable.

Egon on SMILES InChI CML and RSS

I agree with everything Egon says and add comments.
(Incidentally WordPress and Planet remove the microformats so please read his original
for the correct syntax)
The blogs ChemBark and KinasePro, have been some discussions on the use of SMILES, CML and InChI in Chemical Blogspace (with 70 chemistry blogs now!). Chemists seem to prefer SMILES over InChI, while there is interest in moving towards CML too. Peter commented.

PMR: 70 blogs is great. Go back a year and we'd have ca 10 I suspect. As I say I'm only looking for the 5-10% who are happy to be early adopters

Any incorporation of content other than images and free text requires some HTML knowledge, but this can be rather limited. It is up to us chemoinformaticians to write good documentation on how to do things; so here is a first go.

PMR: Yes, documentation is key as we are always being reminded! But we are also still fighting the browser technology. One of the great problems is that browsers have been a moving target for 12 years - it was almost easier to create a "plugin" in 1994 than now. How many of you can run Chime under Firefox?

Including CML in blogs and other RSS feedsI blogged about including CML in blogs last February, and can generally refer to this article published last year: Chemical markup, XML, and the World Wide Web. 5. Applications of chemical metadata in RSS aggregators (PMID:15032525, DOI:10.1021/ci034244p). Basically, it just comes down to putting the CML code into the HTML version of your blog content, though I appreciate the need for plugins.

PMR: you should always try to create XHTML (HTML with balanced tags). Unfortunately (and most regrettably) some tools, including WordPress, can often remove end tags.

Including SMILES, CAS and InChI in blogsIncluding SMILES is much easier as it is plain text, and has the advantage over InChI that it is much more readable. Chris wondered in th e KinasePro blog on how to tag SMILES, while Paul did the same on ChemBark about CAS numbers.
PMR: SMILES shouldn't need to be "readable" and some of it isn't (e.g. if you have a complete disconnected structure). It is because people have got used to seeing it for many years that they don't feel frightened. There is no way to create canonical SMILES by hand, so you have to have a tool. InChI seems more forbidding because (a) it's new (b) It can never be hand authored (c) it's about 50% more verbose (d) it has layers. But each of those has a positive side.
Now, users of know how to add markup to their blogs to get PostGenomic index discussed literature, website and conferences. Something similar is easily done for chemistry things too, as I showed in Hacking InChI support into (which was put on lower priority because of finishing my PhD). basically uses microformats, which I blogged about just a few days ago in Chemo::Blogs #2, where I suggested the use of asperin.And this is the way SMILES, CAS and InChI's can be tagged on blogs. The element is HTML code to indicate a bit of similar content in HTML, and can, among many other things, be formatted differently than other text. However, this can also be used to add semantics in a relatively cheap, but accepted, way. Microformats are formalized just by use, so whatever we, as chemistry bloggers, use will become the de facto standard. Here are my suggestions:
[snipped see Egon's blog]
The RDFa alternativeThe future, however, might use RDFa over microformats, so here are the RDFa equivalents:
[snipped see Egon's blog]
which requires you to register the namespace xmlns:chem="" somewhere though. Formally, the URN for this namespace needs to be formalized; Peter, would the Blue Obelisk be the platform to do this? BTW, this is more advanced, and currently does not have practical advantages over the use of microformats.
Egon is right: there is currently no clear indication of which approach will come out as the "winner" although there is lots of Web discourse. However for us I suspect we would adopt both if lots of people were using them, and see which approach won.
Yes, of course we should use blueobelisk for the RDF! This has the real chance of succeeding.
Again the message is that the rest of the world is going down this route and at some stage chemistry will follow. RDF looks just as impenetrable as InChI, DOI, and all the rest...

Why bother with new technology?

Kinasepro has blogged about discussions of new chemoinformatics technology (specifically CML (Chemical Markup Language) and InChI (chemical identifier)). Here's the post and some correspondence. It's basically about the introduction of new technology. Obviously I'm not neutral but I will try to discuss it in a neutral manner. For that reason I have copied it more or less in full.

There’s been a fair amount of talk [ChemBark] over the last little while on the topic of chemoinformatics and chemblogs. Here’s my two cents.
smiles inchi Aldrich
smiles inchi ChemExper
smiles inchi The PDB
smiles inchi Chemdraw (until v10)
smiles inchi The entire pharmaceutical industry.
smiles inchi Peter Murray Rust
smiles inchi IUPAC
So somehow a couple librarians have convinced Google that inchi > smiles. Result? Google may well do Inchi, but noone but the librarians are currently using it, and meanwhile google doesn’t index smiles very well. I’m reminded of a day when it was thought to be a good idea to put the CAS#s of new entities at the bottom of ACS journal articles. Don’t worry, we survived those librarians too.
PMR: I'm not sure who the librarians are. I'd label all of (us) as chemical informatics. The institutions include NIST, RSC, and University of Cambridge. I don't think Google has been convinced of anything - chemistry is relatively too small for Google to worry about. But yes, we have visited and had very useful and forward-looking conversations. Watch out for Googlebase...
Lookit, we don’t need a string of XML code that you need an advanced degree to use. We don’t need people telling us to tag our blog posts, we need an integrated solution. We need something that can draw structures and present them attractively in an index friendly HTML format. Near term: Get google to index picture descriptions, and code a firefox plugin that can insert smiles into said descriptions.
PMR: I am not quite sure what "index picture descriptions" means. Google indexes the fact that there is a picture but not the content. There are major efforts in image recognition, but I am not aware that any of this is being done in chemistry. I think that indexing chemistry in published GIFs is extremely difficult. I've looked at this over the years and conclude that it would be much easier if authors simply made their molecular files available.
Till google has a smiles substructure search, I’m not going to bother.
PMR: This is a perfectly valid response from an individual in the system. It's rather less encouraging if it reflects the whole of chemistry (which currently it does). If the chemical informatics community says "at some stage Google will solve all our chemical problems, until then we'll do nothing" that's regrettable. (All other major scientific disciplines - physics, astronomy, bioscience, geosciences, etc.) are making major efforts to develop informatics infrastructure.Some of us are, in fact, thinking about how to do this. The problem is that there has to be some software somewhere. It can be in the following places:

  • client (i.e. your browser)
  • Google (we have discussed this with Google and it's not impossible)
  • third party (who may or may not charge for it).

Given that Openbabel can search millions of structures quite rapidly there are some encouraging opportunities.

  1. 1 totallymedicinal Dec 5th, 2006 at 3:09 pm
    Couldn’t agree more with the sentiment - not only does my ancient version of ChemDraw not support this exotic format, but I have enuff hassle in my life without learning some obscure new coding system.

PMR: Again this is a perfectly valid response. Any approach to chemoinformatics requires tools. And I suspect or your institution would have to pay for an upgrade to Chemdraw. Obviously there is the opportunity of some Open Source free tools but they are not yet widely deployed and are effectively for early adopters.

  1. 3 Paul Dec 7th, 2006 at 4:06 am
    I could not agree more about the need for an integrated solution! I got a really thoughtful response from Peter Murray-Rust and friends, and I feel kind of bad about not acting on it, but putting random InChI designations at the bottom of all our blog posts doesn’t seem worth it to me. I think that CML is indeed the future, and I look forward to the day of being able to download a CML plugin for WordPress that will take care of everything for us lazy bloggers.
PMR: There is no doubt there are technical problems and they will require some early adopters to solve. I have tried to hide the InChI - it is an effort and is fragile. Given that I have problems with simple computer code in WordPress I expect the same with chemistry. However we have some new ideas of how to take this away from the WordPress process.
  1. 4 Chris Dec 9th, 2006 at 12:23 pm
    The argument against SMILES seems to be they are not an Open Format and it is possible to represent a single molecule with multiple SMILES strings. For my part I can read and write SMILES, (and SMARTS and SMIRKS). I find InChi impenetrable and I don’t think there is syntax for substructure or similarity queries, in addition I don’t think there is a system for describing reactions.I’ve started to add SMILES to my web pages in the hope that someone will build an index at some point, I guess it would help if there was a SMILES tag.
PMR: SMILES was a groundbreaking language when it came out. In general I have no problem with non-Open formats if there are free tools to manage them. There is a canonicalization algorithm for SMILES but it is closed and proprietary. I have regularly discussed the value of making it openly available with Daylight management but they are not prepared to do this. This is a legitimate business approach - control the market through trade secrets. IN the current case, however, it has the practical downside that several groups have created incompatible "canonical" SMILES.
The main virtue of InChI is that it is a public Open Canonicalization algorithm. It's perfectly possible to convert InChI to SMILES if you want. It would not be "canonical SMILES" in the strict sense, but it would be canonical. That may, in fact, be a useful approach for certain types of compound. As InChI has a richer set of concepts than SMILES there may be some information loss.
In summary, if Daylight had made the SMILES algorithm public and it had been used responsibly I doubt very much whether we would have InChI. It has been driven by the lack of interoperability in chemistry - coming in some part from government agencies and the publishing community.
InChI is by definition impenetrable. It's an identifier. Do you find DOI, ISBN, security certificates impenetrable? I hope so :-)

  1. 5 kinasepro Dec 9th, 2006 at 7:03 pm
    InchI and CML may well be the future, and no-one will embrace it more then me, but SMILES is the present. For people working in the field not to understand that boggles the mind!

PMR: I'm not sure who "people working in the field" are. If it includes me, then I fully understand it. I am simply trying to bring the future to the present a bit quicker and a bit more predictably. :-)

  1. I’ve experimented on this site a little with smiles. For instance a google search of the following string brings you here:O=C(C2=CN=C(NC3=NC(C)=NC(N4CCN(CCO)CC4)=C3)S2)NC1=C(C)C=CC=C1ClOf note I’m not the only one with that string on the web! Maybe thats an important compound? Sadly google indexed that page under my SRC tag rather then as a standalone page. Put that together with the fact that smiles strings are not substructure searchable via google and its clear to me that google is not ready to be a chemistry informatics platform. It’s sad really, because it doesn’t seem to me that it would be that difficult for them to make SMILES strings substructure searchable via the same algorithm the PDB, relibase, aldrich and everybody else is using.

PMR: This is a very important point and at the heart of the problem. Google works by indexing text. It's good at it and can distinguish different roles for text and can look for substrings. This is a simple, powerful model. But at present it doesn't index other objects (faces, maps, etc.) These are both harder and require specialist software. By contrast PDB, Relibase, Aldrich do index chemical structures. That means that they have to have specialist software running on their servers. Which means a business model. And that someone has to pay somewhere. PDB gets a grant, Relibase is commercial, Aldrich will see this as the basis for selling more compounds. All completely valid. But there is no business reason for Google to invest in chemistry-specific software - as I said chemistry is too small for Google to bother with. It's not helped by the fact that all the information is proprietary and that one of the major chemical information suppliers (CAS/ACS) sued Google. So unless you convince them differently - and I have gently tried - it won't happen.

So this is all about the introduction of new technology. The primary messages from the chemistry community are something like:

  • We're happy with what we've got - it's worked for the last 20 years and will go on doing so. Yes, for a little while.
  • When it's necessary CambridgeSoft, Chemical Abstracts, Elsevier will develop a new technology and we'll pay them to use it. Unfortunately I don't see any movement from any of these to embrace the new Web metaphors. Biology, geoscience, etc. are working hard to develope the semantic web in the subjects - apart from a few of us noone in chemistry is.
  • Well, it's a bit of a mess, but it's not at the top of my priorities. I'll come back in a few years.
There are movements in chemistry, particularly in three areas:
  • computational chemistry. We are having a visit of COST D37 (EU) to Cambridge tomorrow to create an interoperable infrastructure for computational chemistry. It will be based on communal agreements and use XML/CML as the infrastructure.
  • chemoinformatics. The Open Source community (e.g. Blue Obelisk) supports both current (legacy) formats (SMILES, Mol) etc. and also CML/InChI. This can provide a smooth path towards the wider adoption of these newer approaches, including toolkits. The toolkits are free, which some see as a disadvantage, in which case you will have to convince the commercial suppliers to create them.
  • publishing. Commercial publishing is universally based on XML (and variants) so it is easy for them to include CML and related systems. I won't give details but I'd be surprised if there weren't major changes in the next 2-3 years here which I hope will answer some of the obejctions raised here.

There are also general major drivers elsewhere for the abandonment of legacy formats. They include the semantic web, RSS, institutional repositories, archival, etc. These efforts require interoperability and freely available tools - you can't archive - say - a binary chemistry file and expect it to be readable in 5 years time. There are a lot of people to whom that matters.

So I'm not telling anyone to do anything - I'm putting ideas, protocols and tools where they may wish to pick them up. If 5% of a community is enthusiastic that's a good beginning. It worries me that the pharma industry has no concept of interoperability. But I've said that already.

What is a chemical compound? and what's a label

Steve Bachrach poses an interesting question on the CHMINF-L list. I have omitted the citations and some other material - you can read the archive if necessary.

I have run into an interesting chemical problem that has led to both theoretical and applied database questions. I am hoping that some of the experts on the list can shed some light.
I have been looking into the recent controversy concerning the structure of (+)-hexacyclinol. This compound was first isolated in 2002 by Graefe et al who proposed a structure for it.... La Clair recently synthesized this structure, or it least reportedly so. [SB: By the way - do a google search
on hexacyclinol so see how the blogosphere responded to this problem.] Then Rychnovsky ...proposed an alternative structure for (+)-hexacyclinol, which was subsequently synthesized and confirmed to be identical to the original natural product
(PMR: yes the blogosphere is worth reading, e.g. Totally Synthetic (blogroll) and Tenderbutton (but this is now password-only).
So here is first my theoretical question: How do you index such a situation? The original structure of the molecule (+)-hexacyclinol is wrong, and a subsequent one is right. So, when you query a database, which structure matches up with the name "(+)-hexacyclinol"? My guess is that it should be the correct one - but then what do you do with the old
structure? Obviously, this is not the first, nor will it be the last, compound whose structure in contested.
Now here is the more applied aspect. A search in SciFinder for (+)-hexacyclinol gives CA 484674-97-7, which is the original (and, we now know, wrong!) structure. Querying for the papers that have this structure returns the Grafe, La Clair and Rychnovsky papers, but not the Porco paper. But entering the "true" hexcyclinol structure and then doing a search locates 2 structures CA 903574-41-4 and CA 903574-42-5, which look to me to be identical. Furthermore, the only paper that is linked to these "two" structures is the Rychnovsky paper. In other words, the Porco paper that reports the actual synthesis and x-ray structure of hexacyclinol does not have any hexacyclinol structure(s)
(correct or not) attached to it!
PMR: Note - Scifinder is a Closed access tool to the Closed Chemical Abstracts database of chemical information. I cannot therefore comment on Steve's Ids.
(By the way, a PubChem search for hexacyclinol comes up dry, but all of the above papers are indexed in PubMed.) Any explanations?

(PMR: Yes, Hexacyclinol is not very interesting except to chemists so no-one has deposited a data collection containing it to Pubchem. If synthetic chemists contribute collections of targets to Pubchem I am sure Pubchem will be delighted to accept them. However many chemists are still unaware that PubChem exists.)

PMR: There is nothing strange in this - it's common in all most disciplines. As a science progresses the interpretation of objects changes. Genes, organisms, galaxies are all frequently reclassified. It's actually a strange feature of structural chemistry that there are so many cases that aren't fluid and where a structure and a substance can be associated and where this association can persist for a long time.
The language of "right" and "wrong" is what is causing the problem. These statements should be recast in terms of annotations or assertions, labelled with the authority that makes them. (Incidentally this is what is at the basis of the RDF-based Semantic web). The above could be written:
2002: Graefe asserts that C1 is the structure associated with a given substance (S1). Graefe gives S1 the label "hexacyclinol". Graefe also asserts that certain reported physical data (D1) belong to S1.
2005? Le Clair makes S2 and asserts that it is the same substance as S1 and re-asserts that S1 is has the structure C1
2006. Rychnovsky makes S3 and asserts it is not identical with S1 or S2 but should be associated with the structure C1.
By the laws of chemistry (which say that a given substance S should only be associated with one structure C) we have a contradiction. So...
2006. Many chemists, including the blogosphere, assert that Le Clair's statement is false.
But that may not be the end of it.
That is a simplified picture. Henry Rzepa has written about "What is mauveine?" - it is by no means clear what this industrially spectacular purple pigment was or is. He has devised an RDF scheme for presenting, and possibly resolving, a number of assertions about the structure of a compound.
Not all chemistry has the luxury of being able to associate a precise formula with a given substance in a jar. Here are some simple examples,
  • What is aluminium chloride?
  • What is glutamate?
  • What is glucose?

These are legitimate scientific statements which require several assertions, linked in a mini-semantic web. That is why we need to move from a twentieth-century way of describing chemistry (as exemplified by the CA numbers) to a semantic one. There is lots of room for volunteers.

I'm reminded of being shown round the British Museum of Natural History - and a room full of fish specimens in ethanol in labelled glass containers. The biologist said that some countries had asked for their specimens - their property - back - the BM had resisted. But if it ever came to that the BM would keep the labels - their metadata.

What is a citation?

I've been trying to find out what a "citation" is. At least the sort of citation that governs my future, and the funding of my department and institution. Just to reintroduce this subject, here's Bill Hooker replying tomy post Impact Factors! Hirsch, Erdős and Pauling

According to Google Scholar, this is me [Bill]: [18,16,16,12,10,9,8,5,5,2,1,1,0,0], which yields an h-index of 7 if I understand the definition. According to the Wikipedia article, a “modestly productive” biomed researcher should have an h-index greater than their “years of service”. Even if those years start when I first published (1995), I’m not doing very well. But I didn’t need a fancy index to tell me that.

I think Bill's maths is correct. But where do his figures come from? Google Scholar , which I also use because it's Open, and easy and I don't like using products from closed monopolistic commercial information providers. But is a Google citation count the same as a Web Of Knowledge citation? How do we know.
Recently I had to fill in my publications for the current UK Research Assessment Exercise. In this we were asked to give 4 + 2 research publications over the last 5 years. I selected the ones that I was proudest of - not necessarily the ones with the highest Google citations. I think that in this RAE there is still a lot of human assessment so I should give them something interesting to read. In the next one it will be done by robots, so we need to know what robots like.So research is not now about chasing the puzzles that "nature" sets us, but about guessing what the next metric is going to be. I suspect it's rather like pop music - for many years the New Musical Express produced hit charts - the lists of how many people bought which records each week. The numbers were collected by the industry - there were presumably no audit processes - and showed which were the most popular records. Not the best, just the most popular. Presumably this is a complex function of quality and marketing. However the numbers had positive feedback - if something sold well it was likely to be played more often and people felt they needed to buy it. But, retrospectively, I doubt few musicologists would claim the numbers were perfect or even good measures of quality. The same is true of films - box office and expert judgement from 20 years on probably have a poorish correlation.

So will the research metrics be different? The music industry had two indicators at least (sheet music and record sales). Perhaps this is analogous to citations and downloads of research articles. So let me take one of my papers that I feel represents a part of my informatics research and scholarship:

Murray-Rust, Peter; Mitchell, John; Rzepa, Henry (2005) "Chemistry in Bioinformatics" BMC Bioinformatics 6 141

It's Open Access, so you can read it. BMC Bioinformatics publishes accesses (see, e.g. this months) A month after it was published BMC sent me a mail saying this was one of the highly accessed articles

From: BioMed Central Editorial
To: Peter Murray-Rust

Subject: Download statistics for your Open Access article
X-OriginalArrivalTime: 08 Sep 2005 06:30:15.0878 (UTC) FILETIME=[C65B6E60:01C5B43E]
Date: 8 Sep 2005 07:30:15 +0100

Title : Chemistry in Bioinformatics
Authors : Peter Murray-Rust, John B Mitchell and Henry S Rzepa
Journal : BMC Bioinformatics
Citation : 6:141
Dear Dr Murray-rust,
We thought you might be interested to know how many people have read your article since it was published:
Total accesses to this article: 1143
Access figures include full text, abstract and PDF downloads from the BMC Bioinformatics website.
These figures only reflect the accesses recorded on the journal's website and the BioMed Central website and do not include those from PubMed Central or other sites that archive articles published by BioMed Central (see The overall access statistics for your article are therefore likely to be significantly higher.

(I can't find the current access count as BMC only seems to keep the last year in its RSS).

But the paper only gets 4 citations in Google Scholar (probably at least two are self-citations), and presumably less in ISI (which I cannot access as it is Closed).

So there is clearly a wide variation between reading and citing. Citations have the advantage that they are in principle measureable (albeit I suspect with considerable imprecision, particularly in a changing world). Access cannot be easily audited.
So my questions are, please, (and I genuinely don't know the answers)

  • How are citations counted?
  • Are different methods in widespread use?
  • If so are there agreed algorithms for converting between different metrics?
  • ... or is a single authority accepted?
  • if there is a single authority, what auditing of the counting is available? Does the authority set the metrics themselves? Or is there a community process?

Great scientists will generally rise to the top (though I suspect metrics may make the path different from before). I am not a great scientist - in fact I am primarily a technologist at present. Egon reports that on ISI I get an h-score of 9 - fair enough (although it seems to have missed a lot of my papers - maybe there is a time cutoff).

If we are going to be based on metrics then it is a waste of time writing papers for humans to read. The Bioinformatics article above counts for nothing.

Hilaire Belloc (1870-1953) wrote:

When I am dead, I hope it may be said:

"His sins were scarlet, but his books were read."

I am not a poet, but feel something like:

"My paper's published (and it was invited)

Dont't bother reading it, but please let it be cited"

"English sentences without overt grammatical subject"

We had a Ph. D. viva today and a small party afterwards.  We got onto interesting scholarly publications which reminded me of a paper which came out in my early career. It was by  Quang Phuc Dong (South Hanoi Institute of Technology) and I remember it in a French journal of linguistics (Langue) though most modern citations are to an English language journal (Language). Since we are now involved in chemical linguistics I mentioned it to Peter Corbett as something that would help us understand the deep structure of sentences
"English sentences without overt grammatical subject,"

It's a classic not only in linguistics but in scholarship in general. I remember it as being in a physical journal article in the library. It's not easy  to find  an Open Access online copy (though I suspect it was published before publishers started appropriating author's work). It was a work of merit, being cited within a year (e.g. this Closed Access paper) and although Google Scholar only finds 3 citations (and does not give journal or other details) there are certainly more and this must be due to the difficulty of finding an indexable online version.
I obviously can't reproduce it here as I would be breaking copyright and it would inappropriate to reproduce parts as it would not represent the full moral rights of the author. I have found what appears to be a pirate copy here (which it pains me to reference as almost all the work in this field - even 40 years old is quite rightly Closed Access and available at very reasonable prices).

For further information about the author, try Wikipedia.