Hamburgers – theses in PDF

Having blogged about the excitement of automatic reading and semantic enhancement of chemical theses I come to the startk reality of PDF.
“Turning PDF into XML is like turning a hamburger back into a cow” (anon).
So I searched for Openly exposed electronic chemical theses on the web.  Yes, they are out there, but – in PDF. So here’s a typical tale of the sort of waste of time that I am colleagues have to go through.
I find a major academic institution (Foo) with a repository of theses. There are two versions – one that the world can read, and one private to Foo faculty. They seem to be the same, except that the Open one is “not printable”.  I load it into my browser – it displays. I try to save as text. I try to select text and paste into a text editor (this usually works). It doesn’t. So presumably there is deliberately some sort of gremlin in the document which prevents Adobe tools saving the text. (I expect Adobe developed this specifically anayway).
PDF has already destroyed the structure of the document. But perhaps I can at least save the words. OSCAR3 is very good at reconstructing chemistry from words. I save the PDF locally (that seems to be technically allowed) and then I open it in a text editor. Gibberish – but I expected that.
So I download PDFBox from Sourceforge. A typical example of noble-spirited Open Source development – trying to make life better than the hamburger culture. It has an executable called ExtractText. I run it. “Null Pointer Exception” (This means the program has failed to trap an error – but I forgive them since by definition a hamburger is an error. I then notice another executable (SplitText). Expecting it to fail I run it. Surprisingly it works. It produces 200 little PDF files (one for each page in the thesis). Not the ideal thing to work with but serious progress.
Then I notice an option (-split). This says “only start splitting after n pages”). So I use -split 200. This creates one large PDF page (the same as the original document). This doesn’t seem like progress, but it is – the new file behaves perfectly with ExtractText. I can now convert the PDF to text without problems. And run it into OSCAR3. And more of this later.
Of course the resultant text is awful but at least it contains all the right words and in the right order. It cannot manage suffixes (for example H2SO4 – the chemical formula for sulfuric acid somes out as:
H
2
SO
4
).
That’s because PDF has no semantics. The ‘2’ and ‘4’ are just characters with X,Y coordinates – not associated with anything.
So the message is clear.
Do not author documents in PDF alone.
If you use another format (Word, HTML, TeX and perhaps even some time XML), preserve that version. If you are required to destroy the semantic into a hamburger, insist that the rich version is preserved. In your institutional repository.
Does this sorry story suggest that really we should be using XML for science, not PDF?

Posted in Uncategorized | 1 Comment

SPECTRa and SPECTRa-T

Every chemistry department spends millions each year on determining crystal structures, calculating properties of molecules and measuring spectra for newly synthesised compounds. The data are potentially extremely valuable, but almost all of them are lost. Of those that are saved and published, many are given to the publishers or aggregators who then sell them back to the community.
We have just completed an 18-month project, SPECTRa, (a collaboration between chemists and libraries at Cambridge and Imperial College – see home page) to see what could be done. We verified that there was a massive problem of data loss and that most of the factors were people-related, not technical. The final report will be out soon (and I and/or Jim Downing will probably blog it then) but the additional point is that we are looking at how to sustain the momentum (sustainability is a major problem with many short-term IT-based projects) – particularly without specific funding. Our approach is to develop informal partnerships with other institutions (libraries and chemists) specifically to use the Open Source tools we have developed. For example we have collaborated with the University of Southampton (who run the national Crystallographic Service). We have also got 2-3 early adopters outside the UK who are working out how they can adapt the process to capture their data (crystallography in the first instance).
How can we base a sustainable business model on zero funding? We can’t, of course, and it will only work where the institution cares about preserving data which it spent millions to create. Our tools are Open and therefore free, but they require local business processes to be developed and installed. Departments and institutions have to realise that preservation of data is a necessary investment. It needn’t be large and it can be aggregated with the current costs of (say) promoting research output, preparing publications, thesis preparation, etc. Chemists may also find that their LIS (Library Informations System) group is interested in this type of project – it’s an ideal place to start for data capture and preservation and the likelihood of technical success is high. In general any departmental service should cost information systems as part of their operation.
The great thing about the Web, of course is that it allows you to find the best and most enthusiastic collaborators. So if anyone out there is interested in saving their chemical data…
SPECTRa itself has finished but JISC has funded a new project (SPECTRa-T). It’s not a continuation of SPECTRa, but it uses the ideas gathered there and applies them to chemical theses. Probably at least half the chemical data generated in departments is only ever published in theses (the supervisor moves, the student’s data cannot be found, etc.). So SPECTRa-T is looking at how robots can manage the semantic preservation of this automatically. Our OSCAR3 tool (Peter Corbett) can read a thesis in a few seconds and extract thousands of instances of chemical metadata along with structured objects (crystal strucures, spectra, etc.).
So we are interested in institutions which have Openly available chemical theses… If these are available in *.doc format that is much better that than the hamburger PDF that most institutions require for deposition. But we may even be able to do something with than.

Posted in open issues, Uncategorized | Leave a comment

Open * [Chris Swain]

Until the blog cruft disappears I’ll try to repost comments. Here’s one from Chrsi Swain
From: Chris Swain
Subject: Blog
Date: Thu, 12 Apr 2007 09:22:22 +0100

Hi,
Great to see you back!
I tried to post this comment but got lots of errors
Peter,
This issue was one of the reasons I agreed to become a Section Editor
for the Open Access journal Chemistry Central, and if I can be
allowed the indulgence of quoting from my own commentary (http:// www.journal.chemistrycentral.com/content/1/1/20)
“Additional information such as experimental or spectroscopic data
assisting the reader can, and will, be provided liberally and could
of course be linked electronically to structures within the document.
The ready availability of supporting data should make this an
invaluable data store for structure activity information for future
work.”
All the best,
Chris
Posted in Uncategorized | Leave a comment

Ownership/Copyright of Comments and Email

  1. David Goodman raises a question in response to this post:
    December 18th, 2006 at 11:33 pm e excerpted and adapted from my Dec 18 posting on the SPARC-OA list:As an example, relevant to this very posting, I quote from this blog:
    “You own the copyright in your comments: but you also agree to license your comments under the the Creative Commons Attribution-NonCommercial license.”
    When you act as a publisher, you do not license commercial use.
    David Goodman,

When we set this series of blogs up we thought it was important to consider the ownership of comments. I’ve been in positions before when I have been unable to distribute a collaborative work because we didn’t address this issue. So we took a stab and picked a license for the comments. I’m quite happy to be convinced that “commercial use” would be better, but we omitted it because we thought it might put off some readers (we thought that even requiring a license might cause problems).
So, readers (and commenters) what is your view? Would you be happy with a license that allows commercial exploitation? Are any of you put off by the current license? (If you are, and don’t want to use the comments, please mail me.).
David – I do not understand “you” in the last sentence. Does it mean:
“When PMR acts as a publisher, PMR does not license commercial use” (i.e. a statement of fact) or
“When anyone acts as a publisher, that person does not license commercial use” (i.e. a metasyntactic variable applying to all publishers) (and perhaps “does not” implies “should not” or “cannot”).
or something else…
David’s other comment refers to the SPARC Open Data list which has had a discussion about copyright on email. David writes (Mailing List SPARC-OpenData@arl.org Message #85):

If you mean quoting, that is protected fair use. When it is for the
purpose of criticism, and especially scholarly criticism,
as it usually is on these lists, this can be very encompassing.
If you mean reposting without permission on a different list, I
think youy will find that most of the posters report
that they have permission--the usual statement is
"reposted with permission from the X list";
I know that i have always asked. I think that Heather
has always asked.
Once or twice the original poster did not want a larger or different
audience, and in those cases I did not repost.
And, as Heather says, posting or quoting from a private letter is
totally unethical without explicit permission, print or email. I have
not seen anyone I know do it.

So this quote (almost all David's mail except for the address) is quoted here under fair use.
The penultimate point is important. I have been embarrassed in the past because someone took one of my posts to a fairly small list (A) and reposted it in a much larger list (B) with the wrapper “from Peter Murray-Rust”. This post caused harm, because it appaered that I had deliberately posted it in list B where its content and attribution was seen as agressive while on list A it was normal discourse in the community. Unfortuantely the person (an open source evangelist) made a habit of this and I took to including something like “please do not repost without  explicit statement that this  was not done by the original author”. I don’t know the answer to this.
A typical scenario might be:
I post here something that says “I think it would be a good idea if all publishers added InChIs to their chemistry” . This is fine on this blog. If, however, the post was copied to the Computational Chemistry List (where the person copied my material) and attributed to me, then those members might see me as spamming, uncritical, intrusive, etc. If he had written (I found this quote on PeterMR’s blog …) I couldn’t complain.
So metadata and transclusion matters. But we don’t want any more hard stuff before breakfast…
Sometimes when someone writes to me I extract part and comment:

“A correspondent asked me:

“When should I use cml:property and when should I use cml:Parameter? Here is some code to highlight the problem… ”
[code included in quote]

and I’m replying in public because I think many people will be interested.”

I hope  that this does not break confidentiality and fair use. Otherwise I will have to write a lot more emails.

Posted in general, Uncategorized | 3 Comments

Open Data: more on ALPSP and STM statement

The next few posts will contain a substantial amount of comment (hopefully from you, dear readers and commentators) on the June 2006 ALPSP/STM statement on data in scholarly publications. I’m kicking myself for missing it. I plead that I have many other things to do (chemical research, writing programs), and Open Data is not my top priority, so I wasn’t reading Peter Suber’s Open Access News blog on a daily basis. I have a better feedreader and am now. Here’s his June comment:

ALPSP and STM support OA data The ALPSP and STM have issued a joint statement in support of free online access to scientific data (June 2006). Excerpt:

Publishers recognise that in many disciplines data itself, in various forms, is now a key output of research.

[snip… you can read the full statement here]

this is fundamental in copyright law. In the EU, the use of ‘insubstantial’ parts of a database, provided it is not systematic and repeated, does not infringe the database maker’s rights.

PS: Comments. Three quick responses.

  1. First, I commend ALPSP and STM for the primary recommendation in this significant statement. Open access to data is important for all the reasons they cite.
  2. Their call for intellectual property protection for databases is a separable and regrettable part of their statement. The EU has such protection but the US does not. Any argument that such protection is needed to stimulate the production, collection, or dissemination of data is refuted by the vigor of science in the US.
  3. The ALPSP and STM both lobby against policies that would provide OA to research literature, like FRPAA and the draft RCUK policy. I acknowledge that there are many differences between OA to data and OA to peer-reviewed articles interpreting or analyzing data. But ALPSP and STM should acknowledge that there are many similarities, and that most of their arguments for OA data (enhancing research productivity, avoiding costly repetition of research, supporting the creative integration and reworking of research) also apply to OA literature.

PMR comments:
(1) Like PS I commend the ALPSP and STM for this. (It is very important that we do not see people and organisations as “good” and “bad” – this sometimes happens in web-based and other discussions. They have good and bad policies. So this is a very valuable advance.
(2) I support Peter’s position as well. I think the blanket claim for IPR through this method is both crude and generally undesirable. Anything that can be done to make it clear that data in publications lie without this directive is important. A journal is not, IMO, a database (though the ACS now explicitly claims that a journal is not longer a journal, but a database to which readers have access. There are other issues here, but I stick with the data). So I will help to promote the swell of challenge to database rights, but I’m not leading the charge.
(3) I am, of course, a strong supporter of Open Access, and the logical premise that OA implies Open Data. That is the area where I’m trying to lead a charge.
So, in principle, all members of ALPSP and STM should support this, shouldn’t they? Or has there been dissension. If not, we’ll assume it’s got universal support.
So what possible legitimacy do we have for journals that copyright supplemental info?
(There are many other questions – what about data in full text? what about spidering websites? It should be exciting to examine these. Let’s see if we can use the blogosphere to do some useful research into this.)
And – finally – why did no-one on the SPARC Open Data list flag this?
I missed this in June: I’ll reproduce Peter Sub

Posted in data, open issues | 2 Comments

Open Data : help from ALPSP and STM

I’m over the moon! Richard (from the Royal Society of Chemistry) has commented on my post: about the ownership and licensing of data

  1. Richard Says:
    December 13th, 2006 at 3:27 pm eAnd of course, publishers (well, ALPSP and STM members) have this statement as guidance.

Now I knew that STM publishers were thinking of coming out with something like this and I kept searching from time to time, but I missed it. (It’s dated June 2006). It really should have got spotted by the SPARC Open Data membership, shouldn’t it.  Anyway here we go (it doesn’t have colpyright statement so I claim fair use, and it’s in hamburger format so I hope I have got most of what matters):

Databases, data sets, and data accessibility – views and practices
of scholarly publishers


A statement by the Association of Learned and Professional
Society Publishers (ALPSP) and the International Association of
Scientific, Technical and Medical Publishers (STM)


Publishers recognise that in many disciplines data itself, in various forms,
is now a key output of research. Data searching and mining tools permit
increasingly sophisticated use of raw data. Of course, journal articles
provide one ‘view’ of the significance and interpretation of that data – and
conference presentations and informal exchanges may provide other
‘views’ – but data itself is an increasingly important community resource.
Science is best advanced by allowing as many scientists as possible to
have access to as much prior data as possible; this avoids costly
repetition of work, and allows creative new integration and reworking of
existing data.
There is considerable controversy in the scholarly community about
‘ownership’ of and access to data, some of which arises because of the
difficulty in distinguishing between information products created for the
specific display and retrieval of data (‘databases’) and sets or collections
of raw relevant data captured in the course of research or other efforts
(‘data sets’). Another point of difficulty is that in many cases data sets or
even smaller sub-sets of data are also provided as an electronic adjunct to
a paper submitted to a scholarly journal, either for online publication or
simply to allow the referees to verify the conclusions.
We believe that, as a general principle, data sets, the raw data outputs of
research, and sets or sub-sets of that data which are submitted with a
paper to a journal, should wherever possible be made freely accessible to
other scholars. We believe that the best practice for scholarly journal
publishers is to separate supporting data from the article itself, and not to
require any transfer of or ownership in such data or data sets as a
condition of publication of the article in question. Further, we believe that
when articles are published that have associated data files, it would be
highly desirable, whenever feasible, to provide free access to that data,
immediately or shortly after publication, whether the data is hosted on the
publisher’s own site or elsewhere (even when the article itself is published
under a business model which does not make it immediately free to all).
We recognise, however, that hosting, maintaining and preserving raw
data or data sets, and continuing to make such data available over the
long term, has a cost which, in certain circumstances, the host site may
need to recover. We also recognize that on occasion the generation of
data has been privately funded, and the funding entity may have a
particular reason for restricting access to the data (either temporarily or
even permanently), but we believe these should be limited exceptions,
and that journal publishers themselves should claim no ownership interest
in such data. The academic and publishing communities should discuss
further (in the context of the debate on the public funding of research)
whether more reliable and more permanent sites should be established to
host research data.
None of this means, however, that databases themselves – collections of
data specifically organised and presented, often at considerable cost, for
the ease of viewing, retrieval and analysis – do not merit intellectual
property protection, under copyright or database protection principles.
Such databases are often characterized by the sophistication of their data
field structuring, searchability tools, and the like, and scholarly publishers
are often involved in producing and marketing databases that contain
valuable and useful information for scholarly research. The research
interest and value of raw research data sets and individual data points is
entirely different, and serves different purposes, from that of specific
databases that have been organised and compiled for particular research
needs.
There is sometimes confusion about whether the use of individual ‘facts’
and data points extracted from a database is permitted under law. Facts
themselves are not copyrightable, but only the way in which information is
expressed – this is fundamental in copyright law. In the EU, the use of
‘insubstantial’ parts of a database, provided it is not systematic and
repeated, does not infringe the database maker’s rights.
Articles published in scholarly journals often include tables and charts in
which certain data points are included or expressed. Journal publishers
often do seek the transfer of or ownership of the publishing rights in such
illustrations (as they might do with respect to an author’s photograph),
but this does not amount to a claim to the underlying data itself.
We hope that this statement is helpful in clarifying the views of publishers
concerning raw data, data sets and databases, and that the statement will
serve as useful guidance for publishers in their policies concerning data
sets submitted with papers. Scholarly and scientific publishers share the
view that research data should be as widely available as possible.
June 2006

PMR: This looks very useful indeed. Thanks Richard. I don’t know how I missed it. I will comment later.

Posted in data, open issues | 1 Comment

Open Data: help from Microsoft

In reply to my last post (about the idea of adding Creative Commons licenses to scientific data)…

  1. Robin Rice Says:
    December 12th, 2006 at 7:04 pm eThere was an article in the October issue of Ariadne, Creative Commons Licences in Higher and Further Education: Do We Care? which points out some of the questions around widespread use of Creative Commons licenses. “Naomi Korn and Charles Oppenheim discuss the history and merits of using Creative Commons licences whilst questioning whether these licences are indeed a panacea.”

Robin – that’s wonderful – I’d missed this but I’ll quote extensively (the article is Openly available and while it doesn’t fulfil all the criteria of the BOAI it allows extensive quoting. But you can’t resell it :-)…

Creative Commons Licences in Higher and Further Education: Do We Care?
Naomi Korn and Charles Oppenheim discuss the history and merits of using Creative Commons licences while questioning whether these licences are indeed a panacea.

The recent incorporation of Creative Commons licences within Microsoft Office Word, Excel and PowerPoint applications via a downloadable plug-in [7] now provides an integrated method for the creation and licensing of content. It is a brilliant way of encouraging consideration about who is allowed access to digital content and under what conditions, at the time that the content is generated. This development has enormous potential for nurturing educated and mature approaches to copyright and access, but at the same time had also necessitated the need to re-examine the validity of using Creative Commons within teaching, learning and research activities. It has precipitated a critical assessment of their use and the need to set clear parameters about when they can and cannot be used.

[7 is Microsoft Office plug-in http://www.microsoft.com/downloads/details.aspx?FamilyId=113B53DD-1CC0-4FBE-9E1D-B91D07C76504&displaylang=en/ ]

PMR: This is really exciting. I don’t mind other people being visited by a meme ahead of me – it strengthens my resolve. From the Microsoft (yes, Microsoft) page:

Overview

This add-in enables you to embed a Creative Commons license into a document that you create using the popular applications: Microsoft Office Word, Microsoft Office PowerPoint, or Microsoft Office Excel. With a Creative Commons license, authors can express their intentions regarding how their works may be used by others.
The add-in downloads the Creative Commons license you designate from the Creative Commons Web site and inserts it directly into your creative work. Creative Commons supports a number of languages.
To learn more about Creative Commons, please visit its web site, www.creativecommons.org. To learn more about the choices among the Creative Commons licenses, see http://creativecommons.org/about/licenses/meet-the-licenses.
Microsoft Office productivity applications are the most widely used personal productivity applications in the world, and Microsoft’s goal is to enhance the user’s experience with those applications. Empowering Microsoft Office users to express their intentions through Creative Commons licenses is another way Microsoft enables users around the world to exercise their creative freedom while being clear about the rights granted to users of a creative work. In the past, it has not always been easy or obvious to understand the intentions of some authors or artists regarding distribution or use of their intellectual creations.
This add-in is made available through a partnership among Creative Commons, Microsoft, and 3Sharp, LLC, an independent solution provider.
The Creative Commons Add-in is an unsupported technology preview, however we welcome your feedback and input. Please send it to us at ccfeedbk@microsoft.com.

PMR: Whatever else one thinks of Microsoft, this looks like a genuine and important offering.
My Korn-Oppenheim approximation (couldn’t resist the pun)… I quote snippets…

Creative Commons actively encourages the sharing of educational material, … This ‘standard licence’ is not specific for educational purposes and users can only choose, as in the case of the majority of Creative Commons licences, to make the content available either for commercial or non-commercial purposes. Creative Commons licences do not specifically cater for educational purposes.
It is also worth noting that there is a significant difference between the information that is provided within the full-length licence in comparison to that which appears within the shortened code of Creative Commons licences. […]

  • If the work includes third party rights for which the Higher or Further Education institution has not secured permission for them to then be disseminated under a CC licence. This might include the use of photographs, text, images, etc., generated by third parties, or indeed images of third parties for which permission would need to be sought.
  • As an employee, unless there is an agreement with the employer to the contrary, the employer is likely to own the rights in the work created (as is the case with course material, research outputs, etc). In this case, as the employee does not own the rights in the work that is produced, he or she will need to check with the institution that it is happy for it to be made available under a Creative Commons licence. Reasons for refusal might include those that are ethical, political, financial or legal. We are, of course, aware that in many institutions, custom and practice leaves such decisions to the employee. The recent HEFCE Good Practice Guide [17], however, strongly encourages such institutions to assert ownership of copyright in e-learning materials, and such advice may well occasion changes in policy in the future.
  • The terms of the Creative Commons licence may cut across some of the activities for which an institution or department might normally charge, the business activities of the department or institution, or undermine existing licensing arrangements with third parties.
  • Creative Commons licences are global licences without providing any means to restrict the countries in which material may be used. This is important if contractual agreements, relationships, political or ethical reasons preclude the release of certain types of learning material or research outputs in particular countries.
  • The department or institution may want more control over the context of use of the work and want to prevent any implied or direct endorsement. This is not currently one of the areas which is covered by the licence.
  • It maybe important to know who accesses and uses the material that is generated, for evaluation, marketing and other in-house purposes. The Creative Commons licences have no provision for user accountability, or tracing of usage.
  • There are also many instances where you cannot use TPMS (Technical Protection Measures) in conjunction with the use of Creative Commons licences, if TPMS undermine any of the provisions of the licences. Thus, for example, restricting access to students within the institution is incompatible with a Creative Commons licence.
  • The Creative Commons licence does not cover database rights, yet much output from FE and HE institutions is protected by such rights. Thus, even with a Creative Commons licence, users may not infringe any database rights.
  • It is, in any case, unclear if Creative Commons licences are valid in UK law, as they do not provide any ‘consideration’ or payment, and there is no ‘I agree’ button to accept.

Concluding Remarks

The well-known JORUM service [18] decided not to use Creative Commons licences in the past because of some of the points made within this article. So, are Creative Commons licences a panacea? No. Should we worry? Probably not, but we need to be aware of the implications and limitations of using Creative Commons licences and remember that their use is only as good as staff awareness about copyright issues, rights management procedures and robust policies underpinning the operations of educational institutions. HE and FE institutions need to be clear about their policies towards access and broader strategic and commercial goals, before committing themselves to the irrevocable terms of Creative Commons licences.
It might be advisable instead for institutions to explore the use of Creative Archive licences [5], which are a set of more restrictive licences, based upon the same premise as Creative Commons but with limits upon the use of content for educational and non-commercial purposes and restrictions relating to the territories in which they may be used.

All these are valid points. However I am only concerned here with scientific facts supporting research output. I hope that HE institutions do not regard themselves as the owners of scientific facts in which case there is no problem. I agree that CC is not designed for science – hence the Science Commons effort.
I haven’t downloaded and installed the plugin, but my main concern is that someone might leave the CC license switched on at an early stage. If this circulated in private it might vitiate a patent although I don’t know the law. When the data has appeared the license has no effect on a patent which has been successfully filed. You cannot patent data.
So let’s encourage everyone to download this plugin. Wow!

Posted in data, open issues | Leave a comment

Open Data – what can I do? Simple, legal, viral suggestion

Following discussion on the SPARC Open Data list I got a mail:

I’d like to hear more discussion on open data, too. In particular, what are the practical approaches that will help adoption of open data by researchers themselves? We all know technology is not the big issue here. The biggest challenge is how to get researchers to share their raw experiment data on the web. Because this is quite different from traditional publication, a big uphill battle is expected.
Since I’m a software developer (with research background), my thinking is always centered around creating free new tools and services for the end users (i.e. researchers). And the key is to come up with a set of tools and services that can benefit users immediately as they open up their data bit by bit on the web. Of course, they have to be relatively easy to implement (ideally, no funding is required).Fortunately, open source software and communities have made this possible. In addition, the emerging semantic web technologies seem to be right for this task. So, I’m working with W3C semantic web group to openly develop ontologies for representing research data (at high level). I’m also developing necessary web publishing tool and R&D community search engine through open source project. My hope is that researchers will open more data when they actually see these open data bring more visibility and recognition to their work through a community search engine.
Cheers,
AJ
AJ Chen, Ph.D.
Palo Alto, CA. USA
1-650-283-4091
web2express.org
W3C Scientific Publishing task force
“Open data on semantic web”

PMR: The simplest think that researchers can do is to add a Creative Commons license to their data. It costs nothing, is a simple cut-and-paste, and could be trivially made a template in any data production tool. (For example if you publish spreadsheets in Excel add a creative commons license as your last/first line. Every time you open a blank stylesheet it would have this.
Similarly scientific software developers could output the additional line:
“The data output by this program are offered under a Creative Commons Attribution Share-alike license.”
(or better in a machine-readable XML/RDF format like the creative commons already do).
We might add the following:
“The authors regard these as not copyrightable Open Data. Some publishers wish to claim that if published alongside a journal article they own the copyright on them. This license effectively forbids them to claim copyright without the authors permission and acquiescence.
I think the effect of this would be dramatic. Scientists would start to see these messages and think: “Why should I give these data to the publisher?” And if the publisher simply adds a copyright notice saying “all these data are copyright the publisher – you cannot use them for X, Y, Z without permission” this would be in violation of the authors’ license. The author would have to deliberately remove this statement to hand over the IPR to the publisher.
It is never easy to design a viral campaign, but this has all the prerequisites of a meme:

  • it infects a significant number of the potential population
  • they wish to reproduce and spread the meme
  • the costs of replicating the meme are effectively zero

The first two are unknowns. The critical things are to get the form of words right and not to foul up technically. Not all data sets can carry text (although an increasing number will be accompanied by metadata which is ideal for this.)
If the scientific programmers buy into this it is unstoppable.

Posted in "virtual communities", open issues | 14 Comments

Egon on SMILES InChI CML and RSS

I agree with everything Egon says and add comments.
(Incidentally WordPress and Planet remove the microformats so please read his original
for the correct syntax)
The blogs ChemBark and KinasePro, have been some discussions on the use of SMILES, CML and InChI in Chemical Blogspace (with 70 chemistry blogs now!). Chemists seem to prefer SMILES over InChI, while there is interest in moving towards CML too. Peter commented.

PMR: 70 blogs is great. Go back a year and we’d have ca 10 I suspect. As I say I’m only looking for the 5-10% who are happy to be early adopters

Any incorporation of content other than images and free text requires some HTML knowledge, but this can be rather limited. It is up to us chemoinformaticians to write good documentation on how to do things; so here is a first go.

PMR: Yes, documentation is key as we are always being reminded! But we are also still fighting the browser technology. One of the great problems is that browsers have been a moving target for 12 years – it was almost easier to create a “plugin” in 1994 than now. How many of you can run Chime under Firefox?

Including CML in blogs and other RSS feedsI blogged about including CML in blogs last February, and can generally refer to this article published last year: Chemical markup, XML, and the World Wide Web. 5. Applications of chemical metadata in RSS aggregators (PMID:15032525, DOI:10.1021/ci034244p). Basically, it just comes down to putting the CML code into the HTML version of your blog content, though I appreciate the need for plugins.

PMR: you should always try to create XHTML (HTML with balanced tags). Unfortunately (and most regrettably) some tools, including WordPress, can often remove end tags.

Including SMILES, CAS and InChI in blogsIncluding SMILES is much easier as it is plain text, and has the advantage over InChI that it is much more readable. Chris wondered in th e KinasePro blog on how to tag SMILES, while Paul did the same on ChemBark about CAS numbers.
PMR: SMILES shouldn’t need to be “readable” and some of it isn’t (e.g. if you have a complete disconnected structure). It is because people have got used to seeing it for many years that they don’t feel frightened. There is no way to create canonical SMILES by hand, so you have to have a tool. InChI seems more forbidding because (a) it’s new (b) It can never be hand authored (c) it’s about 50% more verbose (d) it has layers. But each of those has a positive side.
Now, users of PostGenomic.com know how to add markup to their blogs to get PostGenomic index discussed literature, website and conferences. Something similar is easily done for chemistry things too, as I showed in Hacking InChI support into postgenomic.com (which was put on lower priority because of finishing my PhD). PostGenomic.com basically uses microformats, which I blogged about just a few days ago in Chemo::Blogs #2, where I suggested the use of asperin.And this is the way SMILES, CAS and InChI’s can be tagged on blogs. The element is HTML code to indicate a bit of similar content in HTML, and can, among many other things, be formatted differently than other text. However, this can also be used to add semantics in a relatively cheap, but accepted, way. Microformats are formalized just by use, so whatever we, as chemistry bloggers, use will become the de facto standard. Here are my suggestions:
[snipped see Egon’s blog]
The RDFa alternativeThe future, however, might use RDFa over microformats, so here are the RDFa equivalents:
[snipped see Egon’s blog]
which requires you to register the namespace xmlns:chem=”http://www.blueobelisk.org/chemistryblogs/” somewhere though. Formally, the URN for this namespace needs to be formalized; Peter, would the Blue Obelisk be the platform to do this? BTW, this is more advanced, and currently does not have practical advantages over the use of microformats.
Egon is right: there is currently no clear indication of which approach will come out as the “winner” although there is lots of Web discourse. However for us I suspect we would adopt both if lots of people were using them, and see which approach won.
Yes, of course we should use blueobelisk for the RDF! This has the real chance of succeeding.
Again the message is that the rest of the world is going down this route and at some stage chemistry will follow. RDF looks just as impenetrable as InChI, DOI, and all the rest…
Posted in open issues, XML | 1 Comment

Why bother with new technology?

Kinasepro has blogged about discussions of new chemoinformatics technology (specifically CML (Chemical Markup Language) and InChI (chemical identifier)). Here’s the post and some correspondence. It’s basically about the introduction of new technology. Obviously I’m not neutral but I will try to discuss it in a neutral manner. For that reason I have copied it more or less in full.

There’s been a fair amount of talk [ChemBark] over the last little while on the topic of chemoinformatics and chemblogs. Here’s my two cents.
smiles inchi Aldrich
smiles inchi ChemExper
smiles inchi The PDB
smiles inchi Chemdraw (until v10)
smiles inchi The entire pharmaceutical industry.
smiles inchi Peter Murray Rust
smiles inchi IUPAC
So somehow a couple librarians have convinced Google that inchi > smiles. Result? Google may well do Inchi, but noone but the librarians are currently using it, and meanwhile google doesn’t index smiles very well. I’m reminded of a day when it was thought to be a good idea to put the CAS#s of new entities at the bottom of ACS journal articles. Don’t worry, we survived those librarians too.
PMR: I’m not sure who the librarians are. I’d label all of (us) as chemical informatics. The institutions include NIST, RSC, and University of Cambridge. I don’t think Google has been convinced of anything – chemistry is relatively too small for Google to worry about. But yes, we have visited and had very useful and forward-looking conversations. Watch out for Googlebase…
Lookit, we don’t need a string of XML code that you need an advanced degree to use. We don’t need people telling us to tag our blog posts, we need an integrated solution. We need something that can draw structures and present them attractively in an index friendly HTML format. Near term: Get google to index picture descriptions, and code a firefox plugin that can insert smiles into said descriptions.
PMR: I am not quite sure what “index picture descriptions” means. Google indexes the fact that there is a picture but not the content. There are major efforts in image recognition, but I am not aware that any of this is being done in chemistry. I think that indexing chemistry in published GIFs is extremely difficult. I’ve looked at this over the years and conclude that it would be much easier if authors simply made their molecular files available.
Till google has a smiles substructure search, I’m not going to bother.
PMR: This is a perfectly valid response from an individual in the system. It’s rather less encouraging if it reflects the whole of chemistry (which currently it does). If the chemical informatics community says “at some stage Google will solve all our chemical problems, until then we’ll do nothing” that’s regrettable. (All other major scientific disciplines – physics, astronomy, bioscience, geosciences, etc.) are making major efforts to develop informatics infrastructure.Some of us are, in fact, thinking about how to do this. The problem is that there has to be some software somewhere. It can be in the following places:
  • client (i.e. your browser)
  • Google (we have discussed this with Google and it’s not impossible)
  • third party (who may or may not charge for it).

Given that Openbabel can search millions of structures quite rapidly there are some encouraging opportunities.

  1. 1 totallymedicinal Dec 5th, 2006 at 3:09 pm
    Couldn’t agree more with the sentiment – not only does my ancient version of ChemDraw not support this exotic format, but I have enuff hassle in my life without learning some obscure new coding system.

PMR: Again this is a perfectly valid response. Any approach to chemoinformatics requires tools. And I suspect or your institution would have to pay for an upgrade to Chemdraw. Obviously there is the opportunity of some Open Source free tools but they are not yet widely deployed and are effectively for early adopters.

  1. 3 Paul Dec 7th, 2006 at 4:06 am
    I could not agree more about the need for an integrated solution! I got a really thoughtful response from Peter Murray-Rust and friends, and I feel kind of bad about not acting on it, but putting random InChI designations at the bottom of all our blog posts doesn’t seem worth it to me. I think that CML is indeed the future, and I look forward to the day of being able to download a CML plugin for WordPress that will take care of everything for us lazy bloggers.
PMR: There is no doubt there are technical problems and they will require some early adopters to solve. I have tried to hide the InChI – it is an effort and is fragile. Given that I have problems with simple computer code in WordPress I expect the same with chemistry. However we have some new ideas of how to take this away from the WordPress process.
  1. 4 Chris Dec 9th, 2006 at 12:23 pm
    The argument against SMILES seems to be they are not an Open Format and it is possible to represent a single molecule with multiple SMILES strings. For my part I can read and write SMILES, (and SMARTS and SMIRKS). I find InChi impenetrable and I don’t think there is syntax for substructure or similarity queries, in addition I don’t think there is a system for describing reactions.I’ve started to add SMILES to my web pages in the hope that someone will build an index at some point, I guess it would help if there was a SMILES tag.
PMR: SMILES was a groundbreaking language when it came out. In general I have no problem with non-Open formats if there are free tools to manage them. There is a canonicalization algorithm for SMILES but it is closed and proprietary. I have regularly discussed the value of making it openly available with Daylight management but they are not prepared to do this. This is a legitimate business approach – control the market through trade secrets. IN the current case, however, it has the practical downside that several groups have created incompatible “canonical” SMILES.
The main virtue of InChI is that it is a public Open Canonicalization algorithm. It’s perfectly possible to convert InChI to SMILES if you want. It would not be “canonical SMILES” in the strict sense, but it would be canonical. That may, in fact, be a useful approach for certain types of compound. As InChI has a richer set of concepts than SMILES there may be some information loss.
In summary, if Daylight had made the SMILES algorithm public and it had been used responsibly I doubt very much whether we would have InChI. It has been driven by the lack of interoperability in chemistry – coming in some part from government agencies and the publishing community.
InChI is by definition impenetrable. It’s an identifier. Do you find DOI, ISBN, security certificates impenetrable? I hope so 🙂
  1. 5 kinasepro Dec 9th, 2006 at 7:03 pm
    InchI and CML may well be the future, and no-one will embrace it more then me, but SMILES is the present. For people working in the field not to understand that boggles the mind!

PMR: I’m not sure who “people working in the field” are. If it includes me, then I fully understand it. I am simply trying to bring the future to the present a bit quicker and a bit more predictably. 🙂

  1. I’ve experimented on this site a little with smiles. For instance a google search of the following string brings you here:O=C(C2=CN=C(NC3=NC(C)=NC(N4CCN(CCO)CC4)=C3)S2)NC1=C(C)C=CC=C1ClOf note I’m not the only one with that string on the web! Maybe thats an important compound? Sadly google indexed that page under my SRC tag rather then as a standalone page. Put that together with the fact that smiles strings are not substructure searchable via google and its clear to me that google is not ready to be a chemistry informatics platform. It’s sad really, because it doesn’t seem to me that it would be that difficult for them to make SMILES strings substructure searchable via the same algorithm the PDB, relibase, aldrich and everybody else is using.

PMR: This is a very important point and at the heart of the problem. Google works by indexing text. It’s good at it and can distinguish different roles for text and can look for substrings. This is a simple, powerful model. But at present it doesn’t index other objects (faces, maps, etc.) These are both harder and require specialist software. By contrast PDB, Relibase, Aldrich do index chemical structures. That means that they have to have specialist software running on their servers. Which means a business model. And that someone has to pay somewhere. PDB gets a grant, Relibase is commercial, Aldrich will see this as the basis for selling more compounds. All completely valid. But there is no business reason for Google to invest in chemistry-specific software – as I said chemistry is too small for Google to bother with. It’s not helped by the fact that all the information is proprietary and that one of the major chemical information suppliers (CAS/ACS) sued Google. So unless you convince them differently – and I have gently tried – it won’t happen.
So this is all about the introduction of new technology. The primary messages from the chemistry community are something like:

  • We’re happy with what we’ve got – it’s worked for the last 20 years and will go on doing so. Yes, for a little while.
  • When it’s necessary CambridgeSoft, Chemical Abstracts, Elsevier will develop a new technology and we’ll pay them to use it. Unfortunately I don’t see any movement from any of these to embrace the new Web metaphors. Biology, geoscience, etc. are working hard to develope the semantic web in the subjects – apart from a few of us noone in chemistry is.
  • Well, it’s a bit of a mess, but it’s not at the top of my priorities. I’ll come back in a few years.
There are movements in chemistry, particularly in three areas:
  • computational chemistry. We are having a visit of COST D37 (EU) to Cambridge tomorrow to create an interoperable infrastructure for computational chemistry. It will be based on communal agreements and use XML/CML as the infrastructure.
  • chemoinformatics. The Open Source community (e.g. Blue Obelisk) supports both current (legacy) formats (SMILES, Mol) etc. and also CML/InChI. This can provide a smooth path towards the wider adoption of these newer approaches, including toolkits. The toolkits are free, which some see as a disadvantage, in which case you will have to convince the commercial suppliers to create them.
  • publishing. Commercial publishing is universally based on XML (and variants) so it is easy for them to include CML and related systems. I won’t give details but I’d be surprised if there weren’t major changes in the next 2-3 years here which I hope will answer some of the obejctions raised here.

There are also general major drivers elsewhere for the abandonment of legacy formats. They include the semantic web, RSS, institutional repositories, archival, etc. These efforts require interoperability and freely available tools – you can’t archive – say – a binary chemistry file and expect it to be readable in 5 years time. There are a lot of people to whom that matters.
So I’m not telling anyone to do anything – I’m putting ideas, protocols and tools where they may wish to pick them up. If 5% of a community is enthusiastic that’s a good beginning. It worries me that the pharma industry has no concept of interoperability. But I’ve said that already.

Posted in chemistry, open issues, XML | 5 Comments