US Voters: time to lobby Congress on OA again

Posted on November 13, 2007 by pm286

President Bush, escalating his budget battle with Congress, on Tuesday vetoed a spending measure for health and education programs prized by congressional Democrats….
[…]

PS: Comments

First, don’t panic. This has been expected for months and the fight is not over. Here’s a reminder from my November newsletter: “There are two reasons not to despair if President Bush vetoes the LHHS appropriations bill later this month. If Congress overrides the veto, then the OA mandate language will become law. Just like that. If Congress fails to override the veto, and modifies the LHHS appropriation instead, then the OA mandate is likely to survive intact.” (See the rest of the newsletter for details on both possibilities.)
Also expected: Bush vetoed the bill for spending more than he wants to spend, not for its OA provision.
Second, it’s time for US citizens to contact their Congressional delegations again. This time around, contact your Representative in the House as well as your two Senators. The message is: vote yes on an override of the President’s veto of the LHHS appropriations bill. (Note that the LHHS appropriations bill contains much more than the provision mandating OA at the NIH.)
The override votes –one in each chamber– haven’t yet been scheduled. They may come this week or they may be delayed until after Thanksgiving. But they will come and it’s not too early to contact your Congressional delegation. For the contact info for your representatives (phone, email, fax, local offices), see CongressMerge.
Please spread the word!

Posted in open issues | Leave a comment

Repository depositions – what scales? A simple idea

Posted on November 13, 2007 by pm286

One of the problems of repositories at present is that everything is new. And much of it is complex. And some changes rapidly. So here is a simple idea, motivated by Dorothea’s reply to a post of mine…

Dorothea Salo Says:
November 12th, 2007 at 2:51 pm e

[… why repositories need investment …]

And some of the work is automatable over time. Once you know a particular journal’s inclinations, pushing anything from that journal out of the bucket becomes a two-second decision instead of a ten-minute slog through SHERPA and publisher websites.

PMR: Now this is an area that is a vast time-sink. Suppose I (as a simple scientific author) want to know if I can archive my Springer article (read also Wiley, Elsevier, ACS, RSC…). What do I have to do? When?
I imagine that hundreds of people struggle through this every year. Frantically hacking through awful, yes awful, pages from publishers. Many of these are not aimed at helping authors self-archive but suggesting how they can pay money to the publisher for re-use of articles. (I could easily rack up a bill of 1000 USD for re-using my own article if I wanted to include it in a book, use it for distance education, use it for training etc.). It is not easy to find out how to self-archive – I wonder why?
So I thought I would try to do this responsibly and find out what Springer actually allows. I have a paper in a Springer journal – what am I allowed to do and when? The following journey may be inexact, and I’d appreciate correction, but it’s the one that a fairly intelligent, fairly knowledgeable, scientist who knows something about Open Access followed.
I went to the home page of J. Molecular Modeling and looked in “For authors and editors”. Nothing about self-archiving. A fair amount about Open Choice (the option where the author pays Springer 3000 USD to have their full-text article article visible (like all other current non-OpenChoice articles in J.Mol.Mod) and archived in Pubmed and re-useable for teaching without payment but with the copyright retained by Springer. I went to Google and typed “Springer self-archiving”. I won’t list all in detail but the results in order were:
A report by Peter Suber (2005), Journal of Gambling Studies, a critique by Stevan Harnad, a Springer PPT presentation (2004) on Open choice (which stated:

Springer supports Self-archiving: Authors are allowed to post their own version on their personal website or in their institution’s online repository, with a link to the publisher’s version. (PMR this is the ONLY page I found from Springer).

… an attack by Richard Zach on the 3000 USD for Open Choice, an attack by Stevan Harnad on why Jan Velterop opposes Green self-archiving, a page from Sherpa-Romeo which gives the conditions:
http://www.sherpa.ac.uk/romeoupdate.php?id=74
and is the most helpful of all.
I immediately take away the fact that Springer is making no effort to help authors find the conditions for self-archiving. I have no idea where they are. I’d hate to do anything that violated the conditions.
So, to follow up Dorothea’s post. A LOT of useful human effort is wasted because the publishers make it so difficult to find out how to self-archive. I’d like a fair bargain. If the publisher has agreed that you can self-archive, tell us how. Or we start to see publishers as a difficulty to be overcome.
So a suggestion. Suppose each institutional repositarian spent 1 day a year posting how to self-archive articles from the Journal of Irreproducible Results. (Don’t be fooled – it takes a day to get a clear answer from most publishers or journals). And each one took a different journal. And posted it on a communal Wiki. Then we would have a clear up-to-date indication of what was allowed and what wasn’t. Including things like “I asked if I could retain copyright and they said yes”. Really vital info.
It’s not a lot of work per person. It would pay back within a year. Someone has to set up the Wiki. And keep it free of spam. But that’s not enormous. But sorry – I’m not volunteering. I’m in a discpline where there is very little chance of self-archiving legally. I’ve spent enough time trying.

•

Posted in open issues, repositories | 3 Comments

Can I reposit my article?

Posted on November 12, 2007 by pm286

Having re-explored the access to articles in the Journal Of Molecular Modeling I thought I would see if I am allowed to reposit my article in the Cambridge DSpace. So while the sun is shining here’s a small pictorial journey…
I can read my article without paying (I’m not at work and have no special access AFAIK). So, I assume, can everyone else:

I click on Permissions & Reprints…

I assume I have got the right options here. I thought I did quite well to find the “Institutional Repository” option. I have no idea what “Prepress article” means but since it’s the only option I don’t need to think. So how much if anything do I have to pay…

… and thank you, Rightslink for making this very clear that I cannot put my paper in an IR. As an exercise see how long it takes you to find the relevant section in the SSBM “clear guidelines”. First you have to find it. Then find “repository”. The best I could come up with after 5 minutes was:

“Details of Use:

Details of use define where or how you intend to reuse the content. Details of use vary by the type of use.

Some of the details of use include: Advertising, Banner, Brochure or flyer, Catalogue, CME, Web site, Repository, Slides/Slide Set, Staff training, or Workshop.

Some details of use are geographic: National (in the country associated with your account set up), Internal (within your organization), or Global (worldwide).

“

Well, yes. But it doesn’t answer my question about why and when and what and how I can put in my Institutional Repository.
But since the answer is actual NO to everything, shouldn’t I just accept that?

Posted in open issues, repositories | 2 Comments

Green Gold Hybrid

Posted on November 11, 2007 by pm286

Peter Suber reports Jan Velterop’s comments on the Green version of Open Access: More on JAM about the NIH policy (Jan Velterop, JAM tomorrow) The Parachute, November 9, 2007:

JV: Applied to OA, ‘green’ and ‘gold’ are qualifiers of a different order. ‘Gold’ is straightforward: you pay for the service of being published in a peer-reviewed journal and your article is unambiguously Open Access. ‘Green’, however, is little more than an indulgence allowed by the publisher. This, for most publishers at least, is fine, as long as it doesn’t undermine their capability to make money with the work they do. But a ‘green’ policy is reversible.

PS: Comments

On the other hand, Jan may be right that there is “something…in the distinction between ‘green’ and ‘gold’ that wrong-foots otherwise intelligent people”. I’ve tried for years to understand why otherwise intelligent people so frequently get it wrong. Last month I put it this way (p. 46): “The fact is that green OA has always had to fight for recognition. Its novelty makes it invisible. People understand OA journals, more or less, because they understand journals. But there’s no obvious counterpart to OA archiving in the traditional landscape of scholarly communication. It’s as if people can only understand new things that they can assimilate to old things.”

PMR: My point here is not to re-open the Green-Gold debate but to declare that I find the whole thing very complex and am frequently wrong footed. For me the world needs to be simple. Gold is simple to understand. Closed access and legal threats are simple to understand. The rest is messy and complicated. I do not wish to re-open the discussions we have had here about whether Green OA makes permission barriers irrelevant – my personal view is that Green makes permission more complicated; Stevan Harnad thinks it makes it simple; we agree to differ.
To me OA ==> CC-BY (freedom to do almost anything with the work). And if the whole world took that view it would be simple. There must be literally 500 different positions taken up by various publishers – free redistribution except for photocopies – no commercial use except for textbooks – re-use by academics but not the pharma industry (see OUP for this amazing restriction). It’s impossible.
So about 6 months ago I tried to understand the Open Access (hybrid) policy of a journal I was an editor on (J. Molecular Modeling, Springer). Authors paid 3000 USD for an article labelled “Open Choice” but still copyrighted by the Journal and still with restrictive permissions. I was upset and said so, sufficiently that I resigned. In retrospect Springer and Jan Velterop suffered this because it was the first hybrid Open Access article/publisher I encountered. The others are just as bad – if not worse. But all of them are making hybrid Open Access so (unnecessarily) complicated that I suspect no-one in the world understands all the details. Anyway Jan promised some changes, so I have revisited the site. Before that, here’s HHMI on Springer’s Open Choice:

The Howard Hughes Medical Institute (HHMI) has expressed support for Springer’s Open Choice program whereby articles are — if accepted for publication after a process of rigorous peer-review — immediately published with full open access and deposited in repositories such as PubMed Central, at a flat-rate fee per article of $3,000. Springer’s Open Choice programme applies to all its journals.
… and …
With Springer Open Choice the authors decide how their articles are published in Springer journals. As with all other articles, Springer Open Choice articles are peer-reviewed, professionally produced, and available in both print and electronic versions on SpringerLink. In addition, every article will be registered in CrossRef and included in the appropriate Abstracting and Indexing services. In Springer Open Choice, authors are not required to transfer their copyright to Springer; instead, these articles are published under a Creative Commons License.

PMR: So I go back to the current issue of Journal of Molecular Modeling:

Jimmy Stewart has an Open Choice article. But the copyright is still Springer’s despite what HHMI thinks. Abbasoglu does not have Open Choice. BUT it’s Free Access. You can read it on Springer’s site without a subscription. ALL the papers are Free Access. At least for the last 4 years (I haven’t looked further). Whatever is going on? I have no idea. The Springer system knows that Jimmy’s article is Open Choice because the permissions robot says it’s free whereas it says I would have to pay 150+ USD for 100 copies of Abbsoglu’s article if I wished to use it for distance learning. So I checked in Pubmed. Yes, Jimmy’s article is there – in full. And Abbsoglu’s has only the abstract. So this seems to follow logically.
Jimmy has paid 3000 USD for his article (this is a lot of money and he was not motivated to do it last time we spoke). Abbasoglu has paid nothing. The world can read both of their articles on Springer’s site. The world cannot read Abbasoglu’s in Pubmed nor can they use it for distance learning without paying. But is that worth 3000 USD??
The point of this is that the whole situation is so complex that it is beyond the comprehension of anyone to understand. The publishers aren’t trying to make hybrid work and I suspect many are trying to make it die. The varieties of OA (that is not CC-BY) are so enormously complex that again no-one can remember the whole lot and the publishers make it impossible to find out easily (or more generously fail to make it easy). There is a HUGE amount of wasted human effort in managing this charade – subscription sales, copyright violation police, librarian paralysis, wasted author time, etc. None of this helps science.
In short, OO CC-BY is simple. You can learn it in five minutes. If we all did it then we could use the rest of our time to do something useful like discovering new biologically active compounds. And, since it is a near zero-sum game everyone would still have a living if they performed a useful role.
I’m now off to hack some Java. It’s relaxing…

Posted in open issues | 2 Comments

Using our own repository

Posted on November 11, 2007 by pm286

Elin Stangeland from our Institutional Repository will be talking to us (my Unilever Centre colleagues) tomorrow on how to use it. Jim and I have seen her draft talk, but I’ll keep it a surprise till afterwards.
I still think there is a barrier to using IR’s and I’ll explain why.
We spent some of our group meeting on Friday discussing what papers we were writing and how. As part of that we looked at how to deposit them in the IR. It’s not easy in chemistry as most publishers don’t allow simple deposition of the “publishers’ PDF”. So here’s the sort of problems we face and how to tackle them.
Firstly every publisher has different rules. It’s appalling. I don’t actually know for certain what ACS, RSC, Springer, Wiley allow me to do. Elin has a list which suggests that I might be able to archive some of my older ACS papers. etc. This is an area where I’m meant to know things, and I don’t. (I’ve just been looking through the Springer hybrid system and I do not understand it. I literally do not know why all the articles are publicly visible, but some are Open Choice, yet Springer copyright. I would have no idea which of these can be put in an IR. Or when. Or what the re-use it. I may write more about this later.)
Here are some basic problems about repositing:

the process from starting a manuscript to final publication can take months or years
there are likely to be multiple authors
authors will appear and disappear during the process
manuscripts may fission or fuse.
authors may come from different institutions

A typical example is the manuscript we are writing on the FOO project. The project has finished. The paper has 6 authors. I do not know where one of them is. There are 2 institutions and 4 departments involved. One person has been entrusted with the management of authoring. They are unlikely to be physically here when the final paper is published. The intended publisher does not support Open Access and may or may not allow self-archiving
We have to consider at least the following versions of the article:

The manuscript submitted to the publisher (normally DOC or TeX). Note that this may not be a single version as the publisher may (a) refuse it as out of scope (b) require reformatting, etc. even before review. Moreover if after a refusal the material is submitted to a subsequent journal we must remember which manuscript is which.
The publisher sends the article for review and returns reviewers comments. We incorporate these into a post-review manuscript. This process may be iterative – the journal may send the revision for further review. Eventually we get a manuscript that the journal accepts.
We get a “galley proof” of the article which we need to correct. This may be substantially different from (2). Some of the alterations are useful, some are counterproductive (one publisher insists on setting computer code in Times Roman). There are no page numbers. We make corrections and send this back.
At some stage the paper appears. We are not automatically notified of when – some publishers do, some don’t. We may not even be able to read it – this has happened.

By this stage the original person managing the authoring has left us, and so has one of the co-authors. Maybe at this stage we are allowed to reposit something. Possibly (1). The original manuscript. But the author has left – where did they keep the document? It’s lost.
This is not an uncommon scenario – I think at DCC 2005 were were informed that 30+% of authors couldn’t locate their manuscripts. Yes, I am disorganized, but so are a lot of others. It’s a complex process and I need help. There are two sorts – human amanuenses and robot amanuenses. I love the former. Elin has suggested how she can help me with some of my back papers. Dorothea Salo wants to have a big bucket that everyone dumps their papers in and then she sorts it out afterwards (if I have got this right). But they don’t scale. So how can robots help?
Well, we are starting to write our papers using our own repository. Not an IR, but an SVN repository. So Nick, Joe and I will share versions of our manuscripts in the WWMM SVN repository. Joe wrote his thesis with SVN/TeX and I think Nick’s doing the same. Joe thought it was a great way to do things.
The advantage of SVN is that you have a complete version history. The disadvantage is only that it’s not easy to run between institutions. I am not a supporter of certificates. And remember that not all our authors are part of the higher education system. In fact Google documents starts to look attractive (though the versioning is not as nice as SVN.)
Will it work? I don’t know. Probably not 100% – we often get bitten by access permissions, forgetting where things are, etc. But it’s worth a try.
And if I were funding repositories I would certainly put resource into communal authoring environments. If you do that, then it really is a one-click reposition instead of the half-day mess of rtrying to find the lost documents.

Posted in repositories | 1 Comment

Open NMR: update

Posted on November 11, 2007 by pm286

I am very grateful to hko and Wolfgang Robien for their continued analysis of the results of Nick Day’s automated calculation of NMR chemical shifts, using the GIAO approach (parameterized by Henry Rzepa). The discussion has shown that some structures are “wrong” and rather more are misassigned.
Wolfgang Robien Says:
November 11th, 2007 at 10:01 am e

we need ‘CORRECT’ data – many assignments of the early 70’s are absolutely correct and useful for comparison […]
As a consequence of your QM-calculations 10 assignment corrections and 1 structure revision within a few hundred compounds have been performed by ‘hko’ (see postings above) – this corresponds to an
error rate of approx. 5% ! [PMR: In the data set we extracted from NMRShiftDB]. [… discussion of how such errors are detected snipped…]

PMR: Part of the exercise that Nick Day has undertaken was to give an objective analysis of the errors in the GIAO method. The intention was to select a data set objectively. It is extremely difficult to select a representative data set by any means – every collection is made with some purpose in mind. We assumed that NMRShiftDB was “roughly representative” of 13C NMR (and so fat this hasn’t been an issue). It could be argued that it may not have many organometallics, minerals, proteins, etc. and I suspect that our discourse is mainly about “small organic molecules”. But I don’t know. It may certainly not be representative of the scope or GIAO or HOSE codes. Again I don’t know. Having made the choice of data set the algorithm for selecting the test data was objective and Nick has stated it (< 20 heavy atoms, <= Cl except Br, no adjacent acyclic bonds). There may have been odd errors in implementing this (we got 2-3 compounds with adjacent acyclic bonds) but it was largely correct. And it could be re-run to remove these. We stress again that we did not know how many structures we would get and whether they would behave well in the GIAO method. In fact over 25% failed to complete the calculation. (We are continuing to find this – the atom count is not a perfect indication of how long a calculation will take which can vary by nearly a factor of 10).
We would not claim that the remaining ca. 250 compounds were "representative". There are no organometallics, no electron-deficient compounds, no overcrowded compounds, no major ring currents, etc. (all of which are areas where we might expect GIAO to do better than some empirical methods). In fact the compounds are generally ones that we would expect connection-table-based methods to score well on as there are few unusual groups (so well trained) and no examples where the connection table cannot describe the molecule well (e.g. Li4Me4, Fe(Cp)2, etc.
Our current conclusion is that the variance in the experimental data is sufficiently large (even after removal of misassignments) to hide errors in the GIAO method. This appears to give good agreement with an RMS of ca. 2 ppm. (but again we stress that the data set is not necessarily representative). If the Br/Cl correction had not been anticipated it would have been clearly visible and the exercise would have revealed it as a new effect. It is certainly possible that there are other undetected effects (especially for unusual chemistry). But, for common compounds I think we can claim that the GIAO method is a useful prediction tool. It should be particularly useful where connection tables break down and here are some systems I'd like to see it exposed to:

Li4Me4
Fe(Cp2) – although Fe is difficult to calculate well.
p-cyclophane (C1c(cc2)ccc2CCc(cc3)ccc3C1)
18-annulene

PMR: So what I would like is a representative test data set that could be used for the GIAO method. The necessary criteria are:

It is agreed what the chemical scope is. I think we would all exclude minerals, probably all solid state, proteins, macromolecules (there are other communities which do that). But I think we should include a wide chemical range if possible.
The data set is prepared by one or more NMR-expert groups that have no particular interest in promoting one method over another. That rules out Henry, Wolfgang, ACDLabs, and probably NMRShiftDB.
The data set should provide experimental chemical shifts and the experts should have agreed the assignments by whatever methods are currently appropriate – these could include a group opinion. The assignments should NOT have been based on any of the potential competitive methodologies.

For a competition there would be stronger requirements – it is essential it is seen to be fair as reputation and commercial success might hang on the result.
So I make my request again. Please can anyone give me some data that we can use in an Open experiment to test (and if necessary validate/invalidate) the GIAO method? At this stage we’d be happy to take material from anyone’s collections, but it would have to be Open so that other groups have the chance to comment.
I hope someone can volunteer. If not we may have to resort to (machine) extraction of data from the current literature. Our experience with crystallography suggests that the reporting and quality of analytical data in general has increased over the last 10 years.

Posted in data, nmr | Leave a comment

Open science: competitions increase the quality of scientific prediction

Posted on November 11, 2007 by pm286

I previous posts and comments we have been discussing the value of certain predictive methods for NMR chemical shifts. In the next post I am going to make a proposal for an objective process which I hope will help take us forward. Chemistry (chemoinformatics) is often not good at providing objective reports of predictive quality – the data, algorithms, statistics and analysis are often not formally redistributable and so cannot be easily checked.
In preparation for the suggestion here are some examples of how competitions enhance the quality of prediction:

CASP

Every 2 years (CASP1 (1994) | CASP2 (1996) | CASP3 (1998) | CASP4 (2000) | CASP5 (2002) | CASP6 (2004) | CASP7 (2006)) the Protein Structure Prediction Centre runs a competition:
“Our goal is to help advance the methods of identifying protein structure from sequence. The Center has been organized to provide the means of objective testing of these methods via the process of blind prediction. In addition to support of the CASP meetings our goal is to promote an evaluation of prediction methods on a continuing basis.”
There are independent CASP assessors who give their time on an impartial basis to oversee the procedure and judge the results of the predictions. Some more details:

For the experiment to succeed, it is essential that we obtain the help of the experimental community. As in previous CASPs, we will invite protein crystallographers and NMR spectroscopists to provide details of structures they expect to have made public before September 1, 2006. A target submission form will be available at this web site in mid-April. Prediction targets will be made available through this web site. All targets will be assigned an expiry date, and predictions must be received and accepted before that expiration date.”
As in previous CASPs, independent assessors will evaluate the predictions. Assessors will be provided with the results of numerical evaluation of the predictions, and will judge the results primarily on that basis. They will be asked to focus particularly on the effectiveness of different methods. Numerical evaluation criteria will as far as possible be similar to those used in previous CASPs, although the assessors may be permitted to introduce some additional ones.

There are four assessors, representing expertise in the template-based modeling, template-free modeling, high accuracy modeling and function prediction: In accordance with CASP policy, assessors are not directly involved in the organization of the experiment, nor can they take part in the experiment as predictors. Predictors must not contact assessors directly with queries, but rather these should be sent to the escramble(“casp”,”predictioncenter.org”)casp@predictioncenter.org email address.

and they follow up with a meeting.

Text Retrieval Conference

The TREC conference series has produced a series of test collections. Each of these collections consists of a set of documents, a set of topics (questions), and a corresponding set of relevance judgments (right answers). Different parts of the collections are available from different places as described on the data page (http://trec.nist.gov/data.html). In brief, the topics and relevance judgements are available at http://trec.nist.gov/data.html, and the documents are available from either the LDC (Tipster Disks 1–3) or NIST (TREC Disks 4–5), information on collections other than English can be found at http://trec.nist.gov/data.html.

A Third Blind Test of Crystal Structure Prediction

In May 2004 the CCDC hosted a meeting to discuss the results of the third blind test of Crystal Structure Prediction (CSP). The challenge of the competition was to predict the experimentally observed crystal structure of the 4 small organic molecules shown in figure 1 given information only on the molecular diagram, the crystallisation conditions and the fact that Z’ would be no greater than 2. The results of the competition are presented including an analysis of each participants extended list of candidate structures. A computer program COMPACK has been developed to identify crystal structure similarity. This program is used to identify at what positions the observed structures appear in the extended lists. Also, predicted structures obtained from the various participants are compared to determine whether the different approaches and methodologies attempted produce similar lists of structures. The hydrogen bond motifs predicted for molecule I are also analysed and an assessment made as to the most commonly predicted motifs and a comparison made to common motifs observed for similar molecules found in the Cambridge Structural Database

PMR: These havea range of objective (measured) and subjective (expert opinion) criteria for the “right” answer. The key components are:

the mechanism and evaluation must be independent of the competitors
all competitors must have an equal chance
the answers must be carefully created and hidden before the prediction
there is a closing date

It is essential that the data are Open and seen to be a reasonable challenge and that the analysis process is transparent. It is not essential that competitors software is Open.

Posted in data | Leave a comment

Open Data for common molecules?

Posted on November 10, 2007 by pm286

Yesterday I needed the measured (i.e. not predicted) mass density for 2-bromo-propanoyl-bromide (CH3-CH(Br)C(=O)Br). This is a moderately common reagent and so I went to look for it on the Web – ultimately finding it on several sites. The value is ca. 2.061 g.cm-3 (many sites omit the units – argh!!). The temperature should also be reported – but isn’t. I need the measured density because many chemical recipes give the volume of reagents and I want to work out the molar ratios in reactions for which I need the density. I may also be interested in other measured properties such as boiling point.
The problem is that it’s difficult to scrape these sites. They give little indication of copyright, are arcanely structured and often have poor semantics (e.g. units). The best known is the NIST Webbook, part of which reads:

Thermophysical property data for 74 fluids:

Density, specific volume

Heat capacity at constant pressure (C_p)

Heat capacity at constant volume (C_v)

Enthalpy

Internal energy

Entropy

Viscosity

Thermal conductivity

Joule-Thomson coefficient

Surface tension (saturation curve only)

Sound speed

You can search for data on specific compounds in the Chemistry WebBook based on name, chemical formula, CAS registry number, molecular weight, chemical structure, or selected ion energetics and spectral properties.

NIST reserves the right to charge for access to this database in the future.The National Institute of Standards and Technology (NIST) uses its best efforts to deliver a high quality copy of the Database and to verify that the data contained therein have been selected on the basis of sound scientific judgment. However, NIST makes no warranties to that effect, and NIST shall not be liable for any damage that may result from errors or omissions in the Database.

© 1991, 1994, 1996, 1997, 1998, 1999, 2000, 2001, 2003, 2005 copyright by the U.S. Secretary of Commerce on behalf of the United States of America. All rights reserved.

It’s clear that this is not an Open site – most works of the US Government are required to make their works freely available but NIST has exemption for its databases so that it can raise money.
Many suppliers list property information but scattered throughout somewhat uncoordinated pages. Moreover the copyright and crawling position is often not clear.
My requirement is likely to be via robot – i.e. an asynchronous request for a property I don’t have, with the ability to re-use it without explicit permission. I am therefore wondering whether there are Open sites for chemical data that can be accessed without explicit permission. I am not interested in collections of millions of compounds, but rather ca. 10,000 of the most commonly used.
A good source of data is MSDS (Materials Safety data Sheets), and here is part of a typical one hosted by a group at Oxford University:

General

Synonyms: nitrilo-2,2′,2″-triethanol, tris(2-hydroxyethyl)amine, 2,2′,2″-trihydroxy-triethylamine, trolamine, TEA, tri(hydroxyethyl)amine, 2,2′,2″-nitrilotrisethanol, alkanolamine 244, daltogen, sterolamide, various further trade names
Molecular formula: C₆H₁₅NO₃
CAS No: 102-71-6
EC No: 203-049-8

Physical data

Appearance: viscous colourless or light yellow liquid or white solid
Melting point: 18 – 21 C
Boiling point: 190 – 193 C at 5 mm Hg, ca. 335 C at 760 mm Hg (decomposes)
Vapour density: 0.01 mm Hg at 20 C
Vapour pressure: 5.14
Specific gravity: 1.124
Flash point: 185 C
Explosion limits: 1.3 % – 8.5 %
Autoignition temperature: 315 C

Stability

Stable. Incompatible with oxidizing agents and acids. Light and air sensitive.

It looks as if there are in the range of 5,000 to 100,000 compounds on the site – I haven’t counted and if so this is close to what I am looking for. It looks as if the creators are happy for people to download it – their concern is that it shouldn’t be seen as authoritative about safety (a perfectly reasonable request). If so, an Open Data sticker would be extremely useful and solve the problem. (There is the minor problem that there are no connection tables, but links to Pubchem should solve that).
There has been talk of a Wikichemicals – and this is the sort of form it might take. It shouldn’t be too difficult to create it and the factual data on the pages doesn’t belong to anyone. So I’d like to know whether anyone has been doing this (measured, not predicted data) and whether there resource is Open.

Posted in data, open issues | Leave a comment

Using Connotea as a community annotator for CrystalEye

Posted on November 9, 2007 by pm286

Quite by chance I met up in the bar yesterday evening with Ian Mulvany (see Nature Network entry) from Nature Publishing Group. Our group had been talking about how we could annotate structures in CrystalEye, the crystallographic knowledgebase that Nick Day has built. The “natural” way to do it would be to build a wiki, set up a registration system, clean the spam daily, etc. A lot of work. And unless people already knew about CrystalEye they wouldn’t use it.
That’s not Web 2.0. The Web 2.0 way is to relax and see who can do the work instead of you. The obvious answer was Connotea: free online reference management for clinicians and scientists. Connotea is one of several exciting new ideas (Urchin, Nature Networks, etc.) to have come out of NPG, and particularly the New Technology Group (if that’s what it’s still called).
About 5 years ago I met up with Timo Hannay from NPG who has been the driving force behind much of this. Timo sponsored a summer student (Vanessa de Sousa) who built the “Nessie” system for annotating chemistry in published papers – a logical antecedent of current adventures in semantic chemical publishing. (Nessie as a tool has evolved into OSCAR, but the work needs to be remembered). As a result of that I met colleagues from the NTG including Ben Lund, Tony Hammond and now I’ve met Ian.
Back to Connotea. A simple idea – Nature provides a site which allows anyone to register and tag publications they are interested in. Any publication – not just Nature’s. If all papers were tagged then it could be the first place to look for blogosphere comment. So maybe we could tag the papers from which CrystalEye draws its structures.
It’s easy to do it by hand (you have to register first). Here’s an example. Let’s say I’m browsing CrystalEye for the latest articles from ChemComm (a rapid publication journal of high-interest chemistry from RSC). I find

Covalent Palladium-Zinc Bonds and Their Reactivity

Is this an interesting article? I don’t know. Maybe someone has annotated it in Connotea.
[… I then spent 20 minutes playing with Connotea and I’ll show you how I tagged it and how I can use it in a future post …]

Posted in "virtual communities", Uncategorized | Leave a comment

I have to eat Peter Sefton's dogfood

Posted on November 9, 2007 by pm286

I have moaned publicly about how difficult I find it to author technical chunks of material in my blog (maths, computer code, chemistry). Yesterday I responded to Peter Sefton’s post about his editor ICE by saying it was a Good Thing but also suggesting that there wasn’t much I could do to promote it. Now he has truly caught me. In his post: ICE as a blog editor he suggest that I can use ICE as an auxiliary editing tool:

Also yesterday I wrote about how we are breaking ICE up into more digestible pieces, one of which is the ability to post to a weblog using Atompub. Daniel de Byl has just posted a demo using OpenOffice.org Writer to publish a nicely formatted blog post to WordPress.
And today a supportive reply from PMR to my post with a poem in it! Thing do indeed take time, I’ve been at this since 1996. I think we’re getting there now, though.
I thought I’d try out the new ICE services using one of Peter’s posts and see what happens. I think that the ICE toolbar in Writer could help transcend the formatting problems with WordPress and we could look at doing interesting stuff like CML integration.
Here’s his post (embedded in mine as a blockquote)
[snipped]

PMR: It certainly looks OK (well mine wasn’t properly formatted and it has captured that exactly).

Easy enough to do in ICE apart from the slightly clunky way quoting works. We really need the ability to import HTML properly formatted as a blockquote. This would be very important for PMR, as he likes to quote big chunks.

PMR: Gulp. Everyone tells me I quote too much. They must be right.

[snip]You can see a draft version of this here post on my test blog.
If you want to try this out and blog to WordPress from OpenOffice.org then see the instructions that Daniel has put up on the ICE site and remember this is bleeding-edge alpha-quality Windows-only software at this stage. Remember also to actually read the instructions. The URL to use for WordPress is really important, for example.

PMR: Bleeding edge. My favourite sort of system. But since I can’t do XML at all it may liberate me to write some CML.
Even if I didn’t want to I am going to have to start Eating one’s own dog food or at least Peter’s. I am sure it is Good For Me. Let’s hope it also tastes nice!

Posted in XML | Leave a comment

petermr's blog

US Voters: time to lobby Congress on OA again

Repository depositions – what scales? A simple idea

Can I reposit my article?

Green Gold Hybrid

Using our own repository

Open NMR: update

Open science: competitions increase the quality of scientific prediction

CASP

Text Retrieval Conference

A Third Blind Test of Crystal Structure Prediction

Open Data for common molecules?

General

Physical data

Stability

Using Connotea as a community annotator for CrystalEye

Covalent Palladium-Zinc Bonds and Their Reactivity

I have to eat Peter Sefton's dogfood

Recent Posts

Recent Comments

Archives

Categories

Meta