OSCAR's wishList for Nature

Neil Withers of Nature Chemistry has asked the community what they would like to see. Before answering let me applaud Nature on asking, and I see that the blogosphere has already started. Although I campaign for Open Access I’m more interested in Open Data, so I;m going to add those thoughts at the end. I write from the point of view of OSCAR the chemistry text-mining journal-eating robot…

|

Hi everyone,
This week the Nature Chemistry team have been thinking about how we display our wonderful papers (when we finally open the doors and eventually publish a paper, anyway).
We’d really like to see what everyone else thinks about some of the things we discussed after looking at what other journals have to offer.
So, the things we’re interested in:
(1) HTML vs PDF: does anyone read the HTML articles? Do you read the PDF on-screen or print it out?

OSCAR: I hate PDF. It tastes awful – like eating paper. I like HTML. It tastes of ASCII

(2) Big vs little graphics: what does everyone else think about the tiny size of the graphics in ACS html articles?

OSCAR: I don’t like graphics – they hurt my brain. I’d much rather have nice crunchy CML behind them

(3) Tagging/’semantic web’: what do you think about the toys on the RSC’s Project Prospect? What kind of things would you like to see tagged/linked to other content in Nature Chemistry? For instance, Steve would love to do something with named reactions.

OSCAR: I helped the RSC with Project Prospect so I feel quite proud. It’s got a way to go, though before it’s a full meal.
If you are going to do reactions, please use CMLReact. It’s scrumptious

(4) 3D molecular structures: do these help your understanding of a paper?

OSCAR: Oh yes. I can eat hundreds before breakfast

(5) How useful to you are InChIs and SMILES?

OSCAR: They’re both useful. I don’t like INCHI-Key – it’s like chewing beanstalks. InChI’s OK. SMILES is fine but it tastes too organic for my liking.

(6) Forward linking: the RSC and Elsevier/Science Direct offer this – do you use it? Would you use an RSS feed that alerted you to new citations of a particular paper.

OSCAR: If I can read a paper I can store it and I can do my own linking. But I’m not allowed to. Everyone keeps saying I should wait till I grow up before they’ll let me have real papers. I get some scraps called “Fare use”, but the fare is pretty awful.

(7) Would you actually comment on papers if there was a comments box at the end?

OSCAR. Absolutely. I can commone on hundreds of papers a day. Or thousands. Unfortunately I’m trained to fid the bad bits in papers and spit them out. Like missing decimal points or silly formula. Not every likes being told their food is bad.

(8) We really like the Biochemical Society’s HTML article style (sample one here) – do you?

OSCAR: ARGGGGGGGGGGH! Fancy trendy eye candy. Look nice tastes horrible. I can’t even read a page without falling over. Just plain XHTML, with RSS sauce and CML, please. It tastes good and it does you good.

If we could get a deluge of posts about this one, we’d be overjoyed! And this is your chance to voice your opinion on what a Nature Chemistry paper should look like.

OSCAR: Oh, and I’d like a free lunch of OPEN DATA, please. Structures are good, but how about some spectra as well. Mind you, if I eat too many spectra I get a hangover.

Neil Withers (Associate Editor, Nature Chemistry)

Posted in Uncategorized | 2 Comments

Noel reviews chemical depiction and SDG

A really useful post from Noel O’Blog about chemical depiction and structure diagram generation (SDG). The chemical structure of compounds in “2D diagrams” is often the most important way of communicating chemical information. There is a gradually growing realisation that diagrams need to be clear and use consistent conventions (I have been involved with IUPAC in this activity).
There are two aspects as Noel shows clearly:

  • if you know what atoms are connected, and you know where to put them on the page then how do you draw the best/most_useful diagram? Among the things you are allowed to alter are:
  1. the font-size and color
  2. the width of lines,
  3. the color of bonds
  4. whether hydrogen atoms are shown or not
  5. how aromatic rings are drawn (double bonds or circles)
  6. where charges should be located
  7. how close bond lines should approach atoms
  8. what happens when lines cross
  9. how to depict stereochemistry
  10. where exactly to position double bonds (inside rings, inside and outside, mitred, etc.)
  • what you must not do is alter the position of atoms.

It must be clear what the compound is – correctness is more important than beauty. A major problem is when atoms are very close – it can be difficult to distinguish the atoms and often there are spurious “rings”. There is no correct answer, but it’s worth looking at some of Noel’s collection of molecules drawn by different programs. The molecules are randomly taken from Pubchem (so probably don’t exercise the inorganic features). Here’s the post:


Now for some pretty pictures as well as some not so pretty. Yes, it’s the turn of the structure diagram generators (SDGs) to strut their stuff and throw some shapes. How do they perform for 100 random compounds from PubChem?
Here are my [NO’B] results for depiction and structure diagram generation […]
Some comments:
(0) Rich Apodaca has written an overview of Open Source SDGs.
(1) 2D coordinate generation is independent of depiction. A SDG typically has both parts but coordinates could be generated with one toolkit and depicted with another.
(2) Looking good is not the same as chemical accuracy. But looking good is important too! 🙂
[…]
(5) The PubChem images appear to be generated by an OpenEye product (for sure, the coordinates are). I don’t know what version.
[…]
(7) It is important to consider how to handle hydrogens. With OASA, I just drew all the hydrogens. This is probably not a good idea.
[…]
(10) PubChem entries with more than 1 connected component were not included in this test. (As a result, the number of molecules shown is actually less than 100.)
PMR: Make you own choice as to what looks nice, but some are dead wrong in the stereochemistry. Personally I deprecate any depiction of atom-centered that does not have the pointed end on the atom and is not wedge-shaped. (Thick lines are often ambiguous, and perspective diagrams easy lead to errors). Here’s part of a typical line (7250053)- all the compounds are meant to be the same, but they cannot be. At least two are in error

So it’s not impossible but not completely trivial to depict structures. Structure Diagram Generation (where the coordinates are not given) is much harder and there is often an impossible tensions between accuracy, arbitrary convention, and aesthetics. Sometimes only a human can do it.

Posted in Uncategorized | 5 Comments

Chemspider in Nature

Nature has just published an account of Chemspider after interviewing a number of people. (The Nature report, Geoff Brumfiel and I spent considerable time on the phone but it was too late to include my comments in the article so I” outline what I think I may have said at the bottom – I have tried to be objective and am reviewing CS as it is now. Many of my comments relate to chemical aggregators in general. I include the whole of Antony’s post as, presumably, I will get into trouble if I copy very much from Nature.

22:32 07/05/2008, Antony Williams,
Last a week I had a pleasant chat with a reporter from Nature magazine, a Mr Geoff Brumfiel. Geoff was interested in ChemSpider…what it was, how it ran, who used it, who supported it, who liked it, who curated it, who didn’t like it and so on.
The results of that discussion, and others he spoke to about ChemSpider, are here in his article.
Chemists spin a web of data p139
Chemspider website provides free information on millions of molecules.
Geoff Brumfiel
doi:10.1038/453139a
Full Text | PDF
It is a rule at Nature, at least for this type of article, that I [AJW] could not see the article before it went to press and therefore I didn’t get the chance to proofread and comment. Geoff has accurately captured the spirit of our discussions but a few detailed clarifications are needed too. I have pasted in black the article content and in italics the clarification.
providing the community with an open-access source of chemical information
I giggled and commented please don’t say it’s Open Access. Say it’s Free Access. Say there are Open Data. And now we have Creative Commons licenses. But don’t say it’s Open Access, not Strong, not weak, not gold, not green. Just Free Access. No price barriers to usage.
Chemist Antony Williams is hoping to change this in a move likely to ruffle the feathers of the American Chemical Society.
I commented that we are not purposely in competition with anyone. It’s not what drives us to do this. Whether others see us to be competitive is for them not us. We don’t intentionally try to ruffle feathers. It doesn’t mean that what we are doing won’t ruffle feathers of course. Whether it’s ACS or others. It’s not the goal..it might be an outcome.
The modest project has made chemists interested in open access take notice — last week, the number of daily users of the site surpassed 5,000.
We have crossed 5500 users for the past two nights. The trend is positive.
“Other potential sources of information, such as Wikipedia, lack the algorithms needed to search chemicals according to their structure. “
Structure searching is “feasible” of course with InChI Strings. But substructure isn’t and Wikipedia is treated as a text-based search by almost all of its users
“The site is maintained with modest profits from advertising and the work of about 30 active volunteers who double-check the data pulled in from outside.
The original investment in hardware and software costs has finally been recouped. Modest profits? No one gets paid for the work we do. There is a phenomenal sweat equity investment in the platform numbering many thousands of hours to get here. We are indebted to the many software collaborators, providers of tools and the people curating and depositing to the system. There have BEEN about 30 active volunteers. RIght now I would say the number of active depositors and curators is around 10. But it is growing. I hadn’t checked the number of REGISTERED users for a long time. We have over 1150 registered users…those who CAN login and curate data, deposit data, see new features etc. People do NOT have to register to use the site…but >1150 did. Wow. I didn’t know it was that many until i just checked (BIG SMILE)
““There’s an awful lot of chemical information, but there’s an awful lot of rubbish as well,” says Barrie Walker, a retired industrial chemist in Yorkshire, UK, who helps maintain the site.”
Don’t know whether Marrie said this or not. He IS an honest guy and he is our QUALITY GURU and we are proud that he is willing to give us his fine eyes. There IS garbage on the site still. But, after a year online and active curating it has been much reduced. About 200 edits a day are made to the site: names changed/deleted/added, spectra/structures/URLs/Publications added etc. It’s quite the pace. We have cleaned up 100s of thousands of incorrect associations from the external data sources. It’s been and will remain an enormous task with an enormous payback for the community
Williams adds that the site still has problems with certain searches. For example, it struggles to distinguish between isomers: molecules with the same chemical formula arranged in different structures.
We can distinguish isomers no problem. The PROBLEM is that there is a mixture of isomeric species submitted from multiple data sources and data are mixed and intermingled in way that the user cannot get to the correct structure. Search taxol or Ginkgolide on the ChemSpider blog and read the mutliple blog posts about this. We can of course search all isomers for a particular chemical formula…
“But Williams nevertheless believes that the service may be able to compete with for-profit services. “What I’m doing is highly disruptive,” he says. “I think it can be done and it needs to be done.”
I think what WE are doing…its not me..it’s we…is disruptive. In a good way. Many chemists will benefit. Will it have an impact on for-profit services? Yes, maybe. As an outcome but not as the target. Our team of people, both internal to ChemSpider’s development and Advisory Group, and the people we don’t even know who are cleaning and depositing into the system for their colleagues in the community, are creating a powerful resource for Chemists. The FOCUS of this effort is to Build a Structure Centric Community for Chemists. We will change that soon…the focus on Structure-Centric will be to cover Chemistry in general and to Build a Community for Chemists.
We are well on our way and thanks to Nature, and Geoff in particular for exposing it. My comments above are not meant to detract from Geoff’s reporting abilities but it was a long discussion and some clarification statements are of value i believe.
PMR; Firstly to say that I commented to Geoff before Chemspider’s announcement that it was adopting CC-SA licences. This is a major advance and has enhanced the importance of Chemspider. For non-chemists the lack of data in chemistry is a desperate and desperately serious problem. Almost all publicly visible data is first published in peer-review journals. (There are exceptions, where data is collected for hire, and I have no problem with people charging for that – it is the charging for data that belongs to the community that concerns me. So, in challenging the status quo, Chemspider is pointing in the right direction).
It’s (now) based on Web 2.0 principles in that it uses social computing for some of its content and can and has reacted to external changes. It’s also perpetual beta. It’s not, however, based on semantic web technology such as RDF and XML and this may be a future limitation in managing some of the more complex content. Although I’m not party to the internal design I’d guess it has a relational database, most of whose primary keys are the identifiers for chemical compounds. These identifiers map onto canonicalised chemical structures (one serialization of which is the InChI) and this is the primary mechanism for indexing compounds. Chemistry is fortunate in that it is easier to index compounds automatically than, say, stellar objects, organisms, genomes, etc.
The information management is hybrid. At one level there is robotic ingestion and curation and at the other human annotation (curation). CS has ca 20 million compounds and the only way to manage these is robotically. This brings several problems, which bedevil any large chemical aggregator:
  • the data either have to come from somewhere else or be computer-generated. CS does both – it ingests from PubChem, and it computes molecular properties. Pubchem (which I’ll tackle in a later post) consists mainly of data contributed by a number of parties and is of highly variable quality. It is extremely difficult to evaluate this sort of data robotically as there are few objective constraints and few other independent data sources. (We are trying to do this for a much smaller data set ~ 5000 compounds, and we find unexpected and serious garbage, which I’ll blog later).
  • Similarly there is no guarantee that the computation of properties is free from error – indeed it cannot be. Many physical properties depend on the physical form of the compound and this is often not recorded. I suspect most of the properties are computed by heuristic means (“QSPR”) rather than QM calculations. And many of them fail to take things like chemical stability and reactivity into account.  (Examples are boiling points for compounds that decompose, flashpoints for things that could never burn). But how do you tell this robotically – I don’t have a good suggestion But one can guarantee that in 20 million calculations some will be meaningless
  • Chemistry is not regular, and in millions of compounds there will be mainly that simply don’t behave as expected from their formula. Or, alternatively, some can have many formulae. There is no simple robotic way of determining which these are and correcting them. So the compiler and the user of such systems have to be clear that error is part of the nature of the system.
Chemspider is using social computing (crowdsourcing) to clean up (curate) the information in the database. This works in Wikipedia, although the number of chemicals in in the thousands, not the millins, and there are still many data and chemical problems. Moreover WP shows that there are compounds – e.g. aluminium chloride – where there is no single structure. It’s a matter of opinion as to whether the various states are manifestations of a single compound or several separate ones – certainly they have different connections tables. The problem with crowdsourcing is the numbers – chemistry is conservative – chemical WP lags behind other science, despite enormous efforts from a small number of individuals including Antony. What is certain is that crowdsourcing can only address a very small amount of Chemspider content – even with 10000 volunteers it would take 2000 curations each to address the whole.
Chemspider has also started to act as a repository for scientific data, especially Jean-Claude Bradley’s Open Notebook Science. In doing that it runs up against the same problems as University Institutional Repositories – heterogeneous data sets, versioning, metadata,  compound documents, etc. Its advantage is that it will probably be restricted to a fairly narrow range of content types (chemistry) and it is also able to provide chemical substructure search (a major problem across the web
What is Chemspider now is and where it may be going? It’s difficult to predict anything on the web but it’s also clear that chemists are one of the most conservative disciplines. Why use a free service when you can get your library to pay (a lot of money) for ACS or Beilstein services? So I wouldn’t predict explosive growth like Flickr or Google.
I think quality is a major problem in this area. Chemspider is correcting the structures of compounds as they come across errors. In some cases this is possible as “all chemists know the correct structure of X”. But in many cases they won’t agree, or the chemistry simply doesn’t admit of an answer. There is no correct structure for glucose. And then there is the problem of the long tail. There’s a huge amount of chemistry out there and a lot is wrong. When InChI started, Nick Day surveyed the web for chemical structures that InChI might help to disambiguate. We started with staurosporine – it’s an anticancer drug that one of my close associates was interested in and it wasn’t clear what the structure was. Nick found 26 sites displaying staurosporine and there were 19 different structures given. Some were incomplete and several were just crazily wrong. Clearly many chemical suppliers, journal editors, etc. do not care about chemical structures. So there is a huge amount of rubbish out there.
However as Nature says:
Chemical data have long been available, but at a hefty price. The largest supplier of such information is the American Chemical Society’s Chemical Abstracts Service. The service, which is more than a century old, includes data on roughly 35 million molecules. But university and industry chemists must pay thousands of dollars to use the database. The society will not reveal numbers, but fees for using the database are thought to make up a substantial portion of its US$311-million annual income from ‘electronic services’. Some have been highly critical of the society’s grip on chemicals.
PMR: At some stage, therefore, the community will react against this centralisation of information, but it could be a long time. I don’t think anyone should set up to duplicate what ACS does – I think we should use modern thinking to do things quicker, smarter, cheaper and in tune with the modern Web. Chemspider may have to make some choices soon – is it a company or a voluntary activity? does it concentrate on high volume and variable quality, or low volume and high quality – it cannot do both? What is the particular USP of its repository service ?- there may well be a role for a specialist chemical repository service but when? Is it different from Pubchem, and how…?
Posted in Uncategorized | Leave a comment

Chemistry and CrystalEye – 1

We are discussing with Chemspider how some of the content in CrystalEye can be transferred to Chemspider. What Antony Williams wants is primarily the chemical identity of the substance, expressed as a connection table which can be ingested into CS, checked for uniqueness and linked back to CrystalEye (and thereby the original literature). This is straightforward for some of the entries. It’s hard for others and impossible for some. I’ll explain some of the problems here but they include:

  • we don’t know what some of the compounds are
  • we cannot give a connection table for some of the compounds
  • the algorithmic generation of the connection table may not be “chemically useful” for some compounds.

To start with let’s observe that the chemical identification of the compound is not normally (or almost ever) recorded in the CIF file. Yes, the CIF dictionary has a template for connection tables but no-one uses it. (IF they did life would be easier).
Since Acta Crystallographica Section E covers all of chemistry and since its now Open Access it’s a good place to start by looking at some example. (Note also that if you want an overview of the range of CrystaEye, just subscribe to its RSS feed and you’ll get 10-50 structures per day with structure diagrams when possible). Here we’ll browse the 2008-05 issue: We get a table of contents with the first structure displayed in Jmol (3D coordinates) and CDK (2D coordinates and bondTypes and connections from JUMBO/CIF2CML). The 2D structure looks like this:

(I’m temporarily unable to post images in WordPress so cannot snapshot the Jmol. But in any case Jmol is so lovely you should really try it yourself).
Now if we click on the first entry (grey bar) in the “article” column (view) we get linked through to the Acta Cryst article. And, since it’s Open Access, I can reproduce the whole page without having to ask permission. That gives me a warm feeling, and I am sure it does the same for the team at IUCr in Chester and also the authors. Open Access is so liberating. Anyway here’s the page:
======================================================

Acta Crystallographica Section E

Structure Reports Online

Volume 64, Part 5 (May 2008)


organic compounds


bg2176 scheme

Acta Cryst. (2008). E64, o899    [ doi:10.1107/S1600536808010696 ]

8-Hydroxy-5,6,7-trimethoxy-2-phenyl-4H-chromen-4-one

J. E. Theodoro, D. Santos, H. Pérez, M. F. das G. F. da Silva and J. Ellena

Abstract: In the title compound, C18H16O6, the benzopyran group is essentially planar, with the O atoms of the substituent groups lying close to its mean plane. The molecular conformation is governed by intramolecular interactions. The crystal packing is mainly determined by one classical intermolecular hydrogen bond which gives rise to the formation of an infinite chain along the a axis.
======================================================
So everything fits. Our proposed structure (from the CIF) corresponds exactly with the structure drawn by the authors (for non-chemists the position of the double bonds in the LH benzene ring is arbitrary). And the name can be translated by human or machine to give the same structure.
So in this case we’d be 100% happy to submit the connection table to aggregators such as PubChem or Chemspider. There’s a simple map between the connection table and the crystal structure.
But it’s not always that simple…
… and later posts will tel you why.

Posted in semanticWeb, Uncategorized | Leave a comment

Clarification on CrystalEye

A long post on CrystalEye by Antony Williams. I comment on points that still need answering:

06:19 06/05/2008, Antony Williams, a
n a recent post about ChemSpider we’ve been accused of wanting a Free Lunch. I copy a segment of the post and comment with insertions.
“Data are normally produced for a particular purpose and the reuse them for another cost money. I’ll exemplify this by taking CrystalEye data – about 120,000 crystal structures and 1 million molecular fragments – which were aggregated, transformed and validated by Nick Day as part of his thesis. (BTW Nick is writing up – it’s a tribute to his work that CrystalEye runs without attention for months on end).
AJW> It is true…it is a tribute to Nick that CrystalEye can run for months without attention. Kudos. I am interested in how much pressure the site is under. How many searches/users in a day etc.? We find that our struggles in uptime (and these are negligible) are primarily based on stress on the servers. For nighttime users tonight things will have been slow…we deposited over 100,000 new molecules from 5 new data sources. That does create some slowness. We will hit about 40,000 transactions today. Our problems are ISP issues and powercuts. But we are also not in a University using thick pipes etc.

PMR: We do not monitor the number of searches. The maintenance is almost competely about journal TOC pages. The only reference to CIFs is usually on these HTML TOCs which are designed to look good for humans (i.e. with lots of pictures advertising the publisher) but are awful for machines. Every so often a journal changes its TOCs and the crystalEye robot breaks. It would be nice if publishers thought about the machine age and the Semantic Web sometimes.

One comment…it was 130,000 structures according to a previous blog and has been expanding since then from daily depositions. Right now I would expect it to be 140,000 rather than 120,000. When we did try scraping the data our best estimate was about 90,000. We might have missed something in our scraping and it’s why we asked for a dump of the data.

PMR: We don’t know how many unique structures there are. I’m guessing that there are about 130,000+ entries but that many are duplicates. We (or rather Nick) does a good job on disambiguating by cell dimensions but this is not foolproof and indeed no method is. We hope to develop better methods over the summer. The main duplication comes from the Crystallography Open Database which has about 45,000 structures. It is released periodically. Quite a number of the structures have syntactic problems and we do our best to fix them. So we really don’t know how many unique structures. Note also that the TOC in CrystalEye does not point to COD as it doesn’t have the appropriate structure.

The primary purpose of CrystalEye was to allow Nick to test the validity of QM calculations in high-throughput mode. It turned out that the collection might be useful so we have posted it as Open Data. To add to its value we have made it browsable by journal and article, searcahable by cell dimensions, searchable by chemical substructure and searchable by bond-length. This is a fair range of what the casual visitor might wish to have available. Andrew Walkingshaw has transformed it into RDF and built a SPARQL endpoint with the help of Talis. It has a Jmol applet and 2D diagrams, and links back to the papers. So there is a lot of functionality associated with it.
AJW> The team has done a good job in putting the site together. The JMol applet is an excellent utility for us all to use and thanks to that team for sure! Egon has been challenging us to RDF the site and it’s on our list, but keeps getting pushed down based on other requests. Since he’s the only voice asking it will keep getting pushed down unfortunately.

PMR: Andrew Walkingshaw has converted a subset of the data (in all the entries) to RDF and he is demonstrating it at XTech (Dublin) this week. I blogged earlier about his mashup with Google Earth.

Antony and I have had several discussions about CrystalEye – basically he would like to import it into his database (which is completely acceptable) but it’s not in the format he wants (multi-entry files in MDL’s SDF format, whereas CrystalEye is in CML and RDF).
AJW> To clarify, again. I DON’T want to import CrystalEye into ChemSpider. I DON’T! All I want is the set of structures and unique associated URLs so that users of ChemSpider can find that there is crystal structure information over on CrystalEye and can click the link and be on CrystalEye and get the benefit of Nick, Andrew and Peter’s work. I don’t want to reproduce their effort. I want to integrate to it. I’ve said it many times on Peter’s blog and on this one.
This type of problem arises everywhere in the data world. For example the problem of converting between map coordinates (especially in 3D) can be enormous. As Rich says, it costs money. There is generally no escape from the cost, but certain approaches such as using standards such as XML and RDF can dramatically lower the costs. Nevertheless there is a cost. Jim Downing made this investment by creating an Atom feed mechanism so that CrystalEeye couls be systematically downloaded but I don’t think Chemspider has used this.
AJW> If Jim can contact me by email and provide me with detailed instructions to download the entire file of structures ONLY and their associated URLs that would be excellent. I’ll send the request to him tonight.

PMR: There IS no entire file of structures. It has never been created and won’t be. That’s not because we want to make life difficult. It would take a month and we haven’t got a month. We believe we have a better way which Jim created for you in November (CrystalEye and repositories: Jim explains the why and how of Atom) – we did it for you… and we’d appreciate feedback.
[…]

[…] it costs money. It’s unrealistic to expect we should carry out the conversion for a commercial company for free. We’d be happy to a mutually acceptable business proposition and it could probably be done by hiring a summer student.
AJW> I am interested in what commercial benefit integrating to CrystalEye can have. It’s work on our side. I’m not sure what a mutually acceptable business proposition would look like. It can’t be that much work to send us a set of InChIStrings and URLs for the CrystalEye dataset..they already exist on CrystalEye. So, I’ll assume that this is a last comment on “No thanks to CrystalEye data in ChemSpider”. I have to ask why not put them in PubChem. Since PubChem is held as the standard of OpenData why not put CrystalEye there?

PMR: The only thing stopping us putting them in Pubchem, or anywhere, is work. We need to make sure that we have data integrity and referential integrity. We’re going to do it, but at present Nick is writing his thesis. We have some limited funding earmarked for this and hope to start it soon.
When it’s finished it will be in RDF/CML.

FWIW we are continuing to explore the ways in which CrystalEye is made available. We’re being funded by Microsoft as part of the OREChem project and the result of this could represent some of the way in which the Web technology is influencing scientific disciplines. We’d recommend that those interested in mashups and re-use in chemistry took a close look at RDF/SPARQL/CML/ORE as those are going to be standard in other fields.
AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.

PMR: Chicken and egg… 🙂 You won’t adopt it until other people adopt it and they won’t adopt it till you do. But we make progress. It’s now mainstream in part of Accelrys software (funded by DTI). It’s being put into compchem codes by the COST project, and it’s really the only choice for datuments (combined data and documents) as in semantic publishing and the results of test-mining.
And we shall have one or two announcements soon…

Posted in Uncategorized | 2 Comments

Chemspider adopts CC-SA licence

I am DELIGHTED to report that Chemspider has adopted a CC-SA licence for its data. I comment below:

05:23 06/05/2008, Antony Williams,
Over the past year ChemSpider has been challenged over the nature of our offering in terms of Open Data etc. A small number of people focused a lot of time talking about this while we remained focused on improving the website and having it available for people to use as a Free Access website. I spoke to Peter Suber about Open Access and then John Willbanks about Creative Commons.
Since ChemSpider is the aggregate of a number of people’s work (including provision of software by collaborators) I had to get into conversation to see what licenses would be acceptable to those groups.
With the redesign of the website we have structured ourselves in a way to add licenses as we see appropriate now. So, as of today we have added the Creative Commons Attribution Share Alike 3.0 United States License and the appropriate logo is on all sections of a Record View except for the predicted properties. Once we get approval from our collaborators for this same license (and discussions are underway) then the whole record view will be Licensed.
At that point, you are free :
  • to Remix — to make derivative works

Under the following conditions:

  • Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
  • Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.
  • PMR: This is wonderful. As far as I know Chemspider is the only commercial chemical information company offering data under this licence, which is completely compatible with the Open Knowledge Definition. (It is also BBB-compliant, though data and publications are different animals).
    For those not familiar with Chemspider (Chemspider official site) it is a commercial company which inter alia aggregates data from public sources, and offers to host data from individuals and organisations, combining them through a common linkbase, primarily of chemical structure identifiers.
    Chemspider has moved a long way since its beginnings which did not originally include an Open model for the content, and they deserve great kudos for having made the transition.
    Note that this sort of data sharing is common in biosciences, astronomy and a number of other sciences. though generally the support comes either from public funding, industrial subventions or marginal resources from the community. In those areas licences are often not used as there is an ethos of Community Norms where researchers are expected to make their data avaliable, wheareas in chemistry the ethos is to control access and charge for it.
    Chemspider has similarities to the NIH’s Pubchem which also aggregates chemical data, often by donations from agencies, suppliers and research collections. As CS notes the data were originally produced by the community and so they should have access to them.
    The real tragedy of chemical information is that the data produced in the primary literature cannot go directly into Chemspider, Pubchem or any other database because the publishers would send lawyers. Chemspider have tried to get ACS to allow them to use the CIF data which we collect in CrystalEye. I assert that these data, which the ACS copyrights, are facts and so can be re-used as facts without breaking copyright. Antony Williams has asked the ACS on more than one occasion whether they can use these data, and the ACS have been unable to say yes or no, but mainly mumble. It’s this sort of lack of help from learned societies which makes it such a hard struggle. Chemspider and ourselves now have a lot in common and I’ll address some of this when I reply to their CrystalEye post. (Makes a change from trying trying to work out what Open Access is).

    Posted in Uncategorized | 1 Comment

    Some test cases for strongOA/weakOA

    I am still – honestly – completely unable to work out what the proposed boundary between strongOA and weakOA (see copious blog posts recently) . I therefore put forward some testcases and ask those who understand it – especially the proponents – what category they fall into:

    • CC-BY
    • CC-NC
    • Springer Open Choice from the site:Springer operates a program called Springer Open Choice. It offers authors to have their journal articles made available with full open access in exchange for payment of a basic fee (‘article processing charge’).
      With Springer Open Choice the authors decide how their articles are published in the leading and well respected journals that Springer publishes. Springer continues to offer the traditional publishing model, but for the growing number of researchers who want open access, Springer journals offer the option to have articles made available with open access, free to anyone, any time, and anywhere in the world. If authors choose open access in the Springer Open Choice program, they will not be required to transfer their copyright to Springer, either.
    • ACS Author Choice ; from the site: ACS AuthorChoice facilitates unrestricted web access to your published ACS article—at the time of publication—for a one–time fixed payment, provided by you or your funding agency. Contributing authors who are ACS members and/or are affiliated with an ACS subscribing institution receive significant discounts. This policy also allows you to post copies of published articles on your personal website and institutional repositories for non–commercial scholarly purposes.
    • The NIH Public Access Policy [which] ensures that the public has access to the published results of NIH funded research. It requires scientists to submit journal articles that arise from NIH funds to the digital archive PubMed Central (http://www.pubmedcentral.nih.gov/). The Policy requires that these articles be accessible to the public on PubMed Central to help advance science and improve human health.
    • Green self-archival of a final author manuscript  in toll-access journal with copyright transferred to publisher
    • articles published describing work currently funded by the Wellcome trust

    The following terms have been proposed as synonymous with strongOA:
    transparent and self-explanatory (Harnad):
    RE-USE OA
    READ-WRITE OA
    PERMISSION OA
    generic
    EXTENDED OA
    EXTENSIBLE OA
    FULL OA
    PMR: I am particularly worried about the use of “FULL-OA” for anything that isn’t.
    To recap – I now have no idea what the boundaryline is. I hope some testcases will make it clearer. Because everyone except me seems to agree the line is clear but I suspect they will all put it at different places.

    Posted in Uncategorized | 2 Comments

    Word, OOXML, ODT

    Glyn Moody takes me to task – with some justification – for suggesting that Word/OOXML is a useful format for archival.

    I should make it clear that I am not religiously opposed to PDF, just to the present incarnation of PDF and the mindset that it engenders in publishers, repositarians, and readers. (Authors generally do not use PDF).

    He then discusses in detail what the problems are and what solutions might be. Then he drops this clanger:

    [PMR] I’m not asking for XML. I’m asking for either XHTML or Word (or OOXML)

    Word? OOXML??? Come on, Peter, you want open formats and you’re willing to accept one of the most botched “standards” around, knocked up for purely political reasons, that includes gobs of proprietary elements and is probably impossible for anyone other than Microsoft to implement? *That’s* open? I don’t think so….
    XHTML by all means, and if you want a document format the clear choice is ODF – a tight and widely-implemented standard. Anything but OOXML.

    PMR: I don’t think I would disagree with your analysis of OOXML. My point is that – at present – we have few alternatives. Authors use Word or LaTeX. We can try to change them – and Peter Sefton (and we) are trying to do this with the ICE system. But realistically we aren’t going to change them any time soon.
    My point was that if the authors deposit Word we can do something with it which we cannot do anything with PDF. It may be horrible, but it’s less horrible than PDF. And it exists.
    I may be optimstic but it can also be converted to ODT. See the WP entry:
    Microsoft Office does not natively support OpenDocument. Microsoft has created the Open XML translator[19] project to allow the conversion of documents between Office Open XML and OpenDocument. As a result of this project, Microsoft finances the ODF add-in for Word project on SourceForge. This project is an effort by several of Microsoft’s partners to create a plugin for Microsoft Office that will be freely available under a BSD license. The project released version 1.0 for Microsoft Word of this software in January 2007Microsoft Excel and Microsoft PowerPoint in December of the same year. Sun Microsystems has created the competing OpenDocument plugin for Microsoft Office 2000, XP, and 2003 that supports Word, Excel, and Powerpoint documents.[20]
    It is our intention that anything we do in this space will be open and as far as we understand it compatible with both OOXML and ODT.
    That probably hasn’t made it better but hopefully it’s clearer.
    P.

    Posted in Uncategorized | 3 Comments

    Open metabolomics data

    An email which I’m delighted to blog. [from WP: Metabolomics is the “systematic study of the unique chemical fingerprints that specific cellular processes leave behind” – specifically, the study of their small-molecule metabolite profiles.[1] The metabolome represents the collection of all metabolites in a biological organism, which are the end products of its gene expression.]

    my postdoctoral scientist Dr Tobias Kind has alerted me that you have been giving a lecture during the Open Repositories 2008 conference, and that you had asked the larger community for examples.
    We just wanted to let you know (i.e. for mere information to you, no  action requested), we are a rather small lab but have big ambitions to  make (‘all’) our data publicly available, at least those that were or  are about to be published and/or for which our collaborators have
    granted permission to publish the data (and metadata).
    If interested, please find below a link to our public studies that are  composed of experimental design metadata as well as processed  metabolite data and the underlying unprocessed data files. So, in  principle, anyone could download our data and prove us wrong (or    hopefully, concur with our findings), and we hope that our  repositories (SetupX and BinBase) will become a helpful tool for researchers in the area.
    Our publications on the databases and on structure elucidation are
    found on   http://fiehnlab.ucdavis.edu/publications/
    An example of a study that is compliant with the ‘Metabolomics  Standards Initiative’ is published in Fiehn O, Wohlgemuth G, Scholz M,    Kind T, Lee DY, Lu Y, Moon S, Nikolau BJ (2008) Quality control for    plant metabolomics: Reporting MSI-compliant studies.
    <http://fiehnlab.ucdavis.edu/publications/Fiehn%20et%20al%202008%20Plant%20Journal_Reporting%20MSI%20compliant%20plant%20metabolomics%20studies.pdfPlant
    Journal 53, 691-704
    And the data for this study are obviously public at
    http://fiehnlab.ucdavis.edu:8080/m1/main_public.jsp
    We will publish our source codes, documentation and help files for    installations for the databases within the next 4-8 weeks. Snippets of    these are found under the pages of my staff, i.e.
    Gert Wohlgemuth, http://fiehnlab.ucdavis.edu/staff/wohlgemuth/binbase/
    Martin Scholz http://fiehnlab.ucdavis.edu/staff/scholz/dev/
    Tobias Kind http://fiehnlab.ucdavis.edu/staff/kind/Metabolomics/ (e.g.
    the Seven Golden Rules)
    Best regards
    Oliver Fiehn
    Oliver Fiehn, Assoc. Prof. MCB
    – Metabolomics –
    UC Davis Genome Center
    GBSF Building room 1315
    451 East Health Sciences Drive
    Davis (CA) 95616-8816
    ofiehn AT ucdavis DOT edu
    URL http://fiehnlab.ucdavis.edu/
    tel +1-530-754-8258
    fax +1-530-754-9658

    PMR: This is great. The data are both numerous and complex so making them available will be very valuable to other labs doing the same sort of work. And they shouldn’t worry about being a small lab – most labs belong to the “long tail” of science.
    FWIW Scientific American carried an article this month about Science 2.0 with contribtuions from Open Wet Ware, Jean-Claude Bradley, Cameron Neylon, etc. I thought it was a balanced coverage of the pluses and minuses of Open Notebook Science and related efforts

    Posted in Uncategorized | 2 Comments

    More clarification from Stevan Harnad

    I was just about to start some hacking, but I’ve just got a comment from Stevan that needs reply. [This blog is not the best medium to carry out this discussion but it seems to be providing something useful.] Parts of the comment are excised in this post

    Stevan Harnad Says:
    [full comment]
    […]
    Price-Barrier-Free OA (free online access, better name to come) is one form of OA, Permission-Barrier-Free OA (better name to come) is another.

    PMR: I think a major confusion has come from the term “Permission-Barrier-Free”. I read this to mean “Free of all permissions” whereas the Suber-Harnad terminology is for “Free from at least one permission-barrier”

    And the logical algorithm continues to be that Price-Barrier-Free OA is a necessary condition for Permission-Barrier-Free OA and Permission-Barrier-Free OA is a sufficient condition for Price-Barrier-Free OA, which technically and logically makes the one “Weak OA” and the other “Strong OA”.
    However, two problems remain: “Weak” has unintended pejorative connotations, so it cannot be used as the generic name for Price-Barrier-Free OA.

    PMR: As I understand it weakOA is a uniform object which is price-free but has not other advantages. Words such as minimal or basic would be descriptive

    And Permission-Barrier-Free OA is a matter of degree (whereas Price-Barrier-Free OA is all-or-none).
    So in order to avoid vagueness, a further criterion is needed in order to define Permission-Barrier-Free OA precisely: A minimum or lower bound has to be specified (in the hierarchy of possible CC licenses) for Permission-Barrier-Free OA.

    PMR: Much OA does not use licences at all. If all OA can be specified by licences that would be a vast improvement

    (In addition, and optionally, an optimum CC license can be designated, either in general, or for certain fields or uses.)
    This has nothing whatsoever to do with either “grand visions” or rhetoric. It is all about functionality, logic and practicality.
    If you don’t mind my saying so, Peter, you have a specific need for a specific kind of Permission-Barrier-Free OA. You seem to want to define OA, or Strong OA, or Permission-Barrier-Free OA as what meets that specific need.

    PMR: No. I read “Permission-Barrier-Free OA” as meaning “free of all permission-barriers”. If it doesn’t it is highly misleading.

    In the wider context of OA, your specific need falls within a spectrum of needs, all of which are supported by the architects and advocates of OA. But your specific needs cannot be made the basis of the definition of OA, and not even of the definition of Permission-Barrier-Free OA. There is no point calling this simple logical, functional and practical fact a preference for grand visions of rhetoric, because it is not.

    PMR: My specific need is for clarity, not for a special type of OA.

    In addition, it may well be that your own specific needs have no use for the Green/Gold distinction — which is not about how OA is defined, but about how OA is delivered (via OA self-archiving of articles in non-OA journals or via publishing in OA journals). But the reason you keep finding the color distinction confusing (despite having it repeatedly explained, and despite the fact that it was formulated to resolve confusion) may again be that you are focussed only on your own specific OA needs and not on the OA needs of others, and on the confusion that needs to be resolved in order to meet them.
    OA is being defined and provided in order to fulfill a broad spectrum of needs, primary among them being free online access to articles that would otherwise be inaccessible to users. In addition, there is a broad spectrum of permissions and corresponding licenses that can remove a broad spectrum of permission barriers to a broad spectrum of possible usage and re-usage needs. Apart from a specific CC license, there is no natural kind in all of this that corresponds only to the kinds of usage needs you have in mind.
    This is not a logical, practical or functional defect in the concept, the nature or the definition of OA.

    PMR: My perspective is as a user/reader, agreed. So how the document got to be OA is less important to me personally that what the final situation is. “Green” and “gold” represent processes, and the processes per se do not define the final outcome. So “self-archiving” does not indicate what the user may or may not do. That matters to me.
    Note also that the word in general will take its definition of OA either from BBB or Wikipedia which states:

    Open access (OA) is free, immediate, permanent, full-text, online access, for any user, web-wide, to digital scientific and scholarly material,[1] primarily research articles published in peer-reviewed journals. OA means that any user, anywhere, who has access to the Internet, may link, read, download, store, print-off, use, and data-mine the digital content of that article. An OA article usually has limited copyright and licensing restrictions.

    the spectrum covered by weakOA and much OA does not accord with BBB or WP. You are therefore redefining the use of the term itself. This may be a good thing, but it certain to cause much confusion and I am acting as a touchstone for that.

    Posted in Uncategorized | 6 Comments