petermr's blog

Green OA and Open Data

Posted on May 25, 2008 by pm286

Peter Suber has picked up a point that I made at the RSC Open Access Meeting and I’m happy to address it:

15:58 22/05/2008, Peter Suber,

Peter Murray-Rust, RSC Open Access – what I think I’m going to say, A Scientist and the Web, May 22, 2008. Peter is referring to his talk today at Open Access Publishing in the Chemical Sciences (London, May 22, 2008). Excerpt:

[…]
The theme is “Open Data”. I’ve recently written a review of this in Elsevier’s Serials Review and it’s coming out RSN in a special issue on Open Access. It’s already on Nature Precedings. So if you want detailed aspects – a few months out of date, they are there.
Some bullet points:

Data are different from text. Open Access generally does not support data well (I make exceptions for ultra-strong-OA such as CC-BY and BBB-compliant. Of the sort that PLoS and BMC provide. Green Open Access is irrelevant to Open Data (I think it makes it harder, others disagree).

[… most points snipped…]

Peter S Comment. I follow and agree with all of this, with one exception: “Green Open Access is irrelevant to Open Data (I think it makes it harder, others disagree).” I don’t understand the claim or the argument, but I imagine we’ll hear more in time. Good luck today, Peter!

PMR: [I’m happy to be corrected in anything that follows… if you comment doesn’t get through please mail pm286]
Green Open Access describes a process – primarily of an author self-archiving her “paper” to an Institutional repository or their own web page. There are mechanisms for indexing repositories (e.g. Google Scholar). I’ve been through the process and here is a typical result:

Representation and use of Chemistry in the Global Electronic Age

P Murray-Rust, HS Rzepa, SM Tyrrell, Y Zhang – dspace.cam.ac.uk
This manuscript addresses questions of robotic access to data and its automatic
re-use, including the role of Open Access archival of data. This is a
pre-refereed preprint allowed by the publisher’s (Royal Soc. Chemistry) …
Cached – Web Search

[There is an anomaly in the the RSC does not actually allow self-archiving in this way but at the time they had publicly announced (or been announced) that they did. And the hassle of taking it out is even worse than the hassle of getting it in. So we agree to let it rest and there was a statement from RSC (Org Biomol Chem. 2005 May 21;3(10):2037.) clarifying it]
Green Open Access results in the full-text (versions may vary) of a paper being publicly visible, indefinitely, without price barriers. There are no default permissions – Green does not per se remove any permission barriers. In particular GOA does not actively support the extraction of data (of course an author may be permitted by some publishers to allow data extraction).
GreenOA is designed to be simple. Stevan Harnad argues that it can be accomplished with “one-click”. I haven’t found this to be true for me in Cambridge/DSpace but it’s a useful mantra. The “one-click” is to upload some version of the paper (varying between pre-/post- refereeing and author/publisher version).
GreenOA does not, in general, say anything about copyright or licences. The paper may or may not carry a publisher’s copyright, an author’s copyright and (frequently) none. There is almost never a formal licence. There is almost always no formal statement of policy for re-use. Cambridge DSpace states by default “Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.” It takes a lot more than one click to override this default.
There is no explicit mention in the GreenOA upload model for items other than the “full-text”. The repositories may provide such support but – at least in the early days – the focus was completely on full-text only.
We need to remove GoldOA from the discussion. GoldOA by default may also not remove permission barriers. However with GoldOA there is a single copy of the material – the items on the publishers website and these are freely accessible to human eyes. Indefinitely. So IF the author has submitted supplemental data, that will be oenly visible on the publishers site.
I hope we can all agree on these and I’ll start making my argument here…
======================
So by default GreenOA items are designed to be human-visible but without any support for Data, in any of upload, legal access and technical access. The primary goal of Stevan Harnad – expressed frequently to me and others – is that we should strive for 100% GOA compliance and that discussions on Open Data, licences and other matters are a distraction and are harmful to the GOA process. I suspect that many other do not take such a strong position. However if Open Data is irrelevant or inimical to GOA then it is hard to see GOA as supportive of Open Data.
However my main argument is that lack of support for Open Data in GOA is potentially harmful to the Open Data movement. Let’s assume that Stevan’s approach succeeds and we get 100% of papers in repositories through University mandates, funders et. al. (I’ll exclude chemistry from the argument). GOA will encourage the deposition of full-text only.
So a GreenOA paper may often be a cut-down, impoverished, version of what is available – for a price – on the publishers website. It may, and usually will, lack the supporting information (supplemental data). It will probably not reproduce any permissions that the publisher actually allows. So – if we concern ourselves with matters other than human eyeballs and fulltext – it is almost certainly a poorer resource than the one on the publisher site.
I’m aware that I’m speculating without data. If anyone can provide figures for the provision of (a) supporting info and (b) licences/permissions in IRs it would be extremely useful. However it is a lot of extra hassle and why bother anyway. The robots can’t search the data (technically) so why not point readers to the publisher website. It is possible that the reverse occurs – that some author archive more data than the publisher allows. But I doubt it’c common

So my major concern is that GreenOA will lead to substandard processes for publishing scientific data. I’d be happy to find Repositories that insist on data upload. I doubt they are common.
So here is a challenge to the community: How many instances are there of crystallographic data (CIF) self-archived with GreenOA papers. It’s allowed to archive the data. There are enough publishers (Wiley, Elsevier, Springer) who allow GreenOA. If no-one can find examples then again I would justify the use of “irrelevant”.

Now the more tenuous arguments.

Even if the IRs contained all the data appropriate to the publications how do we discover it? This is anyway very difficult, and CrystalEye succeeds mainly because of the insistence of the Int. Union of Crystallography on the need to publish all supporting experimental information. By contrast many publishers do not do this for chemistry. If I want to find data then unless there is a known data repository I will go to the publisher’s website, not the IRs. The Transylvanian Journal of Haematology is a better place to find data on nocturnal data on anticoagulants than searching IRs. Firstly I don’t know that the papers are in the IRs and they probably aren’t anyway. Secondly I don’t know that even those are indexed – maybe most are but the doubt remains. But mainly I don’t think I’ll find any data. And how, anyway, do I search thousands of repositories, when searching a small number of journals has a much higher concentration of productive results.

And there is a deeper worry about the role of data in the mandates. GoldOA does not by default remove any permission barriers. Yet the fees for achieving GoldOA can be very high. So if a publisher agrees with a funder that the grants can pay 3000 Draculas for a publication, this need not include any data. Funders may get the idea that data-impoverished publication is the most than can or should be achieved. The currently position with NIH/PMC is an example of extremely impoverished archiving. Read-but-don’t-use.

Many funders (Wellcome, and we heard from Robert Kiley 8 other major UK medical funders) require ultra-strong-OA for their archival. Because they care about data. And several publishers (PLoS, BMC) also insist on CC-BY. This is, of course, great for scientific data.

But it’s a long way from GreenOA.

Posted in Uncategorized | 4 Comments

PUG SOAP for PubChem

Posted on May 24, 2008 by pm286

I missed this in the chemical blogosphere and was alerted by Peter Suber: Facilitating the exchange of chemical data

18:14 21/05/2008, Peter Suber,

PubChem has released the beta of its PUG SOAP (Power User Gateway Simple Object Access Protocol). From the site:
PUG SOAP is a web services access layer to PubChem functionality. It is based on a WSDL [Web Service Definition Language]….
PubChem’s PUG (Power User Gateway), documented elsewhere, is an XML-based interface suitable for low-level programmatic access to PubChem services, wherein data is exchanged through a relatively complex XML schema that is powerful but requires some expertise to use. PUG SOAP contains much of the same functionality, but broken down into simpler functions defined in a WSDL, using the SOAP protocol for information exchange….

PMR: This is excellent news. The chemical information web is riddled with human-oriented GUIs, closed interfaces, hidden data with little exposure of content or architecture. This in itself prevents re-use, although most of the sites are anyway not Open.

The bioinformatcs community, by contrast, thrives on Open web services. These have Open Data (re-usable) and Open Architecture. It’s epitomised by the national and international Centres such as NCBI, PDB and EBI and many others. There are literally thousands of Web Services (WS).

This is exemplified in the Open Source Taverna Workflow tool from myGrid (UK, eScience). Taverna has a huge list of bio-Webservices, including access to SOAPLab. The architecure is based on XML. Unfortunately we couldn’t use this approach in chemistry because there aren’t any Web Services. (We and Indiana have provided a few, but nothing compared with bio-).

Although I’m not personally a fan of SOAP (we develop everything using a RESTFul approach) it’s an acceptable architecture. Along with Web Services goes RDF and RSS – these are so much more use than what most web sites provide.

But I’m not surprised, and I’m very pleased. I’ve been saying for some time that chemical informatics is stalled – and the RSC meeting did nothing to change my view. Chemists aren’t interested in information. Chemical informaticists aren’t interested in C21.

So the biosciences – as I’ve predicted – are tooling up to do chemistry properly. Perhaps only those bits of bio-interest, perhaps everything. Who’s doing chemical ontologies? The bioscientists. Who’s doing chemical Web Services? The bioscientists. Who’s doing chemical text-mining? The bioscientists. Who’s doing chemical datamining? Let’s see.

And what are the chemists doing?

We’ll certainly be using PUG SOAP. and perhaps we can work towards a PUG-REST?

Posted in Uncategorized | 4 Comments

Free NMR prediction

Posted on May 24, 2008 by pm286

Chemspider describes the NMR prediction service provided by the Ecole Polytechnique Fédérale de Lausanne. NMR Prediction Now Available Via ChemSpider

00:56 22/05/2008, Antony Williams,

[…]. The NMR prediction service is provided by Luc Patiny’s group out of Ecole Polytechnique Fédérale de Lausanne at the Institute of Chemical Sciences and Engineering. Their nmrdb.org webpage offers a series of services, not just NMR prediction and I offer the details below from their website.
NMR Predictor – This page allows to predict the spectrum from the chemical structure
NMR Assigner – Upload and assign NMR spectra on-line. The assignment of NMR spectra may be decomposed in 4 steps:

identification of the signals

integration and multiplicity determination

assignment of each signal to the corresponding atom in the molecule

exportation of the data for publication and/or for database storage

NMR Resurrector – A great amount of NMR information is currently available in the form of scientific publications. However, this information is not readily accessible in the format required for complex searches. The Resurrector enables the user to easily import these in-line spectral descriptions and creates an assigned visual representation that can be seamlessly integrated in the attribution process.

PMR: Firstly, I think it’s very important to provide free online predictions of NMR spectra and Lausanne are to be congratulated. The technology is very valuable in the age of machine-readable chemical information. I’m not sure quite what the primary site at Lausanne provides – it states that the prediction is provided by a (presumably commercial) engine from Molecular Networks. I don’t know whether there are restrictions on the volume or re-use of the calculations done at Lausanne. If there aren’t any, this is a useful advance (though I am still keen to see the algorithms of prediction tools exposed ).
Again I may have missed it, but the primary means of calculating NMR spectra at Lausanne seems to be through an applet. I’d like to see Web Services that can be used by machines, rather than needing to enter structures by hand (I’m pleased to see Pubchem is taking the WS approach and I’ll blog this later). Computing is becoming indefinitely cheap and it makes sense to compute the spectrum of every new compound published.

Posted in Uncategorized | Leave a comment

RSC: What I and others said; and let's unlease the robots

Posted on May 23, 2008 by pm286

Christoph Steinbeck has given a huge account of the RSC meeting so I don’t have to say anything. Here’s what I seem to have said (As I have no powerpoint or linear trajectory my talks do not follow a set course and are determined in part by audience reaction).

Open Access Publishing in the Chemical Sciences

May 23rd, 2008 · No Comments

I was invited to give my views on some new chemistry in European Bioinformatics at a Meeting held by the CICAG group of the Royal Society, held at Burlington House, London.
Peter Murray-Rust set the scene by emphasising the importance for Open Data. He showed some fantastic work on data extraction by OSCAR from theses, where his group had parsed a synthetic chemistry thesis into an interactive graph of a reaction network. He also showed an SVG animation of this graph as a reaction sequence, all automatically generated from an OSCAR run. Peter pointed out in the subsequent discussion that data cannot be copyrighted, which was acknowledged by all publishers in the audience. The reality is different, however, because publisher’s licenses often prevent downloading of more than few articles in a row. Detection of a robotic download for text mining comes with the danger of the whole university being disconnected. It is unclear to me how robotically parsing papers and extracting data would damage the bushiness model of publishers. It could, of course, lower the number of subscriptions from

PMR: The main thing we took away was the importance of factual data. No-one disputed that facts could not be copyrighted (though not all realised that copyright was only one of the methods used by publishers to control access and re-use – server-side beheading is completely effective). I asked the audience – > 30 composed of publishers, librarians, software companies, etc. – no actual chemists of course – whether anyone would object to our robots reading the literature and extracting the data from the papers whether as text, images of tables. Half the audience thought I should, the rest didn’t vote against.
So, publishers, I’m going to start mining data from your sites. I hope you welcome this as a way forward t a new exciting era of data-rich science publishing. I hope that if you don’t agree you’ll let me know. I wouldn’t like to start and then get the lawyers sent. So please comment – it’s very important. I shan’t attack anyone who sends a reply. And you can send it by confidential email if you like.
There are a million new compounds each year in the scholarly literature. Our robots can produce huge amounts of good information from it. In some cases we get over 90% recall and precision – it depends on the type. This must be good for science. So please, publishers, let us know we can do it and we’ll publicly thank you. And if you don’t like the idea, please let us know why.
I’m in Barcelona at COST D37 helping to develop computational tools using CML. It is really changing the way things are done.
More on both fronts later.

Posted in Uncategorized | Leave a comment

RSC Open Access – what I think I'm going to say

Posted on May 22, 2008 by pm286

I normally try to blog some of my presentations before the event so that at least there is some sort of record. It also allows for feedback from readers. So I’m talking on Open Data in Chemistry. I’ve been working very hard to create a new demo of the future that chemistry could have if it wished and I think it’s working. I believe – not surprisingly – that every publisher of chemistry should look at it carefully. Because it helps to change the shape of the technical chemical publishing.
The theme is “Open Data”. I’ve recently written a review of this in Elsevier’s Serials Review and it’s coming out RSN in a special issue on Open Access. It’s already on Nature Precedings. So if you want detailed aspects – a few months out of date, they are there.
Some bullet points:

Data are different from text. Open Access generally does not support data well (I make exceptions for ultra-strong-OA such as CC-BY and BBB-compliant. Of the sort that PLoS and BMC provide. Green Open Access is irrelevant to Open Data (I think it makes it harder, others disagree).
Data matter. Chemistry is a data-rich science. We throw away over 90% of our data. We are all part of the problem, but publishers are one of the worst places for data loss.
Data must be made available by the authors. It’s now simple to do this. There is no technical excuse for not publishing chemistry data.
Data must be Open. It’s that simple. It can be done independently of Open Access
Data should be semantic. That’s harder but it’s happening. In our group we are producing the next generation of semantic tools for chemistry. I had hoped to announce a very important new project but the legal details haven’t been completely signed off. You’ll see it first on this blog.
Graduate students can – if they wish – provide semantic theses that can be checked and enhanced by machines. Free of many common errors.

What of the future? I’m not going to talk about business models or the rights and wrong of Junk Science through government mandates. But I should make it clear that:

Closed access is harmful to chemical data. That’s a fact, not a political stance. We are 10+ years behind other data-rich sciences because we protect data in archaic silos.
Publishers have to choose, one way of the other. “Mumble” is no good. Either you are an enthusiastic publisher of Open Data or you are a closed publisher. Your choice.
The formal aggregators (Chemical Abstracts, Inorganic Crystal Structure Database, Cambridge Crystallographic Data Base) will see their market and importance steadily decline. I predict that in 5 years’ time there will be no role for ICSD in its current form. The CSD may follow. Chem Abs will survive, but in form marginalised from the main web.Unfortunately at the moment several publishers (Wiley, Elsevier, Springer) do not expose crystallographic data and sent it to the data centres where we have to pay to get it out. This type of restrictive practice harms chemistry – I shall show how – and will be increasingly difficult to defend. Unfortunately when I write to these publishers they simply don’t reply.

So that’s what I intend to say. Roughly. And show some demos of the publication of the future. The datument.
Enthusiastic publishers can make it happen and chemistry is the best subject. Or you can delay it. You’ll succeed in delaying it, biut eventually it will happen.

Posted in Uncategorized | 1 Comment

RSC Meeting: Open Access to Crystallographic Data

Posted on May 21, 2008 by pm286

Yesterday I received a request from a publisher. I won’t name them but I don’t think the material is sensitive, and I need your help anyway. It’s very simple

We're in the process of producing new [High School] Chemistry
teaching and learning materials
We are looking at producing a number of rotatable
molecular models as part of our digital publishing offer.
[We would like someone] who can write CML files for us.

The molecules are apparently simple:

neopentane and 4 other simple organic molecules
```
buckminsterfullerene
```
```
diamond
```
```
graphite
```
```
sodium
```
```
sodium chloride
```

The organic molecules are easy (the publisher wasn’t concerned about what conformation was provided). I actually used ChemAxon’s Marvin – which emits CML – and it took about 5 minutes. So thank you ChemAxon, who make the software available for free, though it’s not Open.
In the near future it will be different. We shall have extracted all the molecules from CrystalEye and put them in an Open Repository. We shall have added the 250,000 molecules from the NCI which we computed with MOPAC and so it’s certain we would find the molecules we wanted in there. Moreover I would expect that the Blue Obelisk will soon have a complete workflow for drawing molecules and creating 3D structures.
But where can I get a 3D structure of sodium chloride? No, it’s not in Wikipedia (which is an encyclopedia, not a data- or knowledge-base). How long would it take you? And, before you think it’s simple remember it’s for commercial use. You will have to negotiate with the supplier of the information to determine whether you are allowed to redistribute the derivative work. Yes, it’s only data and data shouldn’t be copyright, should it?
So my question to the blogosphere is:

“where can I get redistributable coordinates for the last 5 substances? and how long did it take you to get them and assure me that they can be re-used for commercial purposes”

I expect buckminsterfullerene to be fairly easy. I think I have an answer for the solids. It took me about 10 minutes of half-remembered browsing by a strange route. I’ll accept coordinates in CIF, PDB or CML (no other format I know of supports crystallography and we need the space group and cell dimensions). As a second best I’d accept a filled-out unit cell with Cartesian coordinates. But the coordinates aren’t the problem. Finding the structures is. Please reply and tell me how long it took. Remember I’m not interested in pictures, only coordinates. And I didn’t get any joy with the ICSD database of inorganic crystal structures (http://icsd.ccp14.ac.uk/icsd/icsd_help.html). I may not have navigated through the complex interface but I only got:

Access forbidden!

You don’t have permission to access the requested object. It is either read-protected or not readable by the server.

The sodium chloride was determined about 100 years years ago. It’s in the public domain. But where can I get it?
The publisher has offered me a fee. I don’t know how much, but I will suggest they donate it to support education in Africa unless anyone has a better idea.

Posted in Uncategorized | 2 Comments

Open Knowledge Foundation Open visualisation Workshop

Posted on May 20, 2008 by pm286

Open Visualization Workshop this Saturday in London (UK). Not sure what time, but it will be clarified, I am sure.

Just to confirm that the most popular date for the Open Visualisation
Workshop voted on at our Doodle page [1] is this coming Saturday 24th
May. The event will be an informal, hands-on workshop focusing on open
source visualisation technologies. More details are at:
http://okfn.org/wiki/OpenVisualisation/Workshop
Trampoline Systems have kindly offered to host the first workshop:
Trampoline Systems
8-15 Dereham Place
London
EC2A 3HJ
United Kingdom

Open Visualisation Workshop

Visual representation is a time-tested way of making large, complex bodies of information manageable – whether in the form of maps, timelines, graphs, or charts. Emerging digital technologies have revolutionised what is possible in this domain.

This is an informal, hands-on workshop for those who work with, or are interested in, open-source visualisation technologies, whether based on the web or on the desktop, whether for data or other kinds of knowledge.

Participating

If you are interested in participating, please:

add your name to the list below;

specify which dates in May you are free on the event’s doodle page;

join the ok-london-announce and the open-visualisation list for further details.

List of participants

Jonathan Gray, Open Knowledge Foundation

Ed Murfitt, RCA-IDE

Jonathan Lowe, Giswebsite LLP

Rufus Pollock, Open Knowledge Foundation

Saul Albert, The People Speak

Charles Armstrong, Trampoline Systems

David Aanensen, Imperial College London

Peter Murray-Rust, Cambridge

Julie Tolmie, KCL

Joanne Harrison, Univ. of East London

Nils Gehlenborg, European Bioinformatics Institute

Gregory Jordan, European Bioinformatics Institute

DeborahMacPherson, Accuracy&Aesthetics

Activities

Feel free to add to the following list of things you would like to see on the day:

Examples and use cases of visualisation in different fields

Demos, walkthroughs, testing and comparing of different open-source visualisation software packages (see list on Open Visualisation page)

Client-side visualisation (SVG?)

Publicity

Please post here if you publicise the event to avoid doubling up!

Sourceforge Prefuse forum, 2008-04-17

Processing forum, 2008-04-17

Nature Network group “Visualization and Science”, 2008-05-02

Posted in Uncategorized | Leave a comment

New JChempaint and the Blue Obelisk

Posted on May 20, 2008 by pm286

I am delighted to report Egon’s announcement of the new work on JChempaint – one of the open Source chemical drawing tools from the Blue Obelisk community

Development of the new JChemPaint

22:24 19/05/2008, Egon Willighagen,

A quick screenshot, after some work on the JChemPaint code based on CDK trunk/. Nothing much to see, but a rather small code base, which is good. Today, I have set up cdk/cdk/trunk/ and cdk/jchempaint/trunk as Eclipse plugins, allowing the second to depend on the first. So, no more use of svn:externals. This is what it now looks like, and basically formalizes the end result of Niels‘ work of last year:
A possible spin of is that Bioclipse2 can use these plugins too, instead of defining plugins itself.
To reproduce the above screenshot, just import cdk/cdk/trunk and cdk/jchempaint/trunk into Eclipse, and run the TestEditor from the JChemPaint plugin.

PMR: There is a LOT going on quietly and relatively unuannounced in the Blue Obelisk world. Bioclipse continues apace. CDK is being seriously refactored. So is JUMBO. Many of the incompatibilities are being removed. We’re introducing mavenization to most of the Java programs so that they can be combined without breaking the libraries.

This has several benefits. It allows people to combine functionality without worrying about implementation. It helps to identify and define the functionality. To provide use cases and tutorials. To get rid of methods which have been superseded by better ones.

None of this is dramatic. But the overall effects will be dramatic.

You will wake up one morning – whether in academia, government, or industry and say “The Blue Obelisk gives us more or less everything we need. It works. Why aren’t we using it?”

Chemoinformatics is now not rocket science. It’s clear what the mainstream activities are. And the Blue Obelisk is catching up rapidly on most of the current functionalities and is ahead in some:

high-quality semantic representation of chemistry
dictionaries and ontologies
transformation of molecules
descriptors
sub-structure search
collections of data
totpological analysis
3D structure generation
2D diagram generation

These can be combined with other Open Source offerings in text-mining, machine-learning, etc.

What is there new in chemoinformatics in the last 10 years that the BO has to catch up with? I can’t think of very much. There are a few things we don’t have:

top rank name to structure (but there’s OPSIN from Peter Corbett and PMR and it can be extended)
image to structure (but there are growing points, OSRA)
structure elucidation (SENECA…)

And we’ll address these. There is enough critical mass of collaborators, and they are working together. Different projects, but a shared vision.

And there is more in the pipeline.

So, those of you who spend lots of money on commercial tools – maybe it’s worth thinking about Blue Obelisk. But unfortunately we don’t wear suits, so you will have to work to take us seriously.

PMR: A minor quibble. Please don’r reproduce the above screenshot – which makes horrible use of stereochemical indicators. Throw it in the bin. Find something beautiful instead.

Posted in Uncategorized | 3 Comments

(Open) Data in crystallography

Posted on May 20, 2008 by pm286

I’ll try to post at least twice in what I shall say at RSC on Thursday. (FWIW at least 2 readers have recently applied to go to the meeting – I should have started blogging earlier…)
The posts look to a positive future based on the ready technological ability of high-quality chemical data as a result of the information and instrumentation revolution. For those who espouse this view there’s a great future. For those who traditionally have seen data as something hard-won and chargeable there are turbulent times. The reality is that they will have to change. Not because I tell them so, but because the world does. Not explicitly, but in the relentless change in the way we do things – credit cards vs banknotes, mobile phones over landlines (they are knocking down the iconic red telephone boxes in UK) – you know the list as well as me.
So the model of “you publish your data in fragmented form; we type it up and sell it back to the community” is now no longer necessary or viable. We’ve seen gigabillion companies flourish in the last 10 years based on openly available data. (Yes we have concern about some of the Openness but that’s another post). There isn’t a market for micropayments for scientific information. (I exclude 40 USD to read a paper as it is hardly “micro” -it’s quite a good meal in some places)
When I started scientific research data were hard to come by. I was a crystallographer and I’m going to use that discipline as I understand it and also as it contrasts different approaches well. And it has exceptionally high-quality data.
To solve a crystal structure I had to record the diffraction pattern. This was done using methods developed by Karl Weissenberg (I was appalled that I couldn’t find a Wikipedia article). The diffraction pattern looks like some here. (or search Google for “Weissenberg photograph”). I solved 6 crystal structures for my doctorate and each might produce several thousand “spots”. The spots were of different intensity and this could be used to determine the positions of the atoms in the structure – incredible, even now. The intensities were meaured by eye (usually compared to a calibrated scale). So I measured some tens of thousands of spots.
This was the raw data. It mattered. It was hard-won, every spot hand-recorded in the lab book. Then each was typed (on a teletype) onto punched tape and fed into a Mercury computer (later the KDF9). When the strucrure was published every spot was published in the journal. It took pages but the journals and editors required them.
Why? Because they were the primary record of the experiment. They were the proof that you had made the right deductions about the atomic positions. Over-ambitious or sloppy claims were regularly demolished by self-appointed critics such as Jerry Donohue, and Richard Marsh (and several others). If they saw a structure they didn’t feel was correct they would re-type the data and re-analyse the structure. No-one likes being corrected in public print but we all accepted it was completely appropriate. It’s that public criticism which has helped to keep crystallography at the top of data quality.
There’s a complementary aspect. As science evolves it’s often possible to re-analyze existing data. So, for example, many Weissenberg photographs recorded so-called anomalous dispersion which can be used to determine the cirality of molecules. In many cases the effect was clear enough to observe in retrospect but the authors weren’t aware of the phenomenon. It would be possible to revist the data and re-analyse. Similarly the advance of theoretical methods and programs in crystallography means that more effects can be corrected or analysed in the computation. If I revisited my DPhil data I could use anistropic refeinement to get imporived coordinates. It’s probably not worth it – cheaper to re-synthesize and re-collect the data. But it’s possible.
Then the journals started to worry about page space. These intensities took up a lot of space. So they were no longer published (although they might be microfiched). And then even the final coordinates were no longer published – the only way you could get them was by requesting microfiches from the publishers.
And this culture has often stayed with us. There is now no technical reason why all the scientific data shouldn’t be part of the publication. But it usually isn’t. And that leads to several consequences:

Loss of data
A market in data

I’ll address these later.

Posted in Uncategorized | Leave a comment

Blog comments are sometimes failing to get through

Posted on May 19, 2008 by pm286

2-3 people have mailed me to say that their comments don’t get through. This may be due to “upgrades” in WordPress which we made a week or two back.
If your comment doesn’t appear, please mail me. I never fail to post comments unless they are obvious spam.
P.

Posted in Uncategorized | 2 Comments

Green OA and Open Data

PUG SOAP for PubChem

Free NMR prediction

RSC: What I and others said; and let's unlease the robots

Open Access Publishing in the Chemical Sciences

May 23rd, 2008 · No Comments

RSC Open Access – what I think I'm going to say

RSC Meeting: Open Access to Crystallographic Data

Access forbidden!

Open Knowledge Foundation Open visualisation Workshop

Open Visualisation Workshop

Participating

List of participants

Activities

Publicity

New JChempaint and the Blue Obelisk

(Open) Data in crystallography

Blog comments are sometimes failing to get through

Recent Posts

Recent Comments

Archives

Categories

Meta