Open Access and Open Data: licences, policies and other constraints

I am trying to follow up the value and deficiencies of attaching a formal Open Data licence to a chunk of information. I’d written a good deal when I also saw Peter Suber’s blog: Whether or not to allow derivative works: (Thomas Lemberger, Open Access: Derivs or No Derivs? It’s your call! The Seven Stones, November 1, 2007. Excerpt follows) [this is therefore a long post as I list the arguments – which are instructive – and make additional comments. This post is  followed by a second one on the relationship to Open Source]

TL: I am pleased to announce that Molecular Systems Biology has changed its license to publish for all articles accepted after October 1st, 2007 (see updated instruction to authors). The new license allows our authors to choose between two Creative Commons licenses: one that allows the work to be adapted by users (by-nc-sa), the other that does not allow the work to be modified (by-nc-nd)….
Our content is therefore not only freely available to all but our authors can now also decide to make their research fully open for reuse and adaptations.
The current explosive development of data and text mining, semantic-web and information aggregation technologies is profoundly changing the publishing landscape (eg Tim O’Reilly visits Nature). When we were contacted a few months ago by the OpenWetWare community who envisaged the “wikification” of one of our Reviews (see post), we decided that Molecular Systems Biology should strongly support such initiatives by providing our content in an as open form as possible. Our Senior Editors fully supported this transition to a more open license but also encouraged us to allow authors to have some influence on the decision.
Providing authors the possibility to choose their license has some decisive advantages: first, by enforcing a conscious choice by authors it will inevitably raise awareness on the implications of the various publication licenses; second we would like to see the question of “what should be open access” being addressed in a more democratic way by the community itself rather than through incantations of what the ideal solution should be. My guess – and my personal hope – is that most of the authors will indeed choose the most open version of the license, but I think that it is important to respect the opinions of those who think differently and who would feel uncomfortable with the idea that their article can be remixed or adapted without them being aware of it.
Our attitude is motivated by the fact that, at Molecular Systems Biology, we see the role of a scientific journal more as a catalyst facilitating and accelerating scientific discovery rather than a policy-making instrument. What is Systems Biology? Rather than providing a rigid definition of a rapidly evolving field, we prefer to let the community define the scope of this field and we adapt to it. What is open access? Rather than relying on a dogmatic position in a still fluid situation, we prefer to let scientists define their priorities.

PeterS:
Molecular Systems Biology is an OA journal jointly published by Nature and EMBO. Lemberger is the editor. His message above is a revised version of a letter to the editor published yesterday in PLoS Biology.Here’s a reply to Lemberger by Mark Patterson (PLoS Director of Publishing) and Catriona MacCallum (author of the article to which Lemberger was responding):

MP: We are grateful to Thomas Lemberger for his response to the recent PLoS Biology editorial concerning the confusion about open versus free access. We thank him as well for pointing out that authors at Molecular Systems Biology are now given a choice between two Creative Commons licenses when they publish their work. The announcement of a new option for their authors of an alternative “Share Alike” licence was not available as the PLoS Biology editorial went to press. We certainly agree with him that open access offers a tremendous potential for researchers and scientific publishing. However, in our view, no matter how well-intentioned this new policy might be, it will only lead to further confusion.
As noted in our editorial, all the research articles published in Molecular Systems Biology still end with the statement that the article is published under a Creative Commons Attribution License – see for example this article. This remains misleading, because only one Creative Commons Attribution License allows any kind of derivative reuse subject only to appropriate attribution of the authors. If you follow the license link at the bottom of the article cited above you find that the license is quoted as an “attribution, non-commercial, no derivative works” license – one of the most restrictive of the Creative Commons licenses ([see this]…summary of the licenses). The Creative Commons web site explains the meaning of “no derivative works” as follows: “You may not alter, transform, or build upon this work”. This is not open access.
The new “share alike” choice now offered to Molecular Systems Biology authors is closer to the accepted definition of open access, but includes the “non-commercial” and “share alike” restrictions, which means that any derivatives that are created have to be distributed under the same license terms. While we agree with the sentiments underlying this licence (in that it potentially promotes open access) – it is still restrictive, which is why open access publishers such as PLoS, BioMedCentral and Hindawi have chosen to use the Creative Commons Attribution License.
In effect, Molecular Systems Biology offers authors the choice between free access to their work and open access (with some restrictions). This means that the content of the journal is not all available open access. It is therefore not correct to say the “Molecular Systems Biology is an open access journal” as it does at the bottom of the research articles.
It is unfortunate that the PDFs of the articles published in Molecular Systems Biology lead to further confusion. The PDF of the article available at this link http://www.nature.com/msb/journal/v3/n1/pdf/msb4100156.pdf has a copyright line at the top indicating that the copyright belongs to EMBO and the Nature Publishing Group and that all rights are reserved….
It seems to us that we share many of the same goals as the editors Molecular Systems Biology, and so we urge them to work with their publisher to rationalize and simplify the license policies of their fine journal.

PeterS: Comment. A few background thoughts:

  • It’s healthy and useful to debate which licenses best promote research. Moreover, it might even be productive. The best way to debate subtly different shades of openness is to debate explicit, well-crafted licenses, and all the CC licenses are explicit and well-crafted.
  • However, the public definitions of OA (from Budapest, Bethesda, and Berlin) do not have the same sharp edges that explicit licenses do. Moreover, they differ on some fine points, including the one under discussion here. For example, the Bethesda and Berlin definitions allow derivative works, but the Budapest definition allows authors to disallow derivative works that would interfere with “the integrity of their work”.
  • Because the BBB definitions don’t settle the question, I think it’s more productive to debate policy than labels –what promotes research rather than what deserves the name of “open access”. If everything that satisfies at least one of the BBB definitions is OA, then both sides are talking about OA here.
  • I’m not saying that clear labels aren’t useful, or that the label “OA” isn’t usefully clear. I’m saying that when the label covers both policies under discussion, then we don’t gain by debating the label and should focus instead on specific advantages and disadvantages of the two policies.

PeterS: One response to this situation might be to revisit and revise the public definitions. There might be some gain in that. But even if we did, I’d want any newly revised definition would include some latitude for variation and flexibility –within limits, of course.

  • My own preference is for the straight, unadorned attribution license (same as PLoS Biology), essentially permitting every use except plagiarism. I wish all OA journals would use it. But several other decisions, including the decision to disallow derivative works, fall within the boundaries of OA.

PMR: I think this is a very timely analysis besides having Peter’s normal clarity. For ca. 2 years I have regarded BBB as almost scriptural – an algorithm for OA. I am coming round to the view that an “OA sticker” on a document or chunk of data is a useful indication that the author has thought about the issue and that there is some intention to make something freely available to the community. A CC-BY licence is clear and is a great step forward – like Peter and Mark I urge everyone to use it as all other CC-* cause problems.
However CC-* is a coarse-grained instrument. No doubt it will be tested in law for academic publications or scientific data ( … tell me it already has) but things shouldn’t get that far normally.  There should – in addition – be a clear statement of the policy of the authors or the publishers or the funders or all of them. This policy might not have legal standing but should command moral respect in the community.
Since I and my colleagues publish Open Data we have encountered some of the problems (but by no means all). Here are some examples of issues not easily covered by a licence (it could be argued that CC-ND covers these but it is too blunt IMO)

  • what components of a data set (including metadata) must be retained so as not to break its integrity?
  • should reference data and formal specifications be regarded as sacrosanct? I don’t really want people editing the CMLSchema and republishing it – all the software will break.
  • What happens a re-user edits mistakes into a data set which still retains the address and authorship of the primary author?
  • can an author require that the institutional branding (e.g. logos) are retained on recirculated works? Can other institutional logos be added?
  • can these works be used for marketing third-party products, with implied endorsements.

Would a non-legal policy help in these cases?
… next I compare the Open Source approach.

Posted in open issues | 1 Comment

Open Access and Tropical Diseases

This should be of interest to those interested in cyberscience and collaborative projects: WHO launches OA portal on tropical diseases Peter Suber, Open Access News “TropIKA is a new OA portal for research and information on tropical diseases.  (Thanks to SciDev.net.)  From its about page:”

The [World Health Organization’s] Special Programme for Research and Training in Tropical Diseases (TDR) has established TropIKA.net as a global knowledge management electronic portal to share essential information and to facilitate identification of priority needs and major research gaps in the field of infectious diseases of poverty….
TropIKA.net (Tropical Disease Research to foster Innovation and Knowledge Application) is a web-based platform for the acquisition, review and sharing of current information and knowledge on:
  • Public health research needs and scientific opportunities
  • Research-based evidence in support of control and policy
  • High profile research activities and control projects
  • International research funding and support opportunities
  • Potential innovations for interventions and control of infectious diseases of poverty.

Rationale for TropIKA.net
…Informed participation of disease endemic countries in the global research agenda setting is often prevented by limited access to scientific information and essential knowledge. Rapid advances in the field of information technology have made it possible to share and deliver information at higher speeds and lower costs and several initiatives aiming at enabling access to high quality, scientific information, via Internet, are now in place. However researchers and policy makers face the other problem of haphazard flow of scientific information for which they lack time to screen, awareness of what is relevant and essential for their domain of activities and skills for interpretation and application in health interventions….
TropIKA is designed to enhance access and to share essential knowledge with health researchers and policy makers dedicated to improving control of infectious diseases of poverty….
Partners participating in the TropIKA.net initiative to date

  • TDR (UNICEF/UNDP/World Bank/WHO Special Programme for Research and Training in Tropical Diseases)
  • BIREME/PAHO/WHO, in Brazil: hosts and manages the portal
  • HINARI: provides access to full text journals in specific countries
  • The Global Heath Library (GHL) and the Virtual Health Library (VHL)
  • Public Library of Science Neglected Tropical Diseases (and PLoS in general) for sharing “open access” scientific content and technology
  • SciELO journals and other open access journals….
Posted in open issues | Leave a comment

Motivation for Wikipedian contributors

Why do we contribute to Wikipedia? Here’s a report of a survey. I add my additional motivations below:
var imagebase=’file://C:/Program Files/FeedReader30/’;

20:19 30/10/2007, Andy Oram,

By Andy Oram
A recent article titled “What Motivates Wikipedians?” (written by Oded Nov in Communications of the ACM, November 2007, Vol. 50, No. 11, pp. 60-64) attracted my eye because I’ve been doing similar research on why people write computer documentation. Back in June I published the results of a survey that over 350 people filled out. I wanted to see what more rigorous research would turn up.
[…]The article is not available to the general public online [PMR: Closed access I assume], but I can summarize the motivations Nov tested in his survey:

  • Altruism and humanitarian concerns
  • Responding to requests by friends or attempting to engage in an activity viewed favorably by important others
  • Chances to learn new things
  • Preparing for a new career or signaling knowledge to potential employers
  • Addressing personal problems, such as guilt at being more fortunate than others
  • Ego needs and public exhibition of knowledge
  • Ideological concerns, such as belief that information should be free
  • Fun

This is a pretty comprehensive list of personal reasons for doing something that doesn’t offer any immediate personal payback. But what’s missing from this list? The motivation that means the most to me personally, and probably to many people who write: they actually have something to say!
[…]
But there are lots of urgent social issues where someone feels he or she benefits from other people having the correct facts: political controversies, public health problems whose cures depend on widespread compliance, and so on. In my own field, people are writing computer documentation because they want to promote the software they’re writing or using. The intrinsic motivations in the list may add a bit of extra incentive, but the main goal is to get one’s point of view heard. And the Wikipedia’s fame, along with the high rankings it receives in web searches, ensures that lots of casual web users will read its entries. If you care about people hearing your point of view, you’d better damn well write for Wikipedia.
[…]

PMR: What articles have I started, and why? Here’s my list (I got this from a list of my contributions – is there a list of the pages I have started?):

  • Penicillin binding proteins (showing MSc students the value of Wikipedia – Birkbeck MSc in Bioinformatics)
  • Carotenoid oxygenase same reason
  • Interleukin 7 same reason
  • Molecular graphics as a founder member of the Molecular Graphics Society (now MGMS) I felt the history and ethos should be recorded
  • Molecular model because I have spent many days making molecular models and I love them and again wish to recapture the history
  • Pseudorotation‎  because I love this concept and have written papers on it
  • Round-trip format conversion‎ because I really care about preventing semantic loss during data transformations
  • Open Data this is perhaps the most important. When I started with the idea of “Open Data” I didn’t know whether the term was in frequent use and so starting a Wikipedia page was the most effective way of finding out. If it already existed under another name, someone would tell me. As it happens it didn’t and the entry was needed because many others have contributed

So – in the most general terms – a fair overlap with the list above. I hope it’s not ego trips. I would add the educational aspect – I think communal preparation of a Wikipedia entry is an amazing class activity.  And I’d probably add “attempts to create a community in common agreement on the meaning of a concept”.

Posted in open issues | Leave a comment

Joe Townsend's PhD and data-driven science

Joe was examined yesterday by Martin Dove (Cambridge) and Henry Rzepa – of course nothing is official but he was given the indication “minor corrections”. So I will congratulate Joe on having to make corrections. I won’t put words into Joe’s mouth but we have discussed the likelihood of his thesis being Openly available (Cambridge does not – yet – have a mandatory requirement).
Joe has pioneered much of the work we have done here – OSCAR, OSCAR-DATA, early natural language processing, eScience, data-driven science. I and others owe a great deal to that. It hasn’t been smooth – in several cases the vision was ahead of the technology or the data. For example it wasn’t then possible to extract chemical reactions out of published papers by robotic means. (We’re making progress in this area). But it was possible to extract crystallography and that’s the basis of CrystalEye. Nick built the extraction technology and Joe has pioneered the data driven-science – is it possible to validate thousands of data automatically by comparing experiment and theory. The answer – for crystallography – is definitely yes. And, somewhat as a surprise to us, he found that the major cause of variance was experiment, not theory. But in one or two cases it has revealed effects due to the method of calculation and that is “new science”. Not world-shaking and probably of the “oh well, we all know that already” sort, but still science hidden in the data.
That’s the same philosophy as Nick has been investigating with the NMR – can we compare high-level calculations with experiment and use it to analyse the variance. If, if we find systematic effects, can these point out science in the experiment or in the theory. More on that later…

Posted in crystaleye, data, nmr | Leave a comment

Dissemination of CrystalEye

There has been considerable interest in having access to the bulk knowledgebase of CrystalEye – WWMM which contains primary data for over 100,000 crystal structures  and probably over 1 million copies of fragments derived from those. We are obviously excited to see the interest and will be talking this morning, and possibly later today at the SPECTRa-T meeting, about the problems of  disseminating repositories and knowledgebases where we have some experts in the field.
Firstly I reiterate that CrystalEye is OpenData according to the The Open Knowledge Foundation definition. It does not actually carry a licence but uses this as a  meta-licence. So it is legally allowed for anyone to take copies of the contents and re-use them, including for commercial purposes. We shall not waver in that.
There have been recent suggestions that to save bandwidth people should make copies of the data and redistribute them on DVD. We would ask you to refrain from doing this for the immediate future for several reasons:

  • The architecture of CrystalEye and its dissemination through AtomPP is new. Jim Downing hasn’t even had the chance to explain what his vision for the dissemination is. Please give Jim a chance to explain.
  • It is not trivial to take a physical snapshot of dynamic hypermedia. CrystalEye is updated daily, has over 1 million hyperlinks, and contains several distinct meta-views of the knowledge. This cannot be captured in a single process. Therefore any physical copy will involve significant loss of metadata. This loss could be so significant that the copy was effectively broken.
  • It seems clear from the 2-3 days discussion that different communities want different views of CrystalEye. Some want links to the entries as arranged by the literature, others want it organised by fragments. These are almost completely orthogonal.
  • Copying the data and redisseminating without reference to the originators it is, in effect, forking (see below).
  • We are critically concerned about versioning and annotations. CrystalEye has effectively nightly versions and it is important that when people use it for data-driven science it is clear that they are referring to PRECISELY the same collection.
  • We have thought carefully about sustainability of CrystalEye and have had discussion with appropriate bodies. These would maintain the Openness, but would look to sustainable processes. I cannot give more details in public.

Please note than many Open resources ask or require that their database is not distributed in toto without their involvement. I think this is true of Pubchem – anyone can download individual entries and re-use but it is required (and common courtesy) to ask before downloading the whole lot. We have done this, for example, for the names in Pubchem which are now part of OSCAR.
Then there are the more intangible aspects. It is appropriate that this is seen as a creation of the authors, their collaborators, the Unilever Centre for Molecular Science and Informatics, and the University of Cambridge.  It would be appropriate that these are the first entities that the world should look to if there is to be a physical distribution of some of the resource. At present we see a physical resource as potentially creating as many problems as it solves – whether done by us or not. CrystalEye is much more than the data contained in it – a physical snapshot gives as much indication of this as a series of photographs does of television.
So before distributing the data without our involvement, please let’s discuss the aspects – and at present this blog is the appropriate place. I reiterate that no comments are moderated out.
========================
From WP: Fork: (this relates to software, rather than data but the principles overlap)

In free software, forks often result from a schism over different goals or personality clashes. In a fork, both parties assume nearly identical code bases but typically only the larger group, or that containing the original architect, will retain the full original name and its associated user community. Thus there is a reputation penalty associated with forking. The relationship between the different teams can be cordial (e.g., Ubuntu and Debian), very bitter (X.Org Server and XFree86, or cdrtools and cdrkit) or none to speak of (most branching Linux distributions).
Forks are considered an expression of the freedom made available by free software, but a weakness since they duplicate development efforts and can confuse users over which forked package to use. Developers have the option to collaborate and pool resources with free software, but it is not ensured by free software licenses, only by a commitment to cooperation. That said, many developers will make the effort to put changes into all relevant forks, e.g., amongst the BSDs.[citation needed]
The Cathedral and the Bazaar stated in 1997 [1] that “The most important characteristic of a fork is that it spawns competing projects that cannot later exchange code, splitting the potential developer community.” However, this is not common present usage.
In some cases, a fork can merge back into the original project or replace it. EGCS (the Experimental/Enhanced GNU Compiler System) was a fork from GCC which proved more vital than the original project and was eventually “blessed” as the official GCC project. Some have attempted to invoke this effect deliberately, e.g., Mozilla Firefox was an unofficial project within Mozilla that soon replaced the Mozilla Suite as the focus of development.
On the matter of forking, the Jargon File says:

“Forking is considered a Bad Thing—not merely because it implies a lot of wasted effort in the future, but because forks tend to be accompanied by a great deal of strife and acrimony between the successor groups over issues of legitimacy, succession, and design direction. There is serious social pressure against forking. As a result, major forks (such as the Gnu-Emacs/XEmacs split, the fissioning of the 386BSD group into three daughter projects, and the short-lived GCC/EGCS split) are rare enough that they are remembered individually in hacker folklore.”

It is easy to declare a fork, but can require considerable effort to continue independent development and support. As such, forks without adequate resources can soon become inactive, e.g., GoneME, a fork of GNOME by a former developer, which was soon discontinued despite attracting some publicity. Some well-known forks have enjoyed great success, however, such as the X.Org X11 server, a fork from XFree86 which gained widespread support from developers and users and notably sped up X development

Posted in crystaleye, data | 2 Comments

We agree the structure is wrong!

Nick Day’s procedure has generated the agreement – and disagreement – between observed and calculated NMR shifts. In my post Open Notebook NMR – the good and the ugly I highlighted one of the worst disagreements. I hesitated to say “the structure is wrong” because I am not an expert in either NMR or this group of natural products”, but I would have bet on it.
Now there is general agreement that the structure is wrong:

Wolfgang Robien Says (in comment on post above)
[… details snipped…]
(5) Application of ANY PREDICTION SOFTWARE PACKAGE should put a lot of RED and/or YELLOW markers on this entry ….. CSEARCH does this AUTOMATICALLY during display – a NN-spectrum is calculated on-the-fly whenever you display an entry ! (ACD does it too according to the examples I have seen)
Conclusion: Guys, its all there, we only need a few drops of glue putting the pieces together
PMR: I think we have good agreement here. The glue that is needed is between the NMR community and the publication process, ultimately to generate semantic publishing. My involvement is to try to “sell” the idea of semantic publishing and data valiation to the chemical authors and to the publishers. If an author is spending 3000 USD to publish a paper, then it should not be impossible to find part of that to validate the data.
Hopefully this acts as a signal to reduce the number of wrong structures in future.

Posted in nmr, open issues | 3 Comments

ATOMic crystals

How do we disseminate our CrystalEye data? If we use one large file, even zipped, it will run into gigabytes. Also it can’t easily be updated.  Jim Downing has started to set up AtomPP feeds for disseminating  it. Geoff Hutchison asks:

  1. Geoff Hutchison Says:
    October 29th, 2007 at 10:03 pm ePeter, as I mentioned to you earlier, I think many of us are looking for the open data in Crystal Eye, particularly fragments. Surely there’s an easier and more efficient way to get the data than AtomPP feeds. Will you have periodic dumps — say I get this quarter’s crystal structures and then can use the AtomPP feeds to just pull new entries?

I think this depends where someone starts. If they are a regular user of crystalEye then AtomPP would seem to be the best approach – it means you don’t have to remember when to download and what the size of chunk are. Is it a simple method to get the historical material when starting out? Jim, perhaps you can help here

Posted in crystaleye, data, semanticWeb | 7 Comments

Open NMR publication: possible ways forward

Wolfgang Robien has posted some valuable comments and I think this gives us a positive way forward. I won’t comment line by line but refer you to the links. For background Wolfgang suggest that I have a religious take on this and am trying to impose this on the NMR community which already has adequate and self-sufficient processes. [In all this we differentiate macromolecular/bioscience from smallMolecule/chemistry which have completely different ethics and practice. Here we refer only to small molecules.] I am not religious about NMR.
I’ll start by saying that I think Wolfgang and I may have very significant common ground and this is an attempt to address it. I also think that our differences are confined to different fields of endeavour.In summary I believe that:

  • NMR data are published in non-semantic ways (PDF, etc.) and that this destroys much useful machine-interpretable information. By contrast crystallography is semantic and the quality at time of publication is very much higher.
  • A significant number of papers contain NMR data which do not correspond exactly to the structures – often referred to as “wrong”. By contrast this hardly ever happens in crystallography.
  • Crystallographic data is subject to intense validation before publication and the algorithms and code are freely available. This has raised the quality of crystallography over the last 15 years and the data in crystalEye show this clearly. With the advent of computational methods in NMR (whether HOSE or GIAO) it should be possible to carry out similar validation before publication.
  • The crystallographic data as published constitute a global knowledgebase which can be re-used in many ways in a semantically valid framework. This is currently not possible for NMR but it could be if the community wished it.

Wolfgang mentions religiosity – I try not to be but the publishing community is rapidly fracturing over the Open-Closed line and I personally see this as having little middle ground. Others disagree. I am insistent that the words “Open Access” be used in a manner which is consistent with the Open Access definitions, just as for Open Source. There is a tendency for people to describe resources as Open when they do not conform to the definition. I hold the same view for Open Data.
Where I think we have common ground is that we both agree that:

  • there are too many publications where the NMR-structure is simply wrong
  • it would be possible to validate many of these using software
  • that it would be useful to publish the spectra in semantic form rather than text and PDFs. (Wolfgang may disagree here and see value in having the data retyped by humans, and if so I’d like to see the case. In practice we have shown that the data can go straight from the instrument to the repository without semantic loss, but that the business processes are not yet clear).

In principle I would be very happy to collaborate on developing an NMR protocol which would validate data in publications. I think we would need a variety of methods and data resources. We can’t do this in Nick Day’s project and I can’t speak for Henry, but it sounds promising. Methods like this exist for crystallography and thermochemistry (ThermoML). Spectroscopy and computational chemistry are the most tractable and valuable next steps.
One reason we used NMRShiftDB was that we knew that the data were heterogeneous and possibly contained errors. This simulated what we might find in publications. We can use our OSCAR and other software to extract spectra and structures from the literature though the assignments are harder without explicitly numbering schemes in connection tables. Clearly the requirements on analysing questionable data and creating a validation procedure are more difficult in this case but we are prepared to defend it.
Ultimately my vision is that all NMR in journals would be validated and in semantic form (e.g. CMLSpect) before being published. Other disciplines have already achieved it, so it’s a matter of communal will rather than absence of technology. I think we have a mutual way forward, though not in the timescale of Nick Day’s thesis.
Wolfgang Robien Says:
[links to comments broken in WordPress]
WR:
OK, you are not a NMR-spectroscopist, but you want to liberate NMR data from the pages of the journals:
PMR: This is exactly right. It is virtually the sole motivation for this work. Anything else (NMRShiftDB/WR, GIAO/HOSE-NN) is secondary. It is also coupled to the capture of data from eTheses (the SPECTRa and SPECTRa-T projects) where we have shown that most data rapidly gets lost. It is about validation, semantic quality, dissemination, preservation, and closely tied to the capture of academic output in institutional and other repositories.
WR: There are so many people around working in this field, who are doing excellent science
PMR: I am unaware of major scientific laboratories who are making major efforts in changing the way that NMR Spectra are published in journals or theses or captured in repositories. I do claim to be aware of semantic scientific publication and repositories and am regularly invited by both the Open and Closed publishers to talk about this. If there is major work ongoing in pre-publication validation and semantic output of NMR I haven’t heard of it

Posted in data, nmr, open issues | 3 Comments

WWMM: The World Wide Molecular Matrix

Since I have been asked to talk about the WWMM here’s a bit of background… When the UK e-Science project started (2001) we put in a proposal for a new vision of shared chemistry – the World Wide Molecular Matrix. The term “Matrix” comes from the futuristic computer network and virtual world in William Gibson‘s novels where humans and machines are coupled in cyberspace. Our proposal was for distributed chemistry based on a Napster-like model where chemical objects could be shared in server-browsers just as for music.
It seemed easy. If it worked for music it should be possible for chemistry. Admittedly the content was more variable and the metadata more complex. But nothing that shouldn’t be possible with XML. And when we built the better mousetrap, the chemists would come. Others liked the idea, and there is a n article in Wikipedia (Worldwide molecular matrix).
But it’s taken at least 5 years. The idea seems simple, but there are lots of details. The eScience program helped – we had two postdocs through the Cambridge eScience Centre and the DTI (Molecular Informatics “Molecular Standards for the Grid”). As well as CML ee listed 10 technologies (Java, Jumbo, Apache HTTP Server, Apache Tomcat, Xindice, CDK – Chemistry Development Kit, JMol, JChemPaint, Condor and Condor-G, PHP). We’re not using much PHP, no Xindice and prefer Jetty to Tomcat, but the rest remain core components. We’ve added a lot more – RDF, RSS, Atom, InChI, blogs, wikis, SVN, Eclipse, JUnit, and a good deal more. It’s always more and more technology… OpenBabel, JSpecView, Bioclipse, OSCAR and OSCAR3…
But we needed it. The original vision was correct but impossible in 2002. Now the technology has risen up to meet the expectations. CrystalEye, along with SPECTRa,  is the first example of fully functioning WWMM. It’s free, virtually maintenance-free, and very high quality. We have developed it so it’s portable and we’ll be making the contents and software available wherever they are wanted.
But it also requires content. That’s why we are developing ways of authoring chemical documents and why we are creating mechanisms for sharing. Sharing only comes about when there is mutual benefit, and until the blogosphere arrived there was little public appreciation. We now see the value of trading goods and services and the power of the gift economy. In our case we are adding things like quality and discoverability as added value. We’ve seen the first request for a mashup today.
WWMM requires Open Data, and probably we had to create the definition and management of Openness before we knew how to do it. We’ll start to see more truly Open Data as publishers realise the value and encourage their authors to create Open content as part of the submission process. And funders will encourage the creation and deposition of data as part of the required Open publication process.  Then scientists will see the value of authoring semantic data rather than paying post-publications aggregators to type up up again. At that stage WWMM will truly have arrived

Posted in "virtual communities", cyberscience, data | Leave a comment

COST D37 Meeting in Rome

Tomorrow Andrew Walkingshaw and I will be off to Rome for the COST D37 Working Group. From the site:

What is COST?

COST is one of the longest-running instruments supporting co-operation among scientists and researchers across Europe. COST now has 35 member countries and enables scientists to collaborate in a wide spectrum of activities in research and technology. […]

PMR: I’m always proud to be involved in European collaborations. When I was born Europe was tearing itself apart. Whatever we may think of the bureaucracy involved it’s worth it. Science and scientists have always been a major force in international collaboration, and the prevention of conflict.
The meeting itself (COST D37) is aimed at interoperability on chemical computation:

Objective

Realistic modelling in chemistry often requires the orchestration of a variety of application programs into complex workflows (multi-scale modelling, hybrid methods). The main objective of this working group (WG) is the implementation, evaluation and scientific validation of workflow environments for selected illustrator scenarios.

Goals

In the CCWF group, the focus is on the implementation and evaluation of quantum chemical (QC) workflows in distributed (Grid) environments. This is accomplished by:

  • The implementation of workflow environments for QC by adapting standard Grid technologies.
  • Fostering standard techniques (interfaces) for handling quantum chemical data in a flexible and extensible format to ensure application program interoperability and support of an efficient access to chemical information based on a Computational Chemistry ontology.
  • The implementation of computational chemistry illustrator scenarios from areas of heterogeneous catalysis, QSAR/QSPR, and rational materials design to demonstrate the applicability of our approach.

PMR: So I’ll be talking about the World Wide Molecular Matrix (WWMM) and Andrew will talk on Golem – which will transduce the output of computational programs into ontologically supported components that can be fed into other programs without loss of information. I shall try to present as much as possible from the WWW, linking into CrystalEye and OpenNMR.

Posted in chemistry, data, XML | 2 Comments