Digital Curation day 2 – Carole on workflows

There is a definitely an air of optimism in the conference – we know the taks are hard and very very diverse but it’s clear that many of them are understood.  The morning plenary was Carole Goble, Manchester – who has been a central pillar of the eScience community and one of the relatively few computer scientists who have really got immersed in the needs and the culture of the scientists, rather than the abstractions of the CS. (Of course she does the cutting edge stuff there as well, but hides it from us). She’d just come from Amsterdam (“where people smoke in bars”) and gave her natural bouncy presentation – today on workflows.
The great thing about Carole is she’s honest. Workflows are HARD. They are expensive. There are lots of them. Not of them does exactly what you want. And so on. [We did a lot of work – by our standards – on Taverna but found it wasn’t cost-effective at that stage. Currently we script things and use Java. Someday we shall return.]
So random jottings – mainly about stuff which was new to me:
myExperiment.org. A collaborative site for workflows. You can go there and find what you want (maybe) and find people to talk to. “- bazaar for workflows, encapsulated objects (EMO) single WFs or collections, chemistry data with blogged log book, encapsulatd experimental objects Open Linked Data linked initiative…
facebook-like but based round the object, not person
from me-science to we-science. Crossing tribal boundaries
new project WS4LS (web services for life sciences – complete catalogue.)
Scientists do not collaborate – scientists would rather share a toothbrush sharing rather than gene names (Mike Ashburner)
who gets the credit? – who is allowed to update?. Changing metadata rather than data. Versioning. Have to get credit and reputation managed. Scientitsts are driven by money, fame, reputation, fear of being left behind
Web 2.0, perpetual beta, users add value. blogs, viral marketing
Needs to ba an Ontology dictatorship.
Flexible model:
RDF, OWL, SKOS, ORE, Open Linked data
Annotations are first class citizens
Nice Interfaces with simple functionality matter more than sophisticated reasoning engines
2 Design patterns for repositories in response to LIz Lyon’s request:
Long Tail, Data is next Intel Inside, Users add value, network effects, Some rights reserved, Perpetual  beta, cooperate, don’t control
and… (SMARTER)
Selective
Mass community annotation,
Automate,
React to changes
Timely (just enough)
Expeedient
Review (rather than control)

Posted in Uncategorized | Tagged | Leave a comment

SPECTRa @ UIUC

Last spring I visited Illinois (UIUC) and presented the SPECTRa tools. Scott Wilson who runs the crystallographic facility and many of the LIS community were keen to see how it could be used for capturing their crystallography. Yesterday I met Sarah Shreeve at the DCC conference and she told me that they had now budgeted to install a SPECTRa system. This is great – Jim Downing and I will be discussing the technical details – but we’ll be hoping to have some more news RSN.
If anyone else at DCC is interested in SPECTRa  for ingesting crystallography, spectroscopy or compchem, catch me at coffee – I’m around till Saturday.

Posted in data, open issues | Tagged | Leave a comment

Microsoft eChemistry Project and molecular repositories

Some of you may have picked up from – e.g. the Open Grid Forum – that Microsoft (Tony Hey, Lee Dirks, Savas Parastatidis) have been collaborating with Carl Lagoze (Cornell) and Herbert van de Sompel (LANL) on bringing together Chemistry and OAI-ORE – the next generation of interoperable repository software. We are delighted that Microsoft has now agreed to fund this project and when Carl, Lee, Simon Coles (Soton) and I had lunch yesterday Lee said I could publicly blog this. (There are contractual details to be settled on various sites).
In brief – Tony Hey was the architect of the UK eScience program and then moved to Microsoft Redmond where he has been developing approaches to Open Science (not sure if this is the correct term but it gives the idea) – for example it includes Open Access and permits/encourages Open Source in the project. Carl and Herbert developed the OAI-PMH protocol for repositories which allows exposure of metadata for harvesters. They have now developed ORE – Object Re-use and Exchange – which sees the future as composed of a large number of interoperating repositories rather than monolithic databases (I am on the advisory board of ORE).
There are 7-8 partmers in the program – MS, PubChem, Cornell, LANL, Lee Giles (PSU), Soton, Indiana and Cambridge. This is a really exciting development as we shall be able to create a number of well-populated molecular repositories with heterogeneous content (everything from crystallography to Wikipedia chemicals for example). One that we are currently developing is an RDF/CML-based repository of common chemicals – perhaps 5000 – which could serve as an amanuensis for the bench chemist or undergraduate needing reference material. CrystalEye will be in there as well and we shall also be “scraping” (ugly word) any material we can legally access. In this was we can hope to see the concept of World Wide Molecular Matrix start to emerge. Chemistry eTheses can also be reposited – we are starting to hear of universities who have mandated open theses.
Chemical substructure searching across repositories will be an exciting challenge but we have a number of ideas.
We shall have openings here so if you are interested let us know.
More later, but to reiterate our thanks to Tony and colleagues.

Posted in crystaleye, open issues | 5 Comments

Digital Curation Conference Day 1

I can’t actually connect to the Internet during the sessions so have made some notes and edited them later. I suspect this is rather patchy.
Overall impressions – optimistic spirit with some speakers being very adventurous about what we can and should do. I’m not reporting everyone.
…Opening Plenary,  John Wood , Imperial gave an overview of largescale projects all of which were heavily data rich. One message was that data curation should be intimately planned before the project starts. I had the strong feeling that at least half the problems were soleved if you awere a large funded project.
…Astrid Wissenburg (RCUK) – Egaging overview of research council policies. On the one hand  ESRC does not pay unless researchers actually deposit. On the other hand EPSRC has very lax policy (her words). I couldn’t catch all their statement but I asked her afterwards and basically EPSRC says you can archive or not. Their words imply that publication is the end of the process – after that they don’t care what happens to any data (and possibly not before that either).
…Chris Greer (cyberinfrastructrure)  : over 10^^18 bytes were created in 2006.
…Rhys Francis (Australia) . One of the most engaging presentations.  Why support ICT? It will change the world. Systems can make decisions, electronic supply chains. humans cannot keep pace. We do not need people to process information.
Who owns the data, and the copyright = physics says, who cares? If it’s good, copy it. Else discard it. Storage is free. [PMR: let’s try this in chemistry…]
What to do about data is a harder question than how to build experiements
FOUR components to infrastructure. I’ve seen this from OZ before… data, compute, interoperation, access. Must do all 4. –
And he also highlighted the importance of domain knowledge in preference to institutional repositories (I go along with that).
…Tim Hubbard:
Particular snippet: Rfam (RNA family) is now in Wikipedia as primary resource with some enhancement by community annotation. So we are seeing Wikipedia as the first choice of deposition forareas of scholarship

Posted in idcc3 | Leave a comment

Digital Curation Conference (DCC) Washington

I’m in Washington for a JISC NSF meeting on Friday. Originally I thought I would have to have missed the Digital Curation Conference but due to a change of plans am now able to attend (Wed and Thursday). Since I have no commitments (other than dinner, and meeting collaborators) I may be able to blog some of the meeting. I don’t know how many bloggers there are – I have found it varies wildly between meetings – about a hundred for www2007 and only me for berlin5.
[NOTE Added later: Chris Rusbridge has asked for the tag IDCC3 to be used . Thanks Chris.]

Posted in idcc3 | Tagged | 1 Comment

Scope for SCOAP

From Peter Suber: SCOAP3 FAQ for US libraries : CERN‘s SCOAP3 project has created an FAQ for U.S. Libraries. Excerpt:

 

What is SCOAP3 and what does it have to do with me?
SCOAP3 is the Sponsoring Consortium for Open Access Publishing in Particle Physics (see [this] for more info). It is a mechanism for a field of science (in this case Particle Physics) to pay for its own publishing costs, rather than make the readers of its journals pay via subscriptions. In the SCOAP3 model, everyone involved in producing the literature of particle physics (universities, labs, and funding agencies) pays into a consortium (SCOAP3) which then pays publishers so that all articles in the field are Open Access.
No particle physics journal will have a subscription cost, and everyone can read any article published.
You can redirect the money that you save on subscriptions to SCOAP3 to pay for Open Access for the entire literature of Particle Physics.
As a physics/science library you will be realizing the savings from the lack of subscription costs for the Particle Physics journals, so it is only natural that you would be a contributor to SCOAP3. Clearly the cost of Open Access will be similar to the cost of subscriptions, because there won’t be any new money in the system. Without your redirected money, it won’t work….

PMR:This is a great model and should work well for any large, coherent, well-managed and funded community. In reality there are probably only a few fields where it works – they need to be collaborative, global and probably specialist.

 

“Clearly the cost of Open Access will be similar to the cost of subscriptions, because there won’t be any new money in the system. Without your redirected money, it won’t work….”

 

PMR: I don’t agree. We don’t know what the cost of publishing actually is, but it’s clear that it varies widely and there is much misinformation. The fact that many Open Access society journals are author-doesn’t-pay shows that in certain cases the costs can be accommodated in “marginal costs” or other subsidies. It can be argued that commercial publishers are more cost-efficient than non-profits because they are commercial. But they have many other costs – copyright police, marketing, and perhaps production to layout standards which the community does not require. And there is the shareholder profit.

A major part of the current pricing problem is that price and cost are not seen to be related.

 

So would the following be a more accurate statement?

 

 

“In the case of SCOAP the cost of Open Access is not zero, but we shall be open about the expenditure. We expect to avoid some of the costs and profits of commercial and society publishers and would hope to be able to lower costs. Since price (of author submission) is now directly related to costs we must recover them from the funders because there won’t be any new money in the system. Without your redirected money, it won’t work….”

 

Posted in open issues | Leave a comment

What is data deposition?

 Chemspider raises an important and valuable issue. How is data reposited?

  1. ChemSpiderMan Says:
    Peter, as you saw from my other posts at http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=854 I have followed the SPECTRa project with interest and read the “final report” recently.BAsed on your comment “We have benefitted from this in the SPECTRa project which allows groups in academia and elsewhere to reposit their spectral and crystallographic information.” the system is up and running somewhere. Can you point me to the system to test as I am interested to see it in action.
    As you likely know we have implemented spectral uploading onto ChemSpider already (http://www.chemspider.com/docs/Uploading_Spectra_onto_ChemSpider.htm) and use JSpecView for the purpose of viewing. It took about 8 hours of work to deliver the minimalistic implementation and there is definitely work to be done to improve it. I know that a lot of time was spent interviewing users and I am therefore interested in seeing the interface you developed for people (I have not interviewed lots of people re ChemSPider uploading and am going on the feedback of people like Bob Lancashire himself and JC Bradley, both on the ChemSPider advisory group (http://www.chemspider.com/Advisory.aspx).

PMR:  First, we – or rather Jim – has released the SPECTRa tools ( SPECTRa tools released). There are systems in place in at least Cambridge and Imperial. However that’s only part of the story – the easy bit. The main problem is to create a system and business model where there is a natural incentive to deposit data. We have found very considerable resistance and apathy – it’s easy to get excited by the ODOSOS community, but in practice most chemists don’t care.
One difficult problem is “when” ? When the data are originally collected the chemist would never make them public. Although we can dream I doubt that chemists will rush to Open Notebooks. When the paper is published it would be appropriate to reposit them. But the publication occurs many months after the submission of the manuscript. So you need an escrow repository – that;s why we had to spend so much time on that in SPECTRa. This requires a mechanism of trust and I suspect it can come from the following sources:

  • the departmental analytical infrastructure.
  • the institutional infrastructure (especially for theses)
  • a respected publisher (Perhaps an obvious role for BMC).

So although I applaud the Chemspider offer to archive data I think it will need a large number of different business models to make it work. Each university department is different and each publisher is different. I wish it wasn’t so. We have to change the culture of data – which is one reason why I shall be attending the Digital Curation Centre meeting next week in Washington.
Finally, what is data? Data without metadata can be almost valueless. Many of the “SD” files on the web have no metadata and you have to guess the tags. Spectra are easier which is why we have started with them, crystallography and compchem. Moreover the metadata are often available in the file – not always but enough that it’s valuable.
But also what and where is the extent of the data? Pubchem, for example, is a linkbase but not a database. It does not, for example, carry melting points. (This is a simplification but it’s generally true). So “putting data in Pubchem” is essentially adding the links to Pubchem through the connection table (or possibly the name in a fwe cases – I don’t know).  And unless there are synchronised ontologies then it’s often unclear  what quantities are equivalent and what aren’t. There is a lot of difference between a human reading a few entries and a machine reading many thousands and interpreting the results.

Posted in Uncategorized | Leave a comment

Capturing SPECTRa

Jean-Claude Bradley has blogged  JSpecView Article on Chemistry Central


Robert Lancashire has just published an article in Chemistry Central Journal:

The JSpecView Project: an Open Source Java viewer and converter for JCAMP-DX, and XML spectral data files
Our lab has found this software to be key for communicating organic chemistry results within an Open Notebook Science environment. All NMR raw data and metadata are automatically recorded and users from anywhere can mine the spectra by expanding and integrating at will from a browser interface. This is an enormous improvement over the traditional method of storing and publishing spectra as images that cannot be expanded.
The article describes other useful applications, such as the integration of JSpecView with Jmol, to show the assignment of specific peaks.
The other reason I really like this article is that Robert has used some UsefulChem blog posts as primary references. This is an important way for the scientific blogosphere to get incorporated and accepted by the mainstream.

[RL] Abstract
The JSpecView Open Source project began with the intention of providing both a teaching and research tool for the display of JCAMP-DX spectra. The development of the Java source code commenced under license in 2001 and was released as Open Source in March 2006. The scope was then broadened to take advantage of the XML initiative in Chemistry and routines to read and write AnIML and CMLspect documents were added. JSpecView has the ability to display the full range of JCAMP-DX formats and protocols and to display multiple spectra simultaneously. As an aid for the interpretation of spectra it was found useful to offer routines such that if any part of the spectral display is clicked, that region can be highlighted and the (x,y) coordinates returned. This is conveniently handled using calls from JavaScript and the feedback results can be used to initiate links to other applets like Jmol, to generate a peak table, or even to load audio clips providing helpful hints. Whilst the current user base is still small, there are a number of sites that already feature the applet. A tutorial video showing how to examine NMR spectra using JSpecView has appeared on YouTube and was formatted for replay on iPods and it has been incorporated into a chemistry search engine.

PMR: Congratulations to Robert, both on publishing JSV and choosing an Open Access journal to do it in. I predict that he will get many downloads and hopefully not a few citations.  JSV was not always Open Source and Robert has had to be tenacious to extract it.

 

We have collaborated with Robert for some  years and are two of the co-authors of the CMLSpect pape. There are now several, interoperating standards for 1-dimensional spectra – JCAMP-DX, AnIML and CMLSpect. There is really no excuse for publishers not to encourage authors to deposit these. They aren’t enormous, they can be deposited as supplemental data and can then by re-used by the scientific community. At present the publishers force readers to measure spectra with rulers if they want any numerical information.

 

And others deserve mention – Stefan Kuhn, who has helped to develop much of the spectral software, Christoph Steinbeck and the authors of JCAMP and the software library at sourceforge. We have benefitted from this in the SPECTRa project which allows groups in academia and elsewhere to reposit their spectral and crystallographic information.

Posted in Uncategorized | 5 Comments

Ontologies in Physics and Chemistry

My colleague Nico Adams has just posted on ontologies (Ontologies are overrated?!?)

Here’s a video by the indefatigable Michael Wesch and done in his inimitable style, arguing that maybe ontologies are not needed anymore and that the shelf is obsolete in the area of digital information. In the blurb next to the video it says:

“This video explores the changes in the way we find, store, create, critique, and share information. This video was created as a conversation starter, and works especially well when brainstorming with people about the near future and the skills needed in order to harness, evaluate, and create information effectively.”

 

[NA] It ties in with arguments made by Weinberger and even gels nicely with a short “after-dinner” talk I gave recently. Are you committing ontological apostasy now, I hear you ask? And you have only just blogged about ontology development methods.Will I incur Aristotle’s wrath? No, I don’t think so….ontologies are still useful if they are used to provide a general frame of reference that describes both likeness and limits of likeness. As long as we appreciate that there may be more than one top node…..
PMR: I’ll know more when the video appears, but also to say that Nico did, indeed, give a nice after-lunch talk with cleverly aggregated visual material.
PMR: I also bumped into the problems of ontologies when talking with Michael Kohlhase last week on PhysML. a language to support physics, developed by Michael and Ebs Hilf and in close conjunction with the OpenName (OM) community. We are working together to make PhysML, MathML and CML interoperate. Some of this is technical, some is ontological.
It may be that physicist think in terms of perdurants and chemists in terms of endurants. Here’s Wikipedia:

Common Terms in Formal Ontologies

The Difference in terminology used between separate Formal upper level ontologies can be quite substantial, but the one and foremost Dichotomy most Formal upper level ontologies apply is that between ‘Endurants’ and ‘Perdurants’.

Endurant

Also known as continuant, or in some cases ‘substance’. Endurants are those entities that can be observed-perceived as a complete concept, at no matter which given Snapshot of time. Were we to freeze time we would still be able to perceive/conceive the entire endurant. Examples are material objects, such as an apple or a human, and abstract ‘fiat’ objects, such as an organisation or the border of a country.

Perdurant

Also known as occurrent, accident or happening. Perdurants are those entities for which only a part exists if we look at them at any given snapshot in time. When we freeze time we can only see a part of the perdurant. Perdurants are often what we know as processes, for example ‘running’. If we freeze time then we only see a part of the running, without any previous knowledge one might not even be able to determine the actual process as being a process of running. Other examples include an activation, a kiss, or a Procedure.

PMR: Ebs and Michael had reviewed CML and questioned why the key concepts were atoms, molecules, electron, substances, whereas they suggested it would have been better to start from reactions. I think that’s a very clear difference in orientation between endurants and perdurants. Although chemists publish reactions, most of the emphasis is on (new) substances and their properties. CML is designed to map directly onto the way chemists seem to think – at least in their public communication – e.g. through documents. Of course we can also do reactions in CML, but even there the emphasis is often on the components. For my part it has been a useful change of vision to see how the physicists think. Michael will correct me but there are three basic components:

  • create a theory (this can be quite a general term)
  • devise an experiment to test it
  • collect observations (observable) and give the allowable error limits

There is no set list of material endurants (I think) – such as apparatus or material. These are described by dictionaries as and when required.
In our discussions we explored the difference in thought between the formal representation of a chemical equation with a formula for the rate (constant). For a mathematician it must be formally correct. For a chemist it must be useful and work. Both are desirable, but the real world probably requires a compromise.

Posted in Uncategorized | Leave a comment

Christoph Steinbeck moves to European Bioinformatics Institute (EBI)

I’d known about this for a few days and am delighted it’s now public:

I’m very delighted to announce my move to the European Bioinformatics Institute (EBI) in Hinxton near Cambridge, UK, at the beginning of 2008.

At EBI, I will re-establish my research group and become leader of the chemoinformatics-related service teams, including the ChEBI and Reactome teams.

Future Directions at EBI

All of my ongoing research projects (SENECA, Bioclipse, CDK) will be continued at EBI and constitute a solid base for the rapid formation of a strong research group.

In addition, there is a plethora of fascinating modeling and data-analysis projects to be envisioned in Systems Biology in addition to the final grand goal of achieving a whole-cell or even whole-organism metabolic simulation. At EBI we will pursue the implementation of a metabolic simulation environment, with the goal of creating novel approaches to drug discovery. The “one-target, one-drug” paradigm still followed in many pharmaceutical chemoinformatics studies presented at conferences has clearly not lead to an increased number of drugs on the market and it is most importantly not capable of preventing failures of novel compounds in late clinical trials. Only a systems biology approach to drug discovery, taking into account transport phenomena, the interaction with all or as many targets as possible may allow us to make correct predictions.

To alleviate the abysmal lack of chemical data in some crucial areas of Systems Biology, we are interested in text, or better, publication mining techniques. In an ongoing collaboration with the Center for Molecular Informatics at Cambridge University, we will aim at creating an automated workflow for the extraction of molecular structures and data from the printed literature – past and present.

The Steinbeck Group offers PhD student positions via the EBI’s PhD program in 2008. Applications need to be submitted by Dec. 17 2007.

As leader of the chemoinformatics-related service teams at EBI (ChEBI, Reactome, etc.), my emphasis will be to enhance the existing resources with complete chemical semantics and raise funding for a substantial growth of the resources while securing their quality. The chemistry-related databases at EBI have the chance to become a valuable resource in pharmaceutical and medicinal chemoinformatics and serve as an integration point between chemo- and bioinformatics.

PMR: I’ve know Christoph for several years. It was a typical “meeting” through Open Source chemistry development – a mixture of what is now the Blue Obelisk group of software. I have been particularly grateful to his wholehearted commitment to this and also to the development of CML.

This is yet another step in the liberation of chemical informatics. It emphasizes the value of the Open approach which is so rare in mainstream chemistry, but relatively common in bioscience, both for code and data. The major bioinformatics centres (NCBI/Pubchem, EBI/ChEBI, etc.) are tired of the lack of chemistry available for bioscientists and are determined to make it happen for themselves.

This is obviously a boost for the Blue Obelisk movement. I was visiting the EBI last week at a meeting on the  MACiE project on enzyme reactions.  This has been a great help in developing CMLReact – CML for chemical reactions – which we are now using in other projects.

It was short-sighted of  Germany to allow the CUBIC group to be closed, and there are now very few “chemoinformatics” groups in the world – certainly compared with bioinformatics which continues to expand rather than contract.

So this is a very welcome development both ad hominem and for the community. We shall obviously work closely together.

 

Posted in Uncategorized | 3 Comments