Open Bibliography is Essential for modern scholarship

#jiscopenbib

Very simply:

Scholarship (Research, innovation, planning, university business, etc.) is being crippled by the lack of Open Bibliography

That’s a strong statement. (And bibliography is not the only area where lack of Openness is crippling). But I’ll defend it and say what JISC, The Open Knowledge Foundation and others are doing about it. First some examples:

  • A Vice-chancellor (University president, …) wants to know who in her university is collaborating effectively with the BRIC countries. (Any VC who doesn’t want to know this shouldn’t be in the job). How do they find this out? They send a mail to all academics – get some sporadic replies. What they can and should be doing is analysing the co-authors of all university publications to see where the co-authors are located. Simple… Except that if they try to read the metadata automatically (which is technically possible), they run into copyright FUD.
  • A researcher wants to find out what papers contain crystal structures of natural products. This can be obtained from the gratis cis-paywall (the freely visible part of a scientific publication) of all publishers. He can extract this from OA-Libre publications but not necessarily from OA-gratis or Toll-access. But can they do this automatically without being sued? Copyright FUD again
  • Has there been an increase in the number of papers published jointly by neuroscientists and molecular biologists? Which groups have the most effective collaborations? This would be easy to do if electronic serial bibliography Again, you cannot do this is a modern manner because of copyright FUD.

What is copyright FUD? It’s the uncertainty as to whether you are allowed to do something without possibly violating copyright (or some other contractual obligation signed behind closed doors and restricted by non-disclosure – many academics don’t even know what they aren’t allowed to do within their own institution). If you ask someone in the library you’ll get the normal sort of pseudo-syllogism:

“We’re not sure whether you are allowed to do this. There is a risk that if you do this the publishers will object/sue us/cut us off/. No risk is worth taking. Therefore you can’t do this.”

The same FUD applies to library catalogues. If they are obtained from other sources they may be contaminated with copyright FUD. After all collections of data may carry IP rights in some jurisdictions (such as Europe).

So the way forward is to embrace Open Bibliography (as itself, and also more widely as Open Scholarship). We believe that all primary publishers will actually see this as an important advance. After all if the bibliography is Open, then more people are likely to access the paper, journal, thesis, monograph, report, grey literature, etc.

The Open Knowledge Foundation has been running an Open Bibliography project for nearly a year (see http://lists.okfn.org/mailman/listinfo/open-bibliography – which anyone can join and contribute). As a result we’ve (Adrian Pohl) been linking efforts from round the world – such as Jim Pitman’s Bibliography Knowledge Network (BKN), and other luminaries such as Thomas Krichel, Karen Coyle, David Shotton. They are joined by OKF – Jonathan Gray, Rufus Pollock, Ben O’Steen , Jo Walsh, … and myself. For example Jonathan and colleagues have just returned from running an Open Bibloliography workshop in Berlin.

This has been enhanced by JISC funding of the #jiscopenbib project (PeterMR, Dave Flanders (JISC)) which has been going about 3 months (Rufus, Ben and Mark McGillivray (E-burgh)). [There’s a complementary project #jiscopencite, with David Shotton, for citations]. We’ve accomplished a great deal, not the least of which is that there is clearly a massive pent-up requirement (not always recognised) for Open bibliography – expressed very well by Paul Miller that bibliography should be Open. We’ve talked with many publishers, many libraries and other producers of bibliography. Ben has made great progress in tying together the various approaches that publishers use for exposing their metadata. Each publisher does it differently, using a sort of microtagging based on PRISM, Dublin Core, NLMDTD and homegrown. Ben’s approach unifies this and allows it to searchable through an RDF triplestore.

As I’m part of UKPMC (the UK branch of PubMed Central) I’m keen to re-use the bibliography from that. We can (obviously) use the Open Access subset which is about 1 million records. But we can’t use the whole PMC (20 million). Why not? Copyright FUD. 20 million records that scholars should be using and aren’t. 20 million articles that aren’t being accessed as much as they should because their bibliography is only available through closed sources. (And although some publishers sell their bibliographies to secondary publishers, surely it won’t harm this business to provide the Bibliography in an Open form as well – look at the success of many recent Openings of content from Cory Doctorow to Dave Mackay to Ordnance Survey.)

There’s lots of recent news – read the discussion list and the OKFN blog. I’ll be reporting more as I have time (currently refactoring OSCAR3=>OSCAR4, running Quixote, …).

The message is simple:

We’re creating a set of Open Bibliography principles (like the Panton Principles for Scientific Data). We don’t think these are controversial. For scientific bibliography they can be summarised as:

Bibliographic records (names, titles, ISSNs, pages, addresses – all cis-paywall) should be fully Open:

Free to use, re-use and redistribute without further permission.

Who could sensibly object to that? It’s not a creative work – it’s part of our scholarly infrastructure. The digital plumbing. There are some details such as community norms – best endeavour to ensure fidelity of reproduction, for example.

So we are appealing to creators of bibliographic records (catalogue entries, article metadata, thesis splash pages, …)

Please support us in this endeavour. We can do the techie stuff – all we need is permission (hopefully enthusiastic) to use your records and transform them to an Open Semantic format.

 

 

 

 

Posted in Uncategorized | 1 Comment

Update and real excitement

#jiscopenbib #jiscxyz #quixotechem

The last week has been hectic and really exiciting. I get the feeling that the last 10-20 years of my life are starting to bear fruit. In particular I now believe the World Wide Molecular Matrix will happen as a zero-cash activity. But here are some of last week’s advances:

  • Visit to IUCr Chester with agreement on #jiscopenbib and #jiscxyz. Expect announcement RSN
  • Liberation into OKD-compliant form of major library catalogues. We MUST fight to make all bibliography Open. It’s so blindingly obvious that this should be so, that I cannot understand why we are in almost complete paralysis in some areas. Bibliography should be Open. Oh, and I am exploring a critically important collection of bibliographic scientific records today. And I hope to liberate some major scientific bibliography. Wake up! We MUST have Open Bibliography. Scientific information is crippled without it. Also managed to speak with Jim Pitman (Berkeley) who is also liberating bibliography in the BKN (Bibliographic Knowledge Network)
  • Egon Willighagen has joined us to work on OSCAR-ChEBI funded by EPSRC/OMII. We’ll be visiting EBI to see how it can be deployed
  • I have now written a generic parser technology for compchem input/output/punch files. This should be rapidly deployable in the Quixote project. Scope for parallel development. We are still on track for the deployment of a prototype on October 21.
  • Have to present out work on the Green Chain Reaction to the EPSRC grand challenge Dial-a-Molecule on 20101020. This is the vision that we can extract reactions from the literature to help develop better synthetic procedures.

More later.

Posted in Uncategorized | 2 Comments

Why we need unique addresses and identifiers

#jiscopenbib #quixotechem

This post is about identifier and indexing systems – we shall need these for Lensfield and Quixote.

The hierarchical system seems to come naturally to most humans. I’m not a physiologist, neuroscientist or psychologist but it seems natural for schoolchildren to write something like:

Jane Doe

Bedroom 3

First floor

12 Occupation Road

Abbey Ward

Cambridge

Cambridgeshire

England

Great Britain

United Kingdom

Europe

The World

The Solar System

The Galaxy

The Universe

And it’s the genius and apparently simplicity of so many naming schemes that make it possible to manage modern information. Many are a contract between a global provider and a local system. Thus when we write:

http://wwmm.ch.cam.ac.uk/blogs/murrayrust

we are actually building a hierarchy based on several providers:

The internet naming authorities (InterNIC) create the toplevel domains such as “uk”

The authorities in the countries create the next level (here “ac”). Within that each university has its own domain such as “cam”. The university then decides on the subcomponents (“ch ” is chemistry).

Our group then gets its own sub…subdomain (“wwmm”). Under this our sysadmin allocates levels such as “blogs”, “svn”, etc. I have the sublevel for “murrayrust”. Now I can do more or less what I like in naming the resources.

 

Some systems – such as the WordPress software we use allocates semantically void identifiers, such as “1234”. Its only concern is to make sure that no two blog posts every have the same ID. In this case its eems to work by using serial numbers and this is an excellent approach. The number is meaningless except that smaller ones are earlier. They act as an indexing system for the blog. It means the system has to keep track of everything that has been done. If, for example, I closed the system down and then restarted blogging it might well start at “1” and that would foul up anyone who had bookmarks to the earlier posts.

Making sure an identifier is unique isn’t easy and usually requires mapping onto some human activity. A useful way is to include the date, time and location in some way. Another is to generate numbers so large that they are “almost certainly” unique – this is what the UUID approach does. It works for most of us.

Another way is to use a completely semantic system where every step of the hierarchy makes sense. This is what Nick Day does in Crystaleye. He creates a large URL rather like:

http://wwmm.ch.cam.ac.uk/crystaleye/<publisher>/<journal>/<year>/<issue>/<article>/<datafile1>

and a real example:

http://wwmm.ch.cam.ac.uk/crystaleye/summary/acta/e/2007/01-00/data/bh2062/bh2062sup1_I/bh2062sup1_I.cif.summary.html

This works because the whole publishing system is based on unique publishers which manage their information in a professional manner. Almost all follow this sort of strategy.

So how do we extend this to Quixote, where we are calculating chemistry? Here the generation of identifiers is critical. For example suppose we have 5 people in the group who submit jobs we might have

murrayrustgroup/2010/10/01/job23

This requires a group tool that allocates unique job numbers. Individuals MUST use this tool. If they make up their job numbers then they will certainly clash, probably within days of starting. We could, of course, use a UUID system and have something like:

murrayrustgroup/2010/10/01/a379dc23aaef3490ffeacb23aac

 

This is safe but psychologically depressing for many. People like to remember their information by handy identifiers. I still remember the chemical compounds I worked on in the pharma industry such as AH19065, GR123976 and so on. That’s about the mental limit. It rests on yet another agreed scheme – a prefix for each comapny. “AH” is Allen and Hanburys’, GR is Glaxo Group, and so on. Within each company the compounds had to be unique. For chemistry it’s technically and semantically difficult and it still is (and we may revisit this later when we name chemicals).

So it’s critical in Quixote-Lensfield that we have good identifier schemes. They don’t have to be top-down. But they have to be capable of being integrated with top-down schemes. And they must create unique identifiers. And as we’ve seen before there are only a limited number of ways of doing this EASILY – i.e. where we can rely on everyone doing it.

And the simplest of these is the hierarchical filing system. It’s impossible to create objects with duplicate filenames on the same system. (It’s possible to foul up when you clone an old machine onto a new machine – as I have just done – and then continue using both (as I hope I haven’t done).

The problem is that most projects don’t fit naturally into any scheme. They nearly do, but there are normally problems. However I’m going to assume that we can have a system rather like:

external-id/organization/person/project

Where external-id is something like a domain name, a DOI, or other guaranteed unique root, /organization is something like a company or university, /person is a unique individual within an organization, and that the person can manage their own projects on their own filestore. (Yes, it breaks down when people move organizations, or when the work with 2 organizatins, or when several people work on the same project, or… But there are NO easy answers here. What I am describing is common enough in many sciences – and individual has the freedom and the responsibility to manage their own information. Unfortunately they rarely have any guidance!

My examples will then be based on Lensfield acting within a hierarchical filing system. The key thing is not to be afraid of creating directories (or folders or whatever term is used). We’re recommending 1 folder for one unit of work. So one folder per crystal structure; 1 folder for calculations on one compound. And create subfolders if you have variant calculations or experiments on the same system.

[… I’ll stop here because this is already 3 days in writing and I’ll continue on the train back from Chester …]

Posted in Uncategorized | 3 Comments

Lensfield: In the Beginning was the FileSystem

This post introduces the command line and the file system which are the bedrock of the Lensfield system we have built to support scientific computing .

Neal Stephenson has a marvellous essay/book http://en.wikipedia.org/wiki/In_the_Beginning…_Was_the_Command_Line . It’s primarily a discussion on proprietary operating systems, but it highlights the role of the command line in supporting one of the most natural and most durable ways of humans communicating with machines. It’s so “natural” that we may feel it’s a fundamental part of human intelligence, mapped onto the machine.

But there were machines before command lines. When I started (what an awful phrase – but please don’t switch off)… there was no keyboard input into machines. Input was via paper tape:

[thanks to wikimedia]

Ours was even worse – it had only 5 holes and would tear at the slightest chance. And ask yourself how you would edit it? – yes, it was possible and we did it… You encoded your data on an ASR33. To get your results out the computer had a tape punch and you fed your tape into the ASR33 which punched it out at 10 chars/second – rather slower than a competent typist.

And where was the filestore? Yes, you’ve guessed it – rolls of paper tape. Howe were they indexed? By writing on them with pens. If the information was too big it was split over two or more tapes. If you read them in the wrong order, the result was garbage. And if the rolls were large the middle could fall out. A beautiful effect, but topologically diabolical. Because every layer in the spiral contributed to a twist that had to be undone by hand. It could take an hour to rewind a tape.

The next advance was magnetic tape. This was almost infinite in storage.

[from Wikimedia]

From Wikipedia (http://en.wikipedia.org/wiki/IBM_729 ):

Initial tape speed was 75 inches per second (2.95 m/s) and recording density was 200 characters per inch, giving a transfer speed of 120 kbps[1]. Later 729 models supported 556 and 800 characters/inch (transfer speed 480 kbps). At 200 characters per inch, a single 2400 foot tape could store the equivalent of some 50,000 punched cards (about 4,000,000 six-bit bytes, or 3 MByte).

A tape could therefore hold many chunks of information. I can’t remember whether we called them “files” – it depended on the machine. But it was really the first filing system. Let’s assume we had 1000 files, each about 32Kb. The files were sequential. To find one at the end of the tape you had to read through all the other 999. This could take ages. Writing a file could only be done if you have a tape with space at the end (even if you wanted to delete a file in the middle and overwrite it, problems with tape stretch might make this dangerous). So generally you read from one tape and wrote to another.

And how did you input instructions to the machine? Not with a commandline. Not with buttons and guis. But with switches. Here’s a PDP-8 (which controlled my X-ray diffractometer in the mid-1970’s ).

[thx Wikimedia]

To enter a program you have to key in the operating system – set the toggle switches (orange/white at the bottom) to a 12-bit number – enter it. Set it to another number. Enter it. Do this for about 10 minutes without mistake (a single mistake meant back to the start). Then you read the rest of the operating system on paper tape. Then you could read in your program!

The point is that back then…

  • No command line
  • No file system

That’s why when Dennis Ritchie and others introduced the hierarchical file system and the command line in UNIX it was a major breakthrough.

The magnificent thing is that 40 years on these two fundamentals are still the most natural way for many humans to interact with machines. 40 years in which the tools and approaches have been tested and honed and proved to work.

And that’s why Lensfield builds on the filesystem and the command line to provide an infrastructure for modern scientific computing and information management. And why we’ve used Lensfield for the Green Chain Reaction and will be providing it to the Quixote project (and anyone else for anything else).

 

 

Posted in Uncategorized | 2 Comments

#quixotechem Another compchem program goes Open Source

 

#quixotechem

 

I’m delighted to see that a major computational chemistry program (NWChem) has been released under a fully F/OSS Open Source software licence. There are many programs (“codes”) in compchem but few of them are F/OSS. The norm is either to be fully commercial, or allow free-to-academics. The main exceptions I knew about already were http://en.wikipedia.org/wiki/ABINIT and http://en.wikipedia.org/wiki/MPQC ; the former deals with solids rather than molecules so is outside the scope of Quixote. This means we can now – in principle – create completely open distributions for a community project.

 


Sujet : CCL:G: NWChem version 6.0 (Open Source) released
Date : mercredi 29 septembre 2010, 21:38:09

We are pleased to announce the release of NWChem version 6.0. This version
marks a transition of NWChem to an open-source software package. The software
is being released under the [Educational Community License 2.0] (ECL 2.0).
Users can download the source code and a select set of binaries from the new
open source web site http://www.nwchem-sw.org

New functionality, improvements, and bug fixes include:

* Greatly improved memory management for TCE four-index transformation,
CCSD(T), CR-EOMCCSD(T), and solver for EOMCCSD

* Performance and scalability improvements for TCE CCSD(T), CR-EOMCCSD(T), and
EOMCCSD

* TCE based static CCSD hyperpolarizabilities

* New exchange-correlation functionals available in the Gaussian DFT module

* Range-separated functionals: CAM-B3LYP, LC-BLYP, LC-PBE, LC-PBE0, BNL. These
functionals can also be used to perform TDDFT excited-state calculations

* SSB-D functional

* Double hybrid functionals (Semi-empirical hybrid DFT combined with
perturbative MP2)

* DFT response are now available for order 1 (linear response), single
frequency, electric field and mixed electric-magnetic field perturbations

* Greatly improved documentation for QM/MM simulations

* Spin-orbit now works with direct and distributed data approaches

* Plane-wave BAND module now has parallelization over k-points, AIMD, and
Spin-Orbit pseudopotentials

* Plane-wave modules have improved minimizers for metallic systems and
metadynamics capabilities

* Bug fix for DISP: Empirical long-range vdW contribution

* Bug fix for Hartree-Fock Exchange contributions in NMR

Please let us know if have any issues accessing the new website.

Best wishes,

   Huub

__________________________________________________
Huub van Dam
Scientist
EMSL: Environmental Molecular Science Laboratory
Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, MSIN K8-91
Richland, WA  99352 USA
Tel:  509-372-6441
Fax: 509-371-6445
Hubertus.vanDam : pnl.gov
www.emsl.pnl.gov

——————————————————-

——————————————————————————
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Blueobelisk-discuss mailing list
Blueobelisk-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss

Reply

Reply to all

Forward

       

 
 

Reply to all

 
 

Posted in Uncategorized | 3 Comments

#quixotechem the different ways of looking at the world [repost]

 

[I have moved machines and as a result overwrote an earlier post. Here it is again]

The meaning and use of words and ideas is critical to the development of the semantic web. Frequently I find the writings of Jorge Luis Borges expresses a deep truth about how we describe, classify and interrelate concepts. Here in a single sentence he sums up the problem of classification. It’s sufficiently compelling that it has a whole Wikipedia entry http://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevolent_Knowledge%27s_Taxonomy

“These ambiguities, redundancies, and deficiencies recall those attributed by Dr. Franz Kuhn to a certain Chinese encyclopedia called the Heavenly Emporium of Benevolent Knowledge. In its distant pages it is written that animals are divided into (a) those that belong to the emperor; (b) embalmed ones; (c) those that are trained; (d) suckling pigs; (e) mermaids; (f) fabulous ones; (g) stray dogs; (h) those that are included in this classification; (i) those that tremble as if they were mad; (j) innumerable ones; (k) those drawn with a very fine camel’s-hair brush; (l) etcetera; (m) those that have just broken the flower vase; (n) those that at a distance resemble flies.”[3]

I shall be writing one or more blog posts on a proposed architecture for the Quixote project (which will gather computational chemistry output). I’ll be describing a concept – the World Wide Molecular Matrix – which is about 10 years old but only now is the time right for it to start to flourish. It takes the idea of a xdecentralised web where there is no ontological dictatorship and people collect and republish what they want.

In classifying the output of computational chemistry we can indeed see the diversity of approaches. Here are some meaningful and useful reasons why someone might wish to aggregate outputs:

  1. Molecules which contain an Iron atom
  2. Calculations of NMR shifts in natural products
  3. Molecules collected by volunteers
  4. Calculations using B3LYP functional
  5. Studies reported in theses at Spanish Universities
  6. Work funded by the UK EPSRC
  7. Molecules with flexible side chains
  8. Large macromolecules with explicit solvent
  9. Work published in J. American Chemical Society
  10. Calculations which cause the program to crash

These are all completelye serious and some collections along some of these axes already exist.

The point is that there is no central approach to collection and classification. For that reason there should be no central knowledgebase of calculations, but rather a decentralised set of collections. Note, of course, that both Jorges’ and my classification have intersections and omissions. This is the fundamental of the web – it has no centre and is both comprehensive and incomplete.

I’ll be showing how the WWMM now has the technology – and more inmportant the cultural acceptability – to provide a distributed knowledgebase for chemistry. It knocks down walled gardens through the power of Open Knowledge.

Posted in Uncategorized | Leave a comment

Components of the Quixote Open computational chemistry system and the WWMM

#quixotechem #wwmm #jiscxyz

Last week we agreed that a small, very agile, group of co-believers would put together a system for collecting, converting, validating, and publishing Open Data for computational chemistry, decribed by the codeword “Quixote”. This is not a fantasy – it’s based on a 10-year vision which I and colleagues put together and called the “World Wide Molecular Matrix”. I’ve talked about this on various occasions and it’s even got its own Wikipedia entry (http://en.wikipedia.org/wiki/WorldWide_Molecular_Matrix – not my contribution (which is as it should be)). We put together a proposal to the UK eScience program in (I think) 2001 which outlined the approach. Like so much the original design is lost to me, though it may be in the bowels for the EPSRC grant system. We got as far as presenting it to the great-and-the-good of the program but it failed (at least partly) because it didn’t have “grid-stretch”. [I have been critical of the GRID’s concentration on tera-this and peta-that and the absurdly complex GLOBUS system, but I can’t complain too much because the program gave us 6 FTE-years funding for the “Molecular Standards for the Grid” which has helped to build the foundations of our current work. Actually it’s probably a good thing that it failed – we would have had a project which contained herding a lot of cats and where the technology – and even more the culture – simply wasn’t ready. And one of my features is that I underestimate the time to create software systems – it seems to be the boring bits that take the time.

But I think the time has now come where the WWMM can start to take off. It can use crystallography and compchem separately and together as the substrates and then gradually move toward organic synthesis, spectroscopy and materials. We need to build well-engineered, lightweight, portable, self-evident modules and I think we can do this. As an example when we built an early prototype it used WSDL and other heavyweight approaches (there was a 7-layer software stack of components which were meant to connect services and do automatic negotiation – as agile as a battle tank). We were told that SOAP was the way forward. And GLOBUS. And certificates. We were brainwashed into accepting a level of technology which was vastly more complex (and which I suspect has frequently failed in practice). Oh, and Upper Level Ontologies – levels of trust, all the stuff from the full W3C layer cake.

What’s changed is that the bottom-up approach has taken a lightweight approach. REST is simple (I hacked the Green Chain Reaction in REST – with necessary help from Sam Adams). The new approach to Linked Open Data is that we should do it first and then look for the heavy ontology stuff later – if at all. Of course there are basics such as an ID system. But URLs don’t have to resolve. Ontological systems don’t have to be provably consistent. The emergent intelligent web is a mixture of machines and humans, not First-Order predicate logic on closed systems. There’s a rush towards Key-value systems – MongoDB, GoogleData, and so on. Just create the triples and the rest can be added later.

What’s also happened is Openness. If your systems are Open you don’t have to engineer complex human protocols – “who can use my data?” – “anyone!” (that’s why certificates fail). Of course you have to protect your servers from vandalism and of course you have to find the funding somewhere. But Openness encourages gifts – it works both ways as the large providers are keen to see their systems used in public view.

And the costs are falling sharply. I can aggregate the whole of published crystallography on my laptop’s hard drive. Compchem is currently even less (mainly because people don’t publish data). Video resources dwarf many areas of science – there are unnecessary concerns about size, bandwidth, etc.

And our models of data storage are changing. The WWMM was inspired by Napster – the sharing of files across the network. The Napster model worked technically (though it required contributors to give access to local resources which can be seen as a security risk and which we cannot replicate by default). What killed Napster was the lawyers. And that’s why the methods of data distribution and sharing have an impaired image – because they can be used for “illegal” sharing of “intellectual property”. I use these terms without comment. I believe in copyright, but I also challenge the digital gold rush that we’ve seen in the last 20 years and the insatiable desire of organization to possess material that is morally the property of the human race. That’s a major motivation of the WWMM – to make scientific data fully Open – no walled gardens, however pretty. Data can and will be free. So we see and applaud the development of Biotorrents, Mercurial and Git and many Open storage locations such as BitBucket. These all work towards a distributed knowledge resource system without a centre and without controls. Your power is your moral power, the gift economy.

And that is also where the Blue Obelisk (http://en.wikipedia.org/wiki/Blue_Obelisk ) comes to help. Over the last 6 years we have built a loose bottom-up infrastructure where most of the components are deployed. And because we believe in a component-based approach rather than monoliths it is straightforward to reconfigure these parts. The Quixote system will use several Blue Obelisk contributions.

And we have a lot of experience in our group in engineering the new generation of information systems for science. This started with a JISC project, SPECTRa, between Cambridge and Imperial, chemistry and libraries and which has seeded the creation of a component-based approach. Several of these projects turned out to be more complex than we thought. People didn’t behave in the way we thought they should, so we’ve adjusted to people rather than enforcing our views. That takes time and at times it looks like no progress. But the latest components are based on previous prototypes and we are confident that they now have a real chance of being adopted.

To keep the post short, I’ll simply list them and discuss in detail later:

  • Lensfield. The brainchild of Jim Downing: a declarative make-like build system for aggregating, converting, transforming and reorganizing files. Originally designed in Clojure (a Java functional language), Sam Adams has now built a simpler system (Lensfield2). This doesn’t have the full richness and beauty of Clojure – which may come later – but it works. The Green Chain Reaction used the philosophy and processed tens or hundreds of thousands of files in a distributed environment.
  • Emma. The embargo manager. Because data moves along the axis of private->Open we need to manage the time and the manner of its publication. This isn’t easy and with support from JISC (CLARION) we’ve built an Embargo manager. This will be highly valuable in Quixote because people need a way of staging release.
  • Chem# (pronounced “ChemPound”). A CML-RDF repository of the chemistry – based on molecules. We can associate crystallography, spectra, and in this case compchem and properties. The repository exposes a SPARQL endpoint. This means that a simple key-value approach can be used to search for numeric or string properties. And we couple this to a chemical search system based on (Open) Blue Obelisk components.

The intention is that these components can be easily deployed and managed without our permission (after all they are Open). They will act as a local resource for people to manage their compchem storage. They can be used as push either to local servers or to community Chem# repositories which we shall start to set up. Using Nick Day’s pub-crawler technology (which builds crystaleye every night) we can crawl the exposed web for compchem, hopefully exposed through Emma-based servers.

We hope this prompts publishers and editors to start insisting that scientists publish compchem data with their manuscripts. The tools are appearing – is the communal will-to-publish equally encouraging?

Posted in Uncategorized | 2 Comments

The Why, When, Where and How of publishing data

#quixotechem #jiscyxz

One of the major questions that arose at the ZCAM meeting on Computational Chemistry and databases (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2619) was the publication of data.

In some subjects such as crystallography (and increasingly synthetic chemistry), the publication of a manuscript requires the publication of data as supplemental/supporting information/data (the terms vary). This is a time consuming process for authors but many communities feel it is essential. In other disciplines, such as computational chemistry, it hasn’t ever been mandatory. In some cases (e.g. J. Neuroscience) it was mandatory and is now being abolished without a replacement mechanism. In other cases such as Proteomics it wasn’t mandatory and is now being made so. So there are no universals.

At the ZCAM meeting there was general agreement that publishing data was a “good thing” but that there were some barriers. Note that compchem, along with crystallography, is among the least labour-intensive areas as it’s a matter of making the final machine-generated files available. By contrast publishing synthetic chemistry can require weeks of work to craft a PDF document with text, molecular formulae and spectra. Some delegates said that suppInfo could take twice as long as the paper. (There was rejoicing (sadly) in some of the neuroscience community that they no longer needed to publish their data).

So this post explores the positive and negative aspects of publishing data.

Here were some negatives (they were raised and should be addressed)

  • If I publish my data my web site might get hacked (i.e. it is too much trouble to set up a secure server). I have some sympathy – scientists should not have to worry about computer infrastructure if possible. We do, but we are rather unusual.
  • It may be illegal or it may break contractual obligations. Some compchem programs may not be sold to the various enemies of US Democracy (true) and maybe it’s illegal to post their outputs (I don’t buy this, but I live in Europe). Also some vendors put severe restrictions on what can be done with their programs and outputs (true) but I doubt that publishing output breaks the contract (but I haven’t signed such a contract)
  • If I publish my data certain paper-hungry scientists in certain countries will copy my results and publish them as theirs (doesn’t really apply after publication, see below)
  • Too much effort. (I have sympathy)
  • Publishers not supportive (probably true)

Now the positives. They fall into the selfish and the altruistic. The altruistic is the prisoner’s dilemma (i.e. there is general benefit but *I* benefit only from other people being altruistic). The selfish should be compelling in any circumstances.

Altruistic:

  • The quality of the science improves if results are published and critiqued. Converge on better commonality of practice.
  • New discoveries are made (“The Fourth Paradigm”) from mining this data, mashing it up, linking it, etc.
  • Avoid duplication (the same work being recycled)
  • Avoid fraud (unfortunately always probable)
  • Provide teaching and learning objects (very valuable in this field)
  • Contribute to a better information infrastructure for the discipline

Selfish:

  • Advertise one’s work. Heather Piwowar has shown that publishing data increases citations. This alone should be compelling reason.
  • Use a better set of tools (e.g. for communal authoring)
  • Speed up the publication process (e.g. less work required to publish data with complying publishers).
  • Be mandated to comply (by funder, publisher, etc.)
  • Store one’s data in a safe (public) place
  • Be able to search one’s own data and share it with the group
  • Find collaborators
  • Create more portable data (saves work everywhere)

That’s the “why”. I hope it’s reasonably compelling. Now the “when” , “where” and “how”.

The “when” is difficult because the publication process is drawn out (months or even years). The data production and publication is decoupled from the review of the manuscript. (This is what our JISCXYZ project is addressing). The “where” is also problematic. I would have hoped to find some institutional repositories that were prepared to take a role in supporting data management, publication, etc. but I can’t find much useful. At best some repositories will store some of the data created by some of their staff in some circumstances. BTW it makes it a lot easier if the data are Open. Libre. CC0. PDDL, etc. Then several technical problems vanish.

So the scientist has very limited resources:

  • Rely on the publisher (works for some crystallography)
  • Rely on (inter)national centres (works for the rest of crystallography).
  • Put it on their own web site. A real hassle. Please let’s try to find another way.
  • Find a friendly discipline repository (Tranche, Dryad). Excellent if it exists. Of course there isn’t a sustainable business model but let’s plough ahead anyway
  • Twist some other arms (please let me know).

Anyway there is no obvious place for compchem data. I’d LOVE a constructive suggestion. The data need not be huge – we could do a lot with a few Tb per year – we wouldn’t get all the data but we’d get most that mattered to make a start.

So, to seed the process, we’ll see what we can do in the Quixote project. If nothing else we (i.e. our group) may have to do it. But I would love a white knight to appear.

That’s the “where”. Now the “when” and “how”. I’d appreciate feedback

If we are to have really useful data it should point back to the publication. Since the data and the manuscript are decoupled that only works when the publisher takes on the responsibility. Some will, others won’t.

An involved publisher will take care of co-publishing the paper and the data files. Many publishers already do this for crystallography. The author will have to supply the files, but our Lensfield system (used in the Green Chain reaction) will help.

Let’s assume we have a non-involved publisher…

Let’s also assume we have a single file in our plan9 project: pm286/plan9/data.dat (although we can manage thousands) and that there is a space for a title. When we know we are going to publish the file we’ll get a DataCite DOI. (I believe this only involves a small fixed yearly cost regardless of the number of minted dataDOIs – please correct me if not). We’ll mint a DOI. Let’s say we have a root of doi:101.202, so we mint: doi:101.202/pm286/plan9/data.dat . We add that to the title (remember that our files are not yet semantic). This file is then semantified into /plan9/data.cml with the field (say)

<metadata DC:identifier=” doi:101.202/pm286/plan9/data.cml”/>

The author adds the2 identifiers to the manuscript (again the system could do this automatically, e.g. for Word or LaTeX documents).

After acceptance of the manuscript the two files (data.dat and data.cml) are published into the public repository. Again our Lensfield system and the Clarion/Emma (JISC-CLARION) tools can manage the embargo timing and make this automatic. The author can choose when they do this so they don’t pre-release the data.

So the reader of the manuscript has a DataCite DOI pointing to the repository. What about the reverse?

This can be automated by the repository. Every night (say) it trawls recently published papers (looking for DataCite DOIs. Whenever these are located then the repository is updated to include the paper’s DOI. In that way the repository data will point to the published paper.

This doesn’t need any collaboration from the publisher except to allow their paper to be read by robots and indexed. They already allow Google to do this. So why not a data repository?

And what publisher would forbid indexing that gave extra pointers to the published work?

So – some of the details will need to be hammered out but the general process is simple and feasible.

In any case we’ll go ahead with a data repository for compchem…

 

 

 

 

Posted in Uncategorized | 3 Comments

A quixotic approach to an Open knowledgebase for Computational Chemistry

I’ve just got back from a wonderful meeting in Zaragoza on “Databases for Quantum Chemistry” (http://neptuno.unizar.es/events/qcdatabases2010/program.html ). [Don’t switch off, most of the points here are generally to scientific repositories and Open Knowledge]

Quantum Chemistry addresses how we can model chemical systems (molecules, ensembles, solids) of interest to chemistry, biology and materials science. To do that we have to solve Schroedinger’s equation (http://en.wikipedia.org/wiki/Quantum_chemistry ) for our system. This is insoluble analytically (except for the hydrogen atom) so approximations must be made and there are zillions of different approaches. All of these involve numerical methods and all scale badly (e.g. the time and space taken may go up as the fourth power or even worse.

The approach has been very successful in the right hands but also is often applied without thought and can give misleading results. There are a wide variety of programs which make different assumptions and which take hugely different amounts of time and resources. Choosing the right methods and parameters for a study are critical.

Millions (probably hundreds of millions) of calculations are run each year and are a major use of supercomputing, grids, clusters, clouds, etc. A great deal of work goes into making sure the results are “correct”, often checked to 12 decimal places or more. People try to develop new methods that give “better” answers and have to be absolutely sure there are no bugs in the program. So testing is critical.

Very large numbers of papers are published which rely in part or in full on compchem results. Yet, surprisingly, the data are often never published Openly. In , for some disciplines (such as crystallography) it’s mandatory to publish supplemental information or deposit data in databases. Journals and their editors will not accept papers that make assertions without formal evidence. But, for whatever reason, this isn’t generally the culture and practice in compchem.

But now we have a chance to change it. There’s a growing realisation that data MUST be published. There’s lots of reasons (and I’ll cover them in another post). The meeting had about 30 participants – mainly, but not exclusively from Europe and all agreed that – in principle – it was highly beneficial to publish data at the time of publication.

There’s lots of difficulties and lots of problems. Databases have been attempted before and not worked out. The field is large and diverse. Some participants were involved in method development and wanted resources suitable for that. Others were primarily interested in using the methods for scientific and engineering applications. Some required results which had been shown to be “correct”; others were interested in collecting and indexing all public data. Some felt we should use tried and tested database tools, others wanted to use web-oriented approaches.

For that reason I am using the term knowledgebase so that there is no preconception of what the final architecture should look like

I was invited to give a demonstration of working software. I and colleagues have been working for many years using CML, RDF, semantics and several other emerging approaches and applying these to a wide range of chemistry applications including compchem. So, recently, in collaboration with Chemical Engineering in Cambridge we have built a lightweight approach to compchem repositories (see e.g. http://como.cheng.cam.ac.uk/index.php?Page=cmcc ). We’ve also shown (in the Green Chain Reaction, http://scienceonlinelondon.wikidot.com/topics:green-chain-reaction ) that we can bring together volunteers to create a knowledgebase with no more than a standard web server.

I called my presentation “A quixotic approach to computational chemistry knowledgebases”. When I explained my quest to liberate scientific information into the Open a close scientist friend (of great standing) asked “where was my Sancho_Panza?” – implying that I was a Don_Quixote . I’m tickled by the idea and, since the meeting was in Aragon it seemed an appropriate title. Since many people in chemistry already regard some of my ideas as barmy, there is everything to gain.

It was a great meeting and a number of us found compelling common ground. So common that it is not an Impossible_Dream to see computational chemistry data made Open through web technology. The spirit of Openness has advanced hugely in the last 5 years and there is a groundswell that is unstoppable.

The mechanics are simple. We build it from the bottom up. We pool what we already have and show the world what we can do. And the result will be compelling.

We’ve given ourselves a month to get a prototype working. Working (sic). We’re meeting in Cambridge in a month’s time – the date happened to be fixed and that avoids the delays that happen when you try to arrange a communal get-together. As always everything is – or will be when it’s created – in the Open.

Who owns the project? No-one and everyone. It’s a meritocracy – those who contribute help to decide what we do. No top-down planning – but bottom-up hard work to a tight deadline. So, for those who like to see how Web2.718281828… projects work, here’s our history. It has to be zero cost and zero barrier.

  1. I set up an Etherpad on the OKFN site at http://okfnpad.org/zcam2010 – Etherpads taker 15 seconds to create and anyone can play
  2. Pablo Echenique – one of the organizers and guiding lights of the meeting has set up a Wiki at
    http://quixote.wikispot.org/Front_Page

  3. Pablo has also set up a mailing list at http://groups.google.com/group/quixote-qcdb
  4. We are planning to set up a prototype repository at http://wwmm.ch.cam.ac.uk

[I suggested the name Quixote for the project and it’s been well received so that’s what we are going with.]

I have also mailed some of the Blue Obelisk and they have started to collect their resources.

So in summary, what we intend to show on October 21 is:

  • A collection of thousands of Open datafiles produced by a range of compchem programs.
  • Parsers to convert such files into a common abstraction, probably based on CML and maybe Q5COST
  • Tools to collect files from users directories (based on Green Chain experience and code, i.e. Lensfield)
  • Abstraction of the commonest attributes found in compchem (energy, dipole, structure, etc.) This maps onto dictionaries and ontologies
  • Automated processing (perhaps again based on Lensfield)
  • Compelling user interfaces (maybe Avogradro, COMO, etc.)

By giving ourselves a fixed deadline and working in an Open environment we should make rapid progress.

When we have shown that it is straightforward to capture compchem data we’ll then engage with the publishing process to see how and where the supplemental data can be captured. This is a chance for an enthusiastic University or national repository to make an offer, but we have alternative plans if they don’t.

We’ll fill some of the details later.

I’ll tag this #quixotechem

Posted in Uncategorized | 4 Comments

Where can scientists host their web pages? Please help

I’m currently at a meeting on Computational Chemistry where we are looking at how to store, search and disseminate our results. http://neptuno.unizar.es/events/qcdatabases2010/program.html The problem is a very general one:

A community creates results and wants to make the raw results available under Open licence on the web. The results don’t all have to be in the same place. Value can be added later.

One solution is to publish this as supplemental data for publications. (The crystallographers require this and it’s worked for 30 years). But the Comp chem. People have somewhat larger results – perhaps 1-100 TB /year. And they don’t want the hassle (particularly in the US) or hosting it themselves because they are worried about security (being hacked).

So where can we find a few terabytes of storage. Can university repositories provide this? Would they host data from other univs? Could domain-specific repositories (e.g. Tranche, Dryad) manage this scale of data?

Last time I asked for help on this blog I got no replies and we had to build our own virtual machine and run a webserver. We shouldn’t have to do this. Surely there is a general academic solution – or do we have to buy resources from Amazon. If so how much does it cost per TB-year?

If we can solve this simple problem then we can make rapid advance in Comp Chem.

Simple web pages, no repository, no RDB, no nothing.

UPDATE
Paul Miller has tweeted a really exciting possibility:
http://aws.amazon.com/publicdatasets/
At first sight this looks very much what we want. It’s public, draws the community together, it’s Open. Any downside?
P.

Posted in Uncategorized | 5 Comments