eResearch: meeting people and a closed/open story

#eres2011

So it’s great to be at eResearch Melbourne. Sitting with Alex wade and Nico Adams (we are giving a semantic workshop on Thursday). And forget about Open stuff for a while.

At the get-together I’ve met many old acquaintances. There is a real sense of friendliness and collaboration. Australia has got its act together for eResearch – it’s country-wide – it’s learnt from the UK eScience experience and US cyberinfrastructure. And one of those lessons is that there are no easy answers in eReserach and its data. It’s hard work and requires skills and flexibility.

Except I can’t forget about Open.

I met A (from a university I had visited. Names and genders have been anonymised, though we may be able to publish them later). I asked where B was

A: she had had cancer. A rare type, that her doctors hadn’t seen. The doctors wanted to see the literature. But they were in a small town which didn’t have a University.

P: The scholarly poor.

A: Yes but because she was at a University she could read the literature

P: How many papers?

A: hundreds.

P: that would have cost several thousand dollars if she hadn’t been university staff.

A: Oh! Hadn’t realised that. Anyway she is recovering.

P (getting into normal rant): Did you know that we are forbidden by publisher contracts to index “their” publications. [Note: these are “our” publications, appropriated by the publishers]

A (who is a librarian): No I didn’t

P: and that I am forbidden to text-mine the literature.

A: No I didn’t

P: and that if I read too much literature too rapidly the publishers will cut off the whole university.

A: No!!!

P: Yes. They did that to me. Because they think I am a thief.

A: well B has kept a record of her use of the literature and may well publish it as a scholarly paper

P: under an Open licence?

…. And we will wait to find out …

So:

PMR has now one more anecdote that patients have a critical need to read the literature.

Because otherwise they die. [Yes, these are simple words but necessary]

So I will rephrase my dictum:

“Patients assert that (closed access means people die)”

(I have a sample of about 5 patient stories. That’s good enough for me.)

If this still make you uncomfortable, you can truly state:

PMR asserts that
(patients assert that (closed access means people die))”

That’s a true statement (because I am PMR).

Here is one person who has many many statements from cancer patients about the need for open access:

Gilles Frydman in real life and his avatar

At least people outside academia care. And I hope to help them find a semantic voice

Posted in Uncategorized | 1 Comment

eResearch Australia 2011

I am back in Melbourne for the eResearch meeting. Nico Adams, Alex Wade and I are giving a workshop on “The Semantic Web for Physical Science”. (The bio- world is spilling over with semantic stuff and Nico has done some very nice ontology creation – blending formal reasoning with pragmatism). Spent the day with Nico preparing material. From what we can see the delegates mainly come from:

  • Computational science
  • Libraries/repositories
  • Service providers

I always try to aim the material at the likely audience.

Anyway some pictures and a hard question:

The new innovation/exploration building at Monash

 

You can have showers here when you arrive for work

And… (a) what are these people going to? (b) what is their very loose connection with Peter Sefton?

Posted in Uncategorized | Leave a comment

JUMBO components

[Briefer than I would have liked as MSWord crashed and so did the document recovery.]

I explained that I blogged everything now as Open. Imagine what I wrote and Word wiped out.

This blog is mainly for my own memory. It’s not guaranteed accurate. Note all software mentioned is Open. Use Wikipedia to find the non PMR stuff.

Prerequisites:

Most software is on Bitbucket under mercurial source control (yes, look them up – they are friendly).Easy to download with

hg clone

which populates a directory. Use Eclipse, Netbeans or IntelliJ to explore(they’re friendly too after a little while). Blog me if you need help. If you want to build and install use Maven (look it up) from commandline (note there are scientists who do not know what a comandline is – please help them – it is the easiest way to do this stuff). And guess where to look it up…

JUMBO (just JUMBO):

If you are a developer, install in order. If you are not then skip this – it’s all packaged for you.

mvn test install

on the commandline

mvn test install

mvn test install

This gives the basic CML toolkit. It is all packaged in a single maven Jar file. Just use maven and sleep soundly

JUMBOConverters:

Convert lots of legacy / foreign file formats to CML. Needed for CIF, Chemdraw, spectra and lots of compchem. Again, skip if not a dev.

mvn test install

There are 26 sub-projects ranging from CIF , Chemdraw, many molecules, spectra, reactions, many many compchem.

If you use Eclipse import it manages them beautifully. Now you have the jumbo-converters package which is used in Quixote, etc. But you can also skip and rely on the packaged jar.

Dictionaries:

Still intensively under development. Inspect if you like raw stuff, else skip

Conventions:

Still intensively under development. Skip…

Chempound:

You will need this if you want to run Chempound unless you use the Quixote stuff below. Skip…

  • chempound build a chempound repository: https://bitbucket.org/ chempound/chempound-aggregator

    mvn test install

There are about 28 packages. If you use Eclipse import it manages them beautifully

Sam Adams built this. It’s impressive code. Worth reading for an idea how to manage repository import using SWORD, etc.

Quixote:

You will need this if running a Quixote repository

UPDATE: Docs from Jorge Estrada: https://bitbucket.org/jestrada/quixote-docs

More later

Posted in Uncategorized | Leave a comment

Searchable Semantic CompChem data: Quixote, Chempound, FoX and JUMBO

I am really excited to be at PNNL – there has been a queue of people wanting to talk about semantic data. We go over the fundamentals – if there are N ontologists in a room there are > N^2 fights. Ontologies are very close to people’s souls – they have developed their world models over years and they are similar but slightly different from everyone else’s. So we agree to keep things simple – and that’s feasible in down-to-earth physical science.

We are starting with two areas where data needs to be captured and searched (NMR spectra and Compchem). Here’s compchem

 

We’re using NWChem – IMO the world’s leading Open source computational chemistry (QC) code and arguably the leading one anyway. (It’s difficult to tell because the closed source ones don’t all allow benchmarks.) So NWChem is the centre. Traditionally it takes input (pink) and produces output (pink) , the point being that this is often done by hand. Some codes have GUI stuff and there’s also quite a lot of contributions from the Blue Obelisk. But what you are looking at is a leap beyond that – Quixote, dictionaries and Chempound. All, of course, Open. (Which makes it easy for any others who wish to interoperate rather than compete).

The problem with hand created input is it doesn’t scale, is desperately tedious, there are frequent human errors, little machine validation of quality, etc. We can fix all that by using Chemical Markup Language (blue). The job itself is described as a product of about 6 axes:

  • Molecule. (this can also mean crystal or other extended phase). There are many legacy formats, and we can convert them all into CML
  • Commands. There are a relatively common set of things we want to do. Optimisation, calculation of properties, etc.
  • Basis sets. PNNL is the world leader as it produces the popular and valuable Open Basis Set Exchange BSE. Basis sets describe the atomic orbitals that we shall use to create the molecular orbitals. (don’t switch off, it gets easier)
  • Methods and functionals. In simple terms how we set up Schroedinger’s equation to calculate the energy of the molecule.
  • Physical Parameters. What pressure, temperature, electric field etc. to we wish to impose?
  • “environment”. The local stuff about who ran the job, when, how and on what machine

Generally we can mix and match any of these (and that’s called a parameter sweep. We might have a fixed protocol and wish to apply it to 10,000 molecules. Or we might wish to take just a few molecules and try 50 basis sets (to see which gives the best agreement with experimental measurements. Or to step through a series of pressures, maybe modelling the earth’s crust.

The parameters are abstractions and independent of the program syntax. In principle they can be used with any of the main programs. The framework means that chemists don’t have to worry about the syntax. So Quixote will generate all the CML input files. These are then converted by templates (automatically) to program input (in the future we’ll use CML directly).

The program runs and creates outputs. These are normally created for human viewing but we’ve written JUMBOParser that reads them and extracts all the information into CML. This is then ingested into a Chempound repository (which holds the native input and output, the CML and also transformed RDF). And today we got the Chempound repository working here. It now ingests NWChem files. The repository is designed to allow very sophisticated searches (using RDF, SPARQL and InChI/CML). Currently very few computational chemists store their output in searchable form. Quixote is changing that.

We can be even more professional. Rather than transform and parse, we can read CML directly into the program. It needs a bit of tweaking, but that is done with FoX. FoX was written by Toby White and now tended by Andrew Walker at Bristol. FoX is a library which reads and writes XML in FORTRAN. A bit creaky compared with other languages but it works.

Then there are the dictionaries. EVERY CML component in the file must be described by a dictionary entry. That means we have to write them, and we are doing exactly that. Hard work, but necessary. Of course you don’t get impact factors for doing it, though you should.

So one more three-quarter day. I haven’t been here long enough but the momentum is unstoppable.

Posted in Uncategorized | 6 Comments

Open Data; Why I love National Laboratories (SFTC, PNNL, CSIRO, EBI)

[EBI? – Europe’s a sort of nation!]

I and others have been trudging a lonely path trying to get people to think about managing their data in semantic form. Henry and I have pushed Chemical Markup Language for 17 years and although its takeup is exponential it has a small constant. (Exponential doesn’t necessarily mean explosive). So it’s been very pleasing over the last year or so to see the interest from National laboratories.

Most countries have national labs. There is always an argument as to whether money is better spent centrally or distributed. Both are right and both frequently run into mismanagement and inefficiency (but we are on Planet Earth so no surprise). National labs are required when you need synchrotrons (here’s Rutherford, STFC with the Diamond synchrotron at the back.) That’s where Cameron Neylon does all his pioneering work on opening up data. That’s where Brian Mathews ran the JISC-I2S2 program for scientific metadata design and workflows that I was part of.

[Images used without permission but I am saying nice things about them]

And Daresbury (where we held our Quixote meeting, thanks to Paul Sherwood) in the linear accelerator tower which has now gone to graceful retirement as a café and seminar building.

Now I am at PNNL, in gorgeous Washington State (that’s thousands of miles from Washington DC). Here’s the guest house

(Photo courtesy of P. Murray-Rust, CC-BY)

And in February I’ll be in CSIRO, Clayton, Victoria, AU. More of that later.

So why the excitement?

National labs often have facilities (like the synchrotron or ISIS neutron source) which are used by the whole country. They are run efficiently and their primary output is data for the customer scientist. So data management is given a lot of thought and investment. Unlike academia, where you get no credit (yet) for managing data professionally, national labs have a need and a pride to do so.

So it’s not surprising that they are interested in semantic tools for data. And that’s why I have come to PNNL. People have been queuing up to talk We are going to continue to semanticize the data from computational chemistry.

“Semanticize?”

Yes. Suppose you get a program that outputs:

Temp = 298.15

You “know” this is a temperature because 298.15 is a rather special number. It’s 25 Celsius, which is often taken as a standard. Add 273.15 which is 0 Celsius in Kelvin and you get 298.15 K.

But the machine doesn’t know this. It has to be told. So here it is in CML:

<property dictRef=”compchem:temp”>

<scalar units=”unit:k” dataType=”xsd:double”>298.15</scalar>

</property>

Show that to most university chemists and their eyes roll upwards. It’s meaninglessly complex. It takes time and detracts from writing the next grant, getting the next notch on the h-index. But show it to someone in a national lab and their eyes light up. It allows them to manage their data. To search for it. To make sure that it is still fresh in 5 years’ time. That they can export it for others to re-use. Because national labs are not possessive of their data in the same way as academics are (of course this is a generalization, but as an academic I am allowed to mock them).

So over the last few days and the 1.8 remaining I have been getting CML installed here. We’re going to do several things:

  • Create semantic output (CML) from NWChem. Ideally this is by adding CML hooks into the code (with FoX)
  • Create semantic output for NMR spectra
  • Install a chempound repository. Sam Adams and Jorge Estrada have been hacking this and got it working so it ingests NWChem

     

Off to edit FORTRAN under CYGWIN! And install Java under Maven. Exciting times.

 

 

Posted in Uncategorized | 3 Comments

My bike is (fairly) stable

An interlude. PNNL guest house provides its guest with low cost rented bicycles. I had a “free” day today (though I did some thinking) and cycled about 5+5miles through Richland to Bateman Island. But first, here is my bike:

Perceptive cyclists will notice that it has no brakes on the handlebars – in fact it has a backpedal brake. I haven’t ridden one of these before and didn’t find it very easy – if you backpedal then you stop, but if you put your foot down you keep going. I’m not sure what you do on steep hills. I think these bikes are popular in NL which doesn’t have many hills. I couldn’t go very fast, partly because the bike had only one gear, partly because it’s quite heavy and partly because of the air intake.

Zooming in we see:

All the bikes are named after elements. I got Technetium (http://en.wikipedia.org/wiki/Technetium ). The trouble is that all technetium isotopes are unstable/radioactive. The one I know (and is used in medicine) is technetium-99m with a half life of 6 hours. This means that after 6 hours half my bike would have disintegrated (actually it depends whether the bike represents a single atom – if so, then there was an evens chance that after 6 hours I would have no bike). I was even more worried about Tc-98 because this probably only lasts for milliseconds.

I needn’t have worried. The thoughtful PNNL bike people had chosen the isotope with a half-life of 4.2 million years. I had more chance of a car crash than spontaneous disintegration.

A beautiful day and Bateman island (http://en.wikipedia.org/wiki/Bateman_Island )was great (and isn’t Openstreetmap fantastic – it shows the cycleways unlike most other maps)– lots of birds on the river (which is quite wide here). I saw white pelicans, various grebes, ducks etc. which I’ll try to look up from memory:

And the causeway to the island

Posted in Uncategorized | 5 Comments

Update: Open Science conclusion; and PNNL NWChem/CML/Quixote update

#oss2011 @okfn

#oaweek

After my talk at OSS I published two posts on the value of Open Access – I used challenging language which has upset several people but seems to find a chord with others. The discussion has taken place on the Open Knowledge Foundation discussion list (http://lists.okfn.org/pipermail/open-science/2011-October/001032.html and about 25 more posts, culminating currently in a very long and researched post (http://lists.okfn.org/pipermail/open-science/2011-October/001053.html ) by Jenny Molloy, my co-presenter at OSS. Further discussion can take place on this list – it’s open to everyone.

For the next few days I am now devoting my energies to helping create the first fully Open Computational chemistry system. This is based on:

  • NWChem http://www.nwchem-sw.org/index.php/Main_Page which last year became fully Open Source. It’s the main Open program for atomistic calculations, and is complemented by Open plane-wave codes such as ABINIT, MPQC and Quantum Espresso (Please comment if I have missed any – I am also not aware of a list of Open Source computational chemistry – not the same as cheminformatics).
  • Chemical Markup Language and specializations in conventions such as CMLComp and compchem.
  • The JumboConverters framework
  • The Blue Obelisk (includes cclib, openbabel, Avogadro, Jmol, etc.) and other Open Source chemistry tools.
  • Chempound (a repository for semantic chemistry) built by Sam Adams.
  • The Quixote community http://quixote.wikispot.org/
  • The FoX library for XML and CML in Fortran

I have blogged about most of these before. At present what we are doing is:

  • Define a top level dictionary for compchem. Bert deJong at PNNL is optimistic that this is feasible in a reasonable time. It will be a community effort.
  • Define a revised convention for compchem (compchem1, say). Bert thinks there is a very clear infrastructure to almost all QC codes and that we can implement this.
  • Add CML output to NWChem. We are halfway there. I have compile FoX on windows and we are currently getting NWChem running on my machine

This will be supported by dictionary validation and document validation.

Anyone interested should post a comment or mail me or the Quixote list – see main page above

 

Posted in Uncategorized | 2 Comments

Suboptimal/missing Open Licences by Wiley and Royal Society

#oaweek

Well Wiley has just proudly announced its first Open Access Journals http://www.wileyopenaccess.com/view/journals.html. They’re not cheap for author-side fees (Brain and Behaviour == 2500 USD – higher than the others – presumably it’s easier to tap brain researchers for money).

What has upset me is that the licence is CC-NC. No commercial use. http://www.wileyopenaccess.com/details/content/12f25d1df44/About.html

Now I’ll be very generous and assume that Wiley isn’t aware of the real problems of CC-NC. If they aren’t they should read my blog post:

/pmr/2010/12/17/why-i-and-you-should-avoid-nc-licences/

which also points to definitive sources.

CC-NC is apparently attractive, but actually completely restrictive for anything I want to do.

  • The material cannot be used for teaching as that can be construed as commercial (especially in private universities)
  • It cannot be put on web-pages which carry adverts
  • It cannot be used for text- or data-mining which is openly published because a commercial company might read my paper or website and use it
  • All derivative works must carry CC-NC
  • And worst of all it violates the Budapest Open Access Declaration (and the Open Definition)

I doubt VERY much whether it is the intention of the AUTHORS to forbid commercial use of their material. Effectively they would be saying

“I don’t want a manufacturer of medical equipment to use any pictures from Brain and Behaviour without paying WILEY money” (remember dear reader that the AUTHOR gets nothing.”

So, Wiley, I am in a good mood and assume this was a mistake. It would be very nice if you were able to respond to this post (you WILL read it, I know).

There’s a similar case at the Royal Society. Now they already publish Open Biology under CC-BY 3.0 so they know about licences. They’ve recently made all their historical content FREE, which is absolutely stunning (http://royalsociety.org/news/Royal-Society-journal-archive-made-permanently-free-to-access/ ), but there is no explicit licence. I have also heard that there are actually still paywalls in place for this material.

Please, Royal Society, tell us you simply forgot to add CC-BY on the splash pages and the articles. Because then we can use them for teaching, etc. with a clear legal conscience.

And we can then do some exciting things with the Bibliography!

 

 

Posted in Uncategorized | 12 Comments

Occupied Scholarly Territory: Which publishers do I trust?

#oaweek

For me the primary concern in scholarly publishing is who do I – and maybe you – trust? This blog will give some personal thoughts and probably upset some, but it shows my thoughts.

If I am getting windows renewed for the house I need to know which builders I can trust. That’s as important as cost. Who has my interests at heart when I pay them for materials and labour? It’s not a silly idea – and in a small city like Cambridge there are many ways to address it – friends and neighbours who have had work done – reports (good and bad) on the Cambridge blogosphere – visiting showrooms and premises, etc.

And almost always talking to the people involved.

And generally it works. When large commercial companies are involved the personal trust is lacking but it’s still possible to read consumer magazines or the grumblepages of the newspapers. Generally you know what is available with some idea of who the cowboys are (a UK term which is not flattering!). And local tradespeople often have the interest of the community as well – they live there!

But in scholarly publishing it’s different. Who can you trust to look after your interests? Either as author, or reader, or institution, or the wider society?

Answer: There are almost no scholarly publishers you can trust. Certainly not when measured by the volume of publications.

The only publishers I trust are those where I know the people involved, talk with them, and we know each other’s desires and limitations. Here are some I do trust:

  • The International Union of Crystallography. They have a society-based ethic, are innovative, have been part of my life for 45 years. I know the editors and the IUCr boards and committees. They are my ideal, followed by:
  • The European Geosciences Union (publishing through Copernicus). They are aggressively Open Access because they are part of the community and have the community interests at heart.
  • Public Library of Science PLoS. Because it was set up by passionate scientists, who wanted to change the world of scholarly publishing. My trust remains as long as the scientists such as Jonathan Eisen are in control.
  • ASBMB – a society publishing biology and molecular biology. I know the editor Ralph Bradshaw well and we have talked long about the aspirations of the journal for Open Data – the need to back the science with data. He insisted on that for Molecular and Cellular Proteomics (MCP) and the rest of the publishing community sneered. Now they have adopted the principles pushed forward by Ralph. MCP isn’t OA, but I trust it. As long as Ralph is in charge.

I trust these because I trust the people. Other people I currently trust are the immediate editors in Biomed Central who have done a great job in promoting Open Access and Open Data.

But BMC are owned by Springer and I totally distrust Springer as an organization to look after my interest, my university’s interests, and my readers’ interests. I may be slightly romantic but I come from a background where companies were ethical and wished to provide a fair product or service for those whose money they paid. It used to be called pride.

But read Richard Poynder’s interview with Springer’s boss http://poynder.blogspot.com/2011/01/interview-with-springers-derk-haank.html . Haank speaking:

“The Big Deal is the best invention since sliced bread. I agree that there was once a serial pricing problem; I have never denied there was a problem. But it was the Big Deal that solved it.

“The truth is that it is in the interests of everyone—publishers and librarians—to keep the Big Deal going.”

I find no mention of “reader” (the enduser of a publisher is the purchasing officer of the university – often the Library)

I find no mention of “author” (other than “author charges”, “author archiving”)

I find no mention of “the scientific community”

The whole article is cold-hearted. About how Springer has designed a product not on its value to the community which is paying for it, but as something artificial that can be manufactured as cheaply as possible and sold at the highest price. It doesn’t matter to Haank whether it helps science – it’s just a commodity. And absolutely no indication of innovation based on what the community wants – oh, no – it’s innovations that Springer thinks it can sell. Like the 35 USD per day rental of papers.

So, sadly, I do not trust BMC long term and it saddens me to say so.

The other commercial publishers (almost all closed access) are all the same. I don’t trust any.

And what about Societies? I used to help run the Molecular Graphics Society as treasurer. We didn’t use publications to subside the society – we used the society to subsidize subscription costs for members. (Shut up, PMR, you are a stupid romantic – we are in the C21 and sentimentality is a thing of the past).

Most of the societies have lost their soul and sold out in one way or another. The American Chemical Society’s anticontributions are well-known. The Royal Society of Chemistry stated that “Open Access is ethically flawed”. OK, 5 years ago – but how can a society say that at all? Many learned societies , especially large ones, are run for the benefit of their senior officers and the bottom line.

Which is a tragedy, because it is the learned societies and international unions who should be the guardians of scholarship. Not profit-oriented business people, whether commercial or not. I’d love to recover their role – I wish I knew how.

And until that happens we are left with a very few organisations we can trust. A few charities (e.g. Wellcome Trust) and a few (not all) funding bodies.

Oh, and if you think that all commercial OA publishers can be trusted, read Richard Poynder on InTech (http://poynder.blogspot.com/2011/10/oa-interviews-intechs-nicola-rylett.html ). Oh for the lost learned societies. Quis custodiet? No one except you and me… We’ll have to do it through the blogosphere.

Because, yes, I can trust the bits of the blogosphere I have learned to trust.

Yes, today seems to be a gloomy start.

 

 

Posted in Uncategorized | 5 Comments

PNNL and eResearch: Semantic Physical Science

[the purpose of this mail is to work out my thoughts, test that I can blog from PNNL, let people know I am still alive, and tell the world what I am doing and will do.]

I’m spending 9 days here at PNNL (in Richland, WA, US) with little to distract me so I have a real chance to get my ideas in order about semantic physical science. There’s a natural progression:

  • Create V0.9 of a high-quality computational chemistry dictionary (or ontology if you like the word). It’s expressed as XML (Chemical Markup Language) but it’s also isomorphic with simple RDF triples. We’ve done the first pass (have a V0.1) and I’m working with the group here to create the next versions
  • Then travel to eResearch at Melbourne where I’m collaborating with Nico Adams, one of my colleagues in Cambridge, who has moved to CSIRO, Clayton. Nico not only buys into the idea of semantic science, he’s pushed it much further than I could have. With Alex Wade we are running a one day workshop in eResearch (http://conference.eresearch.edu.au/workshops/ , “Making the Semantic Web work for Physical Science”. I’m getting my ideas together now, and there will be a concentration on things like chemistry, quantities and units of measurement. If you know what the boiling point of water is, then you will be qualified for the workshop.
  • Later in Feb I will be spending some months with Nico. CSIRO is a great place to really develop an infrastructure. National labs (like PNNL, CSIRO and STFC – a international ones like EBI, NCBI) understand the need for proper data management, infrastructure and information engineering. Academia generally doesn’t, and when it does it doesn’t value it.

More later as I get the order worked out.

[Immediate update. I can blog from PNNL Visitor LAN!]

Posted in Uncategorized | 1 Comment