petermr's blog

Southampton’s Blog3 and ScholarlyHTML

Posted on April 11, 2011 by pm286

#scholarlyhtml

There were several exciting things to come from the recent workshops (World University Network Lab note book, and OREChem) at PNNL; this post is on Southampton’s Blog3 (http://blog3.rubyforge.org) (Jeremy Frey, Simon Coles, Mark Borkum and others). They’ve been using blogging as a means to provide an “electronic lab notebook” (a term which I think is rather dated and where the Soton work goes much beyond).

They’ve been doing it for some years but I think the progress over the last year or so has been important. Originally blog technology was very flaky (and still is). They’ve written their own which creates a better semantic platform. I’m not sure what the interoperability is and the issues in using this beyond Soton, but hope to find out.

It has convinced me that blogging is the way to go for capturing and enhancing scientific work – at least for academia and probably for companies as well. There is so much common ground with established practice on the web. Obviously if strong AAA (Authorisation, Authentication, Accounting) is required this takes a LOT more effort whatever technology is required – there are no easy answers (and academia is a hotchpotch of so many different problems in that area).

This reinforces my conviction that HTML5, not PDF, is the way to go for science. (It always was until PDF lost us ten years of progress). ScholarlyHTML fits perfectly into this. It helps to define what convention(s) a blog should emit and what it can consume. If, for example, the blog has created a chemical compound record, then we should use a convention that supports and constrains this in ScholarlyHTML (/pmr/2011/03/14/scholarly-html-%E2%80%93-major-progress/ ). Of course we can and should embed CML in this where appropriate – e.g. for molecules, crystals, calculations, etc.

If we all adopt ScholarlyHTML for our science – and the relatively modest discipline it imposes – then we can have something close to semantic interoperability.

And where we can’t it’s because we don’t fully understand the science, not because we cannot manage the syntax.

Posted in Uncategorized | Leave a comment

NWChem: a fully Open Source compchem code from PNNL

Posted on April 9, 2011 by pm286

#nwchem #quixotechem

I’ve spent a great 4 days at Pacific Northwest National Laboratory (http://www.pnl.gov/ ) where we’ve been doing a number of things – including OREChem (which I’ll blog later). It’s been great to talke with the people who have developed and are continuing to develop NWChem (www.nwchem-sw.org/ ) – their flagship computational chemistry package (which does both atomistic and plane wave calculations). It’s very large and I’ll be finding out more during the plane journey back.

But the key first thing is that it’s Open Source.

The normal practice in computational chemistry is to develop a business model where costs can be recovered. Sometimes this is free-to-academics-pay-by-industry. Sometimes it’s pay-by-everybody.

I have no moral principles against charging for software. But there is a utilitarian downside. It fragments the society (there are probably 10 other codes which do much-the-same as NWChem). It leads to closed algorithms – “you can’t see our code because you might steal it”) . And it is difficult to develop a modern model where there are community contributions.

The result is that many codes have an architecture and community that creaks.

NWChem has broken the mould. (I should mention that there are plane wave codes which have also done this, Quantume Espresso and ABINIT – and I work with them as well).

So I and other Quixotans are working with Open Source codes to add semantics. That will take them from FORTRAN-like tools with serious impedance in input/output top potentially semi-intelligent information engines. It means that the language of compchem will not have 20 separate languages, but languages which truly reflect the physics and chemistry.

What’s the plane journey got to do with it?

I’m writing a declarative parser for NWChem. Here’s a chunk of current log output:

Lattice Parameters

——————

lattice vectors in a.u. (scale by 1.000000000 to convert to a.u.)

a1=< 5.920 0.000 0.000 >

a2=< 0.000 10.255 0.000 >

a3=< 0.000 0.000 9.653 >

a= 5.920 b= 10.255 c= 9.653

alpha= 90.000 beta= 90.000 gamma= 90.000

omega= 586.0

reciprocal lattice vectors in a.u.

b1=< 1.061 0.000 0.000 >

b2=< 0.000 0.613 0.000 >

b3=< 0.000 0.000 0.651 >

Now CML understands this – it has lattice vectors (real and reciprocal). But what’s “omega”? I’m a crystallographer and I’ve never heard of omega. There’s a clue later in that the volume is also given as 586.0. So I am guessing that omega is the symbol for volume in some community of practice.

So we are creating a vocabulary that the whole NWChem community can contribute to. I even hope that someone will comment on this post, but even if not the communal process will soon resolve this problem.

Once and for all.

So by an open community process we make rapid progress. Which will soon mean that the Open codes will have a major semantic advantage over the closed codes.

At that stage scientists will start to wonder whether “free as in beer” and “free as in speech” is actually a very valuable concept and one worth throwing their effort behind.

I look forward to much continued collaboration with the NWChem group and the Quixotans.

And an exciting plane journey.

Posted in Uncategorized | 2 Comments

OSCAR4 Launch

Posted on April 8, 2011 by pm286

#oscar4launch

I am delighted to announce the launch of OSCAR4:

http://www-pmr.ch.cam.ac.uk/wiki/OSCAR4_Launch

OSCAR (Open Source Chemistry Analysis Routines) is an open source extensible system for the automated annotation of chemistry in scientific articles. It can be used to identify chemical names, reaction names, ontology terms, enzymes and chemical prefixes and adjectives. In addition, where possible, any chemical names detected will be annotated with structures derived either by lookup, or name-to-structure parsing using OPSIN[1] or with identifiers from the ChEBI(`Chemical Entities of Biological Interest’) ontology.

The current version of OSCAR. OSCAR4, focuses on providing a core library that facilitates integration with other tools. Its simple to use API is modularised to promote extension into other domains and allows for its use within workflow systems like Taverna[2] and U-Compare [3].

We will be hosting a launch on the 13th of April to discuss the new architecture as well as demonstrate some applications that use OSCAR. Tutorial sessions on on how to use the new API will also be provided.

OSCAR4 is a major rewrite and the people involved : Lezan Hawizy, Bala Kolluru, David Jessop, Sam Adams and others deserve great credit. OSCAR4 makes it much easier to incorporate as a module for:

Training/machine-learning
Domain adaptation
Web applications
Etc.

We see OSCAR4 as potentially applicable to a wide range of corpora in physical sciences (not just chemistry) and is particularly suited to named entities, quantities with units and errors and chemistry-in-other-disciplines

Posted in Uncategorized | Leave a comment

The Freedom Cloud: The future of our culture is in the balance

Posted on April 1, 2011 by pm286

#okcon2010 #okfn

I have known Becky Hogge for several years – Becky is deeply involved in the Open movement and is inter alia a board member of the OKF. She’s just published an essay http://www.opendemocracy.net/becky-hogge/freedom-cloud which so exactly mirrors my own thoughts ( and leads them) that I want you all to read it. It also mirrors keynotes at OKCon10. The message is simple:

At this very moment the freedom of the world’s culture is in the balance

That’s a strong statement and it’s an act of faith. We are in the middle of a great cultural change (due to the Internet). Because we are in the middle of history we cannot (by definition) asses it objectively. But in 20 years historians will look back and say that in 2010-2015 the battle for freedom was won or lost. (Of course if the loss is too traumatic – a 1984-like newspeak culture – there will be no historians. And no language in which to express our loss.)

Read Beckky, not my summary. But in essence the forces of control (mainly large corporations, Google, Apple, Microsoft (though diminished), Thomson-Reuters, Macmillan, Elsevier, Murdoch) are looking to monopolise our thought and culture. The printing press liberated our culture, but printing presses can be controlled. The heady days of 1993 when everything was possible have withered and we have Facebook, Google, etc.

So why aren’t these a “good thing”? Google does no evil, so we shouldn’t worry. But history teaches that all large organizations self-corrupt. I used to work in pharma (Allen and Hanbury’s). It did no evil. I knew the people who ran it. They made medicines to cure people or manage diseases (e.g. Ventolin). I know they would be incapable of the excesses of current pharma. But now the pharma industry is managed by standard corporate goals. So it wasn’t surprising that a publishers and a pharma got together to create a fake scientific journal solely for making money for both. Truth was abandoned.

By analogy that has to be true for all large industries. Some have better corporate roots than others but the benevolent dynasties of 19^th Century industrialists (from which my personal history springs) – Cadbury, Rowntree, Lever, Nettlefold have gone and there are no moral or religious checks. So we have to questions and check everything that large corporations do.

The key problem is the control of information and through that the control of people and people’s thought. Facebook controls people. Google controls people. And through their lobbying of governments publishers and media control people. Net-neutrality is critical – we have to fight for it. Established laws are not a useful precedence – we have to create the visions that thinking moral citizens would adopt. Charters and constitutions (e.g. why the OK definition is so import ant). Our 21^st C equivalent of the Bill of Rights.

The good thing at present is that there are many more educated literate humans than in C 18^th. Even in Scotland, whose Enlightenment was responsible for much of our current freedom of thought.

Some days I wake up and think – what a lot of things we are liberating. And other days I think how much is being ripped away before our eyes. Why does not academia rise up and protect its freedoms? Its primary job is to define our possible cultures and put them in front of us. And if we want to pursue freedom (as opposed to personal glory) to help us and to go to the wall if necessary. Yes, as Becky recounts, freedom is in the balance from Libya to Bahrein. But it’s also in the balance in Washington and London.

All I can do is “keep buggering on”. And hope that the little bits of very hard won freedom will inspire and can be aggregated to an emergent phenomenon of world internet freedom.

It can happen.

Posted in Uncategorized | Leave a comment

Breakfast in Seattle at the Mecca Cafe

Posted on April 1, 2011 by pm286

#quixotechem #acsanaheim

I’m in the Mecca café on Queen Anne in Seattle and in heaven. I’ve been to Seattle quite often – mainly to visit Microsoft in Redmond and we’ve stayed at the Mediterranean Inn on Queen Anne and one block away is the Mecca. It’s been doing real American breakfasts for 80 years and it’s a refreshing change to MacBurger and the rest. I ordered 3 blueberry pancakes – and the server wisely counselled me to have 2 (and even that is more than an average human should eat…) . Free wifi of course.

I’m winding down from an intense week of time-critical demos and talks at the ACS and elsewhere.These are communal projects so lots of people deserve lots of credit. Our group (Sam, Joe, Lezan, David, Nick, Daniel), the OKF (Mark, Rufus, Ben, William, Daniel, Alfredo, Mathias), Quixote and the Blue Obelisk (Marcus, Pablo, Jens, Sebastian, Henry – these are only the most involved in last week) I’ll try to blog these later in detail, but they include:

ChemicalTagger. We can now technically read the chemical literature by machine and extract data. But the publishers are actively stopping us.
Open Data. The concept is now clear. Two typical concerns: The ACS copyright data (sic). They didn’t create it, they didn’t edit it , I suspect they didn’t even read it. But they stamp it as theirs. So we’ve moving to the situation where we cannot challenge scientific results for fear of being sued. Does no-one else get angy? And there is a cosy cartel where Elsevier, Wiley and Springer feed raw data to the Cambridge Crystallographic Data Centre who then control its active redissemination. Why? Not for any scientific reason but to perpetuate the CCDC’s business model. Result. Half the world’s published crystallography is unavailable. (MEMO: I think I will write to the CCDC board)
Lensfield/Quixote. A tremendous push from everyone. Really tremendous. We had to put in place:
- Parsers and other converters to XML CML. The technology works. It’s simple and could be used in many other areas of physical science. Anyone can develop a parser as long as they understand what the program is actually doing!
- Conventions. Essentially validatable community-driven agreed practice. Think validatable microformats. What XML should have been before XSD ruined community semantics.
- Dictionaries. A formal description of what the input and output to codes are. An OWL-free zone, that normal people can understand.
- Respositories/ Chempound. We now have a working chemistry repository that anyone can POST to or SWORD to. That is indexed through RDF and aggregated through OREChem. This could and should become the de facto approach to managing chemical information in the modern world. It works at a lab level and at an “enterprise” (argh) level and also out on the Open Web Of Linked Data. Which is where most of our data should end up
Open Theses. And last night (for me) we ran the first Open Theses workshop. In Vilnius. It was 0300 for me and the skype was bad. But we created a sense of community. And some initial metadata for theses. I hope to get all Murray-Rust theses into this – I think I have 4 so far. There is no reason why the world should not have Open metadata for Open Theses.

So my next self-imposed deadline is demoing Chemppound/ORE at PNNL next week… It has to work and it will work.

Just-in-time

Posted in Uncategorized | 2 Comments

Open Data at the ACS

Posted on March 31, 2011 by pm286

#acsanaheim

I spoke on Monday at the ACS “Open Data” session on the Panton Principles. I had to leave after mine because I was speaking in the Education session and my comments on them are based on hearsay and their abstracts. There were only 4 contributed papers.

Mine
One on a commercial software project (the only OD reference was apparently “it would be nice to have some Open Data”)
The Cambridge Data centre arguing that data should be curated and charged for and that this business model had to be maintained. (I should point out that CCDC is the “official” repository of raw data from crystallographic experiments. Half the publishers (Springer, Wiley, Elsevier) do not publish supplemental crystallography and the authors then donate the data to CCDC. If you want the data you either have to subscribe to the database or can only get a handful of data (I think 25 out of 500,000). There is no right of re-use)
A paper by the organizer Irina Sens who wasn’t able to come.

In another talk Steve Bachrach reviewed the SOAP report of Open Access. It says – no great surprise – that chemistry is well behind other sciences in OA – estimated at 5 years (I would increase this to 10).

I was told that the ACS supplemental data was now Open. Wow! I was going to jump up and down publicly. There was a JPA (journal publishing agreement on this). http://pubs.acs.org/userimages/ContentEditor/1285231362937/jpa_user_guide.pdf (11 pp) It is (I quote claiming fair use, as the document is copyright) “is a result of ACS’ ongoing efforts to provide the best possible publishing experience for our authors”. (I note this awful word “use experience” creeping into the language)…Here’s some more:

The new agreement specifically addresses what authors can do with the different versions of their manuscript—e.g. use in theses and collections, teaching and training,conference presentations, sharing with colleagues, and posting on websites and repositories.

The terms under which these uses can occur are clearly identified to prevent misunderstandings that could jeopardize final publication of a manuscript.

• The new agreement clarifies that the transfer of copyright in Supporting Information is nonexclusive. Authors may use or authorize the use of Supporting Information in which they hold copyright for any purpose and in any format.

• The new agreement extends key terms of use to an author’s previously published work with ACS—as long as the same conditions of use are met.

• Behaviors expected of ACS authors are more fully addressed throughout the agreement.

I haven’t read it all but these seem small positive steps. But I am more interested in what READERS (an archaic term replaced by “end-user”) can do. A reader is a human OR machine who actually wants to do something with the published material. To have an interactive experience. So, with great excitement I turned to the conditions of use of ACS supplemental info:

Electronic Supporting Information files are available without a subscription to ACS Web Editions. The American Chemical Society holds a copyright ownership interest in any copyrightable Supporting Information. Files available from the ACS website may be downloaded for personal use only. Users are not otherwise permitted to reproduce, republish, redistribute, or sell any Supporting Information from the ACS website, either in whole or in part, in either machine-readable form or any other form without permission from the American Chemical Society. For permission to reproduce, republish and redistribute this material, requesters must process their own requests via the RightsLink permission system. Information about how to use the RightsLink permission system can be found at http://pubs.acs.org/page/copyright/permissions.html.

What’s changed? Here’s the same paragraph about 5 years ago

Electronic Supporting Information files are available without a subscription to ACS Web Editions. All files are copyrighted by the American Chemical Society. Files may be downloaded for personal use; users are not permitted to reproduce, republish, redistribute, or resell any Supporting Information, either in whole or in part, in either machine-readable form or any other form. For permission to reproduce this material, contact the ACS Copyright Office by e-mail at copyright@acs.org or by fax at 202-776-8112.

Well the “end-user experience” is pretty much the same. You can’t do anything without permission. Oh, dear – and I was so expectant.

Actually it’s worse. The old version meant it was a straight dialogue with the ACS – I carried this out over several years without much response. The new version has:

The American Chemical Society holds a copyright ownership interest in any copyrightable Supporting Information.

This is so wonderfully fuzzy that it guarantees that you will not get a clear response from the ACS as to what it means. (Well, actually, you won’t get a response anyway. I have had one response in 4 years’ of trying. “Let’s discuss it at the next ACS meeting”. Not yes, not no, but classic beautiful MUMBLE.

By contrast two cheers to Chemspider. Chemspider is not an Open resource – it is run by the RSC and the system and content is by default closed. They have collected data and contributed data and some of this is Open. So Tony Williams showed that the Open Data items will be stamped with the OKF button.

Well don Chemspiderman.

Posted in Uncategorized | 2 Comments

Open Theses at EURODOC: 2011-04-01; Sleepless in Seattle

Posted on March 29, 2011 by pm286

#jiscopenbib #opentheses

As part of our JISCOpenBIB project we are running a workshop on Open Theses at EURODOC 2011. “We” is an extended community of volunteers centered round the main JISC project. In that project we have developed an approach to the representation of Open Bibliographic metadata, and now we are extending this to theses.

Why theses?

Because, surprisingly, many theses are not easily discoverable outside their universities. So we are running the workshop to see how much metadata we can collect on European theses. Things like name, university, subject, datae, title – standard metadata.

For the workshop we’ll have an Etherpad… http://science.okfnpad.org/Conference-call-Eurodoc-Open-Theses-workshop-20110321 If you haven’t used an Etherpad just go to the address. You can add your material into the pad. Let us know if you are interested in being involved.

There will be a datasheet for collecting data: https://spreadsheets.google.com/ccc?key=0AnCtSdb7ZFJ3dHFTNDhJU0xfdGhIT01WeTBMMDZWOGc&hl=en_GB&authkey=CJuy4owB#gid=0

We’ll also be collecting survey data survey location (coming online very soon) http://bit.ly/Eurodoc-opentheses-survey

This workshop is not limited to participants. I shall be in Seattle US. Sleepless (it’ll be 0300 in the morning there). So all of us can and should participate. I’ll try to add MY thesis data (1967, but I think that counts as European?)

So I’ll blog more info as we create it. But 1300 WEST = 1200UTC is the time we start – make a note to be involved.

Posted in Uncategorized | 2 Comments

ScholarlyHTML – ScholarlyChemistry!

Posted on March 29, 2011 by pm286

#scholarlyhtml #acsanaheim

In this morning’s CINF program http://abstracts.acs.org/chem/241nm/program/divisionindex.php?act=presentations&val=Internet+and+Chemistry:+Social+Networking&ses=Internet+and+Chemistry:+Social+Networking&prog=54108

Alex Clark observed that there was such as mess of different mobile providers (Apple, Blackberry, Android …) all incompatible that the solution for Chemistry was to adopt HTML5 and Javascript.

Just what we have concluded for Documents!

Let’s build the next generation of chemistry in HTML5! It’s a bit of work, but it will be worth it. I will start hacking CML …

Posted in Uncategorized | Leave a comment