petermr's blog

The Law of the Excluded Mumble; Please SIGN the Principles of Open Bibliographic Data

Posted on February 3, 2011 by pm286

In classical logic there is a Law of the Excluded Middle

http://en.wikipedia.org/wiki/Law_of_excluded_middle

that states (roughly, but read the article) that something is either TRUE or not TRUE (==FALSE).

This principles does NOT hold in scholarly publishing where there are three states:

The material is OKD-Open LIBRE. You can use it without seeking permission
The material is definitively Not OKD-Open (Gratis or CLOSED). If you re-use it you are liable to take-down mandates, lawyers letters, having services arbitrarily cut off by suppliers of “their” content, and personal lawsuits (and this has happened).
MUMBLE.

MUMBLE?

Mumble is the main non-LIBRE response from most publishers when you ask about whether there are specific permissions to re-use material. In terms of frequency they are:

Null response. Yes, most publishers don’t even reply to polite requests for factual information. I once mailed FIVE editors of a scholarly journal asking if I could annotate their material. Not one had the courtesy to reply. How can I ensure that a journal or publisher can at least have the decency to reply to a responsible question? But because I am just a reader I can be ignored (the publisher’s customers, or “end-users” are the purchasing officers – readers don’t count for anything in this market). The problem with the null response is that there are so many ways to justify doing nothing.
The filibuster. The publisher apparently offers to give an answer but never does. We are still waiting for a response from a major publisher after four years. It’s always polite – “let’s talk about it when we next meet” or similar. By comparison my enquiry with Elsevier about whether I can text-mine chemistry is a mere eighteen months old. We’ve finally got to the stage where they have referred it to their legal experts. All will be revealed on this blog. A few days ago I asked if the discussion could be public. So far, null response.
Classic mumble. This can take so many forms. Typical phrases are “it all depends on…”, “well I am not a lawyer, so…” [“my car has broken, can you fix it … Sorry, I am not a lawyer”].
Paper chase. “If you refer to the UK copyright act… you’ll find what you need”. Pointing people to legal documents is a surefire way of bottling the problem. We want answers, not meta-answers.
Reductio ad absurdum. This is using logic and terminology to escape the problem. I had a discussion recently. I won’t reveal the source. I wanted to know if data in publications were free to use. “well you can re-use really raw data, but data on publisher’s web sites has had creative treatment and so is potentially copyright”. (qualified by “it all depends what sort of data”). Could I use graphs, tables? (Elsevier has given me a NO on this – all data in tables and graphs belongs to Elsevier. See BtPDF discussions. At least NO is better than mumble. Of course I do not accept this.). “So is a spectrum printed from a machine really raw data?” “It all depends – the software used to print it is creative so possibly not.” Oh, dear.
The pious hope. Create a declaration that everyone agrees to. The STM publishers agreed 5 years ago (http://www.stm-assoc.org/public_affairs_brussels_declaration.php ). This states “Raw research data should be made freely available to all researchers. Publishers encourage the public posting of the raw data outputs of research. Sets or sub-sets of data that are submitted with a paper to a journal should wherever possible be made freely accessible to other scholars [their emphasis]”. Problem solved. “it all depends what is meant by data”, “it all depends what is meant by free”. “wherever possible”. Classic mumble. I observe that compliance rates are “variable”.

I announced that DOIs were free of copyright yesterday and got a Friend Feed (we are not meant to reveal authors):

“but what possible value could one derive [from] asserting copyright over their DOI suffix?”

Well, I am afraid the answer is “Lots”. Some publishers copyright their identifiers (the ACS copyrights CAS identifiers for chemical compounds (http://en.wikipedia.org/wiki/CAS_registry_number) ). Many publishers sell their tables of contents to meta-publishers. For money. The meta-publishers then sell this information back to us. It’s rather as if I want to know my neighbour’s house number – “I can’t tell you because I sold it to a Directory”. “Can I tell other people what your number is?” “Sorry, signed a contract that I mustn’t reveal my house number without permission”. Bibliographic data is the house numbering of scholarship. Without it you cannot find and identify scholarly works. And in the OKF we assert that this information is LIBRE. Not “should be”. IS.

So if you wish to protect your little market of selling bibliographic data you can assert that this data is created by a creative act. You, the publisher, have creatively created a DOI. It’s your property. So if someone republishes “your” bibliographic data and it includes “your” DOIs you can send “your” lawyers to remove all the work done – and that protects your market.

Let’s assume you are a publisher and your think I’m being unfair. And of course all generalizations are unfair – many publishers are very very cooperative. And you are one of them.

The answer is simple:

SIGN THE PRINCIPLES OF OPEN BIBLIOGRAPHIC DATA (http://openbiblio.net/principles/ )

That will not only identify you as a publisher who regards bibliographic data as LIBRE…

… It will identify you as a PUBLISHER WHO CARES! And that solves the problem of the Excluded Mumble.

BTW the principles are for signing by anyone. Libraries, funders are also particularly welcome.

If everyone signs the Principles then the bibliographic data problem is solved! The DOI was just the first step.

Posted in Uncategorized | 1 Comment

Panton Discussions online

Posted on February 2, 2011 by pm286

#pantondiscussions

The Panton discussions are now online. Many people are to be thanked for this – and it’s taken a lot of effort (as always I blunder into things that I don’t understand – recording, streaming, etc.).

They are available at the Cambridge Streaming Media site:

http://sms.cam.ac.uk/institution/CHEM

and also at DSpace:

http://www.dspace.cam.ac.uk/handle/1810/229688

where they will still be bright and fresh in 100 years.

We’ve already had a significant number of downloads.

I think this is a useful format and I particularly appreciated the reverse (where Richard Grant interviewed my for F1000).

Ideas welcome – I think one over two months is about the right frequency.

Posted in Uncategorized | 5 Comments

DOIs are not copyright! What about Bibliographic Data?

Posted on February 2, 2011 by pm286

Every so often we take an important step forward in Openness and today is one example.

Norman Paskin of the DOI foundation has confirmed that the DOI foundation does not regard DOIS as copyright and encourages their re-use:


to	List for Working Group on Open Bibliographic Data <open-bibliography@lists.okfn.org>
date	Wed, Feb 2, 2011 at 11:50 AM
subject	Re: [open-bibliography] DOIs and openbiblio

Peter,
regarding your specific question on whether or not DOIs as identifiers are considered copyright. Like you, I expected that IDF would not make claims of copyright to DOI identifiers. I’m happy to say that I have just confirmed with Norman Paskin, Director of the International DOI Foundation, that IDF does not regard DOI names (identifiers) as copyright and, indeed, encourages their open and widespread use.

Paul

This is tremendous! It’s a precisely and fully solved problem. No-one ever needs to ask the question again (maybe we should formally ask it on http://www.isitopendata.org/ – any volunteers?)

I do not need to waste any more time on it. I can do something else with my time. I do not need to live in fear of the lawyer’s letter. We can add DOIs into OpenBibliography!

By contrast I spend much of my time in wasted attempts to get clear factual answers from publishers. I’ve been waiting for 4 years from one on data. I’ve been in intense discussion with another about text-mining of data for 18 months. They’ve now relayed it to their legal team. I wait with expectation.

Trying to get clear factual answers from publishers is a wearisome journey. It’s easy to feel that

“Oh, it’s that Murray-Rust again. Just don’t bother to answer and he’ll go away.”

Well he won’t and there are others like him.

It’s very easy to get the impression that we are engaged in an ongoing conflict with publishers. That’s not universally true, but it’s common.

So if publishers want to help us scientists can you please answer a simple question:

“Is the bibliographic data in your publications Open?”

We all know what this means as we have the principles of Open Bibliographic Data. They are simple to understand. Here are some clear answers:

Yes
No [and reasons given]

Here’s an acceptable one:

Gulp – hadn’t thought. We’ll get back by the end of the week – promise

And unacceptable ones:

It all depends on what jurisdictions you are in and how much you are going to use. [This means you cannot use it – so say NO]
We’ll send it to our lawyers [knowing that they won’t reply – too busy buying companies]

And quite unacceptable, impolite and arrogant:

[no reply]

Posted in Uncategorized | 3 Comments

Microsoft Research and University of Cambridge Assign Chemistry Add-In for Word Project to Outercurve Foundation

Posted on February 1, 2011 by pm286

Today we assigned our Chemistry Add-in for word (“Chem4Word”) to OuterCurve:

http://www.prnewswire.com/news-releases/microsoft-research-and-university-of-cambridge-assign-chemistry-add-in-for-word-project-to-outercurve-foundation-115019264.html

REDMOND, Wash. and WAKEFIELD, Mass., Feb. 1, 2011 /PRNewswire/ — The Outercurve Foundation, in collaboration with Microsoft Research and University of Cambridge, today announced that the Chemistry Add-In for Word project has been added to the Foundation’s Research Accelerators Gallery, a collection of open source projects that benefit the research and science communities. The Chemistry Add-In for Word (also known as the Chem4Word project) was developed by Microsoft Research and Drs. Peter Murray-Rust and Joe Townsend of the University of Cambridge’s Unilever Centre for Molecular Science Informatics. The two organizations assigned the project to the Outercurve Foundation today.

Let me explain what this means and then why we did it. I expect a wide range of reactions.

The ‘what’ is that the Code has been assigned to OuterCurve (http://www.outercurve.org/)

The name ‘Outercurve Foundation’ speaks to our ambition to be a foundation on the leading edge of the open source world, representing the interests of the growing audience of developers and corporations engaging with the traditional FOSS community.

Simply, the code is Open Source, under an Apache licence and so free for anyone to use, develop and distribute. This formally places it in the same licence area as code on the well-known Apache site (http://www.apache.org/ ) where many extremely valuable libraries and other tools are developed in a community and with communal governance. OuterCurve has similar aims, but no two organizations are alike and we expect that OC will create its own tradition. OC provides a place where those interested in developing Open Code can congregate and contribute. It’s likely, but not essential, that the code is based on .NET and/or C# with perhaps WPF and XAML.

C# was developed essentially by Microsoft and it’s probably that >90% of code written in it runs on a .NET platform. At present Chem4Word will only run usefully on a Microsoft operating system. There is an Open Source platform, Mono, that will run C# but not yet the WPF/XAML for the graphics.

One aspiration is that operating environments such as Mono will become increasingly popular and that the remaining deficiencies (graphics) will be developed in the Open Source community. This aspiration is shared by those associated with OuterCurve which will foster Open source approaches to solving the larger problems. At the other end of the spectrum will be the view that Microsoft’s closed platform will remain dominant and that these Open Source developments are irrelevant.

Certainly there is currently little F/OSS code developed in C# in our area. Given that it’s the most popular language in the world this is an opportunity for practitioners to take part in a community project and we hope to see such a community develop.

Am I idealist in thinking that Microsoft and its practitioners will move towards an Open operating system? Or that the world will consider this so valuable that it will put effort into it, in parallel to the commercial offering? (Experience with Open Office is not a good omen but that shouldn’t dissuade us). Open Source is gaining ground and there are an increasing number of organizations and purchasers requiring it. Maybe the time will come when it’s impossible to sell closed source operating systems into some organizations. I’d applaud them. And Microsoft will need to change its business model to accommodate this – in which case projects such as ours will have helped to show the way.

The project has seen great change during its four-year run. We started with a semi-closed system and moved toward a completely Open one. It’s created the formal, Open, de facto standard of CML. CML is the only validatable content in chemistry, and probably among very few others in science. Joe deserves great credit for that. Whatever happens to the code, the CML specification and practice is completely Open and will help to create better chemistry in the future.

Please join us if you want to develop in C# and want an exciting and useful project.

Comments welcome and expected – I will treat them thoughtfully.

Posted in Uncategorized | 3 Comments

There are other evils than PDF: what’s the problem here?

Posted on January 30, 2011 by pm286

I am writing a parser for the #quixotechem project – in this case NWChem output. The output is generated AFAIK by FORTRAN. I am having difficulty parsing it. Why?

Here is some text I can parse (deliberately a snapshot from a text processor):

And here is the evil output:

What unexpected horror (or semi-unexpected, as I’ve had it before) has caused me to waste a lot of time?

EDIT:

Here’s a very strong hint. This is what I get when I load it into my text editor. Whatever has happened? And what could I do to make it at least human readable?

And what piece of code did I have to write last night to solve the problem in future?

Posted in Uncategorized | 8 Comments

The Blue Obelisk 5+ years on

Posted on January 26, 2011 by pm286

#blueobelisk

Not only was the blue obelisk a theme at #pmrsymp, but Brian Brook had MADE a blue obelisk! It’s made of wood and nearly 2 metres high. Many of the attendees signed it;

And then I went to San Diego where 5+ years ago we first met under the Blue Obelisk. Yes, there is a real Obelisk. In fact there are two – the Greater Blue Obelisk and the Lesser Blue Obelisk. We met under the “blue obelisk” –most of us found this one:

But Geoff Hutchison found this one!

Geoff waited for half-an-hour and we waited for half-an-hour and each of us thought “how could you possibly not find the obelisk in Horton Plaza”. Luckily we went to look for each other = without that the Blue Obelisk might happened.

Anyway I was there because Aaron Culich drove me. Aaron’s a compsci, very interested in how we interact with molecules and some great ideas for “strongly typed parser combinators” which will create a domain-specific-language for us.

And he took a picture of me

So the Blue Obelisk is flourishing. More obelisks have found their way to people – I am very flattered by mine. Much more software has been written. Our sphere of influence increases. Our code is taken seriously.

J.Cheminformatics (open Source) is running a special issue from the PMR Symposium and if anyone has an idea for a Blue-Obelisk-related paper let me or Christoph or David know. We’ve already got a lot of people who say they will contribute.

Until the next time

KEEP SMILING

Posted in Uncategorized | 1 Comment

Beyond the PDF – the good the bad and the future?

Posted on January 24, 2011 by pm286

I’ve finished a hectic 9 days with #pmrhack, #pmrsymp and #beyondthepdf. I’ll need weeks to do justice to all of these. So a few immediate reactions to BtPDF.

A mixed community, mainly scientsis, some IT/Library people , some ICT companies and some publishers. Perhaps 50 people. Wonderful hospitality. Great discussions and presentations (with a strong bio-flavour – this was anticipated).

Surprisingly the most exciting stuff happened with the group of people staying on till today with no formal business. We’ve started a “Writing” project which now takes Peter Sefton’s “Fascinator” as the framework. I’d seen it before in USQ / Toowoomba but it’s come on a lot. It wowed everyone who saw it (many didn’t as they were in other sessions). I am sure it is one way that we manage the results of Quixote.

There is also a large groundswell of those who want to develop new types of scholarly communication. So, for example, can we publish our 2 days of #pmrhack? We think so and I’ll be tackling this when I get back. The bio- community is into ontologies, annotations, etc. (There was relatively little non-bio – we didn’t see any quantities with units, any chemical diagrams, any maps, any maths, etc.). But lots of excitement.

On the other hand I am increasingly depressed by the position of publishers. There is an assumption (by them) that they “own” the scholarly content and that they give us permission to use it. If you think I am being over-reactive, read Richard Poynder interviewing Springer http://poynder.blogspot.com/2011/01/interview-with-springers-derk-haank.html . Springer is simply out for what it can milk the academic community for – there is nothi8ng about how S brings benefit to academia. Scientists are living in occupied territory – we need permission to use our own creation. We know that there needs to be money in the system – and we discussed business models in the meeting. But unless we scientists design these business models for our benefit we will simply be manipulated by the commercial world.

Plane is leaving – thx to San Diego airport for free wifi

Posted in Uncategorized | 1 Comment

Beyond The PDF Presentation

Posted on January 20, 2011 by pm286

#beyondthepdf

The presentation was a technical success – virtually everything worked. Not sure we got speech back from the skype. But otherwise great communication.

Several vocal contributions over skype including Daniel Mietchen from audience!

No idea how and where it has gone down. There is already a lively debate about PDF vs the rest and closed source vs Open source.

Posted in Uncategorized | Leave a comment

Scholarly communication let’s eat our own Dogfood

Posted on January 20, 2011 by pm286

#beyondthepdf

THIS REPRESENTS AN INTERACTIVE PRESENTATION TO https://sites.google.com/site/beyondthepdf/ ON 2011-01-20. The large letters are for projection

IF YOU WANT TO ASK QUESTIONS OF US IN THE BtPDF SESSION USE THE ETHERPAD:

http://okfnpad.org/quixote20110120

Just go to this page and give your name. Don’t go in unless you want to ask a question – I will be showing the site

===========================================================

This post tries to practise http://en.wikipedia.org/wiki/Eating_your_own_dog_food (Wikipedia)

rather than using Powerpoint (Power corrupts; Powerpoint corrupts absolutely. this kills kittens)

(used with thanks http://ilovecharts.tumblr.com/post/450388944/brownpau-everytime-you-make-a-powerpoint and probably violating copyright – not fair use but funny).

In case you are still reading I shall use this blog to present my message. Not “slides” but communication by “writing” – a form of linear communication. I’ll try to make this a “conversation” in places

“All Science is hypothesis-driven”.

No. Lots of it is data-driven. Part measurement and observation; Part computed. The result is data which can (and often is) published for its own sake. The fourth paradigm. Problem is:

Many scientists are arrogant about data publication
Many publishers can’t be bothered. Many that are bothered sell our data back to us (ACS, Elsevier are examples in chemistry).

OK – but data is hard, so it’s too expensive to do. All that collecting, standardising etc.

No it’s not. Vannevar Bush proposed the Memex (1948) to remember everything we do.

But that’s impossible

No – it’s possible it we wanted to do it. We make it a way of life and it becomes tractable

Hmmmm

We’re already doing it. I’ll show 3 projects. Based on communities. Not how can I do this, but who wants to work with us.

But that means you work with your competitors!!!

I don’t have competitors. Just apathetics. Anyway here’s the projects. They are all ZERO cash.

???

They aren’t funded – they work with volunteer resources and marginal cpu+bandwidth+storage. Like Galaxyzoo. Open Streetmap. All results are O-P-E-N. Licensed with PDDL or CC0.

crystallography (Crystaleye). Goes out every night and crawls the scientific literature for crystallography. 200,000 separate structures. Pulls back data files (CIFs) indexes, validates and provides domain search. Sustainablity with IUCr.
chemical reactions (Greenchain reaction and results ). We trawled the patent literature for c hemical reactions. 100,000 . Hypothesis: reactions are getting greener (extract the solvents)
and compchem (Como (example) and Quixote).

But Crystaleye only has half the literature…

…Yes, because Wiley, Elsevier and Springer don’t publish Open supplementary info.

Well, just ask them…

…I have – several times. They never reply. Readers don’t matter.

Stop ranting…

… sorry.

So were the reactions getting greener …

… not that I could tell. But we’ll do another run.

But patents are grotty. Designed to conceal information. Why don’t you use the published literature? There’s 10,000,000 reactions reported…

… answer it yourself

Oh, I see. Only 1,000 are Libre Access…

… you’re getting the message

And Quixote. Why the name?…

… because PMR has mad expectations. And because it started in Spain…

So what makes it different?

… well NO-ONE publishes comp chem data. And it’s the easiest and best data of any discipline. The laws of physics work. So we will scrape scientist’s disks and convert them to semantic

You’ll have to give them the program…

… Yes we’ll let them use Avogadro. It’ll front up all the tough stuff. It’ll act as a computational Memex.

…Heres Quixote. Also

I don’t like the black background…

… it isn’t black. It just looks black. Browsers have a little way to go – see http://quixote.wikispot.org/Front_Page?action=Files&do=view&target=workflow.png

Posted in Uncategorized | 1 Comment

PMRSymposium

Posted on January 19, 2011 by pm286

#pmrsymp

The symposium http://www-pmr.ch.cam.ac.uk/wiki/Main_Page was a great effort and I personally am very pleased. I believe it was a success – about 100 people came and an unknown number attended virtually (probably about 20). I’m hoping we can capture pictures, reports, etc. But here are some main points:

Invited speakers. All excellent and fitted together well. There was NO coordination between speakers – this was deliberate – and delegates had to make their own sense.
Media. We had all sorts. Powerpoint. Html. Local. Web-based. Recorded slides. With/out audio. Skype conference. Silent and spoken. Slides and demos.
We had to communicate and record this
The timescale was ultra-tight. So when we presented the results of our group we split it into 2 parts. (a) 15 mins, with 10 mini-presentations. Timed. To the second. (b) 30 mins, 7 demos, with people doing them but me speaking over them. It worked!
John Wilbanks contributed a recorded session. Unfortunately we didn’t have a clear way of getting questions asked at the end. Sorry, John.

We had several demos; existing ones

Crystaleye
ChemicalTagger/OSCAR
IsItOpenData

And those hacked over the hackfest

Locative molecular art (David Murray-Rust). Several people were able to “see” local molecules in their phones so we are developing this and I’ll tell you more soon.
Spoken OPSIN. We are able to speak into an Android “Caffeine”. The phone interprets the speech (send it back to Google for interpretation) then send the text to OPSIN and then send the structure back to the phone. In seconds or less. Awesome (as they now say). Sam Adams, Daniel Lowe, etc.
Kinect molecules. Several people (DSM-R, Ben O’steen, Dan Hagon, …). You can now dance in front of a kinect and rotate a molecule or change its geometry. Fantastic.
Ami. Showing computer vision and intelligent molecules

And then the rest

Tom M-R painted some molecules specially for the meeting
Brian made a huge blue obelisk

Sue and Emma put on a wonderful day and evening and looked after speakers etc.

And then Julian and Steve. They did a really marvellous job on streaming. Picked up on mobiles and PCs throughout the world. They spent 3 days preparing. Tried several approaches. Streaming is NOR trivial. For example if you have too many streams the machine may overheat (I never thought of that). How do you route a skype conversation to remote audio? How do you manage questions? Etc.

So – all in all a great occasion. Thanks to everyone

(Oh, I am not retiring)

Posted in Uncategorized | Leave a comment

The Law of the Excluded Mumble; Please SIGN the Principles of Open Bibliographic Data

Panton Discussions online

DOIs are not copyright! What about Bibliographic Data?

Microsoft Research and University of Cambridge Assign Chemistry Add-In for Word Project to Outercurve Foundation

There are other evils than PDF: what’s the problem here?

The Blue Obelisk 5+ years on

Beyond the PDF – the good the bad and the future?

Beyond The PDF Presentation

Scholarly communication let’s eat our own Dogfood

PMRSymposium

Recent Posts

Recent Comments

Archives

Categories

Meta