petermr's blog

A Scientist and the Web

 

Archive for July, 2010

Open Chemistry for Science Online 2010

Saturday, July 31st, 2010

Typed into Arcturus

Rapid update for #solo2010.

What’s emerged from talking with Jean-Claude Bradley, Mat Todd, Cameron Neylon and others doing Open Science / Chemistry is that we are going to do an open experiment starting now and in full view until #solo2010 at the start of Sept.

We shal use the Open Access (chemistry) literature to answer the question:

“Do industrial chemists use different (or fewer) chemical reactions that academia?”

This will be a data-driven experiment relying on the Open Access (CC-BY) literature and extracting reactions both manually and automatically (using the Cambridge OSCAR/ChemicalTagger software). The results will be put into an Open RDF repository (CML +RDF) where all data will be Open Data according to Panton (CC0 or PDDL).

Current sources will be:

  • European patents (i.e. industrial) – PD
  • Acta Crystallographica E (ca 10,000) CC-BY
  • J-C’s open notebook (Drexel)
  • Mat Todd’s theses. (Sydney)

Anyone can take part. All resources must be Open. We’ll probably coordinate through OKF or Unilever Centre.

More later. We’ll expect to do this on a daily basis.

 


 

PP9_0.1: Semantic Open Scientific Data

Friday, July 30th, 2010

Dictated into Arcturus

This post is a first outline – not even a draft – of a proposed Panton Paper on “Semantic Open Scientific Data”

The vision of the Semantic Web 2.0 (If I’ve not lost count) includes Linked Open Data. We’be dealt a lot with Open and somewhat with Data but not about links. The rules of Linked Open Data (http://en.wikipedia.org/wiki/Linked_Data ):

  • Use URIs to identify things.
  • Use HTTP URIs so that these things can be referred to and looked up (“dereferenced“) by people and user agents.
  • Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
  • Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

The technologies for doing Semantic Web are therefore HTTP/URI, HTML, various XMLs, and RDF. The linking technology is can be through HTML link, href, or Xlink (or others).

LOD does not say much about documents, and most scientific data is published in document format. Most documents such as theses and scientific papers contain headers, abstracts, sections, paragraphs, embedded tables, images, attached data, etc. Common formats are Word, LaTeX and HTML. These provide more or less semantics according to the authoring tool, the diligence of the author, etc. So, for example, several publishers has very well marked up HTML.

PDF, PNG, Powerpoint as commonly used have effectively no semantics. (There are areas in some of these allowing the inclusion of semantics but these are not used and are so variable between releases that they are effectively useless.) in the average PDF it is impossible for a machine to tell where a sentence or paragraph start and another begins. Superscripts, styling are also incredibly difficult to interpret.

Most authors author in a (semi)semantic form. Most publishers will accept this, then print it out and scan it or de-semantify it as PDF. Many have the manuscript retyped.

Many graduate theses are required as PDF even though the authoring is in Word or LaTeX.

So here are my recommendations

  • Authors should be provided with incentives and tools to create documents with as much semantics as possible.
  • Publishers must become aware of the value of semantics and retain it during their processing
  • Theses should always preserve the original born-digital document and data. It should always be available alongside any PDF.
  • Repository owners should present their content as Linked Open Data (RDF) wherever possible. This may require managing identifier systems and ontologies
  • Readers should have access to semantic readers supported by repositories and publishers
  • There should be converters from common semi-semantic forms to fully semantic where possible (e.g. as in Chemistry), supported by repositories and publishers
  • Tools should be available for human and machine semantic annotation. This may not always be completely accurate, but it will be useful.

     


 

Panton papers: current state

Friday, July 30th, 2010

Typed into Arcturus

Current state of play for Panton Papers. My contribution to the Panel on Open Access and Open Data this afternoon at #oss2010.

I have been releasing a new Panton Paper pre-draft every 1-2 days. These are nuclei for communal editing to work towards “rough consensus and running implementability”. They are not, naysayers, cast in stone. You are welcome to hack them through the OKF Open wikis, Etherpads and other tools.

Panton Papers on Open Data

Already published:

Not yet published (the numbers may vary)

  • PP5_0.1 When should data be released? (absolutely fundamental and a lot of hard work)
  • PP6_0.1 Access to text- and –data-mining. (Firewalled publications are often contractually barred from this. The data must be minable)
  • PP7_0.1 Reproducibility. The data must cater for being used in reproducible science (cf. Victoria Stodden)
  • PP8_0.1 Semantic data. Linked Open Data – the dream of many of us – depends on semantic data

Readers an #oss2010ers, please add other topics and comments.

Science Online 2010: What shall I say? #solo2010

Friday, July 30th, 2010

Typed into Arcturus

I’ve been invited (and accepted) to talk at Science Online 2010 (#solo2010). In previous years this was called “Science Blogging” and I’m currently wearing the T-shirt from one such. And they were held in the Royal Institution where you could lecture from the same place as Michael Faraday and Lawrence Bragg.

This year it’s in the British Library.

Recently Duncan Hull has highlighted that I am the only scientist. (http://duncan.hull.name/2010/07/26/solo10/ ). This has caught traction in the twittersphere. Help!

What shall I do? I had probably planned to talk about how to make science data Open. That’s my current rage. Should I change? Should I try to carry out some research before the meeting? Should I try to crowdsource some research before the meeting. Should I (like Lawrence Bragg) demonstrate a soap-bubble raft. (I don’t think the BL would like me bringing Bunsen burners into their hallowed racks.

Seriously – is there any exciting and new we could communally do before in the next month? My guess it would have to be in the area of data-driven chemistry. I was talking with Jean-Claude Bradley at breakfast about liberating chemical reactions from the literature. There will be new science in that. Not world-shattering, but worthy.

Thoughts?

#OSS2010: Rage against the Publisher-Industrial-Scientific Machine

Friday, July 30th, 2010

Typed and edited into Arcturus

Sitting in the Free Speech Movement Cafe in Berkeley (see picture) it’s impossible not to get the sense of cataclysmic change. It’s doubly nostalgic in that it was followed a few years later by a similar rejection of the University status quo in the UK. I was warden of a hall of Residence in the University of Stirling when (in 1972) it publicly exploded its angst across the pages of the world. I was young (30) and although a lecturer was critical of the University administration and admired the drive for the students to Free Speech. I can’t remember the exact title but there was essentially a free speech forum where the students invited speakers with a wide range of views, often outside the pale and guaranteed their message would be heard politely, though not unchallenged.

This is echoed by a comment ( I’ll attribute if given the OK) to one of my posts on “Reclaim our Scholarship”:

Cue RATM: we gotta take the power back! :-)

[RATM = http://en.wikipedia.org/wiki/Rage_Against_the_Machine ]

Listening to the presentations here I sense the same rage as in the sixties when that was against the Military-Industrial complex http://en.wikipedia.org/wiki/Military%E2%80%93industrial_complex

This term was and is widened to Military-Industrial-Scientific Complex (e.g. http://www.counterpunch.org/grossman01152009.html ). This is a sad condemnation of much scientific endeavour – whether in industry or academia – and is caused both by rapaciousness of individuals and institutions . (Just listening now to Tim Hubbard on the failure of Bayh-Dole and the tech-transfer departments – essentially the model has failed to generate significant wealth and has inhibited the use of science for the general good.

So what are we raging against now?

The Publishing-Industrial-(Scientific) Complex.

Of course not all publishers are implicated and not all industries and not all scientists. But there is a core of corporate resistance, gatekeepers, micro-control, which holds our endeavour back.

If this doesn’t change rapidly the PIS-Complex will fracture. Whether it will be deliberate action or whether it will be the amorphous forces of the zeitgeist and technology I can’t foretell.

But I am conscious that my current actions and attitude are a constant drip of water onto the congealed mass. There are many other erosive forces. Change is in the wind.

#OSS2010: Reclaiming Our Scholarship – what I said

Friday, July 30th, 2010

Typed and edited into Arcturus

There’s an impressive (near verbatim) transcript of the sessions. this is often at least as useful as the slides. In my case essential. I do not use Powerpoint and instead click my way through my “slides” in a non-linear order according to how I interpret the session and audience.

This time all slides were projected remotely (i.e. there was no podium computer and no VGA connector). So I hurriedly typed a few links into a blog post and asked the projectionist to click on about 2-3. The OK Definition, the picture of the Panton Arms and Pantonistas, The Panton Principles. Most of the talk was done with Flowerpoint. What I said is here:

http://gnusha.org/open-science-summit-2010-transcript.html

and my slight editing (e.g. removing “click here” and correcting names) gives:

I am a chemist. I do not do PowerPoint [...] My main method of presentation is flowerpoint. I am old enough to have remembered the 60s and not to have been at Berkeley but it has made a huge contribution to our culture. [describing the flowerpoint] The Open Knowledge Foundation will adopt [flowerpoint] as a way of making my points.

We [OKF] have many different areas- maybe 50- that come under Open, that relate to knowledge in general. First of all, my petals are going to talk about various aspects of Openness. [...] the open knowledge definition. This is the most important thing in [my talk and message].

A piece of knowledge is open if you are free to use, re-use and re-distribute it, subject only [possibly] to attribute and share-alike.

That’s a wonderfully powerful algorithm. If you can do that, it’s open. If not, it’s not open according to this [?definition].

Another picture [the Panton Arms with PP collaborators], Panton Principles. It’s a placed called a pub. It’s 200 meters from the chemistry department where I work, and between the pub and the chemistry lab is the Open Knowledge Foundation. Rufus has been successful in to getting people to work on [OKF]. A lot of this is about government, public [knowledge].

[petal 1]How many people have written open source software? [many hands]

[petal 2] What about open access papers? [fewer] How many of them had a full CC-BY license [fewer still]. If they weren’t, they didn’t work as open objects. CC-NC, causes more problems than it solves.

[petal 3] How many people have either published or have people in their group who have published a digital thesis, not many, right? [few hands] How many of those explicitly carry the CC-BY license. [about zero] That’s an area where we have to work. Open Theses are a part of what we’re trying to set up in the Open Knowledge Foundation. Make the semantic [version] available, LaTeX, Word, whatever they wrote it in, that would be enormously helpful. The digital landgrab in theses is starting and we have to stop it. There are many things we can do.

[petal 4 + 5] There are two projects, and these have been funded by JISC. Open Bibliography and Open Citations. At the moment, we’re being governed by non-accountable proprietary organizations who measure our scholarly worth by citations and metrics that they invent because they are easy to manage and retain control of our scholarship. We can reclaim that within a year or two, and gather all of our citation data, and bibliographic data, and we can then, if we want to do metrics, I am not a fan, but we [emphasis] should be doing them, and not some unaccountable body. Anyone can get involved in Open Bibliography and Open Citations.

[petal 6] The next is open data, and the next is very straight forward. Jordan Hatcher, John Wilbanks from Science Commons, have shown that open data is complex. I think it’s going to take 10 years [to get to terms with Open Data].

[petal 7]This is a group involved in the Panton Principles, Jenny Molloy, Jenny is a student. The power of our students.. undergraduates are not held back by fear and conventions. She has done a fantastic job in the Open Knowledge Foundation. [identifies people in photo] Jordan, then Rufus, John Wilbanks, Cameron, and me, and anyway, we came up with the Panton Principles, [link to ] the Panton Principles, and let’s just deal with the first one [due to time and not being able to scroll down].

Data related to public science should be explicitly placed in the public domain.

There are four principles to use when you publish data. What came out of all of this work is that, one should use a license that explicitly puts your [data] in the public domain – CC0, or PDDL from the Open Knowledge Foundation. So, the motto that I have brought to this [meeting] is one which I’ve been using and been taken up by JISC.. … on the reverse of the flower,

reclaim our scholarship.

That’s a very simple idea, one’s that possible if a large enough number of people in the world look to reclaiming scholarship, we can do it. There are many more difficult things that have been done by concerted activists. We can bring back our scholarship where we [emph] control it, and not others.

[petal 8] I would like to thank to people on these projects, Open Citations (David Shotton) and our funders and collaborators who are JISC, who funds it, BioMed Central who also sponsors this, International Union of Crystallography, Public Library of Science. (applause)

PP4_0.1: Comments on Repository Structure and location

Friday, July 30th, 2010

Dictated into Arcturus

Responses to PPaper4_0.1.

I have had a number of useful comments to my suggestion that Scientific Data be reposited in domain-specific repositories (and a number of tweets to the effect that “PMR is dissing librarians yet again”. To the latter I’d ask the authors to reread what I actually said which was that many librarians think that data should be put in IRs; all the scientists I have spoken to think otherwise. This was a factual statement, not an attack.) The meaningful comments are:

Chris Rusbridge says:

July 28, 2010 at 8:57 pm  (Edit)

My real point to make is that Peter suggests an ideal that i fear cannot be realised in the broad. There are comparatively few existing domain-specific repositories, and most are extremely vulnerable. Witness what happened to the AHDS when the makeup of the policy committee changed slightly. Secondly, don’t think (please!) that domains are consistent; there can be endless divisiveness of approach between many subdomains. Thirdly, why should institutional data repositories not work, given the support of the institutional scholars? Fourthly, how can reasonably well-managed institutional data repositories not be federated so that the sub-domain parts of all the world appear as one? Fifthly, institutional data repositories do have a sustainability case, if linked to a library, an institutional mission, and that vital sense of scholarship disclosure.

I would never seek to undermine a domain repository that existed and worked, but I would hesitate to try to establish (and more importantly sustain) a domain repository where none existed. I would aim to establish IDRs and federate them. I’m not saying the former can’t be done, just that it is MUCH harder!

Jim Downing says:

July 29, 2010 at 10:47 am  (Edit)

@Chris

I have to say that I broadly agree with your points, and that the best sustainability and access is offered by federated institutional / sub-institutional repos.

I don’t think this is the easy path, though. There are few IRs tackling data archiving at a significant level, and even fewer aggregated domain-specific meta-repositories.

In the spirit of paving the cow paths, the best route might be to look for ways to deliver institutional support to domain repositories.

Steve Hitchcock says:

July 29, 2010 at 10:53 am  (Edit)

Peter, You mention ‘open data’ twice in this blog entry, in the opening sentence and in the final sentence. In between you do not address how the extensive requirements can be achieved while continuing to provide open data. You propose to disregard the contribution that might be made by researchers’ institutions, yet intimate roles for scientific unions, societies and publishers. These are likely to provide services at a cost that is not compatible with open data. Since open is axiomatic to what you want, it doesn’t seem to add up here. I think we could, and will, see examples of more diversified structures, with IRs at the apex, to provide the expert data management and curation that you seek, but within our research institutions.

Firstly, none of this will be easy and it may well be impossible in most cases. I see no reason why Institutions should not provide data repositories other than the fact that they do not currently do so and there is little sign of them making any progress. I can certainly conceive of a future where this happens – I just don’t see it happening. There *are* a number of domain-specific repositories , and yes, most of them are fragile. But that is to be compared with almost zero equivalents in IRs.

If you read my actual draft for the PPaper (between the rules) there is no mention of where the repository should be and who should finance it. I have simply made the point that data should not be stored in a general-purpose repository where there is no domain expertise. If you wish to make the point that it *should* be stored like that – without effective federation and without domain expertise for ingestion – I will continue to disagree. If you agree that it needs domain expertise then you will have to get that from practising scientists – there is no way that anyone outside the discipline (libraries, Google, Bing) can rightly manage the intricacies and detail.

The last thing that scientists want is their data spread over ca. 10,000 sites (because that is how many HE institutions there are worldwide. No scientist, editor, journal that I have spoken to would countenance data being reposited in that way.

So if libraries (whom I did not attack) wish to be involved they have to engage with domain scientists. Libraries have the following positives to offer:

  • They have (at least currently) funding for IRs
  • They have some degree of permanency

Domain repositories have the following:

  • They have the technical trust of the community
  • They offer a single point of contact

My (rather tentative) solution is that libraries should actively try to take on one or two domain-specific repositories. Not more. Those repositories should correspond to world-expertise on campus. So the Protein Data bank (RCSB) is located in Rutgers. I have no idea whether the University supports it. But it is a single point of contact for the discipline.

The future is tough, however you look at it. But the fact that scientists are starting to set up their own repositories sends a message.

I am simply the messenger.

Open Science Summit: My homage to Berkeley – Flowerpoint

Friday, July 30th, 2010

Typed into Arcturus

Open Science Summit: Update.

The live feed is variable, I gather. There have been many very good tweets and they are archived on:

 http://opensciencefoundation.com/oss2010/

(I guess this will be dynamically updated.)

I wasn’t able to show my own HTML so I showed a few HTML links from the blog. I had, however, my main visual support – FLOWERPOINT;


Some of the younger generation may not appreciate the real change that flowerpower made to us in the 60′s and 70′s. I have finally paid my homage to Berkeley.

OSS2010: My slides

Friday, July 30th, 2010

Typed into Arcturus

Open Science Summit: I may be presenting from another machine so here is a blog post with critical links. Most of the talk will be given by Flowerpoint, but we shall need:

Reclaiming our Scholarship

Peter Murray-Rust,

Univ of Cambridge, Churchill College and Open Knowledge Foundation

Power Corrupts; Powerpoint Corrupts Absolutely (Tufte)

Flowerpoint Petals:

  • Open Source
  • FULL Open Access
  • Open Theses
  • Open Bibliography + Open Citations
  • Open Data
  • Panton Principles
  • Thanks (JISC, IUCr, PLoS, BMC and many others)

Links:

Berkeley: Reclaiming our scholarship

Thursday, July 29th, 2010

Dictated into Arcturus

I am giving a 10-minute talk at the Open Science Summit this afternoon at Berkeley. So many things are going round in my head and I still have no clear idea exactly what I’m going to say. The history of freedom on the Berkeley campus is enormous and I’m just off to have my breakfast in the free speech movement cafe in the library (http://www.lib.berkeley.edu/give/bene55/fsm.html ). Goodness only knows what new ideas I will get during breakfast.

I am absolutely sure that we are in the middle of something momentous. The phrase that I have come up with in the Berkeley context is that “openness is the new flower power”. That may sound pretentious but for somebody who remembers the sixties the influence that Berkeley has given us is enormous. Flower power changed our way of thinking throughout the world and directly addressed the military-industrial-complex. Openness has many of the same aspects. It is compelling, it gathers people round it and if enough people become involved then it is surely unstoppable. And it challenges power by making basic assertions that cannot reasonably be denied.

Our new concern is the publisher-industrial-complex. Of course many publishers are enlightened and working to spread knowledge. Many industries are enlightened and bringing valuable wealth creation to the world. But there is a core of members of the publisher-industrial complex who create income by restricting the flow of knowledge. They innovate only to generate greater control and more income. Our scholarship is hobbled by copyrights, patents, firewalls, portals, restrictive conditions and contracts. This has been a problem for the last 50 years and many people have accepted that this is the appropriate way to manage our scholarship. But it is not. .

So we must Reclaim Our Scholarship.

And that will be the primary theme of my short talk. It’s now possible. We create the scholarship. We create the meta-data. We create the tools. We can reclaim and reinvent the way that scientific scholarship is created and disseminated. It only needs enough people.