Quixote: CML is now an infrastructure for Computational Chemistry

#quixotechem #cmlcomp

We have had a fantastic two days at Daresbury developing a prototype of a repository system for computational chemistry. Although computational chemistry is a major scientific discipline, supporting bioscience, materials, chemical reactions, energy, etc. it consists of isolated programs (codes) with no common semanatics of communactions. Although hundreds of millions of dollars (or more) are spent on compchem, almost none of the data is made public. This lack of interoperability means that inter-program communication has to be created for each pair of programs, leading to an n-squared number of interconverters, most of which have no semantic model and are highly lossy. Similarly if you want to visualize the results, then you often need one-viewer-per-code.

The Quixote-CML initiative has now changed this. There is a common semantic model base on CML. This model is democratic but controlled. A special interest group creates their own model, based on CML vocabulary, and defines the practices required to make it work. There is no “central top-down control” other than to specify how the conventions are created (but not their content) and the base vocabulary of CML.

This means that, with foresight from the conventions, software creators (such as Avogadro) can expect to read a wide range of files and display the contents without breaking. (Avogadro is much more than a visualizer, and with the semantics of CML embedded is likely to become a major resource in compchem and it has an “intelligent” approach to the content.)

As always we set ourselves a very ambitious target – to create a searchable repository that could consume the output of any semantified code. We’ve currently written lossless parsers for NWChem, MOLCas, Turbomole, Gaussian, Dalton, and GAMESS-US. The parser technology is such that it’s a few days to write one for a mainstream code and the results is lossless. If anyone is interested, consults us and we’ll show the way forward – it will add considerably to the public value of your output

We are also developing tools for solid state, with Quantum Espresso as the leading code. It was exciting to see the solid state group starting to form their own convention and repurposing CML for that.

I can’t thank everyone but …

  • Jens Thomas for his enormous enthusiasm and vision. He has made this happen.
  • Paul Sherwood and colleagues at Daresbury for positive involvement and funding.
  • Pablo Echenique for his continued enthusiasm and support while not being able to travel to Daresbury. Jorge Estrada making the second part of the Spanish team
  • My colleagues Sam, Joe and Weerapong. They go along with the crazy deadlines and make them work
  • Marcus Hanwell for the great belief and the enormous support from Avogadro

And everyone else at the meeting. Whether or not you believed in all the windmills all the time you took part with enthusiasm and made it work.

So – after 16 years of developing CML this is the first time I have felt that it has become a real fact. People are using it for many different parts of chemistry and continue to advocate it. It’s complex, but not too complex to manage. It’s large, but not too large to implement. It has created its own democracy – which is very rare in chemistry and confined to the Blue Obelisk and a few other areas. The democracy means it moves fast. Faster than many other developments. And because it is continually questioned and open to new ideas the rigour of design is encouraged. It’s now captured the spirit of many mainstream ICT activities.

The next phase is technically trivial. It’s simply to get people to publish their log-files (the output of the calculations). After they have published the work, if necessary. If people did this with it would save millions of dollars of repeated calculations. It would enhance teaching and learning. It would enhance quality. You don’t have to write any code – simply make your files available.

Publishers, just require that authors of computational chemistry results make their files available. Quixote will do the rest.

Funders , just require that authors of computational chemistry results make their files available. Quixote will do the rest.

Universities and research labs, just require that authors of computational chemistry results make their files available. Quixote will do the rest. And your institutions will get greater visibility

It is now unstoppable… the only unknown is the rate of growth.

P.

Posted in Uncategorized | 2 Comments

Scholarly HTML – latest thoughts

#scholarlyhtml

We’ve had a great hackfest – extended over 10 days – working towards scholarly HTML. The idea is simple – we should be using HTML as the main substrate for exchanging information in areas of scholarship, research, education and learning. Most “formats” have shortcomings but only HTML preserves the democratic symmetry of ease of authoring and ease of reading. XML is necessary for complex objects (such as CML, SVG, MathML, etc.) but HTML has everything that is required for most other communication – documents, webpages, publications. PDF, Word, LaTeX may be useful for authoring but have major shortcomings for re-use.

So why don’t we all use HTML? After all in 1994 we all did. Everyone on the web know how to author HTML and how to render it. Since that time we’ve seen the growth of the graphical over the semantic. Good graphics requires tools and these tools have been developed in an asymmetric manner – the reader cannot interact with their outputs other than to view them with the human eye. We’ve lost the simple skill of reading someone else’s content, editing it and republishing it.

But academia is all about re-using content and it’s bizarre that we produce it in ways that prevent this. Scholarly HTML – which is nothing more than using HTML sensibly – will change this to return power to the reader.

Peter Sefton and I have spent much of the last week discussing how to do this. Peter is going to continue to develop demos using his suite of tools. We are probably going to concentrate first on repurposing existing scholarly content from Open Access publishers. (We can’t apply this to closed access material because it breaks copyright). We’ll be looking at how existing published material can interoperate simply by converting it to scholarlyHTML. So the front runners are those who are partners on our JISC grants – and who publish Open Access HTML.

Why does this matter? One Friday the EBI held a session on text-mining and the continuing undercurrent was that most of the material provided by publishers made text-mining almost impossible – it couldn’t be easily read (PDF) and the results couldn’t be distributed.

Whereas with ScholarlyHTML textmining is almost trivial.

Posted in Uncategorized | 2 Comments

Things that Scholarly HTMLers do

There has been a lot of interest in #scholarlyhtml and Peter Sefton has blogged latest thoughts yesterday:

http://ptsefton.com/2011/03/18/scholarly-html-fraglets-of-progress.htm

The techniques we’re documenting have been drawn from lots of previous work,


I don’t think we’re duplicating work but trying to make it easy to find, packaging it into a single set of guidelines and providing a locus for tool builders.


A lot of what we are doing amounts to providing guidelines.


we are taking the opposite approach to the one taken by Perl where Larry Wall’s mantra was “There’s more than one way to do it”. On the web, there certainly is, but for Scholarly HTML to succeed I think we need to say to people:

  • There’s one good (enough) way to do it.
  • Here’s a meaningful way to do it.
  • Here’s a simple way to do it.

 

Peter and I are going punting tomorrow morning (anyone in Cambridge is welcome to come – we’ll tweet the rough time) and we’ll discuss more of this. ScHTML is an attitude of mind. So, while we work out the best practice for best practices here are some analogies from real-life which you may find helpful:

  • ScHTMLers do not reinvent the wheel – they makes sure the tyres are pumped up
  • ScHTMLers stick the stamp in the top right-hand corner, and don’t just write “Newcastle” but give the country and postcode
  • ScHTMLers write their address on the back of the envelope
  • ScHTMLers don’t write dates like 01/02/03; the say 2003-01-02
  • ScHTMLers help other people across the road
  • ScHTMLers put dates in their email [1], not just “tomorrow”
  • ScHTMLers add explanations rather than making people guess
  • ScHTMLers write “Fragile, Glass” on packages
  • ScHTMLers need only one match to light a fire
  • ScHTMLers think like Ned Flanders not Homer Simpson

[1] a true story.

The day before yesterday Michael sent me an email which arrived yesterday saying we should meet “tomorrow”. I thought it meant today and only realised it meant yesterday when the meeting had started. If the mail had said 2011-03-17 then not only me but also my text-mining software (OSCAR4) would have known what was meant.

Posted in Uncategorized | Leave a comment

Scholarly HTML – major progress

#scholarlyhtml

Our extended hackfest over the last 3 days has made huge progress towards Scholarly HTML. We will be posting reports on #beyondthepdf lists and also continuously and continually updating Etherpads, wikis, code pages, etc. Our current starting place is

http://scholarly-html.okfnpad.org/1

Anyone is free to contribute – we ask you to identify yourself. There is an FAQ at http://okfnpad.org/schtml-faqs If you have a question, simply ask it in the current style. The Etherpad has a total memory of ALL changes so it can be backtracked.

We’ll be posting more details as fast as I and others can, but here are some basic details:

We held an introductory session in Chemistry on Friday afternoon with presentations from Martin Fenner, Peter Sefton and Brian McMahon. REALLY valuable in setting the scene. Then dinner. On Saturday about 10 physical and 6 virtual attendees starting from ca 10 o’clock. Lunch at the Alma. Dinner at the Panton. Sunday hack through morning and snack through lunch till ca 1500 – Marting and Brian leave; others continue with citations-references. Dinner at M-R house. Flop…

Physical attendees:

Peter Murray-Rust

Peter Sefton

Lezan Hawizy

Dan Hagon

Brian McMahon

Martin Fenner

Sam Adams

Mark MacGillivray

David F. Flanders

David Jessop

Cameron Neylon

Nick England

 

Virtual attendees:

Egon Willighagen (Stockholm Area/Sweden)

Jakob Voss (Hamburg/Germany)

Aaron Culich (San Francisco/USA)

Mark Hahnel (London/UK)

Claudia Koltzenburg (Hamburg/Germany)

Graham Steel

 

Outcomes (VERY brief):

Scope of ScHTML. It’s a community-based activity, re-using best practice, but with minimal entry barriers. IOW it can apply to almost every public activity in scholarship (research, education, reference, record). It is not just for submitting publications – it can manage student essays, lab notebooks, Wikipedia entries, chemical databases, etc. See the FAQs http://okfnpad.org/schtml-faqs YOU can contribute questions or suggested answers.

Technology of ScHTML. The minimum entry is simply to be able to create modern well-formed HTML5. Everything beyond that is done by evolving agreements (“conventions”). It can, if necessary, be edited by hand, though we are obviously keen to create a flexible toolchain. In practice this is often already present and a matter of identifying current good tools and good practice.

Social and political aspects. ScHTML is not owned by any one institution or person – it is owned by you. (Wikipedia is quite a good, but not perfect, analogy). Anyone can contribute and their influence is based on the value of their contribution. If someone wants to do something new they can – and the convention structure means that they cannot “break” other parts of the effort. ScHTML is an evolving ecosystem constrained only by the acceptability of the ideas and the ability to create the tools and distribute them. ScHTML is guided by a set of principles, which are as yet only part formed – they will form an evolvable “constitution”. We are informed by the IETF mantra “rough consensus and running code” (http://en.wikipedia.org/wiki/Rough_consensus ).

Conventions. The ability to “do your own thing” is governed by “conventions”. These are sandboxes of practice identified by a unique Identifier (URI) and with a description of the convention. In a convention participants have complete freedom to create their own ScHTML infrastructure, governed only by the constraints of HTML5. If they wish to develop complex objects (e.g. scientific articles) they will have to work out how to create objects, edit them, disseminate them, search them and display them. Their success will depend on the strength of the community, the available toolset, the ability to write their own, their advocacy and the innate compulsion of their activity.

Packaging. The major drawback of HTML is that it does not have a universal packaging format (e.g. a page with embedded PNGs cannot be saved or transmitted in a universally safe and recognizable way. This is one reason why DOC and PDF are often used – not for their format, but because they wrap several objects. This is currently the greatest challenge for ScHTML. Should we re-use existing formats (e.g. ePub) or develop our own? There are social and technical pluses and minuses for most ways forward.

Exemplars: We shall create ScHTML from the start ( http://en.wikipedia.org/wiki/Eating_your_own_dog_food )both technically and socially ScHTML. The first activity is on Citations (References) and their management within ScHTML documents. But if you want to set up a convention – anything from molecules to mountains to midges – you can do so. The only criterion is that you adopt the very simple syntax of HTML and the convention mechanism.

Citations. http://okfnpad.org/schtml-citations . We have created the first draft of a convention for citations (again, anyone can contribute). We are creating an example of how a modern scholarly document would use this convention to manage its references. We are adapting existing Open technology to create a new “reference manager” and we hope that RSN we shall start to remove the current hideous and unnecessary waste of time in “formatting” and even worse “reformatting” references. If you want to get “into the spirit and practice” of ScHTML, take part in this activity.

Will it work?

That’s up to YOU. Wikipedia works – and it’s an excellent lead to follow. This is not a moonshot – it’s primarily about doing simple sensible things that already exist. It works best if you do it in the spirit of helping others to use and re-use your work, but it can equally well be used for private material or protected commercial activity.

 

 

 

Posted in Uncategorized | 6 Comments

Mendeley (and other Bib Data): WHAT is Open?

Euan has commented on my enquiry to Mendely about their “Open data”. He raises important valid points


Euan says:

March 12, 2011 at 1:08 am  (Edit)

IANAL, but the data includes (or did last time I checked) many, many abstracts that definitely haven’t been licensed for use in this way, scraped from PDFs or online sources of metadata like PubMed.

PMR: I do not know what data Mendeley are offering – I have specifically aked (see below). IF they are offering abstracts then IMO they are likely to potentially be breaking copyright law.

Though

  1. I’m sure most publishers look favourably on their article metadata being spread around as many ways as possible (I’m speaking personally, not for NPG my employer whom I do not represent in this matter in any way)
    2) Abstracts do exist in a kind of weird grey area where nobody is sure exactly what’s fair use and what isn’t, and some people seem to believe that they’re public domain

You are right in that it is extremely difficult to get definitive answers. I regard abstracts as “mumble” in the Yes-No-Mumble value logic.

… it *doesn’t* seem to me that this means that anybody can package them up with a bunch of homegrown content under the same CC-BY license and say that the whole thing is “open”.

I personally agree with this analysis. An individual or organization cannot declare that someone else’s IP is Open or free of copyright. The problem is that it is difficult to determine what the IP is on things like abstracts. Publishers are extremely unhelpful in this (other than the ones who assert that they own the abstract).

Obvious example: some publishers sell their abstracts and associated metadata to commercial literature databases. The current Mendeley API license implies to me that I could put together my own, identical datasets with the same content from that source and sell it for half the price, thus cutting the original publisher out of the loop. This makes me think that a blanket one line “all data is made available under CC-BY” is insufficient.

I agree. IF the Mendeley data includes abstracts I would refuse to use to abstracts in an Open data collection.

At the very least the attribution for abstracts should be to the copyright holders – preferably the authors, otherwise the publisher – not Mendeley (Again, IANAL, I may be wrong. If somebody wants to tell me so I’ll be very happy).

I’ve mentioned this issue on Twitter a couple of times and know that Mendeley are perfectly aware of it, but haven’t ever had a response and nothing has ever changed on the license page. Jason…? Just having somebody say “we’ve checked it out with our lawyers and it’s all fine” would be good to hear. If it is then I’m off to build my own abstract dataset to sell for $$$.

Copyright is governed by civil law in many domains so if someone believes they own the IP then they can reasonably sue. It doesn’t mean they will or wont win.

The Mendeley API is awesome and the intentions noble, but you can’t cut legal corners. It won’t do anybody good in the long run (at some point it’ll become a problem) and, at worst, could potentially land people who’ve used the API in legal trouble with the *actual* intellectual property owners.

I agree. And in the OKF we are scrupulous to avoid violation of IP. It often means there are things we cannot reuse that seem “reasonable”. So many of the “free” bibliography collections are not Open in that they may scapre data off other sites.

I’d like to see a little extra effort in living up to the “open data” claim by securing the relevant permissions from copyright holders, or clarifiying exactly what attribution should be used for what, or separating out abstracts to be delivered under a different license… whatever would work.

So would I. I’d like to see publishers trying to help scholars re-use material rather than explicitly or implicitly preventing re-use as part of an outdated business model. It would, for example, be possible for publishers to agree that they would not claim copyright over abstracts. That doesn’t remove the author copyright. It would also be possible (thoughthe probability requires a Maxwell demon (JamesClerk, not Robert)) for publishers to own the copyright of abstracts and donate it to the world as CC-BY..

Alternatively John Wilbanks saying “it’s not a problem because x” would work for me too.

That depends on “x”.

So this is why I have formally asked Mendeley on IsItOpenData WHAT their data are. http://www.isitopendata.org/enquiry/view/e59da4b5-1ef2-43d1-beab-60ec91196f27/: So far I have not had a response and I hope I get one

It would be very useful if you could answer the following question(s):

(1) What is the data? It is important to know precisely what is included and what is not.

(2) How is the data obtained? From your blurb it seems like it may have to be accessed through an API. If so what is the nature of the nature of the API?

(3) Are there any limitations on how much data can be downloaded? If so what is the definition of the subset?

(4) Is there any guarantee that updates to your data base will be made available in the same way or is this effectively a snapshot?

(5) what is the “Creative Commons licence” that you mention?

This was answered in my blog. It is CC-BY

(0) This is summed up in the single question: is it OKD-compliant? http://www.opendefinition.org/ “A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

So some final clarification.

All evidence points to core data for individual bibliographic entries being Open. The STM publishers have confirmed publicly to me that they do not regard this as copyrightable. PMC have declared that their collection of bibliographic data is Open . The core is author/journal/title/year/page/language/format/ etc. It does NOT cover abstracts or images.

It does NOT cover collections of bibliographic data which ARE generally agreed to be copyrightable.

I think that Mendeley need to add clarification to this and I am hoping to get it. Mendeley, PLEASE use the IsItOpenData enquiry to reply since that then becomes public record.

[Of course with goodwill we can solve this problem and the OKF intends to catalyse that.]

 

 

 

Reply

Posted in Uncategorized | 4 Comments

Scholarly HTML : Theme and presentations today

#scholarlyhtml

Martin Fenner, Peter Sefton and I have been discussing what Scholarly HTML is, what we intend to do over the weekend and then what Peter and we will do next week. This post is “my slides” to introduce the session…

We believe that HTML is the right way to develop future scholarly communication at all levels. Not only is it technically capable of what we want but it’s inherently democratic – anyone can play. HTML (and some closely associated W3C specs) does many things “right” but there are a few things it doesn’t do well (at least in people’s perception).

We want to address the following (at least) but also set a somewhat bounded scope (so this isn’t about research metrics, how to review papers, course materials, etc.):

  • Packaging. Probably the major failing of HTML – when you get one with complex links or embedded material there is no standard way to proceed. We can change that.
  • Principles. What IS SchHTML? We’ll try to come up with ca 10 principles over the weekend. Example: SchHTML should fail gracefully.
  • Convention, microformats, folksonomies. A major strength of HTML. We plan to develop a strategy so that subcommunities can create their own ways of doing things.
  • Use-cases. Some ideas for the weekend are: a data-journal (maybe in crystallography or chemistry); new vision for e-only articles; and theses.
  • Advocacy. Why people should adopt SchHTML.
  • Symmetric relationships in the publishing community. Currently most authors are disempowered; we need to recreate communities of practice (as the crystallographers do).

There will be thousands of microformats in SchHTML. Here are some common ones, and we might proof one or two this w/e:

  • Bibliographic metadata
  • Citations and references
  • Scientific document structure
  • Sub topics (computation, experimental…)

The presenters today (1530, Todd-Hamied, Chemistry Dept) are (probable order):

  • Martin Fenner – overview of the architecture and process
  • Peter Sefton – tools for SchHTML and examples
  • Brian McMahon – a scholarly community that has developed its own publishing system

The hackfest is on Saturday and Sunday (Unilever Centre), ca 1000 – 1700 each day. If door closed there will be a mobile number to ring. ALL VISITORS SHOULD SIGN IN.

Posted in Uncategorized | 1 Comment

Scholarly HTML – what we are hoping for

#scholarlyhtml

We’ve already started the ScholarlyHTML event (with Peter Sefton’s prfesence) but tomorrow we start to ramp up with a session presented by Peter Sefton, Martin Fenner and Brian McMahon. (There’ll also be Simon Hodson and others including me). We then have a hackfest weekend and then Peter stays with us for the whole of next week, with more hacking taking place. So what’s it all about?

For me it’s about returning freedom-of-publication to authors. To publish what they want, and how they want.

Can’t they do that already? No. A major role of publishers is to restrict the flow of publications. Some refuse 90% of the publications they are sent. And at the other end they work hard to restrict the readership with paywalls, legal teams, etc. And a major casualty of this is the author. The author has virtually no freedom in how they publish. This is nothing new in creative arts where patrons insist on conditions unacceptable to authors and artists, but science is different. We are the patrons and the publishers are taking our money.

The problem is that this impacts on the “service” to authors. Authors get told that they have to publish in ways that suit the publisher. I’ve had personal experience of this. One where Henry Rzepa fought our paper through the most Kafkaesque system ever devised. But more recently when I wanted to publish a paper using HTML.

HTML? What’s that?

Well it’s the language that the rest of the world uses to create and publish electronic material – websites, adverts … It’s universal. It was designed to communicate science (it happens to sell insurance as well, but science was its motivation). It’s easy to author. Even if you don’t like pointy brackets there are zillions of free / open tools for creating HTML. So it’s obvious that scientists should use HTML for publishing.

So I was invited by a publisher a few years back to contribute to a themed issue. I asked the editor “could I publish in HTML?” “Yes” They said. So I created my manuscript on the basis that I could use HTML. It’s got the great advantage that you can lay things out where you want, it resizes, you can embed interactive objects (e.g. molecules), etc. I checked at regular intervals – I think I sent 50 emails for this one paper.

I came to submit it. The publisher refused. Well, not the publisher, but its publication robot. This is as friendly as a robot salesperson. There was no way I could submit this paper. I contacted the editor but was told I had to create it in Word. Converting my HTML to Word destroyed all my work. Half the figures couldn’t even be included. The final paper was a disaster.

I am not alone in wanting to publish in a plastic medium such as HTML. Many people do their slides in HTML. It’s plastic, fluid and semantic.

So this event is about returning to our basics.

HTML is completely suitable for all forms of modern scientific publication

It was good in the beginning, and now with HTML5 and various W3C and other specs and tools it’s all we need and all we should need. So Scholarly HTML is about reclaiming our right to express ourselves. It’s about authors.

Here are some of the things that HTML can do:

  • Embed a wide range of non-textual objects
  • Provide a machine-validatable specification (whether XHTML, XML, RDF or other)
  • Provide a manifest of what is being submitted
  • Act as a reading and writing environment

There is no reason why students shouldn’t write their theses in HTML. It’s more powerful than any other format and will allow the students and the examiners to agree on what has been submitted. There’s no reason why manuscripts should not be submitted in HTML, reviewed in HTML, processed and edited in HTML and read in HTML.

In a hackfest ideas arise naturally so we don’t want to be too prescriptive, but we have some initial starting points:

  • It’s about authors (not reviewers, or metric-weenies, or backroom production)
  • It’s platform- and tool-chain independent. There must be a toolchain but there can be (and are) many solutions.
  • ScholarlyHTML is declarative. Declarative means you state what something is, not how it is processed. The HTML exists independently of the tools. A molecule is a molecule regardless of how it looks. A table is a table. An author is an author. The declarative nature is probably the central technical core of ScholarlyHTML.
  • It’s Open. It comes from the community, not from a digital neo-colonialist. HTML was not just a markup language, it was a major blow for Freedom. We’ve lost some of the freedom. HTML was and is subversive technology.
  • It’s communal. HTML always envisaged communal activities but it’s taken a little while for good tools to arise. Now we have them. So HTML is publicly read/write. Wikis, blogs, shared docs all are communal HTML readwriters.

And the science…

  • It depends who is there but we shall definitively have some molecules, some crystal structures, hopefully some compchem.
  • One idea is to create a toolchain for writing and assembling theses. A validatable checklist.
  • Another is to create a data-journal – probably in crystallography

So will we change the world? The omens are good – scholarly publishing technology is so far behind what the rest of the world is doing that it cannot last in its present state. When something that makes sense comes along people will change to it. And when enough people are is using it, then the rest of the world has to take notice.

So:

  • Authors have a right to author in HTML
  • There is a burden that we should do it responsibly and we’ll address this. We need conventions and styles that make processing straightforward and robust. But it’s technically possible. We are not being unreasonable.

Join us – and help to make history.

 

Posted in Uncategorized | 8 Comments

Mendeley Data IS OPEN!

I have had a very rapid response to my blog post:

Jason Hoyt says:

March 10, 2011 at 3:02 pm  (Edit)

Hi Peter,

All very good and valid questions raised there. To answer you acid test: yes, you could download and mash up our data set in any way you see fit. It is currently covered by a CC-BY license.

As you pointed out with the addition of John Wilbanks from Creative Commons to the API judging panel, we are very serious about making data accessible and agreeable to all parties.

That said, currently there is no bulk data dump option available to all. That option is available to academic researchers who want to work closely with us. The current process of using API methods is the more appropriate tool for developers desiring to build various applications for this contest.

We see the creation and usage of basic developer-friendly APIs as one of the key solutions to making science more open and more digestible by the general public. Large, raw data sets can serve a different purpose.

For serious research, ie not consumer facing apps to make science more accessible, we currently have a data set suited for collaborative filtering algorithm development (http://dev.mendeley.com/datachallenge/). We are also working on a few other large data sets that would be suited to other types of algorithm development and general research.

We will also be taking in feedback from all relevant stakeholders, including yourself, as we go forward in our agenda of making science more open.

Best,
Jason

Jason Hoyt, PhD
Chief Scientist and VP for R&D
Mendeley

So this is very good news. Not just for the fact that 70 million pieces of data are available, but because this is large enough to make a major impact on scholarship. I don’t know much about the data, but I will get myself a login and have a look.

I’m assuming that the bulk download for “academic researchers who want to work closely with us.” will carry restrictions. So the open material is what can be got out of the API. I certainly value the API – for example this is something that could be accessed by Chem4Word. Just as we access Pubchem and OPSIN from C4W we could also access the Mendeley API. If you want small numbers of specific bibliographic records then APIs can be a useful way to go. Indeed we might have a look during the hackfest.

However there are cases where we want all the data. An API should not precludes access to the raw data. And that’s where the “data” question still needs to be answered.

AIUI the data are collected silently from the activities of Mendeley users. Clearly there are data about this process (names of users, patterns of usage) which probably won’t be made public. There may be users comments – I don’t know. But for me the core raw data is the bibliography – as specified in The Principles of Open Bibliographic Data . Here we have confirmation from an increasing number of sources that individual bibliographic entries/records/data are not Copyright. So – assuming the Mendeley data falls into this, that is what I am talking about for raw data. Note that the author (or unfortunately the publisher) may claim copyright over material such as abstracts and I don’t know what Mendeley do about abstracts and related material.

So I still need to know what the “data” is. But assuming that it’s core bibliography then a large amount of that is becoming Open. And it’s not before time.

P.

Oh, and if all of us in JISCOpenBib and related projects feel the same way, expect us to win the 10001 prize. There is a great deal we can do already.

Posted in Uncategorized | Leave a comment

Mendeley API Binary Battle: IsItOpenData?

Mendeley have invited me to enter their “API Binary Battle” and win 10001 USSD.

http://dev.mendeley.com/api-binary-battle.

Here’s the blurb:

Build an application with our data, make science more open and win $10,001!

What’s it all about?

At Mendeley we love science. We also love tech. And we’ve built the world’s largest crowdsourced research database, with 70 million documents, usage statistics and reader demographics, social tags, and related research recommendations, all available under a Creative Commons license.

We want to see a world in which science is mashed up… with anything. So, we are really excited to announce the Mendeley API Binary Battle. For you, this means: Build an application with this data, make science more open, win $10,001!

This could be very exciting. If we can really have a database of bibliographic data for 70 million documents AND if that is truly Open then it’s a major step forward. Of course it depends on what the documents are and what the quality of the data is, but that’s minor.

My excitement will depend on the answer to the following question(s) which I asked Mendeley last year but haven’t got a reply:

What is the data?

And is it OKD-compliant?
http://www.opendefinition.org/
“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

The positive indications are the use of “open” several times (though if not defined “open” is as useful as “healthy”). The mention of “a Creative Commons licence” (though not all are OKD-compliant. The presence of John Wilbanks on the team.

My acid test is” Can I systematically download all of the data in the Mendeley data base, transform it to my own format, and re-use and redistribute it subject only to acknowledging that it came from Mendeley?

If so this is a valuable resource.

I should really be asking this using the OKF IsItOpenData service. So I will. And tell you what it is for and looks like.

 

 

 

 

 

 

 

Posted in Uncategorized | 4 Comments

IsItOpen[Drug]Data?

I missed an important OKF event on Monday – the identification of Linked Open Drug Data. Linked Open Data is one of the great emerging ideas of the modern Web – the idea that data is semantic, linked and open. There’s already a huge collection: http://richard.cyganiak.de/2007/10/lod/imagemap.html and http://richard.cyganiak.de/2007/10/lod/lod-datasets_2010-09-22.png

 

That’s our current Linked Open Data. It’s got music, government, bioscience…

But where’s the chemistry?

There isn’t much. That’s because most chemists don’t understand the value of Linked Open Data. The normal model is to licence and sell it. Unfortunately that stops it being linked and being Open. But a larger number of data sets are “freely visible” in some way. That’s a start, but it’s not Open. We think that many data providers, when shown the value of Open Data will change their approach and make the data fully Open.

Before starting on Open, let me clarify “Linked”. This means that each identifiable chunk of Information in the data set has a unique identifier (technically a URI). It’s good practice anyway for data sets to have uniqueIDs and if yours does, then you can usually create Links by prepending your domain name. If you don’t then you need to start creating UniqueIDs.

Back to Open. This means that anyone can use *and reuse* the data for any purpose without further permission. Simple to state. Simple to make clear – simply add a licence that guarantees Openness. Best choices are CC-BY, CC0 and PDDL. The following are NOT Open:

  • Non-commercial licence (CC-NC). They may be useful for musicians, but they are a menace in science and academia. They look enticing but they are a rathole. Never use them. Persuade your colleagues to get rid of them.
  • Logins (even if no money is required). These cannot be negotiated in the Semantic web of LOD
  • Incomplete access to data. Many sites provide search facilities (what is entry A2341) for “free”. The problem is you cannot navigate the whole data set. So access-through-search is not Open. Moreover the owner of the site could renove the facility
  • Restricted subsets. “You can use up to 12345 entries without a contract”. Not Open.

So a group of OKF Open scientists – anyone can join – on the open-science list (http://lists.okfn.org/mailman/listinfo/open-science ) has started to ask providers whether their data is Open. It’s run by Jenny Molloy and Egon Willighagen who have put great commitment into Openness. Jenney has helped to build the IsItOpen request tool – where we aske providers whether their data is Open – OKD-compliant. This is inspired by MySociety’s http://whatdotheyknow.com where FOI requests can be made and logged. So here we make requests to data providers, publishers, etc about the Openness of their data. (Note that this is not a legal request – providers can refuse to answer – but they then risk violating the community norms of being unresponsive to the needs of the community. We are sure that all responsible publishers will welcome the opportunity to clarify their approach to data – and do this ass public record.

Anyway 12 OKFers met in the Ether pad on Monday and here’s Jenny’s account:

We had a very productive hack session on Monday night regarding linked open drug data. You can see the full notes here:
http://okfnpad.org/sciencewg-loddhack-201103

In summary, we reviewed the openness of several LODD data sets in CKAN and identified those whose maintainers should be sent an Is It Open Data? request. We drafted letters  to send to the World Health Organisation, Global Health Observatory and the maintainers of two datasets at the US National Library of Medicine:
http://okfnpad.org/sciencewg-who-letter
http://okfnpad.org/sciencewg-rxnorm-letter
http://okfnpad.org/sciencewg-nlm-letter

Before we send them via http://www.isitopendata.org/, it would be great to get more signatories from the group, so please add your name to the end of the generic letter on http://okfnpad.org/sciencewg-loddhack-201103 if you are happy to be included. Unfortunately, we didn’t remind all of the hack session participants to do this before they left, so if you helped on Monday then please do sign!

We will be sending the letters on Monday 14th March during a follow up session, of which more details are to follow.

If there is a group on CKAN, or a general topic area that you feel would be a good target for future sessions of this nature, then please let me know!

This s great. And you can be part of it

Posted in Uncategorized | Leave a comment