petermr's blog

Kitware’s Contribution to the OSTP RFI on publicly funded data: the “Open Source Way”

Posted on January 5, 2012 by pm286

The US government (OSTP) has recently issued an RFI on Open Access to data resulting from publicly funded research.

The deadline for responding to the White House RFI on OA to US

federally funded research has been extended to January 12.

http://www.federalregister.gov/articles/2011/11/04/2011-28623/request-for-information-public-access-to-peer-reviewed-scholarly-publications-resulting-from

There has been an excellent, lengthy, response from Harvard (which inter alia argues for CC-BY and not CC-NC) [See http://mailman.ecs.soton.ac.uk/pipermail/goal/2012-January/000063.html ].

Here I want to applaud the response from Kitware, a company I know remotely (though have never visited). I work closely with Marcus Hanwell, who has developed Avogadro, an Open Source molecular editor ( picture at the end of this mail).

The point here is that Kitware is dedicated to making a successful profitable commercial business out of “Open Source”. They make their software Free/Openly available ( Avogadro is. GNU GPL.). Their flagship is VTK, a graphics toolkit. Their business model is to create added value on top of F/OSS (e.g. by services, consultancy, privately contracted software, etc.).

The world benefits immensely from this approach.

They have responded to the White House RFI in the “Open Source Way”. Reading their excellent submission shows a philosophy similar to what I have outlined – that publicly funded work should be Open **and clearly defined as such**. This means clear licencing of components. They argue for:

CC-BY for documents (articles/papers)
OSI-compliant (F/OSS) for source code
CC0 or Public domain for data (cf. Panton principles)

I strongly recommend you read their submission. Those active in the Open Access movement should realise the essential integrity of the whole scientific research community, where industry is an essential and valued member. NC licences are divisive to contributions such as Kitware’s. [If I published an image such as the zeolite below in a journal under CC-NC Kitware would not be allowed to use it without my explicit permission.]

from Luis Ibanez (Kitware)

“””

At Kitware, we have decided to respond to both RFIs the “Open Source Way“.

We have posted the draft of our responses in the two public documents below:

and we now invite everyone to join us in refining and extending the answers to both RFIs. The documents are open for editing by anyone with the link.

Please join us in improving the feedback that we are providing to OSTP, or if you are satisfied with the current content of the documents, please join us by signing the response at the end of each document. The response will be submitted in the name of the signing parties.

This of course, is not intended to preclude nor diminish any other initiatives for responding to the RFIs. We just want to make sure that we grab this unique opportunity to drive federal policy.

We will close down edits on January 10th, to format the final document responses and submit them to OSTP by January 12th.

Also, watch for an upcoming inSCIght podcast on this topic, open access and federally funded research, which should be available sometime this week.

Happy New Open Access Year !

Luis
“””

Posted in Uncategorized | 5 Comments

Panton Discussion #4: video of Iain Hrynaszkiewicz of BioMedCentral on Open Data, etc.

Posted on January 4, 2012 by pm286

We have now edited and converted the Panton Discussion with IainH – it’s at http://vimeo.com/34555054.

The full video runs for 28 minutes, so we have given a TOC with start times if you want to jump to particular topics.

In addition we have split the discussion into 10 sections (which we will post to vimeo) and try to index from the main video.

We are also transcribing the video (probably in the same sections) and will make these available shortly.

Please comment either on this blog or on the Vimeo site. I am aware that there are probably technical problems such as framerates and dropped indexframes – if you see these, please note their time.

Topics with start time and duration:

Iain Hrynaszkiewicz (BiomedCentral) a Panton Discussion with Laura Newman and Peter Murray-Rust (0:14)
My background (0:20) (1:02)
My role at BiomedCentral (1:21) (0:42)
The role of Open Data at BiomedCentral (2:03)(1:33)
What is needed to make Open Data happen in BiomedCentral?(3:36) (1:34)
The annual BMC Open Data awards (5:10)(1:09)
Other Open Data advocates in the publishing industry(6:20) (0:49)
Technical issues of Data between disciplines (7:09)(1:52)
Does one size fit all for data? or should we subdivide disciplines? (9:01)(0:48)
What are the roles of the scientist the funder the publisher the institution?(9:51) (1:54)
Should each discipline have its own central repository? (11:43)(1:03)
How do you see the Panton Principles? (12:47)(2:22)
What are the barriers to Open Data? (15:09)(1:07)
Business models and funding for publishing data.(16:16) (1:44)
Can we avoid fragmentation (e.g. of definitions) for Open Data? (18:01)(1:06)
Would it make sense to have a central organisation for Open Data? (19:07)(0:48)
Creating or capturing machine-readable data. (19:56)(1:18)
Examples of where Open Data works well. (21:14)(2:12)
What is the influence of Government Open Data initiatives?(23:25) (1:02)
What would you like to see in 5 years’ time? (24:28)(1:37)
What are the greatest challenges facing the Open Data movement? (26:05)(2:03)

Posted in Uncategorized | 2 Comments

Semantic Physical Science Workshop

Posted on December 30, 2011 by pm286

We (Charlotte Bolton and I) are preparing the material for the Semantic Physical Science Workshop in January (10/12). A major feature of this is our Jumbo-Converters which convert legacy log files to semantic CML. To do that we are cleaning up and testing the code – which runs to probably tens of thousands of lines designed by Jim Downing and Sam Adams and implemented mainly by me.

To make it usable it has to be cleaned of historical cruft, tested and documented , probably all together and iteratively. We’ve had several iterations of wrappers for J-C including two versions of “Lensfield”. Currently it looks like we are going back to a much simpler commandline-interface, and putting some responsibility on the user to write the wrapper. This is a common problem – workflows are hard and local and don’t seem to generalise or abstract well. Moreover when you commit to one it’s very hard to remove and change to another. So, as we have done with OSCAR, we’ve whittled away the wrapper stuff.

We didn’t get as far as we’d hoped for today, because of this:

It’s a Sparrowhawk (Accipiter Nisus) and although we have seen them in our garden from time to time, this one – a female – has started to use our spruce tree as its dining table. Sparrowhawks eat mainly birds (superb article in Wikipedia http://en.wikipedia.org/wiki/Eurasian_Sparrowhawk where we learn that this one might eat 1500 Great Tits a year. This looked like it was eating a Blackbird and we recovered the following

Which I am guessing is a female blackbird (Turdus Merula).

We managed to get the photo with my birdwatching telescope and a phone equivalent pointed at the eyepiece.

Since the hawk takes about an hour to eat a bird and because it went off for another one and came back again, it took up a lot of our time. So the evening will have to make up for what I didn’t manage during the day.

Posted in Uncategorized | Leave a comment

Videos from BiomedCentral: IainH and Gulliver Turtle (Panton #4 and #5) and thanks to Musopen

Posted on December 29, 2011 by pm286

I have spent the last days on and off editing the material that Laura Newman (OKFN) and I collected from BiomedCentral – interviewing Iain Hrynaszkiewicz and also Gulliver Turtle. Iain has got the (slightly edited) video/audio and hoped to let me have comments shortly, when I can release a final edited version.

Before I talk about the details I want to say how much I appreciate what BiomedCentral has done for the processing of publishing science Openly. They have been going a bit over 10 years and when they started their business model was unproven. I have paid tribute to this in /pmr/2010/06/11/reclaiming-our-scholarship-tribute-to-vitek-tracz-and-bmc/ last year. I have also commended their initiatives in going beyond the mainstream. While most publisher “open access” (such as it is) has to be dragged year-by-year from apparently resistant publishers BMC has gone out in front and want to show what we should be doing. So there is a lot of this in Iain’s interview.

Editing videos is hard work and I would be grateful for advice from people who’ve done it. It depends critically on the material and Iain was a superb interviewee. He knew what the questions would be and had prepared thoroughly, so when we interviewed him the replies were fluent, without hesitation, deviation or repetition. There are about 20 questions and answers – here is a typical one –

and the whole interview lasts about 28 minutes. So I am planning to create:

A complete edited movie of 29 minutes
20 snippets (Q + A), each in its own movie (1-2 mins each)

Each Q+A has the interviewer (mainly Laura) in the semi-background, but quite audible and then Iain’s response. The audio seems very clear – it was an empty room with a lapel mike for Iain and a camera mike for the interviewer.

Q: should I create 20 snippets or try to bundle them into larger themes?

Q: where should I post them (currently I will use VIMEO with a CC-BY licence)?

I also wish to get a transcript of the session (this is very important for indexing by search engines). Last time we asked for OKF volunteers and it took ages. I am considering Mechanical Turk which will costs about 1 USD/min of video, so ca 30 USD. There’s a good tutorial on this (http://waxy.org/2008/09/audio_transcription_with_mechanical_turk/ ) so it seems to be very cost-effective and I am expect of high quality (given the simplicity of the task and the clarity of the material).

Meanwhile I have also created the final version of Gulliver Turtle’s interview (http://vimeo.com/34259668).

I wanted to add music to the slideshow so that it added to the atmosphere, and @davemurrayrust offered his CC-BY material (http://mo-seph.com/). However it was too good in that the reader/listener spent more attention on the music than the text. So I started looking for CC-BY or CC-PD music and was pointed to a wonderful site (http://musopen.org). This has many hundred public domain recordings (sic, CC-PD) mainly from “the classics”. So it was question of selecting something that added to the video.

I couldn’t find Carnival of the Animals so first tried Schumann’s kinderszenen – but we all agreed it was too sentimental. So the animals now have Bach’s Anna Magdalena in the background (far better than I can play it!). It’s fairly easy to add music – you have to trim it to the right length. It’s repeated three times to fit and has a fade at the end. I’d value comments, but I am thinking of using it as the basic AnimalGarden background for any generally “happy” photocomic.

So then I resurrected the slideshow that I had given at the Serpentine Gallery /pmr/2011/10/16/garden-marathon-at-serpentine/ and added music to it. This was harder, as the themes were Innocence, Greed and Treachery, and Hope. Still choosing from Musopen (and again many thanks) I chose Anna Magdalena, Winter (4 Seasons) and Brandenburg 6-1. Certianly I am really happy to have found a PD site that gives me so much choice.

(There will be no music for IainH’s interview)

Posted in Uncategorized | 1 Comment

Panton Discussions #4 and #5

Posted on December 22, 2011 by pm286

Yesterday Laura Newman (OKF) and I met with Iain Hrynaszkiewicz of BiomedCentral and recorded a Panton discussion. Panton discussions are irregular discussions with figures in the Open world and have traditionally taken place in the Panton Arms in Cambridge, where the Panton Principles (http://pantonprinciples.org/ ) were announced.

The previous three discussions were with:

Richard Poynder
David Dobbs
Peter Murray-Rust (interviewed by Richard Grant, F1000)

Until recently discussions depended on my finding someone with a recorder and preferably video camera as well. Now I have permanent access to a camera and have learnt how to use it, so we don’t have to get people to Cambridge (though it’s nice to try). We’ve also got funding for Panton Fellowships (OSI) so that’s an extra resource and has enabled Laura and me to work together.

IainH has been a great advocate of Open data and this recording explores his ideas. I’m currently editing it but here is the most controversial part as a teaser – how does Iain pronounce his name?:

http://dl.dropbox.com/u/6280676/ih_prononunce.wmv

This clip is a test of whether blogging videos works in my environment.

When AnimalGarden found out we were going to BMC they insisted we had to have an interview with Gulliver Turtle – the Open Access Turtle. So here are a few snippets of their recording:

The current movie is in *.wmv –sorry if you can’t read it. I am getting a converter so the full interview

Should be in MP4

Posted in Uncategorized | 1 Comment

The Open Access Movement is disorganized; this must not continue

Posted on December 20, 2011 by pm286

I am going to have to reply to an article by Stevan Harnad (http://openaccess.eprints.org/index.php?/archives/862-guid.html ) where he argues inter alia that gratisOA (e.g. through Green, CC-restricted) rather than libreOA (e.g. through Gold, or CC-BY) should be adopted because:

“Note that many peer-reviewed journal article authors may not want to allow others to make and publish re-mixes of their verbatim texts”

“It is not at all it clear, however, that researchers want and need the right to make and publish re-mixes of other researchers’ verbatim texts.”

“Nor is it clear that all or most researchers want to allow others to make and publish re-mixes of their verbatim texts. “

“Hence Gratis OA clearly fulfils an important, universal and longstanding universal need of research and researchers.

This is a new, specious and highly damaging assertion that I have to challenge. If we restrict ourselves to STM publishing (where almost all of the funders’ efforts are concentrated) there is not a shred of evidence that any author wishes to restrict the re-use of their publications through licences. My analysis is:

Most scientists don’t care about Open Access. (Unfortunate, but we have to change that)
Of the ones that care, almost none care about licence details. If they are told it is “open Access” and fulfils the funders’ requirements then they will agree to anything. If the publisher has a page labelled “full Open Access – CC-NC – consistent with NIH funding” then they won’t think twice about what the licence is.
Of the ones who care I have never met a case of a scientist – and I want to restrict the discussion to STM – who wishes to restrict the use of their material through licences. No author says “You can look at my graph, but I am going to sue you if you reproduce it” (although some publishers, such as Wiley did in the Shelley Batts affair, and presumably still do).

My larger point, however, is that the OA movement is disorganised and because of that is ineffective. The movement :

Cannot agree on what “open access” means in practice
Appears to be composed of factions which while they agree on some things disagree on enough others to make this a serious problem.
Does not sufficiently alert its followers to serious issues
Has no simple central resource for public analysis of the major issues.
Is composed of isolated individuals and groups rather than acting on a concerted strategy
Spends (directly or indirectly) large amounts of public money (certainly hundreds of millions of dollars in author-side fees) without changing the balance of the market
Has no clear intermediate or end-goals

I accept that all movements have differences of opinion. This happened in the Open Source movement. Richard Stallman proposed a model of [UPDATED] Free Software with a strong political/moral basis. Software should be free and should be a tool of liberation, and hence the viral GPL (similar to CC-BY-SA). Others see software as a public good where optimum values is obtained by requiring libre publication but not restricting downstream works to this model (similar to CC-BY). These coexist fairly well in two families – copyleft and copyright. There are adherents of both – my own approach is copyright where anyone can use my software downstream for whatever purpose save only crediting the authors (and with ArtisticLicence requiring forks to use a different name).

The point relevant to Open Access is that this market / movement is regulated. They have formed The Open Source Initiative (OSI, http://www.opensource.org/) which

“is a non-profit corporation with global scope formed to educate about and advocate for the benefits of open source and to build bridges among different constituencies in the open source community.”

“One of our most important activities is as a standards body, maintaining the Open Source Definition for the good of the community. The Open Source Initiative Approved License trademark and program creates a nexus of trust around which developers, users, corporations and governments can organize open source cooperation.”

This is critical.

When I find an Open Source program, I know what I am getting. When I find an Open Access paper I haven’t a clue what I am getting. When I publish my code as Open Source I can’t make up the rules. I must have a licence and it must be approved by OSI (they have a long list of conformant licences and discuss why some other licences are non-conformant.

The Open Source movement decided early on that the Non-Commercial clause in licences was inappropriate / incorrect / unworkable and NO OSI licences allow such restrictions. This is one of the great achievements of OpenSource

And it’s policed. When I published some GPL-licenced CML-generating code I added the constraint “if you alter this code you cannot claim the result is CML”. I was contacted by the FSF and told this was incompatible with the GPL. So I changed it.

In short the OS community cares about what Open Source is, how it is defined, how it is labelled and whether the practice conforms to the requirements.

By contrast the OA community does not care about these things.

That’s a harsh thing to say, and there are many individuals and organizations who do care. But as a whole there is no coherent place where the OA movement expresses its concern and where concerns can be raised.

The problem can be expressed simply. “Open Access” was defined in the Budapest and other declarations. Budapest (see http://www.earlham.edu/~peters/fos/boaifaq.htm ) says:

“By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”

Everyone (including Stevan) would agree that this is now consistent with what is (belatedly) being labelled as OA-libre. Note that Stevan was a signatory to this definition of Open Access.

My immediate concern is that unless we organize the definition, labelling and practice of Open Access we are simply giving OA-opponents or OA-doubters carte blanche to do whatever they like without being brought to account. We are throwing away hundreds of millions of dollars in a wasteful fashion. We are exposing people to legal action because the terms are undefined.

In short “Open Access” is a legally meaningless term. And, whether you like it or not, the law matters. If you try to re-use non-libre material because it was labelled “Open Access” you could still end up in court.

As a UK taxpayer I fund scientists to do medical research (through the MRC). The MRC has decided (rightly) that the results of scientific research should be made Open. But they are not Open according to the BOAI declaration. Every paper costs taxpayers 3000 USD to publish and we do not get our money’s worth.

The fundamental problem is that this is an unregulated micro-monopoly market and no-one really cares. The BBC discuss today about how some bus operators had 70% of the market. That means they have an effective monopoly. They can run the services they want, not what we need. The only possible constraint is government regulation.

Scientific publishing is an unregulated pseudo-monopoly market where the publishers make the rules and make the prices. The bargaining (such as it is) comes from:

Libraries (who are fragmented, and who care only about price, not rights)
Funders (who are fragmented across disciplines and countries)
Small scholarly organisations such as SPARC. (BTW I thought SPARC was advocating for OA-libre but I can’t see much sign)
Individuals such as Stevan, Peter Suber, Alma Swan, with relatively little coordination and no bargaining power

These have been ineffectual in creating a coherent OA market. BMC and PLoS have been very useful in showing that OA is possible on several fronts – without them I think OA in STM would be effectively dead.

So the OA movement desperately needs coordination. Coordination of:

Terminology
labelling
Dissemination of information and coordinated search
Advocacy
Enforcement

It also needs to coordinate on price-bargaining in a more effective way than libraries do at the moment. We (our governments, our charities, our universities) provide the money but they don’t coordinate how it’s spent or what value they get.

So my simple proposal is that we need an Open Access Institute/initiative similar to the OSI for Open Source. It would cost a small fraction of what we already pay in unregulated Open Access fees (who for example challenges the 5000 USD that Nature charges for a hybrid paper). It costs more to run a car than to publish one hybrid paper per year in Nature.

Stevan’s response will be: “let’s concentrate on getting all papers published as Green before we worry about anything else”. I don’t agree with this and I will explain more later. The OAI will have to accommodate such differences of opinion and label the approaches properly rather than allowing everyone to redefine Open Access as they think best or trying to get everyone under a common ultra-fuzzy label.

[UPDATE: I typed “Open Source” instead of “Free Software” by mistake and apologize to RMS for the slip.]

Posted in Uncategorized | 25 Comments

LBM2011 Singapore; a milestone in Text-mining and Natural Language Processing, OSCAR, OPSIN, ChemicalTagger

Posted on December 16, 2011 by pm286

Yesterday I gave an invited plenary lecture at http://lbm2011.biopathway.org/ – The Fourth International Symposium
on Languages in Biology and Medicine (LBM 2011) Nanyang Technological University, Singapore .14th and 15th December, 2011

The meeting was on Natural Language Processing – using computational techniques to “understand” science in publications and get machines to help us with the often boring and error-prone part of extracting detailed meaning. It’s an exciting field and progress is steady.

The meeting itself was great. Very high standard of talks. I understood most of them both in their intent and their methodology. And a really great atmosphere – ca 30 people in a relaxed atmosphere and prepared to exchange ideas.

I realised the night before that this lecture represented a milestone in my NLP/TextMining career so I took considerable time to adjust it to the audience and the occasion. Normally in my lectures I don’t know what I am going to say (and choose from HTML slides). This time I wanted to pay tribute to all the people who have contributed so much over the years and have brought us to this milestone. So here’s the second slide of my talk:

OSCAR2: Joe Townsend, Fraser Norton, Sam Adams, Chris Waudby, RSC
OPSIN: Daniel Lowe
Sciborg/OSCAR3: Anne Copestake, Simone Teufel, Peter Corbett, RSC, IUCr, Nature PG
Ontologies: Nico Adams, Colin Batchelor (RSC)
OSCAR4: David Jessop, Egon Willighagen, BalaKrishna Kolluru, OMII/EPSRC, NaCTeM, JISC
ChemicalTagger: Lezan Hawizy
CML: Henry Rzepa
And the BlueObelisk (CDK, Bioclipse, OpenBabel, etc.)
http://www-pmr.ch.cam.ac.uk/wiki/OSCAR4_Launch
http://www.jcheminf.com/series/semantic_mol_future
http://www.dspace.cam.ac.uk/handle/1810/238302 (DavidJ thesis)

(I’ve only included each person once). If there is anyone who is omitted let me know.

I’d like first to thank the people in the Centre I have had the chance to work with (including of course Jim Downing). They’ve been unusual in a good way in that they haven’t been obsessed with academic competition. They have worked as a team, creating joint products and they have also put a high premium of creating things that are useful and work. That’s not so common in academia and this group has traded H-indexes for software and systems that are out there and being used. If only academia gave credit for that they would be stars. That time will come.

I’m proud to say that they’ve all joined high-tech UK companies which have to be part of our future. Making digital things that people want to buy. Thermodynamics owes more to the steam engine than the steam engine to thermodynamics. We should be learning from the companies that this group has gone into. I’m proud of the software engineering that Jim introduced and that the group adopted without mandates or coercion but simply because it was so evidently right.

That pride has gone into the three products, OPSIN, OSCAR4 and ChemicalTagger.

In a real sense we can draw a line under them. They work, they are “out there” and they are used. We don’t know how much they are used because people are so secretive. I’d guess that there are probably 20-100 installations of OSCAR. We get little feedback because the software works. (We’ve got no formal feedback from OSCAR2 and we know that it’s widely used).

And I’m going to be unusually boastful, because it’s for them, not me:

OPSIN, OSCAR and ChemicalTagger are the best in their class that we know about. There may be private confidential programs that we don’t know about, but hey! Because they are Open Source people don’t try to compete and duplicate the functionality. They re-use it. So OSCAR is used in Bioclipse, used at EBI for their chemical databases and ontology. Open source doesn’t necessary make a program functionally better per se but it allows other people to work on it. More testers, more bugs discovered, more progress.

Why can we draw a line?

Because essentially we have done what we needed to. We’ve built them as frameworks and we are confident that the frameworks will work for some time before they need refactoring (everything needs refactoring). So if you think OSCAR4 has less functionality than OSCAR3, that’s because it’s modular. There is no point in US writing web interfaces that you need to put on your server. Instead we have written an API that is so simple and powerful it’s 2 lines of code in its basic form. Easy to understand, easy to test, easy to install, easy to customise.

There’s lost more to do, but it doesn’t involve rewriting the programs. OSCAR is designed to be extended through APIs. If you want to use a new corpus there’s and API. A new dictionary/lexicon? An API. A new machine-learning algorithm? Yes, an API. It should be hours, not years to reconfigure. Here’s how to do it:

ChemicalEntityRecogniser myRecogniser = new PatternRecogniser()

Oscar oscar= newOscar();

oscar.setRecogniser(myRecogniser);

oscar.setDictionaryRegistry(myDictionaryRegistry);

List<ResolvedNamedEntity> entities =

oscar.findResolvableEntities(text);

Five lines of code (of course someone has to write that recogniser and the dictionary but then you can plug and play them). So if you want OSCAR to use Conditional Random Fields, find an Open Source library (there are lots) and bolt it in as a Recogniser.

Yes, group, I am proud of you and yesterday was the day I said so publicly. I used your own (Powerpoint) slides!

So where does it leave us? What does the signpost point to?

I said yesterday that Language Processing research had two goals and I’ll prepend a third:

LP research develops new approaches to LP. Our main contribution here has been to add chemistry and we’ve covered most of what’s involved. There is no reason why the technology shouldn’t be extended to different human languages, different corpora. We’ve not made any great Chomskian-like breakthroughs in understanding language itself. We haven’t been able to compare chemical corpora because we don’t have a collection of Open ones.
LP uses discourse to give insights into the science itself. I’d hoped to do that in chemistry, but the universal refusal of chemistry publishers to provide Open corpora has meant that we have been restricted to patents. And patents, designed to conceal as well as to reveal, are not where new ideas in the fundamentals of chemistry come from. Contrast bioscience where language is a primary tool for understanding the discipline
LP as a tool provides new useful knowledge to the whole world. Here again we are stymied by the publishers. As this blog has shown publishers are unwilling to make papers Openly available and for the extracted knowledge also to be Open. At a conservative estimate publishers have held back LP by a decade.

I then divided chemical LP into two areas:

Chemical LP in chemical corpora. There’s nothing useful we can do here until the scientific literature is Opened. There are 10 million syntheses published a year, and even being pessimistic, PubCrawler and OSCAR could analyse 2 million of these (I think it’s higher). Richard Whitby in Dial-a-Molecule could use all this in his project for designing the molecules we will need in the future. But we are simply legally forbidden.

So I am giving up LP in chemical documents. There is no point. Some commercial companies will possibly use OSCAR or OSCSAR-like tools to do a small bit of this – but necessarily inefficiently. We can forget the idea of a chemical Bingle. Chemistry remains stagnated
Chemical LP in biological documents. Unlike the chemists, biologists really need and want LP/Textmining. They are also hampered by the restrictive practices, but they can probably work out the scope (so long as they don’t publish the extracted data – all our data are belong to the publishers). There’s areas such as metabolism (where Peter Corbett had some great and easily implantable ideas) which would yield massive results. Metabolism with why drugs work and why they don’t work. It matters. Lots of it is in the existing literature but technically and legally locked up, gathering dust.

So I am encouraging the bioscientists to use our software. I am happy to work with anyone – I am not slavishly tied to generate REF points. There is some valuable chemistry to discover by mining the bio-literature

And I intend to go more into the patient-oriented literature and to use LP to help the scholarly poor. Because it may help them to become scholarly richer in spite of everything. And I picked up quite a lot about medical LP at the meeting so I’m fired up.

Posted in Uncategorized | Leave a comment

Workshop and Symposium on Semantic Physical Science

Posted on December 12, 2011 by pm286

We are delighted to announce that we are running a Workshop and Symposium on Semantic Physical Science (see below). This will explore how to use physical scientific data in semantic form and will explore: creation by humans and machines, specifications including dictionaries/schemas, writing and using code (Java, FORTRAN), creating semantic data/annotation, data repositories, and publication/re-use.

We are encouraged by the success of a 1-day workshop we ran in Melbourne last month where we explored the ideas and technology. The feedback was very positive, suggesting that the time has come for Physical Science to embrace semantics. Common components are dictionaries, units of measurement, errors, metadata and the workshop/symposium will explore how to create and use these. We are also encouraged by a recent JISC/OKF/W3C-SWAT4LS life sciences hackathon that we helped to run (which showed that substantial progress can be made in 2-3 days)

This workshop is made possible by support from EPSRC. Please feel free to mail this to other scientific groups and organizations and mailing lists, especially in chemistry, earth and materials sciences.

Workshop and Symposium on Semantic Physical Science, Unilever Centre for Molecular Science Informatics, Cambridge UK (2012-01-10/12)

We are running a hands-on workshop (January 10th/11th 2012) and symposium (January 12th 2012) on Semantic Physical Science, supported by EPSRC (“Pathways to Impact”). At these events, we will be investigating how semantic technologies (dictionaries, mark-up languages, ontologies, data-typing) can be applied to the capture, publication, preservation and re-use of data in the physical sciences (especially chemistry and materials science). We have invited 25 scientists (particularly from the fields of crystallography/solid state, analytical spectroscopy and computational chemistry; see list below) to a two-day workshop where we will review and create toolkits and protocols. We are delighted to see very great interest from national laboratories and national providers of services.

The results of the workshop and general talks on semantic principles will be presented at the full day symposium, which will be of particular interest to creators of chemical software, publishers, repository managers and funders who encourage data publication. The symposium is open to everyone without charge. The approximate program will be released shortly but some details will reflect the progress made in the preceding workshop. There will be significant time for discussion. If you wish to attend the symposium, please email spsworkshop@gmail.com to register; places are limited to 50 – first come, first served!

Please distribute this flyer to anyone who you feel may be interested.

There may be one or two places still available in the workshop – please email spsworkshop@gmail.com for information.

Confirmed attendees:

Nico Adams (CSIRO), Simon Coles (University of Southampton), Clyde Davies (Microsoft), Bert de Jong (Pacific Northwest National Laboratory), Martin Dove (University of Cambridge/Queen Mary University of London), Jorge Estrada (Zaragoza Scientific Center for Advanced Modeling), Marcus Hanwell (Kitware), Marcus Kraft (University of Cambridge), Mahendra Mahey (JISC), Brian McMahon (IUCr), Karl Mueller (Pacific Northwest National Laboratory), Weerapong Phadungsukanan (University of Cambridge), Henry Rzepa (Imperial College), William Shelton (Pacific Northwest National Laboratory), Paul Sherwood (STFC Daresbury Laboratory), Michael Simmons (University of Cambridge), Christoph Steinbeck (EBI), Jens Thomas (STFC Daresbury Laboratory), Andrew Walker (University of Bristol), Alex Wade (Microsoft), Nancy Washton (Pacific Northwest National Laboratory ), Mark Williamson (University of Cambridge), Erica Yang (Science and Technology Facilities Council)

Posted in Uncategorized | 1 Comment

JISC/OKFN/SWAT4LS Hackathon: Bibsoup and disease video

Posted on December 11, 2011 by pm286

This weeks’ hackathon showed how much could be accomplished in a (short) day-and-a-half. I’ve already given a brief overview but here I discuss our project – Open Research Reports and disease – in detail.

David Shotton and Tanya gray introduced Minimal Information for Infectious Disease Investigations (MIIDI). [Lively debate about whether the Minimal Info idea was useful in bioscience]. Graham Steel and Gilles Frydman brought the patient axis = ACOR (Assoc Cancer Online Resources) and PatientsLikeMe. What is absolutely clear is that:

Patients want to be and must be equals in the use of information
Patients have a huge interest , huge energy and increasingly community experience and knowledge
They are seriously disadvantaged by lack of access to full-text articles (academics do not realise this)
Bibliography is critically useful

So we are developing Bibsoup to take in selected Pubmed IDs (UKPMC) and ingest them into a Bibserver to provide a BibSoup instance. Mark MacGillivray has developed a facetted browsing technology, based on BibJSON, which is both formal and fluid. It’s easy for people to annotate and should be straightforward to extend.

Along with other presentations I have captured ours (3.20 mins) and annotated some of the footage. Currently in my Public dropbox (http://dl.dropbox.com/u/6280676/hackathon.mp4 )

We are excited with how effective the hackathon was in bringing several people together into the BibSoup cluster and how the various ideas came together. We are hoping to create a future event where patients are participants. Spent a day on Friday with Gilles Frydman discussing what would be involved.

Again thanks to Gilles, Graham, Jenny, Mark, Andrea, Mahendra, Naomi and everyone else.

Posted in Uncategorized | Leave a comment

The hackathon was a great success (Open Research Reports, SWAT4LS, JISC, OKFN, Open Bibliography/BibSoup …)

Posted on December 9, 2011 by pm286

The 1.5 day hackathon prequel to the SWAT4LS workshop was a great success – at least all the 30+ people there thought so. This is just a brief report and thanks. First enormous thanks to Andrea Splendiani and Mahendra Mahey (JISC) who jointly came up with the idea. And JISC who sponsored the event. Thanks to Jenny Molloy (OKFN) who spent a lot of time preparing and advertising and to everyone who contributed (i.e. everyone)

Very briefly we started with an evening session where people presented ideas – some prepared (Professor OWL showed her video) so extempore. Ideas about our ORR interests (Open Citations (David Shotton and Tanya Gray), Open Bibliography, and patient centred information and decisions (Gilles Frydman and Graham Steel)). But also how to build networks of gene-drug interactions, design an artificial genome, map the incidence of disease, etc.).

Then next day Mahendra gently organized us into groups – about 6 critical masses or people who felt they could create something by the end of the day. Not necessarily software – it could be exploration, resources, specs, etc. Mahendra, Naomi Lillie (recently joined OKF staff) and I had a roving brief interviewing tables, people and generally recording the event.

I have recently discovered the joys of video recording and since Mahendra’s was full I rushed out to Tottenham Court Road for a tripod. We started to record individual attendees and find out who, why, what, etc. ULU isn’t the easiest place to record as it’s got large echoing rooms and passages, banging doors, people with drills, police cars, and students shrieking with laughter. So some of the early efforts were poor quality. Mahendra has a colleague who can apparently work magic and he’s taken them all away to edit. But anyway at lunchtime I rushed out again and got a lapel mike and spare card and here we are. Here I’ll just put stills from the movies. But I am very impressed with the quality – you can often see every word on the screen.

[Sorry I don’t have names for everyone – feel free to annotate through blog comments]

Graham McDawg, Gilles KosherFrog and Jenny Molloy

Building an artificial genome

The BibSoup cluster (anticlockwise) : XX, Naomi Lillie, Tania Gray, Mark McGillivray, Jenny Molloy, Gilles Frydman, Graham Steel

BibSoup being presented (Jenny Molloy)

The details [there was also a great demo on screen by Mark – I’ll post it when transcoded)

Soup of the evening, beautiful BibSoup

McDawg – stand back – I’m a scientist

David Shotton (middle rear, beard) and others

I am really excited about the whole thing – the different disciplines and experiences really came together. There was a lot of interest in Open Bibliography and Citations and BibSoup. These can become central tools in the semantic web – and when allied with UKPMC they specifically serve bioscience. The disease theme was very strong – not just our group. And great interest in patient-centred approaches.

Kitware’s Contribution to the OSTP RFI on publicly funded data: the “Open Source Way”

Panton Discussion #4: video of Iain Hrynaszkiewicz of BioMedCentral on Open Data, etc.

Semantic Physical Science Workshop

Videos from BiomedCentral: IainH and Gulliver Turtle (Panton #4 and #5) and thanks to Musopen

Panton Discussions #4 and #5

The Open Access Movement is disorganized; this must not continue

LBM2011 Singapore; a milestone in Text-mining and Natural Language Processing, OSCAR, OPSIN, ChemicalTagger

Workshop and Symposium on Semantic Physical Science

JISC/OKFN/SWAT4LS Hackathon: Bibsoup and disease video

The hackathon was a great success (Open Research Reports, SWAT4LS, JISC, OKFN, Open Bibliography/BibSoup …)

Recent Posts

Recent Comments

Archives

Categories

Meta