APE2008 – Heuer, CERN

APE (Academic Publishing in Europe)  was a stimulating meeting, but I wasn’t able to blog any of it as (a) there wasn’t any wireless and (b) there wasn’t any electricity (we were in the BerlinBrandenburg. Academy of Sciences, which made up for the lack by the architecture and the legacy of bullet holes in the masonry). So I took notes while the battery lasted, but they read rather staccato.
The first keynote was very exciting. Rolf-Dieter Heuer is the new Director General of CERN – where they start hunting the Higgs Boson any time now. CERN has decided to run its own publishing venture – SCOAP3– which I first heard of from Salvatore Mele – I’m hoping to visit him is CERN before they let the hadrons loose.
So my scattered notes…
SCOAP requires all COUNTRIES contribute (i.e. total commitment from the community and support for the poorer members)
closely knit community, 22, 000 ppl.
ca 10MEUR for HEP – much smaller than expts (500MEUR) so easy for CERN to manage (So organising a publishing project is small beer compared with lowering a 1200 tonne magnet down a shaft
22% use of Google by young people in physics as primary search engine
could we persuade people to spend 30 mins/week for tagging
what people want
full text
depth of content
quality
build complete HEP paltform
integrate present repositories
one-stop shop
integrate content and thesis material [PMR – I agree this is very important]
text-and data-mining
relate documents containg similar information
new hybrid metrcs
deploy Web2.0
engage readers in subject tagging
review and comment
preserve and re-use reaserach data
includes programs to read and analyse
data simulations, programs behind epts
software problem
must have migration
must reuse terminated experiments
[PMR. Interesting that HEP is now keen to re-use data. We often heard that only physiscists would understand the data so why re-use it. But now we see things like the variation of the fundamental constants over time   – I *think* ths means that the measurement varies, not the actual constants]
preservation
same reesearchers
similar experiements
future experiements
theoretic who want to check
theorist who want to test futuire (e.g. weak force)
need to reanalyze data with time (JADE experiement, tapes saved weeks before destruction and had expert)
SERENDIPTOUS discovery showing that weak force grows less with shorter distance
Raw data 3200 TB
raw-> calibrated -> skimmed -> high-leve obj -> phsyics anal – > results
must store semantic knowledge
involve grey literature and oral tradition
MUST reuse data after experiment is stopped
re-suable by other micro doamins
alliance for permanent access
PMR: I have missed the first part because battery crashed. But the overall impression is that SCOAP3 will reach beyond physics just as arXiv does. It nmay rival Wellcome in its impact on Open Acces publishing. SCOAP3 has the critical mass of community, probably finance, and it certainly has the will to succeed. Successes tend to breed successes.
… more notes will come at random intervals …

Posted in open issues, open notebook science, publishing | Tagged , | Leave a comment

Richard Poynder Interview

I was very privileged to have been invited to talk to Richard Poynder at length in a phone interview. http://poynder.blogspot.com/2008/01/open-access-interviews-peter-murray.html.
I am impressed with the effort that Richard put in – it is a real labour of love. We’ve not met IRL and hope to do so some day.
Internviewers of this quality provide very useful checkpoints – in going over some of the points I was able to realise what might be consistent and contradictory. And there is an objectivity which an individual cannot create for themself.
So many thanks Richard.

Posted in open issues | Tagged , | Leave a comment

APE 2008

I’m off the the APE meeting in Berlin: APE 2008 “Quality and Publishing”, which asks some questions:

  • What do we really know about publishing?
  • Is ‘Open Access’ a never ending story?
  • Will there be a battle between for-profit and non-for-profit publishing and who will be the survivors?
  •  Which is the best peer review system in the public interest?
  • What does impact mean in times of the Internet?
  • What are the plans of the European Commission for digital libraries, access and dissemination of information?
  • Will libraries become university presses or repositories?
  •  How efficient is ‘OA’ in terms of information delivery?
  •  What are the full costs of information?
  •  Business models versus subsidies?
  •  What is the future role of books and reference works?
  •   How important are local languages?
  •  Which kind of search engines do we all need?
  •  What about non-text and multi media publications?
  •  Which models for bundling and pricing will be accepted?
  •  What makes publications so different?
  •  Why are some journals in a defined subject field much more successful than other journals?
  •  How important is the role of editors and editorial boards?
  •  What education and training is required?
  •  What skills are needed?
  •  Barrier-free information: do we provide sufficient access for the visually impaired?

I often sit at the back and blog so maybe I’ll give some answers. OTOH the hotel offers Internet for a price of 10 EUR/hour so maybe I won’t be able to post anything. (From what I can see Germany is one of the worst countries for charging for casual internet time – can’t we initiate some “Open Access”?)

Posted in publishing | 2 Comments

From Peter Suber  More on the NIH OA mandate.

Many points but I pick one:

 

Jocelyn Kaiser, Uncle Sam’s Biomedical Archive Wants Your Papers, Science Magazine, January 18, 2008 (accessible only to subscribers).  Excerpt:

If you have a grant from the U.S. National Institutes of Health (NIH), you will soon be required to take some steps to make the results public. Last week, NIH informed its grantees that, to comply with a new law, they must begin sending copies of their accepted, peer-reviewed manuscripts to NIH for posting in a free online archive. Failure to do so could delay a grant or jeopardize current research funding, NIH warns….
[…]
Scientists who have been sending their papers to PMC say the process is relatively easy, but keeping track of each journal’s copyright policy is not….

PMR: Exactly. It should be trivial to find out what a journal’s policy is. As easy as reading an Open Source licence. An enormous amount of human effort is wasted – authors, repositarians, on repeatedly trying to (and often failing to) get this conceptually simple information.

 

I’ve been doing article and interviews on OA and Open Data recently and  one thing that becomes ever clearer is that we need licences or other tools. Labeling with “open access” doesn’t work.

 

Posted in open issues, publishing, repositories | Tagged , | Leave a comment

XML, Fortran and Mr Fox at NESC

Toby White (“Fantastic Mr Fox@) has developed a superb system for enabling FORTRAN programs to emit XML in general and CML specifically. He and colleagues are presenting this at Edinburgh as part of the NESC programme:

Integrating Fortran and XML
28 January, 08 01:00 PM – 30 January, 08 01:00 PM
e-Science Institute, 15 South College Street, Edinburgh
Organiser: Toby White
eScience technologies offer great hope for massive improvements in the quality and quantity of science that we are able to do, particularly in the domains of data management and information delivery. Many of our escience tools rely on XML and related technologies. However, an enormous number of scientific codes are written in Fortran, and many scientists do much of their work using Fortran. Unfortunately, Fortran knows very little about XML, and vice versa; thus many useful scientific codes are de facto excluded from the world of escience. However, there is an increasing number of tools being made available to integrate Fortran into an XML-aware world, and there is a large body of knowledge and lessons learned on sensible strategies for making use of existing scientific codebases in escientific ways. This workshop will aim to instruct participants in the uses of several Fortran-XML tools, and to transfer practical experience about successes in this area. It is expected that participants will then be able to extend their existing Fortran codes to both write and read XML files, and otherwise manipulate XML data.
Target Audience
Scientific programmers from eScience projects who need to integrate existing or legacy Fortran codes into modern data handling infrastructures.
Delegates should have some experience in programming Fortran and of working in a Linux/Unix environment. No prior knowledge of XML is required.
Programme
This event is provisionally scheduled to start at 13:00 Monday 28 January 2008 and close at 13:00 on Wednesday 30 January 2008.
A programme is available at: http://www.nesc.ac.uk/esi/events/841/Fortran XML timetable.pdf
Speakers
Toby White <tow21@cam.ac.uk>, Cambridge
Andrew Walker <awal05@esc.cam.ac.uk>, Cambridge
Dan Wilson <wilson@kristall.uni-frankfurt.de>, JWG-University, Frankfurt, Germany

Posted in Uncategorized, XML | 1 Comment

Science 2.0

Bill Hooker points to an initiative by Scientific American to help collaborative science. Mitch Waldrop on Science 2.0

I’m way behind on this, but anyway: a while back, writer Mitch Waldrop interviewed me and a whole bunch of other people interested in (what I usually call) Open Science, for an upcoming article in Scientific American. A draft of the article is now available for reading, but even better — in a wholly subject matter appropriate twist, it’s also available for input from readers. Quoth Mitch:

Welcome to a Scientific American experiment in “networked journalism,” in which readers — you –get to collaborate with the author to give a story its final form.The article, below, is a particularly apt candidate for such an experiment: it’s my feature story on “Science 2.0,” which describes how researchers are beginning to harness wikis, blogs and other Web 2.0 technologies as a potentially transformative way of doing science. The draft article appears here, several months in advance of its print publication, and we are inviting you to comment on it. Your inputs will influence the article’s content, reporting, perhaps even its point of view.

PMR: It a reasonably balanced article, touching many of the efforts mentioned in this blog. It’s under no illusions that this won’t be easy. I’ve just finished doing an interview where at the end I was asked what we would be like in 5 years’ time and I was rather pessismistic that the current metrics-based dystopia would persist and even get worse (The UK has increased its efforts on metrics-based assessment in which case almost any innovation, almost by definition, is discouraged). But on the other hand I think the vitality pf @2.0@ in so many areas may provide unstoppable disruption.

Posted in open issues, open notebook science, publishing | Tagged | Leave a comment

Update, Open Data

I have been distracted by the real world (in some cases to good effect). A lot of progress on CML, Wikipedia, chemical language processing, etc. We’ve also had a WordPress upgrade which until it happened has stopped my re-opening the CML blog.  I shall be at the European meeting on Open Access in Berlin next week and will hope to blog some of that.
Some recent highlights:

  • I’ve done a long interview on Open Data which should be public fairly soon
  • I converted the Serials Review article into Word and that has now been submitted. I have also submitted it to Nature Precedings and that should be available in a day or so.
  • I have finalised the prrofs for the Nature “Horizons” article (whose preview is on Nature Precedings). The house style seems  to be to remove all names from the text and further reading and I am not allowed to acknowledge people by name. This makes the article read in a very ego-ecentric style which does not reflect on the communal nature of the exercise. It appears in early Feb
Posted in open issues | Leave a comment

Could Wikia be used for chemistry?

This may be miles offbeam, but the following from Peter Suber’s blog caught my eye: Wikia launches

Today Jimmy Wales launched an alpha version of Wikia, the search engine to be built openly and wiki-like by users.  From the about page:

Wikia is working to develop and popularize a freely licensed (open source) search engine. What you see here is our first alpha release.
We are aware that the quality of the search results is low..
Wikia’s search engine concept is that of trusted user feedback from a community of users acting together in an open, transparent, public way. Of course, before we start, we have no user feedback data. So the results are pretty bad. But we expect them to improve rapidly in coming weeks, so please bookmark the site and return often.
Right now, the most important thing you can do is help with the “miniarticles” that appear at the top of popular search terms. These will vary in purpose according to the circumstance, but the primary uses will be:

  • Short definitions
  • Disambiguations
  • Photos
  • See also

At the bottom of every page is a linke to “Post bug reports here”… please use that link liberally to give us large amounts of feedback.
I believe that search is a fundamental part of the infrastructure of the Internet, and that it can and should therefore be done in an open, objective, accountable way. This site, which we have been working on for a long time now, represents the first draft of the future of search….

PS:  I believe the Wikia project was first announced in December 2006.

PMR: I haven’t looked at the page so this may be rubbish. But if we have a web search engine we can customise can’t we adapt it to do InChI properly. It would read pages, scan for chemical names and then specifically index them.
I’ll go and have a look.

Posted in semanticWeb | 1 Comment

Community involvement in information capture and extraction

There has been a large increase in the number of people and organisations interested in extracting or capturing chemical information from the public domain. This is typified by the ongoing discussions between individuals and organisations – here’s a comment on this blog from Antony Williams -Chemspiderman – who has been working very hard to develop approaches towards Open data (comment to Open Data in Science):

I’m in the middle of curating all chemical structures on Wikipedia. I spent a couple of hours discussing it with Martin Walker last night. The process involves a lot of manual work…I’m at over 150 hours right now. There are issues with chemical names not matching the structure diagrams (people can use nomenclature very poorly!) so this will be an ongoing issue for ANYBODY using name to structure conversion structure. However, there are many names agreeing with the chemical structure. Have you thought about applying OSCAR to WIkipedia to generate a real structure file? You can then add that into the WWMM and hook up to Wikipedia. If you wait a while I’ll have one done and will hopefully be able to get Wikipedia to accept InChIKeys on the structures directly and therefore make Wikipedia searchable by InChIKey. I’ll log about this soon but have other deadlines in the way at present. I have just co-authored a book chapter on name to structure conversion and talked about OSCAR-3 but couldn’t comment too much on capabilities. I can add it in in proofing.. Here are 10 names of structure on Wikipedia …they are correct for the structures. You commented “If the names can be interprted or looked up then OSCAR does a good job. “How well does OSCAR does on this set of 10? If you want to post the InChI strings I’ll check the structures and let you know…

We are very grateful for this work. We are also doing similar things and we’d be delighted to coordinate – I have also been mailing Martin and booked an IRC with him and WP-CHEM colleagues asap.
As Antony says there is a lot of hard work. The good news about social computing – of the sort he and we have been fostering – is that in principle it can scale. The difficulty is that it can be difficult to run technically hard projects – and this is a technically hard project. The reason is that it is not about certainty – what is the formula of “snow”? – but requires evaluation of assertions (X says the formula of A is C30H22O5; Y says the formula of A is C32H24O5).
There is an awful lot of grunt work. First we have to get the data. For Wikipedia this has been done manually, but I am looking at whether data can be extracted from other sources and fed in automatically. There’s at least 1000 common compounds that “should” be in WP. There’s the problem of rights – I think we are getting to the stage where the resistance to mining data from chemistry text will weaken. Then we have to deal with the syntax. PDF is still a major hurdle. Can we use images (I’ll post about that later). My work over the holiday has shown that extraction from web pages is still fragile, but we can get a lot. (e.g. does anyone have a parser for ALL inline chemical formulae – e.g. C(CH3)2=(CH2)2COC(CH3)2Cl.2H20 ? JUMBO does a so-so job. If anyone can do better that would be very useful).
Then when the data have all been extracted those from different source can be compared. This often shows real errors. In the case we absolutely need reasoning tools like RDF. It will highlight inconsistencies of the type above. But it can’t resolve them. Can we develop heuristics including probability? Recommender systems – A has fewer inconsistencies per entry than B so we weight A higher.
I shall respond to the technical questions on images and names in separate posts. I make it very clear that this is research – not a production system. There may be cases where precision and recall run at 20%. This is not a failure, it’s a starting point. Some of this is skunkworks – and I am reluctant to involve the community in skunk works. It takes time before we can reasonably loose development code at sourceforge.
Part of the point is to encourage authors and publishers to deposit semantic data as well as text. If all papers had InChIs (with compound numbers) then we probably wouldn’t have to extract stuff from images. (There are still compounds which can only be represented graphically). Similarly if all chemists published NMR spectra with molecular structures and assignments then we wouldn’t have to do any of these. All this is technically possible already.
There are many areas where the community can help. Chemical nomenclature is one. Part of the low recall for OPSIN is that the compounds aren’t in the vocabulary. It’s not part of our research. But it’s relatively easy for anyone to add these – and once added they are done. I’m guessing that we could double the PR by this method. But I’ll comment on this in detail.
So I think we shall see a valuable increase in distributed Open chemical information projects this year. It’s difficult to get funding – but not impossible – and we are hopeful. One important activity is workshops.
More later (including comments on OPSIN, OSRA, etc.).

Posted in Uncategorized | 9 Comments

CMLBlog: Sourceforge resources

[This is the first of a continuing series of posts destined for the revitalised CMLBlog.]
The major developers resource for CML is at sourceforge. This is the traditional page which each project has and has several useful features:
cml.PNG
There has been a fairly steady set of releases, with relatively little drift in APIs.
cml1.PNG
There is one mailing list which has a fairly low traffic – mainly requests for technical information. I hope that the blog will be a better format for general discussion.
The SVN repository is the primary resource besides the downloads. For those who are familiar (use Tortoise for Windows) the natural way is to SVNUpdate from time to time. For casual browsing the SVNRepository posts HTML pages which are very well set out and a good way to find things if you know what you want.
There are about 10-15 active developers – a very few commit large amounts, most other offer patches, bug fixes and unit tests.

Posted in programming for scientists | Tagged , | Leave a comment