berlin5 : Maxine Clarke

Maxine is presenting Nature’s practice and philosophy on data. (Hope I capture this OK – there is a lot or material) In the early 1990’s they introduced Supplementary Info (SI). 2007 they have fully integrated online methods. SI is largely Free access and unedited.
All policies are common.
Authors should retain all original data and analyses.
Central website for Nature’s policies – includes “availability of data and materials”, with reasons for the policies
[BIND – bioscience databases – has been sold to an informatics startup – shows problems of trying to keep data Open]
Challenges:
data submission for large datasets
finding relevant experts
image manipulation
post publication
access to data and analysis
timely editorial responses
funders
incentivize data sharing
institutions must be more involved
And in questions Maxine suggests a liberal approach to images.
[Nature chemistry is coming next year, so that could be very exciting].
And … for readers who read this blog for the human interest Maxine and I are on very good terms…

Posted in berlin5, open issues | Leave a comment

berlin5 : what did I say?

I am very grateful to Berlin5 and the ESF (European Science Foundation) for inviting me to speak on “Open Data”. In giving talks like this I don’t prepare a linear set of Powerpoint (which I despise technically and philosophically) but use a large set of HTML resources, including many active web pages. All told I think there are several thousand slides some of which I have written , some scraped from elsewhere, and efforts to capture some of the dynamic ones.
So it’s very exciting when a sponsor agrees to capture this on video. This happened at OIA4 (2005), Google (2006), and Caltech (2007). This captures some of the displays than cannot fit in a machine. It also means that I can speak to a wider audience – possibly including university administrators who have a large part in policy but were not represented at the meeting.
I had ca 27 minutes to speak and my style is to select those slides which I think are most relevant at the time. In some ways it’s a performance, not a lecture. I have a menu which leads to submenus, some of which I might not have seen for some time and which prompt me to say something.
The presentation was heavily influenced by Ilaria’s account of the absolute necessity to share genetic information about disease and the opposition she met when she made it public. I have three levels of polemic (GREEN, AMBER RED) and had intended to be at AMBER – occasionally prodding various sectors or organizations. But, after hearing Ilaria, I upped it to RED – full-scale rant. Rant against the scientific publication community for its ooposition to the free spread of information which is vital to the human race, for its lack of vision in the positive power of technology, for the overhanging cloud of FUD engendered by copyright and access controls. And a milder rant against the scholarly information community for not being braver in challenging the nonsense of copyright on scientific data. You’ve got to get up and speak. And your vice-chancellors and provosts.
I forgot what I urged in detail but it’s mainly in my blog. Theses. Clear Open licences. Positive permissions, rather than implicit fuzz (“PLEASE take our data and use it!”). Brief mention of the need for live semantic data.
Very simply, if we wish to save humanity we must make our data Open and positively share it. Otherwise we shall be spread-betting whether we are doomed by Asian ducks or melting penguins.
I look forward to seeing the video.
[NOTE: I asked how many in the audience knew the Keeling curve. Only 2/100 did… We have a little way to go.]

Posted in berlin5, Uncategorized | 2 Comments

berlin5 : Ilaria Capua's bravery

A stunning presentation from Ilaria Capua on the necessity of releasing sequence information relating to avian flu. There’s lot’s ocoverage on the web – here’s one and a snippet:

After Capua took over, IZSVe became Italy’s reference lab for bird flu, testing samples from all over the country. In 2002, OIE asked Capua if IZSVe could serve as one of its global reference labs as well; FAO asked in 2004. As a result, the institute has received a steady stream of samples from H5N1-affected countries, primarily in the Middle East and Africa.
It was because she was at the hub of this research that Capua became aware of the lapse in data sharing. Her discomfort began in February, when WHO asked her to deposit the sequence of a sample from Nigeria, the first African country affected, in a closed-off compartment of a flu database at Los Alamos National Laboratory in New Mexico, to which fewer than 20 labs have access. If she shared her sequence, WHO scientists said, she would have access to the rest of the hidden Los Alamos data.
“I’m very brave. I’m often ahead of others in thinking about important issues.” –Ilaria Capua
Capua refused and instead deposited her sequences in GenBank for the entire world to see. At the same time, in a message on ProMED, an e-mail list for emerging infectious diseases, she asked her colleagues to follow suit (her posting won ProMED’s annual award in August); she also asked Science to investigate (3 March, p. 1224).
WHO defended the closed database on the grounds that H5N1-affected countries often don’t want reference labs in the developed world to publish information about the strains circulating within their borders. But Giovanni Cattoli, the director of research and development in Capua’s lab, says that “is simply not our experience,” noting that of the 15 countries the Capua team has dealt with, 14 said sharing data was “fine.” As to scientists’ worries that they might be scooped if they post their sequences in real time, Capua says: “What is more important? Another paper for Ilaria Capua’s team or addressing a major health threat? Let’s get our priorities straight.”

Simply: establshed bureacratic processes had the key data locked up in dusty databases that no one was using. Iliara insisted that data should be available to all and “just did it”. A storm of outrage followed, but also growing support and now her approach and vision is accepted.
This set the scene for my presentation and put me in a polemic mood… more later.

Posted in berlin5, open issues | 1 Comment

berlin5 : how to progress Open Data?

I’m putting together some ideas for my talk tomorrow – probably about 25-30 minutes. It’s sometimes useful to set them out in the blog beforehand so I can refer to it as well as the slides.
The audience is roughly:

  • funders
  • librarians
  • publishers (reader-pays and author-pays)
  • governmental and non-governmental agencies
  • researchers (like me)

For background I’m making the broad brush assumptions (and would welcome challenges).

  • green access does not suit a lot of people and there is an increasing movement to insist on gold
  • awareness of the importance of Data is increasing but it still a poor relation to “full-text”
  • although there is quite a lot of activity in institutional digital repositories they won’t (and shouldn’t) address Data. It’s subject-specific and too complex for the average repository manager.
  • eTheses have an increasing importance
  • BBB (BOAI) is a useful political and philosophic utterance but it isn’t a licence. Licences are critical.
  • funders increasingly understand the issues and that they are the most important agents of change

For the last 10 years we have stood still – the eJournal “revolution” has been stultifyingly stagnant. No new ideas for managing information, no new tools for innovation by authors and readers. And the open access publishers have concentrated so hard on the business model they have simply mimicked the commercial offerings. If the rest of the world had been as bad as this we wouldn’t have Google, Flickr, etc. Their mantras – “Take risks and apologize later”. “Just Do it”.
I’ll try not to concentrate on what is broken – at least not in detail. You can re-read this blog. What’s broken is:

  • publishers oppose change
  • licences are non-existent or awful. Publishers do not clarify them, do not reply, create FUD.
  • librarians have been cowed into submission
  • the god of copyright is worshipped to paralysis
  • young people are frightened of experimenting; old people are largely dismissive and antagonistic.

So my suggestions for positive action on Data:

  • all funders should include statements requiring Open Data.
  • subject repositories should be set up
  • CC licences – or similar – should be required to define the actual practice
  • eThesis deposition must be mandatory
  • Open tools will be required and must be funded
  • we must create effective advocacy for all parties: funders, provosts, researchers, repository managers

Advocacy will include self-sustaining demonstrators showing:

  1. short-term archival (“I can get her thesis data”),
  2. re-use (“I can mash his data with mine”)
  3. exposure (“they cited my paper because the robot found my data”)
  4. communities (“I found these other people in this field”)
  5. semantics (“I never thought of looking at the data in that way).
  6. human value (“we can tackle this global problem with this data”)

NOTE: I am publishing this before the presentation so that (a) I can link to it and (b) in case anyone wants to suggest modifications

Posted in berlin5, open issues | 2 Comments

berlin5 : The laws of robotics; request for drafting

I have been asked about what we need for robotic access to publishers’ sites. Several publishers are starting to allow robotic access to their Open material. (Of course the full BBB declarations logically require this, but in practice many publishers haven’t made the connection). So let’s assume a publisher who espeouse Open Access and allows robotic access to their site. Is, say, CC licence enough?
There are no moral problems with CC, but the use of robots has additional technical problems, even when everyone agrees they want it to happen. There’s a voluntary convention, robots.txt,  which suggests how robots should behave on a website. It’s been around since the web started, and there is no software enforcement. In essence it says:

  • I welcome X, Y, Z
  • I don’t welcome A, B, C
  • Feel free to visit pages under /foo, /bar
  • Please don’t visit /plugh, /xyzzy

From the WP article:

This example allows all robots to visit all files because the wildcard “*” specifies all robots:

User-agent: *
Disallow:

This example keeps all robots out:

User-agent: *
Disallow: /

The next is an example that tells all crawlers not to enter into four directories of a website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

Example that tells a specific crawler not to enter one specific directory:

User-agent: BadBot
Disallow: /private/

Example that tells all crawlers not to enter one specific file:

User-agent: *
Disallow: /directory/file.html

There’s another dimension. Even if the robots go where they are allowed, they mustn’t slaughter the server. 100 hits per second isn’t welcome. So some extensions:

Nonstandard extensions

Several crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server: [1] [2]

User-agent: *
Crawl-delay: 10

Extended Standard

An Extended Standard for Robot Exclusion has been proposed, which adds several new directives, such as Visit-time and Request-rate. For example:

User-agent: *
Disallow: /downloads/
Request-rate: 1/5         # maximum rate is one page every 5 seconds
Visit-time: 0600-0845     # only visit between 6:00 AM and 8:45 AM UT (GMT)

I can see roughly two types of robotic behaviour:

  1. systematic download for mining or indexing. CrystalEye is in this category – it visits publishers sites every days and attempts to be comprehensive (it doesn’t index Wiley or Elsevier because they don’t expose any crystallography). It would be highly desirable to minimise repetitious indexing and an enthusiastic publisher could put their XML material in a proper repository framework with a RESTful API (rather than requiring HTML screen-scraping of PDF-hack-and-swear). In return there could be a list of acknowledged robots so that these could act as “proxies” or caches.
  2. Random access from links in abstracts or citations. This is likely to happen when the bot is in PMC/UKPMC, or crystaleye, and discovers an interesting abstract and goes to the full-text on a publishers site. The bot may have been created by an individual researcher fo a single one-time purpose.

So I’d like to come up with (three?) laws of mining robotics. Here’s a first shot:

  • A publisher should display clear protocols for robots, with explanations of any restrictions and lists of any regular mining bots.
  • A data-miner should use software that is capable of honouring machine-understandable guidance from servers. The robots should be prepared to use secondary  sites.
  • Mining software should be Open Source and should honour a common set of public protocols.

But I would like suggestions from people who have been through this…

Posted in berlin5, open issues | Leave a comment

berlin5 : Chris Armbruster; green ==> gold

Chris argues that we must move rapidly to Gold OA (full OA) rather than Green (self-archiving). The community increasingly requires archival of the final version (e.g. for RAE).
Why should OA publishers (like BMC) mimic conventional ones like Elsevier and try to do all the process of publication (from registration to archiving).
The first copy costs (in OA publishers) is 5 USD (sic). The rest of the 1500-3500 is up for grabs…
Functions of publishing:

  • registration
  • certification (peer review)
  • dissemination
  • archiving
  • navigation (overlay services)

Non-exclusive licences and repositories can provide:

  • repos and DLs can take care of registration as well as dissemination
  • publishers provide certification (peer review) and navigation
  • markets for certification and navigation become comptetive through standard non-exclusive licensing.

Final copy MUST come back to the authors and their institution – if the publisher provides additional essential value it may have to be bought back (at a guaranteed price).
Overlays can provide:

  • usage impact
  • citation and co-citations
  • data and text mining
Posted in berlin5, open issues | Leave a comment

berlin5 : SCOAP

Salvatore Mele (CERN) on SCOAP (the new CERN OA publishing model). HEP (high energy physics) is decades ahead in OA thinking

  • has been circualting preprints for 40 years
  • arXiv 1991
  • first peer-reviewed electronic journals
  • small community 20,000

and journals are only needed for non-communication reasons. The community

  • need journals as interface to officialdom
  • purchases subscriptions but actually reads arXiv
  • journal format is anachronistic

The authors, not librarians, drive the process.
Hybrid model has negligible success – why should I pay for what I can read anyway
So, sponsoring model (institutions) or institutional membership (e.g. JHEP and JINST)
So SCOAP3. 5000 articles in community of 20000. tripartite – funders, publishers, authors. “A consortium sponsors HEP pubs and makes them OA by redirecting subscription money”. The consortium pays for peer-review – articles are free to everyone.
Sponsoring and inst. membership on a world wide scale.
Most publishers are expected to enter negotiations if SCOAP is seen as stable. SCOAP open to all HEP journals in principle.
Will cost 10 M EUR / year (cost of experiment is 400 M EUR…) [PMR: not sure what experiment].
[…Lots of material on cost of publication omitted…]
GOAL: have SCOAP ready for first set of LHC results.
25% of funds have been pledged already.

Posted in berlin5, open issues | 1 Comment

berlin5 : Monetizing informatics – a fantasy

I want to float an idea tomorrow – I don’t know whether it’s mad, but at least if we float it in the blogosphere I’ll find out if others have similar thoughts. In essence we need a method of rewarding people – in monetary terms – for making information free. (At  present we reward publishers for not making information free.) Since money is about the most powerful incentive (other than jail) can we devise a raically new approach. I was struck by one idea that cam up at scifoo – “fantasy journals”. (Have forgotten who it came through.)
The idea comes from fantasy football (== soccer in most of the world) – you choose a team of players, managers, etc. that you think would do better than others teams. The teams are virtual and their success is an integration from the individual successes of the players in real life. So a fantasist would choose a set of papers – or authors – that they thought would do “well” – of course that is tricky – but it would create a speculative market in the value of science.
In similar vein there is already a fantasy market in blogs (from ScienceLibraryPad):
21:28 18/09/2007, Richard Akerman, Science Library Pad
But this press release I ran across from June amused me anyway.

“Science Library Pad was the subject of much speculation when analysts at several firms were heard to be very positive about its recent performance. Its share price rose from B$176.83 to B$245.80. Much of the hype was said to originate from M Melville whose AACR2 (artefact) was said to be involved….”

PMR: so there is already a market in social computing. I assume this is fantasy, not real, but who knows. If you can bet on something, people will…
And in similar vein from Deepak:

Historically, centralized data repositories like the NCBI, EBI, PDB, etc have been sources of data, but have also provided the most commonly used search interfaces and web services that people use to access that data. A number of services built on local copies of the data have been developed, often for internal use at companies (and I’ve been part of some fantastic ones), and while APIs are available, the trend to provide documented, usable APIs pervasive in the tech world these days is not quite the norm in the life sciences. Assuming that we have excellent public data repositories, with rich APIs and data structures, it would be nice if a mix of application developers, designers and data geeks could start developing visual experiences and web services that enhance the utility of these sites. Unfortunately, as Neil’s and Hari’s experiences have shown, that is simply not the case.
In my own experience, from conferences, etc, it is clear that the world of bioinformatics (all life science informatics actually) faces a major problem. One where too much time is spent moving data back and forth and in formatting/reformatting and just in work that I would call “grunt work”. A decade ago that might have been somewhat acceptable, as the field was still young, but not when bioinformatics becomes a core part of research. It is critical that various biological resources need to do a better job of allowing their customers (and I use the word deliberately) to be more effective using their resources. One of the best comments about Pipeline Pilot came from the head of informatics at a pharma company. He said that using it had made it possible for his informaticians to focus on developing new methods and deploying them to other scientists, since Pipeline Pilot did such a good job of gluing things together. We need to make this process even more simple, and allow the Neil’s of the world to focus on data analysis, software development and methodology and not data munging.
Let me take this thought one step further. I believe that there is a business model to be explored here as well. Philosophically, I believe that knowledge lies in what can be done with data, rather than the data itself. If everyone has equal access to the data, monetizing processes that generate useful information from the data is perfectly fair and square. The one caveat, and perhaps someone can share their thoughts on this, is whether the data producers should be compensated somehow, or is that addressed by the funding, etc they get? Alternatively, data produces are well placed to develop services on top of the data as they have intimate scientific knowledge. And I am not just talking about the AJAX-ification of genome browsers. It is a well known fact that Google and others have built their empire on top of open source software. Others have leveraged services and APIs to provide useful services, e.g. Lijit uses Google Custom search and one of the genome browsers mentioned above uses the Google maps API. Would it be appropriate to take publicly available services, and using them as a backend, develop commercial services? If yes, what are the kinds of businesses that can be built on top of that? What kind of licensing policies would be prevalent? Food for thought and the subject of another post some day.

PMR: I like this sentence:
“If everyone has equal access to the data, monetizing processes that generate useful information from the data is perfectly fair and square. ”
Yes. C21 should be about increasing real value with new products and services, not paying to get grotty C20 data out of jail. The last 10 years have been a total failure for scholarly eInformation. We have gone backwards. The dream of eScience is in ruins in many disciplines and lack of progress is not just zero, it’s negative.
So I want to pursue this and am actively thinking of ways to monetize Open Data. I’d like to hear from others who share the vision. That woud open up huge markets in which true competition, not robber barons, would flourish.
So, here’s my wacky idea. The biggest industry in 10 years will be “saving the planet”. It’s already worth 30 billion USD in carbon trading (which to me appears to be fantasy, but people are making lots of money from it). WP suggests it will be 1000 billion USD in a few years. So if we take the axiom:
“Open data and collaborative scholarship are necessary conditions to save the planet”
we could argue that monetizing the process was essential. Given that the EU already has an economic model where farmers are paid not to grow crops but to preserve the countryside, we could argue that publishers might be paid not to ban people from reading “their property”. This would then create a lively market in doing something useful with the data. If the publishers wanted to be in this market they would need to actually do something NEW, or someone will eat their lunch.
Tell me that this is not a fantasy

Posted in berlin5, open issues, Uncategorized | 2 Comments

berlin 5: plenary 1

Random snippets:

  • Fred Friend (original Berlin-1 signatory), now JISC and honorary Univ. College London… Need to go to GOLD OA asap. JISC will not massively fund this in the future (it has done quite a lot so far) but will encourage Research Councils, etc to fund through FullEconomicCosts (FEC). Need a lot of advocacy. Also persuading publishers to use licence to publish rather than copyright.
  • Jens Vigen, CERN. OA culture is FIFTY years old. BUT… researchers don’t get excited about “Operational Circular 6”. Repositories aren’t working – head of physics didn’t know there was one. Capture 0% of theoretical papers, 10% of theses , 90% of experimental papers. Theorists want to submit to arXiv, not CERN. Theses (PMR: my pet love) has 30% if the author is mailed. “ensuring Green, promoting gold”. Recovering old papers (from 10000+), hunting for theses, promoting OA journals, encouraging OA for conferences, and preparing SCOAP (later talk) – their OA publishing
  • (JV) Top scientists IGNORE LIBRARIANS. Any return MUST be immediate. Authors are glad to deposit theses because of preservation.
  • GIVE THE SCIENTISTS WHAT THEY WANT: TagCrowd shows search, access publication …
  • Institutional and subject repositories go hand in hand. Some publishers are friendly. In order to be discovered publishers have strong interest to feed subject repositories
  • Subbiah Arunchalam (India). Some very good science, some OK, some bad. Many graduates unemployable. Reserach performance far below potential. Poor libraries (due to cost), Dissemination strategy also poor (3500 journals, many not subscribed to in India so work is not known). OA is best solution.
  • SA: Scientists are afraid of publishers. Academies are ready to act, slow to move. No freeCulture.org, So… forging alliance with CC, Cream of Indian Science, CSIR to set up IRs, webometrics research to start. Get students and the Left interested – this will make large and powerful lobby for OA.
  • Hiroya Takeuchi. “R and D based on “catching up with the West”. Japan is second only to US in productivity of publications. But impact is lower. 300 million USD on subscriptions.
  • HT: OA has never been seriously discussed. No signatories on anything. Scientists not interested or aware. BUT IRs are mushrooming – journal price pressure, university PR… BUT researchers effectively have access at present, so not interested, and worry as members of societies about cancellations.

NOTE: Is anyone else blogging this meeting? Because I don’t intend to act as a complete record – I have to write my talk :-). I use the tag “berlin5” – if there are others let’s converge now.

Posted in berlin5, open issues | Leave a comment

berlin 5: Open Data: What am I going to say?

I’m talking tomorrow on “Open Data” at the Berlin 5 conference on Open Access. (see this WP page for most of the terminology) This is the fifth annual meeting in the series – the first signed the actual declaration (Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities ). It’s a mixture of activist, librarians, publishers (mainly OA, but a ?brave delegate from Elsevier is here), NGOs, funders, learned bodies, etc. Some people I have bumped into already:

  • Subbiah Arunachalam (WP). A tireless campaigner for Open Access. I am fixing up to visit India to visit him and other scientists who are interested in Open Access, chemistry, crystallography and Open Data. We hope to involve the EU-Indiagrid project. This is so important it requires separate posts.
  • Alma Swan – who is coordinating the UK RIN project on access to data. Alma led several of us to dinner last nigh. I ended up sharing too much of a litre of red wine (after the sparklin wine at the reception). (Why do I write this detail – my family only knows what I am doing through my blog…)
  • Kaitlin Thaney (Science Commons). SC is one of the key tools that will be essential (but not sufficient) if we are to liberate data.
  • Susanna Mornati who has put the conference together. Susanna has agreed that we should be able to record my talk. That changes what I say and how I say it.

and there are others I need to meet – Maxine Clarke who is speaking in the same session in the session and haven’t yet bumped into – we should have a drink together – I am in the second row stage right, Maxine
So what do I say? I’m very glad this will be recorded as it will reach a wider audience. As I have already blogged I have ca 10,000 slides and I don’t leave a conventional P*w*rP**nt presentation. I’ll leave a lot on my blog, I hope.
And, seriously, until I have heard the other presentations I don’t know what slant to take? What is the role of developing nations? Of funders? Of NGOs – progress is often slow. Do people understand how important data is, and how poorly, at the moment, it is managed by publishers and Open Access? Do I suggest new practices? [Yes, I shall think of them during the talks.]
I shan’t blog everything – Fred Friend is introducing the great and the good of OA, including some of the original signatories, so I’ll stop here.

Posted in berlin5, open issues | 2 Comments