Talks at Berlin5 on Open Access


Antonella De Robbio has very kindly made available the talks ate Berlin 5 Open Access : From Practice to Impact : Consequences of Knowledge Dissemination 19 – 21 September, 2007
They can be viewed starting from the Conference website or from

I am especially grateful since many of my talks involve demonstrations from the web and do not use Powerpoint. My own talk has acceptable audio but is a bit fuzzy on the slides. However I created several blog entries

berlin5 : Open Access to Research Data: surmountable challenges),

berlin5 : how to progress Open Data?

berlin5 : what did I say?)

which may help to fill in some gaps.

[Verbal slips - I referred to ACS's description of NIH as "socialist", when the exact term - as on the slide I showed - is "socialized science" [*] – my apologies. And I referred to Peter Suber’s categorization of Open access as “access barriers and Permission barriers” when the better term is “price barriers and permission barriers”]

It is always slightly scary to see what you actually said – particularly since I do not normally have a set order in my slides.

[*} Chambers derfines socialize/socialized as:

socialize or socialise verb (socialized, socializing) 1 intrans to meet with people on an informal, friendly basis. 2 intrans to mingle or circulate among guests at a party; to behave sociably. 3 to organize into societies or communities. 4 to make someone or something social.


I made the sweeping assertion at Berlin5 that no-one other than me was blogging (I asked for a show of hands), and am delighted to be proved wrong:

Paolo Gardois Says:
September 24th, 2007 at 3:30 pm e

Firstly, compliments for your presentation, it was great!!
Secondly, I was at Berlin 5, blogging the meeting, so you should feel a little less sad… :-) Our blog is in Italian, but if you want to take a look: .
Please let us know what you think…

PMR: Paolo has done a great job and blogged every session.

Peter Murray Rust, scienziato di Cambridge e blogger prosegue con un paper sulla publicità dei dati di ricerca, partendo dall’esempio dei dati su inquinamento e riscaldamento climatico. Dopo l’imperdibile citazione di Tufte (”Power Corrupts. PowerPoint Corrupts Absolutely“), Rust prosegue delineando l’onnipresenza del copyright, sia nelle tabelle e grafici pubblicati dentro gli articoli scientifici di editori commerciali, sia nei database (es. ACS).

Ora, un aspetto dell’open access riguarda il libero accesso all’informazione, ma un altro riguarda la possibilità di riuso dei dati. Non sempre le 2 cose sono collegate, e questo costituisce un problema per gli scienziati, che generano nuova conoscenza letteralmente manipolando e riconfigurando dati pubblicamente disponibili. Una soluzione è rappresentata dall’attrivuire un’esplicita licenza relativa all’utilizzo dei dati (es. Science Commons).

Anche le tesi di dottorato dovrebbero essere rilasciate secondo modelli di licenza simili (vedi l’iniziativa di Harvard).

Su un altro versante, si incontrano difficoltà anche tecniche nell’estrarre i dati (formule, ecc.) dalle pubblicazioni per poterle riutilizzare. Non è solo un problema di copyright, dunque, ma anche di formati. Occorre dunque pubblicare i dati grezzi in formati standard in repository pubblici, e parallelamente sviluppare strumenti di text mining – estrazione automatica di dati da file di testo – ovviamente XML, non PDF che distrugge la scienza :-)

Un es. di questi strumenti, utile per l’annotazione semantica di articoli di chimica, è OSCAR3.

Ma comunque, ironia a parte, quanto detto qui sui formati chiusi riecheggia quello che ho scritto ieri sui documenti chiusi come forma ormai obsoleta di pubblicazione della conoscenza. Spesso i dati sono più interessanti delle conclusioni che se ne traggono, perché permettono discussione ed interpretazioni alternative: “chiuderli” esclusivamente dentro PDF e Powerpoint è sicuramente un errore, ed un’altra faccia dello stesso problema. Qui i concetti da cui si può partire per un ragionamento sono: openaccess (aspetto culturale, giuridico, economico, professionale) opensource (aspetto informatico, produzione) open standards (accesso, riuso, riconfigurazione).

PMR: Good accurate report, like the others. I am interested to see that Italian uses many English terms directly “copyright”, “repository”, “open access”, “text mining”. I don’t want to seem like an anglophone imperialist, but in the Internet age it can be useful to know that we are using the same terms for the same concept. Of course copyright will be country-dependent in precise meaning.

blogs, folksonomies and tagging – get going!

At the recent “Berlin 5″ meeting on Open Access I noted sadly that I was the only person blogging the meeting. Normally there are many bloggers at the meetings I go to so I (and everyone else) can choose what they blog. At berlin5 I felt it was important to show the way so I hacked some notes together for many of the talks – generally typing scattered phrases during the talks (and with even more typos than normal). As a result I spent more time than I would have likely simply noting some of the presentations. In any case it’s not a very good approach since you don’t know what the speaker is going to say and often run out of puff during dense slides. You know them (“Title XVIII chapter 123 of the EC, says …”). Mind you, it isn’t easy to blog my presentations synchronously either…

I made ca. 15-20 posts about Berlin-5. 2-3 before I went, 2 (so far) afterwards and about 12 during the meeting. Many of the latter are simple shorthand notes of speakers with little or no comment. So for example I have copied as many of Alma Swan’s words as possible to give those not there an idea. I can’t type well or fast so it’s limited. And there are no links.

I’m writing this in the hope that librarians, funders and policy makers will be more adventurous and start their own blogs. An increasing number of slides at berlin5 mentioned blogs, wikis, folksonomies, etc. The best way to understand these is to DO them, not read other people’s.

There are of course some top-class blogs from staff working for publishers – Nature and PLoS lead the way. They actually tell us how people in the organization think, work, interact. (Contrast the more formalised magazine-like blogs on some publishers which are often written by third parties, sometimes recruited from the blogosphere). And there are some excellent librarian blogs. But I am sure there is a niche for “DGXIII inmate”, “bewildered at RCUK/STFC”, etc. In Open Access we need more than just Peter Suber, Stevan Harnad commenting. They have clear formats and agendas which need complementing. There is a huge need for investigative blogging to reveal the spread and the problems with OA.
The digital library needs metadata and in C21 much of this should be done elsewhere than the library. Two main methods are text-mining and tagging (folksonomies). Here I’ll look at the latter.

If you have just set up a blog, no one will know about it. It can be quite dispiriting. There are many legitimate ways to advertise, including tagging. There are sites such as Technorati which visit all blogs (ca. 100, 000, 000 exist) and index and link to them.
One thing that Technorati looks for is the tags in a blog.
If you write a blog you can add tags which give an idea of the content. Tags are common in many systems such as and Connotea where communities expect other members to use tags to find similar contributions. There is NO controlled vocabulary – you could use anything (though it’s best to stick to ANSI alphanumerics). If you don’t understand social computing, this is a good place to start. It doesn’t matter what you do – it won’t break anything. And there is no “right” or “wrong” way to do things – it is whether it works. So for this meeting I chose “berlin5″. It’s natural and I assumed that of the 100-200 delegates that others would choose something similar. Let’s assume they choose “berlin open access” (you can have multiple tags, of course).
In a formal metadata system this is a nightmare, but in the blogosphere it’s trivial. If twenty people read both blogs one of them will probably post a comment ” Petermr is using berlin5 – why don’t you add that as well” (or the other way round). So the two of start to converge. No one tells us to – it’s just obviously a good thing to do.

So here is the list of posts about berlin5 (there are 18). There are 3 which are nothing to do with OA but they are easily ignored as they are old.




As I say it’s a pity that there isn’t anyone else (although you we needn’t have finished)
Let’s look at a more distant meeting – electronic theses and dissertations at Uppsala. If you follow:

You’ll find 17 posts, mainly by me but not all:

ETD Policies, Strategies and Initiatives in…

Das, Anup Kumar and Sen, B. K. and Dutta, Chaitali (2007) ETD Policies, Strategies and Initiatives in India: a Critical Appraisal. In Proceedings 10th International Symposium on Electronic Theses and Dissertations (ETD2007), Uppsala, Sweden.

So now I have made two important contacts – the authors of the article and also EPrints for LIS. That’s just because we both used etd2007 in out posts.

But now let’s look at the really hectic end of the scale, www2007:

303 posts! and although the meeting was 5 months ago, posts mentioning it are still coming in, such as Yet another semantic tagging application in Jakoblog — Das Weblog von Jakob Voß

More, because I have added this link to my blog, Jakoblog will get notified. Technorati keeps count of how often every blog mentions others. E-LIS has 251 other blogs which link to it; I have about 120 (“the authority”), Jakoblog has 37. If I put Jakoblog on my blogroll it would increase to 38. (A popular aggregator/multiple_author blog like ScienceBlogs has nearly 10, 000, Bora’s Blog around the clock has 700, Dorothea Salo’s Caveat Lector · has Authority: 199; My colleague Andrew Walkingshaw‘s Brighten the Corners, has 28. Of course these numbers are about as useful as citation statistics!

The serious message is that if you want to go out and get noticed in the blogosphere you have to get noticed! Tagging is a good way of finding out who is thinking along the same lines as you. Then link to them. They’ll often link back. Aggregators will include all of you, and so on.

So, OA colleagues – and hopefully OD colleagues as well – get out there! Yes, you will reach some people via conventional scholarly publications. But your publications will be noticed much more if they are blogged. Das, Sen and Gutta should get some more readers because I have blogged it. They’ll get me anyway, and that’s because E-LIS blogged it. And so it grows…

berlin5: final thoughts

Some final thoughts on the berlin-5 meeting on Open Access in Padova – I have spent more blog time than I thought and I am probably driving any chemical/software readers up the wall. This should be the last post with the tag. Some discussion is reported in Chatham House Rule manner.
Splendidly organised. Wonderful food and drink. Very relaxed atmosphere.
Fantastic location. Italy is fortunate to have preserved many of its medieval town almost intact. The best analogy in the UK is probably Cambridge or Oxford, but they don’t have the same compact city boundaries as in many Italian counterparts.
A reasonably good mix of funders, policy makers (EU, etc.), publishers, researchers, library/IT.

A positive atmosphere. Alma was very upbeat that Open Access was now unstoppable.

I was pleased to see that Open Data was now much higher on the agenda. General agreement that it must be addressed and quickly and I think several people have taken this away and will work on it. Similarly the idea that “Open Access” is not a licence and we have to use CC or SC. Kaitlin Thaney from Science Commons was there and I am sure that people will get in touch with her.

eTheses were also higher on the agenda. Good. At earlier meetings I had asked whether I could run robots over the Dutch theses and was told there was a copyright problem. Now I am told that was incorrect – I can do whatever I like. There are over 10,000 Open theses in NL, so we’ll start pointing our robots there.

Because of my diffident nature I have been in the habit of asking permission for this sort of thing. Now I am getting braver and shall “ask for forgiveness rather than permission”. So here come text and data-mining robots. After all it’s C21.
There was a mixture of views about the legality of Foo, Bar, and Bananas. I am urging that in the C21 copyright is inappropriate for eScience and we should simply declare all scientific data unencumbered by publisher copyright. I pushed one or two publishers like this…
PMR: “are images (graphs, gels, cells) of the scientific record copyright”

Publisher: “well, we put lots of effort into the design of lettering in images”

PMR: “on gels?”

Publisher: “… er um”

So I think there are an increasing number of publishers who see that the scientific record per se (i.e. the wider “data”) must be free and Open. I talked with one publisher who has got excited about the possibility of Open Data and although they might not be Open Access, see the advantages of making data visible.

I think a lot of people hadn’t seen the power of data- and text-mining and although I had to compress a lot into 27 minutes the message came through.
One a slightly more critical note:

There was very little awareness of what Web2.0 and the rest is about. There is a vast difference between berlin5 and www2007 (scifoo is something else, of course). We who are in the middle of it forget how many academics have never heard of Flickr.

I was disappointed that no-one else was blogging and presumably the awareness of tags and folksonomies is low and I’ll address that in another post

I am looking forward to the video and will let you know when it happens.

And as always new contacts and opportunities. I am always happy to visit and demo or spread the word. Open Access, Repositories, Open Data … we are taking off.

berlin5 : Alma Swan

The final keynote by Alma Swan, familiar to all in the OA field. How are we doing? (Alma does a lot of surveys, interviews questionnaires, etc.)
We are getting definition creep. There should be no qualification of OA – it’s either pregnant or not – not slightly pregnant.  OA is not “delayed OA”…
Awareness, in order:

  • funders
  • publishers (PLoS, BMC doing very good advocacy)
  • peers (word-of-mouth
  • library (often repositories are not well advertised)
  • and the effect of OA

“self archiving gave my work instant world-wide visibility. As a result I was invited to … conferences … and authoring”

Proven Business model (PLoS, BMC, Hindawi) 70% rise in submissions over last 2 years. Hindawi is profitable, BMC break-even, PLoS OK on all except flagships. Bentham launching 200+ OA this year, and 100 next year
Moving the money around. Need to move from library budgets to author-end. Not trivial but vice-chancellors have to grasp this nettle. Experiments:

  • Nottiingham
  • Wisconsin
  • Amsterdam

Reorganising rather than spending new money. 7 billion USD into scholarly publishing.

Learned societies. Not homogeneous. Sounds like publishing, but is NOT. Actually aligns with mission of a scholarly society. Target the scientific officers of society. Please try. Work with LS to help them embrace OA and concept of opening up scholarship. Show benefits. Discuss green and gold. Discuss evidence against damage to business. Be patient. Praise and encourge the ones which are moving. They are too coy about their achievements (e.g. APS and IOP(UK). Both have built mirror site for arXiv (doing this for benefit of community – let’s praise them.) Support members who are struggling to change. History will record who helped and obstructed.

Start by making Society conferences OA. ASCB (Am Soc Cell Bio), Ins Math. Stat.)

Peter Suber says 380 OA journals from 350 societies.

Digital Repositories. Family of types but shared purpose is dissemination of research in ways not possible up to now. Repositories are at centre of universe. Ingets tools, search and retrieve, aggregate/display, count/assess, peer review (might/not be publishers), editorial (publishers), other value adding

Repos are where content is going to start, at data creation stage.

We need a marketing message for each constituency:

  • institution: visibility and impact. G-factor (Google rank or Web presence). Much higher in US. But Southampton is 3rd of UK universities. Mandatory deposit of research. Many links are to repo.
  • funders. OECD says: boost innovation and better return if proceedings Open Access. Houghton. Drummond Bone – repos are vital to UK economy. EU: SME find it hard to get access to the basic research infortion they need. A small pharma: cannot pay for TA journals or 30GBP/article
  • authors: WILL comply willingly, if mandated (81%). reluctantly (14%). Arthur Sale – QUT has over 50% in repo. Encouragment doe sNOT work. Mandate AT ACCEPTANRights:CE. The AUTHOR’S FINAL VERSION, even if not OA. Mandate DEPOSIT. Need author’s final version (as well as PDF)
  • usage: UoCalif 2 million downloads. Interoperable Respistory Statistics (IRS) will help. Monthly download, Daily downloads, types of referrer, etc. Which universities are accessing. In some cases Wikipedia is top referrer. Authors love it

Rights: Shouldn’t be a block but it is. Promote author addendum. Most address data. Monitor copyright policies and addenda.

It’s about the Web, stupid:

BBC linked to Soton and links were out pf date. If Google on author’s name

One third of Soton ECS lack home page, same in MIT. Let the young people help.

Joined up strategy. It IS a web. Data theses, articles.

And work on lobbying:

it’s hard, but PRISM has backfired and this makes it easier. Now we have to SHOUT. need organizing centre. SPARC…

Personal Strategy – stay cheerful

  • Peter Suber’s blog
  • AmSci and SPARC OA lists
  • David Prosser’s paper
  • Alma’s OA calendar

berlin5 : NIH and RCUK

NIH has an open policy since 1994. Barbar Seto presented an example, GWAS which has to deal with human subjects. How to make data Open, while protecting identity?
NIH serves as central data repository, including: Genome-wide acssociation study (GWAS), Genbank, Protein cluster, Pubchem,

GWAS – identifies common genetic factors influencing health and disease. Genetic variations associated with observable traits. It combines genomic data with clinical and phenotypic data to understand disease mechanism and prediction of disease.

Because some diseases are rare it is sometimes possible to work out indentities from anonymized data.
Cold room for use at grantee institution = data is open within a specified location and can’t be taken away


Mark Thorley NERC and RCUK. Reacting to issues brought up in the morning. 4.1 billion EUR:

  • Data as byproduct of research data.
  • Data as part of the scientific record; support publications
  • Data as published output in own right


  • scientific need (e.g. atmospheric physics requires data sharing)
  • increased value – as part of larger collection e.g. oceans
  • value for money. Ship costs 10,000 GBP/day
  • public funds, so public access

One-size-fits all is NOT appropriate.

  • RC’s recognise data as valuable long-term public-good resource
  • Data sharing improves opportunities for exploitation (e.g. mashups). “Power of information” (UK, Cabinet Office) Stimulate knowledge economy
  • Investigator has a right of first use and right to be acknowledged. But there must be a limit, but early release can be a problem.
  • Effective exploitation requires effective data management.
  • Must be legal. (e.g. directive on public access to environmental information)

Any differences between RC’s is not policy, but how to support data sharing.

National facilities (NERC, ESRC) or local delegation (AHRC, BBSRC, MRC)?

  • National – longer term, single point, centres of excellence. expensive, less agile
  • Delegates, more responsive, closer to science, cheaper, lack of long term.
  • Long term commitment. Needs long-term vision, long term support. Are PIs the right people to do this?

berlin5: Open Data and institutional repositories

John Marks (ESF) introduced our session and set the scene on the need for Open Data and sharing. He stated strongly that it was essential that we had discipline-specific repositories for different branches of science. I share this view and blogged it recently (berlin5 : how to progress Open Data?).

My stance comes from meetings this year where I have talked to many people about institutional repositories. I ask them “why are you setting up an IR?” I have got about 8 distinct answers. Very few of them mention data.

Some of us addressed these issues at ETD2007. There are hundreds of different types of biologiocal data, tens of chemistry data, humderds of geoscience, etc. There is no way that these managers – with the best will in the world – will know how to manage them. So I wrote:

although there is quite a lot of activity in institutional digital repositories they won’t (and shouldn’t) address Data. It’s subject-specific and too complex for the average repository manager.

PMR: In response to this Dorothea Salo (who has run Caveat Lector blog for some years and has a strong following).

  1. Dorothea Salo Says:

    Disagree somewhat that IRs and their managers shouldn’t address data, though I agree that for now it’s impractical because the software is so wretched and the technical infrastructure insufficiently scalable. Just because IR software in its current state is completely broken with regard to data doesn’t mean it must or should stay that way, though. Moreover, the notion that “domain knowledge” is the sole key to data curation is (bluntly) bunk, and nobody’s yet tested the assertion that it’s harder to teach a librarian domain knowledge than to teach a discipline-practitioner info management.Frankly, “it differs by discipline” doesn’t matter. So does everything else in librarianship, from reference transactions to collection development. We cope. It’s our job to. As for “too complex,” says who? And about which librarians? I think I’ve just been insulted.

    There’s nothing wrong with telling librarians — and the subset of librarians who are repository managers — that we need to brush up our game to deal with these issues. I have a plan in place to learn the principles of data curation for myself over the next year or so. I want to see more librarians planning the same!

    Looks like a good talk. Wish I could be there to hear it!

PMR: I haven’t met Dorothea but I’d like to – her blog is insightful and entertaining and she is unafraid to speak out. She’s also technically proficient in the IT skills required – XML, etc. And the last thing I want to do is upset and antagonize people like Dorothea.

But… There is no single human on the planet who knows how to reposit all of protein structures, variable stars, ice sheets, chemical structures. It needs much more than metadata. So what can a repository manager do. Putting the raw data into the repository without understanding it is not an option. It has to go into a system devised by experts in the discipline. And, for me, that means subject repositories. Maybe each university has a different one. Maybe they are national.Some, like the bioscience ones, will be international.

berlin5 : Maxine Clarke

Maxine is presenting Nature’s practice and philosophy on data. (Hope I capture this OK – there is a lot or material) In the early 1990′s they introduced Supplementary Info (SI). 2007 they have fully integrated online methods. SI is largely Free access and unedited.
All policies are common.

Authors should retain all original data and analyses.

Central website for Nature’s policies – includes “availability of data and materials”, with reasons for the policies

[BIND - bioscience databases - has been sold to an informatics startup - shows problems of trying to keep data Open]
data submission for large datasets
finding relevant experts
image manipulation

post publication

access to data and analysis
timely editorial responses

incentivize data sharing

institutions must be more involved

And in questions Maxine suggests a liberal approach to images.
[Nature chemistry is coming next year, so that could be very exciting].

And … for readers who read this blog for the human interest Maxine and I are on very good terms…

berlin5 : what did I say?

I am very grateful to Berlin5 and the ESF (European Science Foundation) for inviting me to speak on “Open Data”. In giving talks like this I don’t prepare a linear set of Powerpoint (which I despise technically and philosophically) but use a large set of HTML resources, including many active web pages. All told I think there are several thousand slides some of which I have written , some scraped from elsewhere, and efforts to capture some of the dynamic ones.

So it’s very exciting when a sponsor agrees to capture this on video. This happened at OIA4 (2005), Google (2006), and Caltech (2007). This captures some of the displays than cannot fit in a machine. It also means that I can speak to a wider audience – possibly including university administrators who have a large part in policy but were not represented at the meeting.

I had ca 27 minutes to speak and my style is to select those slides which I think are most relevant at the time. In some ways it’s a performance, not a lecture. I have a menu which leads to submenus, some of which I might not have seen for some time and which prompt me to say something.

The presentation was heavily influenced by Ilaria’s account of the absolute necessity to share genetic information about disease and the opposition she met when she made it public. I have three levels of polemic (GREEN, AMBER RED) and had intended to be at AMBER – occasionally prodding various sectors or organizations. But, after hearing Ilaria, I upped it to RED – full-scale rant. Rant against the scientific publication community for its ooposition to the free spread of information which is vital to the human race, for its lack of vision in the positive power of technology, for the overhanging cloud of FUD engendered by copyright and access controls. And a milder rant against the scholarly information community for not being braver in challenging the nonsense of copyright on scientific data. You’ve got to get up and speak. And your vice-chancellors and provosts.

I forgot what I urged in detail but it’s mainly in my blog. Theses. Clear Open licences. Positive permissions, rather than implicit fuzz (“PLEASE take our data and use it!”). Brief mention of the need for live semantic data.

Very simply, if we wish to save humanity we must make our data Open and positively share it. Otherwise we shall be spread-betting whether we are doomed by Asian ducks or melting penguins.

I look forward to seeing the video.

[NOTE: I asked how many in the audience knew the Keeling curve. Only 2/100 did... We have a little way to go.]

berlin5 : Ilaria Capua’s bravery

A stunning presentation from Ilaria Capua on the necessity of releasing sequence information relating to avian flu. There’s lot’s ocoverage on the web – here’s one and a snippet:

After Capua took over, IZSVe became Italy’s reference lab for bird flu, testing samples from all over the country. In 2002, OIE asked Capua if IZSVe could serve as one of its global reference labs as well; FAO asked in 2004. As a result, the institute has received a steady stream of samples from H5N1-affected countries, primarily in the Middle East and Africa.

It was because she was at the hub of this research that Capua became aware of the lapse in data sharing. Her discomfort began in February, when WHO asked her to deposit the sequence of a sample from Nigeria, the first African country affected, in a closed-off compartment of a flu database at Los Alamos National Laboratory in New Mexico, to which fewer than 20 labs have access. If she shared her sequence, WHO scientists said, she would have access to the rest of the hidden Los Alamos data.

“I’m very brave. I’m often ahead of others in thinking about important issues.” –Ilaria Capua

Capua refused and instead deposited her sequences in GenBank for the entire world to see. At the same time, in a message on ProMED, an e-mail list for emerging infectious diseases, she asked her colleagues to follow suit (her posting won ProMED’s annual award in August); she also asked Science to investigate (3 March, p. 1224).

WHO defended the closed database on the grounds that H5N1-affected countries often don’t want reference labs in the developed world to publish information about the strains circulating within their borders. But Giovanni Cattoli, the director of research and development in Capua’s lab, says that “is simply not our experience,” noting that of the 15 countries the Capua team has dealt with, 14 said sharing data was “fine.” As to scientists’ worries that they might be scooped if they post their sequences in real time, Capua says: “What is more important? Another paper for Ilaria Capua’s team or addressing a major health threat? Let’s get our priorities straight.”

Simply: establshed bureacratic processes had the key data locked up in dusty databases that no one was using. Iliara insisted that data should be available to all and “just did it”. A storm of outrage followed, but also growing support and now her approach and vision is accepted.

This set the scene for my presentation and put me in a polemic mood… more later.