Category Archives: open issues

Big Science and Long-tail Science

Jim Downing and I were privileged to be the guests of Salvatore Mele at CERN yesterday and to see the Atlas detector of the Large Hadron Collider . This is a "wow" experience - although I "knew" it was big, I hadn't realised how big. I felt like Arthur Dent watching the planet-building in the The Hitchhiker's Guide to the Galaxy. It is enormous. And the detectors at the edges have a resolution of microns. I would have no idea how to go about building it. So many thanks to Salavtore and colleagues. And it gives me a feeling of ownership. I shall be looking for my own sponsored hadron (I've never seen one). So this is "Big Science" - big in mass, big in spending, big in organisation, with a bounded community. A recipe for success.


CMS detector for LHC


The main business was digital libraries, repositories, Open publishing, etc. It's clear how CERN with it's mega-projects ("big science") can manage ventures such as the SCOAP3 Open Access publishing venture. And the community will need somewhere to find the publications - so that is where repositories come in.

There is no question that High-energy physics (HEP) needs its own domain repository. The coherence, the specialist metadata, the specialist data for re-use. HEPhysicists will not go to institutional repositories - they have their own metadata (SPIRES) and they will want to see the community providing the next generation. And we found a lot of complementarity between our approaches to repositories - as a matter of necessity we have had to develop tools for data-indexing, full-text mining, automatic metadata, etc.

But where do sciences such as chemistry, materials, nanotech, condensed matter, cell biology, biochemistry, neuroscience, etc. etc. fit? They aren't "big science". They often have no coherent communal voice. The publications are often closed. There is a shortage of data.

But there are a LOT of them. I don't know how many chemists there are in the world who read the literature but it's vastly more than the 22,000 HEP scientists. How do we give a name to this activity. "Small science" is not complementary; "lab science" describes much of it it but is too fixed to buildings.

Jim Downing cam up with the idea of "Long Tail Science". The Long Tail is the observation that in the modern web the tail of the distribution is often more important than the few large players. Large numbers of small units is an important concept. And it's complimentary and complementary.

So we are exploring how big science and long-tail science work together to communicate their knowledge. Long-tail science needs its domain repositories - I am not sanguine that IRs can provide the metalayers (search, metadata, domain-specific knowledge, domain data) that are needed for effective discovery and re-use. We need our own domain champions. In bioscience it is provided by PubMed. I think we will see the emergence of similar repositories in other domains.

I am on the road a lot so the frequency (and possibly intensity) of posts may decrease somewhat...

APE2008 - Heuer, CERN

APE (Academic Publishing in Europe)  was a stimulating meeting, but I wasn't able to blog any of it as (a) there wasn't any wireless and (b) there wasn't any electricity (we were in the Berlin-Brandenburg. Academy of Sciences, which made up for the lack by the architecture and the legacy of bullet holes in the masonry). So I took notes while the battery lasted, but they read rather staccato.
The first keynote was very exciting. Rolf-Dieter Heuer is the new Director General of CERN - where they start hunting the Higgs Boson any time now. CERN has decided to run its own publishing venture - SCOAP3- which I first heard of from Salvatore Mele - I'm hoping to visit him is CERN before they let the hadrons loose.

So my scattered notes...

SCOAP requires all COUNTRIES contribute (i.e. total commitment from the community and support for the poorer members)
closely knit community, 22, 000 ppl.
ca 10MEUR for HEP - much smaller than expts (500MEUR) so easy for CERN to manage (So organising a publishing project is small beer compared with lowering a 1200 tonne magnet down a shaft

22% use of Google by young people in physics as primary search engine
could we persuade people to spend 30 mins/week for tagging

what people want
full text
depth of content

build complete HEP paltform
integrate present repositories
one-stop shop
integrate content and thesis material [PMR - I agree this is very important]

text-and data-mining
relate documents containg similar information
new hybrid metrcs
deploy Web2.0
engage readers in subject tagging
review and comment

preserve and re-use reaserach data
includes programs to read and analyse
data simulations, programs behind epts
software problem
must have migration
must reuse terminated experiments

[PMR. Interesting that HEP is now keen to re-use data. We often heard that only physiscists would understand the data so why re-use it. But now we see things like the variation of the fundamental constants over time   - I *think* ths means that the measurement varies, not the actual constants]

same reesearchers
similar experiements
future experiements
theoretic who want to check
theorist who want to test futuire (e.g. weak force)
need to reanalyze data with time (JADE experiement, tapes saved weeks before destruction and had expert)
SERENDIPTOUS discovery showing that weak force grows less with shorter distance

Raw data 3200 TB

raw-> calibrated -> skimmed -> high-leve obj -> phsyics anal - > results
must store semantic knowledge
involve grey literature and oral tradition

MUST reuse data after experiment is stopped

re-suable by other micro doamins
alliance for permanent access

PMR: I have missed the first part because battery crashed. But the overall impression is that SCOAP3 will reach beyond physics just as arXiv does. It nmay rival Wellcome in its impact on Open Acces publishing. SCOAP3 has the critical mass of community, probably finance, and it certainly has the will to succeed. Successes tend to breed successes.

... more notes will come at random intervals ...

Richard Poynder Interview

I was very privileged to have been invited to talk to Richard Poynder at length in a phone interview.

I am impressed with the effort that Richard put in - it is a real labour of love. We've not met IRL and hope to do so some day.

Internviewers of this quality provide very useful checkpoints - in going over some of the points I was able to realise what might be consistent and contradictory. And there is an objectivity which an individual cannot create for themself.

So many thanks Richard.

From Peter Suber  More on the NIH OA mandate.

Many points but I pick one:


Jocelyn Kaiser, Uncle Sam's Biomedical Archive Wants Your Papers, Science Magazine, January 18, 2008 (accessible only to subscribers).  Excerpt:

If you have a grant from the U.S. National Institutes of Health (NIH), you will soon be required to take some steps to make the results public. Last week, NIH informed its grantees that, to comply with a new law, they must begin sending copies of their accepted, peer-reviewed manuscripts to NIH for posting in a free online archive. Failure to do so could delay a grant or jeopardize current research funding, NIH warns....


Scientists who have been sending their papers to PMC say the process is relatively easy, but keeping track of each journal's copyright policy is not....

PMR: Exactly. It should be trivial to find out what a journal's policy is. As easy as reading an Open Source licence. An enormous amount of human effort is wasted - authors, repositarians, on repeatedly trying to (and often failing to) get this conceptually simple information.


I've been doing article and interviews on OA and Open Data recently and  one thing that becomes ever clearer is that we need licences or other tools. Labeling with "open access" doesn't work.


Science 2.0

Bill Hooker points to an initiative by Scientific American to help collaborative science. Mitch Waldrop on Science 2.0

I'm way behind on this, but anyway: a while back, writer Mitch Waldrop interviewed me and a whole bunch of other people interested in (what I usually call) Open Science, for an upcoming article in Scientific American. A draft of the article is now available for reading, but even better -- in a wholly subject matter appropriate twist, it's also available for input from readers. Quoth Mitch:

Welcome to a Scientific American experiment in "networked journalism," in which readers -- you --get to collaborate with the author to give a story its final form.The article, below, is a particularly apt candidate for such an experiment: it's my feature story on "Science 2.0," which describes how researchers are beginning to harness wikis, blogs and other Web 2.0 technologies as a potentially transformative way of doing science. The draft article appears here, several months in advance of its print publication, and we are inviting you to comment on it. Your inputs will influence the article's content, reporting, perhaps even its point of view.

PMR: It a reasonably balanced article, touching many of the efforts mentioned in this blog. It's under no illusions that this won't be easy. I've just finished doing an interview where at the end I was asked what we would be like in 5 years' time and I was rather pessismistic that the current metrics-based dystopia would persist and even get worse (The UK has increased its efforts on metrics-based assessment in which case almost any innovation, almost by definition, is discouraged). But on the other hand I think the vitality pf @2.0@ in so many areas may provide unstoppable disruption.

Update, Open Data

I have been distracted by the real world (in some cases to good effect). A lot of progress on CML, Wikipedia, chemical language processing, etc. We've also had a WordPress upgrade which until it happened has stopped my re-opening the CML blog.  I shall be at the European meeting on Open Access in Berlin next week and will hope to blog some of that.

Some recent highlights:

  • I've done a long interview on Open Data which should be public fairly soon
  • I converted the Serials Review article into Word and that has now been submitted. I have also submitted it to Nature Precedings and that should be available in a day or so.
  • I have finalised the prrofs for the Nature "Horizons" article (whose preview is on Nature Precedings). The house style seems  to be to remove all names from the text and further reading and I am not allowed to acknowledge people by name. This makes the article read in a very ego-ecentric style which does not reflect on the communal nature of the exercise. It appears in early Feb

Open Data: I want my data back!

var imagebase=\'file://C:/Program Files/FeedReader30/\';


Although I am mainly concerned with campaigning for data associated with schoilarly publishing to be Open, the term Open Data has also been used in conjunction with personal data "given" or "lent" to third parties (see Open Data - Wikipedia) which contains Jon Bosak's quote "I want my data back"). Here is a good example of the problems of getting one's personal data (and possibly other people's) back from Paul Miller of Talis: Scoble, Facebook, Plaxo, open data; time for change?. Excerpts (read the whole post for the details)


I am of course talking, like so many others, about Robert Scoble being barred from Facebook for using an as-yet unlaunched capability of Plaxo that clearly and unambiguously breached Facebook's Terms and Conditions.

It all began with a 'tweet' from Robert Scoble, about the time that post-holiday blues kicked in for those returning to work this (UK) morning;

“Oh, oh, Facebook blocked my account because I was hitting it with a script. Naughty, naughty Scoble!”

Twitter exploded, closely followed by large chunks of the blogosphere. ...

Minutiae aside, the whole affair raises a couple of points pertinent to one of the biggest issues for 2008; ownership, portability and openness of data.

  • I want to be able to take my data from a service such as Facebook, and use it somewhere else. That's what Marc Canter has been arguing forever, along with the AttentionTrust, OpenSocial (to a degree), and many more. That's part of the rationale behind all the work we've been doing on the Open Data Commons, too. However, whether I want to or not, doing it the way Scoble did is a breach of the terms and conditions of Facebook; terms and conditions to which I - and he - signed up when we chose to use the site. If you don't like the terms, don't use the service. It's as simple as that;
  • Even were I allowed to export 'my' data, there's a fuzzy line between that which is mine and that which isn't. The fact that I am a Facebook friend with Nova Spivack certainly should be mine to take wherever I choose. The contact details Nova chooses to surface to me as part of that relationship, however? Are they mine to take with me, or his to control where I can surface them? There's clearly work to do there, although it's interesting that 'even' people such as Tara Hunt are reacting (also on Twitter, of course) with;

“I'm appalled that someone can take my info 2 other networks w/o my permission. Rights belong 2 friends, too.”

PMR: I have no additional comments on this other than to say it's going to take hard work, forethought to anticipate problems of this sort and probably a lot of legal work. Kudos to Paul and Talis and their collaborators for helping in these general areas.


In science it's easy. Our data are ours. They don't belong to Wiley, ACS, Elsevier, Springer. I've just finished a paper on this which you should all see shortly.


We want our data back.


And in future we want to make sure we don't give away our rights to them. Is that a simple message for 2008?



Technorati Tags: , , , , , ,

New Year's resolutions

Cameron Neylon has made Some New Year’s resolutions

I don’t usually do New Year’s resolutions. But in the spirit of the several posts from people looking back and looking forwards I thought I would offer a few. This being an open process there will be people to hold me to these so there will be a bit of encouragement there. This promises to be a year in which Open issues move much further up the agenda. These things are little ways that we can take this forward and help to build the momentum.

  1. I will adopt the NIH Open Access Mandate as a minimum standard for papers submitted in 2008. Where possible we will submit to fully Open Access journals but where there is not an appropriate journal in terms of subject area or status we will only submit to journals that allow us to submit a complete version of the paper to PubMed Central within 12 months.
  2. I will get more of our existing (non-ONS) data online and freely available.
  3. Going forward all members of my group will be committed to an Open Notebook Science approach unless this is prohibited or made impractical by the research funders. Where this is the case these projects will be publically flagged as non-ONS and I will apply the principle of the NIH OA Mandate (12 months maximum embargo) wherever possible.
  4. I will do more to publicise Open Notebook Science. Specifically I will give ONS a mention in every scientific talk and presentation I give.
  5. Regardless of the outcome of the funding application I will attempt to get funding to support an international meeting focussed on developing Open Approaches in Research.

PMR: This is highly commendable, especially from someone early in their career. Some comments:

  • In some subjects it's hard to find Open Access journals whose scope covers the work. That's very true of chemistry, and there is some sacrifice required. However, there is a high-risk investment here - publish in an OA journal and you are likely to get higher publicity than from a non-OA journal of similar standing. Senior faculty (like me) must promote the idea that it's what you publish rather than where you publish that matters. All journals start small, but many grow, including OA ones.
  • ONS. This is technically hard in many areas. At this stage the effort is as important as the achievement - get as much online as you can afford. But complex internal workflows do not lend themselves to ONS easily and we certainly need a new generation of tools
  • I don't know of any funders who explicitly forbid ONS (other than for confidentiality, etc.) Funders should not be concerned about where the work is published, only that it is reviewed and reasonably visible.  Funders certainly shouldn't dictate the proposed journal and that's the only obvious mechanism for forbidding ONS
  • Obviously I hope the application succeeds and we shall be there

Best of fortune

New free journal from Springer - but no Open Data

Peter Suber reports:

New free journal from Springer

Neuroethics is a new peer-reviewed journal from Springer.  Instead of using Springer's Open Choice hybrid model, it will offer free online access to all its articles, at least for 2008 and 2009.

The page on instructions for authors says nothing about publication fees.  It does, however, require authors to transfer copyright to Springer, which it justifies by saying, "This will ensure the widest possible dissemination of information under copyright laws."  For the moment I'm less interested in the incorrectness of this statement than in the fact that Springer's hybrid journals use an equivalent of the CC-BY license.  It looks like Springer is experimenting with a new access model:  free online access for all articles in a journal (hence, not hybrid); no publication fees; but no reuse rights beyond fair use.  The copyright transfer agreement permits self-archiving of the published version of the text but not the published PDF.

Also see my post last week on Springer's new Evolution: Education and Outreach, with a similar access policy but a few confusing wrinkles of its own.

PMR: Whatever the rights and wrongs of this approach - I accept PeterS's analysis of most situations - it represents one of my fears - the increasing complexity of per-publisher offerings. Springer now has at least 3 models - Closed, OpenChoice and FreeOnlineAccess. Even for the expert it will be non-trivial to decide what can and cannot be done, what should and should not be done. If all the major closed publishers do this, each with a slightly different model where the licence matters, we have chaos. This type of licence proliferation makes it harder to work towards common agreements for access to data (it seems clear that the present one is a step away from Open Data).

I used to think instrument manufacturers were bad, bringing out a different data format with every new machine.  I still do. Now they have been joined by publishers.

What does USD 29 billion buy? and what's its value?

Like many others I'd like to thank the The Alliance for Taxpayer Access ...

... a coalition of patient, academic, research, and publishing organizations that supports open public access to the results of federally funded research. The Alliance was formed in 2004 to urge that peer-reviewed articles stemming from taxpayer-funded research become fully accessible and available online at no extra cost to the American public. Details on the ATA may be found at

for its campaigning for the NIH bill. From the ATA site:

The provision directs the NIH to change its existing Public Access Policy, implemented as a voluntary measure in 2005, so that participation is required for agency-funded investigators. Researchers will now be required to deposit electronic copies of their peer-reviewed manuscripts into the National Library of Medicine’s online archive, PubMed Central. Full texts of the articles will be publicly available and searchable online in PubMed Central no later than 12 months after publication in a journal.

"Facilitated access to new knowledge is key to the rapid advancement of science," said Harold Varmus, president of the Memorial Sloan-Kettering Cancer Center and Nobel Prize Winner. "The tremendous benefits of broad, unfettered access to information are already clear from the Human Genome Project, which has made its DNA sequences immediately and freely available to all via the Internet. Providing widespread access, even with a one-year delay, to the full text of research articles supported by funds from all institutes at the NIH will increase those benefits dramatically."

PMR: Heather Joseph -one of the miain architects of the struggle - comments:

“Congress has just unlocked the taxpayers’ $29 billion investment in NIH,” said Heather Joseph, Executive Director of SPARC (the Scholarly Publishing and Academic Resources Coalition, a founding member of the ATA). “This policy will directly improve the sharing of scientific findings, the pace of medical advances, and the rate of return on benefits to the taxpayer."

PMR: Within the rejoicing we must be very careful not to overlook the need to publish research data in full. So, as HaroldV says, "the Human Genome Project [...]made its DNA sequences immediately and freely available to all via the Internet". This was the essential component. If only the fulltext of the papers are available the sequences could not have been used - we'd still be trying to hack PDFs for sequences.

So what is the 29 USD billion? I suspect that it's the cost of the research, not the market value of the fulltext PDFs (which is probably much less than $29B ). If the full data of this research were available I suspect its value would be much more than $29B.

So I have lots of questions and hope that PubMed, Heather and others can answer them

  • what does $29B represent?
  • will PubMed require the deposition of data (e.g. crystal structures, spectra, gels, etc.)
  • if not, will PubMed encourage deposition?
  • if not, will PubMed support deposition?
  • if not, what are we going to do about it?

So, while Cinderella_Open_Access may be going to the ball is Cinderella_Open_Data still sitting by the ashes hoping that she'll get a few leftovers from the party?