Big Science and Long-tail Science

Jim Downing and I were privileged to be the guests of Salvatore Mele at CERN yesterday and to see the Atlas detector of the Large Hadron Collider . This is a “wow” experience – although I “knew” it was big, I hadn’t realised how big. I felt like Arthur Dent watching the planet-building in the The Hitchhiker’s Guide to the Galaxy. It is enormous. And the detectors at the edges have a resolution of microns. I would have no idea how to go about building it. So many thanks to Salavtore and colleagues. And it gives me a feeling of ownership. I shall be looking for my own sponsored hadron (I’ve never seen one). So this is “Big Science” – big in mass, big in spending, big in organisation, with a bounded community. A recipe for success.


CMS detector for LHC


The main business was digital libraries, repositories, Open publishing, etc. It’s clear how CERN with it’s mega-projects (“big science”) can manage ventures such as the SCOAP3 Open Access publishing venture. And the community will need somewhere to find the publications – so that is where repositories come in.
There is no question that High-energy physics (HEP) needs its own domain repository. The coherence, the specialist metadata, the specialist data for re-use. HEPhysicists will not go to institutional repositories – they have their own metadata (SPIRES) and they will want to see the community providing the next generation. And we found a lot of complementarity between our approaches to repositories – as a matter of necessity we have had to develop tools for data-indexing, full-text mining, automatic metadata, etc.
But where do sciences such as chemistry, materials, nanotech, condensed matter, cell biology, biochemistry, neuroscience, etc. etc. fit? They aren’t “big science”. They often have no coherent communal voice. The publications are often closed. There is a shortage of data.
But there are a LOT of them. I don’t know how many chemists there are in the world who read the literature but it’s vastly more than the 22,000 HEP scientists. How do we give a name to this activity. “Small science” is not complementary; “lab science” describes much of it it but is too fixed to buildings.
Jim Downing cam up with the idea of “Long Tail Science”. The Long Tail is the observation that in the modern web the tail of the distribution is often more important than the few large players. Large numbers of small units is an important concept. And it’s complimentary and complementary.
So we are exploring how big science and long-tail science work together to communicate their knowledge. Long-tail science needs its domain repositories – I am not sanguine that IRs can provide the metalayers (search, metadata, domain-specific knowledge, domain data) that are needed for effective discovery and re-use. We need our own domain champions. In bioscience it is provided by PubMed. I think we will see the emergence of similar repositories in other domains.
I am on the road a lot so the frequency (and possibly intensity) of posts may decrease somewhat…

This entry was posted in open issues and tagged , , . Bookmark the permalink.

12 Responses to Big Science and Long-tail Science

  1. Hank says:

    “I shall be looking for my own sponsored hadron (I’ve never seen one)”
    You mean we can buy these for our kids? I think that is a terrific fundraising idea. If they can sell distand asteroids in space on the internet it should be simple to sell Baryons and Mesons; there could even be tours to the LHC when your Hadron is due to … whatever it will be doing in there.
    On your main point, you’re right. People will be suspicious of the large media companies/institutions – yet are oddly comforted by a government group like PubMed, which I don’t understand. I think small science by groups that care is the way to go.
    I have offered on multiple occasions to build and host just the kind of thing you are talking about, because I think open access/open science/science 2.0 is good for every discipline. So if you get any answers on this, and someone can come up with a realistic spec of what it needs to look like and how it will function, I still will. Getting people to help on this kind of project would be easy. Getting people to figure out what it should look like and abandon fiefdom mentality? Not so easy.

  2. Two things. One, I’m not sure the metalayers you mention care all that much about where the data lives. It may therefore make sense (assuming an ideal world in which IR software doesn’t suck eggs for data-management purposes) to leave IRs to do low-level storage, description, and curation work, and build metalayer tools that are relatively unconcerned about where they acquire the data they’re crunching. (This means notification systems not unlike TrackBack for weblogs, and better discovery systems, and better data standards in several disciplines, and… it’s a whole ‘nother discussion, but it’s not an impossible goal in my opinion.)
    Two, it would be nice to draw a distinction between IRs and the people who run them. IRs can’t do much, I grant you, and my own writings demonstrate that I’m as frustrated by this as anyone — indeed, more than most! The central locus of my frustration is that I can do much more than the software allows me to do, but as long as I’m shackled to bad software…
    There is hope, though possibly not for me. The various archery-named programs from Australia (along with RepoMMan) are a lot closer to what I envision, and I think you as well, for a repository with appropriate flexibility to deal well with diverse data in large quantities.

  3. pm286 says:

    (1) Thanks, Hank.
    Jim had the idea of spread-betting on where the first hadron might hit. And thanks for the offer – amybe a wikipedia-like community could help here
    (2) Thanks,Dorothea
    the main problem IMO is the domain-specific expertise and software. Thus we have installed several specific searches on CrystalEye, including cell dimensions and substructure search. IMO that has to be layered on top of whatever the fundamental reposited object is.

  4. Right, Peter, I agree. But the search layer, which searches across multiple objects unless I am not understanding you correctly, doesn’t have to care where the reposited object is — it just has to know where to find it and how to get it. That’s completely separable from which repository it happens to be in. OAI-ORE is all about this kind of exchange.
    (With certain constraints, admittedly. If we’re talking the size of dataset that Google is currently sending around in suitcases, then a layer over far-flung repositories may not work because there isn’t enough bandwidth — though even then, some tricks may be possible via sending algorithms rather than data over the pipes, to be crunched through at the repository end and the results returned. I don’t, however, get the sense that you’re talking about anything that big.)

  5. pm286 says:

    (4) In principle I am happy for the material to be scattered over many repositories. BUT the domain layer has to be sophisticated. It desn’t just have to know where the individual repos are, it has to understand what is in them and how to seaerch them. And the more uncoordinated IRs there are the harder the semantic reconciliation.
    With text it’s fairly easy (at least for anglophones). Get a sighted human the text and it doesn’t generally matter what the machien semantics are. Print it out and that’s all.
    With machine semantics it’s a different matter. Does this file contain a molecules? A reaction? a spectrum? All that has to be done at the domain layer. If it isn’t then each IR will guess what labels to add, and more likely not do it at all.
    Yes, it will be possible for the IRs to add protein sequences and have them retrived via ORE as it is. That’s because there are 30 years of work in standardising the sequences. So the search is fairly trivial (it can be text-based). (Actually it’s much harder because most people want fuzzy matches – a “sequence like this one” – and that requires locally installed programs.
    Moreover many of these searches are complex and compute-intensive. They require installing in the repository. Even in eChemistry we shall have to work hard to make this effective. My own approach is to create a meta-index which is layered on top of the repos. But, hey, isn’t that what Pubchem is? And – not surprisingly – Pubchem is a central partner of our project. So, in a sense, Pubchem is a domain meta-layer over our emergent repositories.

  6. Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Travel update

  7. Pingback: Open Access Publishing in the Chemical Sciences

  8. Pingback: Introduction | Long tail science

  9. Pingback: Walking Randomly » EPSRC Research Software Engineering Fellow: Mike Croucher

  10. Pingback: EPSRC Study Application Engineering Fellow: Mike Croucher | A bunch of data

  11. Pingback: Walking Randomly » High Performance Computing – There’s plenty of room at the bottom

  12. Pingback: High Performance Computing – There’s plenty of room at the bottom | A bunch of data

Leave a Reply

Your email address will not be published. Required fields are marked *