Daily Archives: August 7, 2007

caltech: (talk on) the power of the scientific e-thesis

Many thanks to Eric Van de Velde and colleagues for inviting me to Caltech to give a talk on the scientific e-thesis. Besides being excited about going to Caltech, I am delighted that they wished to record the presentation. (I gather this was technically successful). This is of enormous value to me as it is the only true immediate record of what I have said. When I give a presentation I organize my material as a large number (perhaps 10,000) HTML pages with associated images and possibly movies. To this I add various demos on my own PC - things like OSCAR, Bioclipse, etc. (I didn't get as far as showing Bioclipse this time). The pages are roughly arranged in a 2-level tree category/page but I cannot remember what all of them are.

The actual presentation depends on the makeup of the audience (I try to ask a few people or for a show of hands) to get an idea of the topics that will be most interesting. I have a chunk of topics I am definitely going to cover but otherwise it is often what comes to mind as I go through the talk. (There is always far too much that I want to say). So there is no clear navigation path. I rehearse the major components to make sure they are still working. Unfortunately every browser "upgrade" destroys more demos. (At some stage I expect my SVG to stop working). I have to use both IE (for the SVG animations) and Firefox (for DBPedia). I had hoped we had left "best viewed in browser X" in the 1990's but no such luck.

So there is no simplistic linear record in the slides. It isn't easy (or often meaningful) to leave "a copy of the presentation" as it consists of thousands of slides which make little sense without my commentary and a record of what was shown in whta order. That's why it is so valuable to have regular recordings using video and audio as Caltech have provided. So many thanks.

FWIW I try to add something new each talk - each one is a moving average of past and future. For this talk I very happily discovered a Caltech thesis which used Jack Dunitz 's ideas and work extensively. From my introduction:

Jack Dunitz postdoc'ed at Caltech 1948-1951 with Linus Pauling and Verner Schomaker. He pioneered the use of very accurate crystal strucures in giving an insight into chemistry. Here is an example of a Caltech thesis (1992) which uses his ideas. See Chapter 9.

In 1973-4 I spent a sabbatical with Jack in Zurich showing how data from the literature could be combined to show new phenomena.

PMR: So I found a rather nice timeline. In the early 1970's Jack and co-workers would investigate chemistry by looking at very accurate crystal structures. These could take a month to complete. The thesis refers to their work on non-planar amide bonds where they would design a molecule, persuade a colleague to make it, and do the structure. It could take years to investigate the idea. When I visited in 1974 Jack and Hans-Beat-Buergi had shown that it was possible to use structures already in the literature. I used this idea to look at the distortions of tetrahedral molecules (simulating the SN1 reaction). There were over 100 data points and each required me to locate a (physical) paper in a journal with geometry about the molecule. Often I would find a paper that looked promising only to discover that there were no data or it was too inaccurate. So it took severl months to find enough papers with good enough data. I got to know the system of document retrieval rather well! Now, 30 years on, we have all the information Openly gathered in crystaleye. We can pull out tetrahedral molecules in seconds.
That still relies on the conventional publication process. If we can do the same thing for theses we shall have unlocked an enormous source of data.

I look forward to the recording and will alert you when it appears. I warned the audience that details were unpredictable and the machine would get tired towards the end. So in the spirit of Openness you will see it all as it happened!

scifoo: data-driven science and storage

I managed to get out to a few sessions at scifoo not concerned with my immediate concerns, of which two were on the Large Synoptic Survey Telescope and Google's abiility and willingness to manage scientific data. They come together because the astronomers are producing hundreds of terabytes every day(?) and academia isn't always the most suitable place to manage the data. So some of them have considered/started shipping it to Google. Obviously it has to be Open Data. There cannot be human-related restrictions that require management.

Everyone thinks they are being overwhelmed with data. Where to keep it temporarily? Can we find it in a year's time? Should we expect CrystalEye data to remain on WWMM indefinitely?  But our problems are minute compared with the astronomers which are probably 3 orders of magnitude greater.

How would you obtain bandwidth to ship data to someone like Google? Remarkably the fastest way to transmit it is on hard disk. 4 750GByte disks (i.e. 3Tb) fit nicely into a padded box and can be shipped by any major shipping company.  And disk storage  cost is decreasing at 78% per year.

I'm tempted to start putting our data into the "cloud" in this way. It's Open, so we don't mind what happens to it (as long as we are recognised as the original creators). It's peanuts for the large players. If we allocate a megabyte for each new published compound (structure, spectra, crystallography, computation, links, etc. and the full-text if we are allowed) and assume  a million compounds a year that is just ONE terabyte. The whole of the world's new chemical data each year can fit on a single disk! What the astronomers collect in one minute!

But before we all rush off to to this we must think about semantics and metadata. The astronomers have been doing this for years. They haven't solved it fully, but they've made a lot of progress and have some communal dictionaries and ontologies.

So we could have all the world's chemical information on our desktops or access it through GYM (Google/Yahoo/Microsoft).
I wonder why we don't.

scifoo: Open Science

One of the themes at scifoo was "Open Science" or "Open Notebook Science" - the latter term coined by Jean-Claude Bradley. The idea that science is publicly recorded as it is done. The very first bottom-up session (i.e. Saturday morning) was run by J-CB and Bora Zivkovic of PLoS ONE. Here are two comments:

Corie Lok et al.Scifoo: day 1; open science

It’s late and so I’ll keep this short. I’ll write more detailed accounts of Scifoo soon, but here are some highlights so far.

My day today started off with a contentious talk about open science. It quickly veered off into a complaint session about how the slow publication process in biology and the fear of not being credited and of being scooped are hindering open science (putting prepublication info and data online). But the physicists in the room quickly got annoyed by the complaining (not exactly new complaints either) and so the discussion got back on track to focus on current efforts to put more data and discussion of prepublication research online (such as Jean-Claude Bradley’s open notebook efforts). The session set the stage for several other related ones later in the day. It also spawned one taking place tomorrow about the culture of fear among young scientists: fear of doing open science, at the risk of jeopardizing career prospects. I’ll definitely be at that one. For another perspective on this session, check out Anna’s post on it.

PMR: here's Anna's post:

Swimming in the Ocean

Saturday, 04 Aug 2007 - 22:46 GMT
Have you heard the expression “small fish in a big pond”? I have an updated version. How about, “plankton in an ocean”? That’s me. I am the plankton, spending the weekend with CEOs of major corporations, editors in-chief, a couple Nobel prize winners, people advancing science and media in ways I can hardly comprehend… and Martha Stewart. That, in a nutshell (or an ocean, as the case may be) is Science Foo Camp, where I am currently sitting with mouth hanging open and ears open wide.One of the major themes of this free-form gathering has been open access publishing. In a group discussion led by Bora Zivkovic of PLoS ONE, tempers flared (which made it even more fun than staring at science celebrities), and the many complications, pros and cons of open access were raised. Does the term “open access” refer to pre- or post-publication open access? Is it open, non-peer reviewed publication of articles or even complete lab notebooks, or access to reviewed, published articles free of charge? That aside, will open access publishing negatively affect the hiring potential of young faculty looking for tenure track positions or funding from organizations such as Wellcome Trust and the NIH?

What about intellectual property? How does one protect findings aired in a public forum? One attendee replied that you don’t, it doesn’t matter, it should all be free and open. As much as I personally admire this free love, Birkenstock/Woodstock approach to science and research, I do not believe it to be feasible at the moment. Science is run by money. In order to get money or funding, one must publish. The changes and minor revolutions in that need to occur in publishing before the concept of the science paper becomes obsolete are staggering. They are also occurring as we speak.

Back to gaping at people far smarter than me.

PMR: and comments to Anna (so far):


  • Bora Zivkovic said:
    Small fish? No way – I was very excited to get to meet you in person.
  • Anna Kushnir said:
    The pleasure was all mine. I am happy I got the chance to meet you!
  • Jean-Claude Bradley said:
    Concerning the question of intellectual property, I am guessing that you are referring to my comment. I was not saying that all research should be open and free – just that people who are interested in intellectual property protection should probably not do Open Notebook Science. And this is no different than in the traditional publication process. People who are interested in intellectual property should not publish manuscripts without filing a patent (at least a provisional US patent). This is an expensive route and completely unrealistic for most scientific research projects. Money is not the sole motivation of scientists. If that were the case who would study fields like archaeology and cosmology?I wish that we had more time to discuss these issues during the session.

  • Deepak Singh said:
    I think the IP issue didn’t get brought up enough, especially with the peer2patent and other IP types there. In many cases the flaws are not in intent, but in the system itself. That said, I think as a community, we know what the problems are. We should just focus on solutions rather than trying to go into what’s wrong in excruciating detail :)

PMR: and Duncan Hull

9.30am: open science 2.0: where we are, where we're going

After breakfast at Googley's, I head off to a session on Open Science 2.0. This session is game of two halves, the first half there is much talk of how publishing is a roadblock to many things we would like to achieve with science on the web. Peter Murray-Rust talks of "conservative chemistry", where (un-named) publishers are the problem, not the solution and block the whole of the University of Cambridge for accessing content in unapproved ways (text-mining). Paul Sereno and Chemist Carl Djerassi discuss the importance of publications in getting jobs and tenure at Stanford. There is talk of the dangerous power of editors of journals, who ultimately decide careers that they are blind too. They don't just accept papers when they publish, they make and break people's livelihoods. Andrew Walkingshaw tells of a common perception amongst young scientists about the importance of the h-index and other publication metrics. Eric Lander points out that publication isn't everything for young scientists, a lot of it comes down to letters of recommendation in job applications and this fact is often overlooked by young scientists. Pamela Silver talks of how the publish or perish mentality is slow like molasses, and sends many talented young scientists at Harvard running and screaming from academia into the arms of anywhere else that will have them, which is a great loss to science. We move on to Open Access, Tim Hubbard, head of informatics at Sanger tells how the Wellcome Trust insists any publications that arise from its funded research projects must be freely available within six months after publication. Jonathan Eisen talks of different types of open access, which is not just about reading papers for free, but reusing them for free too, as in Creative Commons. Somebody possibly Richard Jefferson, talks of a reputation engine called Carmleon? (not sure of spelling).

All of this makes young scientists risk averse and paranoid, which is bad. The only people who can take risks are established scientists, which is a shame. But the discussion takes a u-turn when Paul Ginsparg (arXiv.org) and Dave Carlson, point out we should be having fun not moaning about publishing. We didn't all come here to whinge, we should be talking about the technology that will enable us to break the publishing roadblock and make science a better place to live, work and play. On this note, Bora Zivkovic tells of publication turnaround times at PLOS, which are now "9 weeks not 9 months". This is great for young scientists, who often don't have time to wait for the glacial turnaround times of many publishing companies. He asks what would cyber infrastructure look like in 2015? Jean-Claude Bradley, gives a demo of Usefulchem, see for example this experiment tools like blogs and wikis will play an important contribution in this area.


Science is becoming more open, but it will be a slow evolution not a rapid revolution. We're heading in the right direction, some of the tools for doing it are beginning to work. PLOS asks people to be courageous and send their papers in, this can be a gamble, when scientists often favour the old favourites of Nature, Science and PNAS. This session was typical of scifoo, its a mashup of different ideas from very different people working in different areas. It doesn't always summarise neatly, but thats life. A session on this came later on, called the Culture of Fear: led by Andrew Walkingshaw and Alex Palazzo.

PMR: The session didn't go as planned - JCB had produced material to demonstrate and didn't get to show it till near the end. The meeting got hijacked by the theme of Open Access and I helped in the hijack when I probably should have stayed quiet. It meant that we didn't explore the bright future but reiterated the less inspiring present. But somehow that was the burden that a lot of people had brought with them. Scifoo doesn't run  on predictable lines and one good thing was that Alex and Andrew were inspired to run a session (young scientists and the culture of fear) they hadn't planned to when they came.

"Open Science" is a concept whose time has arrived. I prefer "open Notebook Science" because there is less chance of confusion with other terms which have nothing to do with the concept.  Under Open Research WP ha a stub which lists a few lists a few examples - add some more.

scifoo: young scientists and the culture of fear

On the last day, and as an inspiration from the previous sessions and the community atmosphere of the meeting, Andrew Walkingshaw and Alex Palazzo ran a session on the problems of being a postdoc under the pressure of having to publish in high-impact journals. They explained how the very high sense of competition and the pressure of conformance to a single way measure of success constrained innovation - their sense of concern came through very clearly. Here's their blog entries (AW first):

The Scifoo nature

Scifoo was a blast.

Alex Palazzo and I ran a session today on the politics of scientific communication/open access, particularly for young scientists: he writes about our thoughts here. I was really delighted with how it went; many people, including some very successful academics and editor-in-chief of Nature, Philip Campbell, came along and shared their thoughts.

There’ll be more on what we actually discussed in due course, but the thing happening was itself staggering; from half-formed idea to a really deep round-table discussion in less than forty-eight hours. Creating a space where that can happen is priceless; I can’t thank the organisers enough for inviting me, and, equally importantly, everyone there for their generosity of spirit and openness.

PMR: Then AP. Read this in full, and also the commentary it has generated (and may continue to generate):

Scifoo - Day 3 (well that was yesterday, but I just didn't have the time ...)

Category: art, food, music, citylife and other mental stimuli
Posted on: August 6, 2007 10:46 AM, by Alex Palazzo

Our session on Scientific Communication and Young Scientists, the Culture of Fear, was great. Many bigwigs in the scientific publishing industry were present and a lot of ideas were pitched around. I would also like to thank Andrew Walkinshaw who co-hosted the session, Eric Lander for encouraging us to pursue this discussion, Pam Silver for giving a nice perspective on the whole issue, and all the other participants for giving their views.

Now someone had asked that we vlog the session, we actually tried to set it up but we didn't have the time. In retrospect I'm glad we didn't. This became at the last session of scifoo where attendees voiced their comments on the logistics of scifoo, many conference goers preferred to keep video and audio recording devices away from the sessions as they impede open discussion. Conversations off of the record can be more honest and more productive.

So about the session ...

The main point that we wanted to make was that there are problems with the current way that we are communicating science and due to developments with Web2.0 applications there is a big push to change how this is done. But we must keep in mind the anxieties and fears of scientists. How we communicate does not only impact how information is disseminated but does affect the careers of the scientists who produce content. Until there is general consensus from the scientific publishing industry, the major funding institutions, and the higher echelons of academia (for example junior faculty search committees), junior scientists are unlikely to participate in novel and innovative modes of scientific communication. The bottom line is that it is just to risky to do so.

There are two main areas that remain to be clarified by the scientific establishment.

1) Credit. How do we ascertain who deserves credit for an original idea, model or piece of data.

2) Peer-review. Although most scientists and futurists who promote much of the open-access model of scientific publishing support some type of peer-review where the science or consistence of a particular body of work is evaluated, there remains some confusion as to whether peer-review should continue to assess the "value" of a particular manuscript. Right now, manuscripts that are submitted to any scientific publication must attain some level of importance that is at least equal to the standards of that particular journal. When evaluating the scientific contribution of any given scientist, close attention is payed to their publication record and particularly where their manuscripts are published. Now whether we should continue to follow this model where editors and the senior scientists determine the scientific validity of any given manuscript is being questioned. In an alternative model many technologists are pushing post-publication evaluation processes which evaluate the importance of any single manuscript after the manuscript has been released after minimal peer-review. These not only include citations indices, but also newer metrics that are currently being developed by many information scientists. There are many problems with these systems, the most critical is that most of the value cannot be assessed until many years after the publication date. An important piece of work may take years to have an impact in a given particular field. Until the scientific establishment reaches a consensus as to whether these post-publication metrics are indeed useful for determining the credentials of a scientist in the shorter term (<2 years post-publication) it is unlikely that any scientists would risk publishing their findings in a minimally peer-reviewed journal.

There was a strong feeling that the top journals do provide a valuable filtering service. They go through all the crap in order to publish the best work. OK they don't always succeed but competition between all the big journals promotes a high standard. And many scientists are reluctant to give up this model. Journals also help to improve the quality of the published manuscripts, this function would be lost if all we had was PLoS One and Nature Precedings. To all those who think that journals must be eliminated in favour of an ArXiv.org model you are now warned.

PMR: I kept quiet during this session - I have no easy answer. It's clear that the pressure to get scientific jobs is increasing - whereas not so long ago institutions could choose from those they knew (with all the pluses and minuses) now they try to create a "level playing field". And what measure do they have when everyone has rave references? It's difficult not to count the numbers. We did hear that one leading systems biology lab did not simply look at publications but wanted to choose people who could provide a major shift in emphasis and might have a relatively unconventional paper trail. But it's not common.

Much credit to Alex and Andrew for their bravery in running this session, and to scifoo for it being the sort of place where it could happen.

scifoo: blogsession

As I've mentioned at scifoo the programme was evolved by the participants in a first-come first-accepted process whereby we signed up for free slots. It was hardly surprising that the blogosphere gained a slot and on Sunday we found a community of about 10-15 bloggers discussing how and why they did it. Here are some of the blogs that scifoo members have created and some of which were at the session. (Andrew Walkingshaw created a PlanetScifoo, the aggregation of the blogs updating every halfhour). Nothing special about my selection... they weren't all at the session

So we spent an hour talking about why we did it - what we got out of it - etc. At one end are the compulsive writers - Henry Gee explained how he couldn't help blogging - it was in the journalistic tradition. I sometimes feel like this but not to the extent I am driven to communicate something whatever. Many of us feel we have an "audience", community, whatever with whome we have a fragile rapport. Some bloggers get a lot of feedback, others very little. Often we are dependent on real-life contacts for feedback (I generally get little unless I unwittingly or otherwise turn up the "outrage button" and find out who is at the other end). Many bloggers who act as transducers for the immediate are appreciated by their following - the stream of consciousness of unprepared commentary on the world make contact.

Some - such as Richard Akerman have been blogging for years, others like me have yet to reach their first blogversary. Some, especially those in clear employment (e.g. publishers), have boundaries that should not be overstepped. What the boundaries are, are not always clear. Some have more than one blog - a day blog and a more anonymous non-work one. Some feel "soft constraints", especially when they are partially hosted by - say - a publisher's umbrella. But I think most would be prepared to speak their mind - here (A Letter to Martha) is Anna Kushnir criticising Martha Stewart for failing to live up to the promise of Scifoo.  (I wasn't there, but it sounds like a valid comment).
So blogs are of all sorts. Mine has a life of its own.