petermr's blog

Data repositories for long-tail science: setting the scene?

Posted on August 15, 2011 by pm286

I’m assuming we all believe that we need data repositories for science; that’s there are about 10 different reasons why (not all consistent); that many of us (including communities I am) in are starting to build them. So what should they be like?

I’m talking here and in the future about long-tail science – where there are tens of thousands of separate laboratories with no infrastructural coordination and with a huge variety of interest, support, antagonism, suspicion, excitement, boredom about data. A 6-10 person lab which analyses living organisms, or dead organisms, or properties of materials, or making chemicals, or photographing meteorites, or correlating economic indicators, or climate models, or nanotubes or… Where the individuals may work together on one or more projects, or where each worker has a separate project. Where there is a mix of postdocs, technical staff, central support, graduate students, interns, undergraduates, visiting staff, public projects, private projects, commercially exploitable material. With 3 month projects, 3 year projects, 3 decade projects. With ground-breaking new science, minor tweaks to existing science , application of science to make new things or develop new processes. Not usually all in the same lab. But often in the same department.

Indeed almost as diverse and heterogeneous as you can imagine. The common theme being that people create new raw stuff through lab experiments, field observations, computer simulations, analysing existing data, or text. We’ll call the non-material part (i.e. the bits and bytes, not the atoms or photons) of the raw stuff “data” (a singular noun). This data is very important to each researcher, but they have had generally had no formal training in how to manage this data. They base their “data management policy” on software already on their machine, what their neighbours suggest to them, what they see on the public web, what the student last year submitted for their thesis.

And unfortunately data are often very complicated. So generic data management systems are likely to be either very abstract, or complicated, or both.

So here I’ll try to suggest some views as to how long-tail scientists regard data… I’ll also treat it as a 3-4 year problem – the length of a typical PhD thesis.

At the start data is not an issue. Undergraduate work has designed the environment of an experiment so that you only capture and record a very small amount of stuff. In many PhDs you aren’t expected to start collecting data at this stage. You are meant to be reading and thinking.
You read the literature. In the literature data is a second-class citizen. It’s hidden away, never published. Maybe you read some theses from last year. They have a bit more data in. Usually tables. But it’s still something that you read, rather than interact with. There are also graphs and photographs of stuff. They are self-consistent and make sense (they have to make sense to the examiners).
You learn how to use the equipment, or grow the bugs, or grow the crystals or collect fruit flies or photograph mating bats or whatever. Sometimes this is fun; sometimes it doesn’t work. You’ve been trained to have a lab book (a blue one with 128 pages with hard covers and “University of Hogwarts” on each numbered page.) You’ve been trained (perhaps) to write down your experiment plan. This is generally required if you work with stuff which has legal or mortal consequences if you do it wrong. Hydrogen peroxide can be used for homeland insecurity. In some cases someone has to sign off what you say you are going to do.
Now you do your experiment. You write down – in ballpoint – the date. Then what you are doing, and what happened. And now you have got some stuff. You record it, perhaps as a photograph, perhaps as a spectrum, perhaps in a spreadsheet if it changes with time. Your first data. By this time you are well into your PhD. You’re computer-literature so you have it as a file. But you also have to record it in your lab-book. So, easy – you print it out! Then get some glue and glue it into the book. Now it’s a permanent record of the experiment. [For additional fun, some glues degrade with time, so by the third year all your pasted spectra fall out. Naturally you haven’t labelled which page they were stuck to – why would you? So you have to make an educated guess as to where they used to be.
Oh, and that file with the spectrum in? You have to give it a name – so “first spectrum” and we’ll put it on the Desktop because then we know where to find it. At least it’s safe.
6 months, and the experiments are doing well. Now your desktop is pretty full, so we’ll make the icons smaller. They are called “first spectrum after second purification” and so forth. You can always remember what page in the lab book this related to.
A year on, and you’ve gone to the “Unseen University” to use their new entranceometer. The data is all collected and stored on their machine. You get a paper copy of each experiment for your book. There is no point in taking a copy of the data as it’s in binary and the only processing software is at UU. And you are going back next year so any problems can be sorted then.
Two years on and you are proud of the new system you have devised. Each bit of stuff has a separate name. Something like “carbon-magnesium-high-temperature/1/aug/200atmosphere/version5”. You’ve finished this part of the study and you need a new machine. So you save your data as 2010.zip. Your shiny new machine has 5 times as much diskspace so you copy the file into “old machine”/2010.zip. It’s safe.
Three years on. Time to start thinking about writing up. The entranceometer stuff has been reported at a meeting and got quite a lot of interest. Your supervisor has started to write a paper (some supervisors write their students’ papers. Others don’t). This is good practice for the thesis. You give him the entranceometer diagrams. The paper is sent off.
And reviewer 2 doesn’t like the diagram. They’ve used a different design of entranceometer and it plots the data on logarithmic axes. They ask you to replot.
What to do? You have the data, so we’ll replot it. Where is it? “old machine”/something_or_other. Finally you find the zip file.
It doesn’t unzip – “auxiliary file missing”. You have no idea what that means. So let’s mail the UU quoting the reference number on the printed diagram. After a week or so no answer so try again. A mail with a binary file “these are the only files we could find”. You unzip it – “imaginary component of Furrier transform missing”. Basically you’re stuffed. You cannot recompute the diagram.
Then you have a bright idea. You could replot the spectra by hand. By measuring every point on the curve you get X and Y. And they want logX. Your mate writes a Javascript tool that reads points off an image and records the X-Y values. So you can digitize the spectrum by clicking each point. It only takes 2 hours per spectrum. There’s 30 altogether. So you can do it in under a week if you spend most of the day working on it…
Now this is not fun, and it’s not science and it’s against health and safety. But it will get the data measured for the paper. And you are now knackered.
Wow – the paper got a “highly accessed” (You can’t actually read it because you’re now visiting a lab which doesn’t subscribe to that journal. So it will have to wait till you can read your own paper.
And now the thesis. It’s a bit of a rush because you had to present the results at a conference because the boss said so. But you got a job offer – assuming you finish your thesis.

…

Help, what does this file mean: “first compound second version really latest version 3.1”? What type of data is it (it doesn’t seem to have an extension). And should you not use “first compound (version 2) first spectrum”. You can’t find the dates because when you copied the file they all got changed to the date of copying so they all have the same date. So you talk to the second year PhD . “one of the files was run after the machine software changed; which is similar to yours?” “Ah, I have only seen that type” “Thanks, this must the later one, I’ll use it”.

So, as I continue to stress, the person who needs the datamanagement plan is the researcher themselves. Forget preserving for posterity if you cannot preserve for the present. So let’s formulate principle 1:

“the people who need repositories are the people who will want to put data into them”.

The difficult word is “want”. If we can solve that we have a chance of moving forward. If we can’t we shall not succeed.

Posted in Uncategorized | 1 Comment

How much scientific content is there in IRs?

Posted on August 14, 2011 by pm286

I have suggested (/pmr/2011/08/14/institutional-repositories-are-they-valuable-to-scientists/ ) that Institutional Repositories are not valuable for scientists.Chris Rusbridge who used to run the Digital Curation Centre has commented. I am replying here, rather than in the thread. My argument is that there is little content in most IRs (Soton, UCL and probably CompSci depts. are exceptions) and a fortiori even less scientific content. I have done a simple analysis to back this up.

[CR] August 14, 2011 at 8:23 pm (Edit)
[IRs] are set up to meet rather different aims [from mine], which they do more or less well. I don’t think that means that they are not useful to scientists; indeed in my review of the repository at your university, I met researchers who claimed that repository was an essential tool for them. One told me it would “shatter his work” if it went away. As it happened, their needs were more closely aligned with the repository than yours appear to be.

PMR: I don’t doubt that there a few places where repositories meet the needs of a few scientists. And a few places where the University (e.g. Soton) has put in significant resources and effort so that there is a critical mass of users. But the use of IR’s by everyone, let alone scientists, is very small. Let’s take the contents of the Russell Group (UK’s “top” 20 Universities) and see their contents (items deposited – taken from the latest index of DOAR – http://www.opendoar.org/index.html . I have searched for the name of the University and taken the IR which is most obviously the “central” one. Some may not be up to date, but in the absence of any useful TOC/index I have to take what is there)

University of Birmingham 730 items

University of Bristol 1208

University of Cambridge 190836 (almost all mine)

Cardiff University 757

University of Edinburgh 4124

University of Glasgow 21604

Imperial College London 1217 (of which ca 1000 are theses not on public view)

King’s College London (CompSci) 1984

University of Leeds + Sheffield + York ~8000

University of Liverpool 598

London School of Economics & Political Science 19228 + 72 theses

University of Manchester (Maths) 371

Newcastle University 6100

University of Nottingham 801

University of Oxford 2730

Queen’s University Belfast (Pol, Phil) 87

University of Southampton 13 repos, ca 60,000 items

University College London 196442

University of Warwick 3844

The median is around 2000-3000 items, over I suspect about 5 years.

Despite your gloom, I suspect repositories are more useful than you think. For a start, very often when you do a Google Scholar or equivalent search for a paper, you will often end up at a version in a repository. It may have been self-archived, it may have been deposited by faculty or library staff on behalf of the author, it may even have been harvested from elsewhere. But there are enough (at least in the areas I look in) for it to be a surprise not to find a version. Some institutions (notably Southampton in the UK, and Michigan, MIT and others in the US) have been markedly successful in getting content into their repositories, and it reflects well on them.

I don’t think the UK figures above bear that out. They suggest a median of perhaps 500 items per year. Even if all these are papers (which I doubt – probably half are theses, and an uknown amount of gray material) then that’s less than a paper per staff member per year. That means it’s extremely unlikely that any given paper will be in the repository. In computer science, perhaps. But not in bioscience or chemistry or materials.

Most institutional repositories are not data repositories; Cambridge is unusual in focusing more on data (well, on scholarly materials) than on outputs in the form of articles etc. Institutional data repositories are not yet widely available, but they are coming.

And, even if I were to believe that IR-data is coming, my argument is that this is the wrong way to do it.

Even they, however, are not particularly likely to provide “data publication and storage at all stages of the scientific endeavour”. I have argued in the past that they should (see posts in the Digital Curation Blog on “negative click repositories” etc), but in practice most centralised repositories are likely to focus on fairly static data for some time to come, for purely practical reasons. Personally, I think departments and research groups should be building more dynamic repositories and databases to support the earlier stages.

I’d be surprised if scientists used IRs for data, given that they don’t use them for full-text

I also don’t believe that “single-domain” repositories will support all stages of the science endeavour. The best developed field here is bio-informatics, where you see a multitude (>1,000) of databases mostly highly specialised to the needs of certain data types. The more you cross institutions, the harder become some of the problems, particularly sustainability. Many of those 1,000 databases are precariously funded, despite their evident importance.

II agree that this is precarious. If IRs are less precarious (i.e. they have sustainable funding) then they should be used for domain-specific data on a global basis.

Some of your other comments are spot on. The repository movement really missed a trick early on in declaring the licence under which material is made available. Even though I pressed my institution, the best they could offer (with the resources they had) was to put up future content with a CC licence. And I also agree that the APIs could be much better; OAI-PMH was built around a two-layer architecture of data providers (repositories) and service providers… very few of whom materialised or are used. I just don’t know whether OAI-ORE is useful for the kinds of things you want to do.

I agree that this is all difficult, especially the sustainability. I shall be developing this in future posts.

Posted in Uncategorized | 2 Comments

Data publication: some replies

Posted on August 14, 2011 by pm286

I have had several comments to blog posts about publishing data. Since c omments are not very common I’m replying in a post rather than in thread. They generally raise important points which are general and I’m grateful for the opportunity:

The value of raw data in preventing fraud. Carlos says: August 9, 2011 at 3:49 pm

I am a strong advocate of the need to deposit the experimental data (digital) for many reasons, such as those you have discussed in your blog. This would have certainly helped to clarify, once and for all, the famous debate with Hexacyclinol (there is much written about it, including my own humble contribution, http://nmr-analysis.blogspot.com/2011/03/hexacyclinol-nmr-spectra-vs-plain.html), but from my point of view, it does not mean that this data cannot be easily manipulated at convenience. For example, in the case of 1H-NMR, I could synthesize very easily a spectrum digitally adding the 13C satellites in their proper place, and change the line widths of the signals (to take into account the different relaxation times), add Gaussian noise, etc. If your aim is cheating, there is now perfect digital “Tippex” for that.

[PMR>> I have never said it is impossible to fake data. However it requires a lot of different skills. Not everyone can do what you have suggested. Moreover even this can leave subtle signals in the various moments of the data. The data have to be consistent with the known chemistry of the system. Fake data can be “too good to be true”. And if chemists were to publish their spectra digitally then there would be a great deal of prior information – as the crystallographers have.]

The other point I would like to mention is about the format of the digital data. I think that the proper way to deposit data is by using the original acquired data. Depositing other formats (e.g. JCAMP) is fine if this is done as something supplementary (e.g. for displaying purposes), but this should not replace the original data, otherwise there will be a loss of information. For example, it is possible that the JCAMP file does not include all the information contained in the original data and that this information can be very important in the future (I have found this problem too often L ).
Files in JCAMP format will compress relatively well, but a good compression rate is not so easy to achieve with original data (e.g. raw FID).

[PMR>> This is an ongoing issue for many data-rich fields. In crystallography the first deposition was the coordinates, then at a later date the anisotropic temperature factors, then the structure factors and now the actual diffraction images. The technology advances continuously. So while I agree the FID is ideal, the real-space spectrum is still extremely valuable and will serve 99% of what is needed. Note also that the FID acts as a complete check that the author has NOT fudged their data – it is probably impossible for anyone to change the raw data without leaving traces.]
Journals supporting supplemental data files Richard Kidd says: August 9, 2011 at 4:02 pm

Hi Peter

RSC have always been more than happy to host the original data as ESI (massive files excepted), and in addition of course anyone can deposit spectra with ChemSpider. For the journals ESI we do at least need a pdf in addition for the peer review process, but that doesn’t exclude the original data in any way.

No reflection on Figshare though, all credit to Mark

[PMR: Thanks. This is good to know – that the RSC will act to publish data files in the traditional way]

Also spotted this: http://acscinf.org/meetings/242/242nm_CINF_abstracts.php#S42

[PMR: Yes, NIST have been working for several years with publishers to deposit thermochemical data (ThermoML) and have managed to build a model where all data is fully OKD-Open. Kudos to them and to the various journals, Journal of Chemical and Engineering Data, Fluid Phase Equilibria, The Journal of Chemical Thermodynamics, International Journal of Thermophysics, and Thermochimica Acta.

When an author has submitted a new manuscript to these journals, it will be reviewed by NIST in two stages. The first stage provides to Editors a NIST Literature Report and the second stage provides a NIST Data Report. These two reports are generated on demand by new tools that NIST recently incorporated into ThermoData Engine (TDE) software. The literature report assists Editors and reviewers with their assessment of the manuscript’s scientific contribution, the degree of overlap with published data, and the need for comparison with those published data. The second stage of NIST review occurs just after peer review is completed and prior to an Editors’ final decision. The data report provides provides a complete assessment of data quality, their underlying uncertainties, their sample descriptions, and their descriptions of experimental methods.

It’s an excellent example, like crystallography, or real data curation and publication. The data are thoroughly checked by numerous mechanism for internal and external consistency. It’s a good example of a community which values data at least as much as, if not more than, full-text.]

(C )Standards Steven Bachrach says: August 9, 2011 at 5:27 pm (Edit)

I want to point out that NISO and NFAIS have a working group on standards and best practices for handling supplementary materials. The group is still knee-deep in work and a ways from reporting out any recommendations. http://www.niso.org/publications/isq/free/Beebe_SuppMatls_WG_ISQ_v22no3.pdf.

[PMR: Power to their effort. I hope they swap idea with Iain H’s group. Here’s some points:

… some general issues for potential Recommended Practices. Among them are
the following:

»»Clear, consistent indicators of content

»»Metadata needs

»» Universal agreement on citation practices

»»Consideration of use of the DOI

»» Potential cost recovery

»»Common vocabulary

»» Peer review

»» Preservation and interaction with repositories

»» Archiving

»»Clearly defined specific responsibilities for the parties

Posted in Uncategorized | 3 Comments

Publishing Data: The long-tail of science

Posted on August 14, 2011 by pm286

I am going to explore aspects of “publishing data” in STM disciplines and probably run to several posts. This will specifically cover the “long-tail” rather than “big-science” (such as high-energy physics, satellite surveys, climate models, sky surveys, etc.). In big science the data are often collected as part of a large project which has a data management process and specific resources for doing that. These data sets are often huge (petabytes) and require special resources to manage them. I also exclude data collected in major facilities such as neutron or X-ray sources as these usually have good support for data.

By contrast long-tail science represents the tens of thousands of ordinary laboratories – often “wet” – where data are collected as part of the experiment but where there is no central planning of their projects. They range over biomedical science, chemistry, materials, crystallography and other disciplines. There is no clear borderline but we’ll use these broad categories. The amount of data, and its heterogeneity vary hugely.

We are now seeing a major cultural change where many funders and some scientific societies are pushing for the publication of data. There are several motivations, and I have discussed some of them in this blog, but they include:

A resource to validate the experiment and to prevent careless or fraudulent science
A resource for the community to reuse, through aggregation, mashups and to act as reference data

The problem is that there is no simple universal way to do this. The bioscience community has pioneered this – NCBI/Pubmed, EBI and other data centres have many databases and some disciplines require deposition. Despite these being (inter)national centres there is a continuing concern about funding, as it normally has to come from specific grants made for fixed periods. There have been times when key resources such as Swissprot and PDB have had very uncertain futures.

One common model has been to associate data publication with “fulltext” publication as supplemental/supporting data/information. This has been financed – in many cases for over 15 years – by marginal costs within the conventional model – either through author-side fees or subscription. This is possible because the actual costs of long-tail data are also marginal:

They are not normally peer-reviewed (they *should* be , but that’s a different matter)
They can be trivially transformed into flat files which require little management and where the data are opaque
The technology for publishing them is simpler than the main fulltext – indeed almost costless.
The actual physical storage costs almost nothing
There is no expectation of sustainability beyond that expected for the fulltext

Kudos here (in chemistry at least) goes mainly towards society and/or open-access publishers. Nature also. But not Elsevier, Wiley and Springer which seem to have less commitment to maintain the data record of science. There is a lot of illogicality – Journal of Neuroscience killed its supplemental information while the proteomics community followed Mol. Cell. Proteomics and insisted on data publication (in a repository). The Am. Chem. Soc. Requires crystal structure data but refuses spectra on the basis that it is not a data repository.

Over the last few days I have had conversations and mail which suggest there is a groundswell of people wishing to solve this problem – I reiterate for the “long-tail”. We’ve discussed, and will continue to discuss, Figshare. Here’s an extremely encouraging meeting run by Iain Hrynaszkiewicz two months ago and just published. (http://blogs.openaccesscentral.com/blogs/bmcblog/entry/report_from_the_publishing_open with attendees: Alex Ball (UKOLN), Theo Bloom (Public Library of Science), Diane Cabell (Oxford Internet Institute), David Carr (Wellcome Trust), Matt Cockerill (BioMed Central), Clare Garvey (Genome Biology), Trish Groves (BMJ), Michael Jubb (Research Information Network), Rebecca Lawrence (F1000), Daniel Mietchen (EvoMRI Communications), Elizabeth Moylan (BioMed Central), Cameron Neylon (Science and Technology Facilities Council), Elizabeth Newbold (British Library), Susanna Sansone (University of Oxford), Tim Stevenson (BioMed Central), Victoria Stodden (Columbia University), Angus Whyte (Digital Curation Centre) and Ruth Wilson (Nature Publishing Group). Notice the wide spread of interests, so this groundswell should be taken seriously. Read the report first. I’ll just highlight a little and add my own comments:

Goal 1: Establish a process and policy for implementing a variable publishers’/authors’ license agreement, allowing public domain dedication of data and data elements of scientific articles

PMR:: Critical. Solving the licence problems before you start saves years later (this has been a main problem with “Open Access”). At least today no-one is arguing that data are intellectual property that belong to one or more of the parties and where walled gardens and paywalls could be constructed.

Goal 2: Consensus on the role of peer reviewers in articles including supplementary (additional) data files

PMR: This is objectively difficult and needs a consensus approach. Some reviewers, especially those who suspect the validity of the publication, will wish to use the data as a check. But there are often problems of software and expertise – how many non-chemists would understand CIFs, Mol, CML, JCamp. How many can read Matlab files without the software, etc.

Goal 3: Sharing of information and best practices on implementation of journal data sharing/deposition policies

PMR: There will probably be a variety of ad hoc solutions – no single approach suits everyone

Recent conversations suggest that many funders and publishers are exploring data publication as first-class citizens. I’ve had talks about

data-only journals
publications where the data and full-text occur side by side
publication as a continuous record of the science
domain-specific repositories

and more.

The point is that the world is now starting to change and the traditional publication model is now seen as increasingly anachronistic. Journals have declining importance except for publishers to brand and sell metrics whereas data repositories are seen as exciting and on the increase. When academia finally adjusts to data-as-a-metric then there will be a rush away from journals and towards repositories.

And these repositories will, I hope and expect, be run in a better, more useful fashion than the outdated journal model. They won’t be cost-free, but they won’t be expensive (recent conversations suggest a small fraction of current publication costs).

It won’t be easy or immediate. It will probably take ten years and end up as a complicated heterogeneity. But that is what the current century has to tackle and solve. It could return democracy to the practising scientists.

In later posts I’ll address the fundamental things I would like to see in data repositories.

Posted in Uncategorized | 1 Comment

Institutional Repositories: are they valuable to scientists?

Posted on August 14, 2011 by pm286

I have had time to reflect on http://www.repositoryfringe.org/ (the meeting of repositarians in Edinburgh) and having been recently concerned about the publishing of data (about which I shall post more later) I post my current analyses of the UK repository scene (I don’t know enough about elsewhere). I shall try to be objective, possibly constructive, but this will probably be a rather uncomfortable post. Before I start I’ll say that I have been committed in the past to working with my local repo and more generally the repo community.

I am going to comment (>> PMR) as a working research scientist who needs a repository for (a) collaboration and (b) data publication and storage at all stages of the scientific endeavour. My comments do not necessarily extend to other disciplines or other purposes.

Here are some basic motivations for repos (http://en.wikipedia.org/wiki/Institutional_repository ) :

to provide open access to institutional research output by self-archiving it; >>PMR: This hasn’t worked for science and isn’t going to. I have self-archived some of my publications pre-publication but not post-publication. Most publishers of chemistry do not permit post-publication, the process is complex, distracting and I know of no cases where scientists search in IRs for post-publication material.
to create global visibility for an institution’s scholarly research; >>PMR This is a useful function but IRs are generally poorly set up as showcases and there is so little science in most that I don’t go looking. (Why would I look at the output of the University of X? I might if they were headhunting me, but not otherwise)
to collect content in a single location; >>PMR this has no value for the average scientist. It is primarily (if at all) for institutional purposes such as managing the Assessment exercises
to store and preserve other institutional digital assets, including unpublished or otherwise easily lost (“grey”) literature (e.g., theses or technical reports). >>PMR. This is the only thing that might be useful to me *if I could discover the material easily and read it*. As an example of the non-use, Imperial College prevents anyone outside the institution reading any of their ca 1000 theses. This is not the norm, but it is impossible to answer the question “show my all UK theses”. The interfaces to the ca 200 UK IRs are hotch-potch and completely unnavigable by machine. So I agree with “store and preserve” (which is no use to most scientists in the modern world) but not “discover”.

And from Alma Swan: (I exclude topics above, teaching, measurement, showcasing):

Providing a workspace for work-in-progress, and for collaborative or large-scale projects; >>PMR This is something I have been urging repos to do as I think it’s the only thing that would provide something of value to the average scientist. If scientists used their university system for managing their work processes and data then they would have naturally engaged. But I think repos are running out of time and I think there are existing solutions which have a trajectory and will work.

If repos wish to engage with scientists I think the only real way forward is to help create *single* domain-specific repositories. Examples of these are Dryad, Tranche, PDB, etc. NCBI/EBI resources. The model would involve domain scientists running the [single] repository (let’s say for computational chemistry) and one or more traditional repos managing the sustainability. Note that scientists do not, in general, care about preservation beyond a few years at most. Scientist will not and should not put data directly into their own IR – it fragments the discipline and there are no good search tools.

So I have painted a fairly stark picture for IRs and science. They aren’t working and they aren’t going to work in their current form. The only area of possible interest is theses. To do this the IRs must, across all institutions:

Make their content Open. If the response is “it’s the student’s copyright, we can’t do anything” then we are not interested.
Label the Open content as open (machine-readable). It is *impossible* in any repository I have visited to find specifically Open material in bulk (i.e. by machine –reading). So almost all thesis and other content in UK repositories is closed.
Make it iterable. It should be possible to list everything in the repository systematically. Google does this but academics are usually forbidden to do so. Relying on Google to search University information is simply bottling the problem. I have floated this idea, got very little take up, even though it could be done in a week if the community put its effort into it. I doubt they will, but would be happy to be proved wrong.

On the assumption, therefore, that IR’s have nothing to offer scientists either in data management or discovery my next posts will turn to solutions from different sectors.

Posted in Uncategorized | 11 Comments

The English riots; what can I do?

Posted on August 13, 2011 by pm286

I normally only blog about science , scholarship and related matters on this blog and I am making an exception in this post. I feel a sense of wanting to help and while I have no solutions am open to ideas. At least from my experience I can relate emotionally as well as intellectually to the current problems.

I used to live in NW London (http://en.wikipedia.org/wiki/West_Harrow, a middle class, middle income suburb about 20km from central London). And if I had still been there I would probably have spent large amounts of last week in the local police station (I cannot find copyright-free photographs on Geograph so see some local photographs of the area).

(from Wikipedia)

In 1981, just before we moved to London, there were serious riots in Brixton, with strong feeling that the police were institutional racist. In the aftermath there was an official enquiry which culminated in a report from Lord Scarman (http://en.wikipedia.org/wiki/Scarman_report). This report included the analysis:

According to the Scarman report, the riots were a spontaneous outburst of built-up resentment sparked by particular incidents. Lord Scarman stated that “complex political, social and economic factors” created a “disposition towards violent protest”. The Scarman report highlighted problems of racial disadvantage and inner-city decline, warning that “urgent action” was needed to prevent racial disadvantage becoming an “endemic, ineradicable disease threatening the very survival of our society”.^[1]

Scarman made several recommendations including setting up a system of voluntary “Lay Visitors” (now Independent Custody Visitors [ICV]) and in ca. 1983 I volunteered for this scheme

Initially, the provision of custody visiting was voluntary on the part of the Police Authorities, but it was placed on a statutory basis in 2002.

Visits to police stations by custody visitors are unannounced and can be made at any time. The custody visitors must be admitted to the custody suite immediately, unless there is a dangerous situation occurring.^[7] They are allowed to speak to anyone being detained at the police station, unless a police Inspector (or higher rank) believes that access would place the custody visitors in danger or would “interfere with the process of justice”.^[8] The visitors ask the detained person whether they have been informed of their rights under the Police and Criminal Evidence Act codes of practice [PACE] (for example, to speak to a solicitor or to make a telephone call) and whether they are being treated properly.^[9] Visitors also check that the cells and other facilities within the custody suite, such as the toilets and food-preparation area, are clean.^[10] The custody record, which records everything that happens to someone whilst they are in police custody, may also be examined.^[11]

If the custody visitors find any issues, or a detained person raises an issue about their treatment, the visitors raise these with the officer in charge of the custody suite, or of the police station.^[12] The visitors complete a report of each visit, which will record their finding including any issues identified during the course of the visit. Copies of the report are sent to the Police Authority.^[13]

[If I’m incorrect, say so. I speak without active current knowledge – I stopped ca 7 years ago when I moved to Cambridge]. When it started there was distrust of the police, including by Guardian readers such as me. The role of Visitors is to ensure that police behave legally and appropriately within the custody suites of police stations. (It did not extend to police on the streets, or in general, to transport of suspects or prisoners). The week-to-week activity involving dropping in randomly and being shown directly to the custody suite. Even a few minutes delay was unacceptable. We got to see everyone in custody (with some exceptions such as high-security prisoners and immigration overstayers).

Visits are reports are strictly confidential but my general conclusions were that the police (at least in Harrow) were professional, treated prisoners appropriately and I never saw any evidence of mistreatment. My opinion of the police, in Harrow and in those parts of the operations I saw, rose considerably. There was almost, in some cases, a “business relationship” between the detainees (not criminals, since these people had not yet been to trial so they are not yet found guilty of the offence for which they had been arrested). Much of the activity was paperwork and when PACE came in it increased. I can personally vouch for the amount of writing and form-filling for each detainee – and this is required by law. It seems that increasing clarity in the judicial process requires increasing paper.

There was special provision for unusual circumstances where large number of people might be arrested. In these cases a local police station could (should) ask visitors to attend the station before arrested people were brought in. This was particularly important where the detainees belonged to an identifiable social/racial group. The visitors could publicly assert that nothing untoward had taken place, thus removing at least the suspicion that the police had acted illegally or irresponsibly. I would have expected that many visitors would have spent much time in custody suites last week.

On more than one occasion the visitors were able to represent objective problems encountered by the police to the Home Office and we believe that this had some effect. The most common was overcrowding, particularly when prisons were full. Prisoners on remand were often put in police cells which are completely unsuitable for more than a night’s stay. Moreover police are not trained as or expected to be prison warders.

I did this for two terms (the maximum was 3 years per term). After this I volunteered to be an “Appropriate Adult“.

Appropriate adult is a defined term in the United Kingdom legal system for a parent or guardian or social worker who must be present if a young person or vulnerable adult is to be searched or questioned in police custody.^[1] If these are unavailable a volunteer from the local community may fill the role instead.^[2]

The role is to accompany young people aged below 17, when they are detained in custody to explain the meaning of legal terms, offer counsel or comfort, give advice, contact relatives, ensure the offender is aware of his rights, and that the offender is receiving the care he or she is entitled to (clean cells with no adult offenders inside, for instance)^[2]. The concept was introduced as part of the policing reforms in the Police and Criminal Evidence Act 1984.^[3]

When an unaccompanied young person is arrested the custody suite will contact a local Youth Offending Team who has a duty to arrange for an appropriate adult to be available^[4]. The request for an appropriate adult is often the first way in which Youth Offending Team’s learn of a young persons offences or re-offences.^{[citation needed]}

Appropriate adults are also often used when vulnerable adults are detained in custody. Vulnerable adults are classed as people who suffer from mental illness, learning difficulties or literacy problems. In these cases it is the appropriate adults role to ensure that the detainee understands the custody process, legal advice and any questions put to them by the police. These appropriate adults usually have specialised mental health training or practical experience of dealing with vulnerable adults.

This meant that I could be called – at random – whenever the custody sergeant could not find or persuade a parent or guardian to come. My role was to make sure that the detainee (normally a juvenile, but sometimes a non-english speaker) understood what was happening and what their rights were. The archetypal situation was a juvenile who had been in custody several times before and whose parent(s) would not come – this was just another arrest. The proceedings were predictable – a short interview with the detainee (who was fully accustomed to the process) and my explaining the right to have free legal representation. This was sometimes accepted, when there was a long wait (could be hours) waiting for the lawyer to arrive. There are not many more boring occupations that sitting in police custody suites waiting, waiting. Then the interview. The standard ritual is burnt into my brain:

You do not have to say anything, but it may harm your defence if you do not mention when questioned something which you later rely on in court. Anything you do say may be given in evidence.^[10]

In passing, I think the first sentence is one of the worst ever written. Given that many detainees will have a poor grasp of English an inverted subordinate clause is total gibberish. It shouts at the detainee that the law does not care about them. It would be so easy to write simple monosyllabic English. But more seriously it also represented an erosion (by PACE) of rights – the “right to remain silent”.

Many repeated detainees would simply “no comment” the interview. Most are impervious to the process – they expect to be charged and appear in court and this is simply part of their normal life. I have once seen a proactive lawyer persuade his client that they should challenge police officers’ identification and as a result there was no charge. Sometimes there is no charge, sometimes there is a charge, sometimes bail, to appear in court the next morning. Young people locked up overnight is the consequence.

I did not go to court so I was not part of that system. But often the hearing required more material – social workers, more evidence, etc. – and could be weeks later. By that time some detainees could have been arrested for other offences, including WoW (“wanted on warrant” – see this site). Justice is slow and occurs so long after the offence that it is difficult to see any visceral connection between crime and punishment. Most sentencing of juveniles did not result in being locked up (I am passing no judgment).

By the time I left Harrow I could see clearly for myself at least that the formal criminal justice system had little effect on many persistent offenders. For some there would be a progression to prison which became an option when they turned 17.

Are there other ways forward? One option is http://en.wikipedia.org/wiki/Restorative_justice.

Restorative justice (also sometimes called “reparative justice” ^[1]) is an approach to justice that focuses on the needs of victims, offenders, as well as the involved community, instead of satisfying abstract legal principles or punishing the offender. Victims take an active role in the process, while offenders are encouraged to take responsibility for their actions, “to repair the harm they’ve done—by apologizing, returning stolen money, or community service”.^[2] Restorative justice takes crime seriously without increasing repression and exclusion involving both parties and focusing in on their personal needs. In addition, it provides help for the offender in order to avoid future offences. It is based on a theory of justice that considers crime and wrongdoing to be an offense against an individual or community rather than the state.^[3] Restorative justice that fosters dialogue between victim and offender shows the highest rates of victim satisfaction and offender accountability. ^[4] According to Zehr and Mika (1998), there are three key ideas that support restorative justice.

First, is the understanding that the victim and the surrounding community have both been affected by the action of the offender and in addition, restoration is necessary.
Second, the offender’s obligation is to make amends with both the victim and the involved community.
Third, and the most important process of restorative justice is the concept of ‘healing.’ This step comes in two different parts: the healing for the victim, as well as meeting the offender’s personal needs. Both parties are equally important in this healing process to avoid recidivism and to instill safety back into the victim’s life.

I was introduced to this not by woolly do-gooders but by a senior police officer in Thames Valley Police. He believed that this was a constructive way to reduce crime, and certainly recidivism. It requires a great deal of commitment and hard work on everyone’s part. There is no guarantee of success. It requires people talking and listening to each other in several directions. Unfortunately I left London and have made myself too busy to be involved in police work more recently. But I believe that Restorative Justice must be part of the solution of our current problems.

I have purposely not given homespun solutions. I have been shaken by the events (even though there has been nothing in Cambridge). There are no surprises (other than the actual time of the event). There is still endemic tension and suspicion. I thought it was lessening. It was deeply disturbing to see the exponential growth of disturbance. Not only has this led to huge distress, it has also made it clear we have no mechanisms to prevent it happening again.

Arresting, charging, and imprisoning large numbers of offenders will inexorably put huge pressures on our criminal justice system. Police cells will be overflowing, convicted prisoners will be housed in police stations and maybe even army camps or other completely unsuitable places – I don’t know. It may solve a political problem – from my own experience I do not see it solving much else.

I hope and I believe that the people of England will come up with ideas to repair our society. Simple immediate reactions will not work. Justice and legislation carried out in haste and in reaction rarely work. We can learn from other countries and other cultures. The Brixton riots led to the Scarman report which made modest progress although many of its recommendations were ignored.

Any solution will involve the active involvement of the police. Robert Peel exemplifies the constructive approach to policing in his principles. They are all critical but I have highlighted a few:

The basic mission for which the police exist is to prevent crime and disorder.
The ability of the police to perform their duties is dependent upon the public approval of police actions.
Police must secure the willing co-operation of the public in voluntary observation of the law to be able to secure and maintain the respect of the public.
The degree of co-operation of the public that can be secured diminishes proportionately to the necessity of the use of physical force.
Police seek and preserve public favour not by catering to public opinion, but by constantly demonstrating absolute impartial service to the law.
Police use physical force to the extent necessary to secure observance of the law or to restore order only when the exercise of persuasion, advice, and warning is found to be insufficient.
Police, at all times, should maintain a relationship with the public that gives reality to the historic tradition that the police are the public and the public are the police; the police being only members of the public who are paid to give full-time attention to duties which are incumbent upon every citizen in the interests of community welfare and existence.
Police should always direct their action strictly towards their functions, and never appear to usurp the powers of the judiciary.
The test of police efficiency is the absence of crime and disorder, not the visible evidence of police action in dealing with it.^[2]

I feel myself part of the historic tradition “the police are the public and the public are the police”. In that light, what is the way forward?

That’s it – if anything emerges that I could be involved in, I might. But this blog will revert to informatics, chemistry, openness and rants. To make the transition and raise our spirits here’s something beautiful I saw in the garden last week – a Red Underwing moth (Catocala nupta ) captured on my mobile:

They are normally nocturnal, but this was bright daylight and the red was brilliant

Posted in Uncategorized | Leave a comment

Why we need data repositories: prevention of Scientific Fraud (ACS and others please respond)

Posted on August 9, 2011 by pm286

[Warning – this blog contains praise and criticism of the chemistry publishing industry].

I’ve just been catching up on the chemical blogosphere by reading Chembark (http://blog.chembark.com/about/ )

This site is maintained by Paul Bracher. Paul is currently a National Science Foundation ACC Postdoctoral Fellow at the California Institute of Technology. He completed his doctoral work in organic chemistry at Harvard University, his undergraduate studies at New York University, and his secondary education at Thomas Jefferson High School for Science & Technology. Paul enjoys writing about himself in the third person

Chemistry has one of the finest blogospheres and those who criticize grey literature should take time to read and change their views. Chembark has recently been spending a great deal of time on a very high-profile case of scientific fraud (Sezen/Sames). He has detailed it meticulously. Where he needed official information he sent an FAOI (Freedom of Information) request to the government, which took its time in replying. [I am fortunate in the UK where we have “What do they know?” http://www.whatdotheyknow.com/user/peter_murray_rust – is there a similar system in the US?] Here’s Chembark’s commitment to the case:

3 December 2010 – Acknowledgment of receipt of ChemBark’s FOIA request
8 December 2010 – Denial of ChemBark’s request for expedited processing
20 June 2011 – Follow-up letter to DHHS FOIA Office
22 June 2011 – Cover letter from DHHS with Bengu Sezen Investigation FOIA Materials
22 June 2011 – FOIA Materials for Bengu Sezen Investigation

This is contained in (http://blog.chembark.com/2011/07/07/the-sezen-files-%E2%80%93-part-i-new-documents/ ) – the first of several impressive posts which are worth reading – and we haven’t got to the end.

The case has also been reported by the American Chemical Society (ACS) which has also reported very responsibly and which has taken the view that not only is the case a disgrace but that Columbia University has not taken appropriate action. Note that the fraud has seriously harmed, perhaps destroyed, the careers of innocent chemists trying to repeat the experiments. Here’s Rudy Baum’s editorial (http://cenblog.org/the-editors-blog/2011/08/sezen-sames-and-columbia/ ) (quoted without permission but in appreciation of the strength of argument):

Columbia’s investigation focused exclusively on Sezen’s misconduct. From the ORI report obtained by C&EN, it appears that Columbia has not made any attempt to probe whether Sames was guilty of scientific misconduct himself during Sezen’s time in his lab. And in fact, a close reading of Columbia’s policy on research misconduct—which was adopted in February 2006—suggests that, in the university’s eyes, Sames’ behavior throughout was acceptable. Allegations of misconduct, the policy states, must be made to the “appropriate Responsible Academic Officer,” who is a “Chair, Dean, or Director,” not a principal investigator. The policy, which runs to 14 pages of legalistic, process-oriented gobbledygook, doesn’t appear to mention the responsibility of PIs at all.

And Columbia? The claim that the university can’t talk about the case to protect individuals’ privacy is laughable. Most of the redactions in the report of the university’s investigation are easily filled in with the appropriate names. The Sezen/Sames case is an embarrassment, the malefactor has been banished from the ivory tower, an up-and-coming young professor is moving along with his career, and Columbia is putting the unpleasantness behind it. It may be good spin doctoring, but it’s a lousy way to run a great research institution.

This could happen ANYWHERE. In YOUR laboratory. It impossible to know how much fraud is undetected. In a major case the Int. Union of Crystallography detected massive systematic fraud (http://journals.iucr.org/e/issues/2010/01/00/me0406/me0406bdy.html where 70 papers were systematically fabricated). Great kudos goes to the IUCr for detecting this.

Which they were able to do because they DEMAND MACHINE_UNDERSTANDABLE DIGITAL DATA AS A REQUIREMENT FOR PUBLICATION. By contrast the chemistry community requires supplemental information, but only as PDF. Here’s the main C&EN article (http://pubs.acs.org/cen/science/89/8932sci1.html ). Don’t worry if you don’t understand chemistry – the plot is easy to follow. Again I quote without permission.

Investigators concluded Sezen fabricated ¹H NMR data for at least seven compounds, including the product of the arylation reaction shown, based on the pattern of the products’ reported coupling constants. After breaking out their rulers, the committee found that the starting compounds’ ¹J_CH coupling constants vary, as would be expected in genuine ¹H NMR spectra, but the products’ do not. The poor resolution of Sezen’s other printed spectra prevented the committee from conducting a more comprehensive analysis. [PMR’s emphasis]

Basically the investigating committee had to measure distances with rulers on paper spectra, rather than analysing the numbers that the spectrometer had emitted. By printing to paper, rather than preserving the original data, the laboratory system leaves a massive hole for fraud. Despite my and other campaigns for digital data, all chemistry laboratories and all journals continue to use e-paper, so all are open to easy fraud. Schulz continues (and don’t be put off by the chemistry): (Note – I have reproduced a diagram, thereby violating copyright. This diagram is actually a creative work of Sezen, see below):

[Schulz ]WHITEOUT This ³¹P NMR spectrum from Sezen’s doctoral thesis and research papers was presented as proof that she made the supposedly promising C–H functionalization catalyst RhCl(CO)[P(Fur)₃]₂. The investigation showed that Sezen created it out of whole cloth by merging ³¹P spectra of a simple phosphorus compound (triphenylphosphine) and then applying correction fluid to remove peaks from a triphenylphosphine oxide contaminant in that compound. [PMR’s emphasis].

PMR: For non-chemists: A chemist makes a compound and runs an NMR spectrum (these machines cost hundreds of thousands of dollars and have integral computers which analyse all the data in DIGITAL form.). The chemist takes the spectrum, analyses (annotates) it – see the black text – and argues that it confirms the identity of the compound. The supervisor or journal editor/reviewer is required to accept this argument to justify publication.

NOTE: The spectrum probably originally contained 16384 measurements along the x-axis (frequency). The image here represents the loss of about 99% of that data. The numbers are unreadable (I actually have NO idea what the range is). If this is the diagram published in the journal article it should never have been allowed. (It’s actually the worst I have seen). If it was part of the publication, then the editor of the journal must take responsibility for allowing it.

What then apparently happened is that two independent spectra were merged. I don’t know how this was done. It might have been in Photoshop ™ or Sezen might have written a program (I am sure Chembark will tell me). This, of course is fraud. Then, because the result didn’t look quite right Sezen applied correction fluid (“Tippex”) to the PAPER spectrum. [This is pretty crude. It’s ultimately detectable.].

IF THE DATA HAD BEEN DIGITAL THROUGHOUT SOFTWARE COULD HAVE DETECTED THE FRAUD (there were satellite peaks in the wrong place, etc.).

Mat Todd has asked the ACS if they will accept digital spectra (http://intermolecular.wordpress.com/2011/08/07/raw-data-in-organic-chemistry-papersopen-science/ ) and …

I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it’s less convenient. PLoS seem happy to host the data. [PMR emphasis]

This lack of commitment from the chemistry publishers is unacceptable. It’s effectively saying they can’t be bothered to accept digital data routinely. There is no technical problem – the data are much smaller than the enormous PDF-bitmaps that they routinely allow as supplemental information. They compress well. There are Open Source viewers. If the IUCr can do this routinely (and detect fraud) why can’t the ACS, RSC, etc. “e.g. an institutional repository” is simply ducking the responsibility. The responsibility of the journal is to take reasonable steps to ensure that the science is “correct”. And depositing digital data files with the publisher is trivial. And if they don’t have the expertise to analyse spectra, then the blogosphere has lots of Open Software.

[ If you want an interim solution Mark Hahnel and Mat and I are investigating Open Source technology and the role of Figshare. We’d love to have others involved. It will happen].

So maybe we’ll end up with a system where “papers” are published in ACS and RSC and “data” is published in Figshare (This is much more attractive than publishing it in repositories and I’ll explain why later).

What happens when the world wakes up and starts to value data and creates metrics for it? The metrics will reference Figshare, not ACS…

Posted in Uncategorized | 5 Comments

Figshare meets Open Drug Discovery

Posted on August 7, 2011 by pm286

I normally don’t like blogging more than twice a day, but sometimes it’s inevitable. (People sometimes suggest I blog too much, but there is so much we have to change and such a short time that I take the risk). This is in response to Mat Todd, a champion of Open Source drug discovery (likely to be an increasingly common theme here). He asked for my thoughts on his blog post http://intermolecular.wordpress.com/2011/08/07/raw-data-in-organic-chemistry-papersopen-science/ which starts:

Open science is a way of conducting science where anyone can participate and all ideas and data are freely available. It’s a sensational idea for speeding up research. We’re starting to see big projects in several fields around the world, showing the value of opening up the scientific process. We’re doing it, and are on the verge of starting up something in open source drug discovery. The process brings up an important question.

I’m an organic chemist. If I want people to get involved and share data in my field I have to think about how to best share those data. I’m on the board of more than one chemistry journal that is thinking about this right now, in terms of whether to allow/encourage authors to deposit data with their papers. Rather than my formulating recommendations for how we should share chemical data, I wanted to throw the issue open, since there are some excellent chemistry bloggers out there in my field who may already have well-founded opinions in this area. Yes, I’m
taking
about
you.

I won’t quote the whole but here are my interspersed replies. If you aren’t very chemical skip read quickly…

Peter Murray-Rust 3:19 pm on August 7, 2011 Permalink | Reply

Mat, great post – answering various points:

>Mat>Open science is a way of conducting science where anyone can participate and all ideas and data are freely available. It’s a sensational idea for speeding up research. We’re starting to see big projects in several fields around the world, showing the value of opening up the scientific process. We’re doing it, and are on the verge of starting up something in open source drug discovery. The process brings up an important question.

I am excited about the OSDD effort(s) and think there is a lot of Open technology they can use.

>Mat>I’m an organic chemist. If I want people to get involved and share data in my field I have to think about how to best share those data. I’m on the board of more than one chemistry journal that is thinking about this right now, in terms of whether to allow/encourage authors to deposit data with their papers.

Many already do “require” PDFs. There is no agreed way of doing it, but if what you mean is depositing JCAMPs then YES. The OS community can hack any variants

>Mat>1) You have to save the data and then upload them. Well, this was a problem in 1995, but not now.

agreed – trivial in time and size of files

2) The data files are large. Not really. A 1H NMR spectrum is ca. 200KB.

>Mat> 3) It’s a pain. Yes, a little. But we must suffer for things we love.

see below

>Mat>4) People might find mistakes in my spectra/assignments. Yes. You’re a scientist. This is a Good Thing.

Yes – and some bad chemistry has been detected and corrected

>Mat>An important fact: For many papers, supporting information is actually public domain, not behind a paywall along with the rest of the paper. The ACS, for example, would, by posting raw data as SI, allow the free exchange of raw spectroscopic data. That would be neat.

The ACS requires CIFs and I congratulate them. If they could just extend that to JCAMPs and computational logfiles that would almost solve everything

>Mat>1) X-ray crystallography. This is the exception. Data are routinely deposited raw, and may be downloaded. Not always the case, but XRD blazes a trail here.

True for all OA journals (but not much crystallography here except IUCr ActaE), RSC, IUCr, ACS require CIFs (Applause). Wiley, Springer, Elsevier do not publish this supplemental data. Only available from CCDC and then not in bulk without subscription.

>Mat>2) NMR spectroscopy. The big one. IUPAC recommends the JCAMP-DX file format. Jean-Claude Bradley has been a proponent of this format, and has demonstrated how it can be used in all kinds of applications. We’ve played with it, and in one of our recent papers we deposited all the NMR data in this format in the SI. We’ve been posting JCAMP-DX files in our online electronic lab notebooks, e.g. here. My opinion of this file format (both generating it, and reading it) has not been great. There are two formats, I understand, and we found that if we saved the data in the wrong format, we couldn’t read the data with certain programs, but could with others. i.e. we had to get the generation of the file just right.

Don’t fully understand this. There are actually several formats but the OpenSource software reads all of them. CML-Spect supports these and is readable by JSpecview. This need not be a problem if people have the will to solve it.

>Mat>I don’t know if people have experience of this. I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it’s less convenient. PLoS seem happy to host the data.

I have an idea, which I think will fly. [see below]

>Mat>3) IR data. Don’t know if there is a standard. If the file is small, saving raw data could be encouraged. Would allow easy comparisons of fingerprint regions.

JCAMP will hack this

>Mat>4) Mass spectrometry. It’s not clear to me there is a huge advantage here to sharing raw data, for a typical low res experiment?

JCAMP will do this for “1-D” spectra (e.g. not involving GC or multiple steps

>Mat>5) HPLC data. Again, the outputs are fairly simple, and I’m not clear about the advantage of raw data (which I’m assuming would be absorbance vs. time table). Would (perhaps) permit verification that traces have not been cropped to remove pesky impurities.

Again it wouldn’t take much to solve this

>Mat>6) Anything else?

I think we should use FigShare (see /pmr/2011/08/03/figshare-how-to-publish-your-data-to-write-your-thesis-quicker-and-better/ ) and I’ll explain why in my blog in a day or so

… ok .. take a breath. The main points have been:

The technology for recording digital spectra have been around for at least 30 years
The files are not large, and are trivial to upload
There’s lots of open source software.

The ACS is keen to see these data available but (according to Mat) don’t want to act as a data base. So,

I copied in Mark Hahnel (see link above). Figshare sounds like exactly what is needed.

Figshare has been developed with zero cash (but a lot of love from Mark). That will scale as far as establishing that the concept works and scientists like it.

We don’t have to convince the senior chemists – all we have to convince is graduate students. Because they are the ones that will benefit it and help develop the next phase. Whatever that is.

The University community (including the repositories) should take careful note of what Mark has done. Because he has filled a real need, not built a theoretical design. And this is where innovation comes from (in our own group Nick Day built Crystaleye in his final PhD year http://wwmm.ch.cam.ac.uk/crystaleye and for which we are finding a permanent home. ) Let’s have mechanisms for supporting the products of such innovation. Meanwhile if any graduate student wishes to archive spectra let’s see how Mark recommends we develop it. And if you wish to deposit crystal structures, let’s do them in the new crystaleye2 http://crystaleye.ch.cam.ac.uk/ .

There is now no technical reason not to archive high-quality data for chemistry in a completely Open manner.

Posted in Uncategorized | 3 Comments

Repository Fringe 11: McBlawg, and a Question for everyone

Posted on August 7, 2011 by pm286

I’ve spent much of last week at the Repository Fringe in Edinburgh (see http://www.repositoryfringe.org/ which has a really excellent “live blog” – almost verbatim; also see #rfringe11 for tweets). It was an interesting event with the normal complete spectrum from geeks hacking repo software and content, to those who are making policy, financing repositories and getting (or not getting) engagement.

The event was very well organized, held in the new Informatics Forum of the University of Edinburgh. Here’s Eurovision_Nicola’s photo (my hair is just visible (0.6, 0.35) above Mark Hahnel (FigShare)’s light blue left shoulder). This is the roof garden with Edinburgh’s Central Mosque and Salisbury Crags (Arthur’s Seat) in the background.

(In our group are also Mark MacGillivray (x=0.6), Chen (x=0.73) and Michael Fourman (x=0.77)).

Graham (McDawg and McBlawg) is a prominent and tireless campaigner for Open Access and Open Knowledge. He devotes his own time and money to the cause, and describes himself as

“Scottish International Man of Mystery – Open Science/Access/Data/Knowledge & Patient Advocate”

I have highlighted the “Patient Advocate” as that is what, in large part, has driven Graham to demand Open Knowledge. Graham was a co-founder of the CJD Alliance (http://www.cjdalliance.net/ ) and describes himself in “Patients Like Me” (http://www.patientslikeme.com/members/view/1644) :

Graham has several years experience of obtaining and sharing information between researchers and patients – and now Journals. The patient as always, remains at the forefront – always will.

Graham Steel (42) is a native of Glasgow, Scotland, and works as a property claims adjuster/recovery specialist. Graham’s brother, Richard, was diagnosed with variant Creutzfeldt-Jakob Disease (vCJD) in April 1999 and died in November 1999 at the age of 33.

Graham joined the committee of the Human BSE Foundation on a voluntary basis in September 2001 and became a Trustee as of 2003 after the Foundation became a Charity. Since September 2001, he acted as Vice-Chair. One of his main initial and continued foci had been to develop and maintain the Foundation’s website. Graham left this organisation in October 2005.

Over the last few years, Graham has devoted much time learning more of the background of TSE’s and so called Prion disease, the current and emerging rationale of treatment issues/early diagnostic methodologies and maintaining/seeking contact with many researchers in several Continents. He has also devoted much time assisting in forging links between a number of CJD related support groups from around the world.

So does anyone, anywhere, defend the current system that denies Graham access to the world’s published medical literature? If there is a single motivating example for my advocacy and action for Open Knowledge it is exemplified by Graham. (And we and others are continuing to develop positive ways of getting it – watch this blog later in the month/year).

So Graham went to Edinburgh to find out about the knowledge in Institutional Repositories and how it might be of benefit. Here are some excerpts. (http://www.science3point0.com/mcblawg/ )

The positive: On day one, one of the opening presentations was by Mo McRoberts
(BBC Data Analyst) entitled BBC Digital Public Space project. From memory, this lasted for roughly 20 minutes. I [McB] caught up with Mo before he left the event later on and has a great chat with him. Ultra cool guy.

I agree. The singleness of purpose in the Beeb, liberating their content, contrasted starkly with the fuzziness of academia.

The notsopositive: My [McB] question was along the lines of “I’ve read a number of Enlighten tweets which have links to Manuscripts in the repository and all the ones I’ve looked at are not open access. I’m a bit confused by this and have been meaning to ask why?”

The general response was that (at least in UK terms) only about “10 – 15%” of the content of these IR’s are Open Access. WOW !! I tweeted).

Why the surprise?? Well, from everything that I’ve read and been told about IR’s until that moment led me to believe that ALL of the content of IR’s was OA. Nothing at all was indicative to the contrary.

… McRant …

[after] all of the OA Mandates, only 10 – 15% of researchers are self archiving their work into repositories. IR’s at least in terms of OA content
(the same cannot be said for non OA content that can be accessed by researchers who have on campus access) do not appear to be particularly effective.

Notice the “researchers who have on campus access”. That’s the key phrase. It’s so easy for those in universities to forget that they have “free” access to all the published literature. (Yes I know that libraries are constantly chopping journals… that it isn’t “free” (but academics think it is)). McBlawg and the CJD Alliance do not have access to the literature.

How much of the content in IR’s is Open Access? And of what sort (green gold or murky)? Quite simply:

I [PMR] haven’t a clue what is in UK Institutional repositories.

And I suspect that no-one else has. If I ask for a list of Theses in UK repositories, people suggest that I write an OAI-PMH harvester. I personally can, but I would much rather that the question was already answered. If I ask how much of the content is CC-BY or CC0 licensed no one has the slightest idea. (and without the licence you cannot re-use the content).

So, UK repositories, I am going to start asking you questions. They are simple to phrase and should be simple to answer. I proposed one as a “good idea” at the RepoFringe. It didn’t win a prize, but it did get helpful comments on this blog (/pmr/2011/08/04/linked-open-repositories-%E2%80%9Cwe-can-do-it-in-an-afternoon%E2%80%9D/ ). It was phrased in LinkedOpenData language (including the RDF dragon) so I will state it more simply here:

PLEASE GIVE ME A LIST OF ALL THE CONTENT IN ALL UK REPOSTIORIES

I think that’s a simple and responsible question. (Please do not tell me I can write software to recursively iterate over the UK repos using OAI-PMH). I want a list. For Graham and others. So they can look at it and see what is and what is not in the UK academic store of knowledge.

Here’s my simple arithmetic.

ca 200 UK repositories.

Most have << 10,000 entries (I asked at the meeting).

Soton (one of the most active and also driven from the top) has ca 25,000 (as replied at the meeting)

Several have < 1000 entries (from the meeting)

Assuming ca 2000 entries for 200 repos (power law) we get ca 400,000 items.

[Jim Downing and I have personally put ca 200,000 data items into Cambridge repository. I’m discounting these]

And

Many repos have been running for 10 years. I’ll take an average of 3 year for the 200. That gives 600 repo-years in UK

Let’s assume that’s 1000 person-years full-economic-costs at 100 K GBP / year = 100 million GBP

Now, before you complain that you don’t get paid anything like 100K GBP, that is about the average amount request for a postdoc from a research council. It takes into account the computing infrastructure, mowing the lawns, holding meetings, etc. And on top of that there are those supported to develop software and develop projects. An auditor could reasonably claim that our group had >>200K GBP of JISC grant over the years just for repositories and that’s independent of the actual university costs and support.

So my current arithmetic is:

100,000,000 GBP for 400,000 items

That’s 250 GBP per item deposited.

QUESTION 1: I need to know exactly on (let’s say) August 31 2011 how many entries there are in UK institutional repositories

Assuming I get some answers then I will move onto the next questions. Which are all, ultimately, leading up to knowing how much Open Knowledge do we have and what use is it.

Posted in Uncategorized | 3 Comments

What’s wrong with Scholarly Publishing? Interim observations and perhaps a solution.

Posted on August 7, 2011 by pm286

I have been blogging for 3 weeks on the malaise in scholarly publishing. While doing this I have talked to a number of people and got some blog feedback. I think I am more worried and less sure than when I started.

There’s a horrible similarity between the problems of SP and the current financial cataclysm. Each has created a set of values which is increasingly complex to normal people, and which lead recursively to implosion. On the Digital Curation list Simon Fenton-Jones writes:

To give the conversation about research a little more gravitas, let me point out that the 5% that each reader lost from the value of their pension fund
last week, will lead to an education in what happens when an entire economic system is founded on a financial instrument called a Credit Default Swap.

I looked up CDS on Wikipedia (http://en.wikipedia.org/wiki/Credit_default_swap )and found:

This article may be too technical for most readers to understand. Please improve this article to make it understandable to non-experts, without removing the technical details. (November 2010)

And indeed it was (myself included). SFJ continues

Having lost a wife to the Big C, and discovering less than half of cancer research can be found easily, i write to relieve myself of hate (of existing institutions),

The point is that academia has – wittingly or unwittingly – created a process and market in things that average people do not understand. Perhaps the worst is the worship of the journal and its Impact Factor (http://en.wikipedia.org/wiki/Impact_factor ).

The impact factor, often abbreviated IF, is a measure reflecting the average number of citations to articles published in science and social science journals.

and

a citation is an abbreviated alphanumeric expression (e.g. [Newell84]) embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work for the purpose of acknowledging the relevance of the works of others to the topic of discussion at the spot where the citation appears.

The point I am making is that the previous two sentences describe in large part how we evaluate academic work. How many “average people” understand them? As I have already blogged, the vast majority of research funding comes from public sources. I dislike the term “taxpayer” as it is not politically neutral, but as academics we have a duty to show those who provide the money that we are producing value. If someone is suffering from a disease, is it really acceptable to tell them that they cannot read the research they have funded because *I* want to publish where it advance my career best? And yet this is a universal choice facing academics:

Do I publish where it is openly visible or where it advances my career and institutional standings?

I have sympathy for the young researchers and would not try to urge them to break ranks. I have no sympathy for senior academics who have nothing to lose by trying to change the system.

Not all academic work is immediately “useful” and even that that appears to be may have a short life. I am not saying we should change to obviously utilitarian research, although I think research is desperately needed to prevent a number of foreseeable disasters on the planet. I am simply saying:

Reward research which is openly visible and which strives to allow reuse of the research. Do not reward research which is closed.

It is only because we have built a false value system that it is so hard to make the change. As long as we continue to support the glory that comes from “prestigious publication” that only a tiny fraction of people can read, we continue to deprive the world of the hundreds of billions of dollars put each year into “academic” research.

How we get there I do not know, any more than I know how to stop burning petroleum into the atmosphere. The best I can do is to help to promote consciousness of the need for change and to publicize any solutions that look good. I also think that events may force our hand. I will not be surprised if a major closed publisher crashes, especially in the present financial crisis. Crashes are good for change, but only if we are prepared (which we are not). In any case funders should insist on open publication. Academics will hate them, but I suspect the “taxpayer” will approve. R1

Posted in Uncategorized | 2 Comments

Data repositories for long-tail science: setting the scene?

How much scientific content is there in IRs?

Data publication: some replies

Publishing Data: The long-tail of science

Institutional Repositories: are they valuable to scientists?

The English riots; what can I do?

Why we need data repositories: prevention of Scientific Fraud (ACS and others please respond)

Figshare meets Open Drug Discovery

Repository Fringe 11: McBlawg, and a Question for everyone

What’s wrong with Scholarly Publishing? Interim observations and perhaps a solution.

Recent Posts

Recent Comments

Archives

Categories

Meta