Archive for May, 2008
What I said to the NIH
Saturday, May 31st, 2008Comments on the NIH policy
Friday, May 30th, 2008PMR: I’d like to comment, but it’s not very clear whether it’s US-only. I’m going to look through the current comments (which are all public) and see whether there’s anything non-US. Any help will be valuable. Assuming I am eligible I’ll say something about the critical need to allow access to data. Firstly the NIH could state that data in the papers is factual and therefore not subject to copyright. A public satement of this would be very valuable. Secondly that it will not impose server-side limitations of robotic downloads. It’s the responsibility of the reader not to break copyright. Thirdly I will encourage the NIH to require the submission of data as well as free text. So I’d welcome fedback and also urge anyone who is qualified to post. If that’s everyone, then everyone should post. From what I can see they could use a few more…03:38 30/05/2008,Public comments on the OA mandate at the NIH are due by 5:00 pm (Eastern Standard Time), Saturday, May 31, 2008, less than two days from now. Submit your comments through the NIH web form. But before you do, see some of the comments already submitted. The pro-OA comments will give you ideas, and the anti-OA comments will show you what objections to answer and what perspective might predominate if you don’t send in your own. This time the NIH wants separate answers to four separate questions. The web form has four separate spaces for them:Please submit a comment and spread the word. Even if you have no suggestions to improve the policy, it’s important to express your support.If you’re thinking that the NIH just concluded a round of public comments for its March 20 meeting, you’re right. See the comments generated by that round (and my blog post on them). One persistent publisher objection is that the policy has not been sufficiently vetted and one purpose of the new round no doubt is to give the stakeholders one more chance to speak. We must use it. Publishers will.
- Do you have recommendations for alternative implementation approaches to those already reflected in the NIH Public Access Policy?
- In light of the change in law that makes NIH’s public access policy mandatory, do you have recommendations for monitoring and ensuring compliance with the NIH Public Access Policy?
- In addition to the information already posted [here], what additional information, training or communications related to the NIH Public Access Policy would be helpful to you?
- Do you have other comments related to the NIH Public Access Policy?
Robots can detect error; but images MUST be Open
Thursday, May 29th, 2008OA enhances error correction Jeffrey Young, Journals Find Fakery in Many Images Submitted to Support Research, Chronicle of Higher Education, May 29, 2008. …As computer programs make images easier than ever to manipulate, editors at a growing number of scientific publications are turning into image detectives, examining figures to test their authenticity. And the level of tampering they find is alarming…. One new check on science images, though, is the blogosphere. As more papers are published in open-access journals, an informal group of watchdogs has emerged online. “There’s a lot of folks who in their idle moments just take a good look at some figures randomly,” says John E. Dahlberg, director of the division of investigative oversight at the Office of Research Integrity [at the US Department of Health and Human Services, which includes the NIH]. “We get allegations almost weekly involving people picking up problems with figures in grant applications or papers.” Such online watchdogs were among those who first identified problems with images and other data in a cloning paper published in Science by Woo Suk Hwang, a South Korean researcher. The research was eventually found to be fraudulent, and the journal retracted the paper….PMR: A typical example is a gel, used to show how many proteins or nucleic acids you have got and how pure they are. I’ve taken this from Wikipedia:
This is a good gel – the bands are parallel and there are no thumb prints. Many gels don’t “run straight” so it’s tempting for the author to “striaghten” then in Photshop or similar.
Robots can detect this. Better than humans.
We have software in our group that can detect errors in chemistry. In graphs, molecular structures, text, etc. It would be fairly straightforward to download all the world’s published chemistry and check it for errors. Note that in chemistry errors are mainly due to human error rather than fraud. However we find an awful lot. In closed access journals. (That’s most of them as most chemistry is closed). Our robots did a check on a journal issues by a prestigious chemical publisher and found an error in almost every article. Some were trivial – missing punctuation, some were spelling errors. Some were serious. In a recent article we’ve been looking at 30% of the chemical names are seriously wrong. Our robots find that sort of thing. Some of the chemical formulae are wrong. Some of the molecular masses are wrong.
I think this matters. I think the editors of the journal would agree. I think the publishers would not like to have incorrect science in their journal.
We can do this at near-zero cost. Some of our methods (OSCAR) are well tested, others are still alpha. They don’t work on all papers. But we’d like to see if we can detect errors. (And it’s not just error detection, but adding a very significant degree of semantics – I showed some of our automatic thesis-eating robot last week at the RSC).
The only thing holding us back is that we may be accused of stealing content. We will shan’t do this. We would download the articles, scan then for data and delete the articles. We wouldn’t sell them to non-subscribers. We shan’t post them on the web-site (though we would post the extracted data).
So this is a genuine offer to publishers. We are interested in seeing whether we can extract data and detect errors in publications. Last wek all the publishers at the RSC agreed that facts weren’t copyright and they agreed that scientific images and artwork should not be regarded as being protected by publisher copyright.
So I’d like feedback from publishers. The time has come when robotic extraction and analysis of data makes sense.
And if a publisher forbids the automatic analysis of scientific data for error detection, and defends this through copyright or server-side police, is that advancing the cause of science? Wouldn’t it make sense to make all the data Open?
Now? Where should we get our computing?
Wednesday, May 28th, 2008- A visit to the Barcelona supercomputing centre. We had a personal tour during the COST D37 visit – thanks very much to all involved. It’s a splendid sight – a huge glass box occupying most of the interior of a chapel, with racks of blades (some 2500 I think). And the lovely coloured wiring of the fibre, electric coolant, etc.
- I’m chair of the Computer Services Committee in the Department. We’re quite a federated organization and that means there are several server rooms. All of them require power. All of them require space. All of them require cooling. All of them need backing up. All of them chop and change as the kit wears out and my colleagues get new grants. All of them need connecting to the network. We have an excellent group of Computer Officers so I don’t have to think about them but it’s a lot of work and a lot of money.
- And we have a High Performance Computing facility in the University. It got into the top something-or-other for size or power or … I’m not sure what the finances are (well I’m not going to blog them) but we are urged to consider it as a primary resource.
- cpu
- bandwidth
- storage
- data
Open Source Drug Disovery and Closed Access
Wednesday, May 28th, 2008
Analysis
India Takes an Open Source Approach to Drug Discovery Open source software may have been around for 17 years, but using an open source model to speed up drug discovery is a relatively new idea. This month, India is launching a new open source initiative for developing drugs to treat diseases such as tuberculosis, malaria, and HIV. Seema Singha
Available online 17 April 2008.
Institutional Repositories: Caveat Roach
Wednesday, May 28th, 2008“Roaches check in — but they don’t check out!”The analogy is that many things are checked into institutional repositories but few are checked out. She has formally expanded this view under the general theme that IRs aren’t working and put the preprint in her own (or rather HPOW’s in library-speak) IR:
Please use this identifier to cite or link to this item:http://digital.library.wisc.edu/1793/22088
Title: Innkeeper at the Roach Motel Authors: Salo, Dorothea Keywords: institutional repositories open access Issue Date: 11-Dec-2007 Citation: Salo, Dorothea. “Innkeeper at the Roach Motel.” Library Trends 57:2 (Fall 2008). Abstract: Trapped by faculty apathy and library uncertainty, institutional repositories face a crossroads: adapt or die. The “build it and they will come” proposition has been decisively proven wrong. Citation advantages and preservation have not attracted faculty participants, though current-generation software and services offer faculty little else. Academic librarianship has not supported repositories or their managers. Most libraries consistently under-resource and understaff repositories, further worsening the participation gap. Software and services are wildly out of touch with faculty needs and the realities of repository management. These problems are not insoluble, but they demand serious reconsideration of repository missions, goals, and means. URI: http://digital.library.wisc.edu/1793/22088 Appears in Collections: General Library Collection
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
PMR: I’ll comment on the cntent below but my immediate comment on the meta-data is that I have no indication of my rights to this information. It’s Openly visible and, I assume, permanently visible so it qualifies for OA1 (==weak OA, I think). So we can label it Open Access. But I don’t know whether she has transferred copyright to the publisher and – as I don’t read m/any journals in this area I don’t know whether the journal has any Open Access policy. The mantra about DSpace copyright shown typical librarian paranoia – if we don’t know what the copyright is, put on the strongest protection possible so we won’t get into trouble. So, if I were a basketologist, I would be frightened off using this article in a course on Open Access and Repositories. Dorothea is a DSpace expert so I imagine it took her within the Harnad-Carr six minute event horizon for deposition.
Dorothea is very clear that IRs aren’t working and libraries have a limited future unless they change. Read the article. I have to say I am in sympathy with this view. I’ve come to meet many librarians and repositarians over the last two years or so and I wish them well. In general they don’t impact on my colleagues research (I can’t comment on teaching). If librarians are to continue in a research environment they have to discover what researchers want and give it to them. Here are some suggestions:
- Please help me with the mechanical and arcane parts of putting my CV together.
- Please help me write the grant and give me a higher chance of success. How many institutions have any idea why their grants fail?
- I spend hours preparing papers for publication – please do the boring bits for me
- I’ve just got this collaboration with … – and we need an electronic collaborative environment
- I’ve just lost all my data.
- I’ve lost all my data again
- I don’t want to lose my data a third time
Files in This Item:
File Description Size Format Handle RoachMotel.pdf 211Kb Adobe PDF http://digital.library.wisc.edu/1793/22089 View/Open RoachMotel.doc 248Kb Microsoft Word http://digital.library.wisc.edu/1793/22090 View/Open
Visualising Open Knowledge
Tuesday, May 27th, 2008PMR: I have no idea what the visualisation is actually of – maybe readers would like to guess.The first Open Visualisation Workshop took place on Saturday as we mentioned last week. Details, notes and links are available on the event’s wiki page. The event took place at Trampoline Systems’ new site in East London. To make sure the event was as informal as it was billed to be – we left the schedule open until the day, so we could see what people were interested in doing and plan the workshop accordingly! After introductions and some brainstorming, we had impromptu talks and demos from:
Most of the day was occupied by demos and discussions – so we didn’t get around to doing much tinkering with software packages. However, participants said they found it very useful to see people’s work in other fields – and were keen to continue to meet regularly. It was interesting to see how much commonality there existed between visualisation work in very different fields. Suggestions for future activities included:
- Martin Dittus, last.fm
- Julie Tolmie, Centre for Computing in the Humanities, King’s College London
- Jan Berkel, Trampoline Systems
- Jonathan Lister, Osmosoft
- Gregory Jordan, European Bioinformatics Institute
- David Aanensen, Division of Epidemiology, Public Health and Primary Care, Imperial College London
- Jonathan Gray, The Open Knowledge Foundation (me)
It was suggested we have another workshop in June to keep the ball rolling. Nearly everyone there was keen – so we’ve created a doodle page to fix the date. If you’d like to participate, please:
- continuing to build on the list of open source visualisation packages (on the wiki)- possibly including notes, comments and example visualisations from people who have experience using these;
- domain specific sessions (e.g. visualisation for bioinformatics);
- shared project to work on, using open source visualisation software to represent an open knowledge package – e.g. using Prefuse to represent data from omdb
- using different visualisation software to represent the same open dataset – and comparing the results;
- making very brief screencasts of different visualisation projects with voiceovers from their developers;
- promoting the open-visualisation mailing list to researchers, developers and practitioners – as participants weren’t aware of any other general mailing list for open-source visualisation technologies;
- developing a wish-list of features that participants would ideally like to see in open source visualisation software.
- add your name to the Open Visualisation Workshop wiki page;
- select which dates you are free on the doodle page;
sign up to the open-visualisation mailing list.
How Closed Access makes progress difficult
Monday, May 26th, 2008This page allows to predict the spectrum from the chemical structure based on “Spinus”. You may find more information on the authors website. ReferencesPMR: I missed the link to the authors’ webpage (the font is small) and thought I would start by looking at the literature references. There should be enough in the abstracts to give me a general idea. [to avoid reproducing the whole abstract - which might go beyond fair use - I have removed the past tense of the verb "to be"]. The first abstract read
- Aires-de-Sousa, M. Hemmer, J. Gasteiger, “Prediction of 1H NMR Chemical Shifts Using Neural Networks”, Analytical Chemistry, 2002, 74(1), 80-90 most of the proton descriptors are explained. In that work they were used for the prediction of 1H NMR chemical shifts by counterpropagation neural networks.
- Y. Binev, J. Aires-de-Sousa, “Structure-Based Predictions of 1H NMR Chemical Shifts Using Feed-Forward Neural Networks“, J. Chem. Inf. Comp. Sci., 2004, 44(3), 940-945 the development of the FFNNs and the selection of descriptors is explained.
- Y. Binev, M. Corvo, J. Aires-de-Sousa, “The Impact of Available Experimental Data on the Prediction of 1H NMR Chemical Shifts by Neural Networks“, J. Chem. Inf. Comp. Sci., 2004, 44(3), 946-949 the use of an additional memory is described.
Copyright © 2001 American Chemical Society [... authors snipped...] Counterpropagation neural networks [...] applied to the fast prediction of 1H NMR chemical shifts of CHn groups in organic compounds. The training set consisted of 744 examples of protons that [...] represented by physicochemical, topological, and geometric descriptors. The selection of descriptors [...] performed by genetic algorithms, and the models obtained were compared to those containing all the descriptors. The best models yielded very good predictions for an independent prediction set of 259 cases (mean absolute error for whole set, 0.25 ppm; mean absolute error for 90% of cases, 0.19 ppm) and for application cases consisting of four natural products recently described. Some stereochemical effects could be correctly predicted. A useful feature of the system resides in its ability to be retrained with a specific data set of compounds if improved predictions for related structures are required.PMR: This gives some, but not really enough , information about the method. In particular were 3D coordinates of the molecule generated, and if so how. Were shifts averaged across topologically equivalent protons? So I went to the second abstract (again copyright ACS so the full abstract is not given):
Feed-forward neural networks [...] trained for the general prediction of 1H NMR chemical shifts of CHn protons in organic compounds in CDCl3. The training set consisted of 744 1H NMR chemical shifts from 120 molecular structures. The method [...] optimized in terms of selected proton descriptors (selection of variables), the number of hidden neurons, and integration of different networks in ensembles. Predictions [...] obtained for an independent test set of 952 cases with a mean average error of 0.29 ppm (0.20 ppm for 90% of the cases). The results [...]significantly better than those obtained with counterpropagation neural networks.PMR: Still nowhere near enough information. Now I’m at home (it’ a public holiday in UK) and I’m watching the cricket (which is absorbing). Although I could get a login to Cambridge library I choose not to as it gives me the position of a second-class citizen empoversihed through closed access. So maybe the authors have bought ACS “Author Choice”. Unfortunately not. I will have to pay 2 * 25 USD for access to these articles. And the access only lasts for 48 hours. These are the sort of papers than can’t easily be fully digested in 2 days. I’m not sure what happens if I keep copies on my hard disk – I expect that it bursts into flame like Mission Impossible and I daren’t take the risk. BTW I can see no possible point in restricting access to 48 hours – and it’s yet another indication of the publisher treating the scientific community solely as a source of revenue. Maybe the ACS staff who read this article will enlighten us. The point is that this type of procedure – however necessary or not to the survival of the publisher – causes great problems to the community. So the price of preserving a reader-pays publishing economy is to slow down science, encourage many scientitsts to avoid reading the literature and generally to reduce the coherence of the scientific process. I, for example, am unlikely ever to read these papers now. [NMRDB note. The link I missed explained that the prediction was based on CORINA structures. I am still unclear generally about nmrdb.org - does it have spectra? how many? are they freely available? etc. The web site says very little.]
Green OA and Open Data – more
Sunday, May 25th, 2008Green OA and open data [PMR response snipped...] PS CommentsPMR: I agree with all this and thank PS for the list of institutions encouraging data. I’ll try to rephrase:There are many projects trying, in many different ways, to make green OA even more relevant and useful for data than it is now, e.g. by increasing data deposits in repositories and allowing fuller use of data already on deposit. For example, see ASSDA (from ANU), CESSDA (from NSD), Commons of Geographic Data (from the U of Maine), DANS (from the Royal Netherlands Academy of Arts and Sciences), LEAP (from AHDS), LinkingOpenData (from W3C), Pangaea (from a coalition of German research institutions), and StORe (from JISC).
- First, I generally agree with PMR’s opening characterization of green OA. I’d only add that we should distinguish green OA itself from the strategy proposal (which I do not endorse) to slow down on the pursuit of open data until we succeed with open texts. As usual, I think we should proceed on all fronts at once. I generally agree as well with PMR’s understanding of the state of open data in OA repositories. But in describing this state, I’d put the accent in a different place.
- It’s true that most OA repositories today are optimized for texts and not optimized for data. It’s also true that few institutions (universities, funders, publishers) encourage or require the deposit of data files in repositories. Finally, it’s true that most OA repositories will accept data files, even if few researchers are depositing data files. With this background, my response reduces to to two quick points:
- First, it doesn’t follow that green OA is “irrelevant” for open data, merely that we are under-using the opportunities it provides for open data. We shouldn’t confuse researcher practices or institutional policies with repository capacities or green OA. If under-using an opportunity made it irrelevant, then conservation would be irrelevant to climate change and green OA would be irrelevant even to text files.
- Second, we have a long way to go to make most repositories as useful for data files as they are for text files. But it doesn’t follow that green OA is irrelevant or harmful for open data, merely that its capacity to help users do useful work with OA data files must continue evolving.
- CC-BY and BBB-compliant OA necessarily support Open Data. This is a logical coupling. If you publish an article in this way you automatically give the world permission to use it (and its associated files, by implication) permission to re-use
- “strong OA” *may* go some way to supporting Open Data. It is possible that some licences will alloe re-use. Note that Non-commercial use is incompatible with Open Data and Open Knowledge as currently defined, and it is possible that some “strongOA” sites offer full removal of permission barriers. Even without full removal, the removal of *some* barriers at least points in the right directin and alerts the community (author/reader/repositarian) to the fact that there are barriers and that some people want to remove them. It’s not logically coupled, but there may be some empathy with Open Data. It’s also possible (though I have no evidence) that strong OA offerings are more conscious of the non-copyrightability of factual data and the value of adding supplemental files to web sites.
- gold OA is often paid for. Gold OA may remove none, some or all permission barriers. Only the latter leads to Open Data. Since Gold OA requires the author or the funder to pay money, both of them should think hard about what they are getting. PeterS has often noted that certain agreements give authors and readers very little more than what GreenOA offers, but that the author has paid a lot of money. An example is that some publishers expose fulltexts after an initial period (say 2 years) after which GoldOA and Green OA lose any advantage over the freely accessible text.
- hybrid offerings. I think I have a special concern about hybrid publications (where a joyrnal can contain “OA” and “not-OA”. There was a rash of these last year and they did not impress me – poor value for money, poor clarity of presentation, little additional exposure for authors, poor quality of access to data, etc. It’s not surprisng that – at least in chemistry – there has been almost no take up. I reviewed a random number of these offerings in this blog and felt that funders were paying a lot for relatively little. Admittedly I was taking BBB as a baseline and downgrading anything weaker. In principle a hybrid offering could offer almost nothing other than visibility and charge a lot. The main value that hybrids have over green OA is that they are discoverable on the publishers site which may be important.
- green OA. Although logically there is no reason that data cannot be deposited greenOA offers no logical and little social encouragement to do so. If, tomorrow, the whole world had adopted greenOA but re-use was strictly forbidden then it would be little use to data-rich sciences. That is a caricature, but there is no doubt that BBB – with its insistence on re-use is of much greater practical value in amny subjects.

The first Open Visualisation Workshop took place on Saturday as we