Archive for June, 2008
WHAT IF YOU WERE THE PUBLISHER?I and my colleagues are excited by this and I have written off to Elsevier to ask more about the content of the dataset. However, what can you do with 5000 articles covering the whole of science? Citation analysis has already been done. What else is general to all the disciplines? Well, we have some ideas and we aren’t giving them away here, but here’s one you might like to work on.
Demonstrate your best ideas for how scientific research articles should be presented on the web and compete to win great prizes!
We’ve worked hard to build the Article 2.0 dataset, and now we’re opening it up to developers via a simple, straightforward REST API. We will provide contestants with access to approximately 7,500 full-text XML scientific articles (including images) and challenge each contestant to be the publisher. In other words, each contestant will have complete freedom for how they would like to present the scientific research articles contained in the Article 2.0 dataset. We will encourage the use of XQuery, but this will not be a mandate. By leveraging these APIs, the contestant becomes the publisher and can render scientific articles to meet their needs including integrating the article into existing applications or combining it with other web service APIs.
Lets’ assume that the “fulltext” is chosen randomly from all items on the Elsevier site marked as “fulltext PDF”. These are, of course, chargeable at 31.50 USD. So I’ve done a pilot study on the latest issues of Tetrahedron.
There are 29 fulltext articles (all 31.50 USD) and here are five of them (16% of the total).:
|Editorial board Page IFC Preview Purchase PDF (71 K) | Related Articles|
|Graphical contents list Pages 7445-7451|
||Corrigendum to “The potential of intermolecular NO interactions of nitro groups in crystal engineering, as revealed by structures of hexakis(4-nitrophenyl)benzene” [Tetrahedron 63(28) (2007) 6603–6613] Page 7650 Eric Gagnon, Thierry Maris, Kenneth E. Maly, James D. Wuest Preview Purchase PDF (68 K) | Related Articles|
|Calendar Page I Preview Purchase PDF (60 K) | Related Articles|
|IBC: Guide for Authors Page IBC Preview Purchase PDF (238 K) | Related Articles|
So it would normally cost 157.50 USD to read these. Hopefully 16% of the Elsevier Article 2.0 dataset will be of this category and we’ll be able to read and analyse the fulltext for free. That’s about 800 articles, so enough to work on. I do hope Elsevier have included them, because they are clearly worth paying for. Indeed the total cost of these articles would be 25, 000 USD and we can get them for free!
I’ll be continuing my little adventure into which others publishers charge non-subscribers for:
- Editorial Board Info
- List of abstracts
- Guide for authors
Here’s a diversion – as I am sat watching England lose to New Zealand… (Note that I read this without University access as it’s easier to realise what it’s like not to have it).
I’m interested in how easy or difficult it is to find crystallographic structures in various publishers and I’ve turned to Tetrahedron – an Elsevier journal. Tetrahedron does not allow -or at least does not seem to encourage – its authors to expose their crystallographic data as supplemental information. I wrote to them last year asking if they would consider doing so. After all RSC, IUCr and ACS do. None of the five editors replied. Well – I’m only a reader.
Anyway I turned to the latest issue. No luck (I didn’t expect any). But there was one “article” of interest:
Corrigendum to “The potential of intermolecular NO interactions of nitro groups in crystal engineering, as revealed by structures of hexakis(4-nitrophenyl)benzene” [Tetrahedron 63(28) (2007) 6603–6613] Page 7650 Eric Gagnon, Thierry Maris, Kenneth E. Maly, James D. Wuest Preview Purchase PDF (68 K) | Related Articles
Maybe I can find a (corrected ) structure in the corrigendum. After all, that is free…
The corrigendum will cost me 31.50 USD.
I have to pay 31.50 USD to read the original article and then another 31.50 USD to find out it’s wrong. I’d never have thought of such a clever way to make money.
I wonder how many words there are in it. If there are less than 31 (excluding the author rubric) then it’s over a dollar a word. Perhaps someone can give this information in a comment (I don’t want to get sued for comparing the price of articles.)
This is a great way to spend a Saturday – the RI is worth a visit in itself. And so is London.
We sat in a cafe opposite the British Museum and talked about why blogs and social networks do and don’t work.
Nature Networks also runs RL meetings and I’m much in favour of this. Blogs and other social networks can be good ways of meeting and taking a shared purpose forward. Are they a good way of developing that social purpose? Sometimes, but sometimes it helps to meet IRL.
How about a grand challenge project? Open drug discovery? With Open Notebook science? GSK has just put (Gavin Baker reports) all their cell lines into the “public domain” (not sure exactly what licence or antilicence, and I’ll blog that later). What could the blogosphere do with that? It could be enormous.
And, by chance, Matt studied at York and worked in the lab run by crystallographers Guy and Eleanor Dodson who I was just off to meet at the Wellcome building. They remember him.
But the Wellcome and crystallography in 1951 is a separate post.
Ben Goldacre (The Guardian columnist on “Bad Science”) has unearthed a superb interchange between a scientist and the creationists. Richard Lenski’s replies are tours de force and quite apart from rebutting the criticisms they could be read by undergraduate scientists as an example of excellent scientific practice. It’s worth emphasizing that good scientists live in the daily concern that their data must be validatable, reproducible.
The discussion is classic and a must-read. It’s long and according to your makeup will leave you laughing, weeping or raging. I append Ben’s summary below. But read the discussion
Inter alia RL was attacked by the creationists for failing to provide data to support his claims. RL replies that the data were in the paper but his attacker wilfully failed to understand this (and possibly failed to read the paper anyway). It is absolutely clear that RL provided everything that any responsble scientist would.
Here are some snippets:
Schlafly (Creationist): Submission guidelines for the Proceedings of the National Academy of Science state that “(viii) Materials and Data Availability. To allow others to replicate and build on work published in PNAS, authors must make materials, data, and associated protocols available to readers. Authors must disclose upon submission of the manuscript any restrictions on the availability of materials or information.” Also, your work was apparently funded by taxpayers, providing further reason for making the data publicly available.
Please post the data supporting your remarkable claims so that we can review it, and note where in the data you find justification for your conclusions.
Dear Mr. Schlafly:
I suggest you might want to read our paper itself, which is available for download at most university libraries and is also posted as publication #180 on my website. Here’s a brief summary that addresses your three points….
All these issues and the supporting methods and data are covered in our paper.
Schlafly: Dear Prof. Lenski, This is my second request for your data underlying your recent paper…
If the data are voluminous, then I particularly request access to the data that was made available to the peer reviewers of your paper, and to the data relating to the period during which the bacterial colony supposedly developed Cit+. As before, I’m requesting the organized data themselves, not the graphs and summaries set forth in the paper and referenced in your first reply to me. Note that several times your paper expressly states, “data not shown.”
RL: Finally, let me now turn to our data. As I said before, the relevant methods and data about the evolution of the citrate-using bacteria are in our paper. In three places in our paper, we did say “data not shown”, which is common in scientific papers owing to limitations in page length, especially for secondary or minor points. None of the places where we made such references concern the existence of the citrate-using bacteria; they concern only certain secondary properties of those bacteria. We will gladly post those additional data on my website.
PMR: RL has taken great pains (many pages) to refute the claims of fraud and to assert that the data were visible: BUT this was only possible because the fulltext pf the paper was available:
Imagine what would have happened if RL had replied:
“I am sorry, but the article was published in a closed access journal and I have no rights to make the text , which contains the data , publicly available. You will just have to believe that the reviewers and editors agreed with my arguments and that the data supported it. Or each of you and your acolytes will have to purchase the article at a cost of 30 USD from the publisher. And don’t post it on your website – even just the graphs containing the data – or the publishers will send legal letters to you”.
So, in this case, Open visibility was essential to RL’s successful defence. However the “data not shown” was a potential serious weakness, which was not imposed by the author but by the publication process. Electrons and magnetic disks are infinitely cheaper than “pages” and there is no reason whatever to have “data not shown” in a modern article.
Except of course if, as a publisher, you want to stop us re-using it for free…===================================================================
We are doing a lot of molecular simulations on all kinds of minerals and access to crystallography structure data bases is crucial for our work…USA As the author of webmineral.com, any public access to mineral data promotes and encourages further understanding of the mineral (material) sciences….USA The information on data basis is generated by persons who, in same cases, have to pay for obtaining it. It is not reasonable; it should be freely available!…Brazil Because I am only an occasional user of crystallographic databases (for example, during the mineralogy section of my soil chemistry course) I cannot justify paying full price for access to a database. I generally just use crystal structures that are prepackaged with CrystalMaker. Open access would allow my students and I to learn to use the crystallography software for a large variety of soil minerals. … (USA) The data is all obtained by the scientific community and should be available without charge. As a fact of nature it should not be copywritable…. USA I am a Crystallographer and our University is suffering from a constant lack of funding – so the databases are not being updated and are also not easily available… Australia This is really a good proposal for people from developing countries…. India I trust there should be no economical barriers to knowledge, and databases are essential tools for the scientific community and should be accessibile to everyone…. Italia Egypt is a devoloping Country and We canot have the regular Crystallographic data base which we realy need it in our work accordingly if the crystallographic community allow us to have it on line it will be a big achievement for the third world countries…Egypt As a result of our poor exchange rate, researchers in South Africa have to pay a very high price to obtain these databases. As a young academic and researcher, I could not afford the CSD in the first two years of my career, which made crystallographic research very difficult. Even now, obtaining the CSD is extremely expensive, and a large percentage of my research funds is used for this…. South Africa… and many more. So the volunteers at the COD started soliciting contributions, and – I think – typing up some historical and classic structures. (It is quite difficult to get Open structures for the sorts of materials required in undergraduate teaching, especially minerals). As I understand it the COD sources include:
- typing up public information
- donations from various regular sources – Am. Mineral and others
- donations from individual groups and laboratories.
Tue, 2008-06-24 21:36 — MatTodd We’re drawing up a contract (with WHO and the ARC) to cover our new grant (and hence this site). Our business office would like to know which Creative Commons licence is most suitable. I was assuming Attribution 3.0 unported, since this allows sharing and remixing under attribution. On the face of it, a better alternative is Attribution-Share Alike 3.0 Unported, since this also requires that anyone using the research has to distribute their own work under a similar licence. Anyone have any views on this for science research? Is the share-alike clause unduly restrictive? What if a company, reading results posted on these pages, would like to develop those results for a different for-profit reason that is unrelated to our original research interest – would a ‘share-alike’ licence prevent that?PMR: It’s a great idea to post this on a blog – I’m relaying it here so it gets to other audiences, especially from the SC and OKF communities. My own take is very strongly influenced by the Open Knowledge Foundation (OKF) and Science Commons ( who are an offshoot – not a fork – of CC). I’d strongly urge CC-BY (attribution) for the “fulltext” and the Science Commons + Open Knowledge approach to the data – no licence at all but instruments such as PDDL which make it clear that the data are Open and Free as in air. “Non-commercial” immediately raises questions as to what is “commercial”. This is almost impossible to define, so generally all it causes is problems. Is a non-profit that sells goods and services commercial? If an academic runs a course and charges fees is that commercial? And so on… I started this blog with CC-NC, not because of me but I was worried that readers migh not wish to contribute. Then I was persuaded to change to CC-BY and haven’t regretted it. Now it’s clear that anything on this blog is free and open, as long as you acknowledge the author (not always me). If you want to set it to music, compile a dictionary of incorrect English usage, use it as a public key for cryptography you can do it. You can create an anthology of all CC-BY blogs and sell them. You can set up a web service which charges people access to these blogs… NC-SA is a recursive nightmare. I have already covered myself in shame for having mistakenly aired views of OKF members here, but at least it got the problem into the open. CC-SA has implications for any other digital artefact it is involved with. It is theoretically viral. For the data we should use PDDL. From WP on Science Commons:
PMR: This can be coupled with “community norms” – the idea that authors express – in non-legal terms – what they feel is reasonable and not reasonable to do with the data. They can’t override the basic freedoms (e.g. authors cannot prevent export to countries their givernments don’t approve of). But in some areas there are complex problems – e.g. the use of human data – and it is important to develop protocols for these. My hope is that communities will start to pick up what practices are seen as acceptable and help to formalize them, hopefully with the help of Learned Societies and International Scientific Unions.
Using data and CC licensesScience Commons launched on 16 December 2007 the Protocol for Implementing Open Access Data in conjunction with the Public Domain Dedication License and the Open Knowledge Foundation. The Protocol is of note because, rather than relying on copyright licenses such as the Creative Commons licenses and the GNU GPL, it provides a rationale and methodology for reconstructing the public domain of data.
Charles Pratt (1830-1891) was an early pioneer of the natural oil industry in the United States. He was founder of Astral Oil Works in the Greenpoint section of Brooklyn, New York. He joined with his protégé Henry H. Rogers to form Charles Pratt and Company in 1867. Both companies became part of John D. Rockefeller’s Standard Oil in 1874. Pratt is credited with recognizing the growing need for trained industrial workers in a changing economy. In 1886, he founded and endowed the Pratt Institute, which opened in Brooklyn in 1887.This resonates with other visionaries of the time – I have a long association with Birkbeck College in London:
Working as a doctor in London, Birkbeck, with others, established the London Mechanics Institute in November 1823 – of which he was the first President. The Mechanics Institute concept was quickly adopted in numerous other cities and towns across the UK and overseas, but his association with the ground-breaking London institution was marked by it being renamed the Birkbeck Literary and Scientific Institution in 1866 (now, as Birkbeck College, part of the University of London).Well, the one similarity is that we are “in a changing economy” so I hope that Charles Pratt would have looked favourably on what we did… We are very well equipped for workshops at UCC and have 16 machines and a projector/beamer which can be booked for sessions. So rather than my pontifcating I prepared some hands-on. You can do this at home – it only needs a web-browser. I started by asking them what significance they might attach to the number: 10.1039/b804987d Not all of them got it immediately, so I asked them to Google for it and, of course, it’s a DOI for a scientific article. I then asked them to see what the components were – abstract, full text in HTML, full text in PDF, etc. Could they read the full text? Yes. If they went back to a hotel could they read it? They quickly realised no. Could they tell from the display that the difference was due to the fact that the university had a subscription to the journal. Yes – there was a rubric saying so, though I doubt that many undergrduates or staff in most institutions would notice it. How did the DOI work? We found the DOI site. What did it provide? Could we have donw all this through Google? etc. How did we know a DOI was unique? How do you identify a book? By ISBN… yes, but how do most of the population identify a book? By its Amazon stock number. Oh… That’s the sort of disruption that is changing the role of libraries. Names and addresses in TimBL’s world are conflated. The reality is the web. Current reality is an illusion unless it has a URI (==URL). So now we know what the contents of a paper are , here were some exercises – with discussion in between. (You can use them if you want – this blog is CC-BY). Goals. To investigate how data is published in leading journals. Each team (2-3 people) should pick a publisher from:
- Royal Society of Chemistry (Org. Biomol. Chem)
- ACS (J. Org Chem)
- Wiley (Angew. Chemie)
- Beilstein Journal of Organic Chemistry
- J Heterocyclic Chemistry
- Molecules (MDPI)
- Is the Journal Open or Closed access?
- Can you access the fulltext? If so is it because you are on the Cambridge network?
- Does the journa; publish data embedded in full-text?
- Does it publish data as supplemental/supporting info/data?
- Is there a licence?
- Can you understand it?
- does the author retain copyright?
- is the supplememtal data copyrighted? by whom?
- Download OSCAR/Experimental data checker from RSC site (Google for it – I deliberately don’t give URLs any more)
- Who wrote it? (answer some very bright chemistry undergraduates here)
- What is the copyright?
- what is the licence? (Open Source)
- Run it. This worked a dream on Windows – clicking the jar fired up OSCAR.
- Use it to extract data from one of the papers you have found (the paragraph needs to describe “Synthesis…” or “preparation of … “
- Load CrystalEye (Google)
- Find the latest issue of your journal.
- Is it abstracted by CrystalEye?
- If not, why not? (Because the publisher does not allow or support the publication of crystallographic data as supplemental information)
- Pick another journal. Find the latest issue in the TOC.
- Pick a paper.
- Marvel at Jmol. (Open Source molecular viewer from the Blue Obelisk)
- follow the DOI in CrystalEye to find the article.
- Where in the article is the crystal structure described?
- Where is the CIF file (crystallographic information file)?
- What is its copyrighted? (Some publishers add their copyright to these files of facts. Did we agree with this? No, we didn’t.)
- variance in the original experiments. All scientific measurements should be quoted with error estimates, which can often be obtained by repeated measurement.
- systematic errors (bias) in the measurements. Sometimes the causes are known but often they are not. Bias is often discovered when measurements are made in different laboratories or with different methods and equipment. Miscalibration of instruments is a common cause.
- misunderstandering or misreporting of the physical quantity or measurement. For example in chemistry there are several concepts of “bond length” – the distance between 2 atoms – and they are fundamentally different. One effect is due to the uncertainty principle – atoms do not occupy a fixed position even at absolute zero.
- omission of relevant independent variables. Thus a crystal structure varies with temperatute and pressure. Often these are not explicitly recorded – there is often a default assumption that measurements are done in “normal conditions” – about 25 deg C and 1 atmosphere. But many theoretical calculations relate to absolute zero and no pressure.
- omission of units of measurement. This should never happen, but many computer program still emit raw numbers and assume the user knows what the units are.
- Transcription and typographical errors. These are still common. Many chemists still measure spectra with rulers. Many scientists write numbers in a lab book and type them wrongly. Many computer operations fail to report invalid input or produce corrupted output. For example we used a well-known theoretical program which takes free format input limired to 80 characters on each line. However lines greater than this were not flagged as errors but simply ignored silently which led to gross errors hard to detect. Even copying files – perhaps by cut-and-paste – can corrupt information.
- Our inability to describe effects comprehensively. In crystallography, for example, it is frequently found that atoms are “disordered” – a simple picture is that they are sometimes in place A and sometimes in place B. Whether they hop between these places or whether the disorder is a statistical average over a macroscopic crystal may not be known. A full treament of disorder may be difficult and expensive and include weeks of work on a neutron source (which needs a nuclear reactor).