What is a chemical compound? and what's a label

Steve Bachrach poses an interesting question on the CHMINF-L list. I have omitted the citations and some other material – you can read the archive if necessary.

 
I have run into an interesting chemical problem that has led to both theoretical and applied database questions. I am hoping that some of the experts on the list can shed some light.
I have been looking into the recent controversy concerning the structure of (+)-hexacyclinol. This compound was first isolated in 2002 by Graefe et al who proposed a structure for it…. La Clair recently synthesized this structure, or it least reportedly so. [SB: By the way – do a google search
on hexacyclinol so see how the blogosphere responded to this problem.] Then Rychnovsky …proposed an alternative structure for (+)-hexacyclinol, which was subsequently synthesized and confirmed to be identical to the original natural product
(PMR: yes the blogosphere is worth reading, e.g. Totally Synthetic (blogroll) and Tenderbutton (but this is now password-only).
So here is first my theoretical question: How do you index such a situation? The original structure of the molecule (+)-hexacyclinol is wrong, and a subsequent one is right. So, when you query a database, which structure matches up with the name “(+)-hexacyclinol”? My guess is that it should be the correct one – but then what do you do with the old
structure? Obviously, this is not the first, nor will it be the last, compound whose structure in contested.
Now here is the more applied aspect. A search in SciFinder for (+)-hexacyclinol gives CA 484674-97-7, which is the original (and, we now know, wrong!) structure. Querying for the papers that have this structure returns the Grafe, La Clair and Rychnovsky papers, but not the Porco paper. But entering the “true” hexcyclinol structure and then doing a search locates 2 structures CA 903574-41-4 and CA 903574-42-5, which look to me to be identical. Furthermore, the only paper that is linked to these “two” structures is the Rychnovsky paper. In other words, the Porco paper that reports the actual synthesis and x-ray structure of hexacyclinol does not have any hexacyclinol structure(s)
(correct or not) attached to it!
PMR: Note – Scifinder is a Closed access tool to the Closed Chemical Abstracts database of chemical information. I cannot therefore comment on Steve’s Ids.
(By the way, a PubChem search for hexacyclinol comes up dry, but all of the above papers are indexed in PubMed.) Any explanations?

(PMR: Yes, Hexacyclinol is not very interesting except to chemists so no-one has deposited a data collection containing it to Pubchem. If synthetic chemists contribute collections of targets to Pubchem I am sure Pubchem will be delighted to accept them. However many chemists are still unaware that PubChem exists.)

PMR: There is nothing strange in this – it’s common in all most disciplines. As a science progresses the interpretation of objects changes. Genes, organisms, galaxies are all frequently reclassified. It’s actually a strange feature of structural chemistry that there are so many cases that aren’t fluid and where a structure and a substance can be associated and where this association can persist for a long time.
The language of “right” and “wrong” is what is causing the problem. These statements should be recast in terms of annotations or assertions, labelled with the authority that makes them. (Incidentally this is what is at the basis of the RDF-based Semantic web). The above could be written:
2002: Graefe asserts that C1 is the structure associated with a given substance (S1). Graefe gives S1 the label “hexacyclinol”. Graefe also asserts that certain reported physical data (D1) belong to S1.
2005? Le Clair makes S2 and asserts that it is the same substance as S1 and re-asserts that S1 is has the structure C1
2006. Rychnovsky makes S3 and asserts it is not identical with S1 or S2 but should be associated with the structure C1.
By the laws of chemistry (which say that a given substance S should only be associated with one structure C) we have a contradiction. So…
2006. Many chemists, including the blogosphere, assert that Le Clair’s statement is false.
But that may not be the end of it.
That is a simplified picture. Henry Rzepa has written about “What is mauveine?” – it is by no means clear what this industrially spectacular purple pigment was or is. He has devised an RDF scheme for presenting, and possibly resolving, a number of assertions about the structure of a compound.
Not all chemistry has the luxury of being able to associate a precise formula with a given substance in a jar. Here are some simple examples,
  • What is aluminium chloride?
  • What is glutamate?
  • What is glucose?

These are legitimate scientific statements which require several assertions, linked in a mini-semantic web. That is why we need to move from a twentieth-century way of describing chemistry (as exemplified by the CA numbers) to a semantic one. There is lots of room for volunteers.
I’m reminded of being shown round the British Museum of Natural History – and a room full of fish specimens in ethanol in labelled glass containers. The biologist said that some countries had asked for their specimens – their property – back – the BM had resisted. But if it ever came to that the BM would keep the labels – their metadata.

Posted in chemistry | 4 Comments

What is a citation?

I’ve been trying to find out what a “citation” is. At least the sort of citation that governs my future, and the funding of my department and institution. Just to reintroduce this subject, here’s Bill Hooker replying tomy post Impact Factors! Hirsch, Erdős and Pauling

According to Google Scholar, this is me [Bill]: [18,16,16,12,10,9,8,5,5,2,1,1,0,0], which yields an h-index of 7 if I understand the definition. According to the Wikipedia article, a “modestly productive” biomed researcher should have an h-index greater than their “years of service”. Even if those years start when I first published (1995), I’m not doing very well. But I didn’t need a fancy index to tell me that.

I think Bill’s maths is correct. But where do his figures come from? Google Scholar , which I also use because it’s Open, and easy and I don’t like using products from closed monopolistic commercial information providers. But is a Google citation count the same as a Web Of Knowledge citation? How do we know.
Recently I had to fill in my publications for the current UK Research Assessment Exercise. In this we were asked to give 4 + 2 research publications over the last 5 years. I selected the ones that I was proudest of – not necessarily the ones with the highest Google citations. I think that in this RAE there is still a lot of human assessment so I should give them something interesting to read. In the next one it will be done by robots, so we need to know what robots like.So research is not now about chasing the puzzles that “nature” sets us, but about guessing what the next metric is going to be. I suspect it’s rather like pop music – for many years the New Musical Express produced hit charts – the lists of how many people bought which records each week. The numbers were collected by the industry – there were presumably no audit processes – and showed which were the most popular records. Not the best, just the most popular. Presumably this is a complex function of quality and marketing. However the numbers had positive feedback – if something sold well it was likely to be played more often and people felt they needed to buy it. But, retrospectively, I doubt few musicologists would claim the numbers were perfect or even good measures of quality. The same is true of films – box office and expert judgement from 20 years on probably have a poorish correlation.
So will the research metrics be different? The music industry had two indicators at least (sheet music and record sales). Perhaps this is analogous to citations and downloads of research articles. So let me take one of my papers that I feel represents a part of my informatics research and scholarship:
Murray-Rust, Peter; Mitchell, John; Rzepa, Henry (2005) “Chemistry in Bioinformatics” BMC Bioinformatics 6 141
http://www.biomedcentral.com/1471-2105/6/141

It’s Open Access, so you can read it. BMC Bioinformatics publishes accesses (see, e.g. this months) A month after it was published BMC sent me a mail saying this was one of the highly accessed articles

From: BioMed Central Editorial
To: Peter Murray-Rust

Subject: Download statistics for your Open Access article
X-OriginalArrivalTime: 08 Sep 2005 06:30:15.0878 (UTC) FILETIME=[C65B6E60:01C5B43E]
Date: 8 Sep 2005 07:30:15 +0100

Title : Chemistry in Bioinformatics
Authors : Peter Murray-Rust, John B Mitchell and Henry S Rzepa
Journal : BMC Bioinformatics
Citation : 6:141
Dear Dr Murray-rust,
We thought you might be interested to know how many people have read your article since it was published:
Total accesses to this article: 1143
Access figures include full text, abstract and PDF downloads from the BMC Bioinformatics website.
These figures only reflect the accesses recorded on the journal’s website and the BioMed Central website and do not include those from PubMed Central or other sites that archive articles published by BioMed Central (see http://www.biomedcentral.com/info/libraries/archive). The overall access statistics for your article are therefore likely to be significantly higher.

(I can’t find the current access count as BMC only seems to keep the last year in its RSS).
But the paper only gets 4 citations in Google Scholar (probably at least two are self-citations), and presumably less in ISI (which I cannot access as it is Closed).
So there is clearly a wide variation between reading and citing. Citations have the advantage that they are in principle measureable (albeit I suspect with considerable imprecision, particularly in a changing world). Access cannot be easily audited.
So my questions are, please, (and I genuinely don’t know the answers)

  • How are citations counted?
  • Are different methods in widespread use?
  • If so are there agreed algorithms for converting between different metrics?
  • … or is a single authority accepted?
  • if there is a single authority, what auditing of the counting is available? Does the authority set the metrics themselves? Or is there a community process?

Great scientists will generally rise to the top (though I suspect metrics may make the path different from before). I am not a great scientist – in fact I am primarily a technologist at present. Egon reports that on ISI I get an h-score of 9 – fair enough (although it seems to have missed a lot of my papers – maybe there is a time cutoff).
If we are going to be based on metrics then it is a waste of time writing papers for humans to read. The Bioinformatics article above counts for nothing.
Hilaire Belloc (1870-1953) wrote:

When I am dead, I hope it may be said:
“His sins were scarlet, but his books were read.”

I am not a poet, but feel something like:

“My paper’s published (and it was invited)
Dont’t bother reading it, but please let it be cited”

Posted in open issues | Leave a comment

"English sentences without overt grammatical subject"

We had a Ph. D. viva today and a small party afterwards.  We got onto interesting scholarly publications which reminded me of a paper which came out in my early career. It was by  Quang Phuc Dong (South Hanoi Institute of Technology) and I remember it in a French journal of linguistics (Langue) though most modern citations are to an English language journal (Language). Since we are now involved in chemical linguistics I mentioned it to Peter Corbett as something that would help us understand the deep structure of sentences
“English sentences without overt grammatical subject,”
It’s a classic not only in linguistics but in scholarship in general. I remember it as being in a physical journal article in the library. It’s not easy  to find  an Open Access online copy (though I suspect it was published before publishers started appropriating author’s work). It was a work of merit, being cited within a year (e.g. this Closed Access paper) and although Google Scholar only finds 3 citations (and does not give journal or other details) there are certainly more and this must be due to the difficulty of finding an indexable online version.
I obviously can’t reproduce it here as I would be breaking copyright and it would inappropriate to reproduce parts as it would not represent the full moral rights of the author. I have found what appears to be a pirate copy here (which it pains me to reference as almost all the work in this field – even 40 years old is quite rightly Closed Access and available at very reasonable prices).
For further information about the author, try Wikipedia.

Posted in Uncategorized | Leave a comment

Impact Factors! Hirsch, Erdős and Pauling

Having spent 2 hours tidying CML Schema over a flaky CVS connection to sourceforge, I need some relaxation. So, after my disillusionment with the accuracy of citation metrics, I was spooking around Wikipedia and came across the h-index ( suggested in 2005 by Jorge E. Hirsch of the University of California, San Diego). This is rather similar to Zipf’s law – so essential in understanding informatics. The h-index is defined as:

A scientist has index h if h of his/her Np papers have at least h citations each, and the other (Np – h) papers have at most h citations each.

WP continues:

In other words, a scholar with an index of h has published h papers with at least h citations each. Thus, the H-index is the result of the balance between the number of publications and the average citations per publication. The index is designed to improve upon simpler measures such as the total number of citations or publications, to distinguish truly influential scientists from those who simply publish many papers. The index is also not affected by single papers that have many citations. The index works properly only for comparing scientists working in the same field; citation conventions differ widely among different fields.

So if a scientist has (say) 10 papers with citations:
200, 15, 12, 8, 5, 4, 2, 1, 0, 0
they have an h-index of 5 (5 papers have >=5 citations). The 200 citations are no more powerful than 20 would be for the first paper. If we have to have citation analysis this might be a good approach (since we have little idea how the actual numbers are obtained or who is using what) and the parametric approach allows for this. (I have yet to find how a “citation” is defined). (BTW if it matters I score ca 14 on Google Scholar – Feynmann is quoted at 23, Hawking at 68 – don’t take it too seriously – Galois scores 2).
Anyway, now for some light vanity. In the links to h-index was the Erdős number. This is named after the legendary Hungarian mathematician who was prolific both in the number of his papers and his collaborators. The number is defined as:

In order to be assigned an Erdős number, an author must co-write a mathematical paper with an author with a finite Erdős number. Paul Erdős has an Erdős number of zero. If the lowest Erdős number of a coauthor is X, then the author’s Erdős number is X + 1.

and

Erdős wrote around 1500 mathematical articles in his lifetime, mostly co-written. He had 504 direct collaborators; these are the people with Erdős number 1. The people who have collaborated with them (but not with Erdős himself) have an Erdős number of 2 (6,984 people), those who have collaborated with people who have an Erdős number of 2 (but not with Erdős or anyone with an Erdős number of 1) have an Erdős number of 3, and so forth. A person with no such coauthorship chain connecting to Erdős has an undefined (or infinite) Erdős number.

So might I have a finite Erdős number? We’ve all heard how small the world is – (Six degrees of separation and Small-world network).
But very unlikely. I have to have written a mathematical paper with someone with a finite Erdős number. So I was browsing through the Erdős 4 numbers and suddenly saw Linus Pauling (Now of course I have to take WP on trust that this is a genuine entry – I can imagine the debate over Erdős numbers can be quite detailed). So could I make a chain which links to Linus Pauling (I’ve had the honour to meet him)?
Well, I searched Google scholar for L Pauling (there is also Peter Pauling, his son). And I reckoned there might be a crystallographic chain that connected me. The best I can do is:

  • Pauling + Vernon Schomaker
  • Schomaker + Jack Dunitz
  • Jack Dunitz + PeterMR

That would give me an Erdős number of 7. But, unfortunately, although my paper with Jack was mathematical (cokernels of crystallographic point groups) the Schomaker-Dunitz papers were on electron diffraction (cyclobutane, etc.) and the Pauling-Schomaker papers included the splendid title:
The Use of Punched Cards in Molecular Structure Determinations I. Crystal Structure Calculations
So, reluctantly, unless I can find another chain of mathematical papers I don’t have a finite Erdős number.
But I do have a finite Pauling number – currently 3. (I doubt I can get it lower). And since Pauling is generally acknowledged as the greatest chemist of the twentieth century, why don’t we start a Pauling number?
(Oh – FWIW  Erdős  has an h-number of 54 on Google and Pauling 39. But don’t take these too seriously).

Posted in general | 3 Comments

Open Data in psychology

Peter Suber has posted (in Open Access News)


A fair share, Nature, December 7, 2006. An unsigned editorial. Excerpt:

In psychology there is little tradition of making the data on which researchers base their statistical analyses freely available to others after publication. This makes it difficult for anyone to independently reanalyse research results, and prevents small data sets from being combined for meta-analysis, or large ones mined for fresh insights or perspectives.

Psychologists need to rethink their reluctance to share data….

The need for more data sharing has just been amply demonstrated by Jelte Wicherts, a psychologist specializing in research methods at the University of Amsterdam, who tried to check out the robustness of statistical analyses in papers published in top psychology journals.
He selected the November and December 2004 issues of four journals published by the American Psychological Association (APA), which requires its authors to agree to share their data with other researchers after publication. In June 2005, Wicherts wrote to each corresponding author requesting data, in full confidence, for simple reanalysis. Six months and several hundred e-mails later, he abandoned the mission, having received only a quarter of the data sets. He reported his failure in an APA journal in October (J. M. Wicherts et al. Am. Psychol. 61, 726–728; 2006).
Researchers often have valid reasons for constraining access to their raw data, such as the privacy of research subjects. But data from most studies based on confidential information can be coded in a way that will guarantee their subjects’ anonymity. The few cases where this is not possible can be exempted from the move towards data sharing.
A second factor deterring openness is a natural desire to retain exclusive access to data that took years of care and attention to collect….
The APA’s editors and publishers are now planning their response to Wicherts’ report. One result should be the acceleration of moves, already under discussion, to require the deposition of data as supplementary electronic material in APA databases. Where the APA leads, other psychology journals are likely to follow.
Granting bodies must also play a part. In 2003, the US National Institutes of Health introduced rules requiring the public sharing of data in psychology studies for grants exceeding $500,000, allowing exemptions where confidentiality issues cannot be circumvented. Other agencies should follow suit. And university departments need to do more to teach the basics of note-keeping and data presentation, to prepare their students for an era in which data sharing will increasingly become the norm.

Comment(PeterS). Kudos to Nature for this call for open data (and for appropriate exceptions). I hope that the APA will take the arguments to heart. Sharing data can improve research without compromising confidentiality.

Comment (PeterMR) It’s nice to see the term Open Data being used naturally here (and of course the reasoned arguments). I hope that others will start to see the value in preserving, exposing and re-using scientific data.
From the conferences I have been to in the last 2-3 months it seems clear that the closer to “science” we are the easier it is to make a case for Open Data so please forgive me if I start to concentrate on Science for Open Data. And please add your own examples either here or on the SPARC mailing list (see Discussion List Archive).

Posted in data, open issues | Leave a comment

The most cited chemistry articles??

In my last post ( Assessed by Robots and citation Quiz) I argued that our careers are now in the hands of the publishing industry – they provide the numerical metrics and based on this the funders decide whether we keep our jobs. So I thought I’d look at how to improve my citations. I typed something like “most cited papers chemistry” into a well-known search engine and got something like this result. (Now we’ve just been out for our Christmas lunch and now I have got back the results aren’t the same as beforehand – so take everything with a pinch of salt… Anyway in the first cases I went to CAS Spotlight which announces:

CAS, the world’s leader in providing chemical information is now highlighting the most cited documents. The “Chemistry” category identifies the most highly cited chemistry documents appearing in the 1999-2005 published literature and appearing in journals covered by CAS.
CAS provides this information as a free service to the scientific community.

I went to the Journal articles (2005) button and got:
The following records identify the top ten, most cited journal articles appearing in documents published in 2005.
Sign up to receive notice of future updates.

Title Author/Affiliation Source
1. Density-functional thermochemistry. III. The role of exact exchange [details] Becke, Axel D.
Dep. Chem., Queen’s Univ., Kingston, ON, K7L 3N6, Can.
J. Chem. Phys.
2. Development of the Colle-Salvetti correlation-energy formula into a functional of the electron density [details] Lee, Chengteh; Yang, Weitao; et al.
Dep. Chem., Univ. North Carolina, Chapel Hill, NC, 27514, USA
Phys. Rev. B: Condens. Matter
3. Density-functional exchange-energy approximation with correct asymptotic behavior [details] Becke, A. D.
Dep. Chem., Queen’s Univ., Kingston, ON, K7L 3N6, Can.
Phys. Rev. A: Gen. Phys.
4. Generalized gradient approximation made simple [details] Perdew, John P.; Burke, Kieron; et al.
Dep. Phys. Quantum Theory Group, Tulane Univ., New Orleans, LA, 70118, USA
Phys. Rev. Lett.
5. The Protein Data Bank [details] Berman, Helen M.; Westbrook, John; et al.
Research Collaboratory for Structural Bioinformatics (RCSB), Research Collaboratory for Structural Bioinformatics (RCSB), Rutgers University, Piscataway, NJ, 08854-8087, USA
Nucleic Acids Res.
6. Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen [details] Dunning, Thom H., Jr.
Chem. Div., Argonne Natl. Lab., Argonne, IL, 60439, USA
J. Chem. Phys.
7. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set [details] Kresse, G.; Furthmueller, J.
Inst. Theor. Phys., Technische Univ. Wien, Vienna, A-1040, Australia
Phys. Rev. B: Condens. Matter
8. Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells [details] Elbashir, Sayda M.; Harborth, Jens; et al.
Dep. of Cellular Biochem., Max-Planck-Inst. for Biophys. Chem., Gottingen, D-37077, Germany
Nature (London, U. K.)
9. Ordered mesoporous molecular sieves synthesized by a liquid-crystal template mechanism [details] Kresge, C. T.; Leonowicz, M. E.; et al.
Paulsboro Res. Lab., Mobil Res. and Dev. Corp., Paulsboro, NJ, 08066, USA
Nature (London)
10. General atomic and molecular electronic structure system [details] Schmidt, Michael W.; Baldridge, Kim K.; et al.
Dep. Chem., Iowa State Univ., Ames, IA, 50011-0311, USA
J. Comput. Chem.

Most Cited Journal Articles – ChemistryCAS Science Spotlight

(Note – I am sure this is part of a page that is copyright ACS so I am claiming fair use without asking permission. And I shall be complimentary – so please don’t cut me off). Now… have a look and decide what is common to all of these. Read the abstracts if it helps (I didn’t read the articles as only the abstracts are Openly accessible). That’s what I asked you in the last post.
Yes – they are all about techniques. So my world domination strategy was based on creating things that people want to use, not providing scientific results. (You can ,of course, argue that a database or a basis set or a functional is a scientific result, but the citers are using it as a tool).
I reran the search after lunch. I thought the results would be the same but maybe Google, or the lunch or my fingers were different. At top of the bunch now comes Elsevier:

Access key papers as recognised by CAS Science Spotlight
CAS Science Spotlight is a free web service that identifies the most requested research publications as reflected by requests for full text via their online services. Additionally, the most cited chemistry-related research publications as reflected by the more than 100 million citations found in the journals, patents, conference proceedings and other sources covered by CAS are identified.Elsevier is proud to be the publishers of # 1 and # 2 most requested chemistry papers in 2005*, as recognised by CAS Science Spotlight.
You are invited to access these and other highly-valued articles by clicking on the paper title.
# 1
MOST REQUESTED ‘CHEMISTRY AND RELATED SCIENCE’ PAPER ON CAS

The following Elsevier article was the #2 most requested ‘chemistry and related science’ article for 1Q05 and the #1 most requested for 2Q05 and 3Q05:
Title: External link A useful bicyclic topological decapeptide template for solution-phase combinatorial synthesis of tetrapodal libraries
Published: Tetrahedron Letters, pp7261-7263, vol.42:41, (2001)
If you do not have access to this article on ScienceDirect, click External link here#2
MOST REQUESTED ‘CHEMISTRY AND RELATED SCIENCE’ PAPER ON CAS
The following Elsevier article was the #1 most requested ‘chemistry and related science’ article for 4Q04 and 1Q05 and #2 most requested for 2Q05 and 3Q05:
Title: External link Convenient synthesis of human calcitonin and its methionine sulfoxide derivative
Published: Bioorganic & Medicinal Chemistry Letters, pp2237-2240, vol.12:16 (2002)
If you do not have access to this article on ScienceDirect, click External link here

* Tet. Lett. article, #2, #1, #1 most requested for first three quarters of 2005
BMCL article, #1, #2, #2 most requested for first three quarters of 2005
Final quarter and cumulative year data not yet released by CAS.
OTHER ELSEVIER ‘TOP FIVE’ PAPERS AS IDENTIFIED BY CAS SCIENCE SPOTLIGHT
#1 MOST REQUESTED ‘CHEMISTRY AND RELATED SCIENCE’ PAPER ON CAS
The following Elsevier article was the #1 most requested chemistry article for 2002 and 2003:
Title: External link Glucan synthesis. Part VI. Total synthesis of cyclomaltohexaose
Published: Carbohydrate Research, pp277-296, vol.164 (1987)
If you do not have access to this article on ScienceDirect, click External link here
#1 MOST CITED ‘CHEMISTRY AND RELATED SCIENCE’ PAPER ON CAS
The following Elsevier article was the #1 most cited ‘chemistry and related science’ article for 2004 and 2003 and the #2 most cited for 1999, 2000, 2001 and 2002:
Title:External link A rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding
Published: Analytical Biochemistry, pp248-254, vol.72:1-2 (1976)
To access this article, click External link here

(I didn’t ask their permission to quote this either).
Well I am mystified. There is no correlation between the types of paper given here and the ones earlier. They are not only not the same papers, but they aren’t even on similar topics.
I have probably made a simple mistake. (I think it’s the same CAS Spotlight and the same year. Elsevier uses slightly different words “most requested chemistry papers in 2005*” (my italics) and also “ MOST REQUESTED CHEMISTRY AND RELATED SCIENCE’ PAPER ON CAS “. So maybe there are two completely different lists. Or maybe there is a different selection criterion. Or a subset.
But imagine you are a busy provost/dean and have to decide whether to close the theoretical section of you chemistry department or the organic (of course you may be thinking of both…). The theoreticians will point to the CAS page, the synthetists to the Elsevier page. And I am sure there are others.
So the real skill in the next decade will not be doing science, but choosing and manipulating the metrics. I suppose it is an advance from HEFCE’s last idea which was to measure research income.

Posted in general, open issues | 2 Comments

Assessed by Robots and citation Quiz.

Jim Downing delicioused me the following link from the UK HEFCE. We are now going to be assessed by robots. (Since I am promoting the concept of robot readers of journals I can hardly complain). Skim the following (which has enthusiastic comments from those who have devised it – including the effect that it will reduce the financial burden of assessment. I comment at the bottom..

Response to consultation on successor to research assessment exercise

Quality stays at the heart of the process, but the bureaucratic burden on universities will be cut, says Johnson

As part of the pre-Budget report, Education Secretary Alan Johnson today published the outcome of the Government’s recent consultation on a new assessment process to replace the Research Assessment Exercise (RAE) after 2008. To reduce the administrative burden on universities, the RAE will be succeeded by a new overarching framework for assessing research quality and allocating funding which is more metrics based. The Higher Education Funding Council for England (HEFCE) will lead further development of the new process, which will use a set of indicators of research income and quality combined with advice from experts including research users, to rate university research and inform the distribution of Government funding for research funding for English universities.
The Government announced in March alongside the 2006 Budget its intention to replace the RAE after 2008 with an assessment system based on metrics – statistical indicators such as the number of times research is cited by other researchers or the amount of research income a department earns. Metrics offer a less burdensome assessment process than the RAE which depends on universities submitting research outputs for review by subject panels. But the proposals recognised that metrics are more readily applicable to assessment of some subjects (broadly sciences) than others.
Nearly 300 organisations and individuals responded to the consultation, which ran from June to October. Some key criteria for the new process were identified: it should continue to use expert advice; should recognise disciplinary differences within a common framework; and it should use an indicator directly linked to research quality.
Alan Johnson said:

“The response to our consultation was helpful and we have heeded it. The outcome we are announcing today keeps quality at the heart of the assessment process, whilst reducing the administrative burden on universities. HEFCE will work closely with the sector as it takes our plans forward and the timetable we have set means that universities can continue their work towards the 2008 RAE with the assurance that its outcome will have a reasonable lifespan. But they can also be confident that the new arrangements will build on the RAE’s success, and continue to recognise research excellence in all its forms.”

Welcoming the announcement’s inclusion in the pre-Budget report, Financial Secretary John Healey said:

“A world-class research base is essential to enabling the UK to respond to the challenges and opportunities of globalisation. The new framework for research assessment and funding will ensure that excellent research of all types is rewarded, including that most likely to have an economic and social impact.”

Chief Executive HEFCE, Professor David Eastwood, said:

‘We welcome today’s announcement on the future arrangements for assessing research quality, and for allocating the Council’s research grant, beyond 2008. This provides a stable framework for our continuing support of a world leading research base within HE which is dynamic and responsive to the needs of our stakeholders and research users. We will work with the sector to develop further the new assessment and funding arrangements that will be robust and command their confidence, and to secure a smooth transition.’

The outcome announced today is a new process that uses for all subjects a set of indicators based on research income, postgraduate numbers, and a quality indicator. For subjects in science, engineering, technology and medicine (SET) the quality indicator will be a bibliometric statistic relating to research publications or citations. For other subjects, the quality indicator will continue to involve a lighter touch expert review of research outputs, with a substantial reduction in the administrative burden. Experts will also be involved in advising on the weighting of the indicators for all subjects.
The first assessment exercise for SET subjects will be in 2009, and will begin to inform funding in the 2010/11 academic year. Other subjects will have their first assessment under the lighter touch regime during 2013 and this will inform funding from 2014/15.
Alan Johnson has written to the Chairman of HEFCE, David Young, to ask him to lead the further work and consultation necessary to complete development and testing of the new system. HEFCE will conduct a report on progress to the Department for Education and Skills in time for the 2007 pre-Budget report.

So it is citations that matter. Who creates the citation metrics? HEFCE? No – commercial and quasi-commercial organisations, closely allied to the publishing business. So the publishing industry controls our output and now has a stranglehold on the research economy (e.g. my job) through its metrics.
So how do I get cited? Here’s today’s quiz.
What are the ten most cited papers in chemistry in 1995-2005?
(It’s easy enough to cheat – but have a guess before you do…) I’ll then outline my plan to become the (jointly) most cited chemist of all time. And I’m serious – it has a large element of reality… See if you can work out my plan.

Posted in chemistry, open issues | Leave a comment

Open Access in Science – 1

I have been thinking about Open Access (and Open Data) since I ran into misunderstandings about what is an Open Access journal. As a result I have asked for clarification on one of the most prominent OA mailing lists and have received considerable information and help both publicly and privately. As a result I’ll try and share my thoughts.
I’m not going to cover this comprehensively – for example Bill Hooker has done an excellent job in one of his blogs. But I’ll try to clarify some of the things I have found difficult. And I’ll try to be objective.
A word of warning. Although the basic idea of OA is simple, its practice is more complicated. There are also confusing terms: OAI (Open Archive Initiative) is nothing to do with Open Access (and OAIS has nothing to do with either). The various declarations (Bethesda, Budapest and Berlin) have very similar acronyms – BOAI – (but at least are very similar and sometimes referred to as BBB).
On the positive side OA is happening and is unstoppable. So anything critical I say should be viewed in the greater light. The movement has now enough momentum that it can tolerate robust discussion. But it is clear that although there is a generally shared vision there are many reasons why people want it and this sometimes causes tensions.
Firstly to remind people of the motivation:

  • Access. I belong to the eMinerals project funded by the Natural Environment Research Council. NERC is “…tackling the 21st century’s major environmental issues such as climate change, biodiversity and natural hazards.”). The project (run by Martin Dove) comprises several UK universities and research establishments. Towards the end of its first phase we thought it was a good idea to publish different aspects of the work in a single issue of a journal, and here it is: the complete journal issue. I hope you liked reading the paper that I was a co-author of (CML tools and information flow in atomic scale simulations). Well if you did you were luckier than me and the rest of the project because none of us could! What? The University of Cambridge? One of the UK deposit libraries couldn’t read a journal? Well, yes – I could probably have gone to our splendid Giles Gilbert Scott building and tried to find a physical copy (but it could have been somewhere else and I simply wanted to skim through the issue, not an adventure). So, yes, the University of Cambridge does not have an online subscription to Molecular Simulation. Can the authors put a copy of the final manuscript on the project page? Yes! “You are able to post, after a 12-month embargo (STM) or 18-month embargo (SSH), your revised text version of the final article after editing and peer review on your home page”. (see Copyright Transfer FAQs – this is packed with interesting information which I may revisit in later posts). So, I’ve waited 18 months and now I can read about the projects.
  • Permission. I have already blogged about how publishers’ permissions restrict modern usage to published papers. Despite the BOAI, publishers actively frustrate the text-mining (e.g cut off by publisher) and robotic indexing of papers (see earlier post where I argue that licenses restrict legitimate re-use).

So the two issues I care about are access and permissions. I’ll come back to these later and try to portray some of the views that OA proponents have.
I’ll finish with the slightly provocative statement that Open Access is still a Movement rather than a Code of practice. The declarations urge general motivation rather than precise actions that should be carried out. A declaration of rights might say that all humans should be treated equally, but may not spell out how this is to be done or whether certain actions conform. Similarly some body can say it is an “Open Access” funder/publisher/repositoryProvider, etc. but is it really? Without agreement on the practice of OA we can’t tell. And, as far as I can tell, there isn’t currently any organization that decides.
Does this matter? Yes – it does now. Many funders are requiring publication as “Open Access”. Do authors and publishers conform? I don’t know. And I’m specifically worried about the permissions aspect of OA – I believe that mandates must require that re-use is supported as strongly as access – that is why I have campaigned for Open Data.
For me, if my robots cannot read the articles then as a human I have no interest at all in reading the “fulltext”. And if I cannot guarantee that they can do this without publishers shutting off the supply of journals then OA will not have delivered.

Posted in open issues | 1 Comment

Licenses?!

Chris Rusbridge – who directs the UK Digital Curation Centre – has made an excellent comment on my post Molecules? Does “Open Access” help or hinder Open Science?

I’m confused! Peter says “BOAI permits commercial re-use; MDPI does not.” But his article is licensed under a Creative Commons Attribution Non-commercial licence, so it too fails the BOAI test! And I’m required to submit to that licence in adding my comments (I’m happy to, by the way).

  1. There are a couple of issues here. One is that sites often have stupid licence terms, that say things they do not mean; I suspect that the site might change its terms if it was asked carefully and often enough. As a digression but illustration, I have an email somewhere from many years ago, in which I wrote to Elsevier about the copyright notice on their web site, which then read (my emphasis) “All rights reserved. No part of this service may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the publisher.” Asking for written permission before downloading a web page (which you had to download before you could read the notice, was a bit, well, duhhh! I asked for that written notice, which never arrived, and it took several months for the notice to get changed.Another issue is that just because the default licence does not allow commercial re-use, does not mean that commercial re-use is forbidden! It just means that commercial re-use may be subject to a separate licence.Maybe the real bugbear is that they still ask for a copyright transfer. I really object to this, and just don’t do it. It is (I understand) uncommon practice in book publishing anyway.

PMR comments:

  • Our site. We discussed the license and thought that a non-commercial license for posters would make it easier to induce them to agree to a common license. It’s a little hard to guess what licenses others would offer. For myself I would offer full BOAI – I just though a differential license would be confusing. But if anyone has a suggestion for a form of words, fine. (If anyone wishes to re-sell my blog I have to be brave enough to think that the fame I would get would suffice).
  • Stupid licenses. Yes, I agree that many publishers have put up meaningless and complex licenses that date from the time of parchments. I have written to some of them (mainly about scientific data) asking how the licenses should be interpreted in the age of computers. Some give a null response; others reply that they will get back to me (they don’t). I have other things to do with my (sad) life so I have given up this game. Since I have a kind and generous disposition I assume that the fuzz and muddle of these ancient licenses are due to fuzz and muddle. However there are some publishers who are very clear that these licenses are to prevent the re-use of data in the computer age and the require us to hand over copyright of our scientific facts. So we have to tackle this.
  • default licenses. I don’t think Chris meant this, but many publishers have additional licenses which the normal reader never sees. Thus the librarian (increasingly a purchasing officer) signs a complex contract with the publisher about what the readers (I hate the phrase “users” – there are other uses for journals than reading, though I can think of many for e-journals). These contracts are all about what the readers are not allowed to do and if the y do it how the publishers will cut the institution off without warning (Do you read journals, or “use a database”?). While I hope all academics are aware of the importance of copyright, surely no one can be expected to be aware of the contractual arrangements with the myriad publishers. (And I bet they are not only per-publisher, but also per journal).
  • I met with Alma Swan this morning – she was collecting data for a survey on libraries. She tells me that science publishers are now starting to embed digital watermarks in their publications. Don’t they trust us?

I often see things like “if you want permission to do this, write to the publisher”. This simply doesn’t work in the eScientific age. Peter Corbett and I and colleagues analyse the biochemical literature. PubMed has millions (sic) of abstracts. We – through our linguistic robots – can read all of these (we’ve only read a few hundred thousand so far).
I have now got some thoughts together for posts on Open Access for Scientists and no longer have early-morning distractions and depressions 🙁 – some of you will understand. So I’ll try to get some of this out – and licenses are a critical part.

Posted in open issues | Leave a comment

Molecules? Does "Open Access" help or hinder Open Science?

“Open Access” is often taken to imply certain rights. In fact it is more frequently a fuzzy term whose precise interpretation is unclear and sometimes even counterproductive to Open Science. (I accept this is a provocative statement, so read on…:-).

Molbank, published by Molecuar Diversity Preservation International, is one of the oldest of a handful of Open Access journals in chemistry. Although its longevity is a remarkable accomplishment in itself, there is much more to Molbank than meets eye. Just below the surface is a feature so revolutionary, yet simple, that chemistry publishers years from now will wonder why they didn’t implement it sooner.A Molbank article consists of a short monograph on a single compound, or possibly two. This may strike some scientists as a strange way to publish results, and it is unusual. On the other hand, this system offers vast potential to capture useful, but “unpublishable” findings that would otherwise be lost. Back when scientists actually read hardcopy journals, such a system would never have been feasible. Today, with hard drive space measured in terabytes, fiber optics cables crisscrossing the planet, Internet connectivity for almost everyone, and servers that can be had for virtually nothing, this system not only looks perfectly feasible, but preferable in many ways to the status quo.
Here’s the revolutionary part: each article that Molbank publishes is accompanied by a publicly-available, machine-readable file encoding the structure of the article’s subject molecule. That’s it. There’s nothing tricky or high-tech about it. In fact, the practice is about as low-tech as you could imagine. The file format in which structures are encoded, molfile, dates back at least fifteen years, and nearly every piece of chemistry software – both end-user and developer tools – can handle it. What makes Molbank’s practice revolutionary is that not a single chemistry journal, Open Access or subscription-based, currently does this.
Why does the simple inclusion of a publicly-available molfile encoding molecular structures in a paper matter so much? This is where the second two entities of the trinity named in this article’s title come into play: Open Source and Open Data. By providing a mechanism for a computer to decipher the chemistry in a paper, Molbank has opened the door to a host of highly-productive integration activities that nobody outside of Chemical Abstract Service has even been able to contemplate, let alone prepare for.
This article is the first in a series aimed at exploring the wide-open space that Molbank has created. Rather than arguing my point with words, I’ll actually build working demonstrations of what is now easily within reach. At the same time, I’ll document my work on this blog. I’m not sure where all of this will end up, but I do hope to shine some light on a vital, although currently obscure, component of the Open Access debate.
Rich is absolutely right about the potential value. My concern is that the “Open Access” claimed here is actually counterproductive to Open Science and what he and I want to do. I hope that the Open Access community can address this.
Molbank is potentially a new and valuable extension towards the idea of publishing data as Rich describes. It’s similar to journals like Acta Crystallographica E which publish a single crystal structure per article with full associated data.
Molbank was founded with, I believe, a grant to develop Open Access. But the papers themselves, although openly accessible are copyright MDPI:
  • Copyright of published papers. We will typically insert the following note at the end of the paper: © 200… by MDPI (http://www.mdpi.org). Reproduction is permitted for noncommercial purposes. For alternate arrangements concerning copyright please contact the Editor-in-Chief.

and it has some form of “differential Open Access”:

  • Important additional information: All thematic special issues will be fully Open Access with publishing fees paid by authors. Open Access (unlimited access by readers) increases publicity and promotes more frequent citations as indicated by several studies. More information is available at http://www.mdpi.org/oaj-supports.htm.

and from the copyright transfer form:

The copyright to this article is hereby transferred to MDPI, effective if and when the article is accepted for publication.The copyright transfer covers the exclusive right to reproduce and distribute the article, including reprints, translations, photographic reproductions, microform, electronic form (offline, online) or any other reproductions of similar nature. In the case of a Work prepared under US Government contract, the US Government may reproduce, royalty-free, all or portions of the Work, for official USGovernment purposes only, if the US government contract so requires.The author warrants that his contribution is original and that he has full power to make this grant. The author signs for and accepts responsibility for releasing this material on behalf of any and all Coauthors.
The undersigned author, as corresponding co-author of the Work, states that all co-authors have been made aware that this manuscript has been submitted to this journal, that they have or will be provided with a (electronic) copy of the manuscript, that they have consented to be co-authors of the manuscript and to transfer the copyright.

In my view this is absolutely NOT open access according to the Budapest Open Access Initiative which reads (with my italics):
The literature that should be freely accessible online is that which scholars give to the world without expectation of payment. Primarily, this category encompasses their peer-reviewed journal articles, but it also includes any unreviewed preprints that they might wish to put online for comment or to alert colleagues to important research findings. There are many degrees and kinds of wider and easier access to this literature. By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
The rubric from MDPI is clear. It is NOT BOAI-compliant. (I have corresponded some years ago with the Editor but didn’t get a substantive reply on this issue):
  • BOAI permits commercial re-use; MDPI does not.
  • BOAI permits non-exclusivity of copying; MDPI does not
  • BOAI permits automatic crawling of data; MDPI gives no explicit permission
  • BOAI acknowledges the value of copyright to the authors; MDPI requires the authors to surrender this.
By contrast journals such as Beilstein Journal of Organic Chemistry explicitly states:

Brief summary of what Open Access means for the reader:

Articles with this logo are immediately and permanently available online. Unrestricted use, distribution and reproduction in any medium is permitted, provided the article is properly cited. See our open access charter.

Anyone is free:

  • to copy, distribute, and display the work;
  • to make derivative works;
  • to make commercial use of the work;

Under the following conditions: Attribution

  • the original author must be given credit;
  • for any reuse or distribution, it must be made clear to others what the license terms of this work are;
  • any of these conditions can be waived if the authors gives permission.

Statutory fair use and other rights are in no way affected by the above.

Without an EXPLICIT machine-readable statement of the sort above “Open Access” is effectively useless for Open Science. Remember that we increasingly want to use machines to trawl sites. If I knew I had permission I would set our robots over the whole of MDPI tomorrow. (I am probably allowed to extract all the molecular files as they are (IMO) “data” unless the grotesque sui generis database restriction applies.

Open Science cannot make effective use of:

  • author self-archiving. Much self-archiving – whether on websites or repositories – will not be accompanied by licenses of the sort above.
  • journals that do not assign copyright to the authors AND do not explicitly allow crawling of the publishers site AND do not provide machine-readable licenses. How many hybrid journals do that?

I would recommend the use of the phrase

“Open Access(BOAI)”

If publishers adopted something like that it would solve my problems. It’s simple. However I guess that an increasing number of publishers are likely to let fuzz and FUD drift around their sites, especially those who have been dragged unwillingly into the “a few authors pay so we are Open Access”. We hear encouraging figures about the growth of Open Access journals….

… but how many of these are explicitly BOAI-compliant?

Posted in chemistry, open issues | 8 Comments