Finding chemical structures – InChIs et al., an amusement

Totally Synthetic, Chemspider and I have been discussing the value of InChIs in blogs. TS’s blog is, of course Openly available under CC licence, and he is widely revered in the community for the beauty and acuuracy of his structural diagrams. This post is a slightly light-hearted voyage through what can be discovered with Toll-Access barriers in place. I leave readers to judge whether TSand Pubmed are up to the ease and value of the information from commercial providers.
I’m reading this from outside the University and I do not have a VPN. This is useful as it shows me what it’s like to be an information-impoverished reader. TS blogged today about Diazonamide A , a natural product which was billed as the next big breakthrough in cancer some years ago. (It has 4 reports in Pubmed about its biology, and 26 ones about the chemical synthesis. Taxol has 30,000). Anyway TS has taken the advice of the Blue Obelisk list and managed to put InChIs into his blog.
I’ll show his beautiful-as-always structure at the end, but meanwhile I wanted to see how easy it was to find the structure from freely accessible sites. This includes most abstracts (in science it seems to be almost universal to post abstracts in clear, so be grateful).
Wikipedia does not list it, but has the (intriguing and misleading) entry under “Trivial_name”:

For example, the most important structural feature of Diazonamide is that it’s a nonribosomal peptide, which is denoted by the suffix “amide“.

PMR: it might have started as a peptide but I don’t think many people would now call it that. (Unless there is another Diazonamide that I don’t know of).
So on to the latest synthesis (Magnus, Cheung, Goldberg, Russell, Turnbull and Lynch. JACS, 2007, ASAP. DOI: 10.1021/ja0744448.), remembering I can’t read the full text. The abstract is a superb illustration of hanging links (NullPointerExceptions in Java):

Abstract:
During the course of studies on the synthesis of diazonamide A 1, an unusual O-aryl into C-aryl rearrangement was discovered that allows partial control of the absolute stereochemistry of the C-10 quaternary stereogenic center. Treatment of 30 with TBAF/THF gave the O-tyrosine ethers 31 and 32 (1:1), which on heating each separately in chloroform at reflux rearranged to 33 and 34 in ratios of 84:16 and 56:44, respectively. This corresponds to a 70% yield of the correct C-10 stereoisomer 33 and a 30% yield of the wrong C-10 stereoisomer 34. Attempts to convert 34 into 33 by ipso-protonation and equilibration were unsuccessful. Confirmation of the stereochemical outcome of the rearrangement was obtained by converting 33 into 37, an advanced intermediate in the first synthesis of diazonamide A by Nicolaou et al. It was also found that the success of the above rearrangement is sensitive to the protecting group on both the tryptophan nitrogen atom and the tyrosine nitrogen atom.

PMR: What a splendid piece of non-communication! [My comments could apply to many publishers, not just ACS]. Without the full text (which, after considerable perusal will tell us what 1, 30, 31, 32, 33, 34 and 37 are) it’s almost meaningless. I am reminded of Alice’s comment on Jabberwocky:

“Somehow it seems to fill my head with ideas – only I don’t exactly know what they are! However, SOMEBODY killed SOMETHING: that’s clear, at any rate — ‘”

PMR: and the authors made something from something else…
So off to Pubchem. Many compounds made by synthetic chemists are no in Pubchem because they are of no interest, but Diazonamide is. It has a structural diagram [1]

PMR: Lovely. I think it’s correct, but it’s not exactly beautiful. like mathematical equations chemical strucures can be pretty or semantic. This is semantically correct and it’s probably pretty to jellyfish (this was a marine compound) but not to humans.
So on the InChI. Pubchem tells me that the compound has InChI:
InChI=1/C40H34Cl2N6O6/c1-15(2)27-37-46-29-32(54-37)40-20-9-5-8-19(18-7-6
-10-22-25(18)26(33(41)43-22)31-34(42)48-38(29)53-31)28(20)47-39(40)52-24
-12-11-17(13-21(24)40)14-23(35(50)45-27)44-36(51)30(49)16(3)4/h5-13,15-1
6,23,27,30,39,43,47,49H,14H2,1-4H3,(H,44,51)(H,45,50)/t23-,27-,30-,39-,4
0u/m0/s1/f/h44-45H
The problem is that this is not pretty for blogs as it runs over the line ends and spaces are a problem. So IUPAC are working out new approaches and some of these are discussed by the Blue Obelisk.
There is also a SMILES:
CC(C)C1C2=NC3=C(O2)C45C(NC6=C(C=CC=C64)C7=C8C(=CC=C7)NC(=C8C9=C(N=C3O9)C
l)Cl)OC2=C5C=C(CC(C(=O)N1)NC(=O)C(C(C)C)O)C=C2
which is a linear way of encoding the structure. Let;s go to the Daylight site (they invented SMILES) to see what it looks like:

I think it’s correct, and it’s certainly a lot better than the Pubchem offering but it’s not beauty – except for Shrek.
Let’s try Chemical Abstracts. It’s got every compound ever made. Maybe they will let me have a free go… (STNEasy) I find:
stneasy.PNG
A free demo! Just what I wanted…
stneasy1.png
PMR: This is fine, and it points to the same abstract, but I can’t get at the structure. Let’s try CAS-Number lookup – it will tel me the number and the structure… and there is a free demo as well:
stneasy2.png
Oh dear… Yes, a free demo, but only if you are looking for caffeine. I get get all I want about caffeine from Wikipedia without paying 6.20 USD. Ah well,
So, off to chemspider which is free. The search for diazonamide A reveals:
chemspider.PNG
10472888 is shown at full size. (There are two more structures but both are  equally unreadable). Note that the atom counts of the structures are inconsistent – the actual composition – I think – is that of 4591072. I try to zoom the formula and get a featureless gray square on both IE and Firefox. So I try Jmol (shown right). Now the molecules are three-dimensional but the coordinates in chemspider are those of the 2-D diagram. Personally I regard this as extremely misleading and would NEVER use Jmol for 2D diagrams, but I shan’t pursue this here.
So I still don’t know what the molecule is. Where else? Perhaps I can use some more abstracts…
And the fourth one on Pubmed hits gold. It’s from PNAS:
diazonamide1.png and it’s FREE!!!!!
so we find the structure:
diazonamide2.png
Truth at last. (For non-chemists the exact width of the lines matters, and the pixellation makes it very difficult to be sure. But I’m sure it’s correct.
And now what you have been waiting for – Totally Synthetic’s structure:
InChI=1/C40H34Cl2N6O6/c1-15(2)27-37-46-29-32(54-37)40-20-9-5-8-19(18-7-6-10-22-25(18)26(33(41)43-22)31-34(42)48-38(29)53-31)28(20)47-39(40)52-24-12-11-17(13-21(24)40)14-23(35(50)45-27)44-36(51)30(49)16(3)4/h5-13,15-16,23,27,30,39,43,47,49H,14H2,1-4H3,(H,44,51)(H,45,50)/t23-,27-,30-,39?,40-/m0/s1
I think you’ll agree that the blogosphere is starting to emerge as a serious place to look for chemistry.
[1] pasted directly from the Pubchem site, suggesting we can create an image library for chemical structures

Posted in chemistry, open issues | 5 Comments

open access : Thank you American Chemical Society

In my reviews of the practice of Open Access (Author Choice in Chemistry at ACS – and elsewhere?) I pointed out that there were deficiences in access and labelling on Open offerings. I’ve now had a reply from Dave Martinsen:

Peter,

Thanks for pointing out the problem in accessing ACS AuthorChoice articles. This was a technical glitch which is in the process of being fixed. Please be assured that it is our intention that AuthorChoice material is available without charge from the time it is posted on the web. We believe the solutions we’re putting into place will prevent this access problem from happening again.


Dave

*********************************
David Martinsen
American Chemical Society

1155 16th St. NW
Washington, DC 20036
d_martinsen AT work-it-out

PMR: Thank you Dave (Dave – as I have already mentioned – has been very supportive of new approaches to chemical informatics).
AuthorChoice is a “hybrid Open Access” product produced by the ACS. “Hybrid” only applies to publishers (and sometime specific journals) that are primarily closed (Toll Access, pay-to-read) but where authors may purchase “Open Access” for their specific article. (Many OA publishers require all authors to pay to publish). Every publisher has a different name for their hybrid products and almost all of them offer different rights and restrictions.
As I have said before, the quality of delivery of hybrid Open Access (and related products) is often poor. They are not well labelled, the navigation is poor, and the rights – if any – are often vague and contradictory. Hybrid offerings (as with the ACS) often still require the author to transfer copyright and do not allow full re-use of the article.
I am not (here) criticizing hybrid OA per se (though personally I think it is a distraction and is likely to be ineffective in every way). Nor am I concerned (here) with the price level, though I personally would not believe that I get good value from many publishers (as I require full permissions, including author retention of copyright). What concerned me here was that the reader (and thereby the author) was not getting what they were entitled to.
It is very clear that the OA community MUST insist on clear labelling and must police the practice. Many “OA” publishers are creating unacceptable offerings – either deliberately or probably through laziness and lack of commitment (I call this systemic failure of the industry). I had not intended to embark on any campaign and I am glad to see that others at Berlin5 are interesting in putting in place more formal mechanisms. For example we need a system of labels – but that’s not my story to tell.
I don’t actually like attacking people (institutions are slightly different). Sometimes my role appears to be that of a gadfly. I didn’t know why people use this particular analogy so looked it up in WP and found Gadfly 

Gadfly” is a term for people who upset the status quo by posing upsetting or novel questions, or attempt to stimulate innovation by proving an irritant.
The term “gadfly” was used by Plato to describe Socrates‘ relationship of uncomfortable goad to the Athenian political scene, which he compared to a slow and dimwitted horse. It was used earlier by the prophet Jeremiah in chapter 46 of his book. The term has been used to describe many politicians and social commentators.
During his defense when on trial for his life, Socrates, according to Plato’s writings, pointed out that dissent, like the tiny (relative to the size of a horse) gadfly, was easy to swat, but the cost to society of silencing individuals who were irritating could be very high. “If you kill a man like me, you will injure yourselves more than you will injure me,” because his role was that of a gadfly, “to sting people and whip them into a fury, all in the service of truth.”

PMR: I’m delighted to know the etymology (or rather the usage). And Perhaps that is sometimes why I like the Socratic approach – posing questions which require definite answers rather than generalities. But, ahem, although it grows here I really don’t like hemlock.

Posted in open issues | Leave a comment

How blogging makes contacts and seeds communities

I mailed yesterday about how blogging links to other blogs and generates new contacts. Here is a direct example:

Jakob Says:


You wrote: “More, because I have added this link to my blog, Jakoblog will get notified.” This is true and it may happen that the author will come and see what you have written and even leave a comment – how often do you experience this with publications on paper? Conventional scholarly publications are so old-fashioned, slow, impractical, and inefficient. If you do your research for the progress of knowledge (and not only for your career) then you should better tag your notes at a social tagging/bookmarking service, write your thoughts in your blog, archive your summary-paper at a publication server, provide your data and sourcecode in data and software libraries, discuss you opinion in mailing lists, compile your research into other people’s work in wikis etc…. this is science in the 21st century!

… and …

From the librarian’s point of view I can tell you that archiving data is probably even more complex then it seems to be. From the computer scientist’s point of view I can tell you that Semantic Web will enlight us easily. From the Open Content movement’s point of view I can tell you that you should just license the data and make it available and usable for anyone – like you said: first make sure THAT the data CAN be used.

PMR: Thanks Jakob. There is a growing number of people like you – we need to link them to generate critical mass. In chemistry we have created the Blue Obelisk community and we have pooled our resources and efforts. This could be done for content systems – informally as well as through institutions – an example is our collaboration with Peter Sefton on authoring tools.

Posted in "virtual communities", semanticWeb | 1 Comment

Does linking to technorati tags generate spam?

In a recent post (blogs, folksonomies and tagging – get going!) I encouraged the Open Access community to start using blogs and tagging. I specifically pointed to Technorati to illustrate the value and showed that some conferences had huge amounts of traffic and others almost none. I gave several examples and gave links to the technorati summary of the posts under given tags. This was based on a particular URL structure.
On revisiting these sites I find that the lists at Technorati have been drastically altered.  The berlin5 one has 11 porno spam links. The method a fairly recent one – take the content of a genuine post and do some very crude lexical munging of the words and phrases (I get zillions of these each day sumitted to the blog comments). Somehow they actually linked to sex sites in Cambridge, so maybe they interpret domain names. So it seems the spammers have found my post yesterday and somewhere generated spam content that is either injected into Technorati or has already been linked. AFAICS the genuine links are still there.
Then I looked at www2007 worrying that I would see the same. But whereas there were 300+ links yesterday to www2007, now there are only 6, all half a year old. Was technorati spammed and tried to clean it?
If by linking to Technorati I have unwittingly generated spam I apologize, but this can be done in other ways.
I don’t take Technorati counts very seriously – about as seriously as I take ISI citation counts – but it’s a useful way of finding people. But maybe we have to be careful about the exact way we use it. I welcome enlightenment.

Posted in general | Leave a comment

blogs, folksonomies and tagging – get going!

At the recent “Berlin 5” meeting on Open Access I noted sadly that I was the only person blogging the meeting. Normally there are many bloggers at the meetings I go to so I (and everyone else) can choose what they blog. At berlin5 I felt it was important to show the way so I hacked some notes together for many of the talks – generally typing scattered phrases during the talks (and with even more typos than normal). As a result I spent more time than I would have likely simply noting some of the presentations. In any case it’s not a very good approach since you don’t know what the speaker is going to say and often run out of puff during dense slides. You know them (“Title XVIII chapter 123 of the EC, says …”). Mind you, it isn’t easy to blog my presentations synchronously either…
I made ca. 15-20 posts about Berlin-5. 2-3 before I went, 2 (so far) afterwards and about 12 during the meeting. Many of the latter are simple shorthand notes of speakers with little or no comment. So for example I have copied as many of Alma Swan’s words as possible to give those not there an idea. I can’t type well or fast so it’s limited. And there are no links.
I’m writing this in the hope that librarians, funders and policy makers will be more adventurous and start their own blogs. An increasing number of slides at berlin5 mentioned blogs, wikis, folksonomies, etc. The best way to understand these is to DO them, not read other people’s.
There are of course some top-class blogs from staff working for publishers – Nature and PLoS lead the way. They actually tell us how people in the organization think, work, interact. (Contrast the more formalised magazine-like blogs on some publishers which are often written by third parties, sometimes recruited from the blogosphere). And there are some excellent librarian blogs. But I am sure there is a niche for “DGXIII inmate”, “bewildered at RCUK/STFC”, etc. In Open Access we need more than just Peter Suber, Stevan Harnad commenting. They have clear formats and agendas which need complementing. There is a huge need for investigative blogging to reveal the spread and the problems with OA.
The digital library needs metadata and in C21 much of this should be done elsewhere than the library. Two main methods are text-mining and tagging (folksonomies). Here I’ll look at the latter.
If you have just set up a blog, no one will know about it. It can be quite dispiriting. There are many legitimate ways to advertise, including tagging. There are sites such as Technorati which visit all blogs (ca. 100, 000, 000 exist) and index and link to them.
One thing that Technorati looks for is the tags in a blog.
If you write a blog you can add tags which give an idea of the content. Tags are common in many systems such as del.icio.us and Connotea where communities expect other members to use tags to find similar contributions. There is NO controlled vocabulary – you could use anything (though it’s best to stick to ANSI alphanumerics). If you don’t understand social computing, this is a good place to start. It doesn’t matter what you do – it won’t break anything. And there is no “right” or “wrong” way to do things – it is whether it works. So for this meeting I chose “berlin5”. It’s natural and I assumed that of the 100-200 delegates that others would choose something similar. Let’s assume they choose “berlin open access” (you can have multiple tags, of course).
In a formal metadata system this is a nightmare, but in the blogosphere it’s trivial. If twenty people read both blogs one of them will probably post a comment ” Petermr is using berlin5 – why don’t you add that as well” (or the other way round). So the two of start to converge. No one tells us to – it’s just obviously a good thing to do.
So here is the list of posts about berlin5 (there are 18). There are 3 which are nothing to do with OA but they are easily ignored as they are old.
http://www.technorati.com/posts/tag/berlin5
[NOTE ADDED LATER. THIS APPEARS TO HAVE BEEN SPAMMED. THE TOP 11 LINKS ARE SPAME, AUTOGENERATED FROM MY POST. I DON’T KNOW WHETHER THIS IS CHANCE OR WHETHER THE SPAMMERS WATCH POSTINGS POINTING TO TECHNORATI. PROBABLY THE LATTER.
INCIDENTALLY TECHNORATI HAS BEEN QUITE FAST TO INDEX ALL OF THESE
OH DEAR.]
As I say it’s a pity that there isn’t anyone else (although you we needn’t have finished)
Let’s look at a more distant meeting – electronic theses and dissertations at Uppsala. If you follow:
http://www.technorati.com/posts/tag/etd2007
You’ll find 17 posts, mainly by me but not all:

ETD Policies, Strategies and Initiatives in…

Das, Anup Kumar and Sen, B. K. and Dutta, Chaitali (2007) ETD Policies, Strategies and Initiatives in India: a Critical Appraisal. In Proceedings 10th International Symposium on Electronic Theses and Dissertations (ETD2007), Uppsala, Sweden.

So now I have made two important contacts – the authors of the article and also EPrints for LIS. That’s just because we both used etd2007 in out posts.
But now let’s look at the really hectic end of the scale, www2007:
http://www.technorati.com/posts/tag/www2007
303 posts! and although the meeting was 5 months ago, posts mentioning it are still coming in, such as Yet another semantic tagging application in Jakoblog — Das Weblog von Jakob Voß
More, because I have added this link to my blog, Jakoblog will get notified. Technorati keeps count of how often every blog mentions others. E-LIS has 251 other blogs which link to it; I have about 120 (“the authority”), Jakoblog has 37. If I put Jakoblog on my blogroll it would increase to 38. (A popular aggregator/multiple_author blog like ScienceBlogs has nearly 10, 000, Bora’s Blog around the clock has 700, Dorothea Salo’s Caveat Lector · has Authority: 199; My colleague Andrew Walkingshaw‘s Brighten the Corners, http://wwmm.ch.cam.ac.uk/blogs/walkingshaw has 28. Of course these numbers are about as useful as citation statistics!
The serious message is that if you want to go out and get noticed in the blogosphere you have to get noticed! Tagging is a good way of finding out who is thinking along the same lines as you. Then link to them. They’ll often link back. Aggregators will include all of you, and so on.
So, OA colleagues – and hopefully OD colleagues as well – get out there! Yes, you will reach some people via conventional scholarly publications. But your publications will be noticed much more if they are blogged. Das, Sen and Gutta should get some more readers because I have blogged it. They’ll get me anyway, and that’s because E-LIS blogged it. And so it grows…

Posted in berlin5, general | 2 Comments

berlin5: final thoughts

Some final thoughts on the berlin-5 meeting on Open Access in Padova – I have spent more blog time than I thought and I am probably driving any chemical/software readers up the wall. This should be the last post with the tag. Some discussion is reported in Chatham House Rule manner.
Splendidly organised. Wonderful food and drink. Very relaxed atmosphere.
Fantastic location. Italy is fortunate to have preserved many of its medieval town almost intact. The best analogy in the UK is probably Cambridge or Oxford, but they don’t have the same compact city boundaries as in many Italian counterparts.
A reasonably good mix of funders, policy makers (EU, etc.), publishers, researchers, library/IT.
A positive atmosphere. Alma was very upbeat that Open Access was now unstoppable.
I was pleased to see that Open Data was now much higher on the agenda. General agreement that it must be addressed and quickly and I think several people have taken this away and will work on it. Similarly the idea that “Open Access” is not a licence and we have to use CC or SC. Kaitlin Thaney from Science Commons was there and I am sure that people will get in touch with her.
eTheses were also higher on the agenda. Good. At earlier meetings I had asked whether I could run robots over the Dutch theses and was told there was a copyright problem. Now I am told that was incorrect – I can do whatever I like. There are over 10,000 Open theses in NL, so we’ll start pointing our robots there.
Because of my diffident nature I have been in the habit of asking permission for this sort of thing. Now I am getting braver and shall “ask for forgiveness rather than permission”. So here come text and data-mining robots. After all it’s C21.
There was a mixture of views about the legality of Foo, Bar, and Bananas. I am urging that in the C21 copyright is inappropriate for eScience and we should simply declare all scientific data unencumbered by publisher copyright. I pushed one or two publishers like this…
PMR: “are images (graphs, gels, cells) of the scientific record copyright”
Publisher: “well, we put lots of effort into the design of lettering in images”
PMR: “on gels?”
Publisher: “… er um”
So I think there are an increasing number of publishers who see that the scientific record per se (i.e. the wider “data”) must be free and Open. I talked with one publisher who has got excited about the possibility of Open Data and although they might not be Open Access, see the advantages of making data visible.
I think a lot of people hadn’t seen the power of data- and text-mining and although I had to compress a lot into 27 minutes the message came through.
One a slightly more critical note:
There was very little awareness of what Web2.0 and the rest is about. There is a vast difference between berlin5 and www2007 (scifoo is something else, of course). We who are in the middle of it forget how many academics have never heard of Flickr.
I was disappointed that no-one else was blogging and presumably the awareness of tags and folksonomies is low and I’ll address that in another post
I am looking forward to the video and will let you know when it happens.
And as always new contacts and opportunities. I am always happy to visit and demo or spread the word. Open Access, Repositories, Open Data … we are taking off.

Posted in berlin5 | Leave a comment

berlin5 : Alma Swan

The final keynote by Alma Swan, familiar to all in the OA field. How are we doing? (Alma does a lot of surveys, interviews questionnaires, etc.)
We are getting definition creep. There should be no qualification of OA – it’s either pregnant or not – not slightly pregnant.  OA is not “delayed OA”…
Awareness, in order:

  • funders
  • publishers (PLoS, BMC doing very good advocacy)
  • peers (word-of-mouth
  • library (often repositories are not well advertised)
  • and the effect of OA

“self archiving gave my work instant world-wide visibility. As a result I was invited to … conferences … and authoring”
Proven Business model (PLoS, BMC, Hindawi) 70% rise in submissions over last 2 years. Hindawi is profitable, BMC break-even, PLoS OK on all except flagships. Bentham launching 200+ OA this year, and 100 next year
Moving the money around. Need to move from library budgets to author-end. Not trivial but vice-chancellors have to grasp this nettle. Experiments:

  • Nottiingham
  • Wisconsin
  • Amsterdam

Reorganising rather than spending new money. 7 billion USD into scholarly publishing.
Learned societies. Not homogeneous. Sounds like publishing, but is NOT. Actually aligns with mission of a scholarly society. Target the scientific officers of society. Please try. Work with LS to help them embrace OA and concept of opening up scholarship. Show benefits. Discuss green and gold. Discuss evidence against damage to business. Be patient. Praise and encourge the ones which are moving. They are too coy about their achievements (e.g. APS and IOP(UK). Both have built mirror site for arXiv (doing this for benefit of community – let’s praise them.) Support members who are struggling to change. History will record who helped and obstructed.
Start by making Society conferences OA. ASCB (Am Soc Cell Bio), Ins Math. Stat.)
Peter Suber says 380 OA journals from 350 societies.
Digital Repositories. Family of types but shared purpose is dissemination of research in ways not possible up to now. Repositories are at centre of universe. Ingets tools, search and retrieve, aggregate/display, count/assess, peer review (might/not be publishers), editorial (publishers), other value adding
Repos are where content is going to start, at data creation stage.
We need a marketing message for each constituency:

  • institution: visibility and impact. G-factor (Google rank or Web presence). Much higher in US. But Southampton is 3rd of UK universities. Mandatory deposit of research. Many links are to repo.
  • funders. OECD says: boost innovation and better return if proceedings Open Access. Houghton. Drummond Bone – repos are vital to UK economy. EU: SME find it hard to get access to the basic research infortion they need. A small pharma: cannot pay for TA journals or 30GBP/article
  • authors: WILL comply willingly, if mandated (81%). reluctantly (14%). Arthur Sale – QUT has over 50% in repo. Encouragment doe sNOT work. Mandate AT ACCEPTANRights:CE. The AUTHOR’S FINAL VERSION, even if not OA. Mandate DEPOSIT. Need author’s final version (as well as PDF)
  • usage: UoCalif 2 million downloads. Interoperable Respistory Statistics (IRS) will help. Monthly download, Daily downloads, types of referrer, etc. Which universities are accessing. In some cases Wikipedia is top referrer. Authors love it

Rights: Shouldn’t be a block but it is. Promote author addendum. Most address data. Monitor copyright policies and addenda.
It’s about the Web, stupid:
BBC linked to Soton and links were out pf date. If Google on author’s name
One third of Soton ECS lack home page, same in MIT. Let the young people help.
Joined up strategy. It IS a web. Data theses, articles.
And work on lobbying:
it’s hard, but PRISM has backfired and this makes it easier. Now we have to SHOUT. need organizing centre. SPARC…
Personal Strategy – stay cheerful

  • Peter Suber’s blog
  • AmSci and SPARC OA lists
  • David Prosser’s paper
  • Alma’s OA calendar
Posted in berlin5, Uncategorized | 1 Comment

TOPAZ and CLADDIER

Hopefully of great interest to those of us looking at self-publishing, Open Notebooks, etc.

topazWhen PLoS One was first announced, one of the first things that caught my eye was the fact that it was built on an open source publishing framework, Topaz. As much as anything else, that was one of the reasons that PLoS One was so appealing; a journal that put the web first. Today, Topaz took a big step in becoming a general scientific publishing framework
Some of the cool new features found in the new release candidate (bold emphasis mine)
  • Enable multiple journals using a single repository
  • Skins for multiple journals.
  • Filter search results by journal using OTM
  • TrackBack linkbacks for articles
  • Citation download of the article
  • Migration to the Struts 2 web application framework

Personally, it is very exciting to see trackbacks implemented in a scientific publishing platform. Trackbacks are a powerful vehicle for communication (I prefer them to commenting), and allows people to discuss papers at length on their blogs in the appropriate context. It was one of the first feature requests that I submitted to the PLoS folks, as I am sure did many others, so one can feel good about it.
The next step for Topaz is to get to version 1.0 when it will become available with source. Many publications, struggling with funding and very poor internet presence could do much worse than adopt the Topaz platform. I wonder what uses people could make for Topaz outside the formal publishing field – Open Notebook Science perhaps?

PMR: This ties up very nicely with what Brian Matthews (RAL) told us at eScience All-hands about CLADDIER (and *). This inlcudes (is?) a data overlay journal with trackback. People with a data entry can know whether someone else in the system has used their entry. Trackback has had a problem with SPAM but I expect that within a research community this is soluble.

Posted in data | 1 Comment

berlin5 : NIH and RCUK

NIH has an open policy since 1994. Barbar Seto presented an example, GWAS which has to deal with human subjects. How to make data Open, while protecting identity?
NIH serves as central data repository, including: Genome-wide acssociation study (GWAS), Genbank, Protein cluster, Pubchem, http://www.nlm.nioh.gov/databases/
GWAS – identifies common genetic factors influencing health and disease. Genetic variations associated with observable traits. It combines genomic data with clinical and phenotypic data to understand disease mechanism and prediction of disease.
Because some diseases are rare it is sometimes possible to work out indentities from anonymized data.
Cold room for use at grantee institution = data is open within a specified location and can’t be taken away
==============
Mark Thorley NERC and RCUK. Reacting to issues brought up in the morning. 4.1 billion EUR:

  • Data as byproduct of research data.
  • Data as part of the scientific record; support publications
  • Data as published output in own right

Drivers:

  • scientific need (e.g. atmospheric physics requires data sharing)
  • increased value – as part of larger collection e.g. oceans
  • value for money. Ship costs 10,000 GBP/day
  • public funds, so public access

One-size-fits all is NOT appropriate.

  • RC’s recognise data as valuable long-term public-good resource
  • Data sharing improves opportunities for exploitation (e.g. mashups). “Power of information” (UK, Cabinet Office) Stimulate knowledge economy
  • Investigator has a right of first use and right to be acknowledged. But there must be a limit, but early release can be a problem.
  • Effective exploitation requires effective data management.
  • Must be legal. (e.g. directive on public access to environmental information)

Any differences between RC’s is not policy, but how to support data sharing.
National facilities (NERC, ESRC) or local delegation (AHRC, BBSRC, MRC)?

  • National – longer term, single point, centres of excellence. expensive, less agile
  • Delegates, more responsive, closer to science, cheaper, lack of long term.
  • Long term commitment. Needs long-term vision, long term support. Are PIs the right people to do this?
Posted in berlin5, open issues, Uncategorized | Leave a comment

berlin5: Open Data and institutional repositories

John Marks (ESF) introduced our session and set the scene on the need for Open Data and sharing. He stated strongly that it was essential that we had discipline-specific repositories for different branches of science. I share this view and blogged it recently (berlin5 : how to progress Open Data?).
My stance comes from meetings this year where I have talked to many people about institutional repositories. I ask them “why are you setting up an IR?” I have got about 8 distinct answers. Very few of them mention data.
Some of us addressed these issues at ETD2007. There are hundreds of different types of biologiocal data, tens of chemistry data, humderds of geoscience, etc. There is no way that these managers – with the best will in the world – will know how to manage them. So I wrote:

although there is quite a lot of activity in institutional digital repositories they won’t (and shouldn’t) address Data. It’s subject-specific and too complex for the average repository manager.

PMR: In response to this Dorothea Salo (who has run Caveat Lector blog for some years and has a strong following).

  1. Dorothea Salo Says:

    Disagree somewhat that IRs and their managers shouldn’t address data, though I agree that for now it’s impractical because the software is so wretched and the technical infrastructure insufficiently scalable. Just because IR software in its current state is completely broken with regard to data doesn’t mean it must or should stay that way, though. Moreover, the notion that “domain knowledge” is the sole key to data curation is (bluntly) bunk, and nobody’s yet tested the assertion that it’s harder to teach a librarian domain knowledge than to teach a discipline-practitioner info management.Frankly, “it differs by discipline” doesn’t matter. So does everything else in librarianship, from reference transactions to collection development. We cope. It’s our job to. As for “too complex,” says who? And about which librarians? I think I’ve just been insulted.
    There’s nothing wrong with telling librarians — and the subset of librarians who are repository managers — that we need to brush up our game to deal with these issues. I have a plan in place to learn the principles of data curation for myself over the next year or so. I want to see more librarians planning the same!
    Looks like a good talk. Wish I could be there to hear it!

PMR: I haven’t met Dorothea but I’d like to – her blog is insightful and entertaining and she is unafraid to speak out. She’s also technically proficient in the IT skills required – XML, etc. And the last thing I want to do is upset and antagonize people like Dorothea.
But… There is no single human on the planet who knows how to reposit all of protein structures, variable stars, ice sheets, chemical structures. It needs much more than metadata. So what can a repository manager do. Putting the raw data into the repository without understanding it is not an option. It has to go into a system devised by experts in the discipline. And, for me, that means subject repositories. Maybe each university has a different one. Maybe they are national.Some, like the bioscience ones, will be international.

Posted in berlin5, open issues | 4 Comments