- Pubchem (10 million+ , superset of many Open datasets including NCI. I use this term to subsume everything at nih.gov)
- ChEBI (> 25 000 terms collected at EBI, not all with connection tables)
- MSD (ligands in Protein structures, collected at EBI > 5000)
- WWMM (250, 000 calculated structures from NCI database). Reposited in DSpace,
- Crystallographic Open Database crystal structuires collected from the literature or donated. Soon to be complemented with CrystalEye. This should give nearly 100,000 crystal structures.
- The BlueObelisk Data Repository (BODR). A collection of critical information collected by BO volunteers primarily as reference data for (Open) software. (includes non-molecular stuff like elemental properties). BODR is widely distributed on Gnome and other Open Source distros.
Archive for April, 2007
Repositories or Lists of Open Molecules
Sunday, April 29th, 2007
I am looking for lists (or repositories) of small molecules with connection tables (or machine-parsable molecular structures) which are Open. By Open I mean that anyone can, in principle download, copy or clone part or all of the site, re-use the information and redistribute without reference to the original site. At present I am aware of:
The SWORD is mightier than the pen
Saturday, April 28th, 2007
I am just relaxing in a hotel in Redmond, WA, US after two week’s very hard work instead of going downtown Seattle and shopping (which I hate unless it is for Obelisks). So the posts are going in all directions today. Incidentally a post seems to take about at least an hour which is why I don’t post every day. I am seriously considering audio blogs (podcasts) and would welcome advice.
Part of my environment now is “digital libraries” – at least I get to interact with librarians, informations scientists, etc. There is a real buzz of change in the air. One feature of that is the Institutional Repository (IR) – a digital place where people in an institution (usually a University) put digital things. That’s as close as I can get to the communality of vision. Why are people doing this – why are they NOT doing this? What do they want to put in? I’ll probably blog about this later.
Caveat Lector is an intriguing blog from a librarian (Dorothea) an ex-publisher who describes herself as
Sure, I’m a geek. But I’m not a gadget geek or a Lib2 geek or even a web geek (more than incidentally). I’m not even really a markup geek any more; I sling XML now and then, but as a side requirement of my real work rather than the focus of my professional attention. What I am is a problem-solving geek. I have a problem with a technology (say, hm, I dunno, DSpace?), I beat the living daylights out of it with the nearest handy rock until either it does what I want or I decide that the problem needs a better tool than a rock and give up (complaining bitterly afterwards, of course).She sounds like the sort of person that we desperately need – an XML-slinger who works in libraries and information science. We collaborate with Jeremy Frey (Southampton – CombeChem eCrystals, etc.) and yesterday that we agreed at the top of our shopping list we needed informations scientists embedded in chemistry (and other scientific) departments. That’s the role of the modern “library”. Anyway there has been a lot of debate about repositories and how to get stuff into them. As a scientist I know that this requires two conditions
- There has to be an overwhelming motivation for the scientist (like losing their job if they don’t do it) and…
- … it has to be trivially easy
One caveat (small C). If you build it they may not come. That’s the challenge. If they do come, and it’s a good SWORD, they’ll be less likely to go away. That’s why we have to work on all areas of encouraging people to capture and reposit digital artifacts.Well, hot damn
I ask for middleware, and lo, there is middleware!(No, I don’t think there’s a direct cause-and-effect relationship there. Even on my worst days I’m not that arrogant! Just shows that I’m not the only person with that particular train of thought.) Bring on the SWORD, y’all. I’ll wield that baby, you betcha.Dies Martis, 24 Aprili 2007
Repository middleware
I did a lot of IR marketing this week, despite my perfect awareness that IR marketing doesn’t work. For a tactic that doesn’t work, I did manage to come away with some contacts, and it appears that the IR made its way into some heads, and that’s all good.But if marketing doesn’t work, what does? Here’s the problem I’ve got: there’s a ton of material that’s IR-ready floating around, but I can’t get at it. My nose is mashed up to the window of other people’s hard drives, web servers, workflow silos, and collaboration tools. I want the stuff that comes out of those arenas. I just have no way to grab it. Here’s the problem everybody else has got: they need the curation, preservation, and “put this important content somewhere safe (but otherwise out of my hair)” tools that an IR theoretically provides, but they don’t need the hassle of extra deposit steps. They need an “Archive It!” button. They just have no way to build one… if they even know about the IR to begin with. I need middleware, and I need it badly. I don’t think DSpace or EPrints developers should be directly considering building the kinds of tools that Peter Murray-Rust is talking about. We’re the wrong people for the job (we can’t even do versioning!), and the job is being done elsewhere by others anyway, because faculty want and need these tools, and IT is finally listening. (I have direct evidence of that from my own job, but I need to keep fairly quiet about it because work is ongoing. You’ll just have to trust me.) What DSpace and EPrints developers should be considering is how to hook IRs up to the firehose of research products those other tools are producing. By my one-horse back-of-the-napkin calculations, that means an ingest API (no, not a command-line batch import tool, an API!) that is configurable enough to authorize certain tools for unmediated deposit and then prepopulate metadata fields with what those tools “know” about their content and the people who use them. It’s a tall order, but I dearly hope it’s not impossible, because I want to get my IR’s ingest pipe connected to that firehose.
BioMOO and BlueObelisk Cemetery
Saturday, April 28th, 2007
Jean Clause Bradley posts in Nature Island Review
Joanna Scott just wrote a nice little review of what is going on at Nature Island (slurl) on Nature’s Nascent blog since her return from the American Chemical Society meeting in Chicago. The Blue Obelisk Cemetery, where I give my students quiz races on Fridays was featured (only possible through help from Beth and Eloise – thanks again!). Another fun place is Mary Anne Clark’s biological cell that you can enter and float amongst the mitochondria. Nature Island has really become a very interesting place to hang out, meet smart people and learn and share.
A slurl is a Second life url – an address in the emerging virtual world. The Blue Obelisk community has built a cemetery of Obelisks – see the picture in J-C’s link. Name sounds a bit gloomy, but the sun seems to be shining – and no dount the obelisks celebrate something positive.
I have been a great believer in virtual worlds. Pioneers in this were British Telecom ? University of Essex (MUD) and Xerox Parc (Pavel Curtis) with the MOO. Many MOOs were built with LambdaMOO software (not an intuitive system!) including BioMOO for bioscientists.
BioMOO is no more – it flourished from ca 1993 to ca 1998. It was years ahaed of its time – as was Diversity University. But the digital record is inexorably decaying and this is a tragedy. I regard BioMOO as at least part of the current subjective unconscious of the collective scientific web and a first generation of what we are now seeing in scientific second life. Here are some bitshards (electronic potshards) unearthed from Google:
an overview captured in 1997 but all the links are 404. (The Virtual School of Natural Sciences was my creation in Diversity University).
2-4.2 The BioMoo
a summary by TECFA on the value of BioMOO for education. Again most links are 404.BioMOO PPS97-98
A record of the meetings in BioMOO of the Principles Of Protein Structure course which Alan Mills and I kicked off in 1995. “ClareS” is Clare Sansom who took over.DESCRIPTION] BioMOO announce VR web interface
a description of how to use BioMOO in 1995 (VR = virtual reality)BioMOO meeting, PPS Base, 14th Mar ’96 17:00 GMT
a transcript of a tutorial on protein structure at which petermr was present.Diversity University, 29.09.1996
with link to “tour of BioMOO by petermr”Analyse d’un Mud: le BioMoo
A description of BioMOO including a map. BioMOO (Gustavo Glusman and others) developed a rich system of images to paint a picture of BioMOO, Our course was held in VSNS-PPS classroom and some of the students decorated the virtual walls with interactive 3D objects in RasMOL. So – secondLife – it’s all been done before. But before the world was ready for it. Because it’s technically hard and takes a lot of infrastructure. So perhaps all these ideas of the 1990+ have had to wait for the Cloud, the Web, SeocndLife etc. to impact on everyday life. So now it is almost costless to try them out. I am delighted to see a garden in which flowers of sorts that I cannot imagine will grow But the tragedy is that we have lost much of the digital garden of the first generation. Are we in danger of losing the record of the BlueObelisk Cemetery? Can we record it for posterity? I think we probably can.Another puzzle
Saturday, April 28th, 2007
I have blogged about the broken state of chemical information and the lack of semantics in current commercial chemical software and drawing tools. This is exemplified by the amount of incorrect structures received by publishers (and I have talked to several).
What is the most common chemical substance received by a chemical publisher (or at least the Royal Society of Chemistry, thanks to Colin Batchelor)? It’s not water, or a common solvent. It’s an error.
If you KNOW – i.e. I or Colin or someone else have told you – please don’t answer. But otherwise please guess.
Pubchem Pigeons and Parrots
Saturday, April 28th, 2007
I ‘ve pent some of the last two days talking with Steve Bryant who runs The PubChem Project. People often think (and I’ve been guilty of this) that there is a lot of junk in Pubchem (although the proportion is very low). That’s not really accurate – it would be better to say that there are a lot of name-to-structure links that most people would regard as inaccurate.
The primary problem is names (and naming is one of the critical challenges of the digital age). Lewis Carroll recognised the central roles of names – in Alice’s Adventures in Wonderland Alice’s neck had grown extremely long and …
… a large pigeon had flown into her face, and was beating her violently with its wings.
`Serpent!’ screamed the Pigeon. `I’m not a serpent!’ said Alice indignantly. `Let me alone!’ [...] `And just as I’d taken the highest tree in the wood,’ continued the Pigeon, raising its voice to a shriek, `and just as I was thinking I should be free of them at last, they must needs come wriggling down from the sky! Ugh, Serpent!’ `But I’m not a serpent, I tell you!’ said Alice. `I’m a–I’m a–’ `Well! what are you?’ said the Pigeon. `I can see you’re trying to invent something!’ `I–I’m a little girl,’ said Alice, rather doubtfully, as she remembered the number of changes she had gone through that day. `A likely story indeed!’ said the Pigeon in a tone of the deepest contempt. `I’ve seen a good many little girls in my time, but never one with such a neck as that! No, no! You’re a serpent; and there’s no use denying it. I suppose you’ll be telling me next that you never tasted an egg!’ `I have tasted eggs, certainly,’ said Alice, who was a very truthful child; `but little girls eat eggs quite as much as serpents do, you know.’ `I don’t believe it,’ said the Pigeon; `but if they do, why then they’re a kind of serpent, that’s all I can say.’This exemplifies a fundamental problem of naming – the pigeon uses phenotypes and Alice uses genotypes, and Alice’s phenotype is inconsistent with her genotype. I’ll try to create an analogy and then map it onto Pubchem. Mr Python sells musical animals such as parrots and mouse organs. He has a number of suppliers who send animals (and occasionally collections of animals) which are labelled. Mr Python is not an ornithologist, but he has bought a molecular biology kit and can sequence the DNA of the things he is sent. He uses the names the suppliers send and his own internal numbering system based on the DNA of the thing he is sent (let’s assume no intra-species variation in the DNA) running from C1…. He also has a cataloguing system for everything S1… Supplier 1 sends a live specimen labelled “Norwegian blue parrot” – its DNA is labelled C1 and its catalog is S1. He now gets:
- A “white mouse” C2 S2
- An “african grey parrot” C3 S3
- A box of assorted animals labelled “animal organ”. Mr Python cannot extract DNA and the label is * S4
- Another parrot labelled “norwegian blue” and with DNA consistent with C1. He labels this C1 S5
- a bird called “oslo beauty” with DNA C1. He labels this as C1 S6.
- he gets a picture of a parrot. Since he is not an art gallery he does not accept this into the collection.
- he gets a parrot labelled “norwegian blue” which does not look very perky. He puts this into his collection as S7. Try as he can he can’t get any DNA out of this. It is, in fact, a stuffed parrot. So he complains to the supplier – the rest is history – but the entry still stands in the record – “norwegian blue” – S7. He does not offer this to his customers.
- he gets a parrot labelled “norwegian blue” whose DNA corresponds to C3. There is a name collision, but Mr Python is completely ignorant of parrot names and it goes in his collection as “norwegian blue” C3 S8.
- He gets another bird labelled “parrot” with DNA C3. This is labelled as “parrot” C3 S9
Corrections and Retractions from the GreaseMonkey
Saturday, April 28th, 2007
In Correction/Retraction Notice Noel O’Blog (Noel was in our Centre until recently) shows how the Blue Obelisk Greasemonkey can show a richer view of the chemical literature. The greasemonkey, developed by Noel and others and reported in Blue Obelisk mailing lists and (Using Javascript and Greasemonkey for Chemistry – Bowiki), is a clever little thing. It sits in your (Firefox) browser and reads every HTML page you load. Everytime it sees some chemistry (including journal names) it recognises it can add links to the page. So the page becomes enhanced, enriched.
Noel has applied it to the case of the chemistry retractions in ACS publications. When I tried to follow up ChemBark’s story I had DOIs but no links. With the BO greasemonkey these DOIs get translated to links which I can follow. In this way Noel has constructed the complete set of linked correction/retractions – there appear to have been at least 6 papers which have been affected. I think you can see from this that the blogosphere pays a positive role in helping post-hoc peer-review in chemistry.
Noeal highlights the fact that anyone reading the original paper does not know that a retraction has been made. Obviously this is not possible in paper journals, but I would have expected a publisher to put up a note saying “this paper has been retracted/corrected”. I am now thoroughly confused as to what I am seeing at the end of a DOI. The fundamental questions are:
- Is a DOI and identifier to a static piece of information (which is what I would expect – as it stands for Digital Object Identifier) or
- Is a DOI a controlled addressing system managed by a purchaser of DOIs. IOW can a purchaser put different versions of the same information under the same identifier
Sanctity of the scientific record
Friday, April 27th, 2007
I have not read the chemical blogosphere for some time (other than Blue Obelisk) and have been catching up with some of this in the plane. This is ChemBark from January and some my specific comments may be out of date, but the general questions remain. Scholars of scientific accuracy/fraud may wish to pursue the posts (which are IMO a valuable addition to the formal review process).
=== ChemBark ==
Interesting…
–var SA_ID=”acspix;acspix”; (I have quoted this without permission and argue fair use).
The publisher states that figures in the supporting information have been replaced. There is some doubt in the blogosphere as to whether an amended copy of the supporting information was added to the scientific record or whether the record itself was changed. If it is the latter then it is very serious. The ACS policy is:
Posted by Paul on January 11th, 2007
I missed this since I read JACS by the ASAP alerts and every single addition/correction from Sames has been allowed to bypass the system. A tip of my hat to the kind person who e-mailed it.Edit: I don’t have the time right now to investigate the e-mail’s note that the original supporting information file has been altered so that you cannot compare the new SI with the old, but I wouldn’t be surprised. I’ll get to it tonight, but feel free to investigate this lead on your own. On the surface, it doesn’t look good.Edited again: The supporting information files from 2005 and 2007 appear to be identical. They are also many many pages long, and I’m busy, so you probably won’t get a post from me today. Sorry.=======
I haven’t read this in detail so have no idea whether the question of scientific fraud has been resolved. My concern is about the scholarly record. On pursuing the links I find the correction/retraction:
Direct Palladium-Catalyzed C-2 and C-3 Arylation of Indoles: A Mechanistic Rationale for Regioselectivity [J. Am. Chem. Soc. 2005, 127, 8050-8057].
Benjamin S. Lane, Meghann A. Brown, and Dalibor Sames*
For comparison purposes, this article refers to a palladium-catalyzed arylation of free azoles in the presence of magnesium oxide, published previously in a separate communication. Although the magnesium oxide procedure has recently been found irreproducible (J. Am. Chem. Soc. 2006, 128, 8364), this fact does not affect the conclusions of this paper. Consequently, the magnesium oxide protocol has been removed from the Supporting Information. Also, Figures S5 and S8 have been replaced with corrected versions.
Supporting Information Available
Experimental procedures, spectral data, and base optimization data (corrected). This material is available free of charge via the Internet at http://pubs.acs.org.01/03/2007 —- I make now comment on the question of misconduct, but am concerned with the sanctity of the scientific record. The publishers’ statement could be interpreted as implying that the record has been amended, rather than that an amended piece of information has been added to the scientific record. The blogosphere certainly queried this – I don’t know whether it has been resolved. However the ACS policy on journals is then worrying:“Socialized Science” (ACS[*] commentary on NIH)
RUDY M. BAUM, Editor-in-Chief, C&E News, September 20 2004 Volume 82, Number 38 p. 7 I find it incredible that a Republican Administration would institute a policy that will have the long-term effect of shifting responsibility for communicating scientific research and maintaining the archive of STM [1] literature from the private sector to the federal government.What is important to realize is that a subscription to an STM journal is no longer [...] a subscription; in fact, it is an access fee to a database maintained by the publisher.
[...] one important consequence of electronic publishing is to shift primary responsibility for maintaining the archive of STM literature from libraries to publishers. I know that publishers like the American Chemical Society are committed to maintaining the archive of material they publish. Maintaining an archive, however, costs money.
(PM-R’s emphasis and ellipses)
[*] American Chemical Society
—-
The scientific record is thus not a paper or even epaper journal – it is a set of database records. With paper journals the record was very clearly preseved – mutliple copies were distributed and could never be recalled. Here there is effectively only one record and it is controlled by the publisher. (I know that certain depositing libraries have electronic copies, but it is unrealistic for the average scientist to pursue this).
I don’t doubt that Rudy Baum has a sincere commitment to preserving the scientific record. But I can imagine cases – with less reputable publishers – where it was embarassing to the publisher for the record to be visible and it was convenient for the database to be amended.
There is another point which the chemical community should take seriously if it cares about the accuracy of scientific publications – and certainly the blogosphere does. ChemBark says that it is impossible to check by eye whether the copies are identical or what has been changed. I gather that there might be PDF diff tools but I don’t have one. OTOH if the supporting information were in CML then it would be possible to compare not only the text but also the spectra, compounds, etc.
So I’d like to be sure that the complete record is available. And encourage chemical cows rather than chemical hamburgers— function createCookie(name,value,days) { if (days) { var date = new Date(); date.setTime(date.getTime()+(days*24*60*60*1000)); var expires = “; expires=”+date.toGMTString(); } else var expires = “”; var ck = name+”=”+value+expires+”; path=/”; document.cookie = ck; } function readCookie(name) { var nameEQ = name + “=”; var ca = document.cookie.split(‘;’); for(var i=0;i<ca.length;i++) { var c = ca[i]; while (c.charAt(0)==' ') c = c.substring(1,c.length); if (c.indexOf(nameEQ) == 0) return c.substring(nameEQ.length,c.length); } return null; } function URLEncode(plaintext) { // The Javascript escape and unescape functions do not correspond // with what browsers actually do… var SAFECHARS = "0123456789abcdefghijklmnopqrstuvwxyz-_.!~*'()"; var HEX = "0123456789ABCDEF"; var encoded = ""; for (var i = 0; i 255) { alert( “Unicode Character ‘” + ch + “‘ cannot be encoded using standard URL encoding.\n” + “(URL encoding only supports 8-bit characters.)\n” + “A space (+) will be substituted.” ); encoded += “+”; } else { encoded += “%”; encoded += HEX.charAt((charCode >> 4) & 0xF); encoded += HEX.charAt(charCode & 0xF); } } } // for return encoded; }; function getWLHost() { var hostdev=”pubsdev.acs.org”; var hosttest=”pubstest.acs.org”; var hostprod=”pubs.acs.org”; var localhost=window.location.toString(); if (localhost!=null) localhost=localhost.toUpperCase(); if (localhost.indexOf(“DEV.”)!=-1) { return hostdev; } else if (localhost.indexOf(“TEST.”)!=-1) { return hosttest; } else { return hostprod; } } function openURLRedirect(query) { var myframe=window.document.getElementById? window.document.getElementById(‘openURLFrame’): window.document.frames['openURLFrame']; var myform=myframe.contentDocument? myframe.contentDocument.openURLForm: window.document.frames['openURLFrame'].openURLForm; //var subscriberId=myform.subscriberId.value; var subscriberId=readCookie(“OpenURL_Saved_SubscriberId”); var url=”http://”+getWLHost()+”/wls/ACSOpenURLWeb/servlet/OpenURLRedirectorServlet?subscriberId=”+subscriberId+”&”+query; // document.location=url; // new window window.open(url,”openurl”,”toolbar=yes,menubar=yes,location=yes,scrollbars=yes,resizable=yes”); } var history_len1; var history_len2; function openURLSubmit() { // works with IE // window.document.frames['openURLFrame'].openURLForm.submit(); // this works with both IE and netscape history_len1=history.length; var myframe=window.document.getElementById? window.document.getElementById(‘openURLFrame’): window.document.frames['openURLFrame']; if (myframe==null) { window.setTimeout(‘openURLSubmit()’,100); return; } var myform=myframe.contentDocument? myframe.contentDocument.openURLForm: window.document.frames['openURLFrame'].openURLForm; if (myform==null) { window.setTimeout(‘openURLSubmit()’,100); return; } myform.action=”http://”+getWLHost()+”/wls/ACSOpenURLWeb/servlet/OpenURLServlet”; myform.submit(); openURLProcess(); } function openURLProcess() { var myframe=window.document.getElementById? window.document.getElementById(‘openURLFrame’): window.document.frames['openURLFrame']; var mydoc=myframe.contentDocument? myframe.contentDocument:window.document.frames['openURLFrame'].document; var myform=myframe.contentDocument? myframe.contentDocument.openURLForm: window.document.frames['openURLFrame'].openURLForm; var isOpenURL=myform.isOpenURL.value; var imageLocation=myform.imageLocation.value; if (isOpenURL==null || isOpenURL==” || isOpenURL==’unknown’ || imageLocation==null || imageLocation==” || imageLocation==’unknown’) { window.setTimeout(‘openURLProcess()’,100); return; } if (isOpenURL==’true’) { var i; for (i=0;i<999;i++) { if (window.document.images==null) break; if (window.document.images[i]==null) break; var image_name=window.document.images[i].id; if (image_name.indexOf('openURL_img')!=-1) { window.document.images[i].style.visibility=""; window.document.images[i].src=imageLocation; } } } // 12/14/2005 // go back to previous page so users don't have to click twice history_len2=history.length; var subscriberId=myform.subscriberId.value; createCookie("OpenURL_Saved_SubscriberId",subscriberId); var agt=navigator.userAgent.toLowerCase(); var version=parseInt(navigator.appVersion); if (history_len2-history_len1==1 || history_len1-history_len2==1) { // 4/17/2005 safari fix if (agt.indexOf('safari') == -1 ) { history.back(); } } }