- Pubchem (10 million+ , superset of many Open datasets including NCI. I use this term to subsume everything at nih.gov)
- ChEBI (> 25 000 terms collected at EBI, not all with connection tables)
- MSD (ligands in Protein structures, collected at EBI > 5000)
- WWMM (250, 000 calculated structures from NCI database). Reposited in DSpace,
- Crystallographic Open Database crystal structuires collected from the literature or donated. Soon to be complemented with CrystalEye. This should give nearly 100,000 crystal structures.
- The BlueObelisk Data Repository (BODR). A collection of critical information collected by BO volunteers primarily as reference data for (Open) software. (includes non-molecular stuff like elemental properties). BODR is widely distributed on Gnome and other Open Source distros.
Archive for April, 2007
Sure, I’m a geek. But I’m not a gadget geek or a Lib2 geek or even a web geek (more than incidentally). I’m not even really a markup geek any more; I sling XML now and then, but as a side requirement of my real work rather than the focus of my professional attention. What I am is a problem-solving geek. I have a problem with a technology (say, hm, I dunno, DSpace?), I beat the living daylights out of it with the nearest handy rock until either it does what I want or I decide that the problem needs a better tool than a rock and give up (complaining bitterly afterwards, of course).She sounds like the sort of person that we desperately need – an XML-slinger who works in libraries and information science. We collaborate with Jeremy Frey (Southampton – CombeChem eCrystals, etc.) and yesterday that we agreed at the top of our shopping list we needed informations scientists embedded in chemistry (and other scientific) departments. That’s the role of the modern “library”. Anyway there has been a lot of debate about repositories and how to get stuff into them. As a scientist I know that this requires two conditions
- There has to be an overwhelming motivation for the scientist (like losing their job if they don’t do it) and…
- … it has to be trivially easy
One caveat (small C). If you build it they may not come. That’s the challenge. If they do come, and it’s a good SWORD, they’ll be less likely to go away. That’s why we have to work on all areas of encouraging people to capture and reposit digital artifacts.
Well, hot damnI ask for middleware, and lo, there is middleware!(No, I don’t think there’s a direct cause-and-effect relationship there. Even on my worst days I’m not that arrogant! Just shows that I’m not the only person with that particular train of thought.) Bring on the SWORD, y’all. I’ll wield that baby, you betcha.
Dies Martis, 24 Aprili 2007
Repository middlewareI did a lot of IR marketing this week, despite my perfect awareness that IR marketing doesn’t work. For a tactic that doesn’t work, I did manage to come away with some contacts, and it appears that the IR made its way into some heads, and that’s all good.But if marketing doesn’t work, what does? Here’s the problem I’ve got: there’s a ton of material that’s IR-ready floating around, but I can’t get at it. My nose is mashed up to the window of other people’s hard drives, web servers, workflow silos, and collaboration tools. I want the stuff that comes out of those arenas. I just have no way to grab it. Here’s the problem everybody else has got: they need the curation, preservation, and “put this important content somewhere safe (but otherwise out of my hair)” tools that an IR theoretically provides, but they don’t need the hassle of extra deposit steps. They need an “Archive It!” button. They just have no way to build one… if they even know about the IR to begin with. I need middleware, and I need it badly. I don’t think DSpace or EPrints developers should be directly considering building the kinds of tools that Peter Murray-Rust is talking about. We’re the wrong people for the job (we can’t even do versioning!), and the job is being done elsewhere by others anyway, because faculty want and need these tools, and IT is finally listening. (I have direct evidence of that from my own job, but I need to keep fairly quiet about it because work is ongoing. You’ll just have to trust me.) What DSpace and EPrints developers should be considering is how to hook IRs up to the firehose of research products those other tools are producing. By my one-horse back-of-the-napkin calculations, that means an ingest API (no, not a command-line batch import tool, an API!) that is configurable enough to authorize certain tools for unmediated deposit and then prepopulate metadata fields with what those tools “know” about their content and the people who use them. It’s a tall order, but I dearly hope it’s not impossible, because I want to get my IR’s ingest pipe connected to that firehose.
Joanna Scott just wrote a nice little review of what is going on at Nature Island (slurl) on Nature’s Nascent blog since her return from the American Chemical Society meeting in Chicago. The Blue Obelisk Cemetery, where I give my students quiz races on Fridays was featured (only possible through help from Beth and Eloise – thanks again!). Another fun place is Mary Anne Clark’s biological cell that you can enter and float amongst the mitochondria. Nature Island has really become a very interesting place to hang out, meet smart people and learn and share.
… a large pigeon had flown into her face, and was beating her violently with its wings.
`Serpent!’ screamed the Pigeon. `I’m not a serpent!’ said Alice indignantly. `Let me alone!’ [...] `And just as I’d taken the highest tree in the wood,’ continued the Pigeon, raising its voice to a shriek, `and just as I was thinking I should be free of them at last, they must needs come wriggling down from the sky! Ugh, Serpent!’ `But I’m not a serpent, I tell you!’ said Alice. `I’m a–I’m a–’ `Well! what are you?’ said the Pigeon. `I can see you’re trying to invent something!’ `I–I’m a little girl,’ said Alice, rather doubtfully, as she remembered the number of changes she had gone through that day. `A likely story indeed!’ said the Pigeon in a tone of the deepest contempt. `I’ve seen a good many little girls in my time, but never one with such a neck as that! No, no! You’re a serpent; and there’s no use denying it. I suppose you’ll be telling me next that you never tasted an egg!’ `I have tasted eggs, certainly,’ said Alice, who was a very truthful child; `but little girls eat eggs quite as much as serpents do, you know.’ `I don’t believe it,’ said the Pigeon; `but if they do, why then they’re a kind of serpent, that’s all I can say.’This exemplifies a fundamental problem of naming – the pigeon uses phenotypes and Alice uses genotypes, and Alice’s phenotype is inconsistent with her genotype. I’ll try to create an analogy and then map it onto Pubchem. Mr Python sells musical animals such as parrots and mouse organs. He has a number of suppliers who send animals (and occasionally collections of animals) which are labelled. Mr Python is not an ornithologist, but he has bought a molecular biology kit and can sequence the DNA of the things he is sent. He uses the names the suppliers send and his own internal numbering system based on the DNA of the thing he is sent (let’s assume no intra-species variation in the DNA) running from C1…. He also has a cataloguing system for everything S1… Supplier 1 sends a live specimen labelled “Norwegian blue parrot” – its DNA is labelled C1 and its catalog is S1. He now gets:
- A “white mouse” C2 S2
- An “african grey parrot” C3 S3
- A box of assorted animals labelled “animal organ”. Mr Python cannot extract DNA and the label is * S4
- Another parrot labelled “norwegian blue” and with DNA consistent with C1. He labels this C1 S5
- a bird called “oslo beauty” with DNA C1. He labels this as C1 S6.
- he gets a picture of a parrot. Since he is not an art gallery he does not accept this into the collection.
- he gets a parrot labelled “norwegian blue” which does not look very perky. He puts this into his collection as S7. Try as he can he can’t get any DNA out of this. It is, in fact, a stuffed parrot. So he complains to the supplier – the rest is history – but the entry still stands in the record – “norwegian blue” – S7. He does not offer this to his customers.
- he gets a parrot labelled “norwegian blue” whose DNA corresponds to C3. There is a name collision, but Mr Python is completely ignorant of parrot names and it goes in his collection as “norwegian blue” C3 S8.
- He gets another bird labelled “parrot” with DNA C3. This is labelled as “parrot” C3 S9
- Is a DOI and identifier to a static piece of information (which is what I would expect – as it stands for Digital Object Identifier) or
- Is a DOI a controlled addressing system managed by a purchaser of DOIs. IOW can a purchaser put different versions of the same information under the same identifier
Posted by Paul on January 11th, 2007