#solo10: The Green Chain Reaction is becoming a chain reaction!

Dictated into Arcturus

I’m waiting until the rain stops before I cycle in to work, so here is an update on The Green Chain Reaction. It’s going incredibly well. The energy of those who have already volunteered is enormous, and so is the speed with which they’ve picked up the ideas and the competence and initiative that they have used. Absolutely incredible. An important byproduct of this experiment is to show how universal the ideas of collaboration are, and how the tools have developed over the last few years so that they are straightforward to use.

Mark and Dan have made spectacular progress on the code. We have produced a system in house which we use regularly for downloading and analysing publicly visible and Open scholarly material. We have a build system which involves about 30 different projects and libraries and is quite complex. I should pay tribute to Jim Downing, Sam Adams, Nick Day, and others in our group for having set this up. I don’t think we could have done it without the infrastructure that we have built and we’d be delighted to talk with are the people who are interested in managing large and varied amounts of scholarly information. What is really exciting is that Mark and Dan have understood what we’re doing, have written some very nice documentation on the science online wiki, and have robustified the procedures. Because they are working on other systems, including other operating systems, this is an excellent test of portability. Quite shortly we should be able to create a package which many of you will be able to use in this project.

Heather Piwowar has made wonderful progress on the IsItOpenData resource. We shortly going to be creating template letters to send to those people who expose data on the web. Will be sending these to the owners of most of the sources. It may seem strange to enquire from an open access publisher whether their data is open, but this letter will give us a chance to thank them and also an opportunity for them to respond to the project. So if you’re a publisher, or a site, which exposes open data then don’t be surprised if you get a request asking IsItOpen. And thank you.

Mat Todd and Jean-Claude Bradley have already posted their notebooks and Lab reports. I’ll be looking at these in detail later this morning I hope. Mat has also posted a list of potential sources of Open chemistry, and will be using these where it is clear that these are open. Some of those sources are not explicitly open and heather and I (and anyone else who wishes to help) out will be composing letters and sending them through IsItOpen. We hope that we will get a quick enough response to allow us to use these sources for the project, and if so this will be fantastic.

Because these volunteers have made such rapid progress much of this is in a state where you can join in. This is part of the purpose of the experiment, allowing anyone to get a feel for what data-rich science is like. So for example you will be able to install the package for downloading and analysing Open Data.

ONE CAVEAT – SOME OF THE TOOLS CARRY OUT AUTOMATIC DOWNLOADING. WHERE POSSIBLE WE WISH TO DO THIS ONLY ONCE TO AVOID DENIAL-OF-SERVICE AND MESSING UP ACCESS COUNTS. SO WE WOULD HOPE TO CACHE DATA AFTER IT HAS BEEN DOWNLOADED. ANY SUGGESTIONS ON HOW AND WHERE WE CAN DO THIS (AND OFFERS OF HELP) WOULD BE WELCOMED. NOTE THAT THE DATA WILL ALL BE EXPLICITLY OKD-COMPLIANT SO CANNOT BE SERVED FROM A LESS-THAN-OPEN RESPOSITORY

And, yes, I goofed on the tag. It should be #solo10


 

Posted in Uncategorized | 1 Comment

#solo2010: Where can we get Open Chemical Reactions?

Typed/Scraped into Arcturus

Mat Todd has created a great page of possible chemical resources for the Green Chain Reaction (http://scienceonlinelondon.wikidot.com/greenchem:discussion-area ). I’ll comment rapidly here.

Mat Todd Point 1 of 2: At the moment I am assuming that I can help by highlighting places where synthetic organic chemistry experiments can be obtained online. If so, we can start a list here:

Exactly so. I will comment on each on the page.

 

  1. Our ELN online for our open project here
    2) Reaction Attempts includes UsefulChem(Bradley) and Todd notebooks – click on explore link
    3) Chemspider Synthetic Pages
    4) OrgPrepDaily procedures
    5) Beilstein Journal of Organic Chemistry
    6) Small Molecule Papers in PLoS One but there aren’t many with synthetic schemes…
    7) What about publicly available theses – are you wanting html for this, or pdf?

Initial comment here – sample size here might be small, and hence the kinds of reactions looked at may skew the analysis a little. We’re focussing on one main reaction type, and so is Jean-Claude. Not exclusively, just a preference. Obviously the wider the sample set the better the analysis.

To this I will add:

  • About 7-9000 syntheses from Acta Crystallographica E (some will be simply “we took this from a bottle” but most are actual preparations). Licence CC-BY
  • Somewhere about 100,000 reactions per year in patents. We expect the historical quality to be textually lower. Licence PUBLIC DOMAIN
  • A number of gifted theses in Cambridge
  • A number of theses in University repositories. Most licences CC-DONTKNOW, some CC-SA, some CC-BY. Smallish numbers (guess about 100-1000 theses if we work hard. A really good opportunity for collaboration)

PMR The main aspects that will be important are:

  • What are the explicit permissions on the site?

    This is more important than anything else. We cannot use material that is visible on the web unless there is explicit permission. This is an ideal opportunity to use IsItOpenData? Of the sources above I shall assume 1 and 2 are Open, 3 is unknown until we hear from Chemspider, 4 is unknown (most blogs are CC-BY or CC-SA but we have to check). We cannot use CC-NC. 5, 6 are fully CC-BY

  • What is the format? PDF is the worst, but probably usable for this project. XHTML is normally excellent. *.doc (theses) is also excellent.
  • How is the information structured? If it’s in diagrams it’s very difficult. If it’s in running text it depends on the style. Formal reports of single compounds are often quite tractable. Highly detailed accounts are potentially much more valuable but harder to parse as there is less consistency.
  • What information is given? Acta E and Patents do not normally give yields (a pity). Theses are usually very rich.

I’ll start liasing with Heather about asking for formal permission on IsItOpen?


 

Posted in Uncategorized | 2 Comments

#solo2010: Sharing Data is Good for All of Us

Dictated into Arcturus

We are delighted that Heather Piwowar has offered to help on the project. This is specially exciting as she says “I know nothing about chemistry” – I doubt that’s absolutely true but I’ll take her at her word. This means that anyone can help. Already she is organizing the Wiki and starting to increase the use of the IsItOpenData resource (http://www.isitopendata.org/ ). I’ll talk more about that later…

Here’s Heather… (http://researchremix.wordpress.com/2010/08/11/green-chain-reaction-project-putting-my-minutes-where-my-mouth-is/ )

Green chain reaction project: putting my minutes where my mouth is


Quick post to highlight a really cool project that is worth following, and participating in!, right now.

The Green Chain Reaction is a ground-breaking, innovative, global experiment. It will apply open data and citizen science to rapidly investigate:

“Are chemical reactions in the literature getting greener?”

Background work will be done in the next month, and some of the investigation will take place in real time at Science Online London 2010. As such the project will be highly visible.

Do you believe in the power of open data, open science, citizen science, and fun sprints around an important problem? Do you talk about it? Then join us.

Yup, us. I’m helping. I’m putting my minutes where my mouth is, and reallocating a few hours in the next month to the cause. I know nothing about chemistry. That is no problem, everybody can help. I’ll probably focus on writing to publishers asking for clarification about whether their publications can be data-mined.

You can help extract information from full text publications, test some software, help flush out the wiki, tweet about the experiment, contribute ideas, etc.

Don’t wait, check it out now and participate with some minutes. This sprint is only on for a month…

ps Kudos to Peter Murray-Rust, Simon Hodson, and the conference organizers for initiating this project. Innovative, informative, and exciting.

I hope this helps you also to jump in…

Heather is, of course, a star in her own right as she has seminally shown that if you share your data you get more citations. The first step to sharing your data is to find it. Then you have to make it available and if you post it Openly on the Web, with a CC0 or PDDL license then anyone can find it and re-use it. When they re-use it, they cite it. Bingo!

The International Union of Crystallography are showing this enlightenment by exposing their data and we thank them for it. Actually it’s much more than that as they have also worked tirelessly to try to persuade other publishers to expose their data. And to the credit of the ones that agree, they have also exposed their data. So kudos to the Royal Society of Chemistry and the American Chemical Society for making their crystallographic data Open. As a result I expect they get more citations.

Those publishers also expose some of their chemical syntheses and we’d like to follow that up in this project. Because it would help not only in this project but also in the more general area of getting better quality science. Here are some simple mantras:

Open Data is Shared Data

Open Data means Better Science

Shared Data gets More Citations

Better Science gets More Citations

And so

Making Data Open is Good for Everyone

If Heather finds time among the Wiki tending she can point to more convincing figures.

 

 

 

 

Posted in Uncategorized | 1 Comment

#solo2010: Green Chain Reaction – update and more volunteers!

Dictated into Arcturus

A brief update of the Green Chain Reaction.

Viewed in the light of normal projects with Gantt charts, milestones, etc. the GCR project is completely mad. There is very little formal support other than the excellent wikis provided by the conference organizers. It relies almost completely on volunteer contributions over which we have no moral or other control. People promise what they can when they can, but often they find that they can’t deliver. There’s no shame in that whatever. We all have day jobs, including myself. It’s just that it’s such a wonderful opportunity to do something new and exciting.

What is fantastic it is that we have had already a number of really valuable offers and I have no doubt that we will continue to get more.

I could reasonably count on Jean-Claude Bradley and Mat Todd and today I shall be exploring how they manage reactions and seeing if we can come up with a common representation. This will also be closely coupled to our own work on the JISC funded AMI project. But the other offers have been unexpected and delightful.

Heather Piwowar, who is very well known for her work on the value of Scientific Data, has offered significant help. She has been a tireless campaigner for making scientific content openly available. She’s also approved the pragmatic and utilitarian value of doing so. This is not the place to do justice to her work, but he is a recent blog post (https://researchremix.wordpress.com/2010/07/05/studying-reuse-of-geo-datasets-in-the-published-literature/ )

The real value-add of data archiving, though, is in the potential for more efficient and effective scientific progress through data reuse. There have been many calls to quantify the extent and impact… to do a cost/benefit analysis. An estimate of value of reuse would help to make a business case for repository funding, an ethical case for compelling investigators to bear the personal cost of sharing data, and clarify that sharing even esoteric data is useful — as the Nature Neuroscience editorial Got Data? puts it, “After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay.”

The GCR project is looking at re-use and Heather’s contribution will be invaluable.

We’ve also had two amazing offers of computing support. I have already mentioned Dan Hagon, but yesterday we also got an offer from Mark W:

Mark W says:

August 10, 2010 at 10:05 pm  (Edit)

I’m no chemistry expert (more bioinformatics) but can contribute Java coding and/or computing power. Please get in touch if I can help with either!

This is brilliant. I have put Dan and Mark in touch and pointed them at the first batch of data and code to see if they can get this working. They both have experience with high throughput computing and these will be magnificent for the project.

We’ve also had three contributions about the greenness of the project. I have already mentioned Rob Hanson and the tools that he has built.

Please keep the contributions coming. Today I shall concentrate on making sure that the next batch of code is likely to work, and also that we have a good formal representation for the chemical reactions.

I’m confident that we will get many more volunteers and that between us we will be able to show not only that there is significant value in the published literature but also that A rapidly convened group of committed people can make enormous progress.

Why do I believe this? Because of a wonderful project over 10 years ago that produced the SAX specification for XML. In only one month (see http://www.saxproject.org/sax1-history.html ) a group of people on the xml-dev mailing list had created and tested a de facto standard which now runs on every computer on the planet. There is no reason why the same dynamics should not apply to this project.

A month is a long time in an enthusiastic group of collaborators.


 

Posted in Uncategorized | 3 Comments

#solo2010: Computing volunteers needed for Green Chain Reaction

Typed into Arcturus

Here is a wonderful offer for the Green Chain reaction project at #solo2010.


Dan Hagon says:

August 10, 2010 at 12:28 am  (Edit)

Hi Peter, sounds a really fun project. I’m happy to help out with some Java coding. Also I have a cloud-hosted virtual machine I’m not really making much use of right now which you’re welcome to use.

This is exactly one of the skills we shall need for this project. If we are going to look at patents over many past years we are going to have to use either/or a lot or humans or a lot of computing.

Dan worked with us as a summer student and then moved on to RAL. He helped us get much of the automation into crystal structure repositories. So I know that he knows this contribution is possible and valuable.

I’ll explain in more detail what we are going to do, but this is about how. We have written most of the tools (in Java) and we’ll be able to offer them so they can run standalone on any machine. This may require wrapping them as a WAR or other self-starting distributable. We’ll also need to make sure they run remotely (Java is described as write-once-run-anywhere and parodied as write-once-debug-everywhere. So people who know what debugging looks like are highly valued).

The main distributed tool will be natural-language-processing (NLP) for chemical documents and specifically reactions. I’ll describe this in detail in a later post. The overall strategy looks something like:

  • Download N documents from remote site (e.g. patents, Acta Crystallographica E)
  • Find all reactions in the document (can be hundreds in patents, only one in Acta)
  • Carry out NLP on each reaction.
  • Create a datafile from each
  • Index each datafile (probably using RDF)
  • Search for green concepts in the RDF repository
  • Present the results

We’ve got code for 1-4. We’ll need help and imagination with the later stages (5-7), especially since they may come slightly later than the initial parsing. But there will be many of you out there who have some experience of this sort of thing.

Note that the cloud is an ideal place to do this sort of work as it is embarrassingly parallel – or can be created as map-reduce. For example each volunteer could take a year of patents (many tens of thousands of reactions in each year)

So please volunteer for help with the computing – it should be fun.


 

Posted in Uncategorized | 2 Comments

#solo2010: How safe / green is my reaction? Feedback requested

Typed into Arcturus

For our Green Chemical reaction at #solo2010 (http://scienceonlinelondon.wikidot.com/topics:green-chain-reaction )

I’d like to be able to calculate the “greenness” of a reaction. This is obviously subjective, but should illustrate the principles and components of green chemistry even if it’s not completely worked out. Bob Hanson (of Jmol / BlueObelisk fame) has created a Green Chemistry Assistant that calculates Process Mass Efficiency (PME) and Atom Efficiency. Bob’s done great work for getting students involved with thinking about Green Chemistry. I asked him about the nature of the materials as well (e.g. toxicity, flammability) but the calculator doesn’t do this.

DOES ANYONE HAVE A GREEN CHEMISTRY CALCULATOR/PROGRAM THAT ASSESSES THE OVERALL GREENNESS OF A REACTION?

To give an example, here’s our first reaction:

The title compound [1] was synthesized as described in the literature. To glycine (1.00 mol) and potassium hydroxide (1.00 mmol) in 10 ml of methanol and 5 ml of water was added 2-hydroxy-1-naphthaldehyde (1.00 mmol in 10 ml of methanol) dropwise. The yellow solution was stirred for 2.0 h at 333 K. The resultant mixture was added dropwise to Cu(II) nitrate hexahydrate (1.00 mmol) and pyridine (1.00 mmol) in an aqueous methanolic solution (20 ml, 1:1 <i>v</i>/<i>v</i>), and heated with stirring for 2.0 h at 333 K. The brown solution was filtered and left for several days, brown crystals had formed that were filtered off, washed with water, and dried under vacuum.

We can’t address yield (it wasn’t given and crystallographers don’t need lots of stuff – purity is more important. However it uses methanol and pyridine and several other compounds you will find in Wikipedia – which have some modest hazards associated with them. Anyone doing this in Universities and Industry has to prepare a safety assessment (COSHH in UK) before doing the work. Can we create a numerical index of safety from the form??

Posted in Uncategorized | 2 Comments

#solo2010: Green Chain Reaction; details and offers of help

Typed into Arcturus

As I have blogged already (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2526) we are planning an exciting communal event for Science Online where we investigate whether chemical reactions are getting greener. I’ve already had two offers of help today and I’ll post details later. To keep you up to speed this is what we are hoping to analyse. EVEN IF YOU ARE NOT A CHEMIST TAKE THE TIME TO HAVE A LOOK. It’s not as difficult as it looks.

The title compound was synthesized as described in the literature. To glycine (1.00 mol) and potassium hydroxide (1.00 mmol) in 10 ml of methanol and 5 ml of waterw as added 2-hydroxy-1-naphthaldehyde (1.00 mmol in 10 ml of methanol) dropwise. The yellow solution was stirred for 2.0 h at 333 K. The resultant mixture was added dropwise to Cu(II) nitrate hexahydrate (1.00 mmol) and pyridine (1.00 mmol) in an aqueous methanolic solution (20 ml, 1:1 <i>v</i>/<i>v</i>), and heated with stirring for 2.0 h at 333 K. The brown solution was filtered and left for several days, brown crystals had formed that were filtered off, washed with water, and dried under vacuum.

Let’s do a Jabberwocky on it. You’ll remember :

 The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
  And burbled as it came!

One, two! One, two! And through and through
  The vorpal blade went snicker-snack!
He left it dead, and with its head
  He went galumphing back.

“And, has thou slain the Jabberwock?

And Alice says perceptively:

“…Somebody killed something: that’s clear, at any rate–,”

And that’s the point. I think we could all say of the chemistry

“something was added to something and heated: that’s clear at any rate”.

Let’s rewrite:

To SOMETHING1 (1.00 mol) and SOMETHING2 (1.00 mmol) in 10 ml of LIQUID1 and 5 ml of water was added SOMETHING3 (1.00 mmol in 10 ml of LIQUID1) dropwise. The yellow solution was stirred for 2.0 h at 60degrees C. The resultant mixture was added dropwise to SOMETHING4 (1.00 mmol) and LIQUID2 (1.00 mmol) in an aqueous methanolic solution (20 ml, 1:1 volume/volume), and heated with stirring for 2.0 h at at 60 degrees C. The brown solution was filtered and left for several days, brown crystals had formed that were filtered off, washed with water, and dried under vacuum.

[NB ml = millilitre – and 1 litre = 1.7 pints. mmol is millimole which is a very small burrowing animal a chemical measure of how much molecular stuff there is] That’s really not very different from making cocktails or cakes! You don’t have to understand why it works.

The good news is that the machine knows (or can work out) all the SOMETHINGs (they are in open Databases such as Pubchem (http://pubchem.ncbi.nlm.nih.gov/ ). You’ll see that the whole recipe is just a number of ACTIONs. We (Lezan Hawizy and others) have catalogued the actions and they are commonly:

  • ADD SOMETHING to SOMETHING
  • WAIT for TIME
  • HEAT/COOL at TEMPERATURE
  • STIR (vigorously/gently)
  • DRY (under vacuum/over drying agent)
  • FILTER (solution/solid)

There are a few others. There’s also:

  • OBSERVATION (solid, bubbling, colour, explosion, etc.)

So it’s quite feasible for machines or humans to abstract this into a formal account.

 

We are now doing this with Mat Todd from Sydney and over the next 24 hours intending to come up with a description of a chemical reaction as a set of events. We’ll publish this shortly.

Meanwhile we are dusting off our natural language-processing tools and we’d be delighted if anyone out there would like to take part. At this stage it probably needs someone who has run Java programs – we’ll get some distribution bugs. But maybe we can also build a server.

… more later

Posted in Uncategorized | 3 Comments

PP4: More on where data should be reposited

Scraped into Arcturus

More debate on where data repositories should be located

Chris Rusbridge says:

August 8, 2010 at 11:39 am  (Edit)

[…]

Let’s set the data to one side for the moment and think about the two models for science outputs (articles). It doesn’t greatly matter if articles are deposited in institutional repositories, domain repositories or one grand central repository (ArXiV on steroids). And however many there are, they don’t really have to be federated (I know I mentioned the word earlier, but I was thinking in a looser context). In fact, the OAI-PMH protocol on which repositories are based was originally based around the idea of federated repositories: there were going to be data providers (repositories) and service providers (specialist search sites, like OAISTER). The service providers would harvest the metadata from the data providers, and build their search and other services on that basis.

This was a great idea of its time, but let’s see what happened. OAISTER and a few other search providers do exist, nd let you search the repositories they know about. They generally do a MUCH better job than the previous distributed search paradigm, Z39.50 (and later SRU/SRW), which were very prone to realtime and metadata mismatch failures. But sadly, they turn out to do a MUCH WORSE job than the best search engines, like Google et al.

So now we have a large and growing set of repositories, being indexed by Google, and searched by millions or billions every day. And lo! It works well! […]

PMR: This is a useful analysis. Essentially the message is that a textual document can be discovered and indexed by major search engines outside the repository system. Effectively for discovery all you have to do is post your material on the web and very shortly (hours) it will be discovered. For example this blog is indexed within about 2 hours.

So what is the value of the text-oriented repository system. It has some value, but it’s often not what authors want. And authors will generally only post something if there is value to them. Here are some of the points which have been used to promote the value of IRs:

  • A single place for authors to place material. If authors wish their work to be discovered they will generally create specific web pages – and these will be disovered and indexed. A repository takes time to learn, the early ones were unnatural to use and offered no flexibility in the type of material. So authors don’t use them unless there is a specific benefit.
  • Higher visibility for citations. This is true and valuable where the original material is closed access. It’s often a hassle – it’s not a natural process and it’s not surprising that most people I talk to have no familiarity with their repo.
  • Archival. Scientific authors have no interest in archival. They want their stuff disseminated now. If they publish in a conventional journal it is de facto archived. In general repositories have no facilities for archiving blogs, web pages, hypermedia.
  • Metadata-driven federated discovery. It would be nice if this happened. My simple test : “find all exposed chemistry theses in UK universities“. This is a simple, important, legitimate aspiration. I want it for my Green Chain Reaction Challenge. It depends on two simple fields: “thesis” and “chemistry” which I would hope would be among the top metadata fields. Yet I cannot do this on even a single site, let alone a federation of UK repos.

So the primary roles for repos appear to be archival and citation enhancement.

So to data. As I remember you found out with some of your molecule deposits in the Cambridge repository, data can have a problem: much data will not be indexed by search engines (it “cannot speak for itself” as text does), and hence is not inherently findable. But as we also found out then, if you stick it in some sort of repository with some kind of standardised metadata, the latter becomes indexable and hence searchable. So now well-curated data can perhaps be found even if it cannot speak.

Let’s clarify terms – a general repository such as DSpace has the opportunity for the author to manually add metadata terms for discovery. It has little opportunity to support structured semantic content and so the metadata is associated not with data items but with pages relating to collections of data items. I have put (or Jim Downing rather put) 150,000 data sets into our DSpace, with 150,000 splash pages. The pages do not interpret the data – they may give some very limited discovery metadata (though at that stage not enough even to retrieve the data automatically). I cannot, for example, discover a ionization potential between 5 and 10 eV. And yet the data are clearly visible.

Of course there are lots more problems with data. Data are different from text in many ways. Scale factors can vary enormously: size, numbers of objects, rate of deposit, rate of access and use, rate of change, etc by many orders of magnitude. Raw data, processed data, combined data, interpolated data (eg the UARS Level 0 to Level 3 distinctions). Many more ways, I guess. So the repository infrastructure built for text will only be useful for a small subset of data. It might be a useful subset (eg the data representing figures and tables and otherwise substantiating the findings of a scientific article), but there will be much data that’s not appropriate. I think that’s a great place to start promoting the value of depositing data, as it links so closely to the value point of the science enterprise for so many scientists: the article.

I will agree that articles are a good place to start. First they are the natural way that many scientists communicate their findings. It’s a pity that most publishers do not care about publishing the data properly and worse claim ownership of everything thus stultifying any attempt to make advances here. The association of tables, figures, spectra, molecules, etc with text is a good way of providing many types of data. The data, of course, are not properly indexed and will reply on things like captions. Scientific units, etc are almost always lost. Numbers are squashed into pixels or PDF.

Now I really do think the institutional versus domain data repository dichotomy is a false one. I don’t think I know of any “single” domain repository. There are more than a thousand databases in nucleic acids research alone. There are dozens of data centres in the social sciences, and many more in climate-related areas. There are half a dozen funded by NERC in the natural sciences. But there are still many domain areas without them, and no likelihood of them being set up. In those cases data repositories linked to institutions, faculties, research groups etc are appropriate. If they are properly managed (yes, with adequate domain involvement), the data can be found and used by those who need them.

There’s a misunderstanding here. All the examples given here are what I call domain repositories. They usually cater for a few types of data, and data that are understood by those running the repository. The examples are perfect illustrations of why domain repositories already exists and already work. A nucleic acid repository accepts nucleic acid sequences, not evolutionary biology. A crystallography repository accepts crystallography not evolutionary biology.

So, if there are domain repositories for your domain, use the most appropriate for your kind of data. If there aren’t, and it is even reasonably appropriate, you could try your institution, especially for the “data behind the graph”. Or you could spend your energy persuading your sub-domain to create and sustain its own.

And that’s exactly what I have been asking for consistently. It takes three things, all beginning with “M”:

  • Motivation (people need to want to do it, and I am trying to show this in chemistry and some other areas). It’s not easy but it’s coming. The greatest motivations are coming from domains which care about published data quality (patchy in chemistry, climate; strong in crystallography, proteomics) and funders who insist.
  • Methods. The data management, the metadata, and the discovery mechanisms have to be in place. Bunging data in an unstructured repository is a waste of time. There has to be a domain-specific discovery tool whether its graph substructure search (for chemistry), numeric indexing (e.g. for crystallography), or triple store (e.g. for key-value data)
  • Money. It’s not a zero cost operation. That’s hard and there is no general solution.

 

Perhaps the really hard part is persuading folk to want to save and share their data in the first place!

That’s what I am trying to do through Panton Papers and elsewise. But we only get one shot. If we don’t do it properly then recovering will be almost impossible. I’m sorry to say it but IRs in universities have done nothing for scientists (except perhaps in the few universities which have actually tried to promote Open Access).

Your rightly say that many domains (and let’s use this to cover any sub-sub… domain) do not have their repositories. Their problems will not be solved by simply putting complex data into general-purpose repositories – they will be worsened because they won’t gain anything.

However IRs have funding (I don’t know for how much longer, but they could do it). They could reach out to domains, but it would have to be on a specific basis. It might be done in conjunction with an Open Access publisher (most closed access publishers currently have a model of “owning data” and selling it back to the community). It would have to be a model where the content was not “owned” by the particular institution. (Indeed the whole idea that data can be fragmented on the basis of people’s employers rather than the natural structure is clearly unworkable). And in many institutions that probably fails on inter-institutional politics at the first hurdle – why should “we” host “their” data.

So for those who still believe that federated IRs provide us with a natural scalable solution, just tell me how to get chemistry theses in the UK. It’s a natural, simple, important request. The only way I can do it is by writing my own crawler and text-mining engine.

And I will probably be told that I don’t have the legal rights to do it.

 


 

Posted in Uncategorized | 1 Comment

#solo2010: The Green Chain Reaction; please get involved!

Typed into Arcturus

More info on the session we (Simon Hodson from JISC and I) are planning for Science Online 2010 in London on September 4. We have had lots of discussions with the organizers and sponsors and we are intending to do something exciting, novel, important and certainly unpredictable.

The title, “Chain Reaction”, was suggested by Allan Sudlow, one of the organizers from the British library and I have added the suffix “Green – see below.

What we want to do is to have a global interactive adventure, hopefully with work being done beforehand in the blogosphere where we – as a world community – carry out data-driven science. The current working title is

“Are chemical reactions becoming greener?”

YOU DON’T HAVE TO BE A CHEMIST TO TAKE PART; ANYONE CAN CONTRIBUTE.

There is lots of information in the published literature and the unpublished literature on chemical synthesis. There are several million chemical syntheses published each year either in the primary literature, theses, or patents. These are normally reported with chemical diagrams and a paragraph recording what was done.

Henry Rzepa suggested that the theme of this event should be greenness . This doesn’t mean that the reaction actually looks green to the eye but that it is more friendly to the environment (wastes less material, causes fewer problems with toxic or environment-unfriendly chemicals). Green chemistry is described here: (http://en.wikipedia.org/wiki/Green_chemistry )

And there is a strong push for both industrial processes and academic chemistry to be green.

The question is:

“Does the literature show that chemists are using greener reactions than previously?”.

I’ll be laying out how we might tackle this and emphasize that we want everyone to take part. The main challenge will be organizing information and we welcome people who want to carry out data-driven research in the Open.

The only restriction is that the data we use must be Open according to the OKDefinition (http://www.opendefinition.org/ ). At present almost all databases of chemical reactions are not Open/Libre (and most are not even Gratis – you have to pay for the information). So it will have to be through text-mining from publications.

Here again we are restricted. Currently the Open material we have is:

  • Jean-Claude Bradley’s (Drexel, Philadelphia) pioneering work on OpenNotebook Science where he and his group publishe all syntheses to the web as they are collected.
  • Mat Todd (Sydney) who has pioneered open Drug Discovery and where he will be making syntheses and these available
  • Cambridge (where we have created semantic theses from the originals)
  • Acta Crystallographica E with about 8000 preparations. These are all Open Access /Libre (CC-BY) and we have the active involvement of IUCr.
  • BioMed Central, PLoS and Beilstein Journal of Organic Chemistry. These are all Libre/CC-BY publishers and we are already in contact with them.
  • The Open Subset of PubMed Central (and especially UKPubMedCentral)
  • European patents. Ca 60 per week, maybe 1000-5000 syntheses per week.

Text-mining is not 100% recall or precision but the noise should be small in identifying key things like temperature, solvent, catalyst and time.

We have tools that will allow you to download and mine patents and this would be an excellent adventure for early adopters. I’ll write more, but handling alpha-beta code is more important than knowing chemistry and we’d love volunteers.

We’d also like to invite mainstream chemistry publishers to take part. Traditionally they have not allowed text-mining of their material, but times are changing and this project is reaching out to them to see if they’d like to show the value of text-mining for chemistry. So we’ll be setting up a scheme (using the Open Knowledge Foundation’s IsItOpen service) to formally request permission to text mine experimental data and release it as Open Data. We are sure that many will wish to contribute towards green activities and we’ll record those with positive response at the meeting.

As you can we are developing this as we go. I know August is a bad time of year but this is a great activity while you are on vacation. It would be great to show a blogospheric and publisher response for the September meeting.

We hope that both Blue obelisk and OKF adherents will take part. We want this to be completely open so it needs to use Open Data, Open Source Open Standards and Open Services. Please jump in…

I hope to blog every day on this topic.

 

P.

Posted in Uncategorized | 2 Comments

OR2010: Sam Adams is runner-up in the Developer Challenge

Typed and scraped into Arcturus

Many things happened while I’ve been away and often it’s difficult to take the mernatl time to do them justice.

Anyway we put in an entry for the Developer Challenge at Open Repositories 2010 (Madrid) and we came second. Here’s the challenge:

We [Developer Challenge] encouraged teams comprised of developers working with non-developers (such as repository managers) to enter.  Just to reccap, the final challenge was:

Create a functioning repository user-interface, presenting a single metadata record which includes as many automatically created, useful  links to related external content as possible.

Jim Downing thought we should enter and Sam Adams (who is working on the JISC CLARION project) came to bounce ideas off me. We had to get a “pseudo-user” and that was really my only role – to help steer the use case. Effectively Sam did everything and deserves all the credit. (Dev8D attendees may remember that Sam also won a prize there for his impressive interface to roll back web pages to previous years.)

Here’s the announcement for OR2010:

http://devcsi.ukoln.ac.uk/blog/2010/07/13/we-have-a-winner-developer-challenge-at-open-repositories-2010-madrid/

I’ve cut-pasted the announcement here but if it doesn’t work go to the link above.

The runners up!

Sam Adams (Developer) and Peter Murray-Rust  – University of Cambridge

Sam Adams runner up at the Developer Challenge

Sam Adams announced as runner up at the Developer Challenge

embedded by Embedded Video

vimeo Direkt

Video of Sam presenting his pitch

embedded by Embedded Video

vimeo Direkt

Interview of Sam talking about his entry

Podcast of interview

I haven’t debriefed IRL with Sam yet, but he was immediately generous in saying that we were beaten by a better entry and I congratulate them wholeheartedly.

Our (Sam’s) entry highlighted the problem of getting useful metadata out of many repositories. A typical example is that there is no unique identifier for the institution or its repository (at least not given in the repo). Given that we are meant to be federating repositories this isn’t going to be easy if we cannot identify them. There are too many clashes where University of X is different from X University.

And you would think it was easy to find the theses that are in an institutional repository… … well if you read this blog you’d know it’s often impossible.

So repositories have some way to go before getting all the useful metadata from them is straightforward.

But to end on a happy note. Well done Sam, Well done ULCC. Well done everyone. And many thanks to the organizers for cooking this challenge up – it’s great impetus and great entertainment.

 

 

 

 

 


 

Posted in Uncategorized | Leave a comment