There is no “right structure” for a compound

A few days ago I promise d to respond to Antony Williams’ post on associating chemical names with structures. I wrote:
There is no “right structure (sic)” for a compound. There are structures which have a very high probability of being associated with a name. There are names which have a probability of representing a chemical entity.”
I still hold by this statement and Antony’s post reinforces my view. I’ll post most of his and comment… [There is a question at the end that I’d like readers to comment on].

05:37 07/06/2008, Antony Williams,
I refer you back to the original post from which this comment was made as it is taken from a specific context.

Is this [PMR above] a true statement? In many case I would agree but I have my own opinion in specific cases and let’s focus on the drug industry for a moment and trade names. First, let’s talk about me..and my identifiers. Depending who’s talking about me I am Tony, Antony, Dr Williams, Mr Williams, Dad, sweetheart, son, Tone, AJ, Bro’ and so on. However I am registered with a social security number and exist as a legal entity, a “registered” entity.

PMR: Although humans are peripheral to this discussion, it’s actually very difficult to associate a human with identifiers. The UK is spending zillions of pounds on trying to do this and requiring everyone to have identity cards. They can be forged. They’ll probably need to brand us with a number, and have us rebranded every year in case we try to laser it off.

CS: Now, Zantac is a registered trade name for the chemical here.

PMR: This points to a page in Chemspider (http://www.chemspider.com/Chemical-Structure.571454.html) which I shall refer to as page571454 for simplicity of dialog. It contains the header:

ChemSpider ID: 571454
Empirical Formula: C13H22N4O3S
Molecular Weight: 314.4038
Nominal Mass: 314 Da
Average Mass: 314.4038 Da
Monoisotopic Mass: 314.141261

CS: I am not an expert in the registration process but I believe that somewhere along the line a defined chemical entity is associated with that name. Whether the chemical entity has been appropriately elucidated by analytical technologies or not is a different question. What is registered as a compound, and associated with the name, is what that name defines.

PMR: I am not currently an expert in registration, but at one stage I worked closely with authorities such as FDA and WHO on registration of drugs so my comments may be out of date. I and colleagues also worked for several years on the structire of “ranitidine” – I’ll clarify later

CS: Now, there are a whole series of other names for the same compound – registry numbers, systematic names, organization numbers. See below

PMR: I will leave these here, and also add some from some from page571454 :

Ranitidine [Wiki]

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hen-1,1-d​iamin

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hene-1,1-​diamine

1,1-Ethen​ediamine,​ N-[2-[[[​5-[(dimet​hylamino)​methyl]-2​-furanyl]​methyl]th​io]ethyl]​-N’-methy​l-2-nitro​-, (Z)-

128345-62​-0 [RN]

266-332-5 [EINECS]

66357-59-3 [RN]

Azantac

GR 122311X

More…

Melfax

N-[2-[[[-​5-[(Dimet​hylamino)​methyl]-2​-furanyl]​methyl]th​io]ethyl]​-N’-methy​l-2-nitro​-1,1-ethe​nediamine

[…]

Raniplex

Ranitidin​e Base

Sostril

[…]

ZANTAC [Wiki]

Zantic

PMR: The first point is that these are NOT exact synonyms. It is clear that

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hen-1,1-d​iamin

and

N-[2-[[[-​5-[(Dimet​hylamino)​methyl]-2​-furanyl]​methyl]th​io]ethyl]​-N’-methy​l-2-nitro​-1,1-ethe​nediamine

are not identical. One describes a compound whose stereochemistry is asserted, the other describes one where the stereochemistry is not asserted. Butene and 1-butene and 2-butene and (Z)-butene are all different. They all have different InChIs. Some of them may refer to the same concept in some contexts, but they are not synonyms. Fowler (Modern English Usage) says “perfect synonyms are extremely rare”.

This is not nit-picking or logic chopping. If we are representing something in a machine, and we assert the two are to be used interchangeably then we have to be very sure that they can be. Adding a “(Z)” may appear a reasonable thing to do – in this case it is a diastrous act that corrupts information (I’ll leave that till the next post).

The robotic aggregation of chemical names and identifiers, if done without metadata and ontology, corrupts information. That’s a strong statement, but we can see it in the current case. First there is junk out there. Robotic name harvesting harvests junk. (Christoph Steinbeck described it in worse terms at the RSC meeting. ) Here’s a snip from page571454

Validated by Experts, Validated by Users, Non-Validated, Removed by Users, Redirected by Users, Redirect Approved by Experts

Ranitidine [Wiki]

(Z)-N-{2-​[({5-[(Di​m?thylami​no)m?thyl​]furan-2-​yl}m?thyl​)sulfanyl​]?thyl}-N​’-m?thyl-​2-nitro?t​h?ne-1,1-​diamine

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hen-1,1-d​iamin

The “?” characters show up in my browser – I don’t know what they are, but they are not normal “e”s (ASCII 101). The first name is not a synonym – I’m sorry, but it’s junk. Associating junk with good information degrades the good information rather than increasing the quality of the junk (There is a more formal proof somewhere by Shannon – I believe – that machines cannot act as 100% proofreaders).

CS: I think that the Trade Name for a compound is definitive since its registered. Relative to the statement “There are structures which have a very high probability of being associated with a name. There are names which have a probability of representing a chemical entity.”…my question is whether a Registered Trade Name is absolute? I’m asking the question since I’m actually not sure. Thoughts anyone?

PMR: A trade name represents a product, not a compound and certainly not a connection table. In some cases it may refer to a pure substance, which itself is describable by a connection table, but these are not synonyms. And aggregating them as synonyms adds error rather than clarity.
However there is an even stronger reason why “Zantac” does not describe ranitidine. See the FDA page.
Zantac (Ranitidine Hydrochloride) Tablets
Zantac contains (not “is”) ranitidine hydrochloride. Ranitidine is not ranitidine hydrochloride, any more than ammonia is ammonium chloride. Listing them toegther under synonyms corrupts information.
You may argue than an intelligent chemically educated chemist will know the difference and that may be true. But the current aggregations of chemicals (Chemspider, eMolecules, Chempedia) are designed for use by machines as well as humans.
And unless high-quality metadata is given, along with a structured ontology then machine aggregation of chemistry corrupts rather than enhances.
For that reason we are building molecular repositories based on metadata and ontologies. In the current era of the web it’s becoming essential.
Now, I suggested that the “(Z)” should not have been added to “ranitidine” to indicate the stereochemistry. You can find pages out there with “(E)”. What is the “correct structure”? Or is this a meaningless question?

Posted in Uncategorized | 2 Comments

Don't "use Institutional Repositories"; "put it on the web"

A very thoughtful post by Cameron Neylon about a very thoughtful talk by Andy Powell about why institutional repositories don’t work and in their current form won’t work. I’ll post snippets and comment:

20:52 10/06/2008, Cameron Neylon,
[…]
The problem with institutional repositories in their current form is that academics don’t use them. Even when they are being compelled there is massive resistance from academics. There are a variety of reasons for this: academics don’t like being told how to do things; they particularly don’t like being told what to do by their institution; the user interfaces are usually painful to navigate. Nonetheless they are a valuable part of the route towards making more research results available. I use plenty of things with ropey interfaces because I see future potential in them. Yet I don’t use either of the repositories in the places where I work – in fact they make my blood boil when I am forced to. Why?
PMR: Exactly so.

So Andy was talking about the way repositories work and the reasons why people don’t use them. He had already talked about the language problem. We always talk about ‘putting things in the repository’ rather than ‘making them available on the web’. [PMR emphasis].  He had mentioned already that the institutional nature of repositories does not map well onto the social networks of the academic users which probably bear little relationship with institutions and are much more closely aligned to discipline and possibly geographic boundaries (although they can easily be global).
But for me the key moment was when Andy asked ‘How many of you have used SlideShare’. Half the people in the room put their hands up. Most of the speakers during the day pointed to copies of their slides on SlideShare. My response was to mutter under my breath ‘And how many of them have put presentations in the institutional repository?’ The answer to this; probably none. SlideShare is a much better ‘repository’ for slide presentations than IRs. There are more there, people may find mine, it is (probably) Google indexed. But more importantly I can put slides up with one click, it already knows who I am, I don’t need to put in reams of metadata, just a few tags. And on top of this it provides added functionality including embedding in other web documents as well as all the social functions that are a natural part of a ‘Web2.0’ site.
PMR: Yes. This is how the world actually works, not how repositarians would like it to work


Andy was arguing for global discipline specific repositories. I would suggest that the lesson of the Web2.0 sites is that we should have data type specific repositories. FlickR is for pictures, SlideShare for presentations. In each case the specialisation enables a sort of implicit metadata and for the site to concentrate on providing functionality that adds value to that particular data type. Science repositories could win by doing the same. PDB, GenBank, SwissProt deal with specific types of data. Some might argue that GenBank is breaking under the strain of the different types and quantities of data generated by the new high throughput sequencing tools. Perhaps a new repository is required that is specially designed for this data.
PMR: Fully agreed. CrystalEye is an aggregatory for crystals. eCrystals promises to be a repository. We should have, we shall have, an open repository for chemical preparations. And an Open repository for chemical spectra. All three separate. And it would be nice
[… role of preservation snipped…]
But the key thing is that all of this should be done automatically and must not require intervention by the author. Nothing drives me up the wall more than having to put the same set of data into two subtly different systems more than once. And as far as I can see there is no need to do so. Aggregate my content automatically, wrap it up and put it in the repository, but I don’t want to have to deal with it. Even in the case of peer reviewed papers it ought to be feasible to pull down the vast majority of the metadata required. Indeed, even for toll access publishers, everything except the appropriate version of the paper. Send me a polite automated email and ask me to attach that and reply. Job done.
For this to really work we need to take an extra step in the tools available. We need to move beyond files that are simply ‘born digital’ because these files are in many ways still born. This current blog post, written in Word on the train is a good example. The laptop doesn’t really know who I am, it probably doesn’t know where I am, and it has not context for the particular word document I’m working on. When I plug this into the WordPress interface at OpenWetWare all of this changes. The system knows who I am (and could do that through OpenID). It knows what I am doing (writing a Blog post) and the Zemanta Firefox plug in does much better than that, suggesting tags, links, pictures and keywords.

PMR: Read Cameron’s blog and Andy’s slides. Then ask yourself what do I want from my repository?
What I do not want is DSpace in its current form. Or ePrints. I don’t know Fedora. For me DSpace is a write-only system. I have (with Jim’s help) put 200,000+ molecules into DSpace and Cambridge. I thought this would make them available to the world. It doesn’t. Yes, someone can get one out by hand if the really want. But not the whole lot. Google doesn’t index them because they have the wrong file suffix.
I have thousands of revisions of code in Sourceforge. I can get every single revision. I share this with anyone in the world who is interested. They can contribute. The system pings me when they do. The material is safe. (I don’t know how it’s safe, but it’s safe.)
Whereas the code on my current machine is less safe. I’ve just had to change machines. Waht did I lose? I am still finding out. To be fair the COs have backed up everything and so I don’t think I’ll lose actual files. But I lose environment. I have to reinstal programs. Reconfigure fonts, etc.
So what do I want from my Institution? Not a repository.
I want an AUTHORITORY. And I’ll explain what that is in a future post.

Posted in Uncategorized | 3 Comments

ICE-TheOREm – Authoring theses has never been easier

Peter Sefton and ourselves are teaming up. ICE is the Integrated Content Environment for authoring semantic documents. Peter’s been working on this and it’s now becoming an important factor. With the ODT/OOXML arena in many people’s minds groups like Peter and ourselves who tackle the semantic aspects have a lot to offer. TheOREm is – surprise – based on ORE – the Object Re-use and Exchange being developed collaboratively across the globe – led by Carl Lagoze and Herbert van de Sompel. (I’m on the advisory board).
ORE is important. It’s difficult for the academic community to outguess and outpace the Web, but there is a need for a semantic overlay on the current content of the web and ORE fits that. At its simplest it’s a navigational RDF graph overlaid on content. with time we can learn to use it in more powerful ways. RDF is so new (in practice) that we don’t know what ways it will go and its flexibility is a strength and a problem. On the assumption that ORE becomes a mainstream approach in scholarly content then it will gather people round it, just as HTML did in 1993.
So TheOREm – Jim’s term – is a JISC project funded in response to a call for demonstrators for ORE. We thought it would be exciting to base our example on theses – aggergating the material, authoring them, submitting them, reviewing them and getting them into the Institutional Repository. (This is one area where I support the role of the Instituion in running a repository – it’s more important to aggregate them there than in the cloud).
Here’s Peter Sefton (much is omitted)

Deflation in repository clicks

At Open Repositories 20008 a group of us Australian developers entered in the Repositories Challenge, with an entry entitled Zero Click Ingest [1]. The introduction puts it like this:

This micro-project demonstrates a way to eliminate the repository deposit step altogether, by having the repository software take responsibility for collecting the content that it needs. It involves using the Integrated Content Environment1 (Sefton 2006)[2] (ICE) as a document authoring system, but the principle could be applied to other content management systems which support metadata or category-aware ATOM or RSS feeds, with the ability to supply the requisite formats. We show how documents created and managed in ICE can be automatically ingested into a repository at the appropriate time, based on document state.[…]

In TheOREM we’re going to set up ICE as a ‘Thesis Management System’ where a candidate can work on a thesis which is a true mashup of data and document aka a datument [3]. When it’s done and the candidate is rubber-stamped with a big PhD, the Thesis management system will flag that, and the thesis will flow off to the relevant IR and subject repositories, as a fully-fledged part of the semantic web, thanks the embedded semantics and links to data.[…]
At USQ, the Integrated Content Environment is on the way to becoming a ‘core’ system for producing courseware. It has been available for a few years under an open source license…
But at USQ, where we are reaffirming our commitment to flexible delivery we call it Fleximode staff know they have to create resources that suit on-campus, web and print use. ICE helps with that, so it has grown organically from our first user to a couple of hundred because overall it makes life easier. Learning ICE is not trivial, you have to do some training, you have to change the way you work, and the organization needs to supply support. If you do use it, though, there’s a net benefit….
This is one reason why TheOREM is so exciting; not that it’s going to look at ORE but that it will be a first step to providing tools for PhD candidates and their supervisors that I hope will be the envy of others, just as Shirley Reushle used the first version of ICE to make an online course that met the USQ standards and her colleagues saw it and wanted to do the same.

[1] L. Monus et al., Zero Click Ingest, Apr. 2008; http://pubs.or08.ecs.soton.ac.uk/119/.

[2] P. Sefton, The Integrated Content Environment for Research and Scholarship, ICE Website, 2006; http://ice.usq.edu.au/introduction/ice_rs.htm.

[3] P. Murray-Rust and H.S. Rzepa, The Next Big Thing: From Hypermedia to Datuments, Journal of Digital Information, vol. 5, 2004, p

Posted in jisc-theorem, Uncategorized | Leave a comment

Research data must be free

A very important piece of work from RIN – about the critical need for data. Peter Suber has summarised it but you have to read it. This study whould be on the top of all science funders’ reading. The research – carried out by Key Perspectives – aka Alma Swan is thorough and compelling. But you don’t need me to tell you – just read it. (The RAE doesn’t come out with glory).

19:20 09/06/2008, Peter Suber

The Research Information Network (RIN) has released a new study, To Share or Not to Share:  Publication and Quality Assurance of Research Data Outputs, June 2008.  The study was commissioned by RIN and executed by Key Perspectives.  From the executive summary:
…There are two essential reasons for making research data publicly-available: first, to make them part of the scholarly record that can be validated and tested; second, so that they can be re-used by others in new research.
This report presents the findings from a study of whether or not researchers do in fact make their research data available to others, and the issues they encounter when doing so. The study is set in a context where the amount of digital data being created and gathered by researchers is increasing rapidly; and there is a growing recognition by researchers, their employers and their funders of the potential value in making new data available for sharing, and in curating them for re-use in the long term….
We gathered information on researchers’ attitudes and data-related practices in six discrete research areas – astronomy, chemical crystallography, classics, climate science, genomics, and social and public health sciences – and two interdisciplinary areas – systems biology and the UK’s rural economy and land use programme. The primary methodology used was interviews with over 100 researchers, data managers and data experts….
Key findings….
3.  …The convention in many fields is that derived or reduced data – as distinct from raw data – are what is made available to other researchers. Providing access to raw data is relatively rare, though it may be the most effective means of ensuring that the research is reproducible. But there is discussion in some fields about the lack of access to raw data.
4.  Many datasets of potential value to other researchers and users – particularly those arising from small-scale projects – are not managed effectively or made readily-accessible and re-usable….
5.  Many research funders are putting policies in place to ensure that datasets judged to be potentially useful to others are curated in ways that allow discovery, access and re-use. But there is not a perfect match between those policies and the norms and practices of researchers in a number of research disciplines….
10.  Some researchers are motivated to publish their data by factors such as altruism, encouragement from peers, or hope of opening up opportunities for collaboration. But the lack of explicit career rewards, and in particular the perceived failure of the Research Assessment Exercise (RAE) explicitly to recognise and reward the creating and sharing of datasets – as distinct from the publication of papers – are major disincentives.
11.  Many researchers wish to retain exclusive use of the data they have created until they have extracted all the publication value they can. When combined with the perceived lack of career rewards for data creation and sharing, this constitutes a major constraint on the publishing of data. Other disincentives include lack of time and resources; lack of experience and expertise in data management and in matters such as the provision of good metadata; legal and ethical constraints; lack of an appropriate archive service; and fear of exploitation or inappropriate use of the data.
12.  Some publishers are taking steps to underpin the scholarly record by creating persistent links from articles to relevant datasets; and this signposting is viewed positively by researchers.
13.  Relatively few researchers have the expertise, resources and inclination to perform themselves all the tasks necessary to make their data not only available, but readily accessible and usable by others.
14.  …Datasets on journal websites are commonly in PDF format which is unsuitable for meaningful re-use.
15. Other obstacles to locating and gaining access to datasets produced by researchers and other organisations include inadequate metadata, refusal to release the data; the need for licences (which may restrict how the data may be used or disseminated) and/or for the payment of fees; or the need to respect personal and other sensitivities.
16. Effective use of raw scientific data in particular may require access to sophisticated specialist tools and technologies, and high level programming skills….
Conclusions and recommendations….
3.  Research funders and institutions should seek more actively to facilitate and encourage data publishing and re-use by [using the following 10 strategies]….
5.  Publishers should wherever possible require their authors to provide links to the datasets upon which their articles are based, or the datasets themselves, for archiving on the journal’s website. Datasets made available on the journal’s website should wherever possible be in formats other than pdf, in order to facilitate re-use.
6. Researchers and publishers should seek to ensure that wherever possible, datasets cited in published papers are available free of charge, even if access to the paper itself depends on the payment of a subscription or other fee.
7. Funders, researchers and publishers should seek to clarify the current confusion with regard to publishers’ policies with regard to allowing access for text-mining tools to their journal contents….

PS:  For background, see our post from June 2007 on the launch of this study.

Posted in Uncategorized | 3 Comments

Xiphos Research Day – What I said

Owen Stephens has blogged the day and also linked to the Twitter log. First here’s Owen’s record. I am really impressed by yhe amount he has covered – some of the slides are almost verbatim. This is much better than having “lecturers” prepare powerpoint beforehand. Just invite Owen to your meeting (he’s done everyone). I am impressed by the accuracy.

Talis Research Day – Codename Xiphos

Peter presents using HTML – although it’s hard work, he believes that the common alternatives (Powerpoint, PDF) destroy data. I think the question of ‘authoring tools’ – not just for presentations, but in a more general sense of tools that help us capture data/information – is going to come to the fore in the next few years.
Peter has a go at publishers – claiming that publishers are in the business of preventing access to data, rather than facilitating it (at this points asks if there are any publishers in the audience – two sheepish hands are raised). Peter, also mentioning that Chemistry is particularly bad as a discipline in terms of making data accessible – with the American Chemical Society being real offender.
Peter’s talk tend to be pretty impromptu – so he is just listing some topics he may (or not) touch on today:

  • Why data matters
  • What is Open Data
  • Differences between Open Access and Open data
  • Demos
  • Repositories
  • eTheses
  • OpenNoteBook Science
  • Semantic data and the evils of PDF
  • Scinec Commons, Talis and the OKF
  • Possible Collaborations

Peter demonstrating how a graph without metadata is meaningless – showing a graph on the levels of Atmospheric Carbon Dioxide. If this was in paper form and we wanted to do some further analysis – it would take a lot of effort to take measurements off the graph – but if we have the data from behind the graph, we can immediately leap to doing further work.
Peter now noting that a scholarly publication looks very much now as it would have done 200 years ago. Showing a pdf of an article from Nature – and making the point that all looks great (illustrations of molecules, proteins and reactions etc.) but completely inaccessible to machines.
Peter noting that most important bio-data that is published is publicly accessible and reusable – but this is not true in chemistry. This means in the article, the data about the proteins is publicly accessible, but the information on chemical molecules is not – although covered in the same article.
Peter illustrating how there is a huge industry based on moving and repurposing data (e.g. taking publicly available patent data, and re-distributing in other formats etc.)
Peter now showing how a data rich graph is reduced to a couple of data points to ‘save space’ in journals – a real paper-based paradigm – we need to get away from this. Similarly experimental protocols are reduced to condensed text strings.
Peter now showing ‘JoVE’ – Journal of Visualised Experiments. In this online publication where scientific protocols are published in both textual and audio-visual format – so much richer in detail than the type of summarisation that journals currently support. Peter notes – this is really important stuff – failure to provide enough detail to recreate an experiment, it can have a huge impact on your reputation and career.
Peter now moving onto ‘big science’ – relating his visit to CERN – how the enormous amounts of data generated by the Large Hadron Collider is captured, as well as relevant metadata. However, most science is not like this – not on this scale. Peter is relating the idea of ‘long tail’ science (coined by Jim Downing) – this is the small scale science, that is still generating (over all activity) large amounts of data – but each from small activities. This is really relevant to me, as this is exactly the discussion I was having at Imperial yesterday – looking at the approach taken by ‘big science’ and wondering if it is applicable to most of the research at Imperial.
So in Long-tailed Science, you may have a ‘lab’ that will have a reasonably ‘loose’ affiliation to the ‘department’ and ‘institution’. Peter noting that most researchers have experience data-loss – and this can be a real selling point for data and publication repositories.
Peter showing a thesis with many diagrams of molecules, graphs etc. Noting there is no way to effectively extract the information about molecules from the paper, as it is a PDF. He is demonstrating a piece of software which extracts data from a chemical thesis – demonstrating this from a thesis authored in Word, and using OSCAR (a text-mining tool tuned to work in Chemistry) – and shows how it can extract relevant chemical data, can display it in a table, reconstruct spectra (from the available data in the text – although these are not complete).
Peter asking (rhetorically) what are the major barriers – e.g. Wiley threatened legal action against a student who put graphs on their website.
Peter now demonstrating ‘CrystalEye’ – a system which spiders the web for crystals – reads the raw data, draws a ‘jmol’ view (3d visualisation) of the structure, links to the journal article etc. This brings together many independent publications in a single place showing crystal structures. Peter saying this could be done across chemistry – but data is not open, and there are big interests that lobby to keep things this way (specifically mentioning Chemical Abstracts lobbying the US Government)
Peter now talking about development of authoring tools – pointing out that this is much more important that a deposition tool – if the document/information is authored appropriately, it is trivial to deposit (it occurs to me that as long as it is on the open web, then deposit is not the point – although there is some question of preservation etc – but you could start to take a ‘wayback machine’ type approach). Peter is demonstrating how an animated illustration of chemical synthesis can be created from the raw data.
Peter now coming on to Repositories. Using ‘Sourceforge’ (and computer code repository) as an example. Stressing the importance of ‘versioning’ within Sourceforge – trivial to go back to previous versions of code. Need to look at introducing these tools for science. He is involved in a project called ‘Bioclipse’ – a free, open-source, workbench for chemo- and bioinformatics using a ‘sub versioning’ approach (based on Eclipse which is a software subversioning package) – Bioclipse stores things like spectra, proteins, sequences, molecules etc.
Peter mentioning issues of researchers not wanting to share data straightaway – we need ‘ESCROW’ systems that can store information which is only published more openly at a later date. The selling point is keeping the data safe.
Peter dotting around during the last few minutes of the talk, mentioning:

  • Science Commons (about customising Creative Commons philosophy for Science)
    • how to license data to make it ‘open’ under appropriate conditions – this is something that Talis has been working on with Creative Commons.
    • Peter saying that, for example, there should be a trivial way of watermarking images so that researchers can say ‘this is open’ – and then if it is published, it will be clear that the publisher does not ‘own’ or have copyright over the image.

Questions:

Me: Economic costs of capturing data outside ‘big science’
PMR: If we try to retro-fit costs are substantial. However, data capture can be marginal cost if done as part of research. Analogy of building motorways and cyclepaths – very expensive to add cyclepaths to motorways, but trivial to build at the same time.
==============================
The microblogging (I nearly called it Twitter, but Andrew W corrected me) was on Content Live from Andy POwell with a few other comments. Worth reading, but as it says a lot of what Owen has captured I won’t repeat.
=============================
PMR: This is incredibly useful. It’s better than traditional powepoints – these lack any sense of why a slide was presented. With a bit of editing it almost forms a simple paper.
Little to add to this bit, though I’ll probably say more about the day as a whole. I may appear to be a bit hard on publishers, but I’ll make a simple deal. If a publisher makes factual data available, outside the firewall without copyright restrictions, that’s great and I’ll say so. This obviously applies to all OA-CC-BY publishers (PloS, BMC, Acta Cryst E, and one or two others). It also applies to RSC and IUCr for their supplemental data for closed access publications, which is still outside the firewall and uncopyrighted. But Wiley and the ACS copyright factual data. This isn’t right and I continue to say so. I wish I didn’t have to. But they show no signs of changing the policy and won’t enter into constructive discussion of it. It’s simple. Factual data belongs to the public domain. And the more of us who say so, the mre liekly it is to actually be there. So speak up.

Posted in Uncategorized | 1 Comment

Talis Xiphos Research Day

I’ve just been to Birmingham / Solihull to the Talis Xiphos Research Day. Talis are a Library company transmogrifying into a Semantic Web company and have a great vision of the potential future. Xiphos is a potential product – somewhere between Facebook and Zotero (social networking for research scientists who care about their citations).  They spent 4 weeks writing it of which 3 weeks were arguing. This is excellent practice. It is much more common to have 0 weeks arguing, 3 weeks coding, 1 week having a blazing row, 3 more weeks coding, etc.
Paul has/will/will have/might blog the meeting but I can’t yet find anything. Andy Powell has twittered it (and it may have more links). Andy and Paul are keen twitterers. So are several of my colleagues. They are all waiting for the i-something which will be the next quantum leap for mobile twitterers.
Anyway I’ll wait to see what Paul has said before I make my comments. But Talis have helped us a lot with their “Talis Platform” and I’ll say more later.

Posted in Uncategorized | 1 Comment

How many named chemical entities? Interannotator agreement

In a recent post ( Text-mining at ERBI : Nothing is 100%;) I set a little problem – asking you to make estimates of the number of named chemical entities in a piece of experimental text. At the ERBI meeting estimates ranged between 4 and 11. I got three blog answers from 8 to 15 (though the latter was looking for lexical as well as semantic matches, so really 8 to 11).
The main point is that unless the problem and methodology are clearly defined there is no answer and since I didn’r define the problem I expected a spread. The message is that if you talk about text-mining without defining the methodology then your results are meaningless.
In this exercise we are addressing the need for interannotator agreement (IA). In this a series of domain experts are given precise rules on how to annotate the text (in this case by marking up the named chemical entities). Since I didn’t tell you what a NCE was I again expected a spread.
In the SciBorg project, which includes help from RSC, Nature and IUCr, Peter Corbett, Colin Batchelor (RSC) and Simone Teufel (Computer Lab) have produced a set of rules for identifying chemical entities. Not interpreting them into connerction tables or looking them up but identifying them as chemical entities and giving the start and end of the character string. There are several classes (adjectives, enzymes, etc.) but the passges I gave you contained only nouns relating to “chemicals”, designated by CM. (Language processing folks like abbreviations – so “was” is past tense of “be” so it is BEDZ).  Here’s the guide:

Peter Corbett, Colin Batchelor and Simone Teufel Annotation of Chemical Named Entities. BioNLP 2007: Biological, translational, and clinical language processing, Prague, Czech Republic.

This is quite precise about what a CM is. Most of them were “obvious” but the guide makes clear that “reaction mixture” and “yellow oil” are not CMs while “molecular sieve” and “petroleum ether” are. “Petroleum ether:diethyl ether” contains 2 distinct CMs. So the answer is 9 unique CMs of which one (petroleum ether) occurs twice.
The process necessarily involved arbitrary boundaries. Some of you might feel that they should be more or less broad. The authors have had to set them somewhere and that’s what they have done. They have had to write them up very clearly (about 30 pages). We hope you feel this is a useful resource.
So if we are now given the guidelines we should all agree 100% shouldn’t we? Well, Peter and Colin tried it on themselves. They took 14 papers from all branches of chemistry and annotated these. It takes ca. 2 weeks. They did not get 100% agreement – they got 93%. Even though they had written the guidelines. This disagreement is universal. There is no IA of 100% except in trivial tasks. They involved a third party and the tripartite agreement was about 90%.
So if humans can only agree 90% of the time we can’t expect machines to do better. And they didn’t. OSCAR3 has been trained on a similar corpus with some being used for trainging and some for metrics. Here we text OSCAR3 against the “Gold Standard” – papers marked up (jointly) by the experts as their best estimate of what the guidelines  suggest. OSCAR then identifies the character strings which are CMs. The metrics are harsh. If OSCAR gets “ether” instead of “petroleum ether” it gets a negative mark (not even zero).
There are two measures – precision and recall. Each can be improved at the expense of the other. One metric is the geometric mean or F-score. OSCAR gets about 80% (Peter has some novel ways of approaching this using probabilities).
So if you hear someone saying “My technique identifies 90% of chemical compaunds in text” ask them “where is your corpus?”, “what is your interannotator agreement?”, “where are your guidelines?” and “how did you compute your metrics?”.
If they can’t answer all 4, don’t believe a word.

Posted in Uncategorized | Leave a comment

New settings for this blog

Because comments were getting lost in the Akismet spam (over 200,000 spams since the start of this blog) we have changed the mechanism to reCAPTCHA where commenters have to enter simple words. I hope this is not too problematics but please mail if it is.
One downside is that pingbacks are also disabled so I shall not get automatic notification of blogs about this blog. Since some of the discussions ping between blogs this is a slight handicap but I shall read Technorati frequently.
One interesting thing about reCAPTCHA is that it includes words scanned from books. This means that the commenters digitize parts of books while they use it – a word at a time, but it all helps. Whether this makes a major difference to book digitization I have no idea.

Posted in Uncategorized | 2 Comments

Chemical textmining – 2

In a previous post ( Text-mining at ERBI : Nothing is 100%;) I asked the readership to suggest how many chemical entities there were in a given paragraph. I intended to your replies – and comments – to help clarify some of the issues in chemical textmining and make clear what some of the essential procedures are. (We don’t believe everybody does it sufficiently rigorously, though it’s difficult to tell when the methods aren’t public).
There are no tricks. Just to clarify, we are describing the sort of activity that a bio-/chemical indexing organisation might undertake. So the following sentences contain one chemical entity each:

  • all spectra were run in CDCl3.
  • benzene melts at 5 deg. C
  • the flask was flushed with He.

and the following do not:

  • She fed the dog
  • He fed the dog
  • The cat is neither alive nor dead

So please revisit the post (Text-mining at ERBI : Nothing is 100%) and add a comment, minimally a number between 4 and 11. You don’t have to give your name.
Update:
Some people have problems commenting on this blog.
Chemspider has posted (How Many Chemical Entities in a Paragraph of Text) a detailed and thoughtful estimate (10) on his blog. Do you agree?

Posted in Uncategorized | 2 Comments

We have more than good arguments

Peter Suber writes:

22:55 04/06/2008, Peter Suber,
Reed Elsevier 1Q lobbying reached $790,000, Associated Press, June 4, 2008.  Excerpt:

The U.S. unit of Reed Elsevier Group PLC spent $790,000 in the first quarter to lobby…the U.S. federal government on…[many issues including] public access to the National Institutes of Health and more….
In the January-to-March period, the company lobbied Congress, NIH and the Health and Human Services Department, according to the report filed April 17 with the House clerk’s office.

Comment.  All we have are good arguments.

PMR: Peter modestly understates our strength. We have a lot more than good arguments. We have people – like Peter – who create those arguments (arguments require a great deal of work). We have research to support those arguments (arguments without facts are usually very weak). We have a community and a sense of shared purpose. We have people like Stevan who are indefatigable and never sleep and never waver. We have many organisations who give support, hold meetings, coordinate and innovate.
And, when our spirits are low, we have Alma’s calligraphic calendar.

Posted in Uncategorized | Leave a comment