Industry suffers from Closed Data

I received the following unsolicited mail two days ago from a scientist in a major chemical company [I have anonymised everything so you will have to take my word].

I work [in industry] and am very interested in improving our ability
to mine information from chemical documents.  The work you, Peter
Corbett, and the rest of your group have been doing is of great
interest to us.  As you are aware, much of this information is
locked up in proprietary databases that are highly overlapping (but
none are comprehensive), and even after paying licensing fees, the
vendors make it very difficult to execute data mining and data
integration workflows.  This is especially frustrating since the
data is available to everyone in the community, but is not easily
extracted.  So, we pay fees to read the journal, then pay again to
gain limited access the data in a searchable, structured format
(e.g. [a well known database provider] limits exports to 500 chemical
names).

PMR: I used to work in pharma industry and nothing seems to have changed. It is clear that the anticommons effect is destroying productivity and innovation. It has been estimated that the UK is 500 million pounds worse off because it charges for its maps. The government gets money back from the Ordnance Survey, but that is much smaller than the cost to – say – local government, new media, travel, etc.
In the same way industry is clearly suffering from the restrictive practices of information vendors. This is one of the reasons I am angry about Closed databases, even if they have a free element. It is clear to me that if Chemical Data becomes Open then we shall be better able to develop approaches to disease. I deliberately did not say “develop better drugs”, because chemistry is only part of the scientific problem. It is essential to include chemistry in one’s knowledge toolkit for understanding disease – humans are chemical objects. All future science-based solutions will include chemistry somewhere in their products and ceryainly in their means of discovery.
How can we take this forward? This is not the first company to make this point. I would encourage others in industry to come forward (if you mail me I will only post with your agreement – indeed I do this with all unsolicited private mail). I have long said that chemical industry should raise the pre-competitive level so that common knowledge was made Open. For example I see industries who are developing their own internal ontologies for science in the public domain. This is a waste since the effort could be shared, and almost counterproductive since the ontologies will be limited and incompatible. That is why we insist on Openness – it works.
I suspect we shall have to catalyze this somehow – perhaps through a real-life meeting.

Posted in data, open issues | 3 Comments

Adding semantic markup with InChI

If we could require all authors to provide machine-readable chemical structures in their chemistry articles the quality of chemistry would increase dramatically and immediately. We could create Open databases immediately, that were machine-searchable (just like crystalEye). No-one doubts that, but who is prepared to make it work?

  1. Richard [from RSC] Says:
    October 15th, 2007 at 10:17 am e[…]
  2. Sitting here as a publisher, we don’t have half the power you suggest – we have to satisfy authors (we want the best) as well as readers (to give them the best service), and making submission as easy as possible is an absolute requirement. Demanding InChIs from authors isn’t a realistic option yet – we can show the advantages of this information via the enhanced HTML and work towards it, but compulsion’s an attractive but ultimately futile option. The more you push, the worse data you’ll get.The publishers aren’t the problem – it’s because the possibilities of processing and reusing this information have only comparatively recently been apparent, and frankly because most people want to read the text and look at the pictures. As authors and readers are encouraged to look beyond print/PDF it’ll happen, but keeping the data within the publishing process is a community issue rather than a publisher one. We’d love it of course!

PMR: Well, who supports it? From Nascent at nature.com

Lunch with Egon Willighagen

We had lunch yesterday with Egon Willighagen who in his spare time runs the Chemical Blog space, now situated at http://cb.openmolecules.net/ (running on postgenomic code).The chat over lunch was pretty good, it turns out that Egon’s favorite molecule might be Ascorbic acid. One of the topics that really animated Egon was how how to link molecules to academic papers. By this I mean for example if you do a search in google, or in some dedicated search engine, for a molecule, how does your search engine know which papers deal with this molecule. There are a couple of problems with solving this. One is that many different fields use different terminology for molecules, especially as the molecules become large, so a plain text search for the name will not get all of the papers that you might be interested in, also papers don’t have semantic markup of molecules.One solution to marking up molecules is to use an InChi (an IUPAC International Chemical Identifier). These have been championed by Peter Murray Rust and there is an extensive InChi FAQ available. The short story is that an InCHi is a character string which uniquely describes a chemical substance. From any chemical structure you can generate an InChi.
Peter has a writeup on using inCHi in blogs, and if every chemical that appeared everywhere was somehow marked up with it’s InChi, or the article referring to it tagged with them then the findability problem would be solved by simple string searching.OK great, well what’s the problem? For a start there is an alternative system SMILES (which is a Simplified Molecular Input Line Entry System), a markdown for molecules if you like. There is a very good description of the syntax here and the KinasePro blog has a short comment on how many people use SMILES vs InChi. The bottom line is that more people use SMILES, but it seems easier to search Google with InChi. I’m not a chemist, but it seems from my naive stand point that the SMILES syntax seems closer to the text description of chemistry that we know from school, wheres the InChi system is more rigorous, it requires one further step of abstraction. It reminds me of the difference between LaTeX for math and MathML. MathML is a hell of a lot easier to write a parser for than LaTeX, as LaTeX can be quite expressive, however no one writes raw MathML. Scientists are lazy and that extra step of abstraction might be the reason why SMILES seems to be used more frequently at the moment.Egon suggested as a solution that journals should require papers dealing with chemicals to include InChis. He said that every tool for drawing chemicals (standard issue for anyone writing a paper on the subject) can now output the InChi with the click of a button. Sounds reasonable, seems easy, but there are problems with this approach. I have heard a few times people say, you are Nature, you can make authors do anything in order to get a paper published so why not get them to do x. Well, for a start, that’s an editorial decision, but even so, making more demands on scientists may not be the best decision when the process of publication is already pretty fraught and stressful. Even if we did this what would that gain? A small selection of the literature would be marked up, but the vast majority of journals in the area would need to follow suit in order to gain full coverage. Of course an argument that we should not do x because other people are not doing x is not what I am getting at here, but rather that this cannot be seen to be a final solution to the problem. Journals are naturally shy of any step that can delay the publication time of an article, and so I am also skeptical that we would see such obligatory requirements. Better, I think, to have this step as a voluntary one. Practically all journals allow supplementary information and I am sure all of them would accept InChi as supplementary information.Even then one is still left with the vast existing corpora of papers that are already published. Egon points out that no one uses the literature in this area from 50 years ago, as modern techniques have advanced so far that this literature is functionally of little use. The implication here is that 50 years in the future we will only need to go back as far as today’s papers. Even so there has to be a value in seeing the evolution of an idea for insertion into the literature right through to where it has led today, and Egon agreed with this.So what can we do now to help making connections between papers and molecules? Peter Corbett, who works with Peter Murray Rust, is working on automated methods of getting computers to read chemistry papers and output semantic markup of them. Tools like this can begin to fill in the semantic blanks, both for papers from the past and for the current literature. Egon has now created rdf pages for molecules on openmolecules.net. These pages use the InChi in their structure, and now each molecule had it’s own web page. Egon’s pages check Connotea, and pull from Connotea co-tags of InChi tags (Here is a short description of this). If we work on this a bit more we should be able to set up a system where if you tag a paper with an InChi, that paper could appear on Egon’s pages. We got quite excited about this idea yesterday and are certainly going to discuss this further. It’s a small start, but a start nonetheless.
PMR: and Richard replies:
Getting InChIs out from the chemical drawing is easily done now, but I don’t think there will be a realistic way to get them into the authoring process until the tools offer a robust way to keep the InChIs in the right place (and validated). Certainly it’s not a burden we could currently expect of the majority of authors, which is why RSC Project Prospect relies on a combination of text mining and input by skilled technical editors. It’s quite hard to do in practice, but it’s worth it when you see the results which won us the ALPSP/Charlesworth Publishing Innovation award this year. The InChIKey should help to promote acceptance and use as Tony suggests, along with common treatment of these standards across publishers .

PMR: so there seems to be a can-do in biology that is missing in chemistry. So let’s float a revolutionary idea for capturing biology in articles. Let’s start with protein sequences. Now these are complex molecules with lots of atoms, so we’ll make them simpler. We’ll call one group of atoms “A”, another “C” and so on to “Y”. We’ll just use 20 letters.
This will be very very difficult for biochemists. They haven’t had nearly as long as chemists to learn informatics and their molecules are much larger. Insulin has hundreds of atoms – ten times larger than most common molecules and it’s one of the smallest. But even so too long to fit on the page:
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
so we’ve had to break it in the middle.
Unfortunately that’s much too difficult for a biochemist to include in a paper and anyway it’s meaningless. You can’t understand it.
Well it was worth a try. It could actually revolutionise biology and perhaps create something we could call bio-informatics (similar to chemo-informatics).

Posted in chemistry, open issues | Leave a comment

What does "Open Access" mean

Stevan Harnad is one of the founders of the OA movement and has tirelessly promoted the idea of Green and Gold OA. I applaud and support Stevan’s achievements. However I find and argue that Green Access does not give the scientific constituency the rights it needs – particularly in data-driven science and cyberscholarship. My concern is that while it may be possible for a human to read an article, they may be prevented from doing other things such as indexing or repurposing. Stevan takes the view that logically you can do everything you want with Green; I take the view that in practice you can’t. Here’s Stevan, then Peter Suber, then my comments

17:20 15/10/2007, Peter Suber, Open Access News
Stevan Harnad, Re-Use Rights Already Come With the (Green) OA Territory: Judicet Lector, Open Access Archivangelism, October 14, 2007.

Summary: Not one [Peter Suber], not two [Robert Kiley], but three [Peter Murray-Rust] of my valued OA comrades-at-arms have so far publicly registered their disagreement with my position on “re-use” rights. Here is my summary of the points at issue: Judicet Lector.
Individual re-use capabilities: If a document’s full-text is freely accessible online (OA), that means any individual can (1) access it, (2) read it, (3) download it, (4) store it (for personal use), (5) print it off (for personal use), (6) “data-mine” it and (7) re-use the results of the data-mining in further research publications (but they may not re-publish or re-sell the full-text itself: “derivative works” must instead link to its URL).
Robotic harvestability: In addition, (8*) robotic harvesters like Google can harvest and index the freely available Web-based text, making it boolean full-text searchable. (9*) Robotic data-miners can also harvest the full-text, machine-analyse it, and re-use the results for research purposes (but they may not re-publish or re-sell the full-text itself: “derivative works” must instead link to its URL).
OA is about access and use, not re-publication or re-sale: Online re-publishing or re-sale rights were never part of OA, any more than on-paper re-publishing or re-sale rights were; nor do they need to be, because of all the capabilities that come with the free online territory.
The Green OA territory: Capabilities (1)-(9*) all come automatically with the Green OA territory. Hence there is no need to pay for Gold OA to have these capabilities, nor any need for further re-use rights beyond those already inherent in Green OA. Sixty-two percent of journals today already endorse immediate Green OA self-archiving.
Gold OA includes Green OA: If you do elect to pay a publisher for Gold OA, you also get the right to deposit your refereed final draft [“postprint“] in your own OA Institutional Repository. Hence even here there is no need for further “re-use rights.” (If you pay for “Gold OA” without also getting this Green OA, you have done something exceedingly foolish.)
“Harvesting rights”? If authors self-archive their articles on the web, accessible freely (Green OA), then robots like Google can and do harvest and data-mine them, and have been doing so without exception or challenge, for years now.
What about Gray publishers? With Gray publishers (i.e., neither Green nor Gold) the interim solution today is (i) Immediate Deposit (IDOA) Mandates, (ii) Closed Access deposit for Gray articles, and (iii) reliance on the semi-automatized “Email Eprint Request” (“Fair Use“) Button to provide for individual researchers’ usage and re-usage needs for these Gray articles during any Closed Access embargo interval (but note that the Fair Use Button cannot provide for robotic harvesting and data-mining of these embargoed full-texts).
Extra Gold OA rights? For those articles published in the 38% of journals that are still non-Green today, I think that to rely on (i)-(iii) above is a far better interim strategy for attaining 100% OA globally than to pay hybrid Gray/Gold publishers for Gold OA today. But regardless of whether you agree that (i)-(iii) is the better strategy in such cases, what is not at issue either way is whether Gold OA itself requires or provides “re-use” rights over and above those capabilities already inherent in Green OA — hence whether in paying for Gold OA one is indeed paying for something further that is needed for research, yet not already vouchsafed by Green OA.

Comments. I hope no one minds if I reprint my comments from June 12, 2007, in which I responded in detail to a very similar post by Stevan:

  • Stevan isn’t saying that OA doesn’t or shouldn’t remove permission barriers. He’s saying that removing price barriers (making work accessible online free of charge) already does most or all of the work of removing permission barriers and therefore that no extra steps are needed.
  • The chief problem with this view is the law. If a work is online without a special license or permission statement, then either it stands or appears to stand under an all-rights-reserved copyright. The only assured rights for users are those collected under fair use or fair dealing. These rights are far fewer and less adequate than OA contemplates, and in any case the boundaries of fair use and fair dealing are vague and contestable.
  • This legal problem leads to a practical problem: conscientious users will feel obliged to err on the side of asking permission and sometimes even paying permission fees (hurdles that OA is designed to remove) or to err on the side of non-use (further damaging research and scholarship). Either that, or conscientious users will feel pressure to become less conscientious. This may be happening, but it cannot be a strategy for a movement which claims that its central practices are lawful.
  • This doesn’t mean that articles in OA repositories without special licenses or permission statements may not be read or used. It means that users have access free of charge (a significant breakthrough) but are limited to fair use.

Update. I’ve often pointed out that the BBB definition of OA requires the removal of permission barriers, not just the removal of price barriers, and I stand by that. Klaus Graf has just collected some of my past statements to this effect along with some of his own. (Thanks, Klaus.)

PMR: The arguments from both are very clear. My own position is that in practice I am forcibly prevented from following through Stevan’s logic. I have described (Indexing Open Access and Free Access articles) how Chemspider (quite appropriately in my opinion) indexed (not copied, except temporarily) articles from the Royal Society of Chemistry labelled as Free Access. The RSC required Chemspider to remove all the links (it was only links, not copies) from the Chemspider site. I cannot see the logic of it, nor can I see the legality, nor can I see the business sense (spammers go to great effort to get links to their sites!).
Ok, the articles were not “Open Access” they were “Free Access”. But there is no guarantee that an Open Access publisher will not do the same. And remember that the publisher has greater powers. I have twice caused publishers to cut off the whole University of Cambridge for things that were completely legal but to which they took exception. In neither case was any notice given. Some people may think I go over the top in my emotion – but sometimes there is provocation.
This spills over to librarians. I have met many librarians in the last 2-3 years and have not met one who wouldn’t defer to some potential copyright infringement even if more imagined than real. If the copyright holder was long dead they would still enforce copyright procedures “just to be on the safe side”. The safe side cripples digital science.
So with Green OA and that Gold OA that carries the publisher’s copyright the default will be “we can’t afford to violate copyright so you can’t legally use it”. It does not assert enough rights to be effectively useful.
As for indexing I don’t understand at all. Google indexes these papers but we are not allowed to. If you type into Google (not Google Scholar):
CdSeS nanocrystals”
you will get:

Chemical Communications. London, 2003; (24)

High quality CdSeS nanocrystals synthesized by facile single injection process and their electroluminescence / Jang, Eunjoo / Jun, Shinae / Pu, Lyongsun
www.ucm.es/BUCM/compludoc/W/10312/13597345_3.htm – 46k – CachedSimilar pages
So Google has indexed this Free Access article. They haven’t been told to take it down. Why? Perhaps they are too powerful – or maybe they pay RSC. I don’t know.
If everything was Gold there would be no problem.

Posted in data, open issues | 5 Comments

Oh Dear … Patent on Name2Structure conversion

Chemspider has reported a new patent which claims the conversion of chemical names to structures. (BTW I am genuinely grateful for this post, as for several of the others). He writes:

Name to Structure Conversion – and What One Little Patent Might Do…
Those of you watching this blog will likely have seen multiple conversations by myself regarding the conversion of chemical names to chemical structures. There are a number of commercial products on the market performing this conversion including those of ACD/Labs, Cambridgesoft, OpenEye, Cheminnovation and ChemAxon (soon). There may be others.  Also, there are now efforts going on in academia.
Last week while searching for some information in the patent database I happened across an interesting patent.
The title and lead in is listed as:

Method, system, and software for deriving chemical structural information
A method and a system are provided for deriving chemical structures from chemical names. Chemical name fragments are grouped into a number of classifications. The method and the system handle new and old chemical names, including names for organic and inorganic substances.

My interpretation of this patent is that this is for the conversion of Chemical Names to Chemical Structures (I am not a patent lawyer though). The patent was granted to Jonathan Brecher of Cambridgesoft as listed here. The patent was granted in 2006.
As a product manager and CSO at ACD/Labs I managed the Name to Structure functionality in their nomenclature software. There was a LOT of prior art when this patent was applied for, in my opinion. Products might not have been on the market but certainly a number of companies had such capabilities. This will be interesting to watch….

PMR: This is very depressing. It’s a classic example of the tragedy of the anticommons – where over protection of IPR leads to nothing for anybody. I have not read all the patent (  Method, system, and software for deriving chemical structural information) , but here are some bits:

  • Uncommon characters of chemical significance are spelled out using common characters, so that, for example, the character “µ” (“µ”) is changed to “mu”.
  • Also during the preprocessing, if the name or a portion of the name has been submitted in inverted form (e.g., “acetic acid, 2-hydroxy-“), the name or portion is converted to its uninverted form (e.g., “2-hydroxyacetic acid”)
  • The input name is analyzed to mark all potential name fragment boundaries (step 2010). In a specific embodiment, the mark used is an @ sign, which is rarely used in chemical names. In another embodiment, it may be advantageous to use a non-printing character such as control-A (ASCII value 1) that has effectively no chemical significance.
  • The buffer is scanned for any single one of the characters “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, “?”, or an apostrophe, that is immediately followed by any number (i.e., including zero) of the characters “]”, “)”, “}”, or “h”, in any order, but that is not preceded by the character “d”. If such a sequence is found, any @ sign that immediately follows the sequence is converted to a comma, so that, for example, “1h@3h@5h@2@4@6-pyrimidinetrione” is properly converted to “1h,3h,5h,2,4,6-pyrimidinetrione”.
  • The buffer is scanned for an @ sign immediately preceding any number of periods, where such periods (if any) precede either i) any single one of the characters “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, “?”, “n”, “o”, “p”, “s”, “N”, “O”, “P”, or “S”; or ii) any of the text strings “ortho”, “meta”, or “para”. If such an @ sign is found that is preceded by any number of apostrophes or periods, which are preceded by any one of the strings “ortho”, “meta”, or “para”, the @ sign is converted to a comma.
  • the name is divided into the smallest number of meaningful fragments of a maximum length. For example, “pentane” is not divided into three fragments “penta”, “n”, and “e”, since the latter two fragments would not be meaningful, but rather is divided into two meaningful fragments “pent” and “ane”. In a specific embodiment, a fragment is determined to be meaningful (“recognized”) if an exact match for the fragment is found in a dictionary of known text strings (“lexicon”) that is maintained by the system.
  • The locant map associates names of individual atoms with respective specific locations in the connection table. For example, an atom named “2” in “2-hydroxy-propanoic acid” may be a specific one of the carbon atoms, and a “3” atom may be a different one of the carbon atoms. Multiple locants can refer to the same atom: “beta” may refer to the same atom as did “2” above.

PMR: Non-chemists may regard it as a non-obvious invention that “pentane” should be broken down into “pent”- and “ane”. Since this is what we teach our first-year students it is clearly non-obvious (or we wouldn’t need to teach them). But there is a tiny possibility that it is prior art. After all we (and others )have been teaching the students this for over 100 years.
Now I know that prior art doesn’t matter to the USPTO and you can patent almost anything. I suspect that if you proposed identifying a chemical by giving a set of characters called a n-a-m-e you’d be allowed to patent that.
I haven’t read the patent (it requires microfiche). I don’t know how Cambridgesoft’s name2structure software works but I suspect it is probably close to this patent. However I suppose a competent lawyer could claim that any name-2-structure software infringes this software. That means you have to go to court to defend it.
So what does it mean for OSCAR/OPSIN? Not the slightest idea. We haven’t taken any of Cambridgesoft’s tradesecrets and we don’t use @ signs to separate spaces. (I have to say I have used tildes (~) in the past and that is next to the @ key). But have regarded pentane as being split pent-ane and so does OSCAR, so we are clearly infringing. (OK we did this in 2003, but does that count?).
If you have a culture of patenting the obvious and fundamental then you destroy it.

Posted in open issues | 2 Comments

Indexing Open Access and Free Access articles

I reported that Chemspider had been asked to take down indexes of scientific articles (based presumably on chemical names) and stated that I did not think this was reasonable. (My language was probably rather more heated – I shall choose words carefully in the near future. The position now seems to be clear:

  1. ChemSpiderMan Says:
    October 15th, 2007 at 6:16 pm ePeter…fyi regarding the situation: Visit http://www.chemspider.com/open-chemistry-web/?p=4
    The post is given here for you too.
    Agreement Reached between ChemRefer and RSC
    We have been requested to remove all RSC articles from the ChemRefer Index.
    The articles in question, from 1997-2004 are marked as ‘Free access’ and, these being indexable according to the robots.txt file, formed the basis of the current indexing. The RSC are unhappy at the way their articles have been presented and linked to in our search results, and consider that the additional intended reuse of the indexed information in ChemSpider without permission violates the terms of use.
    RSC will reconsider the indexing policy for ChemRefer if requested changes are made to the search results and we are presently in discussions with the RSC to identify and execute on these modifications. All RSC articles will be de-indexed from ChemRefer during the next indexing cycle.

PMR: thank you for this. I now comment without emotion.
From what I can see the RSC has made copies of its older articles (ca. 2-3 years) “Free Access”. This means they can be read on the Web. They are not labelled Open Access and are copyrighted in the original manner (“This journal is © The Royal Society of Chemistry 2003”). The Permission box states:

Material in RSC and other publishers’ publications is subject to all applicable copyright, database protection, and other rights.   Therefore for any article, whether printed or electronic, permission must be obtained to use material for which the author(s) does not already own the copyright.   This material may be, for example, a figure, diagram, table, photo or some other image.   Note that permission is not needed to re-use your own figures, diagrams, etc, which were originally published in an RSC publication.   However, permission should be requested for use of the whole article or chapter.

PMR: This does not indicate whether or how the material may be indexed. I make the following observations:

  • The term “Free Access” is used although it is unclear whether this is simply an incentive to read the paper or describes types of access and re-use.
  • The RSC uses the term “Open Science” to describe its author-pays hybrid

I am not surprised there is confusion, which stems from a lack of clarity. It would be useful to know when and how publishers wished to have links to their articles; it seems a pity to have to take all the links down again. I would have thought that publishers would welcome pointers to their papers.
I asked the general question and asked if Peter Suber would give his best shot at an opinion.

  1. Peter Suber Says:
    October 15th, 2007 at 4:09 am eHi Peter. You asked me to comment “on what it is legal to index without publishers’ permission. And what it is reasonable to expect from someone who labels themselves an Open Access publisher.” I’ll give it a try.
    On the first: I wish I knew. Some book publishers are suing Google for indexing their copyrighted books without permission. The case has not come to trial and law professors disagree in their predictions of the outcome. On the publishers’ side is the fact that Google has to make copies in order to create its index, and it makes these copies without the copyright holder’s permission. On Google’s side is the fact that it is not distributing the copies, but only distributing fair-use snippets. I would guess that the kind of indexing you have in mind is even more lawful than Google’s (however lawful that is), since you only distribute uncopyrightable facts. But I just don’t know whether your kind of indexing has ever been tested in court.
    On the second: While the term “open access” has a clear public definition, not every journal describing itself with the term lives up to that definition. Some of them state clearly what they have in mind, which is fair even if we wish they would use the public definition instead. The hybrid OA journals are usually specific about what they offer in exchange for a publication fee, and they usually avoid the term “open access”; they should be held to the terms of their offer. (I’ve appreciated your blog posts monitoring hybrid publishers and calling out those who are not living up to their own terms.) If a journal uses the term without saying what it has in mind, then I think it would be fair to assume it means the BBB definition. I don’t know whether the journal would be legally bound to live up to the BBB definition, but nor do I know how the publisher could complain if users took the term to mean what its public definitions say it means.
    Peter Suber

PMR: Many thanks, Peter. It is a great pity that the current situation is so messy. I cannot see who benefits anywhere. I take the simple-minded Web 2.0 that the more exposed something is the more value it accretes. I am not advocating the misappropriation of electronic content, simply creating an electronic index of the facts within it.
Unless someone tells me otherwise it is legal to make an index of a printed book. I can, for example, list the names of characters in Harry Potter without infringing. If I legally buy a copy on CDROM can I make an index from that? If I read the RSC articles on the web surely I can type up a list of the melting points (facts) in the article? But if I use a machine to reduce the labour I cannot do this?

Posted in open issues | Leave a comment

Open-Data-driven science and a brokering system for ONS

Cameron Neylon and Jean-Claude Bradley have blogged about a directory of Open Notebook Science (ONS) where projects including this approach can register.

21:19 14/10/2007, Cameron Neylon,
As has been flagged up by Jean-Claude Bradley there are a couple of places now where people can sign up to say that they have Open Notebook Science in their laboratory, practise Open Notebook Science,or even would like to find a place where they can keep an Open Notebook.  Jean-Claude has put a list on the Nodalpoint Wiki and I have set up a database at DabbleDB. Dabbledb is a rather cool web based database system that provides free access as long as you make the database contents freely available. Because the data is completely open I am not asking for people’s email addresses.
If you want to be included in the database you can put your details in on the form here. This will allow anyone to re-use the data (which you can find here) to generate lists on appropriate web-pages, or maps or any number of other nice re-uses of the data. If you are interested in the working of the database give me a yell and I can give you admin access.

PMR: As soon as we start to get the results of the NMR calculations on NMRShiftDB we’ll put them up, but I don’t want to register this before we have actually started (I have seen too many empty web pages in my career and I don’t want to leave them myself.) So we all have to be a little patient.
But then I thought that CrystalEye is an ideal resource for data-driven science. I’ve blogged about how crystal-data-driven research started in the mid-1970’s but there is a great opportunity to use crystalEye in new ways. Unlike the Cambridge Data Centre the data includes inorganic structures. The software is modern and extensible and it should be economic to develop many new applications.
CrystalEye is, of course, OpenData (we use the OKFN licence at present) and anyone can download it (we are still working out how to implement APP – Atom Publishing Protocol – to make this easy). But we’d also love to explore collaborative projects. We have all the data and software here so you don’t have to set it up. Crystallographic data makes good undergraduate, Master’s and PhD projects – Egon should know. So if you – or your collaborator/students/supervisor/whomever is interested in using this data perhaps we could explore this on the Wiki.

Posted in blueobelisk, data, open issues, open notebook science | 6 Comments

ODOSOS and an article on OA

Egon reminds us of the importance of the intensity of purpose that we need in the Blue Obelisk. (ODOSOS is our mantra: Open Data, Open Source, Open Standards). I won’t add very much new to that but I’ll also add and contrast OA.


I value ODOSOS very high: they are a key component of science, and scientific research, though not every scientist sees these importance yet. I strongly believe that scientific progress is held back because of scientific results not being open; it’s putting us back into the days of alchemy, where experiments were like black boxes and procedures kept secretly. It was not until the alchemists started to properly write down procedures that it, as a science, took off. Now, with chemoinformatics in mind, we have the opportunity to write down our procedures in high detail.I keep wondering what the state of drug research would be, if the previous generation of chemoinformaticians would have valued ODOSOS as much as I do. Now, with a close relative being diagnosed last week with a form of cancer with low five-year survival rates, I can not get more angry about those who want to make (unreasonable) money by selling scientific research. A 1M bonus is unreasonable. I can have 10 post-docs work on chemoinformatics research for the same period; I can have them work on drug design for various kinds of cancer.Therefore, I will continue to use every opportunity to convince people of ODOSOS, and will continue to develop new methods to improve accurate exchange of scientific data and experimental results. I will help people where I can to distribute open data, even if the whole project is not 100% ODOSOS. For example, the Chemistry Development Kit is open source itself (LGPL) which does allow embedding into proprietary software. This does not mean that I will contribute to the proprietary software, and actually am proud not having done so in the last 10 years.
I will continue to advice people how to make their work more ODOSOS, even if they cannot make the full transition. I will also continue to make sure that all my scientific results are ODOSOS, as there is no other kind of science. To set a good example, and, hopefully, to lead the way.
This is why I am a proud member of the Blue Obelisk.

PMR: I have had exactly these thoughts today and I’d like to ask for some literature help.
I have been invited to write an article on Open Data for a closed access journal, Serials Review – Elsevier which has a special issue every so often (ca. 4 years) on Open Access. I normally accept such invitations (assuming it’s on something I want to write on) and this one is important …

Serials Review (v.30, no.4, 2004) was a focus issue on Open Access. It remains one of the most heavily downloaded issues and articles even now. Open Access remains a “hot topic” and fundamental discussion in scholarly communication.

I’m not sure who has also accepted but the invitees are well known in the area.
I have taken my subject “Open Data in Science”. I intend to make exactly the case that Egon has made, that Closed anything usually disadvantages the human race.
In the Blue Obelisk we did not include Open Access, because it wasn’t – and isn’t – central to our activities. We are – I suspect – largely in favour but are forced to publish in Closed access journals because the the conservatism of chemists. We make our protests regularly and ritually – the technical editors know us well for the requests to mount stuff here, add addenda there, etc.
So I started through the disciplines – astronomy is open, chemistry is closed, biology is open. And I thought – if the bioscientists had been as selfish as the chemists we wouldn’t have genomes, we wouldn’t know how HIV works, we wouldn’t have the ribosome structure, we wouldn’t understand amyloid. Back in the mid 1990’s there was a movement to patent ESTs (bits of the the genome). I’d be grateful for chapter and verse but essentially Craig Venter wanted to patent these (I know patents are yet another concern) but in 1995 the pharma company Merck donated all its ESTs to the public good. This was typical of the concern of locking up IP.
I’m not sure when journals started to permit and then to require that authors publish their protein and nucleic sequences – I remember late 1980’s. But it’s now mandatory. Earlier the pioneers of bioinformatics , e.g.

*Needleman SB, Wunsch CD. (1970). A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48:443-453.

but also Bill Pearson (who’s here in Cambridge for a year and I met last week), Russell Doolittle, David Lipman, and Margaret Dayhoff (called the “founder of bioinformatics” by Lipman). They showed that the mechanical comparison of sequences was an incredibly powerful tool in understanding the function of proteins and genes, of modelling evolutionary processes, including viral mutations. This technique (and many variants) is at the heart of modern molecular bioscience.
NONE OF THIS WOULD HAVE BEEN POSSIBLE IF SCIENTISTS HAD NOT BEEN ABLE TO HAVE ACCESS TO THE WORK OF OTHERS, ROUTINELY AND WITHOUT EXPLICIT PERMISSION.
That is why I and Egon feel so angry when information is less than Open. Without Open information people die. So does our planet.
The technology is here. If we wished we could make every new piece of chemistry Open within a year. How much value would that be in finding new chemistry to use in the service of humanity.
[PS. I’d be grateful for any pointers as to how bioinformatics became free. Are there any lessons there for trying to change the chemists’ mindset?]

Posted in blueobelisk, open issues | 1 Comment

Fun graph

I love Rich Apodaca’s idea of “name that graph” (example). I am not competing, but just occasionally a bit of fun:
slashdot1.png
Shouldn’t be hard

Posted in fun, puzzles | 8 Comments

ACS: Why it matters

I have posted as an outsider why I am concerned about the current state of governance at ACS, particularly with regard to truth and integrity in the scholarly process. You may ask “why is this Brit slagging off a society in a different country of which he is not a member. Or why doesn’t he join and change it by democratic processes?”. I have explained my position here. I have tried to pick up signals from correspondents and the blogosphere, and I have to be VERY careful not to reveal my correspondents. First I give an amalgam of replies and then I comment on what seems to be the most serious allegation and what I think the ACS needs to do if it is to retain respect.
To reiterate very briefly, the key role of a learned society is to represent its members but with a wider responsibility to the national and international communities. It must be transparent, it must act justly and impartially and it must be seen to be doing so. It must uphold the basic tenets of scholarship. It may be involved in professional certification or people, courses or procedures. In all of this it governs only if it is seen to command respect from the chemical and wider community.
As background I restate that the chemical community generally does not care about any of this.
Here are some private quotes from ACS members (anonymized) [I have not deliberately omitted positive comments – there were none, but that is probably not a surprise]:

“By the way, I enjoyed your blog piece on the ACS.”
“The basic problem is the ACS management/staff is not accountable to anyone at the end of the day.”
“I believe that many within ACS do support things like open access, but they are drowned out by people who are very frightened by the prospect of losing the CAS cash cow”
“I almost posted a comment, but decided against it because I’m [… quite active in ACS…]”

and the ChemBlog, :

I’m far more incensed that the ACS isn’t a transparent organization …
I am a lover of open access, but I’m not so sure I can demand that those capitalist pigs, hogging the peoples’ science for themselves, give it to us for free since it was already done using the tax dollars.  […]  I think it’s an obvious concern, but the ACS doesn’t read these blogs anyway.

PMR: … many of the staff do read them – the blogosphere has power.

In order of increasingly dubiousness:
  • CAS (through ACS) lobbied the government to have Pubchem effectively shut down. The words may indicate slight differences but the public intent was clear. This was done with no explicit support from the membership and this makes it particularly unacceptable for a learned society to maintain a monopoly in this manner.
  • ACS officers have bonuses that depend on publication income. This, in itself, is not insurmountable, but it needs careful independent oversight. However it seems increasingly clear that “PRISM” is largely, if not wholly driven by the ACS and that its statements – known untruths – are made deliberately to mislead the wider community. There is now widespread comment that this is for the personal benefit of the officers. This is unacceptable for a learned society.
  • industrial lobbying. I was taken aback by Paul Thacker’s allegations of corporate lobbying and his allegation of the ACS’s subsequent suppression of the normal scientific process. I make it quite clear that these are currently simply allegations and I have no independent knowledge of any of the issues. I post below some extracts from his article

Thacker: But I believe that what lead me to resign last September probably was set in motion months earlier. In February 2006, Bill Carroll, an executive with Occidental Chemical, called some of the society’s publishing executives to complain about my reporting.
The American Chemical Society is a nonprofit that is run by an elected board and Bill Carroll was the president. Because of Carroll’s call, my editor, Alan Newman, had to defend me to his bosses. In a three-page letter, Newman, responded to Carroll’s characterization of my reporting as “anti-industry” and “liberal,” and that my articles were “not news” but just “muckraking.” Specifically, Carroll had cited my articles “Hidden Ties” and “The Weinberg Proposal.”
In the first article, I documented a hidden campaign by industry lobbyists and the PR firm Pac/West Communications to undo the Endangered Species Act. Pac/West had previously run a multi-million dollar covert public relations drive to pass President Bush’s Healthy Forest legislation in 2004.
The article on the Weinberg Group, a product defense firm, grew out of a letter written by the Weinberg Group to DuPont that I discovered in EPA’s docket on PFOA, a chemical used to make Teflon and other non-stick products. In this letter, the Weinberg Group detailed a campaign they hoped to organize for DuPont to protect them against lawsuits and federal regulations on PFOA. The Weinberg Group suggested creating studies to show that PFOA was not only harmless but actually beneficial and offered to find expert scientists that could help DuPont to prove this.
Newman bristled at Bill Carroll’s attack on my reporting and ended his memo to the ACS publishing executives by saying he was deeply troubled that some individuals feel that they can “go to the top of ACS” as their way to respond. “This is not a genuine attempt to engage in an open and transparent conversation on issues of national importance,” he stated.

Newman added that we had tried to be transparent in our reporting, posting interviews and documents with the story. He ended by saying he stood behind the stories and they had revealed valuable information to the environmental science community.

In reply

ES&T, ACS officials respond:
The policy of ACS, as expressed in the ACS governing documents, clearly prohibits interference in editorial decisions by anyone on the staff of the society or in its governance structure. Editors of ACS publications exercise complete control over the content of their journal or magazine. Any suggestion by Paul Thacker to the contrary is entirely without merit. Britt Erickson and I were uniformly unimpressed with Paul’s journalistic skills, and we told him so. We said that, especially on his investigative stories, he needed much more editorial supervision than ES&T had the resources to devote to him. We did not tell Paul that he could no longer work on such stories, only that he needed prior approval to work on them. As to the specific case of the story on the Weinberg group, it was a hatchet job and running the transcript was embarrassing to Paul and ES&T because Paul’s questions were almost incoherent.

Rudy M. Baum, Editor in Chief, Chemical & Engineering News

Bill Carroll, former ACS president, wrote to say he did not interfere in the ES&T editorial process, but did question editors about whether the stories were more appropriate for Chemical and Engineering News, another ACS publication, because the stories were critical of industry. Carroll added that he chaired the compensation committee but it does not evaluate or award bonuses to editorial employees.

PMR: The matter is summarised by a campaigning group SourceWatch (a Wikipedia-like community). I repeat that I have NO idea whether Thacker’s allegations have any substance. I have met Rudy Baum in the past and I have tried to understand his viewpoint that the NIH is a socialist organisation which is “hell-bent on imposing an “open access” model of publishing on researchers receiving NIH grants. [This] action will inflict long-term damage on the communication of scientific results and on maintenance of the archive of scientific knowledge.”
However I find that the ACS has increasingly fewer supporters outside its doors and increasingly many detractors. When I move in the – admittedly woolly liberal –  arena of digital scholarship the ACS is often mentioned among the most illiberal organisations and the one that causes most problems. The Pubchem and PRISM affairs have damaged it deeply.
If the Thacker affair is true, that is very serious. If it is not true, then the ACS should investigate it publicly and demonstrate its falsehood. That cannot easily be done by officers whose actions and motivations are increasingly in question and would require external investigators.
I do not intend to write further on this issue in the near future unless new information comes along.

Posted in open issues | 1 Comment

OPSIN/OSCAR: you + us = we; please help

I’m exploring how you and we may be able to work to improve OSCAR and OPSIN. Even if you aren’t interested in chemical names, you may find the general principles useful.
One of the drawbacks of full Open Source and Open Access is that you don’t always get feedback on whether what you are doing is appreciated. There is a small measure of downloads, accesses, and so on but you can only guess at the motivation and followup. When we go to meetings people (often in industry) say things like: “Oh we use OSCAR for identifying compounds and it’s very useful”, “We started using InChI and it’s saved us lots of money”, etc. But unless someone actually tells us what they are doing and what they want we don’t know and can only guess.
Sometimes we get letters of support, and some explore whether they can offer help. I may be able to expand later, but here are a number of positive things you can do:
(a). Simply tell us that you are using OPSIN and use some words that show us why. If possible we can publish this anonymously “we regularly use OSCAR and have integrated it into our Y process”. That gives motivation to us to continue in the dark hours of the night, to refactor, to document, etc.
(b). Contribute in-kind data. I’ll expand “how” later, but the sorts of things we would like are:

  • names with connection tables (especially if not in Pubchem)
  • ontologies (e.g. similar to ChEBI)
  • tutorials
  • regular expressions for specific tasks (though Peter often has better approaches)
  • insight into document structuring (e.g. the variability and commonality of organization in theses)
  • corpora (especially annotated). These are very valuable but should not be approached casually
  • acronyms
  • numbering systems, bothy arbitrary and algorithmic
  • journal- and thesis-specific regular expressions

(c) contribute in-kind code. You can, of course, do this anyway through Sourceforge, but it is best to coordinate. Major areas that OPSIN requires are:

  • stereochemistry (even R- and S- cannot be parsed at present)
  • bridged ring systems (e.g. bicyclo[3.2.1]octane). (I wrote an algorithm for this at one stage but it’s not in OPSIN)
  • fused rings. (This can be fairly hairy, but partial lookup can help a lot)
  • specialised vocabularies (e.g. saccharides, nucleic acids).

(d) support the community.
(e) Invite one of us to talk with your organization.
(f) financial support. This can range from a summer student (a few thousand USD) to larger structured projects, perhaps involving other partners and initiatives (JISC, FP7, NSF, etc.). One of the enormous attractions of Openness is that you immediately get all the benefits of everyone else involved. (If, in our work with RSC, Nature and IUCr we had insisted on all IP being held in silos we would never have had OSCAR – it’s that simple).
SciBorg and OPSIN
It’s valuable to understand where OSCAR fits into the large picture. in 2004 Ann Copestake and Simone Teufel (from the ComputerLaboratory) together with Andy Parker (Cambridge eScience) and myself bid for and got an EPSRC grant which we called “SciBorg”. It includes the very real and positive collaboration of Nature, Int. Union of Crystallography, and the Royal Society of Chemistry. The objectives are:

  1. To develop a natural-language oriented markup language which enables the tight integration of partial information from a wide variety of language processing tools, while being compatible with GRID and Web protocols and having a sound logical basis consistent with Semantic Web standards.

  2. To use this language as a basis for robust and extensible extraction of information from scientific texts.

  3. To model scientific argumentation and citation purpose in order to support novel modes of information access.

  4. To demonstrate the applicability of this infrastructure in a real-world eScience environment by developing technology for Information Extraction and ontology construction applied to Chemistry texts.

The project is larger and more visionary than simply extracting chemical names from text. I like to use the phrase “machine-understanding of scientific literature” – but may be pulled up on over-stressing “understanding”. But the idea is to use a variety of techniques to understand the deep structure of the language – at sentence level, paragraph, and document structure. To go beyond the linguistic form to infer motivation “why is this citation important?”, “is this paper challenging conventional views”.
To do this we need to understand chemical language, and that is where OSCAR3 comes in. It works out the role of chemical words and phrases so that more powerful tools can interpret the larger context. So “methane” is a noun (CM) while “methyl” is an adjective (CJ) and “methylated” is a reaction/verb. (You might mention “methylated spirits”, but exceptions and ambiguity are all part of the fun of parsing human-generated language). And OSCAR gives these decisions a probability (“P450 demethylates caffeine” is more likely to be a verb than “methylated spirit”). But in the grander scheme of things SciBorg does not usually need to know what the actual compounds are so work out the deep structure of the language. (Obviously as we advance there is a chance to add validity and inference – thus “C-14 demethylation of lanosterol” is meaningful whereas “C-15…” is not – but we can’t do it all yet).
The point is that Peter is only partially working on OSCAR. Along the way he has developed a lot of clever tricks and this account does not do him justice. But we now need to take OSCAR/OPSIN forward on a broader front and this post addresses some of this.
OSCAR and OPSIN have become complex. Peter Corbett inherited bleeding-edge code and data which at that time had bits from all sorts of authors (Joe Townsend, Fraser Norton, Sam Adams, Chris Waudby, Richard Marsh, James Bell, Vanessa de Sousa (who wrote “Nessie”), Justin Davies, and PeterMR. It covered document structure, import of legacy, regular expressions, data checking, name2structure, lexicons and other bits and pieces.
The original OSCAR (sometimes OSCAR-1 or even OSCAR-2) was primarily based on regular expressions and data checking. OSCAR-3 continues to identify this part but does not support the OSCAR-2 GUI. nor do the regexes fit all journals and theses – they were aimed at RSC articles. So Justin and Richard created OSCAR-DATA which consumes the data section output of OSCAR3 and then applies rules and presents it. We have a sustainability path for that. More later.
Meanwhile OSCAR3 is being used for a wider range of applications than simply journal articles, in particular patents, internal documents and theses. So we have been able to get funding from Unilever for David and Lezan, who joins us on Monday and from JISC for Alan and Diana. That gives us a critical mass in the more direct support for OSCAR3 functionality.
The first thing is refactoring. No-one likes refactoring, but it feels good when you’ve done it. Jim is designing the overall approach and he is able to do this in a way I can only marvel at. It uses design patterns, testing (of course) but combined with lightweight services (REST, Web 2.0) and JFDI pragmatism. We’ll expose more of this later but some of the themes are likely to be:

  • use all the Open strategies (SVN, Eclipse, JUnit, Maven, etc.)
  • use open source components where possible (PDFBox, Lucene, JUMBO, CDK, etc.)
  • explore how to modularise them where necessary (e.g. CDK)
  • separate components and communicate through XML or RDF (e.g. OSCAR3 -> OSCAR-DATA can be almost completely decoupled
  • devise a framework based on standard approaches (e.g. Eclipse)
  • actively design and highlight and explain extension points. This is essential if others are to add code and resources in parallel

So we have to do this for ourselves.
If the community can give us clear indications of what they can contribute then this will help us to identify the extensions.

Posted in chemistry, open issues, oscar, programming for scientists, XML | Leave a comment