petermr's blog

A Scientist and the Web

 

Archive for September, 2011

Journal of Cheminformatics special issue: Visions of a Semantic (Molecular) Future

Thursday, September 15th, 2011

Over the past approximately 3 months I and colleagues have been writing and editing 15 articles for the Journal of Cheminformatics on “Visions of a Semantic (Molecular) Future”. We’ve finally got to the stage where all 15 articles have been accepted and are in the final stages of processing. We expect the “issue” to appear RSN (“Real Soon Now”).

Most of the submitted drafts can be found here:

http://www.dspace.cam.ac.uk/handle/1810/238409/browse?type=title&sort_by=1&order=ASC&rpp=20&etal=-1&null=&offset=20

(Note that DSpace is poorly designed for managing collections of documents so we haven’t been able to provide our own title page which links to the articles and explains them – for that you have to know the “handles” and manually edit them. So there are also additional materials).

I wrote an editorial which can be found in full here http://www.dspace.cam.ac.uk/handle/1810/238399 and I’ll quote some sections. Note that several of the papers are general and the chemistry is almost incidental.

I’d like to thank the editorial staff of Biomed Central very much. This isn’t ritual thanks – many publishers generally deserve few thanks for their attitudes towards holding scholarship in the dark ages for their own benefit rather than serving readers and authors. It’s also no thanks to Springer (who own BMC) (see http://blogs.ch.cam.ac.uk/pmr/2010/11/11/versitaspringer-%E2%80%93-please-edit-our-commercial-journals-for-free-so-we-can-sell-them-to-you/ and whose interview with Richard Poynder http://poynder.blogspot.com/2011/01/interview-with-springers-derk-haank.html showed that Springer simply regards academia as a (guaranteed) source of income.

I have commented before on how important BMC has been in establishing the credibility of Gold Open Access: http://blogs.ch.cam.ac.uk/pmr/2010/06/11/reclaiming-our-scholarship-tribute-to-vitek-tracz-and-bmc/

Vitek, Matt Cockerill and others have shown that a publisher aimed at providing a service to the community can make a viable income (including profit). That in itself is valuable. But BMC, probably more than any other OA publisher, has caught the spirit of OA and more generally Openness in that it has been active in developing new facets to Openness (such as the Open Data awards and the adoption of the Panton Principles). And I confidently expect to be working in collaboration with BMC in the future and reporting it on this blog.

Iain Hrynaszkiewicz, Jan Kuras, Bailey Fallon and the editors (Christoph Steinbeck and David Wild) have all helped to adopt new features in this issue. Dan Zaharevitz’s contribution is unusual – it’s a transcript of his talk which I think captures the historical aspects of cheminformatics far better than sentences with passive verbs. Henry Rzepa , and our Open Bibliography group, eat our own dogfood and the editors have accepted this (Elsevier totally destroyed my last attempt to publish in HTML).

But the conventional publication process is out-of-date. The reviews have been useful. They’ve caught batches of typos, and we have added sections in response. Some reflect the different slants on publication and the tension between the new and the conventional. There are probably still glitches.

It’s taken about 1-2 months to write the articles (some of the authors like writing, some do not!). And about 10 weeks for the papers to go through the review process (most were posted to DSpace on 2011-07-04). And a bit more before they appear in print. I have to give great thanks to Charlotte (Bolton) who acted as amanuensis (and also to EPSRC who provided Pathways to Impact funding for the symposium and publication process).

So the timescale is probably about as good as it gets. But because BMC is an OA publisher we’ve posted the manuscripts in DSpace and Google (and perhaps you, dear reader) have been reading them. So publication was effectively immediate.

What have we gained from the formal publication process? Undoubtedly there will be people who don’t read blogs who will read them because they are in J.Cheminf. They have a formal bibliographic entry in a way that blogs don’t (yet) – but that will change. They are better because of the review process.

But the main apparent value is that they are citable, and citable for establishing the personal merit of the authors. For me that’s irrelevant – for some of the authors it’s very important. But *why* is the publishing of papers still stressed in this fashion? A paper about OSCAR4 is far less use in practice that the material we provided for the launch (tutorials, examples, downloads, etc.). Open Bibliography will be judged by how well it supports Open Scholarship – not by a paper. Henry and me recounting the development of CML might well be better in a video. Dan Z is certainly better in video! We have to change and there are increasing indications that non-paper outputs will start to be valued.

So here are some snippets from my editorial:

The articles have a common theme of representing information in a semantic manner – i.e. being largely “understandable” by machine. This theme is common across science and many of the articles can and should be read by people outside the chemical sciences, including information scientists, librarians, etc. An emergent phenomenon of the last two decades is that information systems can grow without top-down directions. This is disruptive in that it empowers anyone with energy and web-skills, and is most powerful when exercised in communities of people with similar or complementary skills.

It is often possible to move very quickly, and in our hackfests (one was prepended to the symposium) we have shown that it is possible to prototype within a day or two. This creates a new generation of scientist-hackers (I use “hacker” as “A person who enjoys exploring the details of programmable systems and stretching their capabilities” [1]). Several of the authors in this issue would regard themselves as “hackers” and enjoy communicating through software and systems rather than written English. This stretches the boundaries of the possible but also creates tension where the mainstream world cannot react on a hacker timescale and with hacker ethics.

More generally many scientists and information professionals are increasingly frustrated with the conventional means of disseminating science. Most conventional publishers regard scientific articles as “their content” and a very recent article (2011-06-20) from the STM publishers [2] indicates that the publishers believe they have the right to determine how content is, or more often is not, used. As an example most forbid by default indexing, textmining, repurposing, even of factual data to which the scientist has a legitimate subscription. This has an entirely negative effect on information-driven science, preventing even the development of the technology.

Generally, therefore, there is a culture of bottom-up change (“web democracy”) which looks to the modern web and examples of empowerment. (There are also examples of disempowerment such as attacks on Net-neutrality, walled gardens, information monopolies, vendor lock-in, etc. and this contrast activates many in the modern informatics world). There are several articles, therefore, whose main theme is the access to Open information.


I now believe that in many cases it is unethical to restrict access to publicly funded science. Lessig, in his CERN talk (“Scientific Knowledge Should Not Be Reserved For Academic Elite” [3]), showed that it would cost 500 USD for him to read the top 10 papers relating to his child’s condition. These papers are effectively only available to academics in rich universities. A colleague recently told me he had spent a month researching the literature of his child’s condition (to critically effective purpose) and we agreed he could only do this because he was a professor at a University. That is one reason I support the Open Knowledge Foundation and its projects to define and obtain Open information (of which Open Bibliography [4] in this issue is typical).


Because of this, chemistry has almost no public ontologies, and we have a vicious circle. Without ontologies, authors cannot reasonably be expected to create semantic information, and without a clear need for semantic information, the community will not take on the considerable load of creating ontologies. Several of the articles argue that the creation of lightweight dictionaries and other semantic metadata is affordable by the community and I believe that if the communal will is present, then it would be possible through bodies such as IUPAC and others, to create a full semantic infrastructure for much of the current published chemistry.

The current legal and contractual restrictions on re-using chemical data are seriously holding chemistry behind other subjects. These articles in this issue are not the place for polemics but we hope that traditional creators of information resources in chemistry will now think carefully about the value of making their data fully Openly available. This will be a considerable act of faith, because it will need a change in business model. Some of those providers have been traditionally held in high esteem by the community and if they use that esteem they have the opportunity to change the practice of chemical informatics.


A major feature underlying all of the papers is to give an insight into the process of creating an information ecology. Some of them represent scientific discoveries (e.g. Rzepa) but most are concerned with building a coherent infrastructure usable by the community. It may be useful to liken this infrastructure to the development of instrumentation in many branches of science. Science depended on the microscope, the telescope, the spectrograph, the Geiger counter and many other types of instrumentation. There is sometimes a modern tendency to discount instrumentation and infrastructure as not being ‘proper science’. We hope that this issue will redress that balance


Several of the articles (CML [13], OSCAR [14], OPSIN, dictionaries [15], WWMM [16]) in this issue cover a decade of work. We hope this will be useful to scientists and scholars who wish to implement new ideas and to give them some idea of what works, and what, more commonly, does not work. Sometimes only the passage of time and persistence achieves some level of success. Again, the short-termism of many infrastructural projects militates against developing a good platform for the future


A number represent growing points whose development is highly unpredictable. These include the WWMM [16], where the vision of a distributed peer-to-peer knowledge resource has had to wait a decade until it could be implemented. The Quixote project is only months old but takes this vision and has already built an impressive prototype, which I expect to set the model for computationally-based knowledge repositories. These projects rely heavily on community, and this is most clearly shown in the Blue Obelisk movement [20] which aims to, and has largely succeeded in, creating an Open infrastructure for cheminformatics. A major motivation for this has been not just that software and data should be universally available but also that this is the only manner in which science can be reputably validated both by humans and machines. An example of the need for such validation is shown in Henry Rzepa’s article [21].


The relative stagnation of chemical informatics suggests that change is unlikely to happen from within chemistry. As progress occurs in other areas (retail, bioscience etc.) chemistry may be dragged into the semantic world regardless. If chemists wish to retain control over their own systems they will be wise to start investing in Open semantic environments, because otherwise the rest of the world will do it for them.

How can chemical informatics survive and prosper? I think the most likely model will be Open publishing, not just of texts but data and other resources, mandated and paid for by funders. Those publishers which are able to adopt an Open model rather than continuing to maintain their own walled gardens, will ultimately triumph, and probably more rapidly than we expect.

 

 


 

Open API (or glorious API?)

Thursday, September 15th, 2011

It is becoming critical that we (?everyone) defines what is meant by “open API” and what it means operationally. This post introduces the problem – the next will suggest some ways forward.

Why does this matter? Isn’t “open” an indication of goodwill towards others? A general philosophy that we’d like to share things and work together? That things should be free?

No.

The problem is that “open” is used in so many contexts, often without thought, that it can become almost meaningless. And if you take it lightly it will cost you money or end you in court.

What does “open Access” mean? I am sure all readers know.

Except they don’t. If you are asked to pay 3000 USD as an author of an Open Access scholarly article, what are you getting? And what are you offering to the rest of the world. Often it is seriously unclear. Why pay 3000 USD if you can post your article as “Green Open Access”? Are you allowed to post your article? Can you re-use it?

In fact don’t you have to read the small print of every single publisher (if you can find it, which usually I cannot)? And make sure that what you do isn’t going to end you up in court? Yes, I’m serious. If you post a single image from a Wiley journal you are still in danger of being sued or having your subscription cut off (http://scienceblogs.com/retrospectacle/2007/04/when_fair_use_isnt_fair_1.php ). Claiming that you had some nebulous “open” right or “fair use” isn’t going to remove the lawyers. Wiley still require you to ask permission to re-use “their” material (even if you wrote it or drew the pictures).

In short I believe “Open” is only useful as an operational term if it is clearly defined as something that frees us from the threat of lawyers.

Many people use “open” like Humpty Dumpty uses “glory”

 ”There’s glory for you!”
   ”I don’t know what you mean by ‘glory,’ ” Alice said.
   Humpty Dumpty smiled contemptuously. “Of course you don’t—till I tell you. I meant ‘there’s a nice knock-down argument for you!’ “
   ”But ‘glory’ doesn’t mean ‘a nice knock-down argument,’ ” Alice objected.
   ”When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.”

Here’s a conversation I had with a vendor of information systems about 2 years ago at a JISC meeting:

V: “We have an Open API” (implying this was a GOOD THING)

Me: “can you let me have a copy of the spec?”

V: “No, it’s confidential to customers”

Me: “If I purchased your system could I share the API with others?”

V: “No, that’s a breach of contract” (i.e. I might be sued).

I have had this conversation with other vendors. When I questioned them I was told that I had a different idea of Open from them. (True). This use of “open” seems to be as useful as “healthy”. The most charitable interpretation is that they have actually documented their API. “open” is frequently a marketing word, or a word to make you feel good (about the “open”ers) or just fuzz to show the heart is in the right place.

And as such I shall replace it by Humpty’s word, “glorious”.

V: We have a glorious API.

Me: no quibble. Meaningless marketspeak but I’m used to that

So whenever you hear “open”, substitute “glorious” and see if you have lost any information.

Open Source does this well. I know that Sourceforge, Outercurve, Apache, Bitbucket, Git contain Open Source programs. And if I look this up on OSI I find: http://www.opensource.org/docs/osd

Open source doesn’t just mean access to the source code. The distribution terms of open-source software must comply with the following criteria. They are simple (fit on a page) and crystal clear to English speakers (and have of course been translated). I’m just giving the headings here, but READ them.

1. Free Redistribution

2. Source Code The program must include source code, and must allow distribution in source code

3. Derived Works The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software.

4. Integrity of The Author’s Source Code

5. No Discrimination Against Persons or Groups

6. No Discrimination Against Fields of Endeavor

7. Distribution of License The rights attached to the program must apply to all to whom the program is redistributed

8. License Must Not Be Specific to a Product

9. License Must Not Restrict Other Software

10. License Must Be Technology-Neutral

 

This doesn’t stop you running a business on Open Source (Redhat, Kitware ++). Or having moral control – as long as you can exercise it through e-charisma. But, in principle and usually in practice, anyone has the right to copy and fork your code. It may be frowned upon, but it will not bring the lawyers.

Whereas if you fork copyright material – even if “freely” available on the web, and even if not created by the copyright holder, lock the door or leave the country. (The original idea that academics signed over their copyright to publishers so that publishers could protect academics from pirates seems tragically distant now. Publishers “own” OUR material for their own ends).

So the Open Access declaration (I use Budapest http://www.soros.org/openaccess/read ) had the same noble principles:

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

These are great principles, and COULD have been crafted into a legal framework that ensured that readers could re-use Open material without fear. But in practice the community did not address this with the result that until recently no one knew what “open access” meant in practice. If I post a “green open access” copy of a “publisher’s article” and I re-use this for any purpose I can still be sued. There is no legal gift, no legal guarantee.

The major progress in this has been the emergence of “Open Access” publishers. These are – in the main – characterised by using CC-BY licences. A document which EXPLICITLY gives the reader/user rights. With Open Access publishers you can sleep soundly.

Note that if a document is not completely Open its status is effectively closed in legal terms. This is not a quibble – ask the lawyers when they come after you.

Sadly Institutional Repositories have almost completely failed in promoting Open Access. Almost no content carries explicit rights, and without those rights you can only assume that the content is closed. If you doubt this, try to find more than 5% of any IR which is explicitly marked as Open/CC-BY. And how did you serach for it? By hand – as repositories generally don’t provide search-by –legal-rights. So almost all content in IRs is “glorious”.

The Open Knowledge Foundation has defined “Open Knowledge” very clearly (http://www.opendefinition.org/ ):

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

AND the OKF has spent time in cataloguing those licences which are OKD-compliant and those which are not. For example CC-BY is, CC-NC is not, compliant. OSI licences are compliant. A document without a licence is not, de facto, compliant. So if something is OKD-compliant, you can sleep. Otherwise you can’t.

All of this leads to a recent discussion (http://lists.okfn.org/pipermail/open-bibliography/2011-September/thread.html ) on OKF’s Open-bibliography list about the use of “Open” (http://lists.okfn.org/pipermail/open-bibliography/2011-September/001141.html).

David Weinberger <self at evident.com> wrote:

 

> LibraryCloud is a metadata server gathering

> library metadata (circ data, user reviews, etc.) and making it openly

> available via APIs and Linked Open Data. 

......

 

> We are on the verge of making it accessible to a limited public (API key

> required, daily queries limited to 3,152). We're interested in

> contributing what we can as we can. (No, we cannot make its catalog

> available in its entirety. We wish.)

 

These two paragraphs contradict each other. 

Is LibraryCloud an open data provider, or not?

Either data is open, and it is possible to get hold of the entire

dataset with a clear open license for what you can so with it. Or it is not open. 

It is wrong to call data open if it is subject to arbitrary access restrictions like 3K entries a day. 

 

--Jim [Pitman]

 

A lively and fruitful discussion followed, with some supporting “open” == OKD-compliant and other arguing that “open” was an arbitrary point on a spectrum. For which I read “glorious”. Here’s two typical passages:

 

If something

isn't "open" according to [OKD] strict standards it isn't open at all. This

completely misses the fact that "open" as in the Harvard API may be

completely fine and useful for nearly all real world purposes.

 

PMR: The good intention is clear, but “open” gives no other information. It does not keep the lawyers away.

we cannot make all our catalog

data available for bulk download. That is a limitation we all regret

but there it is.  I would argue that because the data we make

available we make available without restriction, it is reasonable to

use "open" as a modifier.

 

PMR: This shows the complexity. Perhaps the individual items *are* Open. In which case good, and in which case give them each a licence.

The sad fact is that in many cases “open” == “glorious”.

If we are to operate legally usefully then the only practicable way is to use the Open Definition. Everything else may be interesting points on a political spectrum, but only OKD make us safe.

And brings in the promised land of infinite re-use of knowledge.

There *are* technical concerns with OKD-Open APIs and I’ll discuss them in the next post

 

Chemical Informatics and China; and challenges of language

Wednesday, September 14th, 2011

I was honoured last week to be invited by Professor Xiaoxia Li to speak at the 14th Asian Chemistry Congress in Bangkok and then to visit her group in Beijing. Unfortunately I couldn’t spend longer but the visit impressed me very much, both with the focus of the group and their high morale. The group is in the Institute of Process Engineering, Chinese Academy of Sciences in the Haidian part of Beijing (where there are many universities and scientific institutes). Here is Xiaoxia (left) and some of her group [photos on my phone, so variable quality].

Xiaoxia’s group has a lot in common with us. Several of the group are involved in indexing and information retrieval from the Deep Chemical Web (http://chemport.ipe.ac.cn/IPE-ChINGroup/group-publications-en.html ). Most of the web is actually inaccessible to search engines because the information is exposed through a query interface. “Enter your search term: “. Often you have no idea what to enter, and so Bingle passes it by. Her group is developing heuristics and templates for exploring what is in databases and what information can be extracted. It’s very challenging.

We also talked about software for processing chemical names and natural language wholly or partially in Chinese. We tried an experiment with OPSIN (name to structure). Daniel Lowe has explored how Chinese (and other languages) representation of IUPAC names might be processed by OPSIN. He is moderately confident that the core of OPSIN is suitable, and it is a question of preprocessing and vocabulary. Here is an example taken from http://zh.wikipedia.org/wiki/IUPAC%E6%9C%89%E6%9C%BA%E7%89%A9%E5%91%BD%E5%90%8D%E6%B3%95_%28A%E9%83%A8%29 – a translation of the IUPAC nomenclature rules into Chinese.


2,7,8-
三甲基癸

When Daniel reads this from a file into OPSIN it interprets it as 2,7,8-trimethyldecane. We tried to reproduce this on a Chinese commandline, but ran into encoding problems. (Encoding is one of the commonest problems). However I am sure it is soluble.

 

I was also given a tour of the work in the Institute. There is a lot of exciting work on High-performance computing (using GPGPUs http://en.wikipedia.org/wiki/GPGPU ) and the institute has, I think, the 33rd most powerful machine in the world. Certainly the scale and ambition of investment in science was clear. Among the demonstrations I saw were the simulation of a fluidised bed reactor (flow, temperature) and also the molecular dynamics of a complete influenza H1N1 virus (neuraminidase, haemagglutinin, capsids, RNA, – everything). We have come a long way since I worked in influenza 20 years ago.

 

I was also very very well entertained – driven everywhere – and shown many of the sights of Beijing. Here are two with colleagues from Xiaoxia’s group:

And (in the Forbidden City)

I was very well exercised by the end!

It was great to talk with a group with mutual interests – the discovery and re-use of information on the web. I gave them an overview of our work, our recent manuscripts, and left a fairly complete copy of OSCAR/OPSIN and related software. Some – at least – should be fairly easily adaptable.

And by happy chance “YY” (Prof Yong Zhang, left) was free and invited me to dinner with his group at Tsinghua University. YY spent 3 years in our group at Cambridge and was responsible for the early development of the World Wide Molecular Matrix (publication in press http://www.dspace.cam.ac.uk/handle/1810/238387 ).

Again I was very well looked after.

It is always great to find other groups who interact synergistically. Chemical informatics is not always glamorous, but it’s important and will increase in value as the barriers to information discovery start to disappear. Many many thanks.

14 Asian Chemical Congress: Why we need Open Data/Source/Standards

Tuesday, September 6th, 2011

I am talking tomorrow as an invited lecturer to the 14th Asian Chemical Congress in the Cheminformatics section (http://www.14acc.org/speakers.htm#s8). My message is the Cheminformatics needs Open resources (as in http://www.opendefinition.org/ ). I am not arguing that everything should be Open, but that everything critical should be. To summarise:

  • Data should be open. Unless data are Open they cannot be:
  1. Independently validated
  2. Republished
  3. Re-used for derivative works; this is where the innovation comes from
  4. Used as reference sources
  • Source code should be open to the extent that:
  1. It should be possible to recalculate a model, a set of properties, an analysis independently of closed systems
  2. The algorithm used should be inspectable.

This does not prevent proprietary codes being used for speed, convenience, etc. but they should not be the only way of verifying the work

  • Standards, including dictionaries. Where files are used to communicate data, the syntax must be agreed (e.g. OpenSmiles (http://www.opensmiles.org/ ), and documentation openly visible. Where terms/metadata are used then they must be defined and agreed by the community (e.g. http://www.xml-cml.org/dictionary/ ). Modern dictionaries should be semantic (i.e. understandable by machine)

Chemistry , and cheminformatics even more, has very little in any of these areas. InChI is one of the few exceptions. Openness is being driven by funders, regulators, some government agencies and (from the bottom-up) the Blue Obelisk (http://sourceforge.net/apps/mediawiki/blueobelisk/index.php?title=Main_Page ).

Without Open Data/Source/Standards, computational/data-driven science is not reproducible.

Many areas in science, especially bioscience, are driven by the vision of the Semantic Web and Linked Open Data (http://en.wikipedia.org/wiki/Linked_Data ) and graph (http://en.wikipedia.org/wiki/File:Lod-datasets_2010-09-22_colored.png ). There is very little chemistry here, because very little is Open. Even KEGG will disappear because It’s becoming closed

I am working with the European Bioinformatics Institute on ChEBI (http://en.wikipedia.org/wiki/ChEBI ) and hopefully also on CHEMBL and related data. The bioinformatics community need Open chemical data and they are prepared to work to make it happen. Maybe at some stage the chemical community will see the value of Open knowledgebases. Until then we will continue to generate collections of computational chemistry, crystallography, spectra, and other properties by using machines to extract or generate them.

Here’s some material I presented earlier to the ChEBI group (2011-06-01) …

Web-based science relies on Linked Open Data.

Topics

issue

closed data

open data

sustainability

supported by income

few proven models

creation of business model

easyish

hard

added human value

often common

anything possible

support

usually good

depends on community

domain acceptability

well proven

often suspicious

cost

high; increasing?

marginal

innovation

central authority

fully open

reuse

normally NO

fully OPEN

speed from source

often slow

immediate

mashupability/LODD

very rare

almost universal

reaction to new tech.

often slow

very fast

comprehensiveness

very good to patchy

potentially v. high

global availability

often very poor

universal

acceptable to funders

variable; decreasing

very high

 

In the current talk I shall stress tools for data extraction and creation. In particularly:

I hope to show demos of some/all of:

And, if you are excited about creating Open Chemistry, here are some tools to help (http://blogs.ch.cam.ac.uk/pmr/2011/09/04/open-crystallography-how-to-start-it-and-where-should-we-base-it/ ).

 

 

 

Open Crystallography: How to start it and where should we base it?

Sunday, September 4th, 2011

#opencryst

At #iucr2011 Saulius Grazulis and I agreed to set up Open Crystallography. We’ve both been working in this area (he with the Crystallography Open Database community (http://www.crystallography.net/ ) and our group with Crystaleye (http://wwmm.ch.cam.ac.uk/crystaleye/ ), Chempound for crystals (http://crystaleye.ch.cam.ac.uk/ ), etc.). These complement each other very nicely and we are aiming to

make all published crystallography Open.

That’s very similar to the Blue Obelisk, with Open Data and Open Source. COD has a lot of Perl and we’ve a lot of Java. They’ve got 150K structures, we’ve got about 250 K. Many overlap of course. We campaign for more crystallographers to make their structures Open, and this will come. We have to create the social and technical framework where it’s easy and attractive.

Anyone can join us. Crystallography is fun and appeals to mathematicians, high-school students, artists, etc.

So we are starting with creating Open communications where we can share what we have got. That’s something we are well used to in the Open Knowledge Foundation. So here I am outlining what we plan to do and asking readers for help and guidance.

  • Blog. A blog is s really good idea of getting messages out, indexed etc. reporting on news. Presenting interim results. And we are developing our own crystallography data blog for structures. What blog? Current suggestion: create a blog on wordpress.com. Comments, please!
  • Wiki. We need a wiki to host all the links, overviews, and pointers to our own material. Where? Current suggestion: use http://wikispot.org/ ; any other suggestions?
  • Mailing list. This is for day-to-day communication of technical issues, policy, debates, etc. Suggestion: use googlegroups. Any other ideas?
  • Realtime communication. This is for virtual meetings. For this we’ll use skype and Etherpads.
  • Sourcecode. We use Bitbucket.org. Not sure what COD use.
  • Hosting data bases, repositories, etc. This is hard work and a major effort of both projects
  • Hashtag. I am suggesting #opencryst, which seem to be free.

We’ve used this mixture very successfully in several virtual projects. Please suggest other things we’ll need to do.

This is a model for Open Foo (so far Foo includes ScienceData, Bibliography, Government Data, amd lost more). So if you are interested in developing Open anything, this project may be a useful learning ground.

 

#solo11; post Science Online thoughts

Sunday, September 4th, 2011

I’ve been to all the SOLO meetings (Aug/Sept) and really enjoyed them. The earlier ones concentrated on blogging and it was a great way to meet other bloggers. They have gradually moved towards adding other types of online activity, with a strong sense of doing stuff. Last year I was pre-occupied as we had organized a monthlong hacakathon to search patents for chemistry using text-mining. (We are forbidden to do this for scientific papers). As with many of my barmy ideas we just managed it with minutes to go…

This time (http://www.scienceonlinelondon.org/ ) was more relaxed for me – I wasn’t presenting anything. There were thousands of tweets – #solo11 which will give you a good idea. So here are my highlights:

  • Michael Nielsen – theoretical physicist turned into Open Science evangelist and doing lots of lectures in Europe (http://michaelnielsen.org/blog/visiting-europe/ ). Try to go if you can. We had a long series of talks earlier this year in Utah. Michael had looked at the principles of collective action (http://en.wikipedia.org/wiki/The_Logic_of_Collective_Action ), ranging from how trades unions started (in small groups) to how Valencia managed its upstream river basin to avoid the “tragedy of the commons”. Michael’s takehome was that it is very difficult to change from a local optimum (I agree in general) and that it is slow. Anyone altering the status quo lays themselves open to freeloaders. This applies to reforming scholarly communications and creating Open science. I felt it was slightly pessimistic, and feel that funders and governments may be able to exert influence. But I have taken to heart that one should start smallish, aim for the achievable, do it well, and allow for time. (Which is, I hope, what my various schemes conform to).
  • Ivan Oransky (Retraction Watch – http://retractionwatch.wordpress.com/ ). RW is a great site. I’d dropped in before, but now I see it as mainstream . How can we make sure that the blogosphere’s work in challenging bad published science is respected and recorded? RW is a great start. What we need – and I have ideas – is a tool where referees can be alerted as to whether one of more authors “have previous”. I have just seen an example of a re-publication of bad science, where I ask “how can the referees possibly have let that through. It’s not just wrong, but the blogosphere has trashed it – and the referees are oblivious. And many journals don’t seem to care
  • MaryAnn Martone, Neuroscience Information Framework (NIF), Spinal Muscular Atrophy Foundation (SMA). SMA is a genetic disease of children – fatal over unpredictable timescales. The SMA is committed to funding research that is targeted to deliver real value to patients in an aggressive timescale. They are not interested in impact factors, but research that makes a difference now. She highlighted that at the moment scientists compete and occasionally collaborate. But we should be looking for cooperation and coordination. That’s common in Open Source software which is why it has such a strong resonance with Open Science. Effective achievement of measurable goals is more important than individual brilliance. And I think that charities are perhaps the major force that could make this happen. Less glory, more progress.

  • Kristi Holmes


    VIVO . An interdisciplinary national network Enabling collaboration and discovery among scientists across all disciplines.

 

And our own session on Open Research Reports (run by David Shotton , supported by Tanya Gray). This will create Open material for disease science and we’ll be doing this at a SWAT2LS hackathon in December in London. But that deserves its own post.

A fantastic meeting and a very broad and valuable delegate list. Thanks to everyone.    

AJCann http://scienceoftheinvisible.blogspot.com/2011/09/solo11-day-1-rough-thoughts.html

AJCann http://scienceoftheinvisible.blogspot.com/2011/09/solo11-day2.html?utm_source=twitterfeed&utm_medium=twitter

Bjoern Brembs: http://bjoern.brembs.net/news.php?item.776.5&utm_source=twitterfeed&utm_medium=twitter

 

 

 

Reader Pays (a lot) to read “Sodium Hydride as Oxidant paper” (you don’t need to be a chemist)

Saturday, September 3rd, 2011

YOU DON’T HAVE TO KNOW ANY CHEMISTRY – THE POINT IS DIFFERENT. But you should have 70 USD ready.

I came across the following paper today:

Reductive and Transition-Metal-Free: Oxidation of Secondary Alcohols by Sodium Hydride

Xinbo Wang, Bo Zhang, and David Zhigang Wang

J. Am. Chem. Soc., 2011, 133 (13), p 5160

Publication Date (Web): July 21, 2009 (Addition/Correction)

DOI: 10.1021/ja904224y

 

Hmm… This is the same title as a paper published 2 years ago, which I thought had been withdrawn…

So some more searches:

Reductive and Transition-Metal-Free: Oxidation of Secondary Alcohols by Sodium Hydride

Xinbo Wang, Bo Zhang and David Zhigang Wang

J. Am. Chem. Soc., 2010, 132 (2), p 890

Publication Date (Web): December 23, 2009 (Addition/Correction)

DOI: 10.1021/ja910615z

 

But I thought it was 2009?

Ah yes…

Xinbo Wang, Bo Zhang and David Zhigang Wang

School of Chemical Biology and Biotechnology, Shenzhen Graduate School of Peking University, Shenzhen, China 518055

J. Am. Chem. Soc., Article ASAP

DOI: 10.1021/ja904224y

Publication Date (Web): July 21, 2009

 

Now the 2009 paper was severely cricized by the blogosphere. Including repeated work by TotallySynthetic http://totallysynthetic.com/blog/?p=1903 showing that if you kept oxygen out the reaction didn’t happen. The criticizers were moderate in their use of science – given that the ACS promoted the use of the term “junk science” I suggest it applied here. (The reaction was flawed, the explanation was flawed). But that’s not the point here.

The point is “what is in the 2011 paper?”. Now, when I am away from the office I deliberately do not sign into Cambridge University. So If I want to read the 2011 paper I find:

Purchase This Content

Choose from the following options:

Well, it had better be good…

I got a friend to access it for me. I won’t say who in case they get done for stealing content. They let me read the whole content of the paper. I’m going to reproduce it here.

Hang on, that’s violating copyright. You CANT reproduce a whole paper.

OK, I’ll cut out the vowels. …

It reads:

“Th*s p*p*r has b**n w*thdr*wn f*r sc**t*f*c r**s*ns”

I hope I haven’t broken fair use.

That’s the WHOLE paper. 35 USD for 2 days.

I haven’t read the 2010 paper. Perhaps someone can post it (but not, of course, the fulltext). Because I am going to spend my dollars in the pub.

 

 

 

 

Science Online: We can make blogs first-class citizens in scholarly publishing

Saturday, September 3rd, 2011

 

#solo11 (Science Online) is fantastic as ever and very different from last year. Masses of great people to meet.

Today (through Martin Fenner ) we’re going to look at how to use blogs for science. I’ll probably be blogging quite a lot. Feed at: http://blogs.scienceonlinelondon.org/blog/2011

Here’s my first http://blogs.scienceonlinelondon.org/blog/2011/09/03/the-value-of-blogs/

Many thanks to Martin for setting up this blogging session at #solo11. I have been blogging for 6 years and have found it a very useful way of communicating ideas and getting feedback. I am now strongly convinced that blogs are a better technical platform for formal and informal communication os science. There is, for example, no technical reason why scholarly publications should not be on WordPress rather than some proprietary backroom system.

This post is short, to test Martin’s blog. Here are some advantages of blogs:

  • The authoring interface is natural (I use Word, others type directly, some use LaTeX)
  • There are many natural tools that come as standard (index, search, archive, chronology)
  • Blog feeds can be filtered, combined, repurposed. (Ever tried repurposing a PDF hamburger?)
  • You can subscribe to feeds and get immediate notification
  • There is a really easy way to comment. Some blogs get 100 comments in a day. It’s rapid. Feedback is great
  • Blogs can be hyperlinked so a subject can be discussed in many places
  • specialist plugins can be built (e.g. for chemistry, scholarly publication)
  • … and I can keep going

and some disadvantages

  • you don’t get citations
  • you don’t get citations
  • you dont’ get citations

That’s a human problem, not a technical one.