How can we plan the future?

First many thanks to Phoenix Sky Harbor Airport for free wireless. Much appreciated. Why can’t others be less mean?
Today we set ourselves the challenge of how to plan the future of eScholarship. As the report will appear later it would be unfair to anticipate their findings, but I can at least indicate my own suggestions and to say that they were well received.
We should look beyond classical adherence to publication-driven metrics and hiding the scientific output before (or even after) conventional publication. I highlighted what Jean Claude Bradley is doing by publishing his work as he does the experiments and that this type of activity should be promoted in the next decade. That we should build virtual communities based on discovering collaborators by ‘net contacts, drawn together by the force-filed of similar thought patterns. That we should cross discpline boundaries in searching for challenges to the world’s problems by pooling public knowledge and community expertise. Readers will see that J-C has built part of his laboratory in SecondLife.
That we should have the courage to trust young people’s vision and judgment. (Remember that many of the great chemists of the past were young – read Great Discoveries by Young Chemists: by Kendall J if you can get the book). I suggest that we should have a radical program of undergraduate eduaction where the new students – who have never known anything other than the e-universe. That there should be encouragement for these students to create their own proposals (how old were the founders of Google?)
We also discussed the governance of e-organisations like Wikipedia and the Blue Obelisk. How can funding bodies help such bodies develop? What happens when they reach a critical size? There is certainly an excitement among some of the delegates.

Posted in Uncategorized | Leave a comment

Data-driven scholarship

(I’m afraid I’m going to bang on again about access to data). I’ve been at the CNI and the NSF/JISC meeting for which I wrote a position paper. The meeting was on Digital Repositories and Data-driven Scholarship (Science). My paper was meant to look at scale and complexity (and my internal presentation did this) but I also added stuff about how your couldn’t do data driven science if you couldn’t get the data.
The meeting was an invited group of library/informatics/CS/funders and covered science, humanities and infrastructure. I really enjoyed hearing of the challenges in classical studies – digital museums, the comple e-collection of cuneiform tablets and so on. The idea is for NSF and JISC (funding bodies) to prepare a strategy document for the next decade (we set ourselves the task of planning radically new support for making a major impact by 2015 – a grand challenge – we also know how fast the world is changing.) But part of that assumption is that all major knowledge will be digital by that time and will be freely accessible to everyone – not just academic subscribers. We see young people, retired people, people anywhere on the globe participating freely in this endeavour.
But the publishers are still copyrighting our data. I used to think this was a bureacratic oversight but it isn’t. The publishers now make no secret that they are taking our data from us and intending to resell it back to us. Although I make noise (on this blog, on the SPARC Open Data list, on Wikipedia) I don’t find any resonance. I’m told there are other people out there who care about Open Data but I don’t find a groundswell.
Readers, if we do not do something about this very soon we shall simply be owned by commercial and quasi-commercial publishing interests. The publishers are not stupid and they now know what the issues are.
I heard a story from one delegate from a major US laboratory where a major publisher had sent an inspection team on the assumption that there was too much use of their journals for the price they paid and then demanded very large amounts of money. No names, sorry. We are fairly close to some sort of warfare with publishers if things don’t change. If so, you’ll know where to find me.

Posted in Uncategorized | 2 Comments

my terabytes

I’ve been offline for ca. 2 days – staying in a hotel which dates from the days of Mae West (she stayed there) but where the internet only works in one place if you hide under the bed.
The closing plenary from Marc Smith at CNI

Microsoft Research
Community Technologies Group
– I didn’t catch all of it but it included – everything we do leaves tracks in the digital sand. No privacy. We know that anyway. But our total world line is about 10 terabytes. That’s what an ipod will be in a few years’ time. So we have the opportunity to record the comple world history from now on. Vannevar Bush’s memex is in sight.
Marc put forward the vision that we’ll carry this information environment with us and that when we meet other people of machines we’ll exchange bytes – rather like pheromones – we get to know all about the beings we interact with. This is a common theme in SF – one’s memory is portable, dowloadable and reloadable.
At the JISC NSF meeting (more later) we also talked about our information environment. The conventionalists see this tied to the institutional repository – where one’s output is collected and disseminated. But I’d rather keep it with me – I’ve had 6 email addresses in the last 15 years and every time I change I lose huge amounts of information. I’d rather keep it with me and trade bits of it with any employer.
In the future we won’t put people in prison, we’ll just remove their terabytes and bandwidth.
Posted in Uncategorized | Leave a comment

PDFBox and OCR

Ben Litchfield is the(?) current guru of PDFBox and has updated me on  PDFBox. (I copy it here as although I think Jim has fixed the Blog (thanks, Jim)  I won’t take chances.)
Name: Ben Litchfield URI: http://www.pdfbox.org/ | IP: 170.37.224.2 | Date: April 17, 2007
FYI, work is in progress to fix the subscript issue, keep an eye out for the next version.
PDFBox does not contain OCR, but it does have some magic :)
Ben
====
I’m at the Coalition for Networked Information in Phoenix – and may blog on that – and talked to Glen Newton (from Canada) who suggested that what had happened was the library had digitised the thesis *and* run OCR over the thesis and then overlaid the OCR on the TIFF bitmap. This makes sense.
I managed to hack a simple heuristic for subscripts and it worked better than I had hoped. So now I am able to extract chemistry out of PDF theses.
BUT – it causes great pain. So please don’t deposit PDF-only. I am saddened to hear so many people at this meeting (which is mainly Library and Information Scientists) talking about “depositing the PDF in the repository” as if PDF was some god-given information object structure.
I am really excited that Ben has answered – it shows how rapidly the blogosphere helps people  make contact. Yes, I could have mailed the PDFBox list – and I probably shall – but the blog also reaches the people who are creating PDFs…
I sometimes get people who offer services but say they aren’t chemists. Here’s an opportunity.
Is there anyone interested in helping develop PDF2XML for chemistry – you don’t need to know any chemistry – but need to be excited about de-obfuscation (the reverse of what PDF does to chemistry).  For example at the moment I need a way of removing graphical boxes out of the manuscript as the characters in them bleed into the text.
P.

Posted in Uncategorized | 5 Comments

PDFBox and Hamburgers – the story continues

I blogged recently about how I used PDFBox to turn chemical theses (in hamburger PDF) into text. I have now found some interesting (and I think exciting) developments – but I’d like a reality check.
I downloaded a thesis from a well-known institution. The spectra were bitmaps (and not very pretty when magnified). But the text seemed real text – it was selectable by the Adobe Reader and PDFBox emitted it as ASCII. But I noticed some strange things. As an example:
H2O 13NMR
might come out as
H
2
OJ
3
NMR
– why the “J”? So I looked closer and it seemed that the text was also a bitmap. On magnification it gave jaggy charaters and the “j” was jsut a bad reading of the “1” of “13”. I’d assumed the document was Word converted to PDF – a sort of undead or still warm hamburger, but no – it’s a totally cold dead hamburger
So does PDFBox have a OCR facility inside it? Or is there some magic inside the PDF I don’t understand (it’s all binary gibberish). If PDFBox can do OCR without even blinking that is awesome.
But – as you can see – OCR corrupts. Or rather scanning paper documents corrupts.
So – please – when you ingest Electronic Theses and Dissertations (ETDs), Please use a semantic format. Please insist the student hands over  a semantic electronic version. After all you can withhold their degree or do other horrible things to them. It’s one of the few places where academia still retains some publishing power. So please show how it should be done properly 🙂

Posted in Uncategorized | 2 Comments

Rise of the Chemical Blogosphere

Another snippet from ChemBark some months ago but highly relevant
News Story of 2006: The Rise of the Chemical Blogosphere
I have no doubt that the chemical blogosphere is here to stay and adds important new directions. I said this in my presentation at ACS. We can see the time when most facets of chemistry generate social computing in much the same way as other aspects of life do. It’s not a nine-day wonder.
It is becoming increasingly easy to annotate web pages – either directly or through various standoff methods. Social annotation can be at least as good as – and is much cheaper then – annotation by conventional commercial abstracters. If a significant number of chemists make simple comments on papers or other resources they read then we can shortly have an immediate and valuable meta-resource. (We are going to try this out with CrystalEye and the BlueObelisk has already generated some exciting technology). If so see an error in a paper and could comment on it with   a click of the mouse, why wouldn’t you? I have done this for many of the papers in Beilstein Journal of Orgamic Chemistry. Yes, it’s unnatural at the moment but as the MySpace generation moves into chemical research they will have little fear and will regard it as second nature.

Posted in Uncategorized | Leave a comment

Chemical Citizen of 2006 (Wikipedia, Blue obelisk, etc.)

I am trying to get my past blog-stuff sorted out – some of my unpublishined snippets may appear in random order.
I had selected Chem-Bark’s post:
Chemical Citizen of 2006: Wikipedia User “V8rik”
where CB lauds the contributions from “V8rik”. It’s a typical example of how multi-faceted the world is. There are lots of different groups and people with different approaches to the opportunities of the Net. The WP group are fairly distinct from the Blue Obelisk, for example. But synergy is increasing.
I’ve contributed a modest amount of chemistry to WP – and interacted with some of the main practitioners. WP is great in that it is easy to get started, with an exciting social model. I have predicted that WP (or derivatives yet to come) will displace many chemistry reference works. Have a look at some of your favouraite compounds and see the WP entry. That’s all done by quite a small number of people. Imagine what it would be like if we all contributed just a little.
The main problem for me with WP (as with blogs, Wikis, etc.) is that they are syntactically and semantic broken. It’s very difficult to get chemistry into any of them. That may be gradually changing – Henry Rzepa has experimented with a semantic Wiki and reported this at the ACS. But generally if you try to get a usable chemical connection table into Wikis it’s a effort.
That’s starting to change. Martin Walker is one of the stalwarts of chemical WP and we met at the Blue Obelisk
dinner in Chicago.  Yesterday he posted (I can’t get to the BO archive, so have copied the message)
=== Walkerma ==

There have been a lot of things going on at Wikipedia that will interest
this group.
Chemistry/Structure drawing workgroup#ACD ChemSketch – the company is
willing to make a Wikipedia Template on their Freeware
(Sorry about the long URL!)
We are working with Antony Williams, Product Manager for ChemSketch, to
add Wikipedia settings into ChemSketch.  Simple drawing settings alone
would be trivial – we’ve already agreed to use ACS drawing settings – but
they are talking about adding in the image processing to make a PNG file
automatically (we asked for SVG, but we’d still be thrilled with PNG).
Williams is posting on the above talk page, and will provide a test
version soon, I think.  I think this feature will make it easier for
people to post structures to their web pages, never mind Wikipedia.
I did my duty and requested that the software also generate an InChI.
Williams was open/supportive, but said it would take longer.  I would like
to draw a molecule in ChemSketch, then have the magic “Wikipedia button”
generate a PNG or SVG file for me with the InChI attached as metadata, all
ready for uploading to the Wikimedia Commons.  If we can do this in
ChemSketch, I’m sure we can pressurise ChemDraw to follow suit.  I have a
feeling that the “Wikipedia button” (if properly designed) could in effect
become the standard “Upload my structure to my website button”, in which
case we would really like to get the metadata included.  Does anyone here
have any comments or advice on how best to do this?  Adding metadata to
image files in Wikimedia software is difficult, hence my post at the
Commons Village Pump:
Please help!
2. Someone from IUPAC Gold Book has approached us, “offering a
collaboration between IUPAC and Wikipedia and offering to make their data
available to us.”
http://en.wikipedia.org/wiki/Wikipedia talk:WikiProject Chemistry#IUPAC
Gold Book
Again, much discussion and many thoughtful responses.  This really looks
like a good chance to add value to Wikipedia articles.  Suggestions,
ideas?
As well as the above, we’ve also had heated debates about SVG vs PNG,
citation policies, an Endnote citation-generator for WP, harmonisation of
chemical structure drawing standards with the German Wikipedia, and much
more, all since I got back from Chicago.  Jmol adoption even got a short
discussion, Bob.  Phew!
I’d really appreciate some expert advice on how best to proceed.  Many
thanks,
Martin

[…]

Martin A. Walker
Department of Chemistry
SUNY College at Potsdam
Potsdam, NY 13676 USA
+1 (315) 267-2271

=== Walkerma ==
(I think the IUPAC contact may be Alan McNaught, and the Gold Book has been impressively converted to a publicly available ontology by the ZVON group. )
Anyway it’s nice to see a chemical software company helping – in general they have been very slow to embrace the new world of collaboration.
Semantic Wikis and blogs are very primitive and anyone who has any ideas or experience would be welcome – suggest you post to the BO mailing list (http://hardly.cubic.uni-koeln.de/mailman/listinfo/blue-obelisk/).
When (not if) the worlds of chemical Wikipedia and Blue obelisk come together then the chemical world will not be able to ignore the power of Openness.

Posted in Uncategorized | Leave a comment

IDID – Idea-Design-Implementation-Dissemination

Software development is hard. Tedious. Frustrating. It usually takes much longer than anyone, including the author, thinks. So what tools and philosophy are useful to the solo – or near-solo – Open Source programmer? Here are some thoughts which  you’re welcome to challenge and improve.
Of course we should good design, code re-use, testing and so on. And, where possible, create the community where synergy adds insight, quality and productivity. But it doesn’t always work that way. Many projects are essentially single-person because they are too individual for communal design, because ideas are not tested, because they’re likely to have constant revision, because you can’t expect someone else to go through the pain on your behalf. Because the idea is half-baked and only the originator can see it through. Later – when the idea is proven – a community may develop. But until then how do we go ahead?
Certain ideas from the world of mainstream software development are proven. Documentation. Unit testing. They work. Yes, we find them tedious and we try to neglect them but at all stages they pay back. So these are taken for granted. But what else – if anything – works for the solo programmer?
Many of the tenets of programming style assume a team of paid developers working in a well-funded project. Or at least a funded project. Whereas solo Open Source is usually done in marginal time – when you really should be asleep. Extreme and agile programming doesn’t work: pair programming? there isn’t anyone to pair with; 40-hour week? yes – 40+40 = 80.
The much maligned waterfall model? (even though it’s not
as mindless as often protrayed). Not in its classic form (from WP):

  1. Requirements specification
  2. Design
  3. Construction (aka: implementation or coding)
  4. Integration
  5. Testing and debugging (aka: verification)
  6. Installation
  7. Maintenance

But we do need some sequential discipline, and here’s mine. We start with an Idea. At this stage it looks like a good thing to do. Sometimes it starts from nowhere – sometimes it’s the obvious thing to do, or even the essential. There can be no “requirements” at this stage – it is often pure experimentation.
OSCAR started this way – it wasn’t called OSCAR then – it didn’t have a name). I had the simplistic idea that we could parse chemical language easily. If I had realised how difficult it actually is I would probably have abandoned the whole thing and we would never had any OSCARs.  But I found some regex code, tried it on some papers, got some simple results and started our collaboration with the RSC. We were at least 10 years out-of-date in our approach – but this gave us the opportunity to meet our colleagues in Natural Language Processing and thence to do it properly (SciBorg). But this was all a learning process – we couldn’t have created a properly structured project at the start as we had no idea where we were going.
We try the Idea out but soon need a roadmap. At this stage it’s important to create a Design. Without a design, especially when you are exploring, the code thrashes around. A clear indication of lack of design is difficulty in writing code, whereas with a good design the code can sometimes almost write itself. For me, XML has been invaluable as the design tool. Yes, it’s an end in itself for some of what I do, but it’s also an extremely powerful constraint and guide for programming. I often find that all non-transient data structure can be exported in XML and indeed helps the structuring of the code.
But Design without Implementation is dangerous. Far too many protocols are developed without being fully or even implemented. “Rough consensus and running code” (IETF) – the design must be implemented. And, as we said, good design supports implementation.
But often things go wrong here. The implementation doesn’t work out. And that’s a clear indication that the design is wrong. It has to be revised. Sometimes it needs simple additions. In the best cases it requires deletions – it is a great feeling when code can be simplified. Non-programmers don’t appreciate this – they look at a simple beautiful design and say – “that’s obvious” – whereas what we know is “how much work it took to make it simple”.
Sometimes the Design cannot be easily rejigged. That means we have to go back to the Idea. Change what we are trying to do. Or maybe even scrap the whole Idea. And that could be months or years down the line. (I spent a whole year wrapping the W3C DOM in a CMLDOM. It was a nightmare. But I had to work through it to show it wouldn’t work. It was “the right way to do it” at the time. Now we know that the W3C DOM is totally broken.)
JUMBO has been scrapped 4 times – we are now on JUMBO-5. It’s taken years. But at least now it works, the Design is clean and stable. It has to be right. Of course each new regeneration teaches us something. But none of this survives in the final product.
So we finally have an implementation. It’s not much use if no-one uses it. So it has to be Disseminated. Giving an iterative progression (with backtrack and restart):
Idea-Design-Implementation-Dissemination
Each step is a lot harder than the preceding one. Maybe half an order of magnitude. It doesn’t always work in this strict order. But generally it’s only at the end that other people can really start to collaborate – because if you do this too early you risk your prematire Design or Implementation crashing on them. And that’s not fair.

Posted in Uncategorized | Leave a comment

Data Aggregators or the Gift Economy?

The C20th saw the rise and value of scientific data aggregators – organisation who extracted data from the literature, cleaned it, packaged it and offered it for re-use. In some cases they got grants to support this, but most moved to a commercial (or at least non-profit) model where costs were recovered (and profits could be made). In chemistry one of the best know and archetypal is “Beilstein” , created in 1881 by Friedrich Konrad Beilstein . Other well-known examples are Chemical Abstracts (run by the ACS) and the Cambridge Crystallographic Database (no direct link with ourselves).
These databases were created by necessary human labour (“the sweat of the brow” in US copyright phraseology). The fees or subscriptions were usually seen as a return for fair endeavour. However the problem of monopoly is always present and some have charged very high prices for their cash cows.
This model is now becoming increasingly untenable objectively. This comes from many sources including:

  • The gift economy (more below)
  • Changes in social attitudes (especially among young people)
  • The dramatically lower cost of creating data, often near zero
  • Unacceptable monopolistic activities by aggregators trying to save their content, leading to public opposition
  • The requirement by funders that the complete scientific record is made Openly available

Although I have a personal opinion that data should be free, this post is intended to be objective. I know people in data aggregation activities and I do not wish them ill, but Cassandra-like I have to predict that they must change or die. (Steve Heller is one of the most eloquent evangelists on this theme, showing the inexorable pressure for change).
Few data aggregators show any signs of their impending problems – maybe this post will waken some.
The software community has already seen the rise of Open Source and some prophets suggest that ultimately all software will become free. Part of the motivation has been described as a “gift economy” (Eric Raymond in Homesteading the Noosphere). He argues that the Open Source movements value gifts rather than conventional material wealth. The costs of software creation are low enough and the rewards from the gift economy high enough that the equation balances. In chemistry the Blue Obelisk epitomises free donation.
I argue here that science will increasingly also create a gift economy and that individuals and organisations should be valued not simply (or even) for their integrated citation count and grant income but for what they have freely donated to science.
Data are even more important than software for future success in science. They may be expensive to gather but the costs of further dissemination are near-zero. We have shown (SPECTRa) that data can be published as part of the process of their generation and if funders require this, then it requires only marginal costs to integrate this operation into the normal scientific procedure of experiment, analysis and publication. Through SPECTRa, CrystalEye, WWMM etc. the data are published to the community, and current informatics protocols (metadata, harvesting, etc.) are sufficient to create a virtual database.
However few users of the current commercial databases in chemistry would willing give them up. Why? Because (quite reasonably) chemistry values historical data. The chemists of the C19th were meticulous experimenters and their work is still valuable today. So both aggregators and most “users” see the historical content as critical. The monopoly continues.
But is this really relevant today? And even if it is, is the cost worth it. Why is historical data important?

  • patents. I accept that historical data is critical for showing prior art. And we shall need to continue historical data for patents
  • safety. It may be necessary to search widely such data.

The rest of the arguments are less supportable:

  • comprehensiveness. Many scientists have been trained – or become adapted – to the need for a comprehensive review of the literature. But this is illusory. A large amount of chemistry lies in paper theses or other hardly-accessible sources and this is rarely searched. 80% of crystal structures are never published so the crystallographic data aggregators are only comprehensive in the narrow sense of formal publication.
  • curation and data cleaning. How well are data actually cleaned? Our robots show that most organic chemistry papers contain at least one error. Robots can now add a new dimension to data quality
  • commentary. Some of the aggregators will add synoptic evaluation and commentary. This is enormously expensive. As social computing and annotation develops it will become uneconomic, except for safety-critical and legal-critical processes.

There will continue to be a limited need for high-quality data curation in these limited areas. But for scientific research these are unnecessary.
Most young scientists do not read papers. Allen Renear (UIUC) gave a splendid talk at ACS where he argued that the point of browsing the literature was to avoid reading papers. Seriously. The exploratory phase of modern science values speed, multidisciplinarity, impressions as much as it values formality.
Any hindrance to access destroys the rhythm of exploration. I have seen graduate students give up (in seconds) on any paper to which their institution does not subscribe.
But still there is the historical content. The nagging feeling that the pearl of wisdom could be missed. And that the expensive aggregation is essential. I think this is a myth, and will be recognised as such in the next 5-10 years. I presented this at the ACS meeting (in the symposium honouring Gary Wiggins, of Indiana).
Historically the growth of most databases have been exponential, with a doubling period of – say – a decade. During that time labour costs increase, and customers expect prices to remain constant in real terms. So already there is a mismatch – leading to compromises in comprehensiveness and up-to-dateness.
data aggrgators
(Here I had a diagram but I can’t get it into the blog).
In the mid 1990’s electronic publications became common and the technical cost of data dropped to near-zero. Many aggregators took advantage of this and moved to an author-submission model for data collection, with major reduction in costs. However because of the historical material, pricess remained constant or increased to the benefit of the aggregators and the disadvantage of the community.
Initially the proportion of author-contributed data was small, but now it is significant and growing. Thus we have collected ca. 60,000 crystal structures from the literature. This is perhaps only 20-30% of the historically aggregated crystallography, and might be seen as of limited use. However:

  • it has immediate currency – the robots aggregate it as soon as it is published.
  • the data are robotically curated
  • we can look to the gift economy to grow more rapidly than the historical model. If SPECTRa-T, COD and other gifts provide substantive new data then our Open data will start to outnumber those collected by the aggregators.
  • The community can add innovative data management tools whereas conventionla aggregators tradtionally are responsible for their own innovation and will be slower.
  • The community will develop social models for data annotation

The prime barrier to this is restrictive practices and inertia. How long it takes to overcome these depends on the community. But I predict that within 5-10 years many of the current data aggregators in chemistry will have seen their “markets” seriously challenged by the gift economy. It will give better data, more data, better metadata and will be zero-cost. So why not start to embrace it now?

Posted in Uncategorized | Leave a comment

WWMM – The World Wide Molecular Matrix

We have been working on a general, fluid, concept which we labelled “World Wide Molecular Matrix” – starting about 2001. (We actually put in a grant application under that name to the then new UK eScience programme – it didn’t get funded but helped to sort out some ideas. Actually, things turned out better as we got a more limited, but more tractable, funding from UK eScience via the Cambridge eScience Centre and DTI).
The WWMM is described on our home page and, interestingly also in Wikipedia. (I don’t know who started this – that’s one of the great things about WP. But it’s gratifying to think that someone believes it’s worth an article.)
However the WWMM has taken its own course. When we had the idea in 2000/1 it was very much driven by the idea of music-sharing peer2peer systems. We believed that the same would work naturally for chemistry – simply promote the idea, produce some simple examples of how it might work and it would “build itself”. Of course it didn’t, and there are many reasons – some of which I have blogged and most of which come down to how chemists think and behave. And I made my normal mistake which is to assume that from idea to deployment is trivial.
But the WWMM is now starting to take off. This is not due to any major single breakthrough but to a whole number of related ideas and technologies. They include:

  • CML is now robust, supported and widely deployed
  • InChI has catalysed the community to the realisation that metadata is valuable, works and is worth the effort
  • PubChem has shown that large free resources can be deployed and that people will contribute to them
  • The Open Access movement continues to gain ground
  • The concept of data-reuse is much stronger
  • The younger generation thinks in radically different ways and will not put up with the technical and social dysfunction of chemical providers
  • Google has shown the value of free-text indexing
  • Institutional Repositories are a reality
  • We have received funding (JISC/SPECTRa) to help preserve scientific data from loss.
  • We have continued to develop new informatics methods such as OSCAR3 (Peter Corbett)
  • The Royal Society of Chemistry has promoted the idea of hypertextual documents (Project Prospect)
  • The Blue Obelisk has pulled together much (most?) of the Open Source innovation in chemistry
  • The Crystallographic Open Database has pioneered the idea of self-contributed crystal structures – ca 480,000 structures
  • The bisociences are fed up with the conservatism of chemistry and are funding their own Open chemical resources (e.g. chEBI chemistry ontology)
  • The chemical blogosphere has blossomed into a mature, thoughful, responsive community of new thinkers.

(and more that I have probably forgotten).
The point is that none of this was available in 2002. (If we had had largescale funding for a detailed managed project we’d have ended up somewhere different from where we are now. And perhaps out of sync…)
So why is the WWMM starting to take off? Isn’t Pubchem in fact the WWMM? It’s clearly part of it. But the WWMM is broader – it’s not a single entity – good as Pubchem is – it’s a set of technical metadata, social aspirations and protocols that allow collaborative knowledge-driven chemistry to flourish at near-zero cost.
I’m not claiming that the whole of the chemical net is WWMM -it’s a smaller part. It is the idea that in near zero-cost collaboratively planned activities chemists can create and share certain types of data and create certain types of shared knowledge. For example the crystallography, computational chemistry and spectroscopy data (thanks to SPECTRa and other contributions) represent a starting point.
Nick Day in our group has created a very exciting WWMM resource component, CrystalEye which he presented at the March meeting of the ACS 2007.  CrystalEye (originally called CMLCrystBase) consists of arobot that harvests (scrapes) all legally allowed crystal structures in current publications (back to ca 1992). [Certain publishers (ACS, Wiley, Springer) forbid the re-use of their data so we don’t scrape them]. We are also adding in the Crystallographic Open Database (we have to remove duplicates). I’ll blog more of that later.
So the Openness of some publishers, and the contributions of individuals is an important part of the WWMM. So is the ability of OSCAR3 to scrape those parts of chemical articles that can reasonably be considered to be factual data (and I’m getting more bullish on what we can scrape and some cunning ways of doing it). But  the real power will come from the regular, zero-cost automatic contribution of ordinary chemists in departments. If enough chemical theses are (a) Openly exposed and (b) in semantic form with agreed syntax (CML) and metadata then the concept is proved. The work with repositories (OAI-PMH and OAI-ORE, together with the proven value of free-text indexing) creates the infrastructure for the matrix – the chemistry is layered on top)
. If all chemical data collected in chemistry departments is exposed in the same way then we will have built the matrix.
So I’m working to help accelerate it. The toolkit is looking pretty good. Nick’s CrystalEye is great and the JUMBO and CDK toolkits which power it are also used in SPECTRa. The metadata seems robust, CML-RSS works and can tailor output to individual chemists on demand.  Chemical substructure searching is still a challenge, but can be managed for medium-size collections or through Pubchem as a linkbase. And I suspect we shall create a standoff chemical search toolkit.
But the fundamental aspect of the matrix is that it is decentralised. All our software is Open and cloneable. Thus anyone can set up a CrystalEye or SPECTRa server. Anyone can download and run an OSCAR3 server (takes ca 10-15 minutes). In this way the whole chemical social computing community (which includes bioscientists, LIS etc.) can share the effort and excitement.
What’s missing?

  • Senior chemists
  • 20th-Century data and information aggregators
  • 20th century chemical software companies
  • The pharmaceutical companies

Does this matter? History will tell, but I doubt it.
It isn’t, of course completely zero-cost, just as Wikipdeia isn’t. But it’s not easy to get dedicated chemistry funding. JISC, EPSRC, DTI, Unilever have helped us in Cambridge. Some of the other collaborators have had useful funding. But the enormous contribution (CDK, NMRShiftDB, etc.) made by the group in Koeln (DE) through Christoph Steinbeck has had its funding terminated. So if there are organisations that are interested in supporting social computing in one of the most exciting current developments let us know.

Posted in Uncategorized | 2 Comments