Paul Miller on the Web of Data

Paul visited us today (Paul Miller speaking at UCC) and gave a beautiful presentation on the Web of Data. Literally beautiful. He had worked very hard on preparing it and it flowed imperceptibly from simple beginnings to a current conclusion.
Luckily [*] one of us, Diana, couldn’t be present so I asked Nico if he would mind video-ing the presentation. He’s done this before and I think quite likes it, and I knew that Paul – with his efforts in CC and Open Data – would be supportive. So, assuming it’s been captured OK, we will be able to mount a video as well as the pictures.
In this case the words are as important as the slides, and I won’t try to summarise details. Paul started with something we all knew about and very gradually each of us – at different times – starting learning something new. The message is an attitude of mind rather than a mantra of orientation – so please watch the video.
In the bar afterwards we shared some of our successes and occasional frustrations. Both groups are trying to change parts of the world and the world doesn’t always understand the change. I was impressed to hear how Talis – a solid, worthy, traditional library company – had re-engineered itself to have a dynamic web-oriented group outlook which manages to balance excitement with reality. They’ve put a lot of effort into investing in legal effort to help design their Open Data community licence, and they’re not possessive about it.
Where’s the payoff – none of us know. But there is no doubt that “repositories” – whatever that means will become very important. The first phase of University Institutional Repositories have many of the problems of a first phase, not least that the motivation is often unclear. I’ve not seen the Talis platform, so I’m not commenting – and it may be inappropriate for our sort of area – but I’m keen to make a reverse visit to them. (What’s the currently rather unusual aspect of their website? If you understand it will be obvious).
Note added later [*]: In an awful choice of phrase I didn’t mean that it was lucky that Diana wasn’t there, but because she wasn’t we had decided to video it.

Posted in open issues, semanticWeb | 7 Comments

Billion-dollar Scientific Scholarship?

Peter Suber seems to have connections everywhere and picked up this really exciting post about how there is a wide-open market to completely restructure scientific publishing. Alexandre Linhares is the Director-General of the Brazilian Chapter of the Club of Rome. (The Club of Rome published the very challenging Meadows report “Limits to Growth” which affected me deeply). So take it very seriously. If you are a publisher you should be either very excited or very afraid – there is no middle ground.
I think the proposal itself, which concentrates on authoring technology is useful, but there will be much more. If we bring Open Scientific Data into this, as we should, the market goes up by a factor of ten IMO.
I was sufficiently impressed that I changed the topic of my talk last night to the Cambridge Network to scientific publishing and I argued that there was a billion-dollar market waiting to be exploited. More about that later.

02:12 07/10/2007, Peter Suber, Open Access News
Alexandre Linhares, A modest (billion-dollar) proposal, The Human Intuition Project, October 6, 2007. Linhares is the Director-General of the Brazilian Chapter of the Club of Rome. Excerpt:

Imagine the following scenario. A secretive meeting, years ago, when Apple’s Steve Jobs, the benevolent dictator, put in place a strategy to get into the music business….I have no idea how that meeting went, but one thing is for sure: many people afterwards must have been back-stabbing Jobs, and mentioning “the music business? We’re going to sell music? This guy has totally lost it.”
Fact of the matter was, technology had forever changed the economics of the music business, and Jobs could see it.
Having said that, I’d like to make a modest, billion-dollar, proposal, to the likes of Adobe, Yahoo, Apple, IBM, Microsoft, and whomever else might be up to the task.
Think about science publishing….
The economics of science publishing is completely crazy for this day and age….
[T]echnology has forever changed the economics of the scientific publishing business, and it’s high time for someone like Jobs to step forward.
[idea about Adobe Buzzword omitted as I think we can do better…]
Adobe Buzzword is specially suited to do this. Most scientific publishers Buzzword is just my favorite option….Other options could be desktop processors (MsWord, Pages, OpenOffice, etc)….
Now, why would the people in Adobe, Yahoo, SUN, IBM, Microsoft, Google, or others actually want to do a thing like that? There are two reasons. The first one is goodwill, the second one is money….
One crucial point is for the platform to be freely accessible to all. But you can do that, and still block the googlebot, the yahoobot, and all others “bots”, but your own. Let’s say, for instance, that Microsoft does something of the sort. In some years time, not only it gets the goodwill of graduate students who are studying papers published by science.microsoft.org (as opposed to hey-sucker-pay-thirty-bucks-for-your-own-paper-Elsevier), but also the way to search for such information would be only through that website. As we all know, advertising is moving online: according to a recent study, the last year saw “$24 billion spent on internet advertising and $450 billion spent on all advertising“. Soon we’ll reach US$100 Billion/year in advertising on the web. And imagine having a privileged position in the eyeballs of graduate-educated people, from medicine to science to economics to business to engineering to history….
Google might want to do it just to preempt some other company from blocking the googlebot to get its hands on valuable scientific research. Microsoft, the Dracula of the day, certainly needs the goodwill, and it could help it to hang on to the MS-Word lock in. Maybe Amazon would find this interesting–fits nicely with their web storage and search dreams. Yahoo would have the same reason as Google….

PeterS: Comments
His proposal raises a larger and more important question. I’d put it this way. Journal publishing is dysfunctional and unsustainable in its dominant form. Open access allows wider distribution and lower costs at the same time. Tools to produce it, services to exploit it, and business models to support it are all multiplying fast. There may be a trajectory in all this activity (and I believe there is), but at least there is ferment. The ferment smells like readiness. Now: Is journal publishing susceptible to a sudden transformation from an unexpected player using a killer app and business plan? Is there a Steve Jobs of journal publishing ready to act?

PMR: Publishers, take this analysis very seriously. Peter has his finger on your pulse better than you do. The values that you will sell in ten years’ time are yet-to-be-discovered but it is a certainty that trying to protect and resell the output of others will not be among them. I’d guess that we will be looking for trust, quality, flexibility and service as some of the key resellable products.
For money.

Posted in data, open issues | Leave a comment

Paul Miller speaking at UCC

I should have blogged this earlier but was too wrapped up in my talk for yesterday. Still if anyone in the Cambridge area is reading this, Paul Miller of Talis is visiting us today and giving a talk in the afternoon (1415, 2007-10-09, Unilever Centre, Chemistry).
I can’t remember when I ‘met’ Paul, but he invited me to a session on Open Data that he ran at WWW2007 and I was extremely impressed with the activities that he and his extended collaborators are working in. I also heard elsewhere that he had given a very good talk on the future of data on the web so I suggested he should be one of our speakers.
Bio:

Paul Miller

Paul Miller

Senior Manager, Technology Evangelist

blog: Thinking about the Future

Paul joined Talis in September 2005 from the Common Information Environment (CIE), where as Director he was instrumental in scoping policy and attracting new members such as the BBC, National Library of Scotland and English Heritage to this group of UK public sector organisations. Previously, Paul was at UKOLN where he was active in a range of cross-domain standardisation and advocacy activities, and before that he was Collections Manager at the Archaeology Data Service. At Talis, Paul is exploring new models of collaboration and identifying further areas in which our technology or knowledge would be of value. Paul has a Doctorate in Archaeology from the University of York.

PMR: Note that you don’t have to have a degree in Computer Science to be a world expert on the semantic web. The future requires people with a vast range of experience and skills
Abstract:

Moving from a web of documents to the web of data
Dr. Paul Miller, Senior Manager & Technology Evangelist Talis (http://www.talis.com/)
The open web has been transformed by a tide of newly interactive applications, many of which approach the utility of software previously installed locally on your own computer. Flexible and interactive, they also leverage the power of the network to compare your behaviour with that of your peers (Amazon’s famous recommendation services are a prime example), and aggregate individually irrelevant actions together to deliver network-scale benefits.
Into this visibly evolving world of richly ‘Web 2.0’ web sites comes the long-held promise of Tim Berners-Lee’s ‘Semantic Web’. Escaping from the laboratory, previously esoteric technologies such as ‘RDF’, ‘OWL’ and ‘GRDDL’ are being put to work in building the next generation of the rich applications that have so transformed our use of the Web.
This presentation will illustrate and comment upon these trends, as well as introducing ways in which they might impinge upon all of you… and not just when you order a new book from Amazon in your lunch break.

PMR: Among many other talents Paul is able to blog talks as they are given. I can’t type fast enough to do this for Paul, but I’ll try to summarise my impressions.
And anyway, everything that Paul and Talis do is Open. They are leading the way in developing Open licences. I am sure it will be a good business investment.

Posted in semanticWeb, www2007 | 4 Comments

"This explains a lot"

Followers of Peter Suber’s blog know that he is one of the fairest, most objective, writers and thinkers on Open Access. He gives credit where it is due even for advances which he feels are largely suboptimal. I have corresponded with Peter but never met him. In two recent postings he uses words that speak very clearly about boundaries, especially the relation between financial reward and publishing.
I already blogged (Why Open Access really matters) on his This time a real threat to peer review and quality control where he says: “Some publishers worry out loud and groundlessly that OA will undermine peer review and quality control, but then work with pharma companies to undermine peer review and quality control themselves and profit handsomely from it.”
Now he reports a similar concern about the link of money and big business under the title:

This explains a lot (21:42 05/10/2007, Peter Suber, Open Access News)
He is reporting an article
Paul D. Thacker, Investigative reporting can produce a “higher obligation”, SEJournal, Summer 2007.
PeterS: Excerpt:

…[P]ublishing executives and senior editors at ACS get bonuses based on how well the publishing operation performs. These bonuses are approved through the committee on executive compensation….

and PeterS Comments

  • We’ve long known that the American Chemical Society (ACS) pays very high salaries to its executives, at least for a non-profit scientific society. For example, according to its 2005 tax returns, it paid Madeleine Jacobs $919,251 and Bob Massie $1,033,330. At the time, Jacobs was the the society’s CEO and Massie was president of its Chemical Abstracts Service (CAS).
  • We’ve also long known that the ACS strongly opposed government OA policies, from PubChem to the NIH policy and FRPAA. According to Nature, the ACS was one of only three publishers (with Wiley and Elsevier) to hear the Dezenhall proposal that recently evolved into PRISM.
  • Thacker’s full article is of interest for another reason: he explains why he was fired from his job as a reporter for one of the ACS journals –after a phone call to his editor from an industry executive who happens to chair the ACS committee on executive compensation.
  • The article includes a sidebar in which two ACS officers dispute parts of Thacker’s account. Excerpt: “Bill Carroll, former ACS president, wrote to say…that he chaired the compensation committee but it does not evaluate or award bonuses to editorial employees.”
  • If your professional society has opposed government OA policies, try to find out whether its executives get bonuses based on the revenues or profits of its publications. If they do, ask in a public meeting whether they believe this is a conflict of interest.

PMR: Read Paul Thacker’s article and make up your own mind. I am not an American, or a member of ACS (although I am grateful to the interest of and interaction with their middle management who have many time invited me to talk). I do know that the membership that I talk to felt that the discussion on Pubchem was stifled and this discontent appeared on public lists. An excerpt from PaulT’s account:

In February 2006, Bill Carroll, an executive with Occidental Chemical, called some of
the society’s publishing executives to complain about my reporting. The American Chemical Society is a nonprofit that is run by an elected board and Bill Carroll was the president.

PMR: Non-chemists may not realise that the ACS has a large industrial membership both individual and corporate.
PMR: For chemical readers, Paul talks about PFOA, (Perfluorooctanoic_acid) which gives me a chance

Perfluorooctanoic acid
IUPAC name pentadecafluorooctanoic acid
Other names perfluorooctanoic acid
PFOA
Identifiers
CAS number 335-67-1
RTECS number RH0781000
SMILES OC(=O)C(F)(F)C(F)(F)C(F)(F)C
(F)(F)C(F)(F)C(F)(F)C(F)(F)F
Properties
Molecular formula C8HF15O2
Molar mass 414.07 g/mol
Appearance colorless liquid

to use the Wikipedia entry. There is a lot on the public controversy about PFOA. PaulT writes:

The article on the Weinberg Group, a product defense firm,grew out of a letter written by the Weinberg group to DuPont that I discovered in EPA’s docket on PFOA, a chemical used to make Teflon and other non-stick products. In this letter, the Weinberg Group detailed a campaign they hoped to organize for DuPont to protect them against lawsuits and federal regulations on PFOA. The Weinberg Group suggested creating studies to show that PFOA was not only harmless but actually beneficial and offered to find expert scientists that could help DuPont to prove this.

You’ll have to read Paul’s article to see the context. Because I am sitting at home I cannot access major chemical databases (which apart from Pubchem are all closed, pay-for-access). So you will have to do with what the Wikipedia volunteers come up with (and in case it came out wrong recently I am a passionate supporter of this activity):

WP: The durability of PFOA prevents it from breaking down once in the environment, leading to widespread buildup and bioaccumulation in food chains. Traces of PFOA-family chemicals can now be found in the blood of nearly all Americans and in the environment worldwide. Scientists do not yet know how the chemicals are transported or exactly how dangerous they are to humans, although concerns about these issues caused its major manufacturer, 3M, to announce in May 2000 that it would cease producing the chemical. DuPont, one of the largest users of PFOA, then built its own plant in Fayetteville, North Carolina to manufacture PFOA.
DuPont has used PFOA for over 50 years at its Washington Works plant near Belpre, Ohio. Area residents sued DuPont in 2001, claiming the chemical contaminated area drinking water. As part of the settlement, DuPont is paying for blood tests and health surveys of residents believed to be affected. Up to 60,000 people are expected to take part in the study, which will be reviewed by epidemiologists to determine if there are any long-term health effects.
In 2004, DuPont came under investigation by the Environmental Protection Agency (EPA) for allegedly covering up knowledge of possible health effects of PFOA exposure in a study of pregnant employees, including evidence of PFOA in umbilical cord blood.
The EPA pursued charges against DuPont for failure to report violations filed under the Toxic Substances Control Act and the Resource Conservation and Recovery Act. On December 13, 2005 DuPont announced a settlement with the EPA in which DuPont will pay US$10.25 million in fines and an additional US$6.25 million for two supplemental environmental projects without any admission of liability.

Posted in open issues | Leave a comment

Thank you JCB for Free XML

From Peter Suber’s blog
17:26 04/10/2007, Peter Suber, Open Access News
Emma Hill, JCB content automatically deposited in PubMed Central (PMC), Journal of Cell Biology, October 1, 2007. An editorial. Excerpt:

Public access to JCB content
The JCB has long been a leader in providing free, public access to the science we publish. Since January 2001 we have released our content six months after publication, and we also provide immediate free-access to colleagues in 143 developing nations. And, in this my first editorial in the journal, I am delighted to announce another enhancement to our commitment to public access. As of November 2007 we will deposit all JCB content in PubMed Central (PMC), where it will available to the public six months after publication.
PMC, developed and managed by the NIH’s National Center for Biotechnology Information (NCBI), in the National Library of Medicine (NLM), is a free digital archive of literature from biomedical and life science journals. Despite a previous reluctance on our side, we are now happy to provide the NLM with all of our content in XML format. The process requires a certain amount of finessing to ensure accurate conversion (hence the short delay until November). This change in our policy stems from the realization that XML content may have greater longevity than PDF files. We also recognize the necessity for multiple archives of our electronic content as print is phased out.
Additional placement of JCB content in the PMC archive ensures permanent and free access in a central repository alongside research from other leading journals. Our routine deposit in PMC represents de facto compliance for authors with policies formulated by many funding agencies requiring access to research they have funded after a short delay. This service will be free of charge to authors (although HHMI are welcome to pay us $1,000–$1,500 a pop if they so choose)….

Comment. This goes well beyond standard green policies that permit self-archiving. It even goes beyond the few that positively encourage self-archiving. It guarantees OA archiving and doesn’t leave it to the initiative of busy authors unfamiliar with their options. Automatic deposit in PMC is routine for OA journals at PLoS and BMC, but don’t forget that JCB is a TA journal. Moreover, JCB is depositing the published editions of its articles, not the unedited author manuscripts, and doing so from self-interest, not the goad of payments from authors or funding agencies. Compare the HHMI deal with Elsevier in which Elsevier would not deposit even the unedited author manuscripts in PMC without payment of $1,000-1,500 per paper. If other journals follow the lead of JCB, then the OA percentage of the new literature will rapidly approach 100% and funding agencies like HHMI will never again accede to archiving demands like those of Elsevier. Kudos to JCB and Rockefeller University Press.

PMR: This is good news. It is close to completely (delayed) Open Access. (For those unfamiliar with the background, HHMI (Howard Hughes) did a deal where funders paid a lot of money for not very much). My take on the current position is that if you don’t mind the six-month gap then the result is “nearly full Open Access”.
I say “nearly” because the license position isn’t clear from this report. Can the material be fully re-used without permission (maybe someone can comment)?
But the really good news is that what is being deposited is XML! Fantastic. PDF is awful. XML is wonderful. So if we can get at the XML we can start to do wonderful things – like text and data mining. In a sense if someone deposits XML they are inviting people to text-mine it. It’s like leaving out free beer.

Posted in open issues, XML | Leave a comment

Downtime, generic apology, and trivia

WWMM server is going down tomorrow morning (BST, ca 0900-1200 and UTC 0800-1100). So if you read manually or wish to comment, please don’t be surprised.
When I started this blog I did not expect for it to take on a role of advocacy. In particular I did not realise that access to and use of information would be a major problem.  So some posts are somewhat strong polemic. But:
“The worst offense that can be committed by a polemic is to stigmatize those who hold a contrary opinion as bad and immoral men.” [John Stuart Mill, 1806-73]
Sometimes while writing I say things that upset people. If so, I apologize. There are posts which I would have worded differently in retrospect and recipients include Jan Velterop, Maxine Clarke, Chemspiderman, and one of the editorial staff at Nature. It is not my intention to upset individuals. However it is often necessary to upset organizations and the line between an individual and a representative of an organization is a fine one. Blogs have an immediacy that articles don’t, and are replicated instantly. I will try to be careful, but I am afraid that there are still many many organizations whose practices are unacceptable and where exposure and strong comment is appropriate.
I will buy any or all drinks if we meet.
Now for an anonymous contribution from a reviewer whom I know well and who does a lot of reviewing. Details adjusted to anonymize.

The referee had commented that a number [of objects] was wrong (as could be seen from comparing the text with an illustration).

Reply:
“The number of [objects] has been changed to [n]. The number of [objects] appears to have been a numerical counting error.”


Posted in Uncategorized | 3 Comments

Why Open Access really matters

From Peter Suber’s blog. This time a real threat to peer review and quality control

22:26 01/10/2007, Peter Suber, Open Access News
Sergio Sismondo, Ghost Management: How Much of the Medical Literature Is Shaped Behind the Scenes by the Pharmaceutical Industry?  PLoS Medicine, September 25, 2007.  Excerpt:

There are many reports of medical journal articles being researched and written by or on behalf of pharmaceutical companies, and then published under the name of academics who had played little role earlier in the research and writing process. In extreme cases, drug companies pay for trials by contract research organizations (CROs), analyze the data in-house, have professionals write manuscripts, ask academics to serve as authors of those manuscripts, and pay communication companies to shepherd them through publication in the best journals. The resulting articles affect the conclusions found in the medical literature, and are used in promoting drugs to doctors….
This article enlarges the focus from ghost writing to the more general ghost management of medical research and publishing….
Several of the publication planning firms identified are owned by major publishing houses. For example, Excerpta Medica is “an Elsevier business” and writes that its “relationship with Elsevier allows… access to editors and editorial boards who provide professional advice and deep opinion leader networks”. Wolters Kluwer Health draws attention to its publisher Lippincott Williams & Wilkins, with “nearly 275 periodicals and 1,500 books in more than 100 disciplines,” and to Ovid and its other medical information providers, emphasizing the links it can make between its different arms. Vertical integration is attractive in the industry as a whole: at least three of the world’s largest advertising agencies own not only MECCs [medical education and communication companies], but also CROs….
The CMD [an MECC working for Pfizer] document obtained by [David] Healy suggests that during key marketing periods as many as 40% of published articles focusing on specific drugs are ghost managed….

PS: Comment.  What’s the OA connection?  Some publishers worry out loud and groundlessly that OA will undermine peer review and quality control, but then work with pharma companies to undermine peer review and quality control themselves and profit handsomely from it.

PMR: I have nothing to add except that this is very serious and very depressing. If you want a moral reason why we should adopt OA, this is a large part of it.

Posted in open issues | 2 Comments

Naming chemical compounds

At the risk of boring readers who already understand the issue of names, metadata, recursive annotations and versions, let me do this discussion to death.
I reiterate. A name by itself is neither right of wrong. It is possible that the syntax might determine whether it is a name or not, but that’s all. “green feathery compound” was a name we gave to a lead compound in Glaxo. (It wasn’t a very good lead).
A connection table obeys some syntax, but otherwise is neither right nor wrong. COOOOOOC is a valid SMILES. So is CO(O)(O)OC. So is [H][U]Ge. They are unlikley to exist, but they are still valid by the syntax rules.
The name “water” is associated with the compound with formula H2O. So is “wasserstoff”. So is “oxidane”.
I have made three statements. Some of you will assert that I am “wrong”. Some will say the name for H2O is “wasser”. Others might say that “wasserstoff” should be associated with the compound H2.

Name: ChemSpiderMan | 
Peter, Your original posting was on how long it took to copy a structure (well that’s my interpretation anyways).
If someone copies a structure and misses a fragment of the structure by default isn’t it wrong? If some draws Taxol and reverses a stereocenter by accident is that taxol anymore? I don’t think so…I think it’s a poor copy, and wrong.

PMR: So if someone takes my statement “wasserstoff is H2O” and mistypes it as “wasserstoff is H2”, then Chemspider asserts they are wrong because they have made a typo. But they have made a statement with which most (German) chemists might agree.

This is the type of quality issue that is essential to have tracked.

PMR: If you wish to spend your life recording typos in chemical documents, I hope it is fulfilling.

Maybe I misinterpreted your point about right and wrong structures?

PMR: I think you have. Read Alice through the Looking Glass. Carroll understood the issue very well.

  1. Name: DrZZ |
    I agree wholeheartedly with Peter. I think the emphasis on curation is misplaced. It isn’t that the issues are unimportant, it is that many of the curation questions are essentially scientific questions and saying you are going to solve that by curation leads to a situation where these scientific decisions are made in ways that don’t become part of the database record. I think at this point in time the emphasis should be on an architecture that allows multiple structure/name claims from multiple sources to be compared and then ways to annotate and track the discussions that arise as consensus is reached on the inconsistencies. This is exactly what scientists do and I don’t see why we should build databases that hide these issues from the people who use them. To be sure most inconsistencies will be the equivalent of typos, but even these problems are ill served by the current set up. I can tell you that at least 3 or 4 other groups have contributed structures to PubChem that were originally copied from the DTP database and hence are not really independent data sources. If there was a problem in a DTP generated structure do we look at PubChem and say that because a number of independent groups have the same structure that ther is a consensus developing on an alternative structure? Not if we knew that all the structures originally came from the same source. Setting up databases that assume the only information needed for a structure is the connection table, coordinates, and whether the representation is right or wrong throws away all the information of where the structure came from, how was it manipulated, what evidence was claimed in support of it, etc. It doesn’t even allow us to know what correction to make. We just had a example where someone found a name for a DTP compound that didn’t match the structure at all. It would be pretty easy to substitute the correct structure for that name, but in this case that would be exactly the wrong thing to do. The primary data from the supplier was the structure and it was a DTP mistake that associated an incorrect name with that structure. In my opinion the most pressing need is for an awareness of the importance of these data items and for an architecture that captures and makes use of them. I think PubChem has made a good start and I would very much like to encourage that, rather than use its limited resources to try to be the final (and hidden) arbiter of what is right and wrong.

PMR: There is no absolute way of deciding what the name of a compound is. There are authorities who make meta-statements. Thus IUPAC states that various chemical structures have various names. Chemical Abstracts states that various chemical structures have various names. Suppose they differ? Chemspiderman makes the final decision!?
No. The only “absolute” is if there are real-world consequences. If I state that I have sold compound X and it is safe, and someone else says that actually compound X is something else, then we have a court case. The lawyers will argue that Chemical Abstracts is more important than IUPAC. Or vice versa. And I go to jail because I got the wrong name. But I am neither right or wrong, I have simply made a statement which conflicts with one made by a real-world authority who can send me to jail.
The only modern way to do this is with constant annotation, including versioning. This is what sites like Sourceforge and Wikipedia provide. And the latter has a form of cybergovernance which can never be absolute, even as Plato’s is not absolute, but it’s good enough for me.
So Chemspider is fundamentally and almost irretrievably broken because it does not have metadata. It deals with absolutes, while the modern world deals with assertions. And the technology – RDF/OWL – has now arrived to support assertions.

Posted in data, semanticWeb | 2 Comments

Can chemical structures be right or wrong?

Chemspiderman has commented…

  1. ChemSpider Blog » Blog Archive » Dictionary Lookups and Optical Structure Recognition Versus Structure Drawing. Which is Less Error Prone? Says:
    October 2nd, 2007 at 5:48 am e[…] Luqidcarbon has put up a recent blog posting about the speed by which he/she can draw structures in ChemDraw and asked for challengers. PRM has commented in Chemical SpeedDrawing. The challenge is outlined below… […]
  2. ChemSpiderMan Says:
    October 2nd, 2007 at 6:21 am ePeter, I think the structure of discodermolide is wrong…this is where a look-up in a reference dictionary is necessary…and I think we both support that effort. But it MUST be curated. it IS correct on Wikipedia but drawn incorrectly by liquidcarbon and everyone afterwards…
    It is why I favor the scan and convert software for this…there is the version from Marc Nicklaus’ lab but I must admit that my present bias is to use CLiDE (http://www.simbiosys.ca/clide/index.html) because it can be batched and because the results appear to be so far ahead of the Open Source code at present. We do not have time to work on the Open Source support at present as ChemSpider is very distracting and we are focused on potentially using the batch processing for extracting novel structures from Open Access articles.
    I put a detailed blog posting about this at: http://www.chemspider.com/blog/?p=180

PMR: I have already posted on this blog that – in general – chemical structures are not right or wrong. They may be associated with other information and the chemical community as a whole decides that this association is useful or counterproductive. Please read the argument carefully.
If, for example, I write CH5 is this structure wrong? It violates the valency rule, after all. No. It’s not wrong, it just can’t be found in a bottle in most labs. It can be found in mass specs and interstellar space. There is an arrogance in the chemical informatics community that assumes the only discipline that matters is synthetic organic chemistry. In general no chemical structure that obeys the algebra is wrong. (The algebra says things like “no fractional charges on molecules” (although ther can be on crystal cells,  “if A is bonded to B then B is bonded to A”).
There are unacceptable uses in mainstream C19 organic chemistry, such as carbon with 5 “valencies”. Such structures may be deemed “wrong” by organic chemists. It was clear that when Chemspider was set up the support for inorganic compounds was almost non-existent – I pointed this out and I think the position is improved somewhat. But I don’t have time to check – I expect there are many compunds represented by discrete “connection tables” which in my view are far worse chemical sins. But I am turning my attention elsewhere.
So  “Peter, I think the structure of discodermolide is wrong”. No. I think this means “liquidcarbon has drawn a structure to which s/he has associated  the name ‘discodermolide’ and Chemspiderman things this association is incompatible with current usage. ” OK. Discodermolide is a substance of relatively minor importance compared to penicillin G and THC.  It has 103 hits in Pubmed, compared with 30,000 for taxol.  Maybe it will become famous one day. Until then I don’t really care that liquidcarbon may have got it  “wrong”.
What I do care about is that we develop a community process – not regulated by a closed commercial company or a closed learned society division – that allows us to converge towards a cluster of agreed names at any point in time. In some cases this is easy – I think we all agree what Pen-G is – in some cases this is a question of removing known errors – and Wikipedia is great for this. (BTW I made a correction to the strucure of Acetyl-CoA in Wikipedia, and the wikichemists agree the structure is noew “correct” – but this is a natural part of using WP and I do these things every other day).
Pubchem has got it right. It simply records what name a human or organization has attached to a connection table, and gives the reference. That is all it needs to do. We then, as a community, need to evolve a Web 2.0 mechanism for annotation that allows us to find the “right” structure rapidly.
That’s the sort of thing we shall soon start to be doing with the peer-reviewed literature – if our grant gets funded. Social computing to create consensus on data and names. All Open. All in public view. Versioned. With metadata. And until the chemical “databases” adopt C21 metadata they are largely useless in the C21. Pubchem understands this. And ChEBI, and some Blue Obelisk efforts. No-one else seems to have got the point.

Posted in chemistry, open issues | 2 Comments

The chemical blogosphere cares

Wow! I posted a request yesterday (sic) for supporting material for our proposal to JISC for a person to support the blogosphere as a major resource for increasing the quality of published chemistry. I have had valuable contributions from 4 people already and now Egon has created a wonderful summary – just the right length. We’ll either include it as it stands of point to it from the proposal – depends on space.  [Recall that Egon’s Chemical Blogspace blog aggregates the whole of chemical blogspace.]

17:01 01/10/2007, Egon Willighagen, chemical blogspace, pmrgrantproposal, chem-bla-ics
Peter is writing up a 1FTE grant proposal for someone to work on the question how automatic agents and, more interestingly, the blogosphere are changing, no improving, the dissemination of scientific literature. He wants our input. To make his work easy, I’ll tag this item pmrgrantproposal and would ask everyone to do the same (Peter unfortunately did not suggest a tag himself). Here are pointers to blog items I wrote, related to the four themes Peter identifies.
The blogosphere oversees all major Open discussion

The blogosphere cares about data

Important bad science cannot hide
I do not feel much like pointing to bad scientific articles, but want to point to the enormous amount of literature being discussed in Chemical blogspace: 60 active chemical blogs discussed just over 1300 peer-reviewed papers from 213 scientific journals in less than 10 months. The top 5 journals have 133, 78, 68, 57 and 48 papers discussed in 22, 24, 10, 11 and 18 different blogs respectively. (Peter, if you need more in depth statistics, just let me know…)
Two examples where I discuss not-bad-at-all scientific literature:

Open Notebook Science
I regularly blog about the chemoinformatics research I do in my blog. A few examples from the last half year:

Update: after comments I have removed one link, which I need to confirm first.

PMR: A few comments. Yes, I didn’t include a tag – but as I have said before the blogosphere rapidly converges. I sympathize with Egon that I don’t particularly like pointing to bad articles. However whent eh robots start refereeing journals – as they will in out project – they don’t have sentiments and if they find bad data they will expose it without a qualm. Of course we will have to check they “hardly ever” make mistakes (no one is perfect). And, of course, if you publish in Open Access journals there is no place to hide.

Posted in blueobelisk, chemistry, open issues | Leave a comment