Electronic Theses (ETD2007)

I am honoured to be asked to speak at the meeting next week in Uppsala on electronic theses (The Power of the Electronic Scientific Thesis). (This resonates with the JISC meeting on repositories (Digital repositories: Dealing with the digital deluge) which I haven’t yet been able to blog as our server is only just back up.) Some snippets:

Yet our own work in the SPECTRa project has shown that 80% (or more) of scientific data is never published….Electronic theses have the power to change all this. The thesis has several major advantages over current methods of publication

  • the author and/or its institution retain complete control over the copyright of the work and are not forced to hand it over to the publisher
  • there is a strict quality control system of internal and external examiners. The candidate has to convince them that the data are fit for purpose.
  • the student cannot be “lazy” about the means of authoring. If a university insists on XML then the student will have to do it.
  • an electronic thesis can (and I argue must) be openly available in an institutional repository.
  • an unlimited amount of supporting data can be copublished.

There are technical and socio-political barriers.

  • the thesis is often produced in some form or e-paper (TIFFs or PDF) which completely destroy all semantics
  • XML tools are not yet universal
  • there is no metadata for the scientific data
  • the authors and their supervisors are afraid that someone might read the thesis and (a) show there are errors (b) re-use it in clever ways thus “scooping” the authors. (This is sometimes contaminated with the problems of patents and confidential human information – but there are well accepted mechanisms for this). There are no moral reasons why the average thesis should not be fully visible to the world and re-usable under the BOAI declaration.
  • the university has medieval rules of ownership and copyright but enlightened ones now routinely post their theses.

My utopian vision is that students prepare their thesis in XML. This solves all the technical problems. It also will help the students to prepare better theses faster. For example students are often criticised for not having scientific units, omitting scales and labels on diagrams, missing out critical information, etc.
I suggest the following simple rules:

  • invest in XML authoring technology for theses (it is then automatic to create PDFs)
  • invest in communal XML languages (MathML, CML, SVG…) for the major scientific domains and to check the quality of material
  • develop departmental awareness and practices for capturing data at source. Our SPECTRa project has done this for crystallography, computational chemistry and spectroscopy.
  • until then ALWAYS co deposit a Word or LaTeX document, never just the PDF
  • add a copyright notice such as Science/Creative Commons to protect the data being appropriated by publishers

I also prepared a “manifesto” for the JISC meeting – it overlaps with the rules but adds

  • Theses must be born-digital (i.e. NOT PDF)
  • Domain ontologies must be used
  • All data must be included in theses
  • Data must be validated before submission
  • Theses must be openly exposed to data and metadata crawlers

One critical point from the JISC meeting was that in most institutions the copyright of the thesis is vested in the author (student) (although sometimes it is the institution). For born-nondigital his makes it VERY difficult to re-use without explicit permission from the author. A human can read, but not re-use.
This is compounded by the use of the term “Open Access” to describe theses. My interpretation of Open Access is strict BOAI (Budapest Open Access Initiative):

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

Unfortunately it is common practice for many at the JISC meeting to talk of “Open Access” when they mean “Toll Free”. I asked several organizers of thesis repositories specifically whether my robots could download these “Open Access” theses, text-mine them, and publish the results. In all cases I was told that for existing theses this was not allowed. However most agreed that born-digital theses had the opportunity for authors to make their theses fully Open.
The single most important rule, therefore, is that authors should be very strongly encouraged to make their theses fully Open under the BOAI and given the technical and legal tools to do so. Although in many disciplines this is complex (the thesis could contain third-party material, creative works of the author that they hold valuable (e.g. music, poetry, art…)) in most sciences it is negligible. I would be surprised in many current chemistry PhD students wished anything other than full re-use of their material. (Yes – it’s frightening – there will be errors – inevitably. I am anything but proud of my own thesis presentation and know there are errors, but I might go back and scan it in all the same when I have time).
Now I’m going to appeal to the chemistry community to see if there are any Open theses I can use.
BTW I am tagging this and future relevant posts as etd2007

Posted in chemistry, etd2007, open issues | Leave a comment

WWMM is back

Our server suffered physical damage due to a power problem and has been off air for several days. Many thanks to the Computer Officer team who have physically mended it.
I have a lot that I would like to write about from the JISC repositories conference and elsewhere…
P.

Posted in Uncategorized | Leave a comment

Podcast on Semantic Web, Open Access and Open Data

Paul Miller of Talis interviewed me over the phone today and the result has been captured as a podcast. It’s rather longer than I suspect either of us expected (70 mins) – I have spent some time explaining the basics of scientific publication, citation, and later RDF and the semantic web.
I have been thinking about pdocasting -I suspect this example is too long and that 15 minutes would be maximum – feedback welcomed. (It could be reasonably split in the middle – the first part being Open Access – the second being Open Data and the semantic Web). One downside is that Google doesn’t index the audio. I had a suggestion some time back that there was software which could transcribe audio to a reasonable degree – although this would have errors it might be could enough for indexing.
There has been no editing. Listening to the replay – I sound too pessimistic about Open Access – I should probably say that I see this too much from a chemical point of view and that other subjects are moving much faster.

Posted in data, open issues, semanticWeb | Leave a comment

Data validation in publications

Tony Williams’ comment to my post (Data validation and protcol validation – May 31st, 2007) has several valuable themes which I expand on here and in later posts. Tony and I are in agreement here and working towards something that we can separately and jointly promote. To summarise:

  • The published literature (even after peer-reviewing and technical editing) contains many factual errors.
  • machines can help to eliminate many of these before and during the publication process. In severe cases they can prevent bad science.
  • techniques include the direct computation of observed data and the comparison of data between datasets.
  • this is only reasonably affordable if the data are originally in machine-understandable form. PDF is not good enough.
  1. Antony Williams Says:
    May 31st, 2007 at 6:12 pm eI have blogged on your comments on the ChemSpider blog with a track back and we are in general agreement re the intent and value of the NMRShiftDB.
    I wanted to comment separately on “The role of the primary publisher is critical.” I agree that they can make it a lot easier to extract information and let’s discuss NMR data for now since this is the focus of this discussion. Validation engines will be required to confirm literature NMR data since year on year we have identified 8% errors in the peer-reviewed literature. Your comment re. 1% is one concern…8% is at a whole different level

PMR: It’s important to define what is an error (and much of the debate about NMRShiftDB is what is an error). Because of the byzantine method of hand-publishing to PDF many transcription errors get included. Our rough estimate is that many if not most published papers in chemistry contain at least one transcription error (maybe only punctuation, but it fouls machine-reading).
A second procedural error is associating the wrong molecule with data. We don’t know how common this is but we have certainly seen it. I suspect that these tow errors run at th 1% level.
These are separate from “scientific errors” where the “wrong structure is proposed (see below) or where the assignment (i.e. annotation) is “wrong”.  I have no comment on these yet – maybe this is the 8%.

  1. Improved automated checking of data is possible. it is one of our primary missions to perform structure verification by NMR as well as auto-assignment and computer assisted structure elucidation. These technologies are not in their infancy…they are on the maturity curve now. The adoption of such tools by publishers, whether commercial or open source, will be essential if the generation of Open Access QUALITY databases is to proceed. I think I’m speaking to the converted of course….

PMR: completely agreed. We have done the same thing with crystallography and discovered a number of experimental errors and artifacts. Routine calculation of molecular geometry and NMR spectra shoudl now be a pre-requisite for these types of studies.

  1. As an example of how computer algorithms for validation of NMR assignments can outperform even skilled spectroscopists I highlight the debacle around hexacyclinol. A search on this term tells an interesting story cited as “into the biggest stink-bomb in organic synthesis in many years” (http://pipeline.corante.com/archives/2006/07/23/hexacyclinol_rides_again.php). The Chemical Blog declares “La Clair to get ass handed to him on hexacyclinol” (http://www.thechemblog.com/?p=108). The story regarding NMR validation algorithms comes AFTER the material was synthesized and AFTER a crystal structure proved the structure and AFTER full H1 and C13 assignments were made of the material. The algorithm went on to show that the assignments were incorrect allowing 7-bond couplings. We have worked with the authors to reassign the molecule and a publication is in preparation to report on the FINAL assignments..and potentially the end of this story.

PMR: fully agreed. The blogosphere had a field day with this and helped to raise the issue of quality in publishing.
There is a general concern among many principals that scientific fraud or sloppiness is of sufficient concern that data must be deposited in repositories so that questions such as this can be at least partially addressed by referral back to the raw data.
So publishers can help by:

  • insisting machine-readable raw data is available
  • using computer-validation where possible.

Note, of course, that the crystallographers already do this – I shall blog on this again very shortly.

Posted in chemistry, data, open issues | Leave a comment

Data validation and protcol validation

This post replies to an ongoing debate about the quality of data and Open vs Closed data and systems. It’s specifically about NMR (spectroscopy) but my points are general. Since I have been publicly critical of some systems I must be prepared to take criticism myself (as happens here).
In Update: Robien on NMRShiftDB Ryan Sasaki (Technical Marketing Specialist for ACD/Labs) writes [PMR’s comments interspersed]:

If you have read my earlier post, you will be aware of Wolfgang Robien’s critique of the NMRShiftDB. Following this critique, Tony Williams from the ChemSpider Blog  and Peter Murray-Rust from the Unilever Cambridge Centre for Molecular Informatics replied to Wolfgang’s comments.

Well now, it appears that Wolfgang has responded to Tony’s comments.You can find his response here.
It appears that Wolfgang remains firm in his stance that the NMRShiftDB is not a good resource for scientists as it contains too many errors. He continues with the comments, “But: Enjoy – it’s free!”

PMR: My case has been that science is impoverished by lack of access to data and information. Neither are free, but there are new methods which lower the costs dramatically and also redistribute them. “free” may mean undefunded and therefore lower quality or it may mean “open” and capable of dramatic improvement by the community. In the case of NMRShiftDB I am firmly of the opinion that it leads the way in opening access to scientific information. If the community wishes it can use it as a growing point to develop more and better data. If they don’t, they will continue to use existing non-open systems (or in most cases not use anything at all).
I also state publicly that I support the activity of NMRShiftDB for several reasons. Firstly data (which for me are the central point of NMRShiftDB in this discussion):

  • It allows and promotes the public aggregation of data.
  • It contains mechanisms for assessing data quality automatically. For example software can be run that will indicate whether values are seriously in error.
  • It allows the public to identify errors and report them. It is also allows the creation of a developer/committer community that spreads the load of this process.
  • It allows mashups against other data resources (computation, crystal structures, etc.)
  • It acts as a model system that can be adapted for laboratories that wish to develop their own Open data aggregation systems. We had a DAAD-British_Council grant to collaborate with Koeln directly for this process and we see NMRShiftDB as a potential model to the extension of our SPECTRA program to capture data into institutional repositories.

Now software. I have no comment as to the relative merits of NMRShiftDB software against commercial systems. However the history of Open Source in chemistry has shown that within a few years software can be communally developed to become leaders in the field. 7 years ago relatively few people had heard of Jmol – now it is one of the leading display packages and widely used by publishers, pharma companies, etc. Similarly OpenBabel was a mess 5 years ago and the community has now put in so much work that they have made it the leading format conversion tool. It is therefore quite possible that NMRShiftDB software can do the same for the NMR community. Certainly if anyone is intending to build an NMR repository I would urge them to look closely at NMRShiftDB.

So I have a couple of responses in regards to Wolfgang’s comments in his follow-up:
When doing this job in a more systematic way not using specific examples as given here, the total number of incorrect assignments exceeds the above mentioned limit of 250 significantly. The intermediate number is at the moment around 300, but about ca. 1,000 pages of printouts are waiting for visual inspection.
Is 300 vs. 250 errors in a dataset of over 200,000 chemical shifts SIGNIFICANT? Is a difference of 50 errors in this dataset statistically significant? That’s 0.025%. I await Wolfgang’s final results and then we can judge whether it is significant. Meanwhile, he should also read the document we produced comparing the prediction accuracy between ACD/CNMR Predictor and Modgraph’s NMRPredict if he wants to challenge our findings. I think it is a good place to pick up our conversation.
“I definitely do not claim, that collections like CSEARCH, NMRPredict and SPECINFO are free of errors – the desired level of errors is always 0.0%; a value which can’t be reached – the acceptable limit is clearly below 0.1%, maybe 0.05% is good compromise between dream and reality.”
I agree, as I mentioned in my last post that while the desired level of error is 0.0%, this is a value that cannot be reached. I certainly would not claim that our prediction databases are free of error. Further, our work reveals about 8% errors in the form of mis-assignments, transcription errors, and incorrect structures within the peer-reviewed literature we comb. Error is human nature.

PMR: One of the core skills for all 21st C humans is to make judgements about the usability of any information. Without NMRShiftDB my current access to spectra is minimal. If I have 20,000 entries with 1% error that is an enormous advance. Biologists work everyday with the knowledge that their gene identifications, sequences, annotations, etc. have serious erros and they try to measure that error rate.

Let me say, I am very confused by the positioning of this question to Christoph Steinbeck:
“Why do you “reinvent” existing systems – there are a lot of systems (with much better performance !) already around  (a few in alphabetical order: ACD, CSEARCH, KnowItAll, NMRPredict, SDBS, SPECINFO)”
Why reinvent existing systems? To improve! To provide better resources for NMR spectroscopists and scientists around the world! While there is certainly better performing systems to date there is no reason to believe that these existing systems cannot be surpassed in terms of performance. Further, they offer an alternative to those institutions that do not have access to commercial products.
I think that Wolfgang is misunderstanding something here. From his writing, it seems that he feels threatened by the NMRShiftDB and is trying too hard to discredit the hard work and ideas behind this open source collection. What NMRShiftDB is providing, is something very different than anything the commercial products he names are offering. It is a truly open access and open source offering where scientists and spectroscopists can freely share their data and build an NMR database that is freely available to the scientific community.
It’s FREE! It’s not a commercial product like the ones he compares it to!

PMR: I would re-iterate this and add: It’s OPEN.

Christoph’s group is handling this very well and he mentions himself,
validations like Robien’s and the ones performed by us help make a strong case for open access and open source policy.

PMR: certainly. We are in the process of going live with CrystalEye, a near zero cost crystallographic knowledge base. We have made efforts to identify the error rate (which is lower than NMR but non-zero). Our value will be judged on the validity of the protocol, not the validity of individual entries (though we shall be adding automatic checks to them).

Finally, As I mentioned above, I can only make the assumption that Wolfgang has not seen my blog posting that compares the results of his algorithm vs. ACD/Labs. It should make for an interesting discussion.

PMR: In the future the means of publishing pre-validated data will continue to increase so large amounts of Open high-quality data will become available. The current method of human curation of data will only be useful where the values of the data are so important that life or law depends on them.
The role of the primary publisher is critical. If they want they can help speed up this process; if they want to possess and constrict (cf. Wiley copyrighting data) they will slow it down, but ultimately lose both the battle and their credibility.
I shall write more on our strategy in coming weeks.

Posted in chemistry, data, open issues | 3 Comments

Open Data in biomedical science

Heather (Research Remix) has a most important post on data sharing – she has analysed the data deposition policies of some of the major journals/publishers. Note that this is orthogonal to Open Access – not all these publishers are OA, but many are agressive about requiring data deposition – and that’s good. My comments during and after her post:

Diverse journal requirements for data sharing

Filed under: publishingdata — Heather Piwowar @ 9:04 am
Many academic journals make sharing research data a requirement for publication, but their policies vary widely. I’ve been wanting to understand this better: below is a summary of my Tuesday Morning Delve into the world of “Information for Authors”.I selected 10 journals, two from each of the following ad hoc categories: general science (Nature and Science), medicine (JAMA and NEJM), oncology (JCO and Cancer), genetics (Human Molecular Genetics and PLoS Computational Biology), and bioinformatics (Bioinformatics and BMC Bioinformatics). The results are obviously just the tip of the iceberg, but I found them enlightening.
PMR note: although Science and Nature are general journals almost all the emphasis is on biomedical in this discussion. I would not be surprised to find that the requirements were very different in – say – chemistry or materials science.
Nature has the most stringent requirements, followed closely by Science. These journals required data sharing for the most diverse types of data, specified acceptable databases,escrow requirements, and actually had “teeth” clauses… they specify a statement of consequences for times when you ask for data and the authors don’t provide it.The medical journals do have requirements for clinical trials registries, and sometimes suggestions for data inclusion based on clinical trial design, though they have no mention of requirements or encouragement for sharing (obviously deidentified) research data except that NEJM requires sharing microarray data.I’m out of time this morning to highlight the other findings, but you can have a look for yourself below.These rough conclusions of mine are consistent with Table 2-1, “Policies on Sharing Materials and Data of 56 Most Frequently Cited Journals”, in [Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences (2003). National Research Council of the National Acadamies]:
Their more exhaustive (though dated) analysis also suggests that few clinical-medicine journals have a policy, or if they do it rarely mentions depositing data. About half the life-science journals have some kind of a policy about depositing data. Almost no journals have a statement of consequences.
Conclusions: kuddos to Nature and Science. I’m surprised that the policies of other journals are so lax.
PMR: I am afraid I am not surprised. I don’t know about medical science but the major commercial journals have no incentive for data deposition, although a senior representative from Wiley told me they copyrighted data so they can sell it back to us.
Not sure this analysis is worth digging into more deeply. It isn’t quite where my research is headed, though I do believe the trends would be informative. If anyone else wants to use this as a starting point, have at it!
PMR: this is too important to leave at this stage. It’s something that the blogosphere – with a good wiki could manage easily. Apart from the spammability of wikis I’d suggest it very strongly
{Tried to post table here, but can’t get it to display nicely}
PMR: know how it feels – blogging software is strictly text and image only.
PMR: see below:

Well done Heather. It’s hard work and often depressing. I once tried to read some publishers’ contracts on what authors and readers could do and found them incomprehensible. I thought at the time it was incompetence and out-of-date pages – now I think much of the license area contains deliberate FUD.
I’m interested in the strength of Nature:

Indexed, publicly accessible database, or, where one does not exist, to readers promptly on request. Any supporting data sets for which there is no public repository must be made available [..] any interested reader [..] from the authors directly, [..]. Such material must be hosted on an accredited independent site [..] or sent to the Nature journal at submission [..]. Such material cannot solely be hosted on an author’s personal or institutional web site.

Note that institutional web sites (does that mean repositories) are not good enough for Nature! If you read this as a non-biologist there are precious few sites where you could deposit data. Maybe at somewhere like BMC which offers reposition services. I think there is a great opportunity here for the new semantic web.
… and Science:

Large data sets must be deposited in an approved database and an accession number provided for inclusion in the published paper. Large data sets with no appropriate approved repository must be housed as supporting online material at Science, or when this is not possible, on the author’s web site, provided a copy of the data is held in escrow at Science

What Heather does not mention is what the public access to the data are. Most of the databases are biological and therefore Open.

Approved databases: Worldwide Protein Data Bank ; Research Collaboratory for Structural Bioinformatics, Macromolecular Structure Database (MSD EMBL-EBI), or Protein Data Bank Japan], BioMag Res Bank, and Electron Microscopy Data Bank (MSD-EBI), Cambridge Crystallographic Data Centre {CLOSED). GenBank or other members of the International Nucleotide Sequence Database Collaboration (EMBL or DDBJ) and SWISS-PROT. Gene Expression Omnibus ; ArrayExpress.

The Cambridge Crystallographic Data Centre (no direct connection with PMR) has 350, 000 entries and last time I enquired allows only 25 to be downloaded free (0.01 %). I shall return to this later.
Science is as good as its word – there are many articles with exposed supporting info – here’s a chemistry one and it looks of high technical quality (haven’t read the science):
http://www.sciencemag.org/cgi/content/full/316/5828/1172/DC1
It doesn’t say anything about copyright and I hope that Science can confirm that they do not assert copyright. It would be extremely useful if they suggested (or required)that authors add Science Commons license to the data. This would act as a high-profile encouragement to the others.
Nature is similar – here is
http://www.nature.com/nature/journal/v447/n7144/extref/nature05881-s1.pdf
the supplemental data here has been formatted by Nature – but no copyright has been added – and again I hope that they can take the same approach I have suggested.
Whatever your views on Open access, these two journals have made a good start on Open Data. A long way to go as the data are in the dreaded hamburger PDF (molecules are destroyed by PDFisation), although plaudits for Nature Chemical Biology which sends molecules to Pubchem. We need more semantic data here, please.
Also up in PMR’s good books are Royal Soc. Chemistry and Int. Union of Crystallography which expose all there supplemental data openly and, although muddled, effectively free of copyright.
The ACS is halfway. It does expose supplemental info, but it copyrights them (and I know from first hand intercourse that this is deliberate).
The less satisfactory publishers are harder to be precise about as they hide their information.
Wiley – Hides data for subscribers only and copyrights them aggressively. I suspect that some data are not even required.
Springer – does not seems to manage data itself and hides those that it does get. I have written to my Springer contact asking for clarification but not yet heard back
Elsevier – I suspect they require little data, ut have no hard evidence
It would be EXTREMELY useful for the blogosphere to collect information on these practices. If we all do a little we could cover the whole field. And shame those who need to be shamed.
I shall write more later on supplemental data.

Posted in data, open issues | 1 Comment

Ola Spjuth of Bioclipse

From time to time people get presented with Blue Obelisks and the latest recipient is Ola Spjuth. Presentations – and the preparations for them – are rarely simple – they have included

  • obelisk falling off the table on night of travel
  • being sent a green obelisk (although to be fair this was an addition by the supplier)
  • tramping the length and breadth of San Fancisco only to discover there seemed to be no blue obelisks in the whole city. The recipients at that stage got pseudo-blue pseudo-obelisks

and most recently the supply was late, so I resorted to a crystal healing shop in Uppsala – which also had no obelisks but did have a magic blue rock.
So Ola was presented with a blue “foo” for which he was promised a real obelisk next time we met. We hopefully have a photograph of the presentation, in which case it should be possible to
ola.png
to do some Photoshopping or similar on the photo.
Fortunately I shall be in Uppsala again in June and Ola and I will be able to meet for a physical handover.
For the uninitiated – there is no formula for the award of the blue obelisk – it happens when it happens, but there has to be physical meeting – they are not delivered by post. There is no citation – if there were an inscription it would read:
“Si monumentum requiris, circumspice”

Posted in "virtual communities", blueobelisk | Leave a comment

Ranking chemistry and blogosphere metrics

I’ve been pointed to ChemRank – a system that allows you to comment on and rank the chemical literature. I hadn’t seen this before and haven’t looked in depth, so I am only commenting on the idea and technology. (As always I also would like to know what the Openness of the supporting organisation was). I copy the post in full and make some comments.

Dissociating the good literature from the bad literature is an endeavor we all do individually(but not for long: ChemRank). If only there was some website where we could tell whether that syn. prep. is accurate or that physical model is valid. The current methods to determine the validity of the literature are: to perform the same experiment, try to determine this from how many people cite that article, go ask around the department for someone who did something similar, try to relate the quality of the paper from the h-index of the author.

But, what if you still want more? What if you want something more than a numerical qualifier of a paper’s worthiness, or of an author’s scientific quality? What if you want to have a discussion beyond the scope of your labmates? What if you want to have an intellectual discussion about the current chemical literature with every other chemist on the planet? How do you do this and where do you begin?
The chemical blogosphere has done a rather good job in keeping the community of internet savvy chemists a breasted on some of the latest and coolest research (http://wiki.cubic.uni-koeln.de/cb/index.php). But, unless you have a huge audience and build a faithful readership, no one will ever know your views and it can be rather difficult to have an intellectual conversation with just yourself.
With all this in mind, I set out to create a website that would overcome these historic limitations to academic communication. I created a new website called ChemRank: http://www.chemrank.com Screen Shot shown [deleted in this post – PMR]
At ChemRank you can add a paper to the database and then vote whether you find it a good or bad paper. You can also leave comments critiquing the yield or congratulating the authors on a job well done. Most importantly, there is a public record of your views! Your pain does not need to be repeated if you point out the problematic reaction or the incorrect eqn in the comments section.
An other cool feature of ChemRank is the building of a database that knows the good chemical literature from the bad chemical literature. Currently, there is no database to my knowledge, except you own brain or PI’s brain, that tracks this information. Although, Noel’s experiments with Connotea is the step in the right direction.
There is even something for the already established pioneers in online chemical literature discussions (aka chemical bloggers) too. For each paper added to the database a script, behind the scenes, will generate a little code snippet to allow voting on the literature your discussing on your blog too. For example, if you are discussing the recent paper “Atomic Structure of Graphene on SiO2” you can generate the following from copy/pasting a code snippet. Assuming you’ve already added the paper to the database and clicked the link: “add to your blog

Atomic Structure of Graphene on SiO2

code snippet:

Code: (html)

This by no means is a completed project, there is still much more to do.

  • Make links that show the papers posted last 7days, 1month, 1 year, all time
  • Make users register before posting.
  • Add the ability to tag the articles with relevant keywords
  • Make a pipe’s rss feed of the author’s most recent papers, on the fly and behind the scenes, display in the description as well.
  • Make an api so others can access the database and use it for cool-new mashups.
  • Add a search feature, but that is kind of pointless while the database has less than 10 papers in it.

All of the above can be done in time, but it depends on the feedback from the community and the popularity of the website. As with all projects, I don’t necessarily expect it to catch on right away if at all. A lot of times I code, simply to show how to do it, and how to do it well. Allowing public comments on the current literature is something the chemical publishers should be doing anyways, now we no longer have to wait for them…
Mitch

Comment:
This and similar approaches are very important. We don’t know where metrics like this will take us but it can’t be worse than the current citation index which is managed by … by whom???
Before anyone puts too much effort into their own system the should check out carefully what the “Web 2.0” community does for review. As I blogged earlier, Revyu.com – Review Anything has tools for managing and collecting reviews – no technical reason why they shouldn’t actually include chemistry AFAIK. I would strongly encourage using RDF for reviews – this will happen anyway and makes your material discoverable more easily.
There are reviews in the chemical blogosphere already – TotallySynthetic.com runs a poll for the best synthesis of the month/year, etc. I’m not sure whether there is any technology that publishes this in RDF form but it would be relatively easy. Commentaries in the blogosphere also have a greasemonkey viewer from Chemical blogspace
and other sites, so that when you read the table of contents the monkey alerts you to any commentary. It would be technically easy to fix the monkey so it counted eyeballs on papers – how often was this paper READ (not cited, not downloaded – but at least displayed in the browser presumably with a human somewhere near). And to add tools so that the monkey could ask for opinions from the viewer.
I’m not suggesting there should be one coordinated effort – anymore than we need just one movie review database. But I do think that reviewers should try to use the same technology so that their results can be aggregated – and the reviewers can themselves be reviewed!

Posted in "virtual communities", chemistry | 6 Comments

The soul of the Blue Obelisk

There has been a discussion on the Blue Obelisk mailing list about whether we should move the mailing list to Google to attract the less geeky community. I drafted a contribution to this discussion and then felt it would be worth airing more generally as there are some principles of the Open movement, the gift economy, etc.

On 5/29/07, Tobias Kind <tkind@ucdavis.edu> wrote:

However to get more spin and attract more people
the whole setup must be easy, otherwise it will be just an
exclusive geek club 🙂 And some users just want to lurk
around instead of subscribing to the mailing-list.

I have been thinking about the suggestion to relocate the mailing list and offer the following.
The BO was indeed founded essentially as a geek club – authors of software and data resources who realised they needed to work together to make sure they were interoperable and that the resources they shared (data, algorithms, etc.) were normalised. It acts as a convenient first place to post mail of general interest to this community – new major releases, common technical problems, values of data, etc. (Yes, we all subscribe to most of each others mailing lists, but not all). That aspect is still a core aspect of the BO – looking through past mails there is a vibrant collection of mails on the details of programs, data, etc.
My description of the BO is that it is a meritocratic gift economy (see Eric Raymond) – where people are valued for their contributions. These can be many-fold – as above and range from programs, data, draft protocols, managing wikis and lists, etc. This type of approach is very common in Open Source projects.
More recently the BO has started to promote the BO brand. Not yet very consistently – but the idea is that software, data, protocols and other resources state more publicly that they are “members” (whatever that means) of the BO. In that way people can get a feel for the quality (at least the intent) or interoperability and Open Data (the Open Source is a given that most people understand). There is also the feeling that when you are using resource A and want to integrate it with resource B then there are people in the community who would at least try to help in some way – this is not true of all Open Source where very often integration is solely up to the users.
There are also common difficult problems we all face. Very recently, for example, I have had a request to integrate CML into geochemistry and isotopes are a requirement. I know that isotopes are more difficult than they look and would certainly ask the community for suggestions when or if I ran into trouble.
I think the calls for different or added lists springs from the emerging “political” dimension in BO world. The motivations seem to be:
* we need more people to discover BO through a more exposed mailing list
* people are debarred from contributing because of the geekiness of SF mail lists.
I am not too worried about the second. BO is primarily about contributing resources. This requires SVN and there is no reason to move this from SF. So it makes sense for the mailing list to be on the same site.
The political aspect (“spin”) is worth discussing. Personally I see the most powerful tool for advocacy being useful running code, interoperable code, valid respected data, coherent algorithms, etc. When I moderated the XML-DEV list a major theme was to develop code that worked and standards that could be implemented (by the “desperate Perl hacker”). For me BO has many of the same features – implementation was the key thing. Now it is clear that BO has political dimensions – it is destabilising technology and I spoke on this theme at the Bioclipse meeting. Essentially if we get Bioclipse onto every scientific desktop and get BO software running for all major chemical tasks we shall destabilise the current chemical software economy. And for me that would be a good thing – it would allow innovations that we are lacking.
But the reality of the current situation is that we primarily need more developers – especially for testing, documentation, integration and dissemination. You do not need to be a geek to do some of these (especially docs, tutorials, etc.). But you do have to be able to use SVN at SF.
I am fairly confident that we haven’t overlooked large pockets of current Open Source chemical developers. Not all Open Source chemistry projects would describe themselves as BO members, but they almost certainly know of our existence and we work with them from time to time. We also try to make sure that we interoperate. So if we want to reach out to a wider community *of contributors* where will they come from? I can think of at least the following:

  • WP-like contributors. We have very good contact with WP-chem and are making sure that our products are interoperable (e.g. through RDF). It may well be that much of the BO data in the future comes through WP. This is a great way of bringing in school students, etc.
  • non-chemical software developers. This is a very important resource – from example Miguel Howard – a major guru of Jmol – knew no chemistry but has done a world class job of implementing the graphics. So yes, there is an exciting and really valuable role for non-chemical programmers.
  • educators. The BO resources are now of sufficient quality and breadth to make excellent teaching resources. A major offshoot of this would be tutorials and use cases.
  • industrial programmers. I still have a dream that the pharmaceutical industry will do something in the Open Source area. I talk with representatives on a regular basis. They all say how important open source is and they all want to move into to it. I know they *use* our software in considerable amounts. But they don’t contribute. We are very receptive to offers.
  • political evangelists (Open Data, Open Science). This is also an important and growing area. We have certainly some exposure here – e.g. at least 4 of us have been invited to Google/Nature FooCamp and we are active in Open forums, etc. Here again the strength of our case is the deeds more than the talk.

Should there be a wider dimension? I am receptive to new ideas. And I want more people to use our code. Is there a “non-geek” dimension that we should expand? I am not sure I can see it immediately, but maybe others can. What is essential, however, is that anyone proposing new ideas should be prepared to put in the work that helps them succeed – ideas by themselves usually don’t flourish – the BO is hard work.

Posted in blueobelisk, programming for scientists | 1 Comment

Remixing Open Data and the cost of not doing so

Welcome to a new blog (Research Remix)  from Heather Piwowar, currently doing her PhD in Biomedical Informatics at the University of Pittsburgh. Heather is encountering first-hand the difficulty of doing her research because of the problem of getting access to data. So she’s taking a very systematic approach to analysing the problem. Here’s a typical post

Open Literature Review on Open Data

Don’t you love to experiment? Me too.
This blog is an experiment. I’m starting my PhD literature review on the topic of biomedical data sharing and reuse, and thought it would be appropriate to do it out in the open.
Not quite sure how it will work: I’m new to this blogging thing. Please send me suggestions, questions, and especially links to related work.
Thanks, and happy experimenting… with your own data or that of others 🙂
Heather

One of the key tools we must have in fighting for Open Data is agreed metrics. That is hard work. It includes much disappointment – in other posts Heather mentions that many researchers don’t reply to requests for data, and many of those that do cannot (or will not) supply it. (To be fair it’s often because it is a lot more work than it might seem – among the first customers for Repositories we often find scientists who have lost their own data!).
It’s also important to realise that this data has cost money. There seems to be an assumption that once the “science” has been published the data are then worthless. That’s usually not true, but even if it was I think it’s useful to enumerate the actual cost of collecting the data. A useful metric is to work out what they would cost at commercial rates – if a chemistry department generates (say) 500 crystal structures at a commercial cost of (say) USD 3000 (and that’s probably underestimate) – that’s 1.5 million dollars. Does it become worthless after publication?
So we need metrics. It’s not exciting, but it’s necessary. I would like to know how how many chemistry papers are available under “Open Access/Choice” or whatever name – where the author is invited to pay the publisher so that people can read the artcile Openly. And I am interested in the publishers’ poicies on Open Data – is supplemental data Openly available. This is a sizeable task. But with modern Web 2.0 tools it should be easier to aggregate the response (or non-response) from the publishers. Suggestions and offers welcome.

Posted in data, open issues | 3 Comments