xI'm gathering data for my presentation at OR08. Having appealed to the readership of this blog and found zero I'm now looking at other blogs. A very valuable post from Cameron Neylon ...
From Neil [Saunder]
My take on the problem is that biologists spend a lot of time generating, analysing and presenting data, but they don’t spend much time thinking about the nature of their data. When people bring me data for analysis I ask questions such as: what kind of data is this? ASCII text? Binary images? Is it delimited? Can we use primary keys? Not surprisingly this is usually met with blank stares, followed by “well…I ran a gel…”.
Part of this is a language issue. Computer scientists and biologists actually mean something quite different when they refer to ‘data’. For a comp sci person data implies structure. For a biologist data is something that requires structure to be made comprehensible. So don’t ask ‘what kind of data is this?’, ask ‘what kind of file are you generating?’. Most people don’t even know what a primary key is, including me as demonstrated by my misuse of the term when talking about CAS numbers which lead to significant confusion.
I do believe that any experiment [CN - my emphasis] can be described in a structured fashion, if researchers can be convinced to think generically about their work, rather than about the specifics of their own experiments. All experiments share common features such as: (1) a date/time when they were performed; (2) an aim (”generate PCR product”, “run crystal screen for protein X”); (3) the use of protocols and instruments; (4) a result (correct size band on a gel, crystals in well plate A2). The only free-form part is the interpretation.
Here I disagree, but only at the level of detail. The results of any experiment can probably be structured after the event. But not all experiments can be clearly structured either in advance, or as they happen. Many can, and here Neil’s point is a good one, by making some slight changes in the way people think about their experiment much more structure can be captured. I have said before that the process of using our ‘unstructured’ lab book system has made me think and plan my experiments more carefully. Nonetheless I still frequently go off piste, things happen. What started as an SDS-PAGE gel turns into something else (say a quick column on the FPLC).
[... and a good deal more...]
PMR: This is very important and I shall draw heavily on this and add my interpretation. Simply put, the whole idea of "putting data in repositories" is misguided. It is not addressing the needs of the scientific community (and I'm not going to expand ideas here because they are only half formed).
Cameron - I'd be grateful for any more thoughts on this issue - public or private. They will be attributed, of course. Your ideas will probably form the "front end" for the work that the Soton group has been doing so attribution will be important there.
... This definitely comes with a health warning as it goes way beyond what I know much about at any technical level. This is therefore handwaving of the highest order. But I haven’t come across anyone else floating the same ideas so I will have a shot at explaning my thoughts.
The Semantic Web, RDF, and XML are all the product of computer scientists thinking about computers and information. You can tell this because they deal with straightforward declarations that are absolute. X has property Y. Putting aside all the issues with the availability of tools and applications, the fact that triple stores don’t scale well, regardless of all the technical problems a central issue with applying these types of strategy to the real world is that absolutes don’t exist. I may assert that X has property Y, but what hppens when I change my mind, or when I realise I made a mistake, or when I find out that the underlying data wasn’t taken properly. How do we get this to work in the real world?
[... lots more - on provenance, probability, etc. snipped ...]
PMR: In essence Cameron outlines the frustration that many of us find with the RDF model. It makes categorical assertions which have 100% weight and - in its default form - are unattributed. Here are three assertions:
- The formula of water is H2O
- The formula of snow is C17H21NO4
- Snow is frozen water
Assuming I have the implicit semantics that frezzing does not change the chemical nature of a substance (not always true), these three statements taken at face value create a contradiction.
I can remove the contradiction by introducing the semantic that a formula may be associated with more than one name and that a name may be associated with more than one formula. This taken at face value prevents us from making any useful inferences.
What I have felt a great need for (echoing Cameron) is that the triple should be enhanced with two properties:
- the provenance (the person or software making the assertion)
- the weight of the assertion
"At Dagstuhl it is continuing to snow."
If I pass this sentence to OSCAR it may mark up snow as a chemical substance. In doing so it now gives every annotation a weight based on the confidence (I shan't explain how here). So, for example, it is much more likely that 2-acetylfoobarane is a chemical than HIV (hydrogen-vanadium-iodide) and OSCAR addresses these concerns.
It's possible to add provenance and confidence to RDF but I don't know of a standard approach for doing this. If we start doing this we need to make sure we have consistent schemas.
(Interestingly we've just been discussing the value of adding "strength of statement" to the results of text mining.
14:43 25/03/2008, Open Access News
The Value of Spatial Information, a report by ACIL Tasman prepared for Australia's Cooperative Research Centre for Spatial Information and ANZLIC, March 2008. (Thanks to Baden Appleyard.) From the executive summary:
...Constraints on access to data are estimated to have reduced the direct productivity impacts in certain sectors by between 5% and 15%. It is estimated that this could have resulted in GDP and consumption being around 7% lower in 2006-07 (around $0.5 billion) than it might otherwise have been....
Comment. These are big numbers and it takes a minute to put them in perspective. In one country (Australia) in one year (2006-07), lack of OA to one kind of data (spatial data) cost the economy $500,000,000.
PMR: I hardly need to comment. However in our current discussions at Dagstuhl on Text Mining and Ontologies in the Life Sciences it is clear how valuable Open Data is. It's also clear how much the lack of open data in chemistry holds back innovation. I don't have numbers, but it would be great to have an economist look at this...
A brief update. I'm privileged to have been invited to a meeting at Dagstuhl in Germany. It's on Text Mining and Ontologies (I expect that we shall all post abstracts over the next few days). It's a heavyish program and - rightly - wireless is forbidden in the seminar room so there won't be many posts. I'm applying Chatham House rule although you can see the participants on the web.
The discussion today included
- whether ontologies modelled reality
- can we create ontologies from text-mining (general answer no, except in limited cases, but it may be a useful help)
- whether ontologies should always be created by domain experts (generally yes - any contracted-out ontology is garbage).
I presented our group's work today - OSCAR is now well known and being used elsewhere. It's nice to be in an area where software and resources are freely available. My optimism level about free knowledge in science has risen.
Also some useful unattributed conversations about repositories - the computer scientists are not impressed with DSpace etc. and the resources spent by universities. I'm gathering ideas for my presentation at OR08 next week on Data Repositories. I am gearing up to generate lively discussion.
Catalysed by a recent comment on a 2007-12 post (Exploring RDF and CML) :
here's an update of where we are at with molecular repositories. (We shall have a clearer idea when several of our group present at Open Repositories 2008 (OR08) in 10 days time. (Lots of progress can be made in 10 days). I'm omitting details here (so as not to spoil the show next month).
- We are committed to RDF+XML/CML as the future for molecular information. This is the only way that we can manage such diverse information as documents, recipes, results of calculations, spectra, crystallography, physical and analytical properties, etc. The CML schema is now being used in many places and has remained stable for 15 months. Almost all parts have now been tested in the field (the main exception is isotopic variation - e.g. in geoscience). We can easily go from CML to RDF - the reverse is not always possible. The value of CML is that it is currently easier to use for chemical calculations as there is a knowable coherence of related concepts. Note that the CML community is developing a number of subdomains ("conventions") which allows some degree of local autonomy as in CMLComp.
- We are enthusiastic partners in the OREChem project (Chemistry Repositories, and from Jim Downing ORE! Unh! Huh! What is it good for?). This uses named (RDF) graphs to describe local collections ("Aggregates") of URIs. The project will have several molecular repositories of which we shall contribute at least 2 (CrystalEye and our neascent "molecular repository"). All content will be Open Data.
- Jim Downing has developed a lightweight repository (MMRe) based on Atom/REST and RDF. I won't give too much away except to say it is deployed and over the last few days Joe Townsend has been adding data from chemistry theses (SPECTRaT) and Lezan Hawizy has been adding our collection of "common molecules" (scraped from various sources). This can now be queried through SPARQL.
- Andrew Walkingshaw has converted the CML in CrystalEye to RDF - 100,000 entries and probably about 10 million triples. He's been working with a well-known semantic web company (not sure if this is public yet) and has done some very exciting extraction and mashups. SPARQL searches work over this size. Andrew has also developed Golem - a system which extracts dictionary links (cml:@dictRef) from CML computations and is able to build dictionaries (ontologies) automatically and then to extract data.
- In the last four days Thomas Steinke has converted VAMP to emit CML. We have run a few hundred calculations automatically (by extracting molecules from the NMREye repository, converting them to input, running the calculation, and then converting to RDF). The results - which contain coordinates, energies and NMR peaks - are being fed into another local repository.
So we have a variety of sources which will all be available. We face a number of exciting questions.
- How do we express a molecule in RDF? We are gradually converging on an "aggregate" where a molecule has identifiers, properties, and special resources such as chemical formula, the CML connection table, and a list ofg chemical names.
- How do we assign identifiers. This is a really hard problem. Although for many chemicals there is little doubt about the relationship between names, identities and properties there cannot, in general, be a "correct" structure or a "unique URI" for a chemical. Look, for example, at "Phosphorus Pentoxide" (In WP). Experiment shows that there are several different forms, with different chemical connectivities. There are 2 formulae (P2O5 and P4O10) each with a different CAS number (Chemical Abstracts is a major authority in chemistry). Are these different chemicals or do they represent our changing chemical knowledge? Is one used for early publications and another for later ones? Only CAS can say when one number is used and not the other. It is because of this uncertainty that we cannot know exactly how many different chemicals there are in the CAS collection.
There cannot be a platonic semantic description of chemical identity - many chemicals do not have a "correct" structure. Antony Williams has been doing a heroic and valuable job in detecting inconsistencies in reporting chemical structure and resolving them where possible (eMolecules and ChemSpider - A Respectful Comparison of Capabilities). But he is not establishing a "correct" structure - he is making authoritative statements about the relationship between names, structures and identifiers.
This brings us to why RDF - probably in its quad form (i.e. with provenance) - is important to describe chemical structure.
- Many substances occur in several forms and there is no single structure. We hope that RDF can manage these relationships.
- Many name-to-structure assignments have changed over time as out experimental techniqures become more powerful. Thus the C19 chemists would first write PO5 (atomic weights were not "correct"), then P2O5 and only after X-ray crystallography P4O10. To understand historical chemistry we have to know the relationships used at the time.
We have scraped about 10000 compounds from the web including Wikipedia and have a variety of triples associated with each. There is little overlap of triples - names, CAS, formulae are present or absent. So we now need to use RDF technology to reconcile this information. It's a complex task and we will probably have to add weights/probabilities to some of the statements - some authorities are less reliable than others.
In the first instance we'll probably use some of the commonest identifiers to assert identity and that's the version we should be releasing in a few days.
I am honoured to be a member of the COST D37 CCWF working group on interoperability in chemical computing (see Semantic Chemical Computing for the last meeting in Berlin). COST enables European collaboration by not only having group meetings but also by exchanging scientists (STSM - short term scientific missions).
We're excited by this model because we've had two working visits from COST members. First was Kurt Mikkelsen from Copenhagen with the DALTON code (I was away so I may get details wrong). Kurt is interested in physical quantities such as hyperpolarizability and other tensors of high rank. Although the code is fairly ancient - and (I gather) has spaghetti in places - Kurt and colleagues here (Toby White, Andrew Walkingshaw) were able to put enough FoX in that DALTON can be said to be CMLized.
Now we've had a visit from Thomas Steinke from Berlin for 2 weeks (Thomas also helps run the COST process). Thomas is a developer on the VAMP program (Tim Clark) which is a semi-empirical code from the AMPAC phylogeny. The code is pretty hairy in places (e.g. backwards GOTOs out of loops) it took us (Thoams, Andrew, Toby and me) about 2 days to get a running version that emitted metadata, calculated NMR shifts, and final properties and coordinates. Yesterday we added the history (steps, cycles, etc.) which is not conceptually difficult but a spaghetti nightmare in places.
So - if you understand a code well - it is possible to make substantial progress to CMLizing it in about 2 days. Well structured code is easy and the difficulties arise primarily from unmaintainble code. I'll blog this in detail later, but it's straightforward to output coordinates, NMR shifts (peakList), energies, and a wide range of scalar, array and tensor/matrix properties.
The next phase was to run Andrew's Golem over the outputs and create a dictionary and we're now about to convert the results into RDF and put it into our molecular repository. More later...
I have been invited to give a keynote lecture at Open Repositories 2008 (see the programme - about 25% down) and have chosen the title "Repositories for Scientific Data". I'd value help from the repositarian blogosphere and elsewhere.
My thesis is that the current approach for Instituional Repositories will not translate easily to the capture of scientific data and related research output. In some fields of "big science" (e.g. High Energy Physics) the problem is or will be solved by the community and their funders and institutions have effectively no role. However much - probably most - science is done in labs which are the primary unit of allegiance. Typical disciplines are chemistry , materials science, biochemistry, cell biology, neuroscience, etc. etc. These labs are often focussed on local and short-term issues rather than long-term archival, dissemination of data to the community, etc. Typical worries are:
- My grad student has just left without warning - can I find her spectra?
- How can we rerun the stats that our visitor last year did for us?
- My laptop has just crashed and I've lost all the images from the microscope
- My chosen journal had to retract papers due to recent scientific malpractice. Now they want me to send them all my supporting data to prove I have adopted correct procedures. This will take me an extra month to retype in their format.
If we are to capture and preserve science we have to do it to support the scientist, not because the institution thinks it is a good idea (even it is is a good idea). So we have to embed the data capture directly into the laboratory. Of course in many cases there is a key role for the Department, particularly when - as in chemistry - there is a huge investment in analytical services (crystallography, spectroscopy, computing).
I am developing this theme for the presentation and would be very grateful for anecdotal or other information as to where the institution or department has developed a data capture system which ultimately feeds into medium-term (probably Open) preservation. Two emerging examples are Monash which has acquired a petabyte for storage of University scientific data and will layer a series of access mechanisms (SVN, Active Directory, Samba, RDB, SRB, etc.) on top of it. Recently Oxford has announced a Data Repository.
If you have material that will help give a balanced picture of data reposition in institutions I'd be grateful for email (or comments on the blog but I'll be offline for a few days from Monday). I'm aware that some disciplines have domain repositories independently of institutions (e.g. HEP, bio-sequences, genes, structures, etc and David Shotton's image repository for biology) - I'm after cases where the institution has invested in depertamental or lab facilities and which are actually being used.
Many thanks in advance.
Margaret Henty (ANU) has done a very professional job of collecting the material (of several media types from the APSR 2008 meeting in Brisbane:
APSR is pleased to announce that podcasts and vodcasts of Open Access Collections held in Brisbane on February 14 are now available at the following url.
Because vodcast files are very large, they are available in parts (for those with bandwidth issues) and in total (for those in bandwidth plenty).
Thanks are due to the team of technicians who made this all possible.
National Services Program Coordinator
Australian Partnership for Sustainable Repositories
W. K. Hancock Building (#43)
The Australian National University
Canberra, ACT, 0200, AUSTRALIA
phone 61 2 6125 7685 mob. 0404 878 442
fax 61 2 6125 5526
I haven't yet finished downloading, and I'm pleased / scared to see what I actually looked like and said.
Because my presentations have live demos it is difficult to capture them on traditional slide media so I rely on occasional videos of the event. So many thanks to Margaret (and for a great time in Australia).
I was privileged to be asked to present a homage to Leo Waaijers yesterday at the SURF foundation in Utrecht (actually in an old castle+microbrewery - Stadskasteel Oudaen). Leo has been an architect of so much in the Netherlands that I cannot list all of it - perhaps the best is simply to reference one of his latest articles in Ariadne where he reviews eScholarship - practice and technology - and the principles of Openness.
The theme of the afternoon was "Inspire". t was a great occasion where after the presentations we had a theatrical homage to Leo - hooded monks representing the four elements, and an excellent set of computer graphics and personal tributes from round the world - including many from UK/JISC.
I took the opportunity to honour Leo's work in creating the DARE repository :
Promise of Science makes over 15,000 e-theses searchable.
DAREnet is the result of the DARE programme, funded by SURF. All Dutch universities, the National Library of the Netherlands, the Royal Netherlands Academy of Arts and Sciences (KNAW) and the Netherlands Organisation for Scientific Research (NWO) participate. From 1 January 2007 KNAW Research Information has taken over responsibility for the DAREnet website.
I was able to find over 2000 publications when searching for "chemistry". I haven't checked but I think most are theses. This is a fantastic resource. I could download one and within a minute[*] OSCAR had read much of the chemical data from it. It changes the way we should capture and publish data. The Netherlands can justly feel proud in leading the world with the coherence and commitment of their program on capturing eScholarship.
PS [*] Yes, I have to admit that I had to convert the PDF to something useful (ASCII) and then the parsing takes a few seconds. So, a simple message.
Besides archiving your PDF, archive the WORD or LaTeX file. That's the message. It's simple, so I'll repeat it:
Besides archiving your PDF, archive the WORD or LaTeX file.
I have been so busy I haven't managed to blog the Open Knowledge Foundation's OKCON 2008. Here's it in brief:
OKCon: The Annual Open Knowledge Conference
PMR: Don't offhand know if there's space still available - maybe someone can respond.
Rufus Pollock has asked me to chair the session on visualization.
Session 2 (1200-1315): Visualization and Analysis
I'm really excited about this. I've been involved in graphics for nearly 30 years and the world still hasn't solved the problem of how to get graphics to the client. But graphics is so important in making your message obvious and widely known. So there is still a lot of scope for work here.
And while I'm here enormous Kudos to Rufus for getting OK off the ground and keeping focussed. It's made critical contributions over tha last year - e.g. with Science Commons.
Look forward to meeting old and new friends tomorrow.
I'd like to reemphasize how i