- homage to Caltech: Jack Dunitz, Linus Pauling, Verner Schomaker and Ken Trueblood.
- data-driven science in crystallography – examples from 1973 to present day.
- semantic web and chemistry, including DBPedia
- Open Access
- eTheses
- crystalEye
Archive for the ‘etd2007’ Category
Webcast: the power of the eThesis
Sunday, September 9th, 2007Data and Institutional Repositories
Saturday, June 23rd, 2007- To promote the institution and its work
- To make the work more visible
- To manage business processes (e.g. thesis submission, or research assessment exercises).
- To satisfy sponsors and funding bodies
- To preserve and archive the work
- To curate the work
- re-analysis could show the data or the conclusions were flawed
- re-analysis could discover exciting science that the author had missed
- journals could refuse to publish the work (isn’t it tedious to have to mention this in every post
)
If it is so simple, why is this manual so long? In the best of worlds all information would be exchanged using both syntactically and semantically precise protocols. The reality of the current state of information exchange is somewhat different. Most applications still rely on the poorly defined semantics of the PDB format. Although there has been significant effort to define and standardize the information exchange in the crystallographic community using mmCIF , the application of this method of exchange is just beginning to be employed in crystallographic software. Currently, there is wide variation in specification of deposited coordinate and structure factor data. Owing to this uncertainty, it is not practical to attempt fully automated and unsupervised processing of coordinate data. This unfortunately limits the functionality that can be presented to the depositor community through Web tools like ADIT , as it would be undesirable to have depositors deal with the unanticipated software failures arising from data format ambiguities. Rather than provide for an automated pipeline for processing and validating coordinate data, coordinate data processing is performed under the supervision of an annotator. Typical anomalies in coordinate data can be identified and rectified by the annotator in a few steps permitting subsequent processing to be performed in an automated fashion. The remainder of this manual describes in detail each of the data processing steps. The data assembly step in which incoming information is organized and encoded in a standard form is described in the next section.So data reposition is both highly desirable and complex. I’m not offering easy solutions. But what is not useful is the facile idea that data can simply be reposited in current IRs. (I heard this suggested by a commercial repository supplier in the Digital Scolarship meeting in Glasgow last year. He showed superficial slides of genomes, star maps, etc. and implied that all this could be reposited in IRs. It can’t and I felt compelled to say so.) At ETD2007 we had a panel session on repositing data. Lars Jensen gave a very useful overview of bioscientific data – here are my own points from it:
- There is a huge amount of bioscience data in repositories.
- These are specialist sites, normally national or international
- There is a commitment to the long-term
- much bioscience is done from the data in these repositories
- the data in them is complex
- the community puts much effort into defining the semantics and ontologies
- specialist staff are required to manage the reposition and maintenance
- there are hundreds of different types of data – each requires a large amount of effort
- the relationships between the data are both complex and exceedingly valuable
- there must be support for the development of Open ontologies and protocols. One model is to encourage groups which are already active and then transfer the maintenance to International Unions or Learned Societies (though this can be difficult when they are also income-generating Closed Access Publishers)
- funders must make sure that the work they support is preserved. Hundreds of millions or more are spent in chemistry departments to create high-quality data and most of these are lost. It’s short-sighted to argue that the data only live until the paper publication
- departments must own the initial preservation of this data. This costs money. I think the simplest solution is for funders to mandate the Open preservation of data (cf. Wellcome Trust).
- the institutions must support the generic preservation process. This requires the departments actually talking to the LIS staff. It also requires LIS staff to be able to converse with scientists on equal terms. This is hard, but essential.
- save your data
- don’t simply put it in your repository
CML on ICE – towards Open chemical/scientific authoring
Saturday, June 23rd, 2007I mentioned before that at the ETD 2007 conference I met Prof Peter Murray-Rust. [1] We’re going to collaborate on adding support for CML – the Chemical Markup Language to ICE, so that people can write research publications that include ‘live’ data.[1] I’m just Petermr or PMR
Here’s a quick demo of the possibilities. I went to the amazing Crystaleye service.PMR: This is Nick Day’s site. We’d hoped to announce it formally a week or so ago but machine problems kep us back. But we’ll get some posts out this coming week. [We thank Acta Cryst/IUCr for a summer studentship which helped greatly to get it off the ground.]
The aim of the CrystalEye project is to aggregate crystallography from web resources, and to provide methods to easily browse, search, and to keep up to date with the latest published information. http://wwmm.ch.cam.ac.uk/crystaleye/Crystaleye automatically finds descriptions of crystals in web-accessible literature, turns them into CML and builds pages like Acta Crystallographica Section B, 2007, issue 03-00. From that page I grabbed this two dimensional image of (C6H15N4O2)2(C4H4O6-2),
PMR: Minor point: This is just the anion – there is a separate image for the cation.(the 3D structure below displays the cations as well).
PMR: image and applet deleted here …There’s a Java applet on the page that lets you play with the crystal in 3d. Here’s a screenshot. of the 3d rendering.
There’s lots more work to be done, but I thought I’d show how easy it is to make an ICE document that shows the 2d view for print, with the 3d view for the web, via the applet. Be warned, this may not work for you. The applet refuses to load in Firefox 2 for me, but it does work in Safari on Max OS X. If you follow the ‘view this page in PDF’ link above you’ll see just the picture.
What’s happening here? My initial hack is really simple. I grab the image and paste it into ICE like any other image, but then I link it to the CML source. I wrote a tiny fragment of Python in my ICE site to go through every page, and if it finds a link to to a CML file containing an image, it adds code to load the CML into the Jmol applet. This is a kind of integration-by-convention, AKA microformat.
The main bit of programming only took a few minutes, but sorting out where to put the CML files and the Jmol applet, and integrating the changes into this blog took ages. I ended up putting the files here on my web site which meant putting a big chunk of stuff into subversion, something that should have been done ages ago, but the version of svn that runs on my other server refuses to do large commits over HTTPS ‘cos of some SSL bug and I can’t figure out how to update it which meant switching the repository to use plain HTTP, and so on. It wasn’t made easier by me mucking around with the Airport Extreme router and our ADSL modem at the same time, halting internet access at home for a couple of hours. To make this integration a bit more usable and robust we want to:PMR: I have now hacked JUMBO so it generates SVG images of 2D (and soon 3D molecules). Note that this then allows automatic generation of molecular images in PDF files (through FOP/SVG)
- Work out a workflow that lets you keep CML files in ICE and easily drop images in to your documents, letting ICE render using the applet when it makes HTML.
- Integrate forthcoming work from Peter & team that will provide high quality vector graphics instead of the PNG files I’m using now.
I am also extremely keen to talk to these teams – as they are doing very similar and complementary work to our SPECTRa and SPECTRA-T projects in capturing scientific data at source. I am impressed by the Australian commitment to Open Access, Open Data and collaborative working. ICE is an excellent example of how we can split the load. ICE likes working with the technical aspects documents (I don’t really though I have to). The Blue Obelisk likes working with XML in chemistry. The two components naturally come together. This is something I have been waiting for for about 12 years. We haven’t got there yet, but we are well on the way.
- Investigate embedding CML in an image format such as EPS that word processors understand.
- Generalize this approach for other e-scholarship applications. We’re working with the Alive team at USQ on this.
- Talk to the DART & ARCHER teams.
The power of the scientific eThesis
Friday, June 15th, 2007- the thesis need not be a dull record of a final result but a creative work with lives and evolves until and beyond the “final submission”
- theses should be semantic and interactive, supported by ontologies and go beyond “hamburger PDF”. Theses are computable.
- We must develop communal semantic authoring/creation environments and processes.
- the process should move rapidly towards embracing open philosophies and methodology. Metadata and ontologies should be open.
- young people should be actively involved in all parts of managing the thesis process. (Harvard Free Culture)
- “Web 2.0″ will transform society and therefore the academic process. We must be prepared for this.
- It is not clear that current approaches to “repositories” will help rather than hinder innovation and dissemination of eTheses. They will only be useful for preservation if they are semantic.
- thesis structure (templating) – e.g. USQ’s Integrated Content Environment ICE system which supports XML/”Word”
- MathML
- SVG (graphics)
- CML (Chemistry)
- GML (maps)
- Numeric data (various, including CML)
- graphs (various, including CML)
- tables (various, including CML)
- scientific units (various, including CML)
- ontologies and dictionaries (various, including CML)
-
Disruptive Technology Mathias Klang – this is the first PhD thesis in Sweden to be licensed under a Creative Commons license
-
Edinburgh Research Archive : Item 1842/433 Magnus Hagdorn open thesis in geosciences
-
usefulchem » Alicia Holsey One of the first chemistry theses created on a public Wiki
- OSCAR1 chemical (thesis) validator, written by undergraduates
- Oscar3 – WWMM chemical linguistics including Named Entity recognition, links to
- The PubChem Project (Free database of chemical structures of small organic molecules and information on their biological activities., etc. and
- Chemical Entities of Biological Interest (ChEBI) a freely available dictionary of molecular entities focused on ‘small’ chemical compounds.
- Bioclipse (including display of molecules from dissertation on 2-Pyridon-katalysierte Esteraminolyse)
- The Blue Obelisk – Bowiki and their Greasemonkey. The Blue Obelisk Data Repository
- The Worldwide molecular matrix (WWMM) and CrystalEye (typical page).
- SPARQL Query Language for RDF on chemistry theses from St Andrews.
- MACiE Homepage- what chemical reactions SHOULD look like (CML) example
- involve young people in all parts of the process – understand Web 2.0 culture and democracy. Be brave
- help promote their vision against the conservatism of institutions, learned societies and commercial interests
- promote thesis creation as a complete part of the research process. Start on day 0 with tools, encouragment. Get students from year+1 to explain the vision
- Harness the power of social computing (Google, Flickr, Wikipedia, etc.). You will have to anyway. Give credit for innovation in this area
- Co-develop semantic authoring tools, including scientific languages. Use rich clients for display.
- Promote the use of ontologies and similar resources as integral parts of the scholarly process. Insist on marked up information and entities
- Use software to validate data in theses. Give these tools to examiners.
- insist that data belongs to the scientific community. Use creative commons licenses from day 0.
“open access” – some central questions
Tuesday, June 12th, 2007Summary [of Stevan's post]:Downloading, printing, saving and data-crunching come with the territory if you make your paper freely accessible online (Open Access). You may not, however, create derivative works out of the words of that text. It is the author’s own writing, not an audio for remix. And that is as it should be. Its contents (meaning) are yours to data-mine and reuse, with attribution. The words themselves, however, are the author’s (apart from attributed fair-use quotes). The frequent misunderstanding that what comes with the OA territory is somehow not enough seems to be based on conflating (1) the text of research articles with (2a) the raw research data on which the text is based, or with (2b) software, or with (2c) multimedia — all the wrong stuff and irrelevant to OA.Comments
- Stevan is responding to Peter Murray-Rust’s blog post from June 10. But since I agreed with most of what Peter MR wrote, I’ll jump in.
- Stevan isn’t saying that OA doesn’t or shouldn’t remove permission barriers. He’s saying that removing price barriers (making work accessible online free of charge) already does most or all of the work of removing permission barriers and therefore that no extra steps are needed.
- The chief problem with this view is the law. If a work is online without a special license or permission statement, then either it stands or appears to stand under an all-rights-reserved copyright. The only assured rights for users are those collected under fair use or fair dealing. These rights are far fewer and less adequate than OA contemplates, and in any case the boundaries of fair use and fair dealing are vague and contestable.
- This legal problem leads to a practical problem: conscientious users will feel obliged to err on the side of asking permission and sometimes even paying permission fees (hurdles that OA is designed to remove) or to err on the side of non-use (further damaging research and scholarship). Either that, or conscientious users will feel pressure to become less conscientious. This may be happening, but it cannot be a strategy for a movement which claims that its central practices are lawful.
- This doesn’t mean that articles in OA repositories without special licenses or permission statements may not be read or used. It means that users have access free of charge (a significant breakthrough) but are limited to fair use.
Get the Institutional Repository Managers Out of the Decision Loop
The trouble with many Institutional Repositories (IRs) (besides the fact that they don’t have a deposit mandate) is that they are not run by researchers but by “permissions professionals,” accustomed to being mired in institutional author IP protection issues and institutional library 3rd-party usage rights rather than institutional author research give-aways.
PMR: I have had similar thoughts. I got the distinct impression that some IR’s are run like victorian museums – look but don’t touch. Ithe very word “repository” suggests a funereal process – it’s no surprise that having put much of my stuff into DSpace I find it’s an enormous effort to get it out. Why don’t we build “disseminatories” instead? [Stevan's analysis of how we should deposit papers omitted. I don't disagree - I'm just more interested in data t present.]Now, Peter, I counsel patience! You will immediately reply: “But my robots cannot crunch Closed Access texts: I need to intervene manually!” True, but that problem will only be temporary, and you must not forget the far larger problem that precedes it, which is that 85% of papers are not yet being deposited at all, either as Open Access or Closed Access. That is the inertial practice that needs to be changed, globally, once and for all.
PMR: Here we differ. In many fields there has been little movement and no Green journals. We could wait another five years for no effect. But my main concern is the balance between Green access and copyrighted data. The longer we fail to address the copyrighting of data the worse the situation will become. Publishers are not stupid – they have revenue-oriented business people working out how to make money out of our data – Wiley told me so. Imagine, for example, that a publisher says “I will make all our journals green as long as we retain copyright. And we’ll extend the paper to cover the whole of the scientific record”. That would be wonderful for Stevan and a complete disaster for paper-crunchers. We can’t afford to wait for that to happen.TJust as I have urged that Gold OA (publishing) advocates should not over-reach (”Gold Fever“) — by pushing directly for the conversion of all publishers and authors to Gold OA, and criticizing and even opposing Green OA and Green OA mandates as “not enough” — I urge the advocates of automatized robotic data-mining to be patient and help rather than hinder Green OA and Green OA (and ID/OA) mandates.
PMR: I am not – I hope – hindering Green access. I am not personally agitating for Green or Gold – my energies go into arguing that the experimental process must not be copyrighted by the publisher or anyone else. And that institutional repositories should start to be much much more proactive and actively support the digital research process.Stevan Harnad on “open access”
Tuesday, June 12th, 2007- Stevan Harnad Says: June 12th, 2007 at 3:37 am eOpen Access: What Comes With the Territory Peter Murray-Rust’s worries about OA are groundless. Peter worries he can’t be be sure that: Pay no attention. Download, print, save and crunch (just as you could have done if you had keyed in the text from reading the pages of a paper book)! [Free Access vs. Open Access (Dec 2003)] It will. The University OA IRs all see to that. That’s why they’re making it OA. [Proposed update of BOAI definition of OA: Immediate and Permanent (Mar 2005)] Versions are tracked by the IR software, and updated versions are tagged as such. Versions can even be DIFFed. You may not create derivative works. We are talking about someone’s own writing, not an audio for remix, And that is as it should be. The contents (meaning) are yours to data-mine and reuse, with attribution. The words, however, are the author’s (apart from attributed fair-use quotes). Link to them if you need to re-use them verbatim (or ask for permission). Yes, you can. Download and crunch away. This is all common sense, and all comes with the OA territory when the author makes his full-text freely accessible for all, online. The rest seems to be based on some conflation between (1) the text of research articles and (2a) the raw research data on which the text is based, and with (2b) software, and with (2c) multimedia — all the wrong stuff and irrelevant to OA). Stevan Harnad American Scientist Open Access Forum
More on “open access”
Tuesday, June 12th, 2007- I agree with much but not all of what Peter MR says. I’m responding at length because I’ve often had many of the same thoughts.
- I’m the principal author of the BOAI definition of OA, and I still support it in full. Whenever the occasion arises, I emphasize that OA removes both price and permission barriers, not just price barriers. I also emphasize that the other major public definitions of OA (from Bethesda and Berlin) have similar requirements.
- I don’t agree that the term “open access” on its own, or apart from its public definitions, highlights the removal of price barriers and neglects the removal of permission barriers. There are many ways to make content more widely accessible, or many digital freedoms, and the term “open access” on its own doesn’t favor or disfavor any of them. Even at the BOAI meeting we realized that the term was not self-explanatory and would need to be accompanied by a clear definition and education campaign.
- The same, BTW, is true for terms like “open content”, “open source”, and “free software”. If “open source” is better understood than “open access”, it’s because its precise definition has spread further, not because the term by itself is self-explanatory or because “open access” lacks a precise definition.
- I do agree that many projects which remove price barriers alone, and not permission barriers, now call themselves OA. I often call them OA myself. This is only to say that the common use of the term has moved beyond than the strict definitions. But this is not always regrettable. For most users, removing price barriers alone solves the largest part of the problem with non-OA content, and projects that do so are significant successes worth celebrating. By going beyond the BBB definition, the common use of the term has marked out a spectrum of free online content, ranging from that which removes no permission barriers (beyond those already removed by fair use) to that which removes all the permission barriers that might interfere with scholarship. This is useful, for we often want to refer to that whole category, not just to the upper end. When the context requires precision we can, and should, distinguish OA content from content which is merely free of charge. But we don’t always need this extra precision.
- In other words: Yes, most of us are now using the term “OA” in at least two ways, one strict and one loose, and yes, this can be confusing. But first, this is the case with most technical terms (compare “evolution” and “momentum”). Second, when it’s confusing, there are ways to speak more precisely. Third, it would be at least as confusing to speak with this extra level of precision –distinguishing different ways of removing permission barriers from content that was already free of charge– in every context. (I’m not saying that Peter MR thought we should do the latter.)
- One good way to be precise without introducing terms that might baffle our audience is to use a license. Each of the CC licenses, for example, is clear in it own right and each removes a different set of permission barriers. The same is true for the other OA-friendly licenses. Like Peter MR, I encourage providers to remove permission barriers and to formalize this freedom with a license. Even if we multiplied our technical terms, it will usually be more effective to point to a license than to a technical term when someone wonders exactly what we mean by OA for a given piece of work.
More Open Thesis heroes
Monday, June 11th, 2007Many thanks Mathias, and I shall enjoy reading your thesis – this whole area needs some disruptive technology – I am finding that approaches to repositories still look conservative and based on outdated models of thought. I can’t comment in detail on the science but the format of Magnus’ thesis is an excellent example of what a modern thesis should contain – it’s 400Mbyte zipped but contains spendid animations and data of glaciation – worth a look. But the problem with the repositories is that there is no indication that the actual thesis is OpenAccess. The Edinburgh repository announces:Oleg Evnin at Caltech (successfully defended May 26, 2006) [PMR: blogged by Peter Suber]…a number of CC-licensed ETDs at the U of Edinburgh and that the earliest seems to be by Magnus Hagdorn, submitted on March 4, 2004.
All items in ERA are protected by copyright, with all rights reserved. Copyright for this page [1] belongs to The University of Edinburgh [1] i.e. the metadata splash pagewhich discourages the visitor for looking for an Open License within the thesis. I’m sure this isn’r deliberate, but, repository managers, here is a very simple idea: Add dc:rights to the splash page and metadata and proudly proclaim in large letters: THIS THESIS CARRIES A CREATIVE COMMONS LICENCE – ENJOY!
Free Culture and Open Theses
Sunday, June 10th, 2007Welcome to the Harvard College Thesis Repository
Welcome to the Harvard College Thesis Repository, a project of Harvard College Free Culture! Here Harvard students make their senior theses accessible to the world, for the advancement of scholarship and the widening of open access to academic research. Too many academics still permit publishers to restrict access to their work, needlessly limiting—cutting in half, or worse—readership, research impact, and research productivity. For more background, check out our op-ed article in The Harvard Crimson. If you’ve written a thesis in Harvard College, you’re invited to take a step toward open access right here, by uploading your thesis for the world to read. (If you’re heading for an academic career, this can even be a purely selfish move—a first taste of the greater readership and greater impact that comes with open access.) If you’re interested in what the students at (ahem) the finest university in the world have to say at the culmination of their undergraduate careers, look around.There are 28 theses here and – unlike the green fuzzy repositories – all have been deposited under CC-BY (i.e. completely compliant with BOAI). The web page didn’t make the license position clear but I got the following clarification today:
Yes–all users ofWell done Harvard College Free Culture – you have made an important step forward. Convince students in other institutions to follow your lead and the battle is won. (Not surprisingly there are no chemistry theses but I am sure that can be fixed).our repository agreed to a CC-by license when they uploaded theirtheses. As part of the submission process, all users agreed to thefollowing terms:“I am submitting this thesis, my original work, under the terms ofthe Creative Commons Attribution License, version 2.5: roughly, Igrant everyone the freedom to share and adapt this work, so long asthey credit me accurately. I have read and understood this license.”We will work to make this more clear in the metadata for each thesis.
Useful chemistry thesis in RDF
Sunday, June 10th, 2007
As you can see this is semantically useless. A lot of work has gone into this, but none of it is useful to a machine (look closely and you’ll see it’s a JPEG). Even in the native software which was used to draw it it is unlikely that the semantics can be easily determined. However XML and RDF allow a complete representation. It took me about 1 hour to handcraft the topology – if we had decent tools it would be seconds. The complete set of reaction schemes (I counted 11 in the thesis can be easily converted to a single RDF file which looks something like this:
uc:scheme1_1 pmr:isA pmr:reactionScheme .
uc:scheme1_1 pmr:hasA uc:rxn1_1a .
uc:scheme1_1 pmr:hasA uc:rxn1_1b .
uc:rxn1_1a pmr:hasReactant uc:comp1 .
uc:rxn1_1a pmr:hasReactant uc:comp2 .
uc:rxn1_1a pmr:hasReactant uc:comp3 .
uc:rxn1_1a pmr:hasReactant uc:comp4 .
uc:rxn1_1a pmr:hasProduct uc:comp5 .
uc:rxn1_1b pmr:hasReactant uc:comp5 .
uc:rxn1_1b pmr:hasProduct uc:comp6 .
(uc: refers to the usefulChemistry namespace, pmr: to mine).
There are many Open Source tools for graphing this and here is part of the output of one from the W3C
Here you can see that reaction1.1a has four reactants (compound 1,2,3,4) and 1 product (comp 5). Comp5 is the reactant for another reaction (clipped to save blog problems). The complete picture for the whole thesis looks like this:
and (assuming you have a large screen) you can see immediately what reactions every compound is involved in.
That’s only the start as it is possible to ask sophisticated questions from a SPARQL endpoint – and that’s where we are going next…
… IFF you make the theses true Open Access