Monthly Archives: September 2007

NSF/JISC meeting on eScience/cyberinfrastructure

I was privileged to be at a meeting between JISC (UK) and NSF (US). Every paragraph of the report is worth reading – I quote a few…

William Y. Arms and Ronald L. Larsen, The Future of Scholarly Communication: Building the Infrastructure of Cyberscholarship, September 26, 2007. Report of the NSF/JISC Repositories Workshop (Phoenix, April 17-19, 2007). It announces

The fundamental conclusions of the workshop were:
• The widespread availability of digital content creates opportunities for new forms of research and scholarship that are qualitatively different from traditional ways of using academic publications and research data. We call this “cyberscholarship”.
• The widespread availability of content in digital formats provides an infrastructure for novel forms of research. To support cyberscholarship, such content must be captured, managed, and preserved in ways that are significantly different from conventional methods.
As with other forms of infrastructure, common interests are best served by agreement on general principles that are expressed as a set of standards and approaches that, once adopted, become transparent to the user. Without such agreements, locally optimal decisions may preclude global advancement. Therefore, the workshop concluded that:
• Development of the infrastructure requires coordination at a national and
international level. In Britain, JISC can provide this coordination. In the United States, there is no single agency with this mission; we recommend an inter-agency coordinating committee. The Federal Coordinating Council for Science, Engineering and Technology (FCCSET), which coordinated much of the US government’s role in developing high performance computing in the 1990s, provides a good model for the proposed Federal Coordinating Council on Cyberscholarship (FC3S). International coordination should also engage organizations such as the European Strategy Forum on Research Infrastructures (ESFRI), the German research foundation DFG, and the Max Planck Digital Library.
• Development of the content infrastructure requires a blend of interdisciplinary research and development that engages scientists, technologists, and humanities scholars. The time is right for a focused, international effort to experiment, explore, and finally build the infrastructure for cyberscholarship.
3
• We propose a seven-year timetable for implementation of the infrastructure. The first three years will focus on developing and testing a set of prototypes, followed by implementation of coordinated systems and services.

Computer programs analyze vast amounts of information that could never be processed manually. This is sometimes referred to as “data-driven science”. Some have described data-driven science as a new paradigm of
research. This may be an over-statement, but there is no doubt that digital information is leading to new forms of scholarship. In a completely different field, Gregory Crane, a humanities researcher, recently made the simple but profound statement, “When collections get large, only the computer reads every word.” A scholar can read only one document at a time, but a supercomputer can analyze millions, discovering patterns that no human could observe.

The National Virtual Observatory describes itself as “a new way of doing astronomy, moving from an era of observations of small, carefully selected samples of objects in one or a few wavelength bands, to the use of multiwavelength data for millions, if not billions of objects. Such datasets will allow researchers to discover subtle but significant patterns in statistically rich and unbiased databases, and to understand complex astrophysical systems through the comparison of data to numerical simulations.” From: http://www.us-vo.org/

The workshop participants set the following goal:
Ensure that all publicly-funded research products and primary resources will be readily available, accessible, and usable via common infrastructure and tools through space, time, and across disciplines, stages of research, and modes of human expression.

The shortcomings of the current environment for scholarly communication are wellknown and evident. Journal articles include too little information to replicate an experiment. Restrictions justified by copyright, patents, trade secrets, and security, and the high costs of access all add up to a situation that is far from optimal. Yet this suboptimal system has vigorous supporters, many of whom benefit from its idiosyncrasies.
For example, the high cost of access benefits people who belong to the wealthy organizations that can afford that access. Journal profits subsidize academic societies. Universities use publication patterns as an approximate measure of excellence.

Younger scholars, who grew up with the Web, are less likely to be restrained by the habits of the past. Often – but not always – they are early adopters of innovations such as web search engines, Google Scholar, Wikipedia, and blog-science. Yet, they come under intense pressure early in their careers to conform to the publication norms of the past.

… and so the final proposal

… a seven year target for the implementation of the infrastructure for
cyberscholarship. The goal of establishing an infrastructure for cyberscholarship by 2015 is aggressive, but achievable, when coordinated with other initiatives in the U.S., Britain, and elsewhere. A three-phase program is proposed over seven years: a three-year research prototype phase that explores several competing alternatives, a one-year architecture specification phase that integrates the best ideas from the prototypes, followed by a three-year research and implementation phase in which content infrastructure is deployed and research on value-added services continues. Throughout the seven years, an evaluation component will
provide the appropriate focus on measurable capability across comparable services. A “roadmap” for the program is suggested in the following figure.

[... it's too large to cut, so you'll have to read it for yourselves...]

… and the details …

Open grant writing. Can the Chemical Blogosphere help with “Agents and Eyeballs”

In the current spirit of Openness I’m appealing to the chemical blogosphere for help. Jim Downing and I are writing a grant proposal for UK’s JISC : supporting education and research – which supports digital libraries, repositories, eScience/cyberinfrastructure, collaborative working, etc. The grant will directly support the activities of the blogosphere, for example by providing better reporting and review tools, hopefully with chemical enhancement.
The basic theme is that the Chemical Blogosphere is now a major force for enhancing data quality in chemical databases and publications, and we are asking for 1 person-year to help build a “Web 2.0″-based system to help support the current practice and ethos. The current working title is “Agents and Eyeballs”, reflecting that some of the work will be done by

  • machines, as in CrystalEye – WWMM which aggregates and checks crystal published structures on a daily basis.
  • humans as in the Hexacyclinol? Or Not? saga. Readers may remember that there was a report of the synthesis of a complicated molecule. This was heavily criticized in the blogosphere, and indeed the top 9 hits on google for “hexacyclinol” are all blogs – the formal, Closed, peer-reviewed paper comes tenth in interest.

Given enough eyeballs, all bugs are shallow” – Eric Raymond. In chemistry it is clear that the system of closed peer-review by 2-3 humans sometimes leads to poor data quality and poor science. We’ve found that in some chemistry journals almost every paper has an error – not always “serious”, but … So:
“Agents and eyeballs for better chemical peer-review”.

Not very catchy but we’ll think of something.
It’s unusual to make your grant proposal Open (and we are not actually putting the grant itself online, especially the financial details). But there are parts of the case that we would like the blogosphere to help with. If you have already written a blog on any of the aspects here, please give the link. You may even wish to write a post

  • showing that the blogosphere is organised and effectively oversees all major Open discussion in chemistry. I take Chemical blogspace as the best place for a non-chemist (as the reviewers will be) to start.
  • show that the Blogosphere cares about data. Here I would like to point to the Blue Obelisk and the way Chemspider has reacted positively to the concerns about data quality
  • show that important bad science cannot hide. I would very much like an overview of the hexacyclinol story – which is still happening – with some of the most useful historical links. Anything showing that the blogosphere was reported in the conventional chemical grey literature would be valuable.
  • Open Notebook Science.

We have three partners from the conventional publishing industry – I won’t name them – who have offered to help explore how the Agents and Eyeballs approach could help with their data peer review.

You might ask “why is PMR not doing this, but asking the blogosphere?” It’s precisely because I want to show how responsive and responsible the blogosphere is, when we ask questions like this.

There is considerable urgency. To include anything in the grant we’ll need it within 36 hours, although contributions after that will be seen by the reviewers. I suggest that you leave comments on this post, with pointers where necessary. Later I suspect we’ll wikify something, but it’s actually the difficulty of doing this properly and easily that is – in part – motivating the grant.

TIA

Volunteers: does the computer experience translate to chemistry?

One of the spinoffs of having been to scifoo is that I skim over 50+posts / day from the blogs that participants run. Some are multi-author blogs:  Here’s Andy Oram on Tim O’Reilly’s blog, talking about what makes volunteer documenters click. Read it all.

01:47 30/09/2007, Andy Oram, Planet SciFoo
By Andy Oram

[...]If value increasingly comes from communities of volunteers outside the compass of corporate management, isn’t it only right to shift resources to support these communities? I have to deal with that question in my own field of computer documentation, where the shift to community production is as happening as fast as it is anywhere. (I examine this trend in a series of articles about community documentation.) [PMR - listed below] But many industries could ask the same question I explore in this article: how can society shift its resources to support the important new source of value in communities?

Volunteerism needs support

The idea that volunteers play an important social role goes at least as far [...]

Volunteers who are paid, of course, are no longer volunteers. Companies have hit upon an enormous number of intermediate forms of reward by now: invitations to focus groups and conferences, honorable mentions, free products, etc. Still, serious problems in the concept of rewarding volunteers have been publicized:

  • Rewards create incentives to game the system, which would ultimately lead productive volunteers to abandon the system as unfair.
  • Even when rewards are fair, they “crowd out” the original incentives that led volunteers to serve in the first place.
  • It’s just plain impossible to determine how much each volunteer’s contribution is worth.

The final point just listed is the killer. The reasons for it are easy to state: the ultimate value created by any new idea may lie far out in the future, and the give-and-take discussion around information makes it hard to trace a valuable idea to an individual or small group. Let’s look at this problem more closely.

The value of information

[...]In computer documentation (as in journalism), certainly, it’s becoming harder and harder to add value to what the community contributes for free. So the challenge becomes how to improve the community’s offerings.

I find the key traits of value in documentation to be:

  • Availability–somebody has to write it in the first place. (Readers also need computers and Internet access in order to meet this goal.)
  • Findability–people need something better than current search techniques to find obscure documents, and particularly need help finding background when they read a document that assumes too much prior knowledge.
  • Quality–this covers such general and complex issues as accuracy, relevance, and readibility.

A particularly urgent aspect of quality is keeping a document up to date. Many a project has annoyed its users by starting out with reasonably good documentation and failing to keep it updated. Somehow, people who enjoyed writing something the first time lose interest in maintaining it. This is just as true for comments in source code and commercial books. (Many of my authors have built their reputations and businesses on books they’ve written, and despite good intentions have been unable to find time to update the books.) I myself have lived out the feeling of writing new documentation for a free software project and then lacking the motivation to go back to it.

Thus, companies and user consortia who want to direct resources toward making software more usable can consider:

  • Offering incentives that make the best people contribute, while trying to avoid invoking the crowding-out phenomenon.
  • Providing paths through documentation, so readers can find what they need in their particular state of knowledge. This task is an ongoing research project for any particular body of documentation.
  • Ensuring continuity, by tracking the need to update documents and finding people to do so.
  • Training contributors to do a better job and make the most of their efforts.

The last of the tasks interests me in particular, because it provides scope for offering my skills as an editor and O’Reilly’s as a publisher. But we need some compensation for it.

I feel funny, of course, offering our services as editors or other quality providers when the original authors might not be paid. But if you accept that it’s harder to recruit people for supporting roles than for leading roles, payment is justified.

To conclude, I think volunteers can be supported without being paid directly. If they know their work will be improved to be more useful and will have lasting value, they’ll have more incentive to contribute.

[...]

PMR: and the details:

… writings by Andy Oram about web pages, forums, and other media used by users of technology to educate each other. Articles include (in reverse chronological order):


Andy Oram
Editor, O’Reilly Media
Home page

PMR: This is very relevant to recent development in the Blue Obelisk, where a volunteer community has become the keeper of the SMILES de facto standard. We should read Andy’s thoughts carefully.

The equations are similar but not isomorphic. Why do people work with the BO? Here are some ideas:

  • A sense of community. This is a major reward for many people, being able to keep in touch and knowing that you are on the right track (or more importantly, on the wrong one). And the price of membership, though not explicitly stated in the gift economy, is to contribute and to uphold the ideals of the work.
  • A fuzzy mixture of morals, ethics and politics. It is the “right thing to do”. If that drives some people, great. On the reverse I have been attacked several times for being immoral in promoting various aspects of Open Chemistry – it destroys the jobs of honest hard-working developers. [No, it creates jobs for those people who wish to translate to C21].
  • Personal “academic” karma. This is a major motivation. As the BO succeeds those people who have been associated with it will be asked to write articles for value publications, to cooperate on the next phase of funded Web 2.0 grants, etc. For aspiring scientists to work together.
  • Personal financial reward. This is a powerful and valid motivation. There is lots of potential – I wouldn’t have a job today if I hadn’t contributed to the development of XML. When we look for people to join us, the blogosphere is an obvious recruiting ground.  And as the balance shifts from closed to open there will need to be ways of monetizing Openness. The chemical information market is worth at least low billions of USD – it’s still going to be there in 10 years’ time. But many of the conventional players will be gone and new ones will have taken their place.
  • Fun. Yes, fun. We like writing algorithms. If you are a Sudoku addict you’d enjoy writing a chemical substructure search. We like drawing molecules. Many artists – like Jane Richardson – have joined the community of molecular graphics. We like building second life. We like writing blogs.
  • Changing the world. Everyone contributing to the BO is changing the world… It may not be apparent, but it’s real.

and as Alma Swan, quoting Gandhi, (blogged by Barbara Kirsop) reminded us:

‘first they ignore you, then they laugh at you, then they fight you, then you win’.

The BO has not won yet. It’s somewhere between ignore and laugh, and for the next little while we’d love some documentation volunteers!

Open Access at Abbey Square

Yesterday Jim Downing, Nick Day and I were the guests of Peter Strickland and Brian McMahon at the International Union of Crystallography in their gorgeous offices at Abbey Square, Chester UK

.iucr.png
The IUCr is a member of ICSU – International Council for Science and as such acts as a governing body. It has taken a very proactive role over the last 5 decades (and probably more, but I can’t remember) on things like data quality, standards, creating a community. So do all Scientific Unions – such as IUPAC (which recently did me the honour of making me a fellow) – but I hope I’m not divisive in giving the IUCr some individual praise.

I remember IUCr running an community exercise – I think in the 1950/1960 period – where labs were invited to collect data sets from a standard crystal (something like sodium ammonium tartrate, but I forget). That meant that the community could estimate the precision and accuracy that might be possible at that time. The philosophy has continued, and of course technology is much improved so that routine crystallographic data has excellent precision and accuracy. The IUCr has also emphasized the publication of data sets – as part of the scientific record, to check for and with the expectation that future scientists might revisit data sets and re-use them. (For example when I did my doctorate the programs couldn’t model anisotropic scattering from atoms and it would be easy to re-analyse the data. The IUCr has always promoted the publication of the raw data and it’s due to their advocacy that Nick Day has been able to create CrystalEye – WWMM from the supplemental crystallographic data that many responsible scientific publishers mount on their websites. The IUCr had given us some initial support for a summer student – Mark Holt – and we were showing where it had got to. CrystalEye is an excellent model for harvesting data from publisher sites – at least those who don’t try to posses public domain data. More on all of this later.

The IUCr is also a publisher – its flagship journal is Acta Crystallographica (sections A-F). CrystalEye takes data mainly from E, C and parts of B. Acta has a hybrid approach to OA – the cost to authors is 900 USD which is a lot less than most. I think we can expect more developments in this area.

Four theses and a repository

I’ve been advocating that all theses should be deposited in institutional repositories under CC-BY licences, and here’s an interlude with 4 I have personal knowledge of. I’m keeping the authors secret, although those in the know will identify some.

ONE is from someone I have supervised. They have submitted their thesis and are awaiting a viva. I shouldn’t comment publicly on its quality even though I am not an examiner. But we are both keen to see the thesis under OA – the question is what to do with the data – ca 15 GB of computation. Is the repository the right place for it?

TWO is from someone I know well who has also submitted their thesis. It’s in a University which already has a tradition of Open Access in their IR. Although it’s not in a field I can claim expertise (music and AI) I think it deserves widespread visibility.

THREE is someone I examined recently at another University. I can’t publicly say what we recommended for the candidate, but at least we had a drink afterwards. I broached the subject of Open Access and the candidate was excited – they want the world to know what they have done. So I have written to the University and although this is not an established routine I got an encouraging reply.

FOUR was written many years ago on manual typewriters (sic) and several carbon papers. I think physical copies still remain. So I wondered if the author of FOUR might wish to see his thesis digitised [1] and made open access. Should I suggest he writes to his alma mater? If he can get his act together.
[1] Institutions such as Caltech have been retrospectively digitising their theses – see for example this one.

Chemical Speeddrawing

There used to be an advert on the London Tube advertising “Speedwriting” (something like “f u cn rd ths u cn gt a gd jb”. What about speed-drawing of chemical structures? Here’s Liquid Carbon:

Finally, I’d like to offer a small pissing contest. It takes me:
• 30 sec to draw THC
• 38 sec to draw Penicillin G
• 82 sec to draw discodermolide
What about you? The compounds should be drawn with all stereochemical information and in the same general style (bond angles, side chains positioning) as in the picture below.

PMR: some of the blogosphere responded and the times were similar.

What about non-graphical input such as SMILES or even WLN (which Depth-First resurrected: Everything Old is New Again: Wiswesser Line Notation (WLN) ? Of course WLN doesn’t do stereo, but I bet the practitioners could beat the times above by some considerable margin. And it wouldn’t be too difficult to include the stereo – in the last 40 years we have lowercase letters on our keyboards!

And it took me 27 seconds to type the SMILES for penicillin (admittedly without stereo and orientation). But, as readers of this blog know, I can’t type either.

The Obelisk SMILES

We are delighted that Craig James has suggested making the molecular format SMILES an Open activity. Egin Willighagen writes:

08:03 28/09/2007, Egon Willighagen,
Craig James wants to make SMILES an open standard, and this has been received with much enthusiasm. SMILES (Simplified molecular input line entry specification) is a de facto standard in chemoinformatics, but the specification is not overly clear, which Craig wants to address. The draft is CC-licensed and will be discussed on the new Blue Obelisk blueobelisk-smiles mailing list.Illustrative is my confusion about the sp2 hybridized atoms, which use lower case element symbols in SMILES. Very often this is seen as indicating aromaticity. I have written up the arguments supporting both views in the CDK wiki. I held the position that lower case elements indicated sp2 hybridization, and the CDK SMILES parser was converted accordingly some years ago. A recent discussion, however, stirred up the discussion once more (which led to the aforementioned wiki page).You can imagine my excitement when I looked up the meaning in the new draft. It states: The formal meaning of a lowercase “aromatic” element in a SMILES string is that the atom is in the sp2 electronic state. When generating a normalized SMILES, all sp2 atoms are written using a lowercase first character of the atomic symbol. When parsing a SMILES, a parser must note the sp2 designation of each atom on input, then when the parsing is complete, the SMILES software must verify that electrons can be assigned without violating the valence rules, consistent with the sp2 markings, the specified or implied hydrogens, external bonds, and charges on the atoms..

PMR: This is excellent. The problem with specifications is that it is VERY difficult to describe them so that independent groups can interpret them consistently. I spent some years helping with the XML effort and apparently simple ideas could cause huge debates. (e.g. namespaces…) It’s well known that some constructs in computer languages, such as

int i = 6;

int j = i++ * i++;

i = i++;
cause enormous confusion. What are the results? (Try to work it out, then try it out and then find the “right answer” (your compiler may surprise you) [*].

Back to chemistry. Almost all formats have been proprietary. That means that there is unlikely to be much useful interactive public help from the originators, and the only check is likely to be a binary executable. When I joined the pharma industry and started trying to get some standards, one software company threatened to sue anyone who published their molecular file formats. It’s slightly better now, but IMO the responsibility for the current appalling situation lies with the pharma industry which has had no effective interest is standardising anything and is now paying the price. (It can only survive by using information, and until it makes this standard and largely free it won’t).
That’s a major reason for developing CML (Chemical Markup Language). CML is open, and uses open standards (XML). It’s much larger than SMILES, and there are places where it is defined less well than we would like, but at least it’s open and that can happen.

SMILES is very widely used. Creating an open standard will take more effort than might appear. The “aromatic” or “lower case” concept is extremely difficult to define. I don’t understand the definition:
The formal meaning of a lowercase “aromatic” element in a SMILES string is that the atom is in the sp2 electronic state.

I don’t believe that SMILES has anything to do with electronic states and I think it should simply be a means for counting atoms, formal bonds and electrons. Is there a difference between Cn(C)C and CN(C)C ? The first represents a planar transition state of trimethylamine, the second a pyramidal ground state.

But the positive point is that I have the chance to make this view and other the chance to support it, modify it or challenge it. Just like Wikipedia, the Blue Obelisk uses the court of public opinion. And we have the exciting position that a “Web 2.0″ community is now about to lead the chemoinformatics world.

Maybe the pharma industry will take us seriously. And, wonder of wonders, might actually come into the open, say so, and offer some support.

[*] actually both are undefined and may give different answers

What’s in a name? hexanoic acid still smells of goats

In a recent post I said – rather crudely – that there was no absolute way of understanding chemical names. I have been (rightly) taken to task for imprecision:

ChemSpiderMan Says:
September 25th, 2007 at 5:04 am e I’m not sure what you mean by the comment “Because there is no absolute way of assigning names to structures.” Systematic naming is exactly that….IUPAC Naming, CAS Naming. Well defined rules. Now, are they exhaustive across all forms of chemistry..surely not…inorganics, organometallics, polymers while challenging do have nomenclature standards too while some believe they don’t. Of course chemical structure classes change…there were no rules of fullerenes before they were synthesized. But, in general there IS an absolute way of assigning the names to structures. Maybe I misinterpreted your

PMR: This is true, in principle for certain classes of compounds (mainly organic). BTW many chemical (informatics) folk are arrogant enough to assume that there is nothing in the world except organic chemistry. There are many chemicals which aren’t organic. The Wikipedians have a lot of problems in deciding how to assign a name to something because they use names as both descriptions and addresses. Naming is hard. Very hard. It’s been said that there are only two hard problems in computer science and naming is one of them. Here are some and they can’t be represented by a formal name other than lookup.

calcite / aragoniteBakelite
invert sugar

and, of course there are trivial names, such as Diazonamide A. Why use that rather than the systematic name? Because when it was first discover they didn’t know what it was. It seems they still don’t. Or at least some people don’t. The name relates not to a connection table but to a sample with associated properties such as composition, melting point, NMR, etc. which serve to identify, but not always elucidate.

Trivial names are convenient. Therefore we need an Open (not just free) set of chemical names.

I’ve just remembered. We’ve got several: Pubchem, Wikipedia, ChEBI. Set up respectively by biologists, volunteers, biologists. For the service of chemists. They might even get interested in helping them grow.

Structures that InChI and SMILES can’t represent

Even in organic chemistry there are lots of strucures that cannot be represented by InChIs and currently cannot be communicated without structure diagrams. I’ve gone randomly to Beilstein Journal of Organic Chemistry (as it’s Open Access) and found three consecutive abstracts. They contain ideas of variable locants, spatial arrangements, non-atomic species (balls), reactions, ion pairs, organometallic coordination. It would be an act of scientific barbarism to copyright anything below.

Novel base catalysed rearrangement of sultone oximes to 1,2-benzisoxazole-3-methane sulfonate derivatives
Veera Reddy Arava, Udaya Bhaskara Rao Siripalli, Vaishali Nadkarni, Rajendiran Chinnapillai
Beilstein Journal of Organic Chemistry 2007, 3:20 (8 June 2007)

[Full Text] [PDF] [Album] [PubMed] [Related articles]

m-Iodosylbenzoic acid – a convenient recyclable reagent for highly efficient aromatic iodinations
Andreas Kirschning, Mekhman S Yusubov, Roza Y Yusubova, Ki-Whan Chi, Joo Y Park
Beilstein Journal of Organic Chemistry 2007, 3:19 (4 June 2007)[Abstract] [Full Text] [PDF] [Album] [PubMed] [Related articles]

A convenient catalyst system for microwave accelerated cross-coupling of a range of aryl boronic acids with aryl chlorides
Matthew L Clarke, Marcia B France, Jose A Fuentes, Edward J Milton, Geoffrey J Roff
Beilstein Journal of Organic Chemistry 2007, 3:18 (30 May 2007)

[Abstract] [Full Text] [PDF] [Album] [PubMed] [Related articles]

FWIW: CML can manage much of the uncertainty above, but although it is a work of breathtaking beauty it also shouldn’t be copyrighted.

Grazie!

I made the sweeping assertion at Berlin5 that no-one other than me was blogging (I asked for a show of hands), and am delighted to be proved wrong:

Paolo Gardois Says:
September 24th, 2007 at 3:30 pm e

Firstly, compliments for your presentation, it was great!!
Secondly, I was at Berlin 5, blogging the meeting, so you should feel a little less sad… :-) Our blog is in Italian, but if you want to take a look: http://unitosbd.wordpress.com .
Please let us know what you think…

PMR: Paolo has done a great job and blogged every session.

Peter Murray Rust, scienziato di Cambridge e blogger prosegue con un paper sulla publicità dei dati di ricerca, partendo dall’esempio dei dati su inquinamento e riscaldamento climatico. Dopo l’imperdibile citazione di Tufte (”Power Corrupts. PowerPoint Corrupts Absolutely“), Rust prosegue delineando l’onnipresenza del copyright, sia nelle tabelle e grafici pubblicati dentro gli articoli scientifici di editori commerciali, sia nei database (es. ACS).

Ora, un aspetto dell’open access riguarda il libero accesso all’informazione, ma un altro riguarda la possibilità di riuso dei dati. Non sempre le 2 cose sono collegate, e questo costituisce un problema per gli scienziati, che generano nuova conoscenza letteralmente manipolando e riconfigurando dati pubblicamente disponibili. Una soluzione è rappresentata dall’attrivuire un’esplicita licenza relativa all’utilizzo dei dati (es. Science Commons).

Anche le tesi di dottorato dovrebbero essere rilasciate secondo modelli di licenza simili (vedi l’iniziativa di Harvard).

Su un altro versante, si incontrano difficoltà anche tecniche nell’estrarre i dati (formule, ecc.) dalle pubblicazioni per poterle riutilizzare. Non è solo un problema di copyright, dunque, ma anche di formati. Occorre dunque pubblicare i dati grezzi in formati standard in repository pubblici, e parallelamente sviluppare strumenti di text mining – estrazione automatica di dati da file di testo – ovviamente XML, non PDF che distrugge la scienza :-)

Un es. di questi strumenti, utile per l’annotazione semantica di articoli di chimica, è OSCAR3.

Ma comunque, ironia a parte, quanto detto qui sui formati chiusi riecheggia quello che ho scritto ieri sui documenti chiusi come forma ormai obsoleta di pubblicazione della conoscenza. Spesso i dati sono più interessanti delle conclusioni che se ne traggono, perché permettono discussione ed interpretazioni alternative: “chiuderli” esclusivamente dentro PDF e Powerpoint è sicuramente un errore, ed un’altra faccia dello stesso problema. Qui i concetti da cui si può partire per un ragionamento sono: openaccess (aspetto culturale, giuridico, economico, professionale) opensource (aspetto informatico, produzione) open standards (accesso, riuso, riconfigurazione).

PMR: Good accurate report, like the others. I am interested to see that Italian uses many English terms directly “copyright”, “repository”, “open access”, “text mining”. I don’t want to seem like an anglophone imperialist, but in the Internet age it can be useful to know that we are using the same terms for the same concept. Of course copyright will be country-dependent in precise meaning.