petermr's blog

A Scientist and the Web

 

Archive for the ‘publishing’ Category

APE2008 more thoughts

Wednesday, January 30th, 2008

Because there was no electricity and wireless at the APE meeting ( APE 2008) I took some notes, but they seem rather dry now and have lost some of the immediacy. So I shall use the meeting to catalyze some thoughts.

Michael Mabe – CEO STM – gave a useful presentation about facts in publishing, but they don’t read well in this blog a week later.  The growth of publishing is not new – ca 3.5 percent for the last 300 years. So it’s a good thing that we’ve gone digital or the whole world will be drowned in the Journal Event Horizon (cf Shoe Event Horizon). 1 million authors, over 1 billion article downloads. The primary motive of authors is to disseminate their ideas (it’s reassuring to know that as we can plan new ways of doing it).

An afternoon session from with some snippets (rather random):

Ulrich Poeschel, Mainz,
in “bad papers” main problem is carelessness, not fraud, etc.,    overly superficial superficial reports of experiment, non-traceable arguments. He described interactive/dynamic publishing where review has several stages (but I can’t remember his journal – maybe it was Atmospheric Chemistry and Physics??
Traditional peer review is not efficient today,    editors and referees have limited capacity,    too few editors and reviewers
traditional discussion papers are very rare – originally 1/20 papers were commented, now => 1/100
speed conflicts with thorough review
so develop speed first, review later
discussion paper = upper-class preprint (some pre-selection)
lengthier traditional peer-review later
referees can maintain anonymity – self-regulation works

rewards for well-prepared papers
most limiting factor is refreeing capacity
total rejection rate only ca 10%, so referees effort is saved
deters careless papers – self-regaultion through transparency
5 comments/paper – 1/4 papers get public comment
comment volume is 50% of publication

now #1 in atmospheric phys, #2 in geosciences

Catriona MacCallum:    PLoS – …
journals and discussion cannot capture discussions in ways that blogs do – blogs are self-selecting communities     TOPAZ is open source publishing software – makes connections between all components of publishing systems – blogs, documents, data, services …

Linda Miller Nature.

Purpose of peer-review is to decide where paper should be published
protects public (e.g. health and large policy)
avoids chasing spurious results
quality of review is decreasing

open trial for commenting on (PMR: I think) regular Nature papers.

12% of regular authors (PMR: I assume this is in Nature) accepted comment trial – mainly earth, eco, evo, physics
half papers had comments
comment’s average score was 2/5 -(i.e. comments weren’t very good)
no chemistry, genomics, genomics
low awareness of trial

why do people not comment? Overwork – no incentive?

so we need:
motivation for PR
stable identifier for reviewer
high ratings on pubmed
checks and balances on retributions
critical mass of submissions

referees need to get credit
need to develop online reputation score
CVs should include this

change is inevitale execpt from vending machine (Robert C Gallagher)

My next thoughts will hopefully include:

  • role of librarians
  • beyond the full-text
  • legal and contractual stuff

APE2008 – Heuer, CERN

Friday, January 25th, 2008

APE (Academic Publishing in Europe)  was a stimulating meeting, but I wasn’t able to blog any of it as (a) there wasn’t any wireless and (b) there wasn’t any electricity (we were in the Berlin-Brandenburg. Academy of Sciences, which made up for the lack by the architecture and the legacy of bullet holes in the masonry). So I took notes while the battery lasted, but they read rather staccato.
The first keynote was very exciting. Rolf-Dieter Heuer is the new Director General of CERN – where they start hunting the Higgs Boson any time now. CERN has decided to run its own publishing venture – SCOAP3- which I first heard of from Salvatore Mele – I’m hoping to visit him is CERN before they let the hadrons loose.

So my scattered notes…

SCOAP requires all COUNTRIES contribute (i.e. total commitment from the community and support for the poorer members)
closely knit community, 22, 000 ppl.
ca 10MEUR for HEP – much smaller than expts (500MEUR) so easy for CERN to manage (So organising a publishing project is small beer compared with lowering a 1200 tonne magnet down a shaft

22% use of Google by young people in physics as primary search engine
could we persuade people to spend 30 mins/week for tagging

what people want
full text
depth of content
quality

build complete HEP paltform
integrate present repositories
one-stop shop
integrate content and thesis material [PMR - I agree this is very important]

text-and data-mining
relate documents containg similar information
new hybrid metrcs
deploy Web2.0
engage readers in subject tagging
review and comment

preserve and re-use reaserach data
includes programs to read and analyse
data simulations, programs behind epts
software problem
must have migration
must reuse terminated experiments

[PMR. Interesting that HEP is now keen to re-use data. We often heard that only physiscists would understand the data so why re-use it. But now we see things like the variation of the fundamental constants over time   - I *think* ths means that the measurement varies, not the actual constants]

preservation
same reesearchers
similar experiements
future experiements
theoretic who want to check
theorist who want to test futuire (e.g. weak force)
need to reanalyze data with time (JADE experiement, tapes saved weeks before destruction and had expert)
SERENDIPTOUS discovery showing that weak force grows less with shorter distance

Raw data 3200 TB

raw-> calibrated -> skimmed -> high-leve obj -> phsyics anal – > results
must store semantic knowledge
involve grey literature and oral tradition

MUST reuse data after experiment is stopped

re-suable by other micro doamins
alliance for permanent access

PMR: I have missed the first part because battery crashed. But the overall impression is that SCOAP3 will reach beyond physics just as arXiv does. It nmay rival Wellcome in its impact on Open Acces publishing. SCOAP3 has the critical mass of community, probably finance, and it certainly has the will to succeed. Successes tend to breed successes.

… more notes will come at random intervals …

APE 2008

Sunday, January 20th, 2008

I’m off the the APE meeting in Berlin: APE 2008 “Quality and Publishing”, which asks some questions:

  • What do we really know about publishing?
  • Is ‘Open Access’ a never ending story?
  • Will there be a battle between for-profit and non-for-profit publishing and who will be the survivors?
  •  Which is the best peer review system in the public interest?
  • What does impact mean in times of the Internet?
  • What are the plans of the European Commission for digital libraries, access and dissemination of information?
  • Will libraries become university presses or repositories?
  •  How efficient is ‘OA’ in terms of information delivery?
  •  What are the full costs of information?
  •  Business models versus subsidies?
  •  What is the future role of books and reference works?
  •   How important are local languages?
  •  Which kind of search engines do we all need?
  •  What about non-text and multi media publications?
  •  Which models for bundling and pricing will be accepted?
  •  What makes publications so different?
  •  Why are some journals in a defined subject field much more successful than other journals?
  •  How important is the role of editors and editorial boards?
  •  What education and training is required?
  •  What skills are needed?
  •  Barrier-free information: do we provide sufficient access for the visually impaired?

I often sit at the back and blog so maybe I’ll give some answers. OTOH the hotel offers Internet for a price of 10 EUR/hour so maybe I won’t be able to post anything. (From what I can see Germany is one of the worst countries for charging for casual internet time – can’t we initiate some “Open Access”?)

Friday, January 18th, 2008

From Peter Suber  More on the NIH OA mandate.

Many points but I pick one:

 

Jocelyn Kaiser, Uncle Sam’s Biomedical Archive Wants Your Papers, Science Magazine, January 18, 2008 (accessible only to subscribers).  Excerpt:

If you have a grant from the U.S. National Institutes of Health (NIH), you will soon be required to take some steps to make the results public. Last week, NIH informed its grantees that, to comply with a new law, they must begin sending copies of their accepted, peer-reviewed manuscripts to NIH for posting in a free online archive. Failure to do so could delay a grant or jeopardize current research funding, NIH warns….

[...]

Scientists who have been sending their papers to PMC say the process is relatively easy, but keeping track of each journal’s copyright policy is not….

PMR: Exactly. It should be trivial to find out what a journal’s policy is. As easy as reading an Open Source licence. An enormous amount of human effort is wasted – authors, repositarians, on repeatedly trying to (and often failing to) get this conceptually simple information.

 

I’ve been doing article and interviews on OA and Open Data recently and  one thing that becomes ever clearer is that we need licences or other tools. Labeling with “open access” doesn’t work.

 

Science 2.0

Friday, January 18th, 2008

Bill Hooker points to an initiative by Scientific American to help collaborative science. Mitch Waldrop on Science 2.0

I’m way behind on this, but anyway: a while back, writer Mitch Waldrop interviewed me and a whole bunch of other people interested in (what I usually call) Open Science, for an upcoming article in Scientific American. A draft of the article is now available for reading, but even better — in a wholly subject matter appropriate twist, it’s also available for input from readers. Quoth Mitch:

Welcome to a Scientific American experiment in “networked journalism,” in which readers — you –get to collaborate with the author to give a story its final form.The article, below, is a particularly apt candidate for such an experiment: it’s my feature story on “Science 2.0,” which describes how researchers are beginning to harness wikis, blogs and other Web 2.0 technologies as a potentially transformative way of doing science. The draft article appears here, several months in advance of its print publication, and we are inviting you to comment on it. Your inputs will influence the article’s content, reporting, perhaps even its point of view.

PMR: It a reasonably balanced article, touching many of the efforts mentioned in this blog. It’s under no illusions that this won’t be easy. I’ve just finished doing an interview where at the end I was asked what we would be like in 5 years’ time and I was rather pessismistic that the current metrics-based dystopia would persist and even get worse (The UK has increased its efforts on metrics-based assessment in which case almost any innovation, almost by definition, is discouraged). But on the other hand I think the vitality pf @2.0@ in so many areas may provide unstoppable disruption.

Open Data in Science

Sunday, January 6th, 2008

I have been invited to write an article for Elsevier’s Serials Review and mentioned it in an earlier post (Open Data: Datument submitted to Elsevier’s Serials Review). I had hoped to post the manuscript immediately afterward but (a) our DSpace crashed and (b) Nature Precedings doesn’t accept HTML So DSpace is up again and you can see the article. This post is about the content, not the technology
[NOTE: The document was created as a full hyperlinked datument, but DSpace cannot handle hyperlinks and it numbers each of the components as a completely separate object with an unpredictable address. So none of the images show up - it's probably not a complete disaster - and you lose any force of the datument concept (available here as zip) which contains an interactive molecule (Jmol) ]

The abstract:

Open Data (OD) is an emerging term in the process of defining how scientific data may be published and re-used without price or permission barriers. Scientists generally see published data as belonging to the scientific community, but many publishers claim copyright over data and will not allow its re-use without permission. This is a major impediment to the progress of scholarship in the digital age. This article reviews the need for Open Data, shows examples of why Open Data are valuable and summarises some early initiatives in formalising the right of access to and re-use of scientific data.

PMR: The article tries not to be too polemic and to review objectively the area of Open Data (in scientific scholarship), in the style that I have done for Wikipedia. The next section shows Open Data in action, both on individual articles and when aggregating large numbers (> 100,000) articles. Although the illustrations are from chemistry and crystallography the message should transcend the details. Finally I try to review the various intitiatives that have happened very recently and I would welcome comments and corrections. I think I understand the issues raised in the last month but they will take time to sink in.

So, for example, the last section I describe and pay tribute to the Open Knowledge Foundation, Talis and colleagues, and Science/Creative Commons. I will blog this later but there is a now a formal apparatus for managing Open Data (unlike Open Access where the lack of this causes serious problems for science data). In summary, se now have:

  • Community Norms(“this is how the community expects A and B and C to behave – the norms have no legal force but if you don’t work with them you might be ostracized, get no grants, etc.”)
  • Protocols. These are high-level declarations which allow licences to be constructed. Both Science Commons and The Open Knowledge Foundation have such instruments. They describe the principles to which conformant licences must honour. I use the term meta-licence (analogous to XML, a meta-markuplanguage for creating markup languages).
  • Licences. These include PDDL and CC0 which conform to the protocol.

Throughout the article I stress the need for licences, and draw much analogy from the Open/Free Source communities which have meta-licences and then lists of conformant licences. I think the licence approach will be successful and will be rapidly adopted.

The relationship between Open Access and Open Data will require detailed work – they are distinct and can exist together or independently.  In conclusion I write:

Open Data in science is now recognised as a critically important area which needs much careful and coordinated work if it is to develop successfully. Much of this requires advocacy and it is likely that when scientists are made aware of the value of labeling their work the movement will grow rapidly. Besides the licences and buttons there are other tools which can make it easier to create Open Data (for example modifying software so that it can mark the work and also to add hash codes to protect the digital integrity).

Creative Commons is well known outside Open Access and has a large following. Outside of software, it is seen by many as the default way of protecting their work while making it available in the way they wish. CC has the resources, the community respect and the commitment to continue to develop appropriate tools and strategies.

But there is much more that needs to be done. Full Open Access is the simplest solution but if we have to coexist with closed full-text the problem of embedded data must be addressed, by recognising the right to extract and index data. And in any case conventional publication discourages the full publication of the scientific record. The adoption of Open Notebook Science in parallel with the formal publications of the work can do much to liberate the data. Although data quality and formats are not strictly part of Open Data, their adoption will have marked improvements. The general realisation of the value of reuse will create strong pressure for more and better data. If publishers do not gladly accept this challenge, then scientists will rapidly find other ways of publishing data, probably through institutional, departmental, national or international subject repositories. In any case the community will rapidly move to Open Data and publishers resisting this will be seen as a problem to be circumvented

Why publishers’ technology is obsolete – I

Sunday, January 6th, 2008

I have just finished writing an article for a journal – and I suspect the comments apply to all publishers. To create the Citations (or “references”) they require:

CITATIONS Citations should be double-spaced at the end of the text, with the notes numbered sequentially without superscript format. Authors are responsible for accuracy of references in all aspects. Please verify quotations and page numbers before submitting.

Superscript numerals should be placed at the end of the quotation or of the materials in which the source is mentioned. The numeral should be placed after all punctuation. SR follows the latest edition of the Chicago Manual of Style, published by the University of Chicago Press. Examples of the correct format for most often used references are the following:

Article from a journal: Paul Metz, “Thirteen Steps to Avoiding Bad Luck in a Serials Cancellation Project,” Journal of Academic Librarianship 18 (May 1992): 76-82.

[Note: when each issue is paged separately, include the issue number after the volume number: 18, no. 3(May 1992): 76-82. Do not abbreviate months. When citing page numbers, omit the digits that remain the same in both the beginning and ending numbers, e.g., 111-13.

PMR: It’s the author who has to do all this. In a different journal it would be a different style – maybe Harvard, or Oxford or goodness knows. Each with their own bizarre, pointless micro syntax.

As we know there is a simple effective way of identifying a citation in a journal – the Digital object identifier – (Wikipedia), It’s a unique identifier, managed by each publisher and there is a resolution service. OK not all back journals are in the system, and OK it doesn’t do non journal articles but why not use it for the citations it can support. IN many science disciplines almost all modern citations would have DOIs.

Not only would it speed up the process but it would save errors. Authors tend to write abbreviations (J. Acad. Lib), mungle the volumes and pages, get the fields in the wrong areas. They hate it, and I suspect so do the technical editors when they have to correct the error. I can’t actually believe the authors save the technical editorsany time – I suspect it costs time.

You may argue that the publisher still has to type out the citation from the DOI. Not at all. This is all in standard form. Completely automatic.

Why also cannot publisher emit their bibliographic metadata in standard XML on their web page. It’s a solved problem. It would mean that anyone getting the a citation would get it right. (I assume that garbled citations don’t get counted in the holy numbers game, so it pays to have your metadata scraped correctly. And XML is the simple, correct way to do that.

It’s not as if the publishers don’t have an XML Schema (or rather DTD). They do.

It’s called PRISM. Honest. Publishing Requirements for Industry Standard a worthy – if probably overengineered approach. But maybe the name has got confused.

Of course the NIH/Pubmed has got this problem solved. Because they are the scientific information providers of the future.

Why not borrow their solution?

Why PubMed is so important in the NIH mandate – cont.

Saturday, January 5th, 2008

In Why PubMed is so important in the NIH mandate  – which got sent off prematurely – I started to show why the NIH/PubMed relationship was so important. To pick up…
The difference between PubMed and almost all other repositories is that it has developed over many years as a top-class domain specific information engine.  Here’s a typical top page (click to enlarge):

pubmed.PNG

Notice the range of topics offered. Many of these are searching collections of named scientific entities. Such as genes, proteins, molecules, diseases, etc. One really clever idea – at least two decades old – was that you search in one domain, come back with the hits, search in another domain, and so on. An early idea of mashups, for example.

You can’t do this with Google. If you search for CAT you get all sorts of things. But in Pubmed you can differentiate between the animal, the 3-base codon, the tripeptide, the enzyme, the gene, the scanning techique and so on. Vastly improved accuracy. You can search for CAT scans on Cats. And there are the non-textual searches. You can do homology seraches for sequences. Similar molecules using connection tables. etc. etc.

Then there is the enormous economy of scale. Let’s say I search for p450 (a liver enzyme). I get 23000+ hits. I can’t possibly read them all. But OSCAR can. OSCAR can read the abstracts anyway, but now it will be able to read many more fulltexts as well. It can pass them to chemistry engines, which pass them onto … and then onto …

You can’t do that with Institutional repositories or with self-archiving. They don’t have the domain search engines, they don’t have the comprehensives. They don’t emit the science in standard XML.

For science it is likely that we have to have domain repositories. With domain-specific search engines, XML, RDF, ORE, the lot. It’s the natural way that scientists will work.

And PubMed – and its whole information infrastructure of MeSH, PubChem, Entrez, etc. is so well constructed and run that it serves as an excellent example of where we should be aiming. It’s part of the future of scientific information and data-driven science.

Do the Royal Society of Chemistry and Wiley care about my moral rights?

Saturday, January 5th, 2008

In a previous post I asked Did I write this paper??? because I had come across something like this:

wiley2.png

(click to enlarge). Take a long hard look and tell me what is the journal, and who is the publisher. Note also that it costs 30 USD to look at it.

Now I (and others) wrote that paper. When we submitted it I was proud to publish it with the Royal Society fo Chemistry. I cite it as:

Org. Biomol. Chem., 2004, 2, 3067 – 3070, DOI: 10.1039/B411699M


Experimental data checker: better information for organic chemists

S. E. Adams, J. M. Goodman, R. J. Kidd, A. D. McNaught, P. Murray-Rust, F. R. Norton, J. A. Townsend and C. A. Waudby

and you can still find it posted at:

http://www.rsc.org/Publishing/Journals/OB/article.asp?doi=B411699M

However when I visit the RSC page – on the RSC site – at:

http://www.rsc.org/publishing/journals/OB/article.asp?DOI=B411699M&type=ForwardLink

I find:

wiley3.PNG

Since this is on the RSC’s own site and it says it’s not an RSC journal article it’s clearly deliberate, not a mistake. The RSC seems to have transferred the rights of the paper to Wiley, who are reselling it under the name Cheminform. Or maybe both are selling it. Or maybe the RSC don’t know what Wiley are doing. (The best I can see is that Wiley appear to be passing off my/our paper under their name. As far as I can see they are only selling the abstract and even then it;s the wrong one – but maybe they are also selling the full text if they were competent to get the web site right. And they are asking 30 USD.)

I care very deeply about this. I used to be proud to publish in the journals of the Chemical Society (now the RSC). Can I still be proud? they have disowned my article as not one of theirs. Someone reading the Wiley page would naturally assume that I had published in a Wiley Journal and not with the RSC. We’ve worked closely with the RSC – many of the ideas for Project Prospect came from our group.

A major justification for Transfer of Copyright to publishers, whether or not you believe it, is that it allows the publisher to defend the integrity of the work against copyright infringement by others. I contend that what I have depicted here is a gross violation of someone’s copyright. Probably not mine since I gave it away.

Cockup or conspiracy – I don’t know. But I certainly feel my rights have been violated.

Open Data: Datument submitted to Elsevier’s Serials Review

Saturday, January 5th, 2008

I have just finished writing an invited article for Serials Review – Elsevier (I’m making an exception and submitting to a closed access publisher because (a) this is a special issue – from the invitation from Connie Foster

*Serials Review*

Serials Review (v.30, no.4, 2004) was a focus issue on Open Access. It remains one of the most heavily downloaded issues and articles even now. Open Access remains a “hot topic” and fundamental discussion in scholarly communication. Your names were suggested by either current board members or previous contributors to the Open Access issue.
At the time of that publication, editors and authors envisioned revisiting the Open Access environment a few years hence since issues, publisher responses, “experiments,” and government mandates were or are in flux.

PMR: and (b) we are all allowed to retain copyright.

[I'll discuss the message later. This post is about the medium. And how today's medium doesn't carry messages very well at all.]

First to publicly thank Connie Foster for her patience. I warned her that I would not submit a conventional manuscript because I wanted to show what Scientific Data are actually like. And you can’t do that in a PDF, can you?

So I asked ahead of time if I could submit HTML. It caused the publoisher (Elsevier) a lot of huffing and puffing. The answer seemed to be “yes”, but when I came to submit the manuscript it only accepted dead documents. So I’ve ended up mailing it to Connie.

The document is a datument – a term that Henry Rzepa and I coined about 4 years ago (From Hypermedia to Datuments: Murray-Rust and Rzepa: JoDI). It emphasizes that information should be seamless – not arbitrarily split into “full-text” and “data” because it’s easier for twentieth century publishers. (I return to this in a later post). The ideal medium for datuments is XML – for example using ICE (Integrated Content Environment) and that’s why I’m going to visit Peter Sefton and colleagues.

But the simple way to create datuments is in valid XHTML. Every editor in the world should now produce XHTML so there is no reason not to do it. It’s a standard. It’s in billions of machines over the world. It’s got everything we need. You see hundreds of examples every day.

XHTML manages:

  • images (it’s done this for 15 years)
  • multimedia (also for 15 years)
  • hyperlinks (for 15 years)
  • interactive objects (also for 15 years, though with some scratchy syntax)
  • foreign namespaces – probaly about 10 years
  • vector graphics (SVG) nearly 10 years

. It also manages STYLES. You don’t have to put the style in the content. You put it in a stylesheet. So my datument doesn’t have styles. Elsevier can add those if it wants. Personally I like reading black text on a white background – I know it’s very old-fashioned, but that;s how I was educated.

Also, since it’s in XML you can repurpose it. Extract just the images. Or discard the applet. Or reorganise the order of author’s names. Or mash it with another paper. Or extract the data. Or…

So XHTML is a liberating medium int which to publish while PDF is a dead, restrictinf and dismal medium. So having created my manuscript as a standard XHTML hyperdocument – no technology that isn’t at least 10 years old I try to submit it. Doesn’t work. Publisher doesn’t like HTML. This seems barmy since they actually publish in HTML.

I am not prepared to transform the datument into PDF. It destroys the whole point of the article. It would be like publishing movie as a single snapshot. Or a recording of a song using only a score. So I’ve had to zip it up and send it as email. Which is what we do everyday anyway.
[In passing - why this elaborate ritual with the publishers' technology? Authors have been producing acceptable manuscripts in HTML for years. Why publish in double-column PDF? I didn't ask for it. It is purely for the benefit of the publishers. To help their branding. (It's not even to make their life easier, as I'll show later because it doesn't).]

So, as a good Open Access advocate I have reposited it in the Cambridge DSpace. DSpace does not deal wth hyperdocuments (please tell me I’m wrong). I would have to go through all the documents and find the relative URLs and expand them to the Cambridge DSpace base URL. This, of course, means that the documents are not portable. So I had to reposit a ZIP file. 15 years after the invention of HTML and we cannot reposit HTML hyperdocuments.

[UPDATE: I have since found that it does accept HTML so we'll see how it comes out. ]

[UPDATE2: Yes, it accepts HTML, but no the links don't work. You have to know the address of each image before you deposit them. Then you have to edit the main paper to make them work. Which means it breaks if you export it. So basically you cannot reposit normal HTML in DSpace and expect it to work.]

So, dear reader, if you are a human, and want to read the file, download the zip file, unzip it, point your browser at it, swear at me when the browser breaks.

[UPDATE: Bill says is breaks. I don't understand this.]

And, dear reader, if you are a robot you have no option but to ignore it. It’s a zip file. It’s potentially evil. And anyway you wouldn’t know what you were indexing or looking for. So maybe I will give you the top part of the HTML to look at. You won’t see the pictures, but you probably don’t care at this stage, though in a few years you will.

I also tried to reposit it at Nature Precedings. They wouldn’t let post a zip file. Only DOC, PPT, PDF. Oh dear.