MOTSI: What is a citation?

We are all now judged by citations? But what *IS* a citation? It’s not easy to answer… and it may not be quite what you think. gives:

Broadly, a citation is a reference to a published or unpublished source (not always the original source). More precisely, a citation is an abbreviated alphanumeric expression (e.g. [Newell84]) embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work for the purpose of acknowledging the relevance of the works of others to the topic of discussion at the spot where the citation appears. Generally the combination of both the in-body citation and the bibliographic entry constitutes what is commonly thought of as a citation (whereas bibliographic entries by themselves are not).[PMR’s emphasis]

A prime purpose of a citation is intellectual honesty: to attribute prior or unoriginal work and ideas to the correct sources, and to allow the reader to determine independently whether the referenced material supports the author’s argument in the claimed way.


Bibliographies, and other list-like compilations of references, are generally not considered citations because they do not fulfill the true spirit of the term: deliberate acknowledgment by other authors of the priority of one’s ideas.

By the definition of this Wikipedia article it is clear that a citation consists of not only the bibliographic entry reference, but also the in-text context. Using our own paper (Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry, BalaKrishna Kolluru1*, Lezan Hawizy2, Peter Murray-Rust2, Junichi Tsujii1, Sophia Ananiadou1 )we find a typical context in which the citation occurs.


  1. Different aspects [of text-mining] such as named entity recognition (NER), tokenisation and acronym detection require bespoke approaches because the complex nature of such texts [1][5].

The [1]-[5] represents five elements of prior work which are defined by the language. This “sentiment” is very difficult to analyse exactly by machine and requires a human to describe the type of the citation. In this case it is prior work in the field. Here are the resolved bibliographic references:

Kemp N, Lynch M (1998) Extraction of information from the text of chemical patents. 1. identification of specific chemical names. Journal of Chemical Information and Computer Sciences 4: 544–551. Find this article online

Murray-Rust P, Rzepa H (1999) Chemical markup, xml, and the worldwide web. 1. basic principles. Journal of Chemical Information and Computer Sciences 39: 928–942. Find this article online

Murray-Rust P, Mitchell J, Rzepa H (2005) Chemistry in bioinformatics. BMC Bioinformatics 6: 141. Find this article online

Banville D (2006) Mining chemical structural information from the drug literature. Drug Discovery Today 11: 35–42. Find this article online

Kolrik C, Hofmann-Apitius M, Zimmermann M, Fluck J (2007) Identification of new drug classification terms in textual resources. Bioinformatics 13: 264–272. Find this article online

Note that the references themselves are NOT citations, it is the combination of each of them with the context (sentence (I)) that defines the citation. This is important as most “citations” do not fulfil this criterion. Note also that two of the references are to works authored in part by one of the authors (PMR); these , when expressed as citations, are sometimes called “self-citations”. (But note that several authors are involved in each case). These citations – in the first paragraph of the paper – can be assumed to reference fairly important prior work on which the paper probably builds

These citations can be seen to lend some merit to the work described by the references. But it is because of the complete citation that we lend the merit – not just because the references are in the paper. In using citations, therefore, we should always include the sentiment. There are many types of sentiment – and we looked at this in our Sciborg project. ( )

Each citation is labelled with exactly one category. The following top-level four-way distinction applies:

  • Weakness: Authors point out a weakness in cited Work

  • Contrast: Authors make contrast/comparison with cited work (4 categories)

  • Positive: Authors agree with/make use of/show compatibility or similarity with cited work (6 categories),


  • Neutral: Function of citation is either neutral, or weakly signalled, or different from the three functions stated above.


Some of these might count positively to an author’s reputation, others would be negative.


Here’s a similar assessment in botanical systems . Typical extract:

In 2007, Stephen McLaughlin published “Tundra to Tropics: The Floristic Plant Geography of North America” in Sida, Bototanical Miscellany. McLaughlin is one of the few authors who included data sources in his work. He stated that “The 245 local floras selected for this study are listed in Appendix A” (p. 3). Although listed in an appendix, they are not included in the bibliography. Instead, his bibliography consists of 28 other publications, mostly books and articles in books, but also articles in Thomson Reuters-monitored journals.


In other words flora (which are citable) may be excluded from “citation” analysis” because the authors put them elsewhere in the document. Automated methods of “citation analysis” cannot pick this up. In some of my own work many references may be in tables.


So a true citation carries sentiment and describes the purpose of the citation. In #jiscopencite (sister to #jiscopenbib) David Shotton and colleagues are developing a citation typing ontology (CITO).


But, unless you tell me different, the “citation” used in current metrics and in the Science Citation Index and modern descendants is in fact only a bibli0ographic reference. It carries no sentiment. Which is why citation counts are skewed by negative sentiment. And some types of common citation (methods or software) can achive very large counts.


None of this is included in the Journal Impact Factor, which AFAIK simply extracts bibliographic references. A person can get increased “citations” for being criticized. Being controversial may increase your metrics. None of this is surprising, other than to those who think the evaluation of research can be automated.


There are many other reasons why “citations” (bibliographic references) are seriously flawed. I’m not the first to bring these up

  • :Error in the bibliographic references themselves. shows “an author named “I. INTRODUCTION” has published hundreds of papers. Similarly, according to CiteSeer, the first and third authors of this article have a coauthor named “I. Introducción”. This makes us laugh, but sickly, in that our careers are based on this level of inaccuracy. There are many other problems, author identification and disambiguation (ORCID may solve some, but not all of this). #jiscopencite reveals that many bibliographic references are simply inaccurate. José H. Canós Cerdá, Eduardo Mena Nieto, and Manuel Llavador Campos continue: Citation analysis needs an in-depth transformation. Current systems have been long criticized due to shortcomings such as lack of coverage of publications and low accuracy of the citation data. Surprisingly, incomplete or incorrect data are used to make important decisions about researchers’ careers. We argue that a new approach based on the collection of citation data at the time the papers are created can overcome current limitations, and we propose a new framework in which the research community is the owner of a Global Citation Registry characterized by high quality citation data handled automatically. We envision a registry that will be accessible by all the interested parties and will be the source from which the different impact models can be applied
  • Lack of transparency in what sources are used. What is, and is not, citable – the target of a bibliographic reference. Web pages? Broadcast talks? Our media changes if we wish to use it for evaluation, WE, not the unaccountable commercial organizations should be in control.
  • Lack of sentiment (above). Without knowing why something is cited we cannot attribute motivation and value.


For me the biggest problem is the lack of transparency – if this problem is addressed, the other two follow.

In (14 years ago) Cameron suggests:


A universal citation database has significant potential to act as a catalyst for reform in scholarly communication by leveling the playing field between alternative forms of scholarly publication. This would happen in two important ways. First, the citation database would ensure that publications in any form are equally visible (but not necessarily equally accessible) to the literature research process. Regardless of which publication venue an author chooses, all that she/he need to do to make her/his work visible is to cite appropriate previous works. Publication venues would then compete on the important values that they bring to the publication process, such as refereeing standards, editorial control, quality of presentation, timeliness of dissemination and so forth. Publications would no longer enjoy an unfair competitive advantage simply by virtue of being indexed in a particular literature database.

The second way that a universal citation database would promote fairer competition among publication venues is by providing a method for evaluating the significance of individual papers independent of the publication venue chosen. University faculty members are often critically concerned with the recognition that their work receives because of its importance to the evaluation of their academic careers. Because the significance of papers is often judged solely by the perceived quality of the venues in which they are published, this encourages a very conservative approach to choice of publication venues. By providing citation data as an independent means of demonstrating the significance of a particular work, a universal citation database has the potential to encourage authors to choose publication venues for other qualities.

And the implication was clear – academia could and should have initiated this. As I have implied elsewhere academia has sleepwalked past this opportunity and now generated a mass of unregulated bibliographic reference collections. Their quality and coverage is not transparent so I cannot judge them – other than to say that non-transparency has generally little value.

So – as in so much else – IF we created semantic publications, and IF we made them Open, many of the problems would be solved. We should use semantic bibliography (as we have developed in our Open Bibliography project). We should label that with sentiment to create true citations. These should be semantic and published Openly at time of first publication. This would solve most of our problems of bad data, missing data, control by unaccountable third parties, etc.

But it would require the author to change their habits. To adopt new and better ways of authoring papers. And there is an established industry – including academia – who benefit from the low quality processes we currently have.

We are constantly told

“Authors will never do that”

And that’s true IF, but only IF, academia doesn’t care.

Let’s try the following:

“scientists will never asses the safety of their reactions. It’s too much trouble”

“scientists will never bother to report experiments on animals. It’s too much trouble”

So IF academia required scholarly publications to have semantic accurate citations (not just bibliographies) we could solve this in months. The technology is not the problem.

Academics are the problem.

We are in the grip of being controlled by our own creations we cannot control. In this case our Monster of the Scholarly ID is the “citation”. Let’s tame it.







This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *