Semantic drift of cited references in the medical literature

posted on 28.11.2019 by Gustaf Nelhans, Johan Eklund
Paper presented at the 24th Nordic Workshop on Bibliometrics and Research Policy, 28 November 2019.
Abstract: Adding the semantic content of texts to the study of citations opens for new means of research in the field. Words can be used in specific or more general terms. Their meaning changes through use. Correspondingly, the meaning of a cited reference is defined by its use. Furthermore, the meaning of the reference changes as it is used in different contexts. Using ‘word embeddings’ we create a conceptual space of references using a window of text around the references. The model is trained on a set of 2 million full-text articles derived from EuroPMC. We measure the length of the journey of the cited references in this space to determine how much their semantic meaning changes over time. Furthermore, we study the topical heterogeneity of the citation contexts inferred to the references by the citing documents.

• RQ1. Can we identify the degree of topical heterogeneity of a subset of investigated cited references?
• RQ2. Can we identify the semantic drift in cited references over time?
• RQ3. Can we infer the presence of a cited reference in a given text using our trained model? Correspondingly: can we reconstruct the context of a reference in a text?

In this explorative work we investigate to what degree the semantic meaning of a cited reference can be recognized. In the end, we explore the possibility to generate a dynamic classification of research based on its use, rather than on their content. This would make it possible to identify similar works irrespectively of manifest citation links (bibliographic coupling or co-citation) or identical content of words (co-word analysis).


European Union’s Horizon 2020 research and innovation programme under grant agreement No. 770531.