## Global and Local Knowledge in Citation Network Formation

Talk presented at NetSci-X, Wrocław, 12th January 2016.

Models of the citation distribution of academic papers have a long history (Price, 1965). One aim is to illustrate if certain simple processes can explain important features. In this paper we focus on the fact that the distribution of citations for papers of a similar age scales primarily with the average number of citations (Radicchi, Fortunato, & Castellano 2008, Evans, Hopkins & Kaube, 2012), with the shape otherwise largely invariant. In particular the width of such distributions (as measured by the s parameter of a lognormal fit to reasonably well cited papers) shows no temporal evolution. Simple multiplicative processes or basic models such as the Price model (Price 1965) give dramatically different results, typically the distributions become narrower over time.

We found that to get a reasonable agreement our model had to incorporate three key aspects: local searches of the network (these generate preferential attachment), such local searches had to start from recent papers, and finally some knowledge of the global set of papers was needed. from recent papers, plus global access to papers, we needed.

To check our results we used data from the citation network of the hep-th section of the arXiv repository (KDD cup 2003) as a benchmark. Our three-parameter model was able to produce an acceptable fit to the hep-th data over 11 different years (see figure).

We find the best fits for our model to our data is when around 70% to 80% of papers cited are ‘subsidiary papers’, papers found from local searches through the bibliographies of other papers. Interestingly similar results have been seen by Simkin and Roychowdhury (2005) derived from an analysis of mistakes in bibliographic entries. In our terminology these would be citations to subsidiary papers so both sets of results are consistent. Further support for this result comes from the transitive reduction analysis of Clough et al. (2015).

We conclude that the citation patterns we see are based on around 25% of papers found from some global source come from reflect a mixture of local searches from papers found through some global information but favouring recent papers, with the remainder then found by local searches.