# World urban populations and city locations for Benford law

PART 1 – file copied from R CRAN package *maps* including the population in cities worldwide . In the *maps* R package, it is available as *world.cities* dataset. Data were collected for the year 2006 – dataset originated from Gazetteer website , which now redirects to the Population Mondiale website . The dataset includes 43.645 observations. This file includes the following variables:

ID - ordered number from 1 to 43.645

name - name of the city (in English), 43.645 different cities

country.etc - name of the country the city belongs currently (in English), 239 countries were included

pop - number of inhabitants in city in 2006

lat - latitude of the location of the city in WGS84

long - longitude of the location of the city in WGS84

capital - dummy variable that takes value 1 if a city is a capital of the country, and 0 otherwise (in the case of China, 2 for China Municipalities and 3 for China Provincial capitals)

PART 2 – Dataset includes calculations to paper “*Natural spatial pattern – when mutual socio-geo distances between cities follow Benford's law*”. Calculations were conducted on the cleaned dataset reported in PART 1. Dataset was limited by 1.617 cities and 70 countries by eliminating countries with few cities and the very small island countries where the urban location is pre-determined by the island's shape. Cleaning was due to countries – if a country was eliminated, all its cities were eliminated; if the country was kept, all its cities were kept. The computations were run on 42.028 cities in 169 countries. The number of cities per country varies from 11 to 1,000. Calculations in each row (country) were run on all cities (from a clean dataset) belonging to a given country. This file includes the following variables:

ID - ordered number from 1 to 169

country - name of the country the city belongs currently (in English), 169 countries were included

continent region of the world where the country is located ("Europe", "Asia", "Africa", "America South", "America North", "Australia")

name.as.map - name of the country consistent with the world map available in *rworldmap* R software package.

tot.pop - total population of a given country, summed up from cities reported in a file in PART 1

how.many.cities - number of cities reported for a given country, ranges from 11 to 1000.

coefvar.popul - coefficient of variation (standard deviation divided by the mean) calculated for populations of the cities within a given country

Zipf - Zipf coefficient calculated for cities within a given country

Benford.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of distances between cities. First, Euclidean (two-dimensional) distances between all pairs of cities within a given country were calculated based on their geo-coordinates. Second, MAD for those distances were derived (see Note 3) – those values are reported in *MAD.dist* variable. Third, the values of MAD were classified into one of four classes (see Note 1).

Benford.pop.vec - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of population data. First, MAD for population values were computed (see Note 3) – those values are reported in *MAD.pop.vec* variable. Second, the values of MAD were classified into one of four classes (see Note 1).

Benford.pop.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated on population data. First, Euclidean (one-dimensional) distances between all pairs of cities within a given country were calculated based on their population volume. Second, MAD for those distances were derived (see Note 3) – those values are reported in *MAD.pop.dist* variable. Third, the values of MAD were classified into one of four classes (see Note 1).

Benford.socgeo.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of population data for cities with linkage to their location. First, Euclidean (three-dimensional) distances between all pairs of cities within a given country were calculated based on their population volume and geo-coordinates. Second, MAD for those distances were derived (see Note 3) – those values are reported in *MAD.socgeo.dist* variable. Third, the values of MAD were classified into one of four classes (see Note 1).

RofClarkEvans - R from Clark-Evans test for spatial randomness was calculated. R is derived as a ratio of two values: average nearest-neighbour distance in empirical spatial pattern, and average nearest-neighbour distance in random spatial pattern. It can be proxied by [sum(r)/n]/[1/2*sqrt(n/S)], where r are distances from a given city to its nearest neighbour, n is a number of cities analysed within a given country, S is the area of the country. In clustered patterns, the average distances are shorter than in random spatial distributions, thus ratio R<1, while in ordered pattern oppositely, therefore R>1. In random point pattern R=1.

ClarkEvans.pvalue - p-value of Clark-Evans test, calculated using t-test, where test statistic t=(R-1)/var(R) (according to Petrere, 1985), where R is reported as RofClarkEvans, while var(R)=0.2732/n, where n is a number of cities analysed within a given country. Values close to 0 (below 0.05) mean that R significantly differs from 1 (so point pattern in non-random).

ClarkEvans.lab - Point pattern which was detected within a given country. Classification based on RofClarkEvans. Label "clustering" was given if R is significantly less than 1 (R<1), label "ordering" was given if R is significantly more than 1 (R>1), while label "random" was given if R≈1. Significance from *ClarkEvans.pvalue* variable (threshold 0.05).

MAD.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of distances between cities. First, Euclidean (two-dimensional) distances between all pairs of cities within a given country were calculated based on their geo-coordinates. Second, MAD for those distances were derived (see Note 3) – those values are reported in *MAD.dist* variable. Variable *Benford.dist* reports its classification into four conformity classes.

MAD.pop.vec - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of population data (see Note 3). Variable *Benford.pop.vec* reports its classification into four conformity classes.

MAD.pop.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated on population data. First, Euclidean (one-dimensional) distances between all pairs of cities within a given country were calculated based on their population volume. Second, MAD for those distances were derived (see Note 3). Variable *Benford.pop.dist* reports its classification into four conformity classes.

MAD.socgeo.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of population data for cities with linkage to their location. First, Euclidean (three-dimensional) distances between all pairs of cities within a given country were calculated based on their population volume and geo-coordinates. Second, MAD for those distances were derived (see Note 3). Variable *Benford.socgeo.dist* reports its classification into four conformity classes.

Benford.dist.nr - Adjusted *Benford.dist* variable by adding the order number of classification. Values as in Note 2. This change is needed for nice plotting.

Benford.pop.vec.nr - Adjusted *Benford.pop.vec* variable by adding the order number of classification. Values as in Note 2. This change is needed for nice plotting.

Benford.pop.dist.nr - Adjusted *Benford.pop.dist* variable by adding the order number of classification. Values as in Note 2. This change is needed for nice plotting.

Benford.socgeo.dist.nr - Adjusted *Benford.socgeo.dist* variable by adding the order number of classification. Values as in Note 2. This change is needed for nice plotting.

Note 1: Values of MAD were classified into one of four classes: "Close conformity", "Acceptable conformity", "Marginally acceptable conformity", "Nonconformity". Values of MAD between 0.000-0.0012 are interpreted as *close conformity*, between 0.0012-0.0018 as *acceptable conformity*, between 0.0018-0.0022 as *marginally acceptable conformity* and above 0.0022 as a *nonconformity* (Nigrini, 2012).

Note 2: Labels are as following: "1.Close conformity", "2.Acceptable conformity", "3.Marginally acceptable conformity", "4.Nonconformity"

Note 3: MAD was calculated for the first two digits for 90 values from 10 to 99.

References:

Nigrini, M. J. 2012, Benford's Law: Applications for forensic accounting, auditing, and fraud detection (Vol. 586). John Wiley & Sons (2012).

Petrere, M. (1985). The variance of the index (R) of aggregation of Clark and Evans. Oecologia, 68(1), 158-159.