figshare
Browse

World urban populations and city locations for Benford law

Posted on 2022-08-08 - 21:00 authored by Katarzyna Kopczewska

  

PART 1 – file copied from R CRAN package maps including the population in cities worldwide . In the maps R package, it is available as world.cities dataset. Data were collected for the year 2006 – dataset originated from Gazetteer website , which now redirects to the Population Mondiale website . The dataset includes 43.645 observations. This file includes the following variables: 


ID - ordered number from 1 to 43.645


name  - name of the city (in English), 43.645 different cities


country.etc - name of the country the city belongs currently (in English), 239 countries were included


pop - number of inhabitants in city in 2006


lat - latitude of the location of the city in WGS84


long - longitude of the location of the city in WGS84


capital - dummy variable that takes value 1 if a city is a capital of the country, and 0 otherwise (in the case of China, 2 for China Municipalities and 3 for China Provincial capitals)



PART 2 – Dataset includes calculations to paper “Natural spatial pattern – when mutual socio-geo distances between cities follow Benford's law”. Calculations were conducted on the cleaned dataset reported in PART 1. Dataset was limited by 1.617 cities and 70 countries by eliminating countries with few cities and the very small island countries where the urban location is pre-determined by the island's shape. Cleaning was due to countries – if a country was eliminated, all its cities were eliminated; if the country was kept, all its cities were kept. The computations were run on 42.028 cities in 169 countries. The number of cities per country varies from 11 to 1,000. Calculations in each row (country) were run on all cities (from a clean dataset) belonging to a given country. This file includes the following variables: 


ID - ordered number from 1 to 169


country  - name of the country the city belongs currently (in English), 169 countries were included

continent region of the world where the country is located ("Europe", "Asia", "Africa", "America South", "America North", "Australia")


name.as.map - name of the country consistent with the world map available in rworldmap R software package.


tot.pop  - total population of a given country, summed up from cities reported in a file in PART 1


how.many.cities - number of cities reported for a given country, ranges from 11 to 1000. 


coefvar.popul - coefficient of variation (standard deviation divided by the mean) calculated for populations of the cities within a given country


Zipf - Zipf coefficient calculated for cities within a given country


Benford.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of distances between cities. First, Euclidean (two-dimensional) distances between all pairs of cities within a given country were calculated based on their geo-coordinates. Second, MAD for those distances were derived (see Note 3) – those values are reported in MAD.dist variable. Third, the values of MAD were classified into one of four classes (see Note 1). 


Benford.pop.vec - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of population data. First, MAD for population values were computed (see Note 3) – those values are reported in MAD.pop.vec variable. Second, the values of MAD were classified into one of four classes (see Note 1).


Benford.pop.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated on population data. First, Euclidean (one-dimensional) distances between all pairs of cities within a given country were calculated based on their population volume. Second, MAD for those distances were derived (see Note 3) – those values are reported in MAD.pop.dist variable. Third, the values of MAD were classified into one of four classes (see Note 1).


Benford.socgeo.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of population data for cities with linkage to their location. First, Euclidean (three-dimensional) distances between all pairs of cities within a given country were calculated based on their population volume and geo-coordinates. Second, MAD for those distances were derived (see Note 3) – those values are reported in MAD.socgeo.dist variable. Third, the values of MAD were classified into one of four classes (see Note 1).


RofClarkEvans - R from Clark-Evans test for spatial randomness was calculated. R is derived as a ratio of two values: average nearest-neighbour distance in empirical spatial pattern, and average nearest-neighbour distance in random spatial pattern. It can be proxied by [sum(r)/n]/[1/2*sqrt(n/S)], where r are distances from a given city to its nearest neighbour, n is a number of cities analysed within a given country, S is the area of the country. In clustered patterns, the average distances are shorter than in random spatial distributions, thus ratio R<1, while in ordered pattern oppositely, therefore R>1. In random point pattern R=1. 


ClarkEvans.pvalue - p-value of Clark-Evans test, calculated using t-test, where test statistic t=(R-1)/var(R) (according to Petrere, 1985), where R is reported as RofClarkEvans, while var(R)=0.2732/n, where n is a number of cities analysed within a given country. Values close to 0 (below 0.05) mean that R significantly differs from 1 (so point pattern in non-random). 


ClarkEvans.lab - Point pattern which was detected within a given country. Classification based on RofClarkEvans. Label "clustering" was given if R is significantly less than 1 (R<1), label "ordering" was given if R is significantly more than 1 (R>1), while label "random" was given if R≈1. Significance from ClarkEvans.pvalue variable (threshold 0.05). 


MAD.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of distances between cities. First, Euclidean (two-dimensional) distances between all pairs of cities within a given country were calculated based on their geo-coordinates. Second, MAD for those distances were derived (see Note 3) – those values are reported in MAD.dist variable. Variable Benford.dist reports its classification into four conformity classes. 


MAD.pop.vec - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of population data (see Note 3). Variable Benford.pop.vec reports its classification into four conformity classes.


MAD.pop.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated on population data. First, Euclidean (one-dimensional) distances between all pairs of cities within a given country were calculated based on their population volume. Second, MAD for those distances were derived (see Note 3). Variable Benford.pop.dist reports its classification into four conformity classes.


MAD.socgeo.dist - MAD (Mean Absolute Deviation) conformity for Benford law – calculated out of population data for cities with linkage to their location. First, Euclidean (three-dimensional) distances between all pairs of cities within a given country were calculated based on their population volume and geo-coordinates. Second, MAD for those distances were derived (see Note 3). Variable Benford.socgeo.dist reports its classification into four conformity classes.


Benford.dist.nr - Adjusted Benford.dist variable by adding the order number of classification. Values as in Note 2. This change is needed for nice plotting. 


Benford.pop.vec.nr - Adjusted Benford.pop.vec variable by adding the order number of classification. Values as in Note 2. This change is needed for nice plotting. 


Benford.pop.dist.nr - Adjusted Benford.pop.dist variable by adding the order number of classification. Values as in Note 2. This change is needed for nice plotting.


Benford.socgeo.dist.nr - Adjusted Benford.socgeo.dist variable by adding the order number of classification. Values as in Note 2. This change is needed for nice plotting.


Note 1: Values of MAD were classified into one of four classes: "Close conformity", "Acceptable conformity", "Marginally acceptable conformity", "Nonconformity". Values of MAD between 0.000-0.0012 are interpreted as close conformity, between 0.0012-0.0018 as acceptable conformity, between 0.0018-0.0022 as marginally acceptable conformity and above 0.0022 as a nonconformity (Nigrini, 2012). 


Note 2: Labels are as following: "1.Close conformity", "2.Acceptable conformity", "3.Marginally acceptable conformity", "4.Nonconformity"


Note 3: MAD was calculated for the first two digits for 90 values from 10 to 99.


References: 

Nigrini, M. J. 2012, Benford's Law: Applications for forensic accounting, auditing, and fraud detection (Vol. 586). John Wiley & Sons (2012).


Petrere, M. (1985). The variance of the index (R) of aggregation of Clark and Evans. Oecologia, 68(1), 158-159.

CITE THIS COLLECTION

DataCite
3 Biotech
3D Printing in Medicine
3D Research
3D-Printed Materials and Systems
4OR
AAPG Bulletin
AAPS Open
AAPS PharmSciTech
Abhandlungen aus dem Mathematischen Seminar der Universität Hamburg
ABI Technik (German)
Academic Medicine
Academic Pediatrics
Academic Psychiatry
Academic Questions
Academy of Management Discoveries
Academy of Management Journal
Academy of Management Learning and Education
Academy of Management Perspectives
Academy of Management Proceedings
Academy of Management Review
or
Select your citation style and then place your mouse over the citation text to select it.

SHARE

email
need help?