On-line Appendix: Generalized Correlation and Kernel Causality with Applications in Development Economics

This document describes on-line supplementary material explain-ing the data and R software to accompany my paper Vinod (2015) appearing in Communications in Statistics - Simulation and Computation . The aim of this supplement is to assist the reader in extending and implementing the ideas in my paper. Included R code can be used as a template allowing one to use my methods on any data of interest to the reader. My software here is not claimed to be eﬃcient, but does work. Explanations of software steps are often included, but not claimed to be complete. Hence some familiarity with R is needed to use it properly. The software tools and their descriptions provided here need additional work to bring them up to the standard of an oﬃcial R package.


Introduction
This document is intended to be read in conjunction with my paper Vinod (2015) published in Communications in Statistics -Simulation and Computation (CSSC). This document is focused only on implementation and data analysis, and will not repeat the explained theory from the paper. Most R code snippets are named for cross-reference purposes. The names always start with "R-".
There is considerable R code provided here which can be developed into an R package. At the minimum the code can serve as a template for specific graphics or computational applications with minor modifications. The code is seen in red font ready to be copied and pasted by the user. All commands of the type 'require()' or 'library()' will not work unless the particular R package named in parentheses is first loaded at the R installation where the code is used.
The blue font is reserved for outputs from the code. The outputs of many code snippets is suppressed and /or abridged for brevity.

R software Initialization
While many R experts do not agree with me, I prefer to replace the R prompt (>) implicit in the R software with one space and the R continuation symbol (+) with two spaces. This change permits direct copy and paste from this file into any user's R console without modifications, avoiding confusion with the greater than and plus symbols also used in R. I also prefer to use the one stroke (=) symbol for assignment rather than ( <-) requiring 4 strokes after including spaces commonly used in R.
#R-Init rm(list=ls()) #clean up the R memory seed=42 set.seed(seed) print(c("seed=",seed),quote=FALSE) options(prompt = " ", continue = " ", width = 68, useFancyQuotes = FALSE) print(date()) It is recommended that the reader copy and paste this code named "R-Init" before each new project. The next section describes our data sources before use in the subsequent sections.

Details Regarding the Nine Variables in Economic Development (EcDev) Data
Our data are publicly available at the UN, (See http://hdr.undp.org/en/ data), the World Bank http://search.worldbank.org/data, etc. The supplementary material to this paper allows the reader to reproduce our results. Some data and country codes follow the Association of Religion Data Archives (ARDA) www.TheARDA.com, with the Principal Investigator Jaime Harris.
1. GRO: the least squares annual growth rate in Gross Domestic Product (GDP) used with constant GDP per capita data in local currency units. Sample growth rates are China=7.9, India=3.6, Mexico=1.7 and USA=1.9.

INHDI: Inequality-adjusted Human Development Index.
This measure adjusts a country's HDI score based on its scores on three additional indexes: the inequality-adjusted life expectancy index, the inequality-adjusted education index and the inequality-adjusted income index. Sample scores are Albania=0.627, Bangladesh=0.321, In-dia=0.365, USA =0.799 and Pakistan =0.336.
3. GINI: Income Gini coefficient. This is a measure of the deviation of the distribution of income (or consumption) among individuals or households within a country from a perfectly equal distribution. A Lorenz curve plots the cumulative percentages of total income received against the cumulative number of recipients, starting with the poorest individual or household. The Gini index measures the area between the Lorenz curve and a hypothetical line of absolute equality, expressed as a percentage of the maximum area under the line. A value of zero represents absolute equality, a value of 100 absolute inequality. Sample scores are Albania=33, Bangladesh=31, India=36.8, Iran=38.5 and USA=40.6. 4. GII: A country's score on the Gender Inequality Index (GII), a composite index measuring loss in achievements in three dimensions of human development: reproductive health (measured by the maternal mortality ratio and the adolescent fertility rate), empowerment (measured by the female and male population with at least a secondary education and by the female and male shares of parliamentary seats) and the labor market due to inequality between genders (measured by female and male labor force participation rates). These scores are high in countries with large Muslim populations as: Egypt=0.714 and India=0.748 and low in Westernized countries as Japan=0.273 and The Netherlands= 0.174. The USA=0.400 is somewhat higher than expected.

R software for reading Economic Development (EcDev) Data
We use the country codes by ARDA available at http://www.fordham.edu/ economics/vinod/ARDAcountryCodes.xls. The data are available in R format at: http://www.fordham.edu/economics/vinod/DevEcData.Rdata. If the Rdata file does not work an alternative version 'DevEc10.csv' a comma separated csv file is also available at the same Internet location. Its use is shown below. The data for 250 countries contains lots of missing data (NAs). The code named "R-get-EcDev-data" loads entire data into user's R console for the chosen nine variables (abbreviations listed in the previous section) and country code.

Plotting Bivariate Density for EcDev data
Now we provide the R code for plotting Bivariate density seen as " Figure  2: Joint density between economic growth (GRO) and income in-equality (GINI)" in my CSSC paper. One needs to do pair-wise deletion of missing data or NA's by using the code named "R-napair" providing my function called 'napair.' The code named "R-napair-Dev-data" further below uses 'napair' function on EcDev data. It renames the pair-wise matched data as GRO2 and GINI2, while keeping the original GRO and GINI data intact, since the original data will have to be matched eventually with seven other variables.

Computing Generalized Correlation Matrix
The code named "R-gmcxy.np-function" below is my R function which in turn relies on the 'np' package to do kernel regressions in equations (1) and (2) in my CSSC paper. The function 'npregbw' suitably sets up the bandwidths.
The code named "R-gmcmtx0-function" provides my R function to compute the R * (new asymmetric generalized correlation matrix) described in Section 2.5 and defined in equation (10) of my CSSC paper. The zero in the name 'gmcmtx0' is intended to distinguish it from a more CPU-timeconsuming bootstrap version 'gmcmtx,' described later.
Our first example is the European crime data from Section 3.3 of my CSSC paper.
Using the nine variables from 'EcDev' data in the matrix called 'mtx' above we can get the matrix of r * (X i |X j ) defined by eq. (10) in the text of my CSSC paper by the following command. The generalized correlation matrix does not appear in Section 4 of the CSSC paper.

rs=gmcmtx0(mtx) print(rs,digits=2)
The output given below is abridged to only two digits to fit the width of the page. The actual R * matrix in the memory of R has more than eight digits than what is displayed here. Tables 5 and 6 of my CSSC paper displays additional digits for 36 unique pairs and explicitly identify the "cause" by name.  Recall that eq. (16) of my CSSC paper defines the partial correlation coefficients for the R * matrix. The function 'parcor.ijk' computes those partial correlation coefficients bet X i and X j after removing the effect of all X k variables. The function reports a list of removed columns named 'myk'.
#R-getParcor-function # NEEDS gmcmtx0' and partialc2' in memory require(Hmisc) #rcorr does pairwise deletion #otherwise corr matrix will have NAs Nan etc problems getParcor=function(x,rstar=TRUE,verbo=TRUE, nam=colnames(x)) { #get partial coeff from inverse of the correlation matrix #Input x=data matrix e.g. cbind(x,y,z) p=ncol(x) if (p<3) stop("number of columns < 3") if ( ("p,length(nam),ncol(par.cor),nrow(par.cor)") print(c(p,length(nam),ncol(par.cor),nrow(par.cor))) colnames(par.cor)=nam rownames(par.cor)=nam if(verbo)print("partial R-sq on diag and partial correl off-diag") if(verbo) print(par.cor) #cr has R correlations or R* if rstar=TRUE list(cr=cr, par.cor=par.cor) }#end function getParcor Now we are ready to apply the above tools to Economic Development data to create Table 7 of my CSSC paper illustrating the use of partial correlations and betas in multivariate situations. Table 7 has two vertical panels with the left panel for the three variables my3=(MPI, GRO, GINI) and the right panel is for four variables my4=(MPI, GRO, GINI, GEI). Recall that the 'getbeta' function offers a choice of Pearson correlation by using the command 'g1=getbeta(my3,rstar=FALSE)' illustrated below. Of course, we are more interested in the new R * and betas obtained from it obtained by the command: 'g2=getbeta(my3,rstar=TRUE)'. The four rows of the left panel on Table 7 of the CSSC paper are in the matrix 'myx'. The output from the above code creates Table 7 in my CSSC paper, where the discussion of evidence supports Bhagwati's growth policies over Sen's redistribution policies for economic development. It allows consideration of equation (22) (regressing MPI on GRO and GINI) which has two regressors illustrating multivariate regressions. By using partial correlations we can focus on the effect of one regressor at a time, removing the effect of remaining regressors. The right panel of Table 7 considers regression of MPI on GRO, GINI and GEI (i.e., three regressors). The partial correlations in the right panel remove the effect of two regressors.

Non-spherical Error Corrections
Correcting for autocorrelation and heteroscedasticity (non-spherical errors) is readilty done by using the following function called 'autohetero', which in turn needs a function 'sort.matrix' which is given first. The theory is described in Vinod (2010). The basic idea is to use generalized least squares.
The function 'pcause' computes P(cause) defined in eq. (9) of my CSSC paper giving the larger of the two rejection probabilities. The computed P(cause) lies in the range [0,1], where a larger value is more desirable for inference about the causal direction based onδ using the meboot bootstrap.

Comprehensive Tables Displaying 36 Rows for 9 Variables
Note that our kernel causality criterion relies on the asymmetry of the generalized correlation matrix r * (i, j) to identify the causal variable. Since the row variable X i is the "effect" and the column variable X j is the "cause," we identify the cause depending on which matrix entry (i,j) or (j,i) is larger. EcDev data have a large 9×9 matrix and it is not convenient to manually compare such entries. It is safer to let the computer choose all unique (i,j) pairs and explicitly identify the cause by comparing the indicated entries of the asymmetric matrix. Since ( 9 C 2 = 36), our EcDev data needs a comparison of 36 pairs. Tables 4 to 6 of my CSSC paper do have 36 rows for each unique pair and the cause is identified and named in the third column by using a computer program. A typical program is given below.
If the bootstrap is used for inference, a much slower version of gmcmtx0 called gmcmtx needs to be used before we do the 36 pairs. Hence we begin with that code.

Code for Kernel Regression Robustness Checks
Recall that if the model regressing Y on X is more robust than the model regression X on Y we conclude that X kernel causes Y . Section 2.3 of my CSSC paper describes Kernel Regression Robustness Checks where the first robustness check uses a nonparametric test of dependence between residuals and regressors, where low dependence is desirable. Table 5 of CSSC paper illustrates the results of this robustness check for EcDev data. The version of gmcmtx0 in the code named 'R-gmcmtx0npdepfunction' is called 'gmcmtx0npdep.' The 0 in the name refers to fast version (without the meboot) and 'npdep' is the name of the function from the package 'np' used here to assess the dependence of regression residuals with the regressor. The dependence is measured by S ρ explained in Section 2.3 of my CSSC paper. Note that it is desirable to choose the model with the "smaller" value of S ρ .
#R-gmcmtx0npdep-function gmcmtx0npdep=function(mym, nam=colnames(mym)){ #author: H.D.Vinod, Prof of Economics, Forhdam Univ. NY Sept2013 # mym is a data matrix with n rows and p columns # some NAs may be present in the matrix p=NCOL(mym) #print(c("p=",p)) out1=matrix(1,p,p)# out1 stores asymmetric correlations out2=matrix(1,p,p)# out1 stores asymmetric corr wrt residuals for ( Section 2.3 of the CSSC paper also describes another robustness check based on out-of-sample forecasts. The R code for that purpose is given next. # Using out of sample 10% root mean square and scaled require(np) options(np.messages=FALSE) require(akima) #load this package #R-stdze-function to standardize data matrix all columns stdze=function(mtx){ stdx=function(x) (x-mean(x,na.rm=TRUE))/sd(x, na.rm=TRUE) out=apply(mtx, 2, stdx) return(out)} The above standardization makes the mean zero and standard deviation unity for all columns. The following function computes root mean squared (rms) values defined in Section 2.3 of the CSSC paper. It uses the standardization function and then sorts on the first column, while carrying along all other columns.