2023-02-03 Sessión 4

Step by Step - Selection of variables for modeling

Import data

Importing a CSV database under the name of “sandfly”, with “,” as separator:

sandfly <- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Linear Regresion Exercise Database/Datasets/sandfly.csv", sep=",")


Spoiler alert
Note

What we are going to do today is a multivariate linear regression and selecting the variables that will be included in this regression with the “Classical model selection”

So, so far we know that the steps to the “Classical model selection” are:

Steps Example
Imagine that you have 6 independent variables and 1 dependent variable. And you want to create a linear regression.
1. Do a simple linear regression (bivariate) which each independent variable So we will have 6 models
2. Note the models with a p-value less than 0.10. Those are the variables we are going to start working with. Exclude the others Imagine that only 4 models had a p-value less than 0.10. So we keep those independent variables
3. Now create a multivariate linear regression model with all the remaining variables So we will create 01 model with the 4 remaining independent variables. (multivariate model 1)
4. Check in the results if any on the variables have a p-value higher than 0.05. If yes, exclude the one with the highest value and create a new model with the remaining variables. Repeat until all variables has a significant p-value (less than 0.05) Imagine that in the results we see that two variables had a p-value bigger than 0.05. We exclude the highest and create a (multivariate model 2). In the new results we still see that one variable had a p-value bigger than 0.05. So we exclude it and create a (multivariate model 3). In the new results all variables has a significant p value.
Now you have the variables that, no matter what happens, they have to be in your model. But, are they ok in that way? or it would be better to put them as interaction? The remaining model (multivariate model 3) only includes 2 independent variables: lm(DependVariabl~ IndepVariab1 + IndependVariable2
5. Create another model(s) with the remaining variables but as an interaction So, we create another model (multivariate model 4) but as an interaction lm(DependVariabl~ IndepVariab1 * IndependVariable2
6. Check if the model(s) with interaction is significantly better than the one without interaction. For this perform an F test (anova table) We will compare the (multivariate model 3) vs the (multivariate model 4) in an anova table.
7. If the results show that no model is better than the other, we keep the model more simple (the one without interaction). Otherwise, we keep the model with interaction Imagine that the p-value of the ANOVA table is not significant (less than 0.05). This means that no model is better than the other. So we keep the more simple model (the one without interaction)
So in this case the final model would be the (multivariate model 3)


Starting the Exercise (Ex05-2023):

In Bihar, India, visceral Leishmaniasis or Kala Azar is transmitted by the phlebotomine sand fly p.argentipes. The disease is known to disproportionally affect the poorest segments of society, this could be due to increased exposure to sand flies as a result of poor housing and environmental conditions. To assess the independent contributions of different factors, a study team conducted an entomological survey and collected information on housing and environmental conditions as well as on assets owned. Five hundred houses from a rural block were randomly sampled and light traps were installed simultaneously for a total of 6 nights in each house. Per house the total number of p. argentipes sand flies captured was determined. The data can be found in sandfly.csv.

1. The outcome variable is tot_parg (total numbers of p.argentipes sandflyes captured per house) Is tot_parg normally distributed?
hist(sandfly$tot_parg,main="Total number of p.argentipes sandflies captured in a house distribution", 
     xlab="Total number of p.argentipes sandflies captured", ylab="nb", col="green", border="dark green")

Note

It is not normally distributed

2. Generate a new variable “log_parg” which is the natural logarithm of tot_parg. Is normally distributed?
sandfly$log_parg <- log(sandfly$tot_parg)    #Creating the natural logarith variable
hist(sandfly$log_parg,main="Log of p.argentipes sandflies captured in a house distribution", 
     xlab="Log of p.argentipes sandflies captured", ylab="nb", col="green", border="dark green")

Note

It seems normally distributed

3. Explored in simple linear regression whether each of the independent variables is associated with log_parg. Make a table listing each factor, the number and proportion of households in which it is present

This is the Classical model selection. So, we need bivariate regressions for each independent variable with log_parg

3.1 Animals

  sandfly$animals <- factor(sandfly$animals)      #Factor because is a categorical variable
  #table (sandfly$animals, useNA="always")         #To explore the frequency and missing values
  lm1 = lm(log_parg ~ animals, data = sandfly)    #Simple linear regression (bivariate)
  summary(lm1)

Call:
lm(formula = log_parg ~ animals, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1944 -0.7965 -0.0164  0.7226  3.3292 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.19440    0.05053  63.218  < 2e-16 ***
animals1     0.81025    0.25265   3.207  0.00143 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.107 on 498 degrees of freedom
Multiple R-squared:  0.02023,   Adjusted R-squared:  0.01827 
F-statistic: 10.29 on 1 and 498 DF,  p-value: 0.001427

3.2 Stove

  sandfly$stove <- factor(sandfly$stove)        #Factor because is a categorical variable
  #table (sandfly$stove, useNA="always")         #To explore the frequency and missing values
  lm2 = lm(log_parg ~ stove, data = sandfly)    #Simple linear regression (bivariate)
  summary(lm2)

Call:
lm(formula = log_parg ~ stove, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4602 -0.8197 -0.0395  0.7336  3.3060 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.21760    0.05095  63.150   <2e-16 ***
stove1       0.24261    0.26138   0.928    0.354    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.117 on 498 degrees of freedom
Multiple R-squared:  0.001727,  Adjusted R-squared:  -0.0002775 
F-statistic: 0.8615 on 1 and 498 DF,  p-value: 0.3538

3.3 Ventilation

  sandfly$ventilation <- factor(sandfly$ventilation)    #Factor because is a categorical variable
  #table (sandfly$ventilation, useNA="always")           #To explore the frequency and missing values
  lm3 = lm(log_parg ~ ventilation, data = sandfly)      #Simple linear regression (bivariate)
  summary(lm3)

Call:
lm(formula = log_parg ~ ventilation, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3918 -0.6837 -0.0740  0.7756  3.1318 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.39179    0.06271  54.084  < 2e-16 ***
ventilation1 -0.42741    0.10094  -4.234 2.73e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.099 on 498 degrees of freedom
Multiple R-squared:  0.03475,   Adjusted R-squared:  0.03281 
F-statistic: 17.93 on 1 and 498 DF,  p-value: 2.731e-05

3.4 Light

  sandfly$light <- factor(sandfly$light)          #Factor because is a categorical variable
  #table (sandfly$light, useNA="always")           #To explore the frequency and missing values
  lm4 = lm(log_parg ~ light, data = sandfly)      #Simple linear regression (bivariate)
  summary(lm4)

Call:
lm(formula = log_parg ~ light, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2306 -0.8216 -0.0118  0.7425  3.2929 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.23063    0.06166  52.394   <2e-16 ***
light1      -0.01115    0.10544  -0.106    0.916    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.118 on 498 degrees of freedom
Multiple R-squared:  2.244e-05, Adjusted R-squared:  -0.001986 
F-statistic: 0.01118 on 1 and 498 DF,  p-value: 0.9158

3.5 Cowdung

  sandfly$cowdung <- factor(sandfly$cowdung)        #Factor because is a categorical variable
  #table (sandfly$cowdung, useNA="always")           #To explore the frequency and missing values
  lm5 = lm(log_parg ~ cowdung, data = sandfly)      #Simple linear regression (bivariate)
  summary(lm5)

Call:
lm(formula = log_parg ~ cowdung, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2419 -0.7570 -0.0230  0.7284  3.2817 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.7388     0.2879   9.512   <2e-16 ***
cowdung1      0.5031     0.2923   1.721   0.0859 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.115 on 498 degrees of freedom
Multiple R-squared:  0.005912,  Adjusted R-squared:  0.003916 
F-statistic: 2.962 on 1 and 498 DF,  p-value: 0.08588

3.6 Water

  sandfly$water <- factor(sandfly$water)         #Factor because is a categorical variable
  #table (sandfly$water, useNA="always")          #To explore the frequency and missing values
  lm6 = lm(log_parg ~ water, data = sandfly)     #Simple linear regression (bivariate)
  summary(lm6)

Call:
lm(formula = log_parg ~ water, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3342 -0.8027 -0.0305  0.7452  3.3229 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.20064    0.05572  57.443   <2e-16 ***
water1       0.13354    0.12586   1.061    0.289    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.117 on 498 degrees of freedom
Multiple R-squared:  0.002256,  Adjusted R-squared:  0.0002522 
F-statistic: 1.126 on 1 and 498 DF,  p-value: 0.2892

3.7 Asset_index

  sandfly$asset_index <- factor(sandfly$asset_index)   #Factor because is a categorical variable}
  #we don't need to relevel
  #table (sandfly$asset_index, useNA="always")          #To explore the frequency and missing values
  lm7 = lm(log_parg ~ asset_index, data = sandfly)     #Simple linear regression (bivariate)
  summary(lm7)

Call:
lm(formula = log_parg ~ asset_index, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5065 -0.7242  0.0206  0.7371  3.2189 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.50646    0.10664  32.882  < 2e-16 ***
asset_index2  0.03697    0.16667   0.222   0.8245    
asset_index3 -0.33546    0.15196  -2.208   0.0278 *  
asset_index4 -0.30015    0.15276  -1.965   0.0500 .  
asset_index5 -0.70187    0.14556  -4.822 1.92e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.072 on 477 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.06171,   Adjusted R-squared:  0.05384 
F-statistic: 7.843 on 4 and 477 DF,  p-value: 3.968e-06

3.8 House_type

  #Already a character (categorical) variable no need to factor
  #table(sandfly$house_type, useNA="always")             #To explore the frequency and missing values
  lm8 = lm(log_parg ~ house_type, data = sandfly)       #Simple linear regression (bivariate)
  summary(lm8)

Call:
lm(formula = log_parg ~ house_type, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1259 -0.6913 -0.0121  0.6752  3.2067 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              2.88012    0.12215  23.578  < 2e-16 ***
house_typepbrick_efloor  0.02238    0.15798   0.142    0.887    
house_typethatched       0.83198    0.14821   5.614  3.3e-08 ***
house_typeupbrick        0.24579    0.14994   1.639    0.102    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.065 on 496 degrees of freedom
Multiple R-squared:  0.09706,   Adjusted R-squared:  0.0916 
F-statistic: 17.77 on 3 and 496 DF,  p-value: 5.686e-11

3.9 Completing the requested table and checking witch models has a p value higher than 0.10

We do this checking all the results above

Note
Factor Number (%) Coefficient p-value
Animals in the house 20 (4.0) 0.81 0.001
Traditional stove 19 (3.8) 0.24 0.35
Ventilation 193 (38.6) -0.43 <0.001
Adequate light 171 (34.2) -0.01 0.92
Cow dung 485 (97.0) 0.50 0.09
Water body 98 (19.6) 0.13 0.29
Asset index*
—–1 (poorest) 101 (21.0) ref ref
—–2 70 (14.5) 0.04 0.82
—–3 98 (20.3) -0.34 0.03
—–4 96 (19.9) -0.30 0.05
—–5 (wealthiest) 117 (24.3) -0.70 <0.001
—–Missing 18
House type
—–Plastered brick, cemented floor 76 (15.2) ref ref
—–Thatched 161 (32.2) 0.83 <0.001
—–Unplastered brick 150 (30.0) 0.25 0.10
—–Plastered brick, earth floor 113 (22.6) 0.02 0.89
Note

There are 3 models that have a p value higher than 0.10 (the ones in red) Those variables are not going to enter to the final model

4. Construct a multiple linear regresion with all variables that were significant. Which variables will you include?

We will incude those who obtained a p value less than 0.10: animals, ventilation, cowdung, asset index and house_type. The regresion model is:

lm9 = lm(log_parg ~ animals + ventilation + cowdung + asset_index + house_type, data = sandfly)
summary(lm9)

Call:
lm(formula = log_parg ~ animals + ventilation + cowdung + asset_index + 
    house_type, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2528 -0.6406 -0.0122  0.7067  3.2542 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              3.03255    0.31616   9.592  < 2e-16 ***
animals1                 0.54503    0.24444   2.230  0.02624 *  
ventilation1            -0.14492    0.10974  -1.321  0.18728    
cowdung1                 0.17978    0.28358   0.634  0.52641    
asset_index2             0.12385    0.16411   0.755  0.45082    
asset_index3            -0.20495    0.14927  -1.373  0.17040    
asset_index4            -0.13879    0.15207  -0.913  0.36187    
asset_index5            -0.33856    0.16225  -2.087  0.03746 *  
house_typepbrick_efloor -0.02115    0.16290  -0.130  0.89678    
house_typethatched       0.51342    0.17465   2.940  0.00345 ** 
house_typeupbrick        0.04044    0.16383   0.247  0.80514    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.04 on 471 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.1274,    Adjusted R-squared:  0.1089 
F-statistic: 6.876 on 10 and 471 DF,  p-value: 4.592e-10
5. Which is the weaksest factor (highest p-value) and how high is its p-value?
Note

Cowndown (has the higher pvalue 0.52641)

Clarification: If we see the previous results, In fact the one with the highest p-value is one category of house_type (brick_efloor: 0.89678). But because there are other categories of the same variable that are significant (has p value less than 0.05), we cannot eliminate that variable.

6. Remove ‘cowdung’, which is now the weakest factor and what is its p-value?

So the next model is:

lm10 = lm(log_parg ~ animals + ventilation + asset_index + house_type, data = sandfly)
summary(lm10)

Call:
lm(formula = log_parg ~ animals + ventilation + asset_index + 
    house_type, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2535 -0.6438 -0.0135  0.6956  3.2634 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              3.195702   0.183532  17.412  < 2e-16 ***
animals1                 0.539125   0.244108   2.209  0.02769 *  
ventilation1            -0.151368   0.109200  -1.386  0.16636    
asset_index2             0.124507   0.163999   0.759  0.44812    
asset_index3            -0.208963   0.149044  -1.402  0.16157    
asset_index4            -0.135136   0.151862  -0.890  0.37399    
asset_index5            -0.342134   0.162053  -2.111  0.03528 *  
house_typepbrick_efloor -0.004478   0.160663  -0.028  0.97778    
house_typethatched       0.530562   0.172439   3.077  0.00221 ** 
house_typeupbrick        0.057824   0.161416   0.358  0.72033    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.039 on 472 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.1266,    Adjusted R-squared:   0.11 
F-statistic: 7.605 on 9 and 472 DF,  p-value: 1.904e-10
Note

Ventilation (has now the higher pvalue 0.16636)

Clarification: If we see the previous results, in fact are others with a highest p-value (ej: asset_index2 or house_typepbrick_efloor). But because there are other categories of the same variable that are significant (has p value less than 0.05, we cannot eliminate that variable.

7. Remove ‘ventilation’. Is there still another factor of which no category is significant at the 5% level ? Which factors do you keep in your final model?

So the new model is:

lm11 = lm(log_parg ~ animals + asset_index + house_type, data = sandfly)
summary(lm11)

Call:
lm(formula = log_parg ~ animals + asset_index + house_type, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2045 -0.6371 -0.0027  0.6814  3.2001 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              3.129930   0.177465  17.637  < 2e-16 ***
animals1                 0.505096   0.243107   2.078 0.038279 *  
asset_index2             0.114276   0.163992   0.697 0.486246    
asset_index3            -0.215243   0.149120  -1.443 0.149563    
asset_index4            -0.137779   0.151997  -0.906 0.365156    
asset_index5            -0.381083   0.159754  -2.385 0.017451 *  
house_typepbrick_efloor -0.006961   0.160810  -0.043 0.965490    
house_typethatched       0.586305   0.167848   3.493 0.000522 ***
house_typeupbrick        0.074528   0.161122   0.463 0.643897    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.04 on 473 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.1231,    Adjusted R-squared:  0.1083 
F-statistic: 8.299 on 8 and 473 DF,  p-value: 1.493e-10
Note

All have at least one category with a p-value < 0.05, So there is not another factor no significant and I dont have to exclude another variable

So, This factors are the ones who are going to stay in the final model: House_type, asset_index and animals.

8. To test for interactions generate the following binary variables:

- rich, ‘yes’ if asset_index = 5, ‘no’ if asset-index = 1-4 (beware of missing values!)

- thatched, ‘yes’ if house_type = ‘thatched, else ’no’

Run the model with ‘animals’, ‘rich’ and ‘thatched’ as predictors, note down the equation. Are all three factors statistically significant?

8.1 Creating the variable Rich

#Variable Rich
      #table(sandfly$asset_index, useNA = "always")
      sandfly$rich <- ifelse(sandfly$asset_index==5,1,0)
        table(sandfly$rich, useNA = "always")

   0    1 <NA> 
 365  117   18 

8.2 Creating the variable thatched

#Variable thatched
      #table(sandfly$house_type, useNA = "always")
      sandfly$thatched <- ifelse(sandfly$house_type=="thatched",1,0)
        table(sandfly$thatched, useNA = "always")

   0    1 <NA> 
 339  161    0 

8.3 With this variables, the regresion model is:

  mod12 <- lm(log_parg ~ animals+rich+thatched, data=sandfly)
  summary(mod12)

Call:
lm(formula = log_parg ~ animals + rich + thatched, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1078 -0.6820  0.0068  0.6770  3.2620 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.08869    0.06985  44.217  < 2e-16 ***
animals1     0.55419    0.24126   2.297  0.02205 *  
rich        -0.32729    0.11664  -2.806  0.00522 ** 
thatched     0.56358    0.10836   5.201 2.95e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.041 on 478 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.1128,    Adjusted R-squared:  0.1072 
F-statistic: 20.25 on 3 and 478 DF,  p-value: 2.277e-12
Note

We can see that all variables are still significant The equation of this model is: log_parg = 3.09 + 0.55* animals – 0.33* rich + 0.56* thatched

9. Theoretically there could be interactions between rich and thatched, rich and animals and animals and thatched. Is any of these interaction terms statistically significant?

Check for Possible interactions (pairs)

mod13 <- lm(log_parg ~ animals+rich*thatched, data=sandfly)

mod14 <- lm(log_parg ~ rich+thatched*animals, data=sandfly)

mod15 <- lm(log_parg ~ animals*rich+thatched, data=sandfly)

mod13 <- lm(log_parg ~ animals+rich*thatched, data=sandfly)
summary(mod13)

Call:
lm(formula = log_parg ~ animals + rich * thatched, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1053 -0.6899 -0.0030  0.6862  3.2730 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.09408    0.07087  43.659  < 2e-16 ***
animals1       0.56026    0.24181   2.317  0.02093 *  
rich          -0.34365    0.12195  -2.818  0.00503 ** 
thatched       0.54954    0.11259   4.881 1.44e-06 ***
rich:thatched  0.19563    0.42168   0.464  0.64291    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.042 on 477 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.1132,    Adjusted R-squared:  0.1057 
F-statistic: 15.22 on 4 and 477 DF,  p-value: 1.019e-11
mod14 <- lm(log_parg ~ rich+thatched*animals, data=sandfly)
summary(mod14)

Call:
lm(formula = log_parg ~ rich + thatched * animals, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0775 -0.6796 -0.0034  0.6854  3.2755 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        3.07750    0.06994  43.999  < 2e-16 ***
rich              -0.32953    0.11636  -2.832  0.00482 ** 
thatched           0.60208    0.11010   5.469 7.33e-08 ***
animals1           1.20517    0.42790   2.817  0.00506 ** 
thatched:animals1 -0.95217    0.51750  -1.840  0.06640 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.038 on 477 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.119, Adjusted R-squared:  0.1116 
F-statistic: 16.11 on 4 and 477 DF,  p-value: 2.206e-12
mod15 <- lm(log_parg ~ animals*rich+thatched, data=sandfly)
summary(mod15)

Call:
lm(formula = log_parg ~ animals * rich + thatched, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0898 -0.6869  0.0029  0.6843  3.2723 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.08981    0.06989  44.210  < 2e-16 ***
animals1       0.48546    0.25532   1.901  0.05785 .  
rich          -0.33865    0.11749  -2.882  0.00413 ** 
thatched       0.56918    0.10861   5.241 2.41e-07 ***
animals1:rich  0.64829    0.78580   0.825  0.40978    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.041 on 477 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.114, Adjusted R-squared:  0.1066 
F-statistic: 15.35 on 4 and 477 DF,  p-value: 8.137e-12
Note

None of them are statistically significant. So this models are not good

  - rich*thatched       p= 0.64
  - thatched*animals    p = 0.07
  - animals*rich        p = 0.41

Clarification: To see if some model is better than another, we should perform a F test (anova table), but it will not be done in this exercise

10. As our final model we keep the model with type of house, asset-index and presence of animals. Please make a table presenting this final model.

They are talking about lm11 = lm(log_parg ~ animals + asset_index + house_type, data = sandfly)

lm11 = lm(log_parg ~ animals + asset_index + house_type, data = sandfly)
summary(lm11)

Call:
lm(formula = log_parg ~ animals + asset_index + house_type, data = sandfly)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2045 -0.6371 -0.0027  0.6814  3.2001 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              3.129930   0.177465  17.637  < 2e-16 ***
animals1                 0.505096   0.243107   2.078 0.038279 *  
asset_index2             0.114276   0.163992   0.697 0.486246    
asset_index3            -0.215243   0.149120  -1.443 0.149563    
asset_index4            -0.137779   0.151997  -0.906 0.365156    
asset_index5            -0.381083   0.159754  -2.385 0.017451 *  
house_typepbrick_efloor -0.006961   0.160810  -0.043 0.965490    
house_typethatched       0.586305   0.167848   3.493 0.000522 ***
house_typeupbrick        0.074528   0.161122   0.463 0.643897    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.04 on 473 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.1231,    Adjusted R-squared:  0.1083 
F-statistic: 8.299 on 8 and 473 DF,  p-value: 1.493e-10
Note
  Intercept: 3.129930   
  R2 : 0.1231
  P-value of the model: 1.493e-10 (less than 0.05)

10.1 The table should be like this:

Note
Factor Coefficient p-value
Animals in the house 0.51 0.04
Asset index
-1 (poorest) ref ref
-2 0.11 0.49
-3 -0.22 0.15
-4 -0.14 0.37
-5 (wealthiest) -0.38 0.02
House type
-Thatched 0.59 0.0005
-Unplastered brick 0.07 0.64
-Plastered brick earth floor -0.007 0.97
-Plastered brick cemented floor ref ref
11. How much of the variability in sand fly exposure is explained by the three factors retained? What would be your conclusion about exposure to sand flies in this rural area.
Note

R squared = 0.1231, so the model explains only 12% of the total variability.

- Being rich and living in a good house (Plastered brick earth floor or cemented
  floor) reduces outcome, but far from being factors that play an important role.
- Could be other factors that play a role. 
12. What is the predicted total number of sandflies in a thatched house with animals inside of a family belonging to the poorest quintile? And in a plastered brick house with cemented floor without animals and belonging to the wealthiest quintile?

12.1 For the first condition:

Note

So the conditions are:

- Animals in the hous: Yes
- House type: thatched house
- Assest index: 1 (poorest)

So the equation for lm11 should be

  y = a + b1X1 + b2x2 + b3x3
  y = a + b1(animals) + b2(assest_index) + b3 (house_type)
  y = 3.129930 + 0.51(animals) + 0(assest_index:poorest) + 0.59(house_type:thatched)
  y = 3.13     + 0.51(1)       + 0                       + 0.59(1)
log_parg = 4.23
  #The "log of parg" is 4.23, but I need the value of parg
exp(log_parg) = exp(4.23)
parg = 69
  #parg is 69 with this conditions

12.2 For the second condition:

Note

So the conditions are:

- Animals in the house: No
- House type: Plastered brick cemented floor
- Assest index: 5 (wealthiest)

So the equation for lm11 should be

  y = a + b1X1 + b2x2 + b3x3
  y = a + b1(animals) + b2(assest_index) + b3 (house_type)
  y = 3.129930 + 0.51(animals) -0.38(wealthiest) + 0(Plastered brick cemented floor)
  y = 3.13     + 0.51(0)       -0.38(1)          + 0
log_parg = 2.75
  #The "log of parg" is 2.75 , but I need the value of parg
exp(log_parg) = exp(2.75)
parg = 16
  #parg is 16 with this conditions


Extra
Note

If we want to really check witch model is best than another (like in the step 9), we need to do a f test - anova function and see the p value:

anova(lm11, mod13)

But an error may show. When one of the variables that are in one model have missing values, R eliminate those rows when doing the model. So then if we want to compare, the models might not have the same number of rows and the test will fail.

For this is better to work with a subset of the data with no missing data in all the variables:

So if I know that asset_index has missing value, then:

sandfly_complete <- subset(sandfly, !is.na(sandfly$asset_index))
or
sandfly_complete <- subset(sandfly, complete.cases(sandfly))