<- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Linear Regresion Exercise Database/Datasets/sandfly.csv", sep=",") sandfly
2023-02-03 Sessión 4
- Instituut Voor Tropische Geneeskunde - Antwerp, Belgium
- Javier Silva-Valencia
Step by Step - Selection of variables for modeling
Import data
Importing a CSV database under the name of “sandfly”, with “,” as separator:
Spoiler alert
What we are going to do today is a multivariate linear regression and selecting the variables that will be included in this regression with the “Classical model selection”
So, so far we know that the steps to the “Classical model selection” are:
Steps | Example |
---|---|
Imagine that you have 6 independent variables and 1 dependent variable. And you want to create a linear regression. | |
1. Do a simple linear regression (bivariate) which each independent variable | So we will have 6 models |
2. Note the models with a p-value less than 0.10. Those are the variables we are going to start working with. Exclude the others | Imagine that only 4 models had a p-value less than 0.10. So we keep those independent variables |
3. Now create a multivariate linear regression model with all the remaining variables | So we will create 01 model with the 4 remaining independent variables. (multivariate model 1) |
4. Check in the results if any on the variables have a p-value higher than 0.05. If yes, exclude the one with the highest value and create a new model with the remaining variables. Repeat until all variables has a significant p-value (less than 0.05) | Imagine that in the results we see that two variables had a p-value bigger than 0.05. We exclude the highest and create a (multivariate model 2). In the new results we still see that one variable had a p-value bigger than 0.05. So we exclude it and create a (multivariate model 3). In the new results all variables has a significant p value. |
Now you have the variables that, no matter what happens, they have to be in your model. But, are they ok in that way? or it would be better to put them as interaction? | The remaining model (multivariate model 3) only includes 2 independent variables: lm(DependVariabl~ IndepVariab1 + IndependVariable2 |
5. Create another model(s) with the remaining variables but as an interaction | So, we create another model (multivariate model 4) but as an interaction lm(DependVariabl~ IndepVariab1 * IndependVariable2 |
6. Check if the model(s) with interaction is significantly better than the one without interaction. For this perform an F test (anova table) | We will compare the (multivariate model 3) vs the (multivariate model 4) in an anova table. |
7. If the results show that no model is better than the other, we keep the model more simple (the one without interaction). Otherwise, we keep the model with interaction | Imagine that the p-value of the ANOVA table is not significant (less than 0.05). This means that no model is better than the other. So we keep the more simple model (the one without interaction) |
So in this case the final model would be the (multivariate model 3) |
Starting the Exercise (Ex05-2023):
In Bihar, India, visceral Leishmaniasis or Kala Azar is transmitted by the phlebotomine sand fly p.argentipes. The disease is known to disproportionally affect the poorest segments of society, this could be due to increased exposure to sand flies as a result of poor housing and environmental conditions. To assess the independent contributions of different factors, a study team conducted an entomological survey and collected information on housing and environmental conditions as well as on assets owned. Five hundred houses from a rural block were randomly sampled and light traps were installed simultaneously for a total of 6 nights in each house. Per house the total number of p. argentipes sand flies captured was determined. The data can be found in sandfly.csv.
1. The outcome variable is tot_parg (total numbers of p.argentipes sandflyes captured per house) Is tot_parg normally distributed?
hist(sandfly$tot_parg,main="Total number of p.argentipes sandflies captured in a house distribution",
xlab="Total number of p.argentipes sandflies captured", ylab="nb", col="green", border="dark green")
It is not normally distributed
2. Generate a new variable “log_parg” which is the natural logarithm of tot_parg. Is normally distributed?
$log_parg <- log(sandfly$tot_parg) #Creating the natural logarith variable sandfly
hist(sandfly$log_parg,main="Log of p.argentipes sandflies captured in a house distribution",
xlab="Log of p.argentipes sandflies captured", ylab="nb", col="green", border="dark green")
It seems normally distributed
3. Explored in simple linear regression whether each of the independent variables is associated with log_parg. Make a table listing each factor, the number and proportion of households in which it is present
This is the Classical model selection. So, we need bivariate regressions for each independent variable with log_parg
3.1 Animals
$animals <- factor(sandfly$animals) #Factor because is a categorical variable
sandfly#table (sandfly$animals, useNA="always") #To explore the frequency and missing values
= lm(log_parg ~ animals, data = sandfly) #Simple linear regression (bivariate)
lm1 summary(lm1)
Call:
lm(formula = log_parg ~ animals, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.1944 -0.7965 -0.0164 0.7226 3.3292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.19440 0.05053 63.218 < 2e-16 ***
animals1 0.81025 0.25265 3.207 0.00143 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.107 on 498 degrees of freedom
Multiple R-squared: 0.02023, Adjusted R-squared: 0.01827
F-statistic: 10.29 on 1 and 498 DF, p-value: 0.001427
3.2 Stove
$stove <- factor(sandfly$stove) #Factor because is a categorical variable
sandfly#table (sandfly$stove, useNA="always") #To explore the frequency and missing values
= lm(log_parg ~ stove, data = sandfly) #Simple linear regression (bivariate)
lm2 summary(lm2)
Call:
lm(formula = log_parg ~ stove, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.4602 -0.8197 -0.0395 0.7336 3.3060
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.21760 0.05095 63.150 <2e-16 ***
stove1 0.24261 0.26138 0.928 0.354
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.117 on 498 degrees of freedom
Multiple R-squared: 0.001727, Adjusted R-squared: -0.0002775
F-statistic: 0.8615 on 1 and 498 DF, p-value: 0.3538
3.3 Ventilation
$ventilation <- factor(sandfly$ventilation) #Factor because is a categorical variable
sandfly#table (sandfly$ventilation, useNA="always") #To explore the frequency and missing values
= lm(log_parg ~ ventilation, data = sandfly) #Simple linear regression (bivariate)
lm3 summary(lm3)
Call:
lm(formula = log_parg ~ ventilation, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.3918 -0.6837 -0.0740 0.7756 3.1318
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.39179 0.06271 54.084 < 2e-16 ***
ventilation1 -0.42741 0.10094 -4.234 2.73e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.099 on 498 degrees of freedom
Multiple R-squared: 0.03475, Adjusted R-squared: 0.03281
F-statistic: 17.93 on 1 and 498 DF, p-value: 2.731e-05
3.4 Light
$light <- factor(sandfly$light) #Factor because is a categorical variable
sandfly#table (sandfly$light, useNA="always") #To explore the frequency and missing values
= lm(log_parg ~ light, data = sandfly) #Simple linear regression (bivariate)
lm4 summary(lm4)
Call:
lm(formula = log_parg ~ light, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.2306 -0.8216 -0.0118 0.7425 3.2929
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.23063 0.06166 52.394 <2e-16 ***
light1 -0.01115 0.10544 -0.106 0.916
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.118 on 498 degrees of freedom
Multiple R-squared: 2.244e-05, Adjusted R-squared: -0.001986
F-statistic: 0.01118 on 1 and 498 DF, p-value: 0.9158
3.5 Cowdung
$cowdung <- factor(sandfly$cowdung) #Factor because is a categorical variable
sandfly#table (sandfly$cowdung, useNA="always") #To explore the frequency and missing values
= lm(log_parg ~ cowdung, data = sandfly) #Simple linear regression (bivariate)
lm5 summary(lm5)
Call:
lm(formula = log_parg ~ cowdung, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.2419 -0.7570 -0.0230 0.7284 3.2817
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.7388 0.2879 9.512 <2e-16 ***
cowdung1 0.5031 0.2923 1.721 0.0859 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.115 on 498 degrees of freedom
Multiple R-squared: 0.005912, Adjusted R-squared: 0.003916
F-statistic: 2.962 on 1 and 498 DF, p-value: 0.08588
3.6 Water
$water <- factor(sandfly$water) #Factor because is a categorical variable
sandfly#table (sandfly$water, useNA="always") #To explore the frequency and missing values
= lm(log_parg ~ water, data = sandfly) #Simple linear regression (bivariate)
lm6 summary(lm6)
Call:
lm(formula = log_parg ~ water, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.3342 -0.8027 -0.0305 0.7452 3.3229
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.20064 0.05572 57.443 <2e-16 ***
water1 0.13354 0.12586 1.061 0.289
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.117 on 498 degrees of freedom
Multiple R-squared: 0.002256, Adjusted R-squared: 0.0002522
F-statistic: 1.126 on 1 and 498 DF, p-value: 0.2892
3.7 Asset_index
$asset_index <- factor(sandfly$asset_index) #Factor because is a categorical variable}
sandfly#we don't need to relevel
#table (sandfly$asset_index, useNA="always") #To explore the frequency and missing values
= lm(log_parg ~ asset_index, data = sandfly) #Simple linear regression (bivariate)
lm7 summary(lm7)
Call:
lm(formula = log_parg ~ asset_index, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.5065 -0.7242 0.0206 0.7371 3.2189
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.50646 0.10664 32.882 < 2e-16 ***
asset_index2 0.03697 0.16667 0.222 0.8245
asset_index3 -0.33546 0.15196 -2.208 0.0278 *
asset_index4 -0.30015 0.15276 -1.965 0.0500 .
asset_index5 -0.70187 0.14556 -4.822 1.92e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.072 on 477 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.06171, Adjusted R-squared: 0.05384
F-statistic: 7.843 on 4 and 477 DF, p-value: 3.968e-06
3.8 House_type
#Already a character (categorical) variable no need to factor
#table(sandfly$house_type, useNA="always") #To explore the frequency and missing values
= lm(log_parg ~ house_type, data = sandfly) #Simple linear regression (bivariate)
lm8 summary(lm8)
Call:
lm(formula = log_parg ~ house_type, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.1259 -0.6913 -0.0121 0.6752 3.2067
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.88012 0.12215 23.578 < 2e-16 ***
house_typepbrick_efloor 0.02238 0.15798 0.142 0.887
house_typethatched 0.83198 0.14821 5.614 3.3e-08 ***
house_typeupbrick 0.24579 0.14994 1.639 0.102
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.065 on 496 degrees of freedom
Multiple R-squared: 0.09706, Adjusted R-squared: 0.0916
F-statistic: 17.77 on 3 and 496 DF, p-value: 5.686e-11
3.9 Completing the requested table and checking witch models has a p value higher than 0.10
We do this checking all the results above
Factor | Number (%) | Coefficient | p-value |
---|---|---|---|
Animals in the house | 20 (4.0) | 0.81 | 0.001 |
Traditional stove | 19 (3.8) | 0.24 | 0.35 |
Ventilation | 193 (38.6) | -0.43 | <0.001 |
Adequate light | 171 (34.2) | -0.01 | 0.92 |
Cow dung | 485 (97.0) | 0.50 | 0.09 |
Water body | 98 (19.6) | 0.13 | 0.29 |
Asset index* | |||
—–1 (poorest) | 101 (21.0) | ref | ref |
—–2 | 70 (14.5) | 0.04 | 0.82 |
—–3 | 98 (20.3) | -0.34 | 0.03 |
—–4 | 96 (19.9) | -0.30 | 0.05 |
—–5 (wealthiest) | 117 (24.3) | -0.70 | <0.001 |
—–Missing | 18 | ||
House type | |||
—–Plastered brick, cemented floor | 76 (15.2) | ref | ref |
—–Thatched | 161 (32.2) | 0.83 | <0.001 |
—–Unplastered brick | 150 (30.0) | 0.25 | 0.10 |
—–Plastered brick, earth floor | 113 (22.6) | 0.02 | 0.89 |
There are 3 models that have a p value higher than 0.10 (the ones in red) Those variables are not going to enter to the final model
4. Construct a multiple linear regresion with all variables that were significant. Which variables will you include?
We will incude those who obtained a p value less than 0.10: animals, ventilation, cowdung, asset index and house_type. The regresion model is:
= lm(log_parg ~ animals + ventilation + cowdung + asset_index + house_type, data = sandfly)
lm9 summary(lm9)
Call:
lm(formula = log_parg ~ animals + ventilation + cowdung + asset_index +
house_type, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.2528 -0.6406 -0.0122 0.7067 3.2542
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.03255 0.31616 9.592 < 2e-16 ***
animals1 0.54503 0.24444 2.230 0.02624 *
ventilation1 -0.14492 0.10974 -1.321 0.18728
cowdung1 0.17978 0.28358 0.634 0.52641
asset_index2 0.12385 0.16411 0.755 0.45082
asset_index3 -0.20495 0.14927 -1.373 0.17040
asset_index4 -0.13879 0.15207 -0.913 0.36187
asset_index5 -0.33856 0.16225 -2.087 0.03746 *
house_typepbrick_efloor -0.02115 0.16290 -0.130 0.89678
house_typethatched 0.51342 0.17465 2.940 0.00345 **
house_typeupbrick 0.04044 0.16383 0.247 0.80514
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.04 on 471 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.1274, Adjusted R-squared: 0.1089
F-statistic: 6.876 on 10 and 471 DF, p-value: 4.592e-10
5. Which is the weaksest factor (highest p-value) and how high is its p-value?
Cowndown (has the higher pvalue 0.52641)
Clarification: If we see the previous results, In fact the one with the highest p-value is one category of house_type (brick_efloor: 0.89678). But because there are other categories of the same variable that are significant (has p value less than 0.05), we cannot eliminate that variable.
6. Remove ‘cowdung’, which is now the weakest factor and what is its p-value?
So the next model is:
= lm(log_parg ~ animals + ventilation + asset_index + house_type, data = sandfly)
lm10 summary(lm10)
Call:
lm(formula = log_parg ~ animals + ventilation + asset_index +
house_type, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.2535 -0.6438 -0.0135 0.6956 3.2634
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.195702 0.183532 17.412 < 2e-16 ***
animals1 0.539125 0.244108 2.209 0.02769 *
ventilation1 -0.151368 0.109200 -1.386 0.16636
asset_index2 0.124507 0.163999 0.759 0.44812
asset_index3 -0.208963 0.149044 -1.402 0.16157
asset_index4 -0.135136 0.151862 -0.890 0.37399
asset_index5 -0.342134 0.162053 -2.111 0.03528 *
house_typepbrick_efloor -0.004478 0.160663 -0.028 0.97778
house_typethatched 0.530562 0.172439 3.077 0.00221 **
house_typeupbrick 0.057824 0.161416 0.358 0.72033
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.039 on 472 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.1266, Adjusted R-squared: 0.11
F-statistic: 7.605 on 9 and 472 DF, p-value: 1.904e-10
Ventilation (has now the higher pvalue 0.16636)
Clarification: If we see the previous results, in fact are others with a highest p-value (ej: asset_index2 or house_typepbrick_efloor). But because there are other categories of the same variable that are significant (has p value less than 0.05, we cannot eliminate that variable.
7. Remove ‘ventilation’. Is there still another factor of which no category is significant at the 5% level ? Which factors do you keep in your final model?
So the new model is:
= lm(log_parg ~ animals + asset_index + house_type, data = sandfly)
lm11 summary(lm11)
Call:
lm(formula = log_parg ~ animals + asset_index + house_type, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.2045 -0.6371 -0.0027 0.6814 3.2001
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.129930 0.177465 17.637 < 2e-16 ***
animals1 0.505096 0.243107 2.078 0.038279 *
asset_index2 0.114276 0.163992 0.697 0.486246
asset_index3 -0.215243 0.149120 -1.443 0.149563
asset_index4 -0.137779 0.151997 -0.906 0.365156
asset_index5 -0.381083 0.159754 -2.385 0.017451 *
house_typepbrick_efloor -0.006961 0.160810 -0.043 0.965490
house_typethatched 0.586305 0.167848 3.493 0.000522 ***
house_typeupbrick 0.074528 0.161122 0.463 0.643897
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.04 on 473 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.1231, Adjusted R-squared: 0.1083
F-statistic: 8.299 on 8 and 473 DF, p-value: 1.493e-10
All have at least one category with a p-value < 0.05, So there is not another factor no significant and I dont have to exclude another variable
So, This factors are the ones who are going to stay in the final model: House_type, asset_index and animals.
8. To test for interactions generate the following binary variables:
- rich, ‘yes’ if asset_index = 5, ‘no’ if asset-index = 1-4 (beware of missing values!)
- thatched, ‘yes’ if house_type = ‘thatched, else ’no’
Run the model with ‘animals’, ‘rich’ and ‘thatched’ as predictors, note down the equation. Are all three factors statistically significant?
8.1 Creating the variable Rich
#Variable Rich
#table(sandfly$asset_index, useNA = "always")
$rich <- ifelse(sandfly$asset_index==5,1,0)
sandflytable(sandfly$rich, useNA = "always")
0 1 <NA>
365 117 18
8.2 Creating the variable thatched
#Variable thatched
#table(sandfly$house_type, useNA = "always")
$thatched <- ifelse(sandfly$house_type=="thatched",1,0)
sandflytable(sandfly$thatched, useNA = "always")
0 1 <NA>
339 161 0
8.3 With this variables, the regresion model is:
<- lm(log_parg ~ animals+rich+thatched, data=sandfly)
mod12 summary(mod12)
Call:
lm(formula = log_parg ~ animals + rich + thatched, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.1078 -0.6820 0.0068 0.6770 3.2620
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.08869 0.06985 44.217 < 2e-16 ***
animals1 0.55419 0.24126 2.297 0.02205 *
rich -0.32729 0.11664 -2.806 0.00522 **
thatched 0.56358 0.10836 5.201 2.95e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.041 on 478 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.1128, Adjusted R-squared: 0.1072
F-statistic: 20.25 on 3 and 478 DF, p-value: 2.277e-12
We can see that all variables are still significant The equation of this model is: log_parg = 3.09 + 0.55* animals – 0.33* rich + 0.56* thatched
9. Theoretically there could be interactions between rich and thatched, rich and animals and animals and thatched. Is any of these interaction terms statistically significant?
Check for Possible interactions (pairs)
mod13 <- lm(log_parg ~ animals+rich*thatched, data=sandfly)
mod14 <- lm(log_parg ~ rich+thatched*animals, data=sandfly)
mod15 <- lm(log_parg ~ animals*rich+thatched, data=sandfly)
<- lm(log_parg ~ animals+rich*thatched, data=sandfly)
mod13 summary(mod13)
Call:
lm(formula = log_parg ~ animals + rich * thatched, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.1053 -0.6899 -0.0030 0.6862 3.2730
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.09408 0.07087 43.659 < 2e-16 ***
animals1 0.56026 0.24181 2.317 0.02093 *
rich -0.34365 0.12195 -2.818 0.00503 **
thatched 0.54954 0.11259 4.881 1.44e-06 ***
rich:thatched 0.19563 0.42168 0.464 0.64291
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.042 on 477 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.1132, Adjusted R-squared: 0.1057
F-statistic: 15.22 on 4 and 477 DF, p-value: 1.019e-11
<- lm(log_parg ~ rich+thatched*animals, data=sandfly)
mod14 summary(mod14)
Call:
lm(formula = log_parg ~ rich + thatched * animals, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.0775 -0.6796 -0.0034 0.6854 3.2755
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.07750 0.06994 43.999 < 2e-16 ***
rich -0.32953 0.11636 -2.832 0.00482 **
thatched 0.60208 0.11010 5.469 7.33e-08 ***
animals1 1.20517 0.42790 2.817 0.00506 **
thatched:animals1 -0.95217 0.51750 -1.840 0.06640 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.038 on 477 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.119, Adjusted R-squared: 0.1116
F-statistic: 16.11 on 4 and 477 DF, p-value: 2.206e-12
<- lm(log_parg ~ animals*rich+thatched, data=sandfly)
mod15 summary(mod15)
Call:
lm(formula = log_parg ~ animals * rich + thatched, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.0898 -0.6869 0.0029 0.6843 3.2723
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.08981 0.06989 44.210 < 2e-16 ***
animals1 0.48546 0.25532 1.901 0.05785 .
rich -0.33865 0.11749 -2.882 0.00413 **
thatched 0.56918 0.10861 5.241 2.41e-07 ***
animals1:rich 0.64829 0.78580 0.825 0.40978
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.041 on 477 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.114, Adjusted R-squared: 0.1066
F-statistic: 15.35 on 4 and 477 DF, p-value: 8.137e-12
None of them are statistically significant. So this models are not good
- rich*thatched p= 0.64
- thatched*animals p = 0.07
- animals*rich p = 0.41
Clarification: To see if some model is better than another, we should perform a F test (anova table), but it will not be done in this exercise
10. As our final model we keep the model with type of house, asset-index and presence of animals. Please make a table presenting this final model.
They are talking about lm11 = lm(log_parg ~ animals + asset_index + house_type, data = sandfly)
= lm(log_parg ~ animals + asset_index + house_type, data = sandfly)
lm11 summary(lm11)
Call:
lm(formula = log_parg ~ animals + asset_index + house_type, data = sandfly)
Residuals:
Min 1Q Median 3Q Max
-3.2045 -0.6371 -0.0027 0.6814 3.2001
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.129930 0.177465 17.637 < 2e-16 ***
animals1 0.505096 0.243107 2.078 0.038279 *
asset_index2 0.114276 0.163992 0.697 0.486246
asset_index3 -0.215243 0.149120 -1.443 0.149563
asset_index4 -0.137779 0.151997 -0.906 0.365156
asset_index5 -0.381083 0.159754 -2.385 0.017451 *
house_typepbrick_efloor -0.006961 0.160810 -0.043 0.965490
house_typethatched 0.586305 0.167848 3.493 0.000522 ***
house_typeupbrick 0.074528 0.161122 0.463 0.643897
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.04 on 473 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.1231, Adjusted R-squared: 0.1083
F-statistic: 8.299 on 8 and 473 DF, p-value: 1.493e-10
Intercept: 3.129930
R2 : 0.1231
P-value of the model: 1.493e-10 (less than 0.05)
10.1 The table should be like this:
Factor | Coefficient | p-value |
---|---|---|
Animals in the house | 0.51 | 0.04 |
Asset index | ||
-1 (poorest) | ref | ref |
-2 | 0.11 | 0.49 |
-3 | -0.22 | 0.15 |
-4 | -0.14 | 0.37 |
-5 (wealthiest) | -0.38 | 0.02 |
House type | ||
-Thatched | 0.59 | 0.0005 |
-Unplastered brick | 0.07 | 0.64 |
-Plastered brick earth floor | -0.007 | 0.97 |
-Plastered brick cemented floor | ref | ref |
11. How much of the variability in sand fly exposure is explained by the three factors retained? What would be your conclusion about exposure to sand flies in this rural area.
R squared = 0.1231, so the model explains only 12% of the total variability.
- Being rich and living in a good house (Plastered brick earth floor or cemented
floor) reduces outcome, but far from being factors that play an important role.
- Could be other factors that play a role.
12. What is the predicted total number of sandflies in a thatched house with animals inside of a family belonging to the poorest quintile? And in a plastered brick house with cemented floor without animals and belonging to the wealthiest quintile?
12.1 For the first condition:
So the conditions are:
- Animals in the hous: Yes
- House type: thatched house
- Assest index: 1 (poorest)
So the equation for lm11 should be
y = a + b1X1 + b2x2 + b3x3
y = a + b1(animals) + b2(assest_index) + b3 (house_type)
y = 3.129930 + 0.51(animals) + 0(assest_index:poorest) + 0.59(house_type:thatched)
y = 3.13 + 0.51(1) + 0 + 0.59(1)
log_parg = 4.23
#The "log of parg" is 4.23, but I need the value of parg
exp(log_parg) = exp(4.23)
parg = 69
#parg is 69 with this conditions
12.2 For the second condition:
So the conditions are:
- Animals in the house: No
- House type: Plastered brick cemented floor
- Assest index: 5 (wealthiest)
So the equation for lm11 should be
y = a + b1X1 + b2x2 + b3x3
y = a + b1(animals) + b2(assest_index) + b3 (house_type)
y = 3.129930 + 0.51(animals) -0.38(wealthiest) + 0(Plastered brick cemented floor)
y = 3.13 + 0.51(0) -0.38(1) + 0
log_parg = 2.75
#The "log of parg" is 2.75 , but I need the value of parg
exp(log_parg) = exp(2.75)
parg = 16
#parg is 16 with this conditions
Extra
If we want to really check witch model is best than another (like in the step 9), we need to do a f test - anova function and see the p value:
anova(lm11, mod13)
But an error may show. When one of the variables that are in one model have missing values, R eliminate those rows when doing the model. So then if we want to compare, the models might not have the same number of rows and the test will fail.
For this is better to work with a subset of the data with no missing data in all the variables:
So if I know that asset_index has missing value, then:
sandfly_complete <- subset(sandfly, !is.na(sandfly$asset_index))
or
sandfly_complete <- subset(sandfly, complete.cases(sandfly))