<- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Logistic Regresion Exercise Database/Datasets/onch1302.csv", sep=",") onch1302
2023-02-06 Sessión 1
- Instituut Voor Tropische Geneeskunde - Antwerp, Belgium
- Javier Silva-Valencia
Step by Step - Selection of variables for modeling
Import data
Importing a CSV database under the name of “onch1302”, with “,” as separator:
Starting the Exercise (MVA_LogReg_Ex1-2023):
Onchocerciasis (commonly known as River Blindness) is a chronic filarial disease found in sub-Saharan Africa and some parts of Central and South America. An onchocerciasis project was set up in 1982 in the Bo district of Sierra Leone. The aims of the project were to study epidemiological, clin-ical, immunological and entomological aspects of the disease. Prevalence surveys were undertaken in villages selected on the basis of potential high endemicity, being situated on or near rivers which are the breeding sites for the Simulium damnosum blackfly. Of the twelve villages included in the present dataset, five were situated in the south and east of the country in the forest' zone and the other seven were in the
savannah’ zone of the country. A census was taken of each village, and all villagers over the age of five years were asked to participate in the study. Coverage was over 90% in all but one of the selected villages. Diagnosis was made by taking a skin-snip, and clinical and an ocular examination were also performed. The file ONCH1302 contains data for all 1,302 subjects.
1. To make the naming a bit more logical, copy the ‘sex’ variable into a new variable called ‘female’ and the ‘area’ variable into one called ‘forest’. For ‘mf’ and ‘lesions’ you may use the variables as is. Next, recode variables with more than 2 levels (‘agegrp’ and ‘mfload’) to binary variables. For ‘agegrp’ use ‘adult’ and code as ‘TRUE’ for those aged 20 and older, ‘FALSE’ for those under 20 years of age. For ‘mfload’ you can recode to ‘highload’ taking values 0 and 1 together as ‘FALSE’ and values 2 and 3 together as ’TRUE
Creating the new variables (as a copy)
$female <- onch1302$sex
onch1302$forest <- onch1302$area onch1302
Recoding the variables with more than 2 levels into binary variables
#Adult variable (adult=1, non adult=0)
$adult <- ifelse(onch1302$agegrp>1,1,0)
onch1302
#Highload (False = 0, True=1)
#table(onch1302$mfload, useNA = "always")
$highload <- 0
onch1302$highload[onch1302$mfload==2] <- 1
onch1302$highload[onch1302$mfload==3] <- 1 onch1302
2. Get an overview of the data by univariable analysis. Construct a table showing numbers and frequencies of each of the variables
table(onch1302$lesions, useNA = "always") #frecuency
0 1 <NA>
1101 201 0
prop.table(table(onch1302$lesions)) #proportion
0 1
0.8456221 0.1543779
table(onch1302$female, useNA = "always") #frecuency
0 1 <NA>
616 686 0
prop.table(table(onch1302$female)) #proportion
0 1
0.4731183 0.5268817
#Forest
table(onch1302$forest, useNA = "always") #frecuency
0 1 <NA>
548 754 0
prop.table(table(onch1302$forest)) #proportion
0 1
0.4208909 0.5791091
#Highload
table(onch1302$highload, useNA = "always") #frecuency
0 1 <NA>
847 455 0
prop.table(table(onch1302$highload)) #proportion
0 1
0.6505376 0.3494624
#Adult
table(onch1302$adult, useNA = "always") #frecuency
0 1 <NA>
420 882 0
prop.table(table(onch1302$adult)) #proportion
0 1
0.3225806 0.6774194
#Microfilaria
table(onch1302$mf, useNA = "always") #frecuency
0 1 <NA>
480 822 0
prop.table(table(onch1302$mf)) #proportion
0 1
0.3686636 0.6313364
Factor | N |
---|---|
Female gender | 686 (52.7) |
Adult age | 882 (67.7) |
Living in Forest | 754 (57.9) |
Micro filaria infected | 822 (63.1) |
High micro-filarial load | 455 (34.9) |
Eyes affected (lesions) | 201 (15.4) |
3. Make three 2x2 tables to explore the association between mf infection and the three exposures, ‘female’, ‘adult’ and ‘forest’ and manually compute the odds ratios between exposed and unex-posed
We can do that directly with “cc” command from the EpiStat library. If EpiStat doesnt work you can do it manually with “table” command and a calculator
library(EpiStats)
cc(onch1302, mf,female)
$df1
Cases Controls Total
Exposed 396 290 686
Unexposed 426 190 616
Total 822 480 1302
Proportion exposed 0.48 0.60 0.53
$df2
Point estimate 95%CI-ll 95%CI-ul
Odds ratio 0.61 0.48 0.77
Prev. frac. ex. 0.39 0.23 0.52
Prev. frac. pop 0.24 NA NA
chi2(1) 18.22 NA NA
Pr>chi2 0.000 NA NA
Micro filaria infected | ||
---|---|---|
Gender | Yes | No |
Female | 396 | 290 |
Male | 426 | 190 |
OR (female) = 0.61 (0.48-0.77) p value = 0.0001
cc(onch1302, mf,adult)
$df1
Cases Controls Total
Exposed 677 205 882
Unexposed 145 275 420
Total 822 480 1302
Proportion exposed 0.82 0.43 0.68
$df2
Point estimate 95%CI-ll 95%CI-ul
Odds ratio 6.26 4.82 8.15
Attr. frac. ex. 0.84 0.79 0.88
Attr. frac. pop 0.69 NA NA
chi2(1) 218.04 NA NA
Pr>chi2 0.000 NA NA
Micro filaria infected | ||
---|---|---|
Age | Yes | No |
20+ | 677 | 205 |
1-19 | 145 | 275 |
OR (female) = 6.26 (4.82-8.15) p value = 0.0001
cc(onch1302, mf,forest)
$df1
Cases Controls Total
Exposed 541 213 754
Unexposed 281 267 548
Total 822 480 1302
Proportion exposed 0.66 0.44 0.58
$df2
Point estimate 95%CI-ll 95%CI-ul
Odds ratio 2.41 1.90 3.06
Attr. frac. ex. 0.59 0.47 0.67
Attr. frac. pop 0.39 NA NA
chi2(1) 57.15 NA NA
Pr>chi2 0.000 NA NA
Micro filaria infected | ||
---|---|---|
Residence | Yes | No |
Forest | 541 | 213 |
Savanna | 281 | 261 |
OR (female) = 2.41 (1.90-3.06) p value = 0.0001
4. As a next step, use the ‘cc’ command from the ‘Epistats’ package to confirmm the associations between ‘mf’ and exposures ‘female’, ‘adult’ and ‘forest’ and add the 95% confidence intervals. Are these associations statistically significant? Who are more at risk, men or women?
We did this in the previous step.
Number (%) | OR (95% CI) | |
---|---|---|
Female | 686 (52.7) | 0.6 (0.5-0.8) |
Adult age | 883 (67.7) | 6.3 (4.8-8.2) |
Forest | 754 (57.9) | 2.4 (1.9-3.1) |
All bivariate associations are significant
5. Now compare results from table-based analyses with results from logistic regression. Are they consistent? For each of the three models note the OR and it’s 95% CI
Now we are going to calculate the OR again but with the logistic regression
5.1. mf and female
<- glm(mf ~ female, family=binomial, data=onch1302)
GLM1 summary(GLM1) # To see the results of the logistic regression
Call:
glm(formula = mf ~ female, family = binomial, data = onch1302)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5338 -1.3122 0.8589 1.0483 1.0483
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.80742 0.08724 9.255 < 2e-16 ***
female -0.49588 0.11655 -4.255 2.09e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1714.1 on 1301 degrees of freedom
Residual deviance: 1695.7 on 1300 degrees of freedom
AIC: 1699.7
Number of Fisher Scoring iterations: 4
exp(coef(GLM1)) # To calculate the OR
(Intercept) female
2.2421053 0.6090335
exp(confint(GLM1)) # To calculate the confidence intervarl of the OR
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 1.8930554 2.6654704
female 0.4842344 0.7647899
5.2. mf and adult
<- glm(mf ~ adult, family=binomial, data=onch1302)
GLM2 summary(GLM2) # To see the results of the logistic regression
Call:
glm(formula = mf ~ adult, family = binomial, data = onch1302)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7083 -0.9203 0.7274 0.7274 1.4584
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6400 0.1026 -6.236 4.48e-10 ***
adult 1.8347 0.1300 14.118 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1714.1 on 1301 degrees of freedom
Residual deviance: 1497.8 on 1300 degrees of freedom
AIC: 1501.8
Number of Fisher Scoring iterations: 4
exp(coef(GLM2)) # To calculate the OR
(Intercept) adult
0.5272727 6.2632464
exp(confint(GLM2)) # To calculate the confidence intervarl of the OR
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.4302207 0.6435193
adult 4.8647343 8.0981555
5.3. mf and forest
<- glm(mf ~ forest, family=binomial, data=onch1302)
GLM3 summary(GLM3) # To see the results of the logistic regression
Call:
glm(formula = mf ~ forest, family = binomial, data = onch1302)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5900 -1.1992 0.8148 0.8148 1.1558
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.05111 0.08546 0.598 0.55
forest 0.88102 0.11767 7.487 7.05e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1714.1 on 1301 degrees of freedom
Residual deviance: 1657.0 on 1300 degrees of freedom
AIC: 1661
Number of Fisher Scoring iterations: 4
exp(coef(GLM3)) # To calculate the OR
(Intercept) forest
1.052434 2.413363
exp(confint(GLM3)) # To calculate the confidence intervarl of the OR
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.8901384 1.244620
forest 1.9176644 3.042055
Yes, the OR calculated with logistic regresion it is consistent with the OR calculated in step 4
Number (%) | OR (95% CI) | |
---|---|---|
Female | 686 (52.7) | 0.6 (0.5-0.8) |
Adult age | 883 (67.7) | 6.3 (4.8-8.2) |
Forest | 754 (57.9) | 2.4 (1.9-3.1) |
6. The only exposure that we could do something about in an intervention is ‘living in the forest’, do you think age or gender could be confounders in the association between ‘living in the forest’ and being infected with micro filaria?
Yes it could be confounders.
Clarification:
To being counfunders has to fulfill 3 conditions:
1. Has to be related to the VI, 2. TO be related to the VD, and 3. Not be in the causal pathway
(Microfilaria Infection) ----- (living forest)
\ /
Confounders
(Age or gender)
If it is permant residence, then probably age or gender cannot affect the residence
But in this case it doesn't say anything.
If adult men spend part of the year in the forest for professional reasons,
both age and gender could be associated with living in forest and with having microfilaria infect.
Also age or gender do not seem to be in the causar pathway
7. Use ‘CCInter’ from the ‘Epistats’ package to test for confounding. Is there confounding or inter-action in the association between ‘mf’ and ‘forest’ by either ‘female’ or ‘adult’?
7.1 Confounding: being female
CCInter(onch1302, "mf", "forest", "female")
$df1
CCInter mf - forest by(female) Cases Controls P.est. Stats 95%CI-ll
1 female = 1 <NA> <NA> Odds ratio 2.87 2.07
2 Exposed 265 120 Attrib.risk.exp 0.65 0.52
3 Unexposed 131 170 Attrib.risk.pop 0.44 <NA>
4 Total 396 290 <NA> <NA>
5 Exposed % 66.9% 41.4% <NA> <NA>
6 ______________ <NA> <NA> <NA> <NA>
7 female = 0 <NA> <NA> Odds ratio 1.92 1.34
8 Exposed 276 93 Attrib.risk.exp 0.48 0.25
9 Unexposed 150 97 Attrib.risk.pop 0.31 <NA>
10 Total 426 190 <NA> <NA>
11 Exposed % 64.8% 48.9% <NA> <NA>
12 ______________ <NA> <NA> <NA> <NA>
13 Number of obs 1302 <NA> <NA> <NA> <NA>
14 Missing 0 <NA> <NA> <NA> <NA>
95%CI-ul
1 3.97
2 0.75
3 <NA>
4 <NA>
5 <NA>
6 <NA>
7 2.76
8 0.64
9 <NA>
10 <NA>
11 <NA>
12 <NA>
13 <NA>
14 <NA>
$df2
P.estimate Stats 95%CI-ll 95%CI-ul
1 MH test of Homogeneity (p-value) 0.09
2 Crude OR for forest 2.41 1.90 3.06
3 MH OR forest adjusted for female 2.40 1.90 3.02
4 Adjusted/crude relative change -0.75 _ _
Steps for check for confounding
1.Check for the wolf test (homogeneity test) if pvalue higher than 0.05,
means no interaction (no effect modifier) and that could be a confounding
2. If is not interaction, then check for the change between OR crude and OR adjusted,
if that relative change is higher than 10%, it is a confounding
So, with the results:
1. The Homogeneity test had a p value of 0.09 (higher than 0.05)
means no interaction and that could be a confounding
2. Then, the OR crude is 2.41 and the OR adjusted 2.40.
Also the relative change is -0.75
So sex is not a confounding
7.2 Confounding: being adult
CCInter(onch1302, "mf", "forest", "adult")
$df1
CCInter mf - forest by(adult) Cases Controls P.est. Stats 95%CI-ll
1 adult = 1 <NA> <NA> Odds ratio 3.85 2.72
2 Exposed 434 65 Attrib.risk.exp 0.74 0.63
3 Unexposed 243 140 Attrib.risk.pop 0.47 <NA>
4 Total 677 205 <NA> <NA>
5 Exposed % 64.1% 31.7% <NA> <NA>
6 ______________ <NA> <NA> <NA> <NA>
7 adult = 0 <NA> <NA> Odds ratio 2.42 1.53
8 Exposed 107 148 Attrib.risk.exp 0.59 0.34
9 Unexposed 38 127 Attrib.risk.pop 0.43 <NA>
10 Total 145 275 <NA> <NA>
11 Exposed % 73.8% 53.8% <NA> <NA>
12 ______________ <NA> <NA> <NA> <NA>
13 Number of obs 1302 <NA> <NA> <NA> <NA>
14 Missing 0 <NA> <NA> <NA> <NA>
95%CI-ul
1 5.46
2 0.82
3 <NA>
4 <NA>
5 <NA>
6 <NA>
7 3.86
8 0.74
9 <NA>
10 <NA>
11 <NA>
12 <NA>
13 <NA>
14 <NA>
$df2
P.estimate Stats 95%CI-ll 95%CI-ul
1 MH test of Homogeneity (p-value) 0.10
2 Crude OR for forest 2.41 1.90 3.06
3 MH OR forest adjusted for adult 3.23 2.48 4.22
4 Adjusted/crude relative change 34.04 _ _
So, with the results:
1. The Homogeneity test had a p value of 0.10 (higher than 0.05)
means no interaction and that could be a confounding
2. Then, the OR crude is 2.41 and the OR adjusted 3.23
Also the relative change is 34.04 (more than 10%)
So being adult is a confounding
8. Does logistic regression confirm your findings on “adult” and “female” as potential confounders?
8.1 Logistic regresion only with: mf and forest (bivariate)
<- glm(mf ~ forest, family=binomial, data=onch1302)
GLM4 exp(coef(GLM4))
(Intercept) forest
1.052434 2.413363
8.2 Logistic regresion with: mf and forest and female (multivariate)
<- glm(mf ~ forest + female, family=binomial, data=onch1302)
GLM5 exp(coef(GLM5))
(Intercept) forest female
1.3739477 2.3967425 0.6165796
8.3 Logistic regresion with: mf and forest and adult (multivariate)
<- glm(mf ~ forest + adult, family=binomial, data=onch1302)
GLM6 exp(coef(GLM6))
(Intercept) forest adult
0.2427931 3.2653865 7.6328965
8.4 Logistic regresion with: mf and forest and female and adult (multivariate)
<- glm(mf ~ forest + female + adult, family=binomial, data=onch1302)
GLM7 exp(coef(GLM7))
(Intercept) forest female adult
0.3313785 3.2577809 0.5342773 8.0289468
The OR of forest in model GLM4 (only with: mf and forest) is
OR (crude) = 2.413363
Interpretation: The odds of having a microfilarial infection are 2.41 times as high for those
who reside in the forest compared to those in the savanna.
The OR for forest in model GLM5 (with mf and forest and female) is 2.3967425 This ise quite the same with GLM4, so it confirms that adding female do not cause changes, so female it is not a confounder of (mf ~ forest association)
Interpretation: The odds of having a microfilarial infection are 2.39 times as high for those
who reside in the forest compared to those in the savanna, if sex are held constant.
The OR for forest in model GLM6 (with mf and forest and adult) is 3.2653865 This OR is different from the GLM4, so it confirms that adding adult do cause changes, so being adult it is a confounder of (mf ~ forest association)
Interpretation: The odds of having a microfilarial infection are 3.27 times as high for those
who reside in the forest compared to those in the savanna, if age(being adult)
are held constant.
The OR for forest in model GLM7 (with mf and forest and female and adult) is 3.2577809 This is different from the GLM4, so it still confirms that one of the variables added caused changes,
Interpretation: The odds of having a microfilarial infection are 3.26 times as high for those
who reside in the forest compared to those in the savanna, if sex and age are held constant.
9. What happens if you add an additional term in the ‘CCInter’ command, e.g. if you type: ‘CCInter(Onch1302, “mf”,“forest”,“adult”,“female”)’?
CCInter(onch1302, "mf", "forest", "female", "adult")
$df1
CCInter mf - forest by(female) Cases Controls P.est. Stats 95%CI-ll
1 female = 1 <NA> <NA> Odds ratio 2.87 2.07
2 Exposed 265 120 Attrib.risk.exp 0.65 0.52
3 Unexposed 131 170 Attrib.risk.pop 0.44 <NA>
4 Total 396 290 <NA> <NA>
5 Exposed % 66.9% 41.4% <NA> <NA>
6 ______________ <NA> <NA> <NA> <NA>
7 female = 0 <NA> <NA> Odds ratio 1.92 1.34
8 Exposed 276 93 Attrib.risk.exp 0.48 0.25
9 Unexposed 150 97 Attrib.risk.pop 0.31 <NA>
10 Total 426 190 <NA> <NA>
11 Exposed % 64.8% 48.9% <NA> <NA>
12 ______________ <NA> <NA> <NA> <NA>
13 Number of obs 1302 <NA> <NA> <NA> <NA>
14 Missing 0 <NA> <NA> <NA> <NA>
95%CI-ul
1 3.97
2 0.75
3 <NA>
4 <NA>
5 <NA>
6 <NA>
7 2.76
8 0.64
9 <NA>
10 <NA>
11 <NA>
12 <NA>
13 <NA>
14 <NA>
$df2
P.estimate Stats 95%CI-ll 95%CI-ul
1 MH test of Homogeneity (p-value) 0.09
2 Crude OR for forest 2.41 1.90 3.06
3 MH OR forest adjusted for female 2.40 1.90 3.02
4 Adjusted/crude relative change -0.75 _ _
What happened? Nothing. Only worked with female, doesn’t show anything about adults This command does not work for more than 1 independent variable
10.Now try to fit a logistic regression model with ‘mf’ as outcome variable and ‘forest’, ‘adult’ and ‘female’ as predictors. What happens now to the odds ratio of ‘forest’ in comparison to the model with only ‘forest’ and ‘adult’ as predictors?
We already did this in step 8. They are asking for compare GLM7 with GLM6:
The OR for forest in model GLM6 (with mf and forest and adult) is 3.2653865
Interpretation: The odds of having a microfilarial infection are 3.27 times as high for those
who reside in the forest compared to those in the savanna, if age(being adult)
are held constant.
The OR for forest in model GLM7 (with mf and forest and female and adult) is 3.2577809
Interpretation: The odds of having a microfilarial infection are 3.26 times as high for those
who reside in the forest compared to those in the savanna, if sex and age are held constant.
There is a minimal change from 3.2577809 to 3.2653865. This is no surprise because we already knew that adding “female”, that is not a counfounder, so it was not going to cause change in the OR of forest.
Yet, the question still remains: for the final model will it be better to have sex or not? for this we have to compare both models with the likelihood ratio test. we will do it later
11. Please check that the null deviance (the -2LLR of the null model) is 1714.1 on 1301 degrees of freedom. For the model with ‘forest’ and ‘adult’ it is 1416.7 on 1299 degrees of freedom. For the model with ‘forest’, ‘adult’ and ‘female’ it is 1394.1 on 1298 degrees of freedom. So the differ-ence between the last two models is 1416.7-1394.1, which is equal to 22.6, on 1299-1298, i.e. 1 degree of freedom. Is this a significant difference, please check a chi square table or in Excel (formula CHIDIST(22.6,1)?
- Model ‘forest’ and ‘adult’ is GLM6 (Check step 8.3)
summary (GLM6)
Call:
glm(formula = mf ~ forest + adult, family = binomial, data = onch1302)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9765 -1.0805 0.5531 0.9290 1.8072
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4155 0.1432 -9.884 <2e-16 ***
forest 1.1834 0.1359 8.709 <2e-16 ***
adult 2.0325 0.1402 14.499 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1714.1 on 1301 degrees of freedom
Residual deviance: 1416.7 on 1299 degrees of freedom
AIC: 1422.7
Number of Fisher Scoring iterations: 4
For Model ‘forest’ and ‘adult’ (GLM6)
null deviance: 1714.1
Residual deviance: 1416.7 df: 1299
- Model ‘forest’, ‘adult’ and ‘female’ is GLM7 (Check step 8.4)
summary (GLM7)
Call:
glm(formula = mf ~ forest + female + adult, family = binomial,
data = onch1302)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1302 -0.9543 0.4673 0.7988 1.9465
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1045 0.1564 -7.060 1.66e-12 ***
forest 1.1810 0.1370 8.619 < 2e-16 ***
female -0.6268 0.1335 -4.696 2.65e-06 ***
adult 2.0831 0.1427 14.594 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1714.1 on 1301 degrees of freedom
Residual deviance: 1394.1 on 1298 degrees of freedom
AIC: 1402.1
Number of Fisher Scoring iterations: 4
For Model ‘forest’, ‘adult’ and ‘female’ (GLM7)
null deviance: 1714.1
Residual deviance: 1394.1 df: 1298
Note that the null deviance is the same for all the models (because the null deviance refers to a model without any factor)
So the differ-ence between the two models is
Residual deviance 1416.7-1394.1 = 22.6,
Degrees of freedom 1299-1298 = 1
Is this a significant difference, please check a chi square table or in Excel (formula CHIDIST(22.6,1)?
With the Chi square in excel we will obtain p=0.00002
So the difference between both models is highly significant.
12. So apparently the model with the three terms, ‘forest’, ‘adult’ and ‘female’ is significantly better than the model with only ‘forest’ and ‘adult’. The easier way to check this is to do a likelihood ra-tio test in R. First run the complex model with three terms, then run the simple model with two terms and next use the anova() function to make the comparison.
12.1 Fist the model with ‘forest’, ‘adult’ and ‘female’ (GLM7)
#We will run again the Model ‘forest’, ‘adult’ and ‘female’ (GLM7)
<- glm(mf ~ forest + female + adult, family=binomial, data=onch1302)
GLM7 summary (GLM7)
Call:
glm(formula = mf ~ forest + female + adult, family = binomial,
data = onch1302)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1302 -0.9543 0.4673 0.7988 1.9465
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1045 0.1564 -7.060 1.66e-12 ***
forest 1.1810 0.1370 8.619 < 2e-16 ***
female -0.6268 0.1335 -4.696 2.65e-06 ***
adult 2.0831 0.1427 14.594 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1714.1 on 1301 degrees of freedom
Residual deviance: 1394.1 on 1298 degrees of freedom
AIC: 1402.1
Number of Fisher Scoring iterations: 4
12.2 Then the model with ‘forest’ and ‘adult’ (GLM6)
#Model ‘forest’ and ‘adult’
<- glm(mf ~ forest + adult, family=binomial, data=onch1302)
GLM6 summary (GLM7)
Call:
glm(formula = mf ~ forest + female + adult, family = binomial,
data = onch1302)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1302 -0.9543 0.4673 0.7988 1.9465
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1045 0.1564 -7.060 1.66e-12 ***
forest 1.1810 0.1370 8.619 < 2e-16 ***
female -0.6268 0.1335 -4.696 2.65e-06 ***
adult 2.0831 0.1427 14.594 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1714.1 on 1301 degrees of freedom
Residual deviance: 1394.1 on 1298 degrees of freedom
AIC: 1402.1
Number of Fisher Scoring iterations: 4
12.3 Likelyhood ratio test: To check if adding female change the model significantly
anova(GLM6, GLM7, test="Chisq")
Analysis of Deviance Table
Model 1: mf ~ forest + adult
Model 2: mf ~ forest + female + adult
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1299 1416.7
2 1298 1394.1 1 22.562 2.035e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, according to the likelihood ratio test (anova) the p value is 2.035e-06 (less of 0.05) So adding sex(female) does generate a change in the model. Because of that we keep the more complex model.
The final model would be:
GLM7 <- glm(mf ~ forest + female + adult, family=binomial, data=onch1302)