2023-02-08 Sessión 3
- Instituut Voor Tropische Geneeskunde - Antwerp, Belgium
- Javier Silva-Valencia
Step by Step - Selection of variables for modeling
Import data
Importing a CSV database under the name of “onch1302”, with “,” as separator:
Starting the Exercise (MVA_LogReg_Ex1-2023):
Onchocerciasis (commonly known as River Blindness) is a chronic filarial disease found in sub-Saharan Africa and some parts of Central and South America. An onchocerciasis project was set up in 1982 in the Bo district of Sierra Leone. The aims of the project were to study epidemiological, clin-ical, immunological and entomological aspects of the disease. Prevalence surveys were undertaken in villages selected on the basis of potential high endemicity, being situated on or near rivers which are the breeding sites for the Simulium damnosum blackfly. Of the twelve villages included in the present dataset, five were situated in the south and east of the country in the forest' zone and the other seven were in the
savannah’ zone of the country. A census was taken of each village, and all villagers over the age of five years were asked to participate in the study. Coverage was over 90% in all but one of the selected villages. Diagnosis was made by taking a skin-snip, and clinical and an ocular examination were also performed. The file ONCH1302 contains data for all 1,302 subjects.
This time you are asked to make two logistic regression models, first using classical model selection, then change-in-estimate model selection for the effect of area of residence (forest or savannah). Please answer the questions below.
Part 1
1. Which are the steps you will go through for classical model selection?
We did this in previous excercises
2. Please start by making a table describing the study population. Remember that to get tables with percentages you can use the “proportions” or “prop.table” function around a table (both functions are the same)
2.1 Modifing variables to work correctly
onch1302$female <- factor(onch1302$sex, labels = c("male", "female"))
onch1302$forest <- factor(onch1302$area, labels = c("savanna", "forest"))
onch1302$agegrp <- factor(onch1302$agegrp, labels = c("(5-9)", "(10-19)", "(20-29)", "(40+)"))
onch1302$mf <- factor(onch1302$mf, labels = c("non-cases", "cases"))
onch1302$mfload <- factor(onch1302$mfload, labels = c("(none)", "(1-9)", "(10-49)", "(50+)"))
onch1302$lesions <- factor(onch1302$lesions, labels = c("no", "yes"))
no yes <NA>
1101 201 0
no yes
0.8456221 0.1543779
male female <NA>
616 686 0
male female
0.4731183 0.5268817
savanna forest <NA>
548 754 0
savanna forest
0.4208909 0.5791091
(5-9) (10-19) (20-29) (40+) <NA>
202 218 424 458 0
(5-9) (10-19) (20-29) (40+)
0.1551459 0.1674347 0.3256528 0.3517665
non-cases cases <NA>
480 822 0
non-cases cases
0.3686636 0.6313364
(none) (1-9) (10-49) (50+) <NA>
480 367 277 178 0
(none) (1-9) (10-49) (50+)
0.3686636 0.2818740 0.2127496 0.1367127
Factor | N (%) |
---|---|
Total | 1302 (100) |
Female gender | 686 (52.7) |
Age group | |
5-9 | 202 (15.5) |
10-19 | 218 (16.7) |
20-39 | 424 (32.6) |
40+ | 458 (35.2) |
Living in Forest | 754 (57.9) |
Micro filaria infected | 822 (63.1) |
Micro-filarial load | |
None | 480 (36.8) |
1-9 | 367 (28.1) |
10-49 | 277 (21.2) |
50+ | 178 (13.6) |
Eyes affected (lesions) | 201 (15.4) |
3. Next step is to explore the bivariate associations with the outcome variable, microfilarial infection. Please explore these associations using bivariate logistic regression models and construct an appropriate table. Though odds ratios with 95% confidence intervals show which coefficients are significant, please add a column with p-values from the likelihood ratio test so we can decide which factors to test in our multivariate mode
3.1 Frecuency tables
non-cases cases
0.3686636 0.6313364
#Female
#table (onch1302$female, onch1302$mf)
prop.table(table (onch1302$female, onch1302$mf), margin=2) #Porcentajes por column
non-cases cases
male 0.3958333 0.5182482
female 0.6041667 0.4817518
#Age group
#table (onch1302$agegrp, onch1302$mf)
prop.table(table (onch1302$agegrp, onch1302$mf), margin=2) #Porcentajes por fila
non-cases cases
(5-9) 0.32500000 0.05596107
(10-19) 0.24791667 0.12043796
(20-29) 0.26041667 0.36374696
(40+) 0.16666667 0.45985401
#forest
#table (onch1302$forest, onch1302$mf)
prop.table(table (onch1302$forest, onch1302$mf), margin=2) #Porcentajes por fila
non-cases cases
savanna 0.5562500 0.3418491
forest 0.4437500 0.6581509
#lesions
#table (onch1302$lesions, onch1302$mf)
prop.table(table (onch1302$lesions, onch1302$mf), margin=2) #Porcentajes por fila
non-cases cases
no 0.96041667 0.77858881
yes 0.03958333 0.22141119
3.2 For calculatin the OR and the p-value of the likelihood ratio test
Logistic model: mf ~ female
(Intercept) femalefemale
2.2421053 0.6090335
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 1.8930554 2.6654704
femalefemale 0.4842344 0.7647899
Analysis of Deviance Table
Model: binomial, link: logit
Response: mf
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 1301 1714.1
female 1 18.317 1300 1695.7 1.871e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Logistic model: mf ~ agegrp
(Intercept) agegrp(10-19) agegrp(20-29) agegrp(40+)
0.2948718 2.8213372 8.1120000 16.0239130
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.2099851 0.4059964
agegrp(10-19) 1.8566452 4.3348741
agegrp(20-29) 5.5353487 12.0770820
agegrp(40+) 10.7442422 24.3134387
Analysis of Deviance Table
Model: binomial, link: logit
Response: mf
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 1301 1714.1
agegrp 3 258.4 1298 1455.7 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Logistic model: mf ~ forest
Call:
glm(formula = mf ~ forest, family = binomial, data = onch1302)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5900 -1.1992 0.8148 0.8148 1.1558
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.05111 0.08546 0.598 0.55
forestforest 0.88102 0.11767 7.487 7.05e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1714.1 on 1301 degrees of freedom
Residual deviance: 1657.0 on 1300 degrees of freedom
AIC: 1661
Number of Fisher Scoring iterations: 4
(Intercept) forestforest
1.052434 2.413363
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.8901384 1.244620
forestforest 1.9176644 3.042055
Analysis of Deviance Table
Model: binomial, link: logit
Response: mf
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 1301 1714.1
forest 1 57.025 1300 1657.0 4.302e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Logistic model: mf ~ lesions
Call:
glm(formula = mf ~ lesions, family = binomial, data = onch1302)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1720 -1.3195 0.4456 1.0416 1.0416
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.32807 0.06109 5.370 7.85e-08 ***
lesionsyes 1.93150 0.24868 7.767 8.03e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1714.1 on 1301 degrees of freedom
Residual deviance: 1622.9 on 1300 degrees of freedom
AIC: 1626.9
Number of Fisher Scoring iterations: 4
(Intercept) lesionsyes
1.388286 6.899835
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 1.232067 1.565531
lesionsyes 4.348729 11.590217
Analysis of Deviance Table
Model: binomial, link: logit
Response: mf
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 1301 1714.1
lesions 1 91.198 1300 1622.9 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Factor | Cases (% col) | Non-cases (% col) | OR (95% CI) | p-value of LRTest |
---|---|---|---|---|
Total | 480 (36.9) | 822 (63.1) | ||
Female gender (yes) | 396 (48.1) | 290 (60.4) | 0.6090335 (0.4842344 0.7647899) | 1.871e-05 *** (<0.001) |
Female gender (no) | 426 (51.8) | 190 (39.5) | Ref | |
Age group | ||||
5-9 | 46 (5.5) | 156 (32.5) | Ref | 2.2e-16 *** (<0.001) |
10-19 | 99 (12.0) | 119 (24.7) | 2.8213372 (1.8566452 4.3348741) | |
20-39 | 299 (36.3) | 125 (26.0) | 8.1120000 (5.5353487 12.0770820) | |
40+ | 378 (45.9) | 80 (16.6) | 16.0239130 (10.7442422 24.3134387) | |
Living in Forest (yes) | 541 (65.8) | 213 (44.3) | 2.413363 (1.9176644 3.042055) | 4.302e-14 *** (<0.001) |
Living in Forest (no) | 281 (34.1) | 267 (55.6) | Ref | |
Lesions (yes) | 19 (22.1) | 182 (3.9) | 6.899835 (4.348729 11.590217) | 2.2e-16 *** (<0.001) |
Lesions (no) | 461 (77.8) | 640 (96.1) | Ref |
Example of interpretation: Among the non-cases only 3.9% had lesions
4. Which variables will you include in the multivariate model? Should eye lesions be included?
All variable had a p-value of the LRTest less than 0.10 Even though because of the nature of the research, we are looking for epidemiological exposures of microfilaria, eye lesions are not an exposure are more a symptom/sign
5. Fit the model with ‘age group’ as factor, ‘forest’ and ‘female’, which is now the weakest factor (highest p-value)? Does dropping it make the model significantly less precise? Present the final model in a table.
Call:
glm(formula = mf ~ agegrp + forest + female, family = binomial,
data = onch1302)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2212 -0.7666 0.5392 0.7081 2.1459
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6157 0.2066 -7.820 5.30e-15 ***
agegrp(10-19) 0.9429 0.2239 4.211 2.54e-05 ***
agegrp(20-29) 2.3478 0.2108 11.138 < 2e-16 ***
agegrp(40+) 2.8713 0.2171 13.225 < 2e-16 ***
forestforest 1.1227 0.1386 8.099 5.52e-16 ***
femalefemale -0.5813 0.1356 -4.286 1.82e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1714.1 on 1301 degrees of freedom
Residual deviance: 1366.1 on 1296 degrees of freedom
AIC: 1378.1
Number of Fisher Scoring iterations: 4
(Intercept) agegrp(10-19) agegrp(20-29) agegrp(40+) forestforest
0.1987542 2.5673568 10.4623748 17.6593452 3.0731378
femalefemale
0.5591696
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.1312769 0.2954028
agegrp(10-19) 1.6627984 4.0044862
agegrp(20-29) 6.9776128 15.9588558
agegrp(40+) 11.6370517 27.2821110
forestforest 2.3475876 4.0437091
femalefemale 0.4279198 0.7284977
In lm5 the variable with the highest p-value is female (p value = 1.82e-05). So we will do a new model without female and then compare it with lm5 to see if it caused a significant change
Analysis of Deviance Table
Model 1: mf ~ agegrp + forest + female
Model 2: mf ~ agegrp + forest
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1296 1366.1
2 1297 1384.8 -1 -18.712 1.521e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p value of the Likelihood ratio test is 1.521e-05 (p < 0.001), So, when I eliminate female, it changes the model significant making it less precise. So we cannot eliminate female.
The model we will stick with is the more complex lm5 (mf ~ agegrp + forest + female) The table of OR of lm5 is:
Factor (n=1302) | OR (95% CI) |
---|---|
Female gender | 0.5591696 (0.4279198 0.7284977) |
Age group: 5-9 years | 2.5673568 (1.6627984 4.0044862) |
10-19 years | 2.5673568 (1.6627984 4.0044862) |
20-39 year | 10.4623748 (6.9776128 15.9588558) |
40 years and above | 17.6593452 (11.6370517 27.2821110) |
Living in forest | 3.0731378 (2.3475876 4.0437091) |
6. Which are the three potential interactions? Does any of them significantly improve the model? If so, make a table with the odds ratio’s in which you have taken into account the interaction.
For this section we need to test is the variables are ok in that way (lm5) or there are interaction. We will compare lm5 with models with interactions
6.1 Model with forest*female as interaction
lm7 <- glm(mf ~ agegrp + forest*female , family=binomial, data=onch1302)
anova(lm5, lm7, test="Chisq")
Analysis of Deviance Table
Model 1: mf ~ agegrp + forest + female
Model 2: mf ~ agegrp + forest * female
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1296 1366.1
2 1295 1361.7 1 4.4028 0.03588 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value of the Likelihood ratio test is 0.03588 (Lower than 0.05). So, it changes the model significant. So there is an interaction between forest*female We will keep this model
6.2 Model with agegrp*forest as interaction
lm8 <- glm(mf ~ agegrp*forest + female , family=binomial, data=onch1302)
anova(lm5, lm8, test="Chisq")
Analysis of Deviance Table
Model 1: mf ~ agegrp + forest + female
Model 2: mf ~ agegrp * forest + female
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1296 1366.1
2 1293 1360.8 3 5.2497 0.1544
The p-value of the Likelihood ratio test is 0.1544 (Higher than 0.05). So, it does not changes the model significant. So there is not an interaction between agegrp*forest
6.3 Model with agegrp*female as interaction
lm9 <- glm(mf ~ forest + agegrp*female , family=binomial, data=onch1302)
anova(lm5, lm9, test="Chisq")
Analysis of Deviance Table
Model 1: mf ~ agegrp + forest + female
Model 2: mf ~ forest + agegrp * female
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1296 1366.1
2 1293 1361.2 3 4.873 0.1813
The p-value of the Likelihood ratio test is 0.1813 (Higher than 0.05). So, it does not changes the model significant. So there is not an interaction between agegrp*female
6.3 Filling the table requested Because we now know that we have an interaction : forest*female, is better to calculate the OR by categories of the interaction. We could decide to split the data by (women-men) or by (living/ not living in the forest). The most usual is present the results by sex, so:
male female
616 686
The model in women
(Intercept) agegrp(10-19) agegrp(20-29) agegrp(40+) forestforest
0.09191677 2.13748276 10.48287813 23.11558852 4.15267371
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.05060697 0.1596338
agegrp(10-19) 1.13270252 4.1137899
agegrp(20-29) 5.97014423 19.1384989
agegrp(40+) 12.62532875 44.0769379
forestforest 2.85704938 6.1178354
The model in man
(Intercept) agegrp(10-19) agegrp(20-29) agegrp(40+) forestforest
0.2392317 3.1894477 11.5596717 13.8388609 2.1425002
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.1395508 0.3959848
agegrp(10-19) 1.7572835 5.9066602
agegrp(20-29) 6.4172108 21.4996506
agegrp(40+) 7.8746659 25.0737487
forestforest 1.4425398 3.1999460
Factor (n=1302) | Women (n=686) OR (95% CI) | Men (N=616) OR (95% CI) |
---|---|---|
Age group:5-9 years | ||
10-19 year | 2.13748276 (1.13270252 4.1137899) | 3.1894477 (1.7572835 5.9066602) |
20-39 year | 10.48287813 (5.97014423 19.1384989) | 11.5596717 (6.4172108 21.4996506) |
40 years + | 23.11558852 (12.62532875 44.0769379) | 13.8388609 (7.8746659 25.0737487) |
Living in forest | 4.15267371 (2.85704938 6.1178354) | 2.1425002 (1.4425398 3.1999460) |
The effect of living in the forest is different for women from men The Odds of having the disease is 4. times as high if you are women living in the forest than women living in savana The Odds of having the disease is 2. times as high if you are men living in the forest than men living in savana
This is because female is an interaction factor (The interaction test (likelihood ratio test) was significant)
Part 2
Change-in-Estimate model (it change the primary OR exposure) 1. We select the factor that can be exposures and the main exposure: Can we change it? 2. First the crude odd ratio bivariate outcome - primary exposure
- Then adding one variable at the time and writing it down.
Factor | Co-factor | OR | Change (Original OR - New OR)/Original OR |
---|---|---|---|
Living in forest | 2.41 | ||
Living in forest | Female gender | 2.40 | = (2.41 - 2-40)/ 2.41 |
= 0.4% | |||
Living in forest | Age group | 3.08 | = (3.08 - 2-41)/ 2.41 |
= 27.8% |
So the final model for now is: mod_2 <- glm(mf ~ forest + agegrp, family=binomial, data=onch1302)
Tring adding interaction effects If we want to check for interaction, is better to have no so may categories. So we will dichotomize agegroup
Warning in Ops.factor(onch1302$agegrp, 1): '>' not meaningful for factors
We compare with the likelihood ratio test the one without interaction (mf ~ forest + agegrp) and the one with interaction (mf ~ forest * agegrp)
The p value is not significant, so we keep with te more simple (mf ~ forest * agegrp)
7.Now that you have made the final predictive model (using the classical moldel), let’s continue with an explanatory model for the effect of living in the forest versus the savannah. Which factors should we consider? We will follow these steps:
- Fit a model containing only the primary exposure variable of interest; write down the effect size, which is the crude OR.
- construct models with two variables, including the primary exposure and another exposure; identify confounders by comparing crude with adjusted ORs
- construct saturated model containing all exposure variables which resulted in a 10% change in effect size of the primary exposure (ref. step 2)
- eliminate exposure variables one after the other, but do not eliminate established confounders; retain exposure variable if removing it resulted in a 10% change in effect size of the primary exposure (ref. step 3)
- Explore interaction between primary exposure and other exposure variables; neglect other types of interaction