2023-02-07 Sessión 2
- Instituut Voor Tropische Geneeskunde - Antwerp, Belgium
- Javier Silva-Valencia
Step by Step - Selection of variables for modeling
Import data
Importing a CSV database under the name of “mwanza”, with “,” as separator:
Starting the Exercise (MVA_LogReg_Ex2-2023):
Case control study of risk factors for HIV in women, Mwanza Tanzania: As part of a prospective study of the impact of STD control on the incidence of HIV infection in Mwanza, Tanzania, a baseline survey of HIV prevalence was carried out in 12 communities. All se-ropositive women (15 years and above) were revisited and, where possible) interviewed about poten-tial risk factors for HIV infection using a standard questionnaire. In addition to interviewing HIV +ve women, a random sample of HIV -ve women were selected from the population lists prepared during the baseline survey and these women were also revisited and, where possible, interviewed. No matching of controls with cases was performed
The question which interests us most in this part of the exercise is whether and how education is associated with HIV infection
So the dependant variable is: HIV infection and main independent variable: education
1. Explore how education and age are associated with HIV infection; the list of variables is present-ed overleaf. Transform ‘case’ into a factor variable with levels TRUE and FALSE, similarly cre-ate two factor variables, ‘age1_f’ and ‘ed_f’, based on ‘age1’ and ‘ed’ respectively.
1.1 Creating the variables
#Case variable (case TRUE=1, non Case FALSE=0)
#table(mwanza$case, useNA="always")
mwanza$case <- factor(mwanza$case)
#Age1 variable
#table(mwanza$age1, useNA="always")
mwanza$age1_f <- factor(mwanza$age1, labels = c("15-19","20-24","25-29","30-34","35-44","45-54"))
#table(mwanza$age1_f, useNA="always")
#Ed1 variable
#table(mwanza$ed, useNA="always")
mwanza$ed_f <- factor(mwanza$ed)
1.2 Before doing anything, first is better to do bi-variates tables with the OR calculation for each independent variable
Bivariate table: VIH - Age group
0 1
15-19 96 13
20-24 108 57
25-29 84 39
30-34 85 33
35-44 107 30
45-54 94 17
0 1
15-19 0.8807339 0.1192661
20-24 0.6545455 0.3454545
25-29 0.6829268 0.3170732
30-34 0.7203390 0.2796610
35-44 0.7810219 0.2189781
45-54 0.8468468 0.1531532
Table of Age group (with manual calculation of OR)
Age group | Case | Control | OR (95% CI) |
---|---|---|---|
15-19 | 13 (11.9) | 96(88.1) | Ref |
20-24 | 57(34.5) | 108(65.5) | 3.9 (2.1-7.8) |
25-29 | 39(31.7) | 84(68.3) | 3.4 (1.8-7.1) |
30-34 | 33(28.0) | 85(72.0) | 2.9 (1.4-6.0) |
35-44 | 30(21.9) | 107(78.1) | 2.1 (1.0-4.3) |
45-54 | 17(15.3) | 94(84.7) | 1.3 (0.6-3.0) |
So far regarding to age, We can see that all age groups have higher odds to be infected than age group ‘15-19’ (reference) (even though not all are significant)
Bivariate table: VIH - Education
0 1
1 263 49
2 51 24
3 255 110
4 5 6
0 1
1 0.8429487 0.1570513
2 0.6800000 0.3200000
3 0.6986301 0.3013699
4 0.4545455 0.5454545
Table of education (with manual calculation of OR)
Years of education | Case | Control | OR (95% CI) |
---|---|---|---|
0 | 49 (15.7) | 263 (84.3) | Ref. |
1-3 | 24 (32.0) | 51 (68.0) | 2.5 (1.4-4.5) |
4-6 | 110 (30.1) | 255 (69.9) | 2.3 (1.6-3.4) |
7+ | 6 (54.5) | 5 (45.5) | 6.4 (1.9-23.1) |
So far regarding to education, we can see that educated women have all significantly higher odds to be infected than uneducated women (reference)
2. Now we are interested in a possible interaction between these two variables. First we should reduce the number of categories for both of them. Transform “ed” into a binary variable called “ed2”, with should take value 1 for “having been at school for at least 1 year” and the value 0 otherwise. Check whether recoding has worked. The new variable “age2” should be coded as fol-lows: 1=15 19, 2=20 29, 3=30 44, 4=45 and more. In your own interest: always check that your recoding has worked
2.1 Reducing the categories
#Education at least 1 year
mwanza$ed2_f <- ifelse(mwanza$ed>1,1,0)
mwanza$ed2_f <- factor(mwanza$ed2_f)
table(mwanza$ed, mwanza$ed2_f)
0 1
1 312 0
2 0 75
3 0 365
4 0 11
#Age reducing categories
mwanza$age2_f <- mwanza$age1
mwanza$age2_f[mwanza$age1==3] <- 2
mwanza$age2_f[mwanza$age1==4] <- 3
mwanza$age2_f[mwanza$age1==5] <- 3
mwanza$age2_f[mwanza$age1==6] <- 4
mwanza$age2_f <- factor(mwanza$age2_f)
table(mwanza$age1, mwanza$age2_f)
1 2 3 4
1 109 0 0 0
2 0 165 0 0
3 0 123 0 0
4 0 0 118 0
5 0 0 137 0
6 0 0 0 111
3. Use logistic regression to look for interaction between education and age.
3.1 First the model without interaction
Call:
glm(formula = case ~ ed2_f + age2_f, family = binomial, data = mwanza)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9623 -0.8965 -0.6131 -0.3542 2.3665
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.7375 0.3498 -7.827 5.02e-15 ***
ed2_f1 0.8719 0.2082 4.188 2.81e-05 ***
age2_f2 1.3359 0.3227 4.140 3.48e-05 ***
age2_f3 1.1615 0.3375 3.441 0.000579 ***
age2_f4 0.8583 0.4211 2.038 0.041542 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 854.26 on 762 degrees of freedom
Residual deviance: 807.90 on 758 degrees of freedom
AIC: 817.9
Number of Fisher Scoring iterations: 4
(Intercept) ed2_f1 age2_f2 age2_f3 age2_f4
0.06473065 2.39150115 3.80333613 3.19461007 2.35904293
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.03137867 0.1244741
ed2_f1 1.59983748 3.6230547
age2_f2 2.08418557 7.4537561
age2_f3 1.69527973 6.4225579
age2_f4 1.03890649 5.4731892
3.2 Then the model with interaction
Call:
glm(formula = case ~ ed2_f * age2_f, family = binomial, data = mwanza)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9508 -0.9494 -0.5530 -0.4673 2.1301
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5041 0.5528 -2.721 0.00651 **
ed2_f1 -0.6554 0.6553 -1.000 0.31726
age2_f2 0.2719 0.6307 0.431 0.66636
age2_f3 -0.2964 0.6057 -0.489 0.62458
age2_f4 -0.4177 0.6333 -0.660 0.50951
ed2_f1:age2_f2 1.3245 0.7354 1.801 0.07172 .
ed2_f1:age2_f3 1.8963 0.7256 2.613 0.00897 **
ed2_f1:age2_f4 1.7018 0.8991 1.893 0.05839 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 854.26 on 762 degrees of freedom
Residual deviance: 801.46 on 755 degrees of freedom
AIC: 817.46
Number of Fisher Scoring iterations: 4
(Intercept) ed2_f1 age2_f2 age2_f3 age2_f4
0.2222222 0.5192308 1.3125000 0.7434783 0.6585366
ed2_f1:age2_f2 ed2_f1:age2_f3 ed2_f1:age2_f4
3.7601411 6.6610971 5.4835390
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.06419365 0.5954641
ed2_f1 0.15008794 2.0850593
age2_f2 0.40648790 5.0991675
age2_f3 0.24434285 2.7814891
age2_f4 0.20183699 2.5638698
ed2_f1:age2_f2 0.82155509 15.4284190
ed2_f1:age2_f3 1.47745296 26.7051999
ed2_f1:age2_f4 0.86959967 30.6574966
3.3 Likelihood interaction test to compare both models
Analysis of Deviance Table
Model 1: case ~ ed2_f + age2_f
Model 2: case ~ ed2_f * age2_f
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 758 807.90
2 755 801.46 3 6.4335 0.09232 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p value of the likelihood ratio test (the one with the anova command) is 0.09232 So adding age as interaction has not a significant improve in the model.that also means that we don’t have evidence to said that age is an interaction factor (because doesnt cause a significant change)
Likewise, because of the principle of parsimony and because adding the interaction has not a significant improve in the model, we will keep the simplest model (without interaction)
Still we have some doubt about if there is an interaction or not. Since the Age group variable had many categories, the interaction might not display that way. That is the reason for the next step
4. Combining levels of exposure often increases the power of the likelihood ratio test to show inter-action. Therefore:
a) dichotomise age in a new variable “young”, where 15-19 years is coded as ‘TRUE’ and 20 years is coded as ‘FALSE’. Compare to ‘age2’ to check that it worked well. b) test for interaction, using logistic regression c) estimate appropriate odds ratios for the association between education and HIV infection d) interpret these OR
- Reducing the categories of age
- Test for interaction, using logistic regression
b.1 First the model without interaction
Call:
glm(formula = case ~ ed2_f + young, family = binomial, data = mwanza)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9318 -0.9318 -0.6003 -0.3331 2.4163
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6224 0.1563 -10.380 < 2e-16 ***
ed2_f1 1.0129 0.1891 5.357 8.44e-08 ***
young -1.2413 0.3134 -3.961 7.46e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 854.26 on 762 degrees of freedom
Residual deviance: 810.31 on 760 degrees of freedom
AIC: 816.31
Number of Fisher Scoring iterations: 4
(Intercept) ed2_f1 young
0.1974194 2.7535251 0.2890131
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.1437040 0.2655621
ed2_f1 1.9125028 4.0177352
young 0.1498001 0.5164487
b.2 Then the model with interaction
Call:
glm(formula = case ~ ed2_f * young, family = binomial, data = mwanza)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9446 -0.9446 -0.5807 -0.4673 2.1301
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6946 0.1622 -10.449 < 2e-16 ***
ed2_f1 1.1188 0.1955 5.722 1.05e-08 ***
young 0.1905 0.5761 0.331 0.74086
ed2_f1:young -1.7742 0.6839 -2.594 0.00948 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 854.26 on 762 degrees of freedom
Residual deviance: 804.69 on 759 degrees of freedom
AIC: 812.69
Number of Fisher Scoring iterations: 4
(Intercept) ed2_f1 young ed2_f1:young
0.1836735 3.0610396 1.2098765 0.1696256
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.13200733 0.2496751
ed2_f1 2.10056832 4.5268821
young 0.33774768 3.4251394
ed2_f1:young 0.04620559 0.7138024
b.3 With Likelihood interaction test we compare both models
Analysis of Deviance Table
Model 1: case ~ ed2_f + young
Model 2: case ~ ed2_f * young
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 760 810.31
2 759 804.69 1 5.6194 0.01776 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p value of the likelihood ratio test (the one with the anova command) is 0.01776 So adding age (young) as interaction has a significant improve of the model.
But before we had not said that age group was not an interaction?
This is because before age group had 4 categories, not has only 2
Testing for interaction is more sensitive if we dichotomize, i.e. use two groups only
Likewise, because the model with interaction (GLM4) had a significant improve, this is the model
we are going to take.
- estimate appropriate odds ratios for the association between education and HIV infection
- interpret these OR
So the equation for model GLM4 (the one with interaction) is:
(Intercept) ed2_f1 young ed2_f1:young
Odds(HIV infection) = 0.1836735 * 3.0610396^Educat * 1.2098765^young * 0.1696256^(Educat*young)
When we are talking about the OR (odds ratio) we only work with the coefficients (no te intercept)
Category | OR |
---|---|
Young | |
OR for not educated young | =3.0610396^Educat * 1.2098765^young * 0.1696256^(Educat*young) |
=3.0610396^0 * 1.2098765^1 * 0.1696256^(0*1) | |
= 1 * 1.2098765 * 1 | |
= 1.2098 | |
OR for educated young | =3.0610396^Educat * 1.2098765^young * 0.1696256^(Educat*young) |
=3.0610396^1 * 1.2098765^1 * 0.1696256^(1*1) | |
=3.0610396 * 1.2098765 * 0.1696256 | |
=0.628205 | |
Older | |
OR for not educated older | =3.0610396^Educat * 1.2098765^young * 0.1696256^(Educat*young) |
=3.0610396^0 * 1.2098765^0 * 0.1696256^(0*0) | |
= 1 * 1 * 1 | |
= 1 (Is one because is the reference) | |
OR for educated older | =3.0610396^Educat * 1.2098765^young * 0.1696256^(Educat*young) |
=3.0610396^1 * 1.2098765^0 * 0.1696256^(1*0) | |
=3.0610396 * 1 * 1 | |
=3.0610396 |
Interpretation:
1.Among old people, the odds of having VIH are 3.06 times as high for those
who have education compared to those without instruction. (=3.06/1)
So education is a risk factor
2.Among young people, the odds of having VIH are 0.519 times as high for those
who have education compared to those without instruction. (=0.628205/1.2098)
So education is a protective factor
- Extra: Another way (maybe better) to doing this last part is to create subsets and doing the regression separate:
e.1 Regressión in the database of youngs:
Call:
glm(formula = case ~ ed2_f, family = binomial, data = young)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6335 -0.4673 -0.4673 -0.4673 2.1301
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5041 0.5528 -2.721 0.00651 **
ed2_f1 -0.6554 0.6553 -1.000 0.31726
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 79.670 on 108 degrees of freedom
Residual deviance: 78.734 on 107 degrees of freedom
AIC: 82.734
Number of Fisher Scoring iterations: 4
(Intercept) ed2_f1
0.2222222 0.5192308
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.06419365 0.5954641
ed2_f1 0.15008794 2.0850593
e.2 Regressión in the database of olders:
Call:
glm(formula = case ~ ed2_f, family = binomial, data = old)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9446 -0.9446 -0.5807 1.4297 1.9304
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6946 0.1622 -10.449 < 2e-16 ***
ed2_f1 1.1188 0.1955 5.722 1.05e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 761.75 on 653 degrees of freedom
Residual deviance: 725.96 on 652 degrees of freedom
AIC: 729.96
Number of Fisher Scoring iterations: 4
(Intercept) ed2_f1
0.1836735 3.0610396
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.1320073 0.2496751
ed2_f1 2.1005683 4.5268821
In GLM5 (regresión in database of youngs), the OR for having VIH among young is:
OR: 0.5192308
IC 95% (0.15008794 - 2.0850593)
So, when we search for the association between (VIH and education) among youngs,
we can see that education is a protective factor OR= 0.5192,
but it is not significant (IC takes 1)
In GLM6 (regresión in database of older), the OR for having VIH among older is:
OR: 3.0610396
IC 95% (2.1005683 4.5268821)
So, when we search for the association between (VIH and education) among older,
we can see that education is a risk factor OR= 3.06 and it is significant (IC doesnt takes 1)