2023-02-07 Sessión 2

Step by Step - Selection of variables for modeling

Import data

Importing a CSV database under the name of “mwanza”, with “,” as separator:

mwanza <- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Logistic Regresion Exercise Database/Datasets/mwanza.csv", sep=",")


Starting the Exercise (MVA_LogReg_Ex2-2023):

Case control study of risk factors for HIV in women, Mwanza Tanzania: As part of a prospective study of the impact of STD control on the incidence of HIV infection in Mwanza, Tanzania, a baseline survey of HIV prevalence was carried out in 12 communities. All se-ropositive women (15 years and above) were revisited and, where possible) interviewed about poten-tial risk factors for HIV infection using a standard questionnaire. In addition to interviewing HIV +ve women, a random sample of HIV -ve women were selected from the population lists prepared during the baseline survey and these women were also revisited and, where possible, interviewed. No matching of controls with cases was performed

The question which interests us most in this part of the exercise is whether and how education is associated with HIV infection

So the dependant variable is: HIV infection and main independent variable: education

1. Explore how education and age are associated with HIV infection; the list of variables is present-ed overleaf. Transform ‘case’ into a factor variable with levels TRUE and FALSE, similarly cre-ate two factor variables, ‘age1_f’ and ‘ed_f’, based on ‘age1’ and ‘ed’ respectively.

1.1 Creating the variables

#Case variable (case TRUE=1, non Case FALSE=0)
  #table(mwanza$case, useNA="always")
  mwanza$case <- factor(mwanza$case)

#Age1 variable 
  #table(mwanza$age1, useNA="always")
  mwanza$age1_f <- factor(mwanza$age1, labels = c("15-19","20-24","25-29","30-34","35-44","45-54"))
  #table(mwanza$age1_f, useNA="always")
#Ed1 variable 
  #table(mwanza$ed, useNA="always")
  mwanza$ed_f <- factor(mwanza$ed)

1.2 Before doing anything, first is better to do bi-variates tables with the OR calculation for each independent variable

Bivariate table: VIH - Age group

table (mwanza$age1_f, mwanza$case)
       
          0   1
  15-19  96  13
  20-24 108  57
  25-29  84  39
  30-34  85  33
  35-44 107  30
  45-54  94  17
prop.table(table (mwanza$age1_f, mwanza$case), margin=1)  #Porcentajes por fila
       
                0         1
  15-19 0.8807339 0.1192661
  20-24 0.6545455 0.3454545
  25-29 0.6829268 0.3170732
  30-34 0.7203390 0.2796610
  35-44 0.7810219 0.2189781
  45-54 0.8468468 0.1531532
Note

Table of Age group (with manual calculation of OR)

Age group Case Control OR (95% CI)
15-19 13 (11.9) 96(88.1) Ref
20-24 57(34.5) 108(65.5) 3.9 (2.1-7.8)
25-29 39(31.7) 84(68.3) 3.4 (1.8-7.1)
30-34 33(28.0) 85(72.0) 2.9 (1.4-6.0)
35-44 30(21.9) 107(78.1) 2.1 (1.0-4.3)
45-54 17(15.3) 94(84.7) 1.3 (0.6-3.0)
Note

So far regarding to age, We can see that all age groups have higher odds to be infected than age group ‘15-19’ (reference) (even though not all are significant)

Bivariate table: VIH - Education

table (mwanza$ed_f, mwanza$case)
   
      0   1
  1 263  49
  2  51  24
  3 255 110
  4   5   6
prop.table(table (mwanza$ed_f, mwanza$case), margin=1)  #Porcentajes por fila
   
            0         1
  1 0.8429487 0.1570513
  2 0.6800000 0.3200000
  3 0.6986301 0.3013699
  4 0.4545455 0.5454545
Note

Table of education (with manual calculation of OR)

Years of education Case Control OR (95% CI)
0 49 (15.7) 263 (84.3) Ref.
1-3 24 (32.0) 51 (68.0) 2.5 (1.4-4.5)
4-6 110 (30.1) 255 (69.9) 2.3 (1.6-3.4)
7+ 6 (54.5) 5 (45.5) 6.4 (1.9-23.1)
Note

So far regarding to education, we can see that educated women have all significantly higher odds to be infected than uneducated women (reference)

2. Now we are interested in a possible interaction between these two variables. First we should reduce the number of categories for both of them. Transform “ed” into a binary variable called “ed2”, with should take value 1 for “having been at school for at least 1 year” and the value 0 otherwise. Check whether recoding has worked. The new variable “age2” should be coded as fol-lows: 1=15 19, 2=20 29, 3=30 44, 4=45 and more. In your own interest: always check that your recoding has worked

2.1 Reducing the categories

#Education at least 1 year
  mwanza$ed2_f <- ifelse(mwanza$ed>1,1,0)
  mwanza$ed2_f <- factor(mwanza$ed2_f)
  table(mwanza$ed, mwanza$ed2_f)
   
      0   1
  1 312   0
  2   0  75
  3   0 365
  4   0  11
#Age reducing categories
  mwanza$age2_f <- mwanza$age1
  mwanza$age2_f[mwanza$age1==3] <- 2
  mwanza$age2_f[mwanza$age1==4] <- 3
  mwanza$age2_f[mwanza$age1==5] <- 3
  mwanza$age2_f[mwanza$age1==6] <- 4
  mwanza$age2_f <- factor(mwanza$age2_f)
  table(mwanza$age1, mwanza$age2_f)
   
      1   2   3   4
  1 109   0   0   0
  2   0 165   0   0
  3   0 123   0   0
  4   0   0 118   0
  5   0   0 137   0
  6   0   0   0 111
3. Use logistic regression to look for interaction between education and age.

3.1 First the model without interaction

GLM1 <- glm(case ~ ed2_f + age2_f, family=binomial, data=mwanza)
summary(GLM1)

Call:
glm(formula = case ~ ed2_f + age2_f, family = binomial, data = mwanza)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9623  -0.8965  -0.6131  -0.3542   2.3665  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.7375     0.3498  -7.827 5.02e-15 ***
ed2_f1        0.8719     0.2082   4.188 2.81e-05 ***
age2_f2       1.3359     0.3227   4.140 3.48e-05 ***
age2_f3       1.1615     0.3375   3.441 0.000579 ***
age2_f4       0.8583     0.4211   2.038 0.041542 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 854.26  on 762  degrees of freedom
Residual deviance: 807.90  on 758  degrees of freedom
AIC: 817.9

Number of Fisher Scoring iterations: 4
exp(coef(GLM1))
(Intercept)      ed2_f1     age2_f2     age2_f3     age2_f4 
 0.06473065  2.39150115  3.80333613  3.19461007  2.35904293 
exp(confint(GLM1))
Waiting for profiling to be done...
                 2.5 %    97.5 %
(Intercept) 0.03137867 0.1244741
ed2_f1      1.59983748 3.6230547
age2_f2     2.08418557 7.4537561
age2_f3     1.69527973 6.4225579
age2_f4     1.03890649 5.4731892

3.2 Then the model with interaction

GLM2 <- glm(case ~ ed2_f*age2_f, family=binomial, data=mwanza)
summary(GLM2)

Call:
glm(formula = case ~ ed2_f * age2_f, family = binomial, data = mwanza)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9508  -0.9494  -0.5530  -0.4673   2.1301  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)   
(Intercept)     -1.5041     0.5528  -2.721  0.00651 **
ed2_f1          -0.6554     0.6553  -1.000  0.31726   
age2_f2          0.2719     0.6307   0.431  0.66636   
age2_f3         -0.2964     0.6057  -0.489  0.62458   
age2_f4         -0.4177     0.6333  -0.660  0.50951   
ed2_f1:age2_f2   1.3245     0.7354   1.801  0.07172 . 
ed2_f1:age2_f3   1.8963     0.7256   2.613  0.00897 **
ed2_f1:age2_f4   1.7018     0.8991   1.893  0.05839 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 854.26  on 762  degrees of freedom
Residual deviance: 801.46  on 755  degrees of freedom
AIC: 817.46

Number of Fisher Scoring iterations: 4
exp(coef(GLM2))
   (Intercept)         ed2_f1        age2_f2        age2_f3        age2_f4 
     0.2222222      0.5192308      1.3125000      0.7434783      0.6585366 
ed2_f1:age2_f2 ed2_f1:age2_f3 ed2_f1:age2_f4 
     3.7601411      6.6610971      5.4835390 
exp(confint(GLM2))
Waiting for profiling to be done...
                    2.5 %     97.5 %
(Intercept)    0.06419365  0.5954641
ed2_f1         0.15008794  2.0850593
age2_f2        0.40648790  5.0991675
age2_f3        0.24434285  2.7814891
age2_f4        0.20183699  2.5638698
ed2_f1:age2_f2 0.82155509 15.4284190
ed2_f1:age2_f3 1.47745296 26.7051999
ed2_f1:age2_f4 0.86959967 30.6574966

3.3 Likelihood interaction test to compare both models

anova(GLM1, GLM2, test="Chisq")
Analysis of Deviance Table

Model 1: case ~ ed2_f + age2_f
Model 2: case ~ ed2_f * age2_f
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       758     807.90                       
2       755     801.46  3   6.4335  0.09232 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note

The p value of the likelihood ratio test (the one with the anova command) is 0.09232 So adding age as interaction has not a significant improve in the model.that also means that we don’t have evidence to said that age is an interaction factor (because doesnt cause a significant change)

Likewise, because of the principle of parsimony and because adding the interaction has not a significant improve in the model, we will keep the simplest model (without interaction)

Note

Still we have some doubt about if there is an interaction or not. Since the Age group variable had many categories, the interaction might not display that way. That is the reason for the next step

4. Combining levels of exposure often increases the power of the likelihood ratio test to show inter-action. Therefore:

a) dichotomise age in a new variable “young”, where 15-19 years is coded as ‘TRUE’ and 20 years is coded as ‘FALSE’. Compare to ‘age2’ to check that it worked well. b) test for interaction, using logistic regression c) estimate appropriate odds ratios for the association between education and HIV infection d) interpret these OR

  1. Reducing the categories of age
#Reducing Age to Young
  mwanza$young <- ifelse(mwanza$age1==1,1,0)
  #table(mwanza$young, mwanza$age1)
  1. Test for interaction, using logistic regression

b.1 First the model without interaction

GLM3 <- glm(case ~ ed2_f + young, family=binomial, data=mwanza)
summary(GLM3)

Call:
glm(formula = case ~ ed2_f + young, family = binomial, data = mwanza)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9318  -0.9318  -0.6003  -0.3331   2.4163  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.6224     0.1563 -10.380  < 2e-16 ***
ed2_f1        1.0129     0.1891   5.357 8.44e-08 ***
young        -1.2413     0.3134  -3.961 7.46e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 854.26  on 762  degrees of freedom
Residual deviance: 810.31  on 760  degrees of freedom
AIC: 816.31

Number of Fisher Scoring iterations: 4
exp(coef(GLM3))
(Intercept)      ed2_f1       young 
  0.1974194   2.7535251   0.2890131 
exp(confint(GLM3))
Waiting for profiling to be done...
                2.5 %    97.5 %
(Intercept) 0.1437040 0.2655621
ed2_f1      1.9125028 4.0177352
young       0.1498001 0.5164487

b.2 Then the model with interaction

GLM4 <- glm(case ~ ed2_f*young, family=binomial, data=mwanza)
summary(GLM4)

Call:
glm(formula = case ~ ed2_f * young, family = binomial, data = mwanza)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9446  -0.9446  -0.5807  -0.4673   2.1301  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.6946     0.1622 -10.449  < 2e-16 ***
ed2_f1         1.1188     0.1955   5.722 1.05e-08 ***
young          0.1905     0.5761   0.331  0.74086    
ed2_f1:young  -1.7742     0.6839  -2.594  0.00948 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 854.26  on 762  degrees of freedom
Residual deviance: 804.69  on 759  degrees of freedom
AIC: 812.69

Number of Fisher Scoring iterations: 4
exp(coef(GLM4))
 (Intercept)       ed2_f1        young ed2_f1:young 
   0.1836735    3.0610396    1.2098765    0.1696256 
exp(confint(GLM4))
Waiting for profiling to be done...
                  2.5 %    97.5 %
(Intercept)  0.13200733 0.2496751
ed2_f1       2.10056832 4.5268821
young        0.33774768 3.4251394
ed2_f1:young 0.04620559 0.7138024

b.3 With Likelihood interaction test we compare both models

anova(GLM3, GLM4, test="Chisq")
Analysis of Deviance Table

Model 1: case ~ ed2_f + young
Model 2: case ~ ed2_f * young
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       760     810.31                       
2       759     804.69  1   5.6194  0.01776 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note

The p value of the likelihood ratio test (the one with the anova command) is 0.01776 So adding age (young) as interaction has a significant improve of the model.

  But before we had not said that age group was not an interaction?
      This is because before age group had 4 categories, not has only 2
      Testing for interaction is more sensitive if we dichotomize, i.e. use two groups only
  
  Likewise, because the model with interaction (GLM4) had a significant improve, this is the model 
  we are going to take.
  1. estimate appropriate odds ratios for the association between education and HIV infection
  2. interpret these OR
Note

So the equation for model GLM4 (the one with interaction) is:

                      (Intercept)     ed2_f1                young           ed2_f1:young
Odds(HIV infection) = 0.1836735 * 3.0610396^Educat * 1.2098765^young * 0.1696256^(Educat*young)

When we are talking about the OR (odds ratio) we only work with the coefficients (no te intercept)

Category OR
Young
OR for not educated young =3.0610396^Educat * 1.2098765^young * 0.1696256^(Educat*young)
=3.0610396^0 * 1.2098765^1 * 0.1696256^(0*1)
= 1 * 1.2098765 * 1
= 1.2098
OR for educated young =3.0610396^Educat * 1.2098765^young * 0.1696256^(Educat*young)
=3.0610396^1 * 1.2098765^1 * 0.1696256^(1*1)
=3.0610396 * 1.2098765 * 0.1696256
=0.628205
Older
OR for not educated older =3.0610396^Educat * 1.2098765^young * 0.1696256^(Educat*young)
=3.0610396^0 * 1.2098765^0 * 0.1696256^(0*0)
= 1 * 1 * 1
= 1 (Is one because is the reference)
OR for educated older =3.0610396^Educat * 1.2098765^young * 0.1696256^(Educat*young)
=3.0610396^1 * 1.2098765^0 * 0.1696256^(1*0)
=3.0610396 * 1 * 1
=3.0610396
Interpretation: 

    1.Among old people, the odds of having VIH are 3.06 times as high for those 
    who have education compared to those without instruction. (=3.06/1) 
    So education is a risk factor
    
    2.Among young people, the odds of having VIH are 0.519 times as high for those 
    who have education compared to those without instruction. (=0.628205/1.2098)
    So education is a protective factor
  1. Extra: Another way (maybe better) to doing this last part is to create subsets and doing the regression separate:
young  <-  subset(mwanza, young==1)  #Creating a database for only young
old  <-  subset(mwanza, young==0)    #Creating a database for only older

e.1 Regressión in the database of youngs:

GLM5 <- glm(case ~ ed2_f, family=binomial, data=young)
summary(GLM5)

Call:
glm(formula = case ~ ed2_f, family = binomial, data = young)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6335  -0.4673  -0.4673  -0.4673   2.1301  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -1.5041     0.5528  -2.721  0.00651 **
ed2_f1       -0.6554     0.6553  -1.000  0.31726   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 79.670  on 108  degrees of freedom
Residual deviance: 78.734  on 107  degrees of freedom
AIC: 82.734

Number of Fisher Scoring iterations: 4
exp(coef(GLM5))
(Intercept)      ed2_f1 
  0.2222222   0.5192308 
exp(confint(GLM5))
Waiting for profiling to be done...
                 2.5 %    97.5 %
(Intercept) 0.06419365 0.5954641
ed2_f1      0.15008794 2.0850593

e.2 Regressión in the database of olders:

GLM6 <- glm(case ~ ed2_f, family=binomial, data=old)
summary(GLM6)

Call:
glm(formula = case ~ ed2_f, family = binomial, data = old)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9446  -0.9446  -0.5807   1.4297   1.9304  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.6946     0.1622 -10.449  < 2e-16 ***
ed2_f1        1.1188     0.1955   5.722 1.05e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 761.75  on 653  degrees of freedom
Residual deviance: 725.96  on 652  degrees of freedom
AIC: 729.96

Number of Fisher Scoring iterations: 4
exp(coef(GLM6))
(Intercept)      ed2_f1 
  0.1836735   3.0610396 
exp(confint(GLM6))
Waiting for profiling to be done...
                2.5 %    97.5 %
(Intercept) 0.1320073 0.2496751
ed2_f1      2.1005683 4.5268821
Note

In GLM5 (regresión in database of youngs), the OR for having VIH among young is:

    OR: 0.5192308 
    IC 95% (0.15008794 - 2.0850593)
    
    So, when we search for the association between (VIH and education) among youngs, 
    we can see that education is a protective factor OR= 0.5192, 
    but it is not significant (IC takes 1)
    

In GLM6 (regresión in database of older), the OR for having VIH among older is:

    OR: 3.0610396  
    IC 95% (2.1005683 4.5268821)
    
    So, when we search for the association between (VIH and education) among older, 
    we can see that education is a risk factor OR= 3.06 and it is significant (IC doesnt takes 1)