2023-02-08 Sessión 3

Step by Step - Selection of variables for modeling

Import data

Importing a CSV database under the name of “onch1302”, with “,” as separator:

onch1302 <- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Logistic Regresion Exercise Database/Datasets/onch1302.csv", sep=",")


Starting the Exercise (MVA_LogReg_Ex1-2023):

Onchocerciasis (commonly known as River Blindness) is a chronic filarial disease found in sub-Saharan Africa and some parts of Central and South America. An onchocerciasis project was set up in 1982 in the Bo district of Sierra Leone. The aims of the project were to study epidemiological, clin-ical, immunological and entomological aspects of the disease. Prevalence surveys were undertaken in villages selected on the basis of potential high endemicity, being situated on or near rivers which are the breeding sites for the Simulium damnosum blackfly. Of the twelve villages included in the present dataset, five were situated in the south and east of the country in the forest' zone and the other seven were in thesavannah’ zone of the country. A census was taken of each village, and all villagers over the age of five years were asked to participate in the study. Coverage was over 90% in all but one of the selected villages. Diagnosis was made by taking a skin-snip, and clinical and an ocular examination were also performed. The file ONCH1302 contains data for all 1,302 subjects.

This time you are asked to make two logistic regression models, first using classical model selection, then change-in-estimate model selection for the effect of area of residence (forest or savannah). Please answer the questions below.

Part 1

1. Which are the steps you will go through for classical model selection?
2. Please start by making a table describing the study population. Remember that to get tables with percentages you can use the “proportions” or “prop.table” function around a table (both functions are the same)

2.1 Modifing variables to work correctly

onch1302$female <- factor(onch1302$sex, labels = c("male", "female"))
onch1302$forest <- factor(onch1302$area, labels = c("savanna", "forest"))

onch1302$agegrp <- factor(onch1302$agegrp, labels = c("(5-9)", "(10-19)", "(20-29)", "(40+)"))
onch1302$mf <- factor(onch1302$mf, labels = c("non-cases", "cases"))
onch1302$mfload <- factor(onch1302$mfload, labels = c("(none)", "(1-9)", "(10-49)", "(50+)"))
onch1302$lesions <- factor(onch1302$lesions, labels = c("no", "yes"))
#Lesions
table(onch1302$lesions, useNA = "always")     #frecuency

  no  yes <NA> 
1101  201    0 
prop.table(table(onch1302$lesions))           #proportion

       no       yes 
0.8456221 0.1543779 
#female
table(onch1302$female, useNA = "always")     #frecuency

  male female   <NA> 
   616    686      0 
prop.table(table(onch1302$female))           #proportion

     male    female 
0.4731183 0.5268817 
#Forest
table(onch1302$forest, useNA = "always")     #frecuency

savanna  forest    <NA> 
    548     754       0 
prop.table(table(onch1302$forest))           #proportion

  savanna    forest 
0.4208909 0.5791091 
#Agegr
table(onch1302$agegrp, useNA = "always")     #frecuency

  (5-9) (10-19) (20-29)   (40+)    <NA> 
    202     218     424     458       0 
prop.table(table(onch1302$agegrp))           #proportion

    (5-9)   (10-19)   (20-29)     (40+) 
0.1551459 0.1674347 0.3256528 0.3517665 
#Microfilaria
table(onch1302$mf, useNA = "always")     #frecuency

non-cases     cases      <NA> 
      480       822         0 
prop.table(table(onch1302$mf))           #proportion

non-cases     cases 
0.3686636 0.6313364 
#Microfilaria Load
table(onch1302$mfload, useNA = "always")     #frecuency

 (none)   (1-9) (10-49)   (50+)    <NA> 
    480     367     277     178       0 
prop.table(table(onch1302$mfload))           #proportion

   (none)     (1-9)   (10-49)     (50+) 
0.3686636 0.2818740 0.2127496 0.1367127 
Note
Factor N (%)
Total 1302 (100)
Female gender 686 (52.7)
Age group
5-9 202 (15.5)
10-19 218 (16.7)
20-39 424 (32.6)
40+ 458 (35.2)
Living in Forest 754 (57.9)
Micro filaria infected 822 (63.1)
Micro-filarial load
None 480 (36.8)
1-9 367 (28.1)
10-49 277 (21.2)
50+ 178 (13.6)
Eyes affected (lesions) 201 (15.4)
3. Next step is to explore the bivariate associations with the outcome variable, microfilarial infection. Please explore these associations using bivariate logistic regression models and construct an appropriate table. Though odds ratios with 95% confidence intervals show which coefficients are significant, please add a column with p-values from the likelihood ratio test so we can decide which factors to test in our multivariate mode
library(EpiStats)

3.1 Frecuency tables

#Total
#table (onch1302$mf)
prop.table(table (onch1302$mf))  #Porcentajes totales

non-cases     cases 
0.3686636 0.6313364 
#Female
#table (onch1302$female, onch1302$mf)
prop.table(table (onch1302$female, onch1302$mf), margin=2)  #Porcentajes por column
        
         non-cases     cases
  male   0.3958333 0.5182482
  female 0.6041667 0.4817518
#Age group
#table (onch1302$agegrp, onch1302$mf)
prop.table(table (onch1302$agegrp, onch1302$mf), margin=2)  #Porcentajes por fila
         
           non-cases      cases
  (5-9)   0.32500000 0.05596107
  (10-19) 0.24791667 0.12043796
  (20-29) 0.26041667 0.36374696
  (40+)   0.16666667 0.45985401
#forest
#table (onch1302$forest, onch1302$mf)
prop.table(table (onch1302$forest, onch1302$mf), margin=2)  #Porcentajes por fila
         
          non-cases     cases
  savanna 0.5562500 0.3418491
  forest  0.4437500 0.6581509
#lesions
#table (onch1302$lesions, onch1302$mf)
prop.table(table (onch1302$lesions, onch1302$mf), margin=2)  #Porcentajes por fila
     
       non-cases      cases
  no  0.96041667 0.77858881
  yes 0.03958333 0.22141119

3.2 For calculatin the OR and the p-value of the likelihood ratio test

Logistic model: mf ~ female

lm1 <- glm(mf ~ female, family=binomial, data=onch1302)
#summary(lm1)
exp(coef(lm1))
 (Intercept) femalefemale 
   2.2421053    0.6090335 
exp(confint(lm1))
Waiting for profiling to be done...
                 2.5 %    97.5 %
(Intercept)  1.8930554 2.6654704
femalefemale 0.4842344 0.7647899
anova(lm1, test="Chisq")  #To check for the p value of the likelihood ratio test
Analysis of Deviance Table

Model: binomial, link: logit

Response: mf

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    1301     1714.1              
female  1   18.317      1300     1695.7 1.871e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logistic model: mf ~ agegrp

lm2 <- glm(mf ~ agegrp, family=binomial, data=onch1302)
#summary(lm2)
exp(coef(lm2))
  (Intercept) agegrp(10-19) agegrp(20-29)   agegrp(40+) 
    0.2948718     2.8213372     8.1120000    16.0239130 
exp(confint(lm2))
Waiting for profiling to be done...
                   2.5 %     97.5 %
(Intercept)    0.2099851  0.4059964
agegrp(10-19)  1.8566452  4.3348741
agegrp(20-29)  5.5353487 12.0770820
agegrp(40+)   10.7442422 24.3134387
anova(lm2, test="Chisq")  #To check for the p value of the likelihood ratio test
Analysis of Deviance Table

Model: binomial, link: logit

Response: mf

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    1301     1714.1              
agegrp  3    258.4      1298     1455.7 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logistic model: mf ~ forest

lm3 <- glm(mf ~ forest, family=binomial, data=onch1302)
summary(lm3)

Call:
glm(formula = mf ~ forest, family = binomial, data = onch1302)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5900  -1.1992   0.8148   0.8148   1.1558  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.05111    0.08546   0.598     0.55    
forestforest  0.88102    0.11767   7.487 7.05e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1714.1  on 1301  degrees of freedom
Residual deviance: 1657.0  on 1300  degrees of freedom
AIC: 1661

Number of Fisher Scoring iterations: 4
exp(coef(lm3))
 (Intercept) forestforest 
    1.052434     2.413363 
exp(confint(lm3))
Waiting for profiling to be done...
                 2.5 %   97.5 %
(Intercept)  0.8901384 1.244620
forestforest 1.9176644 3.042055
anova(lm3, test="Chisq")  #To check for the p value of the likelihood ratio test
Analysis of Deviance Table

Model: binomial, link: logit

Response: mf

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    1301     1714.1              
forest  1   57.025      1300     1657.0 4.302e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logistic model: mf ~ lesions

lm4 <- glm(mf ~ lesions, family=binomial, data=onch1302)
summary(lm4)

Call:
glm(formula = mf ~ lesions, family = binomial, data = onch1302)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1720  -1.3195   0.4456   1.0416   1.0416  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.32807    0.06109   5.370 7.85e-08 ***
lesionsyes   1.93150    0.24868   7.767 8.03e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1714.1  on 1301  degrees of freedom
Residual deviance: 1622.9  on 1300  degrees of freedom
AIC: 1626.9

Number of Fisher Scoring iterations: 4
exp(coef(lm4))
(Intercept)  lesionsyes 
   1.388286    6.899835 
exp(confint(lm4))
Waiting for profiling to be done...
               2.5 %    97.5 %
(Intercept) 1.232067  1.565531
lesionsyes  4.348729 11.590217
anova(lm4, test="Chisq")  #To check for the p value of the likelihood ratio test
Analysis of Deviance Table

Model: binomial, link: logit

Response: mf

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                     1301     1714.1              
lesions  1   91.198      1300     1622.9 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Factor Cases (% col) Non-cases (% col) OR (95% CI) p-value of LRTest
Total 480 (36.9) 822 (63.1)
Female gender (yes) 396 (48.1) 290 (60.4) 0.6090335 (0.4842344 0.7647899) 1.871e-05 *** (<0.001)
Female gender (no) 426 (51.8) 190 (39.5) Ref
Age group
5-9 46 (5.5) 156 (32.5) Ref 2.2e-16 *** (<0.001)
10-19 99 (12.0) 119 (24.7) 2.8213372 (1.8566452 4.3348741)
20-39 299 (36.3) 125 (26.0) 8.1120000 (5.5353487 12.0770820)
40+ 378 (45.9) 80 (16.6) 16.0239130 (10.7442422 24.3134387)
Living in Forest (yes) 541 (65.8) 213 (44.3) 2.413363 (1.9176644 3.042055) 4.302e-14 *** (<0.001)
Living in Forest (no) 281 (34.1) 267 (55.6) Ref
Lesions (yes) 19 (22.1) 182 (3.9) 6.899835 (4.348729 11.590217) 2.2e-16 *** (<0.001)
Lesions (no) 461 (77.8) 640 (96.1) Ref
Note

Example of interpretation: Among the non-cases only 3.9% had lesions

4. Which variables will you include in the multivariate model? Should eye lesions be included?
Note

All variable had a p-value of the LRTest less than 0.10 Even though because of the nature of the research, we are looking for epidemiological exposures of microfilaria, eye lesions are not an exposure are more a symptom/sign

5. Fit the model with ‘age group’ as factor, ‘forest’ and ‘female’, which is now the weakest factor (highest p-value)? Does dropping it make the model significantly less precise? Present the final model in a table.
lm5 <- glm(mf ~ agegrp + forest + female, family=binomial, data=onch1302)
summary(lm5)

Call:
glm(formula = mf ~ agegrp + forest + female, family = binomial, 
    data = onch1302)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2212  -0.7666   0.5392   0.7081   2.1459  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -1.6157     0.2066  -7.820 5.30e-15 ***
agegrp(10-19)   0.9429     0.2239   4.211 2.54e-05 ***
agegrp(20-29)   2.3478     0.2108  11.138  < 2e-16 ***
agegrp(40+)     2.8713     0.2171  13.225  < 2e-16 ***
forestforest    1.1227     0.1386   8.099 5.52e-16 ***
femalefemale   -0.5813     0.1356  -4.286 1.82e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1714.1  on 1301  degrees of freedom
Residual deviance: 1366.1  on 1296  degrees of freedom
AIC: 1378.1

Number of Fisher Scoring iterations: 4
exp(coef(lm5))
  (Intercept) agegrp(10-19) agegrp(20-29)   agegrp(40+)  forestforest 
    0.1987542     2.5673568    10.4623748    17.6593452     3.0731378 
 femalefemale 
    0.5591696 
exp(confint(lm5))
Waiting for profiling to be done...
                   2.5 %     97.5 %
(Intercept)    0.1312769  0.2954028
agegrp(10-19)  1.6627984  4.0044862
agegrp(20-29)  6.9776128 15.9588558
agegrp(40+)   11.6370517 27.2821110
forestforest   2.3475876  4.0437091
femalefemale   0.4279198  0.7284977
Note

In lm5 the variable with the highest p-value is female (p value = 1.82e-05). So we will do a new model without female and then compare it with lm5 to see if it caused a significant change

lm6 <- glm(mf ~ agegrp + forest, family=binomial, data=onch1302)
anova(lm5, lm6, test="Chisq")
Analysis of Deviance Table

Model 1: mf ~ agegrp + forest + female
Model 2: mf ~ agegrp + forest
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      1296     1366.1                          
2      1297     1384.8 -1  -18.712 1.521e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note

The p value of the Likelihood ratio test is 1.521e-05 (p < 0.001), So, when I eliminate female, it changes the model significant making it less precise. So we cannot eliminate female.

The model we will stick with is the more complex lm5 (mf ~ agegrp + forest + female) The table of OR of lm5 is:

Factor (n=1302) OR (95% CI)
Female gender 0.5591696 (0.4279198 0.7284977)
Age group: 5-9 years 2.5673568 (1.6627984 4.0044862)
10-19 years 2.5673568 (1.6627984 4.0044862)
20-39 year 10.4623748 (6.9776128 15.9588558)
40 years and above 17.6593452 (11.6370517 27.2821110)
Living in forest 3.0731378 (2.3475876 4.0437091)
6. Which are the three potential interactions? Does any of them significantly improve the model? If so, make a table with the odds ratio’s in which you have taken into account the interaction.

For this section we need to test is the variables are ok in that way (lm5) or there are interaction. We will compare lm5 with models with interactions

6.1 Model with forest*female as interaction

lm7 <- glm(mf ~ agegrp + forest*female , family=binomial, data=onch1302)
anova(lm5, lm7, test="Chisq")
Analysis of Deviance Table

Model 1: mf ~ agegrp + forest + female
Model 2: mf ~ agegrp + forest * female
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1      1296     1366.1                       
2      1295     1361.7  1   4.4028  0.03588 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note

The p-value of the Likelihood ratio test is 0.03588 (Lower than 0.05). So, it changes the model significant. So there is an interaction between forest*female We will keep this model

6.2 Model with agegrp*forest as interaction

lm8 <- glm(mf ~ agegrp*forest + female , family=binomial, data=onch1302)
anova(lm5, lm8, test="Chisq")
Analysis of Deviance Table

Model 1: mf ~ agegrp + forest + female
Model 2: mf ~ agegrp * forest + female
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1      1296     1366.1                     
2      1293     1360.8  3   5.2497   0.1544
Note

The p-value of the Likelihood ratio test is 0.1544 (Higher than 0.05). So, it does not changes the model significant. So there is not an interaction between agegrp*forest

6.3 Model with agegrp*female as interaction

lm9 <- glm(mf ~ forest + agegrp*female , family=binomial, data=onch1302)
anova(lm5, lm9, test="Chisq")
Analysis of Deviance Table

Model 1: mf ~ agegrp + forest + female
Model 2: mf ~ forest + agegrp * female
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1      1296     1366.1                     
2      1293     1361.2  3    4.873   0.1813
Note

The p-value of the Likelihood ratio test is 0.1813 (Higher than 0.05). So, it does not changes the model significant. So there is not an interaction between agegrp*female

6.3 Filling the table requested Because we now know that we have an interaction : forest*female, is better to calculate the OR by categories of the interaction. We could decide to split the data by (women-men) or by (living/ not living in the forest). The most usual is present the results by sex, so:

table(onch1302$female)

  male female 
   616    686 
women  <-  subset(onch1302, female=="female")  #Creating a database for only women
men  <-  subset(onch1302, female=="male")    #Creating a database for only male

The model in women

lm10 <- glm(mf ~ agegrp + forest , family=binomial, data=women)
exp(coef(lm10))
  (Intercept) agegrp(10-19) agegrp(20-29)   agegrp(40+)  forestforest 
   0.09191677    2.13748276   10.48287813   23.11558852    4.15267371 
exp(confint(lm10))
Waiting for profiling to be done...
                    2.5 %     97.5 %
(Intercept)    0.05060697  0.1596338
agegrp(10-19)  1.13270252  4.1137899
agegrp(20-29)  5.97014423 19.1384989
agegrp(40+)   12.62532875 44.0769379
forestforest   2.85704938  6.1178354

The model in man

lm11 <- glm(mf ~ agegrp + forest , family=binomial, data=men)
exp(coef(lm11))
  (Intercept) agegrp(10-19) agegrp(20-29)   agegrp(40+)  forestforest 
    0.2392317     3.1894477    11.5596717    13.8388609     2.1425002 
exp(confint(lm11))
Waiting for profiling to be done...
                  2.5 %     97.5 %
(Intercept)   0.1395508  0.3959848
agegrp(10-19) 1.7572835  5.9066602
agegrp(20-29) 6.4172108 21.4996506
agegrp(40+)   7.8746659 25.0737487
forestforest  1.4425398  3.1999460
Note
Factor (n=1302) Women (n=686) OR (95% CI) Men (N=616) OR (95% CI)
Age group:5-9 years
10-19 year 2.13748276 (1.13270252 4.1137899) 3.1894477 (1.7572835 5.9066602)
20-39 year 10.48287813 (5.97014423 19.1384989) 11.5596717 (6.4172108 21.4996506)
40 years + 23.11558852 (12.62532875 44.0769379) 13.8388609 (7.8746659 25.0737487)
Living in forest 4.15267371 (2.85704938 6.1178354) 2.1425002 (1.4425398 3.1999460)

The effect of living in the forest is different for women from men The Odds of having the disease is 4. times as high if you are women living in the forest than women living in savana The Odds of having the disease is 2. times as high if you are men living in the forest than men living in savana

This is because female is an interaction factor (The interaction test (likelihood ratio test) was significant)

Part 2

Change-in-Estimate model (it change the primary OR exposure) 1. We select the factor that can be exposures and the main exposure: Can we change it? 2. First the crude odd ratio bivariate outcome - primary exposure

mod_crude <- glm(mf ~ forest, family=binomial, data=onch1302)
  1. Then adding one variable at the time and writing it down.
mod_1 <- glm(mf ~ forest + sex, family=binomial, data=onch1302)
mod_2 <- glm(mf ~ forest + agegrp, family=binomial, data=onch1302)
Factor Co-factor OR Change (Original OR - New OR)/Original OR
Living in forest 2.41
Living in forest Female gender 2.40 = (2.41 - 2-40)/ 2.41
= 0.4%
Living in forest Age group 3.08 = (3.08 - 2-41)/ 2.41
= 27.8%
  1. So the final model for now is: mod_2 <- glm(mf ~ forest + agegrp, family=binomial, data=onch1302)

  2. Tring adding interaction effects If we want to check for interaction, is better to have no so may categories. So we will dichotomize agegroup

#Adult variable (adult=1, non adult=0)
  onch1302$adult <- ifelse(onch1302$agegrp>1,1,0)
Warning in Ops.factor(onch1302$agegrp, 1): '>' not meaningful for factors

We compare with the likelihood ratio test the one without interaction (mf ~ forest + agegrp) and the one with interaction (mf ~ forest * agegrp)

The p value is not significant, so we keep with te more simple (mf ~ forest * agegrp)

7.Now that you have made the final predictive model (using the classical moldel), let’s continue with an explanatory model for the effect of living in the forest versus the savannah. Which factors should we consider? We will follow these steps:
  1. Fit a model containing only the primary exposure variable of interest; write down the effect size, which is the crude OR.
  2. construct models with two variables, including the primary exposure and another exposure; identify confounders by comparing crude with adjusted ORs
  3. construct saturated model containing all exposure variables which resulted in a 10% change in effect size of the primary exposure (ref. step 2)
  4. eliminate exposure variables one after the other, but do not eliminate established confounders; retain exposure variable if removing it resulted in a 10% change in effect size of the primary exposure (ref. step 3)
  5. Explore interaction between primary exposure and other exposure variables; neglect other types of interaction
8. Present a table with results of logistic regressions on forest alone, forest + female and forest + age group. Did the odds ratios of forest change as a result of this control for confounding?
9. Gender is not a confounder, but age group is. Normally this would be the end of our procedure. However, we would also like to explore interaction effects between the primary exposure and each of the confounders. This is best done by recoding age group to two levels, creating a variable ‘adult’ set to ‘FALSE’ if age is 0-19 years and ‘TRUE’ if age is 20 years or above. Is ‘adult’ or ‘female’ an effect modifier in the association between ‘forest’ and disease?