2023-02-10 Session 1
- Instituut Voor Tropische Geneeskunde - Antwerp, Belgium
- Javier Silva-Valencia
Step by Step
Import data
Importing a CSV database under the name of “leprosy_comoros”, with “,” as separator:
Starting the Exercise
1. Use the ‘aggregate’ function in R to determine the total population and how many cases arose over the follow-up period of 1 year? (First create a constant with a value of 1)
What was the incidence rate?
The exercise is asking to calculate the total number of cases and population
during the year of the study, using the command aggregate.
So the original data is:
PEP_19 dist_cat pop inc_case
1 0 same hh 735 11
2 0 neighbor <25m 5740 37
3 0 >25-50m 8856 37
4 0 >50-75m 7462 34
5 0 75-100m 5802 14
6 0 >100m 41599 47
7 1 same hh 1241 6
8 1 neighbor <25m 2964 9
9 1 >25-50m 3672 6
10 1 >50-75m 2435 9
11 1 75-100m 1638 5
12 1 >100m 2571 4
The aggregate command could respond to the following request:
"Please agregate( or sum) "inc_case" and "population" by each category of PEP_19
PEP_19 inc_case pop
1 0 180 70194
2 1 39 14521
But because we want all the cases and all the population during the study
(not by each category of PEP),
we have to create a new variable of just one categorie, so that when agregate
it calculates everything.
```{r}
leprosy_comoros$const <- 1
table1 <- aggregate(cbind(inc_case, pop) ~ const, data=leprosy_comoros, FUN=sum)
table1
```
const inc_case pop
1 1 219 84715
So, the incidence rate during the period of the study (1 year) is:
= 219/84715
= 0.002585138
= 2.59 per 1000 population
We can also calculate the incidence ratio using the Poisson regression
Why Poisson regression?, because the Dependent Variable is a rate.
We will just ask for the incidence (using inc_case / population)
with no other variable
```{r}
# Code for poisson regression (univariate analysis)
GLM0 <- glm(inc_case ~ offset(log(pop)) , family=poisson(log), data=leprosy_comoros)
summary(GLM0)
exp(coef(GLM0))
exp(confint(GLM0))
```
Waiting for profiling to be done...
Call:
glm(formula = inc_case ~ offset(log(pop)), family = poisson(log),
data = leprosy_comoros)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.5783 -0.4730 0.7412 2.7829 4.8259
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.95798 0.06757 -88.17 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 109.51 on 11 degrees of freedom
Residual deviance: 109.51 on 11 degrees of freedom
AIC: 164.39
Number of Fisher Scoring iterations: 5
(Intercept)
0.002585138
2.5 % 97.5 %
0.002257694 0.002942787
We can see in the results that the incidence rate during the period of the study (1 year) is:
= exp(-5.95798)*1000
= 0.002585138 *1000
= 2.59 per 1000 population
So its the same result. We can calculate the incidence rate both ways
2. What can you say about the effect of PEP, did it protect?
So far now the Dependent Variable is: Incidence rate of cases
The Independent variable is: Received PEP (PEP_19)
The Covariant is: Distance to the nearest other person (disc_cat)
2.1 First we explore bivariate relation (incidence cases - PEP_19)
To calculate the RR (between incidence and PEP) we cannot do it in a table2x2 because the variables of interest are not categorical and are not dichotomies, so we use the Poisson regression
Why Poisson regression?, because the Dependent Variable is a rate
Doing the poisson regression (bivariate analysis: Incidenc rate and PEP_19)
```{r}
# Code for poisson regression (bivariate analysis)
GLM1 <- glm(inc_case ~ offset(log(pop)) + PEP_19 , family=poisson(log), data=leprosy_comoros)
summary(GLM1)
exp(coef(GLM1))
exp(confint(GLM1))
```
Waiting for profiling to be done...
Call:
glm(formula = inc_case ~ offset(log(pop)) + PEP_19, family = poisson(log),
data = leprosy_comoros)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.5040 -0.4727 0.6352 2.8244 4.8630
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.96606 0.07454 -80.043 <2e-16 ***
PEP_19 0.04627 0.17663 0.262 0.793
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 109.51 on 11 degrees of freedom
Residual deviance: 109.44 on 10 degrees of freedom
AIC: 166.32
Number of Fisher Scoring iterations: 5
(Intercept) PEP_19
0.002564322 1.047358997
2.5 % 97.5 %
(Intercept) 0.002207713 0.002957382
PEP_19 0.730667264 1.463060462
So, is PEP associated with incidence? No, the p value is 0.793
We can also see the RR RR of PEP = 1.047358997 IC(0.73 - 1.46) So, so far it seems that PEP is a risk factor, not significant, is that true?, We should see counfonders
3. The data presented here mixes up 4 different study arms, in villages randomized to study arm 1, no PEP was provided, in study arm 2 PEP was provided to household contacts of index cases, in study arm 3 PEP was provided to entire villages and in study arm 4 PEP was provided to household contacts as well as others living in the same village and testing positive to a serological screening test but asymptomatic. As a result there may be an association between PEP and distance to nearest index case. If distance would also be associated with risk of becoming an incident case, distance could be a confounder. First check if distance is indeed associated with risk of becoming an incident case. Make a table with numbers of population, cases and risk ratios + confidence intervals for the 6 distance categories
3.1 We check if distance is indeed associated with risk of becoming an incident case
First we count how many cases are by each distance category
```{r}
# We can do it with the command table, but also with aggregate
table2 <- aggregate(cbind(inc_case, pop) ~ dist_cat, data=leprosy_comoros, FUN=sum)
table2
```
dist_cat inc_case pop
1 >100m 51 44170
2 >25-50m 43 12528
3 >50-75m 43 9897
4 75-100m 19 7440
5 neighbor <25m 46 8704
6 same hh 17 1976
Now we do the poisson regression (bivariate analysis: Incidence rate, dist)
```{r}
# Code for poisson regression (bivariate analysis)
GLM2 <- glm(inc_case ~ offset(log(pop)) + factor(dist_cat) , family=poisson(log), data=leprosy_comoros)
summary(GLM2)
exp(coef(GLM2))
exp(confint(GLM2))
```
Waiting for profiling to be done...
Call:
glm(formula = inc_case ~ offset(log(pop)) + factor(dist_cat),
family = poisson(log), data = leprosy_comoros)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.07373 -0.76420 0.06291 0.71563 1.68133
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.7640 0.1400 -48.304 < 2e-16 ***
factor(dist_cat)>25-50m 1.0895 0.2070 5.262 1.42e-07 ***
factor(dist_cat)>50-75m 1.3252 0.2070 6.401 1.55e-10 ***
factor(dist_cat)75-100m 0.7938 0.2688 2.953 0.00314 **
factor(dist_cat)neighbor <25m 1.5211 0.2033 7.480 7.40e-14 ***
factor(dist_cat)same hh 2.0084 0.2800 7.171 7.42e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 109.511 on 11 degrees of freedom
Residual deviance: 16.492 on 6 degrees of freedom
AIC: 81.368
Number of Fisher Scoring iterations: 4
(Intercept) factor(dist_cat)>25-50m
0.00115463 2.97265107
factor(dist_cat)>50-75m factor(dist_cat)75-100m
3.76289507 2.21175943
factor(dist_cat)neighbor <25m factor(dist_cat)same hh
4.57716083 7.45107964
2.5 % 97.5 %
(Intercept) 0.0008660436 0.001501143
factor(dist_cat)>25-50m 1.9737201350 4.456174129
factor(dist_cat)>50-75m 2.4984102103 5.640795139
factor(dist_cat)75-100m 1.2743641132 3.678635507
factor(dist_cat)neighbor <25m 3.0643672475 6.818042875
factor(dist_cat)same hh 4.1804487877 12.625252928
Distance | Population | Cases | RR | 95% IC |
---|---|---|---|---|
Same HH | 1976 | 17 | 7.5 | 4.2-12.6 |
Neighbor <25m | 8704 | 46 | 4.6 | 3.1-6.8 |
25-50m | 12528 | 43 | 3.0 | 2.0-4.5 |
50-75m | 9897 | 43 | 3.8 | 2.5-5.6 |
75-100m | 7440 | 19 | 2.2 | 1.3-3.7 |
100+m | 44170 | 51 | Ref | Ref |
So the distance is indeed associated with the incidence of leprosy with more near are the cases more is a risk factor
There is a clear a link between leprasy and distance The probability of receive a PEP is associated at the distance
Also, we have to consider than distance could be a confounder because: 1. Is to the VD, 2. Is related to the VI, and 3. Is not in the causal pathway
(Leprasy Infection) --------- (PEP-19)
\ /
Confounders
(Distance)
4. Now that you have shown that distance is clearly associated with risk of disease, fit a poisson model that includes distance and PEP. What happens to the effect of PEP?
For this we do poisson regression (multivariate analysis: Incidence, PEP, dist)
```{r}
# Code for poisson regression (multivariate analysis)
GLM3 <- glm(inc_case ~ offset(log(pop)) + PEP_19 + factor(dist_cat) , family=poisson(log), data=leprosy_comoros)
summary(GLM3)
exp(coef(GLM3))
exp(confint(GLM3))
```
Waiting for profiling to be done...
Call:
glm(formula = inc_case ~ offset(log(pop)) + PEP_19 + factor(dist_cat),
family = poisson(log), data = leprosy_comoros)
Deviance Residuals:
1 2 3 4 5 6 7 8
0.8075 0.2849 0.4073 -0.3502 -0.5833 -0.3179 -0.8944 -0.5363
9 10 11 12
-0.8804 0.7566 1.2478 1.4215
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.7397 0.1402 -48.079 < 2e-16 ***
PEP_19 -0.5320 0.1854 -2.869 0.00412 **
factor(dist_cat)>25-50m 1.1940 0.2092 5.708 1.14e-08 ***
factor(dist_cat)>50-75m 1.4079 0.2083 6.758 1.40e-11 ***
factor(dist_cat)75-100m 0.8647 0.2695 3.208 0.00133 **
factor(dist_cat)neighbor <25m 1.6482 0.2066 7.977 1.50e-15 ***
factor(dist_cat)same hh 2.2839 0.2929 7.797 6.33e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 109.5113 on 11 degrees of freedom
Residual deviance: 7.4757 on 5 degrees of freedom
AIC: 74.351
Number of Fisher Scoring iterations: 4
(Intercept) PEP_19
0.001183039 0.587440946
factor(dist_cat)>25-50m factor(dist_cat)>50-75m
3.300353348 4.087421777
factor(dist_cat)75-100m factor(dist_cat)neighbor <25m
2.374303585 5.197433272
factor(dist_cat)same hh
9.815321524
2.5 % 97.5 %
(Intercept) 0.0008871107 0.001538568
PEP_19 0.4032137828 0.835625711
factor(dist_cat)>25-50m 2.1820609064 4.967595519
factor(dist_cat)>50-75m 2.7068955509 6.142667468
factor(dist_cat)75-100m 1.3661994452 3.954939056
factor(dist_cat)neighbor <25m 3.4569176392 7.790047953
factor(dist_cat)same hh 5.3795372678 17.072393655
So, what happens to the efect of PEEP?
Is PEP now significant?
Yes after adjusting with distance, the p value for PEP is 0.00412,
so now the association is significant
Also the RR of PEP now is 0.587440946 IC (0.4032137828 0.835625711) so is significative protective.
5. Assume we are fitting a significance based classic model. Is the model with both PEP and distance better than the model with only distance?
I already know that the association is significant, but for being sure they are asking to compare both models (with and without distance)(GLM1 vs GLM3) to see in there is a significant improvement
Analysis of Deviance Table
Model 1: inc_case ~ offset(log(pop)) + PEP_19 + factor(dist_cat)
Model 2: inc_case ~ offset(log(pop)) + PEP_19
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 5 7.476
2 10 109.443 -5 -101.97 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p value is 2.2e-16 , so it is a significant difference between both models So we keep the GLM3 model
6. Check the dispersion parameter, is it far removed from 1? Is there a problem with the condition that mean should be more or less equal to variance
One of the main conditions for us to use Poisson regression is to check for overdispersion (if the mean is similar to the variance), if the result is close to 1 then if Poisson can be used, otherwise we will use binomial regression negative
```{r}
# Code for checking for overdispersion
dispp <- sum(residuals(GLM3, type="pearson")^2)/GLM3$df.residual
dispp
```
[1] 1.703122
The dispersion parameter is 1.7 so it is near to one.
Then is ok to use the poisson regression.
Yet, just for educational purpuses we are going to test the negative binomial regression
7. Even though it is not necessary in this case, try to fit a negative binomial model. Does the result differ?
If we don’t have the library MASS we should install it
```{r}
# Code for poisson regression (multivariate analysis)
library(MASS)
GLM.NM1 <- glm.nb(inc_case ~ offset(log(pop)) + PEP_19 + factor(dist_cat), data=leprosy_comoros)
```
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Waiting for profiling to be done...
(Intercept) PEP_19
0.001183054 0.587446051
factor(dist_cat)>25-50m factor(dist_cat)>50-75m
3.300273827 4.087400355
factor(dist_cat)75-100m factor(dist_cat)neighbor <25m
2.374295046 5.197325128
factor(dist_cat)same hh
9.815151563
2.5 % 97.5 %
(Intercept) 0.0008870494 0.00153864
PEP_19 0.4031984916 0.83565023
factor(dist_cat)>25-50m 2.1818890837 4.96772620
factor(dist_cat)>50-75m 2.7067289044 6.14295421
factor(dist_cat)75-100m 1.3660182361 3.95513469
factor(dist_cat)neighbor <25m 3.4566639695 7.79027966
factor(dist_cat)same hh 5.3787294252 17.07279368
The OR and IC are similar in bot models (poisson and negative binomial model)