2023-02-10 Session 1

Step by Step

Import data

Importing a CSV database under the name of “leprosy_comoros”, with “,” as separator:

leprosy_comoros <- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Poisson Regression Exercise Database/Datasets/leprosy_comoros.csv", sep=",")


Starting the Exercise

1. Use the ‘aggregate’ function in R to determine the total population and how many cases arose over the follow-up period of 1 year? (First create a constant with a value of 1)

What was the incidence rate?

The exercise is asking to calculate the total number of cases and population    
during the year of the study, using the command aggregate.

So the original data is:
leprosy_comoros
   PEP_19      dist_cat   pop inc_case
1       0       same hh   735       11
2       0 neighbor <25m  5740       37
3       0       >25-50m  8856       37
4       0       >50-75m  7462       34
5       0       75-100m  5802       14
6       0         >100m 41599       47
7       1       same hh  1241        6
8       1 neighbor <25m  2964        9
9       1       >25-50m  3672        6
10      1       >50-75m  2435        9
11      1       75-100m  1638        5
12      1         >100m  2571        4
The aggregate command could respond to the following request: 
"Please agregate( or sum) "inc_case" and "population" by each category of PEP_19
```{r}
table1 <- aggregate(cbind(inc_case, pop) ~ PEP_19, data=leprosy_comoros, FUN=sum)
table1
```
  PEP_19 inc_case   pop
1      0      180 70194
2      1       39 14521
But because we want all the cases and all the population during the study
(not by each category of PEP),
we have to create a new variable of just one categorie, so that when agregate
it calculates everything.
```{r}
leprosy_comoros$const <- 1
table1 <- aggregate(cbind(inc_case, pop) ~ const, data=leprosy_comoros, FUN=sum)
table1
```
  const inc_case   pop
1     1      219 84715
So, the incidence rate during the period of the study (1 year) is:
= 219/84715
= 0.002585138
= 2.59 per 1000 population

We can also calculate the incidence ratio using the Poisson regression

Why Poisson regression?, because the Dependent Variable is a rate.

We will just ask for the incidence (using inc_case / population)
with no other variable
```{r}
# Code for poisson regression (univariate analysis)
GLM0 <- glm(inc_case ~  offset(log(pop)) , family=poisson(log), data=leprosy_comoros)
summary(GLM0)
exp(coef(GLM0))
exp(confint(GLM0))
```
Waiting for profiling to be done...

Call:
glm(formula = inc_case ~ offset(log(pop)), family = poisson(log), 
    data = leprosy_comoros)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-6.5783  -0.4730   0.7412   2.7829   4.8259  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.95798    0.06757  -88.17   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 109.51  on 11  degrees of freedom
Residual deviance: 109.51  on 11  degrees of freedom
AIC: 164.39

Number of Fisher Scoring iterations: 5

(Intercept) 
0.002585138 
      2.5 %      97.5 % 
0.002257694 0.002942787 
Note

We can see in the results that the incidence rate during the period of the study (1 year) is:

= exp(-5.95798)*1000
= 0.002585138 *1000
= 2.59 per 1000 population

So its the same result. We can calculate the incidence rate both ways

2. What can you say about the effect of PEP, did it protect?

So far now the Dependent Variable is: Incidence rate of cases

The Independent variable is: Received PEP (PEP_19)

The Covariant is: Distance to the nearest other person (disc_cat)

2.1 First we explore bivariate relation (incidence cases - PEP_19)

Note

To calculate the RR (between incidence and PEP) we cannot do it in a table2x2 because the variables of interest are not categorical and are not dichotomies, so we use the Poisson regression

Why Poisson regression?, because the Dependent Variable is a rate

Doing the poisson regression (bivariate analysis: Incidenc rate and PEP_19)

```{r}
# Code for poisson regression (bivariate analysis)
GLM1 <- glm(inc_case ~  offset(log(pop)) + PEP_19 , family=poisson(log), data=leprosy_comoros)
summary(GLM1)
exp(coef(GLM1))
exp(confint(GLM1))
```
Waiting for profiling to be done...

Call:
glm(formula = inc_case ~ offset(log(pop)) + PEP_19, family = poisson(log), 
    data = leprosy_comoros)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-6.5040  -0.4727   0.6352   2.8244   4.8630  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.96606    0.07454 -80.043   <2e-16 ***
PEP_19       0.04627    0.17663   0.262    0.793    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 109.51  on 11  degrees of freedom
Residual deviance: 109.44  on 10  degrees of freedom
AIC: 166.32

Number of Fisher Scoring iterations: 5

(Intercept)      PEP_19 
0.002564322 1.047358997 
                  2.5 %      97.5 %
(Intercept) 0.002207713 0.002957382
PEP_19      0.730667264 1.463060462
Note

So, is PEP associated with incidence? No, the p value is 0.793

We can also see the RR RR of PEP = 1.047358997 IC(0.73 - 1.46) So, so far it seems that PEP is a risk factor, not significant, is that true?, We should see counfonders

3. The data presented here mixes up 4 different study arms, in villages randomized to study arm 1, no PEP was provided, in study arm 2 PEP was provided to household contacts of index cases, in study arm 3 PEP was provided to entire villages and in study arm 4 PEP was provided to household contacts as well as others living in the same village and testing positive to a serological screening test but asymptomatic. As a result there may be an association between PEP and distance to nearest index case. If distance would also be associated with risk of becoming an incident case, distance could be a confounder. First check if distance is indeed associated with risk of becoming an incident case. Make a table with numbers of population, cases and risk ratios + confidence intervals for the 6 distance categories

3.1 We check if distance is indeed associated with risk of becoming an incident case

First we count how many cases are by each distance category

```{r}
# We can do it with the command table, but also with aggregate
table2 <- aggregate(cbind(inc_case, pop) ~ dist_cat, data=leprosy_comoros, FUN=sum)
table2
```
       dist_cat inc_case   pop
1         >100m       51 44170
2       >25-50m       43 12528
3       >50-75m       43  9897
4       75-100m       19  7440
5 neighbor <25m       46  8704
6       same hh       17  1976

Now we do the poisson regression (bivariate analysis: Incidence rate, dist)

```{r}
# Code for poisson regression (bivariate analysis)
GLM2 <- glm(inc_case ~  offset(log(pop)) + factor(dist_cat) , family=poisson(log), data=leprosy_comoros)
summary(GLM2)
exp(coef(GLM2))
exp(confint(GLM2))
```
Waiting for profiling to be done...

Call:
glm(formula = inc_case ~ offset(log(pop)) + factor(dist_cat), 
    family = poisson(log), data = leprosy_comoros)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.07373  -0.76420   0.06291   0.71563   1.68133  

Coefficients:
                              Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -6.7640     0.1400 -48.304  < 2e-16 ***
factor(dist_cat)>25-50m         1.0895     0.2070   5.262 1.42e-07 ***
factor(dist_cat)>50-75m         1.3252     0.2070   6.401 1.55e-10 ***
factor(dist_cat)75-100m         0.7938     0.2688   2.953  0.00314 ** 
factor(dist_cat)neighbor <25m   1.5211     0.2033   7.480 7.40e-14 ***
factor(dist_cat)same hh         2.0084     0.2800   7.171 7.42e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 109.511  on 11  degrees of freedom
Residual deviance:  16.492  on  6  degrees of freedom
AIC: 81.368

Number of Fisher Scoring iterations: 4

                  (Intercept)       factor(dist_cat)>25-50m 
                   0.00115463                    2.97265107 
      factor(dist_cat)>50-75m       factor(dist_cat)75-100m 
                   3.76289507                    2.21175943 
factor(dist_cat)neighbor <25m       factor(dist_cat)same hh 
                   4.57716083                    7.45107964 
                                     2.5 %       97.5 %
(Intercept)                   0.0008660436  0.001501143
factor(dist_cat)>25-50m       1.9737201350  4.456174129
factor(dist_cat)>50-75m       2.4984102103  5.640795139
factor(dist_cat)75-100m       1.2743641132  3.678635507
factor(dist_cat)neighbor <25m 3.0643672475  6.818042875
factor(dist_cat)same hh       4.1804487877 12.625252928
Note
Distance Population Cases RR 95% IC
Same HH 1976 17 7.5 4.2-12.6
Neighbor <25m 8704 46 4.6 3.1-6.8
25-50m 12528 43 3.0 2.0-4.5
50-75m 9897 43 3.8 2.5-5.6
75-100m 7440 19 2.2 1.3-3.7
100+m 44170 51 Ref Ref

So the distance is indeed associated with the incidence of leprosy with more near are the cases more is a risk factor

There is a clear a link between leprasy and distance The probability of receive a PEP is associated at the distance

Also, we have to consider than distance could be a confounder because: 1. Is to the VD, 2. Is related to the VI, and 3. Is not in the causal pathway

            (Leprasy Infection) --------- (PEP-19)
                              \           /
                               Confounders
                                (Distance)
4. Now that you have shown that distance is clearly associated with risk of disease, fit a poisson model that includes distance and PEP. What happens to the effect of PEP?

For this we do poisson regression (multivariate analysis: Incidence, PEP, dist)

```{r}
# Code for poisson regression (multivariate analysis)
GLM3 <- glm(inc_case ~  offset(log(pop)) + PEP_19 + factor(dist_cat) , family=poisson(log), data=leprosy_comoros)
summary(GLM3)
exp(coef(GLM3))
exp(confint(GLM3))
```
Waiting for profiling to be done...

Call:
glm(formula = inc_case ~ offset(log(pop)) + PEP_19 + factor(dist_cat), 
    family = poisson(log), data = leprosy_comoros)

Deviance Residuals: 
      1        2        3        4        5        6        7        8  
 0.8075   0.2849   0.4073  -0.3502  -0.5833  -0.3179  -0.8944  -0.5363  
      9       10       11       12  
-0.8804   0.7566   1.2478   1.4215  

Coefficients:
                              Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -6.7397     0.1402 -48.079  < 2e-16 ***
PEP_19                         -0.5320     0.1854  -2.869  0.00412 ** 
factor(dist_cat)>25-50m         1.1940     0.2092   5.708 1.14e-08 ***
factor(dist_cat)>50-75m         1.4079     0.2083   6.758 1.40e-11 ***
factor(dist_cat)75-100m         0.8647     0.2695   3.208  0.00133 ** 
factor(dist_cat)neighbor <25m   1.6482     0.2066   7.977 1.50e-15 ***
factor(dist_cat)same hh         2.2839     0.2929   7.797 6.33e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 109.5113  on 11  degrees of freedom
Residual deviance:   7.4757  on  5  degrees of freedom
AIC: 74.351

Number of Fisher Scoring iterations: 4

                  (Intercept)                        PEP_19 
                  0.001183039                   0.587440946 
      factor(dist_cat)>25-50m       factor(dist_cat)>50-75m 
                  3.300353348                   4.087421777 
      factor(dist_cat)75-100m factor(dist_cat)neighbor <25m 
                  2.374303585                   5.197433272 
      factor(dist_cat)same hh 
                  9.815321524 
                                     2.5 %       97.5 %
(Intercept)                   0.0008871107  0.001538568
PEP_19                        0.4032137828  0.835625711
factor(dist_cat)>25-50m       2.1820609064  4.967595519
factor(dist_cat)>50-75m       2.7068955509  6.142667468
factor(dist_cat)75-100m       1.3661994452  3.954939056
factor(dist_cat)neighbor <25m 3.4569176392  7.790047953
factor(dist_cat)same hh       5.3795372678 17.072393655
Note

So, what happens to the efect of PEEP?

Is PEP now significant?
Yes after adjusting with distance, the p value for PEP is 0.00412, 
so now the association is significant

Also the RR of PEP now is 0.587440946 IC (0.4032137828 0.835625711) so is significative protective.

5. Assume we are fitting a significance based classic model. Is the model with both PEP and distance better than the model with only distance?

I already know that the association is significant, but for being sure they are asking to compare both models (with and without distance)(GLM1 vs GLM3) to see in there is a significant improvement

anova(GLM3, GLM1, test="Chisq")
Analysis of Deviance Table

Model 1: inc_case ~ offset(log(pop)) + PEP_19 + factor(dist_cat)
Model 2: inc_case ~ offset(log(pop)) + PEP_19
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1         5      7.476                          
2        10    109.443 -5  -101.97 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note

The p value is 2.2e-16 , so it is a significant difference between both models So we keep the GLM3 model

6. Check the dispersion parameter, is it far removed from 1? Is there a problem with the condition that mean should be more or less equal to variance

One of the main conditions for us to use Poisson regression is to check for overdispersion (if the mean is similar to the variance), if the result is close to 1 then if Poisson can be used, otherwise we will use binomial regression negative

```{r}
# Code for checking for overdispersion
dispp <- sum(residuals(GLM3, type="pearson")^2)/GLM3$df.residual
dispp
```
[1] 1.703122
Note

The dispersion parameter is 1.7 so it is near to one.

Then is ok to use the poisson regression.

Yet, just for educational purpuses we are going to test the negative binomial regression

7. Even though it is not necessary in this case, try to fit a negative binomial model. Does the result differ?

If we don’t have the library MASS we should install it

#install.packages("MASS")
```{r}
# Code for poisson regression (multivariate analysis)
library(MASS)
GLM.NM1 <- glm.nb(inc_case ~  offset(log(pop)) + PEP_19 + factor(dist_cat), data=leprosy_comoros)
```
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached

Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
```{r}
exp(coef(GLM.NM1))
exp(confint(GLM.NM1))
```
Waiting for profiling to be done...
                  (Intercept)                        PEP_19 
                  0.001183054                   0.587446051 
      factor(dist_cat)>25-50m       factor(dist_cat)>50-75m 
                  3.300273827                   4.087400355 
      factor(dist_cat)75-100m factor(dist_cat)neighbor <25m 
                  2.374295046                   5.197325128 
      factor(dist_cat)same hh 
                  9.815151563 
                                     2.5 %      97.5 %
(Intercept)                   0.0008870494  0.00153864
PEP_19                        0.4031984916  0.83565023
factor(dist_cat)>25-50m       2.1818890837  4.96772620
factor(dist_cat)>50-75m       2.7067289044  6.14295421
factor(dist_cat)75-100m       1.3660182361  3.95513469
factor(dist_cat)neighbor <25m 3.4566639695  7.79027966
factor(dist_cat)same hh       5.3787294252 17.07279368
Note

The OR and IC are similar in bot models (poisson and negative binomial model)