2023-02-01. Session 2

Step by Step

Import data

Importing a CSV database under the name of “evans”, with “,” as separator and “.” as decimal:

evans <- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Linear Regresion Exercise Database/Datasets/evans_county.csv", sep=",", dec= ".")

We assume that all variables ok to start (we dont need to transform or create variables)



Starting the Exercise:

Is the Queteled Index(QTI) associated with systolic blood pressure(SBP)?

Independent variable: QTI

Dependent variable: SBP

0. It is a good practice to explore the data. In this case with a scatterplot
#Code in R:          
  # Make a scatterplot
  plot(evans$QTI, evans$SBP, main="Scatterplot")
  # add a regression line
  abline(lm(SBP~QTI, data = evans), col = "blue")

Note

-We can see that there could be a correlation, although it seen not very strong

1. What is the regressión equation?
#Code for linear regresion:
  lmQTI = lm(SBP ~ QTI, data = evans)
  summary(lmQTI)

Call:
lm(formula = SBP ~ QTI, data = evans)

Residuals:
    Min      1Q  Median      3Q     Max 
-54.855 -19.634  -5.213  14.743 155.497 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  129.420      6.875  18.824   <2e-16 ***
QTI            4.439      1.876   2.366   0.0183 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.4 on 607 degrees of freedom
Multiple R-squared:  0.00914,   Adjusted R-squared:  0.007508 
F-statistic: 5.599 on 1 and 607 DF,  p-value: 0.01828
Note

-As a result: we can create the equation with the data we obtained

* Formula:  y = a+b(x)
* a = 129.420
* b = 4.439
    (b is the slope, 
    it means that for every increase of 1 in QTI the SBP increase in 4.43)

So, the equation would be:
* Y= 129.420 + 4.439(X)
2. How much is the determination coefficient? What does it mean?
Note

-The determination coefficient is R2 = 0.009

Interpretation: 
" __%  of the variability of the outcome could be explained by the model "
"0.9% of the variability of the SBP(blood preasure) could be explained by the model"
3. Is the slope for QTI statiscally significant?
Note

-The slope for QTI is 4.439 -And its p value is 0.0183 (less than 0.05)

Interpretation: 
*P value in this case says that there is a statisticas association (that is the same that       says that "the slope of the regresion line is significanly different from zero)

Testing the assumptions (linearity, homogeneity of variance, normality…)

First we need to calculate the residuals an calculate the estimated outcome(fitted) as new variables

evans$mod_resid <- residuals(lmQTI)
evans$mod_fitted <- fitted(lmQTI)
4. Is the association betwenn SBP and QTI a linear association?

We scatterplot the residuals against the exposure(VI)

#Code in R:          
  # Make a scatterplot
  plot(evans$QTI, evans$mod_resid, main="Scatterplot")
  # add a regression line
  abline(lm(mod_resid~QTI, data = evans), col = "blue")

Note
We scatterplot the residuals against the exposure(VI) 
In this case because we see it as a cloud, then we assume linearity
5. Is the variance of the residuals homogeneous?

We scatterplot the residuals against the estimated outcome(fitted)

#Code in R:          
  # Make a scatterplot
  plot(evans$mod_fitted, evans$mod_resid, main="Scatterplot")
  # add a regression line
  abline(lm(mod_resid ~ mod_fitted, data = evans), col = "blue")

Note
We residuals against the estimated outcome(fitted)
In this case because we see it as a cloud, then we assume homogeneity
6. Are the residuals normally distributted?
#Doing an histogram of the residuals
  hist(evans$mod_resid,main="Age distribution of study participants", 
     xlab="Age (years)", ylab="nb", col="green", border="dark green")

Note
Seems not
7. Are the observations independent?
Note
We dont need any calculation for this.
Because we are not working with sequential measures (in time or in space):
  The observations  are  independent
8. Are there any extreme values?
Note
For this we need to made a scatter plot of the residuals against the estimated outcome (fitted) (we already did this in step 5 )

For discussion. It seems to have some extreme values.
9. Are there any influential observation?
Note
For this we need to made a scatter plot of the residuals against the exposure (we already did this in step 4 )

For discussion. It seems to have some influential observations.
10. Does the model significantly improve when the variable AGE is added?

We do this doing an multivariate linear regresion

#Code for linear regresion:
  lmQTI_AGE = lm(SBP ~ QTI+AGE, data = evans)
  summary(lmQTI_AGE)

Call:
lm(formula = SBP ~ QTI + AGE, data = evans)

Residuals:
    Min      1Q  Median      3Q     Max 
-57.645 -15.916  -3.985  11.393 160.285 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  73.4912     9.2816   7.918 1.16e-14 ***
QTI           5.6031     1.7809   3.146  0.00173 ** 
AGE           0.9630     0.1139   8.452  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.93 on 606 degrees of freedom
Multiple R-squared:  0.1136,    Adjusted R-squared:  0.1107 
F-statistic: 38.84 on 2 and 606 DF,  p-value: < 2.2e-16
Note

-A. As a result: we can create the new formula with the data we obtained

* y = a + b1(x1) + b2(x2)
* a= 73.4912
* b1 = 5.6031
* b2 = 0.9630
So, the formula would be:
* Y= 73.4912 + 5.6031(QTI) + 0.9630(AGE)

-B. We also find the R2

* R2 is the "Determination coefficient". 
* R2 =  0.1136
Interpretation: 
"11% of the variability of the SBP could be explained by the model"

-C. Does the model significantly improve when the variable AGE is added?

* It improves, but I dont know if it is a significantly improvement. 

To know if it is a significantly improvement, we can do an anova table comparing the 2 models

#Code for ANOVA
    anova (lmQTI,lmQTI_AGE)
Analysis of Variance Table

Model 1: SBP ~ QTI
Model 2: SBP ~ QTI + AGE
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    607 455567                                  
2    606 407530  1     48037 71.432 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note

p = 2.2e-16 The second model has a significant difference to the first model, so including AGE has increased significantly the model.

11. Which proportion of the variability in SBP is explained by this model?
Note

We already did it at step 10 * R2 is the “Determination coefficient”. * R2 = 0.1136 Interpretation: “11% of the variability of the SBP could be explained by the model”

12. Please write down the equation of the model
Note

We already did it at step 10 * y = a + b1(x1) + b2(x2) * a= 73.4912 * b1 = 5.6031 * b2 = 0.9630 So, the formula would be: * Y= 73.4912 + 5.6031(QTI) + 0.9630(AGE)

Finish =)