<- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Linear Regresion Exercise Database/Datasets/evans_county.csv", sep=",", dec= ".") evans
2023-02-01. Session 2
- Instituut Voor Tropische Geneeskunde - Antwerp, Belgium
- Javier Silva-Valencia
Step by Step
Import data
Importing a CSV database under the name of “evans”, with “,” as separator and “.” as decimal:
We assume that all variables ok to start (we dont need to transform or create variables)
Starting the Exercise:
Is the Queteled Index(QTI) associated with systolic blood pressure(SBP)?
Independent variable: QTI
Dependent variable: SBP
0. It is a good practice to explore the data. In this case with a scatterplot
#Code in R:
# Make a scatterplot
plot(evans$QTI, evans$SBP, main="Scatterplot")
# add a regression line
abline(lm(SBP~QTI, data = evans), col = "blue")
-We can see that there could be a correlation, although it seen not very strong
1. What is the regressión equation?
#Code for linear regresion:
= lm(SBP ~ QTI, data = evans)
lmQTI summary(lmQTI)
Call:
lm(formula = SBP ~ QTI, data = evans)
Residuals:
Min 1Q Median 3Q Max
-54.855 -19.634 -5.213 14.743 155.497
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 129.420 6.875 18.824 <2e-16 ***
QTI 4.439 1.876 2.366 0.0183 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 27.4 on 607 degrees of freedom
Multiple R-squared: 0.00914, Adjusted R-squared: 0.007508
F-statistic: 5.599 on 1 and 607 DF, p-value: 0.01828
-As a result: we can create the equation with the data we obtained
* Formula: y = a+b(x)
* a = 129.420
* b = 4.439
(b is the slope,
it means that for every increase of 1 in QTI the SBP increase in 4.43)
So, the equation would be:
* Y= 129.420 + 4.439(X)
2. How much is the determination coefficient? What does it mean?
-The determination coefficient is R2 = 0.009
Interpretation:
" __% of the variability of the outcome could be explained by the model "
"0.9% of the variability of the SBP(blood preasure) could be explained by the model"
3. Is the slope for QTI statiscally significant?
-The slope for QTI is 4.439 -And its p value is 0.0183 (less than 0.05)
Interpretation:
*P value in this case says that there is a statisticas association (that is the same that says that "the slope of the regresion line is significanly different from zero)
Testing the assumptions (linearity, homogeneity of variance, normality…)
First we need to calculate the residuals an calculate the estimated outcome(fitted) as new variables
$mod_resid <- residuals(lmQTI)
evans$mod_fitted <- fitted(lmQTI) evans
4. Is the association betwenn SBP and QTI a linear association?
We scatterplot the residuals against the exposure(VI)
#Code in R:
# Make a scatterplot
plot(evans$QTI, evans$mod_resid, main="Scatterplot")
# add a regression line
abline(lm(mod_resid~QTI, data = evans), col = "blue")
We scatterplot the residuals against the exposure(VI)
In this case because we see it as a cloud, then we assume linearity
5. Is the variance of the residuals homogeneous?
We scatterplot the residuals against the estimated outcome(fitted)
#Code in R:
# Make a scatterplot
plot(evans$mod_fitted, evans$mod_resid, main="Scatterplot")
# add a regression line
abline(lm(mod_resid ~ mod_fitted, data = evans), col = "blue")
We residuals against the estimated outcome(fitted)
In this case because we see it as a cloud, then we assume homogeneity
6. Are the residuals normally distributted?
#Doing an histogram of the residuals
hist(evans$mod_resid,main="Age distribution of study participants",
xlab="Age (years)", ylab="nb", col="green", border="dark green")
Seems not
7. Are the observations independent?
We dont need any calculation for this.
Because we are not working with sequential measures (in time or in space):
The observations are independent
8. Are there any extreme values?
For this we need to made a scatter plot of the residuals against the estimated outcome (fitted) (we already did this in step 5 )
For discussion. It seems to have some extreme values.
9. Are there any influential observation?
For this we need to made a scatter plot of the residuals against the exposure (we already did this in step 4 )
For discussion. It seems to have some influential observations.
10. Does the model significantly improve when the variable AGE is added?
We do this doing an multivariate linear regresion
#Code for linear regresion:
= lm(SBP ~ QTI+AGE, data = evans)
lmQTI_AGE summary(lmQTI_AGE)
Call:
lm(formula = SBP ~ QTI + AGE, data = evans)
Residuals:
Min 1Q Median 3Q Max
-57.645 -15.916 -3.985 11.393 160.285
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 73.4912 9.2816 7.918 1.16e-14 ***
QTI 5.6031 1.7809 3.146 0.00173 **
AGE 0.9630 0.1139 8.452 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.93 on 606 degrees of freedom
Multiple R-squared: 0.1136, Adjusted R-squared: 0.1107
F-statistic: 38.84 on 2 and 606 DF, p-value: < 2.2e-16
-A. As a result: we can create the new formula with the data we obtained
* y = a + b1(x1) + b2(x2)
* a= 73.4912
* b1 = 5.6031
* b2 = 0.9630
So, the formula would be:
* Y= 73.4912 + 5.6031(QTI) + 0.9630(AGE)
-B. We also find the R2
* R2 is the "Determination coefficient".
* R2 = 0.1136
Interpretation:
"11% of the variability of the SBP could be explained by the model"
-C. Does the model significantly improve when the variable AGE is added?
* It improves, but I dont know if it is a significantly improvement.
To know if it is a significantly improvement, we can do an anova table comparing the 2 models
#Code for ANOVA
anova (lmQTI,lmQTI_AGE)
Analysis of Variance Table
Model 1: SBP ~ QTI
Model 2: SBP ~ QTI + AGE
Res.Df RSS Df Sum of Sq F Pr(>F)
1 607 455567
2 606 407530 1 48037 71.432 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p = 2.2e-16 The second model has a significant difference to the first model, so including AGE has increased significantly the model.
11. Which proportion of the variability in SBP is explained by this model?
We already did it at step 10 * R2 is the “Determination coefficient”. * R2 = 0.1136 Interpretation: “11% of the variability of the SBP could be explained by the model”
12. Please write down the equation of the model
We already did it at step 10 * y = a + b1(x1) + b2(x2) * a= 73.4912 * b1 = 5.6031 * b2 = 0.9630 So, the formula would be: * Y= 73.4912 + 5.6031(QTI) + 0.9630(AGE)
Finish =)