2023-02-01. Session 2
Linear Regression - Session 2
Starting the Exercise
Instructions:
Use database evans_county.csv. Explore the data. There is a variable QTI,’ Quetelet Index’ which is equivalent to the BMI, weight in kg divided by the squared height in meters.
Is the Queteled Index(QTI) associated with systolic blood pressure(SBP)?
- Independent variable: QTI
- Dependent variable: SBP
Import data
Importing a CSV database under the name of “evans”, with “,” as separator and “.” as decimal:
We assume that all variables ok to start (we dont need to transform or create variables)
Explore the data
It is a good practice to explore the data (Doing bivariate analysis). In this case we can do it quicky with a scatterplot
We can see that there could be a correlation, although it doesn’t seem very strong.
Question 1
What is the regression equation?
First we are going to do the regression analysis for only SBP and QTI (bivariate)
Call:
lm(formula = SBP ~ QTI, data = evans)
Residuals:
Min 1Q Median 3Q Max
-54.855 -19.634 -5.213 14.743 155.497
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 129.420 6.875 18.824 <2e-16 ***
QTI 4.439 1.876 2.366 0.0183 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 27.4 on 607 degrees of freedom
Multiple R-squared: 0.00914, Adjusted R-squared: 0.007508
F-statistic: 5.599 on 1 and 607 DF, p-value: 0.01828
-As a result: we can create the equation with the data we obtained
* Formula: y = a+b(x)
* a = 129.420
* b = 4.439
(b is the slope,
it means that for every increase of 1 in QTI the SBP increase in 4.43)
So, the equation would be:
* Y= 129.420 + 4.439(X)
Question 2
How much is the determination coefficient? What does it mean?
-The determination coefficient is R2 = 0.009
Interpretation:
" __% of the variability of the outcome could be explained by the model "
"0.9% of the variability of the SBP(blood preasure) could be explained by the model"
Question 3
Is the slope for QTI statistically significant?
-The slope for QTI is 4.439, and its p value is 0.0183 (less than 0.05)
Interpretation:
*The p-value in this case says that there is a statistical association between QTI and SBP (ie, the same as saying that "the slope of the regression line is significantly different from zero)"
Testing assumptions
In order to say whether it is correct to use linear regression, we must first know if certain conditions are met. These conditions are called assumptions.
What are the assumptions to decide whether or not to use linear regression? In addition that our dependent variables need to be a continuous numerical variable, we seek to comply with:
- Linearity
- Homogeneity of variance
- Normality
- Independence of observations
For the first 3 we have to check the residuals. The last one is a theoretical concept of what our data is like.
To test for the assumptions first we need to calculate the residuals.
The residuals can be put as another variable in the data set and also with them the estimated outcome(fitted) can be calculated. Both as new variables
Question 4
Checking for Linearity: Is the association between SBP and QTI a linear association?
For this we scatterplot the residuals(new variable) against the exposure(the independent variable)
For Linearity we scatterplot the residuals against the exposure(VI)
In this case we see all the dots as a cloud, this suggests that we can assume linearity
Question 5
Checking for Homogeneity of variance: Is the variance of the residuals homogeneous?
We scatterplot the residuals against the estimated outcome(fitted)
For Homogeneity of variance we scatterplot the residuals against estimated outcome(fitted)
In this case we see all the dots as a cloud, then we assume homogeneity
Question 6
Checking for Normality: Are the residuals normally distributed?
For checking normality we do an histogram of the residuals
In this case it seems not, but it is debatable
Question 7
Checking for Independence: Are the observations independent?
We don’t need to do any calculation for this.
Because we are not working with sequential measures (in time or in space): The observations are independent
Rare values
Question 8
Are there any extreme values?
For answer this we need to made a scatter plot of the residuals against the estimated outcome (fitted) (we already did this in Figure 3 )
For discussion. It does seem to have some extreme values.
Question 9
Are there any influential observation?
For answer this we need to made a scatter plot of the residuals against the exposure (we already did this in Figure 2 )
For discussion. It does seem to have some influential observations.
Question 10
Does the model significantly improve when the variable AGE is added?
The exercise is asking to doing an multivariate linear regression
Call:
lm(formula = SBP ~ QTI + AGE, data = evans)
Residuals:
Min 1Q Median 3Q Max
-57.645 -15.916 -3.985 11.393 160.285
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 73.4912 9.2816 7.918 1.16e-14 ***
QTI 5.6031 1.7809 3.146 0.00173 **
AGE 0.9630 0.1139 8.452 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.93 on 606 degrees of freedom
Multiple R-squared: 0.1136, Adjusted R-squared: 0.1107
F-statistic: 38.84 on 2 and 606 DF, p-value: < 2.2e-16
-A. As a result: we can create the new formula with the data we obtained
* y = a + b1(x1) + b2(x2)
* a= 73.4912
* b1 = 5.6031
* b2 = 0.9630
So, the formula would be:
* Y= 73.4912 + 5.6031(QTI) + 0.9630(AGE)
-B. We also find the R2
* R2 is the "Determination coefficient".
* R2 = 0.1136
Interpretation:
"11% of the variability of the SBP could be explained by the model"
-C. Does the model significantly improve when the variable AGE is added?
* It improves, but I dont know if it is a significantly improvement.
To know if it is a significant improvement, we can do an anova table comparing the 2 models
Analysis of Variance Table
Model 1: SBP ~ QTI
Model 2: SBP ~ QTI + AGE
Res.Df RSS Df Sum of Sq F Pr(>F)
1 607 455567
2 606 407530 1 48037 71.432 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p = 2.2e-16
This means that the second model has a significant difference to the first model, so including AGE has increased significantly the model.
Question 11
Which proportion of the variability in SBP is explained by this model?
They are asking for the R2 (“Determination coefficient”)
We already did it in question 10
R2 = 0.1136
Interpretation: “11% of the variability of the SBP could be explained by the model”
Question 12
Please write down the equation of the model
We already did it in question 10
y = a + b1(x1) + b2(x2)
- a= 73.4912
- b1 = 5.6031
- b2 = 0.9630
So, the formula would be:
Y= 73.4912 + 5.6031(QTI) + 0.9630(AGE)
Finish =)