2023-02-01. Session 2

Linear Regression - Session 2

Writer
Affiliation
Javier Silva-Valencia

Instituut Voor Tropische Geneeskunde. Antwerp-Belgium

Published

2023-02-01

Abstract
The exercise asks to review again the concepts of intercept, slope and coefficient of determination. However this time in a multivariate linear regression (for 2 independent variables). But first, to know if we can use linear regression or not, we have to test its assumptions (linearity, variance homogeneity, normality and independence of observations). Likewise to use linear regression correctly we need to check for exceptional values (outliers), influential observations or if we need to do any transformation of the variable.


Starting the Exercise

Instructions:

Use database evans_county.csv. Explore the data. There is a variable QTI,’ Quetelet Index’ which is equivalent to the BMI, weight in kg divided by the squared height in meters.

Is the Queteled Index(QTI) associated with systolic blood pressure(SBP)?

  • Independent variable: QTI
  • Dependent variable: SBP

Import data

Importing a CSV database under the name of “evans”, with “,” as separator and “.” as decimal:

evans <- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Linear Regresion Exercise Database/Datasets/evans_county.csv", sep=",", dec= ".")

We assume that all variables ok to start (we dont need to transform or create variables)

Explore the data

It is a good practice to explore the data (Doing bivariate analysis). In this case we can do it quicky with a scatterplot

#Code in R:          
  # Make a scatterplot
  plot(evans$QTI, evans$SBP, main="Scatterplot")
  # add a regression line
  abline(lm(SBP~QTI, data = evans), col = "blue")

Figure 1:

Note

We can see that there could be a correlation, although it doesn’t seem very strong.

Question 1

What is the regression equation?

First we are going to do the regression analysis for only SBP and QTI (bivariate)

#Code for linear regresion:
  lmQTI = lm(SBP ~ QTI, data = evans)
  summary(lmQTI)

Call:
lm(formula = SBP ~ QTI, data = evans)

Residuals:
    Min      1Q  Median      3Q     Max 
-54.855 -19.634  -5.213  14.743 155.497 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  129.420      6.875  18.824   <2e-16 ***
QTI            4.439      1.876   2.366   0.0183 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.4 on 607 degrees of freedom
Multiple R-squared:  0.00914,   Adjusted R-squared:  0.007508 
F-statistic: 5.599 on 1 and 607 DF,  p-value: 0.01828
Note

-As a result: we can create the equation with the data we obtained

* Formula:  y = a+b(x)
* a = 129.420
* b = 4.439
    (b is the slope, 
    it means that for every increase of 1 in QTI the SBP increase in 4.43)

So, the equation would be:
* Y= 129.420 + 4.439(X)

Question 2

How much is the determination coefficient? What does it mean?

Note

-The determination coefficient is R2 = 0.009

Interpretation: 
" __%  of the variability of the outcome could be explained by the model "
"0.9% of the variability of the SBP(blood preasure) could be explained by the model"

Question 3

Is the slope for QTI statistically significant?

Note

-The slope for QTI is 4.439, and its p value is 0.0183 (less than 0.05)

Interpretation: 
*The p-value in this case says that there is a statistical association between QTI and SBP (ie, the same as saying that "the slope of the regression line is significantly different from zero)"

Testing assumptions

In order to say whether it is correct to use linear regression, we must first know if certain conditions are met. These conditions are called assumptions.

What are the assumptions to decide whether or not to use linear regression? In addition that our dependent variables need to be a continuous numerical variable, we seek to comply with:

  1. Linearity
  2. Homogeneity of variance
  3. Normality
  4. Independence of observations

For the first 3 we have to check the residuals. The last one is a theoretical concept of what our data is like.

To test for the assumptions first we need to calculate the residuals.

The residuals can be put as another variable in the data set and also with them the estimated outcome(fitted) can be calculated. Both as new variables

#Creating the residuals and fitted values as new variables
evans$mod_resid <- residuals(lmQTI)
evans$mod_fitted <- fitted(lmQTI)

Question 4

Checking for Linearity: Is the association between SBP and QTI a linear association?

For this we scatterplot the residuals(new variable) against the exposure(the independent variable)

#Code in R:          
  # Make a scatterplot
  plot(evans$QTI, evans$mod_resid, main="Scatterplot")
  # add a regression line
  abline(lm(mod_resid~QTI, data = evans), col = "blue")

Figure 2:

Interpretation

For Linearity we scatterplot the residuals against the exposure(VI)

In this case we see all the dots as a cloud, this suggests that we can assume linearity

Question 5

Checking for Homogeneity of variance: Is the variance of the residuals homogeneous?

We scatterplot the residuals against the estimated outcome(fitted)

#Code in R:          
  # Make a scatterplot
  plot(evans$mod_fitted, evans$mod_resid, main="Scatterplot")
  # add a regression line
  abline(lm(mod_resid ~ mod_fitted, data = evans), col = "blue")

Figure 3:

Interpretation

For Homogeneity of variance we scatterplot the residuals against estimated outcome(fitted)

In this case we see all the dots as a cloud, then we assume homogeneity

Question 6

Checking for Normality: Are the residuals normally distributed?

#Doing an histogram of the residuals
  hist(evans$mod_resid,main="Age distribution of study participants", ylab="nb", col="green", border="dark green")

Figure 4:

Interpretation

For checking normality we do an histogram of the residuals

In this case it seems not, but it is debatable

Question 7

Checking for Independence: Are the observations independent?

Interpretation

We don’t need to do any calculation for this.

Because we are not working with sequential measures (in time or in space): The observations are independent

Rare values

Question 8

Are there any extreme values?

Reply

For answer this we need to made a scatter plot of the residuals against the estimated outcome (fitted) (we already did this in Figure 3 )

For discussion. It does seem to have some extreme values.

Question 9

Are there any influential observation?

Reply

For answer this we need to made a scatter plot of the residuals against the exposure (we already did this in Figure 2 )

For discussion. It does seem to have some influential observations.

Question 10

Does the model significantly improve when the variable AGE is added?

The exercise is asking to doing an multivariate linear regression

#Code for linear regresion:
  lmQTI_AGE = lm(SBP ~ QTI + AGE, data = evans)
  summary(lmQTI_AGE)

Call:
lm(formula = SBP ~ QTI + AGE, data = evans)

Residuals:
    Min      1Q  Median      3Q     Max 
-57.645 -15.916  -3.985  11.393 160.285 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  73.4912     9.2816   7.918 1.16e-14 ***
QTI           5.6031     1.7809   3.146  0.00173 ** 
AGE           0.9630     0.1139   8.452  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.93 on 606 degrees of freedom
Multiple R-squared:  0.1136,    Adjusted R-squared:  0.1107 
F-statistic: 38.84 on 2 and 606 DF,  p-value: < 2.2e-16
Note

-A. As a result: we can create the new formula with the data we obtained

* y = a + b1(x1) + b2(x2)
* a= 73.4912
* b1 = 5.6031
* b2 = 0.9630
So, the formula would be:
* Y= 73.4912 + 5.6031(QTI) + 0.9630(AGE)

-B. We also find the R2

* R2 is the "Determination coefficient". 
* R2 =  0.1136
Interpretation: 
"11% of the variability of the SBP could be explained by the model"

-C. Does the model significantly improve when the variable AGE is added?

* It improves, but I dont know if it is a significantly improvement. 

To know if it is a significant improvement, we can do an anova table comparing the 2 models

#Code for ANOVA
    anova (lmQTI,lmQTI_AGE)
Analysis of Variance Table

Model 1: SBP ~ QTI
Model 2: SBP ~ QTI + AGE
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    607 455567                                  
2    606 407530  1     48037 71.432 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reply

p = 2.2e-16

This means that the second model has a significant difference to the first model, so including AGE has increased significantly the model.

Question 11

Which proportion of the variability in SBP is explained by this model?

Reply

They are asking for the R2 (“Determination coefficient”)

We already did it in question 10

R2 = 0.1136

Interpretation: “11% of the variability of the SBP could be explained by the model”

Question 12

Please write down the equation of the model

Reply

We already did it in question 10

y = a + b1(x1) + b2(x2)

  • a= 73.4912
  • b1 = 5.6031
  • b2 = 0.9630

So, the formula would be:

Y= 73.4912 + 5.6031(QTI) + 0.9630(AGE)

Finish =)