2023-02-02 Linear Regresion, Sessión 3

Step by Step

Import data

Importing a CSV database under the name of “cholest”, with “,” as separator and “.” as decimal:

cholest <- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Linear Regresion Exercise Database/Datasets/Cholesterol.csv", sep=",", dec= ".")


Starting the Exercise:

We want to respond to the question: How do we best explain the variability of cholesterol with the data we have?

1. We explore the variables - Cleaning

1.1 Cholesterol

hist(cholest$cholesterol)

Note

-Cholesterol seems ok

1.2 Activity

hist(cholest$activity)

summary(cholest$activity)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    5.00    8.50   10.92   15.50   26.00 
Note

-Activity seems ok

1.3 Occupation

1.3.1 First we see the values of the variable if they make sense

table(cholest$occupation)

1 2 3 4 
4 7 6 7 
Note

-Seems ok, Occupation only have 4 categories, and there are 4 categories in my data

1.3.2 Second we see if the variable is in a categorical or numerical way as I want. Ocupation is in a numerical way, need to change it to a factor

cholest$occupation_f <- factor(cholest$occupation)

1.3.2 Then, we have to be sure that the first category of the variable should be the reference category

#
cholest$occupation_f <- factor(cholest$occupation)
Note

-activity seems ok

Doing the modeling - linear regresion - Method: Change-in-estimate model

Model with inly the primary expouse variable

mod1 <- lm(cholesterol ~ activity, data=cholest)
summary(mod1)

Call:
lm(formula = cholesterol ~ activity, data = cholest)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85397 -0.49306 -0.06166  0.34349  1.25118 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.05397    0.22507  18.012 1.17e-14 ***
activity    -0.06410    0.01704  -3.762  0.00108 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6205 on 22 degrees of freedom
Multiple R-squared:  0.3914,    Adjusted R-squared:  0.3638 
F-statistic: 14.15 on 1 and 22 DF,  p-value: 0.001077

Model with two independent variables

mod2 <- lm(cholesterol ~ activity + age, data=cholest)
#mod3 <- lm(cholesterol ~ activity + bmi_cat, data=cholest)
#mod4 <- lm(cholesterol ~ activity + sex_f, data=cholest)
#mod5 <- lm(cholesterol ~ activity + occupation_f, data=cholest)

summary(mod2) summary(mod3) summary(mod4) summary(mod5)

After see the % of change we see that only “activity+age” and “activity+BMI” has a change % higher than 10

Multivariate model

#mod6 <- lm(cholesterol ~ activity + age + bmi_cat, data=cholest)
#summary(mod6)

The adjusted effect of activity is: -0.016

Multivatiate model without age

#mod7 <- lm(cholesterol ~ activity + bmi_cat, data=cholest)
#summary(mod7)

The adjusted effect of activity now is: -0.029

Is that a substancial change? (-0.029 - -0.016)/-0.016 81% Yes, it is a substancial change, so we shoulnt take age of the equation

Multivatiate model without bmi

mod8 <- lm(cholesterol ~ activity + age, data=cholest)
summary(mod8)

Call:
lm(formula = cholesterol ~ activity + age, data = cholest)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.65374 -0.23438 -0.04463  0.20482  0.54035 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.67711    0.32871   5.102 4.71e-05 ***
activity    -0.01687    0.01078  -1.565    0.132    
age          0.04722    0.00610   7.741 1.39e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3235 on 21 degrees of freedom
Multiple R-squared:  0.8421,    Adjusted R-squared:  0.827 
F-statistic: 55.99 on 2 and 21 DF,  p-value: 3.836e-09

The adjusted effect of activity now is: -0.017

Is that a substancial change? (-0.017 - -0.016)/-0.016 6% No, it is not a substancial change, so we can take BMI of the equation

So the final model (the more simple) is cholesterol ~ activity + age Mod8