<- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Linear Regresion Exercise Database/Datasets/Cholesterol.csv", sep=",", dec= ".") cholest
2023-02-02 Linear Regresion, Sessión 3
- Instituut Voor Tropische Geneeskunde - Antwerp, Belgium
- Javier Silva-Valencia
Step by Step
Import data
Importing a CSV database under the name of “cholest”, with “,” as separator and “.” as decimal:
Starting the Exercise:
We want to respond to the question: How do we best explain the variability of cholesterol with the data we have?
1. We explore the variables - Cleaning
1.1 Cholesterol
hist(cholest$cholesterol)
-Cholesterol seems ok
1.2 Activity
hist(cholest$activity)
summary(cholest$activity)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 5.00 8.50 10.92 15.50 26.00
-Activity seems ok
1.3 Occupation
1.3.1 First we see the values of the variable if they make sense
table(cholest$occupation)
1 2 3 4
4 7 6 7
-Seems ok, Occupation only have 4 categories, and there are 4 categories in my data
1.3.2 Second we see if the variable is in a categorical or numerical way as I want. Ocupation is in a numerical way, need to change it to a factor
$occupation_f <- factor(cholest$occupation) cholest
1.3.2 Then, we have to be sure that the first category of the variable should be the reference category
#
$occupation_f <- factor(cholest$occupation) cholest
-activity seems ok
Doing the modeling - linear regresion - Method: Change-in-estimate model
Model with inly the primary expouse variable
<- lm(cholesterol ~ activity, data=cholest)
mod1 summary(mod1)
Call:
lm(formula = cholesterol ~ activity, data = cholest)
Residuals:
Min 1Q Median 3Q Max
-0.85397 -0.49306 -0.06166 0.34349 1.25118
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.05397 0.22507 18.012 1.17e-14 ***
activity -0.06410 0.01704 -3.762 0.00108 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6205 on 22 degrees of freedom
Multiple R-squared: 0.3914, Adjusted R-squared: 0.3638
F-statistic: 14.15 on 1 and 22 DF, p-value: 0.001077
Model with two independent variables
<- lm(cholesterol ~ activity + age, data=cholest)
mod2 #mod3 <- lm(cholesterol ~ activity + bmi_cat, data=cholest)
#mod4 <- lm(cholesterol ~ activity + sex_f, data=cholest)
#mod5 <- lm(cholesterol ~ activity + occupation_f, data=cholest)
summary(mod2) summary(mod3) summary(mod4) summary(mod5)
After see the % of change we see that only “activity+age” and “activity+BMI” has a change % higher than 10
Multivariate model
#mod6 <- lm(cholesterol ~ activity + age + bmi_cat, data=cholest)
#summary(mod6)
The adjusted effect of activity is: -0.016
Multivatiate model without age
#mod7 <- lm(cholesterol ~ activity + bmi_cat, data=cholest)
#summary(mod7)
The adjusted effect of activity now is: -0.029
Is that a substancial change? (-0.029 - -0.016)/-0.016 81% Yes, it is a substancial change, so we shoulnt take age of the equation
Multivatiate model without bmi
<- lm(cholesterol ~ activity + age, data=cholest)
mod8 summary(mod8)
Call:
lm(formula = cholesterol ~ activity + age, data = cholest)
Residuals:
Min 1Q Median 3Q Max
-0.65374 -0.23438 -0.04463 0.20482 0.54035
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.67711 0.32871 5.102 4.71e-05 ***
activity -0.01687 0.01078 -1.565 0.132
age 0.04722 0.00610 7.741 1.39e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3235 on 21 degrees of freedom
Multiple R-squared: 0.8421, Adjusted R-squared: 0.827
F-statistic: 55.99 on 2 and 21 DF, p-value: 3.836e-09
The adjusted effect of activity now is: -0.017
Is that a substancial change? (-0.017 - -0.016)/-0.016 6% No, it is not a substancial change, so we can take BMI of the equation
So the final model (the more simple) is cholesterol ~ activity + age Mod8