2023-01-31. Session 1

Linear Regression - Session 1

Writer
Affiliation
Javier Silva-Valencia

Instituut Voor Tropische Geneeskunde. Antwerp-Belgium

Published

2023-01-31

Abstract
The exercise begins by asking you to create a scatterplot to see the relation between two variables. Then to create a bivariate linear regression to explore the extent to which one variable can explain another. We go through the interpretation of the concepts of intercept, slope and coefficient of determination


Starting the Exercise

Instructions:

An ecological study has looked into the relationship between the incidence of low birth weight (“low birth weight per 100 births”) and perinatal mortality (“perinatal mortality per 100 births”) in health districts of a certain region.

Importing data

Importing a CSV database under the name of lbwpmor, with “,” as separator, and with “.” as decimal:

lbwpmor <- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2/Linear Regresion Exercise Database/Datasets/lbwpmor.csv", sep=",", dec= ".")

We assume that all variables ok to start (we dont need to transform or create variables)

Step 1

Finding if there is a Correlation between inclbw and permor (This is to assume linearity, there are other ways buy we will start this exercise doing this)

#Code in for a correlation test:
  cor(lbwpmor$inclbw, lbwpmor$permor)
[1] 0.6861169
Tip

As a result: we have an r of 0.68. It is a moderate strong positive correlation


Step 2

Doing the Scatterplot

```{r}
#Code in R:          
  # Make a scatterplot
  plot(lbwpmor$inclbw, lbwpmor$permor, main="Title")
  # add a regression line
  abline(lm(permor~inclbw, data = lbwpmor), col = "blue")
```

Tip

As a result: we can see in the graph the linear positive correlation

Step 3

Doing the Linear regresion for only inclbw and permor (bivariate)

```{r}
#Code in R:
lmHeight = lm(permor ~ inclbw, data = lbwpmor)
summary(lmHeight)
```

Call:
lm(formula = permor ~ inclbw, data = lbwpmor)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.30030 -0.17013 -0.03691  0.21897  0.34469 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.42139    0.31880   1.322 0.202786    
inclbw       0.14676    0.03668   4.001 0.000837 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.218 on 18 degrees of freedom
Multiple R-squared:  0.4708,    Adjusted R-squared:  0.4414 
F-statistic: 16.01 on 1 and 18 DF,  p-value: 0.0008373
Tip

-A. As a result: we can recreate the formula with the data we obtained

* y = a+b(x)
* a= 0.42139
* b = 0.14677 (slope)
So, the formula would be:
* Y= 0.42 +0.147(X)
  

-B. We also find the R2

* R2 is the "Determination coefficient". 
* R2 =  0.4708
Interpretation: 
"47% of the variability of the mortality could be explained by the model"
    

-C. We also find the p value = 0.000837

Interpretation:
*P value in this case says that there is a statisticas association between mortality
and low birth weight (that is the same that says that "the slope of the regresion line 
is significanly different from zero)
    

-D. As another result we have the “correlation coefficient”

Remember that we calculate the correlation coefficiente before in part 1 ("1. Finding
Correlation"). But we can also calculate here because: 
      Correlation coefficient = Square root of R2
So:
      Correlation coefficient (r) = Sqrt(0.4708)
      Correlation coefficient (r) = 0.6861
Interpretation: 
    "The value of 0.68 suggest a strong positive association"

Step 4

Predict “y” when X is “8”

Before we have already calculated the formula: “Y= 0.42 +0.147(X)” So we just need to replace the “X”

#Calculate:
0.42 +(0.147*8)
[1] 1.596
Tip

So, using the Linear regression formula above we can say that

When "low birth weight" is 8 
the "perinatal mortality" may be 1.596