Today I am Learning: Linear Regression

What is Linear Regression?

Often the goal of data science is to make predictions from data. This will take the form of fitting our data to a model;

\[y = ax + b + \epsilon\]

where \(y\) is our output/response variable, \(x\) is our input/predictor variable, \(b\) is a constant and \(\epsilon\) is our irreducible error. Thus, if we have data \(\hat{y}\) and \(\hat{x}\), then fitting this data corresponds to finding a suitable \(a\) and \(b\).

Here we have created some synthetic data. We can see there should be a line that can approximate this data (it was built to be so). But the question remains, how do we decide which line is best. The blue and green line seem better than the red line, but how do we quantify this?

One approach is to select the line that minimizes the aggregate distance from data points to our model. More rigorously, the line of best fit (our \(a\) and \(b\)) will be such the quantity

\[\sum_{i=1}^N\vert a\hat{x} + b - \hat{y}(\hat{x}) \vert \]

The proof of the validity of this is beyond the scope of this tutorial.

Linear Regression in R with tidyverse

Fortunately, there are several computer software packages capable of finding lines of best fit without any tedious calculations by hand. We will focus on the tidyverse implementation in r.

The first thing we need to do is install the package and load it.

Next, we need to tell r we want to create a linear model using regression. Essentially, we a creating an interface for our model, that will work nicely with a wide range of packages.

Now, we will specify an engine for fitting the model. This tells r what software should be used to fit the model. Some examples are Stan and glmnet, but we will simply use the ‘lm’ function.

Linear Regression Model Specification (regression)

Computational engine: lm

Lastly, in some cases the predictor does not represent a numeric outcome. In this case, we need to specify the the mode (regression or classification). Presently, since we are using linear regression, there is only one mode and we do not need to specify it.

Now, that our model interface is created, we can use it to fit our data.

parsnip model object

Fit time:  4ms 

Call:
stats::lm(formula = yhat ~ x1, data = data)

Coefficients:
(Intercept)           x1  
    120.764        4.605

[1] "list"

Notice the output of fit is a list containing ‘spec’, ‘fit’, ‘preproc’ and ‘elapsed’. We will only deal with ‘fit’ in this tutorial.

In order to get more detailed picture of our model we can call the summary function. We use the ‘pluck()’ function to pull the ‘fit’ element from our output and create a summary.


Call:
stats::lm(formula = yhat ~ x1, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-38.57 -15.24   0.15  14.31  66.25 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 120.7644    15.2940   7.896 4.22e-12 ***
x1            4.6047     0.2997  15.364  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.54 on 98 degrees of freedom
Multiple R-squared:  0.7066,    Adjusted R-squared:  0.7037 
F-statistic: 236.1 on 1 and 98 DF,  p-value: < 2.2e-16

Alternatively, we can use the tidy() function to get a view our estimated parameters in a nice table.

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   121.      15.3        7.90 4.22e-12
2 x1              4.60     0.300     15.4  7.60e-28

If we want to analyze the model statistics we call the glance() function.

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
1     0.707         0.704  20.5      236. 7.60e-28     1  -443.  892.
# … with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>