-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathRegression.Rmd
executable file
·359 lines (276 loc) · 19.8 KB
/
Regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
---
title: "Regression"
author: "Nirzaree"
date: "31/08/2020"
output:
html_document:
fig_caption: true
toc: true
toc_float: true
toc_collapsed: true
toc_depth: 3
---
# Goal: Understand regression concepts with examples
## Rant/Why am I doing this
When I would read about regression, it would on one hand seem the simplest of approaches to try out and on the other hand, there are quite a bit of nuances in each of the steps to be well understood and applied. Also no one reference tackled all the relevant concepts. Many aspects would simple be ignored (mostly because they might not be applicable to the dataset), but if you had a new dataset, how would you know if you are missing something or doing something wrong? Obviously it took me waaay more than I thought it would, and it is quite long however I still think this would be my reference for all things linear regression going forward, and hopefully it would help some others too.
## Theory
* What is regression?
A technique applied to estimate a continuous output variable as a linear, additive function of one or more input variables.
* What types of regression techniques exist?
1. Linear: $y = a_0 + a_1*x_1 + a_2*x_2 + a_3*x_3 + \epsilon$
+ Univariate: Only 1 input variable
+ Multivariate: More than 1 input variable
2. Polynomial: $y = a_0 + a_1*x_1 + a_2*(x_1)^2 + a_3*(x_2) + a_4*(x_2)^2 + a_5*x_1*x_2 + \epsilon$
3. Logistic: $log(p_i/(1-p_i)) = a_0 + a_1*x_1 + a_2*x_2 + a_3*x_3 + \epsilon$
* Assumptions of linear regression
When we apply linear regression to a dataset, we assume that:
1. There is a linear (& additive in case of more than 1 input variables) relationship between the inputs and the output variables
2. There is constant variance in the error term
3. The error term is normally distributed
4. The errors terms are independent. **
5. In case of multilinear regression, there is no collinearity between the input terms.
* Preprocessing for linear regression
* Feature Engineering for linear regression
* Tuning a regression model
* Metrics for evaluating a linear regression model
1. Statistical significance of the model:
+ **p-value** : p-value is the probability that random chance generated the results or something that is equal or rare.
+ p-value of individual predictor variables:
+ p-value of the entire model:
+ A model can be considered statistically significant only if both these p-values are < pre-determined statistical significance threshold (0.05)
+ In r, this is visually represented by the number of stars at the end of the row. The more the stars besides the p-value of the variable, the more significant the variable.
+ **t-value** : t-value can be interpreted as the higher the t-value, the lower the chance that the coefficients are non-zero by chance.
* Pr(>|t|) indicates the probability that you get t-value as high or higher than the one observed given null hypothesis is true. So if Pr(>|t|) is lower, then the coefficients are significant and vice versa.
+ $t-value = Coefficient/Standard Error$
+ $p-value = 2 * pt(q = -abs(t-value),df = (nrow - ncol))$
where pt = distribution function for the t-distribution with df degrees of freedom and q = vector of quantiles [3]
2. Model Accuracy:
* The actual amount of information in a dataset is indicated by the amount of variance it contains. $R^2$ tells us the amount of variance of the data that is explained by the model. [3]
+ $R^2 = 1 - (SSE/SST)$
where Sum of Squared Error or $SSE = \sum\limits_{i = 1}^{n}(y_i - \hat{y_i})^2$ &
Total Sum of Squares or $SST = \sum\limits_{i = 1}^n(y_i - \bar{y})^2$ where
$\hat{y_i}$ = predicted value when actual observation = $y_i$ &
$\bar{y_i}$ = mean value of all output values
[Nicely explained stepwise $R^2$ computation on a dataset](https://internal.ncl.ac.uk/ask/numeracy-maths-statistics/statistics/regression-and-correlation/coefficient-of-determination-r-squared.html)
* Adjusted $R^2$: As you add variables to the model, the value of $R^2$ will always increase, independent of whether the additional variables are significant to the model or not. Hence, we use $R^2$ Adjusted.
+ Adjusted $R^2 = 1 - MSE/MST$
where $MSE (Mean Squared Error) = SSE/(n - q)$ & $MST = SST/(n-1)$
where $n = total number observations$ & $q = total number of coefficients$
Thus Adjusted $R^2$ penalizes higher number of predictors in the model.
* Low value of $R^2$ not always bad
+ Certain types of problems to always have lower $R^2$ as outputs are just harder to predict [5]
+ However, if there are predictors which are statistically significant, then they can still be insightful. The model might not be good to do precise predictions.
* High value of $R^2$ not always good [5]
```{r highvalueofR2butmisleading,echo=TRUE,eval=TRUE,message = FALSE, warning = FALSE,fig.width = 8, fig.height = 4}
x = seq(1:100)
y = x^2
linModelxy <- lm(y~x)
summary(linModelxy)
plot(x = x, y = y) + abline(linModelxy)
```
**Observations**
Always look at the $R^2$ values in conjunction with other metrics, residual plots and domain knowledge.
* F-test: $R^2$ does not provide a formal hypothesis test for relationship between the predicted and the actual value if the response variable. Enter **F-test**
## Intuitive Approach
Intuitively we would do the following (with no theory knowhow on feature engineering,validation of model etc)
1. Load the dataset & check if there are features that correlate significantly and linearly (well we can transform the input data and it has its nuances but for now we just try to see what best fits the data as it is) the output/response variable
3. Build a model with all variables
4. Check for important variables
5. Reduce the model to include only the significant variables
6. Check how good the model is:
7. If it is good enough, use it to make predictions
## MTCars Dataset
```
Dataset description: mtcars
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
Format:
A data frame with 32 observations on 11 (numeric) variables.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
Task: Regression analysis of fuel efficiency
```
### Intuitive approach
```{r setup1,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
library(data.table)
library(corrplot)
library(ggplot2)
library(ggpmisc)
library(grid)
library(gridExtra)
```
### 1. Load Data
```{r loaddata1,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
dtMTCars <- data.table(mtcars)
head(dtMTCars)
summary(dtMTCars)
```
**High level summary**
* Predictor variables: 10
+ Numeric variables: 8
+ Categorical variables: 2 (Engine (0 = V-shaped, 1 = straight) & Transmission (0 = automatic, 1 = manual))
* Response variable: mpg Miles/(US) gallon
### 2. Feature Exploration
Let's plot the output variable against each of the input variables.
We also plot best fit polynomial for each variable.
```{r featureanalysis,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 8, fig.height = 4}
# corrplot(cor(dtMTCars))
#some more plots
genericformula = y~x
cylplot <- ggplot(dtMTCars,aes(x = cyl,y = mpg)) + geom_point() + geom_smooth(formula = genericformula,method = "lm") + ggtitle("MPG Vs number of cylinders") + stat_poly_eq(formula = genericformula,aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE)
dispplot <- ggplot(dtMTCars,aes(x = disp,y = mpg)) + geom_point() + geom_smooth(formula = genericformula,method = "lm") + ggtitle("MPG Vs displacement") + stat_poly_eq(formula = genericformula,aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE)
hpplot <- ggplot(dtMTCars,aes(x = hp,y = mpg)) + geom_point() + geom_smooth(formula = genericformula,method = "lm") + ggtitle("MPG Vs horsepower") + stat_poly_eq(formula = genericformula,aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE)
dratplot <- ggplot(dtMTCars,aes(x = drat,y = mpg)) + geom_point() + geom_smooth(formula = genericformula,method = "lm") + ggtitle("MPG Vs rear axle ratio") + stat_poly_eq(formula = genericformula,aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE) #not great relation
grid.arrange(cylplot,dispplot,hpplot,dratplot,nrow = 2)
wtplot <- ggplot(dtMTCars,aes(x = wt,y = mpg)) + geom_point() + geom_smooth(formula = genericformula,method = "lm") + ggtitle("MPG Vs weight") + stat_poly_eq(formula = genericformula,aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE)
qsecplot <- ggplot(dtMTCars,aes(x = qsec,y = mpg)) + geom_point() + geom_smooth(formula = genericformula,method = "lm") + ggtitle("MPG Vs qsec") + stat_poly_eq(formula = genericformula,aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE) #positive relation but quite a lot of variation.
vsplot <- ggplot(dtMTCars,aes(x = vs,y = mpg)) + geom_point() + geom_smooth(formula = genericformula,method = "lm") + ggtitle("MPG Vs engine type") + stat_poly_eq(formula = genericformula,aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE)
amplot <- ggplot(dtMTCars,aes(x = am,y = mpg)) + geom_point() + geom_smooth(formula = genericformula,method = "lm") + ggtitle("MPG Vs transmission type") + stat_poly_eq(formula = genericformula,aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE)
grid.arrange(wtplot,qsecplot,vsplot,amplot,nrow = 2)
gearplot <- ggplot(dtMTCars,aes(x = gear,y = mpg)) + geom_point() + geom_smooth(formula = genericformula,method = "lm") + ggtitle("MPG Vs number of forward gears") + stat_poly_eq(formula = genericformula,aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE)
carbplot <- ggplot(dtMTCars,aes(x = qsec,y = mpg)) + geom_point() + geom_smooth(formula = genericformula,method = "lm") + ggtitle("MPG Vs number of carburators") + stat_poly_eq(formula = genericformula,aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE) #not great relation
grid.arrange(gearplot,carbplot,nrow = 1)
```
<!-- **Observations:** -->
<!-- MPG against -->
<!-- 1. Number of cylinders: As number of cylinders increase, MPG is reducing. However number of cylinders being a discrete variable with only 3 values in the dataset, would not be able to discern mpg on a finer scale. -->
<!-- 2. Weight: Inversely proportional with mpg. Linear relation. Weight being a continuous variable is nicely fitting the mpg line. -->
<!-- 3. Transmission: Transmission type 1 (manual) overall has higher mpg than class 0 (automatic) however there is a lot of overlapping data meaning that other variables would be required for better mpg prediction along with transmission type. -->
<!-- 4. -->
### 3. Build model: a) Simple linear regression with all variables
We now build a model using all variables and we check how it looks.
```{r AllVarModel,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
MultLinearModel <- lm(mpg~.,data = dtMTCars)
summary(MultLinearModel)
par(mfrow=c(2,2))
plot(MultLinearModel)
```
**Observations:**
Score: R2 adjusted is 80.66% which is decent.
Assumptions are fairly satisfied.
However, none of the variables seem to be very statistically significant, hence we need to reduce the variables and try fitting the model again.
### 3. Build model: b) Simple linear regression with few handpicked variables
We now build a model using a few variables that we would assume would have a fairly strong correlation with mileage. (todo: how to formally check for collinearity between the variables?)
* Weight: As the weight of the car increases, mileage would decrease.
* Transmission type: Milege of Manual > Automatic.
* Horse power: As horse power of the car increases, mileage would decrease.
* Engine displacement: Engine displacement and horse power are quite related,so we can check which one of this helps the model better.
```{r FewHandpickedVarModel,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
MultLinearModel1 <- lm(mpg~wt+am+hp,data = dtMTCars)
summary(MultLinearModel1)
MultLinearModel2 <- lm(mpg~wt+am+disp,data = dtMTCars)
summary(MultLinearModel2)
```
**Observations:**
* Horse power is better for the model than engine displacement from $R^2 adjusted$ metric, hence we keep the model with engine displacement.
* Also transmission type is not very useful in both the models, so we remove it.
```{r FinalHandpickedModel,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
MultLinearModelHandpicked <- lm(mpg~wt+hp,data = dtMTCars)
summary(MultLinearModelHandpicked)
```
**Observations**
* Coefficients:
* Weight: MPG decreases by ~3.87 units per unit increase in weight in the given dataset.
* hp: MPG decreases by ~0.03 units per unit increase in weight.
* R2 adjusted: 81.48% slightly better than the model with all variables. Why so? Because that model was not able to remove variables. Now we let it do that. We try stepwise linear regression.
### 3. Build model: c) Stepwise linear regression
```{r StepwiseModel,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
LinModelStep <- step(lm(mpg~.,dtMTCars),direction = "both")
summary(LinModelStep)
```
**Observations**
* Features it chose:
* Weight: expected & relevant
* type of transmission: expected and relevant
* qsec: time taken (in seconds) by the car to cover quarter mile. In some ways related to the car power, the more the power, the lesser the time taken, the lesser the mileage. So this feature is directly proportional to the mpg, however the feature plot of qsec vs mpg alone does not look very good.
Since the $R^2$ between the handpicked model and the step model does not differ a lot, we can keep the handpicked model as the final candidate, however any of the two models should be fine.
### 4. Validate best model
We now check if linear regression assumptions are validated by the final model that we chose.
```{r ValidateModel,echo = FALSE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10}
par(mfrow=c(2,2))
plot(MultLinearModel)
```
**Observations:**
Linear Regression assumptions:
1. Linearity:
2. Constant error variance (homoscedasticity)
3. Independent error terms (this is difficult to identify through simple plots, so we leave it aside for now)
4. Normal errors
1. Residuals Vs fitted:
* No pattern in residuals vs fitted values. In a linear regression, sum of residuals would be 0. There would be no pattern in the error terms and the points should just be a random cloud of points => Constant variance assumption met
* Red line is almost horizontal (not great but not too bad) => linearity assumption met
2. Q-Q plot: Follow the line for the most part => normally distributed error terms assumption met
Other plots that help in identifying non-linearity, non constant variance and other problematic observations, but we chill on that for now.
3. Variance in residuals: Fairly flat line hence variance is reasonably constant across predictor range (homoscedasticity)
4. Cooke's distance : for the 3 higher points is < 0.5.
Threshold Generally: 4/N or 4/(n - k -1) where k = number of explanatory variables & N = number of observations.
Need to check about how much cooke's distance is bad, with respect to the given problem.
### 5. Summary
* We took dataset with multiple explanatory variables and one response variable of interest.
* We applied a linear model using all variables. It fairly satisfied the assumptions however seemed to have too many variables in it, which might not be required.
* We then tried the model with a few variables which seemed to have linear relation with the output variable, and the model had slightly better R2 than the model with all the variables, suggesting further that all variables are not required.
* We finally employed stepwise linear regression model which has the capability of removing unnecessary variables. It had sligtly better R2 than both the earlier models.
* We then checked if the assumptions of linearity, constant error variance, normal distribution of errors are satisfied by the final (step wise linear) model and they were fairly satisfied.
* What is not been covered in this notebook is how to handle collinearity between features, how to find the best linear model, any pre-processing requirements for linear model.
<!-- Finally, stepwise linear regression with CV: -->
<!-- <!-- https://topepo.github.io/caret/train-models-by-tag.html for caret model names-->
<!-- ### 5. Build model: Stepwise linear regression with cross validation -->
<!-- ```{r WithCV,echo = TRUE, message = FALSE, warning = FALSE,fig.width = 16, fig.height = 10} -->
<!-- library(caret) -->
<!-- trainingControl <- trainControl(method = "cv",number = 10) -->
<!-- LinModelWithCV <- train(mpg~., -->
<!-- data = dtMTCars, -->
<!-- method = "lmStepAIC", -->
<!-- trControl = trainingControl -->
<!-- ) -->
<!-- summary(LinModelWithCV) -->
<!-- LinModelWithCV$results -->
<!-- summary(LinModelWithCV$finalModel) -->
<!-- #with DAAG library's function: -->
<!-- # cv.lm(data = dtMTCars,mpg~.) -->
<!-- ``` -->
## References
1.Fuel Efficiency Regression Analysis
[Regression analysis of fuel efficiency using R dataset mtcars](https://rstudio-pubs-static.s3.amazonaws.com/157017_d791d59b482441d386ce8b31b3337ecf.html)
2. Validating linear regression assumptions using diagnostics plots
[Checking Linear Regression Assumptions in R | R Tutorial 5.2 | MarinStatsLectures](https://www.youtube.com/watch?v=eTZ4VUZHzxw)
3. Linear Regression Evaluation Metrics Nicely Explained
[Linear Regression](http://r-statistics.co/Linear-Regression.html)
4. Nicely explained stepwise $R^2$ computation on a dataset
[Coefficient of Determination, R-squared](https://internal.ncl.ac.uk/ask/numeracy-maths-statistics/statistics/regression-and-correlation/coefficient-of-determination-r-squared.html)
5. Nice interpretation of $R^2$ values: [Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit?](https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit)
## notes to self
1. new package for plotting equation in feature plots
+ library: ggpmisc
+ function: stat_poly_eq
## ToDo / ToAnswer
* p-value: how is it computed?
* p-value: Why is threshold 0.05?
* Whats the required threshold for Pr(> |t|) for significance?
* Understand results of cv
* Understand params in cross validation folds, iterations
* How to detect and handle collinearity between features
* Feature selection
* How to find the best fit model
* Show with examples problematic models, how to detect them etc
+ Anscombe's quartet
* Does regularization apply in regression?
* Is it possible to have high $R^2$ without having any predictors that are individually statistically significant? Somewhat similar to whats happening when the MTCars data is run with a model using all variables. What do you understand about such a case? Also vice versa, strongly significant variables but still low $R^2$ value. What would that mean?
* To read: [Is R-squared Useless?](https://data.library.virginia.edu/is-r-squared-useless/)
* More mind numbing metrics: F-statistic, AIC, BIC. Examples of how they save then $R^2$ misleads. When do these new metrics fail? What is the final take on metrics?
* F1 score
* AIC
* glm
* Other regression techniques
* ridge
* lasso
* elastic net
* Quick rundown on 2-3 more datasets