forked from robjhyndman/ETC3550Slides
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path5-regression.Rmd
1062 lines (748 loc) · 27.8 KB
/
5-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "ETC3550: Applied forecasting for business and economics"
author: "Ch5. Regression models"
date: "OTexts.org/fpp2/"
fontsize: 14pt
output:
beamer_presentation:
theme: metropolis
fig_height: 4.5
fig_width: 7
highlight: tango
includes:
in_header: header.tex
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, cache=TRUE, warning=FALSE, message=FALSE,
dev.args=list(bg=grey(0.9), pointsize=11))
library(fpp2)
```
# The linear model with time series
## Multiple regression and forecasting
\fontsize{13}{15}\sf
\begin{block}{}\vspace*{-0.3cm}
\[
y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} + \varepsilon_t.
\]
\end{block}
* $y_t$ is the variable we want to predict: the ``response'' variable
* Each $x_{j,t}$ is numerical and is called a ``predictor''.
They are usually assumed to be known for all past and future times.
* The coefficients $\beta_1,\dots,\beta_k$ measure the effect of each
predictor after taking account of the effect of all other predictors
in the model.
That is, the coefficients measure the **marginal effects**.
* $\varepsilon_t$ is a white noise error term
## Example: US consumption expenditure
\fontsize{11}{13}\sf
```{r uschangedata, echo=FALSE}
quarters <- rownames(.preformat.ts(uschange))
```
```{r ConsInc, echo=TRUE, cache=TRUE, fig.height=3.5}
autoplot(uschange[,c("Consumption","Income")]) +
ylab("% change") + xlab("Year")
```
## Example: US consumption expenditure
```{r ConsInc2, echo=FALSE, cache=TRUE}
fit.cons <- tslm(Consumption ~ Income, data=uschange)
uschange %>%
as.data.frame %>%
ggplot(aes(x=Income, y=Consumption)) +
ylab("Consumption (quarterly % change)") +
xlab("Income (quarterly % change)") +
geom_point() +
geom_smooth(method="lm", se=FALSE)
```
## Example: US consumption expenditure
\fontsize{9}{9}\sf
```{r, echo=TRUE, cache=TRUE}
tslm(Consumption ~ Income, data=uschange) %>% summary
```
## Example: US consumption expenditure
```{r MultiPredictors, echo=FALSE, cache=TRUE}
autoplot(uschange[,3:5], facets = TRUE, colour=TRUE) +
ylab("") + xlab("Year") +
guides(colour="none")
```
## Example: US consumption expenditure
```{r ScatterMatrix, echo=FALSE, cache=TRUE}
uschange %>%
as.data.frame %>%
GGally::ggpairs()
```
## Example: US consumption expenditure
\fontsize{7.4}{7.4}\sf
```{r usestim, echo=TRUE}
fit.consMR <- tslm(
Consumption ~ Income + Production + Unemployment + Savings,
data=uschange)
summary(fit.consMR)
```
## Example: US consumption expenditure
```{r usfitted1, echo=FALSE, cache=TRUE}
autoplot(uschange[,'Consumption'], series="Data") +
autolayer(fitted(fit.consMR), series="Fitted") +
xlab("Year") + ylab("") +
ggtitle("Percentage change in US consumption expenditure") +
guides(colour=guide_legend(title=" "))
```
## Example: US consumption expenditure
```{r usfitted2, echo=FALSE, cache=TRUE, message=FALSE, warning=FALSE}
cbind(Data=uschange[,"Consumption"], Fitted=fitted(fit.consMR)) %>%
as.data.frame %>%
ggplot(aes(x=Data, y=Fitted)) +
geom_point() +
xlab("Fitted (predicted values)") +
ylab("Data (actual values)") +
ggtitle("Percentage change in US consumption expenditure") +
geom_abline(intercept=0, slope=1)
```
## Example: US consumption expenditure
```{r}
checkresiduals(fit.consMR, test=FALSE)
```
# Residual diagnostics
## Multiple regression and forecasting
For forecasting purposes, we require the following assumptions:
* $\varepsilon_t$ are uncorrelated and zero mean
* $\varepsilon_t$ are uncorrelated with each $x_{j,t}$.
\pause
It is **useful** to also have $\varepsilon_t \sim \text{N}(0,\sigma^2)$ when producing prediction intervals or doing statistical tests.
## Residual plots
Useful for spotting outliers and whether the linear model was
appropriate.
* Scatterplot of residuals $\varepsilon_t$ against each predictor $x_{j,t}$.
* Scatterplot residuals against the fitted values $\hat y_t$
* Expect to see scatterplots resembling a horizontal band with
no values too far from the band and no patterns such as curvature or
increasing spread.
## Residual patterns
* If a plot of the residuals vs any predictor in the model shows a pattern, then the relationship is nonlinear.
* If a plot of the residuals vs any predictor **not** in the model shows a pattern, then the predictor should be added to the model.
* If a plot of the residuals vs fitted values shows a pattern, then there is heteroscedasticity in the errors. (Could try a transformation.)
## Breusch-Godfrey test
**OLS regression:**
$$
y_{t}=\beta_{0}+\beta_{1}x_{t,1}+\dots+\beta_{k}x_{t,k}+u_{t}
$$
**Auxiliary regression:**
$$
{\hat {u}}_{t}=\beta_{0}+\beta_{1}x_{t,1}+\dots+\beta_{k}x_{t,k}+\rho_{1}{\hat {u}}_{t-1}+\dots +\rho_{p}{\hat {u}}_{t-p}+\varepsilon_{t}
$$
###
If $R^{2}$ statistic is calculated for this model, then
$$
(T-p)R^{2}\,\sim \,\chi_{p}^{2},
$$
when there is no serial correlation up to lag $p$, and $T=$ length of series.
* Breusch-Godfrey test better than Ljung-Box for regression models.
## US consumption again
\fontsize{9}{13}\sf
```{r}
checkresiduals(fit.consMR, plot=FALSE)
```
### If the model fails the Breusch-Godfrey test \dots
* The forecasts are not wrong, but have higher variance than they need to.
* There is information in the residuals that we should exploit.
* This is done with a regression model with ARMA errors.
# Some useful predictors for linear models
## Trend
**Linear trend**
\[
x_t = t
\]
* $t=1,2,\dots,T$
* Strong assumption that trend will continue.
## Dummy variables
\begin{textblock}{6}(0.4,1.5)
If a categorical variable takes only two values (e.g., `Yes'
or `No'), then an equivalent numerical variable can be constructed
taking value 1 if yes and 0 if no. This is called a \textbf{dummy variable}.
\end{textblock}
\placefig{7.7}{1.}{width=5cm}{dummy2}
## Dummy variables
\begin{textblock}{5}(0.4,1.5)
If there are more than two categories, then the variable can
be coded using several dummy variables (one fewer than the total
number of categories).
\end{textblock}
\placefig{5.5}{1.5}{width=7.3cm}{dummy3}
## Beware of the dummy variable trap!
* Using one dummy for each category gives too many dummy variables!
* The regression will then be singular and inestimable.
* Either omit the constant, or omit the dummy for one category.
* The coefficients of the dummies are relative to the omitted category.
## Uses of dummy variables
\fontsize{13}{15}\sf
**Seasonal dummies**
* For quarterly data: use 3 dummies
* For monthly data: use 11 dummies
* For daily data: use 6 dummies
* What to do with weekly data?
\pause
**Outliers**
* If there is an outlier, you can use a dummy variable (taking value 1 for that observation and 0 elsewhere) to remove its effect.
\pause
**Public holidays**
* For daily data: if it is a public holiday, dummy=1, otherwise dummy=0.
## Beer production revisited
```{r, echo=FALSE}
beer <- window(ausbeer, start=1992)
autoplot(beer) + xlab("Year") + ylab("megalitres") +
ggtitle("Australian quarterly beer production")
```
## Beer production revisited
\begin{block}{Regression model}
\centering
$y_t = \beta_0 + \beta_1 t + \beta_2d_{2,t} + \beta_3 d_{3,t} + \beta_4 d_{4,t} + \varepsilon_t$
\end{block}
* $d_{i,t} = 1$ if $t$ is quarter $i$ and 0 otherwise.
## Beer production revisited
\fontsize{8}{8}\sf
```{r, echo=TRUE}
fit.beer <- tslm(beer ~ trend + season)
summary(fit.beer)
```
## Beer production revisited
```{r}
autoplot(beer, series="Data") +
autolayer(fitted(fit.beer), series="Fitted") +
xlab("Year") + ylab("Megalitres") +
ggtitle("Quarterly Beer Production")
```
## Beer production revisited
```{r, echo=FALSE}
cbind(Data=beer, Fitted=fitted(fit.beer)) %>%
as.data.frame %>%
ggplot(aes(x=Data, y=Fitted, colour=as.factor(cycle(beer)))) +
geom_point() +
ylab("Fitted") + xlab("Actual values") +
ggtitle("Quarterly beer production") +
scale_colour_brewer(palette="Dark2", name="Quarter") +
geom_abline(intercept=0, slope=1)
```
## Beer production revisited
```{r}
checkresiduals(fit.beer, test=FALSE)
```
## Beer production revisited
```{r}
fit.beer %>% forecast %>% autoplot
```
## Fourier series
Periodic seasonality can be handled using pairs of Fourier terms:
$$
s_{k}(t) = \sin\left(\frac{2\pi k t}{m}\right)\qquad c_{k}(t) = \cos\left(\frac{2\pi k t}{m}\right)
$$
$$
y_t = a + bt + \sum_{k=1}^K \left[\alpha_k s_k(t) + \beta_k c_k(t)\right] + \varepsilon_t$$
* Every periodic function can be approximated by sums of sin and cos terms for large enough $K$.
* Choose $K$ by minimizing AICc.
* Called "harmonic regression"
```r
fit <- tslm(y ~ trend + fourier(y, K))
```
## Harmonic regression: beer production
\fontsize{8}{8}\sf
```{r fourierbeer, echo=TRUE, cache=TRUE}
fourier.beer <- tslm(beer ~ trend + fourier(beer, K=2))
summary(fourier.beer)
```
## Intervention variables
**Spikes**
* Equivalent to a dummy variable for handling an outlier.
\pause
**Steps**
* Variable takes value 0 before the intervention and 1 afterwards.
\pause
**Change of slope**
* Variables take values 0 before the intervention and values $\{1,2,3,\dots\}$ afterwards.
## Holidays
**For monthly data**
* Christmas: always in December so part of monthly seasonal effect
* Easter: use a dummy variable $v_t=1$ if any part of Easter is in that month, $v_t=0$ otherwise.
* Ramadan and Chinese new year similar.
## Trading days
With monthly data, if the observations vary depending on how many different types of days in the month, then trading day predictors can be useful.
\begin{align*}
z_1 &= \text{\# Mondays in month;} \\
z_2 &= \text{\# Tuesdays in month;} \\
&\vdots \\
z_7 &= \text{\# Sundays in month.}
\end{align*}
## Distributed lags
Lagged values of a predictor.
Example: $x$ is advertising which has a delayed effect
\begin{align*}
x_{1} &= \text{advertising for previous month;} \\
x_{2} &= \text{advertising for two months previously;} \\
& \vdots \\
x_{m} &= \text{advertising for $m$ months previously.}
\end{align*}
## Nonlinear trend
**Piecewise linear trend with bend at $\tau$**
\begin{align*}
x_{1,t} &= t \\
x_{2,t} &= \left\{ \begin{array}{ll}
0 & t <\tau\\
(t-\tau) & t \ge \tau
\end{array}\right.
\end{align*}
\pause\vspace*{1cm}
**Quadratic or higher order trend**
\[
x_{1,t} =t,\quad x_{2,t}=t^2,\quad \dots
\]
\pause\vspace*{-0.5cm}
\centerline{\textcolor{orange}{\textbf{NOT RECOMMENDED!}}}
## Example: Boston marathon winning times
\fontsize{11}{11}\sf
```{r, fig.height=3.5, echo=TRUE}
autoplot(marathon) +
xlab("Year") + ylab("Winning times in minutes")
fit.lin <- tslm(marathon ~ trend)
```
## Example: Boston marathon winning times
```{r marathonLinear, echo=FALSE, message=TRUE, warning=FALSE, cache=TRUE}
autoplot(marathon) +
autolayer(fitted(fit.lin), series = "Linear trend") +
xlab("Year") + ylab("Winning times in minutes") +
guides(colour=guide_legend(title=" ")) +
theme(legend.position = "none")
```
## Example: Boston marathon winning times
\fontsize{11}{11}\sf
```{r, echo=TRUE, fig.height=3.5}
autoplot(residuals(fit.lin)) +
xlab("Year") + ylab("Residuals from a linear trend")
```
## Example: Boston marathon winning times
\fontsize{8}{8}\sf
```{r, echo=TRUE, message=TRUE, warning=FALSE, cache=TRUE}
# Linear trend
fit.lin <- tslm(marathon ~ trend)
fcasts.lin <- forecast(fit.lin, h=10)
# Exponential trend
fit.exp <- tslm(marathon ~ trend, lambda = 0)
fcasts.exp <- forecast(fit.exp, h=10)
# Piecewise linear trend
t.break1 <- 1940
t.break2 <- 1980
t <- time(marathon)
t1 <- ts(pmax(0, t-t.break1), start=1897)
t2 <- ts(pmax(0, t-t.break2), start=1897)
fit.pw <- tslm(marathon ~ t + t1 + t2)
t.new <- t[length(t)] + seq(10)
t1.new <- t1[length(t1)] + seq(10)
t2.new <- t2[length(t2)] + seq(10)
newdata <- cbind(t=t.new, t1=t1.new, t2=t2.new) %>%
as.data.frame
fcasts.pw <- forecast(fit.pw, newdata = newdata)
```
## Example: Boston marathon winning times
```{r, echo=FALSE, message=TRUE, warning=FALSE, cache=TRUE}
autoplot(marathon) +
autolayer(fitted(fit.lin), series = "Linear") +
autolayer(fitted(fit.exp), series="Exponential") +
autolayer(fitted(fit.pw), series = "Piecewise") +
autolayer(fcasts.pw, series="Piecewise") +
autolayer(fcasts.lin$mean, series = "Linear") +
autolayer(fcasts.exp$mean, series="Exponential") +
xlab("Year") + ylab("Winning times in minutes") +
ggtitle("Boston Marathon") +
guides(colour=guide_legend(title=" "))
```
## Example: Boston marathon winning times
```{r residPiecewise, message=FALSE, warning=FALSE, cache=TRUE}
fit.pw$method <- "Piecewise linear regression model"
checkresiduals(fit.pw, test=FALSE)
```
## Interpolating splines
\fullwidth{draftspline}
## Interpolating splines
\fullwidth{spline}
## Interpolating splines
\fullwidth{spline2}
## Interpolating splines
\begin{block}{}
A spline is a continuous function $f(x)$ interpolating all
points ($\kappa_j,y_j$) for $j=1,\dots,K$ and consisting of polynomials between each consecutive pair of `knots' $\kappa_j$ and $\kappa_{j+1}$.
\end{block}
\pause
* Parameters constrained so that $f(x)$ is continuous.
* Further constraints imposed to give continuous derivatives.
## General linear regression splines
* Let $\kappa_1<\kappa_2<\cdots<\kappa_K$ be ``knots'' in interval $(a,b)$.
* Let $x_1=x$, $x_j = (x-\kappa_{j-1})_+$ for $j=2,\dots,K+1$.
* Then the regression is piecewise linear with bends at the knots.
## General cubic splines
* Let $x_1=x$, $x_2 = x^2$, $x_3=x^3$, $x_j =
(x-\kappa_{j-3})_+^3$ for $j=4,\dots,K+3$.
* Then the regression is piecewise cubic, but smooth at the knots.
* Choice of knots can be difficult and arbitrary.
* Automatic knot selection algorithms very slow.
## Example: Boston marathon winning times
\fontsize{7}{7}\sf
```{r, echo=TRUE, message=TRUE, warning=FALSE, cache=TRUE}
# Spline trend
library(splines)
t <- time(marathon)
fit.splines <- lm(marathon ~ ns(t, df=6))
summary(fit.splines)
fits <- ts(fitted(fit.splines))
tsp(fits) <- tsp(marathon)
```
## Example: Boston marathon winning times
\fontsize{8}{8}\sf
```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE}
library(splines)
fc <- predict(fit.splines, newdata=data.frame(t=time(fcasts.pw$mean)),
interval='prediction', level=0.95)
fcasts.spline <- fcasts.pw
fcasts.spline$mean <- fc[,"fit"]
se <- (fc[,'upr'] - fc[,'lwr'])/(2*qnorm(0.975))
fcasts.spline$lower[,2] <- fc[,'fit'] - qnorm(.975)*se
fcasts.spline$lower[,1] <- fc[,'fit'] - qnorm(.90)*se
fcasts.spline$upper[,2] <- fc[,'fit'] + qnorm(.975)*se
fcasts.spline$upper[,1] <- fc[,'fit'] + qnorm(.90)*se
autoplot(marathon) +
autolayer(fitted(fit.pw), series = "Piecewise linear") +
autolayer(fcasts.pw, series="Piecewise linear") +
autolayer(fits, series = "Cubic spline") +
autolayer(fcasts.spline, series="Cubic spline") +
xlab("Year") + ylab("Winning times in minutes") +
ggtitle("Boston Marathon") +
guides(colour=guide_legend(title=" "))
```
## splinef
\fontsize{11}{13}\sf
A slightly different type of spline is provided by `splinef`
```{r, echo=TRUE, fig.height=3.3}
fc <- splinef(marathon)
autoplot(fc)
```
## splinef
* Cubic **smoothing** splines (rather than cubic regression splines).
* Still piecewise cubic, but with many more knots (one at each observation).
* Coefficients constrained to prevent the curve becoming too "wiggly".
* Degrees of freedom selected automatically.
* Equivalent to ARIMA(0,2,2) and Holt's method.
# Selecting predictors and forecast evaluation
## Selecting predictors
* When there are many predictors, how should we choose which ones to use?
* We need a way of comparing two competing models.
\pause
**What not to do!**
* Plot $y$ against a particular predictor ($x_j$) and if it shows no noticeable relationship, drop it.
* Do a multiple linear regression on all the predictors and disregard all variables whose $p$ values are greater than 0.05.
* Maximize $R^2$ or minimize MSE
## Comparing regression models
Computer output for regression will always give the $R^2$ value. This is a
useful summary of the model.
* It is equal to the square of the correlation between $y$ and $\hat y$.
* It is often called the ``coefficient of determination''.
* It can also be calculated as follows:
$$R^2 = \frac{\sum(\hat{y}_t - \bar{y})^2}{\sum(y_t-\bar{y})^2}
$$
* It is the proportion of variance accounted for (explained) by the predictors.
## Comparing regression models
However \dots
* $R^2$ does not allow for ``degrees of freedom''.
* Adding *any* variable tends to increase the value of $R^2$, even if that variable is irrelevant.
\pause
To overcome this problem, we can use \emph{adjusted $R^2$}:
\begin{block}{}
\[
\bar{R}^2 = 1-(1-R^2)\frac{T-1}{T-k-1}
\]
where $k=$ no.\ predictors and $T=$ no.\ observations.
\end{block}
\pause
\begin{alertblock}{Maximizing $\bar{R}^2$ is equivalent to minimizing $\hat\sigma^2$.}
\centerline{$\displaystyle
\hat{\sigma}^2 = \frac{1}{T-k-1}\sum_{t=1}^T \varepsilon_t^2$
}
\end{alertblock}
## Cross-validation
**Cross-validation for regression**
(Assuming future predictors are known)
* Select one observation for test set, and use \emph{remaining} observations in training set. Compute error on test observation.
* Repeat using each possible observation as the test set.
* Compute accuracy measure over all errors.
## Cross-validation
**Traditional evaluation**
```{r traintest0, fig.height=.6, echo=FALSE, cache=TRUE}
train = 1:16
test = 17:20
par(mar=c(0,0,0,0))
plot(0,0,xlim=c(0,22),ylim=c(0,2),xaxt="n",yaxt="n",bty="n",xlab="",ylab="",type="n")
arrows(0,0.5,21,0.5,0.05)
points(train, train*0+0.5, pch=19, col="blue")
points(test, test*0+0.5, pch=19, col="red")
text(22,0.5,"time")
text(10,1,"Training data",col="blue")
text(21,1,"Test data",col="red")
```
\pause
**Leave-one-out cross-validation**
```{r, fig.height=3.6}
kcvplot <- function(k,n)
{
par(mar=c(0,0,0,0))
each <- round(n/k)
ord <- sample(1:n,n)
plot(0,0, xlim=c(0,n+2), ylim=c(1-k/n,1-1/n),
xaxt="n", yaxt="n", bty="n", xlab="", ylab="", type="n")
for(j in 1:k)
{
test <- (1:n)[ord[(j-1)*each + (1:each)]]
train <- (1:n)[-test]
arrows(0, 1-j/n, n+1, 1-j/n, 0.05)
points(train,rep(1-j/n,length(train)),pch=19,col="blue")
points(test, rep(1-j/n,length(test)), pch=19, col="red")
#text(28,1-j/11,"time")
}
}
kcvplot(20,20)
```
## Cross-validation
Leave-one-out cross-validation for regression can be carried out using the following steps.
* Remove observation $t$ from the data set, and fit the model using the remaining data. Then compute the error ($e_t^*=y_t-\hat{y}_t$) for the omitted observation.
* Repeat step 1 for $t=1,\dots,T$.
* Compute the MSE from $\{e_1^*,\dots,e_T^*\}$. We shall call this the CV.
###
The best model is the one with minimum CV.
## Cross-validation
**Five-fold cross-validation**
* 20 observations. 4 test observations per fold
```{r, fig.height=1.}
kcvplot(5,20)
```
\pause
**Ten-fold cross-validation**
* 20 observations. 2 test observations per fold
```{r, fig.height=2.}
kcvplot(10,20)
```
## Cross-validation
\structure{Ten-fold cross-validation}
* Randomly split data into 10 parts.
* Select one part for test set, and use remaining parts as training set. Compute accuracy measures on test observations.
* Repeat for each of 10 parts
* Average over all measures.
## Akaike's Information Criterion
\begin{block}{}
\[
\text{AIC} = -2\log(L) + 2(k+2)
\]
\end{block}
where $L$ is the likelihood and $k$ is the number of predictors in the model.\pause
* This is a \emph{penalized likelihood} approach.
* \emph{Minimizing} the AIC gives the best model for prediction.
* AIC penalizes terms more heavily than $\bar{R}^2$.
* Minimizing the AIC is asymptotically equivalent to minimizing MSE via leave-one-out cross-validation.
## Corrected AIC
For small values of $T$, the AIC tends to select too many predictors, and so a bias-corrected version of the AIC has been developed.
\begin{block}{}
\[
\text{AIC}_{\text{C}} = \text{AIC} + \frac{2(k+2)(k+3)}{T-k-3}
\]
\end{block}
As with the AIC, the AIC$_{\text{C}}$ should be minimized.
## Bayesian Information Criterion
\begin{block}{}
\[
\text{BIC} = -2\log(L) + (k+2)\log(T)
\]
\end{block}
where $L$ is the likelihood and $k$ is the number of predictors in the model.\pause
* BIC penalizes terms more heavily than AIC
* Also called SBIC and SC.
* Minimizing BIC is asymptotically equivalent to leave-$v$-out cross-validation when $v = T[1-1/(log(T)-1)]$.
## Choosing regression variables
**Best subsets regression**
* Fit all possible regression models using one or more of the predictors.
* Choose the best model based on one of the measures of predictive ability (CV, AIC, AICc).
\pause
**Warning!**
* If there are a large number of predictors, this is not possible.
* For example, 44 predictors leads to 18 trillion possible models!
## Choosing regression variables
**Backwards stepwise regression**
* Start with a model containing all variables.
* Try subtracting one variable at a time. Keep the model if it
has lower CV or AICc.
* Iterate until no further improvement.
\pause
**Notes**
* Stepwise regression is not guaranteed to lead to the best possible
model.
* Inference on coefficients of final model will be wrong.
## Cross-validation
\fontsize{8}{8}\sf
```{r, echo=TRUE}
tslm(Consumption ~ Income + Production + Unemployment + Savings,
data=uschange) %>% CV()
tslm(Consumption ~ Income + Production + Unemployment,
data=uschange) %>% CV()
tslm(Consumption ~ Income + Production + Savings,
data=uschange) %>% CV()
tslm(Consumption ~ Income + Unemployment + Savings,
data=uschange) %>% CV()
tslm(Consumption ~ Production + Unemployment + Savings,
data=uschange) %>% CV()
```
# Forecasting with regression
## Ex-ante versus ex-post forecasts
* *Ex ante forecasts* are made using only information available in advance.
- require forecasts of predictors
* *Ex post forecasts* are made using later information on the predictors.
- useful for studying behaviour of forecasting models.
* trend, seasonal and calendar variables are all known in advance, so these don't need to be forecast.
## Scenario based forecasting
* Assumes possible scenarios for the predictor variables
* Prediction intervals for scenario based forecasts do not include the uncertainty associated with the future values of the predictor variables.
## Building a predictive regression model {-}
* If getting forecasts of predictors is difficult, you can use lagged predictors instead.
$$y_{t}=\beta_0+\beta_1x_{1,t-h}+\dots+\beta_kx_{k,t-h}+\varepsilon_{t}$$
* A different model for each forecast horizon $h$.
## Beer production
\fontsize{10}{10}\sf
```{r beeryetagain, echo=TRUE, fig.height=3.5}
beer2 <- window(ausbeer, start=1992)
fit.beer <- tslm(beer2 ~ trend + season)
fcast <- forecast(fit.beer)
autoplot(fcast) +
ggtitle("Forecasts of beer production using regression") +
xlab("Year") + ylab("megalitres")
```
## US Consumption
\fontsize{10}{10}\sf
```{r usconsumptionf, echo=TRUE}
fit.consBest <- tslm(
Consumption ~ Income + Savings + Unemployment,
data = uschange)
h <- 4
newdata <- data.frame(
Income = c(1, 1, 1, 1),
Savings = c(0.5, 0.5, 0.5, 0.5),
Unemployment = c(0, 0, 0, 0))
fcast.up <- forecast(fit.consBest, newdata = newdata)
newdata <- data.frame(
Income = rep(-1, h),
Savings = rep(-0.5, h),
Unemployment = rep(0, h))
fcast.down <- forecast(fit.consBest, newdata = newdata)
```
## US Consumption
\fontsize{10}{10}\sf
```{r usconsumptionf2, echo=TRUE, fig.height=3.5}
autoplot(uschange[, 1]) +
ylab("% change in US consumption") +
autolayer(fcast.up, PI = TRUE, series = "increase") +
autolayer(fcast.down, PI = TRUE, series = "decrease") +
guides(colour = guide_legend(title = "Scenario"))
```
# Matrix formulation
## Matrix formulation
\begin{block}{}
$$y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} + \varepsilon_t.$$
\end{block}
\pause
Let $\bm{y} = (y_1,\dots,y_T)'$, $\bm{\varepsilon} = (\varepsilon_1,\dots,\varepsilon_T)'$, $\bm{\beta} = (\beta_0,\beta_1,\dots,\beta_k)'$ and
\[
\bm{X} = \begin{bmatrix}
1 & x_{1,1} & x_{2,1} & \dots & x_{k,1}\\
1 & x_{1,2} & x_{2,2} & \dots & x_{k,2}\\
\vdots & \vdots & \vdots & & \vdots\\
1 & x_{1,T} & x_{2,T} & \dots & x_{k,T}
\end{bmatrix}.
\]\pause
Then
###
$$\bm{y} = \bm{X}\bm{\beta} + \bm{\varepsilon}.$$
## Matrix formulation
**Least squares estimation**
Minimize: $(\bm{y} - \bm{X}\bm{\beta})'(\bm{y} - \bm{X}\bm{\beta})$\pause
Differentiate wrt $\bm{\beta}$ gives
\begin{block}{}
\[
\hat{\bm{\beta}} = (\bm{X}'\bm{X})^{-1}\bm{X}'\bm{y}
\]
\end{block}
\pause
(The ``normal equation''.)\pause
\[
\hat{\sigma}^2 = \frac{1}{T-k-1}(\bm{y} - \bm{X}\hat{\bm{\beta}})'(\bm{y} - \bm{X}\hat{\bm{\beta}})
\]
\structure{Note:} If you fall for the dummy variable trap, $(\bm{X}'\bm{X})$ is a singular matrix.
## Likelihood
If the errors are iid and normally distributed, then
\[
\bm{y} \sim \text{N}(\bm{X}\bm{\beta},\sigma^2\bm{I}).
\]\pause
So the likelihood is
\[
L = \frac{1}{\sigma^T(2\pi)^{T/2}}\exp\left(-\frac1{2\sigma^2}(\bm{y}-\bm{X}\bm{\beta})'(\bm{y}-\bm{X}\bm{\beta})\right)
\]\pause
which is maximized when $(\bm{y}-\bm{X}\bm{\beta})'(\bm{y}-\bm{X}\bm{\beta})$ is minimized.\pause
\centerline{\textcolor{orange}{So \textbf{MLE = OLS}.}}
## Multiple regression forecasts
\begin{block}{Optimal forecasts}\vspace*{-0.2cm}
\[
\hat{y}^* =
\text{E}(y^* | \bm{y},\bm{X},\bm{x}^*) =
\bm{x}^*\hat{\bm{\beta}} = \bm{x}^*(\bm{X}'\bm{X})^{-1}\bm{X}'\bm{y}
\]
\end{block}
where $\bm{x}^*$ is a row vector containing the values of the predictors for the forecasts (in the same format as $\bm{X}$).
\pause
\begin{block}{Forecast variance}\vspace*{-0.2cm}
\[
\text{Var}(y^* | \bm{X},\bm{x}^*) = \sigma^2 \left[1 + \bm{x}^* (\bm{X}'\bm{X})^{-1} (\bm{x}^*)'\right]
\]
\end{block}
\pause
* This ignores any errors in $\bm{x}^*$.