forked from hannabliska/EDA-Fall2022
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathA07_TimeSeries(1).Rmd
268 lines (182 loc) · 10.7 KB
/
A07_TimeSeries(1).Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
title: "Assignment 7: Time Series Analysis"
author: "Zhiteng Ma"
output:
pdf_document:
latex_engine: xelatex
geometry: margin=2.54cm
editor_options:
chunk_output_type: inline
---
## OVERVIEW
This exercise accompanies the lessons in Environmental Data Analytics on time series analysis.
## Directions
1. Change "Student Name" on line 3 (above) with your name.
2. Work through the steps, **creating code and output** that fulfill each instruction.
3. Be sure to **answer the questions** in this assignment document.
4. When you have completed the assignment, **Knit** the text and code into a single PDF file.
5. After Knitting, submit the completed exercise (PDF file) to the dropbox in Sakai. Add your last name into the file name (e.g., "Fay_A07_TimeSeries.Rmd") prior to submission.
The completed exercise is due on Tuesday, March 16 at 11:59 pm.
## Set up
1. Set up your session:
* Check your working directory
* Load the tidyverse, lubridate, zoo, and trend packages
* Set your ggplot theme
2. Import the ten datasets from the Ozone_TimeSeries folder in the Raw data folder. These contain ozone concentrations at Garinger High School in North Carolina from 2010-2019 (the EPA air database only allows downloads for one year at a time). Import these either individually or in bulk and then combine them into a single dataframe named `GaringerOzone` of 3589 observation and 20 variables.
```{r, message = FALSE}
#1
library(tidyverse)
library(lubridate)
library(trend)
library(zoo)
#2
getwd()
setwd('c:/Users/Zhiteng Ma/Desktop/EDA-Fall2022-main/Data/Raw/Ozone_TimeSeries')
NC2010 <- read.csv('EPAair_O3_GaringerNC2010_raw.csv',stringsAsFactors =TRUE)
NC2011 <- read.csv('EPAair_O3_GaringerNC2011_raw.csv',stringsAsFactors =TRUE)
NC2012 <- read.csv('EPAair_O3_GaringerNC2012_raw.csv',stringsAsFactors =TRUE)
NC2013 <- read.csv('EPAair_O3_GaringerNC2013_raw.csv',stringsAsFactors =TRUE)
NC2014 <- read.csv('EPAair_O3_GaringerNC2014_raw.csv',stringsAsFactors =TRUE)
NC2015 <- read.csv('EPAair_O3_GaringerNC2015_raw.csv',stringsAsFactors =TRUE)
NC2016 <- read.csv('EPAair_O3_GaringerNC2016_raw.csv',stringsAsFactors =TRUE)
NC2017 <- read.csv('EPAair_O3_GaringerNC2017_raw.csv',stringsAsFactors =TRUE)
NC2018 <- read.csv('EPAair_O3_GaringerNC2018_raw.csv',stringsAsFactors =TRUE)
NC2019 <- read.csv('EPAair_O3_GaringerNC2019_raw.csv',stringsAsFactors =TRUE)
GaringerOzone <- rbind(NC2010, NC2011, NC2012, NC2013, NC2014, NC2015, NC2016,
NC2017, NC2018, NC2019)
```
## Wrangle
3. Set your date column as a date class.
4. Wrangle your dataset so that it only contains the columns Date, Daily.Max.8.hour.Ozone.Concentration, and DAILY_AQI_VALUE.
5. Notice there are a few days in each year that are missing ozone concentrations. We want to generate a daily dataset, so we will need to fill in any missing days with NA. Create a new data frame that contains a sequence of dates from 2010-01-01 to 2019-12-31 (hint: `as.data.frame(seq())`). Call this new data frame Days. Rename the column name in Days to "Date".
6. Use a `left_join` to combine the data frames. Specify the correct order of data frames within this function so that the final dimensions are 3652 rows and 3 columns. Call your combined data frame GaringerOzone.
```{r}
# 3
class(GaringerOzone$Date)
GaringerOzone$Date <- as.Date(GaringerOzone$Date, format = "%m/%d/%Y")
# 4
GaringerOzone_1 <- select(GaringerOzone, Date, Daily.Max.8.hour.Ozone.Concentration,
DAILY_AQI_VALUE)
# 5
Days <- as.data.frame(seq.Date(from = as.Date("2010-01-01"),to= as.Date("2019-12-31"),by=1))
colnames(Days) <- c("Date")
# 6
GaringerOzone_2 <- left_join(Days,GaringerOzone_1, by = c("Date"))
```
## Visualize
7. Create a line plot depicting ozone concentrations over time. In this case, we will plot actual concentrations in ppm, not AQI values. Format your axes accordingly. Add a smoothed line showing any linear trend of your data. Does your plot suggest a trend in ozone concentration over time?
```{r}
#7
wind_data_plot <-
ggplot(GaringerOzone_2, aes(x = Date, y = Daily.Max.8.hour.Ozone.Concentration)) +
#geom_point(color = "red") +
geom_line(color = "black") +
ylab("actual concentrations in ppm") +
geom_smooth( method = lm, color = "blue" )
print(wind_data_plot)
```
>Answer:My chart clearly suggests a trend in ozone concentration over time. Because the ozone concentration tends to decrease over time.
## Time Series Analysis
Study question: Have ozone concentrations changed over the 2010s at this station?
8. Use a linear interpolation to fill in missing daily data for ozone concentration. Why didn't we use a piecewise constant or spline interpolation?
```{r}
#8
head(GaringerOzone_2)
summary(GaringerOzone_2$Daily.Max.8.hour.Ozone.Concentration)
# Adding new column with no missing obs, just for illustration purpose
# In real applications you will simply replace NAs
GaringerOzone_2_clean <-
GaringerOzone_2 %>%
mutate( Daily.Max.8.hour.Ozone.Concentration.clean = zoo::na.approx
(Daily.Max.8.hour.Ozone.Concentration) )
summary(GaringerOzone_2_clean$Daily.Max.8.hour.Ozone.Concentration.clean)
#Note the NA is gone
ggplot(GaringerOzone_2_clean) +
geom_line(aes(x = Date, y = Daily.Max.8.hour.Ozone.Concentration.clean), color = "red") +
geom_line(aes(x = Date, y = Daily.Max.8.hour.Ozone.Concentration), color = "black") +
ylab("actual concentrations in ppm")
```
> Answer: In general, cubic interpolation is better than linear interpolation in most aspects such as smoothness of the function and higher accuracy in approximating the original function. However, there is at least one aspect where linear interpolation is better: the linear interpolation will not produce the "overshoot" situation.
9. Create a new data frame called `GaringerOzone.monthly` that contains aggregated data: mean ozone concentrations for each month. In your pipe, you will need to first add columns for year and month to form the groupings. In a separate line of code, create a new Date column with each month-year combination being set as the first day of the month (this is for graphing purposes only)
```{r}
#9
GaringerOzone_2_clean.monthly <- GaringerOzone_2_clean %>%
mutate(month_year = floor_date(Date, "month")) %>%
group_by(month_year) %>%
summarise(mean.concentration = mean(Daily.Max.8.hour.Ozone.Concentration.clean))
```
10. Generate two time series objects. Name the first `GaringerOzone.daily.ts` and base it on the dataframe of daily observations. Name the second `GaringerOzone.monthly.ts` and base it on the monthly average ozone values. Be sure that each specifies the correct start and end dates and the frequency of the time series.
```{r}
#10
f_day <- day(first(GaringerOzone_2_clean$Date))
f_month <- month(first(GaringerOzone_2_clean.monthly$month_year))
f_year <- year(first(GaringerOzone_2_clean$Date))
GaringerOzone.daily.ts <- ts(GaringerOzone_2_clean$
Daily.Max.8.hour.Ozone.Concentration.clean,
start=c(f_year,f_day),
frequency=365)
GaringerOzone.monthly.ts <- ts(GaringerOzone_2_clean.monthly$mean.concentration,
start=c(f_year,f_month),
frequency=12)
```
11. Decompose the daily and the monthly time series objects and plot the components using the `plot()` function.
```{r}
#11
GaringerOzone_2_decomp_daily <- stl(GaringerOzone.daily.ts,s.window = "periodic")
plot(GaringerOzone_2_decomp_daily)
GaringerOzone_2_decomp_month <- stl(GaringerOzone.monthly.ts,s.window = "periodic")
plot(GaringerOzone_2_decomp_month)
```
12. Run a monotonic trend analysis for the monthly Ozone series. In this case the seasonal Mann-Kendall is most appropriate; why is this?
```{r}
#12
# Run SMK test
GaringerOzone_monthly_trend1 <- Kendall::SeasonalMannKendall(GaringerOzone.monthly.ts)
# Inspect results
GaringerOzone_monthly_trend1
summary(GaringerOzone_monthly_trend1)
GaringerOzone_monthly_trend2 <- trend::smk.test(GaringerOzone.monthly.ts)
# Inspect results
GaringerOzone_monthly_trend2
summary(GaringerOzone_monthly_trend2)
```
> Answer: The Mann-Kendall statistical test for trend is used to assess whether a set of data values is increasing over time or decreasing over time, and whether the trend in either direction is statistically significant. The Mann-Kendall test does NOT assess the magnitude of change.
Seasonal has selected the Mann-Kendall test for trend over other statistical trend tests because it can be used for indicators with varying units and time periods and does not require confidence intervals (which are not available for all indicators). In addition, the test can still be performed even when there are missing values in the set.
13. Create a plot depicting mean monthly ozone concentrations over time, with both a geom_point and a geom_line layer. Edit your axis labels accordingly.
```{r}
# 13
#Visualization
wind_data_plot <-
ggplot(GaringerOzone, aes(x = Date, y = Daily.Max.8.hour.Ozone.Concentration)) +
geom_point() +
geom_line() +
ylab("actual concentrations in ppm") +
geom_smooth( method = lm )
print(wind_data_plot)
```
14. To accompany your graph, summarize your results in context of the research question. Include output from the statistical test in parentheses at the end of your sentence. Feel free to use multiple sentences in your interpretation.
> Answer: It can be seen from the figure in question 13 that from 2010 to 2020, the actual concentrations in ppm have a clear trend of gradually decreasing. And it can be clearly seen that actual concentrations in ppm vary greatly with the change of seasons. Showing a wavy trend.
15. Subtract the seasonal component from the `GaringerOzone.monthly.ts`. Hint: Look at how we extracted the series components for the EnoDischarge on the lesson Rmd file.
16. Run the Mann Kendall test on the non-seasonal Ozone monthly series. Compare the results with the ones obtained with the Seasonal Mann Kendall on the complete series.
```{r}
#15
GaringerOzone_3_clean <- select(GaringerOzone_2_clean, Date,
Daily.Max.8.hour.Ozone.Concentration.clean)
mzt <- ts(GaringerOzone_3_clean, frequency=12, start=c(2010,1))
plot.ts(mzt)
mzt1 <- log(mzt)
plot.ts(mzt1)
mzt2 <- decompose(mzt1)
plot(mzt2)
#16
# Run SMK test
GaringerOzone_monthly_trend_no1 <- Kendall::SeasonalMannKendall(mzt1)
# Inspect results
GaringerOzone_monthly_trend_no1
summary(GaringerOzone_monthly_trend_no1)
GaringerOzone_monthly_trend_no2 <- trend::smk.test(mzt1)
# Inspect results
GaringerOzone_monthly_trend_no2
summary(GaringerOzone_monthly_trend_no2)
```
> Answer: