A07_TimeSeries(1).Rmd

---
title: "Assignment 7: Time Series Analysis"
author: "Zhiteng Ma"
output:
  pdf_document:
    latex_engine: xelatex
geometry: margin=2.54cm
editor_options:
  chunk_output_type: inline
---

## OVERVIEW

This exercise accompanies the lessons in Environmental Data Analytics on time series analysis. 

## Directions
1. Change "Student Name" on line 3 (above) with your name.
2. Work through the steps, **creating code and output** that fulfill each instruction.
3. Be sure to **answer the questions** in this assignment document.
4. When you have completed the assignment, **Knit** the text and code into a single PDF file.
5. After Knitting, submit the completed exercise (PDF file) to the dropbox in Sakai. Add your last name into the file name (e.g., "Fay_A07_TimeSeries.Rmd") prior to submission.

The completed exercise is due on Tuesday, March 16 at 11:59 pm.

## Set up 
1. Set up your session:

* Check your working directory
* Load the tidyverse, lubridate, zoo, and trend packages
* Set your ggplot theme

2. Import the ten datasets from the Ozone_TimeSeries folder in the Raw data folder. These contain ozone concentrations at Garinger High School in North Carolina from 2010-2019 (the EPA air database only allows downloads for one year at a time). Import these either individually or in bulk and then combine them into a single dataframe named `GaringerOzone` of 3589 observation and 20 variables. 

```{r, message = FALSE}
#1
library(tidyverse)
library(lubridate)
library(trend)
library(zoo)
#2
getwd()
setwd('c:/Users/Zhiteng Ma/Desktop/EDA-Fall2022-main/Data/Raw/Ozone_TimeSeries')
NC2010 <- read.csv('EPAair_O3_GaringerNC2010_raw.csv',stringsAsFactors =TRUE)
NC2011 <- read.csv('EPAair_O3_GaringerNC2011_raw.csv',stringsAsFactors =TRUE)
NC2012 <- read.csv('EPAair_O3_GaringerNC2012_raw.csv',stringsAsFactors =TRUE)
NC2013 <- read.csv('EPAair_O3_GaringerNC2013_raw.csv',stringsAsFactors =TRUE)
NC2014 <- read.csv('EPAair_O3_GaringerNC2014_raw.csv',stringsAsFactors =TRUE)
NC2015 <- read.csv('EPAair_O3_GaringerNC2015_raw.csv',stringsAsFactors =TRUE)
NC2016 <- read.csv('EPAair_O3_GaringerNC2016_raw.csv',stringsAsFactors =TRUE)
NC2017 <- read.csv('EPAair_O3_GaringerNC2017_raw.csv',stringsAsFactors =TRUE)
NC2018 <- read.csv('EPAair_O3_GaringerNC2018_raw.csv',stringsAsFactors =TRUE)
NC2019 <- read.csv('EPAair_O3_GaringerNC2019_raw.csv',stringsAsFactors =TRUE)

GaringerOzone <- rbind(NC2010, NC2011, NC2012, NC2013, NC2014, NC2015, NC2016, 
NC2017, NC2018, NC2019)
```

## Wrangle

3. Set your date column as a date class.

4. Wrangle your dataset so that it only contains the columns Date, Daily.Max.8.hour.Ozone.Concentration, and DAILY_AQI_VALUE. 

5. Notice there are a few days in each year that are missing ozone concentrations. We want to generate a daily dataset, so we will need to fill in any missing days with NA. Create a new data frame that contains a sequence of dates from 2010-01-01 to 2019-12-31 (hint: `as.data.frame(seq())`). Call this new data frame Days. Rename the column name in Days to "Date".

6. Use a `left_join` to combine the data frames. Specify the correct order of data frames within this function so that the final dimensions are 3652 rows and 3 columns. Call your combined data frame GaringerOzone. 

```{r}
# 3
class(GaringerOzone$Date)
GaringerOzone$Date <- as.Date(GaringerOzone$Date, format = "%m/%d/%Y")

# 4
GaringerOzone_1 <-   select(GaringerOzone, Date, Daily.Max.8.hour.Ozone.Concentration,
DAILY_AQI_VALUE)

# 5
Days <- as.data.frame(seq.Date(from = as.Date("2010-01-01"),to= as.Date("2019-12-31"),by=1))
       colnames(Days) <- c("Date")
       
# 6
GaringerOzone_2 <- left_join(Days,GaringerOzone_1, by = c("Date"))

```

## Visualize

7. Create a line plot depicting ozone concentrations over time. In this case, we will plot actual concentrations in ppm, not AQI values. Format your axes accordingly. Add a smoothed line showing any linear trend of your data. Does your plot suggest a trend in ozone concentration over time?

```{r}
#7


wind_data_plot <-
ggplot(GaringerOzone_2, aes(x = Date, y = Daily.Max.8.hour.Ozone.Concentration)) +
  #geom_point(color = "red") +
  geom_line(color = "black") +
  ylab("actual concentrations in ppm") +
  geom_smooth( method = lm, color = "blue" )
print(wind_data_plot)


```

>Answer:My chart clearly suggests a trend in ozone concentration over time. Because the ozone concentration tends to decrease over time.

## Time Series Analysis

Study question: Have ozone concentrations changed over the 2010s at this station? 

8. Use a linear interpolation to fill in missing daily data for ozone concentration. Why didn't we use a piecewise constant or spline interpolation?

```{r}
#8

head(GaringerOzone_2)
summary(GaringerOzone_2$Daily.Max.8.hour.Ozone.Concentration)
# Adding new column with no missing obs, just for illustration purpose
# In real applications you will simply replace NAs
GaringerOzone_2_clean <- 
  GaringerOzone_2 %>% 
  mutate( Daily.Max.8.hour.Ozone.Concentration.clean = zoo::na.approx
(Daily.Max.8.hour.Ozone.Concentration) )
summary(GaringerOzone_2_clean$Daily.Max.8.hour.Ozone.Concentration.clean)
#Note the NA is gone
ggplot(GaringerOzone_2_clean) +
  geom_line(aes(x = Date, y = Daily.Max.8.hour.Ozone.Concentration.clean), color = "red") +
  geom_line(aes(x = Date, y = Daily.Max.8.hour.Ozone.Concentration), color = "black") +
  ylab("actual concentrations in ppm") 

```

> Answer: In general, cubic interpolation is better than linear interpolation in most aspects such as smoothness of the function and higher accuracy in approximating the original function. However, there is at least one aspect where linear interpolation is better: the linear interpolation will not produce the "overshoot" situation. 


9. Create a new data frame called `GaringerOzone.monthly` that contains aggregated data: mean ozone concentrations for each month. In your pipe, you will need to first add columns for year and month to form the groupings. In a separate line of code, create a new Date column with each month-year combination being set as the first day of the month (this is for graphing purposes only)

```{r}
#9

GaringerOzone_2_clean.monthly <- GaringerOzone_2_clean %>%
  mutate(month_year = floor_date(Date, "month")) %>%
group_by(month_year) %>%
summarise(mean.concentration = mean(Daily.Max.8.hour.Ozone.Concentration.clean))

```


10. Generate two time series objects. Name the first `GaringerOzone.daily.ts` and base it on the dataframe of daily observations. Name the second `GaringerOzone.monthly.ts` and base it on the monthly average ozone values. Be sure that each specifies the correct start and end dates and the frequency of the time series.

```{r}
#10

f_day <- day(first(GaringerOzone_2_clean$Date))
f_month <- month(first(GaringerOzone_2_clean.monthly$month_year))
f_year <- year(first(GaringerOzone_2_clean$Date))

GaringerOzone.daily.ts <- ts(GaringerOzone_2_clean$
                  Daily.Max.8.hour.Ozone.Concentration.clean,
                   start=c(f_year,f_day),
                   frequency=365) 

GaringerOzone.monthly.ts <- ts(GaringerOzone_2_clean.monthly$mean.concentration,
                   start=c(f_year,f_month),
                   frequency=12) 

```


11. Decompose the daily and the monthly time series objects and plot the components using the `plot()` function.

```{r}
#11

GaringerOzone_2_decomp_daily <- stl(GaringerOzone.daily.ts,s.window = "periodic")
plot(GaringerOzone_2_decomp_daily)

GaringerOzone_2_decomp_month <- stl(GaringerOzone.monthly.ts,s.window = "periodic")
plot(GaringerOzone_2_decomp_month)

```


12. Run a monotonic trend analysis for the monthly Ozone series. In this case the seasonal Mann-Kendall is most appropriate; why is this?

```{r}
#12

# Run SMK test
GaringerOzone_monthly_trend1 <- Kendall::SeasonalMannKendall(GaringerOzone.monthly.ts)

# Inspect results
GaringerOzone_monthly_trend1
summary(GaringerOzone_monthly_trend1)

GaringerOzone_monthly_trend2 <- trend::smk.test(GaringerOzone.monthly.ts)

# Inspect results
GaringerOzone_monthly_trend2
summary(GaringerOzone_monthly_trend2)



```

> Answer: The Mann-Kendall statistical test for trend is used to assess whether a set of data values is increasing over time or decreasing over time, and whether the trend in either direction is statistically significant. The Mann-Kendall test does NOT assess the magnitude of change.
Seasonal has selected the Mann-Kendall test for trend over other statistical trend tests because it can be used for indicators with varying units and time periods and does not require confidence intervals (which are not available for all indicators). In addition, the test can still be performed even when there are missing values in the set.

13. Create a plot depicting mean monthly ozone concentrations over time, with both a geom_point and a geom_line layer. Edit your axis labels accordingly.

```{r}
# 13

#Visualization
wind_data_plot <-
ggplot(GaringerOzone, aes(x = Date, y = Daily.Max.8.hour.Ozone.Concentration)) +
  geom_point() +
  geom_line() +
  ylab("actual concentrations in ppm") +
  geom_smooth( method = lm )
print(wind_data_plot)


```

14. To accompany your graph, summarize your results in context of the research question. Include output from the statistical test in parentheses at the end of your sentence. Feel free to use multiple sentences in your interpretation.

> Answer: It can be seen from the figure in question 13 that from 2010 to 2020, the actual concentrations in ppm have a clear trend of gradually decreasing. And it can be clearly seen that actual concentrations in ppm vary greatly with the change of seasons. Showing a wavy trend.

15. Subtract the seasonal component from the `GaringerOzone.monthly.ts`. Hint: Look at how we extracted the series components for the EnoDischarge on the lesson Rmd file.

16. Run the Mann Kendall test on the non-seasonal Ozone monthly series. Compare the results with the ones obtained with the Seasonal Mann Kendall on the complete series.

```{r}
#15

GaringerOzone_3_clean <-   select(GaringerOzone_2_clean, Date,
                                  Daily.Max.8.hour.Ozone.Concentration.clean)

mzt <- ts(GaringerOzone_3_clean, frequency=12, start=c(2010,1))
plot.ts(mzt)

mzt1 <- log(mzt)
plot.ts(mzt1)

mzt2 <- decompose(mzt1)
plot(mzt2)

#16


# Run SMK test
GaringerOzone_monthly_trend_no1 <- Kendall::SeasonalMannKendall(mzt1)

# Inspect results
GaringerOzone_monthly_trend_no1
summary(GaringerOzone_monthly_trend_no1)

GaringerOzone_monthly_trend_no2 <- trend::smk.test(mzt1)

# Inspect results
GaringerOzone_monthly_trend_no2
summary(GaringerOzone_monthly_trend_no2)


```

> Answer: