forked from lmullen/dh-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdates.Rmd
272 lines (193 loc) · 16.2 KB
/
dates.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
---
title: "Dates"
---
A standard historian's joke goes something like this: "On or about October 31, 1517, modernity began." Martin Luther did indeed nail his ninety-five theses to the door of the Wittenberg church on that date. Identifying the beginning of modernity or capitalism, or conceptions of the self, or any other historical abstraction worth studying cannot be so readily pinned down to a specific date, whether on or about. The problem is one of precision. Historical events and movements have fuzzy beginnings and endings and often even straightforward facts often have uncertain dates. Think of chronological terms like "the long nineteenth century," "the last third of the eighteenth century," "the American Revolution," or "antiquity" and "modernity," and you'll recognize how good historians are expressing the necessary chronological uncertainty in prose.
The problem of precision and dates is even more vexing when it comes to computation. Computers demand precision in dates. The computer has no built-in concept of dates such as centuries. To explain the nineteenth century to a computer, you will have to give tell it to computer the difference between December 31, 1900 and January 1, 1801.^[Or 1800 and 1899: whichever definition of the start of a century you prefer.] Even more difficult is that computers require precision down to the second or below. On Unix operating systems, dates are computed as the number of seconds elapsed since the beginning of the Unix epoch: 00:00:00 in Coordinated Universal Time (i.e., Greenich Mean Time) on Thursday, January 1, 1970, excluding leap seconds.^[If we wanted to represent a date before January 1, 1970, then R would store that date as a negative number.]
If I run the Unix date command, I see that 1.4 billion seconds and counting have elapsed since then, and R reports a value down to the hundredth thousand of a second.
```{r get_sys_time}
now <- Sys.time()
as.character(as.numeric(now))
```
The problem for the computational historian, then, is to work with artificially precise dates in a way that makes sense for the discipline of history. This will require a deep knowledge on your part about the date systems of the period you are working in. A historian of early modern Europe, for example, will have to be aware of the distinction between old style and new style dates, while a historian of China will have to be able to translate Chinese dates into a format amenable to R. In general, R will only work with dates in the [Gregorian calendar](http://en.wikipedia.org/wiki/Gregorian_calendar).
This chapter will work with a sample dataset of missions by the Paulist Fathers, a Roman Catholic missionary order in the United States. After a brief introduction to R's native date formats, the chapter will explain the [lubridate](http://cran.rstudio.org/web/packages/lubridate/) package. In particular it will cover parsing dates, extracting information from them, and using them in calculations.^[I will assume that you know how to [manipulate data](data.html) as explained in that chapter.]
## R's native date system
### Creating a date object
### Extracting information from a date object
## Lubridate
As is the case with much of R, a lot of the pain in manipulating dates can be removed by using the right package. [Lubridate](http://cran.rstudio.org/web/packages/lubridate/) is a package which provides many useful functions for parsing dates, getting information out of them, and performing calculations on them.^[A much fuller introduction to [lubridate](http://cran.rstudio.org/web/packages/lubridate/) can be found in Garret Grolemund and Hadley Wickham, "Dates and Times Made Easy with lubridate," *Journal of Statistical Software* 40, no. 3 (2011): <http://www.jstatsoft.org/v40/i03/> and the package's documentation and vignette.]
### Parsing dates
The most useful feature of lubridate are the functions it provides to parse dates from character strings. If you have control over how data is stored and represented, then you should use the widely agreed upon standard [ISO-8601](http://en.wikipedia.org/wiki/ISO_8601) ISO-8601. That standard has many recommendations, but its basic recommendation is that dates be stored as strings, arranged from year to month to day to time. You can omit any value for which you do not have data.^[The standard also allows you to represent dates as numbers without the hyphens, for example, `19690721`.] For example, July 21, 1969 would be represented as `"1969-07-21"`, while, just the month of July 1969 would be represented as `"1969-07"`. This format of dates can be easily parsed, and as a bonus, they can be easily sorted as well even by a program which only knows how to sort numbers and not dates. Chance are good, however, that you do not have control over the formats in which dates are represented in your data, so an easy way of parsing them is important.
Lubridate provides a set of functions in the form `ymd()` which can parse strings into dates. To take our example above, we can use that function to parse our string representing a date into a date object.
```{r}
library(lubridate)
ymd("1969-07-21")
```
But lubridate includes a number of other functions that rearrange the order of year, month, and day. For example, `mdy()` will parse dates arranged by month, day, and year. And lubridate is very good at taking representations of dates such as `"January`" or `"Jan."` and turning them into date objects. All of the following are dates that lubridate knows how to parse.
```{r}
ymd("1969 July 21")
mdy("July 21, 1969")
mdy("7/21/1969")
dmy("21 July 1969")
```
If lubridate does not know how to parse a date, it does the right thing. It will emit a warning message so you will know that a string did not parse into a date, but it will also assign the value `NA`. For example, a dataset I worked with represented unknown dates with the string `"unknown"`.
```{r}
ymd("unknown")
```
If you have a date that represents only a year and a month, lubridate will not be able to parse it because R needs a year, and month, and a day to create a date object. A common practice in such cases is to assign an arbitrary day, usually the first day of the month. As long as you don't mix dates that are specific only down to the month with dates that contain real days, this should not cause a problem.
```{r}
month <- "1969-07"
ymd(month)
# Use paste() to add an arbitrary day to the date
ymd(paste(month, "-01"))
```
If you have dates that are only years, it is not usually necessary to convert them to date objects. You can treat them simply as integers.
### Extracting information
Once you have dates stored as date objects, you can use lubridate's functions to extract the components of the date. For example, given a list of dates, you might be interested only in the year in which they occured, or perhaps you want to see if there is a cyclical pattern by extracting the month. Lubridate can also extract the day of the week. (This will only work for Gregorian dates.)
Take our example date of July 21, 1969 from above. If we save this in a variable, we can extract all kinds of information from it in a variety of numeric and character formats.
```{r}
moon_walk <- mdy("July 21, 1969")
year(moon_walk)
month(moon_walk)
month(moon_walk, label = TRUE)
month(moon_walk, label = TRUE, abbr = FALSE)
day(moon_walk)
yday(moon_walk) # day of the year
weekdays(moon_walk)
weekdays(moon_walk, abbr = TRUE)
week(moon_walk) # week of the year
```
In the section on Paulist missions below, we will demonstrate how extracting this kind of information can lead to useful analysis.
### Calculations with dates
When we use date objects instead of character strings to represent dates, it becomes possible to perform calculations on the dates. We might ask, for example, how long a period there was between John F. Kennedy's "We choose to go to the moon" speech and the Apollo 11 moon walk. After we create the two date objects, we can simply subtract the date of the Kennedy speech from the date of the moon walk:
```{r}
kennedy_speech <- mdy("September 12, 1962")
moon_walk - kennedy_speech
```
The `-` operator here is actually a function provided by lubridate. We can access that function directly using the `difftime()` function. Doing so will allow us to specify the units:
```{r}
difftime(moon_walk, kennedy_speech, units = "weeks")
```
There are other kinds of arithmetic that can be performed with dates. For example, one could add a week, a month, or a year to the date of the Kennedy speech.
```{r}
kennedy_speech
kennedy_speech + weeks(1)
kennedy_speech + months(1)
kennedy_speech + years(1)
```
### Intervals
Lubridate also provides the ability to define intervals of time using the `interval()` function. For example, we could define a set of intervals for each of the presidential administrations:
```{r}
kennedy_admin <- interval(mdy("January 20, 1961"), mdy("November 22, 1963"))
johnson_admin <- interval(mdy("November 22, 1963"), mdy("January 20, 1969"))
nixon_admin <- interval(mdy("January 20, 1969"), mdy("August 9, 1974"))
administrations <- c(kennedy_admin, johnson_admin, nixon_admin)
```
There are a number of helpful operators that can be performed on intervals. The most obvious is to check whether a date occured within an interval. We can figure out whether Kennedy's speech occured during his administration.
```{r}
kennedy_speech %within% kennedy_admin
```
This becomes more powerful when we check the date of the first moon walk against the vector of presidential administrations.
```{r}
moon_walk %within% administrations
```
The moon walk was only within the third administration, which was Nixon's. Lubridate will also allow you to perform set operations such as finding the intersection between two intervals and their union with the base R functions `intersect()` and `union()` respectively. A particularly useful function is `int_overlaps()` which returns a `TRUE` value if two intervals overlap at all. For example, we can test whether which administrations kept U.S. troops in Vietnam.
```{r}
troops_in_vietnam <- interval(mdy("9/1/1950"), mdy("4/30/1975"))
int_overlaps(troops_in_vietnam, administrations)
```
There were American troops in Vietnam during all of the Kennedy, Johnson, and Nixon administrations.
## Working with dates
Beginning in 1851, Redemptorist priests who became the Paulist Fathers held missions throughout the United States. They recorded their missions in manuscript volumes, keeping track of the start and end dates of the missions, where they were held, and how many Catholics came to confession and how many Protestants converted. A transcription of these records is available in the historydata package. Because of the chronological information contained in the records, they make a good source to explore how to work with dates.
We will begin by loading the data, along with the packages we will use to manipulate and visualize the data.
```{r}
library(lubridate)
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
library(ggplot2)
library(historydata)
data(paulist_missions)
paulist_missions
```
Now we can use lubridate to parse the date fields, `start_date` and `end_date`, into date objects.
```{r}
paulist_missions <- paulist_missions %>%
mutate(start = mdy(start_date),
end = mdy(end_date))
```
We can then extract some information from the date objects. Let's find out the year, the month, and the day of the week. We can also create a field for the duration of the mission.
```{r}
paulist_missions <- paulist_missions %>%
mutate(year_start = year(start),
year_end = year(end),
month_start = month(start, label = TRUE),
month_end = month(end, label = TRUE),
day_start = weekdays(start),
day_end = weekdays(end),
duration = as.numeric(difftime(end, start, units = "days")))
head(paulist_missions)
```
In their narrative sources, the Paulists record that the typical mission began on a Sunday evening and lasted for two weeks until the next Sunday. That was their ideal mission, but how did the actual missions proceed? And did the typical mission vary over time?
We can calculate a simple mean and median of the duration of the missions.
```{r}
mean(paulist_missions$duration)
median(paulist_missions$duration)
```
Surprisingly, the average mission (9.56 days) or median mission (8 days) was quite a bit shorter than what the Paulist described as typical or ideal. We can use dplyr and ggplot to find out a breakdown of the kinds of missions.
```{r durations-of-paulist-missions}
missions_duration <- paulist_missions %>%
group_by(duration) %>%
summarize(n = n())
missions_duration
ggplot(missions_duration,
aes(x = duration, y = n)) +
geom_bar(stat = "identity") +
xlim(0, 30) +
ggtitle("Durations of Paulist missions")
```
The missions show a definite multi-modal pattern: missions were usually either one week long or two weeks long, though they could range anywhere from one day to twenty-one.^[One missions is recorded as taking 60 days, but looking at the manuscript shows that it was a series of small three- or four-day-long retreats for which no specific dates were recorded.]
We can complicate the analysis by calculating the average duration of the missions per year, to see whether the Paulists lengthened or shortened their missions. First we will plot the duration of the mission against the start date:
```{r duration-of-missions-by-date}
ggplot(paulist_missions, aes(start, duration)) +
geom_point() +
geom_smooth() +
ggtitle("Duration of Paulist missions by date")
```
This shows no significant variation in the pattern, though we note that after 1880 missions could be considerably longer. We can also calculate the mean length of the mission for each year.
```{r average-mission-length-by-year}
missions_length_by_year <- paulist_missions %>%
group_by(year_start) %>%
summarize(duration = mean(duration))
missions_length_by_year
ggplot(missions_length_by_year,
aes(x = year_start, y = duration)) +
geom_line() + geom_point() +
ylim(0, 14) +
ggtitle("Average mission length by year")
```
By this chart we can see that there are significant fluctuations, but that they do not appear to have any meaning behind them. Notice that the x-axis is plotted using the date object, which means that the points in the scatterplot are organized not just by year but also by month and day.
Since it appears that the Paulists preferred missions that were one or two weeks long, we can ask which days of the week were most popular to start and end the missions. Here will will simply combine the data manipulation and plotting.
```{r}
paulist_missions %>%
select(day_start, day_end) %>%
gather(type, day, day_start, day_end) %>%
mutate(day = factor(day,
levels = c("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday"))) %>%
ggplot(aes(x = day)) +
geom_bar() +
facet_grid(type ~ .) +
ggtitle("Start and end days of Paulist missions")
```
It appears that as a rule Paulist missions started on a Sunday with only very rare exceptions. In general they ended on a Sunday as well, but it was also common for them to end on some other day of the week.
Finally, the Paulists tell us that they did not go on missions during the summer. We can use plot the cyclical pattern of missions each year.
```{r}
ggplot(paulist_missions, aes(x = month_start)) +
geom_bar() +
ggtitle("Months in which Paulists started missions")
```
The Paulists never started missions in July (notice it is absent from the chart) and only rarely started missions in June or August. Fall, winter, and spring were the Paulist campaigning seasons.
Using lubridate, we can create table and visualizations that take into account the diachronic and cyclical patterns in temporal data.
## Times
Both R and lubridate offer functions for dealing with times too. Besides date objects, it is possible to create date and time objects. I have seldom found reason to analyze times as a historian. In general, you should only represent times if you absolutely need them. The simpler form of just date objects is generally sufficient. Read the documentation for lubridate if you need to represent times. The pattern is essentially the same as for dates.
## Further Reading
- Garret Grolemund and Hadley Wickham, "Dates and Times Made Easy with lubridate," *Journal of Statistical Software* 40, no. 3 (2011): <http://www.jstatsoft.org/v40/i03/>.