-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy path05-Tidy-Data.Rmd
123 lines (88 loc) · 2.65 KB
/
05-Tidy-Data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
title: "Tidy Data"
output: html_document
---
<!-- This file by RStudio is taken from https://github.com/rstudio/master-the-tidyverse and is licensed under a Creative Commons Attribution 4.0 International License. -->
```{r setup}
library(tidyverse)
# Toy data
cases <- tribble(
~Country, ~"2011", ~"2012", ~"2013",
"FR", 7000, 6900, 7000,
"DE", 5800, 6000, 6200,
"US", 15000, 14000, 13000
)
pollution <- tribble(
~city, ~size, ~amount,
"New York", "large", 23,
"New York", "small", 14,
"London", "large", 22,
"London", "small", 16,
"Beijing", "large", 121,
"Beijing", "small", 121
)
bp_systolic <- tribble(
~ subject_id, ~ time_1, ~ time_2, ~ time_3,
1, 120, 118, 121,
2, 125, 131, NA,
3, 141, NA, NA
)
bp_systolic2 <- tribble(
~ subject_id, ~ time, ~ systolic,
1, 1, 120,
1, 2, 118,
1, 3, 121,
2, 1, 125,
2, 2, 131,
3, 1, 141
)
```
## Tidy and untidy data
`table1` is tidy:
```{r}
table1
```
For example, it's easy to add a rate column with `mutate()`:
```{r}
table1 %>%
mutate(rate = cases/population)
```
`table2` isn't tidy, the count column really contains two variables:
```{r}
table2
```
It makes it very hard to manipulate.
## Your Turn 1
Is `bp_systolic` tidy?
```{r}
bp_systolic2
```
## Your Turn 2
Using `bp_systolic2` with `group_by()`, and `summarise()`:
* Find the average systolic blood pressure for each subject
* Find the last time each subject was measured
```{r}
bp_systolic2
```
## Your Turn 3
On a sheet of paper, draw how the cases data set would look if it had the same values grouped into three columns: **country**, **year**, **n**
## Your Turn 4
Use `gather()` to reorganize `table4a` into three columns: **country**, **year**, and **cases**.
```{r}
table4a
```
## Your Turn 5
On a sheet of paper, draw how this data set would look if it had the same values grouped into three columns: **city**, **large**, **small**
## Your Turn 6
Use `spread()` to reorganize `table2` into four columns: **country**, **year**, **cases**, and **population**.
```{r}
table2
```
***
# Take Aways
Data comes in many formats but R prefers just one: _tidy data_.
A data set is tidy if and only if:
1. Every variable is in its own column
2. Every observation is in its own row
3. Every value is in its own cell (which follows from the above)
What is a variable and an observation may depend on your immediate goal.