forked from jtr13/EDAV
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathscatterplot.Rmd
executable file
·289 lines (228 loc) · 11.9 KB
/
scatterplot.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
# Chart: Scatterplot {#scatter}
![](images/banners/banner_scatterplot.png)
## Overview
This section covers how to make scatterplots
## tl;dr
Fancy Example NOW! Gimme Gimme GIMME!
<!-- Explanation: -->
Here's a look at the relationship between brain weight vs. body weight for 62 species of land mammals:
```{r tldr-show-plot, echo=FALSE, warning=FALSE, fig.height=6, fig.width=9}
library(MASS) # data
library(ggplot2) # plotting
# ratio for color choices
ratio <- mammals$brain / (mammals$body*1000)
ggplot(mammals, aes(x = body, y = brain)) +
# plot points, group by color
geom_point(aes(fill = ifelse(ratio >= 0.02, "#0000ff",
ifelse(ratio >= 0.01 & ratio < 0.02, "#00ff00",
ifelse(ratio >= 0.005 & ratio < 0.01, "#00ffff",
ifelse(ratio >= 0.001 & ratio < 0.005, "#ffff00", "#ffffff"))))),
col = "#656565", alpha = 0.5, size = 4, shape = 21) +
# add chosen text annotations
geom_text(aes(label = ifelse(row.names(mammals) %in% c("Mouse", "Human", "Asian elephant", "Chimpanzee", "Owl monkey", "Ground squirrel"),
paste(as.character(row.names(mammals)), "→", sep = " "),'')),
hjust = 1.12, vjust = 0.3, col = "grey35") +
geom_text(aes(label = ifelse(row.names(mammals) %in% c("Golden hamster", "Kangaroo", "Water opossum", "Cow"),
paste("←", as.character(row.names(mammals)), sep = " "),'')),
hjust = -0.12, vjust = 0.35, col = "grey35") +
# customize legend/color palette
scale_fill_manual(name = "Brain Weight, as the\n% of Body Weight",
# values = c('#e66101','#fdb863','#b2abd2','#5e3c99'),
values = c('#d7191c','#fdae61','#ffffbf','#abd9e9','#2c7bb6'),
breaks = c("#0000ff", "#00ff00", "#00ffff", "#ffff00", "#ffffff"),
labels = c("Greater than 2%", "Between 1%-2%", "Between 0.5%-1%", "Between 0.1%-0.5%", "Less than 0.1%")) +
# formatting
scale_x_log10(name = "Body Weight", breaks = c(0.01, 1, 100, 10000),
labels = c("10 g", "1 kg", "100 kg", "10K kg")) +
scale_y_log10(name = "Brain Weight", breaks = c(1, 10, 100, 1000),
labels = c("1 g", "10 g", "100 g", "1 kg")) +
ggtitle("An Elephant Never Forgets...How Big A Brain It Has",
subtitle = "Brain and Body Weights of Sixty-Two Species of Land Mammals") +
labs(caption = "Source: MASS::mammals") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
theme(legend.position = c(0.832, 0.21))
```
And here's the code:
```{r tldr-code, eval=FALSE}
library(MASS) # data
library(ggplot2) # plotting
# ratio for color choices
ratio <- mammals$brain / (mammals$body*1000)
ggplot(mammals, aes(x = body, y = brain)) +
# plot points, group by color
geom_point(aes(fill = ifelse(ratio >= 0.02, "#0000ff",
ifelse(ratio >= 0.01 & ratio < 0.02, "#00ff00",
ifelse(ratio >= 0.005 & ratio < 0.01, "#00ffff",
ifelse(ratio >= 0.001 & ratio < 0.005, "#ffff00", "#ffffff"))))),
col = "#656565", alpha = 0.5, size = 4, shape = 21) +
# add chosen text annotations
geom_text(aes(label = ifelse(row.names(mammals) %in% c("Mouse", "Human", "Asian elephant", "Chimpanzee", "Owl monkey", "Ground squirrel"),
paste(as.character(row.names(mammals)), "→", sep = " "),'')),
hjust = 1.12, vjust = 0.3, col = "grey35") +
geom_text(aes(label = ifelse(row.names(mammals) %in% c("Golden hamster", "Kangaroo", "Water opossum", "Cow"),
paste("←", as.character(row.names(mammals)), sep = " "),'')),
hjust = -0.12, vjust = 0.35, col = "grey35") +
# customize legend/color palette
scale_fill_manual(name = "Brain Weight, as the\n% of Body Weight",
values = c('#d7191c','#fdae61','#ffffbf','#abd9e9','#2c7bb6'),
breaks = c("#0000ff", "#00ff00", "#00ffff", "#ffff00", "#ffffff"),
labels = c("Greater than 2%", "Between 1%-2%", "Between 0.5%-1%", "Between 0.1%-0.5%", "Less than 0.1%")) +
# formatting
scale_x_log10(name = "Body Weight", breaks = c(0.01, 1, 100, 10000),
labels = c("10 g", "1 kg", "100 kg", "10K kg")) +
scale_y_log10(name = "Brain Weight", breaks = c(1, 10, 100, 1000),
labels = c("1 g", "10 g", "100 g", "1 kg")) +
ggtitle("An Elephant Never Forgets...How Big A Brain It Has",
subtitle = "Brain and Body Weights of Sixty-Two Species of Land Mammals") +
labs(caption = "Source: MASS::mammals") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
theme(legend.position = c(0.832, 0.21))
```
For more info on this dataset, type `?MASS::mammals` into the console.
And if you are going crazy not knowing what species is in the top right corner, it's another elephant. Specifically, it's the African elephant. It also never forgets how big a brain it has. <i class="far fa-smile-beam"></i>
## Simple examples
<!-- Simplify Note -->
That was *too* fancy! Much simpler please!
<!-- Simple Explanation of Data: -->
Let's use the `SpeedSki` dataset from `GDAdata` to look at how the speed achieved by the participants related to their birth year:
```{r simple-example-data}
library(GDAdata)
head(SpeedSki, n = 7)
```
### Scatterplot using base R
```{r base-r}
x <- SpeedSki$Year
y <- SpeedSki$Speed
# plot data
plot(x, y, main = "Scatterplot of Speed vs. Birth Year")
```
<!-- Base R Plot Explanation -->
Base R scatterplots are easy to make. All you need are the two variables you want to plot. Although scatterplots can be made with categorical data, the variables you are plotting will usually be continuous.
### Scatterplot using ggplot2
```{r ggplot}
library(GDAdata) # data
library(ggplot2) # plotting
# main plot
scatter <- ggplot(SpeedSki, aes(Year, Speed)) + geom_point()
# show with trimmings
scatter +
labs(x = "Birth Year", y = "Speed Achieved (km/hr)") +
ggtitle("Ninety-One Skiers by Birth Year and Speed Achieved")
```
<!-- ggplot2 explanation -->
`ggplot2` makes it very easy to create scatterplots. Using `geom_point()`, you can easily plot two different aesthetics in one graph. It also is simple to add on extra formatting to make your plots look nice (All that is really necessary is the data, the aesthetics, and the geom).
## When to use
<!-- Quick Note on When to use this plot -->
Scatterplots are great for exploring relationships between variables. Basically, if you are interested in how variables relate to each other, the scatterplot is a great place to start.
## Considerations
<!-- * List of things to pay attention to with examples -->
### Overlapping data
Data with similar values will overlap in a scatterplot and may lead to problems. Consider exploring [alpha blending](iris.html#aside-example-where-alpha-blending-works) or [jittering](iris.html#second-jittering) as remedies (links from [Overlapping Data](iris.html#overlapping-data) section of [Iris Walkthrough](iris.html)).
### Scaling
Consider how scaling can modify how your data will be perceived:
```{r scaling-fix}
library(ggplot2)
num_points <- 100
wide_x <- c(rnorm(n = 50, mean = 100, sd = 2),
rnorm(n = 50, mean = 10, sd = 2))
wide_y <- rnorm(n = num_points, mean = 5, sd = 2)
df <- data.frame(wide_x, wide_y)
ggplot(df, aes(wide_x, wide_y)) +
geom_point() +
ggtitle("Linear X-Axis")
ggplot(df, aes(wide_x, wide_y)) +
geom_point() +
ggtitle("Log-10 X-Axis") +
scale_x_log10()
```
## Modifications
### Heat maps
Heat maps are like a combination of scatterplots and [histograms](histo.html): they allow you to compare different parameters while also seeing their relative distributions.
For these heat maps, we will use the `SpeedSki` dataset.
#### Heat map: default
To create a heat map, simply substitute `geom_point()` with `geom_bin2d()`:
```{r heat-map-default}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d()
```
#### Heat map: modify color/bin width
You can change the color palette by specifying it explicitly in your chain of `ggplot` function calls. The bin width can be added inside the `geom_bin2d()` function call:
```{r heat-map-color-bin-width, message=FALSE}
library(viridis) # viridis color palette
# create plot
g1 <- ggplot(SpeedSki, aes(Year, Speed)) +
scale_fill_viridis() # modify color
# show plot
g1 + geom_bin2d(binwidth = c(5, 5)) # modify bin width
```
Here are some other examples:
```{r}
# larger bin width
g1 + geom_bin2d(binwidth = c(10, 10))
```
```{r}
# hexagonal bins
library(hexbin)
g1 + geom_hex(binwidth = c(5, 5))
```
```{r}
# hexagonal bins + scatterplot layer
library(hexbin)
g1 + geom_hex(binwidth = c(5, 5), alpha = .4) +
geom_point(size = 2, alpha = 0.8)
```
```{r}
# hexagonal bins with custom color gradient/bin count
library(hexbin)
ggplot(SpeedSki, aes(Year, Speed)) +
scale_fill_gradient(low = "#cccccc", high = "#09005F") + # color
geom_hex(bins = 10) # number of bins horizontally/vertically
```
### Contour lines
<!-- blurb -->
Contour lines give a sense of the density of the data at a glance.
For these contour maps, we will use the `SpeedSki` dataset.
Contour lines can be added to the plot call using `geom_density_2d()`:
```{r}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_density_2d()
```
Contour lines work best when combined with other layers:
```{r}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_point() +
geom_density_2d(bins = 5)
```
### Scatterplot matrices
If you want to compare multiple parameters to each other, consider using a scatterplot matrix. This will allow you to show many comparisons in a compact and efficient manner.
For these scatterplot matrices, we will use the `movies` dataset from the `ggplot2movies` package.
As a default, the base R `plot()` function will create a scatterplot matrix when given multiple variables:
```{r message=FALSE, fig.width=7, fig.height=7}
library(ggplot2movies) # data
library(dplyr) # manipulation
index <- sample(nrow(movies), 500) #sample data
moviedf <- movies[index,] # data frame
splomvar <- moviedf %>%
dplyr::select(length, budget, votes, rating, year)
plot(splomvar)
```
While this is quite useful for personal exploration of a datset, it is **not** recommended for presentation purposes. Something called the [Hermann grid illusion](https://en.wikipedia.org/wiki/Grid_illusion){target="_blank"} makes this plot very difficult to examine.
To remove this problem, consider using the `splom()` function from the `lattice` package:
```{r, fig.width=7, fig.height=7}
library(lattice) #sploms
splom(splomvar)
```
## Theory
<!-- *Link to textbook -->
* For more info about adding lines/contours, comparing groups, and plotting continuous variables check out [Chapter 5](http://www.gradaanwr.net/content/ch05/){target="_blank"} of the textbook.
## External resources
<!-- - []](){target="_blank"}: Links to resources with quick blurb -->
- [Quick-R article](https://www.statmethods.net/graphs/scatterplot.html){target="_blank"} about scatterplots using Base R. Goes from the simple into the very fancy, with Matrices, High Density, and 3D versions.
- [STHDA Base R](http://www.sthda.com/english/wiki/scatter-plots-r-base-graphs){target="_blank"}: article on scatterplots in Base R. More examples of how to enhance the humble graph.
- [STHDA ggplot2](http://www.sthda.com/english/wiki/ggplot2-scatterplot-easy-scatter-plot-using-ggplot2-and-r-statistical-software){target="_blank"}: article on scatterplots in `ggplot2`. Heavy on the formatting options available and facet warps.
- [Stack Overflow](https://stackoverflow.com/questions/15624656/label-points-in-geom-point){target="_blank"} on adding labels to points from `geom_point()`
- [ggplot2 cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf){target="_blank"}: Always good to have close by.