forked from wlandau/targets-tutorial
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2-pipelines.Rmd
389 lines (291 loc) · 12 KB
/
2-pipelines.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
---
title: "Build up a targets pipeline"
output: html_notebook
---
# About
To implement the customer churn prediction analysis, we will create a `targets` pipeline. A pipeline is a high-level collection of steps, or *targets*, that comprise the workflow. Each *target* has a *command*, which is just a piece of R code that returns a value. In practice, targets are usually data cleaning, model fitting, and results summary steps, and the commands are just the R function calls that do those tasks. Please read the instructions in the comments.
# Setup
```{r}
library(targets)
tar_destroy() # Start fresh.
```
# Functions
The commands of your targets will depend on the functions from `1-functions.Rmd`. For the exercises below, they are in `2-pipelines/functions.R`. Take a quick glance at `2-pipelines/functions.R` and re-familiarize yourself with the code base.
# Start with a small pipeline.
The first version of the pipeline consists of three *targets*, or steps of computation. Our first targets do the following:
1. Reproducibly track the customer churn CSV data file.
2. Read in the data from the CSV file.
3. Preprocess the data for the deep learning models.
The steps above translate to the following R code.
```{r, eval = FALSE}
# No need to run this code chunk.
library(keras)
library(recipes)
library(rsample)
library(tidyverse)
library(yardstick)
source("2-pipelines/functions.R")
churn_file <- "data/churn.csv" # Sketch of target 1.
churn_data <- split_data(churn_file) # Sketch of target 2.
churn_recipe <- prepare_recipe(churn_data) # Sketch of target 3.
```
To formalize our three targets, we express the computation in a special configuration file called `_targets.R` at the project's root directory. The role of `_targets.R` is to
1. Load any functions and global objects that the targets depend on.
2. Set any options, including packages that the targets depend on.
3. Declare a list of targets using `tar_target()`. The `_targets.R` script must always end with a list of target objects.
Run the following code chunk to put the `_targets.R` file for this chapter in your working directory.
```{r, echo = FALSE}
file.copy("2-pipelines/initial_targets.R", "_targets.R", overwrite = TRUE)
```
Now, open `_targets.R` in the RStudio IDE for editing. You will manually make changes to `_targets.R` throughout the entire chapter.
```{r}
library(targets)
tar_edit()
```
Functions `tar_make()`, `tar_validate()`, `tar_manifest()`, `tar_glimpse()`, and `tar_visnetwork()` all need `_targets.R`. `_targets.R` lets these functions invoke the pipeline from a new external R process in order to ensure reproducibility.
```{r}
tar_validate() # Looks for errors.
```
```{r}
tar_manifest(fields = "command") # Data frame of target info.
```
```{r}
tar_glimpse() # Interactive dependency graph.
```
# Dependency relationships
`tar_glimpse()` is particularly helpful because it shows how the targets depend on one another. As the arrows indicate, target `churn_recipe` depends on `churn_data`, and target `churn_data` depends on `churn_file`. The `targets` package detects these relationships automatically using static code analysis. `churn_data` depends on `churn_file` because the command for `churn_data` mentions the symbol `churn_file`. You can see these dependency relationships for yourself using `codetools::findGlobals()`.
```{r}
library(codetools)
findGlobals(function() split_data(churn_file))
```
That means the order you write your targets does not matter. Even if you rearrange the calls to `tar_target()` inside the list, you will still get the same workflow.
`tar_visnetwork()` also includes functions in the dependency graph, as well as color-coded status information.
```{r}
tar_visnetwork()
```
# Run the pipeline.
Everything we did so far was just setup. To actually run the pipeline, use the `tar_make()` function. `tar_make()` creates a fresh new reproducible R process that runs `_targets.R` and executes the correct targets in the correct order (from the dependency graph, not the order in the target list).
```{r}
tar_make()
```
# Inspect the results
Targets `churn_data` and `churn_recipe` now live in a special `_targets/` data store.
```{r}
list.files("_targets/objects")
```
`churn_file` is an [external input file](https://books.ropensci.org/targets/files.html#external-input-files), declared with `format = "file"` in `tar_target()`, so its value is not in the data store. However, the actual file path, hash, and other metadata are stored in the `_targets/meta/meta` spreadsheet, which you can read with `tar_meta()`.
```{r}
library(dplyr)
tar_meta(names = starts_with("churn"), fields = path) %>%
mutate(path = unlist(path))
```
Other user-side functions read the actual data objects.
```{r}
tar_read(churn_data)
```
```{r}
tar_read(churn_recipe)
```
```{r}
tar_load(churn_file)
churn_file
```
# Build up the pipeline.
After inspecting our current targets with `tar_load()` and `tar_read()`, we are ready to add new targets to fit our models. Informally, the new computations look like this:
```{r, eval = FALSE}
run_relu <- test_model(act1 = "relu", churn_data, churn_recipe) # Model 1
run_sigmoid <- test_model(act1 = "sigmoid", churn_data, churn_recipe) # Model 2
```
Your turn: open `_targets.R` for editing.
```{r}
tar_edit()
```
Then, express the model commands above as formal targets in the pipeline.
```{r, eval = FALSE}
# Should go in _targets.R:
library(targets)
source("2-pipelines/functions.R")
tar_option_set(packages = c("keras", "recipes", "rsample", "tidyverse", "yardstick"))
list(
tar_target(churn_file, "data/churn.csv", format = "file"),
tar_target(churn_data, split_data(churn_file)),
tar_target(churn_recipe, prepare_recipe(churn_data)),
tar_target("???", "???"), # Your turn: add model 1.
tar_target("???", "???") # Your turn: add model 2.
)
```
Visualize the graph to check the pipeline. `run_relu` and `run_sigmoid` should be new targets that depend on `churn_data`, `churn_recipe`, `test_model()`, and the custom functions called from `test_model()`.
```{r}
tar_visnetwork()
```
`churn_file`, `churn_data` and `churn_recipe` are up to date from last time, and `run_relu` and `run_sigmoid` are new and thus outdated.
```{r}
tar_outdated()
```
`tar_make()` automatically runs the new or outdated targets (in this case, the models) and skips the targets that are already up to date.
```{r, output = FALSE}
tar_make() # Ignore messages like "TensorFlow binary was not compiled to use..."
```
Those models took a long time to run, right? That is why `tar_make()` skips them if they are up to date.
```{r}
tar_make()
```
If you are not sure what will run, just call `tar_visnetwork()` or `tar_outdated()` first.
```{r}
tar_outdated()
```
# Check the model targets
`run_relu` and `run_sigmoid` should each be a data frame with the accuracy and hyperparameters.
```{r, paged.print = FALSE}
tar_read(run_relu)
```
```{r, paged.print = FALSE}
tar_read(run_sigmoid)
```
# Add another model
Your turn: open `_targets.R` for editing and type in a third model with a different activation function. Use the `act1 = "softmax"` this time.
```{r}
tar_edit()
```
```{r, eval = FALSE}
# Should go in _targets.R:
library(targets)
source("2-pipelines/functions.R")
tar_option_set(packages = c("keras", "recipes", "rsample", "tidyverse", "yardstick"))
list(
tar_target(churn_file, "data/churn.csv", format = "file"),
tar_target(churn_data, split_data(churn_file)),
tar_target(churn_recipe, prepare_recipe(churn_data)),
tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)),
tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)),
tar_target(run_softmax, "???") # Your turn: add a model target with `act1 = "softmax"`.
)
```
Now, only `run_softmax` should be outdated.
```{r}
tar_outdated()
```
```{r}
tar_visnetwork()
```
Run the softmax model. `tar_make()` should skip everything else.
```{r}
tar_make()
```
Inspect the result.
```{r, paged.print = FALSE}
tar_read(run_softmax)
```
# Pick the best model.
The following R code chooses the model run with the highest accuracy.
```{r, eval = FALSE}
# Do not run here.
bind_rows(run_relu, run_sigmoid, run_softmax) %>%
top_n(1, accuracy) %>%
head(1)
```
Your turn: open `_targets.R` and type the above into a new target. Note: although commands should usually be concise function calls, but this is not a restrict requirement, as you will see below.
```{r, eval = FALSE}
# Should go in _targets.R:
library(targets)
source("2-pipelines/functions.R")
tar_option_set(packages = c("keras", "recipes", "rsample", "tidyverse", "yardstick"))
list(
tar_target(churn_file, "data/churn.csv", format = "file"),
tar_target(churn_data, split_data(churn_file)),
tar_target(churn_recipe, prepare_recipe(churn_data)),
tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)),
tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)),
tar_target(run_softmax, test_model(act1 = "softmax", churn_data, churn_recipe)),
tar_target(
best_run,
"???" # Your turn: insert code to choose the model run with the highest accuracy.
)
)
```
`best_run` should depend on `run_relu`, `run_sigmoid`, and `run_softmax`.
```{r}
tar_visnetwork()
```
The next `tar_make()` should run quickly and just build `best_run`.
```{r}
tar_make()
```
`best_run` should contain one row with the accuracy and hyperparameters of the best model.
```{r}
tar_read(best_run)
```
# Retrain the best model
Finally, open `_targets.R` for editing and add a new target to retrain the best model with `retrain_run(best_run, churn_recipe)`. This time, we are returning a Keras model object, so we need to write `format = "keras"` in `tar_target()`.
```{r, eval = FALSE}
# Should go in _targets.R:
library(targets)
source("2-pipelines/functions.R")
tar_option_set(packages = c("keras", "recipes", "rsample", "tidyverse", "yardstick"))
list(
tar_target(churn_file, "data/churn.csv", format = "file"),
tar_target(churn_data, split_data(churn_file)),
tar_target(churn_recipe, prepare_recipe(churn_data)),
tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)),
tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)),
tar_target(run_softmax, test_model(act1 = "softmax", churn_data, churn_recipe)),
tar_target(
best_run,
bind_rows(run_relu, run_sigmoid, run_softmax) %>%
top_n(1, accuracy) %>%
head(1)
),
tar_target(
best_model,
"???", # Your turn: call retrain_run() to retrain the best model.
format = "keras" # Needed to return the actual Keras model object.
)
)
```
Check the dependency relationships.
```{r}
tar_visnetwork()
```
Train the model while skipping previous up-to-date targets.
```{r}
tar_make()
```
Inspect the model object.
```{r}
tar_read(best_model)
```
# Solution
The full pipeline in `_targets.R` should look like this.
```{r, eval = FALSE}
# Should be in _targets.R:
library(targets)
source("2-pipelines/functions.R")
tar_option_set(packages = c("keras", "recipes", "rsample", "tidyverse", "yardstick"))
list(
tar_target(churn_file, "data/churn.csv", format = "file"),
tar_target(churn_data, split_data(churn_file)),
tar_target(churn_recipe, prepare_recipe(churn_data)),
tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)),
tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)),
tar_target(run_softmax, test_model(act1 = "softmax", churn_data, churn_recipe)),
tar_target(
best_run,
bind_rows(run_relu, run_sigmoid, run_softmax) %>%
top_n(1, accuracy) %>%
head(1)
),
tar_target(
best_model,
retrain_run(best_run, churn_recipe),
format = "keras"
)
)
```
# Recap
Building up a pipeline is an gradual, iterative process:
1. Add or change some targets in the pipeline.
2. Check the pipeline with `tar_outdated()` and `tar_visnetwork()`.
3. Call `tar_make()` to run the new targets.
4. Check the output with `tar_load()` and/or `tar_read()`.
5. Repeat.
In the next notebook, we will explore what happens when we make changes to the code and data.