-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathdvc.yaml
198 lines (191 loc) · 6.68 KB
/
dvc.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
stages:
ingest:
cmd: Rscript pipeline/00-ingest.R
desc: >
Ingest training and assessment data from Athena + generate condo strata
deps:
- pipeline/00-ingest.R
params:
- assessment
- input
outs:
- input/assessment_data.parquet
- input/char_data.parquet
- input/condo_strata_data.parquet
- input/land_nbhd_rate_data.parquet
- input/training_data.parquet
frozen: true
train:
cmd: Rscript pipeline/01-train.R
desc: >
Train a LightGBM model with cross-validation. Generate model objects,
data recipes, and predictions on the test set (most recent 10% of sales)
deps:
- pipeline/01-train.R
- input/training_data.parquet
params:
- cv
- model.engine
- model.hyperparameter
- model.objective
- model.parameter
- model.predictor
- model.seed
- model.verbose
- ratio_study
- toggle.cv_enable
outs:
- output/intermediate/timing/model_timing_train.parquet:
cache: false
- output/parameter_final/model_parameter_final.parquet:
cache: false
- output/parameter_range/model_parameter_range.parquet:
cache: false
- output/parameter_search/model_parameter_search.parquet:
cache: false
- output/test_card/model_test_card.parquet:
cache: false
- output/workflow/fit/model_workflow_fit.zip:
cache: false
- output/workflow/recipe/model_workflow_recipe.rds:
cache: false
assess:
cmd: Rscript pipeline/02-assess.R
desc: >
Use the trained model to estimate sale prices for all PINS/cards in Cook
County. Also generate flags, calculate land values, and make any
post-modeling changes
deps:
- pipeline/02-assess.R
- input/assessment_data.parquet
- input/condo_strata_data.parquet
- input/land_nbhd_rate_data.parquet
- input/training_data.parquet
- output/workflow/fit/model_workflow_fit.zip
- output/workflow/recipe/model_workflow_recipe.rds
params:
- assessment
- model.predictor.all
- pv
- ratio_study
outs:
- output/assessment_card/model_assessment_card.parquet:
cache: false
- output/assessment_pin/model_assessment_pin.parquet:
cache: false
- output/intermediate/timing/model_timing_assess.parquet:
cache: false
evaluate:
cmd: Rscript pipeline/03-evaluate.R
desc: >
Evaluate the model's performance using two methods:
1. The standard test set, in this case the most recent 10% of sales
2. An assessor-specific ratio study comparing estimated assessments to
the previous year's sales
deps:
- pipeline/03-evaluate.R
- output/assessment_pin/model_assessment_pin.parquet
- output/test_card/model_test_card.parquet
params:
- assessment
- ratio_study
outs:
- output/performance/model_performance_test.parquet:
cache: false
- output/performance_quantile/model_performance_quantile_test.parquet:
cache: false
- output/performance/model_performance_assessment.parquet:
cache: false
- output/performance_quantile/model_performance_quantile_assessment.parquet:
cache: false
- output/intermediate/timing/model_timing_evaluate.parquet:
cache: false
interpret:
cmd: Rscript pipeline/04-interpret.R
desc: >
Generate SHAP values for each card and feature as well as feature
importance metrics for each feature
deps:
- pipeline/04-interpret.R
- input/assessment_data.parquet
- output/workflow/fit/model_workflow_fit.zip
- output/workflow/recipe/model_workflow_recipe.rds
params:
- toggle.shap_enable
- model.predictor.all
outs:
- output/shap/model_shap.parquet:
cache: false
- output/feature_importance/model_feature_importance.parquet:
cache: false
- output/intermediate/timing/model_timing_interpret.parquet:
cache: false
finalize:
cmd: Rscript pipeline/05-finalize.R
desc: >
Save run timings and run metadata to disk and render a performance report
using Quarto.
deps:
- pipeline/05-finalize.R
- output/intermediate/timing/model_timing_train.parquet
- output/intermediate/timing/model_timing_assess.parquet
- output/intermediate/timing/model_timing_evaluate.parquet
- output/intermediate/timing/model_timing_interpret.parquet
params:
- run_note
- toggle
- input
- cv
- model
- pv
- ratio_study
outs:
- output/intermediate/timing/model_timing_finalize.parquet:
cache: false
- output/timing/model_timing.parquet:
cache: false
- output/metadata/model_metadata.parquet:
cache: false
- reports/performance/performance.html:
cache: false
upload:
cmd: Rscript pipeline/06-upload.R
desc: >
Upload performance stats and report to S3, trigger Glue crawlers, and
publish to a model run SNS topic. Will also clean some of the generated
outputs prior to upload and attach a unique run ID. This step requires
access to the CCAO Data AWS account, and so is assumed to be internal-only
deps:
- pipeline/06-upload.R
- output/parameter_final/model_parameter_final.parquet
- output/parameter_range/model_parameter_range.parquet
- output/parameter_search/model_parameter_search.parquet
- output/workflow/fit/model_workflow_fit.zip
- output/workflow/recipe/model_workflow_recipe.rds
- output/test_card/model_test_card.parquet
- output/assessment_card/model_assessment_card.parquet
- output/assessment_pin/model_assessment_pin.parquet
- output/performance/model_performance_test.parquet
- output/performance_quantile/model_performance_quantile_test.parquet
- output/performance/model_performance_assessment.parquet
- output/performance_quantile/model_performance_quantile_assessment.parquet
- output/shap/model_shap.parquet
- output/feature_importance/model_feature_importance.parquet
- output/metadata/model_metadata.parquet
- output/timing/model_timing.parquet
- reports/performance/performance.html
export:
cmd: Rscript pipeline/07-export.R
desc: >
Generate Desk Review spreadsheets and iasWorld upload CSVs from a finished
run. NOT automatically run since it is typically only run once. Manually
run once a model is selected
deps:
- pipeline/07-export.R
params:
- assessment.year
- input.min_sale_year
- input.max_sale_year
- ratio_study
- export
frozen: true