Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include imputed strata in assessment card/pin #55

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
a603001
Include testing file
Damonamajor Nov 13, 2024
223aa02
include commenting for testing file
Damonamajor Nov 13, 2024
fb520c6
remove code from pipeline
Damonamajor Nov 13, 2024
80fd36f
Add 3 enters
Damonamajor Nov 13, 2024
b692b22
Update commenting
Damonamajor Nov 13, 2024
6edaaf3
update pipeline
Damonamajor Nov 14, 2024
3e1a54b
Add testing v2 file
Damonamajor Nov 15, 2024
3e4ab9b
Testing file v2
Damonamajor Nov 15, 2024
45accc7
Add concluding line
Damonamajor Nov 15, 2024
84593d4
include mapping
Damonamajor Nov 18, 2024
0837cd2
Improve structure; add commenting
Damonamajor Nov 18, 2024
8ae8d84
Remove testing files
Damonamajor Nov 18, 2024
aeaa387
lintr
Damonamajor Nov 18, 2024
d0d8135
Add commenting
Damonamajor Nov 18, 2024
0bf170c
Add unname
Damonamajor Nov 18, 2024
75d8d4e
Remove testing
Damonamajor Nov 18, 2024
f22d226
Update pipeline/02-assess.R
Damonamajor Nov 18, 2024
92710bf
Push testing file
Damonamajor Nov 18, 2024
cfe990b
lintr
Damonamajor Nov 18, 2024
3eed3df
Add Dan language
Damonamajor Nov 18, 2024
21ca2ca
Add testing file
Damonamajor Nov 18, 2024
77d6cee
Include True False
Damonamajor Nov 18, 2024
2d9628f
Include testing file
Damonamajor Nov 18, 2024
b08bdc7
Change to flag
Damonamajor Dec 2, 2024
691fa3c
Make less than 80 characters
Damonamajor Dec 2, 2024
1c4dfe6
Final
Damonamajor Dec 3, 2024
ee5666f
Drop testing file
dfsnow Dec 17, 2024
85ff9e6
Simplify strata deconversion code
dfsnow Dec 17, 2024
d97a9cf
Fix `ccao` package function arg names
dfsnow Dec 17, 2024
c2c4d70
Merge branch 'dfsnow/fix-ccao-arg-names' into 54-include-imputed-stra…
dfsnow Dec 17, 2024
efd3bf4
Drop SV added later cols
dfsnow Dec 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion R/recipes.R
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,10 @@ model_main_recipe <- function(data, pred_vars, cat_vars,
# Remove any variables not an outcome var or in the pred_vars vector
step_rm(-all_outcomes(), -all_predictors(), -has_role("ID")) %>%
# Impute missing values using KNN. Specific to condo model, usually used to
# impute missing condo building strata
# impute missing condo building strata. Within step_impute_knn, an estimated
# node value is called with the sample(). This is not deterministic, meaning
# different runs of the model will have different imputed values, and thus
# different FMVs.
Comment on lines +34 to +37
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): This is sort of correct. The imputation will be deterministic between model runs if a seed is set somewhere in this file. However, it won't be deterministic if you run the same prediction twice in a single session (unless you set the seed again).

I would add set.seed(params$input$strata$seed) somewhere at the top of this file to ensure that prediction is always using the same seed. Then run the stage twice (run once, restart, run again) and check that the results are the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included this in the file as described, and as setting a global seed. In each of those situations, different iterations still created different FMVs. You can use the uploaded testing file to look into this.

step_impute_knn(
all_of(knn_vars),
neighbors = tune(),
Expand Down
78 changes: 60 additions & 18 deletions pipeline/02-assess.R
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ land_nbhd_rate <- read_parquet(


#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# 2. Predict Values ------------------------------------------------------------
# 2. Predict Values and Recover Strata ----------------------------------------
#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
message("Predicting off-market values with trained model")

Expand All @@ -40,20 +40,61 @@ lgbm_final_full_recipe <- readRDS(paths$output$workflow_recipe$local)

# Load the data for assessment. This is the universe of condo units
# that need values. Use the trained lightgbm model to estimate a single
# FMV per unit
# FMV per unit. Bake the data first so we can extract transformed columns
assessment_data_pred <- read_parquet(paths$input$assessment$local) %>%
as_tibble() %>%
as_tibble()

assessment_data_baked <- assessment_data_pred %>%
bake(lgbm_final_full_recipe, new_data = ., all_predictors())

assessment_data_pred <- assessment_data_pred %>%
mutate(
pred_card_initial_fmv = predict(
.,
pred_card_initial_fmv = as.numeric(predict(
lgbm_final_full_fit,
new_data = bake(
lgbm_final_full_recipe,
new_data = .,
all_predictors()
)
)$.pred
new_data = assessment_data_baked
)$.pred),
# Strata variables are converted to 0-indexed integers during baking.
# We save those converted values so we can unconvert them below
temp_strata_1 = assessment_data_baked$meta_strata_1,
temp_strata_2 = assessment_data_baked$meta_strata_2
)

# The baked data encodes categorical values as base-0 integers.
# However, here we want to recover the original (unencoded) values of our
# strata variables wherever they've been imputed by the baking step. To do so,
# we create a mapping of the encoded to unencoded values and use them to
# recover both the original strata values and those imputed by
# step_impute_knn (in R/recipes.R)
strata_mapping_1 <- assessment_data_pred %>%
filter(!is.na(meta_strata_1)) %>%
distinct(temp_strata_1, meta_strata_1) %>%
pull(meta_strata_1, name = temp_strata_1)
strata_mapping_2 <- assessment_data_pred %>%
filter(!is.na(meta_strata_2)) %>%
distinct(temp_strata_2, meta_strata_2) %>%
pull(meta_strata_2, name = temp_strata_2)

# Recover the imputed strata values
assessment_data_pred <- assessment_data_pred %>%
mutate(
# Binary variable to identify condos which have imputed strata
flag_strata_is_imputed = is.na(meta_strata_1) | is.na(meta_strata_2),
# Use mappings to replace meta_strata_1 and meta_strata_2 directly
meta_strata_1 = ifelse(
is.na(meta_strata_1),
unname(strata_mapping_1[as.character(temp_strata_1)]),
meta_strata_1
),
meta_strata_2 = ifelse(
is.na(meta_strata_2),
unname(strata_mapping_2[as.character(temp_strata_2)]),
meta_strata_2
)
) %>%
# Remove unnecessary columns
select(-temp_strata_1, -temp_strata_2)




Expand Down Expand Up @@ -154,14 +195,15 @@ assessment_data_merged %>%
select(
meta_year, meta_pin, meta_class, meta_card_num, meta_lline_num,
meta_modeling_group, ends_with("_num_sale"), pred_card_initial_fmv,
all_of(params$model$predictor$all), township_code
all_of(params$model$predictor$all),
flag_strata_is_imputed, township_code
) %>%
mutate(
ccao_n_years_exe_homeowner = as.integer(ccao_n_years_exe_homeowner)
) %>%
ccao::vars_recode(
starts_with("char_"),
type = "long",
cols = starts_with("char_"),
code_type = "long",
as_factor = FALSE
) %>%
write_parquet(paths$output$assessment_card$local)
Expand Down Expand Up @@ -203,7 +245,7 @@ sales_data_two_most_recent <- sales_data %>%
meta_pin, meta_year,
meta_sale_price, meta_sale_date, meta_sale_document_num,
sv_outlier_reason1, sv_outlier_reason2, sv_outlier_reason3,
meta_sale_num_parcels, sv_added_later
meta_sale_num_parcels
) %>%
# Include outliers, since these data are used for desk review and
# not for modeling
Expand All @@ -225,8 +267,7 @@ sales_data_two_most_recent <- sales_data %>%
meta_sale_outlier_reason1,
meta_sale_outlier_reason2,
meta_sale_outlier_reason3,
meta_sale_num_parcels,
sv_added_later
meta_sale_num_parcels
),
names_glue = "{mr}_{gsub('meta_sale_', '', .value)}"
) %>%
Expand Down Expand Up @@ -270,7 +311,8 @@ assessment_data_pin <- assessment_data_merged %>%
meta_year, meta_pin, meta_pin10, meta_triad_code, meta_township_code,
meta_nbhd_code, meta_tax_code, meta_class, meta_tieback_key_pin,
meta_tieback_proration_rate, meta_cdu, meta_modeling_group,
Copy link
Contributor Author

@Damonamajor Damonamajor Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any documentation that needs to be updated with new values in pin / card output files? Strata was never in the pin output file to begin with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! This should just get automatically added to the equivalent tables in Athena once it's crawled.

meta_pin_num_landlines, char_yrblt,
meta_pin_num_landlines, meta_strata_1, meta_strata_2,
flag_strata_is_imputed, char_yrblt,

# Keep overall building square footage
char_total_bldg_sf = char_building_sf,
Expand Down Expand Up @@ -389,7 +431,7 @@ message("Saving final PIN-level data")
assessment_data_pin_final %>%
ccao::vars_recode(
cols = starts_with("char_"),
type = "short",
code_type = "short",
as_factor = FALSE
) %>%
select(-meta_pin10) %>%
Expand Down
Loading