-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include imputed strata in assessment card/pin #55
Include imputed strata in assessment card/pin #55
Conversation
pipeline/02-assess.R
Outdated
|
||
mapping_1 <- assessment_data_pred %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be minorly optimized by use deframe from the tibble package, but that's not part of the existing renv.
@@ -268,7 +296,8 @@ assessment_data_pin <- assessment_data_merged %>% | |||
meta_year, meta_pin, meta_pin10, meta_triad_code, meta_township_code, | |||
meta_nbhd_code, meta_tax_code, meta_class, meta_tieback_key_pin, | |||
meta_tieback_proration_rate, meta_cdu, meta_modeling_group, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any documentation that needs to be updated with new values in pin / card output files? Strata was never in the pin output file to begin with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope! This should just get automatically added to the equivalent tables in Athena once it's crawled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonamajor This is mostly looking great! A few minor cleanups in the comments. Can you also make a new issue to add the meta_strata
values and flag (is_imputed) to the Desk Review spreadsheets (so edit 07-export.R
).
# impute missing condo building strata. Within step_impute_knn, an estimated | ||
# node value is called with the sample(). This is not deterministic, meaning | ||
# different runs of the model will have different imputed values, and thus | ||
# different FMVs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (blocking): This is sort of correct. The imputation will be deterministic between model runs if a seed is set somewhere in this file. However, it won't be deterministic if you run the same prediction twice in a single session (unless you set the seed again).
I would add set.seed(params$input$strata$seed)
somewhere at the top of this file to ensure that prediction is always using the same seed. Then run the stage twice (run once, restart, run again) and check that the results are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included this in the file as described, and as setting a global seed. In each of those situations, different iterations still created different FMVs. You can use the uploaded testing file to look into this.
pipeline/02-assess.R
Outdated
mapping_2 <- assessment_data_pred %>% | ||
filter(!is.na(meta_strata_2)) %>% | ||
distinct(temp_strata_2, meta_strata_2) | ||
|
||
strata_mapping_1 <- setNames(mapping_1$meta_strata_1, mapping_1$temp_strata_1) | ||
strata_mapping_2 <- setNames(mapping_2$meta_strata_2, mapping_2$temp_strata_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: You can clean this up a tiny bit using set_names
from rlang
or purrr
:
strata_mapping_2 <- assessment_data_pred %>%
filter(!is.na(meta_strata_2)) %>%
distinct(meta_strata_2, temp_strata_2) %>%
purrr::set_names(.$meta_strata_2, .$temp_strata_2)
pipeline/02-assess.R
Outdated
@@ -154,7 +183,8 @@ assessment_data_merged %>% | |||
select( | |||
meta_year, meta_pin, meta_class, meta_card_num, meta_lline_num, | |||
meta_modeling_group, ends_with("_num_sale"), pred_card_initial_fmv, | |||
all_of(params$model$predictor$all), township_code | |||
all_of(params$model$predictor$all), | |||
meta_strata_is_imputed, township_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Let's change the prefix of this variable to flag_
since its purpose fits slightly better with that group of vars.
@@ -268,7 +296,8 @@ assessment_data_pin <- assessment_data_merged %>% | |||
meta_year, meta_pin, meta_pin10, meta_triad_code, meta_township_code, | |||
meta_nbhd_code, meta_tax_code, meta_class, meta_tieback_key_pin, | |||
meta_tieback_proration_rate, meta_cdu, meta_modeling_group, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope! This should just get automatically added to the equivalent tables in Athena once it's crawled.
Co-authored-by: Dan Snow <[email protected]>
Merge remote-tracking branch 'refs/remotes/origin/54-include-imputed-strata-in-assessment_data-assessment_card' into 54-include-imputed-strata-in-assessment_data-assessment_card # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
testing_file.R
Outdated
@@ -0,0 +1,85 @@ | |||
assessment_data_pred <- read_parquet(paths$input$assessment$local) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a testing file demonstrating that the re-converted strata are the same as the original strata.
Merge remote-tracking branch 'refs/remotes/origin/54-include-imputed-strata-in-assessment_data-assessment_card' into 54-include-imputed-strata-in-assessment_data-assessment_card
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Damonamajor This looks fine for now. Let's review it as a part of our modeling work in the next couple weeks.
pipeline/02-assess.R
Outdated
{ | ||
set_names(.$meta_strata_1, .$temp_strata_1) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Why the additional curly braces here? You should be able to just:
{ | |
set_names(.$meta_strata_1, .$temp_strata_1) | |
} | |
set_names(.$meta_strata_1, .$temp_strata_1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey all - this is a really simple issue. You need to use a specific random seed when modeling for consistent results. This is a basic best practice. Even if you somehow are getting deterministic results, it is an anti-pattern to not use random seeds. https://towardsdatascience.com/how-to-use-random-seeds-effectively-54a4cd855a79 Random seeds absolutely must be used. There is no code problem here, you are just missing a random seed param. |
During this PR, it was made known that the
step_impute_knn
is not deterministic and thus different model runs will have different results. This meant we add documentation to the Recipes.R file.Alongside this, we extract values from the bake stage of the condo pipeline and revert them to their original values (adding a
meta_strata_is_imputed
binary variable.