Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include imputed strata in assessment card/pin #55

Conversation

Damonamajor
Copy link
Contributor

During this PR, it was made known that the step_impute_knn is not deterministic and thus different model runs will have different results. This meant we add documentation to the Recipes.R file.
Alongside this, we extract values from the bake stage of the condo pipeline and revert them to their original values (adding a meta_strata_is_imputed binary variable.

@Damonamajor Damonamajor linked an issue Nov 18, 2024 that may be closed by this pull request

mapping_1 <- assessment_data_pred %>%
Copy link
Contributor Author

@Damonamajor Damonamajor Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be minorly optimized by use deframe from the tibble package, but that's not part of the existing renv.

@@ -268,7 +296,8 @@ assessment_data_pin <- assessment_data_merged %>%
meta_year, meta_pin, meta_pin10, meta_triad_code, meta_township_code,
meta_nbhd_code, meta_tax_code, meta_class, meta_tieback_key_pin,
meta_tieback_proration_rate, meta_cdu, meta_modeling_group,
Copy link
Contributor Author

@Damonamajor Damonamajor Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any documentation that needs to be updated with new values in pin / card output files? Strata was never in the pin output file to begin with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! This should just get automatically added to the equivalent tables in Athena once it's crawled.

@Damonamajor Damonamajor marked this pull request as ready for review November 18, 2024 16:16
@Damonamajor Damonamajor changed the title Include imputed strata in assessment data assessment card Include imputed strata in assessment card/pin Nov 18, 2024
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonamajor This is mostly looking great! A few minor cleanups in the comments. Can you also make a new issue to add the meta_strata values and flag (is_imputed) to the Desk Review spreadsheets (so edit 07-export.R).

Comment on lines +34 to +37
# impute missing condo building strata. Within step_impute_knn, an estimated
# node value is called with the sample(). This is not deterministic, meaning
# different runs of the model will have different imputed values, and thus
# different FMVs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): This is sort of correct. The imputation will be deterministic between model runs if a seed is set somewhere in this file. However, it won't be deterministic if you run the same prediction twice in a single session (unless you set the seed again).

I would add set.seed(params$input$strata$seed) somewhere at the top of this file to ensure that prediction is always using the same seed. Then run the stage twice (run once, restart, run again) and check that the results are the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included this in the file as described, and as setting a global seed. In each of those situations, different iterations still created different FMVs. You can use the uploaded testing file to look into this.

Comment on lines 70 to 75
mapping_2 <- assessment_data_pred %>%
filter(!is.na(meta_strata_2)) %>%
distinct(temp_strata_2, meta_strata_2)

strata_mapping_1 <- setNames(mapping_1$meta_strata_1, mapping_1$temp_strata_1)
strata_mapping_2 <- setNames(mapping_2$meta_strata_2, mapping_2$temp_strata_2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: You can clean this up a tiny bit using set_names from rlang or purrr:

strata_mapping_2 <- assessment_data_pred %>%
  filter(!is.na(meta_strata_2)) %>%
  distinct(meta_strata_2, temp_strata_2) %>%
  purrr::set_names(.$meta_strata_2, .$temp_strata_2)

pipeline/02-assess.R Outdated Show resolved Hide resolved
@@ -154,7 +183,8 @@ assessment_data_merged %>%
select(
meta_year, meta_pin, meta_class, meta_card_num, meta_lline_num,
meta_modeling_group, ends_with("_num_sale"), pred_card_initial_fmv,
all_of(params$model$predictor$all), township_code
all_of(params$model$predictor$all),
meta_strata_is_imputed, township_code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Let's change the prefix of this variable to flag_ since its purpose fits slightly better with that group of vars.

@@ -268,7 +296,8 @@ assessment_data_pin <- assessment_data_merged %>%
meta_year, meta_pin, meta_pin10, meta_triad_code, meta_township_code,
meta_nbhd_code, meta_tax_code, meta_class, meta_tieback_key_pin,
meta_tieback_proration_rate, meta_cdu, meta_modeling_group,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! This should just get automatically added to the equivalent tables in Athena once it's crawled.

Damonamajor and others added 5 commits November 18, 2024 12:46
Merge remote-tracking branch 'refs/remotes/origin/54-include-imputed-strata-in-assessment_data-assessment_card' into 54-include-imputed-strata-in-assessment_data-assessment_card
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
testing_file.R Outdated
@@ -0,0 +1,85 @@
assessment_data_pred <- read_parquet(paths$input$assessment$local) %>%
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a testing file demonstrating that the re-converted strata are the same as the original strata.

Merge remote-tracking branch 'refs/remotes/origin/54-include-imputed-strata-in-assessment_data-assessment_card' into 54-include-imputed-strata-in-assessment_data-assessment_card
@Damonamajor Damonamajor requested a review from dfsnow December 3, 2024 02:27
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Damonamajor This looks fine for now. Let's review it as a part of our modeling work in the next couple weeks.

Comment on lines 70 to 72
{
set_names(.$meta_strata_1, .$temp_strata_1)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Why the additional curly braces here? You should be able to just:

Suggested change
{
set_names(.$meta_strata_1, .$temp_strata_1)
}
set_names(.$meta_strata_1, .$temp_strata_1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@dfsnow dfsnow mentioned this pull request Dec 7, 2024
@dfsnow dfsnow changed the base branch from master to 2025-assessment-year December 17, 2024 20:38
@dfsnow dfsnow merged commit 4c55cc3 into 2025-assessment-year Dec 17, 2024
4 checks passed
@dfsnow dfsnow deleted the 54-include-imputed-strata-in-assessment_data-assessment_card branch December 17, 2024 20:39
@ssaurbier
Copy link

Hey all - this is a really simple issue. You need to use a specific random seed when modeling for consistent results. This is a basic best practice. Even if you somehow are getting deterministic results, it is an anti-pattern to not use random seeds.

https://towardsdatascience.com/how-to-use-random-seeds-effectively-54a4cd855a79

Random seeds absolutely must be used. There is no code problem here, you are just missing a random seed param.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include imputed strata in assessment_data / assessment_card
3 participants