-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a data dictionary and remove duplicate features from it #86
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,7 @@ cache/ | |
*.rds | ||
*.zip | ||
*.csv | ||
!docs/data-dict.csv | ||
*.xlsx | ||
!condo_nonlivable_demo.xlsx | ||
*.xlsm | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
#!/usr/bin/env Rscript | ||
# Script to check that the data dictionary file is up to date with the | ||
# latest feature set | ||
library(yaml) | ||
|
||
params_filename <- "params.yaml" | ||
data_dict_filename <- "docs/data-dict.csv" | ||
|
||
params <- read_yaml(params_filename) | ||
data_dict <- read.csv(data_dict_filename) | ||
|
||
symmetric_diff <- c( | ||
setdiff(data_dict$variable_name, params$model$predictor$all), | ||
setdiff(params$model$predictor$all, data_dict$variable_name) | ||
) | ||
symmetric_diff_len <- length(symmetric_diff) | ||
|
||
if (symmetric_diff_len > 0) { | ||
err_msg_prefix <- ifelse(symmetric_diff_len == 1, "Param is", "Params are") | ||
err_msg <- paste0( | ||
err_msg_prefix, | ||
" not present in both ", | ||
params_filename, | ||
" and ", | ||
data_dict_filename, | ||
": ", | ||
paste(symmetric_diff, collapse = ", "), | ||
". ", | ||
"Did you forget to reknit README.Rmd after updating ", | ||
params_filename, | ||
"?" | ||
) | ||
stop(err_msg) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -60,13 +60,14 @@ We leverage these qualities to produce what we call ***strata***, a feature uniq | |
|
||
### Features Used | ||
|
||
Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, [strata](#condo-strata), and time of sale to determine condo assessed values. The features in the table below are the ones used in the 2024 assessment model. | ||
Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, [strata](#condo-strata), and time of sale to determine condo assessed values. The features in the table below are the ones used in the most recent assessment model. | ||
|
||
```{r features_used, message=FALSE, echo=FALSE} | ||
library(dplyr) | ||
library(glue) | ||
library(jsonlite) | ||
library(purrr) | ||
library(readr) | ||
library(tidyr) | ||
library(yaml) | ||
|
||
|
@@ -154,43 +155,62 @@ res_preds <- res_params$model$predictor$all | |
|
||
condo_unique_preds <- setdiff(condo_preds$value, res_preds) | ||
|
||
condo_preds %>% | ||
condo_preds_fmt <- condo_preds %>% | ||
mutate(description = param_notes) %>% | ||
left_join( | ||
ccao::vars_dict, | ||
by = c("value" = "var_name_model") | ||
) %>% | ||
distinct( | ||
`Feature Name` = var_name_pretty, | ||
Category = var_type, | ||
Type = var_data_type, | ||
Notes = description, | ||
value | ||
feature_name = var_name_pretty, | ||
variable_name = value, | ||
description, | ||
category = var_type, | ||
type = var_data_type | ||
) %>% | ||
mutate( | ||
Category = recode( | ||
Category, | ||
category = recode( | ||
category, | ||
char = "Characteristic", acs5 = "ACS5", loc = "Location", | ||
prox = "Proximity", ind = "Indicator", time = "Time", | ||
meta = "Meta", other = "Other", ccao = "Other" | ||
meta = "Meta", other = "Other", ccao = "Other", shp = "Parcel Shape" | ||
), | ||
`Feature Name` = recode( | ||
`Feature Name`, | ||
feature_name = recode( | ||
feature_name, | ||
"Tieback Proration Rate" = "Condominium % Ownership", | ||
"Year Built" = "Condominium Building Year Built" | ||
), | ||
unique_to_condo_model = ifelse( | ||
variable_name %in% condo_unique_preds | | ||
feature_name %in% | ||
c("Condominium Building Year Built", "Condominium % Ownership"), | ||
TRUE, FALSE | ||
) | ||
) %>% | ||
mutate(`Unique to Condo Model` = ifelse( | ||
value %in% condo_unique_preds | | ||
`Feature Name` %in% | ||
c("Condominium Building Year Built", "Condominium % Ownership"), | ||
"X", "" | ||
)) %>% | ||
arrange(desc(`Unique to Condo Model`), Category) %>% | ||
select(-value) %>% | ||
arrange(desc(unique_to_condo_model), category) | ||
|
||
condo_preds_fmt %>% | ||
write_csv("docs/data-dict.csv") | ||
|
||
condo_preds_fmt %>% | ||
mutate(unique_to_condo_model = ifelse(unique_to_condo_model, "X", "")) %>% | ||
rename( | ||
"Feature Name" = "feature_name", | ||
"Variable Name" = "variable_name", | ||
"Description" = "description", | ||
"Category" = "category", | ||
"Type" = "type", | ||
"Unique to Condo Model" = "unique_to_condo_model" | ||
) %>% | ||
knitr::kable(format = "markdown") | ||
``` | ||
|
||
We maintain a few useful resources for working with these features: | ||
|
||
- Once you've [pulled the input data](#getting-data), you can inner join the data to the CSV version of the data dictionary ([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the features that we use in the model. | ||
- You can browse our [data catalog](https://ccao-data.github.io/data-architecture/#!/overview) to see more details about these features, in particular the [condo model input view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_pin_condo_input) which is the source of our training data. | ||
- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) or its [Python equivalent](https://ccao-data.github.io/ccao/python/) to programmatically convert variable names to their human-readable versions ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) or convert numerically-encoded variables to human-readable values ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html). The [`ccao::vars_dict` object](https://ccao-data.github.io/ccao/reference/vars_dict.html) is also useful for inspecting the raw crosswalk that powers the rename and recode functions. | ||
Comment on lines
+208
to
+212
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This paragraph is tweaked from the res model, but I'm wondering if perhaps it's too much duplicate information. Let me know if you think there's a slimmer way of pointing users to these resources without fully duplicating this section between the res and condo models. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is totally fine for now. We can consolidate later when we do a rewrite. |
||
|
||
### Valuation | ||
|
||
For the most part, condos are valued the same way as single- and multi-family residential property. We [train a model](https://github.com/ccao-data/model-res-avm#how-it-works) using individual condo unit sales, predict the value of all units, and then apply any [post-modeling adjustment](https://github.com/ccao-data/model-res-avm#post-modeling). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "most recent" OK here, or would you rather we continue to update this manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Most recent" is definitely fine.