Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a data dictionary and remove duplicate features from it #86

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ cache/
*.rds
*.zip
*.csv
!docs/data-dict.csv
*.xlsx
!condo_nonlivable_demo.xlsx
*.xlsm
Expand Down
7 changes: 7 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,10 @@ repos:
entry: Cannot commit .Rhistory, .RData, .Rds or .rds.
language: fail
files: '\.(Rhistory|RData|Rds|rds)$'
- id: check-data-dict
name: Data dictionary must be up to date with params file
entry: Rscript R/hooks/check-data-dict.R
files: (^|/)((params\.yaml)|(data-dict\.csv))$
language: r
additional_dependencies:
- yaml
34 changes: 34 additions & 0 deletions R/hooks/check-data-dict.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/usr/bin/env Rscript
# Script to check that the data dictionary file is up to date with the
# latest feature set
library(yaml)

params_filename <- "params.yaml"
data_dict_filename <- "docs/data-dict.csv"

params <- read_yaml(params_filename)
data_dict <- read.csv(data_dict_filename)

symmetric_diff <- c(
setdiff(data_dict$variable_name, params$model$predictor$all),
setdiff(params$model$predictor$all, data_dict$variable_name)
)
symmetric_diff_len <- length(symmetric_diff)

if (symmetric_diff_len > 0) {
err_msg_prefix <- ifelse(symmetric_diff_len == 1, "Param is", "Params are")
err_msg <- paste0(
err_msg_prefix,
" not present in both ",
params_filename,
" and ",
data_dict_filename,
": ",
paste(symmetric_diff, collapse = ", "),
". ",
"Did you forget to reknit README.Rmd after updating ",
params_filename,
"?"
)
stop(err_msg)
}
60 changes: 40 additions & 20 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -60,13 +60,14 @@ We leverage these qualities to produce what we call ***strata***, a feature uniq

### Features Used

Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, [strata](#condo-strata), and time of sale to determine condo assessed values. The features in the table below are the ones used in the 2024 assessment model.
Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, [strata](#condo-strata), and time of sale to determine condo assessed values. The features in the table below are the ones used in the most recent assessment model.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "most recent" OK here, or would you rather we continue to update this manually?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Most recent" is definitely fine.


```{r features_used, message=FALSE, echo=FALSE}
library(dplyr)
library(glue)
library(jsonlite)
library(purrr)
library(readr)
library(tidyr)
library(yaml)

Expand Down Expand Up @@ -154,43 +155,62 @@ res_preds <- res_params$model$predictor$all

condo_unique_preds <- setdiff(condo_preds$value, res_preds)

condo_preds %>%
condo_preds_fmt <- condo_preds %>%
mutate(description = param_notes) %>%
left_join(
ccao::vars_dict,
by = c("value" = "var_name_model")
) %>%
distinct(
`Feature Name` = var_name_pretty,
Category = var_type,
Type = var_data_type,
Notes = description,
value
feature_name = var_name_pretty,
variable_name = value,
description,
category = var_type,
type = var_data_type
) %>%
mutate(
Category = recode(
Category,
category = recode(
category,
char = "Characteristic", acs5 = "ACS5", loc = "Location",
prox = "Proximity", ind = "Indicator", time = "Time",
meta = "Meta", other = "Other", ccao = "Other"
meta = "Meta", other = "Other", ccao = "Other", shp = "Parcel Shape"
),
`Feature Name` = recode(
`Feature Name`,
feature_name = recode(
feature_name,
"Tieback Proration Rate" = "Condominium % Ownership",
"Year Built" = "Condominium Building Year Built"
),
unique_to_condo_model = ifelse(
variable_name %in% condo_unique_preds |
feature_name %in%
c("Condominium Building Year Built", "Condominium % Ownership"),
TRUE, FALSE
)
) %>%
mutate(`Unique to Condo Model` = ifelse(
value %in% condo_unique_preds |
`Feature Name` %in%
c("Condominium Building Year Built", "Condominium % Ownership"),
"X", ""
)) %>%
arrange(desc(`Unique to Condo Model`), Category) %>%
select(-value) %>%
arrange(desc(unique_to_condo_model), category)

condo_preds_fmt %>%
write_csv("docs/data-dict.csv")

condo_preds_fmt %>%
mutate(unique_to_condo_model = ifelse(unique_to_condo_model, "X", "")) %>%
rename(
"Feature Name" = "feature_name",
"Variable Name" = "variable_name",
"Description" = "description",
"Category" = "category",
"Type" = "type",
"Unique to Condo Model" = "unique_to_condo_model"
) %>%
knitr::kable(format = "markdown")
```

We maintain a few useful resources for working with these features:

- Once you've [pulled the input data](#getting-data), you can inner join the data to the CSV version of the data dictionary ([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the features that we use in the model.
- You can browse our [data catalog](https://ccao-data.github.io/data-architecture/#!/overview) to see more details about these features, in particular the [condo model input view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_pin_condo_input) which is the source of our training data.
- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) or its [Python equivalent](https://ccao-data.github.io/ccao/python/) to programmatically convert variable names to their human-readable versions ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) or convert numerically-encoded variables to human-readable values ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html). The [`ccao::vars_dict` object](https://ccao-data.github.io/ccao/reference/vars_dict.html) is also useful for inspecting the raw crosswalk that powers the rename and recode functions.
Comment on lines +208 to +212
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is tweaked from the res model, but I'm wondering if perhaps it's too much duplicate information. Let me know if you think there's a slimmer way of pointing users to these resources without fully duplicating this section between the res and condo models.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is totally fine for now. We can consolidate later when we do a rewrite.


### Valuation

For the most part, condos are valued the same way as single- and multi-family residential property. We [train a model](https://github.com/ccao-data/model-res-avm#how-it-works) using individual condo unit sales, predict the value of all units, and then apply any [post-modeling adjustment](https://github.com/ccao-data/model-res-avm#post-modeling).
Expand Down
Loading
Loading