Skip to content

Commit

Permalink
Add a data dictionary and update README to document it
Browse files Browse the repository at this point in the history
  • Loading branch information
jeancochrane committed Jan 14, 2025
1 parent 19bc856 commit 388f278
Show file tree
Hide file tree
Showing 6 changed files with 287 additions and 105 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ cache/
*.rds
*.zip
*.csv
!docs/data-dict.csv
*.xlsx
!condo_nonlivable_demo.xlsx
*.xlsm
Expand Down
7 changes: 7 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,10 @@ repos:
entry: Cannot commit .Rhistory, .RData, .Rds or .rds.
language: fail
files: '\.(Rhistory|RData|Rds|rds)$'
- id: check-data-dict
name: Data dictionary must be up to date with params file
entry: Rscript R/hooks/check-data-dict.R
files: (^|/)((params\.yaml)|(data-dict\.csv))$
language: r
additional_dependencies:
- yaml
34 changes: 34 additions & 0 deletions R/hooks/check-data-dict.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/usr/bin/env Rscript
# Script to check that the data dictionary file is up to date with the
# latest feature set
library(yaml)

params_filename <- "params.yaml"
data_dict_filename <- "docs/data-dict.csv"

params <- read_yaml(params_filename)
data_dict <- read.csv(data_dict_filename)

symmetric_diff <- c(
setdiff(data_dict$variable_name, params$model$predictor$all),
setdiff(params$model$predictor$all, data_dict$variable_name)
)
symmetric_diff_len <- length(symmetric_diff)

if (symmetric_diff_len > 0) {
err_msg_prefix <- ifelse(symmetric_diff_len == 1, "Param is", "Params are")
err_msg <- paste0(
err_msg_prefix,
" not present in both ",
params_filename,
" and ",
data_dict_filename,
": ",
paste(symmetric_diff, collapse = ", "),
". ",
"Did you forget to reknit README.Rmd after updating ",
params_filename,
"?"
)
stop(err_msg)
}
60 changes: 40 additions & 20 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -60,13 +60,14 @@ We leverage these qualities to produce what we call ***strata***, a feature uniq

### Features Used

Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, [strata](#condo-strata), and time of sale to determine condo assessed values. The features in the table below are the ones used in the 2024 assessment model.
Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, [strata](#condo-strata), and time of sale to determine condo assessed values. The features in the table below are the ones used in the most recent assessment model.

```{r features_used, message=FALSE, echo=FALSE}
library(dplyr)
library(glue)
library(jsonlite)
library(purrr)
library(readr)
library(tidyr)
library(yaml)
Expand Down Expand Up @@ -154,43 +155,62 @@ res_preds <- res_params$model$predictor$all
condo_unique_preds <- setdiff(condo_preds$value, res_preds)
condo_preds %>%
condo_preds_fmt <- condo_preds %>%
mutate(description = param_notes) %>%
left_join(
ccao::vars_dict,
by = c("value" = "var_name_model")
) %>%
distinct(
`Feature Name` = var_name_pretty,
Category = var_type,
Type = var_data_type,
Notes = description,
value
feature_name = var_name_pretty,
variable_name = value,
description,
category = var_type,
type = var_data_type
) %>%
mutate(
Category = recode(
Category,
category = recode(
category,
char = "Characteristic", acs5 = "ACS5", loc = "Location",
prox = "Proximity", ind = "Indicator", time = "Time",
meta = "Meta", other = "Other", ccao = "Other"
meta = "Meta", other = "Other", ccao = "Other", shp = "Parcel Shape"
),
`Feature Name` = recode(
`Feature Name`,
feature_name = recode(
feature_name,
"Tieback Proration Rate" = "Condominium % Ownership",
"Year Built" = "Condominium Building Year Built"
),
unique_to_condo_model = ifelse(
variable_name %in% condo_unique_preds |
feature_name %in%
c("Condominium Building Year Built", "Condominium % Ownership"),
TRUE, FALSE
)
) %>%
mutate(`Unique to Condo Model` = ifelse(
value %in% condo_unique_preds |
`Feature Name` %in%
c("Condominium Building Year Built", "Condominium % Ownership"),
"X", ""
)) %>%
arrange(desc(`Unique to Condo Model`), Category) %>%
select(-value) %>%
arrange(desc(unique_to_condo_model), category)
condo_preds_fmt %>%
write_csv("docs/data-dict.csv")
condo_preds_fmt %>%
mutate(unique_to_condo_model = ifelse(unique_to_condo_model, "X", "")) %>%
rename(
"Feature Name" = "feature_name",
"Variable Name" = "variable_name",
"Description" = "description",
"Category" = "category",
"Type" = "type",
"Unique to Condo Model" = "unique_to_condo_model"
) %>%
knitr::kable(format = "markdown")
```

We maintain a few useful resources for working with these features:

- Once you've [pulled the input data](#getting-data), you can inner join the data to the CSV version of the data dictionary ([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the features that we use in the model.
- You can browse our [data catalog](https://ccao-data.github.io/data-architecture/#!/overview) to see more details about these features, in particular the [condo model input view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_pin_condo_input) which is the source of our training data.
- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) or its [Python equivalent](https://ccao-data.github.io/ccao/python/) to programmatically convert variable names to their human-readable versions ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) or convert numerically-encoded variables to human-readable values ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html). The [`ccao::vars_dict` object](https://ccao-data.github.io/ccao/reference/vars_dict.html) is also useful for inspecting the raw crosswalk that powers the rename and recode functions.

### Valuation

For the most part, condos are valued the same way as single- and multi-family residential property. We [train a model](https://github.com/ccao-data/model-res-avm#how-it-works) using individual condo unit sales, predict the value of all units, and then apply any [post-modeling adjustment](https://github.com/ccao-data/model-res-avm#post-modeling).
Expand Down
Loading

0 comments on commit 388f278

Please sign in to comment.