ccao-data · dfsnow · Jan 16, 2025 · Jan 14, 2025 · Jan 14, 2025 · Jan 14, 2025
@@ -19,6 +19,7 @@ cache/
 *.rds
 *.zip
 *.csv
+!docs/data-dict.csv
 *.xlsx
 !condo_nonlivable_demo.xlsx
 *.xlsm

@@ -27,3 +27,10 @@ repos:
         entry: Cannot commit .Rhistory, .RData, .Rds or .rds.
         language: fail
         files: '\.(Rhistory|RData|Rds|rds)$'
+      - id: check-data-dict
+        name: Data dictionary must be up to date with params file
+        entry: Rscript R/hooks/check-data-dict.R
+        files: (^|/)((params\.yaml)|(data-dict\.csv))$
+        language: r
+        additional_dependencies:
+          - yaml
@@ -0,0 +1,34 @@
+#!/usr/bin/env Rscript
+# Script to check that the data dictionary file is up to date with the
+# latest feature set
+library(yaml)
+
+params_filename <- "params.yaml"
+data_dict_filename <- "docs/data-dict.csv"
+
+params <- read_yaml(params_filename)
+data_dict <- read.csv(data_dict_filename)
+
+symmetric_diff <- c(
+  setdiff(data_dict$variable_name, params$model$predictor$all),
+  setdiff(params$model$predictor$all, data_dict$variable_name)
+)
+symmetric_diff_len <- length(symmetric_diff)
+
+if (symmetric_diff_len > 0) {
+  err_msg_prefix <- ifelse(symmetric_diff_len == 1, "Param is", "Params are")
+  err_msg <- paste0(
+    err_msg_prefix,
+    " not present in both ",
+    params_filename,
+    " and ",
+    data_dict_filename,
+    ": ",
+    paste(symmetric_diff, collapse = ", "),
+    ". ",
+    "Did you forget to reknit README.Rmd after updating ",
+    params_filename,
+    "?"
+  )
+  stop(err_msg)
+}
@@ -60,13 +60,14 @@ We leverage these qualities to produce what we call ***strata***, a feature uniq
 
 ### Features Used
 
-Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, [strata](#condo-strata), and time of sale to determine condo assessed values. The features in the table below are the ones used in the 2024 assessment model.
+Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, [strata](#condo-strata), and time of sale to determine condo assessed values. The features in the table below are the ones used in the most recent assessment model.
 
 ```{r features_used, message=FALSE, echo=FALSE}
 library(dplyr)
 library(glue)
 library(jsonlite)
 library(purrr)
+library(readr)
 library(tidyr)
 library(yaml)
 
@@ -154,43 +155,62 @@ res_preds <- res_params$model$predictor$all
 
 condo_unique_preds <- setdiff(condo_preds$value, res_preds)
 
-condo_preds %>%
+condo_preds_fmt <- condo_preds %>%
   mutate(description = param_notes) %>%
   left_join(
     ccao::vars_dict,
     by = c("value" = "var_name_model")
   ) %>%
   distinct(
-    `Feature Name` = var_name_pretty,
-    Category = var_type,
-    Type = var_data_type,
-    Notes = description,
-    value
+    feature_name = var_name_pretty,
+    variable_name = value,
+    description,
+    category = var_type,
+    type = var_data_type
   ) %>%
   mutate(
-    Category = recode(
-      Category,
+    category = recode(
+      category,
       char = "Characteristic", acs5 = "ACS5", loc = "Location",
       prox = "Proximity", ind = "Indicator", time = "Time",
-      meta = "Meta", other = "Other", ccao = "Other"
+      meta = "Meta", other = "Other", ccao = "Other", shp = "Parcel Shape"
     ),
-    `Feature Name` = recode(
-      `Feature Name`,
+    feature_name = recode(
+      feature_name,
       "Tieback Proration Rate" = "Condominium % Ownership",
       "Year Built" = "Condominium Building Year Built"
+    ),
+    unique_to_condo_model = ifelse(
+      variable_name %in% condo_unique_preds |
+        feature_name %in%
+          c("Condominium Building Year Built", "Condominium % Ownership"),
+      TRUE, FALSE
     )
   ) %>%
-  mutate(`Unique to Condo Model` = ifelse(
-    value %in% condo_unique_preds |
-      `Feature Name` %in%
-        c("Condominium Building Year Built", "Condominium % Ownership"),
-    "X", ""
-  )) %>%
-  arrange(desc(`Unique to Condo Model`), Category) %>%
-  select(-value) %>%
+  arrange(desc(unique_to_condo_model), category)
+
+condo_preds_fmt %>%
+  write_csv("docs/data-dict.csv")
+
+condo_preds_fmt %>%
+  mutate(unique_to_condo_model = ifelse(unique_to_condo_model, "X", "")) %>%
+  rename(
+    "Feature Name" = "feature_name",
+    "Variable Name" = "variable_name",
+    "Description" = "description",
+    "Category" = "category",
+    "Type" = "type",
+    "Unique to Condo Model" = "unique_to_condo_model"
+  ) %>%
   knitr::kable(format = "markdown")
 ```
 
+We maintain a few useful resources for working with these features:
+
+- Once you've [pulled the input data](#getting-data), you can inner join the data to the CSV version of the data dictionary ([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the features that we use in the model.
+- You can browse our [data catalog](https://ccao-data.github.io/data-architecture/#!/overview) to see more details about these features, in particular the [condo model input view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_pin_condo_input) which is the source of our training data.
+- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) or its [Python equivalent](https://ccao-data.github.io/ccao/python/) to programmatically convert variable names to their human-readable versions ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) or convert numerically-encoded variables to human-readable values ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html). The [`ccao::vars_dict` object](https://ccao-data.github.io/ccao/reference/vars_dict.html) is also useful for inspecting the raw crosswalk that powers the rename and recode functions.
+
 ### Valuation
 
 For the most part, condos are valued the same way as single- and multi-family residential property. We [train a model](https://github.com/ccao-data/model-res-avm#how-it-works) using individual condo unit sales, predict the value of all units, and then apply any [post-modeling adjustment](https://github.com/ccao-data/model-res-avm#post-modeling).
-Original file line number
+Diff line change
@@ Expand Up / @@ -19,6 +19,7 @@ cache/ @@
     *.rds
     *.zip
     *.csv
+    !docs/data-dict.csv
     *.xlsx
     !condo_nonlivable_demo.xlsx
     *.xlsm
@@ Expand Down @@