2024 model data update #670

wrridgeway · 2024-12-05T21:36:55Z

This PR unfortunately combines two issues: updating scripts wherever needed in order to update modelling data for 2024/2025, and adding a loaded_at column to athena tables created through scripts in the data-architecture/etl folder (as well as spatial.stadium).

As for the model data refresh, these scripts still need to be run again once data is available:

Script	Progress	Date	Notes
ccao/ccao-condominium-pin_condo_char.R	Incomplete		waiting on valuations
census/census-acs.R	Incomplete	12/12/2024	2024 data not yet available
census/census-decennial.R	Incomplete	12/12/2024	2024 data not yet available
census/census-dictionary.R	Incomplete	12/12/2024	2024 data not yet available
export/export-geojson.R	Incomplete	12/12/2024	2024 data not yet available
spatial/spatial-ccao-neighborhood.R	Incomplete	12/12/2024	Waiting on GIS
spatial/spatial-ccao-township.R	Incomplete	12/12/2024	Waiting on GIS
spatial/spatial-census.R	Incomplete	12/12/2024	Waiting on GIS
spatial/spatial-other.R	Incomplete	12/12/2024	Waiting on GIS
spatial/spatial-parcel.R	Incomplete	12/12/2024	Waiting on GIS
spatial/spatial-political.R	Incomplete	12/12/2024	Waiting on GIS
spatial/spatial-tax.R	Incomplete	12/12/2024	Waiting on GIS

Tables with `loaded_at` column added:

ccao

ccbor

appeals

census

other

sale

foreclosure
mydec
validated

spatial

wrridgeway · 2024-12-19T22:42:23Z

etl/scripts-ccao-data-raw-us-east-1/spatial/spatial-transit.R

+      str_remove_all(feed_date, "-"), "/download"
+    ),
+    paste0(
+      "https://files.mobilitydatabase.org/mdb-389/mdb-389-",


transitfeeds is dead, so we need to use mobilitydatabase now

wrridgeway · 2024-12-19T22:43:57Z

etl/scripts-ccao-data-warehouse-us-east-1/spatial/spatial-access.R

+    rename(
+      walkability_rating = Walkabilit,
+      amenities_score = Amenities,
+      transitaccess = TransitAcc
+    ) %>%


This rename chunk started throwing an error for seemingly no reason, which resolves by shifting the code up a few lines.

wrridgeway · 2024-12-19T22:44:27Z

etl/scripts-ccao-data-warehouse-us-east-1/spatial/spatial-access.R

    standardize_expand_geo() %>%
    select(-contains("shape")) %>%
-    mutate(year = "2017") %>%


This was leading to two year columns which messed up hive partitioning.

etl/scripts-ccao-data-warehouse-us-east-1/spatial/spatial-environment-ohare_noise.R

wrridgeway · 2024-12-19T22:46:20Z

etl/utils.R

  dplyr::group_walk(df, ~ {
    partitions_df <- purrr::map_dfr(
-      .y, replace_na, "__HIVE_DEFAULT_PARTITION__"
+      .y, tidyr::replace_na, "__HIVE_DEFAULT_PARTITION__"


I was running into occasional issues with R thinking replace_na was an object(?) that name-spacing replace_na solved.

wrridgeway · 2024-12-19T22:47:15Z

etl/utils.R

This file also needed to be linted.

jeancochrane

Looks good, thanks for picking up this chore! A few nits and questions below, but nothing I'd consider blocking other than my question about disabling renv on CI. @dfsnow should probably take a look at this too, as someone who has more context on these scripts than I do.

jeancochrane · 2024-12-20T16:37:32Z

.github/workflows/lint.yaml

+      - name: Disable renv
+        shell: bash
+        run: rm etl/.Rprofile


[Question, blocking] Can you link to an example workflow that failed prior to this change? I want to make sure I understand why this is the best path forward before we move ahead with it.

etl/scripts-ccao-data-raw-us-east-1/spatial/spatial-ccao.R

jeancochrane · 2024-12-20T16:57:45Z

etl/scripts-ccao-data-warehouse-us-east-1/ccao/ccao-land-land_nbhd_rate.R

+  land_nbhd_rate_2024
+) %>%
+  relocate(land_rate_per_sqft, .after = last_col()) %>%
+  mutate(loaded_at = as.character(Sys.time())) %>%


[Question, non-blocking] Do you know why we missed these files when we fixed line endings as part of the linting PR? I'm guessing it's just because this file didn't require any linting fixes, so we didn't notice the incorrect line endings?

jeancochrane · 2024-12-20T17:01:46Z

etl/scripts-ccao-data-warehouse-us-east-1/ccao/ccao-legacy.R

@@ -99,6 +99,7 @@ cc_dli_senfrr <- map_dfr(files_cc_dli_senfrr$Key, \(f) {

 # Write the files to S3, partitioned by year
 cc_dli_senfrr %>%
+  mutate(loaded_at = as.character(Sys.time())) %>%


[Thought, non-blocking] More a question for @dfsnow, but I wonder if we need loaded_at fields on these tables? They're one-off QC extracts that Mirella provides us with, so I don't expect we'll want freshness tests on them. But I'm open to them if we can think of a good reason why we might need to know when we loaded them.

@jeancochrane My thinking here is that it would be nice and convenient to have a loaded_at column on every Athena source so that we can just SELECT MAX(loaded_at) FROM $TABLE for everything in our catalog and know what hasn't been touched in awhile.

Right @dfsnow, I was just a bit confused by this one since it seems categorically different from our other sources (one-time ingest for a specific project, rather than something we pull regularly for modeling). But if you think it's easier for everything to follow the same pattern it's fine by me.

etl/scripts-ccao-data-warehouse-us-east-1/spatial/spatial-environment-ohare_noise.R

etl/scripts-ccao-data-warehouse-us-east-1/spatial/spatial-transit.R

…ronment-ohare_noise.R Co-authored-by: Jean Cochrane <[email protected]>

…ithub.com/ccao-data/data-architecture into 585-add-a-loaded_at-column-to-all-sources

…sit.R Co-authored-by: Jean Cochrane <[email protected]>

wrridgeway · 2024-12-23T18:46:49Z

etl/scripts-ccao-data-warehouse-us-east-1/ccao/ccao-condominium-pin_condo_char.R

+# At the end of 2024 valuations revisited some old condos and updated their
+# characteristics
+updates <- map(
+  file.path(
+    "s3://ccao-data-raw-us-east-1",
+    aws.s3::get_bucket_df(
+      AWS_S3_RAW_BUCKET,
+      prefix = "ccao/condominium/pin_condo_char/2025"
+    )$Key),
+  \(x) {
+    read_parquet(x) %>%
+      mutate(across(.cols = everything(), as.character))
+  }) %>%
+  bind_rows() %>%
+  rename_with(~gsub("\\.", "_", tolower(.x)), .cols = everything()) %>%
+  select("pin", starts_with("new")) %>%
+  mutate(
+    pin = gsub("-", "", pin),
+    across(starts_with("new"), as.numeric),
+    # Three units with 100 for unit sqft
+    new_unit_sf = ifelse(new_unit_sf == 100, 1000, new_unit_sf)
+  ) %>%
+  filter(!if_all(starts_with("new"), is.na))
+
+# Update parcels with new column values
+chars <- chars %>%
+  bind_rows() %>%
+  left_join(updates, by = "pin") %>%
+  mutate(
+    building_sf = coalesce(new_building_sf, building_sf),
+    unit_sf = coalesce(new_unit_sf, unit_sf),
+    bedrooms = coalesce(new_bedrooms, bedrooms),
+    full_baths = coalesce(new_full_baths, full_baths),
+    half_baths = coalesce(new_half_baths, half_baths)
+  ) %>%
+  select(!starts_with("new"))


Update condos with new chars.

dfsnow · 2024-12-24T19:13:45Z

.github/workflows/lint.yaml

+      - name: Disable renv
+        shell: bash
+        run: rm etl/.Rprofile


This is definitely a weird and ephemeral bug, but basically the R in superlinter will load renv due to the presence of the .Rprofile file in the working directory. The linter then fails because the renv environment doesn't have lintr in it. Removing the .Rprofile file loads the default superlinter R environment.

@wrridgeway You should add a comment to this step to explain why it's necessary/provide some context.

etl/scripts-ccao-data-raw-us-east-1/ccao/ccao-condominium-pin_condo_char.R

dfsnow · 2024-12-24T19:19:06Z

etl/scripts-ccao-data-warehouse-us-east-1/ccao/ccao-condominium_parking.R

@@ -65,6 +65,7 @@ nonlivable[["neg_pred"]] <- map(
 # Upload all nonlivable spaces to nonlivable table
 nonlivable %>%
  bind_rows() %>%
+  mutate(loaded_at = as.character(Sys.time())) %>%


question (blocking): We want all the loaded_at columns across our data to share the same format, precision, and type. Does this yield the same type as the DATE_FORMAT calls in sql?

query result

select loaded_at from ccao.pin_nonlivable limit 1 2024-12-17 16:15:22.613758

select loaded_at from iasworld.pardat limit 1 2024-12-21 07:46:36.530

There's slightly more sub-second precision for the columns generated by r, is that a concern? It seemed a little weird to remove precision in order to make string lengths match when they're still comparable:

select '2024-12-21 07:46:36.530' < '2024-12-17 16:15:22.613758' false
select '2024-12-21 07:46:36.530' > '2024-12-17 16:15:22.613758' true

Nope, that's totally fine. Thanks for checking.

dfsnow · 2024-12-24T19:27:29Z

etl/scripts-ccao-data-warehouse-us-east-1/ccao/ccao-legacy.R

@@ -99,6 +99,7 @@ cc_dli_senfrr <- map_dfr(files_cc_dli_senfrr$Key, \(f) {

 # Write the files to S3, partitioned by year
 cc_dli_senfrr %>%
+  mutate(loaded_at = as.character(Sys.time())) %>%


@jeancochrane My thinking here is that it would be nice and convenient to have a loaded_at column on every Athena source so that we can just SELECT MAX(loaded_at) FROM $TABLE for everything in our catalog and know what hasn't been touched in awhile.

…condo_char.R Co-authored-by: Dan Snow <[email protected]>

Damonamajor added 30 commits October 9, 2024 21:30

Initial test

acb0154

Sort headings

0652b3a

New test

5191e58

local

c4807fb

surface width

6e2a088

More tests

ccc67fa

Query improvements

6a9c130

Make all distinct_pins

12b2ebf

Make traffic_width unique

a22e314

switch to daily_traffic

d835273

Try master

7bd08b8

Fix master?

ca76942

Another master test

8ba251d

Another master test

1cf60f2

Add config

d3ba259

switch to width

e34b8c8

Make year last column

4ae5554

Merge into master

d2a45fa

Try to remove 2014

aace136

Remove minor_collector

3f070de

Remove parcel.year = pin.year

b50e813

Remove minor again

ebbc52e

Remove Freeway

e8671f7

Start from begining

967114c

Separate freeway

ba0b0a1

Add principal

aa91cf8

Try with major

639468c

Add other

ed4ba5f

re-add major

0c41137

remove year

7cc2b22

wrridgeway added 5 commits December 19, 2024 22:37

Remove temp script

b6bd970

Typo

6377dcc

Undo mydec changes

f6fafdb

Undo spatial raw changes

bf894a7

Typo

2716bdb

wrridgeway commented Dec 19, 2024

View reviewed changes

etl/scripts-ccao-data-warehouse-us-east-1/spatial/spatial-environment-ohare_noise.R Outdated Show resolved Hide resolved

wrridgeway commented Dec 19, 2024

View reviewed changes

etl/utils.R

Copy link

Member Author

wrridgeway Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file also needed to be linted.

jeancochrane reacted with thumbs up emoji

wrridgeway marked this pull request as ready for review December 19, 2024 22:52

wrridgeway requested a review from a team as a code owner December 19, 2024 22:52

wrridgeway self-assigned this Dec 19, 2024

dfsnow changed the title ~~2024 modelling data update~~ 2024 model data update Dec 19, 2024

jeancochrane reviewed Dec 20, 2024

View reviewed changes

wrridgeway and others added 8 commits December 23, 2024 09:25

Update etl/scripts-ccao-data-warehouse-us-east-1/spatial/spatial-envi…

2e2805a

…ronment-ohare_noise.R Co-authored-by: Jean Cochrane <[email protected]>

Update neighborhood shapefile url

dbaa887

Merge branch '585-add-a-loaded_at-column-to-all-sources' of https://g…

1661ece

…ithub.com/ccao-data/data-architecture into 585-add-a-loaded_at-column-to-all-sources

Update etl/scripts-ccao-data-warehouse-us-east-1/spatial/spatial-tran…

61a4a4e

…sit.R Co-authored-by: Jean Cochrane <[email protected]>

Add new condo chars source

3c7fb11

Update local file path

c406783

Add condo char updates

8a15f61

Include all changes

d4f6241

wrridgeway commented Dec 23, 2024

View reviewed changes

dfsnow approved these changes Dec 24, 2024

View reviewed changes

wrridgeway and others added 2 commits December 24, 2024 21:13

Update etl/scripts-ccao-data-raw-us-east-1/ccao/ccao-condominium-pin_…

0149fb1

…condo_char.R Co-authored-by: Dan Snow <[email protected]>

Commenting

5b1cbcd

wrridgeway merged commit 2979e5a into master Dec 26, 2024
10 checks passed

wrridgeway deleted the 585-add-a-loaded_at-column-to-all-sources branch December 26, 2024 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024 model data update #670

2024 model data update #670

wrridgeway commented Dec 5, 2024 •

edited

Loading

wrridgeway Dec 19, 2024

wrridgeway Dec 19, 2024

wrridgeway Dec 19, 2024

wrridgeway Dec 19, 2024 •

edited

Loading

wrridgeway Dec 19, 2024

jeancochrane left a comment

jeancochrane Dec 20, 2024

jeancochrane Dec 20, 2024

jeancochrane Dec 20, 2024

dfsnow Dec 24, 2024

jeancochrane Dec 24, 2024 •

edited

Loading

wrridgeway Dec 23, 2024

dfsnow Dec 24, 2024

dfsnow Dec 24, 2024

wrridgeway Dec 25, 2024 •

edited

Loading

dfsnow Dec 26, 2024

dfsnow Dec 24, 2024

query	result
`select loaded_at from ccao.pin_nonlivable limit 1`	2024-12-17 16:15:22.613758
`select loaded_at from iasworld.pardat limit 1`	2024-12-21 07:46:36.530

2024 model data update #670

2024 model data update #670

Conversation

wrridgeway commented Dec 5, 2024 • edited Loading

Tables with loaded_at column added:

ccao

ccbor

census

other

sale

spatial

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wrridgeway Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wrridgeway Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wrridgeway commented Dec 5, 2024 •

edited

Loading

Tables with `loaded_at` column added:

wrridgeway Dec 19, 2024 •

edited

Loading

jeancochrane Dec 24, 2024 •

edited

Loading

wrridgeway Dec 25, 2024 •

edited

Loading