Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024 model data update #670

Merged
merged 177 commits into from
Dec 26, 2024
Merged

Conversation

wrridgeway
Copy link
Member

@wrridgeway wrridgeway commented Dec 5, 2024

This PR unfortunately combines two issues: updating scripts wherever needed in order to update modelling data for 2024/2025, and adding a loaded_at column to athena tables created through scripts in the data-architecture/etl folder (as well as spatial.stadium).

As for the model data refresh, these scripts still need to be run again once data is available:

Script Progress Date Notes
ccao/ccao-condominium-pin_condo_char.R Incomplete   waiting on valuations
census/census-acs.R Incomplete 12/12/2024 2024 data not yet available
census/census-decennial.R Incomplete 12/12/2024 2024 data not yet available
census/census-dictionary.R Incomplete 12/12/2024 2024 data not yet available
export/export-geojson.R Incomplete 12/12/2024 2024 data not yet available
spatial/spatial-ccao-neighborhood.R Incomplete 12/12/2024 Waiting on GIS
spatial/spatial-ccao-township.R Incomplete 12/12/2024 Waiting on GIS
spatial/spatial-census.R Incomplete 12/12/2024 Waiting on GIS
spatial/spatial-other.R Incomplete 12/12/2024 Waiting on GIS
spatial/spatial-parcel.R Incomplete 12/12/2024 Waiting on GIS
spatial/spatial-political.R Incomplete 12/12/2024 Waiting on GIS
spatial/spatial-tax.R Incomplete 12/12/2024 Waiting on GIS

Tables with loaded_at column added:

ccao

  • cc_dli_senfrr
  • cc_pifdb_piexemptre_dise
  • cc_pifdb_piexemptre_ownr
  • cc_pifdb_piexemptre_sted
  • commercial_valuation
  • hie
  • land_nbhd_rate
  • land_site_rate
  • pin_condo_char

ccbor

  • appeals

census

  • acs1
  • acs5
  • decennial
  • table_dict
  • variable_dict

other

  • airport_noise
  • ari
  • dci
  • flood_first_street
  • great_schools_rating
  • ihs_index

sale

  • foreclosure
  • mydec
  • validated

spatial

  • bike_trail
  • board_of_review_district
  • building_footprint
  • cemetery
  • census
  • central_business_district
  • coastline
  • commissioner_district
  • community_area
  • community_college_district
  • congressional_district
  • coordinated_care
  • corner
  • county
  • enterprise_zone
  • fire_protection_district
  • flood_fema
  • geojson
  • golf_course
  • grocery_store
  • hospital
  • hydrology
  • industrial_corridor
  • industrial_growth_zone
  • judicial_district
  • library_district
  • major_road
  • midway_noise_monitor
  • municipality
  • neighborhood
  • ohare_noise_contour
  • ohare_noise_monitor
  • parcel
  • park
  • park_district
  • police_district
  • qualified_opportunity_zone
  • railroad
  • road
  • sanitation_district
  • school_district
  • school_location
  • secondary_road
  • special_service_area
  • stadium
  • stadium_raw
  • state_representative_district
  • state_senate_district
  • subdivision
  • tif_district
  • township
  • transit_dict
  • transit_route
  • transit_stop
  • walkability
  • ward
  • ward_chicago
  • ward_evanston

str_remove_all(feed_date, "-"), "/download"
),
paste0(
"https://files.mobilitydatabase.org/mdb-389/mdb-389-",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transitfeeds is dead, so we need to use mobilitydatabase now

Comment on lines +194 to +198
rename(
walkability_rating = Walkabilit,
amenities_score = Amenities,
transitaccess = TransitAcc
) %>%
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rename chunk started throwing an error for seemingly no reason, which resolves by shifting the code up a few lines.

standardize_expand_geo() %>%
select(-contains("shape")) %>%
mutate(year = "2017") %>%
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was leading to two year columns which messed up hive partitioning.

dplyr::group_walk(df, ~ {
partitions_df <- purrr::map_dfr(
.y, replace_na, "__HIVE_DEFAULT_PARTITION__"
.y, tidyr::replace_na, "__HIVE_DEFAULT_PARTITION__"
Copy link
Member Author

@wrridgeway wrridgeway Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running into occasional issues with R thinking replace_na was an object(?) that name-spacing replace_na solved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file also needed to be linted.

@wrridgeway wrridgeway marked this pull request as ready for review December 19, 2024 22:52
@wrridgeway wrridgeway requested a review from a team as a code owner December 19, 2024 22:52
@wrridgeway wrridgeway self-assigned this Dec 19, 2024
@dfsnow dfsnow changed the title 2024 modelling data update 2024 model data update Dec 19, 2024
Copy link
Contributor

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for picking up this chore! A few nits and questions below, but nothing I'd consider blocking other than my question about disabling renv on CI. @dfsnow should probably take a look at this too, as someone who has more context on these scripts than I do.

Comment on lines +26 to +28
- name: Disable renv
shell: bash
run: rm etl/.Rprofile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, blocking] Can you link to an example workflow that failed prior to this change? I want to make sure I understand why this is the best path forward before we move ahead with it.

etl/scripts-ccao-data-raw-us-east-1/spatial/spatial-ccao.R Outdated Show resolved Hide resolved
land_nbhd_rate_2024
) %>%
relocate(land_rate_per_sqft, .after = last_col()) %>%
mutate(loaded_at = as.character(Sys.time())) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] Do you know why we missed these files when we fixed line endings as part of the linting PR? I'm guessing it's just because this file didn't require any linting fixes, so we didn't notice the incorrect line endings?

@@ -99,6 +99,7 @@ cc_dli_senfrr <- map_dfr(files_cc_dli_senfrr$Key, \(f) {

# Write the files to S3, partitioned by year
cc_dli_senfrr %>%
mutate(loaded_at = as.character(Sys.time())) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Thought, non-blocking] More a question for @dfsnow, but I wonder if we need loaded_at fields on these tables? They're one-off QC extracts that Mirella provides us with, so I don't expect we'll want freshness tests on them. But I'm open to them if we can think of a good reason why we might need to know when we loaded them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane My thinking here is that it would be nice and convenient to have a loaded_at column on every Athena source so that we can just SELECT MAX(loaded_at) FROM $TABLE for everything in our catalog and know what hasn't been touched in awhile.

Copy link
Contributor

@jeancochrane jeancochrane Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right @dfsnow, I was just a bit confused by this one since it seems categorically different from our other sources (one-time ingest for a specific project, rather than something we pull regularly for modeling). But if you think it's easier for everything to follow the same pattern it's fine by me.

Comment on lines +209 to +244
# At the end of 2024 valuations revisited some old condos and updated their
# characteristics
updates <- map(
file.path(
"s3://ccao-data-raw-us-east-1",
aws.s3::get_bucket_df(
AWS_S3_RAW_BUCKET,
prefix = "ccao/condominium/pin_condo_char/2025"
)$Key),
\(x) {
read_parquet(x) %>%
mutate(across(.cols = everything(), as.character))
}) %>%
bind_rows() %>%
rename_with(~gsub("\\.", "_", tolower(.x)), .cols = everything()) %>%
select("pin", starts_with("new")) %>%
mutate(
pin = gsub("-", "", pin),
across(starts_with("new"), as.numeric),
# Three units with 100 for unit sqft
new_unit_sf = ifelse(new_unit_sf == 100, 1000, new_unit_sf)
) %>%
filter(!if_all(starts_with("new"), is.na))

# Update parcels with new column values
chars <- chars %>%
bind_rows() %>%
left_join(updates, by = "pin") %>%
mutate(
building_sf = coalesce(new_building_sf, building_sf),
unit_sf = coalesce(new_unit_sf, unit_sf),
bedrooms = coalesce(new_bedrooms, bedrooms),
full_baths = coalesce(new_full_baths, full_baths),
half_baths = coalesce(new_half_baths, half_baths)
) %>%
select(!starts_with("new"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update condos with new chars.

Comment on lines +26 to +28
- name: Disable renv
shell: bash
run: rm etl/.Rprofile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a weird and ephemeral bug, but basically the R in superlinter will load renv due to the presence of the .Rprofile file in the working directory. The linter then fails because the renv environment doesn't have lintr in it. Removing the .Rprofile file loads the default superlinter R environment.

@wrridgeway You should add a comment to this step to explain why it's necessary/provide some context.

@@ -65,6 +65,7 @@ nonlivable[["neg_pred"]] <- map(
# Upload all nonlivable spaces to nonlivable table
nonlivable %>%
bind_rows() %>%
mutate(loaded_at = as.character(Sys.time())) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): We want all the loaded_at columns across our data to share the same format, precision, and type. Does this yield the same type as the DATE_FORMAT calls in sql?

Copy link
Member Author

@wrridgeway wrridgeway Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query result
select loaded_at from ccao.pin_nonlivable limit 1 2024-12-17 16:15:22.613758
select loaded_at from iasworld.pardat limit 1 2024-12-21 07:46:36.530

There's slightly more sub-second precision for the columns generated by r, is that a concern? It seemed a little weird to remove precision in order to make string lengths match when they're still comparable:

select '2024-12-21 07:46:36.530' < '2024-12-17 16:15:22.613758' false
select '2024-12-21 07:46:36.530' > '2024-12-17 16:15:22.613758' true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that's totally fine. Thanks for checking.

@@ -99,6 +99,7 @@ cc_dli_senfrr <- map_dfr(files_cc_dli_senfrr$Key, \(f) {

# Write the files to S3, partitioned by year
cc_dli_senfrr %>%
mutate(loaded_at = as.character(Sys.time())) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane My thinking here is that it would be nice and convenient to have a loaded_at column on every Athena source so that we can just SELECT MAX(loaded_at) FROM $TABLE for everything in our catalog and know what hasn't been touched in awhile.

@wrridgeway wrridgeway merged commit 2979e5a into master Dec 26, 2024
10 checks passed
@wrridgeway wrridgeway deleted the 585-add-a-loaded_at-column-to-all-sources branch December 26, 2024 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a loaded_at column to all sources
4 participants