-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2024 model data update #670
2024 model data update #670
Conversation
str_remove_all(feed_date, "-"), "/download" | ||
), | ||
paste0( | ||
"https://files.mobilitydatabase.org/mdb-389/mdb-389-", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transitfeeds is dead, so we need to use mobilitydatabase now
rename( | ||
walkability_rating = Walkabilit, | ||
amenities_score = Amenities, | ||
transitaccess = TransitAcc | ||
) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rename
chunk started throwing an error for seemingly no reason, which resolves by shifting the code up a few lines.
standardize_expand_geo() %>% | ||
select(-contains("shape")) %>% | ||
mutate(year = "2017") %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was leading to two year
columns which messed up hive partitioning.
etl/scripts-ccao-data-warehouse-us-east-1/spatial/spatial-environment-ohare_noise.R
Outdated
Show resolved
Hide resolved
dplyr::group_walk(df, ~ { | ||
partitions_df <- purrr::map_dfr( | ||
.y, replace_na, "__HIVE_DEFAULT_PARTITION__" | ||
.y, tidyr::replace_na, "__HIVE_DEFAULT_PARTITION__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was running into occasional issues with R thinking replace_na
was an object(?) that name-spacing replace_na
solved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file also needed to be linted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for picking up this chore! A few nits and questions below, but nothing I'd consider blocking other than my question about disabling renv on CI. @dfsnow should probably take a look at this too, as someone who has more context on these scripts than I do.
- name: Disable renv | ||
shell: bash | ||
run: rm etl/.Rprofile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, blocking] Can you link to an example workflow that failed prior to this change? I want to make sure I understand why this is the best path forward before we move ahead with it.
land_nbhd_rate_2024 | ||
) %>% | ||
relocate(land_rate_per_sqft, .after = last_col()) %>% | ||
mutate(loaded_at = as.character(Sys.time())) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] Do you know why we missed these files when we fixed line endings as part of the linting PR? I'm guessing it's just because this file didn't require any linting fixes, so we didn't notice the incorrect line endings?
@@ -99,6 +99,7 @@ cc_dli_senfrr <- map_dfr(files_cc_dli_senfrr$Key, \(f) { | |||
|
|||
# Write the files to S3, partitioned by year | |||
cc_dli_senfrr %>% | |||
mutate(loaded_at = as.character(Sys.time())) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Thought, non-blocking] More a question for @dfsnow, but I wonder if we need loaded_at
fields on these tables? They're one-off QC extracts that Mirella provides us with, so I don't expect we'll want freshness tests on them. But I'm open to them if we can think of a good reason why we might need to know when we loaded them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane My thinking here is that it would be nice and convenient to have a loaded_at
column on every Athena source so that we can just SELECT MAX(loaded_at) FROM $TABLE
for everything in our catalog and know what hasn't been touched in awhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right @dfsnow, I was just a bit confused by this one since it seems categorically different from our other sources (one-time ingest for a specific project, rather than something we pull regularly for modeling). But if you think it's easier for everything to follow the same pattern it's fine by me.
etl/scripts-ccao-data-warehouse-us-east-1/spatial/spatial-environment-ohare_noise.R
Outdated
Show resolved
Hide resolved
etl/scripts-ccao-data-warehouse-us-east-1/spatial/spatial-transit.R
Outdated
Show resolved
Hide resolved
…ronment-ohare_noise.R Co-authored-by: Jean Cochrane <[email protected]>
…ithub.com/ccao-data/data-architecture into 585-add-a-loaded_at-column-to-all-sources
…sit.R Co-authored-by: Jean Cochrane <[email protected]>
# At the end of 2024 valuations revisited some old condos and updated their | ||
# characteristics | ||
updates <- map( | ||
file.path( | ||
"s3://ccao-data-raw-us-east-1", | ||
aws.s3::get_bucket_df( | ||
AWS_S3_RAW_BUCKET, | ||
prefix = "ccao/condominium/pin_condo_char/2025" | ||
)$Key), | ||
\(x) { | ||
read_parquet(x) %>% | ||
mutate(across(.cols = everything(), as.character)) | ||
}) %>% | ||
bind_rows() %>% | ||
rename_with(~gsub("\\.", "_", tolower(.x)), .cols = everything()) %>% | ||
select("pin", starts_with("new")) %>% | ||
mutate( | ||
pin = gsub("-", "", pin), | ||
across(starts_with("new"), as.numeric), | ||
# Three units with 100 for unit sqft | ||
new_unit_sf = ifelse(new_unit_sf == 100, 1000, new_unit_sf) | ||
) %>% | ||
filter(!if_all(starts_with("new"), is.na)) | ||
|
||
# Update parcels with new column values | ||
chars <- chars %>% | ||
bind_rows() %>% | ||
left_join(updates, by = "pin") %>% | ||
mutate( | ||
building_sf = coalesce(new_building_sf, building_sf), | ||
unit_sf = coalesce(new_unit_sf, unit_sf), | ||
bedrooms = coalesce(new_bedrooms, bedrooms), | ||
full_baths = coalesce(new_full_baths, full_baths), | ||
half_baths = coalesce(new_half_baths, half_baths) | ||
) %>% | ||
select(!starts_with("new")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update condos with new chars.
- name: Disable renv | ||
shell: bash | ||
run: rm etl/.Rprofile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a weird and ephemeral bug, but basically the R in superlinter will load renv
due to the presence of the .Rprofile
file in the working directory. The linter then fails because the renv
environment doesn't have lintr
in it. Removing the .Rprofile
file loads the default superlinter R environment.
@wrridgeway You should add a comment to this step to explain why it's necessary/provide some context.
etl/scripts-ccao-data-raw-us-east-1/ccao/ccao-condominium-pin_condo_char.R
Outdated
Show resolved
Hide resolved
@@ -65,6 +65,7 @@ nonlivable[["neg_pred"]] <- map( | |||
# Upload all nonlivable spaces to nonlivable table | |||
nonlivable %>% | |||
bind_rows() %>% | |||
mutate(loaded_at = as.character(Sys.time())) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (blocking): We want all the loaded_at
columns across our data to share the same format, precision, and type. Does this yield the same type as the DATE_FORMAT
calls in sql?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query | result |
---|---|
select loaded_at from ccao.pin_nonlivable limit 1 |
2024-12-17 16:15:22.613758 |
select loaded_at from iasworld.pardat limit 1 |
2024-12-21 07:46:36.530 |
There's slightly more sub-second precision for the columns generated by r, is that a concern? It seemed a little weird to remove precision in order to make string lengths match when they're still comparable:
select '2024-12-21 07:46:36.530' < '2024-12-17 16:15:22.613758'
false
select '2024-12-21 07:46:36.530' > '2024-12-17 16:15:22.613758'
true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, that's totally fine. Thanks for checking.
@@ -99,6 +99,7 @@ cc_dli_senfrr <- map_dfr(files_cc_dli_senfrr$Key, \(f) { | |||
|
|||
# Write the files to S3, partitioned by year | |||
cc_dli_senfrr %>% | |||
mutate(loaded_at = as.character(Sys.time())) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane My thinking here is that it would be nice and convenient to have a loaded_at
column on every Athena source so that we can just SELECT MAX(loaded_at) FROM $TABLE
for everything in our catalog and know what hasn't been touched in awhile.
…condo_char.R Co-authored-by: Dan Snow <[email protected]>
This PR unfortunately combines two issues: updating scripts wherever needed in order to update modelling data for 2024/2025, and adding a
loaded_at
column to athena tables created through scripts in thedata-architecture/etl
folder (as well asspatial.stadium
).As for the model data refresh, these scripts still need to be run again once data is available:
Tables with
loaded_at
column added:ccao
ccbor
census
other
sale
spatial