-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update read_watersurfaces for version 2024 & read_ws_hab #192
Conversation
- drop hyla_code - add wfd_type_alternative - add wfd_type_alt_name - include water_level_management - add CFe in list water types - use na_if for empty cells ("")
- that is with watersurfaces 2024 - incl. documentation
- watersurfaces 2024 - habitatmap_stdized 2023_v1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thanks so much @cecileherr ! You analyzed the new data source version really well before upgrading the function, resulting in ever more robust code 🚀
Added some minor comments already in the recent commits of branch watersurf_hab_v6
of n2khab-preprocessing, but I presume you wait for current n2khab branch to become stable before starting that PR. Still, the latter is best merged first, then this one, as the n2khab one also uses the resulting watersurfaces_hab
. (At least the n2khab release – merging to main
– should take place after the preprocessing repo is all done and the new v6 is available at Zenodo.)
What is the best (quickest/most readable) way to handle the fact that there are different structures/different fields/different errors to correct in the different versions? Should we use
if then
structures or use optional clauses (such asmatches
,any_of
,across
+where
, ...)?
- I agree that generalizations to cope with version-specific columns provide the intended result, and are more readable, easier to maintain and more robust (future-proof). So that's really great 👌 (and probably you / we already did such things in the past ?).
- Also IMHO it's OK to mix both approaches where that feels more suitable. What feels most logical and efficient will often be the easiest to understand and maintain.
- One could say that, in the generalized approach, the code is less explicit about how the different data source versions differ. But documenting this is not the aim of the n2khab package, rather the package seeks to align things.
- Good idea to make it an issue in order to apply this throughout the package! Please!
- The main point of attention is to still get essentially the same results for the earlier versions (in the sense of 'no columns lost', 'no errors introduced', 'column names unchanged' etc: ultimately we would implement unit tests to automate such checks), but I didn't notice things that would lead to that, while reading the code.
- As a sidenote, I think we should not be reluctant to implement smaller general improvements of the returned object for all data source versions when it's deemed better at some point, which might e.g. be the use of a factor where a character had been used before etc. True reproducibility should involve freezing the n2khab version anyway.
there is a wrong
WVLC
code (“d”) in de Kleiputten van Heist (Palingpot): not corrected since it is the raw source
- Well, for nasty bugs like this we could still mitigate the drawback by using the same approach as
fix_geom
, e.g.fix_extra
with defaultFALSE
, to still honour the default of just returning the 'raw contents' (even though we do always streamline column names and classes to get some basic uniformity between R objects).TRUE
would solve it, which then also transfers to processed data sources where this setting is used. Thefix_extra
argument could be loosely defined as solving some nasty bugs. - What do you think?
General notes:
- thanks for replacing the outdated
mutate_at()
andmutate_if()
verbs, usingacross()
! - as a consequence there are quite some consecutive
mutate() %>% mutate() %>% ...
statements in the code. These are best replaced by a singlemutate()
statement; I believe (?) theacross()
cases are no exception to this. Did you try?
Further, R CMD check
leaves following Note (here):
Undefined global functions or variables:
area_name connectivity connectivity_name depth_class name polygon_id
wfd_code wfd_type wfd_type_certain wfd_type_name where
where()
needs to be imported.- For other names, this is typical for
select()
statements and the like:R CMD check
neglects tidyselect-specific syntax rules, so it concludes that e.g.area_name
is a separate object that should have been defined before. Before, this was solved forR CMD check
by instead passingx
as.data$x
for tidy-selectionor as.data[["x"]]
for data-maskingin tidy evaluation (still supported, but deprecated for tidy selection; see #179). (EDITED; additions in italic)- currently it is advised to just use
"x"
orc("x", "y")
orall_of(<character object>)
in tidy-selection expressions since {tidyselect} also supports variable names as strings instead of as names. (Advice in changelog.) So for new code we better already implement that. (EDITED; additions in italic) - however several verbs like
mutate()
andfilter()
refer to the vector to work on (not the column name) and this is referred to as data-masking.This means that something likePronouns are still supported in data masking, see response by @cecileherr below and see e.g. https://rlang.r-lib.org/reference/topic-data-mask.html and https://dplyr.tidyverse.org/articles/programming.html#data-masking. (EDITED; additions in italic)... %>% mutate(wfd_type_alt_name = "wfd_type_alternative" %>% ...)
would not work in replacing.data
, and here you should useall_of("wfd_type_alternative")
instead IIUC. - I've not yet tested the above; this is just my current understanding.
- currently it is advised to just use
Minor note:
- just for future commits: between the commit message title and the commit message body, a whiteline is needed. Currently the message body (with item lists) is part of the title, resulting in long titles with wrapped items.
- replace the deprecated .data$xxx by "xxx" in all rename, select, arrange
- replace the deprecated .data$xxx by "xxx" in all (rename, select, arrange,) relocate
Still pending: let's discuss this one live. I am a bit reluctant because of the added complexity.
Done
Corrected with the non deprecated version, that is passing
Note that If you prefer to correct all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! 👌
I agree wrt your choices for revising .data
:
- you were right indeed that
.data
is only deprecated in tidy selection, not for data masking. It is well documented in several places indeed; see the edits in my previous post. - indeed, it's best to be consistent at the function level, and revise existing
.data
everywhere in the functions where you already use the new syntax for newly added code.
Also across()
uses tidyselect expressions, even though it's used inside a data-masking function (a feature also seen in pick()
). Since there were still some .data
expressions inside across()
, I've updated this in PR #193.
Further, I suggest to use any_of("...")
instead of matches("^...$")
, for simplicity. It's included in PR #193. Probably the reason you use matches()
(*) is to allow that a variable may be absent; any_of()
does exactly that. any_of()
and all_of()
are tidyselect helpers that only accept a character or numeric vector.
(*) EDIT: I introduced the first appearances of matches()
, in f640d57.
Still pending: let's discuss this one live
OK!
Apart from this, can you also run styler::style_pkg()
to update the code style? I saw some inconsistencies in parenthesis placement. The package tries to adhere to the tidyverse style guide (quick intro wrt pipelines) and that is what {styler} will (try to) implement.
read_watersurfaces(): updates to PR #192
- for version 2024: polygon_id "d" => "WVLKNO0072"
Conclusion: add the correction in the code itself without extra argument. Done in 225316b |
Adding some reasoning for this conclusion:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work @cecileherr 👌
Update read_watersufaces
A few preliminary checks in preprocessing>miscellaneous>watersurfaces.Rmd :
watersurfaces.zip
Here are the most important remarks/needed corrections for this new version 2024:
WVLC
code (“d”) in de Kleiputten van Heist (Palingpot): not corrected since it is the raw sourcewfd_type_alternative
: added (and made optional since it is not present in all versions)hyla_code
: made optional (since it is not present in all versions)Nice to try in the future (maybe we could add this as issue?)
What is the best (quickest/most readable) way to handle the fact that there are different structures/different fields/different errors to correct in the different versions?
Should we use
if then
structures or use optional clauses (such asmatches
,any_of
,across
+where
, ...)?Update read_watersufaces_hab to version 6
Nothing special to report, v6 has the exact same structure as v5