Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update read_watersurfaces for version 2024 & read_ws_hab #192

Merged
merged 18 commits into from
Jan 8, 2025

Conversation

cecileherr
Copy link
Collaborator

@cecileherr cecileherr commented Dec 11, 2024

Update read_watersufaces

A few preliminary checks in preprocessing>miscellaneous>watersurfaces.Rmd :
watersurfaces.zip

Here are the most important remarks/needed corrections for this new version 2024:

  • there is a wrong WVLC code (“d”) in de Kleiputten van Heist (Palingpot): not corrected since it is the raw source
  • new variable wfd_type_alternative: added (and made optional since it is not present in all versions)
  • variable dropped hyla_code: made optional (since it is not present in all versions)
  • all the empty cells are imported as "" (empty string) instead of NA: converted to NA

Nice to try in the future (maybe we could add this as issue?)
What is the best (quickest/most readable) way to handle the fact that there are different structures/different fields/different errors to correct in the different versions?
Should we use if then structures or use optional clauses (such as matches, any_of, across + where, ...)?

Update read_watersufaces_hab to version 6

Nothing special to report, v6 has the exact same structure as v5

- drop hyla_code
- add wfd_type_alternative
- add wfd_type_alt_name
- include water_level_management
- add CFe in list water types
- use na_if for empty cells ("")
@cecileherr cecileherr requested a review from florisvdh December 11, 2024 15:19
- that is with watersurfaces 2024
- incl. documentation
@cecileherr cecileherr changed the title Update read_watersurfaces for version 2024 Update read_watersurfaces for version 2024 & read_ws_hab Dec 12, 2024
florisvdh referenced this pull request in inbo/n2khab-preprocessing Dec 13, 2024
- watersurfaces 2024
- habitatmap_stdized 2023_v1
Copy link
Member

@florisvdh florisvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks so much @cecileherr ! You analyzed the new data source version really well before upgrading the function, resulting in ever more robust code 🚀

Added some minor comments already in the recent commits of branch watersurf_hab_v6 of n2khab-preprocessing, but I presume you wait for current n2khab branch to become stable before starting that PR. Still, the latter is best merged first, then this one, as the n2khab one also uses the resulting watersurfaces_hab. (At least the n2khab release – merging to main – should take place after the preprocessing repo is all done and the new v6 is available at Zenodo.)

What is the best (quickest/most readable) way to handle the fact that there are different structures/different fields/different errors to correct in the different versions? Should we use if then structures or use optional clauses (such as matches, any_of, across + where, ...)?

  • I agree that generalizations to cope with version-specific columns provide the intended result, and are more readable, easier to maintain and more robust (future-proof). So that's really great 👌 (and probably you / we already did such things in the past ?).
  • Also IMHO it's OK to mix both approaches where that feels more suitable. What feels most logical and efficient will often be the easiest to understand and maintain.
  • One could say that, in the generalized approach, the code is less explicit about how the different data source versions differ. But documenting this is not the aim of the n2khab package, rather the package seeks to align things.
  • Good idea to make it an issue in order to apply this throughout the package! Please!
  • The main point of attention is to still get essentially the same results for the earlier versions (in the sense of 'no columns lost', 'no errors introduced', 'column names unchanged' etc: ultimately we would implement unit tests to automate such checks), but I didn't notice things that would lead to that, while reading the code.
  • As a sidenote, I think we should not be reluctant to implement smaller general improvements of the returned object for all data source versions when it's deemed better at some point, which might e.g. be the use of a factor where a character had been used before etc. True reproducibility should involve freezing the n2khab version anyway.

there is a wrong WVLC code (“d”) in de Kleiputten van Heist (Palingpot): not corrected since it is the raw source

  • Well, for nasty bugs like this we could still mitigate the drawback by using the same approach as fix_geom, e.g. fix_extra with default FALSE, to still honour the default of just returning the 'raw contents' (even though we do always streamline column names and classes to get some basic uniformity between R objects). TRUE would solve it, which then also transfers to processed data sources where this setting is used. The fix_extra argument could be loosely defined as solving some nasty bugs.
  • What do you think?

General notes:

  • thanks for replacing the outdated mutate_at() and mutate_if() verbs, using across()!
  • as a consequence there are quite some consecutive mutate() %>% mutate() %>% ... statements in the code. These are best replaced by a single mutate() statement; I believe (?) the across() cases are no exception to this. Did you try?

Further, R CMD check leaves following Note (here):

Undefined global functions or variables:
  area_name connectivity connectivity_name depth_class name polygon_id
  wfd_code wfd_type wfd_type_certain wfd_type_name where
  • where() needs to be imported.
  • For other names, this is typical for select() statements and the like: R CMD check neglects tidyselect-specific syntax rules, so it concludes that e.g. area_name is a separate object that should have been defined before. Before, this was solved for R CMD check by instead passing x as .data$x for tidy-selection or as .data[["x"]] for data-masking in tidy evaluation (still supported, but deprecated for tidy selection; see #179). (EDITED; additions in italic)
    • currently it is advised to just use "x" or c("x", "y") or all_of(<character object>) in tidy-selection expressions since {tidyselect} also supports variable names as strings instead of as names. (Advice in changelog.) So for new code we better already implement that. (EDITED; additions in italic)
    • however several verbs like mutate() and filter() refer to the vector to work on (not the column name) and this is referred to as data-masking. This means that something like ... %>% mutate(wfd_type_alt_name = "wfd_type_alternative" %>% ...) would not work in replacing .data, and here you should use all_of("wfd_type_alternative") instead IIUC. Pronouns are still supported in data masking, see response by @cecileherr below and see e.g. https://rlang.r-lib.org/reference/topic-data-mask.html and https://dplyr.tidyverse.org/articles/programming.html#data-masking. (EDITED; additions in italic)
    • I've not yet tested the above; this is just my current understanding.

Minor note:

  • just for future commits: between the commit message title and the commit message body, a whiteline is needed. Currently the message body (with item lists) is part of the title, resulting in long titles with wrapped items.

R/read_habitatdata.R Outdated Show resolved Hide resolved
R/read_habitatdata.R Outdated Show resolved Hide resolved
@cecileherr
Copy link
Collaborator Author

there is a wrong WVLC code (“d”) in de Kleiputten van Heist (Palingpot): not corrected since it is the raw source

  • Well, for nasty bugs like this we could still mitigate the drawback by using the same approach as fix_geom, e.g. fix_extra with default FALSE, to still honour the default of just returning the 'raw contents' (even though we do always streamline column names and classes to get some basic uniformity between R objects). TRUE would solve it, which then also transfers to processed data sources where this setting is used. The fix_extra argument could be loosely defined as solving some nasty bugs.
  • What do you think?

Still pending: let's discuss this one live. I am a bit reluctant because of the added complexity.
(since this also have consequences for n2khab_preprocessing, I will not create a pull request there yet)


  • as a consequence there are quite some consecutive mutate() %>% mutate() %>% ... statements in the code. These are best replaced by a single mutate() statement; I believe (?) the across() cases are no exception to this. Did you try?

Done


Undefined global functions or variables:
area_name connectivity connectivity_name depth_class name polygon_id
wfd_code wfd_type wfd_type_certain wfd_type_name where

Corrected with the non deprecated version, that is passing "x" instead of .data$x for the tidy-selection cases.
I made three separated commits:

  • for the new code in read_watersurfaces (which gave the note in R CMD check)
  • for the rest in read_watersurfaces (commit 0756449), that is all existing occurences of arrange, select, rename , pivot, relocate in read_watersurfaces (because I do not like the idea of mixing old and new syntax within the same function).
  • for the rest in read_watersurfaces_hab (commit 434a904)

Note that .data is still present in the mutate, filter... statements. If I understand correctly (???) .data can still be used for data-masking (or at least it was the case in 2023 r-lib/tidyselect#169 and using it does not trigger warnings)

If you prefer to correct all .data occurences in the package at once, (issue #179 ) you can revert the last 2 commits (without creating problems in R CMD check)


Copy link
Member

@florisvdh florisvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👌

I agree wrt your choices for revising .data:

  • you were right indeed that .data is only deprecated in tidy selection, not for data masking. It is well documented in several places indeed; see the edits in my previous post.
  • indeed, it's best to be consistent at the function level, and revise existing .data everywhere in the functions where you already use the new syntax for newly added code.

Also across() uses tidyselect expressions, even though it's used inside a data-masking function (a feature also seen in pick()). Since there were still some .data expressions inside across(), I've updated this in PR #193.

Further, I suggest to use any_of("...") instead of matches("^...$"), for simplicity. It's included in PR #193. Probably the reason you use matches() (*) is to allow that a variable may be absent; any_of() does exactly that. any_of() and all_of() are tidyselect helpers that only accept a character or numeric vector.

(*) EDIT: I introduced the first appearances of matches(), in f640d57.

Still pending: let's discuss this one live

OK!

Apart from this, can you also run styler::style_pkg() to update the code style? I saw some inconsistencies in parenthesis placement. The package tries to adhere to the tidyverse style guide (quick intro wrt pipelines) and that is what {styler} will (try to) implement.

read_watersurfaces(): updates to PR #192
R/read_habitatdata.R Outdated Show resolved Hide resolved
@cecileherr
Copy link
Collaborator Author

there is a wrong WVLC code (“d”) in de Kleiputten van Heist (Palingpot): not corrected since it is the raw source

  • Well, for nasty bugs like this we could still mitigate the drawback by using the same approach as fix_geom, e.g. fix_extra with default FALSE, to still honour the default of just returning the 'raw contents' (even though we do always streamline column names and classes to get some basic uniformity between R objects). TRUE would solve it, which then also transfers to processed data sources where this setting is used. The fix_extra argument could be loosely defined as solving some nasty bugs.
  • What do you think?

Still pending: let's discuss this one live. I am a bit reluctant because of the added complexity. (since this also have consequences for n2khab_preprocessing, I will not create a pull request there yet)

Conclusion: add the correction in the code itself without extra argument. Done in 225316b

@cecileherr
Copy link
Collaborator Author

  • last .data replaced (fe00503)
  • code style cleaned up with styler (bd784a5)
  • false polygon_id corrected (225316b)

@cecileherr cecileherr requested a review from florisvdh January 8, 2025 09:49
@florisvdh
Copy link
Member

florisvdh commented Jan 8, 2025

Conclusion: add the correction in the code itself without extra argument. Done in 225316b

Adding some reasoning for this conclusion:

  • not introducing a 'fix_extra' argument: to avoid confusion & dilemma for users
  • we make this (exceptional) record-specific edit only because it is about the coherence (structure) of the data set – it concerns a lost record ID. The n2khab package does however not intend to fix actual data attributes. Errors in the raw data source must by default be fixed through releasing an updated version of the raw data source itself.

Copy link
Member

@florisvdh florisvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work @cecileherr 👌

@florisvdh florisvdh merged commit 0069463 into dev_nextrelease Jan 8, 2025
3 checks passed
@florisvdh florisvdh deleted the update_read_ws_en_ws_hab_2024 branch January 8, 2025 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants