Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ship released schemas with hubUtils #194

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Conversation

zkamvar
Copy link
Member

@zkamvar zkamvar commented Nov 20, 2024

This PR enables offline schema validation by shipping the schemas directly with hubUtils.

  1. I've added a script that will download all of the released schemas and update them in the repository in data-raw/schemas.R. Schemas will live in inst/schemas/ and there is a configuration file called inst/schemas/update.json that records the source of the installed version of the schemas (see next section).
  2. I've added a catch to get_schema_valid_versions() and get_schema() that will check if the requested branch is "main" and then return the locally shipped schemas if it is, otherwise they will fall back to checking from GitHub
  3. In the case a user is working from an old version of hubUtils and the requested schema is not available for get_schema() (e.g. they attempt to validate with a newer hub), the relevant schema will be downloaded from GitHub.

This will fix #193

New Development Process

As I mentioned in point 1, data-raw/schemas.R will synchronize the schemas in this repo with the schemas repo. Based on #194 (comment), I've updated the development process so that we will no longer have to go through something like hubverse-org/hubAdmin#71.

Details are found in the CONTRIBUTING document, but in short, the steps for developing against a new schema version are:

One time step (ever)

Add the script as a git pre-push hook:

library(devtools)
use_git_hook("pre-push", proj_path("data-raw/schemas.R"))

With the git push hook installed, it will run every time you push to keep your schemas are up-to-date with the branch defined in inst/schemas/update.json.

This will live locally for this git repository and you can remove it if you do not like it by using unlink(devtools::proj_path(".git/hooks/pre-push"))

Once per new schema version

When you start work on a new dev version of the schema, you set the default branch. You can do this by either modifying inst/schemas/update.json OR you can use the HUBUTILS_DEV_BRANCH environment variable (which will update that file for you). Once the branch is set in inst/schemas/update.json, the script will always check for updates on that branch.

Sys.setenv("HUBUTILS_DEV_BRANCH" = "br-v4.0.1")
source("data-raw/schemas.R")

After new schema release

After the new schema is released, before you can release this package, you must ensure the default branch is set back to "main", which is done in the same way you would set the
development branch.

Sys.setenv("HUBUTILS_DEV_BRANCH" = "main")
source("data-raw/schemas.R")

Actual changes

Note that the 16k changes are mostly due to the schemas. Here's the actual list of content changes:

$ git diff --stat main..HEAD -- R tests data-raw .github inst/schemas/update.json
 .github/CONTRIBUTING.md                   | 226 +++++++++++++++++++++++++++++++++++++++++++
 R/read_config.R                           |   8 +-
 R/utils-get_hub.R                         |   4 +-
 R/utils-model_out_tbl.R                   |  43 +++++++++
 R/utils-schema.R                          |  41 ++++++++
 R/utils-task_ids.R                        |  17 ++++
 data-raw/schemas.R                        | 228 ++++++++++++++++++++++++++++++++++++++++++++
 inst/schemas/update.json                  |   5 +
 tests/testthat/_snaps/read_config.md      |   4 +-
 tests/testthat/test-read_config.R         |   6 +-
 tests/testthat/test-utils-get_hub.R       |   6 +-
 tests/testthat/test-utils-model_out_tbl.R |  57 +++++++++++
 tests/testthat/test-utils-schema.R        |  29 ++++++
 tests/testthat/test-utils-task_ids.R      |  19 ++++
 14 files changed, 678 insertions(+), 15 deletions(-)

This will fall back to github if versions are not found.
Copy link

github-actions bot commented Nov 20, 2024

@annakrystalli
Copy link
Member

annakrystalli commented Dec 13, 2024

sorry I haven't got round to this! It's been on my to do list and promise to have a look on Monday. I have a suggestion on how this might work but probably won't get time to get to it today.

@zkamvar
Copy link
Member Author

zkamvar commented Dec 13, 2024

sorry I haven't got round to this! It's been on my to do list and promise to have a look on Monday. I have a suggestion on how this might work but probably won't get time to get to it today.

No worries! It's not an urgent thing and it would definitely benefit from careful review.

That being said, I will likely be working on the dashboard project next week, so a quick turnaround is not necessary.

@annakrystalli
Copy link
Member

Before doing a detailed review, I wanted to run some higher level thoughts passed you.

This is a great improvement but I think we could push the utility even further.

One of the biggest difficulties throughout hubverse packages is developing against an under development schema. This would still require internet access to access schema from a branch during development and testing (hence we'd still need to add skips for not online in tests and use set tests against dev branches) and a lot of post release cleanup of tests after a schema release.

Given the GitHub version of packages is considered development, I propose we also allow the next under development schema version to be able to be included in hubUtils. This would make such versions available to the rest of the hubverse and hence development of other packages would become SO much easier and remove the requirement for post release clean up of schema related tests downstream.

To enable this I would like to modify the script to download the schema and turn it into a function that can take a branch argument as an input.

For traceability, I also think inclusion of an additional file (yml?) in the inst folder that records:

  • The branch the latest version was updated from
  • The schemas repo commit hash of the latest version
  • A timestamp of the last update

Down the line we could perhaps use such info and webhooks to automatically open PRs to update the latest version? Not needed to start off with though but good documentation for developers will be required instead.

Whatever we decide I feel we should add a section in the schemas README and also create a more detailed custom CONTRIBUTING.md here that details the workflow we land on.

.github/CONTRIBUTING.md Outdated Show resolved Hide resolved
@zkamvar
Copy link
Member Author

zkamvar commented Dec 19, 2024

To enable this I would like to modify the script to download the schema and turn it into a function that can take a branch argument as an input.

For traceability, I also think inclusion of an additional file (yml?) in the inst folder that records:

  • The branch the latest version was updated from
  • The schemas repo commit hash of the latest version
  • A timestamp of the last update

I think this is a good idea. I did not turn it into a function, however because doing so would require us to put "usethis" as one of the Suggested packages. Since this is code that is only meant for the developers to synchronize data, I kept it in data-raw/ and have added instructions to run it via R, bash, and/or git hook.

I have implemented the additional file that records the branch, sha, and timestamp as a JSON file because we already have the tools to parse and write JSON.

The ability to store the in-development version in this package also means that the branch argument is superseded, but it's still a good fallback if someone really needs an in-development schema with the released version of the package.

@zkamvar zkamvar requested a review from annakrystalli January 6, 2025 16:02
@zkamvar zkamvar added the upkeep maintenance, infrastructure, and similar label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upkeep maintenance, infrastructure, and similar
Projects
Status: Ready for Review
Development

Successfully merging this pull request may close these issues.

Deliver schemas with the hubUtils package instead of via GitHub
2 participants