Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ship released schemas with hubUtils #194

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,139 @@ Our procedures for contributing bigger changes, code in particular, generally fo
- We use [testthat](https://cran.r-project.org/package=testthat) for unit tests.
Contributions with test cases included are easier to accept.

## Working with a development version of `hubverse-org/schemas`

The canonical home for the hubverse schemas are at
https://github.com/hubverse-org/schemas. These schemas are copied over here
under the `inst/schemas` folder, which allows offline validation for hubs.

**If you are developing against an in-development version of the hubverse
schemas, you must ensure that the schemas in this repository are synchronized**

### Synchronization script

The script that synchronizes the schemas is in
[data-raw/schemas.R](https://github.com/hubverse-org/hubUtils/blob/main/data-raw/schemas.R)
and it can be run from within R, as a standalone script, or as a git hook. It
takes one environment variable `HUBUTILS_DEV_BRANCH`. If the environment
variable is unset, the branch information from the `inst/schemas/update.json`
is used.

#### Usage: within R

```r
source("data-raw/schemas.R")
```

#### Usage: from BASH

```bash
Rscript data-raw/schemas.R
```

#### Usage: commit hook

See [Installing the Git Hook](#installing-the-git-hook). A Git hook is a way to
run a local script before or after you do something in Git. For example, a
pre-commit hook (the one we use here) will run every time before you make a
commit. Likewise, a pre-push hook will run every time before you push to a
repository.
zkamvar marked this conversation as resolved.
Show resolved Hide resolved

#### Details

By default, this script will make a single call to the GitHub API to determine
the status of the most recent commit on the branch listed in
`inst/schemas/update.json`. If the sha and branch match and the timestamp is
ahead of the the most recent commit, then you are good to go!

If an update is needed, then your system git is used to clone the branch and
copy it over to `inst/schemas`.

When running this as a script (not interactive), then when a schema update
happens, the tests are re-run.


### Installing the Git Hook

It is optional, but recommended to use this script as a pre-commit hook so that
the schemas are checked for updates before each commit.

```r
usethis::use_git_hook("pre-commit", readLines("data-raw/schemas.R"))
```

This will create or overwrite `.git/hooks/pre-commit`.

**If you want to uninstall the git hook, remove the `.git/hooks/pre-commit`
file**

### Synchronizing a development branch

In order to synchronize a development branch, you should set a temporary environment variable called `HUBUTILS_DEV_BRANCH` to the name of the branch. This can only be done interactively in R or as a BASH script.

#### Via R

```r
Sys.setenv("HUBUTILS_DEV_BRANCH" = "br-v4.0.1")
source("data-raw/schemas.R")
#> ✔ removing /path/to/hubUtils/inst/schemas
#> ✔ Creating inst/schemas/.
#> ℹ Fetching the latest version of the schemas from GitHub
#> Cloning into '/path/to/temp/folder'...
#> ✔ Copying v4.0.1, v4.0.0, v3.0.1, v3.0.0, v2.0.1, v2.0.0, v1.0.0, v0.0.1, v0.0.0.9,
#> and NEWS.md to inst/schemas
#> [ ... snip ... ]
#> ✔ Done
#> ✔ Schemas up-to-date!
#> ℹ branch: "br-v4.0.1"
#> ℹ sha: "43b2c8aceb3a316b7a1929dbe8d8ead2711d4e84"
#> ℹ timestamp: "2024-12-19T16:40:16Z"
Sys.unsetenv("HUBUTILS_DEV_BRANCH")
```

#### Via BASH

When run via script, if any synchronization happens, tests are automatically run:

```bash
HUBUTILS_DEV_BRANCH=br-v4.0.1 Rscript data-raw/schemas.R \
&& unsetenv HUBUTILS_DEV_BRANCH
#> ✔ removing /path/to/hubUtils/inst/schemas
#> ✔ Creating inst/schemas/.
#> ℹ Fetching the latest version of the schemas from GitHub
#> Cloning into '/path/to/temp/folder'...
#> ✔ Copying v4.0.1, v4.0.0, v3.0.1, v3.0.0, v2.0.1, v2.0.0, v1.0.0, v0.0.1, v0.0.0.9,
#> and NEWS.md to inst/schemas
#> [ ... snip ... ]
#> ✔ Done
#> ✔ Schemas up-to-date!
#> ℹ branch: "br-v4.0.1"
#> ℹ sha: "43b2c8aceb3a316b7a1929dbe8d8ead2711d4e84"
#> ℹ timestamp: "2024-12-19T16:40:16Z"
#> ! Schema updated. Re-running tests.
#> ℹ Testing hubUtils
#> ✔ | F W S OK | Context
#> ✔ | 7 | as_config
#> ✔ | 9 | as_model_out_tbl
#> ✔ | 5 | check_deprecated_schema
#> ✔ | 17 | model_id_merge
#> ✔ | 7 | read_config [6.4s]
#> ✔ | 7 | utils-get_hub
#> ✔ | 16 | utils-model_out_tbl
#> ✔ | 14 | utils-round_ids
#> ✔ | 8 | utils-round-config
#> ✔ | 39 | utils-schema-versions
#> ✔ | 14 | utils-schema [1.2s]
#> ✔ | 3 | utils-task_ids
#> ✔ | 7 | v3-schema-utils
#>
#> ══ Results ════════════════════════════
#> Duration: 9.0 s
#>
#> [ FAIL 0 | WARN 0 | SKIP 0 | PASS 153 ]
```


## Code of Conduct

Please note that the hubUtils project is released with a
Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# hubUtils (development version)

* Released schemas are now shipped with the package, so an internet connection
is no longer necessary for local validation.
* Added `subset_task_id_names()` function to subset task ID names from a character vector of column names (#149).
* Added functions `subset_task_id_cols()` and `subset_std_cols()` to subset a `model_out_tbl` or submission `tbl` to task ID or standard (non-task ID) columns respectively (#149).

Expand Down
41 changes: 41 additions & 0 deletions R/utils-schema.R
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@ get_schema_url <- function(config = c("tasks", "admin", "model"),
#' @examplesIf asNamespace("hubUtils")$not_rcmd_check()
#' get_schema_valid_versions()
get_schema_valid_versions <- function(branch = "main") {
if (branch == "main") {
schema_path <- system.file("schemas", package = "hubUtils")
return(list.files(schema_path, pattern = "^v"))
}
branches <- gh(
"GET /repos/hubverse-org/schemas/branches"
) %>%
Expand Down Expand Up @@ -75,6 +79,21 @@ get_schema_valid_versions <- function(branch = "main") {
#' schema_url <- get_schema_url(config = "tasks", version = "v0.0.0.9")
#' get_schema(schema_url)
get_schema <- function(schema_url) {
# If the branch is "main", then we can use the stored schemas inside the
# package.
pieces <- extract_schema_info(schema_url)
if (pieces$branch[1] == "main") {
ver <- pieces$version
cfg <- pieces$config
path <- system.file("schemas", ver, cfg, package = "hubUtils")
if (fs::file_exists(path)) {
return(jsonlite::prettify(readLines(path)))
} else {
cli::cli_alert_warning("{.file {ver}/{cfg}} not found.
This could mean your version of hubUtils is outdated.
Attempting to connect to GitHub.")
}
}
response <- try(curl_fetch_memory(schema_url), silent = TRUE)

if (inherits(response, "try-error")) {
Expand All @@ -96,6 +115,28 @@ get_schema <- function(schema_url) {
}
}

#' Given a vector of URLs, this will extract the branch version and config for
#' each
#'
#' @param id a url for a given hubverse schema
#' @return a data frame with three columns: branch, version, and config
#'
#' @noRd
#' @examples
#' urls <- c(
#' "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/tasks-schema.json",
#' "https://raw.githubusercontent.com/hubverse-org/schemas/main/v2.0.0/admin-schema.json",
#' "https://raw.githubusercontent.com/hubverse-org/schemas/br-v4.0.0/v4.0.0/tasks-schema.json"
#' )
#' extract_schema_info(urls)
extract_schema_info <- function(id) {
lead <- "^https[:][/][/]raw.githubusercontent.com[/]hubverse-org[/]schemas[/]"
good_stuff <- "(.+?)[/](v[0-9.]+?)[/]([a-z]+?-schema.json)$"
pattern <- paste0(lead, good_stuff)
proto <- setNames(character(3), c("branch", "version", "config"))
utils::strcapture(pattern, id, proto)
}

#' Get the latest schema version
#'
#' Get the latest schema version from the schema repository if "latest" requested
Expand Down
182 changes: 182 additions & 0 deletions data-raw/schemas.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
#!/usr/bin/env Rscript
# Update the schemas with our repository
#
# This will delete our existing schemas folder, re-clone the git repository,
# and copy over the schemas and the NEWS to our inst folder.
#
# Since every release is static and the schemas do not materially change, we
# should not see any diffs in the established schemas
#
# This will update the schemas in this repository and update a tracker json file
# called inst/schemas/update.json with the following information:
#
# - branch: The branch the latest version was updated from
# - sha: The schemas repo commit hash of the latest version
# - timestamp: A timestamp of the last update (in ISO 8601 timestamp format,
# UTC time)
#
# USAGE:
#
# GIT HOOK
# ========
#
# This script can be used as a pre-commit hook and will run every time before
# you commit. You can install it with:
#
# ```
# usethis::use_git_hook("pre-commit", readLines("data-raw/schemas.R"))
# ```
#
# If you run this and the schema updates, then you will need to double-check
# your tests to make sure the updates did not affect your work.
#
# STANDALONE
# ==========
#
# This script will check for updates to be had in the branch defined in
# `inst/schemas/update.json` by default. If are no updates to be had, nothing
# will be done:
#
# ```
# source("data-raw/schemas.R")
# #> ✔ Schemas up-to-date!
# #> ℹ branch: "main"
# #> ℹ sha: "0163a89cc38ba3846cd829545f6d65c1e40501a6"
# #> ℹ timestamp: "2024-12-19T16:26:33Z"
# ```
#
# TO CHANGE THE BRANCH, update the environment variable called
# `HUBUTILS_DEV_BRANCH`:
#
# ```
# Sys.setenv("HUBUTILS_DEV_BRANCH" = "br-v4.0.1")
# source("data-raw/schemas.R")
# #> ✔ removing /path/to/hubUtils/inst/schemas
# #> ✔ Creating inst/schemas/.
# #> ℹ Fetching the latest version of the schemas from GitHub
# #> Cloning into '/var/folders/9p/m996p3_55hjf1hc62552cqfr0000gr/T//Rtmp3Q4dnp/file377d
# #> 71aaebd4'...
# #> ✔ Copying v4.0.1, v4.0.0, v3.0.1, v3.0.0, v2.0.1, v2.0.0, v1.0.0, v0.0.1, v0.0.0.9,
# #> and NEWS.md to inst/schemas
# #> [ ... snip ... ]
# #> ✔ Done
# #> ✔ Schemas up-to-date!
# #> ℹ branch: "br-v4.0.1"
# #> ℹ sha: "43b2c8aceb3a316b7a1929dbe8d8ead2711d4e84"
# #> ℹ timestamp: "2024-12-19T16:40:16Z"
# ```

# FUNCTIONS -------------------------------------------------------------------
check_hook <- function(repo_path) {
# if this is running as a git hook, then the first thing to do is to make sure
# it is up to date with the source material. If it's not, error until it is
# fixed.
if (interactive()) {
return()
}
hook <- fs::path(repo_path, ".git/hooks/pre-commit")
if (fs::file_exists(hook)) {
schema_script <- fs::path(repo_path, "data-raw/schemas.R")
okay <- tools::md5sum(hook) == tools::md5sum(schema_script)
if (!isTRUE(okay)) {
cmd <- r"[usethis::use_git_hook("pre-commit", readLines("data-raw/schemas.R"))]" # nolint: object_usage_linter
cli::cli_abort(c("git hook outdated",
"i" = r"[Use {.code {cmd}} to update your hook.]")

Check warning on line 84 in data-raw/schemas.R

View workflow job for this annotation

GitHub Actions / lint

file=data-raw/schemas.R,line=84,col=8,[indentation_linter] Hanging indent should be 23 spaces but is 8 spaces.
)
}
}
}

get_branch <- function(update_cfg_path) {
if (fs::file_exists(update_cfg_path)) {
branch <- jsonlite::read_json(update_cfg_path)$branch
} else {
branch <- "main"
}
branch
}

timestamp <- function() {
format(Sys.time(), "%Y-%m-%dT%H:%M:%SZ", tz = "UTC")
}

get_latest_commit <- function(branch) {
res <- gh::gh("GET /repos/hubverse-org/schemas/branches/{branch}",
branch = branch
)
res$commit
}

check_for_update <- function(update_cfg_path, branch) {
update <- FALSE
# If there is no config file, then we automatically update
if (!fs::file_exists(update_cfg_path)) {
update <- TRUE
cfg <- list(
branch = branch,
sha = NULL,
timestamp = "2024-07-16T00:00:00Z"
)
} else {
cfg <- jsonlite::read_json(update_cfg_path)
}
# Fetch the latest commit, and check if either the branch has changed, or its
# outdated (based on commit date)
the_commit <- get_latest_commit(branch)
branch_change <- cfg$branch != branch
outdated <- cfg$timestamp < the_commit$commit$author$date
sha_different <- cfg$sha != the_commit$sha

update <- update || (branch_change || outdated || sha_different)
if (update) {
cfg$branch <- branch
cfg$sha <- the_commit$sha
cfg$timestamp <- timestamp()
}
return(list(update = update, cfg = cfg))
}

# VARIABLES ------------------------------------------------------------------
check_hook(usethis::proj_path())
schemas <- usethis::proj_path("inst/schemas")
cfg_path <- fs::path(schemas, "update.json")
branch <- Sys.getenv("HUBUTILS_DEV_BRANCH", unset = get_branch(cfg_path))
new <- check_for_update(cfg_path, branch)

# PROCESS UPDATE IF NEEDED ---------------------------------------------------
if (new$update) {
cli::cli_alert_success("removing {.file {schemas}}")
fs::dir_delete(schemas)
usethis::use_directory("inst/schemas")

cli::cli_alert_info("Fetching the latest version of the schemas from GitHub")
tmp <- tempfile()
system2("git", c("clone", "--branch", branch, "https://github.com/hubverse-org/schemas.git", tmp))

versions <- as.character(fs::dir_ls(tmp, type = "dir"))
cli::cli_alert_success("Copying {.file {c(rev(fs::path_file(versions)), 'NEWS.md')}} to {.file inst/schemas}")
purrr::walk(versions, fs::dir_copy, schemas)
fs::file_copy(fs::path(tmp, "NEWS.md"), schemas)
fs::dir_tree(schemas)
fs::dir_delete(tmp)
jsonlite::write_json(new$cfg,
path = cfg_path,
pretty = TRUE,
auto_unbox = TRUE
)
cli::cli_alert_success("Done")
}

# REPORT STATUS --------------------------------------------------------------
cli::cli_alert_success("Schemas up-to-date!")
cli::cli_alert_info("branch: {.val {new$cfg$branch}}")
cli::cli_alert_info("sha: {.val {new$cfg$sha}}")
cli::cli_alert_info("timestamp: {.val {new$cfg$timestamp}}")

# GIT HOOK: RE-TEST ON UPDATE ------------------------------------------------
# If this is being run as a git hook and the schemas were updated, we need
# to signal that the tests should be run again
if (!interactive() && new$update) {
cli::cli_alert_warning("Schema updated. Re-running tests.")
devtools::test(usethis::proj_path())
}
Loading
Loading