Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elr/rel scores #69

Merged
merged 13 commits into from
Jan 7, 2025
Merged

Elr/rel scores #69

merged 13 commits into from
Jan 7, 2025

Conversation

elray1
Copy link
Contributor

@elray1 elray1 commented Jan 3, 2025

fixes #66

Copy link

github-actions bot commented Jan 3, 2025

#' `metrics` and should only include proper scores (e.g., it should not contain
#' interval coverage metrics). If `NULL` (the default), no relative metrics
#' will be computed. Relative metrics are only computed if `summarize = TRUE`,
#' and require that `"model_id"` is included in `by`.
Copy link
Contributor Author

@elray1 elray1 Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the alternative to enforcing this requirement is that we add another argument along the lines of compare as is used in scoringutils::get_pairwise_comparisons, setting a default of "model_id". I think that would be fine, but since essentially all use cases of this function will include "model_id" in by, I don't think it's necessarily worth introducing the extra argument here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current approach is fine since it would be caught by the validation. The problem with extra arguments that affect other arguments is that it becomes difficult for users to remember the relationships between them.

@elray1 elray1 requested a review from nikosbosse January 3, 2025 19:20
Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start. That being said, I did not yet look at the tests because there is a lot going on there. I will take a look after I get back from lunch.

I did make some suggestions to simplify the validation function.

R/validate.R Outdated Show resolved Hide resolved
R/score_model_out.R Outdated Show resolved Hide resolved
R/score_model_out.R Show resolved Hide resolved
#' `metrics` and should only include proper scores (e.g., it should not contain
#' interval coverage metrics). If `NULL` (the default), no relative metrics
#' will be computed. Relative metrics are only computed if `summarize = TRUE`,
#' and require that `"model_id"` is included in `by`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current approach is fine since it would be caught by the validation. The problem with extra arguments that affect other arguments is that it becomes difficult for users to remember the relationships between them.


if (length(relative_metrics) > 0 && !"model_id" %in% by) {
cli::cli_abort(
"Relative metrics require 'model_id' to be included in {.arg by}."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this strictly necessary? If we know that "model_id' always needs to be included in by we can just put it there.
On top of that, we also filter out "model_id" in line 140 by = by[by != "model_id"],
(but also it's kind of required in line 145, scores <- scoringutils::summarize_scores(scores = scores, by = by)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't manually checked out the pr and run the code, so this is just from a cursory reading on github - let me know if you'd like me to dig deeper, happy to make a suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with a cursory review on the level of "this is reasonable or not", given that Zhian is also reviewing.

Regarding your comment above: I don't think this is strictly necessary, but my general preference is to throw errors guiding users towards what we're expecting rather than modify their inputs. I think all of this is clear to you already, but just to say it:

  • The hard-coded use of compare = "model_id" in the call to add_relative_skill means that we are getting results that are broken down by model, but we're not allowed to include the thing we specify for compare in the by argument to that function.
  • This also means that when we scoringutils::summarize_scores(scores = scores, by = by), we need to have "model_id" in the by argument for the results to make sense
  • That means that there is a general situation where the by arguments to add_relative_skill and summarize_scores have to be different; we will always need to either drop the "model_id" entry from by in the call to add_relative_skill or add it to by in the call to score_model_out.
  • I think it'll be clearer to users if for purposes of hubEvals::score_model_out, by is always expected to be the vector of names of variables by which scores are disaggregated in the result.

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciated your descriptions of the expected errors in the comments!

I went through the tests and while they work, they are complex and will be painful to debug later on.

The bottom line is that the majority of the code in these tests should be encapsulated in test fixtures.

The most complicated expected score table here amounts to 6 rows and 10 columns, which is trivial for a human to read, even with a git diff. I would recommend storing it as a csv file in tests/testthat/fixtures/ instead of having nearly 70 lines of code run every time you want to generate it.

Other than that, there was test noise from warnings (from wilcox.test, which cannot be helped) and I proposed a simplification of the equality tests.

tests/testthat/test-score_model_out_rel_metrics.R Outdated Show resolved Hide resolved
tests/testthat/test-score_model_out_rel_metrics.R Outdated Show resolved Hide resolved
tests/testthat/test-score_model_out_rel_metrics.R Outdated Show resolved Hide resolved
tests/testthat/test-score_model_out_rel_metrics.R Outdated Show resolved Hide resolved
@elray1
Copy link
Contributor Author

elray1 commented Jan 7, 2025

Thanks for the review @zkamvar! Here's a summary of my responses:

  • used your suggested simplification of checks for data frame equality; thanks for that
  • after some floundering in 36a810f which was later overwritten in 1718bcd, the expected scores are now saved as a csv file under testthat/testdata.
  • i'm proposing to not add the expect_warning wrappers, because: (a) these aren't really warnings that I would "expect" from this code; the fact that the warnings are currently being thrown is not an intended behavior; (b) I've filed an issue to address the situation that results in these warnings at scoringutils which I expect will be resolved soon.

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

@elray1 elray1 merged commit 040d000 into main Jan 7, 2025
8 checks passed
@elray1 elray1 deleted the elr/rel_scores branch January 7, 2025 20:51
@elray1 elray1 restored the elr/rel_scores branch January 8, 2025 02:47
@elray1 elray1 deleted the elr/rel_scores branch January 8, 2025 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add support for relative/pairwise scores
3 participants