-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: validation task #983
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sebffischer
changed the title
feat: uses_test_set field for learner
feat: contingent properties and validation support
Jan 23, 2024
sebffischer
commented
Jan 25, 2024
sebffischer
commented
Jan 25, 2024
sebffischer
changed the title
feat: contingent properties and validation support
feat: contingent properties and test_test_rows support
Jan 27, 2024
sebffischer
changed the title
feat: contingent properties and test_test_rows support
feat: contingent properties and use_test_rows support
Feb 7, 2024
sebffischer
commented
Feb 12, 2024
sebffischer
changed the title
feat: contingent properties and use_test_rows support
feat: test and holdout task
Feb 20, 2024
sebffischer
changed the title
feat: test and holdout task
feat: validation and holdout task
Mar 18, 2024
be-marc
reviewed
May 17, 2024
#' If `TRUE` (default), the `row_ids` are removed from the primary task's active `"use"` rows. | ||
#' | ||
#' @return Modified `self`. | ||
divide = function(x, remove = TRUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not two parameters?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TODOs:
This PR enables to solve the problem that the test rows, that can e.g. used for early stopping by xgboost, can be preprocessed in a graph learner and that early stopping xgboost in a graph learner now works.
Some explanations for the changes:
$.train_task(task)
method modifies the 'use' rows of task in-place (usually by cbinding, but in principle, anything can happen here, and users have possibly overwritten this method when inheriting fromPipeOpTaskPreproc
.After setting the state of the
PipeOp
, somehow the predictions must be made on the test rows, and added to the task. We previously explored row-binding them to the task, but this was inefficient, as row-binding requires to row-bind all columns, even if they were not altered by the pipeop. In a graph, this would introdcues a rbind-cbind-rbind-cbind, ..., rbind-cbind backend structure, which is a) hard to flatten and b) memory inefficient and can get possibly slow. The solution implemented in this Pull Request sidesteps this problem by simply adding the test task to the task itself, using the newly introduced AB$test_task
. The test task can be conveniently created by the user, using the newly introduced$partition()
method.In practice, this now looks as follows:
Created on 2024-02-16 with reprex v2.0.2
PipeOp
s always preprocess the test_task when it is provided. However, aGraphLearner
only wants to do the preprocessing on the test rows, when they are needed otherwise this is unnecessary computation (as they are currently not used for the learner's$predict()
step. To communicate this, the 'uses_test_task' property was introduced.Because the 'uses_test_task' property is not fixed (its presence depends e.g. on whether he
early_stopping_set
parameter from XGBoost is set to"test"
or"none"
), it was necessary to add the ability to dynamically generate a learner's properties. This was done using the private method.contingent_properties()
that can be overwritten by learners. It is necessary to set this method in theLearner
base class to a function returningcharacter(0)
(and notNULL
), because of a bug inR6
.task$set_row_roles(1, "test")
ortask$set_row_roles(1, "holdout")
.Because we now introduced the
$test_task
field, there would have been two ways to achieve something similar. This made code messy and the interface confusing. For this reason, both theholdout
andtest
row-roles were removed.Because this PR breaks some existing packages (because of the removal of the 'holdout' and 'test' row roles), I have already created Pull Requests in some packages:
The general plan to merge this feature is to:
Make releases for these PRs:
mlr3learners
: Feat/train predict mlr3learners#288 (Xgboost, only dev and paramtest are failing)mlr3tuning
: Fix/train predict mlr3tuning#413 (holdout set is used)mcboost
: update vignette to not use holdout role mcboost#44 (vignette uses holdout set)mlr3fairness
fix incorrect predict set mlr3fairness#74 (there is a bug that I did not cause)mlr3pipelines
https://github.com/mlr-org/mlr3pipelines/pull/761/files (this is needed, because of the way the graphlearner sets its properties)Merge this branch and make a release on CRAN
Implement the feature in pipelines and make a release from this branch:
mlr3extralearners
and bump mlr3 dependency