Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: validation task #983

Closed
wants to merge 33 commits into from
Closed

feat: validation task #983

wants to merge 33 commits into from

Conversation

sebffischer
Copy link
Member

@sebffischer sebffischer commented Dec 14, 2023

TODOs:

  • maybe we should actually rename the test task to validation (?) But the naming is still cofusing as the resampling's test set then becomes the validation set ...
  • some more checks that verify that the holdout and validation task are compatible with the primary task. Pay attention to the different task types (e.g. don't check for target in clustering task).

This PR enables to solve the problem that the test rows, that can e.g. used for early stopping by xgboost, can be preprocessed in a graph learner and that early stopping xgboost in a graph learner now works.

Some explanations for the changes:

  • The relevant lines of code, that restricted how we can implement the preprocessing of test rows can be found here: https://github.com/mlr-org/mlr3pipelines/blob/044762e64e68c4aec39cd2e6b6e1f8ef45f135ca/R/PipeOpTaskPreproc.R#L211-L218. First, the private $.train_task(task) method modifies the 'use' rows of task in-place (usually by cbinding, but in principle, anything can happen here, and users have possibly overwritten this method when inheriting from PipeOpTaskPreproc.
    After setting the state of the PipeOp, somehow the predictions must be made on the test rows, and added to the task. We previously explored row-binding them to the task, but this was inefficient, as row-binding requires to row-bind all columns, even if they were not altered by the pipeop. In a graph, this would introdcues a rbind-cbind-rbind-cbind, ..., rbind-cbind backend structure, which is a) hard to flatten and b) memory inefficient and can get possibly slow. The solution implemented in this Pull Request sidesteps this problem by simply adding the test task to the task itself, using the newly introduced AB $test_task. The test task can be conveniently created by the user, using the newly introduced $partition() method.
    In practice, this now looks as follows:
library(mlr3)
library(mlr3pipelines)

task = tsk("iris")
task
#> <TaskClassif:iris> (150 x 5): Iris Flowers
#> * Target: Species
#> * Properties: multiclass
#> * Features (4):
#>   - dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width
task$divide(1:10, "test")
task
#> <TaskClassif:iris> (140 x 5): Iris Flowers
#> * Target: Species
#> * Properties: multiclass
#> * Features (4):
#>   - dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width
#> * Test Task: (10x5)

task$test_task
#> <TaskClassif:iris> (10 x 5): Iris Flowers
#> * Target: Species
#> * Properties: multiclass
#> * Features (4):
#>   - dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width

po_pca = po("pca")

taskout = po_pca$train(list(task))[[1L]]
taskout$test_task
#> <TaskClassif:iris> (10 x 5): Iris Flowers
#> * Target: Species
#> * Properties: multiclass
#> * Features (4):
#>   - dbl (4): PC1, PC2, PC3, PC4

Created on 2024-02-16 with reprex v2.0.2

  • PipeOps always preprocess the test_task when it is provided. However, a GraphLearner only wants to do the preprocessing on the test rows, when they are needed otherwise this is unnecessary computation (as they are currently not used for the learner's $predict() step. To communicate this, the 'uses_test_task' property was introduced.
    Because the 'uses_test_task' property is not fixed (its presence depends e.g. on whether he early_stopping_set parameter from XGBoost is set to "test" or "none"), it was necessary to add the ability to dynamically generate a learner's properties. This was done using the private method .contingent_properties() that can be overwritten by learners. It is necessary to set this method in the Learner base class to a function returning character(0) (and not NULL), because of a bug in R6.
  • Retired interface: We previously had the API task$set_row_roles(1, "test") or task$set_row_roles(1, "holdout").
    Because we now introduced the $test_task field, there would have been two ways to achieve something similar. This made code messy and the interface confusing. For this reason, both the holdout and test row-roles were removed.

Because this PR breaks some existing packages (because of the removal of the 'holdout' and 'test' row roles), I have already created Pull Requests in some packages:

  • TODO: check whether I really got all packages (only checked those that I have locally available)

The general plan to merge this feature is to:

  1. Make releases for these PRs:

  2. Merge this branch and make a release on CRAN

  3. Implement the feature in pipelines and make a release from this branch:

  1. Make changes in mlr3extralearners and bump mlr3 dependency
  2. Make a gallery post about this

@sebffischer sebffischer changed the title feat: uses_test_set field for learner feat: contingent properties and validation support Jan 23, 2024
R/Learner.R Outdated Show resolved Hide resolved
R/Learner.R Outdated Show resolved Hide resolved
@sebffischer sebffischer changed the title feat: contingent properties and validation support feat: contingent properties and test_test_rows support Jan 27, 2024
@sebffischer sebffischer changed the title feat: contingent properties and test_test_rows support feat: contingent properties and use_test_rows support Feb 7, 2024
NEWS.md Outdated Show resolved Hide resolved
@sebffischer sebffischer changed the title feat: contingent properties and use_test_rows support feat: test and holdout task Feb 20, 2024
@sebffischer sebffischer changed the title feat: test and holdout task feat: validation and holdout task Mar 18, 2024
@sebffischer sebffischer changed the title feat: validation and holdout task feat: validation task Mar 19, 2024
#' If `TRUE` (default), the `row_ids` are removed from the primary task's active `"use"` rows.
#'
#' @return Modified `self`.
divide = function(x, remove = TRUE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not two parameters?

@sebffischer sebffischer closed this Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants