Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In classification problems, merging probably package when determining best threshold. #986

Closed
SHo-JANG opened this issue Jul 8, 2023 · 5 comments

Comments

@SHo-JANG
Copy link

SHo-JANG commented Jul 8, 2023

As far as I can understand, we're using prob_to_class_2 as the default option when predicting class.

prob_to_class_2 <- function(x, object) {
  x <- ifelse(x >= 0.5, object$lvl[2], object$lvl[1])
  unname(x)
}

However, in many cases, the threshold is not 0.5. (Especially in imbalanced datasets.)

In this case, I wonder if we could use the threshold_perf() function in the probably package during the tuning process to check if the model is potentially classifying really well.

I think it's a really necessary feature, what do you think?

@topepo
Copy link
Member

topepo commented Jul 8, 2023

It is an important feature. After the posit conference, we will be working on post-processing tools and this is one of them.

We'll try to make it natural so that you can treat the threshold parameter like any other tuning parameter. If you use a workflow, it will also adjust the hard class predictions automatically (once you've picked a threshold).

@SHo-JANG
Copy link
Author

SHo-JANG commented Jul 9, 2023

Thank you so much for all the hard work you do to make the system more complete.

@SHo-JANG
Copy link
Author

I think that hyperparameterizing to find the optimal threshold would be time consuming and could lead to overfitting.

Instead , I searched for a way to determine the optimal threshold. related paper

In Section 2.3. Threshold criteria,
(6)PredPrev = Obs.
This means that we want the class ratio of the predicted result to be equal to the ratio of the observed classes in the trained data, i.e., we use quantile(probs = 1- "Obs class ratio") from the predicted probability vector as the threshold.

The code to implement this in the training process is as follows.

prob_to_class_2_custom <- function(x, object) {
  obs_ratio<- object$fit$y |> mean()
  pred_equal_obs_threshold <- quantile(x,probs = 1-obs_ratio)
  x <- ifelse(x >= pred_equal_obs_threshold, object$lvl[2], object$lvl[1])
  unname(x)
}

I would like to use this function as the default option.
However, it seems that I need to redefine the engine to apply this function. Is there any way to use this function in an existing engine?

@simonpcouch
Copy link
Contributor

Long time no see😝 We've got some good news here, though—custom probability thresholds and other postprocessing functionality is now available via tailors, which can be added to workflows in the dev version of the workflows package. You can read more on that work on this blog post.

Since these changes will otherwise live on the tailor repo, I'm going to go ahead and close!

Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants