Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native support for categorical splits #29

Open
RAMitchell opened this issue Jul 26, 2023 · 0 comments
Open

Native support for categorical splits #29

RAMitchell opened this issue Jul 26, 2023 · 0 comments
Labels
feature request New feature or request

Comments

@RAMitchell
Copy link
Contributor

RAMitchell commented Jul 26, 2023

Existing GBM implementations present two approaches:

  1. Implement customised categorical split logic in the tree construction. Sort gradients/hessians of all categories within a feature and find the optimal split. LightGBM/XGBoost follow this approach.

  2. Preprocess the dataset to encode new features carrying category information. Catboost uses this approach. It is also implemented here: https://contrib.scikit-learn.org/category_encoders/catboost.html

1 is problematic in that it creates bias towards selecting categorical features over numeric features. Our implementation should ideally be unbiased. Perhaps there is a way to randomly select a single split, like we do for numeric features, in an unbiased way?

2 is problematic in that it potentially involves expensive distributed shuffling or data movement and also requires making a copy of the input matrix. 1) would be significantly less memory hungry.

@RAMitchell RAMitchell added the feature request New feature or request label Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant