Native support for categorical splits #29

RAMitchell · 2023-07-26T08:26:29Z

Existing GBM implementations present two approaches:

Implement customised categorical split logic in the tree construction. Sort gradients/hessians of all categories within a feature and find the optimal split. LightGBM/XGBoost follow this approach.
Preprocess the dataset to encode new features carrying category information. Catboost uses this approach. It is also implemented here: https://contrib.scikit-learn.org/category_encoders/catboost.html

1 is problematic in that it creates bias towards selecting categorical features over numeric features. Our implementation should ideally be unbiased. Perhaps there is a way to randomly select a single split, like we do for numeric features, in an unbiased way?

2 is problematic in that it potentially involves expensive distributed shuffling or data movement and also requires making a copy of the input matrix. 1) would be significantly less memory hungry.

RAMitchell added the feature request New feature or request label Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native support for categorical splits #29

Native support for categorical splits #29

RAMitchell commented Jul 26, 2023 •

edited

Loading

Native support for categorical splits #29

Native support for categorical splits #29

Comments

RAMitchell commented Jul 26, 2023 • edited Loading

RAMitchell commented Jul 26, 2023 •

edited

Loading