You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Existing GBM implementations present two approaches:
Implement customised categorical split logic in the tree construction. Sort gradients/hessians of all categories within a feature and find the optimal split. LightGBM/XGBoost follow this approach.
1 is problematic in that it creates bias towards selecting categorical features over numeric features. Our implementation should ideally be unbiased. Perhaps there is a way to randomly select a single split, like we do for numeric features, in an unbiased way?
2 is problematic in that it potentially involves expensive distributed shuffling or data movement and also requires making a copy of the input matrix. 1) would be significantly less memory hungry.
The text was updated successfully, but these errors were encountered:
Existing GBM implementations present two approaches:
Implement customised categorical split logic in the tree construction. Sort gradients/hessians of all categories within a feature and find the optimal split. LightGBM/XGBoost follow this approach.
Preprocess the dataset to encode new features carrying category information. Catboost uses this approach. It is also implemented here: https://contrib.scikit-learn.org/category_encoders/catboost.html
1 is problematic in that it creates bias towards selecting categorical features over numeric features. Our implementation should ideally be unbiased. Perhaps there is a way to randomly select a single split, like we do for numeric features, in an unbiased way?
2 is problematic in that it potentially involves expensive distributed shuffling or data movement and also requires making a copy of the input matrix. 1) would be significantly less memory hungry.
The text was updated successfully, but these errors were encountered: