C# port of FLAML library #409

torronen · 2022-01-16T16:13:12Z

torronen
Jan 16, 2022
Collaborator

Are there plans to create a C# port of FLAML library? Or any community projects you know of doing it?

torronen · 2022-01-16T16:19:18Z

torronen
Jan 16, 2022
Collaborator Author

If not, would be feasible to run FLAML for LightGBM to get tuned parameters, and then train with said parameters inside C# application with Microsoft.ML.LightGBM library?

5 replies

sonichi Jan 17, 2022

FLAML has a C# implementation in ML.NET Model Builder as part of Visual Studio 2022.
The way you mentioned should also work, as long as the LightGBM version matches.
@JakeRadMSFT or @LittleLittleCloud may have a better answer.

torronen Jan 17, 2022
Collaborator Author

Thanks. I am aware of it being available through ML.NET Model Builder, but model builder does not yet allow to set a sampling key or custom validation data so I am not yet able to use it for my use case. Actually, my experience with Model Builder and ML.NET class library has been that model builder has much better tuner and that is why I am looking forward to using FLAML. LightGBM also has lots of powerful configurations which would not be available through Model Builder, such as setting for imbalanced classes.

So, I might try tuning with Python, and then run the production app in .NET with the parameters. I will pay special attention to LightGBM version.

LittleLittleCloud Jan 21, 2022
Collaborator

@torronen

We are about to start Sweepable Api migration pretty soon which will have the exact same search space && tuner used in model builder and flaml. I'll update under this thread when they are available publicly

torronen Jan 25, 2022
Collaborator Author

That sounds great!

In principle it seems getting the parameters from FLAML to C# LightGBM seems to work, but I dont have any metrics yet. The names of parameters are slightly different but documentation is adequate to match them. Microsoft.ML seems to have version 2.3.1 of LightGBM.

@sonichi Another approach that might be useful, especially for anyone working with .NET, would be having some samples about conversion to ONNX. Maybe even some additional tooling.

sonichi Jan 25, 2022

@torronen thanks for the feedback. I'll create an issue for that.

torronen · 2022-01-27T12:48:14Z

torronen
Jan 27, 2022
Collaborator Author

Hard time getting same results, C# has much lower accuracy. I will continue by comparing defaults of C# and Python.
LightGBM 2.31.

Currently, setting LightGBM variables as below, keeping others as default.

            LearningRate = learning_rate,
              MaximumBinCountPerFeature = max_bin,
              MinimumExampleCountPerLeaf = min_child_samples,
              NumberOfIterations = n_estimators,
              NumberOfLeaves = num_leaves,

              Booster = new GossBooster.Options
              {
                  L1Regularization = reg_alpha,
                  L2Regularization = reg_lambda,
                  FeatureFraction = colsample_bytree,
              },

23 replies

LittleLittleCloud Feb 1, 2022
Collaborator

Here's the column property I use, you should be able to just copy and paste it to replace the ColumnProperties value in your .mbconfig

"ColumnProperties": [
      {
        "ColumnName": "survived",
        "ColumnPurpose": "Label",
        "ColumnDataFormat": "Boolean",
        "IsCategorical": false,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "pclass",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "Single",
        "IsCategorical": false,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "sex",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "String",
        "IsCategorical": true,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "age",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "Single",
        "IsCategorical": false,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "sibsp",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "Single",
        "IsCategorical": false,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "parch",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "Single",
        "IsCategorical": false,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "ticket",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "String",
        "IsCategorical": true,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "fare",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "Single",
        "IsCategorical": false,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "cabin",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "String",
        "IsCategorical": true,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "embarked",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "String",
        "IsCategorical": true,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "boat",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "Single",
        "IsCategorical": false,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "body",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "Single",
        "IsCategorical": false,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "home.dest",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "String",
        "IsCategorical": true,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "surname",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "String",
        "IsCategorical": true,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "title",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "String",
        "IsCategorical": true,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "firstname",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "String",
        "IsCategorical": true,
        "Type": "Column",
        "Version": 1
      },
      {
        "ColumnName": "random",
        "ColumnPurpose": "Feature",
        "ColumnDataFormat": "Single",
        "IsCategorical": false,
        "Type": "Column",
        "Version": 1
      }
    ]

torronen Feb 2, 2022
Collaborator Author

Thanks. I will try this and compare the parameters it give. Hopefully it will help solve the mystery and what might be the difference between the hyperparameters... BTW, I just realize accuracy will be different because Model builder does not use same test data set as python script or the .NET console app.

torronen Feb 2, 2022
Collaborator Author

@LittleLittleCloud column boat is acutally text. I t has value like "5 9" and A-C. However, if according to your sample, it is put as Single then it gives better results. It would make sense because "boat" indicates the lifeboat, as I understand. If a person has no boat specified, then chance of survival is almost 0%.

But why string wouldn't do? If we set as numeric, then maybe we loose some performance because A-C can not be parsed as numbers?

torronen Feb 2, 2022
Collaborator Author

Boat as categorical string drops to 95%, but most iterations are at 91%

LittleLittleCloud Feb 2, 2022
Collaborator

My best guess is if column boat is put as cat String, it will be processed by one-hot encoder which increases dimension of features that put into lgbm and might end up with a lower validate result because dataset with higher dimension is more prone to overfitting (curse of dimension). That's also probably why after removing IsCatagorical for all numeric features, the validation result increase. Because that change helps decrease dimensions.

I would verify it via using hash encoding instead. If I'm right we should see the training result reachs to 99% in a faster speed.

torronen · 2022-01-29T09:43:01Z

torronen
Jan 29, 2022
Collaborator Author

@sonichi If random seed makes a big difference in some datasets it might have some implications for the optimal search space. Choice of booster had also quite big difference. I might try to "tune" seed and booster. Would you or your team have suggestions on best way to do it?

Just add to the list of parameters to tune?
first tune without any randomity based feature, then final tune with randomity and seed?
full tuning in a loop with different seeds?
Probably best to "tune" each specific seed separately, not just random_seed

I have one concern about it: If the tuning results rely on randomly dropped columns and it matters a lot which columns gets dropped, then if the next dataset version removes a column from the beginning of file the tuning results are no longer valid and tuning needs to be redone.

Not every developer might realize a small change in dataset, like remove column or add a engineer a new column in pipeline, migth break the optimal values. Therefore, maybe, for default search space it should be considered not to use randomity based ones at all? Microsoft.ML would not seem to use it. Tuned results from Microsoft.ML are fairly close to FLAML but the tree is much bigger in Microsoft.ML.

For experimentation, we may need a public dataset with high number of columns (+high risk of overfit) so that colsample_bytree becomes effective. I can not share the current dataset, but colsample_bytree values has lowest 0.67 and highest 0.9 (depending on metric) and it has 2500 columns.

3 replies

sonichi Jan 29, 2022

@sonichi If random seed makes a big difference in some datasets it might have some implications for the optimal search space. Choice of booster had also quite big difference. I might try to "tune" seed and booster. Would you or your team have suggestions on best way to do it?

Just add to the list of parameters to tune?

first tune without any randomity based feature, then final tune with randomity and seed?

full tuning in a loop with different seeds?
Probably best to "tune" each specific seed separately, not just random_seed

I have one concern about it: If the tuning results rely on randomly dropped columns and it matters a lot which columns gets dropped, then if the next dataset version removes a column from the beginning of file the tuning results are no longer valid and tuning needs to be redone.

Not every developer might realize a small change in dataset, like remove column or add a engineer a new column in pipeline, migth break the optimal values. Therefore, maybe, for default search space it should be considered not to use randomity based ones at all? Microsoft.ML would not seem to use it. Tuned results from Microsoft.ML are fairly close to FLAML but the tree is much bigger in Microsoft.ML.

For experimentation, we may need a public dataset with high number of columns (+high risk of overfit) so that colsample_bytree becomes effective. I can not share the current dataset, but colsample_bytree values has lowest 0.67 and highest 0.9 (depending on metric) and it has 2500 columns.

For tuning, just add to the search space of the inherited custom estimator. For not tuning, remove the hp from search space of the inherited custom estimator. How much a small change of the dataset affects the result depends on your specific use case. BTW, what application do you use flaml for?

torronen Jan 30, 2022
Collaborator Author

Thanks.

For now I am just testing. This specific dataset is for e-commerce click flow to predict if user will continue session or not. For now, lots of low or no correlation features, few probably have more correlation (session length, pages loaded... screen size, mouse position, scroll position... http headers...) I am planning to use with AI-enabled school or gov (applicaton form processing, job recomendations) or B2B tools (niche focus, like the click flow or chatbot improvement). Datasets often have high number of features, as some other persons in our team are tasked with data sourcing and they add everything they can find, my task is to drop what is not needed.

sonichi Jan 30, 2022

Thanks for sharing. One useful feature might be warm start, when your data are changed slightly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C# port of FLAML library #409

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 31 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

C# port of FLAML library #409

torronen Jan 16, 2022 Collaborator

Replies: 3 comments · 31 replies

torronen Jan 16, 2022 Collaborator Author

sonichi Jan 17, 2022

torronen Jan 17, 2022 Collaborator Author

LittleLittleCloud Jan 21, 2022 Collaborator

torronen Jan 25, 2022 Collaborator Author

sonichi Jan 25, 2022

torronen Jan 27, 2022 Collaborator Author

LittleLittleCloud Feb 1, 2022 Collaborator

torronen Feb 2, 2022 Collaborator Author

torronen Feb 2, 2022 Collaborator Author

torronen Feb 2, 2022 Collaborator Author

LittleLittleCloud Feb 2, 2022 Collaborator

torronen Jan 29, 2022 Collaborator Author

sonichi Jan 29, 2022

torronen Jan 30, 2022 Collaborator Author

sonichi Jan 30, 2022

torronen
Jan 16, 2022
Collaborator

Replies: 3 comments 31 replies

torronen
Jan 16, 2022
Collaborator Author

torronen Jan 17, 2022
Collaborator Author

LittleLittleCloud Jan 21, 2022
Collaborator

torronen Jan 25, 2022
Collaborator Author

torronen
Jan 27, 2022
Collaborator Author

LittleLittleCloud Feb 1, 2022
Collaborator

torronen Feb 2, 2022
Collaborator Author

torronen Feb 2, 2022
Collaborator Author

torronen Feb 2, 2022
Collaborator Author

LittleLittleCloud Feb 2, 2022
Collaborator

torronen
Jan 29, 2022
Collaborator Author

torronen Jan 30, 2022
Collaborator Author