[Bug]: It seems that longer time budgets result in worse outputs #1394

kabeersvohra · 2025-01-17T17:45:14Z

Describe the bug

I have a data set where I have tried to optimise the hyperparameters on Flaml, and it seems that the model keeps getting worse, the longer I give it. Here is a simple example of the code I have for the model I am trying to optimise:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.metrics import f1_score, confusion_matrix, classification_report, precision_score, recall_score
from flaml import AutoML
import numpy as np
import joblib

def create_and_train_pipeline(X_train, y_train, numerical_features, categorical_features, time_budget=60):
    """
    Creates and trains a pipeline without requiring custom wrapper class
    """
    # First, create and fit the preprocessor
    numeric_transformer = Pipeline(steps=[
        ('scaler', StandardScaler())
    ])
    
    categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numerical_features),
            ('cat', categorical_transformer, categorical_features)
        ],
        remainder='drop',
        sparse_threshold=0
    )
    
    # Fit the preprocessor first
    X_train_transformed = preprocessor.fit_transform(X_train)
    
    # Train AutoML on the transformed data
    automl = AutoML()
    
    # Train AutoML
    settings = {
        "time_budget": time_budget,
        "task": "classification",
        "estimator_list": ['lgbm', 'rf'],
        "eval_method": "cv",
        "metric": "f1",
        "n_splits": 5,
        "split_type": "stratified"
    }
    
    automl.fit(X_train_transformed, y_train, **settings)
    
    # Create final pipeline with best model
    final_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', automl.model.estimator)  # Use the best model directly
    ])
    
    # Print training results
    print(f"Best ML model:")
    print(automl.model.estimator)
    print("\nBest hyperparameter configuration:")
    print(automl.best_config)
    print("\nBest score on validation data: {:.4f}".format(automl.best_loss))
    
    # Generate and print test metrics
    y_pred = final_pipeline.predict(X_test)
    print("\nTraining Set Metrics:")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    
    # Save the pipeline
    joblib.dump(final_pipeline, 'full_prediction_pipeline.joblib')
    
    return final_pipeline, automl

if __name__ == "__main__":
    categorical_features = ['created_on', 'dex_id', 'price_confidence']
    numerical_features = [col for col in X_train.columns if col not in categorical_features]
    
    pipeline, automl = create_and_train_pipeline(
        X_train=X_train,
        y_train=y_train,
        numerical_features=numerical_features,
        categorical_features=categorical_features,
        time_budget=35
    )

Giving a minor f1 of 0.37 and a major f1 of 0.96 with a budget of 35 seconds:

Best score on validation data: 0.5886

Training Set Metrics:

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.95      0.96       930
           1       0.32      0.45      0.37        49

    accuracy                           0.92       979
   macro avg       0.64      0.70      0.67       979
weighted avg       0.94      0.92      0.93       979


Confusion Matrix:
[[883  47]
 [ 27  22]]

If I increase it to 60 seconds I get a minor f1 of 0.34 and a major f1 of 0.96:

Best score on validation data: 0.5815

Training Set Metrics:

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.95      0.96       930
           1       0.30      0.39      0.34        49

    accuracy                           0.92       979
   macro avg       0.63      0.67      0.65       979
weighted avg       0.93      0.92      0.93       979


Confusion Matrix:
[[885  45]
 [ 30  19]]

And after 120 seconds minor f1 of 0.33 and major f1 of 0.96:

Training Set Metrics:

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.95      0.96       930
           1       0.29      0.39      0.33        49

    accuracy                           0.92       979
   macro avg       0.63      0.67      0.65       979
weighted avg       0.93      0.92      0.93       979


Confusion Matrix:
[[884  46]
 [ 30  19]]

I am wondering why it is doing this? The error in the logs seems to be getting reduced however the output model is worse. This seems to be the case even when I define my own custom metric (and negate the output of course). As the negative number is getting minimised (absolute value getting bigger), it seems to give a worse final confusion matrix. What am I doing wrong here? Thanks a lot

Steps to reproduce

No response

Model Used

No response

Expected Behavior

No response

Screenshots and logs

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

thinkall · 2025-01-24T03:36:23Z

Hi @kabeersvohra , it could be caused by overfitting or randomness. Looking into the confusion matrix, you can see that the numbers are very close.

kabeersvohra added the bug Something isn't working label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: It seems that longer time budgets result in worse outputs #1394

[Bug]: It seems that longer time budgets result in worse outputs #1394

kabeersvohra commented Jan 17, 2025

thinkall commented Jan 24, 2025

[Bug]: It seems that longer time budgets result in worse outputs #1394

[Bug]: It seems that longer time budgets result in worse outputs #1394

Comments

kabeersvohra commented Jan 17, 2025

Describe the bug

Steps to reproduce

Model Used

Expected Behavior

Screenshots and logs

Additional Information

thinkall commented Jan 24, 2025