Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imbalanced Dataset vs Cost Function #21

Open
S-C-H opened this issue May 24, 2020 · 3 comments
Open

Imbalanced Dataset vs Cost Function #21

S-C-H opened this issue May 24, 2020 · 3 comments

Comments

@S-C-H
Copy link

S-C-H commented May 24, 2020

Hi,

@albahnsen Do you have any thoughts on the relationship between the cost matrix and re-balancing the data? I notice that you do not rebalance in your final logistic regression model in your wonderful paper.

If I have a highly imbalanced dataset where:
<1% are Positive
99% are Negative

But the theoretical cost is:
30 if all are labelled positive
and 1 if all are labelled negative

What should I be adjusting to stop it predicting all Positive? The imbalance? The cost? THe iterations?

Thanks!

Edit: I've done some Cross Validation to check different C aand max_iter but it seems like the best savings score I can get it 0 (with the worst being -12).

@albahnsen
Copy link
Owner

Hi.

If you're assuming a constant cost between errors, it is the same doing a balancing of the input dataset than adjusting the threshold doing the cost.
However, if the costs are example-dependent, balancing the dataset does not give you optimal results.

Edit: I've done some Cross Validation to check different C and max_iter but it seems like the best savings score I can get it 0 (with the worst being -12).
That look quite suspicious.

@S-C-H
Copy link
Author

S-C-H commented Jun 15, 2020

Thanks for the response @albahnsen !:)
Can I confirm the columns of the cost - matrix are?
false positives, false negatives, true positives and true negatives

When I print out the model history (view of the iterations), it suggests the cost per example for the best model is: $0.805161. However, when I manually get the savings score
cost, cost_base, savings_p = savings_score(y_vec, train_predictions, cost_mat)

The cost per example much higher at the cost per alert and the model predicts all fraud.

C= 1.0 - no regularization because I was suspicious about the loss function.

@S-C-H
Copy link
Author

S-C-H commented Jun 16, 2020

The cost-matrix and loss function appear fine so the problem is with the optimisation of the function.

Now the reason I suggested downsampling is because whereas you had 0.5% true fraud in your example, my example is more like 0.05% or worse. =( Therefore the optimisation tends to converse to predicting a single class. This is not ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants