-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imbalanced Dataset vs Cost Function #21
Comments
Hi. If you're assuming a constant cost between errors, it is the same doing a balancing of the input dataset than adjusting the threshold doing the cost. Edit: I've done some Cross Validation to check different C and max_iter but it seems like the best savings score I can get it 0 (with the worst being -12). |
Thanks for the response @albahnsen !:) When I print out the model history (view of the iterations), it suggests the cost per example for the best model is: $0.805161. However, when I manually get the savings score The cost per example much higher at the cost per alert and the model predicts all fraud. C= 1.0 - no regularization because I was suspicious about the loss function. |
The cost-matrix and loss function appear fine so the problem is with the optimisation of the function. Now the reason I suggested downsampling is because whereas you had 0.5% true fraud in your example, my example is more like 0.05% or worse. =( Therefore the optimisation tends to converse to predicting a single class. This is not ideal. |
Hi,
@albahnsen Do you have any thoughts on the relationship between the cost matrix and re-balancing the data? I notice that you do not rebalance in your final logistic regression model in your wonderful paper.
If I have a highly imbalanced dataset where:
<1% are Positive
99% are Negative
But the theoretical cost is:
30 if all are labelled positive
and 1 if all are labelled negative
What should I be adjusting to stop it predicting all Positive? The imbalance? The cost? THe iterations?
Thanks!
Edit: I've done some Cross Validation to check different C aand max_iter but it seems like the best savings score I can get it 0 (with the worst being -12).
The text was updated successfully, but these errors were encountered: