Imbalanced Dataset vs Cost Function #21

S-C-H · 2020-05-24T22:39:00Z

Hi,

@albahnsen Do you have any thoughts on the relationship between the cost matrix and re-balancing the data? I notice that you do not rebalance in your final logistic regression model in your wonderful paper.

If I have a highly imbalanced dataset where:
<1% are Positive
99% are Negative

But the theoretical cost is:
30 if all are labelled positive
and 1 if all are labelled negative

What should I be adjusting to stop it predicting all Positive? The imbalance? The cost? THe iterations?

Thanks!

Edit: I've done some Cross Validation to check different C aand max_iter but it seems like the best savings score I can get it 0 (with the worst being -12).

albahnsen · 2020-05-27T22:18:28Z

Hi.

If you're assuming a constant cost between errors, it is the same doing a balancing of the input dataset than adjusting the threshold doing the cost.
However, if the costs are example-dependent, balancing the dataset does not give you optimal results.

Edit: I've done some Cross Validation to check different C and max_iter but it seems like the best savings score I can get it 0 (with the worst being -12).
That look quite suspicious.

S-C-H · 2020-06-15T22:48:30Z

Thanks for the response @albahnsen !:)
Can I confirm the columns of the cost - matrix are?
false positives, false negatives, true positives and true negatives

When I print out the model history (view of the iterations), it suggests the cost per example for the best model is: $0.805161. However, when I manually get the savings score
cost, cost_base, savings_p = savings_score(y_vec, train_predictions, cost_mat)

The cost per example much higher at the cost per alert and the model predicts all fraud.

C= 1.0 - no regularization because I was suspicious about the loss function.

S-C-H · 2020-06-16T22:22:36Z

The cost-matrix and loss function appear fine so the problem is with the optimisation of the function.

Now the reason I suggested downsampling is because whereas you had 0.5% true fraud in your example, my example is more like 0.05% or worse. =( Therefore the optimisation tends to converse to predicting a single class. This is not ideal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imbalanced Dataset vs Cost Function #21

Imbalanced Dataset vs Cost Function #21

S-C-H commented May 24, 2020 •

edited

Loading

albahnsen commented May 27, 2020

S-C-H commented Jun 15, 2020 •

edited

Loading

S-C-H commented Jun 16, 2020

Imbalanced Dataset vs Cost Function #21

Imbalanced Dataset vs Cost Function #21

Comments

S-C-H commented May 24, 2020 • edited Loading

albahnsen commented May 27, 2020

S-C-H commented Jun 15, 2020 • edited Loading

S-C-H commented Jun 16, 2020

S-C-H commented May 24, 2020 •

edited

Loading

S-C-H commented Jun 15, 2020 •

edited

Loading