-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: relax recall thresholds for CI #3282
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3282 +/- ##
==========================================
+ Coverage 79.00% 79.07% +0.07%
==========================================
Files 246 246
Lines 86900 87471 +571
Branches 86900 87471 +571
==========================================
+ Hits 68655 69168 +513
- Misses 15377 15439 +62
+ Partials 2868 2864 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
#[case(4, DistanceType::Dot, 0.8)] | ||
#[case(4, DistanceType::L2, 0.7)] | ||
#[case(4, DistanceType::Cosine, 0.7)] | ||
#[case(4, DistanceType::Dot, 0.7)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's low the recall requirement for only 4bit IVF_PQ:
- L2 -> 0.85
- Cosine -> 0.85
- Dot -> 0.75
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've scoped the change to just 4bit IVF_PQ
659dea1
to
0a551ce
Compare
@BubbleCal this case failed (recall is 0.89 and target is 0.9). Is this a bug or should we relax this threshold too?
|
let's relax dot to 0.85 |
0a551ce
to
74fff6c
Compare
Looks like you beat me to it in 39f12dc#diff-6de816b72e7c722316243c57df4f809ad34dc8581367c72335154dada48c40ed Thanks! |
I am concerned that a number of our tests test with random data and have quite strict recall thresholds. These tests seem to fail on a regular basis.
On the one hand, this may be a concern that we have broken recall but on the other hand I believe these are just in the category of "it's possible to get bad results sometime with the wrong random data".
In other words, if our CI thresholds are set at P=0.05 then that is too strict because if we have a dozen tests and they fail 5% of the time we will have too many CI failures.
I have proposed some very relaxed suggestions here but feel free to propose alternative suggestions. I'd like these tests to be designed so that if the test fails it is a sign that we broke something and we need to fix it, not just something we ignore.
We could potentially add recall benchmarks to our benchmarking suites if we are concerned about the slow degradation of recall or our recall performance. These should probably be designed with real (not random data) and could hopefully be made more reliable in that way.