Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse: add max score ratio downscaling for approximate searching #1018

Merged
merged 1 commit into from
Jan 10, 2025

Conversation

sparknack
Copy link
Contributor

@sparknack sparknack commented Jan 8, 2025

Reuse wand_bm25_max_score_ratio for approximate searching.

wand_bm25_max_score_ratio is assigned two functions:

1. to upscale the max score to compensate for avgdl changes
The term frequency part of score of BM25 is:
tf * (k1 + 1) / (tf + k1 * (1 - b + b * (doc_len / avgdl)))
as more documents being added to the collection, avgdl can also
change. In WAND index we precompute and cache this score in order to
speed up the search process, but if avgdl changes, we need to
re-compute such score which is expensive. To avoid this, we upscale
the max score by a ratio to compensate for avgdl changes. This will
make the max score larger than the actual max score, it makes the
filtering less aggressive, but guarantees the correctness.

2. to downscale the max score for approximate searching
In the searching process, we use the sum of the max scores to
filter the candidate vectors. If the sum is smaller than the
threshold, skip current vector. If approximate searching is enabled,
we can make the skipping more aggressive by downscaling the max
score. Since the possibility that the maxscore of all dims in the
query appears on the same vector is relatively small, it won't lead
to a sharp change in the recall rate within a certain range.

Test Result

MSMARCO BM25

max score ratio Recall rate of WAND(%) Recall rate of MaxScore(%) Query time of WAND(ms) Query time of MaxScore(ms)
1 0.996691 0.996576 3375 3456
0.9 0.988983 0.992708 2599 2908
0.8 0.942851 0.981490 1999 2417
0.7 0.797994 0.936948 1474 1912

HotpotQA BM25

max score ratio Recall rate of WAND(%) Recall rate of MaxScore(%) Query time of WAND(ms) Query time of MaxScore(ms)
1 0.991757 0.991463 11885 6763
0.9 0.990802 0.991445 9729 6216
0.8 0.981109 0.991243 8244 5592
0.7 0.931632 0.988985 6752 4937

Copy link

mergify bot commented Jan 8, 2025

@sparknack 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

  1. If you're fixing a bug, label it as kind/bug.
  2. For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
  3. Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
  4. Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

Copy link

codecov bot commented Jan 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.91%. Comparing base (3c46f4c) to head (ab9cb61).
Report is 285 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           main    #1018       +/-   ##
=========================================
+ Coverage      0   73.91%   +73.91%     
=========================================
  Files         0       82       +82     
  Lines         0     6981     +6981     
=========================================
+ Hits          0     5160     +5160     
- Misses        0     1821     +1821     

see 82 files with indirect coverage changes

@mergify mergify bot added ci-passed and removed ci-passed labels Jan 9, 2025
* make the max score larger than the actual max score, it makes the
* filtering less aggressive, but guarantees the correctness.
* The larger the ratio, the less aggressive the filtering is.
* wand_bm25_max_score_ratio is assigned two functions:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term frequency part of score of BM25 is:... . but if avgdl changes, the max score changes. this can be used to adjust the max scored used to compute the max score.

  1. if set to a value greater than 1, ...
  2. if set to a value less than 1, ...

@zhengbuqian
Copy link
Collaborator

/lgtm
/approve

@sre-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sparknack, zhengbuqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zhengbuqian
Copy link
Collaborator

/kind improvement

@sre-ci-robot sre-ci-robot merged commit dadbcfc into zilliztech:main Jan 10, 2025
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants