sparse: add max score ratio downscaling for approximate searching #1018

sparknack · 2025-01-08T15:08:59Z

Reuse wand_bm25_max_score_ratio for approximate searching.

wand_bm25_max_score_ratio is assigned two functions:

1. to upscale the max score to compensate for avgdl changes
The term frequency part of score of BM25 is:
tf * (k1 + 1) / (tf + k1 * (1 - b + b * (doc_len / avgdl)))
as more documents being added to the collection, avgdl can also
change. In WAND index we precompute and cache this score in order to
speed up the search process, but if avgdl changes, we need to
re-compute such score which is expensive. To avoid this, we upscale
the max score by a ratio to compensate for avgdl changes. This will
make the max score larger than the actual max score, it makes the
filtering less aggressive, but guarantees the correctness.

2. to downscale the max score for approximate searching
In the searching process, we use the sum of the max scores to
filter the candidate vectors. If the sum is smaller than the
threshold, skip current vector. If approximate searching is enabled,
we can make the skipping more aggressive by downscaling the max
score. Since the possibility that the maxscore of all dims in the
query appears on the same vector is relatively small, it won't lead
to a sharp change in the recall rate within a certain range.

Test Result

MSMARCO BM25

max score ratio	Recall rate of WAND(%)	Recall rate of MaxScore(%)	Query time of WAND(ms)	Query time of MaxScore(ms)
1	0.996691	0.996576	3375	3456
0.9	0.988983	0.992708	2599	2908
0.8	0.942851	0.981490	1999	2417
0.7	0.797994	0.936948	1474	1912

HotpotQA BM25

max score ratio	Recall rate of WAND(%)	Recall rate of MaxScore(%)	Query time of WAND(ms)	Query time of MaxScore(ms)
1	0.991757	0.991463	11885	6763
0.9	0.990802	0.991445	9729	6216
0.8	0.981109	0.991243	8244	5592
0.7	0.931632	0.988985	6752	4937

mergify · 2025-01-08T15:09:41Z

@sparknack 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

If you're fixing a bug, label it as kind/bug.
For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

codecov · 2025-01-09T07:51:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.91%. Comparing base (3c46f4c) to head (ab9cb61).
Report is 285 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##           main    #1018       +/-   ##
=========================================
+ Coverage      0   73.91%   +73.91%     
=========================================
  Files         0       82       +82     
  Lines         0     6981     +6981     
=========================================
+ Hits          0     5160     +5160     
- Misses        0     1821     +1821

see 82 files with indirect coverage changes

zhengbuqian · 2025-01-10T02:34:23Z

src/index/sparse/sparse_inverted_index_config.h

-         * make the max score larger than the actual max score, it makes the
-         * filtering less aggressive, but guarantees the correctness.
-         * The larger the ratio, the less aggressive the filtering is.
+         * wand_bm25_max_score_ratio is assigned two functions:


The term frequency part of score of BM25 is:... . but if avgdl changes, the max score changes. this can be used to adjust the max scored used to compute the max score.

if set to a value greater than 1, ...

if set to a value less than 1, ...

Signed-off-by: Shawn Wang <[email protected]>

zhengbuqian · 2025-01-10T06:19:50Z

/lgtm
/approve

sre-ci-robot · 2025-01-10T06:19:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sparknack, zhengbuqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [zhengbuqian]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zhengbuqian · 2025-01-10T06:37:33Z

/kind improvement

sre-ci-robot requested review from hhy3 and liliu-z January 8, 2025 15:09

sre-ci-robot added the size/M label Jan 8, 2025

mergify bot added dco-passed do-not-merge/missing-related-issue ci-passed labels Jan 8, 2025

mergify bot added ci-passed and removed ci-passed labels Jan 9, 2025

zhengbuqian reviewed Jan 10, 2025

View reviewed changes

sparse: add max score ratio downscaling for approximate searching

ab9cb61

Signed-off-by: Shawn Wang <[email protected]>

sparknack force-pushed the sparse-approx branch from 1a462da to ab9cb61 Compare January 10, 2025 04:14

mergify bot added ci-passed and removed ci-passed labels Jan 10, 2025

sre-ci-robot assigned zhengbuqian Jan 10, 2025

sre-ci-robot added the lgtm label Jan 10, 2025

sre-ci-robot added the approved label Jan 10, 2025

sre-ci-robot added the kind/improvement label Jan 10, 2025

mergify bot removed the do-not-merge/missing-related-issue label Jan 10, 2025

sre-ci-robot merged commit dadbcfc into zilliztech:main Jan 10, 2025
13 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparse: add max score ratio downscaling for approximate searching #1018

sparse: add max score ratio downscaling for approximate searching #1018

sparknack commented Jan 8, 2025 •

edited

Loading

mergify bot commented Jan 8, 2025

codecov bot commented Jan 9, 2025 •

edited

Loading

zhengbuqian Jan 10, 2025

zhengbuqian commented Jan 10, 2025

sre-ci-robot commented Jan 10, 2025

zhengbuqian commented Jan 10, 2025

sparse: add max score ratio downscaling for approximate searching #1018

sparse: add max score ratio downscaling for approximate searching #1018

Conversation

sparknack commented Jan 8, 2025 • edited Loading

Test Result

MSMARCO BM25

HotpotQA BM25

mergify bot commented Jan 8, 2025

codecov bot commented Jan 9, 2025 • edited Loading

Codecov Report

zhengbuqian Jan 10, 2025

Choose a reason for hiding this comment

zhengbuqian commented Jan 10, 2025

sre-ci-robot commented Jan 10, 2025

zhengbuqian commented Jan 10, 2025

sparknack commented Jan 8, 2025 •

edited

Loading

codecov bot commented Jan 9, 2025 •

edited

Loading