Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse: add max score ratio downscaling for approximate searching #1018

Merged
merged 1 commit into from
Jan 10, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 23 additions & 11 deletions src/index/sparse/sparse_inverted_index_config.h
Original file line number Diff line number Diff line change
Expand Up @@ -45,20 +45,32 @@ class SparseInvertedIndexConfig : public BaseConfig {
/**
* The term frequency part of score of BM25 is:
* tf * (k1 + 1) / (tf + k1 * (1 - b + b * (doc_len / avgdl)))
* as more documents being added to the collection, avgdl can also
* change. In WAND index we precompute and cache this score in order to
* speed up the search process, but if avgdl changes, we need to
* re-compute such score which is expensive. To avoid this, we upscale
* the max score by a ratio to compensate for avgdl changes. This will
* make the max score larger than the actual max score, it makes the
* filtering less aggressive, but guarantees the correctness.
* The larger the ratio, the less aggressive the filtering is.
* WAND algorithm uses the max score of each dim for pruning, which is
* precomputed and cached in our implementation. The cached max score
* is actually not equal to the actual max score. Instead, it is a
* scaled one based on the wand_bm25_max_score_ratio.
* We should use different scale strategy for different reasons.
* 1. As more documents being added to the collection, avgdl could
* be changed. Re-computing such score for each segment is
* expensive. To avoid this, we should upscale the actual max score
* by a ratio greater than 1.0 to compensate for avgdl changes.
* This will make the cached max score larger than the actual max
* score, so that it makes the filtering less aggressive, but
* guarantees the correctness.
* 2. In WAND searching process, we use the sum of the max scores to
* filter the candidate vectors. If the sum is smaller than the
* threshold, skip current vector. If approximate searching is
* accepted, we can make the skipping more aggressive by downscaling
* the max score with a ratio less than 1.0. Since the possibility
* that the max score of all dims in the query appears on the same
* vector is relatively small, it won't lead to a sharp decline in
* the recall rate within a certain range.
*/
KNOWHERE_CONFIG_DECLARE_FIELD(wand_bm25_max_score_ratio)
.set_range(1.0, 1.3)
.set_range(0.5, 1.3)
.set_default(1.05)
.description("ratio to upscale max score to compensate for avgdl changes")
.for_train()
.description("ratio to upscale/downscale the max score of each dimension")
.for_train_and_search()
.for_deserialize()
.for_deserialize_from_file();
}
Expand Down