Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: support multivector type #3190

Merged
merged 45 commits into from
Jan 8, 2025
Merged

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Dec 2, 2024

@github-actions github-actions bot added the enhancement New feature or request label Dec 2, 2024
@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 70.27601% with 140 lines in your changes missing coverage. Please review.

Project coverage is 78.66%. Comparing base (c9bb25d) to head (ef5c3f0).

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 57.77% 41 Missing and 16 partials ⚠️
rust/lance/src/index/vector/utils.rs 55.31% 20 Missing and 1 partial ⚠️
rust/lance-linalg/src/distance.rs 66.00% 17 Missing ⚠️
rust/lance-index/src/vector/sq/storage.rs 31.25% 11 Missing ⚠️
rust/lance-index/src/vector/transform.rs 74.35% 8 Missing and 2 partials ⚠️
rust/lance-index/src/vector/flat.rs 52.94% 5 Missing and 3 partials ⚠️
rust/lance/src/index/vector/ivf/v2.rs 95.48% 4 Missing and 2 partials ⚠️
rust/lance-arrow/src/floats.rs 40.00% 3 Missing ⚠️
rust/lance/src/index/vector/ivf.rs 0.00% 0 Missing and 2 partials ⚠️
rust/lance/src/io/exec/knn.rs 0.00% 0 Missing and 2 partials ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3190      +/-   ##
==========================================
+ Coverage   78.58%   78.66%   +0.08%     
==========================================
  Files         250      250              
  Lines       89539    89836     +297     
  Branches    89539    89836     +297     
==========================================
+ Hits        70360    70668     +308     
+ Misses      16293    16256      -37     
- Partials     2886     2912      +26     
Flag Coverage Δ
unittests 78.66% <70.27%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>

let mut knn_node = if q.refine_factor.is_some() {
let mut knn_node = if q.refine_factor.is_some() || is_multivec {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for multivector, refine is always required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont follow, why is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just follows the algo that colbert paper described, this is required for calculating the maxsim distance. without refine, the search just finds nearest chunks without considering maxsim metric

Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
@BubbleCal BubbleCal marked this pull request as ready for review December 16, 2024 08:38
@BubbleCal BubbleCal requested a review from westonpace December 16, 2024 08:39
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Comment on lines 108 to 111
@pytest.fixture()
def multivec_dataset(tmp_path):
tbl = create_multivec_table()
yield lance.write_dataset(tbl, tmp_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new tests, let's create the dataset in memory (unless we actually need to access the individual files.)

Suggested change
@pytest.fixture()
def multivec_dataset(tmp_path):
tbl = create_multivec_table()
yield lance.write_dataset(tbl, tmp_path)
@pytest.fixture()
def multivec_dataset():
tbl = create_multivec_table()
yield lance.write_dataset(tbl, "memory://")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 114 to 118
@pytest.fixture()
def indexed_multivec_dataset(tmp_path):
tbl = create_multivec_table()
dataset = lance.write_dataset(tbl, tmp_path)
yield dataset.create_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.fixture()
def indexed_multivec_dataset(tmp_path):
tbl = create_multivec_table()
dataset = lance.write_dataset(tbl, tmp_path)
yield dataset.create_index(
@pytest.fixture()
def indexed_multivec_dataset(multivec_dataset):
yield multivec_dataset.create_index(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 520 to 522
def test_multivec_ann(indexed_multivec_dataset):
query = np.random.randn(5 * 128)
indexed_multivec_dataset.scanner(nearest={"column": "vector", "q": query, "k": 100})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assert the output has the expected structure?

Also are there errors we need to test? Like we should get ValueError if we pass the wrong query type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

let (_, element_type) = get_vector_type(self.dataset.schema(), column)?;
let dim = get_vector_dim(self.dataset.schema(), column)?;
// make sure the query is valid
if q.len() % dim != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more how those two are different?

Comment on lines -1540 to +1533
AggregateMode::Final,
AggregateMode::Single,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final is used for combining the partial results

@BubbleCal BubbleCal requested a review from wjones127 December 24, 2024 11:18
Signed-off-by: BubbleCal <[email protected]>
@BubbleCal BubbleCal changed the title feat: support multivector type feat!: support multivector type Dec 24, 2024
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just have one small suggestion.

if pa.types.is_fixed_size_list(field.type):
dimension = field.type.list_size
elif pa.types.is_list(field.type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you also check the child is a fixed size list too?

Suggested change
elif pa.types.is_list(field.type):
elif (pa.types.is_list(field.type) and
pa.types.is_fixed_size_list(field.type.value_type)):

@BubbleCal BubbleCal merged commit 94e7bf9 into lancedb:main Jan 8, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants