-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: support multivector type #3190
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3190 +/- ##
==========================================
+ Coverage 78.58% 78.66% +0.08%
==========================================
Files 250 250
Lines 89539 89836 +297
Branches 89539 89836 +297
==========================================
+ Hits 70360 70668 +308
+ Misses 16293 16256 -37
- Partials 2886 2912 +26
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: BubbleCal <[email protected]>
55b0e08
to
a171aa6
Compare
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
|
||
let mut knn_node = if q.refine_factor.is_some() { | ||
let mut knn_node = if q.refine_factor.is_some() || is_multivec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for multivector, refine is always required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont follow, why is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this just follows the algo that colbert paper described, this is required for calculating the maxsim distance. without refine, the search just finds nearest chunks without considering maxsim metric
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
@pytest.fixture() | ||
def multivec_dataset(tmp_path): | ||
tbl = create_multivec_table() | ||
yield lance.write_dataset(tbl, tmp_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new tests, let's create the dataset in memory (unless we actually need to access the individual files.)
@pytest.fixture() | |
def multivec_dataset(tmp_path): | |
tbl = create_multivec_table() | |
yield lance.write_dataset(tbl, tmp_path) | |
@pytest.fixture() | |
def multivec_dataset(): | |
tbl = create_multivec_table() | |
yield lance.write_dataset(tbl, "memory://") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@pytest.fixture() | ||
def indexed_multivec_dataset(tmp_path): | ||
tbl = create_multivec_table() | ||
dataset = lance.write_dataset(tbl, tmp_path) | ||
yield dataset.create_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.fixture() | |
def indexed_multivec_dataset(tmp_path): | |
tbl = create_multivec_table() | |
dataset = lance.write_dataset(tbl, tmp_path) | |
yield dataset.create_index( | |
@pytest.fixture() | |
def indexed_multivec_dataset(multivec_dataset): | |
yield multivec_dataset.create_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
def test_multivec_ann(indexed_multivec_dataset): | ||
query = np.random.randn(5 * 128) | ||
indexed_multivec_dataset.scanner(nearest={"column": "vector", "q": query, "k": 100}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's assert the output has the expected structure?
Also are there errors we need to test? Like we should get ValueError
if we pass the wrong query type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
rust/lance/src/dataset/scanner.rs
Outdated
let (_, element_type) = get_vector_type(self.dataset.schema(), column)?; | ||
let dim = get_vector_dim(self.dataset.schema(), column)?; | ||
// make sure the query is valid | ||
if q.len() % dim != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain more how those two are different?
AggregateMode::Final, | ||
AggregateMode::Single, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final
is used for combining the partial results
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just have one small suggestion.
python/python/lance/dataset.py
Outdated
if pa.types.is_fixed_size_list(field.type): | ||
dimension = field.type.list_size | ||
elif pa.types.is_list(field.type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you also check the child is a fixed size list too?
elif pa.types.is_list(field.type): | |
elif (pa.types.is_list(field.type) and | |
pa.types.is_fixed_size_list(field.type.value_type)): |
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
related to lancedb/lancedb#1838