You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opening this issue because it is probably worth some discussion and wonder if you have any ideas or possible solutions to the issue.
Premise
Often when running queries, you may want to deduplicate returned results and exclude them from being considered in things like the TopK collector, this is especially prevalent when working with mutable data within Tantivy (where you might have a primary key), as often it is more efficient to lazily delete old documents rather than greedily executing a delete every time you receive a new document.
Unfortunately, this operation is difficult for users to implement unless they know quite a bit about how collectors work and the collector itself is quite complicated with a few issues to discuss (further down.)
The DistinctFilterCollector would act similarly to the FilterCollector but specifically only allow documents with values it hasn't already seen go through.
Possible syntax
let collector = DistinctFilterCollector::by_fields(vec![primary_key_field],TopDocs::with_limit(10));let top_docs = searcher.search(&my_query,&collector);
In an ideal world, I'd also like this collector to return both the Collector::Fruit of the inner collector it wraps and a set of document addresses it filtered out, this would allow users a bit more flexibility around the application of this collector (for example, removing now "old" documents.)
This would make the syntax closer to:
let collector = DistinctFilterCollector::by_fields(vec![primary_key_field],TopDocs::with_limit(10));let(top_docs, ignored_docs) = searcher.search(&my_query,&collector);// (TopDocs::Fruit, Vec<DocAddress>)
Problems
From my understanding of collectors, this collector would only be able to work on a first-come-first-served basis.
If you wanted to the collector to behave by taking the most recent distinct document rather than first-come-first-served, segments would need to always be searched consistently and merges would also need to maintain this order (not sure if merges do this already or not.)
The text was updated successfully, but these errors were encountered:
If you wanted to the collector to behave by taking the most recent distinct document rather than first-come-first-served, segments would need to always be searched consistently and merges would also need to maintain this order (not sure if merges do this already or not.)
I would... not care. As you mention, merges make it more or less impossible.
Opening this issue because it is probably worth some discussion and wonder if you have any ideas or possible solutions to the issue.
Premise
Often when running queries, you may want to deduplicate returned results and exclude them from being considered in things like the TopK collector, this is especially prevalent when working with mutable data within Tantivy (where you might have a primary key), as often it is more efficient to lazily delete old documents rather than greedily executing a delete every time you receive a new document.
Unfortunately, this operation is difficult for users to implement unless they know quite a bit about how collectors work and the collector itself is quite complicated with a few issues to discuss (further down.)
The
DistinctFilterCollector
would act similarly to theFilterCollector
but specifically only allow documents with values it hasn't already seen go through.Possible syntax
In an ideal world, I'd also like this collector to return both the
Collector::Fruit
of the inner collector it wraps and a set of document addresses it filtered out, this would allow users a bit more flexibility around the application of this collector (for example, removing now "old" documents.)This would make the syntax closer to:
Problems
The text was updated successfully, but these errors were encountered: