Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New collector - DistinctFilterCollector #2565

Open
ChillFish8 opened this issue Jan 5, 2025 · 1 comment
Open

New collector - DistinctFilterCollector #2565

ChillFish8 opened this issue Jan 5, 2025 · 1 comment

Comments

@ChillFish8
Copy link
Collaborator

Opening this issue because it is probably worth some discussion and wonder if you have any ideas or possible solutions to the issue.

Premise

Often when running queries, you may want to deduplicate returned results and exclude them from being considered in things like the TopK collector, this is especially prevalent when working with mutable data within Tantivy (where you might have a primary key), as often it is more efficient to lazily delete old documents rather than greedily executing a delete every time you receive a new document.

Unfortunately, this operation is difficult for users to implement unless they know quite a bit about how collectors work and the collector itself is quite complicated with a few issues to discuss (further down.)

The DistinctFilterCollector would act similarly to the FilterCollector but specifically only allow documents with values it hasn't already seen go through.

Possible syntax

let collector = DistinctFilterCollector::by_fields(vec![primary_key_field], TopDocs::with_limit(10));
let top_docs = searcher.search(&my_query,  &collector);

In an ideal world, I'd also like this collector to return both the Collector::Fruit of the inner collector it wraps and a set of document addresses it filtered out, this would allow users a bit more flexibility around the application of this collector (for example, removing now "old" documents.)

This would make the syntax closer to:

let collector = DistinctFilterCollector::by_fields(vec![primary_key_field], TopDocs::with_limit(10));
let (top_docs, ignored_docs) = searcher.search(&my_query,  &collector);  // (TopDocs::Fruit, Vec<DocAddress>)

Problems

  • From my understanding of collectors, this collector would only be able to work on a first-come-first-served basis.
  • If you wanted to the collector to behave by taking the most recent distinct document rather than first-come-first-served, segments would need to always be searched consistently and merges would also need to maintain this order (not sure if merges do this already or not.)
@fulmicoton
Copy link
Collaborator

If you wanted to the collector to behave by taking the most recent distinct document rather than first-come-first-served, segments would need to always be searched consistently and merges would also need to maintain this order (not sure if merges do this already or not.)

I would... not care. As you mention, merges make it more or less impossible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants