New collector - `DistinctFilterCollector` #2565

ChillFish8 · 2025-01-05T21:08:54Z

Opening this issue because it is probably worth some discussion and wonder if you have any ideas or possible solutions to the issue.

Premise

Often when running queries, you may want to deduplicate returned results and exclude them from being considered in things like the TopK collector, this is especially prevalent when working with mutable data within Tantivy (where you might have a primary key), as often it is more efficient to lazily delete old documents rather than greedily executing a delete every time you receive a new document.

Unfortunately, this operation is difficult for users to implement unless they know quite a bit about how collectors work and the collector itself is quite complicated with a few issues to discuss (further down.)

The DistinctFilterCollector would act similarly to the FilterCollector but specifically only allow documents with values it hasn't already seen go through.

Possible syntax

let collector = DistinctFilterCollector::by_fields(vec![primary_key_field], TopDocs::with_limit(10));
let top_docs = searcher.search(&my_query,  &collector);

In an ideal world, I'd also like this collector to return both the Collector::Fruit of the inner collector it wraps and a set of document addresses it filtered out, this would allow users a bit more flexibility around the application of this collector (for example, removing now "old" documents.)

This would make the syntax closer to:

let collector = DistinctFilterCollector::by_fields(vec![primary_key_field], TopDocs::with_limit(10));
let (top_docs, ignored_docs) = searcher.search(&my_query,  &collector);  // (TopDocs::Fruit, Vec<DocAddress>)

Problems

From my understanding of collectors, this collector would only be able to work on a first-come-first-served basis.
If you wanted to the collector to behave by taking the most recent distinct document rather than first-come-first-served, segments would need to always be searched consistently and merges would also need to maintain this order (not sure if merges do this already or not.)

The text was updated successfully, but these errors were encountered:

fulmicoton · 2025-01-06T09:12:41Z

If you wanted to the collector to behave by taking the most recent distinct document rather than first-come-first-served, segments would need to always be searched consistently and merges would also need to maintain this order (not sure if merges do this already or not.)

I would... not care. As you mention, merges make it more or less impossible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New collector - `DistinctFilterCollector` #2565

New collector - `DistinctFilterCollector` #2565

ChillFish8 commented Jan 5, 2025

fulmicoton commented Jan 6, 2025

New collector - DistinctFilterCollector #2565

New collector - DistinctFilterCollector #2565

Comments

ChillFish8 commented Jan 5, 2025

Premise

Possible syntax

Problems

fulmicoton commented Jan 6, 2025

New collector - `DistinctFilterCollector` #2565

New collector - `DistinctFilterCollector` #2565