-
-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(aggregators/metric): Add a top_hits aggregator #2198
feat(aggregators/metric): Add a top_hits aggregator #2198
Conversation
Did those discussions already consider the recently introduced |
Hello @ditsuke, is this something you actually have a need for? Can you describe the use case? |
Yes, the idea behind the new algorithm came up in that discussion thanks to @fulmicoton. I didn't use that here since it wasn't found suitable for pagination, but I'm happy to reconsider if that's wrong |
Hi @fulmicoton, we discussed the use-case on #tantivy-help some weeks back. Our use-case is querying for the top docs in each bucket (really the most recent post by some source). For reference. |
I might be misunderstanding things since I wasn't part of that discussion, but my understanding was that the stable sort order needed for pagination is not so much a question of |
You're actually right! My initial impression was that the elimination step would be unstable for distributions with > n/2 elements ≤ median (so when you can potentially eliminate more than n/2 elements and have to make a choice to retain some of the conflicting values, given that we need to cap the elimination to n/2 elements). But in fact this isn't a problem anymore with our comparator falling back to the doc address on conflicts, so we should be able to use |
Fetching DocsOne tricky part that is missing is fetching the actual content of the document. So far aggregations are limited to fast fields due to the way they operate. Document's data sourceGenerally they are two data-sources for a documents data in top hits: The doc store and fast fields. Doc store access is relatively expensive, fast fields (doc values) access are cheap. Fast field terms may be limited to a certain length, so some long texts may be missing there. When to FetchAggregation works roughly like that Segment Collection => Intermediate Result Intermediate results can be de/serialized, merged and converted to a final result. Fetching a documents data could happen when converting to a intermediate result or final result. When the data-source is a fast field, fetching at the intermediate result step will be fine. Fetch at Intermediate ResultThis will cause some overhead as, we will fetch e.g. Top 10 for each Intermediate Result, but after merging 100 Intermediate Result only keep the Top 10. So we would fetch 10_000 documents. From the doc store that would be expensive. From the fast field this should be fine (except maybe very long texts). Fetch at Final ResultThis would require passing additionally metadata like the segment id. In distributed scenarios like quickwit this may require more additional metadata to be able to resolve at the end, as final result conversion may happen at a different node. ConclusionThese approaches are quite different and require a different approach. Managing both would be possible but definitely adds some complexity. The main problem I see with fetching the final result is that is requires to have access to all the We could limit the aggregation to only handle them with the docvalue fields parameter. I'm not sure which variant is the best approach, it very much depends on user queries. Supporting only |
I think this sounds like the most reasonable approach from a risk management and incremental feature development perspective. |
Thank you! I'm working on the tests |
Agreed, that sounds reasonable. @PSeitz do we handle the docvalue fields support in this PR or with a follow-up? |
6cc3cd9
to
74654bf
Compare
It should be in this PR, or we can't add tests for the aggregation. |
Also removes extraneous the extraneous third-party serialization helper.
@PSeitz thanks for the latest round of review, I'll get back to the PR early next week! |
Since a (name, type) constitutes a unique column.
Introduces a translation step to bridge the difference between ColumnarReaders null `\0` separated json field keys to the common `.` separated used by SegmentReader. Although, this should probably be the default behavior for ColumnarReader's public API perhaps.
@PSeitz I believe all review points have been resolved. |
Hi @PSeitz, I don't want to rush the review but I'm just tagging you in case this slipped through your notifications earlier. Please let me know if there are issues with any of the new updates here. |
@PSeitz can you resume review? |
Looks good so far, except the segment ordinal part |
Thank you, I dropped a comment about the SegmentOrdinal in the review thread. |
Thanks for the PR! (and sorry for the slow Review) |
Thanks for merging and the thorough review, very educational! |
* feat(aggregators/metric): Implement a top_hits aggregator * fix: Expose get_fields * fix: Serializer for top_hits request Also removes extraneous the extraneous third-party serialization helper. * chore: Avert panick on parsing invalid top_hits query * refactor: Allow multiple field names from aggregations * perf: Replace binary heap with TopNComputer * fix: Avoid comparator inversion by ComparableDoc * fix: Rank missing field values lower than present values * refactor: Make KeyOrder a struct * feat: Rough attempt at docvalue_fields * feat: Complete stab at docvalue_fields - Rename "SearchResult*" => "Retrieval*" - Revert Vec => HashMap for aggregation accessors. - Split accessors for core aggregation and field retrieval. - Resolve globbed field names in docvalue_fields retrieval. - Handle strings/bytes and other column types with DynamicColumn * test(unit): Add tests for top_hits aggregator * fix: docfield_value field globbing * test(unit): Include dynamic fields * fix: Value -> OwnedValue * fix: Use OwnedValue's native Null variant * chore: Improve readability of test asserts * chore: Remove DocAddress from top_hits result * docs: Update aggregator doc * revert: accidental doc test * chore: enable time macros only for tests * chore: Apply suggestions from review * chore: Apply suggestions from review * fix: Retrieve all values for fields * test(unit): Update for multi-value retrieval * chore: Assert term existence * feat: Include all columns for a column name Since a (name, type) constitutes a unique column. * fix: Resolve json fields Introduces a translation step to bridge the difference between ColumnarReaders null `\0` separated json field keys to the common `.` separated used by SegmentReader. Although, this should probably be the default behavior for ColumnarReader's public API perhaps. * chore: Address review on mutability * chore: s/segment_id/segment_ordinal instances of SegmentOrdinal * chore: Revert erroneous grammar change
Summary
Implements the
top_hits
aggregator. The aggregator is backed bya BinaryHeap based on prior discussions on Discord#tantivy-devthe newTopNComputer
.