forked from duckdb/duckdb
-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add dictionary size, and use dictionary/constant vectors in the aggre…
…gate hash table to speed up finding groups (duckdb#15152) This PR adjusts the dictionary vector to have an optional dictionary size. This size parameter is set only for duplicate-eliminated dictionaries that are read from disk (either from dictionary-compressed columns in DuckDB's own storage, or from dictionary-compressed columns in Parquet). The dictionary size can be used to optimize various optimizations on the vectors, as we can operate on the underlying dictionary instead of on the individual values, allowing us to operate only on unique values in the vector. This PR adjusts the aggregate hash table to utilize both dictionary and constant vectors when inserting groups. In particular, when figuring out which group a value belongs to, we only need to figure this out once per unique value. For constant vectors, that means we only need to probe the hash table once. For dictionary vectors, we only need to probe the unique values within the dictionary. This behavior is added in the `TryAddCompressedGroups` method. We first figure out which unique values are present in the dictionary (that are also referenced in the vector), and then call `FindOrCreateGroups` only for those unique values. #### UnaryExecutor We also use the dictionary vectors to speed up execution of `UnaryExecutor` - however, we only do this if the function cannot throw errors (`FunctionErrors`). The reason for this is that we actually compute the function for the entire dictionary at this layer instead of only the elements that are explicitly referenced - and if the function can throw an error we might execute the function on a filtered out row and introduce errors. Currently we explicitly define `FunctionErrors:: CANNOT_ERROR` only for compressed materialization - so this is still limited (to be expanded in the future). #### Benchmarks Below are some performance numbers for synthetic data (100M rows): ###### Aggregation over string dictionary ```sql CREATE TABLE strings AS SELECT CONCAT('thisisastringwithrepetitions', i%100) AS grp, i FROM range(100_000_000) tbl(i); SELECT grp, SUM(i) FROM strings GROUP BY ALL ORDER BY ALL; ``` | v1.1.3 | new | |--------|--------| | 0.12s | 0.04s | ###### Aggregation over constant dates ```sql CREATE TABLE dates AS SELECT DATE '1900-01-01' + INTERVAL (i // 50000) MONTH grp, i FROM range(100_000_000) tbl(i); SELECT grp, SUM(i) FROM dates GROUP BY ALL ORDER BY ALL; ``` | v1.1.3 | new | |--------|-------| | 0.07s | 0.04s | #### Limitations * Currently this only works for grouping of individual dictionary vectors (i.e. grouping by a single column). Grouping by multiple dictionary vectors is possible but more challenging especially when it comes to caching of the data. * We limit dictionary size to `20000` - for larger dictionaries we don't consider the optimization.
- Loading branch information
Showing
15 changed files
with
470 additions
and
36 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# name: benchmark/micro/aggregate/constant_aggregate.benchmark | ||
# description: Aggregate Over Constant Groups | ||
# group: [aggregate] | ||
|
||
name Aggregate Over Constant Vectors | ||
group aggregate | ||
storage persistent | ||
|
||
load | ||
CREATE TABLE t AS SELECT DATE '1900-01-01' + INTERVAL (i // 50000) MONTH grp, i FROM range(100_000_000) tbl(i); | ||
|
||
run | ||
SELECT grp, SUM(i) FROM t GROUP BY ALL ORDER BY ALL |
115 changes: 115 additions & 0 deletions
115
benchmark/micro/aggregate/dictionary_aggregate.benchmark
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# name: benchmark/micro/aggregate/dictionary_aggregate.benchmark | ||
# description: Aggregate Over Dictionary Vectors | ||
# group: [aggregate] | ||
|
||
name Aggregate Over Dictionary Vectors | ||
group aggregate | ||
storage persistent | ||
|
||
load | ||
CREATE TABLE t AS SELECT CONCAT('thisisastringwithrepetitions', i%100) AS grp, i FROM range(100_000_000) tbl(i); | ||
|
||
run | ||
SELECT grp, SUM(i) FROM t GROUP BY ALL ORDER BY ALL | ||
|
||
result II | ||
thisisastringwithrepetitions0 49999950000000 | ||
thisisastringwithrepetitions1 49999951000000 | ||
thisisastringwithrepetitions10 49999960000000 | ||
thisisastringwithrepetitions11 49999961000000 | ||
thisisastringwithrepetitions12 49999962000000 | ||
thisisastringwithrepetitions13 49999963000000 | ||
thisisastringwithrepetitions14 49999964000000 | ||
thisisastringwithrepetitions15 49999965000000 | ||
thisisastringwithrepetitions16 49999966000000 | ||
thisisastringwithrepetitions17 49999967000000 | ||
thisisastringwithrepetitions18 49999968000000 | ||
thisisastringwithrepetitions19 49999969000000 | ||
thisisastringwithrepetitions2 49999952000000 | ||
thisisastringwithrepetitions20 49999970000000 | ||
thisisastringwithrepetitions21 49999971000000 | ||
thisisastringwithrepetitions22 49999972000000 | ||
thisisastringwithrepetitions23 49999973000000 | ||
thisisastringwithrepetitions24 49999974000000 | ||
thisisastringwithrepetitions25 49999975000000 | ||
thisisastringwithrepetitions26 49999976000000 | ||
thisisastringwithrepetitions27 49999977000000 | ||
thisisastringwithrepetitions28 49999978000000 | ||
thisisastringwithrepetitions29 49999979000000 | ||
thisisastringwithrepetitions3 49999953000000 | ||
thisisastringwithrepetitions30 49999980000000 | ||
thisisastringwithrepetitions31 49999981000000 | ||
thisisastringwithrepetitions32 49999982000000 | ||
thisisastringwithrepetitions33 49999983000000 | ||
thisisastringwithrepetitions34 49999984000000 | ||
thisisastringwithrepetitions35 49999985000000 | ||
thisisastringwithrepetitions36 49999986000000 | ||
thisisastringwithrepetitions37 49999987000000 | ||
thisisastringwithrepetitions38 49999988000000 | ||
thisisastringwithrepetitions39 49999989000000 | ||
thisisastringwithrepetitions4 49999954000000 | ||
thisisastringwithrepetitions40 49999990000000 | ||
thisisastringwithrepetitions41 49999991000000 | ||
thisisastringwithrepetitions42 49999992000000 | ||
thisisastringwithrepetitions43 49999993000000 | ||
thisisastringwithrepetitions44 49999994000000 | ||
thisisastringwithrepetitions45 49999995000000 | ||
thisisastringwithrepetitions46 49999996000000 | ||
thisisastringwithrepetitions47 49999997000000 | ||
thisisastringwithrepetitions48 49999998000000 | ||
thisisastringwithrepetitions49 49999999000000 | ||
thisisastringwithrepetitions5 49999955000000 | ||
thisisastringwithrepetitions50 50000000000000 | ||
thisisastringwithrepetitions51 50000001000000 | ||
thisisastringwithrepetitions52 50000002000000 | ||
thisisastringwithrepetitions53 50000003000000 | ||
thisisastringwithrepetitions54 50000004000000 | ||
thisisastringwithrepetitions55 50000005000000 | ||
thisisastringwithrepetitions56 50000006000000 | ||
thisisastringwithrepetitions57 50000007000000 | ||
thisisastringwithrepetitions58 50000008000000 | ||
thisisastringwithrepetitions59 50000009000000 | ||
thisisastringwithrepetitions6 49999956000000 | ||
thisisastringwithrepetitions60 50000010000000 | ||
thisisastringwithrepetitions61 50000011000000 | ||
thisisastringwithrepetitions62 50000012000000 | ||
thisisastringwithrepetitions63 50000013000000 | ||
thisisastringwithrepetitions64 50000014000000 | ||
thisisastringwithrepetitions65 50000015000000 | ||
thisisastringwithrepetitions66 50000016000000 | ||
thisisastringwithrepetitions67 50000017000000 | ||
thisisastringwithrepetitions68 50000018000000 | ||
thisisastringwithrepetitions69 50000019000000 | ||
thisisastringwithrepetitions7 49999957000000 | ||
thisisastringwithrepetitions70 50000020000000 | ||
thisisastringwithrepetitions71 50000021000000 | ||
thisisastringwithrepetitions72 50000022000000 | ||
thisisastringwithrepetitions73 50000023000000 | ||
thisisastringwithrepetitions74 50000024000000 | ||
thisisastringwithrepetitions75 50000025000000 | ||
thisisastringwithrepetitions76 50000026000000 | ||
thisisastringwithrepetitions77 50000027000000 | ||
thisisastringwithrepetitions78 50000028000000 | ||
thisisastringwithrepetitions79 50000029000000 | ||
thisisastringwithrepetitions8 49999958000000 | ||
thisisastringwithrepetitions80 50000030000000 | ||
thisisastringwithrepetitions81 50000031000000 | ||
thisisastringwithrepetitions82 50000032000000 | ||
thisisastringwithrepetitions83 50000033000000 | ||
thisisastringwithrepetitions84 50000034000000 | ||
thisisastringwithrepetitions85 50000035000000 | ||
thisisastringwithrepetitions86 50000036000000 | ||
thisisastringwithrepetitions87 50000037000000 | ||
thisisastringwithrepetitions88 50000038000000 | ||
thisisastringwithrepetitions89 50000039000000 | ||
thisisastringwithrepetitions9 49999959000000 | ||
thisisastringwithrepetitions90 50000040000000 | ||
thisisastringwithrepetitions91 50000041000000 | ||
thisisastringwithrepetitions92 50000042000000 | ||
thisisastringwithrepetitions93 50000043000000 | ||
thisisastringwithrepetitions94 50000044000000 | ||
thisisastringwithrepetitions95 50000045000000 | ||
thisisastringwithrepetitions96 50000046000000 | ||
thisisastringwithrepetitions97 50000047000000 | ||
thisisastringwithrepetitions98 50000048000000 | ||
thisisastringwithrepetitions99 50000049000000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.