Skip to content

Commit

Permalink
Add dictionary size, and use dictionary/constant vectors in the aggre…
Browse files Browse the repository at this point in the history
…gate hash table to speed up finding groups (duckdb#15152)

This PR adjusts the dictionary vector to have an optional dictionary
size. This size parameter is set only for duplicate-eliminated
dictionaries that are read from disk (either from dictionary-compressed
columns in DuckDB's own storage, or from dictionary-compressed columns
in Parquet). The dictionary size can be used to optimize various
optimizations on the vectors, as we can operate on the underlying
dictionary instead of on the individual values, allowing us to operate
only on unique values in the vector.

This PR adjusts the aggregate hash table to utilize both dictionary and
constant vectors when inserting groups. In particular, when figuring out
which group a value belongs to, we only need to figure this out once per
unique value. For constant vectors, that means we only need to probe the
hash table once. For dictionary vectors, we only need to probe the
unique values within the dictionary.

This behavior is added in the `TryAddCompressedGroups` method. We first
figure out which unique values are present in the dictionary (that are
also referenced in the vector), and then call `FindOrCreateGroups` only
for those unique values.

#### UnaryExecutor

We also use the dictionary vectors to speed up execution of
`UnaryExecutor` - however, we only do this if the function cannot throw
errors (`FunctionErrors`). The reason for this is that we actually
compute the function for the entire dictionary at this layer instead of
only the elements that are explicitly referenced - and if the function
can throw an error we might execute the function on a filtered out row
and introduce errors. Currently we explicitly define `FunctionErrors::
CANNOT_ERROR` only for compressed materialization - so this is still
limited (to be expanded in the future).

#### Benchmarks

Below are some performance numbers for synthetic data (100M rows):

###### Aggregation over string dictionary

```sql
CREATE TABLE strings AS SELECT CONCAT('thisisastringwithrepetitions', i%100) AS grp, i FROM range(100_000_000) tbl(i);
SELECT grp, SUM(i) FROM strings GROUP BY ALL ORDER BY ALL;
```

| v1.1.3 |  new   |
|--------|--------|
| 0.12s  | 0.04s |

###### Aggregation over constant dates

```sql
CREATE TABLE dates AS SELECT DATE '1900-01-01' + INTERVAL (i // 50000) MONTH grp, i FROM range(100_000_000) tbl(i);
SELECT grp, SUM(i) FROM dates GROUP BY ALL ORDER BY ALL;
```
| v1.1.3 |  new  |
|--------|-------|
| 0.07s | 0.04s |

#### Limitations

* Currently this only works for grouping of individual dictionary
vectors (i.e. grouping by a single column). Grouping by multiple
dictionary vectors is possible but more challenging especially when it
comes to caching of the data.
* We limit dictionary size to `20000` - for larger dictionaries we don't
consider the optimization.
  • Loading branch information
Mytherin authored Dec 6, 2024
2 parents 63fb8c1 + e1f0eb6 commit 9da182a
Show file tree
Hide file tree
Showing 15 changed files with 470 additions and 36 deletions.
13 changes: 13 additions & 0 deletions benchmark/micro/aggregate/constant_aggregate.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# name: benchmark/micro/aggregate/constant_aggregate.benchmark
# description: Aggregate Over Constant Groups
# group: [aggregate]

name Aggregate Over Constant Vectors
group aggregate
storage persistent

load
CREATE TABLE t AS SELECT DATE '1900-01-01' + INTERVAL (i // 50000) MONTH grp, i FROM range(100_000_000) tbl(i);

run
SELECT grp, SUM(i) FROM t GROUP BY ALL ORDER BY ALL
115 changes: 115 additions & 0 deletions benchmark/micro/aggregate/dictionary_aggregate.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# name: benchmark/micro/aggregate/dictionary_aggregate.benchmark
# description: Aggregate Over Dictionary Vectors
# group: [aggregate]

name Aggregate Over Dictionary Vectors
group aggregate
storage persistent

load
CREATE TABLE t AS SELECT CONCAT('thisisastringwithrepetitions', i%100) AS grp, i FROM range(100_000_000) tbl(i);

run
SELECT grp, SUM(i) FROM t GROUP BY ALL ORDER BY ALL

result II
thisisastringwithrepetitions0 49999950000000
thisisastringwithrepetitions1 49999951000000
thisisastringwithrepetitions10 49999960000000
thisisastringwithrepetitions11 49999961000000
thisisastringwithrepetitions12 49999962000000
thisisastringwithrepetitions13 49999963000000
thisisastringwithrepetitions14 49999964000000
thisisastringwithrepetitions15 49999965000000
thisisastringwithrepetitions16 49999966000000
thisisastringwithrepetitions17 49999967000000
thisisastringwithrepetitions18 49999968000000
thisisastringwithrepetitions19 49999969000000
thisisastringwithrepetitions2 49999952000000
thisisastringwithrepetitions20 49999970000000
thisisastringwithrepetitions21 49999971000000
thisisastringwithrepetitions22 49999972000000
thisisastringwithrepetitions23 49999973000000
thisisastringwithrepetitions24 49999974000000
thisisastringwithrepetitions25 49999975000000
thisisastringwithrepetitions26 49999976000000
thisisastringwithrepetitions27 49999977000000
thisisastringwithrepetitions28 49999978000000
thisisastringwithrepetitions29 49999979000000
thisisastringwithrepetitions3 49999953000000
thisisastringwithrepetitions30 49999980000000
thisisastringwithrepetitions31 49999981000000
thisisastringwithrepetitions32 49999982000000
thisisastringwithrepetitions33 49999983000000
thisisastringwithrepetitions34 49999984000000
thisisastringwithrepetitions35 49999985000000
thisisastringwithrepetitions36 49999986000000
thisisastringwithrepetitions37 49999987000000
thisisastringwithrepetitions38 49999988000000
thisisastringwithrepetitions39 49999989000000
thisisastringwithrepetitions4 49999954000000
thisisastringwithrepetitions40 49999990000000
thisisastringwithrepetitions41 49999991000000
thisisastringwithrepetitions42 49999992000000
thisisastringwithrepetitions43 49999993000000
thisisastringwithrepetitions44 49999994000000
thisisastringwithrepetitions45 49999995000000
thisisastringwithrepetitions46 49999996000000
thisisastringwithrepetitions47 49999997000000
thisisastringwithrepetitions48 49999998000000
thisisastringwithrepetitions49 49999999000000
thisisastringwithrepetitions5 49999955000000
thisisastringwithrepetitions50 50000000000000
thisisastringwithrepetitions51 50000001000000
thisisastringwithrepetitions52 50000002000000
thisisastringwithrepetitions53 50000003000000
thisisastringwithrepetitions54 50000004000000
thisisastringwithrepetitions55 50000005000000
thisisastringwithrepetitions56 50000006000000
thisisastringwithrepetitions57 50000007000000
thisisastringwithrepetitions58 50000008000000
thisisastringwithrepetitions59 50000009000000
thisisastringwithrepetitions6 49999956000000
thisisastringwithrepetitions60 50000010000000
thisisastringwithrepetitions61 50000011000000
thisisastringwithrepetitions62 50000012000000
thisisastringwithrepetitions63 50000013000000
thisisastringwithrepetitions64 50000014000000
thisisastringwithrepetitions65 50000015000000
thisisastringwithrepetitions66 50000016000000
thisisastringwithrepetitions67 50000017000000
thisisastringwithrepetitions68 50000018000000
thisisastringwithrepetitions69 50000019000000
thisisastringwithrepetitions7 49999957000000
thisisastringwithrepetitions70 50000020000000
thisisastringwithrepetitions71 50000021000000
thisisastringwithrepetitions72 50000022000000
thisisastringwithrepetitions73 50000023000000
thisisastringwithrepetitions74 50000024000000
thisisastringwithrepetitions75 50000025000000
thisisastringwithrepetitions76 50000026000000
thisisastringwithrepetitions77 50000027000000
thisisastringwithrepetitions78 50000028000000
thisisastringwithrepetitions79 50000029000000
thisisastringwithrepetitions8 49999958000000
thisisastringwithrepetitions80 50000030000000
thisisastringwithrepetitions81 50000031000000
thisisastringwithrepetitions82 50000032000000
thisisastringwithrepetitions83 50000033000000
thisisastringwithrepetitions84 50000034000000
thisisastringwithrepetitions85 50000035000000
thisisastringwithrepetitions86 50000036000000
thisisastringwithrepetitions87 50000037000000
thisisastringwithrepetitions88 50000038000000
thisisastringwithrepetitions89 50000039000000
thisisastringwithrepetitions9 49999959000000
thisisastringwithrepetitions90 50000040000000
thisisastringwithrepetitions91 50000041000000
thisisastringwithrepetitions92 50000042000000
thisisastringwithrepetitions93 50000043000000
thisisastringwithrepetitions94 50000044000000
thisisastringwithrepetitions95 50000045000000
thisisastringwithrepetitions96 50000046000000
thisisastringwithrepetitions97 50000047000000
thisisastringwithrepetitions98 50000048000000
thisisastringwithrepetitions99 50000049000000
2 changes: 1 addition & 1 deletion extension/parquet/column_reader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -576,7 +576,7 @@ idx_t ColumnReader::Read(uint64_t num_values, parquet_filter_t &filter, data_ptr
ConvertDictToSelVec(reinterpret_cast<uint32_t *>(offset_buffer.ptr),
reinterpret_cast<uint8_t *>(define_out), filter, read_now, result_offset);
if (result_offset == 0) {
result.Slice(*dictionary, dictionary_selection_vector, read_now);
result.Dictionary(*dictionary, dictionary_size + 1, dictionary_selection_vector, read_now);
D_ASSERT(result.GetVectorType() == VectorType::DICTIONARY_VECTOR);
} else {
D_ASSERT(result.GetVectorType() == VectorType::FLAT_VECTOR);
Expand Down
19 changes: 19 additions & 0 deletions src/common/enum_util.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
#include "duckdb/common/enums/file_compression_type.hpp"
#include "duckdb/common/enums/file_glob_options.hpp"
#include "duckdb/common/enums/filter_propagate_result.hpp"
#include "duckdb/common/enums/function_errors.hpp"
#include "duckdb/common/enums/index_constraint_type.hpp"
#include "duckdb/common/enums/join_type.hpp"
#include "duckdb/common/enums/joinref_type.hpp"
Expand Down Expand Up @@ -1702,6 +1703,24 @@ FunctionCollationHandling EnumUtil::FromString<FunctionCollationHandling>(const
return static_cast<FunctionCollationHandling>(StringUtil::StringToEnum(GetFunctionCollationHandlingValues(), 3, "FunctionCollationHandling", value));
}

const StringUtil::EnumStringLiteral *GetFunctionErrorsValues() {
static constexpr StringUtil::EnumStringLiteral values[] {
{ static_cast<uint32_t>(FunctionErrors::CANNOT_ERROR), "CANNOT_ERROR" },
{ static_cast<uint32_t>(FunctionErrors::CAN_THROW_ERROR), "CAN_THROW_ERROR" }
};
return values;
}

template<>
const char* EnumUtil::ToChars<FunctionErrors>(FunctionErrors value) {
return StringUtil::EnumToString(GetFunctionErrorsValues(), 2, "FunctionErrors", static_cast<uint32_t>(value));
}

template<>
FunctionErrors EnumUtil::FromString<FunctionErrors>(const char *value) {
return static_cast<FunctionErrors>(StringUtil::StringToEnum(GetFunctionErrorsValues(), 2, "FunctionErrors", value));
}

const StringUtil::EnumStringLiteral *GetFunctionNullHandlingValues() {
static constexpr StringUtil::EnumStringLiteral values[] {
{ static_cast<uint32_t>(FunctionNullHandling::DEFAULT_NULL_HANDLING), "DEFAULT_NULL_HANDLING" },
Expand Down
20 changes: 20 additions & 0 deletions src/common/types/vector.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,7 @@ void Vector::Slice(const SelectionVector &sel, idx_t count) {
if (GetVectorType() == VectorType::DICTIONARY_VECTOR) {
// already a dictionary, slice the current dictionary
auto &current_sel = DictionaryVector::SelVector(*this);
auto dictionary_size = DictionaryVector::DictionarySize(*this);
auto sliced_dictionary = current_sel.Slice(sel, count);
buffer = make_buffer<DictionaryBuffer>(std::move(sliced_dictionary));
if (GetType().InternalType() == PhysicalType::STRUCT) {
Expand All @@ -240,6 +241,9 @@ void Vector::Slice(const SelectionVector &sel, idx_t count) {
new_child.auxiliary = make_buffer<VectorStructBuffer>(new_child, sel, count);
auxiliary = make_buffer<VectorChildBuffer>(std::move(new_child));
}
if (dictionary_size.IsValid()) {
this->buffer->Cast<DictionaryBuffer>().SetDictionarySize(dictionary_size.GetIndex());
}
return;
}

Expand All @@ -260,11 +264,24 @@ void Vector::Slice(const SelectionVector &sel, idx_t count) {
auxiliary = std::move(child_ref);
}

void Vector::Dictionary(idx_t dictionary_size, const SelectionVector &sel, idx_t count) {
Slice(sel, count);
if (GetVectorType() == VectorType::DICTIONARY_VECTOR) {
buffer->Cast<DictionaryBuffer>().SetDictionarySize(dictionary_size);
}
}

void Vector::Dictionary(const Vector &dict, idx_t dictionary_size, const SelectionVector &sel, idx_t count) {
Reference(dict);
Dictionary(dictionary_size, sel, count);
}

void Vector::Slice(const SelectionVector &sel, idx_t count, SelCache &cache) {
if (GetVectorType() == VectorType::DICTIONARY_VECTOR && GetType().InternalType() != PhysicalType::STRUCT) {
// dictionary vector: need to merge dictionaries
// check if we have a cached entry
auto &current_sel = DictionaryVector::SelVector(*this);
auto dictionary_size = DictionaryVector::DictionarySize(*this);
auto target_data = current_sel.data();
auto entry = cache.cache.find(target_data);
if (entry != cache.cache.end()) {
Expand All @@ -275,6 +292,9 @@ void Vector::Slice(const SelectionVector &sel, idx_t count, SelCache &cache) {
Slice(sel, count);
cache.cache[target_data] = this->buffer;
}
if (dictionary_size.IsValid()) {
this->buffer->Cast<DictionaryBuffer>().SetDictionarySize(dictionary_size.GetIndex());
}
} else {
Slice(sel, count);
}
Expand Down
Loading

0 comments on commit 9da182a

Please sign in to comment.