Fully parallelize index construction #21

Maxxen · 2024-06-20T15:42:13Z

Instead of constructing indexes during the sinking into the create index operator, we now buffer all input and then spawn tasks equal to the amount of threads that construct the index with all the data available in parallel. This means we now parallelize over vectors instead of row groups regardless of how much data we receive, and don't need to resize/reallocate the index multiple times (with extra locking) during construction.

On my machine this gives me an almost 10x performance increase. But there's still a bunch more small optimizations we can do.

…machine

Maxxen added 4 commits June 20, 2024 17:37

rework index construction, now fully parallel, almost 10x perf on my …

1c28fa8

…machine

format

a6d1903

dead code

8b4679b

yield on main thread, dont flatten scan chunk

966de8b

Maxxen merged commit 8e3a622 into duckdb:main Jun 21, 2024
22 checks passed

Maxxen mentioned this pull request Jul 2, 2024

Can't shard + index 0.5P of data duckdb/duckdb#12805

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully parallelize index construction #21

Fully parallelize index construction #21

Maxxen commented Jun 20, 2024

Fully parallelize index construction #21

Fully parallelize index construction #21

Conversation

Maxxen commented Jun 20, 2024