-
Notifications
You must be signed in to change notification settings - Fork 14
[Perf][Bench] Join is slow on big tables. #574
Comments
Possible points/hints to increase performance:
585ms start(597ms) getAllTableColumnFragments ColumnFetcher.cpp:243
68ms start(597ms) getOneTableColumnFragment ColumnFetcher.cpp:176
58ms start(666ms) ColumnarResults ColumnarResults.cpp:124
69ms start(724ms) getOneTableColumnFragment ColumnFetcher.cpp:176
57ms start(793ms) ColumnarResults ColumnarResults.cpp:124
69ms start(851ms) getOneTableColumnFragment ColumnFetcher.cpp:176
57ms start(921ms) ColumnarResults ColumnarResults.cpp:124
8ms start(979ms) getOneTableColumnFragment ColumnFetcher.cpp:176
7ms start(987ms) ColumnarResults ColumnarResults.cpp:124
187ms start(995ms) mergeResults ColumnarResults.cpp:139 Looks like there are 3 chunk copies.
1st copy - is copy from specific fragment to required format (arrow),
2070ms start(2ms) getAllTableColumnFragments ColumnFetcher.cpp:248
0ms start(2072ms) lock taken, execution started ColumnFetcher.cpp:262
|
Support for multi-fragment joins has always been poor. There is multi-fragment join hash table construction, but the references from the hash table to the actual data are 0-indexed and therefore do not support multi-fragment.
Hash tables should be cached, so I suspect the extra copies are part of the query. It might be worthwhile to check that the hash table cache is working between the
I am curious - what is the fragment size when you run these queries? Can you try running where the inner join table fits into a single fragment? |
Discussed with Alex this issue. |
I found how to reduce generated fragment_size via number_fo_fragments variable. 10782ms total duration for executeRelAlgQuery
10781ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:216
1ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:217
10779ms start(2ms) execute RelAlgExecutor.cpp:411
10779ms start(2ms) executeStep RelAlgExecutor.cpp:895
10779ms start(2ms) executeWorkUnit RelAlgExecutor.cpp:1389
547ms start(3ms) compileWorkUnit NativeCodegen.cpp:1403
0ms start(551ms) ExecutionKernel::run ExecutionKernel.cpp:12
4678ms start(551ms) fetchChunks Execute.cpp:3092
0ms start(5229ms) create QueryExecutionContext.cpp:94
108ms start(5229ms) executePlan Execute.cpp:3464
10ms start(5340ms) collectAllDeviceResults Execute.cpp:2550
71ms start(5351ms) compileWorkUnit NativeCodegen.cpp:1403
0ms start(5423ms) ExecutionKernel::run ExecutionKernel.cpp:126
3542ms start(5423ms) fetchChunks Execute.cpp:3092
0ms start(8965ms) create QueryExecutionContext.cpp:94
1023ms start(8965ms) executePlan Execute.cpp:3464 with 8097ms total duration for executeRelAlgQuery
8097ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:216
0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:217
8097ms start(0ms) execute RelAlgExecutor.cpp:411
8097ms start(0ms) executeStep RelAlgExecutor.cpp:895
8096ms start(1ms) executeWorkUnit RelAlgExecutor.cpp:1389
13ms start(1ms) compileWorkUnit NativeCodegen.cpp:1403
0ms start(14ms) ExecutionKernel::run ExecutionKernel.cpp:126
1ms start(14ms) fetchChunks Execute.cpp:3092
0ms start(16ms) create QueryExecutionContext.cpp:94
1787ms start(16ms) executePlan Execute.cpp:3464
0ms start(1803ms) collectAllDeviceResults Execute.cpp:255
54ms start(1804ms) compileWorkUnit NativeCodegen.cpp:1403
0ms start(1858ms) ExecutionKernel::run ExecutionKernel.cpp:126
1ms start(1858ms) fetchChunks Execute.cpp:3092
0ms start(1860ms) create QueryExecutionContext.cpp:94
6236ms start(1860ms) executePlan Execute.cpp:3464 So |
I noticed that Possible fix at 25d78bc and following commit. |
Original main on current join 5742 ms start(514ms) fetchChunks Execute.cpp:3087 (count*)
4237 ms start(6891ms) fetchChunks Execute.cpp:3087 (join) Updated with 25d78bc 428 ms start(455ms) fetchChunks Execute.cpp:3093 (count*)
4558 ms start(1653ms) fetchChunks Execute.cpp:3093 (join) |
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
Current plan on this issue:
./_launcher/solution.R --solution=pyhdk --task=join --nrow=1e9
[1] "./pyhdk/join-pyhdk.py"
# join-pyhdk.py
pyhdk data_name: J1_1e9_NA_0_0
loading datasets J1_1e9_NA_0_0, J1_1e9_1e3_0_0, J1_1e9_1e6_0_0, J1_1e9_1e9_0_0
Using fragment size 4000000
1000000000
1000
1000000
1000000000
joining...
(899999033, 9)
[thread 851878 also had an error][thread 854019 also had an error][thread 851800 also had an error][thread 851224 also had an error][thread 854989 also had an error][thread 853424 also had an error]
[thread 851529 also had an error]2023-09-13T16:57:41.940055 F 726860 0 194 ColumnarResults.cpp:371 Check failed: type->isString()
[thread 852976 also had an error]
[thread 852381 also had an error]
[thread 853851 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f509cc352d6, pid=726860, tid=853223
#
# JRE version: OpenJDK Runtime Environment (20.0) (build 20-internal-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (20-internal-adhoc..src, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# 2023-09-13T16:57:41.939437 F 726860 0 195 ColumnarResults.cpp:371 Check failed: type->isString()
Aborted (core dumped)
486ms start(2904ms) executePlan Execute.cpp:3464
0ms start(6189ms) resultsUnion Execute.cpp:1134 I understand timings in new threads in wrong way. This time is spent on kernel execution: New thread(6)
1561ms start(0ms) fetchChunks Execute.cpp:3095
3136ms start(1562ms) executePlan Execute.cpp:3464
3136ms start(1562ms) launchCpuCode QueryExecutionContext.cpp:564
0ms start(4699ms) getRowSet QueryExecutionContext.cpp:192
End thread(6)
New thread(7)
1562ms start(0ms) fetchChunks Execute.cpp:3095
3284ms start(1562ms) executePlan Execute.cpp:3464
3284ms start(1562ms) launchCpuCode QueryExecutionContext.cpp:564
0ms start(4846ms) getRowSet QueryExecutionContext.cpp:192
End thread(7)
New thread(8)
1562ms start(0ms) fetchChunks Execute.cpp:3095
3129ms start(1562ms) executePlan Execute.cpp:3464
3129ms start(1562ms) launchCpuCode QueryExecutionContext.cpp:564
0ms start(4692ms) getRowSet QueryExecutionContext.cpp:192
End thread(8) |
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
To estimate output size of result query to allocate buffer we are running `count*` before some actual queries. This estimation requires only query body argumnets without `select` arguments, so this commit changes input_cols for `count*` query. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
issues with copies and hashing also here, so it should be open. |
This commit removes useless copying(memcpy) in `getAllTableColumnFragments`. Also some parallelization added. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit adds check for valid pointer to column buffer to skip copying. Most effective with enabled `enable-non-lazy-data-import` option. Checks number of chunks in storage. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit removes useless copying(memcpy) in `getAllTableColumnFragments`. Also some parallelization added. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit adds parallelization. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit removes useless copying(memcpy) in `getAllTableColumnFragments`. Also some parallelization added. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit adds parallelization. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit removes useless copying(memcpy) in `getAllTableColumnFragments`. Also some parallelization added. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit adds parallelization. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit reworks `fill_hash_join_buff_bucketized_cpu` to use tbb and utilize cpu properly. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit reworks `fill_hash_join_buff_bucketized_cpu` to use tbb and utilize cpu properly. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit reworks `fill_hash_join_buff_bucketized_cpu` to use tbb and utilize cpu properly. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit reworks `fill_hash_join_buff_bucketized_cpu` to use tbb and utilize cpu properly. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit removes useless copying(memcpy) in `getAllTableColumnFragments`. Also some parallelization added. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit removes useless copying(memcpy) in `getAllTableColumnFragments`. Also some parallelization added. Resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit reworks `fill_hash_join_buff_bucketized_cpu` to use tbb and utilize cpu properly. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit reworks `fill_hash_join_buff_bucketized_cpu` to use tbb and utilize cpu properly. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit reworks `fill_hash_join_buff_bucketized_cpu` to use tbb and utilize cpu properly. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit reworks `fill_hash_join_buff_bucketized_cpu` to use tbb and utilize cpu properly. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit refactors and simplifies method `getAllTableColumnFragments`. Also some parallelization added. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit refactors and simplifies method `getAllTableColumnFragments`. Also some parallelization added. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit refactors and simplifies method `getAllTableColumnFragments`. Also some parallelization added. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit refactors and simplifies method `getAllTableColumnFragments`. Also some parallelization added. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit refactors and simplifies method `getAllTableColumnFragments`. Also some parallelization added. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
This commit refactors and simplifies method `getAllTableColumnFragments`. Also some parallelization added. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
On size of tables ~10^8 join performance is very low, compared with duckdb.
On join by single int column for 2 tables 10^8 size takes about 12 s.
The biggest performance drop is in fetchChunks. There are also 2 subqueries, the first is
COUNT(*)
to collect metadata.The text was updated successfully, but these errors were encountered: