-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Querying Hive partitioning parquets is slow #173
Comments
Thanks for opening! Would love your help with debugging this, or anyone else if willing to assist here :) |
/take |
I believe most of the time is spent in the execution process of DuckDB. I tested the release build of pg_analytics and the DuckDB CLI, and found that the query times between pg_analytics and DuckDB are quite similar. We could set duckdb configures to make duckdb print detailed profiling information (including optimizer and binding) SELECT duckdb_execute($$SET enable_profiling='query_tree'$$);
SELECT duckdb_execute($$SET profiling_mode='detailed'$$);
-- execute query
SELECT * FROM t1 WHERE query_len = 80 and response_len = 3000 limit 10; And the slow see duckdb issue
|
@xqe2011 Hi, could you set the parameters and then test it again? Currently, I am not using the ParadeDB Helm Chart; I am directly using pg_analytics. Thank you! |
Yeah, I think so. If the user provides more information indicating that the issue is not with DuckDB, we can reopen the issue. |
Thank you for investigating <3 |
What happens?
Recently, we tried this extensions instead of using a standalone duckdb instance. When we run a simple
SELECT
query on parquet files, it's 2-20 times slower than DuckDB.Profiling method
SELECT duckdb_execute($$SET enable_profiling='query_tree'$$);
and watch logs.To Reproduce
Query one field :
SELECT name FROM public.table1 where code1 = 3261 and code2 = '001' and code3 = '5204' and code4 = '1'
code1
andcode2
are partition fields.DuckDB runs on cli 0.0190s
DuckDB runs in this extension 0.0291s
Total time of using this extension 0.513s
Query multi fields:
SELECT name, level, detxlen, detylen, downid FROM public.table1 where code1 = 3261 and code2 = '001' and code3 = '5204' and code4 = '1'
code1
andcode2
are partition fields.DuckDB runs on cli 0.0413s
DuckDB runs in this extension 0.037s
Total time of using this extension 0.552s
OS:
Ubuntu Server 22.04.3
ParadeDB Version:
paradedb/paradedb:16-v0.11.1
Are you using ParadeDB Docker, Helm, or the extension(s) standalone?
ParadeDB Helm Chart
Full Name:
Liu Qijie
Affiliation:
Dongguan University of Technology
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include the code required to reproduce the issue?
Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?
The text was updated successfully, but these errors were encountered: