Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying Hive partitioning parquets is slow #173

Closed
2 tasks done
xqe2011 opened this issue Nov 7, 2024 · 8 comments
Closed
2 tasks done

Querying Hive partitioning parquets is slow #173

xqe2011 opened this issue Nov 7, 2024 · 8 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed priority-medium Medium priority issue user-request This issue was directly requested by a user

Comments

@xqe2011
Copy link

xqe2011 commented Nov 7, 2024

What happens?

Recently, we tried this extensions instead of using a standalone duckdb instance. When we run a simple SELECT query on parquet files, it's 2-20 times slower than DuckDB.

Profiling method
SELECT duckdb_execute($$SET enable_profiling='query_tree'$$); and watch logs.

To Reproduce

Query one field : SELECT name FROM public.table1 where code1 = 3261 and code2 = '001' and code3 = '5204' and code4 = '1'
code1 and code2 are partition fields.

DuckDB runs on cli 0.0190s
DuckDB runs in this extension 0.0291s
Total time of using this extension 0.513s
Query multi fields: SELECT name, level, detxlen, detylen, downid FROM public.table1 where code1 = 3261 and code2 = '001' and code3 = '5204' and code4 = '1'
code1 and code2 are partition fields.

DuckDB runs on cli 0.0413s
DuckDB runs in this extension 0.037s
Total time of using this extension 0.552s

OS:

Ubuntu Server 22.04.3

ParadeDB Version:

paradedb/paradedb:16-v0.11.1

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB Helm Chart

Full Name:

Liu Qijie

Affiliation:

Dongguan University of Technology

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

  • Yes, I have
@xqe2011 xqe2011 added the bug Something isn't working label Nov 7, 2024
@philippemnoel philippemnoel added good first issue Good for newcomers help wanted Extra attention is needed priority-medium Medium priority issue user-request This issue was directly requested by a user labels Nov 7, 2024
@philippemnoel
Copy link
Collaborator

Thanks for opening! Would love your help with debugging this, or anyone else if willing to assist here :)

@kysshsy
Copy link
Contributor

kysshsy commented Jan 5, 2025

/take

@kysshsy
Copy link
Contributor

kysshsy commented Jan 8, 2025

It seems the most of time is spent on duckdb-rs. Need to dive into duckdb-rs.
partition

@kysshsy
Copy link
Contributor

kysshsy commented Jan 20, 2025

I believe most of the time is spent in the execution process of DuckDB. I tested the release build of pg_analytics and the DuckDB CLI, and found that the query times between pg_analytics and DuckDB are quite similar. We could set duckdb configures to make duckdb print detailed profiling information (including optimizer and binding)

SELECT duckdb_execute($$SET enable_profiling='query_tree'$$);
SELECT duckdb_execute($$SET profiling_mode='detailed'$$);
-- execute query
SELECT * FROM t1 WHERE query_len = 80 and  response_len = 3000 limit 10;

And the slow prepare execution might be due to the limitations or the design of DuckDB.

see duckdb issue

The execution is done lazily - but binding is done immediately. In this case that means (1) resolving the glob and gathering a list of files to scan, and (2) reading the metadata of a Parquet file to figure out the names and types that exist in the file.

Image

logs in Postgres:
Image

logs in duckdb:
Image

@kysshsy
Copy link
Contributor

kysshsy commented Jan 20, 2025

@xqe2011 Hi, could you set the parameters and then test it again? Currently, I am not using the ParadeDB Helm Chart; I am directly using pg_analytics. Thank you!

@philippemnoel
Copy link
Collaborator

Thank you for investigating this @kysshsy. @xqe2011 if you are using the Helm chart, please be sure to allocate enough resources.

@kysshsy Do you think we should close this issue if there isn't anything on our end?

@kysshsy
Copy link
Contributor

kysshsy commented Jan 21, 2025

Thank you for investigating this @kysshsy. @xqe2011 if you are using the Helm chart, please be sure to allocate enough resources.

@kysshsy Do you think we should close this issue if there isn't anything on our end?

Yeah, I think so. If the user provides more information indicating that the issue is not with DuckDB, we can reopen the issue.

@philippemnoel
Copy link
Collaborator

Thank you for investigating <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed priority-medium Medium priority issue user-request This issue was directly requested by a user
Projects
None yet
Development

No branches or pull requests

3 participants