Querying Hive partitioning parquets is slow #173

xqe2011 · 2024-11-07T11:32:53Z

What happens?

Recently, we tried this extensions instead of using a standalone duckdb instance. When we run a simple SELECT query on parquet files, it's 2-20 times slower than DuckDB.

Profiling method
SELECT duckdb_execute($$SET enable_profiling='query_tree'$$); and watch logs.

To Reproduce

Query one field : SELECT name FROM public.table1 where code1 = 3261 and code2 = '001' and code3 = '5204' and code4 = '1'
code1 and code2 are partition fields.

DuckDB runs on cli 0.0190s
DuckDB runs in this extension 0.0291s
Total time of using this extension 0.513s
Query multi fields: SELECT name, level, detxlen, detylen, downid FROM public.table1 where code1 = 3261 and code2 = '001' and code3 = '5204' and code4 = '1'
code1 and code2 are partition fields.

DuckDB runs on cli 0.0413s
DuckDB runs in this extension 0.037s
Total time of using this extension 0.552s

OS:

Ubuntu Server 22.04.3

ParadeDB Version:

paradedb/paradedb:16-v0.11.1

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB Helm Chart

Full Name:

Liu Qijie

Affiliation:

Dongguan University of Technology

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

Yes, I have

The text was updated successfully, but these errors were encountered:

philippemnoel · 2024-11-07T20:02:19Z

Thanks for opening! Would love your help with debugging this, or anyone else if willing to assist here :)

kysshsy · 2025-01-05T11:48:05Z

/take

kysshsy · 2025-01-08T11:40:46Z

It seems the most of time is spent on duckdb-rs. Need to dive into duckdb-rs.

kysshsy · 2025-01-20T15:59:59Z

I believe most of the time is spent in the execution process of DuckDB. I tested the release build of pg_analytics and the DuckDB CLI, and found that the query times between pg_analytics and DuckDB are quite similar. We could set duckdb configures to make duckdb print detailed profiling information (including optimizer and binding)

SELECT duckdb_execute($$SET enable_profiling='query_tree'$$);
SELECT duckdb_execute($$SET profiling_mode='detailed'$$);
-- execute query
SELECT * FROM t1 WHERE query_len = 80 and  response_len = 3000 limit 10;

And the slow prepare execution might be due to the limitations or the design of DuckDB.

see duckdb issue

The execution is done lazily - but binding is done immediately. In this case that means (1) resolving the glob and gathering a list of files to scan, and (2) reading the metadata of a Parquet file to figure out the names and types that exist in the file.

logs in Postgres:

logs in duckdb:

kysshsy · 2025-01-20T16:09:53Z

@xqe2011 Hi, could you set the parameters and then test it again? Currently, I am not using the ParadeDB Helm Chart; I am directly using pg_analytics. Thank you!

philippemnoel · 2025-01-20T20:29:42Z

Thank you for investigating this @kysshsy. @xqe2011 if you are using the Helm chart, please be sure to allocate enough resources.

@kysshsy Do you think we should close this issue if there isn't anything on our end?

kysshsy · 2025-01-21T16:07:44Z

Thank you for investigating this @kysshsy. @xqe2011 if you are using the Helm chart, please be sure to allocate enough resources.

@kysshsy Do you think we should close this issue if there isn't anything on our end?

Yeah, I think so. If the user provides more information indicating that the issue is not with DuckDB, we can reopen the issue.

philippemnoel · 2025-01-21T17:10:02Z

Thank you for investigating <3

xqe2011 added the bug Something isn't working label Nov 7, 2024

philippemnoel added good first issue Good for newcomers help wanted Extra attention is needed priority-medium Medium priority issue user-request This issue was directly requested by a user labels Nov 7, 2024

github-actions bot assigned kysshsy Jan 5, 2025

philippemnoel closed this as completed Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying Hive partitioning parquets is slow #173

Querying Hive partitioning parquets is slow #173

xqe2011 commented Nov 7, 2024

philippemnoel commented Nov 7, 2024

kysshsy commented Jan 5, 2025

kysshsy commented Jan 8, 2025

kysshsy commented Jan 20, 2025

kysshsy commented Jan 20, 2025

philippemnoel commented Jan 20, 2025

kysshsy commented Jan 21, 2025

philippemnoel commented Jan 21, 2025

Querying Hive partitioning parquets is slow #173

Querying Hive partitioning parquets is slow #173

Comments

xqe2011 commented Nov 7, 2024

What happens?

To Reproduce

OS:

ParadeDB Version:

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

Full Name:

Affiliation:

Did you include all relevant data sets for reproducing the issue?

Did you include the code required to reproduce the issue?

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

philippemnoel commented Nov 7, 2024

kysshsy commented Jan 5, 2025

kysshsy commented Jan 8, 2025

kysshsy commented Jan 20, 2025

kysshsy commented Jan 20, 2025

philippemnoel commented Jan 20, 2025

kysshsy commented Jan 21, 2025

philippemnoel commented Jan 21, 2025