Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: 1414: Adds time_bucket() support for duckdb fdw #32

Merged
merged 38 commits into from
Aug 19, 2024

Conversation

devanbenz
Copy link
Contributor

@devanbenz devanbenz commented Aug 3, 2024

Signed-off-by: Devan [email protected]# Ticket(s) Closed

What

`time_bucket(INTERVAL 5 MINUTE, timestamp::DATE)` -> aggregate by 5 min and return DATE
`time_bucket(INTERVAL 5 MINUTE, timestamp)` -> aggregate by 5 min and return timestamp
`time_bucket(INTERVAL 5 MINUTE, timestamp::DATE, INTERVAL '15 DAYS')` -> aggregate by 5 min with a 15 day offset and return DATE
`time_bucket(INTERVAL 5 MINUTE, timestamp, INTERVAL '15 DAYS')` -> aggregate by 5 min with a 15 day offset and return timestamp
`time_bucket(INTERVAL 5 MINUTE, timestamp::DATE, DATE '1990-01-01')` -> aggregate by 5 min with a DATE origin and return date

Tests

test_time_bucket_minutes -> tests for aggregating by minutes/timestamps
test_time_bucket_years -> tests for aggregating by date
test_time_bucket_fallback -> test to ensure that when used on a postgres table it falls back (there is currently no functioning UDF this is only working with duckdb FDW)

src/hooks/executor.rs Outdated Show resolved Hide resolved
Signed-off-by: Devan <[email protected]>
Signed-off-by: Devan <[email protected]>
Signed-off-by: Devan <[email protected]>
Signed-off-by: Devan <[email protected]>
Signed-off-by: Devan <[email protected]>
Signed-off-by: Devan <[email protected]>
@devanbenz
Copy link
Contributor Author

devanbenz commented Aug 5, 2024

I am actually going to natively make this supported by Postgres. Will let you know when the PR is ready again.

Signed-off-by: Devan <[email protected]>
Signed-off-by: Devan <[email protected]>
Signed-off-by: Devan <[email protected]>
@devanbenz
Copy link
Contributor Author

Just need to add some more tests and implement the offsets/origin but currently the date and time base time_bucket() functions are working with postgres + duckdb. TIL that in order to debug pg extensions you need to attach to the running PID of postgres 😅

@devanbenz
Copy link
Contributor Author

Through some edge-case testing I realized that my implementation was not working as expected. I went through a few iterations locally but ultimately implemented the following calculation which when used against my edge cases is working correctly.

result = ((date - origin_unix_epoch) / interval) * interval

This will truncate the chunk of time to make it divisible by the interval. Whatever the result is will be the time bucket (in epoch seconds) that then gets converted to a regular PG timestamp or date. Currently my Timestamp implementation supports UTC timestamps. I will be creating another function using the PG type TimestampWithTimeZone for timezone support in the following days. Prior to implementing this arithmetic I was thinking about setting up a sliding window and then doing some sort of search on it but a simple calculation should be much much MUCH less computationally complex.

Sorry this is taking a bit longer then expected 😅 It appears to be more complex then I initially though. I will also be writing more tests for this specifically for the postgres UDF. I personally am a big fan of writing more tests then actual feature code 😆.

@philippemnoel
Copy link
Collaborator

Through some edge-case testing I realized that my implementation was not working as expected. I went through a few iterations locally but ultimately implemented the following calculation which when used against my edge cases is working correctly.

result = ((date - origin_unix_epoch) / interval) * interval

This will truncate the chunk of time to make it divisible by the interval. Whatever the result is will be the time bucket (in epoch seconds) that then gets converted to a regular PG timestamp or date. Currently my Timestamp implementation supports UTC timestamps. I will be creating another function using the PG type TimestampWithTimeZone for timezone support in the following days. Prior to implementing this arithmetic I was thinking about setting up a sliding window and then doing some sort of search on it but a simple calculation should be much much MUCH less computationally complex.

Sorry this is taking a bit longer then expected 😅 It appears to be more complex then I initially though. I will also be writing more tests for this specifically for the postgres UDF. I personally am a big fan of writing more tests then actual feature code 😆.

Appreciate the thoroughness and the tests, you're doing great. Tagging @od0 who requested the feature in the first place to check the logic

tests/fixtures/mod.rs Outdated Show resolved Hide resolved
tests/time_bucket.rs Outdated Show resolved Hide resolved
@devanbenz
Copy link
Contributor Author

devanbenz commented Aug 10, 2024

@rebasedming thanks for taking a look :) -- will address everything later tonight or tomorrow morning 👍

UPDATE: Changes have been made

@devanbenz
Copy link
Contributor Author

@rebasedming do you know if anyone is working on fixing the failing test in dev? taking a look at the CI the test that is failing is not associated with this PR.

@devanbenz
Copy link
Contributor Author

Awesome I see it was fixed :)

@philippemnoel
Copy link
Collaborator

Awesome I see it was fixed :)

We reverted the faulty PR, sorry about that! And apologies for the silence here, we've been slammed on some search work for a customer, but we'll get back to reviewing analytics PRs very soon!

@od0
Copy link

od0 commented Aug 14, 2024

hey @devanbenz awesome work on this!

just built and tested from your branch. works great from what I can see, with one caveat – when applying a limit to a query with time_bucket, the bucket is still calculated for all rows in the table

e.g. select time_bucket('5m', trips.tpep_pickup_datetime) from trips limit 10; touches all rows not just the first 10

using a subquery seems to work fine as a workaround select time_bucket('5m', t.tpep_pickup_datetime) from (select * from trips limit 10) t;

@vaibhawvipul vaibhawvipul self-requested a review August 19, 2024 13:00
Copy link
Contributor

@vaibhawvipul vaibhawvipul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This successfully adds time_bucket().

We can merge this and open a follow-up perf/bug ticket to improve LIMIT related scans.

@vaibhawvipul vaibhawvipul merged commit 73ddff2 into paradedb:dev Aug 19, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for time_bucket() in ParadeDB pg_lakehouse
5 participants