Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use duckdb functions without having to specify the full type signature #531

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

JelteF
Copy link
Collaborator

@JelteF JelteF commented Jan 14, 2025

Currently when you use read_parquet/read_csv/etc they require specifying the full type signature of the function call. This is quite cumbersome, especially when using SELECT * FROM read_parquet(...) AS (...). Using SELECT * its main advantage is that you don't have to specify all the columns.

This changes that by instead allowing the following new syntax:

SELECT * FROM read_parquet('people.parquet') r WHERE r['age'] > 21;

This syntax is made possible by using two clever hacks to solve two different problems:

The first problem is that Postgres its parser needs to consider the query valid, which means it has to resolve all the types. Sadly, we cannot hook into the type resolution as an extension. Instead the types need to be resolved fully from information in the Postgres catalogs. To do so we create a new type for our extension: duckdb.unresolved_type. This type only exists for the Postgres parser and should never be used explicitly. We add catalog entries that allow this type be casted explicitly to any type that's supported by pg_duckdb. Using explicit casts always is quite annoying though. So we also define various operators and functions for this type, so that things like r['age'] > 21 are allowed.

The second problem is that the Postgres parser replaces the * in SELECT * with an expanded column list. So the query above will look as follows once we receive it (* replaced with r):

SELECT r FROM read_parquet('people.parquet') r WHERE r['age'] > 21;

To resolve this problem we let read_parquet return a duckdb.row, which we then replace with a * again before in our pgduckdb_get_querydef function. This might sound pretty simple but there's tricks that need to be done for subqueries that return both a duckdb.row type and some other columns.

TODO:

  • Support for subscripts, aka array/json indexing
  • More comments about edge cases (see reminders for myself)
  • Add some more tests with more complex queries
  • Support all supported PG versions (not just PG17), hopefully by moving some of the code to the src/pgduckdb_ruleutils.cpp which is shared across versions.
  • BUG: Fix CREATE TABLE AS (broken by a recent commit)

@JelteF JelteF force-pushed the redesigned-read-parquet branch 5 times, most recently from ffb9a26 to 59d8a57 Compare January 14, 2025 15:44
@JelteF JelteF force-pushed the redesigned-read-parquet branch from 59d8a57 to 1992351 Compare January 14, 2025 15:45
}

if (list_length(indirection) != 1) {
/* TODO: Fix this, if the column contains an array we should be able to subscript that type */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a test with an array/json column.

Comment on lines +422 to +423
/* TODO: Trigger cache population, probably we should do this somewhere else */
pgduckdb::IsExtensionRegistered();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where to put this though, this might very well be the first place we can actually hook into.

@@ -9015,6 +9128,45 @@ get_rule_expr(Node *node, deparse_context *context,
appendStringInfoString(buf, " := ");
get_rule_expr(refassgnexpr, context, showimplicit);
}
else if (IsA(sbsref->refexpr, Var) && pgduckdb_var_is_duckdb_row((Var*) sbsref->refexpr)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a comment about why this is needed.

@@ -11082,6 +11234,10 @@ get_const_expr(Const *constval, deparse_context *context, int showtype)
char *extval;
bool needlabel = false;

if (pgduckdb_is_unresolved_type(constval->consttype)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a comment.

@@ -12121,8 +12277,11 @@ get_from_clause_item(Node *jtnode, Query *query, deparse_context *context)
/* Print the relation alias, if needed */
get_rte_alias(rte, varno, false, context);

if (pgduckdb_func_returns_duckdb_row(rtfunc1)) {
/* Don't print column aliases for functions that return a duckdb.row */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we do that?

/* TODO: Fix this, if the column contains an array we should be able to subscript that type */
elog(ERROR, "Subscripting duckdb.row is only supported with a single subscript");
}
pprint(indirection);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove pprint here and in other places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant