-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List needed changes to implement UDTFs on FROMs #746
Comments
What's the purpose? SELECT file_path
FROM commits c
INNER JOIN commit_files cf
ON c.repository_id = cf.repository_id AND c.commit_hash = cf.commit_hash |
Function calling in the from is not valid sql syntax. Or, at least, not valid MySQL syntax. So we will need to change it in vitess (which they're not gonna accept because is not valid MySQL syntax) or manually parse the froms if we want this. Also, what about indexes? What would be the syntax for creating indexes with this? If Is there any other construct in MySQL that behaves like this? If there isn't, why not use the concepts that are already there: the tables? Just like we have ref_commits and so on right now. What's the advantage of this over tables? We must take into account the disadvantages that this brings to the table (invented concept, invalid syntax, creating index awkwardness, ...). |
I'll try to clarify the context, since it seems there're no appropriate minutes or extensive docs about this. The context is the discussion about functions returning multiple rows per input row, also known as table-generating functions (or UDTF, when we're talking about the UDF counterpart). This is something that came up during the discussion about diff (#553), diff-tree (#691) and blame implementation. We discussed two broad approaches:
TablesImplementing each of these (diff, diff-tree, blame) as a table. This has some major advantages:
While it has some drawbacks:
FunctionsThis is actually not a single approach, but a family of them. The common thing is that there is some function construct that allows to map one input row to multiple output rows. There are two main approaches: array based and table based. For array based:
Then there's also the possibility of using table-generating functions, that return tables right ahead. This is supported in HiveQL (not SQL!, see docs), and to some degree on IBM DB2 and Oracle. There's a similar concept in latest SQL standard (SQL:2016) called polymorphic table functions, currently supported by Oracle. The array-based methods are still pretty familiar for users of advanced SQL in PostgreSQL, Spark or BigQuery. They have the short-comming, depending on how they are implemented, that they might imply accumulating a full array in memory before exploding results though. In any case, they provide some more flexibility regarding passing parameters. On the other hand, they can be quite more complex to implement. I understand it would imply changing to vitess and Spark parser, as well as a few planning rules, and probably also changing go-mysql-server interfaces to have a common ground between all tables (including this kind of derivative tables) so that we can still apply optimizations. In any case, the last status of these discussions is that we need further research on what it would take to implement a function-based approach both on gitbase and gitbase-spark-connector. |
For UDTFs to be implemented in gitbase we would need the following things in both The biggest workload is on go-mysql-server, along with adapting the squash rule to work with this. go-mysql-serverParserVitess' parser does not handle functions in the from part of a query. This is not likely to be merged in vitess, because this is not even a thing in MySQL, so we would have to fork vitess' or something. JoinsAssuming we want something like we have on the example query, if we have a cross join like:
Right now, it's transformed into:
Where both branches are absolutely isolated and don't know anything about each other. This is how the semantics of cross join work. The only place where both branches have access to each other's values is in an If we want In any case, we would need to generate a set of rows in the UDTF for each row in the other branch. UDTFs could implement this interface instead of // GeneratedTable is a table whose rows are generated for a given set of values.
type GeneratedTable struct {
// RowsForValues returns the rows generated for the given values.
RowsForValues(ctx *sql.Context, values ...interface{}) (sql.RowIter, error)
// ... other methods such as Schema, ...
} Then, the joins would need to special-case if one of the branches is a generated table and for each row in the other side, gather the values and get the iterator for those values from the generated table. Probably, it would just be better to use a rule to replace CaveatsIntuitive to do Analyzer
CoreA way to register UDTFs and ensure UDTF names cannot clash with real tables or other UDFs. gitbaseSquashSquash works for inner joins, but this pattern is used with cross joins (because It would require new rules to chain these new UDTFs with what's already there based on the parameters they take. UDTFsImplementation of the UDTFs themselves and registration on the server. Personal opinion on using UDTFsDisclaimer: I know this was not asked for in the task, but there is no other place right now to put this or any meetings in the near future that I know of about the topic, so I'm adding this here. While I see why it might be desired to use this instead of just plain tables, there are a lot of objective disadvantages about this approach:
And the only advantages I see are:
|
Personally I still don't see any function benefits, here. And don't want to repeat what Miguel has already written. Functions are much harder to optimize. |
Thanks for the analysis @erizocosmico and also for your feedback @kuba--! We'll be settling this probably next week. |
@ajnavarro I guess this issue can be closed already. |
Check all the changes that we will need to be able to execute the following query:
SELECT file_path FROM commits as c, FILES(c.repository_id, c.commit_hash)
Files function internally will be like a table with mandatory filters (repository_id and commit_hash)
The text was updated successfully, but these errors were encountered: