core/translate: refactor query planner again to be simpler #853
+997
−1,661
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Simplify bookkeeping of referenced tables in the query planner
This PR refactors the way we track referenced tables and associated planner operations related to them (scans, index searches, etc).
The problem with what we currently have:
SourceOperator
which is either aJoin
,Scan
,Search
,Subquery
orNothing
.Join
has two boxed child operators, and so e.g. for a 3-table query you have, for example, 3Scan
nodes and then 2Join
nodes.referenced_tables
that columns can refer to by index:referenced_tables
is used because SQLite joins are ann^tables_len
nested loop, and we need to figure out during which loop to evaluate a condition expression. A lot of plumbing in the current code exists for this, e.g. "pushing predicates" inoptimizer
even though "predicate pushdown" as a query planner concept is an optimization, but in our current system the "pushdown" is really a necessity to move the condition expressions to the correctSourceOperator::predicates
vector so that they are evaluated at the right point.referenced_tables
is also used to map identifiers in the query to the correct table, e.g. 'foo'SELECT foo FROM users
becomes anast::Expr::Column { table: 0, .. }
ifusers
is the first table inreferenced_tables
.In addition to this, we ALSO had a
TableReferenceType
separately for checking whether the upper-level query is reading from a BTree table or a Subquery....even though we already have an
Operator::Subquery
that should be able to encode this information, but doesn't, because it's a tree and we refer to things by index inreferenced_tables
.Why this is especially stupid
SourceOperator
andTableReference
are basically just two representations of the same thing, one in tree format and another in vector format.SourceOperator
even carries around its own copy ofTableReference
, even though we ALSO havereferenced_tables: Vec<TableReference>
🤡Note that I'm allowed to call the existing code stupid because I wrote it.
What we can do instead
Basically, we can just fold the concerns from
SourceOperator
intoTableReference
and have a list of those in the query plan, one per table, in loop order (outermost loop is 0, and so on).Funnily enough, when Limbo had very very few features we used to have a Vec of LoopInfos similarly, obviously with a lot less information than now, but for SQLite it's probably the right abstraction. :)
And we keep the "operation" part from the "operator", but in a simple form:
Now we don't need to carry around both the operator tree and
Vec<TableReference>
, because they are the same thing. If something refers to then'th table
, it is justplan.table_references[n]
.We also don't need to recurse through the operator tree and usually we can just loop from outermost table to innermost table.
Handling the "where to evaluate a condition expression" problem
You can see I've also removed the
predicates
vector fromScan
and friends. Previously eachSourceOperator
had a vector ofpredicates
so that we knew at which loop depth to evaluate a condition. Now we align more with what SQLite does -- it puts all the conditions, even the join conditions, in theWHERE
clause and adds extra metadata to them:Final notes
I've been wanting to make this refactor for a long time now, but the last straw was when I was making a PR trying to reduce some of the massive amount of allocations happening in the read path currently, and I got stuck because of this Operator + referenced_tables shit getting in the way constantly. So I decided I wanted to get it over with now.
The PR is again very big, but I've artificially split it up into commits that don't individually compile but at least separate the changes a bit for you, dear reader.