Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: merge_insert with subcols sometimes outputs unexpected nulls (#3407
) Fixes #3406 At the root of this is a bit of a footgun with DataFusion. Prior to this change, the query plan for getting data that was supposed to be sorted by `_rowaddr` was: ``` ProjectionExec: expr=[id@0 as id, vector@1 as vector, _rowaddr@2 as _rowaddr, _rowaddr@2 >> 32 as _fragment_id], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N, _fragment_id:UInt64;N] RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1, schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N] SortExec: expr=[_rowaddr@2 ASC], preserve_partitioning=[false], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N] StreamingTableExec: partition_sizes=1, projection=[id, vector, _rowaddr], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N] ``` Note the `RepartitionExec` **after** the `SortExec`. This caused the final order to be non-deterministic. After these changes, the plan is: ``` SortPreservingMergeExec: [_rowaddr@2 ASC], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N, _fragment_id:UInt64;N] SortExec: expr=[_rowaddr@2 ASC], preserve_partitioning=[true], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N, _fragment_id:UInt64;N] ProjectionExec: expr=[id@0 as id, vector@1 as vector, _rowaddr@2 as _rowaddr, _rowaddr@2 >> 32 as _fragment_id], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N, _fragment_id:UInt64;N] RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1, schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N] StreamingTableExec: partition_sizes=1, projection=[id, vector, _rowaddr], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N] ``` Which does provide a deterministic order.
- Loading branch information