Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645

TomNicholas · 2024-12-13T16:32:22Z

Concept

Icechunk solves the problem of handling incremental updates to Zarr stores, meaning that users can transparently track changes to datasets at the chunk level. Often real-world data pipelines involve computing some aggregate result from an entire input dataset(s), but currently if you change just one chunk in a store then to get the new result you likely have to recompute using the entire dataset. This is potentially massively wasteful if only part of the new result actually depends on chunks that were changed since the last version of the input dataset.

Can we use Cubed to automatically re-compute only the output chunks that actually depend on the updated input chunks?

This would be an extremely powerful optimization - in the pathological case the differences in the re-computed result might only depend on 1 or 2 updated chunks in the original dataset, so only 1 or 2 chunks need to be re-computed instead of re-computing the entire thing.

Cubed potentially has enough information in the plan to trace back up from the desired result all the way to which input chunks are actually necessary.

cc @rabernat (whose idea this was) @tomwhite @sharkinsspatial

dcherian · 2024-12-13T19:25:46Z

hehhe funnily it seems like you could reverse the graph so inputs become outputs, apply a selection to the inputs to isolate the chunks that have changed, call cull, then reverse the graph back.

TomNicholas added enhancement New feature or request icechunk 🧊 labels Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645

Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645

TomNicholas commented Dec 13, 2024

dcherian commented Dec 13, 2024

Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645

Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan #645

Comments

TomNicholas commented Dec 13, 2024

Concept

dcherian commented Dec 13, 2024