Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintroduce slice fusion #11638

Merged
merged 3 commits into from
Jan 10, 2025
Merged

Reintroduce slice fusion #11638

merged 3 commits into from
Jan 10, 2025

Conversation

phofl
Copy link
Collaborator

@phofl phofl commented Jan 7, 2025

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

This is mostly what the old implementation did...

I still have to add more docs and probably more tests too, but the general idea is there. We are only looking at fused tasks,
since they might have the pattern we care about that caused pydata/xarray#9926

The way we are currently modifying these fused tasks isn't great though, we are using the dictionary that defines the fused task and replace fused getitem tasks with an alias before modifying the parent task with the new slice, i.e.


inner_graph = {
    "a": Task("a", getitem, TaskRef("original_open_dataset"), slice(0, 100)),
    "b": Task("b", getitem, TaskRef("a"), slice(30, 50)),
}

# translates to

inner_graph = {
    "a": Task("a", getitem, TaskRef("original_open_dataset"), slice(30, 50)),
    "b": Alias("b", "a"),
}

We might want to consider adding an API for this instead of modifying the dictionary inplace? cc @fjetter for thoughts?

Copy link
Contributor

github-actions bot commented Jan 7, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ± 0       15 suites  ±0   4h 30m 42s ⏱️ + 2m 56s
 17 154 tests + 1   15 959 ✅ + 1   1 195 💤 ±0  0 ❌ ±0 
211 342 runs  +13  194 167 ✅ +15  17 175 💤  - 2  0 ❌ ±0 

Results for commit 27a5441. ± Comparison against base commit 7f97e68.

♻️ This comment has been updated with latest results.

@fjetter
Copy link
Member

fjetter commented Jan 8, 2025

We might want to consider adding an API for this instead of modifying the dictionary inplace?

I'm open to adding new APIs as long as they are somewhat general. I wouldn't want to add a Task.maybe_fuse_slices method but a Task.unfuse which returns a list of the nested tasks (i.e. disassemble the dict for you) is something I could see working. Such an API is more costly than the inplace mutation, of course.

Is there any other API you have in mind that would be useful to have?

@fjetter
Copy link
Member

fjetter commented Jan 8, 2025

cc @hendrikmakait who's thinking about slimming down fuse and possibly changing its internal representation

@phofl
Copy link
Collaborator Author

phofl commented Jan 8, 2025

I honestly don't know what a good API here would look like. This all feel like horrible abstraction leaks. We might consider just living with the ugly thing here?

Rebuilding the list of nested tasks just puts the ugliness somewhere else

@fjetter
Copy link
Member

fjetter commented Jan 8, 2025

I find the fuse/unfuse to not be horrible. In fact, I could see this being useful outside of this application, e.g. if we encountered two sequential fused tasks and wanted to fuse those more efficiently like... Task.fuse(*t.unfuse() for t in all_tasks)

Comment on lines +95 to +103
"""Optimize slices
1. Fuse repeated slices, like x[5:][2:6] -> x[7:11]

This is generally not very important since we are fusing those tasks anyway. There
is one specific exception to how xarray implements opening netcdf files and subsequent
slices. Not merging them together can cause reading the whole netcdf file before
we drop the unnecessary data. Fusing slices avoids that pattern.

See https://github.com/pydata/xarray/issues/9926
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we somehow confirm this is actually working now? A manual confirmation is good enough for me at this point

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this manually with netcdf files, but the test also validates this behavior more generally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants