Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex-filters recap #2182

Closed
tcompa opened this issue Jan 15, 2025 · 5 comments
Closed

Complex-filters recap #2182

tcompa opened this issue Jan 15, 2025 · 5 comments
Labels
flexibility Support more workflow-execution use cases

Comments

@tcompa
Copy link
Collaborator

tcompa commented Jan 15, 2025

This placeholder issue will collect all remaining open questions or TBD from:

@tcompa tcompa added the flexibility Support more workflow-execution use cases label Jan 15, 2025
@tcompa
Copy link
Collaborator Author

tcompa commented Jan 16, 2025

Here is my first recap of a bunch of relevant points, for how they are currently implemented in v2.11.

EDIT: we updated this comment, after #2187, #2190 and #2191

A review would be most welcome - cc @jluethi

Which objects have filters

  • DatasetV2 has columns for attribute_filters and type_filters.
  • JobV2 has an attribute_filters column.
  • WorkflowTaskV2 has a type_filters column.
  • TaskV2 has input_types and output_types

Valid attribute filters / typical cases

Standard valid examples:

attribute_filters={}
-> do not apply any filter

attribute_filters={"key1": ["A"]}
-> require that image has attribute key1 equal to A

attribute_filters={"key1": ["A", "B"]}
-> require that image has attribute key1 equal to A or B

attribute_filters={"key1": [1, 2]}
-> require that image has attribute key1 equal to 1 or 2

attribute_filters={"key1": [True, False]}
-> require that image has attribute key1 equal to True or False (which corresponds to no filtering)

attribute_filters={"key1": [1.5, -1.2]}
-> require that image has attribute key1 equal to 1.5 or -1.2

attribute_filters={"key1": [1, 2], "key2": ["A", "B"]}
-> require that image has (attribute key1 equal to 1 or 2) and (attribute key2 equal to A or B)

Valid attribute filters / unusual cases (TBD)

There are two edge cases, which are currently considered as valid:

First

attribute_filters={"key1": []}
-> require that image has attribute key1 in [], meaning that no image could ever match with this filter

This one formally correct but meaningless, and we could transform it into an invalid case.

Second

attribute_filters={"key1": None, "key2": ["A", "B"]}
-> require that image has attribute key2 equal to A or B, and do not apply any filtering for key1

This is due to:

  • Remaining backwards-compatible (since up to v2.10 setting an attribute value to None would disable that filter).
  • The fact that we often had to merge different sets of filters (not any more, see Replace attribute filters rather than merging them #2155) and then this was way to represent a patch that would disable a previously set filter.

The latter reason does not hold any more, and I guess the first one is not very relevant in real-life usage. Thus I'd be in favor of deprecating this option. Either way, we'll have to review this (because it's currently not fully backwards compatible, in the API schemas), and update the data-migration script.

Invalid attribute filters

The following set of attribute filters are currently considered invalid

{True: ["value"]}
because the key is not a string

{1: ["value"]},  # non-string key
because the key is not a string

{"key1": 1}
because the value is not a list

{"key1": True}
because the value is not a list

{"key1": "something"}
because the value is not a list

{"key1": [1], " key1  ": [1]}
because keys are not unique, after their normalization

{"key1": [None]}
because None is not a valid type

{"key1": [1, 2.0]}
because 1 is an integer and 2.0 is a float
(if needed, we could modify this by defining a new type which is a union of int and float)

{"key1": [1, "a"]}
because the list of values have non-homogeneous types

{"key1": [1, True]}
because the list of values have non-homogeneous types

{"key1": [[1, 2], [1, 2]]}
because the list of values has non-scalar items

Net effect of running a job on dataset filters

  1. Running a job cannot lead to a change of dataset.attribute_filters, since there is no mechanism for changing dataset.attribute_filters via job execution.
  2. Running a job can lead to a change of dataset.type_filters.

Which filters are consumed within the runner

  1. Dataset attribute filters are ignored, in the runner, while job attribute filters are the only ones that are applied.
  2. The clients are responsible for offering an easy way for the user to say "let's use the dataset filters as job filters", or to provide a custom set of filters.
  3. When preparing the list of images to be processed by a single task, the following filters are used:
    • job.attribute_filters (identical for all tasks)
    • dataset.type_filters (may change during workflow execution)
    • workflowtask.type_filters and workflowtask.task.input_types (both may be different for each task)

Which filters are produced/updated within the runner

After a task has run, there is a single possible source for updates to the dataset type_filters, namely the workflowtask.task.output_types. This is typically set in the task-package manifest.

@tcompa
Copy link
Collaborator Author

tcompa commented Jan 21, 2025

We edited the comment above (#2182 (comment)) as per the latest discussions and updates.

@jluethi: a review would be useful (this can go together with actual tests on the staging server). A minor point which is still to be confirmed is the one in the "Valid attribute filters / unusual cases" section.

(cc @ychiucco)

@jluethi
Copy link
Collaborator

jluethi commented Jan 21, 2025

@tcompa Agreed to all! Looks good to me. I'll review additional type filter questions on that issue separately

attribute_filters={"key1": []}

I have no strong opinion on this. In general, we should not allow people to actually set such filters in the interface.

attribute_filters={"key1": None, "key2": ["A", "B"]}
...
Thus I'd be in favor of deprecating this option.

For job attribute filters, we can remove this without an issue. Given that the only source for defaults for job attribute filters are the dataset attribute filters, I don't see a use for keeping the support for None = unset the filter anymore as well.
Agreed on deprecating it.

{"key1": 1}
because the value is not a list

That was the old datastructure, right? As long as we transition over all existing dataset attribute filters that may have been {"key1": 1} to {"key1": [1]}, I'm fine with that. I don't think they will have been used very often.

@tcompa
Copy link
Collaborator Author

tcompa commented Jan 21, 2025

{"key1": 1}
because the value is not a list

That was the old datastructure, right?

Yes.

As long as we transition over all existing dataset attribute filters that may have been {"key1": 1} to {"key1": [1]}, I'm fine with that.

Yes, this is part of a data-migration script for v2.11.0

@tcompa
Copy link
Collaborator Author

tcompa commented Jan 28, 2025

This was a recap issue for 2.11 work. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flexibility Support more workflow-execution use cases
Projects
Development

No branches or pull requests

2 participants