Python Workflow Definition #62

jan-janssen · 2024-11-18T06:23:05Z

Based on the discussions at the core hackathon - flip charts.

Requirements:

No for-loop, if-statement, while-loop (cyclic)
No execution flow, execution parameters
just DAG
No storage

=> Just the language - only DAG function calls

Needs:

Transformers -> convert to function nodes
Flatten Marcro Nodes

This would also be compatible to CWL and snakemake - From DAG to CWL and snakemake

Different levels of Interoperability

Reuse nodes - execute CWL workflow with pyiron workflow
Transfer DAGS from one framework to another (only recipe)
pyiron node store - 6 months of development required, use tag system and review would improve quality, as edges are dataclass nodes they would also be included
Share workflows - JSON representation of workflows as functions are no longer binary but can be referenced from the pyiron node store.

Publish workflows

provide Python and JSON
create the JSON as part of the Continous Integration environment - no additional effort on the user side.

jan-janssen · 2024-11-24T10:12:14Z

Based on the discussions we had as part of the core hackathon I tried to work on a way to exchange workflow graphs between pyiron_base and pyiron_workflow based on the dictionary representation pyironflow already uses to communicate the workflow graph between the visual programming interface. The two example notebooks for pyiron_base are available at:

The two example notebooks for pyiron_workflow are available at:

The two example notebooks for jobflow are available at:

The format is currently very simple, based on the get_edges() function in pyironflow:
https://github.com/pyiron/pyironFlow/blob/main/pyironflow/reactflow.py#L178

edges_lst = [
    {'target': 0, 'targetHandle': 'x', 'source': 1, 'sourceHandle': 'x'},
    {'target': 1, 'targetHandle': 'x', 'source': 2, 'sourceHandle': None},
    {'target': 1, 'targetHandle': 'y', 'source': 3, 'sourceHandle': None},
    {'target': 0, 'targetHandle': 'y', 'source': 1, 'sourceHandle': 'y'},
    {'target': 0, 'targetHandle': 'z', 'source': 1, 'sourceHandle': 'z'},
]

The nodes at the moment just use the function as Python objects and the same ids used in the edges_lst for target and source:

nodes_dict = {
    0: add_x_and_y_and_z,
    1: add_x_and_y,
    2: 1,
    3: 2,
}

With the functions being simply:

def add_x_and_y(x, y):
    z = x + y
    return {"x": x, "y": y, "z": z}

def add_x_and_y_and_z(x, y, z):
    w = x + y + z
    return w

In future these functions could also be defined in a module and then be just referenced by their path, so the nodes dictionary would look like this:

nodes_dict = {
    0: my_module.add_x_and_y_and_z,
    1: my_module.add_x_and_y,
    2: 1,
    3: 2,
}

Such a workflow could be serialized as JSON and together with the my_module would provide a reproducible and interoperable way to publish workflows.

jan-janssen · 2024-11-25T12:44:58Z

Alternative notation based on tuples rather than dictionaries to improve human readability:

edges_lst = [
    ('add_x_and_y/in/x', 'var_add_x_y__x'),
    ('add_x_and_y/in/y', 'var_add_x_y__y'),
    ('add_x_and_y_and_z/in/x', 'add_x_and_y/out/x'),
    ('add_x_and_y_and_z/in/y', 'add_x_and_y/out/y'),
    ('add_x_and_y_and_z/in/z', 'add_x_and_y/out/z'),
]

nodes_lst = [
    ( 'var_add_x_y__x@int', 1),
    ( 'var_add_x_y__y@int', 2),
    ('add_x_and_y@callable', add_x_and_y),
    ('add_x_and_y_and_z@callable', add_x_and_y_and_z),
]

XzzX · 2024-11-29T10:14:03Z

Since edges are directed we could omit in and out.
If default != value and we do not have a recipe, we create a virtual data node?
How to identify nodes? I suggest a dict with at least name, libpath, module and version.
Add an option to also save input data.

Do we want to use the exchange format also for long term storage of graphs and data?

jan-janssen · 2024-11-29T11:11:44Z

Since edges are directed we could omit in and out.

The in and out is redundant, I agree. Still they help us to prevent duplicated names. Some frameworks require unique names for channels independent if they are input or output channels. As it is likely that a function might take a structure as an input and returns a structure as an output it is not uncommon that the same variable name is used for both inputs and outputs. At the moment I prefer the solution with in and out, still I agree there is a unique transformation from one to the other so it is also an option to go without in and out, then I would write myself a conversion routine to add the in and out again.

If default != value and we do not have a recipe, we create a virtual data node?

I am not exactly sure what you are referring to. The workflow should not have any loose ends, either the user sets a value or we have a recipe to get the input from previous functions. I currently have not considered the case of incomplete inputs.

How to identify nodes? I suggest a dict with at least name, libpath, module and version.

I currently consider the case that a workflow is based on a python module with a number of functions, a conda environment file and a JSON representation of the workflow. Ideally the python module should be minimal and most Python functions should be distributed as conda packages. So for the workflow definition I would store the path to import the module and the name of the function. I guess that covers name, libpath and module. For the version I would refer to the conda environment file, as a change in the dependencies of modules, we used in our workflow, can already lead to different results, just storing the version of the specific module seems to be insufficient from my perspective.

Add an option to also save input data.

Input data is saved as additional nodes. For default inputs there is no need to save them separately as those are already stored in the function definition.

Do we want to use the exchange format also for long term storage of graphs and data?

This is topic to discuss. The primary use case is to exchange workflows between different workflow frameworks, currently aiida, jobflow (Materialsproject) and pyiron. To achieve the interoperability of workflows and make them FAIR. https://arxiv.org/abs/2410.03490

Beyond this use case I see the option to extend the format and also use it for long term data storage of both graphs and data. I think such an interoperable storage format would allow us to use both pyiron_base and pyiron_workflow and also future versions in parallel, in contrast to a solution which is too specifically optimized for one backend. Still I do not have an overview of the performance impact this would have one pyiron_workflow in comparison to an optimized format.

jan-janssen · 2024-11-29T11:12:51Z

I also added a workflow example for aiida-workgraph:

jan-janssen mentioned this issue Nov 28, 2024

draft of reworked storage spec #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python Workflow Definition #62

Python Workflow Definition #62

jan-janssen commented Nov 18, 2024 •

edited

Loading

jan-janssen commented Nov 24, 2024 •

edited

Loading

jan-janssen commented Nov 25, 2024

XzzX commented Nov 29, 2024

jan-janssen commented Nov 29, 2024

jan-janssen commented Nov 29, 2024

Python Workflow Definition #62

Python Workflow Definition #62

Comments

jan-janssen commented Nov 18, 2024 • edited Loading

Requirements:

Needs:

Different levels of Interoperability

Publish workflows

jan-janssen commented Nov 24, 2024 • edited Loading

jan-janssen commented Nov 25, 2024

XzzX commented Nov 29, 2024

jan-janssen commented Nov 29, 2024

jan-janssen commented Nov 29, 2024

jan-janssen commented Nov 18, 2024 •

edited

Loading

jan-janssen commented Nov 24, 2024 •

edited

Loading