Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Workflow Definition #62

Open
jan-janssen opened this issue Nov 18, 2024 · 5 comments
Open

Python Workflow Definition #62

jan-janssen opened this issue Nov 18, 2024 · 5 comments

Comments

@jan-janssen
Copy link
Member

jan-janssen commented Nov 18, 2024

Based on the discussions at the core hackathon - flip charts.

Requirements:

  • No for-loop, if-statement, while-loop (cyclic)
  • No execution flow, execution parameters
  • just DAG
  • No storage

=> Just the language - only DAG function calls

Needs:

  • Transformers -> convert to function nodes
  • Flatten Marcro Nodes

This would also be compatible to CWL and snakemake - From DAG to CWL and snakemake

Different levels of Interoperability

  1. Reuse nodes - execute CWL workflow with pyiron workflow
  2. Transfer DAGS from one framework to another (only recipe)
  3. pyiron node store - 6 months of development required, use tag system and review would improve quality, as edges are dataclass nodes they would also be included
  4. Share workflows - JSON representation of workflows as functions are no longer binary but can be referenced from the pyiron node store.

Publish workflows

  • provide Python and JSON
  • create the JSON as part of the Continous Integration environment - no additional effort on the user side.
@jan-janssen
Copy link
Member Author

jan-janssen commented Nov 24, 2024

Based on the discussions we had as part of the core hackathon I tried to work on a way to exchange workflow graphs between pyiron_base and pyiron_workflow based on the dictionary representation pyironflow already uses to communicate the workflow graph between the visual programming interface. The two example notebooks for pyiron_base are available at:

The two example notebooks for pyiron_workflow are available at:

The two example notebooks for jobflow are available at:

The format is currently very simple, based on the get_edges() function in pyironflow:
https://github.com/pyiron/pyironFlow/blob/main/pyironflow/reactflow.py#L178

edges_lst = [
    {'target': 0, 'targetHandle': 'x', 'source': 1, 'sourceHandle': 'x'},
    {'target': 1, 'targetHandle': 'x', 'source': 2, 'sourceHandle': None},
    {'target': 1, 'targetHandle': 'y', 'source': 3, 'sourceHandle': None},
    {'target': 0, 'targetHandle': 'y', 'source': 1, 'sourceHandle': 'y'},
    {'target': 0, 'targetHandle': 'z', 'source': 1, 'sourceHandle': 'z'},
]

The nodes at the moment just use the function as Python objects and the same ids used in the edges_lst for target and source:

nodes_dict = {
    0: add_x_and_y_and_z,
    1: add_x_and_y,
    2: 1,
    3: 2,
}

With the functions being simply:

def add_x_and_y(x, y):
    z = x + y
    return {"x": x, "y": y, "z": z}

def add_x_and_y_and_z(x, y, z):
    w = x + y + z
    return w

In future these functions could also be defined in a module and then be just referenced by their path, so the nodes dictionary would look like this:

nodes_dict = {
    0: my_module.add_x_and_y_and_z,
    1: my_module.add_x_and_y,
    2: 1,
    3: 2,
}

Such a workflow could be serialized as JSON and together with the my_module would provide a reproducible and interoperable way to publish workflows.

@jan-janssen
Copy link
Member Author

Alternative notation based on tuples rather than dictionaries to improve human readability:

edges_lst = [
    ('add_x_and_y/in/x', 'var_add_x_y__x'),
    ('add_x_and_y/in/y', 'var_add_x_y__y'),
    ('add_x_and_y_and_z/in/x', 'add_x_and_y/out/x'),
    ('add_x_and_y_and_z/in/y', 'add_x_and_y/out/y'),
    ('add_x_and_y_and_z/in/z', 'add_x_and_y/out/z'),
]

nodes_lst = [
    ( 'var_add_x_y__x@int', 1),
    ( 'var_add_x_y__y@int', 2),
    ('add_x_and_y@callable', add_x_and_y),
    ('add_x_and_y_and_z@callable', add_x_and_y_and_z),
]

@XzzX
Copy link
Contributor

XzzX commented Nov 29, 2024

  • Since edges are directed we could omit in and out.
  • If default != value and we do not have a recipe, we create a virtual data node?
  • How to identify nodes? I suggest a dict with at least name, libpath, module and version.
  • Add an option to also save input data.

Do we want to use the exchange format also for long term storage of graphs and data?

@jan-janssen
Copy link
Member Author

  • Since edges are directed we could omit in and out.

The in and out is redundant, I agree. Still they help us to prevent duplicated names. Some frameworks require unique names for channels independent if they are input or output channels. As it is likely that a function might take a structure as an input and returns a structure as an output it is not uncommon that the same variable name is used for both inputs and outputs. At the moment I prefer the solution with in and out, still I agree there is a unique transformation from one to the other so it is also an option to go without in and out, then I would write myself a conversion routine to add the in and out again.

  • If default != value and we do not have a recipe, we create a virtual data node?

I am not exactly sure what you are referring to. The workflow should not have any loose ends, either the user sets a value or we have a recipe to get the input from previous functions. I currently have not considered the case of incomplete inputs.

  • How to identify nodes? I suggest a dict with at least name, libpath, module and version.

I currently consider the case that a workflow is based on a python module with a number of functions, a conda environment file and a JSON representation of the workflow. Ideally the python module should be minimal and most Python functions should be distributed as conda packages. So for the workflow definition I would store the path to import the module and the name of the function. I guess that covers name, libpath and module. For the version I would refer to the conda environment file, as a change in the dependencies of modules, we used in our workflow, can already lead to different results, just storing the version of the specific module seems to be insufficient from my perspective.

  • Add an option to also save input data.

Input data is saved as additional nodes. For default inputs there is no need to save them separately as those are already stored in the function definition.

Do we want to use the exchange format also for long term storage of graphs and data?

This is topic to discuss. The primary use case is to exchange workflows between different workflow frameworks, currently aiida, jobflow (Materialsproject) and pyiron. To achieve the interoperability of workflows and make them FAIR. https://arxiv.org/abs/2410.03490

Beyond this use case I see the option to extend the format and also use it for long term data storage of both graphs and data. I think such an interoperable storage format would allow us to use both pyiron_base and pyiron_workflow and also future versions in parallel, in contrast to a solution which is too specifically optimized for one backend. Still I do not have an overview of the performance impact this would have one pyiron_workflow in comparison to an optimized format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants