Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Init work on publishing final output of "simple" workflow to volume #50

Draft
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

trey-stafford
Copy link
Member

@trey-stafford trey-stafford commented Jan 15, 2025

resolves #48

This PR adds a final step to OGDC "simple" recipes: it publishes the final data to a subpath of the qgnet-ogdc-workflow-pvc PVC based on the recipe's ID.

The --overwrite flag can be used to overwrite a recipe's data if it has already been published. E.g.,

ogdc-runner submit --wait --overwrite ../ogdc-recipes/recipes/seal-tags/

Otherwise an OgdcDataAlreadyPublished exception is raised.

This is a step toward chaining multiple workflows together. In fact, this PR introduces two new Argo workflows: one that deletes existing published data if --overwrite is passed, and a second that checks for the existence of already-published data (if it exists and --overwrite is not passed, the above noted exception is raised). These argo workflows are submitted and execute independently of the argo workflow that runs the requested data transformation recipe.

Note: this PR introduces a "publication" mechanism that is fairly simple in implementation. It just puts data into persistent storage. It does not trigger any other processes that might put that data into e.g., a DataONE dataset or expose it to a user for download. I anticipate these will be "next steps", soon to come. For now, the approach of saving data to a known location on the OGDC workflows PVC sets us up for chaining additional workflows (#45) that take the previous workflow's publication location as input.

In other words, this PR currently sets us up to do:

flowchart LR
    OGDC_RECIPE["OGDC Recipe (id=foo) Argo Workflow"] --> PVC[("**workflow-pvc**<br>foo/{data}")]
Loading

Next, we will want to support something like:

flowchart LR
    OGDC_RECIPE["OGDC Recipe (id=foo) Argo Workflow"]
    PVC_1[("**workflow-pvc**<br>foo/{data}")]
    WORKFLOW_TEMPLATE["Workflow template (e.g., viz-workflow)"]
    PVC_2[("**workflow-pvc**<br>foo_viz/{data}")]


    OGDC_RECIPE --> PVC_1
    PVC_1 --> WORKFLOW_TEMPLATE --> PVC_2
Loading

@trey-stafford
Copy link
Member Author

trey-stafford commented Jan 15, 2025

Next steps:

  • Address published data in test environment. Cleanup?
  • Figure out how to determine if a recipe has already published data. Raise an error if so. Provide an option to overwrite.
  • Consider config for setting up volumes. We can default to qgnet-ogdc-workflow-pvc, but this should be configurable.

@trey-stafford trey-stafford force-pushed the publish-outputs-to-volume branch from 25fba91 to 2e9f997 Compare January 16, 2025 00:23
@rmarow
Copy link
Contributor

rmarow commented Jan 16, 2025

as of this morning im able to run this!

trey-stafford and others added 3 commits January 16, 2025 11:36
Will be used for control flow. We should avoid overwriting existing, published
data unless an overwrite flag is given (TODO).
@trey-stafford trey-stafford force-pushed the publish-outputs-to-volume branch from d2bdf86 to 6d1ad27 Compare January 16, 2025 18:36
@trey-stafford trey-stafford changed the base branch from main to test-image-config-from-env January 16, 2025 18:36
Base automatically changed from test-image-config-from-env to main January 21, 2025 15:36
@rmarow
Copy link
Contributor

rmarow commented Jan 22, 2025

work on overwrite option

Prep for "workflow of workflows" approach
Was thinking that we would construct potentially many argo workflows and then
orchestrate them with a parent argo workflow, but this doesn't work so well in
practice. Some features, like artifacts, do not work within child workflows.
Anticipate the need for more specific exception handling
Some of the errors around traversing nodes and checking outputs is a bit
confusing.  I think the way we have the workflow setup means that the relevant
attrs will be present. May want to consider more robust error checking (or maybe
wrap all of it in try/except...) down the road.
We expect OGDC workflows to have access to the workflow pvc so that data
outputs can be written
Makes it a little easier to understand the logic
Will revisit this. The `submit_ogdc_recipe` function may end up submitting more
than one workflow that we want to preserve (or cleanup) in the future, which
would mean that we can't just return the name of a single workflow. Maybe we end
up having a result object that contains references to all workflows executed for
a given recipe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Publish final outputs to persistent volume
2 participants