Support for GlobusComputeExecutor #3607

yadudoc · 2024-09-05T16:15:40Z

Description

This PR adds a new GlobusComputeExecutor that wraps the Globus Compute SDK to allow Parsl to execute tasks via Globus Compute. This mechanism supports remote execution of tasks similar to the functionality that parsl.channels enabled and is a potential replacement.

Since GlobusCompute often runs on remote machines that do not have a shared-filesystem with the parsl runtime, tests have been updated with a new shared_fs and staging_required pytest markers. I have not added tests and CI actions to enable executing these tests against our CI system, but you can run tests locally with these steps:

Install globus-compute-sdk with pip install .[globus-compute]
Configure and start a globus-compute-endpoint, globus-compute-endpoint start <endpoint_name>
Set an env var with the endpoint id for tests: export GLOBUS_COMPUTE_ENDPOINT=<endpoint_id>
Run tests with pytest -v -k "not shared_fs" --config parsl/tests/configs/globus_compute.py parsl/tests/

Changed Behaviour

N/A

Fixes

Fixes # (issue)

Type of change

Choose which options apply, and delete the ones which do not apply.

New feature

benclifford · 2024-09-06T16:53:10Z

Aside from a lack of CI test, my main concern (from seeing people implement this before) is what goes on with different Parsl versions: there's one version in the endpoint (pinned to a specific version by Globus Compute) and another version on the submit side, and having different Parsl versions like that is out of scope for Parsl.

This might be a case of documenting what users should expect to work or not work, or might be something deeper. At the very least we should be expecting them to be able to use the same combination of versions as used in CI.

parsl/tests/conftest.py

parsl/tests/configs/globus_compute.py

parsl/executors/globus_compute.py

.github/workflows/gce_test.yaml

parsl/executors/globus_compute.py

.github/workflows/gce_test.yaml

docs/userguide/execution.rst

parsl/executors/globus_compute.py

parsl/tests/test_bash_apps/test_basic.py

# Description This PR adds a new `staging_required` marker and a marker to several tests that assume a shared filesystem to work. Similarly, a few tests that use staging are now marked with the `staging_required` marker. This PR splits out changes from #3607. # Changed Behaviour These markers should not affect any test-behavior since none of the test entries in the Makefile make use of this. These change will kick-in once they are used by `GlobusComputeExecutor` and potentially `KubernetesProvider` for tests. ## Type of change Choose which options apply, and delete the ones which do not apply. - New feature - Code maintenance/cleanup

.github/workflows/gce_test.yaml

benclifford · 2024-11-08T13:17:04Z

.github/workflows/gce_test.yaml

+        make clean_coverage
+        
+        # Temporary fix, until changes make it into compute releases
+        git clone -b configure_tasks_working_dir https://github.com/globus/globus-compute.git        


maybe its good to wait for Globus Compute to support the necessary features in a release, so that this can pin a relevant supported release?

(does this mean that right now a user shouldn't expect to be able to use an actual Globus Compute but instead needs to install this fork?)

The github action uses ThreadPoolEngine because it is lightweight compared to running the default config that uses GlobusComputeEngine. The branch that I'm pointing to fixes a ThreadPoolEngine only bug. You can run these same tests against a local GlobusCompute endpoint, by setting GLOBUS_COMPUTE_ENDPOINT=<ep_id> and I've done those tests against a few configs. I believe Chris (NOAA) has also done a bunch of testing at this point.

I have been testing the GlobusComputeExecutor via this branch in combination with globus-compute-endpoint==2.30.1. I am using that release, not a special branch/tag. The one feature I haven't yet tested is passing in the user_endpoint_config at call time rather than at GlobusComputeExecutor instantiation time. What I've done so far is working great!

.github/workflows/gce_test.yaml

parsl/executors/globus_compute.py

benclifford · 2024-11-08T13:28:56Z

parsl/executors/globus_compute.py

+
+        self._executor.resource_specification = res_spec
+        self._executor.user_endpoint_config = user_endpoint_config
+        return self._executor.submit(func, *args, **kwargs)


This feels a bit horribly thread unsafe, were Parsl ever to get multithreaded submission - the sort of thing DFK.submitter_lock was put in place to deal with.

Potentially yes. I don't see how this is only a GCE specific problem. Secondly, I think it's safe to wait until users report this as an issue, or ask for it before working on this.

Err, I don't think it is a GCE specific problem. But I think the GCE class needs to handle it all the same. What's wrong with wrapping this work in the lock?

Or is the point that this method is called from within the lock already?

When this is used from the DFK, this is already in a lock (although as you're both well aware, there is strong community pressure to use pieces of Parsl outside of the DFK). That lock was introduced in PR #625 because executors are not expected to be thread safe on submission, and so in that context this code is not dangerous.

This is a more general backpressure against using this style of API that seems to conflate the state of the submission system as a whole with parameters to a single task execution - I've definitely fixed concurrency bugs in Parsl because of this coding style before, that lead not to Parsl errors but to subtle misexecutions that mostly look plausible.

(see #3492, #1413)

parsl/executors/globus_compute.py

yadudoc · 2024-11-15T19:34:28Z

@benclifford @khk-globus Thanks for the review and all the comments on this issue. I believe I've addressed all of them and at this point, we are only waiting for a release of globus-compute-endpoint with fixes from ComputePR #1689 to update the CI testing target.

khk-globus

I haven't finished reading through the PR, but these are my first thoughts. I'll chime in on Monday morning (-0400).

.github/workflows/gce_test.yaml

parsl/executors/globus_compute.py

khk-globus · 2024-11-15T20:58:33Z

parsl/executors/globus_compute.py

+
+        self._executor.resource_specification = res_spec
+        self._executor.user_endpoint_config = user_endpoint_config
+        return self._executor.submit(func, *args, **kwargs)


Err, I don't think it is a GCE specific problem. But I think the GCE class needs to handle it all the same. What's wrong with wrapping this work in the lock?

Or is the point that this method is called from within the lock already?

parsl/executors/globus_compute.py

.github/workflows/gce_test.yaml

khk-globus · 2024-11-16T05:06:41Z

.github/workflows/gce_test.yaml

+        name: runinfo-3.11-${{ steps.job-info.outputs.as-ascii }}-${{ github.sha }}
+        path: |


Same context as earlier: it would be good to not hard code the python version but to collect it from the environment.

khk-globus · 2024-11-16T05:09:20Z

docs/userguide/execution.rst

+5. `parsl.executors.globus_compute.GlobusComputeExecutor`: This executor uses `Globus Compute <https://globus-compute.readthedocs.io/en/latest/index.html>`_
+as the execution backend to run tasks on remote systems.


Please consider adding more context here. I'm not calling for complete documentation, but this currently only answers the "what." How about a "why" or in what cases the GCE would be a good fit?

parsl/executors/globus_compute.py

.github/workflows/gce_test.yaml

khk-globus · 2025-01-14T17:41:12Z

setup.py

+    'globus_compute': ['globus_compute_sdk>=2.27.1'],
    # Disabling psi-j since github direct links are not allowed by pypi


It's been a minute since we started this PR; might increase this to the most up-to-date as of ~now (2.34.0)

…te_endpoints * A new `GlobusComputeExecutor` implementation * Docs, tests, and examples * Github Action for GlobusComputeExecutor (#3619)

yadudoc · 2025-01-16T00:22:12Z

@khk-globus I've added some configuration examples and squashed all the previous changes.

I've opted to keep the executor.shutdown mechanism using result_watcher.shutdown. I agree that this isn't the best solution here, but the other approach we discussed was causing hangs. Canceling the futures to have the result_watcher exit when the exit handler is triggered appears to cause a hang. From the logs, it looks like the atexit handler triggered RW.shutdown, but instead of recognizing that the futures were canceled, it does a blocking wait until the task completes, which then throws an error since the future is already canceled. This smells like a bug. I've not figured out the root cause on the globus_compute_sdk.Executor side, but I do not think we should hold this PR up for that.

yadudoc requested review from benclifford, rjmello and khk-globus September 5, 2024 16:15

rjmello reviewed Sep 6, 2024

View reviewed changes

parsl/tests/conftest.py Outdated Show resolved Hide resolved

parsl/tests/configs/globus_compute.py Outdated Show resolved Hide resolved

parsl/executors/globus_compute.py Outdated Show resolved Hide resolved

yadudoc force-pushed the globus_compute_executor.py branch from 8273cc6 to 3ed482e Compare September 16, 2024 16:04

yadudoc mentioned this pull request Sep 16, 2024

Github Action for GlobusComputeExecutor #3619

Merged

yadudoc force-pushed the globus_compute_executor.py branch from 3ed482e to 495d009 Compare October 1, 2024 19:09

yadudoc force-pushed the globus_compute_executor.py branch 2 times, most recently from b004cf2 to a3cad96 Compare October 17, 2024 17:36

christopherwharrop-noaa reviewed Oct 18, 2024

View reviewed changes

parsl/executors/globus_compute.py Outdated Show resolved Hide resolved

benclifford reviewed Oct 19, 2024

View reviewed changes

.github/workflows/gce_test.yaml Outdated Show resolved Hide resolved

yadudoc force-pushed the globus_compute_executor.py branch 3 times, most recently from 3b26095 to dbc14cf Compare October 24, 2024 19:18

christopherwharrop-noaa reviewed Oct 25, 2024

View reviewed changes

parsl/executors/globus_compute.py Outdated Show resolved Hide resolved

benclifford reviewed Oct 25, 2024

View reviewed changes

.github/workflows/gce_test.yaml Outdated Show resolved Hide resolved

benclifford reviewed Oct 25, 2024

View reviewed changes

.github/workflows/gce_test.yaml Show resolved Hide resolved

benclifford reviewed Oct 25, 2024

View reviewed changes

docs/userguide/execution.rst Outdated Show resolved Hide resolved