Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newly frequent WorkQueueTaskFailure in CI #2914

Open
benclifford opened this issue Oct 14, 2023 · 17 comments
Open

Newly frequent WorkQueueTaskFailure in CI #2914

benclifford opened this issue Oct 14, 2023 · 17 comments

Comments

@benclifford
Copy link
Collaborator

Describe the bug

I'm seeing this WorkQueueExecutor heisenbug happen in CI a lot recently: I'm not clear what has changed to make it happen more - for example in https://github.com/Parsl/parsl/actions/runs/6518865549/job/17704749713

ERROR    parsl.dataflow.dflow:dflow.py:350 Task 207 failed after 0 retry attempts
Traceback (most recent call last):
  File "/home/runner/work/parsl/parsl/parsl/dataflow/dflow.py", line 301, in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/parsl/parsl/parsl/dataflow/dflow.py", line 571, in _unwrap_remote_exception_wrapper
    result = future.result()
             ^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
parsl.executors.workqueue.errors.WorkQueueTaskFailure: ('work queue result: The result file was not transfered from the worker.\nThis usually means that there is a problem with the python setup,\nor the wrapper that executes the function.\nTrace:\n', FileNotFoundError(2, 'No such file or directory'))
INFO     parsl.dataflow.dflow:dflow.py:1390 Standard output for task 207 available at std.out

I'm don't have any immediate strong ideas about what is going on - I've had a little poke but can't see anything that sticks out right away.

I've opened:

I haven't been successful in recreating this on my laptop. However I have seen a related error on perlmutter under certain high load / high concurrency conditions which is a bit more recreatable and maybe I can debug from there.

cc @dthain

@benclifford
Copy link
Collaborator Author

benclifford commented Oct 14, 2023

maybe related, maybe not, I've also seen this in CI - it looks something to do with staging files in, not out? see https://github.com/Parsl/parsl/actions/runs/6519478342/job/17706018626

E               parsl.executors.errors.BadStateException: Executor WorkQueueExecutor failed due to: Error 1:
E               	EXIT CODE: 139
E               	STDOUT: Found cores : 2
E               Launching worker: 1
E               work_queue_worker: creating workspace /tmp/worker-1001-5848
E               work_queue_worker: using 2 cores, 6932 MB memory, 18382 MB disk, 0 gpus
E               connected to manager fv-az201-276:9000 via local address 10.1.0.39:38854
E               
E               	STDERR: Network function: connection from ('127.0.0.1', 50818)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 50824)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 50828)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 50834)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 40740)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 40756)
E               Network function: recieved event: {'fn_
E               ...
E               ': 'direct'}
E               Network function: connection from ('127.0.0.1', 38228)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 38236)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function encountered exception  [Errno 2] No such file or directory: 't.271'
E               Traceback (most recent call last):
E                 File "/opt/hostedtoolcache/Python/3.8.18/x64/bin/parsl_coprocess.py", line 141, in <module>
E                   main()
E                 File "/opt/hostedtoolcache/Python/3.8.18/x64/bin/parsl_coprocess.py", line 69, in main
E                   task_id = int(input_spec[1])
E               IndexError: list index out of range
E               /home/runner/work/parsl/parsl/runinfo/003/submit_scripts/parsl.WorkQueueExecutor.block-0.1697310662.3360648.sh: line 10:  5848 Segmentation fault      (core dumped) PARSL_WORKER_BLOCK_ID=0 work_queue_worker --coprocess parsl_coprocess.py fv-az201-276 9000

@benclifford
Copy link
Collaborator Author

I've tried my DESC development branch of parsl with ndcctools 7.7.0 and still experience sporadic FileNotFound errors as reported in the main body of this issue.

@dthain
Copy link
Contributor

dthain commented Oct 16, 2023

So that error is almost certainly coming from this line, where the coprocess attempts to chdir to the task directory (t.271) corresponding to the function-call task:
https://github.com/cooperative-computing-lab/cctools/blob/master/poncho/src/poncho/wq_network_code.py#L75

Now, it's hard for me to imagine that the directory does not really exist bc/ the worker creates it before sending the function to the coprocess. But, it would be wise for the coprocess to check this and send back an error message.

But, I think the problem is really that the coprocess doesn't do the complementary chdir(..) under all exit paths. For example, if the coprocess catches an exception, it skips the .. on the way out. So I think we need a more idempotent approach to always return to the same absolute directory each time through the loop.

@tphung3 what do you think?

@tphung3
Copy link
Contributor

tphung3 commented Oct 17, 2023

@benclifford I just merged a fix to the chdir error (see cooperative-computing-lab/cctools#3542), what's the quickest way to see if it works?

@benclifford
Copy link
Collaborator Author

@tphung3 if you have a URL for a binary of cctools (from anywhere, doesn't need to be an official release) it is hopefully easy to make a branch of parsl, edit the install path for ndcctools, hack the dependency problem mentioned elsewhere and see what happens

@dthain
Copy link
Contributor

dthain commented Oct 17, 2023

@dthain
Copy link
Contributor

dthain commented Oct 17, 2023

benclifford added a commit that referenced this issue Oct 24, 2023
This PR upgrades cctools to bring in bugfixes that should address #2914



Co-authored-by: Ben Clifford <[email protected]>
@benclifford
Copy link
Collaborator Author

On the desc parsl branch, I'm still seeing some segfaults and other work queue problems, for example here:

https://github.com/Parsl/parsl/actions/runs/6668296438/job/18123571251?pr=2012#step:6:9134

I don't have a feel for if this is something that is breaking in the parsl branch-specific functionality which is then breaking things in WQ, or what else is going on - so I'm just noting that error here for now.

@dthain
Copy link
Contributor

dthain commented Oct 27, 2023

It looks like this test is running cctools 7.7.1, but the fix for that segfault is in 7.7.2:
https://github.com/cooperative-computing-lab/cctools/releases/tag/release%2F7.7.2

@benclifford
Copy link
Collaborator Author

ok, easy to bump that branch up by 0.0.1 - I'll do that now

@benclifford
Copy link
Collaborator Author

I'm still seeing this in the desc branch of parsl in CI sometimes:

Network function: connection from ('127.0.0.1', 60014)
Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result', 'log'], 'remote_task_exec_method': 'direct'}
Network function: connection from ('127.0.0.1', 60020)
Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result', 'log'], 'remote_task_exec_method': 'direct'}
Network function encountered exception  [Errno 2] No such file or directory: 't.107'
Traceback (most recent call last):
  File "/home/runner/work/parsl/parsl/.venv/bin/parsl_coprocess.py", line 135, in <module>
    main()
  File "/home/runner/work/parsl/parsl/.venv/bin/parsl_coprocess.py", line 68, in main
    task_id = int(input_spec[1])
IndexError: list index out of range
/home/runner/work/parsl/parsl/runinfo/003/submit_scripts/parsl.WorkQueueExecutor.block-0.1699912866.767312.sh: line 10:  6144 Segmentation fault      (core dumped) PARSL_WORKER_BLOCK_ID=0 work_queue_worker --coprocess parsl_coprocess.py fv-az340-503 9000

https://github.com/Parsl/parsl/actions/runs/6854767971/job/18642922623?pr=2012#step:7:1883

This is with CCTOOLS_VERSION=7.7.2

@dthain
Copy link
Contributor

dthain commented Nov 14, 2023

Hmm, that is surprising -- @tphung3 will look into it.
We are Supercomputing in Denver this week, may be a bit delayed.

@dthain
Copy link
Contributor

dthain commented Nov 20, 2023

Ok, I think we see where the problem is, let me bring in @colinthomas-z80 who is going to sort things out.

@colinthomas-z80
Copy link
Contributor

It appears this was fixed in the cctools library code but didn't get moved over here. See above PR

@colinthomas-z80
Copy link
Contributor

Would it be feasible to include the generation of parsl_coprocess.py somewhere in the build process?

@benclifford
Copy link
Collaborator Author

I would like that. I don't know enough about Python build/install to know how to do it, but some packages manage to compile C, etc so I'll guess it's possible.

@dthain
Copy link
Contributor

dthain commented Jan 29, 2024

Let's do this generation at runtime by running poncho_package_serverize appropriately, which is what we do in native TaskVine applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants