-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Newly frequent WorkQueueTaskFailure in CI #2914
Comments
maybe related, maybe not, I've also seen this in CI - it looks something to do with staging files in, not out? see https://github.com/Parsl/parsl/actions/runs/6519478342/job/17706018626
|
I've tried my DESC development branch of parsl with ndcctools 7.7.0 and still experience sporadic FileNotFound errors as reported in the main body of this issue. |
So that error is almost certainly coming from this line, where the coprocess attempts to chdir to the task directory ( Now, it's hard for me to imagine that the directory does not really exist bc/ the worker creates it before sending the function to the coprocess. But, it would be wise for the coprocess to check this and send back an error message. But, I think the problem is really that the coprocess doesn't do the complementary @tphung3 what do you think? |
@benclifford I just merged a fix to the chdir error (see cooperative-computing-lab/cctools#3542), what's the quickest way to see if it works? |
@tphung3 if you have a URL for a binary of cctools (from anywhere, doesn't need to be an official release) it is hopefully easy to make a branch of parsl, edit the install path for ndcctools, hack the dependency problem mentioned elsewhere and see what happens |
See draft release here with fix included: |
This PR upgrades cctools to bring in bugfixes that should address #2914 Co-authored-by: Ben Clifford <[email protected]>
On the https://github.com/Parsl/parsl/actions/runs/6668296438/job/18123571251?pr=2012#step:6:9134 I don't have a feel for if this is something that is breaking in the parsl branch-specific functionality which is then breaking things in WQ, or what else is going on - so I'm just noting that error here for now. |
It looks like this test is running cctools 7.7.1, but the fix for that segfault is in 7.7.2: |
ok, easy to bump that branch up by 0.0.1 - I'll do that now |
I'm still seeing this in the
https://github.com/Parsl/parsl/actions/runs/6854767971/job/18642922623?pr=2012#step:7:1883 This is with |
Hmm, that is surprising -- @tphung3 will look into it. |
Ok, I think we see where the problem is, let me bring in @colinthomas-z80 who is going to sort things out. |
It appears this was fixed in the cctools library code but didn't get moved over here. See above PR |
Would it be feasible to include the generation of parsl_coprocess.py somewhere in the build process? |
I would like that. I don't know enough about Python build/install to know how to do it, but some packages manage to compile C, etc so I'll guess it's possible. |
Let's do this generation at runtime by running |
Describe the bug
I'm seeing this WorkQueueExecutor heisenbug happen in CI a lot recently: I'm not clear what has changed to make it happen more - for example in https://github.com/Parsl/parsl/actions/runs/6518865549/job/17704749713
I'm don't have any immediate strong ideas about what is going on - I've had a little poke but can't see anything that sticks out right away.
I've opened:
I haven't been successful in recreating this on my laptop. However I have seen a related error on perlmutter under certain high load / high concurrency conditions which is a bit more recreatable and maybe I can debug from there.
cc @dthain
The text was updated successfully, but these errors were encountered: