Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-processing w/ s3 logs, unless you use spawn to launch your subprocesses #840

Closed
max-kaufmann opened this issue Nov 13, 2024 · 1 comment

Comments

@max-kaufmann
Copy link
Contributor

max-kaufmann commented Nov 13, 2024

We use s3fs to access s3, which is apparently not safe for multi-processing. i.e. I have data loading code which calls read_eval_log. Using a ProcessPoolExecutor (but not a ThreadPoolExecutor!) hangs when using read_eval_log on an s3 bucket:

with ProcessPoolExecutor(max_workers=num_workers) as executor:
    futures = {executor.submit(read_eval_log, eval_log): eval_log for eval_log in eval_logs}

I opened an issue in their repo, and they replied saying this was a known limitation, and suggested to use spawn (new python interpret state) when doing multi-processing, i.e.:

with ProcessPoolExecutor(max_workers=num_workers,mp_context=get_context("spawn")) as executor:
    futures = {executor.submit(process_eval_log_single, eval_log): eval_log for eval_log in eval_logs}

This works, but we should have a more robust solution to multi-processing

@max-kaufmann max-kaufmann changed the title The library we use to access s3 isn't thread-safe Multi-processing with s3 is broken, unless you use spawn() Nov 13, 2024
@max-kaufmann max-kaufmann changed the title Multi-processing with s3 is broken, unless you use spawn() Multi-processing while reading s3 logs hangs, unless you use spawn to launch your subprocesses Nov 13, 2024
@max-kaufmann max-kaufmann changed the title Multi-processing while reading s3 logs hangs, unless you use spawn to launch your subprocesses Multi-processing w/ s3 logs, unless you use spawn to launch your subprocesses Nov 13, 2024
@jjallaire
Copy link
Collaborator

@max-kaufmann We have made some other changes to back off of s3fs concurrency (as it may have been the cause of other deadlocks we saw stack traces for). I think in the name of not ever hanging we probably won't do more inside of Inspect here, but I'd certainly be interested if there is a documented solution for the "right" way to do this with multi-processing (and would add that to our docs for users looking to get better parallelism).

Another thought would be to find some robust way to download S3 logs in parallel (an entirely different package or alternate use of s3fs) and then once local use multi-processing on them (as then it will just be zip reading which seems like it would be multi-process safe). Again, if we can sort out something robust here we could either included it in Inspect or minimally add it to the docs.

I'm going to close this for now because we don't have plans to address internally, but again happy to take a PR to the library or the docs if the community discovers a good solution for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants