Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Improve subprocess logging #90

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sbrandstaeter
Copy link
Contributor

@sbrandstaeter sbrandstaeter commented Jan 9, 2025

Description and Context

This MR proposes reworking job logging within JobscriptDrivers (and its subclasses) to simplify our current "manual" logging approach. The suggested implementation aims to streamline the logging process while maintaining (almost) the same functionality.

Key motivations and changes are as follows:

  1. Direct Bash Logging: The MR suggests switching to direct logging through bash commands, which should simplify the code and reduce the need for manual logging checks within Python. By embedding logging directly within bash, we may achieve the same tracking of job status with much less self-written, complex code.
  2. Timeout for Zombie Prevention: Our current manual logging was initially implemented partly to detect “zombie” jobs by checking for the termination expression in continuously running jobs. A match would indicate a job that is running but failed -> a zombie. However, this method has proven to be unreliable. Instead, the MR proposes to replace it with a straightforward job timeout mechanism, which would automatically terminate jobs that exceed a set time limit. This timeout has a clear upper limit: the Dask worker wall time.

Benefits:

  • Code Simplification: By logging directly with bash commands, we reduce the complexity in our Python code and make logging more robust.
  • Improved Zombie Prevention: Using a timeout provides a more reliable approach to handling long-running jobs compared to manual detection.

If the changes are accepted, we can likely remove the existing manual logging, reducing maintenance overhead while retaining the essential functionality.

Related Issues and Pull Requests

How Has This Been Tested?

Checklist

  • My commit messages mention the appropriate GitLab issue numbers (advised but optional).
  • I mention the appropriate GitLab issue numbers in this MR.
  • I updated documentation where necessary.
  • I updated the README where necessary
  • I have added as (few,) small and fast tests as possible to cover my changes.

Additional Information

Interested Parties

Possible reviewers: @gilrrei

Other interested parties:

@sbrandstaeter
Copy link
Contributor Author

Closes #70 #66

@sbrandstaeter
Copy link
Contributor Author

We should continue the discussion started on GitLab: https://gitlab.lrz.de/queens_community/queens/-/merge_requests/769#note_3426821

@sbrandstaeter
Copy link
Contributor Author

In the original MR on GitLab @gilrrei commented:
"I think we can remove the code for run_subprocess_with_logging in run_subprocess.py as well as get_job_logger in logger_settins.py"

Which I fully agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant