Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --megascale_abort_on_hangs flag for multi-slice TPU jobs #731

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions axlearn/common/compiler_options.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,11 @@ def default_xla_options(
# concurrently with gradient computation for the following layer.
xla_tpu_enable_data_parallel_all_reduce_opt="true",
xla_tpu_data_parallel_opt_different_sized_ops="true",
# If MegaScale Runtime Error is encountered when running multi-slice jobs,
# enabling this flag will allow for termination of the job, triggering
# the process to exit. This is set to true to prevent the job from
# silently hanging and to reduce time to recovery.
megascale_abort_on_hangs="true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a XLA flag? Curious since other xla flags have xla_ prefix

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a XLA compiler flags, but rather a libtpu runtime flag. As long as it eventually pass into LIBTPU_INIT_ARGS it should work.

Copy link
Contributor

@apghml apghml Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this won't work with AOT compilation. Could you test the AOT compilation script run_aot_compilation.py to confirm?
The reason I ask is the other megascale flags I have used don't work with AOT compilation.
If it doesn't work with AOT compilation, we can move the megascale flag to launch.py.

)

# Validate options. Will never fail if this function is implemented correctly.
apghml marked this conversation as resolved.
Show resolved Hide resolved
Expand Down