Skip to content

Commit

Permalink
run par as an entrypoint if there is no patch or jetter patch. (pytor…
Browse files Browse the repository at this point in the history
…ch#994)

Summary:
Pull Request resolved: pytorch#994

# Context:
Currently, when running torchx local job, we are using penv_python as entrypoint. That means we pass the actual .par or .xar file as argument to penv_python.  within penv_python, the par/xar is executed as a new process.

# Old way to run torchx local job.

For example, if the local job is running "jetter --help", torchx runs it like:
  PENV_PAR='/data/users/yikai/fbsource/buck-out/v2/gen/fbcode/a6cb9616985b22b0/jetter/__jetter-bin__/jetter-bin-inplace.par' penv_python -m jetter.main --help
It passes the par file as an environment variable called "PENV_PAR"(There is another way to pass this to penv_python, which is passing 'PENV_PARNAME' as env variable then get the par file's path using it. But it is very very rare, only 0.1% of total usage.)

# New way to run torchx local job
After migration, We will run it like:
  PAR_MAIN_OVERRIDE=jetter.main /data/users/yikai/fbsource/buck-out/v2/gen/fbcode/a6cb9616985b22b0/jetter/__jetter-bin__/jetter-bin-inplace.par --help

NOTE: This diff only migrates one of the most common use cases, which: 1. There are no patch or jetter patch. 2. it's a par not xar. 3. the par file is passed via "PENV_PAR" env variable.  For other use cases, we still run penv_python as entrypoint.

Reviewed By: Sanjay-Ganeshan

Differential Revision: D66621649
  • Loading branch information
yikaiMeta authored and facebook-github-bot committed Dec 17, 2024
1 parent ff22758 commit 701e972
Showing 1 changed file with 19 additions and 2 deletions.
21 changes: 19 additions & 2 deletions torchx/schedulers/local_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
import warnings
from dataclasses import asdict, dataclass
from datetime import datetime
from subprocess import Popen
from types import FrameType
from typing import (
Any,
Expand Down Expand Up @@ -696,12 +697,11 @@ def _popen(
log.debug(f"Running {role_name} (replica {replica_id}):\n {args_pfmt}")
env = self._get_replica_env(replica_params)

proc = subprocess.Popen(
proc = self.run_local_job(
args=replica_params.args,
env=env,
stdout=stdout_,
stderr=stderr_,
start_new_session=True,
cwd=replica_params.cwd,
)
return _LocalReplica(
Expand All @@ -714,6 +714,23 @@ def _popen(
error_file=env.get("TORCHELASTIC_ERROR_FILE", "<N/A>"),
)

def run_local_job(
self,
args: List[str],
env: Dict[str, str],
stdout: Optional[io.FileIO],
stderr: Optional[io.FileIO],
cwd: Optional[str] = None,
) -> Popen[bytes]:
return subprocess.Popen(
args=args,
env=env,
stdout=stdout,
stderr=stderr,
start_new_session=True,
cwd=cwd,
)

def _get_replica_output_handles(
self,
replica_params: ReplicaParam,
Expand Down

0 comments on commit 701e972

Please sign in to comment.