Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Refactor JobSet for Pathways #918

Draft
wants to merge 65 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
582329a
Adding support for Pathways proxy
jesus-orozco Sep 9, 2024
8d3c643
Update pathways-utils dependency and fix formatting
jesus-orozco Oct 1, 2024
0e61b76
Move pathways package to its own dependency tree and pin it to a spec…
jesus-orozco Oct 7, 2024
4260c38
Relocate pathwaysutils import
jesus-orozco Oct 9, 2024
ae80a36
Create custom jobset for pathways
jesus-orozco Oct 9, 2024
9960027
Updates to pathways jobset creation
jesus-orozco Oct 16, 2024
f02d345
Merge branch 'main' into feature/jax_pathways
jesus-orozco Oct 17, 2024
6c3c083
Merge branch 'apple:main' into feature/jax_pathways
jesus-orozco Oct 25, 2024
ac6bcd2
Update pathwaysutils source to pypi
jesus-orozco Oct 25, 2024
43ac5e4
Merge branch 'main' into feature/pathways_workload
jesus-orozco Oct 28, 2024
05f311d
Merge branch 'apple:main' into feature/jax_pathways
jesus-orozco Oct 28, 2024
8972933
trillium testing baseline
jesus-orozco Nov 6, 2024
a952c6f
revert dockerfile for upstream merge
jesus-orozco Nov 6, 2024
34571aa
Merge branch 'apple:main' into trillium_testing
jesus-orozco Nov 6, 2024
0e8ae86
fixed pdbs 3 for fuji 70b
jesus-orozco Nov 6, 2024
d9d458a
testing pdbs 3 with 2 v6e-256 slices
jesus-orozco Nov 6, 2024
73da333
use maxtext xla flags only
jesus-orozco Nov 7, 2024
01bcf8f
new baseline for pdbs=3 without ffn_dim
jesus-orozco Nov 7, 2024
b1ba4fe
try xla sc offload flags
jesus-orozco Nov 7, 2024
6b24cd2
revert AR + SC offload flags
jesus-orozco Nov 7, 2024
aa06778
output jobset to yaml file
jesus-orozco Nov 7, 2024
efabbfd
calculate batch size based on flags
jesus-orozco Nov 7, 2024
cd556c7
enable ffn 3.5 and test pdbs 3 with 4 slices
jesus-orozco Nov 7, 2024
90b0d26
retry 4 slices with pdbs 3
jesus-orozco Nov 7, 2024
4974458
test xla_tpu_enable_sparse_core_collective_offload_all_reduce
jesus-orozco Nov 8, 2024
caff395
remove xla_tpu_enable_sparse_core_collective_offload_all_reduce
jesus-orozco Nov 8, 2024
8277980
dynamic global batch size based on pdbs and slices
jesus-orozco Nov 8, 2024
e23a4d0
enable xla_enable_async_all_reduce
jesus-orozco Nov 11, 2024
1797c52
Merge branch 'main' into feature/jax_pathways
jesus-orozco Nov 11, 2024
4868815
Refactor pathways config flag
jesus-orozco Nov 11, 2024
fe62afd
Merge branch 'apple:main' into feature/pathways_workload
jesus-orozco Nov 11, 2024
27ceea0
install libtpu nightly
jesus-orozco Nov 11, 2024
5c80fc8
Merge branch 'apple:main' into trillium_testing
jesus-orozco Nov 11, 2024
200ac48
custom remat policy for fuji-70b
jesus-orozco Nov 15, 2024
3bfda15
sparscore xla flags and nothing_saveable remat policy
jesus-orozco Nov 15, 2024
51e2e90
calculate batch size with jax devices
jesus-orozco Nov 18, 2024
b894304
update remat policy offloading
jesus-orozco Nov 18, 2024
ce5f1a2
Merge branch 'main' into feature/pathways_workload
jesus-orozco Nov 18, 2024
b51e67e
Update axlearn/cloud/gcp/job.py
jesus-orozco Nov 18, 2024
6537274
Update job.py with dynamic module imports
jesus-orozco Nov 18, 2024
7dbb1b9
Update job.py - remove pathways from dynamic import error message
jesus-orozco Nov 18, 2024
af7c746
Merge remote-tracking branch 'origin/feature/jax_pathways' into pathw…
jesus-orozco Nov 22, 2024
5704ce5
pathways jobset updates
jesus-orozco Nov 22, 2024
bf82554
merge trillium changes
jesus-orozco Nov 22, 2024
4f58291
Merge branch 'apple:main' into pathways_trillium
jesus-orozco Nov 25, 2024
d206110
Install pathwaysutils
jesus-orozco Nov 25, 2024
4152db3
Disable force eval
jesus-orozco Nov 25, 2024
0c33403
Launch pathways on trainer_main
jesus-orozco Nov 25, 2024
315fa6b
add v6e mesh rules
jesus-orozco Dec 5, 2024
765f64b
pin libtpu version
jesus-orozco Dec 5, 2024
f387807
update pathways jobset definition
jesus-orozco Dec 5, 2024
735ff37
Merge branch 'apple:main' into pathways_trillium
jesus-orozco Dec 6, 2024
2e0dd4b
dump xla flags to gcs
jesus-orozco Dec 9, 2024
e9d8341
Refactor jobset to align with new pathways structure
jesus-orozco Dec 12, 2024
357a322
Apply formatting
jesus-orozco Dec 12, 2024
89ac1ea
refactor pathways jobset to new spec
jesus-orozco Jan 10, 2025
d63dd6a
Rebase to axlearn main
jiya-zhang Jan 14, 2025
4393453
revert changes to gitignore and remove dockerignore
jesus-orozco Jan 14, 2025
9ba599b
revert changes to dependencies
jesus-orozco Jan 14, 2025
ca5c883
revert changes to trainer and model configs
jesus-orozco Jan 14, 2025
c7bc5df
remove unnecessary updates to gke tpu job for pathways workloads
jesus-orozco Jan 14, 2025
a0bf9df
Revert updates to fuji model config
jesus-orozco Jan 14, 2025
9f19159
Update pathways container specs for gke tpu job
jesus-orozco Jan 15, 2025
8f71508
Update gke tpu job to bypass jobset coordinator
jesus-orozco Jan 16, 2025
f467064
Swap pathways check for tpu gke job config flag
jesus-orozco Jan 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ RUN apt-get install -y google-perftools
ENV PIP_FIND_LINKS=https://storage.googleapis.com/jax-releases/libtpu_releases.html
# Ensure we install the TPU version, even if building locally.
# Jax will fallback to CPU when run on a machine without TPU.
RUN pip install .[core,tpu]
RUN pip install .[core,tpu,pathways]
RUN if [ -n "$EXTRAS" ]; then pip install .[$EXTRAS]; fi
COPY . .

Expand Down
Loading