Add non-mcore fsdp2 strategy #11525

BoxiangW · 2024-12-09T23:36:25Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Boxiang Wang <[email protected]>

Signed-off-by: BoxiangW <[email protected]>

nemo/lightning/pytorch/strategies/fsdp2_strategy.py

Signed-off-by: Boxiang Wang <[email protected]>

Signed-off-by: BoxiangW <[email protected]>

Signed-off-by: Boxiang Wang <[email protected]>

Signed-off-by: BoxiangW <[email protected]>

* Initial commit Signed-off-by: Piotr Kaminski <[email protected]> * Apply isort and black reformatting Signed-off-by: Laplasjan107 <[email protected]> --------- Signed-off-by: Piotr Kaminski <[email protected]> Signed-off-by: Laplasjan107 <[email protected]> Co-authored-by: Piotr Kaminski <[email protected]> Co-authored-by: Laplasjan107 <[email protected]>

* Make HfDatasetDataModule a datasets.load_dataset wrapper Signed-off-by: Alexandros Koumparoulis <[email protected]> * add logging Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Update HFDatasetDataModule Signed-off-by: Alexandros Koumparoulis <[email protected]> * refactor Signed-off-by: Alexandros Koumparoulis <[email protected]> * refactor fixup Signed-off-by: Alexandros Koumparoulis <[email protected]> * refactor fixup #2 Signed-off-by: Alexandros Koumparoulis <[email protected]> * do not expand Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * doc Signed-off-by: Alexandros Koumparoulis <[email protected]> * doc Signed-off-by: Alexandros Koumparoulis <[email protected]> * add synonym Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * typo Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * Add train/val/test attributes Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add test for hf-datamodule Signed-off-by: Alexandros Koumparoulis <[email protected]> * Import lazily to avoid breaking with older megatron versions Signed-off-by: Alexandros Koumparoulis <[email protected]> * bot happy Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * bot happy2 Signed-off-by: Alexandros Koumparoulis <[email protected]> * add doc-strings and collate-fn arg Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]>

Signed-off-by: Oliver Koenig <[email protected]>

Signed-off-by: ashors1 <[email protected]>

Signed-off-by: Oliver Koenig <[email protected]>

* ci: Remove token from checkout Signed-off-by: Oliver Koenig <[email protected]> * bump version Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]>

Signed-off-by: Oliver Koenig <[email protected]>

* Fix llm.deploy api Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> * Apply isort and black reformatting Signed-off-by: hemildesai <[email protected]> * PR feedback Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> --------- Signed-off-by: Hemil Desai <[email protected]> Signed-off-by: hemildesai <[email protected]> Co-authored-by: hemildesai <[email protected]>

Signed-off-by: Malay Nagda <[email protected]> Co-authored-by: oliver könig <[email protected]>

* update recipe Signed-off-by: yaoyu-33 <[email protected]> * fix mllama mock ds Signed-off-by: yaoyu-33 <[email protected]> * update to use attention bias Signed-off-by: yaoyu-33 <[email protected]> * remove example Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix docstring mock.py Signed-off-by: yaoyu-33 <[email protected]> * fix docstring language.py Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix docstring language.py Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix docstring mllama/base.py Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix docstring mllama/language.py Signed-off-by: yaoyu-33 <[email protected]> * bump mcore Signed-off-by: Oliver Koenig <[email protected]> * Add scripts for mllama Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * update script Signed-off-by: yaoyu-33 <[email protected]> * fix pylint Signed-off-by: yaoyu-33 <[email protected]> * revert Dockerfile.ci Signed-off-by: Yu Yao <[email protected]> * add scripts Signed-off-by: yaoyu-33 <[email protected]> * add vlm training test in ci Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix docstring issues Signed-off-by: yaoyu-33 <[email protected]> * update script match recipe Signed-off-by: yaoyu-33 <[email protected]> * update recipes Signed-off-by: yaoyu-33 <[email protected]> * Update mllama_train.py Signed-off-by: Yu Yao <[email protected]> * update mllama 90b recipe Signed-off-by: yaoyu-33 <[email protected]> * update to use tmp in ci tests Signed-off-by: yaoyu-33 <[email protected]> * update default llava config Signed-off-by: yaoyu-33 <[email protected]> * add nemo run scripts Signed-off-by: yaoyu-33 <[email protected]> * fix vpp issue Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix cicd Signed-off-by: yaoyu-33 <[email protected]> * fix cicd Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * remove duplicated script Signed-off-by: yaoyu-33 <[email protected]> * ci: Add HF cache Signed-off-by: oliver könig <[email protected]> * update to use SP in recipe Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> * upgrade Signed-off-by: yaoyu-33 <[email protected]> * Revert "upgrade" This reverts commit f6ad2cd. * update neva api Signed-off-by: yaoyu-33 <[email protected]> * update neva api Signed-off-by: yaoyu-33 <[email protected]> * fix neva processing Signed-off-by: yaoyu-33 <[email protected]> * fix lint Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix data fields Signed-off-by: yaoyu-33 <[email protected]> * few fixes Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Yu Yao <[email protected]> Signed-off-by: oliver könig <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: Oliver Koenig <[email protected]>

* Add from_dict method Signed-off-by: Alexandros Koumparoulis <[email protected]> * add test_load_from_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add test_load_from_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]>

* prevent llama3.1 from using linear interpolation * Apply isort and black reformatting Signed-off-by: suiyoubi <[email protected]> --------- Signed-off-by: suiyoubi <[email protected]> Co-authored-by: suiyoubi <[email protected]>

Signed-off-by: Ryan <[email protected]>

* update for nest release Signed-off-by: stevehuang52 <[email protected]> * make pylint happier Signed-off-by: stevehuang52 <[email protected]> * fix for lhotse dataloader Signed-off-by: stevehuang52 <[email protected]> * update yaml Signed-off-by: stevehuang52 <[email protected]> * minor refactor Signed-off-by: stevehuang52 <[email protected]> * clean up Signed-off-by: stevehuang52 <[email protected]> * clean up Signed-off-by: stevehuang52 <[email protected]> --------- Signed-off-by: stevehuang52 <[email protected]>

* Port changes related to SFT text+speech dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Revert changes from Canary(nonLLM) code Signed-off-by: Piotr Żelasko <[email protected]> * Add joint text/audio dataloading capability to speechllm Signed-off-by: Piotr Żelasko <[email protected]> * include text-only into fprop of training and eval; TODO: text-only predict Signed-off-by: zhehuaichen <[email protected]> * Actually working forward step Signed-off-by: Piotr Żelasko <[email protected]> * Support for source-target text file pair training for MT+speech Signed-off-by: Piotr Żelasko <[email protected]> * Include supervision text tokens in audio example's num tokens Signed-off-by: Piotr Żelasko <[email protected]> * Disable conformer seq len NCCL sync Signed-off-by: Piotr Żelasko <[email protected]> * Preliminary sampler fusion stragies support: mux/zip/round_robin/randomized_round_robin Signed-off-by: Piotr Żelasko <[email protected]> * Working V2 version of multimodal dataloading. Each modality gets its own batch settings that can be merged with zip sampler to enjoy max batch sizes for both modalities in a single training step. Each modality runs fwd+bwd in turn to save GPU memory (instead of running fwd separately and bwd together). Signed-off-by: Piotr Żelasko <[email protected]> * Add missing config Signed-off-by: Piotr Żelasko <[email protected]> * Revert multimodal grad accum and fix mask padding issue Signed-off-by: Piotr Żelasko <[email protected]> * Add modality weights support via cfg.model.modality_weights Signed-off-by: Piotr Żelasko <[email protected]> * Fix for V2 dataloader shuffling CRITICAL Signed-off-by: Piotr Żelasko <[email protected]> * Restore multimodal grad accum Signed-off-by: Piotr Żelasko <[email protected]> * Fix unit tests for multi-sampler configurations Signed-off-by: Piotr Żelasko <[email protected]> * Apply isort and black reformatting Signed-off-by: pzelasko <[email protected]> * nemo gemma to hf conversion (#9629) * adding script for gemma nemo to hf Signed-off-by: Krishna Puvvada <[email protected]> * adding verification for convert_gemma_nemo_to_hf Signed-off-by: Krishna Puvvada <[email protected]> * Apply isort and black reformatting Signed-off-by: krishnacpuvvada <[email protected]> --------- Signed-off-by: Krishna Puvvada <[email protected]> Signed-off-by: krishnacpuvvada <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: krishnacpuvvada <[email protected]> * support FSDP (thank Yifan for early trying) (#10062) Note: as of now, this is still not fully working on the cluster. See above doc for details. Signed-off-by: zhehuaichen <[email protected]> * Fix unit tests after rebasing on recent main Signed-off-by: Piotr Żelasko <[email protected]> * support megatron_amp_O2 and tp (#10599) * Port changes related to SFT text+speech dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Revert changes from Canary(nonLLM) code Signed-off-by: Piotr Żelasko <[email protected]> * Add joint text/audio dataloading capability to speechllm Signed-off-by: Piotr Żelasko <[email protected]> * include text-only into fprop of training and eval; TODO: text-only predict Signed-off-by: zhehuaichen <[email protected]> * Actually working forward step Signed-off-by: Piotr Żelasko <[email protected]> * Support for source-target text file pair training for MT+speech Signed-off-by: Piotr Żelasko <[email protected]> * Include supervision text tokens in audio example's num tokens Signed-off-by: Piotr Żelasko <[email protected]> * Disable conformer seq len NCCL sync Signed-off-by: Piotr Żelasko <[email protected]> * Preliminary sampler fusion stragies support: mux/zip/round_robin/randomized_round_robin Signed-off-by: Piotr Żelasko <[email protected]> * Working V2 version of multimodal dataloading. Each modality gets its own batch settings that can be merged with zip sampler to enjoy max batch sizes for both modalities in a single training step. Each modality runs fwd+bwd in turn to save GPU memory (instead of running fwd separately and bwd together). Signed-off-by: Piotr Żelasko <[email protected]> * Add missing config Signed-off-by: Piotr Żelasko <[email protected]> * Revert multimodal grad accum and fix mask padding issue Signed-off-by: Piotr Żelasko <[email protected]> * Add modality weights support via cfg.model.modality_weights Signed-off-by: Piotr Żelasko <[email protected]> * Fix for V2 dataloader shuffling CRITICAL Signed-off-by: Piotr Żelasko <[email protected]> * Restore multimodal grad accum Signed-off-by: Piotr Żelasko <[email protected]> * Fix unit tests for multi-sampler configurations Signed-off-by: Piotr Żelasko <[email protected]> * Apply isort and black reformatting Signed-off-by: pzelasko <[email protected]> * nemo gemma to hf conversion (#9629) * adding script for gemma nemo to hf Signed-off-by: Krishna Puvvada <[email protected]> * adding verification for convert_gemma_nemo_to_hf Signed-off-by: Krishna Puvvada <[email protected]> * Apply isort and black reformatting Signed-off-by: krishnacpuvvada <[email protected]> --------- Signed-off-by: Krishna Puvvada <[email protected]> Signed-off-by: krishnacpuvvada <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: krishnacpuvvada <[email protected]> * support FSDP (thank Yifan for early trying) Signed-off-by: zhehuaichen <[email protected]> * debug TP deadlock Signed-off-by: zhehuaichen <[email protected]> * some fixes for fsdp and tp /lustre/fsw/portfolios/llmservice/users/zhehuaic/results/canary-v0_speechllm/prompt_lhmerge5_p2b_oci_FC-GPT_llama_canaryset_b6s4kf-sunolong_noCC_langtemp0.5_dsettemp0.5_lr1e-4wd1e-3_CosineAnnealing_warmup2500_minlr1e-6_gbs2048_mbs16_ep200/error-1417621-0.out /lustre/fsw/portfolios/llmservice/users/zhehuaic/results/canary-v0_speechllm/prompt_lhmerge5_p2b_tp_oci_FC-GPT_llama_canaryset_b6s4kf-sunolong_noCC_langtemp0.5_dsettemp0.5_lr1e-4wd1e-3_CosineAnnealing_warmup2500_minlr1e-6_gbs128_mbs16_ep200/error-1421103-3.out Signed-off-by: zhehuaichen <[email protected]> * nit fix Signed-off-by: zhehuaichen <[email protected]> * fix for llama3.1 Signed-off-by: zhehuaichen <[email protected]> * for llama3.1 Signed-off-by: zhehuaichen <[email protected]> * fix for inference Signed-off-by: zhehuaichen <[email protected]> * fix inference Signed-off-by: zhehuaichen <[email protected]> * fix grad accu Signed-off-by: zhehuaichen <[email protected]> * fix inference Signed-off-by: zhehuaichen <[email protected]> * initial impl to support megatron_amp_O2 in salm, bestow, salm-t5 Signed-off-by: zhehuaichen <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: zhehuaichen <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: pzelasko <[email protected]> Signed-off-by: Krishna Puvvada <[email protected]> Signed-off-by: krishnacpuvvada <[email protected]> Co-authored-by: Piotr Żelasko <[email protected]> Co-authored-by: Piotr Żelasko <[email protected]> Co-authored-by: pzelasko <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: krishnacpuvvada <[email protected]> * minor change in dataloader (#10601) * Speechllm dataset basic unit test (#10631) * Basic unit test for speechllm lhotse dataset Signed-off-by: Piotr Żelasko <[email protected]> * cleanup Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * Unit test for existing speechllm dataset with llama2 prompt format (#10634) Signed-off-by: Piotr Żelasko <[email protected]> * [speechllm] Replace TextProcessing with PromptFormatter (#10639) * [speechllm] Replace TextProcessing with PromptFormatter Signed-off-by: Piotr Żelasko <[email protected]> * Test for tokens_to_generate Signed-off-by: Piotr Żelasko <[email protected]> * Padding optimization for speechlm dataset Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * Multimodal conversation format dataloading (#10683) * Draft implementation of NeMo Multimodal Conversation format Signed-off-by: Piotr Żelasko <[email protected]> * Fully working data parsing and iteration Signed-off-by: Piotr Żelasko <[email protected]> * Fully working dataloading with tokenization + prompting Signed-off-by: Piotr Żelasko <[email protected]> * Collapse consecutive user turns into single turn Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * a few fixes for the new prompt template based dataloader and lora+distributed fused adam (#10701) * Draft implementation of NeMo Multimodal Conversation format Signed-off-by: Piotr Żelasko <[email protected]> * Fully working data parsing and iteration Signed-off-by: Piotr Żelasko <[email protected]> * Fully working dataloading with tokenization + prompting Signed-off-by: Piotr Żelasko <[email protected]> * Collapse consecutive user turns into single turn Signed-off-by: Piotr Żelasko <[email protected]> * compatible with previous expts Signed-off-by: zhehuaichen <[email protected]> * support gemma Signed-off-by: zhehuaichen <[email protected]> * handle the case max_seq_length is smaller than input_id length Signed-off-by: zhehuaichen <[email protected]> * fix max seq case Signed-off-by: zhehuaichen <[email protected]> * fix lora ckpt storing and loading Signed-off-by: zhehuaichen <[email protected]> * temp fix for distributed fused adam Signed-off-by: zhehuaichen <[email protected]> * revert changes in nemo_adapters.py Signed-off-by: zhehuaichen <[email protected]> * Fix tokenize_with_prompt Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: zhehuaichen <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: zhehuaichen <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Co-authored-by: Piotr Żelasko <[email protected]> * Mechanism to insert BOS/EOS at the beginning/end of dialog (#10923) * Mechanism to insert BOS/EOS at the beginning/end of dialog Signed-off-by: Piotr Żelasko <[email protected]> * Fix Gemma prompt formatter test Signed-off-by: Piotr Żelasko <[email protected]> * Add a test specifically for multiturn insertion of bos/eos Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * Add options to override default map/iterable dataset style selection in lhotse dataloader Signed-off-by: Piotr Żelasko <[email protected]> * Feature/conversations tarred (#11086) * Multimodal conversation tarring script Signed-off-by: Piotr Żelasko <[email protected]> * Fix sharding logic Signed-off-by: Piotr Żelasko <[email protected]> * Fix dir creation Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * EMMeTT support in SpeechLLM + tutorial for Lhotse Multimodal Dataloading (#10927) * Preliminary support for oomptimizer Signed-off-by: Piotr Żelasko <[email protected]> * OOMptimizer for SpeechLLM Signed-off-by: Piotr Żelasko <[email protected]> * Initial version of estimate token bins script Signed-off-by: Piotr Żelasko <[email protected]> * Initial support for multimodal 2d bucketing Signed-off-by: Piotr Żelasko <[email protected]> * Extend to text-to-text oomptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Preliminary support for Llama2 prompt format in ast+mt Signed-off-by: Piotr Żelasko <[email protected]> * Support for 1D estimate token bins Signed-off-by: Piotr Żelasko <[email protected]> * Support for 1D estimate token bins Signed-off-by: Piotr Żelasko <[email protected]> * Fix Signed-off-by: Piotr Żelasko <[email protected]> * Fix Signed-off-by: Piotr Żelasko <[email protected]> * Minor tweaks Signed-off-by: Piotr Żelasko <[email protected]> * Add min/max tokens filter Signed-off-by: Piotr Żelasko <[email protected]> * Change to bisect_left for bucket idx selection Signed-off-by: Piotr Żelasko <[email protected]> * Add reconfigure_num_microbatches_calculator at the start of train epoch for modular models Signed-off-by: Piotr Żelasko <[email protected]> * Update lhotse multi-sampler config and make validation datasets finite Signed-off-by: Piotr Żelasko <[email protected]> * Initial implementation of text+audio training for T5 modular models Signed-off-by: Piotr Żelasko <[email protected]> * megatron t5 nmt prompt formatter Signed-off-by: Piotr Żelasko <[email protected]> * Fixes for MT+AST T5 oomptimizer and training Signed-off-by: Piotr Żelasko <[email protected]> * configs, fixes, token-per-token filtering * Support text modality in predict_step Signed-off-by: Piotr Żelasko <[email protected]> * Support text data in val/test dl Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix infinite Signed-off-by: Piotr Żelasko <[email protected]> * prompt format fixes Signed-off-by: Piotr Żelasko <[email protected]> * Fixes in audio supervision Signed-off-by: Piotr Żelasko <[email protected]> * remove superficial padding Signed-off-by: Piotr Żelasko <[email protected]> * test config and prompt context fetching fixes Signed-off-by: Piotr Żelasko <[email protected]> * support text-only decoding for salm/bestow Signed-off-by: Piotr Żelasko <[email protected]> * Add unit tests for EMMETT / refactor prompt_format_fn Signed-off-by: Piotr Żelasko <[email protected]> * make t5nmt prompt formatter auto discoverable Signed-off-by: Piotr Żelasko <[email protected]> * include token count / tpt filtering in estimate_token_bins Signed-off-by: Piotr Żelasko <[email protected]> * fix max token filter Signed-off-by: Piotr Żelasko <[email protected]> * some fixes Signed-off-by: Piotr Żelasko <[email protected]> * custom mixin for text adapters Signed-off-by: Piotr Żelasko <[email protected]> * Warmup in oomptimizer-speechlm Signed-off-by: Piotr Żelasko <[email protected]> * Move oomptimizer-speechllm to separate directory Signed-off-by: Piotr Żelasko <[email protected]> * Initial cleanup Signed-off-by: Piotr Żelasko <[email protected]> * Refactoring of prompt format fn and length measurement and filtering for data types; improved unit test coverage Signed-off-by: Piotr Żelasko <[email protected]> * Refactor sampler constraints / filters into sampling.py Signed-off-by: Piotr Żelasko <[email protected]> * Tests and support for sampler length measurement of multimodal conversations Signed-off-by: Piotr Żelasko <[email protected]> * Update estimate_token_bins.py Signed-off-by: Piotr Żelasko <[email protected]> * Move estimate_token_bins.py to speech_llm scripts Signed-off-by: Piotr Żelasko <[email protected]> * Minor tweaks Signed-off-by: Piotr Żelasko <[email protected]> * Fixes for SpeechLLM dataset Signed-off-by: Piotr Żelasko <[email protected]> * Apply isort and black reformatting Signed-off-by: pzelasko <[email protected]> * Add missing emmett tests Signed-off-by: Piotr Żelasko <[email protected]> * Add tutorial about multimodal lhotse dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Updated documentation for multimodal dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Prompt Formatter tutorial Signed-off-by: Piotr Żelasko <[email protected]> * Review comments Signed-off-by: Piotr Żelasko <[email protected]> * Fixes for sampling filters None values Signed-off-by: Piotr Żelasko <[email protected]> * Changes requested by Steve: moving some args to main config namespace in multi config sampler Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * Update default configs to the modified config schema Signed-off-by: Piotr Żelasko <[email protected]> * Fix omegaconf use issue Signed-off-by: Piotr Żelasko <[email protected]> * Update the docs to the modified multi config format Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: pzelasko <[email protected]> Co-authored-by: pzelasko <[email protected]> * Remove old TODO comments Signed-off-by: Piotr Żelasko <[email protected]> * Remove prompts/fn.py Signed-off-by: Piotr Żelasko <[email protected]> * Copyright notices Signed-off-by: Piotr Żelasko <[email protected]> * Make linter happy Signed-off-by: Piotr Żelasko <[email protected]> * Make linter happy Signed-off-by: Piotr Żelasko <[email protected]> * Fix megatron test Signed-off-by: Piotr Żelasko <[email protected]> * Fix megatron test Signed-off-by: Piotr Żelasko <[email protected]> * Disable plugin for high entropy strings in secrets detector Signed-off-by: Piotr Żelasko <[email protected]> * Fix CodeQL errors Signed-off-by: Piotr Żelasko <[email protected]> * fix unit tests Signed-off-by: Piotr Żelasko <[email protected]> * fix another unit test Signed-off-by: Piotr Żelasko <[email protected]> * Fix multimodal tests Signed-off-by: Piotr Żelasko <[email protected]> * Apply isort and black reformatting Signed-off-by: pzelasko <[email protected]> * fixes after merging canary2 pr to main Signed-off-by: Piotr Żelasko <[email protected]> * fix headers Signed-off-by: Piotr Żelasko <[email protected]> * fix canary integration test + formatting Signed-off-by: Piotr Żelasko <[email protected]> * Address reviews - add sync_max_audio_length flag for conformer encoder Signed-off-by: Piotr Żelasko <[email protected]> * Revert change in secrets detector Signed-off-by: Piotr Żelasko <[email protected]> * Revert change in secrets detector Signed-off-by: Piotr Żelasko <[email protected]> * Revert change in secrets detector Signed-off-by: Piotr Żelasko <[email protected]> * Address code review Signed-off-by: Piotr Żelasko <[email protected]> * Address Steve's review Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: zhehuaichen <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: pzelasko <[email protected]> Signed-off-by: Krishna Puvvada <[email protected]> Signed-off-by: krishnacpuvvada <[email protected]> Co-authored-by: zhehuaichen <[email protected]> Co-authored-by: pzelasko <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: krishnacpuvvada <[email protected]> Co-authored-by: zhehuaichen <[email protected]>

* Sync validation metrics for ASRModel Signed-off-by: Piotr Żelasko <[email protected]> * support sync for single-dataloader case Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]>

* nemo 2 support Signed-off-by: Onur Yilmaz <[email protected]> * Remove unwanted params in DDP init in Megatron Parallel Signed-off-by: Hemil Desai <[email protected]> * nemo2 working with query Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * multigpu deployment with nemo2 works Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * add max output lenght Signed-off-by: Onur Yilmaz <[email protected]> * Remove prints Signed-off-by: Onur Yilmaz <[email protected]> * Fix merge conflicts Signed-off-by: Onur Yilmaz <[email protected]> * readded this file Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: Hemil Desai <[email protected]> Signed-off-by: oyilmaz-nvidia <[email protected]> Co-authored-by: Hemil Desai <[email protected]> Co-authored-by: oyilmaz-nvidia <[email protected]>

* Add SFT/PEFT HF tests Signed-off-by: Alexandros Koumparoulis <[email protected]> * move hf examples to examples dir Signed-off-by: Alexandros Koumparoulis <[email protected]> * bot Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * use mini_squad Signed-off-by: Alexandros Koumparoulis <[email protected]> * use mini_squad Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * add 2gpu DDP Signed-off-by: Alexandros Koumparoulis <[email protected]> * refactor Signed-off-by: Alexandros Koumparoulis <[email protected]> * use labels as passed by the user Signed-off-by: Alexandros Koumparoulis <[email protected]> * update samples/ tests Signed-off-by: Alexandros Koumparoulis <[email protected]> * rm unused imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add tests with subset split names, e.g. train[:100] Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * add --disable-ckpt Signed-off-by: Alexandros Koumparoulis <[email protected]> * use self-hosted-azure-gpus-1 for single-gpu test Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add TRANSFORMERS_OFFLINE=1 to hf tests Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]>

Signed-off-by: BoxiangW <[email protected]>

tests/collections/llm/hf/utils.py

@@ -0,0 +1,11 @@
+from importlib.metadata import version
+from packaging.version import Version as PkgVersion


github-actions · 2025-01-07T23:51:10Z

beep boop 🤖: 🚨 The following files must be fixed before merge!

Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.lightning.pytorch.strategies.fsdp2_strategy
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:85:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:91:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:116:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:142:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:151:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:161:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:169:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:177:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:198:4: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.25/10

Mitigation guide:

Add sensible and useful docstrings to functions and methods
For trivial methods like getter/setters, consider adding # pylint: disable=C0116 inside the function itself
To disable multiple functions/methods at once, put a # pylint: disable=C0116 before the first and a # pylint: enable=C0116 after the last.

By applying these rules, we reduce the occurance of this message in future.

Thank you for improving NeMo's documentation!

github-actions · 2025-01-08T13:25:36Z

[🤖]: Hi @BoxiangW 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

ko3n1g

Please add the tests to the final step:)

Signed-off-by: Alexandros Koumparoulis <[email protected]>

github-actions · 2025-01-08T18:29:57Z

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.

Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.llm.gpt.model.hf_auto_model_for_causal_lm
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:27:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:35:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:63:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:74:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:77:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:104:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:107:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:130:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:150:4: C0116: Missing function or method docstring (missing-function-docstring)
************* Module nemo.lightning.pytorch.strategies.megatron_strategy
nemo/lightning/pytorch/strategies/megatron_strategy.py:286:4: C0116: Missing function or method docstring (missing-function-docstring)
************* Module nemo.lightning.pytorch.strategies.utils
nemo/lightning/pytorch/strategies/utils.py:41:0: C0115: Missing class docstring (missing-class-docstring)
nemo/lightning/pytorch/strategies/utils.py:50:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:58:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:70:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:86:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:121:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:131:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:186:0: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.75/10

Mitigation guide:

Add sensible and useful docstrings to functions and methods
For trivial methods like getter/setters, consider adding # pylint: disable=C0116 inside the function itself
To disable multiple functions/methods at once, put a # pylint: disable=C0116 before the first and a # pylint: enable=C0116 after the last.

By applying these rules, we reduce the occurance of this message in future.

Thank you for improving NeMo's documentation!

github-actions · 2025-01-08T18:30:17Z

beep boop 🤖: 🚨 The following files must be fixed before merge!

Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.lightning.pytorch.strategies.fsdp2_strategy
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:85:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:91:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:116:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:142:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:151:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:161:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:169:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:177:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:198:4: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.25/10

Mitigation guide:

Add sensible and useful docstrings to functions and methods
For trivial methods like getter/setters, consider adding # pylint: disable=C0116 inside the function itself
To disable multiple functions/methods at once, put a # pylint: disable=C0116 before the first and a # pylint: enable=C0116 after the last.

By applying these rules, we reduce the occurance of this message in future.

Thank you for improving NeMo's documentation!

Add fsdp2 strategy

3338e06

Signed-off-by: Boxiang Wang <[email protected]>

BoxiangW self-assigned this Dec 9, 2024

Apply isort and black reformatting

4c9b5df

Signed-off-by: BoxiangW <[email protected]>

github-advanced-security bot found potential problems Dec 9, 2024

View reviewed changes

nemo/lightning/pytorch/strategies/fsdp2_strategy.py Fixed Show fixed Hide fixed

BoxiangW and others added 26 commits December 9, 2024 16:01

Add imports

11a4637

Signed-off-by: Boxiang Wang <[email protected]>

Apply isort and black reformatting

5971cf4

Signed-off-by: BoxiangW <[email protected]>

Merge branch 'main' into boxiangw/non-mcore-fsdp2

3533b89

Add init import

7c30f82

Signed-off-by: Boxiang Wang <[email protected]>

Apply isort and black reformatting

ef11d67

Signed-off-by: BoxiangW <[email protected]>

ci: Bump release workflow (#11544)

b81df9e

Signed-off-by: Oliver Koenig <[email protected]>

ci: Use SHA for cut-off (#11545)

dfbb87f

Signed-off-by: Oliver Koenig <[email protected]>

link to mcore documentation (#11538)

1fe0310

Signed-off-by: ashors1 <[email protected]>

ci: Adjust inputs for code-freeze workflow (#11550)

02c2cdf

Signed-off-by: Oliver Koenig <[email protected]>

ci: Bump release freeze (#11551)

c37570a

Signed-off-by: Oliver Koenig <[email protected]>

Ko3n1g/ci/commit sha for cutoff (#11553)

37ee432

* ci: Remove token from checkout Signed-off-by: Oliver Koenig <[email protected]> * bump version Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]>

ci: Bump code-freeze workflow (#11554)

fc39d24

Signed-off-by: Oliver Koenig <[email protected]>

ci: Bump code freeze workflow (#11557)

2aff616

Signed-off-by: Oliver Koenig <[email protected]>

perf summary docs link (#11262)

f68208e

Signed-off-by: Malay Nagda <[email protected]> Co-authored-by: oliver könig <[email protected]>

[TTS] Add audio and mel codec HF models to docs (#11526)

c1bb950

Signed-off-by: Ryan <[email protected]>

Sync validation metrics for ASRModel (#11533)

9c264b7

* Sync validation metrics for ASRModel Signed-off-by: Piotr Żelasko <[email protected]> * support sync for single-dataloader case Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]>

Fix import

ca5ffe4

BoxiangW added Run CICD and removed Run CICD labels Jan 7, 2025

BoxiangW added 2 commits January 7, 2025 09:44

Merge branch 'main' into boxiangw/non-mcore-fsdp2

42f4ee8

Merge branch 'main' into boxiangw/non-mcore-fsdp2

1430226

BoxiangW added Run CICD and removed Run CICD labels Jan 7, 2025

BoxiangW and others added 2 commits January 7, 2025 14:40

fix test

b78205a

Apply isort and black reformatting

1e069c3

Signed-off-by: BoxiangW <[email protected]>

BoxiangW added Run CICD and removed Run CICD labels Jan 7, 2025

github-advanced-security bot found potential problems Jan 7, 2025

View reviewed changes

tests/collections/llm/hf/utils.py

@@ -0,0 +1,11 @@

from importlib.metadata import version

from packaging.version import Version as PkgVersion

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'PkgVersion' is not used.

Add copyright

81aeb50

BoxiangW added Run CICD and removed Run CICD labels Jan 7, 2025

akoumpa previously approved these changes Jan 8, 2025

View reviewed changes

BoxiangW enabled auto-merge (squash) January 8, 2025 18:05

ko3n1g requested changes Jan 8, 2025

View reviewed changes

pablo-garay previously approved these changes Jan 8, 2025

View reviewed changes

include test list

d8e7247

Signed-off-by: Alexandros Koumparoulis <[email protected]>

akoumpa dismissed stale reviews from pablo-garay and themself via d8e7247 January 8, 2025 18:29

pablo-garay approved these changes Jan 8, 2025

View reviewed changes

ko3n1g disabled auto-merge January 8, 2025 18:32

ko3n1g merged commit 5d8baa4 into main Jan 8, 2025
26 of 28 checks passed

ko3n1g deleted the boxiangw/non-mcore-fsdp2 branch January 8, 2025 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add non-mcore fsdp2 strategy #11525

Add non-mcore fsdp2 strategy #11525

BoxiangW commented Dec 9, 2024

github-actions bot commented Jan 7, 2025

github-actions bot commented Jan 8, 2025

ko3n1g left a comment

github-actions bot commented Jan 8, 2025

github-actions bot commented Jan 8, 2025

		@@ -0,0 +1,11 @@
		from importlib.metadata import version
		from packaging.version import Version as PkgVersion

Add non-mcore fsdp2 strategy #11525

Add non-mcore fsdp2 strategy #11525

Conversation

BoxiangW commented Dec 9, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

github-actions bot commented Jan 7, 2025

github-actions bot commented Jan 8, 2025

ko3n1g left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 8, 2025

github-actions bot commented Jan 8, 2025