Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-mcore fsdp2 strategy #11525

Merged
merged 135 commits into from
Jan 8, 2025
Merged

Add non-mcore fsdp2 strategy #11525

merged 135 commits into from
Jan 8, 2025

Conversation

BoxiangW
Copy link
Collaborator

@BoxiangW BoxiangW commented Dec 9, 2024

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Boxiang Wang <[email protected]>
@BoxiangW BoxiangW self-assigned this Dec 9, 2024
BoxiangW and others added 26 commits December 9, 2024 16:01
Signed-off-by: Boxiang Wang <[email protected]>
Signed-off-by: Boxiang Wang <[email protected]>
* Initial commit

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

---------

Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>
* Make HfDatasetDataModule a datasets.load_dataset wrapper

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add logging

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Update HFDatasetDataModule

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* refactor

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* refactor fixup

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* refactor fixup #2

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* do not expand

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* doc

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* doc

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add synonym

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* typo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* Add train/val/test attributes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add test for hf-datamodule

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Import lazily to avoid breaking with older megatron versions

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* bot happy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* bot happy2

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add doc-strings and collate-fn arg

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
* ci: Remove token from checkout

Signed-off-by: Oliver Koenig <[email protected]>

* bump version

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
* Fix llm.deploy api

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

* Apply isort and black reformatting

Signed-off-by: hemildesai <[email protected]>

* PR feedback

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

---------

Signed-off-by: Hemil Desai <[email protected]>
Signed-off-by: hemildesai <[email protected]>
Co-authored-by: hemildesai <[email protected]>
Signed-off-by: Malay Nagda <[email protected]>
Co-authored-by: oliver könig <[email protected]>
* update recipe

Signed-off-by: yaoyu-33 <[email protected]>

* fix mllama mock ds

Signed-off-by: yaoyu-33 <[email protected]>

* update to use attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* remove example

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring mock.py

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring language.py

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring language.py

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring mllama/base.py

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring mllama/language.py

Signed-off-by: yaoyu-33 <[email protected]>

* bump mcore

Signed-off-by: Oliver Koenig <[email protected]>

* Add scripts for mllama

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* update script

Signed-off-by: yaoyu-33 <[email protected]>

* fix pylint

Signed-off-by: yaoyu-33 <[email protected]>

* revert Dockerfile.ci

Signed-off-by: Yu Yao <[email protected]>

* add scripts

Signed-off-by: yaoyu-33 <[email protected]>

* add vlm training test in ci

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring issues

Signed-off-by: yaoyu-33 <[email protected]>

* update script match recipe

Signed-off-by: yaoyu-33 <[email protected]>

* update recipes

Signed-off-by: yaoyu-33 <[email protected]>

* Update mllama_train.py

Signed-off-by: Yu Yao <[email protected]>

* update mllama 90b recipe

Signed-off-by: yaoyu-33 <[email protected]>

* update to use tmp in ci tests

Signed-off-by: yaoyu-33 <[email protected]>

* update default llava config

Signed-off-by: yaoyu-33 <[email protected]>

* add nemo run scripts

Signed-off-by: yaoyu-33 <[email protected]>

* fix vpp issue

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix cicd

Signed-off-by: yaoyu-33 <[email protected]>

* fix cicd

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* remove duplicated script

Signed-off-by: yaoyu-33 <[email protected]>

* ci: Add HF cache

Signed-off-by: oliver könig <[email protected]>

* update to use SP in recipe

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* upgrade

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "upgrade"

This reverts commit f6ad2cd.

* update neva api

Signed-off-by: yaoyu-33 <[email protected]>

* update neva api

Signed-off-by: yaoyu-33 <[email protected]>

* fix neva processing

Signed-off-by: yaoyu-33 <[email protected]>

* fix lint

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix data fields

Signed-off-by: yaoyu-33 <[email protected]>

* few fixes

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Oliver Koenig <[email protected]>
* Add from_dict method

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add test_load_from_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add test_load_from_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
* prevent llama3.1 from using linear interpolation

* Apply isort and black reformatting

Signed-off-by: suiyoubi <[email protected]>

---------

Signed-off-by: suiyoubi <[email protected]>
Co-authored-by: suiyoubi <[email protected]>
* update for nest release

Signed-off-by: stevehuang52 <[email protected]>

* make pylint happier

Signed-off-by: stevehuang52 <[email protected]>

* fix for lhotse dataloader

Signed-off-by: stevehuang52 <[email protected]>

* update yaml

Signed-off-by: stevehuang52 <[email protected]>

* minor refactor

Signed-off-by: stevehuang52 <[email protected]>

* clean up

Signed-off-by: stevehuang52 <[email protected]>

* clean up

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: stevehuang52 <[email protected]>
* Port changes related to SFT text+speech dataloading

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert changes from Canary(nonLLM) code

Signed-off-by: Piotr Żelasko <[email protected]>

* Add joint text/audio dataloading capability to speechllm

Signed-off-by: Piotr Żelasko <[email protected]>

* include text-only into fprop of training and eval; TODO: text-only
predict

Signed-off-by: zhehuaichen <[email protected]>

* Actually working forward step

Signed-off-by: Piotr Żelasko <[email protected]>

* Support for source-target text file pair training for MT+speech

Signed-off-by: Piotr Żelasko <[email protected]>

* Include supervision text tokens in audio example's num tokens

Signed-off-by: Piotr Żelasko <[email protected]>

* Disable conformer seq len NCCL sync

Signed-off-by: Piotr Żelasko <[email protected]>

* Preliminary sampler fusion stragies support: mux/zip/round_robin/randomized_round_robin

Signed-off-by: Piotr Żelasko <[email protected]>

* Working V2 version of multimodal dataloading. Each modality gets its own batch settings that can be merged with zip sampler to enjoy max batch sizes for both modalities in a single training step. Each modality runs fwd+bwd in turn to save GPU memory (instead of running fwd separately and bwd together).

Signed-off-by: Piotr Żelasko <[email protected]>

* Add missing config

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert multimodal grad accum and fix mask padding issue

Signed-off-by: Piotr Żelasko <[email protected]>

* Add modality weights support via cfg.model.modality_weights

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix for V2 dataloader shuffling CRITICAL

Signed-off-by: Piotr Żelasko <[email protected]>

* Restore multimodal grad accum

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix unit tests for multi-sampler configurations

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply isort and black reformatting

Signed-off-by: pzelasko <[email protected]>

* nemo gemma to hf  conversion (#9629)

* adding script for gemma nemo to hf

Signed-off-by: Krishna Puvvada <[email protected]>

* adding verification for convert_gemma_nemo_to_hf

Signed-off-by: Krishna Puvvada <[email protected]>

* Apply isort and black reformatting

Signed-off-by: krishnacpuvvada <[email protected]>

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: krishnacpuvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: krishnacpuvvada <[email protected]>

* support FSDP (thank Yifan for early trying) (#10062)

Note: as of now, this is still not fully working on the cluster. See above doc for details.
Signed-off-by: zhehuaichen <[email protected]>

* Fix unit tests after rebasing on recent main

Signed-off-by: Piotr Żelasko <[email protected]>

* support megatron_amp_O2 and tp (#10599)

* Port changes related to SFT text+speech dataloading

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert changes from Canary(nonLLM) code

Signed-off-by: Piotr Żelasko <[email protected]>

* Add joint text/audio dataloading capability to speechllm

Signed-off-by: Piotr Żelasko <[email protected]>

* include text-only into fprop of training and eval; TODO: text-only
predict

Signed-off-by: zhehuaichen <[email protected]>

* Actually working forward step

Signed-off-by: Piotr Żelasko <[email protected]>

* Support for source-target text file pair training for MT+speech

Signed-off-by: Piotr Żelasko <[email protected]>

* Include supervision text tokens in audio example's num tokens

Signed-off-by: Piotr Żelasko <[email protected]>

* Disable conformer seq len NCCL sync

Signed-off-by: Piotr Żelasko <[email protected]>

* Preliminary sampler fusion stragies support: mux/zip/round_robin/randomized_round_robin

Signed-off-by: Piotr Żelasko <[email protected]>

* Working V2 version of multimodal dataloading. Each modality gets its own batch settings that can be merged with zip sampler to enjoy max batch sizes for both modalities in a single training step. Each modality runs fwd+bwd in turn to save GPU memory (instead of running fwd separately and bwd together).

Signed-off-by: Piotr Żelasko <[email protected]>

* Add missing config

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert multimodal grad accum and fix mask padding issue

Signed-off-by: Piotr Żelasko <[email protected]>

* Add modality weights support via cfg.model.modality_weights

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix for V2 dataloader shuffling CRITICAL

Signed-off-by: Piotr Żelasko <[email protected]>

* Restore multimodal grad accum

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix unit tests for multi-sampler configurations

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply isort and black reformatting

Signed-off-by: pzelasko <[email protected]>

* nemo gemma to hf  conversion (#9629)

* adding script for gemma nemo to hf

Signed-off-by: Krishna Puvvada <[email protected]>

* adding verification for convert_gemma_nemo_to_hf

Signed-off-by: Krishna Puvvada <[email protected]>

* Apply isort and black reformatting

Signed-off-by: krishnacpuvvada <[email protected]>

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: krishnacpuvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: krishnacpuvvada <[email protected]>

* support FSDP (thank Yifan for early trying)

Signed-off-by: zhehuaichen <[email protected]>

* debug TP deadlock

Signed-off-by: zhehuaichen <[email protected]>

* some fixes for fsdp and tp

/lustre/fsw/portfolios/llmservice/users/zhehuaic/results/canary-v0_speechllm/prompt_lhmerge5_p2b_oci_FC-GPT_llama_canaryset_b6s4kf-sunolong_noCC_langtemp0.5_dsettemp0.5_lr1e-4wd1e-3_CosineAnnealing_warmup2500_minlr1e-6_gbs2048_mbs16_ep200/error-1417621-0.out

/lustre/fsw/portfolios/llmservice/users/zhehuaic/results/canary-v0_speechllm/prompt_lhmerge5_p2b_tp_oci_FC-GPT_llama_canaryset_b6s4kf-sunolong_noCC_langtemp0.5_dsettemp0.5_lr1e-4wd1e-3_CosineAnnealing_warmup2500_minlr1e-6_gbs128_mbs16_ep200/error-1421103-3.out

Signed-off-by: zhehuaichen <[email protected]>

* nit fix
Signed-off-by: zhehuaichen <[email protected]>

* fix for llama3.1
Signed-off-by: zhehuaichen <[email protected]>

* for llama3.1
Signed-off-by: zhehuaichen <[email protected]>

* fix for inference
Signed-off-by: zhehuaichen <[email protected]>

* fix inference
Signed-off-by: zhehuaichen <[email protected]>

* fix grad accu
Signed-off-by: zhehuaichen <[email protected]>

* fix inference
Signed-off-by: zhehuaichen <[email protected]>

* initial impl to support megatron_amp_O2 in salm, bestow, salm-t5

Signed-off-by: zhehuaichen <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: pzelasko <[email protected]>
Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: krishnacpuvvada <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: pzelasko <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: krishnacpuvvada <[email protected]>

* minor change in dataloader (#10601)

* Speechllm dataset basic unit test (#10631)

* Basic unit test for speechllm lhotse dataset

Signed-off-by: Piotr Żelasko <[email protected]>

* cleanup

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* Unit test for existing speechllm dataset with llama2 prompt format (#10634)

Signed-off-by: Piotr Żelasko <[email protected]>

* [speechllm] Replace TextProcessing with PromptFormatter (#10639)

* [speechllm] Replace TextProcessing with PromptFormatter

Signed-off-by: Piotr Żelasko <[email protected]>

* Test for tokens_to_generate

Signed-off-by: Piotr Żelasko <[email protected]>

* Padding optimization for speechlm dataset

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* Multimodal conversation format dataloading (#10683)

* Draft implementation of NeMo Multimodal Conversation format

Signed-off-by: Piotr Żelasko <[email protected]>

* Fully working data parsing and iteration

Signed-off-by: Piotr Żelasko <[email protected]>

* Fully working dataloading with tokenization + prompting

Signed-off-by: Piotr Żelasko <[email protected]>

* Collapse consecutive user turns into single turn

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* a few fixes for the new prompt template based dataloader and lora+distributed fused adam (#10701)

* Draft implementation of NeMo Multimodal Conversation format

Signed-off-by: Piotr Żelasko <[email protected]>

* Fully working data parsing and iteration

Signed-off-by: Piotr Żelasko <[email protected]>

* Fully working dataloading with tokenization + prompting

Signed-off-by: Piotr Żelasko <[email protected]>

* Collapse consecutive user turns into single turn

Signed-off-by: Piotr Żelasko <[email protected]>

* compatible with previous expts

Signed-off-by: zhehuaichen <[email protected]>

* support gemma

Signed-off-by: zhehuaichen <[email protected]>

* handle the case max_seq_length is smaller than input_id length

Signed-off-by: zhehuaichen <[email protected]>

* fix max seq case

Signed-off-by: zhehuaichen <[email protected]>

* fix lora ckpt storing and loading

Signed-off-by: zhehuaichen <[email protected]>

* temp fix for distributed fused adam

Signed-off-by: zhehuaichen <[email protected]>

* revert changes in nemo_adapters.py
Signed-off-by: zhehuaichen <[email protected]>

* Fix tokenize_with_prompt

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>

* Mechanism to insert BOS/EOS at the beginning/end of dialog (#10923)

* Mechanism to insert BOS/EOS at the beginning/end of dialog

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix Gemma prompt formatter test

Signed-off-by: Piotr Żelasko <[email protected]>

* Add a test specifically for multiturn insertion of bos/eos

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* Add options to override default map/iterable dataset style selection in lhotse dataloader

Signed-off-by: Piotr Żelasko <[email protected]>

* Feature/conversations tarred (#11086)

* Multimodal conversation tarring script

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix sharding logic

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix dir creation

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* EMMeTT support in SpeechLLM + tutorial for Lhotse Multimodal Dataloading (#10927)

* Preliminary support for oomptimizer

Signed-off-by: Piotr Żelasko <[email protected]>

* OOMptimizer for SpeechLLM

Signed-off-by: Piotr Żelasko <[email protected]>

* Initial version of estimate token bins script

Signed-off-by: Piotr Żelasko <[email protected]>

* Initial support for multimodal 2d bucketing

Signed-off-by: Piotr Żelasko <[email protected]>

* Extend to text-to-text oomptimizer

Signed-off-by: Piotr Żelasko <[email protected]>

* Preliminary support for Llama2 prompt format in ast+mt

Signed-off-by: Piotr Żelasko <[email protected]>

* Support for 1D estimate token bins

Signed-off-by: Piotr Żelasko <[email protected]>

* Support for 1D estimate token bins

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix

Signed-off-by: Piotr Żelasko <[email protected]>

* Minor tweaks

Signed-off-by: Piotr Żelasko <[email protected]>

* Add min/max tokens filter

Signed-off-by: Piotr Żelasko <[email protected]>

* Change to bisect_left for bucket idx selection

Signed-off-by: Piotr Żelasko <[email protected]>

* Add reconfigure_num_microbatches_calculator at the start of train epoch for modular models

Signed-off-by: Piotr Żelasko <[email protected]>

* Update lhotse multi-sampler config and make validation datasets finite

Signed-off-by: Piotr Żelasko <[email protected]>

* Initial implementation of text+audio training for T5 modular models

Signed-off-by: Piotr Żelasko <[email protected]>

* megatron t5 nmt prompt formatter

Signed-off-by: Piotr Żelasko <[email protected]>

* Fixes for MT+AST T5 oomptimizer and training

Signed-off-by: Piotr Żelasko <[email protected]>

* configs, fixes, token-per-token filtering

* Support text modality in predict_step

Signed-off-by: Piotr Żelasko <[email protected]>

* Support text data in val/test dl

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix infinite

Signed-off-by: Piotr Żelasko <[email protected]>

* prompt format fixes

Signed-off-by: Piotr Żelasko <[email protected]>

* Fixes in audio supervision

Signed-off-by: Piotr Żelasko <[email protected]>

* remove superficial padding

Signed-off-by: Piotr Żelasko <[email protected]>

* test config and prompt context fetching fixes

Signed-off-by: Piotr Żelasko <[email protected]>

* support text-only decoding for salm/bestow

Signed-off-by: Piotr Żelasko <[email protected]>

* Add unit tests for EMMETT / refactor prompt_format_fn

Signed-off-by: Piotr Żelasko <[email protected]>

* make t5nmt prompt formatter auto discoverable

Signed-off-by: Piotr Żelasko <[email protected]>

* include token count / tpt filtering in estimate_token_bins

Signed-off-by: Piotr Żelasko <[email protected]>

* fix max token filter

Signed-off-by: Piotr Żelasko <[email protected]>

* some fixes

Signed-off-by: Piotr Żelasko <[email protected]>

* custom mixin for text adapters

Signed-off-by: Piotr Żelasko <[email protected]>

* Warmup in oomptimizer-speechlm

Signed-off-by: Piotr Żelasko <[email protected]>

* Move oomptimizer-speechllm to separate directory

Signed-off-by: Piotr Żelasko <[email protected]>

* Initial cleanup

Signed-off-by: Piotr Żelasko <[email protected]>

* Refactoring of prompt format fn and length measurement and filtering for data types; improved unit test coverage

Signed-off-by: Piotr Żelasko <[email protected]>

* Refactor sampler constraints / filters into sampling.py

Signed-off-by: Piotr Żelasko <[email protected]>

* Tests and support for sampler length measurement of multimodal conversations

Signed-off-by: Piotr Żelasko <[email protected]>

* Update estimate_token_bins.py

Signed-off-by: Piotr Żelasko <[email protected]>

* Move estimate_token_bins.py to speech_llm scripts

Signed-off-by: Piotr Żelasko <[email protected]>

* Minor tweaks

Signed-off-by: Piotr Żelasko <[email protected]>

* Fixes for SpeechLLM dataset

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply isort and black reformatting

Signed-off-by: pzelasko <[email protected]>

* Add missing emmett tests

Signed-off-by: Piotr Żelasko <[email protected]>

* Add tutorial about multimodal lhotse dataloading

Signed-off-by: Piotr Żelasko <[email protected]>

* Updated documentation for multimodal dataloading

Signed-off-by: Piotr Żelasko <[email protected]>

* Prompt Formatter tutorial

Signed-off-by: Piotr Żelasko <[email protected]>

* Review comments

Signed-off-by: Piotr Żelasko <[email protected]>

* Fixes for sampling filters None values

Signed-off-by: Piotr Żelasko <[email protected]>

* Changes requested by Steve: moving some args to main config namespace in multi config sampler

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* Update default configs to the modified config schema

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix omegaconf use issue

Signed-off-by: Piotr Żelasko <[email protected]>

* Update the docs to the modified multi config format

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: pzelasko <[email protected]>
Co-authored-by: pzelasko <[email protected]>

* Remove old TODO comments

Signed-off-by: Piotr Żelasko <[email protected]>

* Remove prompts/fn.py

Signed-off-by: Piotr Żelasko <[email protected]>

* Copyright notices

Signed-off-by: Piotr Żelasko <[email protected]>

* Make linter happy

Signed-off-by: Piotr Żelasko <[email protected]>

* Make linter happy

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix megatron test

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix megatron test

Signed-off-by: Piotr Żelasko <[email protected]>

* Disable plugin for high entropy strings in secrets detector

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix CodeQL errors

Signed-off-by: Piotr Żelasko <[email protected]>

* fix unit tests

Signed-off-by: Piotr Żelasko <[email protected]>

* fix another unit test

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix multimodal tests

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply isort and black reformatting

Signed-off-by: pzelasko <[email protected]>

* fixes after merging canary2 pr to main

Signed-off-by: Piotr Żelasko <[email protected]>

* fix headers

Signed-off-by: Piotr Żelasko <[email protected]>

* fix canary integration test + formatting

Signed-off-by: Piotr Żelasko <[email protected]>

* Address reviews - add sync_max_audio_length flag for conformer encoder

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert change in secrets detector

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert change in secrets detector

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert change in secrets detector

Signed-off-by: Piotr Żelasko <[email protected]>

* Address code review

Signed-off-by: Piotr Żelasko <[email protected]>

* Address Steve's review

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: pzelasko <[email protected]>
Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: krishnacpuvvada <[email protected]>
Co-authored-by: zhehuaichen <[email protected]>
Co-authored-by: pzelasko <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: krishnacpuvvada <[email protected]>
Co-authored-by: zhehuaichen <[email protected]>
* Sync validation metrics for ASRModel

Signed-off-by: Piotr Żelasko <[email protected]>

* support sync for single-dataloader case

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
* nemo 2 support

Signed-off-by: Onur Yilmaz <[email protected]>

* Remove unwanted params in DDP init in Megatron Parallel

Signed-off-by: Hemil Desai <[email protected]>

* nemo2 working with query

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* multigpu deployment with nemo2 works

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* add max output lenght

Signed-off-by: Onur Yilmaz <[email protected]>

* Remove prints

Signed-off-by: Onur Yilmaz <[email protected]>

* Fix merge conflicts

Signed-off-by: Onur Yilmaz <[email protected]>

* readded this file

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Hemil Desai <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: Hemil Desai <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>
* Add SFT/PEFT HF tests

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move hf examples to examples dir

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* bot

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use mini_squad

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use mini_squad

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* add 2gpu DDP

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* refactor

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use labels as passed by the user

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update samples/ tests

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rm unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add tests with subset split names, e.g. train[:100]

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* add --disable-ckpt

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use self-hosted-azure-gpus-1 for single-gpu test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add TRANSFORMERS_OFFLINE=1 to hf tests

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
@BoxiangW BoxiangW added Run CICD and removed Run CICD labels Jan 7, 2025
@BoxiangW BoxiangW added Run CICD and removed Run CICD labels Jan 7, 2025
@BoxiangW BoxiangW added Run CICD and removed Run CICD labels Jan 7, 2025
@@ -0,0 +1,11 @@
from importlib.metadata import version
from packaging.version import Version as PkgVersion

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'PkgVersion' is not used.
@BoxiangW BoxiangW added Run CICD and removed Run CICD labels Jan 7, 2025
Copy link
Contributor

github-actions bot commented Jan 7, 2025

beep boop 🤖: 🚨 The following files must be fixed before merge!


Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.lightning.pytorch.strategies.fsdp2_strategy
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:85:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:91:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:116:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:142:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:151:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:161:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:169:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:177:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:198:4: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.25/10

Mitigation guide:

  • Add sensible and useful docstrings to functions and methods
  • For trivial methods like getter/setters, consider adding # pylint: disable=C0116 inside the function itself
  • To disable multiple functions/methods at once, put a # pylint: disable=C0116 before the first and a # pylint: enable=C0116 after the last.

By applying these rules, we reduce the occurance of this message in future.

Thank you for improving NeMo's documentation!

Copy link
Contributor

github-actions bot commented Jan 8, 2025

[🤖]: Hi @BoxiangW 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

akoumpa
akoumpa previously approved these changes Jan 8, 2025
@BoxiangW BoxiangW enabled auto-merge (squash) January 8, 2025 18:05
Copy link
Collaborator

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the tests to the final step:)

pablo-garay
pablo-garay previously approved these changes Jan 8, 2025
Signed-off-by: Alexandros Koumparoulis <[email protected]>
@akoumpa akoumpa dismissed stale reviews from pablo-garay and themself via d8e7247 January 8, 2025 18:29
Copy link
Contributor

github-actions bot commented Jan 8, 2025

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.


Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.llm.gpt.model.hf_auto_model_for_causal_lm
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:27:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:35:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:63:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:74:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:77:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:104:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:107:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:130:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/model/hf_auto_model_for_causal_lm.py:150:4: C0116: Missing function or method docstring (missing-function-docstring)
************* Module nemo.lightning.pytorch.strategies.megatron_strategy
nemo/lightning/pytorch/strategies/megatron_strategy.py:286:4: C0116: Missing function or method docstring (missing-function-docstring)
************* Module nemo.lightning.pytorch.strategies.utils
nemo/lightning/pytorch/strategies/utils.py:41:0: C0115: Missing class docstring (missing-class-docstring)
nemo/lightning/pytorch/strategies/utils.py:50:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:58:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:70:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:86:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:121:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:131:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/utils.py:186:0: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.75/10

Mitigation guide:

  • Add sensible and useful docstrings to functions and methods
  • For trivial methods like getter/setters, consider adding # pylint: disable=C0116 inside the function itself
  • To disable multiple functions/methods at once, put a # pylint: disable=C0116 before the first and a # pylint: enable=C0116 after the last.

By applying these rules, we reduce the occurance of this message in future.

Thank you for improving NeMo's documentation!

Copy link
Contributor

github-actions bot commented Jan 8, 2025

beep boop 🤖: 🚨 The following files must be fixed before merge!


Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.lightning.pytorch.strategies.fsdp2_strategy
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:85:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:91:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:116:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:142:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:151:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:161:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:169:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:177:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/strategies/fsdp2_strategy.py:198:4: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.25/10

Mitigation guide:

  • Add sensible and useful docstrings to functions and methods
  • For trivial methods like getter/setters, consider adding # pylint: disable=C0116 inside the function itself
  • To disable multiple functions/methods at once, put a # pylint: disable=C0116 before the first and a # pylint: enable=C0116 after the last.

By applying these rules, we reduce the occurance of this message in future.

Thank you for improving NeMo's documentation!

@ko3n1g ko3n1g disabled auto-merge January 8, 2025 18:32
@ko3n1g ko3n1g merged commit 5d8baa4 into main Jan 8, 2025
26 of 28 checks passed
@ko3n1g ko3n1g deleted the boxiangw/non-mcore-fsdp2 branch January 8, 2025 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.