[Bugfix] fix qwen tokenizer config when converting to nemo format #11098

chrjxj · 2024-10-30T12:51:40Z

What does this PR do ?

This PR fixes qwen tokenizer config when converting qwen2 from HF to nemo format;

Background: when running Qwen SFT, _build_tokenizer() will return Qwen2Tokenizer or Qwen2TokenizerFast based on cfg.tokenizer .

without this update, _build_tokenizer will use use_fast default value ( False) and create an Qwen2Tokenizer instance (wrapped by NeMo/nemo/collections/common/tokenizers/huggingface/auto_tokenizer.py). Qwen2Tokenizer instance don't have vocab attributes, which results in errors during SFT data loading.

Above changes is tested within nvcr.io/nvidia/nemo:24.07; passed.

Collection: LLM, Qwen2

Changelog

line 86 to 92, in NeMo/scripts/checkpoint_converters/convert_qwen2_hf_to_nemo.py

Usage

won't affect usage

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: chrjxj <[email protected]>

ntajbakhsh · 2024-11-13T03:15:52Z

@cuichenx can you review this PR please?

cuichenx

LGTM! thanks for the fix

zhaoyang-star · 2024-12-04T14:18:31Z

Thanks for your bugfix. I tested your code and found that I can not get the right response when do inference using the generated .nemo file. Is the .nemo file right? Could you please have a look? Thanks. @chrjxj @cuichenx

1. HF -> nemo

Convert Qwen2.5-7B to Qwen2.5-7B.nemo. After tar xvf, the Qwen2.5-7B.nemo is as following:

root@inp16075348439349544319-3-1:/mnt/tenant-home_speed/nemo_models# tar xvf Qwen2.5-7B.nemo 
./
./model_config.yaml
./model_weights/
./model_weights/.metadata
./model_weights/__0_0.distcp
./model_weights/__0_1.distcp
./model_weights/common.pt
./model_weights/metadata.json

2. Inference using `Qwen2.5-7B.nemo`

Then I tried to do inference using Qwen2.5-7B.nemo.

python3 megatron_gpt_eval.py \
            gpt_model_file=Qwen2.5-7B.nemo \
            inference.greedy=True \
            inference.add_BOS=True \
            trainer.devices=1 \
            trainer.num_nodes=1 \
            tensor_model_parallel_size=1 \
            pipeline_model_parallel_size=1 \
            prompts='["who are you?", "What is the captial of China?"]'

The response seems wrong. Part of the output:

[NeMo I 2024-12-04 21:57:07 nlp_overrides:1386] Model MegatronGPTModel was successfully restored from Qwen2.5-7B.nemo.
prompt=========:['who are you?', 'What is the captial of China?']
setting number of microbatches to constant 1
***************************
{'sentences': ['who are you?1000000000000000000000000000000000', 'What is the captial of China?100000000000000000000000000000'], 'tokens': [['<|im_start|>', 'who', 'Ġare', 'Ġyou', '?', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'], ['<|im_start|>', 'What', 'Ġis', 'Ġthe', 'Ġcapt', 'ial', 'Ġof', 'ĠChina', '?', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']], 'logprob': None, 'full_logprob': None, 'token_ids': [[151644, 14623, 525, 498, 30, 16, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15], [151644, 3838, 374, 279, 6427, 530, 315, 5616, 30, 16, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]], 'offsets': [[0, 0, 3, 7, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45], [0, 0, 4, 7, 11, 16, 19, 22, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58]]}
***************************
[NeMo I 2024-12-04 21:57:14 megatron_gpt_model:1717] Pipeline model parallel rank: 0, Tensor model parallel rank: 0, Number of model parameters on device: 7.62e+09. Number of precise model parameters on device: 7615616512.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
    
Predicting DataLoader 0:   0%|                                                                                                                                          | 0/1 [00:00<?, ?it/s]setting number of microbatches to constant 1
Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  0.51it/s]
***************************
[{'sentences': ['who are you?1000000000000000000000000000000000', 'What is the captial of China?100000000000000000000000000000'], 'tokens': [['<|im_start|>', 'who', 'Ġare', 'Ġyou', '?', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'], ['<|im_start|>', 'What', 'Ġis', 'Ġthe', 'Ġcapt', 'ial', 'Ġof', 'ĠChina', '?', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']], 'logprob': None, 'full_logprob': None, 'token_ids': [[151644, 14623, 525, 498, 30, 16, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15], [151644, 3838, 374, 279, 6427, 530, 315, 5616, 30, 16, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]], 'offsets': [[0, 0, 3, 7, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45], [0, 0, 4, 7, 11, 16, 19, 22, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58]]}]
***************************

github-actions · 2024-12-19T02:02:42Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

snowmanwwg · 2024-12-24T05:52:50Z

So what is preventing this to be merged? @cuichenx @chrjxj

github-actions · 2025-01-08T01:58:38Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

lukex and others added 2 commits October 30, 2024 20:36

updated qwen tokenizer config when converting to nemo format

4c9169f

Apply isort and black reformatting

972cfb8

Signed-off-by: chrjxj <[email protected]>

cuichenx self-requested a review November 18, 2024 19:10

cuichenx added Run CICD skip-docs labels Nov 18, 2024

cuichenx enabled auto-merge (squash) November 18, 2024 19:12

cuichenx approved these changes Nov 18, 2024

View reviewed changes

Merge branch 'main' into fix_qwen_convert

ac1f7ca

cuichenx added Run CICD and removed Run CICD labels Nov 23, 2024

github-actions bot added the stale label Dec 19, 2024

github-actions bot removed the stale label Dec 25, 2024

github-actions bot added the stale label Jan 8, 2025

cuichenx added Run CICD and removed Run CICD labels Jan 8, 2025

github-actions bot removed the stale label Jan 9, 2025

Merge branch 'main' into fix_qwen_convert

be95702

cuichenx added Run CICD and removed Run CICD labels Jan 9, 2025

cuichenx merged commit 7aac482 into NVIDIA:main Jan 9, 2025
198 of 201 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] fix qwen tokenizer config when converting to nemo format #11098

[Bugfix] fix qwen tokenizer config when converting to nemo format #11098

chrjxj commented Oct 30, 2024

ntajbakhsh commented Nov 13, 2024

cuichenx left a comment

zhaoyang-star commented Dec 4, 2024 •

edited

Loading

github-actions bot commented Dec 19, 2024

snowmanwwg commented Dec 24, 2024

github-actions bot commented Jan 8, 2025

[Bugfix] fix qwen tokenizer config when converting to nemo format #11098

[Bugfix] fix qwen tokenizer config when converting to nemo format #11098

Conversation

chrjxj commented Oct 30, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

ntajbakhsh commented Nov 13, 2024

cuichenx left a comment

Choose a reason for hiding this comment

zhaoyang-star commented Dec 4, 2024 • edited Loading

1. HF -> nemo

2. Inference using Qwen2.5-7B.nemo

github-actions bot commented Dec 19, 2024

snowmanwwg commented Dec 24, 2024

github-actions bot commented Jan 8, 2025

zhaoyang-star commented Dec 4, 2024 •

edited

Loading

2. Inference using `Qwen2.5-7B.nemo`