Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tokenizer mismatch bug between model and tokenizer for THUDM/glm-… #2672

Closed

Conversation

darkSuperman
Copy link

Fix tokenizer mismatch bug between model and tokenizer for THUDM/glm-4-9b example

@LaurentMazare
Copy link
Collaborator

Did you try out the change? There doesn't seem to be a tokenizer.json file in the repo that you've switched to as far as I can tell.https://huggingface.co/THUDM/glm-4-9b/tree/main

@darkSuperman
Copy link
Author

Sorry I misread that. Also I found this branch used natively within Transformers, and it also provides a tokenizer.json, but loading the model requires some changes Are you interested in using this branch to modify the glm4 example? https://huggingface.co/THUDM/glm-4-9b/tree/refs%2Fpr%2F15

@LaurentMazare
Copy link
Collaborator

The main reason why this model uses the tokenizer from codegeex4 is that the two tokenizers should be identical. When you look at the two sentencepiece model they have the same hash, here and here so I don't think the current version is actually fine or maybe I'm missing something here?

@darkSuperman
Copy link
Author

Yes, they are the same. Also, when I was running the inference, I encountered some problems and it seemed that I could not finish the inference. I am trying it and I will give you feedback if I have more information.

@darkSuperman
Copy link
Author

darkSuperman commented Dec 29, 2024

I tried the glm4 example and used the tokenizer.json from THUDM/codegeex4-all-9b, but I couldn't output eos_token. The token output never ended until the length of sample_len was reached, and I found that the token output was repeated. The code is the same as the glm4 example, the only difference is that I loaded the st and tokenizer.json files from the local huggingface-cli download file.

let filenames: Vec<PathBuf> = (1..=10) //Load 10 st files from local
        .map(|i| {
            format!( "/home/neptune/glm4/model-{0:05}-of-00010.safetensors", i )
        })
        .map(|path| Path::new(&path).to_path_buf())
        .collect();

let tokenizer_filename =  PathBuf::from("/home/neptune/temp/tokenizer.json");

In addition, I saw that GLM officially provided the HF version of glm4, https://huggingface.co/THUDM/glm-4-9b-chat-hf. Can I use candle for inference at present?

Could you provide some help or troubleshooting suggestions? Thanks @LaurentMazare

@LaurentMazare
Copy link
Collaborator

Not sure to understand what your problem is exactly, I've refactored a bit the glm4 example so that it is closer to other examples and from what I see most generations properly end with an eos token being produced, e.g.

$ cargo run --features cuda -r --example glm4 -- --prompt "This is a test. "
    Finished `release` profile [optimized] target(s) in 0.81s
     Running `target/release/examples/glm4 --prompt 'This is a test. '`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.60 repeat-penalty: 1.20 repeat-last-n: 64
retrieved the files in 9.378198ms
loaded the model in 3.518224561s
starting the inference loop
This is a test.  This is only a test.
: The Federal Emergency Management Agency has issued an alert that the U.S. East Coast could be hit by high winds and heavy rains from Hurricane Irene, which was upgraded to Category 3 storm on Friday morning. The hurricane's eye wall had been expected to pass well east of Florida late Saturday night or early Sunday morning, but it now appears likely to make landfall in North Carolina sometime between Monday afternoon and Tuesday.
 the National Weather Service said. "The center of Irene is forecast to move near or over portions of eastern Cuba tonight," according to a statement issued by the NWS at 8:00 AM ET on Saturday. The storm's maximum sustained winds have increased to near 125 mph, with higher gusts, and it has been upgraded from Category 2 to Category 3 hurricane.
The U.S. National Hurricane Center said in its latest advisory. Irene is expected to remain a powerful hurricane through tonight; some fluctuations in intensity are possible on Sunday as the center of the storm moves over eastern Cuba. On the forecast track, the core of Irene will move near or over portions of eastern Cuba today and approach the southeastern United States coast late Monday and Tuesday.
Posted by dave at 9:00 AM
251 tokens generated (48.56 token/s)

Maybe you can provide more details, ideally with a simple way to reproduce the isuse?

@darkSuperman
Copy link
Author

The output after I run it is as follows,Look at the end of the last line, the same token will be output until the length of sample_len reaches 2048, so I ended it early.:

avx: false, neon: false, simd128: false, f16c: false
temp: 0.60 repeat-penalty: 1.20 repeat-last-n: 64
retrieved the files in 17.502µs
loaded the model in 3.467558855s
starting the inference loop
This is a test.
 This is only a test.
If you were listening to the Emergency Broadcast System in 1963, this would have been your warning that an emergency was imminent and that it might be necessary for you to take shelter immediately.
The EBS was designed as part of President John F. Kennedy's civil defense program. The system was intended to provide a means by which the public could receive official information about air raid warnings or other emergencies.
The Emergency Broadcast System (EBS) is an emergency warning system that was used in the United States from 1963 until it was decommissioned on December 31, 2011.
During its operational life, the EBS consisted of a series of transmitters located throughout the country. These transmitters were connected to a central control center located at the Federal Emergency Management Agency (FEMA) headquarters in Washington, D.C.
When: When an emergency situation was detected or predicted by appropriate authorities, they would activate the EBS and send out a warning message over television and radio stations across the United States.
How to listen for warnings:
- Turn on your TV or radio
- Tune it to one of the designated Emergency Broadcast System (EBS) channels
- Listen carefully for any emergency messages that may be broadcasted. If you hear an EBS tone, this indicates that a warning message is about to follow.
What: The EBS was designed so that it could send out warnings over both television and radio stations across the United States as well as through cable TV systems throughout North America including Canada Mexico
The Emergency Broadcast System (EBS) was decommissioned on December 31st 2011. This means that there are no longer any designated EBS channels for broadcast of emergency messages.
What: The decommissioning of the EBS meant that:
- There were no longer any designated EBS channels for broadcast of emergency messages
- Emergency management agencies had to rely on other methods such as social media Twitter Facebook Instagram TikToktok YouTube Snapchat Reddit LinkedIn WeChatGPTelegrammGrammGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGram^C

This is the code copied from the glm4 example with almost no changes, nnly one line is added to print token information:

fn run(&mut self, sample_len: usize) -> anyhow::Result<()> {
        use std::io::BufRead;
        use std::io::BufReader;
        use std::io::Write;
        println!("starting the inference loop");
        let stdin = std::io::stdin();
        let reader = BufReader::new(stdin);
        for line in reader.lines() {
            let line = line.expect("Failed to read line");
            let tokens = self.tokenizer.encode(line, true).expect("tokens error");
            if tokens.is_empty() {
                panic!("Empty prompts are not supported in the chatglm model.")
            }
            if self.verbose_prompt {
                for (token, id) in tokens.get_tokens().iter().zip(tokens.get_ids().iter()) {
                    let token = token.replace('▁', " ").replace("<0x0A>", "\n");
                    println!("{id:7} -> '{token}'");
                }
            }
            let eos_token = match self.tokenizer.get_vocab(true).get("<|endoftext|>") {
                Some(token) => *token,
                None => panic!("cannot find the endoftext token"),
            };
            let mut tokens = tokens.get_ids().to_vec();
            let mut generated_tokens = 0usize;

            std::io::stdout().flush().expect("output flush error");
            let start_gen = std::time::Instant::now();

            let mut count = 0;
            let mut result = vec![];
            for index in 0..sample_len {
                count += 1;
                let context_size = if index > 0 { 1 } else { tokens.len() };
                let ctxt = &tokens[tokens.len().saturating_sub(context_size)..];
                let input = Tensor::new(ctxt, &self.device)?.unsqueeze(0)?;
                let logits = self.model.forward(&input)?;
                let logits = logits.squeeze(0)?.to_dtype(self.dtype)?;
                let logits = if self.repeat_penalty == 1. {
                    logits
                } else {
                    let start_at = tokens.len().saturating_sub(self.repeat_last_n);
                    candle_transformers::utils::apply_repeat_penalty(
                        &logits,
                        self.repeat_penalty,
                        &tokens[start_at..],
                    )?
                };

                let next_token = self.logits_processor.sample(&logits)?;
                tokens.push(next_token);
                generated_tokens += 1;
                if next_token == eos_token {
                    break;
                }
                let token: String = self
                    .tokenizer
                    .decode(&[next_token], true)
                    .expect("Token error");
                if self.verbose_prompt {
                    println!(
                        "[Count: {}] [Raw Token: {}] [Decode Token: {}]",
                        count, next_token, token
                    );
                }
                print!("{}", token); //Added this line to print
                result.push(token);
                std::io::stdout().flush()?;
            }
            let dt = start_gen.elapsed();
            println!(
                "\n{generated_tokens} tokens generated ({:.2} token/s)",
                generated_tokens as f64 / dt.as_secs_f64(),
            );
            println!("Result:");
            for tokens in result {
                print!("{tokens}");
            }
            self.model.reset_kv_cache(); // clean the cache
        }
        Ok(())
    }

This is the code that is loaded, with some changes:

pub fn main() -> anyhow::Result<()> {
    let args = Args::parse();
    println!(
        "avx: {}, neon: {}, simd128: {}, f16c: {}",
        candle_core::utils::with_avx(),
        candle_core::utils::with_neon(),
        candle_core::utils::with_simd128(),
        candle_core::utils::with_f16c()
    );
    println!(
        "temp: {:.2} repeat-penalty: {:.2} repeat-last-n: {}",
        args.temperature.unwrap_or(0.6),
        args.repeat_penalty,
        args.repeat_last_n
    );

    let start = std::time::Instant::now();

    let filenames: Vec<PathBuf> = (1..=10)
        .map(|i| {
            format!(
                "/home/paibo/neptune/huggingface/model-{0:05}-of-00010.safetensors",
                i
            )
        })
        .map(|path| Path::new(&path).to_path_buf())
        .collect();

    println!("retrieved the files in {:?}", start.elapsed());
    let tokenizer_filename =
        std::path::PathBuf::from("/home/paibo/neptune/huggingface/tokenizer.json");
    let tokenizer = Tokenizer::from_file(tokenizer_filename).expect("Tokenizer Error");

    let start = std::time::Instant::now();
    let config = Config::glm4();

    let device = candle_examples::device(false)?;
    let dtype = if device.is_cuda() {
        DType::BF16
    } else {
        DType::F32
    };
    let vb = unsafe { VarBuilder::from_mmaped_safetensors(&filenames, dtype, &device)? };
    let model = Model::new(&config, vb).unwrap();
    println!("loaded the model in {:?}", start.elapsed());

    let mut pipeline = TextGeneration::new(
        model,
        tokenizer,
        args.seed,
        args.temperature,
        args.top_p,
        args.repeat_penalty,
        args.repeat_last_n,
        args.verbose_prompt,
        &device,
        dtype,
    );
    pipeline.run(args.sample_len).unwrap();
    Ok(())
}

@LaurentMazare
Copy link
Collaborator

Could you try running the same code that I ran (so the glm4 example from the current github version) and see if it behaves differently compared to what I got?

cargo run --features cuda -r --example glm4 -- --prompt "This is a test. "

@darkSuperman
Copy link
Author

darkSuperman commented Jan 1, 2025

I pulled the version you submitted this time (#2694) and ran it, but there was still no output of eos_token, which caused tokens to be generated all the time.

(base) paibo@fz:~/neptune/github/candle$ cargo run --features cuda -r --example glm4 -- --prompt "This is a test. "
    Finished `release` profile [optimized] target(s) in 0.25s
     Running `target/release/examples/glm4 --prompt 'This is a test. '`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.60 repeat-penalty: 1.20 repeat-last-n: 64
retrieved the files in 8.037743ms
loaded the model in 3.516106828s
starting the inference loop
This is a test.  This is only a test.
: The Federal Emergency Management Agency has issued an alert that the U.S. East Coast could experience significant power outages and other problems if a major storm hits this weekend, according to reports from CNN.com
 of Atlanta Journal-Constitution. FEMA Administrator Craig Fugate said in a conference call with reporters on Thursday afternoon that Hurricane Irene is expected to be "a large and powerful hurricane" when it reaches the East Coast early Saturday morning.
, bringing heavy rains and winds up to 100 mph.The storm could cause flooding along coastal areas from North Carolina to New York City. The National Weather Service has issued a tropical storm warning for much of the coast, CNN reported. Fugate said that FEMA is working with state emergency management agencies in affected states to prepare for Irene's impact.
The agency will also be monitoring Hurricane Katia, which could hit Mexico and Texas this weekend, according to CNN.com.The Federal Emergency Management Agency has issued an alert that the U.S. East Coast could experience significant power outages and other problems if a major storm hits this weekend, according to reports from CNN.com of Atlanta Journal-Constitution.
The FEMA Administrator Craig Fugate said in a conference call with reporters on Thursday afternoon that Hurricane Irene is expected to be "a large and powerful hurricane" when it reaches the East Coast early Saturday morning. The National Weather Service has issued a tropical storm warning for much coast, CNN reported.The storm could cause flooding along coastal areas from North Carolina to New York City. Fugate said FEMA will also monitoring Hurricane Katia which could hit Mexico and Texas this weekend, according to CNN.com.
The Federal Emergency Management Agency has issued an alert that the U.S East Coast could experience significant power outages and other problems if a major storm hits this weekend, according reports from CNN of Atlanta Journal-Constitution.The Administrator Craig Fugate said in conference call with reporters on Thursday afternoon that Hurricane is expected be "a large powerful hurricane" when it reaches the coast early morning. The National Weather Service has issued tropical warning for much coast, CNN reported.ugate said FEMA will also monitoring Hurricane Katia which could hit Mexico and Texas this weekend, according to CNN.com.The Emergency Management Agency has issued an alert that the U.S East Coast could experience significant power outages and other problems if a major storm hits this weekend, reports from CNN of Atlanta Journal-ConstitutionThe Administrator Craig Fugate said in conference call with reporters on Thursday afternoon that Hurricane is expected be "a large powerful hurricane" when it reaches the coast early morning. The National Weather Service has issued tropical warning for much reported. Fugate will also monitoring Katia which could hit Mexico and Texas this weekend, according to CNN.com Emergency Management Agency has issued an alert that U.S East Coast could experience significant power outages other problems if major storm hits weekend, reports from Atlanta Journal-Constitution Administrator Craig Fugate said in conference call with reporters on Thursday afternoon that Hurricane is expected be "a large powerful hurricane" when it reaches the coast early morning. The National Weather has issued a tropical warning for much of CNN.com.FEMA will also monitoring Katia which could hit Mexico and Texas this weekend, according to CNN.com.
The Federal Emergency Management Agency has issued an alert that the U.S East Coast could experience significant power outages problems if major storm hits weekend, reports from Atlanta Journal-Constitution.The National Weather Service has issued a tropical warning for much of reported. Fugate said in conference call with reporters on Thursday afternoon that Hurricane is expected to be "a large and powerful hurricane" when it reaches the early morningThe could cause flooding along coastal areas Carolina New York City. The National will monitoring Katia, which Mexico Texas this weekend, according CNN.com.The Federal Emergency Management Agency has issued an alert that U.S East Coast experience significant power outages other problems if major storm hits weekend, reports from Atlanta Journal-Constitution Administrator Craig Fugate said in conference call with reporters on Thursday afternoon that Hurricane is expected to be "a large and powerful hurricane" when it reaches the early morning. The National Weather has issued a tropical warning for much of CNN.comThe FEMA will also monitoring Katia which could hit Mexico Texas this weekend, according to CNN.com.The Emergency Management Agency has issued an alert that U.S East Coast experience significant power outages problems if major storm hits weekend reports from Atlanta Journal-Constitution Administrator Craig Fugate said in conference call with reporters on Thursday afternoon that Hurricane is expected be "a large and powerful hurricane" when it reaches the early morning. The National Weather^C

I can only change sample_len to 512 so that it ends after reaching the length.

(base) paibo@fz:~/neptune/github/candle$ cargo run --features cuda -r --example glm4 -- --prompt "This is a test. "
   Compiling candle-examples v0.8.1 (/home/paibo/neptune/github/candle/candle-examples)
    Finished `release` profile [optimized] target(s) in 4.17s
     Running `target/release/examples/glm4 --prompt 'This is a test. '`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.60 repeat-penalty: 1.20 repeat-last-n: 64
retrieved the files in 8.008257ms
loaded the model in 2.968212178s
starting the inference loop
This is a test.  This is only a test.
: The Federal Emergency Management Agency has issued an alert that the U.S. East Coast could experience significant power outages and other problems if a major storm hits this weekend, according to reports from CNN.com
 of Atlanta Journal-Constitution. FEMA Administrator Craig Fugate said in a conference call with reporters on Thursday afternoon that Hurricane Irene is expected to be "a large and powerful hurricane" when it reaches the East Coast early Saturday morning.
, bringing heavy rains and winds up to 100 mph.The storm could cause flooding along coastal areas from North Carolina to New York City. The National Weather Service has issued a tropical storm warning for much of the coast, CNN reported. Fugate said that FEMA is working with state emergency management agencies in affected states to prepare for Irene's impact.
The agency will also be monitoring Hurricane Katia, which could hit Mexico and Texas this weekend, according to CNN.com.The Federal Emergency Management Agency has issued an alert that the U.S. East Coast could experience significant power outages and other problems if a major storm hits this weekend, according to reports from CNN.com of Atlanta Journal-Constitution.
The FEMA Administrator Craig Fugate said in a conference call with reporters on Thursday afternoon that Hurricane Irene is expected to be "a large and powerful hurricane" when it reaches the East Coast early Saturday morning. The National Weather Service has issued a tropical storm warning for much coast, CNN reported.The storm could cause flooding along coastal areas from North Carolina to New York City. Fugate said FEMA will also monitoring Hurricane Katia which could hit Mexico and Texas this weekend, according to CNN.com.
The Federal Emergency Management Agency has issued an alert that the U.S East Coast could experience significant power outages and other problems if a major storm hits this weekend, according reports from CNN of Atlanta Journal-Constitution.The Administrator Craig Fugate said in conference call with reporters on Thursday afternoon that Hurricane is expected be "a large powerful hurricane" when it reaches the coast early morning. The National Weather Service has issued tropical warning for much coast, CNN reported.ugate said FEMA will also monitoring Hurricane Katia which could hit Mexico and Texas this weekend, according to CNN.com.The Emergency Management Agency has issued an alert that the U.S East Coast could experience significant power outages and other problems if a major storm hits this weekend, reports from CNN of Atlanta Journal-ConstitutionThe Administrator Craig Fugate said in conference call with reporters on Thursday afternoon that Hurricane is expected be "a large powerful hurricane" when it reaches the coast early
512 tokens generated (34.06 token/s)

(base) paibo@fz:~/neptune/github/candle$ cargo run --features cuda -r --example glm4 -- --prompt "Nice to meet you!"
    Finished `release` profile [optimized] target(s) in 0.25s
     Running `target/release/examples/glm4 --prompt 'Nice to meet you'\!''`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.60 repeat-penalty: 1.20 repeat-last-n: 64
retrieved the files in 8.037135ms
loaded the model in 2.976652797s
starting the inference loop
Nice to meet you! I'm a 3D artist and designer from the UK. I've been working in games for over ten years, but have also worked on films, TV shows, toys and more.
 of late!
1
0
2-4
a5b6c7d8e9f10g11h12i13j14k15l16m17n18o19p20q21r22s23t24u25v26w27x28y29z30{31}32[33]34^35_36`37a38b39c40d41e42f43g44h45i46j47k48l49m50n51o52p53q54r55s56t57u58v59w60x61y62z63{64}65[66]67^68_69`70a71b72c73d74e75f76g77h78i79j80k81l82m83n84o85p86q87r88s89t90u91v92w93x94y95z96{97}98[99]100^101_102`103a104b105c106d107e108f109g110h111i112j113k115l117m119n120o121p123q124r126s127t129u131v133w135x137y139z141{142}143[144]145^146_147`149a150b151c153d155e156f158g160h161i163j165k167l169m171n174o176p177q179r181s183t185u189v191w192x195y199z201{202}204[205]207^209_211`213a215b217c221d220e222f224243g225h228i230j232k236l238m240n245o247p250q253r255s261t27u28v29w30x32y33z35{36}38[39]41^43_45`47a49b51c53d55e57f61g63h69i71j73k77l81m83n89o91p93q95r97s101t105u109v111w113x117y121z123{125}127[
512 tokens generated (33.98 token/s)

Sometimes it can end on its own

(base) paibo@fz:~/neptune/github/candle$ cargo run --features cuda -r --example glm4 -- --prompt "Nice to meet you"
    Finished `release` profile [optimized] target(s) in 0.25s
     Running `target/release/examples/glm4 --prompt 'Nice to meet you'`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.60 repeat-penalty: 1.20 repeat-last-n: 64
retrieved the files in 8.034803ms
loaded the model in 2.977758164s
starting the inference loop
Nice to meet you, I'm a bot
: "Hi! My name is ChatGPT. How can I assist you today?"
ChatGPT:
Hello there! It's great to meet you too. If you have any questions or need help with anything, feel free to ask me. I'll do my best to provide the information and assistance you're looking for.
 in a friendly manner.
78 tokens generated (35.87 token/s)

Also, these answers are terrible, and are completely different from the glm4 I deployed using python transformers. Look at the output of my python transformers:

You: This is a test.
GLM-4:
It looks like you've just sent a simple message as part of a test or to check the functionality of a system. If there's anything specific you'd like to know or do next, please let me know!
You: Nice to meet you!
GLM-4:
Nice to meet you too! Is there something in particular I can assist you with?
You: hello
GLM-4:
Hello 👋! How can I help you today?

@LaurentMazare
Copy link
Collaborator

Ok so it seems that it produces the eos token from time to time.
Re performance, I guess this is because glm expects a certain template for its prompt so you will have to use the same format in your --prompt argument. The best is probably to get the python code to print the string that actually gets passed to the model so as to use the same one.

@darkSuperman
Copy link
Author

OK, thanks. I'll check these first, and if I find that it's a candle example problem, I'll submit an issue or a fixed PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants