openai API `max_completion_tokens` argument is ignored #1907

BenjaminMarechalEVITECH · 2025-01-24T20:34:02Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Current Behavior

I'm running llama-server with following command:

python3 -m llama_cpp.server --model models/mys/ggml_llava-v1.5-13b/ggml-model-q4_k.gguf --clip_model_path models/mys/ggml_llava-v1.5-13b/mmproj-model-f16.gguf --model_alias llava-v1.5-13b-q4_k --chat_format llava-1-5 --port 10322

(models downloaded from https://huggingface.co/mys/ggml_llava-v1.5-13b/tree/main)

When I call the server using openai python package:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:10322/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
)

chat_completion = client.chat.completions.create(
    model="models/mys/ggml_llava-v1.5-13b/ggml-model-q4_k.gguf",
    messages=[
        {"role": "user", "content": "Write a limerick about python exceptions"}
    ],
    max_tokens=3,
)
print(chat_completion.usage.completion_tokens)  # returns 3, ok.
print(chat_completion.choices[0].finish_reason)  # returns "length", ok.

chat_completion = client.chat.completions.create(
    model="models/mys/ggml_llava-v1.5-13b/ggml-model-q4_k.gguf",
    messages=[
        {"role": "user", "content": "Write a limerick about python exceptions"}
    ],
    max_completion_tokens=3,
)
print(chat_completion.usage.completion_tokens)  # returns much more than 3 (complete answer).
print(chat_completion.choices[0].finish_reason)   # returns "stop".

According to OpenAI API, max_completion_tokens argument is replacing the deprecated max_tokens argument.
It's seems that only max_tokens is not ignored by the server.

Environment and Context

llama_cpp installed with pip install llama-cpp-python[server]
print(llama_cpp.__version__): 0.3.6
print(openai.__version__): 1.59.7

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openai API `max_completion_tokens` argument is ignored #1907

openai API `max_completion_tokens` argument is ignored #1907

BenjaminMarechalEVITECH commented Jan 24, 2025

openai API max_completion_tokens argument is ignored #1907

openai API max_completion_tokens argument is ignored #1907

Comments

BenjaminMarechalEVITECH commented Jan 24, 2025

Prerequisites

Current Behavior

Environment and Context

openai API `max_completion_tokens` argument is ignored #1907

openai API `max_completion_tokens` argument is ignored #1907