Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Prefill) KV Cache Indexing error if started multiple TGI servers concurrently #2675

Open
3 of 4 tasks
nathan-az opened this issue Oct 21, 2024 · 0 comments
Open
3 of 4 tasks

Comments

@nathan-az
Copy link

System Info

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.0
Commit sha: a094729386b5689aabfba40b7fdb207142dec8d5
Docker label: sha-a094729
nvidia-smi:
Mon Oct 21 10:38:14 2024       
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA H100 80GB HBM3          On  |   00000000:0F:00.0 Off |                    0 |
   | N/A   37C    P0            132W /  700W |       0MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
   |   1  NVIDIA H100 80GB HBM3          On  |   00000000:2D:00.0 Off |                    0 |
   | N/A   43C    P0            105W /  700W |       0MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
   |   2  NVIDIA H100 80GB HBM3          On  |   00000000:44:00.0 Off |                    0 |
   | N/A   33C    P0             97W /  700W |       0MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
   |   3  NVIDIA H100 80GB HBM3          On  |   00000000:5B:00.0 Off |                    0 |
   | N/A   37C    P0            102W /  700W |       0MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
   |   4  NVIDIA H100 80GB HBM3          On  |   00000000:89:00.0 Off |                    0 |
   | N/A   32C    P0             97W /  700W |       0MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
   |   5  NVIDIA H100 80GB HBM3          On  |   00000000:A8:00.0 Off |                    0 |
   | N/A   36C    P0            100W /  700W |       0MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
   |   6  NVIDIA H100 80GB HBM3          On  |   00000000:C0:00.0 Off |                    0 |
   | N/A   36C    P0            101W /  700W |       0MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
   |   7  NVIDIA H100 80GB HBM3          On  |   00000000:D8:00.0 Off |                    0 |
   | N/A   33C    P0             95W /  700W |       0MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
                                                                                            
   +-----------------------------------------------------------------------------------------+
   | Processes:                                                                              |
   |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
   |        ID   ID                                                               Usage      |
   |=========================================================================================|
   |  No running processes found                                                             |
   +-----------------------------------------------------------------------------------------+
xpu-smi:
N/A

The node is on a kubernetes cluster via a manage service.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I am starting multiple inference servers on the main container with text-generation-launcher. The motivation is that throughput is significantly higher with data parallelism than tensor parallelism.

I am testing by sending round robin async requests to each server, saturating the compute.

If I start the servers synchronously, waiting until a 200 from /health, there are no issues making requests.

If I start the servers asynchronously in order to decrease spin-up time (note the model is already cached) I get errors during inference regarding kv cache indexing.

The error is shared below. If code snippets on server spin-up or making requests is required, I am happy to share, it's just quite verbose.

IndexError: list index out of range
2024-10-21T10:21:15.831487Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 109, in serve
    server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 280, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 153, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1602, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1505, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 651, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 595, in forward
    kv_cache[i],
IndexError: list index out of range

Note that I am on a cluster that already runs a docker container without DinD enabled, so I can't test whether this behaviour occurs starting multiple containers.

Expected behavior

This is not a common use case, so I understand there may not be plans to fix this. Still, I expected that there wouldn't be a difference in behaviour between starting the servers concurrently versus synchronously. The main motivation behind the asynchronous server creation is simply to cut down on startup time, since each server takes about a minute to start.

On an aside, native handling of data parallel inference (perhaps integrating a basic load balancer into the server) would be great, but again, I understand this is probably not a priority for the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant