You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The node is on a kubernetes cluster via a manage service.
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
I am starting multiple inference servers on the main container with text-generation-launcher. The motivation is that throughput is significantly higher with data parallelism than tensor parallelism.
I am testing by sending round robin async requests to each server, saturating the compute.
If I start the servers synchronously, waiting until a 200 from /health, there are no issues making requests.
If I start the servers asynchronously in order to decrease spin-up time (note the model is already cached) I get errors during inference regarding kv cache indexing.
The error is shared below. If code snippets on server spin-up or making requests is required, I am happy to share, it's just quite verbose.
IndexError: list index out of range
2024-10-21T10:21:15.831487Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 109, in serve
server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 280, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 153, in Prefill
generations, next_batch, timings = self.model.generate_token(batch)
File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1602, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1505, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 651, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 595, in forward
kv_cache[i],
IndexError: list index out of range
Note that I am on a cluster that already runs a docker container without DinD enabled, so I can't test whether this behaviour occurs starting multiple containers.
Expected behavior
This is not a common use case, so I understand there may not be plans to fix this. Still, I expected that there wouldn't be a difference in behaviour between starting the servers concurrently versus synchronously. The main motivation behind the asynchronous server creation is simply to cut down on startup time, since each server takes about a minute to start.
On an aside, native handling of data parallel inference (perhaps integrating a basic load balancer into the server) would be great, but again, I understand this is probably not a priority for the project.
The text was updated successfully, but these errors were encountered:
System Info
The node is on a kubernetes cluster via a manage service.
Information
Tasks
Reproduction
I am starting multiple inference servers on the main container with
text-generation-launcher
. The motivation is that throughput is significantly higher with data parallelism than tensor parallelism.I am testing by sending round robin async requests to each server, saturating the compute.
If I start the servers synchronously, waiting until a 200 from
/health
, there are no issues making requests.If I start the servers asynchronously in order to decrease spin-up time (note the model is already cached) I get errors during inference regarding kv cache indexing.
The error is shared below. If code snippets on server spin-up or making requests is required, I am happy to share, it's just quite verbose.
Note that I am on a cluster that already runs a docker container without DinD enabled, so I can't test whether this behaviour occurs starting multiple containers.
Expected behavior
This is not a common use case, so I understand there may not be plans to fix this. Still, I expected that there wouldn't be a difference in behaviour between starting the servers concurrently versus synchronously. The main motivation behind the asynchronous server creation is simply to cut down on startup time, since each server takes about a minute to start.
On an aside, native handling of data parallel inference (perhaps integrating a basic load balancer into the server) would be great, but again, I understand this is probably not a priority for the project.
The text was updated successfully, but these errors were encountered: