You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Exception ignored in: <function Server.del at 0xXXXXXXXXX>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 194, in del
cygrpc.schedule_coro_threadsafe(
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
self._check_closed()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
Task exception was never retrieved
Error: ShardFailed
future: <Task finished name='HandleExceptions[/generate.v2.TextGenerationService/Prefill]' coro=<()> exception=SystemExit(1)>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 142, in Prefill
generations, next_batch, timings = self.model.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1141, in generate_token
prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.50 GiB. GPU has a total capacity of 79.15 GiB of which 6.94 GiB is free. Process 63385 has 72.21 GiB memory in use. Of the allocated memory 69.81 GiB is allocated by PyTorch, and 309.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Expected behavior
Actually, the service reboot so all the requests on the queue and the ones running goes:
openai.InternalServerError: upstream
connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
To avoid this, it would be really good to check the future memory in place before adding too big info. It could be added as a critera to decide if it needs to go on the queue or not before treating request that we know will blow up the system.
Actually, I'm doing a manual pre check on the length of all the payloads of all the requests I do to avoid this.
The text was updated successfully, but these errors were encountered:
System Info
text-generation-inference v2.2.0
Information
Tasks
Reproduction
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Exception ignored in: <function Server.del at 0xXXXXXXXXX>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 194, in del
cygrpc.schedule_coro_threadsafe(
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
self._check_closed()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
Task exception was never retrieved
Error: ShardFailed
future: <Task finished name='HandleExceptions[/generate.v2.TextGenerationService/Prefill]' coro=<()> exception=SystemExit(1)>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 142, in Prefill
generations, next_batch, timings = self.model.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1141, in generate_token
prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.50 GiB. GPU has a total capacity of 79.15 GiB of which 6.94 GiB is free. Process 63385 has 72.21 GiB memory in use. Of the allocated memory 69.81 GiB is allocated by PyTorch, and 309.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (
https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Expected behavior
Actually, the service reboot so all the requests on the queue and the ones running goes:
openai.InternalServerError: upstream
connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
To avoid this, it would be really good to check the future memory in place before adding too big info. It could be added as a critera to decide if it needs to go on the queue or not before treating request that we know will blow up the system.
Actually, I'm doing a manual pre check on the length of all the payloads of all the requests I do to avoid this.
The text was updated successfully, but these errors were encountered: