torch.cuda.OutOfMemoryError: CUDA out of memory. Why isn't it handle by the queue system ? #2417

JustAnotherVeryNormalDeveloper · 2024-08-14T13:26:48Z

System Info

text-generation-inference v2.2.0

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Exception ignored in: <function Server.del at 0xXXXXXXXXX>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 194, in del
cygrpc.schedule_coro_threadsafe(
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
self._check_closed()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
Task exception was never retrieved
Error: ShardFailed
future: <Task finished name='HandleExceptions[/generate.v2.TextGenerationService/Prefill]' coro=<()> exception=SystemExit(1)>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 142, in Prefill
generations, next_batch, timings = self.model.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1141, in generate_token
prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.50 GiB. GPU has a total capacity of 79.15 GiB of which 6.94 GiB is free. Process 63385 has 72.21 GiB memory in use. Of the allocated memory 69.81 GiB is allocated by PyTorch, and 309.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (
https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior

Actually, the service reboot so all the requests on the queue and the ones running goes:
openai.InternalServerError: upstream
connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111

To avoid this, it would be really good to check the future memory in place before adding too big info. It could be added as a critera to decide if it needs to go on the queue or not before treating request that we know will blow up the system.

Actually, I'm doing a manual pre check on the length of all the payloads of all the requests I do to avoid this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.cuda.OutOfMemoryError: CUDA out of memory. Why isn't it handle by the queue system ? #2417

torch.cuda.OutOfMemoryError: CUDA out of memory. Why isn't it handle by the queue system ? #2417

JustAnotherVeryNormalDeveloper commented Aug 14, 2024 •

edited

Loading

torch.cuda.OutOfMemoryError: CUDA out of memory. Why isn't it handle by the queue system ? #2417

torch.cuda.OutOfMemoryError: CUDA out of memory. Why isn't it handle by the queue system ? #2417

Comments

JustAnotherVeryNormalDeveloper commented Aug 14, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

JustAnotherVeryNormalDeveloper commented Aug 14, 2024 •

edited

Loading