Slow GPU training Laptop 4600 8 Gb VRAM #703

kivrus · 2025-01-18T09:39:51Z

@lpscr thanks a lot!
I managed to run it on my Legion Lenovo laptop with 4060 8 Gb VRAM.

Context: I'm very new for this whole thing.

But now I have issues with the training process. It is very slow for GPU... Biggest success was like 1 epoch in 10 minutes...
I will give more details

my prompts:
for preprocess:
I'm using ru language, yes
python3 -m piper_train.preprocess --language ru --input-dir ~/piper/my-dataset --output-dir ~/piper/my-training --dataset-format ljspeech --single-speaker --sample-rate 22050 --max-workers 1 --debug
I checked jsonl file in my-training, looked fine afaik. It had text and links to .wav

for train:

python3 -m piper_train
--dataset-dir ~/piper/my-training
--devices 1
--batch-size 32
--validation-split 0.0
--num-test-examples 0
--max_epochs 2200
--resume_from_checkpoint ~/piper/epoch=2164-step=1355540.ckpt
--accelerator 'gpu'
--checkpoint-epochs 1
--precision 16
GPU is 100% being used, confirmed it with many tools, it is always busy for like 7.3-7.8 VRAM
I've tried different datasets: 10K .wav files, 800 .wav files... They are really short, good studio quality and 6-7 words at max. Total duration of 10K is like 8-9 hours. Result is the same
I tested different batch sizes - all the way from 128 to 8. It rans out of memory with everything higher than 32. So I stopped with 32. In logs it said:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 8.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 14.07 GiB is allocated by PyTorch, and 493.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
I think that I should be able to run training much smoother. But the only thing that still works - it is very slow both GPU and CPU....

P.S. When I followed the original guide, I was able to run training only on CPU, but I could see the progress: timers, epochs etc.
With this fix from @lpscr , I don't see anything. I've installed Tensorboard, but I see nothing on it. Charts are empty (I've used the last "version" of lightning logs file ofc....
that's the only prompt I had:
DEBUG:fsspec.local:open file: /home/kosov/piper/my-training/lightning_logs/version_39/hparams.yaml
D

kivrus · 2025-01-18T16:18:51Z

lowered batch size:
--batch-size 16

add phoneme limit:
--max-phoneme-ids 300

now my speed is ~140 epochs/hour

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow GPU training Laptop 4600 8 Gb VRAM #703

Slow GPU training Laptop 4600 8 Gb VRAM #703

kivrus commented Jan 18, 2025

kivrus commented Jan 18, 2025

Slow GPU training Laptop 4600 8 Gb VRAM #703

Slow GPU training Laptop 4600 8 Gb VRAM #703

Comments

kivrus commented Jan 18, 2025

kivrus commented Jan 18, 2025