-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow training speed. Is this expected for my system setup? #34
Comments
The AlexNet v2 model is relatively small. It performs 0.2 sec/step on NVIDIA GTX TITAN X. I never meet such slow training progress. Did you concurrently run other GPU programs? Maybe restart will help. You can change the summary interval to a very large number (e.g., 999999) to close the summary. It may help. ctw-baseline/classification/train.py Line 30 in 081c836
In my experience, the warning |
Thanks for the response! I don't have any other GPU programs running concurrently. What would be the best way to check if GPU is actually being used? I |
On Linux, I run Your GPU Util should be very low -- for each step, GPU can do the computation in 0.2 sec, but it takes 30 sec. Thus in 29 sec you will see GPU Util is 0, while in the other 1 sec you will see GPU Util is not 0. It seems most time is spent on preparing data and transferring data to GPU memory. |
I am currently training a model using the Chinese in the Wild image data. My system setup is as follows:
The speed is shown below: Each step takes close to 30 seconds. The training has been running for 2 days, and it's only done 5410 steps, so far. It seems like GPU is getting utilized -- 96% of the GPU memory is used. The CPU also shows quite a bit of activity - e.g., 40% by the Python session in which the training is running.
Also, when I started training, I got the message
failed to allocate 15.90G (17071144960 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
. Not sure if this is relatedSo my question is if the speed I am observing is normal for the kind of computer setup I have, and how I might improve the speed. Thanks!
The text was updated successfully, but these errors were encountered: