Skip to content
This repository has been archived by the owner on Sep 16, 2024. It is now read-only.

Resource exhausted: OOM when allocating tensor with shape[144,12,20,2048] #179

Open
coderclear opened this issue Apr 26, 2018 · 1 comment

Comments

@coderclear
Copy link

coderclear commented Apr 26, 2018

I only set batch_size=1,if batch_size>1 ,error occured:
when batch_size=1,and other parameters is default, the loss does not down,below is loss
step 19960 loss = 1.322, (0.337 sec/step)
step 19961 loss = 1.341, (0.336 sec/step)
step 19962 loss = 1.302, (0.336 sec/step)
step 19963 loss = 1.324, (0.337 sec/step)
step 19964 loss = 1.317, (0.335 sec/step)
step 19965 loss = 1.298, (0.337 sec/step)
step 19966 loss = 1.319, (0.336 sec/step)
step 19967 loss = 1.304, (0.335 sec/step)
step 19968 loss = 1.294, (0.336 sec/step)
step 19969 loss = 1.305, (0.336 sec/step)
step 19970 loss = 1.347, (0.335 sec/step)
step 19971 loss = 1.314, (0.337 sec/step)
step 19972 loss = 1.304, (0.337 sec/step)
step 19973 loss = 1.310, (0.336 sec/step)
step 19974 loss = 1.301, (0.336 sec/step)
step 19975 loss = 1.301, (0.337 sec/step)
step 19976 loss = 1.387, (0.336 sec/step)
step 19977 loss = 1.320, (0.335 sec/step)
step 19978 loss = 1.305, (0.336 sec/step)
step 19979 loss = 1.309, (0.336 sec/step)
step 19980 loss = 1.302, (0.336 sec/step)
step 19981 loss = 1.304, (0.335 sec/step)
step 19982 loss = 1.325, (0.337 sec/step)
step 19983 loss = 1.321, (0.336 sec/step)
step 19984 loss = 1.316, (0.336 sec/step)
step 19985 loss = 1.332, (0.337 sec/step)
step 19986 loss = 1.299, (0.336 sec/step)
step 19987 loss = 1.312, (0.336 sec/step)
step 19988 loss = 1.290, (0.335 sec/step)
step 19989 loss = 1.323, (0.337 sec/step)
step 19990 loss = 1.318, (0.336 sec/step)
step 19991 loss = 1.307, (0.336 sec/step)
step 19992 loss = 1.364, (0.336 sec/step)
step 19993 loss = 1.324, (0.335 sec/step)
step 19994 loss = 1.314, (0.335 sec/step)
step 19995 loss = 1.301, (0.336 sec/step)
step 19996 loss = 1.291, (0.336 sec/step)
step 19997 loss = 1.317, (0.338 sec/step)
step 19998 loss = 1.322, (0.337 sec/step)
step 19999 loss = 1.293, (0.335 sec/step)
The checkpoint has been created.
step 20000 loss = 1.320, (11.272 sec/step)
batch_size>1, below is error:

2018-04-26 13:43:56.846459: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:277] *************************************************************************************************___
2018-04-26 13:43:56.849249: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[144,12,20,2048]
Traceback (most recent call last):
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1327, in _do_call
return fn(*args)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1306, in _run_fn
status, run_metadata)
File "F:\soft\anaconda\lib\contextlib.py", line 66, in exit
next(self.gen)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,5,7,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: ExpandDims/_1095 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2405_ExpandDims", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train.py", line 272, in
main()
File "train.py", line 252, in main
loss_value, images, labels, preds, summary, _ = sess.run([reduced_loss, image_batch, label_batch, pred, total_summary, train_op], feed_dict=feed_dict)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1321, in _do_run
options, run_metadata)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,5,7,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: ExpandDims/_1095 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2405_ExpandDims", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]]

Caused by op 'fc1_voc12_c3/convolution/SpaceToBatchND', defined at:
File "train.py", line 272, in
main()
File "train.py", line 146, in main
net = DeepLabResNetModel({'data': image_batch}, is_training=args.is_training, num_classes=args.num_classes)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 48, in init
self.setup(is_training, num_classes)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\deeplab_resnet\model.py", line 411, in setup
.atrous_conv(3, 3, num_classes, 24, padding='SAME', relu=False, name='fc1_voc12_c3'))
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 22, in layer_decorated
layer_output = op(self, layer_input, *args, **kwargs)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 173, in atrous_conv
output = convolve(input, kernel)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 168, in
convolve = lambda i, k: tf.nn.atrous_conv2d(i, k, dilation, padding=padding)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 974, in atrous_conv2d
name=name)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 672, in convolution
op=op)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 451, in with_space_to_batch
paddings=paddings)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 3359, in space_to_batch_nd
name=name)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 2628, in create_op
original_op=self._default_original_op, op_def=op_def)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2304,5,7,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: ExpandDims/_1095 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2405_ExpandDims", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]]

@ghost
Copy link

ghost commented May 12, 2018

This could be due to your gpu cababilities.
Try decreasing your learning rate by a factor of 10 to decrease the loss.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant