You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 16, 2024. It is now read-only.
I only set batch_size=1,if batch_size>1 ,error occured:
when batch_size=1,and other parameters is default, the loss does not down,below is loss
step 19960 loss = 1.322, (0.337 sec/step)
step 19961 loss = 1.341, (0.336 sec/step)
step 19962 loss = 1.302, (0.336 sec/step)
step 19963 loss = 1.324, (0.337 sec/step)
step 19964 loss = 1.317, (0.335 sec/step)
step 19965 loss = 1.298, (0.337 sec/step)
step 19966 loss = 1.319, (0.336 sec/step)
step 19967 loss = 1.304, (0.335 sec/step)
step 19968 loss = 1.294, (0.336 sec/step)
step 19969 loss = 1.305, (0.336 sec/step)
step 19970 loss = 1.347, (0.335 sec/step)
step 19971 loss = 1.314, (0.337 sec/step)
step 19972 loss = 1.304, (0.337 sec/step)
step 19973 loss = 1.310, (0.336 sec/step)
step 19974 loss = 1.301, (0.336 sec/step)
step 19975 loss = 1.301, (0.337 sec/step)
step 19976 loss = 1.387, (0.336 sec/step)
step 19977 loss = 1.320, (0.335 sec/step)
step 19978 loss = 1.305, (0.336 sec/step)
step 19979 loss = 1.309, (0.336 sec/step)
step 19980 loss = 1.302, (0.336 sec/step)
step 19981 loss = 1.304, (0.335 sec/step)
step 19982 loss = 1.325, (0.337 sec/step)
step 19983 loss = 1.321, (0.336 sec/step)
step 19984 loss = 1.316, (0.336 sec/step)
step 19985 loss = 1.332, (0.337 sec/step)
step 19986 loss = 1.299, (0.336 sec/step)
step 19987 loss = 1.312, (0.336 sec/step)
step 19988 loss = 1.290, (0.335 sec/step)
step 19989 loss = 1.323, (0.337 sec/step)
step 19990 loss = 1.318, (0.336 sec/step)
step 19991 loss = 1.307, (0.336 sec/step)
step 19992 loss = 1.364, (0.336 sec/step)
step 19993 loss = 1.324, (0.335 sec/step)
step 19994 loss = 1.314, (0.335 sec/step)
step 19995 loss = 1.301, (0.336 sec/step)
step 19996 loss = 1.291, (0.336 sec/step)
step 19997 loss = 1.317, (0.338 sec/step)
step 19998 loss = 1.322, (0.337 sec/step)
step 19999 loss = 1.293, (0.335 sec/step)
The checkpoint has been created.
step 20000 loss = 1.320, (11.272 sec/step)
batch_size>1, below is error:
2018-04-26 13:43:56.846459: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:277] *************************************************************************************************___
2018-04-26 13:43:56.849249: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[144,12,20,2048]
Traceback (most recent call last):
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1327, in _do_call
return fn(*args)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1306, in _run_fn
status, run_metadata)
File "F:\soft\anaconda\lib\contextlib.py", line 66, in exit
next(self.gen)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,5,7,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: ExpandDims/_1095 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2405_ExpandDims", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 272, in
main()
File "train.py", line 252, in main
loss_value, images, labels, preds, summary, _ = sess.run([reduced_loss, image_batch, label_batch, pred, total_summary, train_op], feed_dict=feed_dict)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1321, in _do_run
options, run_metadata)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,5,7,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: ExpandDims/_1095 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2405_ExpandDims", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op 'fc1_voc12_c3/convolution/SpaceToBatchND', defined at:
File "train.py", line 272, in
main()
File "train.py", line 146, in main
net = DeepLabResNetModel({'data': image_batch}, is_training=args.is_training, num_classes=args.num_classes)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 48, in init
self.setup(is_training, num_classes)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\deeplab_resnet\model.py", line 411, in setup
.atrous_conv(3, 3, num_classes, 24, padding='SAME', relu=False, name='fc1_voc12_c3'))
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 22, in layer_decorated
layer_output = op(self, layer_input, *args, **kwargs)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 173, in atrous_conv
output = convolve(input, kernel)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 168, in
convolve = lambda i, k: tf.nn.atrous_conv2d(i, k, dilation, padding=padding)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 974, in atrous_conv2d
name=name)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 672, in convolution
op=op)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 451, in with_space_to_batch
paddings=paddings)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 3359, in space_to_batch_nd
name=name)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 2628, in create_op
original_op=self._default_original_op, op_def=op_def)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
I only set batch_size=1,if batch_size>1 ,error occured:
when batch_size=1,and other parameters is default, the loss does not down,below is loss
step 19960 loss = 1.322, (0.337 sec/step)
step 19961 loss = 1.341, (0.336 sec/step)
step 19962 loss = 1.302, (0.336 sec/step)
step 19963 loss = 1.324, (0.337 sec/step)
step 19964 loss = 1.317, (0.335 sec/step)
step 19965 loss = 1.298, (0.337 sec/step)
step 19966 loss = 1.319, (0.336 sec/step)
step 19967 loss = 1.304, (0.335 sec/step)
step 19968 loss = 1.294, (0.336 sec/step)
step 19969 loss = 1.305, (0.336 sec/step)
step 19970 loss = 1.347, (0.335 sec/step)
step 19971 loss = 1.314, (0.337 sec/step)
step 19972 loss = 1.304, (0.337 sec/step)
step 19973 loss = 1.310, (0.336 sec/step)
step 19974 loss = 1.301, (0.336 sec/step)
step 19975 loss = 1.301, (0.337 sec/step)
step 19976 loss = 1.387, (0.336 sec/step)
step 19977 loss = 1.320, (0.335 sec/step)
step 19978 loss = 1.305, (0.336 sec/step)
step 19979 loss = 1.309, (0.336 sec/step)
step 19980 loss = 1.302, (0.336 sec/step)
step 19981 loss = 1.304, (0.335 sec/step)
step 19982 loss = 1.325, (0.337 sec/step)
step 19983 loss = 1.321, (0.336 sec/step)
step 19984 loss = 1.316, (0.336 sec/step)
step 19985 loss = 1.332, (0.337 sec/step)
step 19986 loss = 1.299, (0.336 sec/step)
step 19987 loss = 1.312, (0.336 sec/step)
step 19988 loss = 1.290, (0.335 sec/step)
step 19989 loss = 1.323, (0.337 sec/step)
step 19990 loss = 1.318, (0.336 sec/step)
step 19991 loss = 1.307, (0.336 sec/step)
step 19992 loss = 1.364, (0.336 sec/step)
step 19993 loss = 1.324, (0.335 sec/step)
step 19994 loss = 1.314, (0.335 sec/step)
step 19995 loss = 1.301, (0.336 sec/step)
step 19996 loss = 1.291, (0.336 sec/step)
step 19997 loss = 1.317, (0.338 sec/step)
step 19998 loss = 1.322, (0.337 sec/step)
step 19999 loss = 1.293, (0.335 sec/step)
The checkpoint has been created.
step 20000 loss = 1.320, (11.272 sec/step)
batch_size>1, below is error:
2018-04-26 13:43:56.846459: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:277] *************************************************************************************************___
2018-04-26 13:43:56.849249: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[144,12,20,2048]
Traceback (most recent call last):
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1327, in _do_call
return fn(*args)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1306, in _run_fn
status, run_metadata)
File "F:\soft\anaconda\lib\contextlib.py", line 66, in exit
next(self.gen)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,5,7,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: ExpandDims/_1095 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2405_ExpandDims", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 272, in
main()
File "train.py", line 252, in main
loss_value, images, labels, preds, summary, _ = sess.run([reduced_loss, image_batch, label_batch, pred, total_summary, train_op], feed_dict=feed_dict)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1321, in _do_run
options, run_metadata)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,5,7,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: ExpandDims/_1095 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2405_ExpandDims", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op 'fc1_voc12_c3/convolution/SpaceToBatchND', defined at:
File "train.py", line 272, in
main()
File "train.py", line 146, in main
net = DeepLabResNetModel({'data': image_batch}, is_training=args.is_training, num_classes=args.num_classes)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 48, in init
self.setup(is_training, num_classes)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\deeplab_resnet\model.py", line 411, in setup
.atrous_conv(3, 3, num_classes, 24, padding='SAME', relu=False, name='fc1_voc12_c3'))
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 22, in layer_decorated
layer_output = op(self, layer_input, *args, **kwargs)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 173, in atrous_conv
output = convolve(input, kernel)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 168, in
convolve = lambda i, k: tf.nn.atrous_conv2d(i, k, dilation, padding=padding)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 974, in atrous_conv2d
name=name)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 672, in convolution
op=op)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 451, in with_space_to_batch
paddings=paddings)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 3359, in space_to_batch_nd
name=name)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 2628, in create_op
original_op=self._default_original_op, op_def=op_def)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2304,5,7,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: ExpandDims/_1095 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2405_ExpandDims", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]]
The text was updated successfully, but these errors were encountered: