Skip to content
This repository has been archived by the owner on Sep 16, 2024. It is now read-only.

Loss not going down #96

Open
vijayg78 opened this issue Jul 5, 2017 · 20 comments
Open

Loss not going down #96

vijayg78 opened this issue Jul 5, 2017 · 20 comments

Comments

@vijayg78
Copy link

vijayg78 commented Jul 5, 2017

Hi,
I started a training from scratch with train.py with VOC2012 data set. I downloaded the Augmented GTs and plugged in to the data set. Now the GTs are the augmented GTs and original jpg files from data set.
The loss is not going down, it is oscillating. Any clue on how to get it working?
Regards, Vijay

@DrSleep
Copy link
Owner

DrSleep commented Jul 6, 2017

from scratch

Do you mean with randomly initialised model?

@vijayg78
Copy link
Author

vijayg78 commented Jul 6, 2017

i used the deeplab_resnet_init.ckpt and tried to run the train.py file. The loss was oscillating and not coming down at all. I also tried the deeplab_resnet.ckpt same behaviour.

@vijayg78
Copy link
Author

vijayg78 commented Jul 6, 2017

I used the JPEGimages from VOCdevkit and GTs were pointed to Augmented images i downloaded from this github. Thats correct right?

@akshittyagi
Copy link

Same problem for a model which doesn't use the deeplab_resnet.ckpt file to init

@DrSleep
Copy link
Owner

DrSleep commented Jul 7, 2017

what are the images in your tensorboard after few iterations?

@Hjy20255
Copy link

i have same problem ,i use my own datatset(3 classes ) to train.Loss value was oscillating and not coming down at all. LOSS 1.2~1.3

@akshittyagi
Copy link

@DrSleep there are no images being produced in tensorboard

@DrSleep
Copy link
Owner

DrSleep commented Jul 12, 2017

2all: the hyperparameters (learning rate, batch size, momentum, etc.) have been chosen on Pascal VOC (for the procedure behind these choices, please refer to the original paper).
It is not the case that the same hyperparameters would be suitable for other datasets, thus it is your task finding an appropriate set of hyperparameters for your dataset.

This repository is a replication of an academic paper. Anything else besides that is a bonus (like an ability to train on your own datasets).

@akshittyagi
Copy link

Okay. But the model is also not working for VOC dataset when not using the pretrained .ckpt file

@DrSleep
Copy link
Owner

DrSleep commented Jul 12, 2017

It works (proof, proof) on VOC with either pre-trained or not pre-trained files.
Make sure that your setup is correct.

@wangruixing
Copy link

I also meet this problem, I use VOC2012, and pretrained model..

@dongzhuoyao
Copy link

same here

@chenyuZha
Copy link

In my case I used my own data set to do training. At first I took train.py then the loss went down very very very slowly (from 10 to 8 for 60000 steps), then I took another script train_msc.py and the loss began to go down very quickly , and I found that the second one did training better than the first since the loss was much smaller (about 3 instead of 8 in my case).

@zhengyang-wang
Copy link

May I know the final loss for after running train.py for 20K iterations with deeplab_resnet_init.ckpt as a start? I used PASCAL dataset and the final loss was about 1.3.
It would be better if you could provide the graph of your training curve?

@ChuanWang90
Copy link

Same here. With the default configuration and PascalVOC the loss oscillates between 1.2-1.3. Could someone plot the training curve or tell which are the loss values after 20K iterations for example? Thanks!

@FeiWard
Copy link

FeiWard commented Dec 27, 2017

Have someone show the loss after 20K? It is about 1.18 in my PC. Or who knows the reason?

@EternityZY
Copy link

my loss is always about 1.3 and the result predicted the images is black,nothing result.I use default hyperparameters and voc2012 dataset with deeplab_resnet.ckpt as a start.why doesn't work?

@PallawiSinghal
Copy link

my loss is always about 1.3 and the result predicted the images is black,nothing result.I use default hyperparameters and voc2012 dataset with deeplab_resnet.ckpt as a start.why doesn't work?

Hi were you able to solve the issue.

@PallawiSinghal
Copy link

Hi, My loss does not change. It has become stagnant. I have tried everything mentioned related to deeplabv3+ on every blog.
I am training to detect roads. My images are of 2000x2000.
My training data has 45k images.
I have created my image in the form of PASCAL VOC. I have three kinds of pixels.
background = [0,0,0]
Void class = [255,255,255]
road = [1,1,1]
so the number of classes = 3
I am using PASCAL VOC pre trained weights.

changes in train_util.py are :
1.
ignore_weight = 0
label0_weight =10
label1_weight = 15
not_ignore_mask =
tf.to_float(tf.equal(scaled_labels, 1)) * label0_weight

  • tf.to_float(tf.equal(scaled_labels, 2)) * label1_weight
  • tf.to_float(tf.equal(scaled_labels, ignore_label)) * ignore_weight

Variables that will not be restored.

exclude_list = ['global_step','logits']
if not initialize_last_layer:
exclude_list.extend(last_layers)

my train.py

nohup python deeplab/train.py
--logtostderr
--training_number_of_steps=65000
--train_split="train"
--model_variant="xception_65"
--atrous_rates=6
--atrous_rates=12
--atrous_rates=18
--output_stride=16
--decoder_output_stride=4
--train_batch_size=2
--initialize_last_layer=False
--last_layers_contain_logits_only=True
--dataset="pascal_voc_seg"
--tf_initial_checkpoint="/data/old_model/models/research/deeplabv3_pascal_trainval/model.ckpt"
--train_logdir="/data/old_model/models/research/deeplab/mycheckpoints"
--dataset_dir="/data/models/research/deeplab/datasets/tfrecord" > my_output.log &

Please help 👍
INFO:tensorflow:global step 700: loss = 0.1759 (0.449 sec/step)
INFO:tensorflow:global step 710: loss = 0.1695 (0.655 sec/step)
INFO:tensorflow:global step 720: loss = 0.1742 (0.689 sec/step)
INFO:tensorflow:global step 730: loss = 0.1710 (0.505 sec/step)
INFO:tensorflow:global step 740: loss = 0.1708 (0.868 sec/step)
INFO:tensorflow:global step 750: loss = 0.1683 (0.632 sec/step)
INFO:tensorflow:global step 760: loss = 0.1692 (0.442 sec/step)
INFO:tensorflow:global step 770: loss = 0.1693 (0.597 sec/step)
INFO:tensorflow:global step 780: loss = 0.1665 (0.441 sec/step)
INFO:tensorflow:global step 790: loss = 0.1680 (0.548 sec/step)
INFO:tensorflow:global step 800: loss = 0.1708 (0.372 sec/step)
INFO:tensorflow:global step 810: loss = 0.1674 (0.327 sec/step)
INFO:tensorflow:global step 820: loss = 0.1666 (0.951 sec/step)
INFO:tensorflow:global step 830: loss = 0.1651 (0.557 sec/step)
INFO:tensorflow:global step 840: loss = 0.1663 (0.506 sec/step)
INFO:tensorflow:global step 850: loss = 0.1646 (0.446 sec/step)
INFO:tensorflow:global step 860: loss = 0.1666 (0.424 sec/step)
INFO:tensorflow:global step 870: loss = 0.1654 (0.520 sec/step)
INFO:tensorflow:global step 880: loss = 0.1662 (0.675 sec/step)
INFO:tensorflow:global step 890: loss = 0.1673 (0.325 sec/step)
INFO:tensorflow:global step 900: loss = 0.1633 (0.548 sec/step)
INFO:tensorflow:global step 910: loss = 0.1659 (0.374 sec/step)
INFO:tensorflow:global step 920: loss = 0.1639 (0.663 sec/step)
INFO:tensorflow:global step 930: loss = 0.1658 (0.442 sec/step)
INFO:tensorflow:global step 940: loss = 0.1654 (0.568 sec/step)

@subbulakshmisubha
Copy link

@PallawiSinghal Did u find a solution to your problem?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests