-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output of MHA EfficientNet model #8
Comments
Hi Haohe, Thanks for reaching out. It has been a while since I coded the model, so I might be wrong. In the PSLA paper, figure 2 caption, we said "We multiply the output of each branch element-wise and apply a temporal mean pooling (implemented by summation)", which is relected in psla/src/models/HigherModels.py Line 165 in 7f8fafa
I guess if you change it to Please let me know what you think. -Yuan |
Hi Yuan, I would like to double down on this issue. because I don't think it is about whether you use
and it could end up in any value as you're not constraining it either explicitly (e.g. normalizing) or implicitly (through regularization terms). That said, you do clmap the output of the network to be [0,1] before passing it to BCELoss: Line 103 in 7f8fafa
So rather using a smooth, squishing activation fucntion like sigmoid at the very end of the model, (whether intended or not) you are using a troublesome piece-wise continuous: This means that unless you have super carefully initialized your model's parameters and a very small learning rate, the training would stop if the output goes above or below zero (zero grad). So do you have any explanation as to why this particular design choice with clamping and not using smooth activation functions or avoiding the need for any end activation function altogether by enforcing constraint on head weights? |
On a different note, I see you normalize the attention values across temporal axis : psla/src/models/HigherModels.py Line 162 in 7f8fafa
this would seemingly encourage the model to attend to one single temporal unit (in the output layer) at the expense of not-attending to other temporal slices. Given that many events are dynamic and have larger extent than a single unit of time, specially considering event-dense audioset recordings, what would be the inductive bias for such a choice? Furthermore, in order to obtain these normalized attention values for each head, you first pass them through a sigmoid function and then normalize them using "division by sum" psla/src/models/HigherModels.py Line 162 in 7f8fafa
is there any paticular reason for this choice of "sigmoid +normalization by sum" versus the more mainstream approach of using a softmax of attention values directly? they are not of course equivalent, as Softmax exclusively depends on the difference between values i.e |
Hi there, Thanks so much for your questions. I need time to think of it. The main model architecture is from a previous paper: http://groups.csail.mit.edu/sls/archives/root/publications/2019/LoganFord_Interspeech-2019.PDF.
But before that, I want to clarify that we do not pick the random seeds or pick the success runs at all. All experiments are run 3 times and report the mean, which should be reproducible with the provided code. In the paper, we show the variance is pretty small. Your proposed ``more reasonable'' solution might lead to more stable optimization and probably better results. Have you tried that? -Yuan |
Hi Yuan,
Thanks for open-sourcing this repo. I have a quick question about the MHA EfficientNet model you proposed. When I tried the EfficientNet-b2 with the multi-head attention model, I found some values in the out variable were bigger than one, instead of between 0-1. Is that intentionally designed?
Many Thanks
The text was updated successfully, but these errors were encountered: