You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was wondering how the "attention heatmap" in the paper was drawn.
If I have understood your method correctly, the learnable parameters are only added to the "Video Q-former", which cross-attends with 32 x T queries generated from frozen "Visual Q-former". The 32 visual queries attend to different regions from the frame, but as they are freezed the attention would not have changed.
It would really help if you could share the code/method you used to visualize the attention map.
The text was updated successfully, but these errors were encountered:
Hi rbsohee, thank you very much for your interest in our work! I apologize for the delay due to some deadlines. We use a simplified method similar to attention rollout to extract the attention weights from the Video Q-former. The 32 visual queries are frozen. However, we append the learnable queries which interact with the visual queries through the self-attention layers. This causes the representations of the queries to change, which also affects the attention weights. Additionally, due to the complexity of the model, we used a simplified version before and are now evaluating new ways to extract such attention maps. We are working on cleaning up the script and code component to extract the attention maps for public use and will release it once it is cleaned and tested.
Hi, thank you for an interesting work :)
I was wondering how the "attention heatmap" in the paper was drawn.
If I have understood your method correctly, the learnable parameters are only added to the "Video Q-former", which cross-attends with 32 x T queries generated from frozen "Visual Q-former". The 32 visual queries attend to different regions from the frame, but as they are freezed the attention would not have changed.
It would really help if you could share the code/method you used to visualize the attention map.
The text was updated successfully, but these errors were encountered: