You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was going through the VFIformer paper and I got curious of something from the ablation study.
It would have been great to attend CVPR and ask in person, but unfortunately I cannot do so, so I leave my question here.
In short, to my understanding, the main contribution of the paper is:
use of transformer layers in VFI, with a novel cross-scale window attention, reaching a state-of-the-art performance.
So I assume the 'Model 1' configuration of Table 2 consists of Convolutional layers only, yet it still outperforms (36.27 in Vimeo90k) the best baseline (36.18).
I came to wonder the reason for this.
To me, the 'Model 1' configuration did not seem to have anything special (no offense) since it did not contain the proposed modules.
Can you give an explanation on this?
What was the difference that lead to a strong base model (Model 1)?
Or did I miss anything on the 'Model 1' configuration...?
The text was updated successfully, but these errors were encountered:
Hi, thanks for your interest in our work. As mentioned in the appendix of our paper, the main difference of Model 1 and the best baseline model is the flow estimator with the proposed Bilateral Local Refinement Blocks (BLRBs in Fig.9 b), which in fact bring about 0.1 dB improvement. But we do not claim BLRB as one of our key contributions, because when the model is equipped with transformer layers, the contribution of BLRBs is limited.
Hi authors, thank you for your awesome work.
I was going through the VFIformer paper and I got curious of something from the ablation study.
It would have been great to attend CVPR and ask in person, but unfortunately I cannot do so, so I leave my question here.
In short, to my understanding, the main contribution of the paper is:
use of transformer layers in VFI, with a novel cross-scale window attention, reaching a state-of-the-art performance.
So I assume the 'Model 1' configuration of Table 2 consists of Convolutional layers only, yet it still outperforms (36.27 in Vimeo90k) the best baseline (36.18).
I came to wonder the reason for this.
To me, the 'Model 1' configuration did not seem to have anything special (no offense) since it did not contain the proposed modules.
Can you give an explanation on this?
What was the difference that lead to a strong base model (Model 1)?
Or did I miss anything on the 'Model 1' configuration...?
The text was updated successfully, but these errors were encountered: