October 2019
tl;dr: Extend the work of deep3Dbox by regressing residual center positions.
The paper has a good summary on mono 3DOD in introduction.
The geometric constraints become a closed-formed one. This is similar to deep3Dbox but slightly different (over-constraint vs exact-constraint).
The idea of shift RCNN and FQNet are quite similar. Both builds on deep3Dbox and refines the first guess. But FQNet passively densely sample around the GT and train a regressor to tell the difference to GT, shift RCNN actively learns to regress the difference. The followup work of FQNet is RAR-Net which also actively predicts the offset, but does that iteratively with a DRL agent.
- RoiAligned feature to regress 3D orientation and 3D dimension.
- Optimization to solve for 3D bbox location t'.
- Shift Net work is 2 layer FC network to regress improved final translation of 3D center t''. The input features are t', 2d bbox, dimension, local yaw, global yaw, and camera projection matrix.
- The volume displacement loss is decomposed into 3 sums of 3 terms, each term is
$\Delta x \times h \times w$ and alike. w and h are estimated 3D dimension.
- They used best IoU to pick the best configuration. This is a bit different from the previous method of picking one that mininizes residual from least square fitting, such as FQNet or Deep3DBox. This is also used in MVRA.
- Questions and notes on how to improve/revise the current work