Pillar-based Object Detection for Autonomous Driving

November 2020

Three key improvements based on MVF. The ablation studies in this paper is super clean and persuasive.

Multiview architecture
- Voxelize points in BEV or spherical view or cylindrical view to pillars.
- Extract pillar features.
- Project pillar features to points with nearest neighbor or bilinear interpolation and concat to point features.
- Transform point features to BEV
- Detection backbone + head
Anchor-based Pillar-based prediction: like CenterPoint and Pixor.
- Both PointPillars and MVF uses anchor-based prediction.
- Anchor free avoids complicated anchor matching strategy.
- Ablation studies show that anchor-based < point-based << pillar-based.
Cylindrical view: height z, azimuth angle, radial distance. The radial distance is treated as channels.
- Cylindrical view is better than spherical view as the vehicle size for distant cars are not distorted. Distant cars appears smaller in spherical view but the same in cylindrical view. --> LaserNet uses a range view (RV) which is very similar to spherical view. The original MVF is also a spherical view.
Bilinear upsampling when transferring pillar features to point.
- This avoids the spatial inconsistency and dependency of quantization into diff bins.
- Bilinear interpolation is better than nearest neighbor. This observation is consistent with the comparison between RoIAlign with RoIPooling.

Provide feedback