Reference | Sensors | Object Type | Sensing Modality Representations and Processing | Network Pipeline | How to generate Region Proposals (RP) | When to fuse | Fusion Operation and Method | Fusion Level | Dataset(s) used |
Meyer and Kuschk, 2019
[pdf][ref]
| Radar, visual camera | 3D Vehicle | Radar pointcloud, RGB image. Fused features extracted from CNN. | Faster R-CNN | Before and after RP | Average mean | Region proposal | Early, Middle | Astyx HiRes2019 |
Nabati et al., 2019
[pdf][ref]
| Radar, visual camera | 2D Vehicle | Radar object, RGB image. Radar projected to image frame. | Fast R-CNN | Radar used to generate region proposal | Implicit at RP | Region proposal | Middle | nuScenes |
Liang et al., 2019
[pdf][ref]
| LiDAR, visual camera | 3D Car, Pedestrian, Cyclist | LiDAR BEV maps, RGB image. Each processed by a ResNet with auxiliary tasks: depth estimation and ground segmentation | Faster R-CNN | Predictions with fused features | Before RP | Addition, continuous fusion layer | Middle | KITTI, self-recorded |
Wang et al., 2019
[pdf][ref]
| LiDAR, visual camera | 3D Car, Pedestrian, Cyclist, Indoor objects | LiDAR voxelized frustum (each frustum processed by the PointNet), RGB image (using a pre-trained detector). | R-CNN | Pre-trained RGB image detector | After RP | Using RP from RGB image detector to build LiDAR frustums | Late | KITTI, SUN-RGBD |
Dou et al., 2019
[pdf][ref]
| LiDAR, visual camera | 3D Car | LiDAR voxel (processed by VoxelNet), RGB image (processed by a FCN to get semantic features) | Two stage detector | Predictions with fused features | Before RP | Feature concatenation | Middle | KITTI |
Sindagi et al., 2019
[pdf][ref]
| LiDAR, visual camera | 3D Car | LiDAR voxel (processed by VoxelNet), RGB image (processed by a pre-trained 2D image detector). | One stage detector | Predictions with fused features | Before RP | Feature concatenation | Early, Middle | KITTI |
Bijelic et al., 2019
[pdf][ref]
| LiDAR, visual camera | 2D Car in foggy weather | Lidar front view images (depth, intensity, height), RGB image. Each processed by VGG16 | SSD | Predictions with fused features | Before RP | Feature concatenation | From early to middle layers | Self-recorded datasets focused on foggy weather, simulated foggy images from KITTI |
Chadwick et al., 2019
[pdf][ref]
| Radar, visual camera | 2D Vehicle | Radar range and velocity maps, RGB image. Each processed by ResNet | One stage detector | Predictions with fused features | Before RP | Addition, feature concatenation | Middle | Self-recorded |
Pfeuffer et al., 2018
[pdf][ref]
| LiDAR, visual camera | Multiple 2D objects | LiDAR spherical, and front-view sparse depth, dense depth image, RGB image. Each processed by VGG16 | Faster-RCNN | RPN from fused features | Before RP | Feature concatenation | Early, Middle, Late | KITTI |
Liang et al., 2018
[pdf][ref]
| LiDAR, visual camera | 3D Car, Pedestrian, Cyclist | LiDAR BEV maps, RGB image. Each processed by ResNet | One stage detector | Predictions with fused features. | Before RP | Addition, continuous fusion layer | Middle | KITTI, self-recorded |
Du et al., 2018
[pdf][ref]
| LiDAR, visual camera | 3D Car | LiDAR voxel (processed by RANSAC and model fitting), RGB image (processed by VGG16 and GoogLeNet) | R-CNN | Pre-trained RGB image detector produces 2D bounding boxes to crop LiDAR points, which are then clustered | Before and at RP | Ensemble: use RGB image detector to regress car dimensions for a model fitting algorithm. | Late | KITTI, self-recorded data |
Kim et al., 2018
[pdf][ref]
| LiDAR, visual camera | 2D Car | LiDAR front-view depth image, RGB image. Each input processed by VGG16 | SSD | SSD with fused features | Before RP | Feature concatenation, Mixture of Experts | Middle | KITTI |
Yang et al., 2018
[pdf][ref]
| LiDAR, HD-map | 3D Car | LiDAR BEV maps, Road mask image from HD map. Inputs processed by PIXOR++ [ref] with the backbone similar to FPN | One stage detector | Detector predictions | Before RP | Feature concatenation | Early | KITTI, TOR4D Dataset [ref] |
Casas et al., 2018
[pdf][ref]
| LiDAR, HD-map | 3D Car | sequential LiDAR BEV maps, sequential several road topology mask images from HD map. Each input processed by a base network with residual blocks | One stage detector | Detector predictions | Before RP | Feature concatenation | Middle | self-recorded data |
Guan et al., 2018
[pdf][ref]
| visual camera, thermal camera | 2D Pedestrian | RGB image, thermal image. Each processed by a base network built on VGG16 | Faster-RCNN | RPN with fused features | Before and after RP | Feature concatenation, Mixture of Experts | Early, Middle, Late | KAIST Pedestrian Dataset |
Shin et al., 2018
[pdf][ref]
| LiDAR, visual camera | 3D Car | LiDAR point clouds, (processed by PointNet [ref]); RGB image (processed by a 2D CNN) | R-CNN | A 3D object detector for RGB image | After RP | Using RP from RGB image detector to search LiDAR point clouds | Late | KITTI |
Chen et al., 2017
[pdf][ref]
| LiDAR, visual camera | 3D Car | LiDAR BEV and spherical maps, RGB image. Each processed by a base network built on VGG16 | Faster-RCNN | A RPN from LiDAR BEV map | After RP | average mean, deep fusion | Early, Middle, Late | KITTI |
Asvadi et al., 2017
[pdf][ref]
| LiDAR, visual camera | 2D Car | LiDAR front-view dense-depth (DM) and reflectance maps (RM), RGB image. Each processed through a YOLO net | YOLO | YOLO outputs for LiDAR DM and RM maps, and RGB image | After RP | Ensemble: feed engineered features from ensembled bounding boxes to a network to predict scores for NMS | Late | KITTI |
Oh et al., 2017
[pdf][ref]
| LiDAR, visual camera | 2D Car, Pedestrian, Cyclist | LiDAR front-view dense-depth map (for fusion: processed by VGG16), LiDAR voxel (for ROIs: segmentation and region growing), RGB image (for fusion: processed by VGG16; for ROIs: segmentation and grouping) | R-CNN | LiDAR voxel and RGB image separately | After RP | Association matrix using basic belief assignment | Late | KITTI |
Wang et al., 2017
[pdf][ref]
| LiDAR, visual camera | 3D Car, Pedestrian | LiDAR BEV map, RGB image. Each processed by a RetinaNet [ref] | One stage detector | Fused LiDAR and RGB image features extracted from CNN | Before RP | Sparse mean manipulation | Middle | KITTI |
Ku et al., 2017
[pdf][ref]
| LiDAR, visual camera | 3D Car, Pedestrian, Cyclist | LiDAR BEV map, RGB image. Each processed by VGG16 | Faster-RCNN | Fused LiDAR and RGB image features extracted from CNN | Before and after RP | Average mean | Early, Middle, Late | KITTI |
Xu et al., 2017
[pdf][ref]
| LiDAR, visual camera | 3D Car, Pedestrian, Cyclist, Indoor objects | LiDAR points (processed by PointNet), RGB image (processed by ResNet) | R-CNN | Pre-trained RGB image detector | After RP | Feature concatenation for local and global features | Middle | KITTI, SUN-RGBD |
Qi et al., 2017
[pdf][ref]
| LiDAR, visual camera | 3D Car, Pedestrian, Cyclist, Indoor objects | LiDAR points (processed by PointNet), RGB image (using a pre-trained detector) | R-CNN | Pre-trained RGB image detector | After RP | Feature concatenation | Middle, Late | KITTI, SUN-RGBD |
Du et al., 2017
[pdf][ref]
| LiDAR, visual camera | 2D Car | LiDAR voxel (processed by RANSAC and model fitting), RGB image (processed by VGG16 and GoogLeNet) | Faster-RCNN | First clustered by LiDAR point clouds, then fine-tuned by a RPN of RGB image | Before RP | Ensemble: feed LiDAR RP to RGB image-based CNN for final prediction | Late | KITTI |
Schneider et al., 2017
[pdf][ref]
| visual camera | Multiple 2D objects | RGB image (processed by GoogLeNet), depth image from stereo camera (processed by NiN net) | SSD | SSD predictions. | Before RP | Feature concatenation | Early, Middle, Late | Cityscape |
Takumi et al., 2017
[pdf][ref]
| visual camera, thermal camera | Multiple 2D objects | RGB image, NIR, FIR, FIR image. Each processed by YOLO | YOLO | YOLO predictions for each spectral image | After RP | Ensemble: ensemble final predictions for each YOLO detector | Late | self-recorded data |
Matti et al., 2017
[pdf][ref]
| LiDAR, visual camera | 2D Pedestrian | LiDAR points (clustering with DBSCAN) and RGB image (processed by ResNet) | R-CNN | Clustered by LiDAR point clouds, then size and ratio corrected on RGB image. | Before and at RP | Ensemble: feed LiDAR RP to RGB image-based CNN for final prediction | Late | KITTI |
Schlosser et al., 2016
[pdf][ref]
| LiDAR, visual camera | 2D Pedestrian | LiDAR HHA image, RGB image. Each processed by a small ConvNet | R-CNN | Deformable Parts Model with RGB image | After RP | Feature concatenation | Early, Middle, Late | KITTI |
Kim et al., 2016
[pdf][ref]
| LiDAR, visual camera | 2D Pedestrian, Cyclist | LiDAR front-view depth image, RGB image. Each processed by Fast-RCNN network [ref] | Fast-RCNN | Selective search for LiDAR and RGB image separately. | At RP | Ensemble: joint RP are fed to RGB image based CNN. | Late | KITTI |
Mees et al., 2016
[pdf][ref]
| RGB-D camera | 2D Pedestrian | RGB image, depth image from depth camera, optical flow. Each processed by GoogLeNet | Fast-RCNN | Dense multi-scale sliding window for RGB image | After RP | Mixture of Experts | Late | RGB-D People Unihall Dataset, InOutDoor RGB-D People Dataset. |
Wagner et al., 2016
[pdf][ref]
| visual camera, thermal camera | 2D Pedestrian | RGB image, thermal image. Each processed by CaffeeNet | R-CNN | ACF+T+THOG detector | After RP | Feature concatenation | Early, Late | KAIST Pedestrian Dataset |
Liu et al., 2016
[pdf][ref]
| visual camera, thermal camera | 2D Pedestrian | RGB image, thermal image. Each processed by NiN network | Faster-RCNN | RPN with fused (or separate) features | Before and after RP | Feature concatenation, average mean, Score fusion (Cascaded CNN) | Early, Middle, Late | KAIST Pedestrian Dataset |