Reference | Sensors | Semantics | Sensing Modality Representations | Fusion Operation and Method | Fusion Level | Dataset(s) used |
Chen et al., 2019
[pdf][ref]
| LiDAR, visual camera | Road segmentation | RGB image, altitude difference image. Each processed by a CNN | Feature adaptation module, modified concatenation. | Middle | KITTI |
Valada et al., 2019
[pdf][ref]
| Visual camera, depth camera, thermal camera | Multiple 2D objects | RGB image, thermal image, depth image. Each processed by FCN with ResNet backbone (Adapnet++ architecture) | Extension of Mixture of Experts | Middle | Six datasets, including Cityscape, Sun RGB-D, etc. |
Sun et al., 2019
[pdf][ref]
| Visual camera, thermal camera | Multiple 2D objects in campus environments | RGB image, thermal image. Each processed by a base network built on ResNet | Element-wise summation in the encoder networks | Middle | Datasets published by [ref] |
Caltagirone et al., 2019
[pdf][ref]
| LiDAR, vision camera | Road segmentation | LiDAR front-view depth images, RGB image. Each input processed by a FCN | Feature concatenation (For early and late fusion), weighted addition similar to gating network (for middle-level cross fusion) | Early, Middle, Late | KITTI |
Erkent et al., 2018
[pdf][ref]
| LiDAR, visual camera | Multiple 2D objects | LiDAR BEV occupancy grids (processed based on Bayesian filtering and tracking), RGB image (processed by a FCN with VGG16 backbone) | Feature concatenation | Middle | KITTI, self-recorded |
Lv et al., 2018
[pdf][ref]
| LiDAR, vision camera | Road segmentation | LiDAR BEV maps, RGB image. Each input processed by a FCN with dilated convolution operator. RGB image features are alo projected onto LiDAR BEV plane before fusion | Feature concatenation | Middle | KITTI |
Wulff et al., 2018
[pdf][ref]
| LiDAR, vision camera | Road segmentation. Alternatives: freespace, ego-lane detection | LiDAR BEV maps, RGB image projected onto BEV plane. Inputs processed by a FCN with UNet | Feature concatenation | Early | KITTI |
Kim et al., 2018
[pdf][ref]
| LiDAR, vision camera | 2D Off-road terrains | LiDAR voxel (processed by 3D convolution), RGB image (processed by ENet) | Addition | Early, Middle, Late | self-recorded |
Guan et al., 2018
[pdf][ref]
| Vision camera, thermal camera | 2D Pedestrian | RGB image, thermal image. Each processed by a base network built on VGG16 | Feature concatenation, Mixture of Experts | Early, Middle, Late | KAIST Pedestrian Dataset |
Yang et al., 2018
[pdf][ref]
| LiDAR, vision camera | Road segmentation | LiDAR points (processed by PointNet++), RGB image (processed by FCN with VGG16 backbone) | Optimizing Conditional Random Field (CRF) | Late | KITTI |
Gu et al., 2018
[pdf][ref]
| LiDAR, visual camera | Road segmentation | LiDAR front-view depth and height maps (processed by a inverse-depth histogram based line scanning strategy), RGB image (processed by a FCN). | Optimizing Conditional Random Field | Late | KITTI |
Cai et al., 2018
[pdf][ref]
| Satellite map with route information, visual camera | Road segmentation | Route map image, RGB image. Images are fused and processed by a FCN | Overlaying the line and curve segments in the route map onto the RGB image to generate the Map Fusion Image (MFI) | Early | self-recorded data |
Ha et al., 2017
[pdf][ref]
| Vision camera, thermal camera | Multiple 2D objects in campus environments | RGB image, thermal image. Each processed by a FCN and mini-inception block | Feature concatenation, addition (``short-cut fusion'') | Middle | self-recorded data |
Valada et al., 2017
[pdf][ref]
| Vision camera, thermal camera | Multiple 2D objects | RGB image, thermal image, depth image. Each processed by FCN with ResNet backbone | Mixture of Experts | Late | Cityscape, Freiburg Multispectral Dataset, Synthia |
Schneider et al., 2017
[pdf][ref]
| Vision camera | Multiple 2D Objects | RGB image, depth image | Feature concatenation | Early, Middle, Late | Cityscape |
Schneider et al., 2017
[pdf][ref]
| Vision camera | Multiple 2D Objects | RGB image (processed by GoogLeNet), depth image from stereo camera (processed by NiN net) | Feature concatenation | Early, Middle, Late | Cityscape |
Valada et al., 2016
[pdf][ref]
| Vision camera, thermal camera | Multiple 2D objects in forested environments | RGB image, thermal image, depth image. Each processed by the UpNet (built on VGG16 and up-convolution) | Feature concatenation, addition | Early, Late | self-recorded data |