Detection

Back to index

Reference	Sensors	Object Type	Sensing Modality Representations and Processing	Network Pipeline	How to generate Region Proposals (RP)	When to fuse	Fusion Operation and Method	Fusion Level	Dataset(s) used
Meyer and Kuschk, 2019 [pdf][ref]	Radar, visual camera	3D Vehicle	Radar pointcloud, RGB image. Fused features extracted from CNN.	Faster R-CNN	Before and after RP	Average mean	Region proposal	Early, Middle	Astyx HiRes2019
Nabati et al., 2019 [pdf][ref]	Radar, visual camera	2D Vehicle	Radar object, RGB image. Radar projected to image frame.	Fast R-CNN	Radar used to generate region proposal	Implicit at RP	Region proposal	Middle	nuScenes
Liang et al., 2019 [pdf][ref]	LiDAR, visual camera	3D Car, Pedestrian, Cyclist	LiDAR BEV maps, RGB image. Each processed by a ResNet with auxiliary tasks: depth estimation and ground segmentation	Faster R-CNN	Predictions with fused features	Before RP	Addition, continuous fusion layer	Middle	KITTI, self-recorded
Wang et al., 2019 [pdf][ref]	LiDAR, visual camera	3D Car, Pedestrian, Cyclist, Indoor objects	LiDAR voxelized frustum (each frustum processed by the PointNet), RGB image (using a pre-trained detector).	R-CNN	Pre-trained RGB image detector	After RP	Using RP from RGB image detector to build LiDAR frustums	Late	KITTI, SUN-RGBD
Dou et al., 2019 [pdf][ref]	LiDAR, visual camera	3D Car	LiDAR voxel (processed by VoxelNet), RGB image (processed by a FCN to get semantic features)	Two stage detector	Predictions with fused features	Before RP	Feature concatenation	Middle	KITTI
Sindagi et al., 2019 [pdf][ref]	LiDAR, visual camera	3D Car	LiDAR voxel (processed by VoxelNet), RGB image (processed by a pre-trained 2D image detector).	One stage detector	Predictions with fused features	Before RP	Feature concatenation	Early, Middle	KITTI
Bijelic et al., 2019 [pdf][ref]	LiDAR, visual camera	2D Car in foggy weather	Lidar front view images (depth, intensity, height), RGB image. Each processed by VGG16	SSD	Predictions with fused features	Before RP	Feature concatenation	From early to middle layers	Self-recorded datasets focused on foggy weather, simulated foggy images from KITTI
Chadwick et al., 2019 [pdf][ref]	Radar, visual camera	2D Vehicle	Radar range and velocity maps, RGB image. Each processed by ResNet	One stage detector	Predictions with fused features	Before RP	Addition, feature concatenation	Middle	Self-recorded
Pfeuffer et al., 2018 [pdf][ref]	LiDAR, visual camera	Multiple 2D objects	LiDAR spherical, and front-view sparse depth, dense depth image, RGB image. Each processed by VGG16	Faster-RCNN	RPN from fused features	Before RP	Feature concatenation	Early, Middle, Late	KITTI
Liang et al., 2018 [pdf][ref]	LiDAR, visual camera	3D Car, Pedestrian, Cyclist	LiDAR BEV maps, RGB image. Each processed by ResNet	One stage detector	Predictions with fused features.	Before RP	Addition, continuous fusion layer	Middle	KITTI, self-recorded
Du et al., 2018 [pdf][ref]	LiDAR, visual camera	3D Car	LiDAR voxel (processed by RANSAC and model fitting), RGB image (processed by VGG16 and GoogLeNet)	R-CNN	Pre-trained RGB image detector produces 2D bounding boxes to crop LiDAR points, which are then clustered	Before and at RP	Ensemble: use RGB image detector to regress car dimensions for a model fitting algorithm.	Late	KITTI, self-recorded data
Kim et al., 2018 [pdf][ref]	LiDAR, visual camera	2D Car	LiDAR front-view depth image, RGB image. Each input processed by VGG16	SSD	SSD with fused features	Before RP	Feature concatenation, Mixture of Experts	Middle	KITTI
Yang et al., 2018 [pdf][ref]	LiDAR, HD-map	3D Car	LiDAR BEV maps, Road mask image from HD map. Inputs processed by PIXOR++ [ref] with the backbone similar to FPN	One stage detector	Detector predictions	Before RP	Feature concatenation	Early	KITTI, TOR4D Dataset [ref]
Casas et al., 2018 [pdf][ref]	LiDAR, HD-map	3D Car	sequential LiDAR BEV maps, sequential several road topology mask images from HD map. Each input processed by a base network with residual blocks	One stage detector	Detector predictions	Before RP	Feature concatenation	Middle	self-recorded data
Guan et al., 2018 [pdf][ref]	visual camera, thermal camera	2D Pedestrian	RGB image, thermal image. Each processed by a base network built on VGG16	Faster-RCNN	RPN with fused features	Before and after RP	Feature concatenation, Mixture of Experts	Early, Middle, Late	KAIST Pedestrian Dataset
Shin et al., 2018 [pdf][ref]	LiDAR, visual camera	3D Car	LiDAR point clouds, (processed by PointNet [ref]); RGB image (processed by a 2D CNN)	R-CNN	A 3D object detector for RGB image	After RP	Using RP from RGB image detector to search LiDAR point clouds	Late	KITTI
Chen et al., 2017 [pdf][ref]	LiDAR, visual camera	3D Car	LiDAR BEV and spherical maps, RGB image. Each processed by a base network built on VGG16	Faster-RCNN	A RPN from LiDAR BEV map	After RP	average mean, deep fusion	Early, Middle, Late	KITTI
Asvadi et al., 2017 [pdf][ref]	LiDAR, visual camera	2D Car	LiDAR front-view dense-depth (DM) and reflectance maps (RM), RGB image. Each processed through a YOLO net	YOLO	YOLO outputs for LiDAR DM and RM maps, and RGB image	After RP	Ensemble: feed engineered features from ensembled bounding boxes to a network to predict scores for NMS	Late	KITTI
Oh et al., 2017 [pdf][ref]	LiDAR, visual camera	2D Car, Pedestrian, Cyclist	LiDAR front-view dense-depth map (for fusion: processed by VGG16), LiDAR voxel (for ROIs: segmentation and region growing), RGB image (for fusion: processed by VGG16; for ROIs: segmentation and grouping)	R-CNN	LiDAR voxel and RGB image separately	After RP	Association matrix using basic belief assignment	Late	KITTI
Wang et al., 2017 [pdf][ref]	LiDAR, visual camera	3D Car, Pedestrian	LiDAR BEV map, RGB image. Each processed by a RetinaNet [ref]	One stage detector	Fused LiDAR and RGB image features extracted from CNN	Before RP	Sparse mean manipulation	Middle	KITTI
Ku et al., 2017 [pdf][ref]	LiDAR, visual camera	3D Car, Pedestrian, Cyclist	LiDAR BEV map, RGB image. Each processed by VGG16	Faster-RCNN	Fused LiDAR and RGB image features extracted from CNN	Before and after RP	Average mean	Early, Middle, Late	KITTI
Xu et al., 2017 [pdf][ref]	LiDAR, visual camera	3D Car, Pedestrian, Cyclist, Indoor objects	LiDAR points (processed by PointNet), RGB image (processed by ResNet)	R-CNN	Pre-trained RGB image detector	After RP	Feature concatenation for local and global features	Middle	KITTI, SUN-RGBD
Qi et al., 2017 [pdf][ref]	LiDAR, visual camera	3D Car, Pedestrian, Cyclist, Indoor objects	LiDAR points (processed by PointNet), RGB image (using a pre-trained detector)	R-CNN	Pre-trained RGB image detector	After RP	Feature concatenation	Middle, Late	KITTI, SUN-RGBD
Du et al., 2017 [pdf][ref]	LiDAR, visual camera	2D Car	LiDAR voxel (processed by RANSAC and model fitting), RGB image (processed by VGG16 and GoogLeNet)	Faster-RCNN	First clustered by LiDAR point clouds, then fine-tuned by a RPN of RGB image	Before RP	Ensemble: feed LiDAR RP to RGB image-based CNN for final prediction	Late	KITTI
Schneider et al., 2017 [pdf][ref]	visual camera	Multiple 2D objects	RGB image (processed by GoogLeNet), depth image from stereo camera (processed by NiN net)	SSD	SSD predictions.	Before RP	Feature concatenation	Early, Middle, Late	Cityscape
Takumi et al., 2017 [pdf][ref]	visual camera, thermal camera	Multiple 2D objects	RGB image, NIR, FIR, FIR image. Each processed by YOLO	YOLO	YOLO predictions for each spectral image	After RP	Ensemble: ensemble final predictions for each YOLO detector	Late	self-recorded data
Matti et al., 2017 [pdf][ref]	LiDAR, visual camera	2D Pedestrian	LiDAR points (clustering with DBSCAN) and RGB image (processed by ResNet)	R-CNN	Clustered by LiDAR point clouds, then size and ratio corrected on RGB image.	Before and at RP	Ensemble: feed LiDAR RP to RGB image-based CNN for final prediction	Late	KITTI
Schlosser et al., 2016 [pdf][ref]	LiDAR, visual camera	2D Pedestrian	LiDAR HHA image, RGB image. Each processed by a small ConvNet	R-CNN	Deformable Parts Model with RGB image	After RP	Feature concatenation	Early, Middle, Late	KITTI
Kim et al., 2016 [pdf][ref]	LiDAR, visual camera	2D Pedestrian, Cyclist	LiDAR front-view depth image, RGB image. Each processed by Fast-RCNN network [ref]	Fast-RCNN	Selective search for LiDAR and RGB image separately.	At RP	Ensemble: joint RP are fed to RGB image based CNN.	Late	KITTI
Mees et al., 2016 [pdf][ref]	RGB-D camera	2D Pedestrian	RGB image, depth image from depth camera, optical flow. Each processed by GoogLeNet	Fast-RCNN	Dense multi-scale sliding window for RGB image	After RP	Mixture of Experts	Late	RGB-D People Unihall Dataset, InOutDoor RGB-D People Dataset.
Wagner et al., 2016 [pdf][ref]	visual camera, thermal camera	2D Pedestrian	RGB image, thermal image. Each processed by CaffeeNet	R-CNN	ACF+T+THOG detector	After RP	Feature concatenation	Early, Late	KAIST Pedestrian Dataset
Liu et al., 2016 [pdf][ref]	visual camera, thermal camera	2D Pedestrian	RGB image, thermal image. Each processed by NiN network	Faster-RCNN	RPN with fused (or separate) features	Before and after RP	Feature concatenation, average mean, Score fusion (Cascaded CNN)	Early, Middle, Late	KAIST Pedestrian Dataset

Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges

Di Feng, Christian Haase-Schuetz, Lars Rosenbaum, Heinz Hertlein, Claudius Glaeser, Fabian Timm, Werner Wiesbeck and Klaus Dietmayer
Robert Bosch GmbH in cooperation with Ulm University and Karlruhe Institute of Technology
* Contributed equally

Detection