Adaptive Multi-Scale Feature Fusion for Real-Time Object Detection in Autonomous Driving Systems Using Deep Convolutional Neural Networks

Authors

  • Raden Nur Rachman Dzakiyullah Author

Keywords:

Computer vision; Object detection; Deep learning; Feature pyramid networks; Autonomous driving; Convolutional neural networks; Multi-scale feature fusion; Real-time inference

Abstract

Real-time object detection remains a fundamental yet challenging problem in computer vision, particularly within safety-critical applications such as autonomous driving. Existing methods frequently struggle to balance detection accuracy and computational efficiency, especially when processing multi-scale objects in complex, dynamic environments. In this paper, we propose AdapFuse-Net, a novel deep convolutional neural network architecture that incorporates an Adaptive Multi-Scale Feature Fusion (AMSFF) module. AdapFuse-Net dynamically weights feature maps across multiple resolution levels using a lightweight attention mechanism, enabling the model to capture both fine-grained local features and high-level semantic context without incurring prohibitive computational cost. We evaluate AdapFuse-Net on three benchmark datasets  COCO 2017, KITTI, and BDD100K  achieving a mean Average Precision (mAP) of 54.7%, 89.3%, and 52.1% respectively, while maintaining an inference speed of 43 frames per second on a single NVIDIA RTX 3090 GPU. Ablation studies confirm that the AMSFF module contributes a 4.2% mAP improvement over baseline architectures. Our results demonstrate that AdapFuse-Net offers a compelling accuracy-efficiency trade-off, making it suitable for deployment in real-world autonomous systems.

References

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–4⁄⁄, 2015.

[2] L. Jiao et al., “A survey of deep learning-based object detection,” IEEE Access, vol. 7, pp. 128837–128868, 2019.

[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.

[4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. CVPR, pp. 779–788, 2016.

[5] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. ECCV, pp. 21–37, 2016.

[6] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 743–761, 2012.

[7] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proc. CVPR, pp. 2117–2125, 2017.

[8] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. CVPR, pp. 7132–7141, 2018.

[9] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional block attention module,” in Proc. ECCV, pp. 3–19, 2018.

[10] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” arXiv:2004.10934, 2020.

[11] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient object detection,” in Proc. CVPR, pp. 10781–10790, 2020.

[12] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with Transformers,” in Proc. ECCV, pp. 213–229, 2020.

[13] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proc. CVPR, pp. 8759–8768, 2018.

[14] G. Ghiasi et al., “NAS-FPN: Learning scalable feature pyramid architecture for object detection,” in Proc. CVPR, pp. 7036–7045, 2019.

[15] F. Fu et al., “Dual attention network for scene segmentation,” in Proc. CVPR, pp. 3146–3154, 2019.

[16] Z. Wang et al., “Scale-aware feature learning for object detection,” IEEE Trans. Image Process., vol. 30, pp. 6505–6518, 2021.

[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, pp. 770–778, 2016.

[18] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” in Proc. ICCV, pp. 2980–2988, 2017.

[19] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. ICLR, 2019.

[20] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. ECCV, pp. 740–755, 2014.

[21] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, 2013.

[22] F. Yu et al., “BDD100K: A diverse driving dataset for heterogeneous multitask learning,” in Proc. CVPR, pp. 2636–2645, 2020.

[23] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional one-stage object detection,” in Proc. ICCV, pp. 9627–9636, 2019.

Published

2026-06-05

How to Cite

Adaptive Multi-Scale Feature Fusion for Real-Time Object Detection in Autonomous Driving Systems Using Deep Convolutional Neural Networks. (2026). Journal of Computer Science Innovations and Research, 1(1), 1-6. https://jcsir.org/index.php/jcsir/article/view/1