Call for Papers
Quick Links
A Hybrid Faster R-CNN and YOLOv5 Model with Transformer Augmentation for Enhanced Object Detection
Kunal Sahu
Scholar, Department of Computer Science, National P.G. College, Lucknow, India
Author
Khushi Rajput
Scholar, Department of Computer Science, National P.G. College, Lucknow, India
Author
Shweta Sinha
Assistant Professor, Department of Computer Science, National P.G. College, Lucknow, India
Author
Rinku Raheja
Assistant Professor, Department of Computer Science, National P.G. College, Lucknow, India
Author
📌 DOI: https://doi.org/10.63920/tjths.44009
🔑 Keywords: Object YOLOv5; Faster Detection; R-CNN; Transformer decoder; Hybrid Model;
📅 Publication Date: 06 October 2025
📜 License:
This work is licensed under a Creative Commons Attribution 4.0 International License
- Share — Copy and Redistribute the material
- Adapt — Remix, Transform, and build upon the material
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Abstract:
Our proposal includes a three-step model to identify small-scale objects less than 32x32 pixels, e.g., backpacks, handbags, or other discarded items in a security camera image. We initially determine potential boxes with YOLOv5. Then we fine-tune those boxes with Faster R-CNN to achieve more precise results. We now include a small Transformer decoder to detect smaller objects. We will prune the model using weight pruning and INT8 quantization, and will make the size of the model smaller by 20-30%, and targeting 20-30 frames per second on a Jetson Nano to make it executable in real time. Our training will be done on mixed precise on a custom surveillance set which concentrates on small things. We are aiming to make the recall of small objects exceed the 30% baseline by YOLOv5 with obvious benefits in autonomous car, smart security, and farm monitoring applications. The model will subsequently be tested on our set by running the model later, testing it on COCO and KITTI, and testing its ability to work with video streams.
Download Full PDF Paper
📖 How to Cite
Kunal Sahu, Khushi Rajput, Shweta Sinha and Rinku Raheja (2025). A Hybrid Faster R-CNN and YOLOv5 Model with Transformer Augmentation for Enhanced Object Detection. TEJAS J. Technol. Humanit. Sci.,, Vol. 04, Issue 04. https://doi.org/10.63920/tjths.44009
📊 Article Metrics
References
[1]
Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” Proc. IEEE, vol. 111, no. 3,
pp. 257–276, Mar. 2023.
[2]
J. Huang et al., “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Vancouver, Canada, Jun. 2023, pp. 7464–7475.
[3]
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end
to-end object detection,” in Proc. Int. Conf. Learn. Represent., Virtual, May 2021.
[4]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection
and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Columbus, OH, USA, Jun.
2014, pp. 580–587.
[5] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., Santiago, Chile, Dec. 2015, pp. 1440
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region
proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
[7]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, real-time object
detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, Jun. 2016, pp. 779–788.
[8]
W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis., Amsterdam, The
Netherlands, Oct. 2016, pp. 21–37.
[9]
M. Hussain, “YOLOv5, YOLOv8 and YOLOv10: The go-to detectors for real-time vision,” arXiv preprint
arXiv:2407.02988, Jul. 2024.
[10] R. Khanam and M. Hussain, “What is YOLOv5: A deep look into the internal features of the popular object
detector,” arXiv preprint arXiv:2407.20892, Jul. 2024.
[11] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis.,
Venice, Italy, Oct. 2017, pp. 2980–2988.
[12] Z. Chen, H. Wang, Z. Li, and Q. Yan, “A survey of deep learning-based object detection methods in
autonomous driving,” IEEE Trans. Intell. Transp. Syst., vol. 24, no. 11, pp. 12345–12367, Nov. 2023.
[13] M. Hussain, “A comprehensive survey of deep learning techniques for object detection in surveillance
systems,” arXiv preprint arXiv:2408.01567, Aug. 2024.
[14] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection
with transformers,” in Proc. Eur. Conf. Comput. Vis., Glasgow, U.K., Aug. 2020, pp. 213–229.
[15] Y. Li, Y. Wang, Z. Liu, and J. Sun, “Efficient object detection for edge devices: A survey,” IEEE Access,
vol. 11, pp. 123456–123478, Oct. 2023.
[16] T.-Y. Lin et al., "Microsoft COCO: Common Objects in Context," in Proc. Eur. Conf. Comput. Vis.,
Zurich, Switzerland, Sep. 2014, pp. 740-755.
[17] Geiger, P. Lenz, and R. Urtasun, "Are we ready for autonomous driving? The KITTI vision benchmark
suite," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA, Jun. 2012, pp. 3354-3361.
[18] Aliu, M. A. Mabayoje, and A. E. Onuiri, "Real Time Detection of Hand Carried Weapons for Kidnapping
Mitigation in Nigeria: A YOLOv5–Faster R-CNN Hybrid Approach," Int. J. Comput. Appl., vol. 184, no. 14,
pp. 14-21, Sep. 2022.
[19] Ayush Kashyap et al., Design and Implementation of an Intelligent Loan Eligibility System Using Machine
Learning Techniques, TEJAS Journal of Technologies and Humanitarian Science, ISSN-2583-5599, Vol.04, I.02
(2025), https://doi.org/10.63920/tjths.42002
[20] Esha Srivastava et al., AI-Driven Predictive Analytics with the Help of IoT for Organizational Change
Management, TEJAS Journal of Technologies and Humanitarian Science, ISSN : 2583-5599, V. 04, I.03, July-
2025, https://doi.org/10.63920/tjths.43001
[21] H. Singh and N. Singh, "Real-time smart surveillance using YOLO-Faster R-CNN hybrid approach," Int.
J. Comput. Program. Database Manag., vol. 6, no. 2, pp. 112-120, 2023.
[22] Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc.
Int. Conf. Learn. Represent., Virtual, May 2021.
