TEJAS Journal of Technologies and Humanitarian Science

ISSN : 2583-5599

Open Access | Quarterly | Peer Reviewed Journal


A Hybrid Faster R-CNN and YOLOv5 Model with Transformer Augmentation for Enhanced Object Detection


Kunal Sahu
Scholar, Department of Computer Science, National P.G. College, Lucknow, India

Author

Khushi Rajput
Scholar, Department of Computer Science, National P.G. College, Lucknow, India

Author

Shweta Sinha
Assistant Professor, Department of Computer Science, National P.G. College, Lucknow, India

Author

Rinku Raheja
Assistant Professor, Department of Computer Science, National P.G. College, Lucknow, India

Author


📌 DOI: https://doi.org/10.63920/tjths.44009

🔑 Keywords: Object YOLOv5; Faster Detection; R-CNN; Transformer decoder; Hybrid Model;

📅 Publication Date: 06 October 2025

📜 License:

  • Share — Copy and Redistribute the material
  • Adapt — Remix, Transform, and build upon the material
  • The licensor cannot revoke these freedoms as long as you follow the license terms.

Abstract:

Our proposal includes a three-step model to identify small-scale objects less than 32x32 pixels, e.g., backpacks, handbags, or other discarded items in a security camera image. We initially determine potential boxes with YOLOv5. Then we fine-tune those boxes with Faster R-CNN to achieve more precise results. We now include a small Transformer decoder to detect smaller objects. We will prune the model using weight pruning and INT8 quantization, and will make the size of the model smaller by 20-30%, and targeting 20-30 frames per second on a Jetson Nano to make it executable in real time. Our training will be done on mixed precise on a custom surveillance set which concentrates on small things. We are aiming to make the recall of small objects exceed the 30% baseline by YOLOv5 with obvious benefits in autonomous car, smart security, and farm monitoring applications. The model will subsequently be tested on our set by running the model later, testing it on COCO and KITTI, and testing its ability to work with video streams.

Download Full PDF Paper


📖 How to Cite

Kunal Sahu, Khushi Rajput, Shweta Sinha and Rinku Raheja (2025). A Hybrid Faster R-CNN and YOLOv5 Model with Transformer Augmentation for Enhanced Object Detection. TEJAS J. Technol. Humanit. Sci.,, Vol. 04, Issue 04. https://doi.org/10.63920/tjths.44009

📊 Article Metrics

👁️ Views: 42
📥 Downloads: 28

References

[1] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” Proc. IEEE, vol. 111, no. 3, pp. 257–276, Mar. 2023.
[2] J. Huang et al., “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Vancouver, Canada, Jun. 2023, pp. 7464–7475.
[3] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end to-end object detection,” in Proc. Int. Conf. Learn. Represent., Virtual, May 2021.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Columbus, OH, USA, Jun. 2014, pp. 580–587.
[5] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., Santiago, Chile, Dec. 2015, pp. 1440
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, Jun. 2016, pp. 779–788.
[8] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis., Amsterdam, The Netherlands, Oct. 2016, pp. 21–37.
[9] M. Hussain, “YOLOv5, YOLOv8 and YOLOv10: The go-to detectors for real-time vision,” arXiv preprint arXiv:2407.02988, Jul. 2024.
[10] R. Khanam and M. Hussain, “What is YOLOv5: A deep look into the internal features of the popular object detector,” arXiv preprint arXiv:2407.20892, Jul. 2024.
[11] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., Venice, Italy, Oct. 2017, pp. 2980–2988.
[12] Z. Chen, H. Wang, Z. Li, and Q. Yan, “A survey of deep learning-based object detection methods in autonomous driving,” IEEE Trans. Intell. Transp. Syst., vol. 24, no. 11, pp. 12345–12367, Nov. 2023.
[13] M. Hussain, “A comprehensive survey of deep learning techniques for object detection in surveillance systems,” arXiv preprint arXiv:2408.01567, Aug. 2024.
[14] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis., Glasgow, U.K., Aug. 2020, pp. 213–229.
[15] Y. Li, Y. Wang, Z. Liu, and J. Sun, “Efficient object detection for edge devices: A survey,” IEEE Access, vol. 11, pp. 123456–123478, Oct. 2023.
[16] T.-Y. Lin et al., "Microsoft COCO: Common Objects in Context," in Proc. Eur. Conf. Comput. Vis., Zurich, Switzerland, Sep. 2014, pp. 740-755.
[17] Geiger, P. Lenz, and R. Urtasun, "Are we ready for autonomous driving? The KITTI vision benchmark suite," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA, Jun. 2012, pp. 3354-3361.
[18] Aliu, M. A. Mabayoje, and A. E. Onuiri, "Real Time Detection of Hand Carried Weapons for Kidnapping Mitigation in Nigeria: A YOLOv5–Faster R-CNN Hybrid Approach," Int. J. Comput. Appl., vol. 184, no. 14, pp. 14-21, Sep. 2022.
[19] Ayush Kashyap et al., Design and Implementation of an Intelligent Loan Eligibility System Using Machine Learning Techniques, TEJAS Journal of Technologies and Humanitarian Science, ISSN-2583-5599, Vol.04, I.02 (2025), https://doi.org/10.63920/tjths.42002
[20] Esha Srivastava et al., AI-Driven Predictive Analytics with the Help of IoT for Organizational Change Management, TEJAS Journal of Technologies and Humanitarian Science, ISSN : 2583-5599, V. 04, I.03, July- 2025, https://doi.org/10.63920/tjths.43001
[21] H. Singh and N. Singh, "Real-time smart surveillance using YOLO-Faster R-CNN hybrid approach," Int. J. Comput. Program. Database Manag., vol. 6, no. 2, pp. 112-120, 2023.
[22] Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., Virtual, May 2021.