






















Abstract:Small object detection remains a significant challenge due to feature degradation from downsampling, mutual occlusion in dense clusters, and complex background interference. To address these issues, this paper proposes FSDETR, a frequency-spatial feature enhancement framework built upon the RT-DETR baseline. By establishing a collaborative modeling mechanism, the method effectively leverages complementary structural information. Specifically, a Spatial Hierarchical Attention Block (SHAB) captures both local details and global dependencies to strengthen semantic representation. Furthermore, to mitigate occlusion in dense scenes, the Deformable Attention-based Intra-scale Feature Interaction (DA-AIFI) focuses on informative regions via dynamic sampling. Finally, the Frequency-Spatial Feature Pyramid Network (FSFPN) integrates frequency filtering with spatial edge extraction via the Cross-domain Frequency-Spatial Block (CFSB) to preserve fine-grained details. Experimental results show that with only 14.7M parameters, FSDETR achieves 13.9% APS on VisDrone 2019 and 48.95% AP50 tiny on TinyPerson, showing strong performance on small-object benchmarks. The code and models are available at this https URL.
| Comments: | 6 pages, 6 figures,accepted to IJCNN 2026 |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2604.14884 [cs.CV] |
| (or arXiv:2604.14884v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2604.14884 arXiv-issued DOI via DataCite |
From: Tao Yan [view email]
[v1]
Thu, 16 Apr 2026 11:22:13 UTC (6,367 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。