惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

cs.CV updates on arXiv.org

Benchmarking Composed Image Retrieval for Applied Earth Observation CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning CAFD: Concept-Aware DNN Fault Detection using VLMs MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing Hierarchical Local-Global Transformer for Temporal Sentence Grounding Φ-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning Forgettable Federated Linear Learning with Certified Data Unlearning Paris 2.0: A Decentralized Diffusion Model for Video Generation Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation NudgeVAD: Language-Nudged End-to-End Driving via FiLM Residuals STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration FDDet: Achieving Data-Efficient Food Defect Detection Under Real-World Scenarios Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving Phase-Aware Wavelet-Based-Scattering Encoder-Decoder for Dense Predictions CoDA: Color Distribution Probing for Efficient and Generalizable AI-Generated Image Detection Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization Investigating the Effect of Network Pruning on Performance and Interpretability A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring SparseWorld: Enhancing End-to-End Autonomous Driving via World Models with Sparse Scene Representation World Models as Group Actions MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection ViViD-5K: Vineyard vision dataset for field-based berry detection and segmentation and grape cluster closure estimation Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors on SMPL Skeletons Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology Image-Conditioned Instance Prompt Network for Referring Remote Sensing Image Segmentation SILSM: A Sustainable Interactive Level Set Method for Progressive Refinement Robust Fuzzy Multi-view Learning under View Conflict Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra Causal Physics Steering in Video World Models via Concept Activation Vectors EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems Physen-Noise2Noise: Physics-Guided Self-Supervised Defocus Deblurring with Bias Correction under Low-Light Conditions PILOT: Policy-Informed Learned Optimization for Adaptive Deep Network Training EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026 Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer Nano World Models: A Minimalist Implementation of Future Video Prediction Med-R2: An Adversarial Benchmark for Evidence-Grounded Reasoning in Medical VLMs Single View Seafloor Recovery from Imaging Sonar via Differentiable Rendering ERNIE-Image Technical Report Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection Learnable Shape Prototypes with Occlusion-Geometry-Guided Injection for Amodal Instance Segmentation Parameter-Efficient CT Reconstruction via Deep Graph Laplacian Regularization Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion Motion-Compensated Weight Compression Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions Generalized Evidential Deep Learning: From a Bayesian Perspective TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration Uncertainty-DTW for Sequences and Visual Tokens A Principled Self-Referenced Early Stopping Approach for Deep Image Prior PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction Generating 3D models from sketches of human faces using a combined approach of Convolutional Neural Networks, Procedural Modeling, and Contour Mapping Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models Remote sensing data imputation using deep learning for multispectral imagery V3H: View Variation and View Heredity for Incomplete Multi-view Clustering Dual Prototype-Conditioned Diffusion Model for Scalable Multi-Class Unsupervised Anomaly Detection in Large Category Spaces Coarse-to-Fine Domain Incremental Learning with Attentive Distillation for Mining Footprint Segmentation in Multispectral Imagery Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval Dale meets Langevin: A Multiplicative Denoising Diffusion Model DUEL: Adversarial Self-Play for Multimodal Reasoning Plume Segmentation from MethaneSAT with Cross-Sensor Transfer Learning and Physics-Informed Postprocessing When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers Universal Boosts, Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language Models VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation Opportunistic Target Selection: Early Directional Commitment for Query-Efficient Black-Box Adversarial Attacks In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models EMA: Effort Metric Attention for Anatomical Effort-Guided Human Motion Diffusion Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects MGVQ: Synergizing Multi-dimensional Sensitivity-Aware and Gradient-Hessian Fusion for Vector Quantization Gaussian Rank-Based Neighborhood Degree for Graph Neural Networks in Image Classification TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge Prism: Spectral-Aware Block-Sparse Attention Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis IQA-Spider: Unifying Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL OmniEgo-R$^2$: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026 Towards Large Model Feature Coding Trust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation
EFSI-DETR: Efficient Frequency-Semantic Integration for Real-Time Small Object Detection in UAV Imagery
Yu Xia, Chan · 2026-05-26 · via cs.CV updates on arXiv.org

View PDF HTML (experimental)

Abstract:Real-time small object detection in Unmanned Aerial Vehicle (UAV) imagery remains challenging due to limited feature representation and ineffective multi-scale fusion. Existing methods underutilize frequency information and rely on static convolutional operations, which constrain the capacity to obtain rich feature representations and hinder the effective exploitation of deep semantic features. To address these issues, we propose EFSI-DETR, a novel detection framework that integrates efficient semantic feature enhancement with dynamic frequency-spatial guidance. EFSI-DETR comprises two main components: (1) a Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet) that jointly exploits frequency and spatial cues for robust multi-scale feature fusion, (2) an Efficient Semantic Feature Concentrator (ESFC) that enables deep semantic extraction with minimal computational cost. Furthermore, a Fine-grained Feature Retention (FFR) strategy is adopted to incorporate spatially rich shallow features during fusion to preserve fine-grained details, crucial for small object detection in UAV imagery. Extensive experiments on VisDrone and CODrone benchmarks demonstrate that our EFSI-DETR achieves the state-of-the-art performance with real-time efficiency, yielding improvement of \textbf{1.6}\% and \textbf{5.8}\% in AP and AP$_{s}$ on VisDrone, while obtaining \textbf{188} FPS inference speed on a single RTX 4090 GPU.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2601.18597 [cs.CV]
  (or arXiv:2601.18597v2 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2601.18597

arXiv-issued DOI via DataCite

Submission history

From: Yu Xia [view email]
[v1] Mon, 26 Jan 2026 15:41:37 UTC (10,396 KB)
[v2] Mon, 25 May 2026 05:33:03 UTC (20,236 KB)