惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

V
V2EX - 技术
F
Full Disclosure
博客园 - Franky
WordPress大学
WordPress大学
The GitHub Blog
The GitHub Blog
A
About on SuperTechFans
博客园_首页
爱范儿
爱范儿
腾讯CDC
Engineering at Meta
Engineering at Meta
T
The Blog of Author Tim Ferriss
C
Check Point Blog
Y
Y Combinator Blog
博客园 - 叶小钗
Recent Announcements
Recent Announcements
Last Week in AI
Last Week in AI
U
Unit 42
Apple Machine Learning Research
Apple Machine Learning Research
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
S
SegmentFault 最新的问题
H
Help Net Security
Google DeepMind News
Google DeepMind News
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
I
InfoQ
S
Security Archives - TechRepublic
Google DeepMind News
Google DeepMind News
H
Heimdal Security Blog
小众软件
小众软件
Project Zero
Project Zero
MongoDB | Blog
MongoDB | Blog
P
Palo Alto Networks Blog
O
OpenAI News
Vercel News
Vercel News
Forbes - Security
Forbes - Security
L
LangChain Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Application and Cybersecurity Blog
Application and Cybersecurity Blog
GbyAI
GbyAI
A
Arctic Wolf
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
量子位
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hacker News: Ask HN
Hacker News: Ask HN
Spread Privacy
Spread Privacy
Scott Helme
Scott Helme
N
News and Events Feed by Topic
The Register - Security
The Register - Security

cs.CV updates on arXiv.org

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference Anthropogenic Regional Adaptation in Multimodal Vision-Language Model The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models Zero-shot World Models Are Developmentally Efficient Learners Edu-MMBias: A Three-Tier Multimodal Benchmark for Auditing Social Bias in Vision-Language Models under Educational Contexts Demographic and Linguistic Bias Evaluation in Omnimodal Language Models Cross-Cultural Value Awareness in Large Vision-Language Models GLEaN: A Text-to-image Bias Detection Approach for Public Comprehension From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories PhysInOne: Visual Physics Learning and Reasoning in One Suite Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation Neural Distribution Prior for LiDAR Out-of-Distribution Detection Adding Another Dimension to Image-based Animal Detection Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval Detecting Diffusion-generated Images via Dynamic Assembly Forests Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification BIAS: A Biologically Inspired Algorithm for Video Saliency Detection DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning SenBen: Sensitive Scene Graphs for Explainable Content Moderation Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup Unified Multimodal Uncertain Inference EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding On Semiotic-Grounded Interpretive Evaluation of Generative Art Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction Needle in a Haystack: One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search Assessing Privacy Preservation and Utility in Online Vision-Language Models R3PM-Net: Real-time, Robust, Real-world Point Matching Network Tipiano: Cascaded Piano Hand Motion Synthesis via Fingertip Priors Belief-Aware VLM Model for Human-like Reasoning GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention BiCLIP: Domain Canonicalization via Structured Geometric Transformation Agentic Exploration of PDE Spaces using Latent Foundation Models for Parameterized Simulations MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers When & How to Write for Personalized Demand-aware Query Rewriting in Video Search Relational Visual Similarity Enhancing Geo-localization for Crowdsourced Flood Imagery via LLM-Guided Attention GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models OmniPrism: Learning Disentangled Visual Concept for Image Generation MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions
Large Model for Small Data: Foundation Model for Cross-Modal RF Human Activity Recognition
Yuxuan Weng, Guoquan Wu, Tianyue Zheng, Yanbing Yang, Jun Luo · 2024-10-13 · via cs.CV updates on arXiv.org

Radio-Frequency (RF)-based Human Activity Recognition (HAR) rises as a promising solution for applications unamenable to techniques requiring computer visions. However, the scarcity of labeled RF data due to their non-interpretable nature poses a significant obstacle. Thanks to the recent breakthrough of foundation models (FMs), extracting deep semantic insights from unlabeled visual data become viable, yet these vision-based FMs fall short when applied to small RF datasets. To bridge this gap, we introduce FM-Fi, an innovative cross-modal framework engineered to translate the knowledge of vision-based FMs for enhancing RF-based HAR systems. FM-Fi involves a novel cross-modal contrastive knowledge distillation mechanism, enabling an RF encoder to inherit the interpretative power of FMs for achieving zero-shot learning. It also employs the intrinsic capabilities of FM and RF to remove extraneous features for better alignment between the two modalities. The framework is further refined through metric-based few-shot learning techniques, aiming to boost the performance for predefined HAR tasks. Comprehensive evaluations evidently indicate that FM-Fi rivals the effectiveness of vision-based methodologies, and the evaluation results provide empirical validation of FM-Fi's generalizability across various environments.