惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

L
LangChain Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
G
GRAHAM CLULEY
Latest news
Latest news
H
Heimdal Security Blog
The Hacker News
The Hacker News
AI
AI
S
Secure Thoughts
L
Lohrmann on Cybersecurity
T
Troy Hunt's Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
S
Securelist
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
T
Threatpost
大猫的无限游戏
大猫的无限游戏
I
InfoQ
Google DeepMind News
Google DeepMind News
GbyAI
GbyAI
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 三生石上(FineUI控件)
博客园 - 聂微东
NISL@THU
NISL@THU
C
CERT Recently Published Vulnerability Notes
Cyberwarzone
Cyberwarzone
Microsoft Security Blog
Microsoft Security Blog
Apple Machine Learning Research
Apple Machine Learning Research
T
Tailwind CSS Blog
The Register - Security
The Register - Security
Y
Y Combinator Blog
W
WeLiveSecurity
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
V
V2EX - 技术
T
Tor Project blog
MongoDB | Blog
MongoDB | Blog
爱范儿
爱范儿
V
Visual Studio Blog
O
OpenAI News
S
SegmentFault 最新的问题
博客园 - Franky
博客园 - 叶小钗
Hacker News: Ask HN
Hacker News: Ask HN
阮一峰的网络日志
阮一峰的网络日志
Forbes - Security
Forbes - Security
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
V
V2EX
T
Threat Research - Cisco Blogs
月光博客
月光博客
IT之家
IT之家
美团技术团队

cs.CV updates on arXiv.org

Demographic and Linguistic Bias Evaluation in Omnimodal Language Models Cross-Cultural Value Awareness in Large Vision-Language Models From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories PhysInOne: Visual Physics Learning and Reasoning in One Suite Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation Neural Distribution Prior for LiDAR Out-of-Distribution Detection Adding Another Dimension to Image-based Animal Detection Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval Detecting Diffusion-generated Images via Dynamic Assembly Forests Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification BIAS: A Biologically Inspired Algorithm for Video Saliency Detection DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning SenBen: Sensitive Scene Graphs for Explainable Content Moderation Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup Unified Multimodal Uncertain Inference EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding On Semiotic-Grounded Interpretive Evaluation of Generative Art Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction Needle in a Haystack: One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search R3PM-Net: Real-time, Robust, Real-world Point Matching Network B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers When & How to Write for Personalized Demand-aware Query Rewriting in Video Search Relational Visual Similarity OmniPrism: Learning Disentangled Visual Concept for Image Generation
VisualClaw: A Real-Time, Personalized Agent for the Physical World
[Submitted on 15 Jun 2026] · 2026-06-16 · via cs.CV updates on arXiv.org

View PDF HTML (experimental)

Abstract:Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

Submission history

From: Haoqin Tu [view email]
[v1] Mon, 15 Jun 2026 06:58:22 UTC (3,810 KB)