惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
大猫的无限游戏
大猫的无限游戏
MongoDB | Blog
MongoDB | Blog
The Register - Security
The Register - Security
Jina AI
Jina AI
Y
Y Combinator Blog
WordPress大学
WordPress大学
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
有赞技术团队
有赞技术团队
B
Blog RSS Feed
Microsoft Security Blog
Microsoft Security Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
Cloudbric
Cloudbric
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
C
CERT Recently Published Vulnerability Notes
L
LangChain Blog
A
Arctic Wolf
Apple Machine Learning Research
Apple Machine Learning Research
aimingoo的专栏
aimingoo的专栏
P
Palo Alto Networks Blog
G
GRAHAM CLULEY
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA
M
MIT News - Artificial intelligence
Last Week in AI
Last Week in AI
The Last Watchdog
The Last Watchdog
Google DeepMind News
Google DeepMind News
N
News and Events Feed by Topic
P
Privacy International News Feed
Vercel News
Vercel News
S
Securelist
I
InfoQ
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
B
Blog
N
News | PayPal Newsroom
Blog — PlanetScale
Blog — PlanetScale
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
A
About on SuperTechFans
Attack and Defense Labs
Attack and Defense Labs
小众软件
小众软件
C
Cisco Blogs
Simon Willison's Weblog
Simon Willison's Weblog
S
Secure Thoughts
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Tailwind CSS Blog
T
The Blog of Author Tim Ferriss
H
Hackread – Cybersecurity News, Data Breaches, AI and More

cs.CV updates on arXiv.org

Demographic and Linguistic Bias Evaluation in Omnimodal Language Models Cross-Cultural Value Awareness in Large Vision-Language Models From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories PhysInOne: Visual Physics Learning and Reasoning in One Suite Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation Neural Distribution Prior for LiDAR Out-of-Distribution Detection Adding Another Dimension to Image-based Animal Detection Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval Detecting Diffusion-generated Images via Dynamic Assembly Forests Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification BIAS: A Biologically Inspired Algorithm for Video Saliency Detection DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning SenBen: Sensitive Scene Graphs for Explainable Content Moderation Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup Unified Multimodal Uncertain Inference EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding On Semiotic-Grounded Interpretive Evaluation of Generative Art Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction Needle in a Haystack: One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search R3PM-Net: Real-time, Robust, Real-world Point Matching Network B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers When & How to Write for Personalized Demand-aware Query Rewriting in Video Search Relational Visual Similarity OmniPrism: Learning Disentangled Visual Concept for Image Generation
Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning
Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez · 2026-06-11 · via cs.CV updates on arXiv.org

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.