惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

WordPress大学
WordPress大学
Stack Overflow Blog
Stack Overflow Blog
MongoDB | Blog
MongoDB | Blog
小众软件
小众软件
U
Unit 42
S
SegmentFault 最新的问题
A
About on SuperTechFans
T
Tailwind CSS Blog
Hugging Face - Blog
Hugging Face - Blog
H
Help Net Security
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Recorded Future
Recorded Future
V
Visual Studio Blog
G
Google Developers Blog
The GitHub Blog
The GitHub Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
I
InfoQ
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Y
Y Combinator Blog
博客园 - 司徒正美
量子位
美团技术团队
云风的 BLOG
云风的 BLOG
B
Blog RSS Feed
酷 壳 – CoolShell
酷 壳 – CoolShell
D
Docker
J
Java Code Geeks
B
Blog
L
LangChain Blog
博客园 - 叶小钗
雷峰网
雷峰网
博客园_首页
F
Fortinet All Blogs
Recent Announcements
Recent Announcements
Google DeepMind News
Google DeepMind News
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
有赞技术团队
有赞技术团队
H
Hackread – Cybersecurity News, Data Breaches, AI and More
GbyAI
GbyAI
Blog — PlanetScale
Blog — PlanetScale
Microsoft Azure Blog
Microsoft Azure Blog
阮一峰的网络日志
阮一峰的网络日志
P
Proofpoint News Feed
博客园 - 聂微东
腾讯CDC
T
The Blog of Author Tim Ferriss
罗磊的独立博客
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
博客园 - 三生石上(FineUI控件)

cs.AI updates on arXiv.org

暂无文章

Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning
Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu · 2025-10-02 · via cs.AI updates on arXiv.org

Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce DUPL, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Implemented on top of GRPO and evaluated on six multimodal mathematical and general-domain reasoning benchmarks, DUPL improves Qwen2.5-VL 3B and 7B models, achieving accuracy gains of up to 11.2% on visual math tasks and up to 7.1% on general-domain reasoning tasks, while consistently outperforming GRPO. These results demonstrate that dual-uncertainty guided policy learning is an effective and generalizable approach for multimodal RLVR.