惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

cs.CL updates on arXiv.org

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability The Need for an External Observer Formalizing the Sufficiency Gap: A Mathematical Extension of Mixture Identifiability and Contextual Grounding in Sequence Models MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding NestedKV: Nested Memory Routing for Long-Context KV Cache Compression Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering The Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models LiPUP-MA: A Residential Experience-centric Multi-Agent Framework for Living-in-the-loop Participatory Urban Planning MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection SPEAR: Code-Augmented Agentic Prompt Optimization Vectors Are Not Neutral: Sensitive-Information Inference from Exported LLM Representations in Summarization Learning GUI Grounding with Spatial Reasoning from Visual Feedback LATTE: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories LaRe: Latent Refocusing for Multimodal Reasoning The Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations Representation-Aware Unlearning via Activation Signatures: From Suppression to Entity-Signature Erasure Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention Tracing Computation Density in LLMs UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants Quality Without Usefulness: LLM-Generated XAI Narratives as Trust Heuristics Rather Than Decision Aids KARMA: Karma-Aligned Reward Model Adaptation Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models Targeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization Annotator Positionality as Signal: Psychometric Weighting for Anti-Autistic Ableism Detection SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup? Hubness, Not Anisotropy, Drives Cross-Lingual Retrieval Asymmetry in Multilingual Embedding Models MobileMoE: Scaling On-Device Mixture of Experts MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training AI evaluation may bias perceptions: The importance of context in interpreting academic writing Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation Model Unlearning Objectives Vary for Distinct Language Functions Rethinking the Trust Region in LLM Reinforcement Learning A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration An In-Vitro Study on Cross-Lingual Generalization in Language Models Probing the Knowledge Boundary: An Interactive Agentic Framework for Deep Knowledge Extraction Learning When to Think While Listening in Large Audio-Language Models PinPoint: Prompting with Informative Interior Points GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions HiSpec: Hierarchical Speculative Decoding for LLMs RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers Probing Minimalist Phase Structure in LLMs: What Universal Dependencies Cannot Represent Curation and Extraction of Drug-Related Entities from Reddit Platform Slide Deck Q&A Quality Assurance App: A Multi-Stage Pipeline for Pedagogical Question Generation Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks Cultural Value Alignment Via Latent Activation Steering in Large Language Models Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline Bounded Path Context: A Controlled Study of Visible Path History in LLM-Based Knowledge Graph Question Answering Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models Conceptual Steganography Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation Advancing Creative Physical Intelligence in Large Multimodal Models GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior From Snippets to Semantics: Rethinking Evidence Granularity for Multilingual Fact Verification Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders Rethinking the Multilingual Reasoning Gap with Layer Swap A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness
OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning
Mingxin Huan · 2026-05-27 · via cs.CL updates on arXiv.org

View PDF HTML (experimental)

Abstract:Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50\% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at this https URL.
Comments: ICLR 2026
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2505.17163 [cs.LG]
  (or arXiv:2505.17163v2 [cs.LG] for this version)
  https://doi.org/10.48550/arXiv.2505.17163

arXiv-issued DOI via DataCite

Submission history

From: Dezhi Peng [view email]
[v1] Thu, 22 May 2025 15:25:14 UTC (5,342 KB)
[v2] Tue, 26 May 2026 01:50:12 UTC (4,073 KB)