惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
P
Proofpoint News Feed
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Google DeepMind News
Google DeepMind News
T
The Blog of Author Tim Ferriss
T
Tor Project blog
T
Threatpost
V
Vulnerabilities – Threatpost
大猫的无限游戏
大猫的无限游戏
量子位
Scott Helme
Scott Helme
Schneier on Security
Schneier on Security
有赞技术团队
有赞技术团队
Recent Commits to openclaw:main
Recent Commits to openclaw:main
李成银的技术随笔
K
Kaspersky official blog
T
ThreatConnect
美团技术团队
博客园 - Franky
爱范儿
爱范儿
A
Arctic Wolf
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
SecWiki News
SecWiki News
Microsoft Azure Blog
Microsoft Azure Blog
博客园 - 叶小钗
Recorded Future
Recorded Future
L
Lohrmann on Cybersecurity
J
Java Code Geeks
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
D
DataBreaches.Net
Spread Privacy
Spread Privacy
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Comments on: Blog
B
Blog RSS Feed
L
LINUX DO - 热门话题
阮一峰的网络日志
阮一峰的网络日志
腾讯CDC
酷 壳 – CoolShell
酷 壳 – CoolShell
N
Netflix TechBlog - Medium
S
SegmentFault 最新的问题
S
Security @ Cisco Blogs
Latest news
Latest news
I
InfoQ
Project Zero
Project Zero
P
Privacy International News Feed
D
Docker
The Hacker News
The Hacker News
A
About on SuperTechFans

cs.CV updates on arXiv.org

OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth Personalized Generative Models for Contextual Debiasing OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants VesselSim: learning 3D blood vessel segmentation without expert annotations Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Sleep-stage efficient classification using a lightweight self-supervised model A multifractal-based masked auto-encoder: an application to medical images Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression $R^3$: 3D Reconstruction via Relative Regression InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery LongCat-Video-Avatar 1.5 Technical Report DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP Uncertainty-Aware Gaussian Map for Vision-Language Navigation Dimensional Distribution Emotion State: Leveraging Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization Unified Panoramic Geometry Estimation via Multi-View Foundation Models Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening Cross-scale Aligned Supervision for Training GANs RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields Sentinel: Embodied Cooperative Spatial Reasoning and Planning Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules
当 VLMs "修正" 学生:在多行手写数学 OCR 评估中识别和惩罚过度修正
Jin Seong, W · 2026-04-28 · via cs.CV updates on arXiv.org

查看PDF HTML(实验性)

摘要:手写数学的准确转录对于教育人工智能系统至关重要,然而当前基准测试未能正确评估这一能力。大多数先前研究集中于单行表达式,并依赖于词汇指标(如BLEU),这些指标无法评估多行学生解决方案中的语义推理。在本文中,我们首次对多行手写数学光学字符识别(OCR)进行了系统研究,揭示了一种视觉语言模型(VLM)的关键失效模式:过度纠正。这些模型不是忠实地转录学生的作业,而是经常“修正”错误,从而掩盖了教育评估旨在检测的错误。为此,我们提出了PINK(基于惩罚的INK评分),一种利用大型语言模型(LLM)进行评分标准评估的语义评估指标,并明确惩罚过度纠正。我们对FERMAT数据集上的15个最先进的VLM进行的综合评估显示,与BLEU相比存在显著的排名逆转:GPT-4o等模型因过度纠正而受到严厉惩罚,而Gemini 2.5 Flash则成为最忠实转录器。此外,人类专家研究表明,PINK与人类判断(55.0%的偏好率高于BLEU的39.5%)显著更一致,为教育环境中手写数学OCR提供更可靠的评估框架。
主题: 计算机与社会 (cs.CY); 人工智能 (cs.AI); 计算机视觉与模式识别 (cs.CV); 机器学习 (cs.LG)
引用方式: arXiv:2604.22774 [cs.CY]
  (或 arXiv:2604.22774v2 [cs.CY] 用于此版本)
  https://doi.org/10.48550/arXiv.2604.22774

通过DataCite发布的arXiv DOI

提交历史

来自:Jin Seong [查看邮箱]
[v1] 周三,2026年4月1日 05:27:52 UTC (2,486 KB)
[v2] 周二,2026年5月26日 06:37:15 UTC (2,494 KB)