























We introduce PulseBench-Tab, an open multilingual benchmark for evaluating table extraction from document images. The benchmark comprises 1,820 human-annotated tables spanning 9 languages and 4 scripts (Latin, CJK, Arabic, Cyrillic), drawn from 380 real-world source documents including financial filings, government reports, and regulatory disclosures. Tables range from 2 to 1,183 cells, with 48.1% containing merged or spanning cells. Alongside the dataset, we propose T-LAG (Table Logical Adjacency Graph), a novel evaluation metric that models tables as directed graphs over cell adjacencies and computes structural and content fidelity in a single score via optimal bipartite matching. We evaluate 9 commercial and open-source table extraction systems across the benchmark and report per-language breakdowns. The full dataset, scoring code, and all provider outputs are publicly available.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。