





















Abstract:Existing benchmarks and training pipelines for industrial intelligent customer service (ICS) remain misaligned with real-world dialogue requirements, overemphasizing verifiable task success while under-measuring subjective service quality and realistic failure modes, leaving a gap between offline gains and deployable dialogue behavior. We close this gap with a benchmark-to-optimization loop: we first introduce OlaBench, an ICS benchmark spanning retrieval-augmented generation, workflow-based systems, and agentic settings, which evaluates service capability, safety, and latency sensitivity; moreover, motivated by OlaBench results showing state-of-the-art LLMs still fall short, we propose OlaMind, which distills reusable reasoning patterns and service strategies from expert dialogues and applies staged exploration--exploitation reinforcement learning with instance-level rubric-aware guidance to improve model capability. OlaMind surpasses GPT-5.2 and Gemini 3 Pro on OlaBench (83.64 vs. 70.58/70.84) and, in online A/B tests, delivers an average +23.67% issue resolution and -6.6% human transfer rate versus the baseline, bridging offline gains to deployment. Together, OlaBench and OlaMind advance ICS systems toward more anthropomorphic, professional, and reliable deployment. The project page and evaluation are available at this https URL.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2510.22143 [cs.CL] |
| (or arXiv:2510.22143v3 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2510.22143 arXiv-issued DOI via DataCite |
From: Tianhong Gao [view email]
[v1]
Sat, 25 Oct 2025 03:29:55 UTC (5,283 KB)
[v2]
Thu, 8 Jan 2026 09:45:39 UTC (10,951 KB)
[v3]
Mon, 25 May 2026 04:01:13 UTC (13,855 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。