




















Abstract:Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0, a native-author-verified benchmark of 100 harmful-intent prompts paired across English and Somali. Each of Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, and Aya-23-8B is run locally with temperature 0 and the same English "helpful, harmless, and honest" (HHH) system prompt. A pinned Claude Sonnet snapshot (claude-sonnet-4-5-20250929) classifies each response as refused, complied, or unclear; the native author spot-checks a stratified 80-row sample. We find large English-to-Somali refusal gaps for all four models: Llama-3.1-8B (0.90; 95% bootstrap CI [0.85, 0.96]), Aya-23-8B (0.75 [0.67, 0.83]), Qwen-2.5-7B (0.69 [0.59, 0.78]), and Gemma-2-9B (0.38 [0.27, 0.49]). For three models, the dominant Somali non-refusal mode is not fluent harmful compliance but unclear output: empty, wrong-language, or incoherent generations. The native verification spot-check achieves 100% agreement with the judge (Cohen's kappa = 1.00) on the 80 sampled rows. We report aggregate refusal rates, category gaps, and reliability statistics only; raw model generations are retained locally and are not released.
| Comments: | 12 pages, 3 figures, 4 tables. Code: this https URL Dataset: this https URL |
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) |
| ACM classes: | I.2.7 |
| Cite as: | arXiv:2605.25420 [cs.CL] |
| (or arXiv:2605.25420v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.25420 arXiv-issued DOI via DataCite (pending registration) |
From: Khalid Yusuf Dahir Mr [view email]
[v1]
Mon, 25 May 2026 04:45:44 UTC (33 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。