
























Abstract:AI models are already deployed in societies affected by armed conflict, and journalists, humanitarian workers, governments and ordinary citizens rely on them for information or for their work processes. No established practice exists for checking whether their outputs can make those conflicts worse. We tested nine model configurations from four providers (OpenAI, Anthropic, DeepSeek, xAI) on 90 multi-turn scenarios designed to surface misaligned behaviour in conflict contexts: false equivalence between documented atrocities, denial of genocide, and failure to recognise ethnic slurs, among others. When such outputs feed into journalism, humanitarian reporting, or public debate, they can deepen divisions in fragile societies. Failure rates span 6\% to 47\% between the best and worst performing models, which makes model choice a safety question in its own right and when users pushed for ``balance'' in cases where international courts have already assigned responsibility, five of nine configurations failed 80 to 100 percent of the time. We release the first evaluation framework for this domain and propose adding it to alignment evaluation portfolios.
| Comments: | Preprint. 8 pages, 2 figures. Code and evaluation framework: this https URL |
| Subjects: | Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) |
| ACM classes: | I.2.7; K.4.1 |
| Cite as: | arXiv:2605.22720 [cs.AI] |
| (or arXiv:2605.22720v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.22720 arXiv-issued DOI via DataCite (pending registration) |
From: Andrii Kryshtal [view email]
[v1]
Thu, 21 May 2026 16:55:39 UTC (68 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。