





















This paper has been withdrawn by Niruthiha Selvanayagam
No PDF available, click to view other formats
Abstract:As Large Multimodal Models (LMMs) become integral to daily digital life, understanding their safety architectures is a critical problem for AI Alignment. This paper presents a systematic analysis of OpenAI's GPT-4o mini, a globally deployed model, on the difficult task of multimodal hate speech detection. Using the Hateful Memes Challenge dataset, we conduct a multi-phase investigation on 500 samples to probe the model's reasoning and failure modes. Our central finding is the experimental identification of a "Unimodal Bottleneck," an architectural flaw where the model's advanced multimodal reasoning is systematically preempted by context-blind safety filters. A quantitative validation of 144 content policy refusals reveals that these overrides are triggered in equal measure by unimodal visual 50% and textual 50% content. We further demonstrate that this safety system is brittle, blocking not only high-risk imagery but also benign, common meme formats, leading to predictable false positives. These findings expose a fundamental tension between capability and safety in state-of-the-art LMMs, highlighting the need for more integrated, context-aware alignment strategies to ensure AI systems can be deployed both safely and effectively.
| Comments: | This paper reports preliminary findings from a small-scale study whose sample size is insufficient to support the stated conclusions. The authors are withdrawing it to conduct a more comprehensive evaluation |
| Subjects: | Machine Learning (cs.LG) |
| Cite as: | arXiv:2509.13608 [cs.LG] |
| (or arXiv:2509.13608v2 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2509.13608 arXiv-issued DOI via DataCite |
From: Niruthiha Selvanayagam [view email]
[v1]
Wed, 17 Sep 2025 00:46:42 UTC (4,147 KB)
[v2]
Sat, 23 May 2026 00:20:32 UTC (1 KB) (withdrawn)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。