

















Abstract:Large language models (LLMs) increasingly operate on long inputs, yet their behavior when harmful sentences are sparsely embedded within such inputs remains poorly understood. We present a sensitivity analysis that probes how LLMs extract harmful sentences embedded in long inputs. We construct long inputs by combining neutral and harmful sentences, and systematically vary four factors: input length (600--30,000 tokens), the proportion of harmful sentences (0.01--0.50), harm realization (explicit vs. implicit), and the position of harmful sentences within the input (beginning, middle, end), enabling a controlled stress-test evaluation. Experiments across toxic, offensive, and hate content, and across LLaMA-3.1, Qwen-2.5, and Mistral, reveal consistent patterns: sensitivity is non-monotonic with respect to harmful prevalence, peaking at moderate levels; sensitivity degrades as input length increases; harmful sentences placed earlier in the input are more strongly prioritized; and explicit harm is more reliably identified than implicit harm. These findings provide a systematic view of how LLMs prioritize harmful sentences in long input under controlled stress conditions, highlighting both emerging strengths and remaining challenges for safety-related use.
| Subjects: | Computation and Language (cs.CL); Computers and Society (cs.CY) |
| Cite as: | arXiv:2510.05864 [cs.CL] |
| (or arXiv:2510.05864v2 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2510.05864 arXiv-issued DOI via DataCite |
From: Faeze Ghorbanpour [view email]
[v1]
Tue, 7 Oct 2025 12:33:21 UTC (7,316 KB)
[v2]
Tue, 26 May 2026 13:54:15 UTC (7,334 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。