AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Jaykumar Kasundra · 2025-12-23 · via Hugging Face - Blog

Back to Articles

Large Language Models (LLMs) have rapidly evolved from text-only assistants into complex agentic systems capable of performing multi-step reasoning, calling external tools, retrieving memory, and executing code. With this evolution comes an increasingly sophisticated threat landscape: not only traditional content safety risks, but also multi-turn jailbreaks, prompt injections, memory hijacking, and tool manipulation.

In this work, we introduce AprielGuard, an 8B parameter safety–security safeguard model designed to detect:

16 categories of safety risks, spanning toxicity, hate, sexual content, misinformation, self-harm, illegal activities, and more.
Wide range of adversarial attacks, including prompt injection, jailbreaks, chain-of-thought corruption, context hijacking, memory poisoning, and multi-agent exploit sequences.
Safety violations and adversarial attacks in agentic workflows, including tool calls and model reasoning traces.

AprielGuard is available in both reasoning and non-reasoning modes, enabling explainable classification when needed and low-latency classification for production pipelines.

Model: https://huggingface.co/ServiceNow-AI/AprielGuard
Technical Paper: https://arxiv.org/abs/2512.20293

Motivation
AprielGuard Overview
Taxonomy
Training Dataset
Model Architecture
Training Setup
Evaluation
Conclusion
Limitations

Motivation

Traditional safety classifiers primarily focus on a limited classification spectrum (e.g., toxicity or self-harm), assume short inputs, and evaluate single user messages. Modern deployments, however, feature:

Multi-turn conversations
Long contexts
Structured reasoning steps producing chains of thought
Tool-assisted multi-step workflows (agents)
A growing class of adversarial attacks exploiting reasoning, tools, or memory

As a result, production teams increasingly rely on workarounds: multiple guard models for different stages, regex filters, static rules, or hand-crafted heuristics. These approaches are brittle and do not scale.

AprielGuard addresses these issues with a unified model and a unified safety + adversarial taxonomy, built specifically for modern LLM agent ecosystems.

AprielGuard Overview

AprielGuard operates across three input formats:

Standalone Prompt
Multi-turn Conversation
Agentic Workflow (tool calls, reasoning traces, memory, system context)

It outputs:

Safety classification and a list of violated categories from the taxonomy
Adversarial attack classification
Optional structured reasoning explaining the decision

AprielGuard overview

Taxonomy

A. Safety Taxonomy

Category	Description
O1	Toxic Content
O2	Unfair Representation
O3	Adult Content
O4	Erosion of Trust in Public Information
O5	Propagating Misconceptions/False Beliefs
O6	Risky Financial Practices
O7	Trade and Compliance
O8	Dissemination of Dangerous Information
O9	Privacy Infringement
O10	Security Threats
O11	Defamation
O12	Fraud or Deceptive Action
O13	Influence Operations
O14	Illegal Activities
O15	Persuasion and Manipulation
O16	Violation of Personal Property

(These 16 categories are inspired from SALAD-Bench)

B. Adversarial Attack Taxonomy

The model detects and evaluates a wide range of adversarial prompt patterns designed to manipulate model behavior or evade safety mechanisms. The model outputs a binary classification (e.g., adversarial / non_adversarial) rather than fine-grained attack categories.

The training data covers diverse adversarial types such as role-playing, world-building, persuasion, and stylization, among many other complex prompt manipulation strategies. These examples represent only a subset of the broader adversarial scenarios incorporated in the training data.

Training Dataset

Synthetic data: AprielGuard is trained on a synthetically generated training dataset. The training data points are generated at a sub-topic level of the taxonomy for better coverage. We leverage Mixtral-8x7B and internally developed uncensored models to generate unsafe content for training purposes. Models were prompted with higher temperature to induce output variation. Prompting templates are meticulously tailored to ensure accurate data generation. Adversarial attacks are constructed using a combination of synthetic data points, diverse prompt templates, and rule-based generation techniques. We leveraged NVIDIA NeMo Curator to generate large-scale, multi-turn conversational datasets featuring complex, realistic scenarios with iterative and evolving attacks through context switches. This approach enabled us to systematically synthesize diverse interaction patterns, improving the robustness of the model to long-horizon reasoning, adversarial turns, and evolving user intent. We also used SyGra framework for synthetic data generation processes for harmful prompts and attacks generation. The training dataset encompasses diverse content formats such as conversational dialogues, forum posts, tweets, instructional prompts, questions, and how-to guides.
Data augmentation: To enhance model robustness, a range of data augmentation techniques were applied to the training data. These augmentations are designed to expose the model to natural variations and perturbations that commonly occur in real-world scenarios. Specifically, the dataset includes transformations such as character-level noise, insertion of typographical errors, leetspeak substitutions, word-level paraphrasing, and syntactic reordering. Such augmentations help the model generalize better by reducing sensitivity to superficial variations in input, thereby improving resilience against adversarial manipulations and non-standard text representations.
Agentic workflows: Agentic workflows represent real-world scenarios where autonomous agents execute multi-step tasks involving planning, reasoning, and interaction with tools, APIs, and other agents. These workflows often include sequences of user prompts, system messages, intermediate reasoning steps, and tool invocations, making them susceptible to diverse attack vectors. To construct these training data points, we synthetically generate a wide range of scenarios across multiple domains, capturing realistic agentic interactions between a user and an agentic system. Each data point is enriched with detailed contextual elements—including tool definitions, tool invocation logs, agent roles and policies, execution traces, conversation history, memory states, and scratch-pad reasoning. For malicious or adversarial examples, we corrupt the relevant segment of the workflow to reflect a specific attack vector. Depending on the scenario, this may involve modifying user prompts, altering intermediate reasoning traces, modifying the tool outputs, injecting false memory states, or disrupting inter-agent communication. By systematically perturbing different components of the agentic workflow, we produce high-fidelity examples that expose a model to a diverse spectrum of realistic and challenging attack patterns. Each data point was simulated to reflect realistic executions, incorporating both benign and adversarial sequences.
Long context use cases: We curated a specialized long context dataset composed of diverse, high-length use cases such as Retrieval-Augmented Generation (RAG) work-flows, multi-turn conversational threads, incident details, and operational reports containing detailed communications. These examples simulate real-world environments where large text contexts are typical.

Synthetic data generation flow

Model Architecture

AprielGuard is built on top of an Apriel-1.5 Thinker Base variant, downscaled to an 8B configuration for efficient deployment.

Causal decoder-only transformer
Dual-mode operation:
- Reasoning Mode → emits structured explanations
- Fast Mode → classification only

Training Setup

Parameter	Value
Base Model	Apriel 1.5 Thinker Base (downscaled)
Model Size	8B parameters
Precision	bfloat16
Batch Size	1 with grad-accumulation = 8
LR	2e-4
Optimizer	Adam (β1=0.9, β2=0.999)
Epochs	3
Sequence Length	Up to 32k
Reasoning Mode	Enabled/Disabled via instruction template

Evaluation Summary

AprielGuard is evaluated across:

Public safety benchmarks
Public adversarial benchmarks
Internal Agentic workflow benchmarks
internal Long-context use case benchmarks (up to 32k)
Multilingual evaluation (8 languages)

Safety Benchmark Results

AprielGuard performance on the public safety benchmarks.

A comparative assessment of model performance using aggregated results from safety benchmarks.

Adversarial Detection Results

AprielGuard performance on the public adversarial benchmarks.

A comparative assessment of model performance using aggregated results from adversarial benchmarks.

Agentic Workflow Evaluation

We curated an internal benchmark dataset aimed at evaluating the detection of Safety Risks and Adversarial Attacks within agentic workflows. To construct this benchmark, we systematically designed multiple attack scenarios targeting different components of the workflow—such as prompt inputs, reasoning traces, tool parameters, memory states, and inter-agent communications. Each instance was annotated according to the taxonomy of vulnerabilities. Each workflow was simulated to reflect realistic executions, incorporating both benign and adversarial sequences. The dataset captures granular attack points across various stages such as planning, reasoning, execution, and response generation to provide fine-grained evaluation of model robustness. Overall, the dataset comprises a balanced mixture of safety risks and adversarial attacks.

Safety performance of different models on the agentic benchmark.

Adversarial performance of different models on the agentic benchmark..

Long-Context Robustness (Upto 32k Tokens)

Many real world safety or adversarial risks do not manifest in short, isolated text snippets, but rather emerge across use cases such as Retrieval-Augmented Generation (RAG) workflows, multi-turn conversational threads, organizational incident details, and operational reports containing detailed communications. A guardian model must therefore detect subtle or "needle-in-a-haystack" cases, where malicious or manipulative content is sparsely distributed, embedded across multiple references, or intentionally obscured within benign text.

To evaluate AprielGuard’s long-context reasoning capabilities, we curated a specialized test dataset composed of diverse, high-length use cases. We considered the data upto 32k tokens for this evaluation. The baseline data was initially constructed from benign content representative of these domains. Malicious elements were then systematically injected to simulate adversarial or unsafe scenarios while maintaining the overall coherence of the text. For example, in an incident case summarization, an injection could be embedded within the case description, hidden in a metadata section, or inserted as part of a comment thread. Similarly, in multi-turn dialogue data, adversarial content might appear mid-conversation, near the end or at the beginning to test long range dependency tracking.

Safety Risks performance

Model	Reasoning	Precision ↑	Recall ↑	F1 ↑	FPR ↓
AprielGuard-8B	Without	0.99	0.96	0.97	0.01
AprielGuard-8B	With	0.92	0.98	0.95	0.11

Adversarial Attacks performance

Model	Reasoning	Precision ↑	Recall ↑	F1 ↑	FPR ↓
AprielGuard-8B	Without	1.00	0.78	0.88	0.00
AprielGuard-8B	With	0.93	0.94	0.94	0.10

Multilingual evaluation

A major limitation in the current landscape of content moderation research is the scarcity of high- quality multilingual benchmarks. To address this gap and comprehensively assess the multilingual capabilities of AprielGuard, we extended the Safety Risks benchmarks and Adversarial Attack benchmarks into multiple non-English languages. The translation process was conducted using the MADLAD400-3B-MT model, a multilingual machine translation model based on the T5 architecture.

For this study, we selected eight of the most widely used non-English languages to ensure broad linguistic and geographical coverage: French, French-Canadian, German, Japanese, Dutch, Spanish, Portuguese-Brazilian, and Italian. Each instance from the English Safety and Adversarial benchmarks was translated into the eight target languages. During translation, we preserved the original English role identifiers, such as User: and Assistant:, while translating only the conversational content. This design choice ensures alignment with AprielGuard’s moderation framework, where the role context plays a crucial part in evaluating safety and adversarial intent.

Multilingual performance of AprielGuard

Conclusion

AprielGuard unifies safety, security, and agentic robustness into a single guardian model capable of handling:
- Comprehensive safety risk classification
- Adversarial attack detection, including prompt injection and jailbreak attempts
- Various input modalities, such as standalone prompts, multi-turn conversations, and full agentic workflows
- Long-context inputs
- Multilingual inputs
- Explainable reasoning

As LLMs move toward deeply integrated agentic systems, the need for unified pipelines becomes more critical. AprielGuard is a step toward that future — reducing complexity, improving coverage, and offering a scalable foundation for trustworthy AI deployments.

Limitations

Language Coverage: While AprielGuard has been primarily trained on English data, limited testing indicates it performs reasonably well across several languages, including: English, German, Spanish, French, French (Canada), Italian, Dutch, and Portuguese (Brazil). However, thorough testing and calibration are strongly recommended before deploying the model for production use in non-English settings.
Adversarial Robustness: Despite targeted training on adversarial and manipulative behaviors, the model may still exhibit vulnerability to complex or unseen attack strategies.
Domain Sensitivity: AprielGuard may underperform on highly specialized or technical domains (e.g., legal, medical, or scientific contexts) that require nuanced contextual understanding.
Latency–Interpretability Trade-off: Enabling reasoning traces enhances explainability but increases latency and compute cost. For low-latency or large-scale use cases, non-reasoning mode is recommended.
Reasoning Mode Sensitivity: The model exhibits occasional inconsistencies in classification outcomes between reasoning-enabled and non-reasoning inference modes.
Intended use: AprielGuard is intended strictly for use as a safeguard and risk assessment model. It classifies potential safety risks and adversarial threats according to the AprielGuard unified taxonomy. Any deviation from the prescribed inference may lead to unintended, unsafe, or unreliable behavior.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Table of Contents

Motivation

AprielGuard Overview

Taxonomy

A. Safety Taxonomy

B. Adversarial Attack Taxonomy

Training Dataset

Model Architecture

Training Setup

Evaluation Summary

Safety Benchmark Results

Adversarial Detection Results

Agentic Workflow Evaluation

Long-Context Robustness (Upto 32k Tokens)

Multilingual evaluation

Conclusion

Limitations