惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

Hugging Face - Blog

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook OlmoEarth v1.1: A more efficient family of models Introducing the Ettin Reranker Family Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation PaddleOCR 3.5: Running OCR and Document Parsing Tasks with a Transformers Backend The Open Agent Leaderboard Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality Unlocking asynchronicity in continuous batching Building Blocks for Foundation Model Training and Inference on AWS MachinaCheck: Building a Multi-Agent CNC Manufacturability System on AMD MI300X "OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support" CyberSecQwen-4B: Why Defensive Cyber Needs Small, Specialized, Locally-Runnable Models EMO: Pretraining mixture of experts for emergent modularity MedQA: Fine-Tuning a Clinical AI on AMD ROCm — No CUDA Required vLLM V0 to V1: Correctness Before Corrections in RL Adding Benchmaxxer Repellant to the Open ASR Leaderboard AI evals are becoming the new compute bottleneck Granite 4.1 LLMs: How They’re Built DeepInfra on Hugging Face Inference Providers 🔥 Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents Adaptive Ultrasound Imaging with Physics-Informed NV-Raw2Insights-US AI How to build scalable web apps with OpenAI's Privacy Filter DeepSeek-V4: a million-token context that agents can actually use How to Use Transformers.js in a Chrome Extension Gemma 4 VLA Demo on Jetson Orin Nano Super QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas AI and the Future of Cybersecurity: Why Openness Matters Building a Fast Multilingual OCR Model with Synthetic Data Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers The PR you would have opened yourself Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents Meet HoloTab by HCompany. Your AI browser companion. Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs Multimodal Embedding & Reranker Models with Sentence Transformers ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Welcome Gemma 4: Frontier multimodal intelligence on device Holo3: Breaking the Computer Use Frontier Falcon Perception Any Custom Frontend with Gradio's Backend Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents Training mRNA Language Models Across 25 Species for $165 TRL v1.0: Post-Training Library Built to Move with the Field Liberate your OpenClaw A New Framework for Evaluating Voice Agents (EVA) Build a Domain-Specific Embedding Model in Under a Day State of Open Source on Hugging Face: Spring 2026 Holotron-12B - High Throughput Computer Use Agent Introducing Storage Buckets on the Hugging Face Hub Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries Ulysses Sequence Parallelism: Training with Million-Token Contexts LeRobot v0.5.0: Scaling Every Dimension Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations Introducing Modular Diffusers - Composable Building Blocks for Diffusion Pipelines PRX Part 3 — Training a Text-to-Image Model in 24h! Mixture of Experts (MoEs) in Transformers GGML and llama.cpp join HF to ensure the long-term progress of Local AI Train AI models with Unsloth and Hugging Face Jobs for FREE IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST One-Shot Any Web App with Gradio's gr.HTML Custom Kernels for All from Codex and Claude OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments Transformers.js v4: Now Available on NPM! Introducing SyGra Studio Community Evals: Because we're done trusting black-box leaderboards over the community H Company's new Holo2 model takes the lead in UI Localization The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+ Training Design for Text-to-Image Models: Lessons from Ablations Introducing Daggr: Chain apps programmatically, inspect visually We Got Claude to Build CUDA Kernels and teach open models! Architectural Choices in China's Open-Source AI Ecosystem: Building Beyond DeepSeek Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality One Year Since the “DeepSeek Moment” Differential Transformer V2 Introducing Waypoint-1: Real-time interactive video diffusion from Overworld Open Responses: What you need to know NVIDIA Cosmos Reason 2 Brings Advanced Reasoning To Physical AI Introducing Falcon-H1-Arabic: Pushing the Boundaries of Arabic Language AI with Hybrid Architecture NVIDIA brings agents to life with DGX Spark and Reachy Mini AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems Tokenization in Transformers v5: Simpler, Clearer, and More Modular The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Codex is Open Sourcing AI models Introducing swift-huggingface: The Complete Swift Client for Hugging Face DeepMath: A lightweight math reasoning Agent with smolagents We Got Claude to Fine-Tune an Open Source LLM Transformers v5: Simple model definitions powering the AI ecosystem Diffusers welcomes FLUX-2 Continuous batching from first principles Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks
Harness, Scaffold, and the AI Agent Terms Worth Getting Right
2026-05-25 · via Hugging Face - Blog

Back to Articles

Sergio Paniego's avatar

Aritra Roy Gosthipaty's avatar

When a field evolves quickly, its vocabulary often evolves faster than its shared understanding. Terms start to blur, get reused in different contexts, or become shorthand for ideas that are never fully explained. We are currently seeing this happen in the field of AI Agents, where concepts are getting mixed together, some are renamed, and others are widely used for a few months before quietly disappearing.

This can be overwhelming for newcomers, and even for practitioners trying to keep up with the latest developments. After ICLR 2026, one of us (@ariG23498) posted a question that captured this confusion well:

"What do you mean by the terms 'harness' and 'scaffold' in the context of agents? I have heard a lot of explanations while I was at ICLR, but I could not understand why they did not converge to a single explanation."

This glossary is our attempt to ground the terms that keep coming up without clear, consistent explanations. It is not meant to be a comprehensive dictionary of every term in the field. Instead, we focus on the concepts that are often mixed up, reused in different ways, or assumed to be obvious when they are not.

Most of these terms come up whether you're building an agent, deploying one, or just using tools like Claude Code, Codex, or Hermes Agent. The last section covers concepts specific to training models, which is more relevant if you work on that side of things.

Many of these terms don't have universally accepted definitions yet, and different frameworks use the same word differently. The goal here is not to enforce one correct vocabulary, but to provide a practical mental model that makes discussions easier to follow.

Let's get started.

Table of Contents

Model

The model is the LLM: it takes text in and produces text out (e.g., Claude, Qwen, GPT, Kimi, DeepSeek…). On its own, it has no memory between calls, and no loop. The model can express the intent to call a tool, but it needs a harness to actually execute it. It answers one prompt and stops. Wrap it in scaffolding and a harness and it becomes an agent.

Scaffolding

The behavior-defining layer around the model: system prompt, tool descriptions, how the model's responses get parsed, what it remembers across steps (context management). It shapes how the model sees the world and acts in it, whether during training or at inference.

Products like Claude Code, Codex, and Antigravity CLI call the whole thing a harness. Claude Code's own docs say it directly: "Claude Code serves as the agentic harness around Claude." That's the broad use: harness means everything that isn't the model. The scaffold/harness distinction matters most when you need to reason about them separately, as in a training pipeline. You'll also hear "scaffold" used more broadly to cover any infrastructure the harness relies on: hooks, runtime configuration, even directory structure.

Some products like Claude Code and Codex are tightly coupled to their provider's models. Others like Antigravity CLI and Hermes Agent let you plug in any model.

Harness

The execution layer inside the agent: it calls the model, handles its tool calls, decides when to stop. The harness is what makes the agent run. Scaffolding, defined above, is what the model works from: its instructions, its tools, its format.

Harness engineering is the discipline of designing this layer well: deciding when the agent should stop, how errors get handled, and what guardrails keep it on track. It applies at both training and inference. Addy Osmani's piece and OpenAI's account of building with Codex both cover this from the inference side.

At evaluation time, the same pattern shows up as an eval harness: instead of collecting training data, it runs a fixed set of scenarios at a model checkpoint and records metrics rather than updating weights.

Agent

The term comes from reinforcement learning, where an agent is simply a function that takes an observation and returns an action. The environment takes that action and returns a new observation, and the loop repeats. That loop is still at the core of how LLM agents work.

In the LLM world, the term has expanded. An agent is a model plus everything around it that lets it act, not just respond. It turns raw text generation into something that can act in a loop: taking in information, deciding what to do, and acting on the results.

Take a coding agent as a concrete example. The system prompt, tool descriptions, and the output format the model follows form the scaffolding. The loop that calls the model, handles its tool calls, and decides when to stop is the harness. At training time, the harness also runs many of these loops in parallel and feeds the results back to update the model.

Agent diagram showing Harness, Scaffold, and Model as components inside Agent, with Sub-agent below

In the community, it's usually put as Agent = Model + Harness (@Vtrivedy10 and Will Brown's tweet for reference). If you're not the model, you're the harness. The subtle distinction between harness and scaffold that creates most of the confusion is what the two sections above address.

When people talk about products like Claude Code, Codex, or Cursor, they're referring to a specific harness built on top of a specific model, designed and optimized together. Two products using the same underlying model can feel completely different because their harnesses make different choices. And swapping a better model into the same harness also changes the experience. The model, the harness, and the product are three different things.

Context Engineering

Designing what goes into the agent's context window: what the model sees at each step, system prompt, tool descriptions, conversation history, retrieved knowledge. It's not a one-time decision: as the model runs, previous turns shape what goes into future calls, and the harness actively manages this throughout the run. It applies at both training and inference, but the cost of getting it wrong is very different. At training, what the model sees shapes what gets learned. Get it wrong and you're retraining. At inference, it's just text: change a prompt and redeploy. The HF Context Engineering Course covers this in depth.

Memory is part of this picture. Short-term memory is what stays in the context window during a single run: conversation history, tool results, previous reasoning. Long-term memory persists across sessions, stored externally and retrieved on demand, then injected back into context when relevant.

Policy

A policy is the behavior an agent follows: given any situation, it defines the probability of taking each possible action. In LLM systems, part of that policy is learned in the model weights, but the behavior also depends on the surrounding scaffolding and harness. The same model can behave very differently depending on its prompts, tools, memory, and execution loop. A policy is not an agent. The policy defines behavior; the agent is the full system that acts in an environment. Wrap a checkpoint in scaffolding and a harness and deploy it, and you get an agent whose behavior is the policy.

Tool Use

How agents reach outside themselves: APIs, code interpreters, databases, web search, file systems. The model expresses the intent to use a tool in a structured format. Modern inference APIs surface this as a first-class object: the harness receives the call directly and routes it to the right function. The result gets fed back into context and the loop continues.

Skills

Reusable, structured packages of knowledge that enable multi-step tasks. Where a tool is an action ("run this command"), a skill bundles everything needed to accomplish a goal ("investigate this bug, form a hypothesis, write a fix"). They are portable across agents and loaded on demand. The line between tool, skill, and sub-agent shifts across frameworks. The HF Context Engineering Course covers skills in depth.

Sub-agents

An agent called by another agent to handle a specific subtask. It has its own model and scaffold, reasons independently, and returns a result. The calling agent doesn't need to know how it works internally. This is what separates a sub-agent from a tool (a function call) or a skill (packaged knowledge): a sub-agent can itself reason, use tools, and call further sub-agents.

Training

The terms above apply whether you're training or deploying. These four are specific to training, where the agent runs through tasks, gets scored, and its model's weights get updated. Every RL training system for LLMs is built around the same pipeline:

RL training pipeline showing RL Environment, Trainer, and Reward connected by rollout and updated policy

RL Environment

The environment is anything you can interact with: a stateful object that takes an action as input, updates its internal state, and returns an observation. In the LLM context, actions are typically tool calls. A filesystem is a simple example: the action touch foo.txt updates the state by creating the file, and the observation might be the updated file listing. Definitions vary across frameworks.

We recently published a dedicated guide on this, so rather than compress it here, see The Ultimate Guide to RL Environments for a complete breakdown of types, frameworks, and examples.

Trainer

The trainer is what makes the agent better: it runs many agent episodes, scores the results and uses them to update the inner model's weights. TRL's GRPOTrainer is a concrete example: a single class that handles episode generation, reward scoring, and weight updates.

Rollout

A rollout is one full agent run from start to finish: what the agent saw, what it did, and what reward it got at each step. It's also called a trajectory or a trace, depending on the context. This is the raw data RL algorithms learn from.

Reward

The score that tells the training algorithm whether the model is getting better. It can be verifiable (tests pass/fail, answer matches), or learned (human preferences, LLM-as-judge), sparse (one score at the end of an episode), or dense (a score at each step). This is what the trainer uses to actually update the inner model's weights. For a thorough breakdown of each type, see the Reward Architecture section in Adithya's guide.

Rubrics break the reward into explicit dimensions with weights, rather than a single number. OpenEnv and Verifiers implement rubrics as objects you can combine (WeightedSum, Sequential, Gate).

Learn More

If any definition feels imprecise or you've encountered a term we've missed, we'd love to hear from you.

Thanks to Pedro Cuenca, Quentin Gallouédec, Shaun Smith, and Adithya S Kolavi for reviewing this post.