惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
SecWiki News
SecWiki News
V
Visual Studio Blog
博客园 - 三生石上(FineUI控件)
WordPress大学
WordPress大学
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
博客园 - 【当耐特】
Martin Fowler
Martin Fowler
宝玉的分享
宝玉的分享
F
Fortinet All Blogs
U
Unit 42
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
aimingoo的专栏
aimingoo的专栏
V
V2EX
Apple Machine Learning Research
Apple Machine Learning Research
博客园 - 聂微东
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
O
OpenAI News
T
Troy Hunt's Blog
TaoSecurity Blog
TaoSecurity Blog
小众软件
小众软件
MongoDB | Blog
MongoDB | Blog
Recent Commits to openclaw:main
Recent Commits to openclaw:main
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
L
LINUX DO - 最新话题
N
News | PayPal Newsroom
PCI Perspectives
PCI Perspectives
Engineering at Meta
Engineering at Meta
美团技术团队
J
Java Code Geeks
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
人人都是产品经理
人人都是产品经理
雷峰网
雷峰网
V
Vulnerabilities – Threatpost
B
Blog RSS Feed
NISL@THU
NISL@THU
Security Latest
Security Latest
The Register - Security
The Register - Security
酷 壳 – CoolShell
酷 壳 – CoolShell
The GitHub Blog
The GitHub Blog
S
SegmentFault 最新的问题
N
News and Events Feed by Topic
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
S
Schneier on Security
罗磊的独立博客
Know Your Adversary
Know Your Adversary
Hacker News: Ask HN
Hacker News: Ask HN
S
Security Affairs
月光博客
月光博客
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO

AI Archives – TechEmpower

Agentic Coding in Practice QA in the age of agentic coding: shift-left and shift-right Product meets Engineering in the AI Era Red Teaming Gen AI AI Coding Tools Metrics 2-week spike to ramp up on AI Coding Tools Real-time Monitoring of LLM-Based Applications AI Coding Assistants Update
Building Reliable Autonomous Agentic AI
Tony Karrer · 2026-01-13 · via AI Archives – TechEmpower

Over the past few years, CTOs have been building LLM-based systems using a DAG workflow approach. Autonomous agentic systems are a different sport. We’ve had reliability as a key question and it’s even more critical when a model can take actions (call tools, write to systems, trigger workflows). There’s incredible power here, but also big challenges.

A few definitions to start

Autonomous agentic system: an LLM wrapped in a loop that can plan, take actions via tools, observe results, and continue until it reaches a stop condition (or it’s forced to stop).

Tool calling: the agent selecting from a constrained action space (tool names + schemas) and emitting structured calls; your runtime executes them, validates outputs, and feeds results back into the loop.

Orchestration (the “real software” around the model): state management, retries, idempotency, timeouts, tool gating, context assembly/pruning, audit logging, and escalation paths.

Closed-loop evaluation (Plan -> Act -> Judge -> Revise): a repeatable harness where you run realistic tasks, score outcomes (ideally against ground truth and human-calibrated judges), learn what broke, and iterate.

Guardrails + safe stopping: runtime-enforced constraints (policies, budgets, circuit breakers, permissions) that limit what the agent can do and force it to stop or escalate when risk rises or progress stalls.

A small set of practices that pay off fast

Treat your tools like a product surface, not a pile of functions.
The failure mode is “death by a thousand tools”: overlapping capabilities, ambiguous names, and huge schemas that make selection brittle. Keep tools narrow, make them obviously distinct, and hide tools by default unless they’re relevant to the current step. “Just-in-time” instructions and tool visibility is a pragmatic way to scale without drowning the model in choices. 

Move reliability into deterministic infrastructure (not prompt magic).
If an agent can trigger side effects (create a ticket, refund an order, email a customer), you need transactional thinking: idempotent tools, checkpointing, “undo stacks,” and clear commit points. Prompts don’t roll back production systems; your runtime does. 

Put hard budgets and explicit stop reasons into the main loop.
Most “runaway agents” are simply missing guardrails that set limits on: iterations, tool calls, dollars, and wall-clock time; and “no progress” detectors (same tool call repeating, same plan restated, same error class recurring). When the agent hits a threshold, it should stop with a structured summary: what it tried, learned, and needs from a human.

Design for long-running work with durable state and resumability.
If the agent’s job can outlast a single context window (or a single process), assume it will crash, time out, or be interrupted. Store state externally, make steps replayable, and separate “planning notes” from the minimal context required to proceed. The goal is to resume cleanly without redoing expensive work or compounding earlier mistakes.

Make evaluation real: production-like tasks, ground truth, and judges you can trust.
Vibe checks don’t catch regressions. You want a small-but-representative set of real tasks sampled from production distributions, with ground truth where possible, and automated judges that are calibrated against human agreement (so you know what “good” means). Also assume reward hacking and metric gaming will happen. Build detection for it the same way you do for any other adversarial input.

Security guardrails: constrain action space, validate everything, and sandbox execution.
Tool calling expands your attack surface (prompt injection is just one angle). Practical defaults: strict schema validation, allow-lists for tool targets, content sanitization, least-privilege credentials, and sandboxed execution for anything that can run code or touch sensitive systems.

Want to learn how TechEmpower can help you or your team with Agentic AI?

More reading

Building production-ready agentic systems: Lessons from Shopify Sidekick (Shopify, Aug 26, 2025)

The most “copyable” part is how they hit tool sprawl in the real world and moved to just-in-time instructions, plus a very concrete evaluation approach (ground-truth sets, human agreement, judge calibration, and the reality of reward hacking).

AI grew up and got a job: Lessons from 2025 on agents and trust (Dec 18, 2025, Google Cloud)

A CTO-level framing of why “agents” change the trust model: autonomy, integration into workflows, atomicity/rollback thinking, and why governance has to be part of the architecture.

Effective harnesses for long-running agents (Nov 26, 2025, Anthropic)

Focuses on the annoying reality: agents that run for hours/days need a harness that’s built for resumability, recoverability, and controlled progress—not just bigger context windows.

What 1,200 Production Deployments Reveal About LLMOps in 2025 (Dec 19, 2025, ZenML)

A dense, case-study-heavy sweep of what shows up across production systems: context engineering, infrastructure guardrails, circuit breakers, and why “software fundamentals” keep winning over clever prompting.

Ground Truth Curation Process for AI Systems (Aug 20, 2025, Microsoft).

If you’re serious about closed-loop improvement, this is the unglamorous foundation: how to build and maintain ground truth sets that support regression testing and meaningful “judge” signals.

Function calling using LLMs (May 6, 2025, Martin Fowler).

A solid mental model for “tools as a constrained action space,” plus practical guardrails (unit tests around tool selection, injection defenses, and how to reduce boilerplate as your toolset grows).

How to build your first agentic AI system (Oct 2, 2025, TechTarget).

A pragmatic implementation-oriented checklist, including explicit loop limits, retry patterns, and when to escalate—useful for teams moving from prototypes to something operational.