惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

Your recurring scraper is re-downloading data that didn't change. Here's the 15-line fix (conditional GET) Espressif Reveals CoreBoard and Korvo Dev Kits for ESP32-S31 My CKA Cheat Sheet: Commands, Aliases, and Documentation Tricks I Used During the Exam Frontend Engineering Beyond Pixels: The Architecture of Digital Accessibility VLA or IL? A Controlled Dataset for Testing Whether Finetuning Turns Your VLA into a Fancy Imitation Learner Fabric AI Functions Turn GenAI Into a Data Pipeline Step Proximate vs Ultimate: The Bug Is Never Just the Bug The Treasure Hunt Engine That Broke Before the Traffic Did Reset Windows Update: The Definitive MSP Guide to RWU Your Resume Was Never Built for This AI Writes 46% of Code Now: What Snap's Layoffs Mean for Developers in 2026 From Chatbot to Agent — Tool Calling with NVIDIA NIM Fatigue and Fracture Mechanics: Why Parts Break Below Their Yield Strength I built a token-level debugger for comparing two LLMs VCP-Virtual Private Cloud Embedding sing-box in an iOS messenger to bypass Russian DPI (no VPN) Microsoft Copilot just exfiltrated a company's files. The attack was one email. Here's the mechanism. RAG 시스템 실전 구축 (v42) copilot cloud agent is becoming an automation api Cx Dev Log — 2026-04-23 Why Tesla Is Becoming the AI Enterprise Case Study Every Leader Should Understand ORA-00214 오류 원인과 해결 방법 완벽 가이드 SpecAgnt v2.0: The Agent Lifecycle Framework for AI-Native Engineering Optimizing Signal Latency and Weight Allocations in Algorithmic Pipelines SSH Under the Hood: Protocols, Mechanisms, and the Full Technical Story دليل بوابات الدفع للتاجر العربي في 2026 (وكيف تختار المناسبة لمتجرك) Cómo Mi Configuración de Docker Me Salvó de un Ataque de Supply Chain (Y Por Qué la Tuya Debería Hacerlo También) How My Docker Setup Saved Me From a Supply Chain Attack (And Why Yours Should Too) Astro: The epitome of SEO Technical Update I Gave My AI Agent the Ability to Research Before It Writes — Here’s What Changed Kubernetes sem Cloud Provider (Parte 2): Criando Operators em Go para automação e self-service de plataforma AI Memory Needs an Authority Policy, Not Just More Context You've done tutorial after tutorial. Your GitHub is still empty. (Free 1‑page PDF, no signup) TypeScript 7.0: The Go Compiler That Makes TS 10x Faster Connecting Wallets the Right Way: wagmi v2 and EIP-6963 The 5-Layer Architecture Every Production Multi-Agent System Needs (And Why Most Skip Layers 4 and 5) CSS Scroll-Driven Animations: No JavaScript Required Vite 8 + Rolldown: Rust-Powered Builds That Are 10–30x Faster Core Architectural Components of Azure My Skills How I Use AI as a Senior Engineer Construí um motor ATS determinístico porque estava cansado de adivinhar por que meu currículo era rejeitado SCS-Lab1 — CloudTrail: Trail + S3 + KMS + Log Validation LuisCore MCP server — daily syndication · 2026-05-25 Cursor vs JetBrains Rider for C#/.NET in 2026: which to pay for I built a local-first movie recommender with Corrective-RAG (cited explanations, hybrid retrieval, runs entirely on Ollama) Scaling to 1 Million Users : Load Balancing & Caching Strategies How the Events Table That Looked Right Killed Our Queue Three Failures My AI Memory System Caught — And the Flaw It Revealed in Itself dotnet Framework life cycle tool LangGraph 워크플로우 템플릿 (v41) I built a free image compression API — no signup, just curl Designing TikTok from Scratch — A System Design Deep Dive PREDICTION-20260525-0007: boredom-with-asymmetric-leverage [2026-Q3 through 2027-Q3] [Boost] How to integrate the QuickBooks Invoice API in 2026 How I Cut My Anthropic API Bill by 50% With a Local Python Tool Vibe Coding Problems: 7 Visual Bugs AI Code Generators Always Ship Chinese AI Models 2026: The Agentic Revolution, Hardware Independence, and What It Means for Global Developers The Quiet AI War Inside Your Browser The 12-Line Anti-Bot Trick That Saved Our Airdrop Snapshot From Sybil Farms Building a production-ready SaaS dashboard in Next.js 16 — Recharts, TanStack Table, dark mode, and collapsible sidebar Why 2026 Belongs to Agentic AI (And How to Build Your First Local Agent) It Was 2024 When We Tried to Outsmart the Treasure Hunt Engine RAG 시스템 실전 구축 (v40) I Found a Tool That Generates a Complete .NET 8 or Java Spring Boot API From SQL Schema in 30 Seconds I Added a 4th Agent That Audits My Other Agents. It Caught My Strategist Procrastinating for 3 Weeks. Streaming LLM responses to the browser in Go (Server-Sent Events) How We Publish and Manage Educational Admission Updates at Scale on DailyAxom A prompt is not a conversation. It's a component contract. How to Pass the EAA 2025 Accessibility Audit — A Step-by-Step WCAG Checklist Building an Autonomous MCP Lead Generation System with Hermes Agent LangGraph 워크플로우 템플릿 (v40) How I Built 100 Browser-Based Image Tools With No Server (FFmpeg WASM, PDF-lib, AI Background Removal) Nginx CVE-2026-9256, AI Prompt Injection Defenses, and Claude AI Data Leak Demo Scaling RAG for 10M+ Docs, .md Agent Memory, & Claude Code for Motion Graphics Diagram as Code with draw.io DuckDB Delta, PostgreSQL 17 Migration, & SQLite Optimization Deep Dives Windows 11 Microsoft Account Login Recovery During Internet Restrictions The Linux Commands You Forgot Exist (And Why AI Workflows Make Them Relevant Again) Spec-Driven Development Without an IDE: I Generated NestJS, Go, Spring Boot, Laravel, and Rust Apps From a Single PRD File Components are states Edge SEO y Middleware: Cómo Interceptar a Googlebot y LLMs antes de llegar a tu Servidor Context window exceeded at turn 23. Here's how I track token usage without a tokenizer. My Hermes agent spent $3 before I noticed. Now it can't. My Hermes agent's stop condition was a 40-line if/elif chain. I replaced it with 3 lines. My agent kept hitting context limits. This one function fixed it. Create and configure Azure Firewall Your Hermes agent's audit log is leaking customer emails. Here's a 100-line lib that fixes that. My agent kept forgetting what it was doing. A scratchpad fixed it. I replaced 200 lines of ad-hoc state management in my Hermes agent with one object. Per-Key Rate Limiting for Agent Tool Calls: Stop One User From Breaking Everything Composable Output Guardrails: Filter Agent Responses Before They Reach Users Sanitize Your LLM Message Lists Before Every API Call Thread a Run ID Through Every Agent Call So You Can Debug Anything Normalize Provider Error JSON So Your Agent Can Actually Handle Failures Priority Queue for Agent Sub-Tasks: Stop Processing Low-Priority Work First Static Lint Rules for Your LLM Prompts (Before They Hit Production) tool-call-budgets: Stop Runaway Agent Loops Before They Hit Your Invoice
Claude Code for Canary Deployments: How I Ship to 1% of Users Before Breaking Everything
Nex Tools · 2026-05-26 · via DEV Community

I used to ship by faith. The change passed code review, the tests went green, the deploy button was right there, and I pressed it. Most of the time it was fine. The handful of times it was not fine cost me weekends, customer trust, and a real amount of money. The worst incident I can remember was a single line change that took down checkout for forty minutes during a marketing push. The change had passed every test we had. The bug only showed up under real traffic patterns.

After that incident, I built a canary deployment workflow. Every risky change now ships to one percent of traffic first, sits there for a defined observation window, and gets promoted to the full population only when the metrics from the canary cohort look identical to the metrics from the control cohort. It works. The serious incidents I used to ship have been replaced by canary failures that get caught and rolled back before they reach the majority of users.

The hard part of canary deployment is not the routing layer. The routing layer is a solved problem. The hard part is everything around the routing layer: choosing the right metrics to watch, deciding what counts as a regression, building the decision logic that promotes or rolls back, and connecting it all to the deployment pipeline. That hard part is where Claude Code reshaped how I work. Here is the workflow.


Why Canary Deployments Are Underused

Most teams I have worked with talk about canary deployments more than they actually do them. The reason is almost always the same. Setting up the infrastructure is more work than people initially expect, and the work is spread across several systems that each have their own conventions.

You need a routing layer that can split traffic by percentage and by user cohort. You need a metrics pipeline that can compare the canary cohort to the control cohort on the dimensions that matter. You need a decision policy that knows when to promote, when to hold, and when to roll back. You need a control plane that ties it all together and gives humans visibility. And you need all of it to be reliable enough that people trust it.

Most teams end up with two or three of these pieces but not the full set. The result is a canary system that exists in name only. Deployments still go to everyone at once, with a vague intention to "watch the dashboards for a few minutes" that no one ever has time to follow through on.

The gap between a real canary system and the vague intention of one is the gap between "we caught it before it shipped" and "we caught it because customers complained." Both gaps look small from a distance. Up close, they are completely different worlds.

Once you have a real canary system, you also discover that you start writing different kinds of changes. Changes that would have been considered too risky become routine because you have a safety net for them. The cost of every individual change goes up slightly because you have to wait for the canary window, but the cost of failed changes drops to nearly zero. The expected value calculation flips, and the team ships more aggressively.

The workflow I describe below is the workflow that closed the gap for me. The Claude Code skills do the work that humans were not doing because the work was tedious and the payoff was abstract.


The Cohort Skill

The first skill in the workflow handles cohort assignment. Given a user identifier, the skill returns whether the user belongs to the canary cohort or the control cohort for a particular deployment.

The assignment is stable. The same user identifier always returns the same answer for the same deployment. The stability matters because it means a user who hits the canary on their first request continues to hit the canary on subsequent requests within the same session. Without stability, half a user's requests would go to the canary and half to the control, which would distort the metrics and could also create user-visible inconsistencies.

The assignment is also fast. The skill produces a deterministic hash of the user identifier and the deployment identifier, takes the result modulo 100, and compares to the canary percentage. The computation is single-digit microseconds. It can run in the request hot path without measurably affecting latency.

The skill also handles cohort segmentation. For some deployments, the canary should be limited to specific user populations. The skill accepts a population filter and respects it. The most useful filter I have is internal users only, which lets me canary internal-facing changes to employees before they reach customers.

If you want to see how this cohort approach connects to a broader feature flag system, the workflow I described in Claude Code for Feature Flags is the layer that sits one level up from canary assignment. Canaries are a specialized use of feature flags where the cohort is randomized by user identifier rather than chosen explicitly.


The Metrics Skill

The second skill handles metrics comparison. Given a deployment, a canary cohort, a control cohort, and a time window, the skill produces a comparison of every tracked metric between the two cohorts.

The metrics are dimensional. The skill does not just compare the average error rate across the cohorts. It compares the error rate at p50, p90, p99, and p99.9. It compares the latency distribution at every percentile. It compares the throughput, the success rate, the cache hit rate, and any custom metric the deployment opts into.

The comparison is statistical. The skill knows the difference between a real change and noise. A two percent jump in error rate on a small sample is probably noise. A two percent jump on a large sample is probably real. The skill reports both the point estimate and the confidence interval, and it flags differences that are unlikely to be noise.

The output is a structured comparison report. Each metric has a row showing the canary value, the control value, the absolute difference, the relative difference, and the statistical significance. Rows where the canary is meaningfully worse than the control are at the top. Rows where the canary is meaningfully better are also surfaced, because improvements are interesting too.


The Decision Skill

The third skill turns the comparison report into a deployment decision. Given a comparison and a deployment policy, the skill produces one of three outcomes: promote, hold, or roll back.

The policy is the interesting part. The policy specifies which metrics matter and what regressions are tolerable. For a payment service, the policy might say that any increase in checkout error rate is a roll back, but small latency regressions are tolerable. For a search service, the policy might say that small error rate increases are tolerable but latency regressions over 50 ms are a roll back.

The policy also specifies the observation window. Some changes need a short canary because the signals appear quickly. Others need a long canary because the relevant signals only appear during certain traffic patterns. The skill respects the configured window and does not declare a verdict until the window has elapsed.

The decision is auditable. The skill produces a structured record of the decision, the metrics that drove it, the policy that was applied, and the timestamp of the verdict. The record goes to a deployment log. If a decision is later questioned, the record is the evidence for what was known at the time.

The decision is also overridable. A human with appropriate permissions can override a decision in either direction. An override is logged and requires a reason. In practice, the overrides are rare. Most of the time, the skill's decision is the right one, and the policy is what would need to change if it is not.


The Promotion Skill

When the decision is promote, the promotion skill handles the rollout. The promotion is not a single step from 1% to 100%. It is a series of steps with observation windows between them.

A typical promotion ladder goes 1%, 5%, 25%, 50%, 100%. Each step has its own observation window and its own comparison. The skill executes each step, runs the comparison, applies the policy, and decides whether to proceed to the next step or hold or roll back. The ladder gives multiple chances to catch a regression that did not show up at lower traffic levels.

The promotion also handles communication. Each step posts a status update to the deployment channel. The update includes the current traffic percentage, the metrics from the most recent comparison, and the time until the next step. Humans can follow along without having to query the system.

The full promotion typically takes one to three hours. The duration sounds long compared to a traditional deployment that ships in minutes, but the duration is the price of safety. The bugs that get caught at the 5% step would otherwise be in front of every customer by the time anyone noticed.


How the Workflow Runs in Practice

The deployment pipeline integrates with the canary workflow at the deploy step. Instead of pushing the new version to the full fleet, the pipeline pushes it to a canary subset and registers the deployment with the cohort skill.

The metrics skill starts collecting comparisons immediately. The first comparison usually runs after fifteen minutes of canary traffic. The skill emits a structured report that the decision skill consumes.

If the decision is hold, the comparison continues. The metrics skill produces a new comparison every fifteen minutes, and the decision skill re-evaluates each time. The hold continues until either the observation window expires with a promote decision or a regression appears and triggers a roll back.

If the decision is promote, the promotion skill takes over. It steps through the promotion ladder, running comparisons at each step, until the deployment reaches 100% traffic. At that point, the canary is done and the change is live for everyone.

If the decision is roll back, the routing layer reverts the canary cohort to the previous version. The metrics that triggered the roll back are attached to the deployment record. The author of the change gets a notification with the comparison data, which is usually enough information to identify the bug.


What This Workflow Did to My Practice

The most visible change is in incident frequency. The category of incident I used to see most often, where a code change went to 100% of traffic and broke something, has nearly disappeared. The category that replaced it is canary roll backs, which catch the same class of bugs without the customer impact.

The second change is in deployment speed for safe changes. Because the workflow is automated, deployments that would have required careful human attention now run in the background. I can deploy a low-risk change at any time and the workflow handles the promotion without me having to be present. The combination is that risky changes get more attention and safe changes get less, which is the right allocation.

The third change is cultural. The team writes different code now. The kinds of changes that would have been postponed or batched are now shipped continuously, because the cost of a small risky change is much lower than it used to be. The cycle time on individual changes has dropped, even though each individual deployment takes longer than it used to.

If you want to see how this connects to the broader picture, Claude Code for Incident Response covers what happens when something does break despite the canary. The two workflows together form most of the safety net I rely on in production.

For the rest of my practical workflows around shipping software with Claude Code, the full series is on DEV.to.


FAQ

Does this require a service mesh?

No. The cohort skill can run anywhere a routing decision is made. A service mesh makes it easier, but a load balancer, an API gateway, or even an application-level router works.

What if my service has too little traffic for statistical significance at 1%?

Increase the initial canary percentage. The workflow does not require 1%. It requires that the canary cohort is small enough that a regression does not affect most users and large enough to produce a meaningful statistical signal. The right percentage depends on your traffic volume.

What about changes that affect every request the same way?

For uniform changes, the per-cohort comparison still works because the metrics are computed independently for each cohort. The skill will detect differences even when the change affects every request, as long as the change produces a measurable signal.

How do I write the policy?

Start conservative. List the metrics that matter most for your service. For each one, choose a regression threshold that is large enough to be unambiguous. Tighten the policy over time as you learn what false positives look like.


The canary deployment workflow is not glamorous. It does not produce the kind of architectural diagrams that get applauded at conferences. What it does is take an entire category of operational pain and make it disappear quietly. The change to the team's day-to-day experience is huge, even though the surface change to the system is small. That ratio of impact to visible complexity is exactly what I look for when I decide where to invest engineering time, and it is why I would build this workflow first if I were starting a new production service today.