惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

P
Proofpoint News Feed
Hacker News: Ask HN
Hacker News: Ask HN
Scott Helme
Scott Helme
Hacker News - Newest:
Hacker News - Newest: "LLM"
V
Vulnerabilities – Threatpost
Project Zero
Project Zero
Simon Willison's Weblog
Simon Willison's Weblog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
V
V2EX - 技术
aimingoo的专栏
aimingoo的专栏
博客园 - 三生石上(FineUI控件)
G
GRAHAM CLULEY
www.infosecurity-magazine.com
www.infosecurity-magazine.com
N
News and Events Feed by Topic
V
Visual Studio Blog
A
About on SuperTechFans
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Cisco Talos Blog
Cisco Talos Blog
博客园 - 聂微东
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Hugging Face - Blog
Hugging Face - Blog
H
Hacker News: Front Page
月光博客
月光博客
P
Privacy & Cybersecurity Law Blog
量子位
S
SegmentFault 最新的问题
小众软件
小众软件
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Schneier on Security
Schneier on Security
L
Lohrmann on Cybersecurity
G
Google Developers Blog
人人都是产品经理
人人都是产品经理
酷 壳 – CoolShell
酷 壳 – CoolShell
Google Online Security Blog
Google Online Security Blog
The GitHub Blog
The GitHub Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
T
Tenable Blog
P
Proofpoint News Feed
Spread Privacy
Spread Privacy
The Cloudflare Blog
D
DataBreaches.Net
K
Kaspersky official blog
Microsoft Azure Blog
Microsoft Azure Blog
S
Security @ Cisco Blogs
AWS News Blog
AWS News Blog
T
ThreatConnect
T
Tor Project blog
C
CERT Recently Published Vulnerability Notes
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint

DEV Community

AI does exactly what you ask — that's the problem The Best Form Backend for Static Sites in 2026 The 11 Major Cloud Service Providers in 2025 Visualizing Database Design: From Interactive Canvas to Drizzle, Prisma, and SQL in Real-time A tool to make your GitHub README impossible to ignore 🚀 Zero-Downtime Blue-Green and IP-Based Canary Deployments on ECS Fargate I reproduced a Claude Code RCE. The bug pattern is everywhere. We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found. Jenkins CI/CD Pipeline for a Dockerized Node.js Application: Manual Trigger vs Automatic Trigger Using GitHub Webhooks How to Stream Live Forex Rates to Google Sheets API: A Complete Guide Small Models Will Beat Giant Models (And Most People Haven’t Realized Why Yet) How I Built 5 Linux Automation Scripts on AWS EC2 I built TokenPatch to measure AI coding cost per applied patch I built a Chrome extension to stop squinting at the web Producer audit clean, six tests red Conversa — A Multi-Agent AI Platform Powered by Gemma 4 Build a Real Agent in 15 Minutes with Gemini's New Managed Agents API What I Actually Build: AI Systems That Ship, Not Demos That Impress The Box Ticked While You Read This: LinkedIn, AI Training, and the Switch You Did Not Flip Investasi Masa Depan: Mengintip Fasilitas Laboratorium Komputer Kelas Dunia di Yogyakarta I Cancelled My $20 Claude Cowork Plan After a Week With OpenWork Stop Reviewing Every Line of AI Code - Build the Trust Stack Instead How To Build an Image Cropper in Browser (Simple Steps) I built a macOS disk cleaner for developers and just launched it would love feedback Membangun Kompetensi dan Relasi: Mengapa Ekosistem Kampus Itu Penting I Built an AI That Decides Which AI to Talk To — Running 24/7 From My Living Room Codex Team Usage SOP How to Actually Become a Programmer: The Hard Part Nobody Wants to Explain Building a Production-Style Multi-Tool AI Agent with Python, Flask, React & Gemini AI The Caretaker Sandbox: An Offline-First Visual Playground & Template Engine powered by Gemma 4 # Building Instagram OSINT Projects with HikerAPI Your AI can read. Gemma 4 can see The Battle of the Senior Dev: Why AI Gives You Wings But Only If You're Ready to Pilot HiDream Raw Output Failed Tried Dev-2604 VRAM Math Killed It Won with a Prompt Enhancer Instead I Finally Finished a Project I Abandoned — And GitHub Copilot Helped Me Ship It SafeSMS: On-Device Threat Detection with Gemma 4 E4B, no internet required I Built OpenKap — A Loom Alternative for Small Teams Who Just Want to Ship Gemma 4 is Here: The Dawn of Local Multimodal Reasoning Offline-First Flutter: How We Built a CRM That Manages 100K+ Leads With No Internet Memory for Agents: When Vectors Meet Graphs, Bugs Drop 4 The Rise of Production-Grade AI Infrastructure I ran my idea-validation product through its own validator. The verdict was PIVOT. We Built an Agent Commerce API. Google I/O 2026 Changed Our 3-Month Roadmap in 24 Hours. "My Partner's Memory Was Full. I Didn't Know — Until We Tried to Talk." I’m a Front End Web Developer Learning Machine Learning From Scratch Laravel Waiting Request I Built a Chrome Extension to Track How Long You Actually Spend on Each Tab Why Google Can't See Your React Breadcrumbs (And the 4-Line Fix) AI Travel Assistant Powered by Gemma 4; With Streaming, Image Input, and Visual Recommendation Cards Microsoft tried to kill the printer driver. Healthcare said no. The Blueprint Beneath the Blueprint: Designing Data Model and Choosing Its Database REST APIs vs Webhooks in Telecom Billing - Which One Actually Makes Sense? Accounting Made Simple: AI-Powered Financial Insights of Japanese Companies with Gemma 4 The append-only AST trick that makes Flutter AI chat actually smooth Designing the Future of Payments — Why XML Still Matters in the Age of APIs From Legacy to Live — Reviving XMLPayments with GitHub Copilot Two Weeks Into Learning Solana XMLPayments — The Hidden Backbone of Modern Financial Orchestration AI Agents in Practice — Read from the beginning Reviving My Gemma Agentic Framework: From Prototype to Polished Repo Smart Contracts Demand Better Infrastructure: Building on contract.dev Self-Hosted LLM Tool Calling: Forge and the Build-vs-Buy Decision ORA-00072 오류 원인과 해결 방법 완벽 가이드 OpenWA for CTOs: Self-Hosted WhatsApp Gateway Trade-Offs NotebookLM Automation With notebooklm-py: Useful, But Classify Data First Docker v29.5.x Operator Upgrade Checklist Coding-Agent Instruction Design: The CLAUDE.md File That Prevents Rework When I Finally Realized My Runtime Was Holding Me Back GnokeOps: Host Your Own AI House Party The Death of Static Rate Limiters: Why Your Java Virtual Threads Need BBR-Style Adaptive Concurrency AI Agents in Practice — Part 2: What Makes Something an Agent Stop scattering LLM SDK/API calls across your codebase. Here is the 2-file rule that fixed mine Beyond Prompts: Structuring AI Workflows for Real Frontend Engineering From an Abandoned Hackathon Project to an AI Study Workspace 🚀 Terraform with AI: Build AWS Infra (Cursor + MCP) What If AI Didn’t Need the Internet? 750,000 Chips, 140 Trillion Tokens: The Math Behind DeepSeek's Permanent Price Cut You're Renting Someone Else's Compute — And It's Costing You More Than You Think CSS :has() Selector: The Layout Trick I Wish I Knew 5 Years Ago Five Clusters. Five Lessons. One Production System. Synaptic: A Local-First AI Dev Companion That Remembers How You Think Revolutionizing Edge MedTech: Building a Sovereign Sleep Apnea Companion ("XiHan Snore Coach") with Gemma 4 HDD Eksternal Tiba-Tiba Tidak Bisa Diakses di Windows? Ini Tiga Lapis Fix-nya DMARC p=none vs p=quarantine vs p=reject: what to use and when DSA Application in Real Life: How Git Diff Works: LCS Intuition, Myers Algorithm, and Real Code Changes I solo-built a reputation layer for AI agents on NEAR — and here's what I learned I built an AI faceless video generator in 2 months — here's the stack Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the Autoregressive Speed Ceiling llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts From the Renaissance to the Quantum Dawn: AI, Computation, and the Next Paradigm Shift How I Built a Review Site with 800+ Articles Using AI I Built a Smart Kitchen AI with Gemma 4 That Turns Fridge Photos Into Recipes Why your vulnerability dashboard is lying to you (and how to fix it) From Abandoned Prototype to Smart AI System: Reviving Trafiq AI with GitHub Copilot Why Country/State/City Pickers Are Weirdly Hard Node.js 22 LTS — EOL Date, Support Timeline, and What Comes Next The 7-Layer Memory Architecture Behind Modern AI Agents I Imagined Hermes Agent Running an Entire Smart City — And It Changed How I See AI One backend, four products: why we bet on platform-per-brand AI's tech debt is invisible — even to AI. I solved it at the architecture layer.
How a model upgrade silently broke our extraction prompt (and how we caught it)
shaun vd · 2026-05-23 · via DEV Community

shaun vd

A friend's product summarizes customer support tickets using a fine-tuned LLM
prompt. It worked perfectly on GPT-4o for six months. Then OpenAI deprecated
4o, the team migrated to GPT-4.1, ran a smoke test in the playground, said
"looks fine," and shipped.

Two weeks later a customer escalated: "Your urgency tagging is wrong on
basically everything since last Wednesday."

The prompt asked for {"intent": "...", "urgency": "low|medium|high"}. On
4o, the model returned exactly that. On 4.1, it started returning
{"intent": "...", "urgency_level": "..."} — semantically identical, but
the downstream classifier was indexing on urgency and silently fell
through to a default value of "low" on 100% of new tickets.

Nobody saw it because:

  • The prompt didn't error. JSON parsed. Fields existed.
  • The unit tests checked the prompt string, not the prompt output.
  • The integration tests mocked the LLM call.
  • The output was indistinguishable from "everything's fine and quiet."

This is the silent regression problem. Code has tests; prompts have vibes.

Three categories of model-swap failure

After looking at a dozen of these incidents, the failures cluster into three
groups. Knowing which kind you're looking at tells you what to test.

1. Format drift. The model decides to rename a field, drop a field, add
a field you didn't ask for, or change list ordering. JSON still parses. Your
downstream code breaks.

2. Reasoning regression. The model is "improved" but loses a hidden
constraint your prompt depended on. Classic example: GPT-4 reliably extracted
all requirements from a contract; GPT-4-Turbo extracted "the most important
ones," dropping 15-20% of clauses. The format was fine. The data was wrong.

3. Tone shift. Less common but expensive. The new model's outputs are
more verbose, less verbose, friendlier, blunter. If anything downstream
(another model, a regex, a fuzzy matcher) was tuned to the old tone, it
breaks.

What the team should have had

A test suite of 30 representative tickets, each with an expected JSON shape.
On model swap day:

$ promptfork test summarize_ticket --baseline gpt-4o
→ running v12 across [gpt-4.1] vs baseline [gpt-4o]
✗ 30/30 ok, but 6 regressions detected
  - urgency_field_renamed: 6 cases
  - severity 2 (functional)

Enter fullscreen mode Exit fullscreen mode

Six lines. Seven seconds. Two-week customer-facing bug avoided.

How to actually do this

The setup for the team that got bitten took four minutes:

pip install promptfork

# Save the current production prompt, version 1
promptfork push summarize_ticket \
  --file prompts/summarize.txt \
  --message "current prod"

# Pin 30 real tickets from your support inbox
for t in tickets/*.json; do
  name=$(basename "$t" .json)
  promptfork add-test summarize_ticket "$name" \
    --input ticket="$(cat "$t")" \
    --rubric "must return urgency in {low,medium,high}"
done

# Run baseline on 4o
promptfork test summarize_ticket --models gpt-4o

# Now upgrade — push the new prompt as v2 (or keep v1 and swap models)
# Run with v1 (4o) as the baseline, get an LLM-judge regression report
promptfork test summarize_ticket --baseline 1 --models gpt-4.1

Enter fullscreen mode Exit fullscreen mode

That's it. The --baseline flag is what catches drift — it pulls the
baseline output, runs the candidate, and asks Claude Haiku to compare them
under a strict "only flag strictly worse" rubric.

The CI version

The same command in a GitHub Action means no prompt change ever ships
without running against a known-good baseline:

- uses: shaunvand/promptfork-cli@v0
  with:
    prompt: summarize_ticket
    baseline: 1
    api-key: ${{ secrets.PROMPTFORK_API_KEY }}

Enter fullscreen mode Exit fullscreen mode

The action exits non-zero on regression. Branch protection blocks the merge.

If you ship LLM features, you need this. The first time it catches a silent
regression, it pays for itself a hundred times over. PromptFork has a free
tier (3 prompts, 50 runs/mo) at https://promptfork.online/diff — set it up
in five minutes, sleep better forever.