惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

DEV Community

Django Session Cookie vs localStorage JWT Security Comparison The Day Our Treasure Hunt Engine Blew Up at 3 AM How I Built 8 Free Dev Tools as a Solo Maker — Lessons Learned The Moment the JVM Unwound at 3 AM and the Rust Runtime Held Why Linux Powers Almost Every Modern Server Magento 2 Nginx Optimization for High Traffic — Complete Server Tuning Guide How to Merge Multiple PDFs with One API Call — Node.js, Python & curl Why you should always rewrite the code you copy Structured Prompts Cut Token Waste 35-40%. Here's Where It Actually Matters. Validate EU VAT Numbers in Claude Desktop, Cursor, and ChatGPT — Official MCP Server Do You Really Need Certifications to Get a Job? 🤔 Building Your First UAPK Manifest: A Step-by-Step Guide Inside a Horilla CRM App: registration.py, menu.py, and What AppLauncher Actually Loads Automate Browser Tasks with xbrowser: A Developer's Guide to Web Automation Why Veltrix Will Never Be the Silver Bullet for Distributed Locks at Scale ClickUp from a Developer's Perspective in 2026: API, Webhooks, and the Self-Host Question Foundational Concepts in Data Engineering ¿Por qué Go no tiene excepciones? Primeros pasos Creating my own web browser The Gamedev Server That Broke at 300 Concurrent Hunters and How We Fixed It OneAquaHealth IEEE Global Hackathon Hytale Servers and the Lies We Told Ourselves About Treasure Hunts Evcode:I built a terminal IDE in Rust that runs on 7MB of RAM — Evcode 1.0.0 HackCanton S2 is Open — Build on Canton and Win How to Start Contributing to Open-Source AI Projects (Python, Agents, Good First Issues) I built /ai inside a notes app — here's how I render generated UI components safely I Built 8 Free Browser-Based Developer Tools (No Uploads, No Tracking) Liquid Alerts: WOW Alerts Meet Liquid Border Rest is not what you think How Polymarket Scaled Their Data Stack with Postgres + ClickHouse Adaptive execution for Java agents: reason-aware retries and budget-aware routing Memory Safety and the C/C++ CVE Crisis tRPC: The End of API Docs as We Know Them How to Build a Crypto Trading Bot with CoinGlass API AI: Who I Am, and What I'm Supposed to Be in the Software World I Have Taken Over React Projects Without Standards. Here Is What That Actually Feels Like. How I set up Sanity draft mode preview with Next.js App Router and Vercel Edge Config Secure File Upload Guide to Validation, Scanning and Storage The pause before the first token iOS Image Classification CoreML: Complete 2026 Guide Fine-Tuning Llama 3.2 3B on Medical QA: Week 2- Data Preparation Building a Card Game AI with Reinforcement Learning — Implementation Details#2 Stop hardcoding AI providers: a generic client approach AI models are missing religious context. Builders should treat that as an eval problem. Build Your AI Second Brain with Claude + Obsidian Encoding FIFA’s 495 third-place scenarios for the 2026 World Cup I burned through DeepSeek's 5M free tokens in 14 days — here's the exact math Animating React Without Fighting the Render Loop: useRafFn, useRafState, useFps, useDevicePixelRatio, useUpdate I’m Building AR/XR Experiences for Nigeria Without ARCore or ARKit Memory Graphs Don't Scale Is it just me, or is Codex getting slower day by day? 🐢 LLM API Tokens burning your Bank even on testing ? Not anymore, cuesheet is here to help with that. HTML to JSX: Common Conversion Problems Frontend Developers Still Make Fighting Database Connection Pool Exhaustion Your sanctions screening just broke: managing 50+ data sources without burying your team I think AI accidentally became my personality for a month Building a local-first clipboard workspace for macOS Understanding MCP (Model Context Protocol) in Next.js 16 Next.js 16 RAG Pipeline Optimization: Give Your AI a Perfect Memory The Complete Developer’s Guide to the Baileys WhatsApp Bot: Setup, Scaling, and VPS Deployment The Moment Veltrix Blew Up and We Had to Write Our Own Shard Router We built an alert triage system. Then we watched analysts ignore it. Future of AI Hardware API Treasure Hunt Engine: When Veltrix Defaults Buried 800k Documents in a Hot Partition I Cloned My Dog-Name Site to Build a Cat-Name Site. The Routing Layer Bit Back. Serverless Computing Claude Code Hooks vs Skills: When to Use Which Secure AI API Key Management in Next.js 16: Prevent Key Leaks I Built a Git-Tracked Book Production Pipeline CSS Carousels With Zero JavaScript: 5 Patterns 5 CSS Animations That Needed JavaScript Until 2026 When the Treasure Hunt Engine Eats Itself: My First Production Outage That Taught Me the True Cost of Defaults The 5 Best Places to Buy Next.js Templates in 2026 (Compared by Price) Building AMLA-Ready Systems: A Developer's Technical Roadmap Modern SCADA Systems Need Structured Learning More Than Ever The Rise, Pause, and Rise of CRUD Apps The Hidden Cost of Idempotency in Distributed Systems Solana Account Model — City Analogy Veltrix Configuration Was the Least of Our Worries When Our Treasure Hunt Engine Almost Took Down the Server CSS Box Shadows That Actually Look Professional CSS Gradient Trends in 2026 (And How Developers Actually Use Them) Why EU region toggles in cloud providers don't solve data sovereignty (and how to fix it) Why I Built the "Infrastructure Layer" Under Every AI Coding Agents Why I Still Regret Choosing Velocity Over Simplicity in Our Treasure Hunt Engine Configuration How Are Developers Actually Using AI At Work? Claude Security Update: Scans, Webhooks, 6 Partners The 2026 Chinese LLM Price War: Top 5 Frontier API Costs Compared Local LLM Hosting in Switzerland: Real Costs, Latency & Compliance I Built a Free SVG Background Generator for Developers Tian AI: I Built an AI Assistant That Runs 100% Offline on My Phone (No Cloud, No Subscription) How to Create Responsive Video That Doesn't "Jump" During Loading MY DEEP TECHNICAL EXPLORATION AND PERSONAL EXPERIENCE WITH HERMES AGENT 08/20: Layer 3 – The Network Layer: IP Addresses & Routing Explained CLAUDE.md for Astro: 13 Rules That Stop AI from Shipping Too Much JavaScript 10 JSON Formatting Tricks Every Developer Should Know We replaced 73 hours of weekly alert triage with 10 AI agents. Here is what the architecture looks like. The four-line cron that decides who falls in love (in my dating app) Blocked by Mac Security? How to Fix “Apple Could Not Verify” Errors in Seconds Stop the Leak: A Developer’s Guide to Taming the AWS RDS Bill in 2026
The AI That Improves Itself: Autonomous Prompt Iteration Loop
Odilon HUGON · 2026-05-27 · via DEV Community

Each roast was taking 50 seconds per upload. Quality was unknown — we had a feeling, not data. The prompt had been written "by instinct" and never seriously evaluated. The question was simple: how do you know if a prompt is good, and how do you improve it without spending the whole day reading roasts manually?

The answer: automate the evaluation work using AI itself, in a loop. Write a tool that sends 30 photos to Claude, measures quality metrics, and produces a report. Modify the prompt, rerun, compare. Five iterations later, here's what we learned.

Context: RateMyFace

RateMyFace is an AI roast-by-photo site: the user uploads a photo, Claude analyzes it and generates satirical text along with a score and a "tier label" (e.g. "WiFi Signal With Legs"). The result is rendered as a collectible trading card.

The stack: Go monolith, SQLite, Claude CLI (claude --print) called as a subprocess. The prompt asked Claude to produce 5 roast styles (standard, rap, Shakespeare, passive-aggressive mom, Gordon Ramsay) + a score + a label, all in JSON.

Two concrete problems: roasts were taking ~50 seconds (too slow for interactivity) and their quality was opaque. We knew we were generating something, not whether it was good.

The idea: measure before optimizing

The usual reflex in prompt engineering is to iterate manually — modify, test on 2-3 examples, estimate if it's better. The problem: you optimize on the examples you chose, not on the real distribution. And "it seems better" isn't a metric.

Alternative approach: define what "good" means in a measurable way, generate enough examples to have stable statistics, and automate the evaluation. Metrics chosen:

  • Average length — target < 150 chars. A viral roast is short.
  • Score variance — target > 2.0. If everyone scores 5-6, the score is useless.
  • Fallback rate — how often Claude fails and we return the default text.
  • Score distribution — 1-10 histogram, to visualize biases.

The tool: an evaluation harness in Go

A standalone binary in cmd/prompttest/main.go. No HTTP server, direct call to Claude CLI. 30 fixed test cases (photos from randomuser.me — men and women FR/EN), run sequentially with duration measurement.

func callClaude(ctx context.Context, photoURL, lang string) (*RoastResult, string, time.Duration, error) {
    prompt := buildPrompt(lang)
    fullPrompt := fmt.Sprintf("First, read the image file at %s and look at it carefully. Then:\n\n%s", photoURL, prompt)

    args := []string{
        "--print",
        "--model", "sonnet",
        "--effort", "low",       // reduces time from ~50s to ~27s
        "--allowedTools", "Read",
        "--dangerously-skip-permissions",
        "-p", fullPrompt,
    }

    start := time.Now()
    out, err := exec.CommandContext(ctx, "claude", args...).Output()
    dur := time.Since(start)
    // ...
}

Enter fullscreen mode Exit fullscreen mode

The --effort low flag is the first speed optimization: it reduces response time from ~50s to ~27s. Not officially documented but behavior is stable.

End-of-run report:

╔══════════════════════════════════════════════╗
║   PROMPTTEST v4 — 30 tests
╠══════════════════════════════════════════════╣
║ Avg chars standard : 128 (target < 200)
║ Score variance     : 1.09 (target > 2.0)
║ Fallbacks          : 0/30 (target < 5%)
║ Avg score          : 5.8
║ Avg duration       : 33.859s
╠══════════════════════════════════════════════╣
║ Score distribution:
║    4: █████ (5)
║    5: ████ (4)
║    6: ███████ (7)
║    7: ██████████████ (14)
╠══════════════════════════════════════════════╣
║ Sample roasts (first 5 valid):
║  [homme FR 1] score=4.0 tier="PDG de Rien du Tout"
║  → Le costume + les joues de bébé + le regard vide — t'es le seul
║    mec à avoir l'air d'un enfant ET d'un PDG raté en même temps.
╚══════════════════════════════════════════════╝

Enter fullscreen mode Exit fullscreen mode

Five versions, five lessons

Version

Avg chars

Score variance

Fallbacks

Main issue

v1

216

0.44

0

"T'as [X] de quelqu'un qui" — 17/30 roasts identical in structure

v2

110

0.58

0

"[item] dit/says [X]" — new dominant cliché

v3

95

0.67

0

"C'est la photo LinkedIn de..." — third cliché

v4

128

1.09

0

Best version — specific, varied roasts

v5

146

0.90

1

3 scores of 8 appeared, but overall variance dropped

Lesson 1 — Every positive example creates a cliché

In v1, the prompt gave examples of good roasts. Claude immediately copied the structure of those examples on 17 of the 30 cases. We banned that pattern, gave new examples — and Claude used the new examples as its new cliché. Three times in a row.

The solution (v4): drop positive structure examples entirely. Instead, describe the emotional target ("a roast that a stranger would screenshot and forward to a group chat") and only accumulate negative examples (explicitly banned patterns).

BANNED STARTERS (these patterns are overused trash):
- "[item] dit/crie/says [X]" → BANNED
- "T'as [X] de quelqu'un qui..." → BANNED
- "C'est la photo de profil LinkedIn de..." → BANNED
- Any sentence starting with "C'est la photo" → BANNED

Enter fullscreen mode Exit fullscreen mode

Lesson 2 — Score variance has a natural ceiling

No matter how we phrased the scoring instruction, variance plateaued around 1.1 with randomuser.me photos. These photos are intentionally "average" — they serve as generic profile photos. You can't extract a variance of 2.0 from a distribution that's naturally compressed between 4 and 7.

This isn't a prompt problem. It's a physical constraint of the input data. With real user photos (which include genuinely ugly or beautiful people), variance will be naturally higher. The v4 prompt is optimal for what you can get with this test set.

Lesson 3 — Claude is conservative with low scores

Even when explicitly asking for scores of 2-3 for "objectively difficult to look at" people, Claude resists. Anthropic's safety mechanisms push it to avoid saying a real person is ugly. We rarely got below 4.0 despite repeated instructions.

For a use case like ours (consented humorous roasting), this is slightly frustrating but understandable. The real question: does the user who uploads a photo expect a 2/10? Probably not, even if it's "more honest."

Lesson 4 — Text quality improves dramatically

This is the real gain from iteration. Between v1 and v4, the quality of the roasts is incomparable:

v1: "T'as la tête de quelqu'un qui a mis 'passionné par les synergies' dans son bio LinkedIn — le bâtiment derrière toi est plus intéressant que toi."

v4: "Le front avance plus vite que ta carrière et le regard est resté coincé à la page de chargement."

Same subject, same person. v4 is twice as short, twice as specific, three times funnier. Iterating on measurable metrics (length, fallbacks) forced prompt changes that had an indirect effect on subjective quality.

Limits of the approach

The autonomous iteration loop has important limits to keep in mind.

No ground truth. The metrics (length, variance) measure properties of the text, not its quality. A 90-char roast isn't necessarily funnier than a 180-char one. You're optimizing proxies, not the actual target.

The test set doesn't represent real users. randomuser.me = generic, neutral, well-lit profile photos. Real users upload party photos, blurry selfies, people in costume. The real distribution is different.

Each run takes ~15 minutes. 30 calls × ~30s = 15 min of waiting per iteration. We ran 5 iterations = 75 minutes of runs + analysis time. This isn't real-time optimization.

Score variance plateaus, and that's okay. We tried for 3 iterations to improve variance without major success. Recognizing the plateau and stopping is a skill in itself.

What the loop actually enables

The main value isn't reaching the "perfect prompt". It's making visible what's invisible. Without the tool, we wouldn't have known that 17/30 roasts had the exact same structure. We would have kept thinking the roasts were "pretty good" based on the 3 examples we tested manually.

The loop forces you to define "good" before optimizing. That's the real work: not writing the prompt, but deciding which proxy metrics are relevant. Once the metrics are defined, the AI does the rest — generating tests, measuring, revealing patterns.

This approach is reproducible for any text generation with measurable properties: length, presence of certain patterns, JSON parsing failure rate, distribution of a numeric value produced. If you can write it in a 5-line report, you can automate it.

Conclusion

We started with a prompt written by instinct, roasts averaging 216 characters, and a dominant cliché covering 56% of cases. We ended with a prompt averaging 128 characters, 0 fallbacks, specific and varied roasts — and most importantly, a clear understanding of why each version was better or worse than the previous one.

What surprised me: the most effective iteration (v4) wasn't the one that gave the AI the most instructions. It was the one that gave it the fewest — describe the emotional target, ban the failed patterns, and trust the model to find something else. Fewer positive constraints, more creative freedom within negative constraints.

The quality plateau exists. At some point, iterations no longer improve anything measurable. That's the signal to stop — not because the prompt is perfect, but because the marginal gains aren't worth the time invested anymore. Knowing when to stop is just as important as knowing how to iterate.