惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

P
Proofpoint News Feed
G
GRAHAM CLULEY
GbyAI
GbyAI
Martin Fowler
Martin Fowler
Last Week in AI
Last Week in AI
月光博客
月光博客
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
V
Visual Studio Blog
博客园 - 聂微东
aimingoo的专栏
aimingoo的专栏
The GitHub Blog
The GitHub Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Blog — PlanetScale
Blog — PlanetScale
The Cloudflare Blog
博客园 - 叶小钗
罗磊的独立博客
宝玉的分享
宝玉的分享
P
Privacy International News Feed
酷 壳 – CoolShell
酷 壳 – CoolShell
Scott Helme
Scott Helme
Project Zero
Project Zero
P
Palo Alto Networks Blog
F
Fortinet All Blogs
Help Net Security
Help Net Security
K
Kaspersky official blog
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
S
Schneier on Security
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
F
Full Disclosure
Webroot Blog
Webroot Blog
V
V2EX
C
Check Point Blog
L
LangChain Blog
阮一峰的网络日志
阮一峰的网络日志
H
Hacker News: Front Page
G
Google Developers Blog
Hugging Face - Blog
Hugging Face - Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
博客园_首页
Application and Cybersecurity Blog
Application and Cybersecurity Blog
H
Help Net Security
量子位
Recorded Future
Recorded Future
H
Heimdal Security Blog
雷峰网
雷峰网
T
The Blog of Author Tim Ferriss
www.infosecurity-magazine.com
www.infosecurity-magazine.com
O
OpenAI News
D
DataBreaches.Net

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything BFF模式详解:构建前后端协同的中间层 I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Improving AI resume matching with prompt iteration — 7.37 to 8.37/10
Azeez Roheem · 2026-05-25 · via DEV Community

The problem nobody talks about

Here is what most AI resume rewriters actually do when you give them a weak bullet:

Original: "Attended daily standups"

AI rewritten: "Developed responsive web applications using React and TypeScript, collaborating in agile daily standups to deliver high-quality frontend solutions"

That looks better. It has keywords. It has structure. It sounds professional.

It is also a lie.

The candidate never developed React applications. They attended meetings. The AI looked at the job description, saw React and TypeScript, and quietly fabricated experience the candidate doesn't have. If they get an interview, the first technical question will expose it.

This isn't a hypothetical. This is what my resume rewriter — Resume AI Tailor — was doing before I fixed it. I know because I measured it.

What Resume AI Tailor is

Resume AI Tailor is a SaaS product I'm building under NanoCrafts. You upload a resume, paste a job description, and get a tailored version back in seconds. The stack is Next.js, gpt-4o-mini, Clerk, Neon/Drizzle, and Vercel.

The core promise is straightforward: better resume, better match score, better interview chances. The problem is that promise breaks down the moment the AI starts inventing skills the candidate doesn't have.

A resume rewriter that fabricates React experience for a WordPress developer isn't helping that developer. It's setting them up to fail in a room.

What I set out to do

I spent two weeks applying advanced prompt engineering techniques — few-shot prompting, role prompting, and LLM-as-judge eval loops — specifically to the rewrite and analyse routes in Resume AI Tailor.

The goal wasn't to make outputs look better. It was to make them measurably better, with a number I could track across every prompt change.

I started by building an eval loop: a second model call that scores each rewritten bullet on five criteria (action verb, keyword fit, outcome, truthfulness, brevity) using a structured rubric. Then I ran 10 resumes through the pipeline and established a baseline.

Week 1 baseline: 7.37/10 across 38 bullets.

Then I iterated. Two weeks, multiple prompt changes, edge case testing across 5 resume profiles. Every change measured against that baseline.

Final score: 8.37/10. Net improvement: +1.0 points.

This post documents what changed, why it worked, and what I learned that isn't in any prompt engineering tutorial.

Few-shot prompting: show don't tell

The original rewrite prompt looked like this:

const prompt = `You are an expert resume writer. Rewrite the following 
bullet point to be stronger, specific, and results-oriented.

Original bullet: "${bullet}"
Rewritten bullet:`;

Enter fullscreen mode Exit fullscreen mode

This is zero-shot prompting. One instruction, no examples. The model knows what "stronger" means in the abstract but has no idea what stronger looks like in the specific context of a resume bullet for a software engineer applying to a senior role.

The outputs were confident-sounding and structurally inconsistent. Sometimes paragraphs. Sometimes bullet lists. Sometimes the model added metrics. Sometimes it didn't. And sometimes it hallucinated skills entirely.

Few-shot prompting fixes the format problem by demonstrating the output you want rather than describing it.

const REWRITE_SYSTEM_PROMPT = `You are a senior ATS engineer and resume 
specialist with 10 years of experience configuring applicant tracking 
systems at Fortune 500 companies.

Study the examples below carefully. Your output must match this format 
exactly — return only the rewritten bullet, no explanation.

EXAMPLE 1:

Original: "Worked on the backend of the company's main product"
Target keywords: Node.js, REST APIs, microservices
Job title: Senior Software Engineer
Candidate skills: Node.js, PostgreSQL, Docker, REST APIs

Rewritten: Engineered Node.js REST APIs for core product, improving 
system performance across microservices architecture

EXAMPLE 2:

Original: "Attended daily standups"
Target keywords: React, TypeScript, agile, frontend
Job title: Frontend Developer
Candidate skills: HTML, CSS, JavaScript, WordPress

Rewritten: Participated in agile daily standups supporting [React] 
and [TypeScript] frontend delivery

Return ONLY the rewritten bullet. No explanation. Maximum 20 words.`;

Enter fullscreen mode Exit fullscreen mode

Three things are happening simultaneously in that prompt:

The role primes attention. "Senior ATS engineer" tells the model which part of its training to draw on — keyword compliance, format discipline, parse-ability. Not generic writing quality.

The examples pin the format. Every run now produces a single-line output in the exact structure shown. No paragraphs, no numbered lists, no explanatory preamble.

Example 2 demonstrates truthfulness. The candidate has HTML, CSS, JavaScript, WordPress — not React or TypeScript. The example shows the model exactly what to do with missing skills: use bracket placeholders like [React] rather than claiming the skill directly.

That last point is the most important. Instructions alone didn't fix the hallucination problem. Showing the model a worked example of correct placeholder behaviour did.

The key insight: few-shot examples override written instructions when they conflict. If your examples all show confident keyword integration from candidates who plausibly have those skills, the model will follow the examples — not the instruction that says "be truthful." Show what you want. Be careful what you show.

Role prompting: giving the model a lens

Role prompting is assigning a persona to the model via the system prompt before any user content is processed. The model doesn't become that person. It shifts which part of its training data it draws on most heavily.

The difference matters. "You are a hiring manager" is an attention directive, not a capability upgrade. The model already knows what hiring managers care about. The role tells it to weight that knowledge above everything else.

I tested three personas against the same resume on Day 2.

The ATS engineer

`You are a senior ATS engineer with 10 years of experience configuring 
applicant tracking systems at Fortune 500 companies like Greenhouse, 
Workday, and Lever.

When analysing a resume, evaluate it exactly as an ATS would before 
a human ever sees it. Your priorities are:

1. Keyword match — does the resume contain exact terms a recruiter 
   would search for?
2. Formatting compliance — tables, columns, and text boxes cause 
   parsing failures. Flag them.
3. Date formatting — inconsistent dates confuse parsers.`

Enter fullscreen mode Exit fullscreen mode

Output: clinical, narrow, reliable. Caught machine-layer problems — inconsistent date formats, missing keywords, non-standard section names. Ignored everything a human would care about.

The FAANG hiring manager

`You are a senior engineering hiring manager at a FAANG company. 
You have conducted over 500 hiring committee reviews across multiple 
product and infrastructure teams.

When analysing a resume, evaluate it as you would in a 30-second 
hiring committee pre-screen. Your priorities are:

1. Impact framing — does each bullet describe an outcome, not a task?
2. Scope signals — team size, user base, revenue influenced.
3. Levelling language — "Helped" signals junior. "Led" signals senior.`

Enter fullscreen mode Exit fullscreen mode

The "30-second pre-screen" framing was the single most effective addition. It changed the model's evaluative posture from thorough to filtering — prioritising signal density over completeness.

Output: direct, specific, high-signal. Flagged levelling language gaps, missing scope context, and vague bullets that give a hiring committee nothing to evaluate. Most actionable of the three.

The career coach

`You are a senior career coach with 12 years of experience helping 
engineers at all levels land roles they care about.

When analysing a resume, evaluate it as you would in a first coaching 
session: with honesty, warmth, and a focus on what the candidate can 
improve before their next application.`

Enter fullscreen mode Exit fullscreen mode

Output: warm, narrative-focused, occasionally too encouraging. Caught underselling and generic language that the technical personas ignored. Defaulted to "Thank you for sharing your resume!" as an opener — a guardrail I had to add explicitly: "Do not open with pleasantries."

What I shipped

The FAANG hiring manager became the primary analysis prompt. It produced the most actionable feedback by a clear margin.

The ATS engineer runs as a lightweight parallel pass — cheap output, focused scope, catches the machine layer the hiring manager ignores.

The career coach is held for a future Pro tier feature. The instincts are right but it needs tightening before production.

/api/analyse
  ├── hiring manager prompt   (always — primary analysis)
  └── ATS engineer prompt     (always — parallel pass)
  └── career coach prompt     (Pro tier — future)

Enter fullscreen mode Exit fullscreen mode

The key lesson: vocabulary inside the role description matters as much as the job title. "You are a hiring manager" is weaker than "you are a hiring manager who evaluates candidates in 30-second pre-screens and cares about levelling language." The second version gives the model a specific lens, not just a label.

Eval loops: measuring what you built

After a week of prompt iteration I had outputs that looked better. I had no way to know if they actually were better.

Manual scoring across 5 resumes took 45 minutes and produced scores I couldn't fully trust — after reading enough rewrites, judgment drifts. I needed something consistent, fast, and repeatable.

The LLM-as-judge pattern is a second model call that receives both the original and the rewritten bullet, evaluates them against a structured rubric, and returns a quality score plus reasoning.

export async function evalBullet(
  original: string,
  rewritten: string,
  targetKeywords: string[],
  candidateSkills: string[]
): Promise<EvalResult> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: JUDGE_SYSTEM_PROMPT },
      { role: 'user', content: `
        Original bullet: "${original}"
        Rewritten bullet: "${rewritten}"
        Target keywords: ${targetKeywords.join(', ')}
        Candidate skills: ${candidateSkills.join(', ')}
      `}
    ],
    temperature: 0  // always 0 for judges
  });
  // parse and return structured score
}

Enter fullscreen mode Exit fullscreen mode

Temperature must be 0. Your rewriter uses 0.3 because lexical variation is desirable. Your judge needs to score the same bullet the same way every time — temperature above 0 introduces scoring drift that makes baseline comparisons meaningless.

The rubric

The rubric is the most important part. A vague rubric produces vague scores regardless of which model you use.

Score this rewritten bullet on 5 criteria, 2 points each (total 10):

1. Action verb (0-2)
   2 = strong past-tense ownership verb: Led, Built, Architected
   1 = weak or passive: Helped, Worked on, Assisted
   0 = no clear action verb

2. Keyword fit (0-2)
   2 = target keywords integrated naturally — reads well without them
   1 = keywords present but sentence feels constructed around them
   0 = no target keywords present

3. Outcome present (0-2)
   2 = clear measurable outcome, or placeholder [X%] used correctly
   1 = outcome implied but not quantified
   0 = task description only, no outcome signal

4. Truthfulness (0-2)
   2 = all claims supported by original or candidate skills,
       OR missing keywords correctly wrapped in [brackets]
   1 = minor extrapolation, defensible
   0 = claims skill not in original or skills list, not bracketed

5. Brevity (0-2)
   2 = one sentence, 20 words or under, no filler phrases
   1 = slightly long or contains padding
   0 = multiple sentences or significantly over 20 words

Enter fullscreen mode Exit fullscreen mode

The additive structure — 5 criteria, 2 points each — forces the judge to evaluate each dimension independently. Holistic scoring ("rate this 1-10") collapses everything into a gut feeling that drifts. Mechanical criteria with explicit level descriptions produce consistent scores you can aggregate and trend.

Rubric design matters more than judge model

Upgrading from gpt-4o-mini to gpt-4o as your judge produces less improvement than tightening your rubric by one criterion. The bottleneck in LLM-as-judge systems is almost never model capability — it's instruction clarity.

Compare:

// Vague — produces drift
"Rate keyword integration 0-2"

// Precise — produces consistent scores  
"Keyword fit (0-2):
 2 = target keywords integrated naturally — the sentence reads 
     well without them
 1 = keywords present but the sentence feels constructed around them
 0 = no target keywords present"

Enter fullscreen mode Exit fullscreen mode

The precise version gives the judge a test it can apply mechanically. "Would this sentence read well without the keywords?" is a yes/no question. "Is keyword integration good?" is a judgment call. Judgment calls drift. Mechanical tests don't.

Scores are relative not absolute

A bullet scoring 7/10 doesn't mean it's objectively a 7/10 resume bullet. It means it scored 7/10 against your rubric, run by your judge, on this version of your prompt.

What it does have is comparative value. If you change the rewrite prompt and the same 10 bullets average 8.2/10 instead of 6.8/10 — everything else held constant — the prompt improved. That's what systematic prompt engineering looks like: not intuition, evidence.

This is why I log results to a JSON file across runs:

export function logResults(results: EvalResult[], outputPath: string): void {
  const dir = path.dirname(outputPath);
  if (!fs.existsSync(dir)) fs.mkdirSync(dir, { recursive: true });

  const existing = fs.existsSync(outputPath)
    ? JSON.parse(fs.readFileSync(outputPath, 'utf-8'))
    : [];

  fs.writeFileSync(outputPath, JSON.stringify(
    [...existing, ...results], null, 2
  ));
}

Enter fullscreen mode Exit fullscreen mode

Each run appends to the file. The file is your baseline. Without it you're iterating blind.

The fix that actually worked: structural constraints over instructions

After adding few-shot examples and role prompting, the rewriter was still hallucinating on underqualified candidates.

Sarah Johnson has HTML, CSS, JavaScript, and WordPress. She applied to a React and TypeScript role. The rewriter was producing this:

Original:  "Attended daily standups"
Rewritten: "Developed responsive web applications using React and TypeScript,
            collaborating in agile daily standups to deliver frontend solutions"

Enter fullscreen mode Exit fullscreen mode

React and TypeScript. Fabricated. Every time.

I had a truthfulness instruction in the system prompt:

Truthfulness — never invent numbers or metrics not present in the original.
If a target keyword represents a skill the candidate has not explicitly used,
do NOT claim they used it.

Enter fullscreen mode Exit fullscreen mode

It wasn't working. Three prompt iterations, same result. The instruction was being overridden by the few-shot examples — every example showed confident keyword integration from a candidate who plausibly had those skills. The model followed the examples, not the written rule.

The fix wasn't a better instruction. It was a structural change.

Before — instruction only

export async function rewriteBullet(
  bullet: string,
  targetKeywords: string[],
  jobTitle: string
): Promise<string>

Enter fullscreen mode Exit fullscreen mode

The model had no reference list to check keywords against. It knew the job required React. It didn't know the candidate didn't have it. So it integrated React — confidently, fluently, incorrectly.

After — candidateSkills as a parameter

export async function rewriteBullet(
  bullet: string,
  targetKeywords: string[],
  jobTitle: string,
  candidateSkills: string[]  // added
): Promise<string> {
  const userMessage = `Original: "${bullet}"
Target keywords: ${targetKeywords.join(', ')}
Job title: ${jobTitle}
Candidate skills: ${candidateSkills.join(', ')}  // in every call

Rewritten:`;

Enter fullscreen mode Exit fullscreen mode

Now the model has both sides of the equation in every single call. Target keywords on one side. Candidate's actual skills on the other. When it sees React in the keywords but not in the skills list, it uses a placeholder instead of claiming the skill:

Original:  "Attended daily standups"
Rewritten: "Participated in agile daily standups supporting [React] 
            and [TypeScript] frontend delivery"

Enter fullscreen mode Exit fullscreen mode

[React] and [TypeScript] are honest. They tell the candidate: these are the skills the job requires that you don't currently have. The resume becomes a gap analysis rather than a fabrication.

Why this worked when the instruction didn't

Passing candidateSkills as a parameter is a structural constraint. The data is present in the context window on every call. The model has to work with it explicitly.

A prompt instruction saying "be truthful" is a suggestion the model weighs against everything else in its context — including the few-shot examples that demonstrate confident integration. When examples and instructions conflict, examples win.

But a skills list in the context window doesn't compete with examples. It's data. The model can't ignore data the way it can de-prioritise an instruction.

The lesson: structural constraints beat instructional constraints for truthfulness enforcement. If you want the model to check something before acting, give it the thing to check — don't just tell it to check.

This is the most transferable insight from two weeks of prompt iteration. It applies beyond resume rewriting to any situation where you need the model to respect a boundary that its training data might push against.

Results: before and after

The hallucination problem, fixed

Here is the same candidate — Sarah Johnson, HTML/CSS/JavaScript/WordPress developer, applying to a React and TypeScript role — run through three versions of the prompt.

Zero-shot (Week 1 — no examples, no candidateSkills):

Original:  "Worked on client websites"
Rewritten: "Developed responsive client websites using React and TypeScript,
            implementing modern CSS frameworks in an agile environment"

Original:  "Helped with bug fixes"  
Rewritten: "Resolved bugs in React and TypeScript, enhancing frontend 
            performance and responsiveness within agile development cycles"

Original:  "Attended daily standups"
Rewritten: "Developed responsive web applications using React and TypeScript,
            collaborating in agile daily standups to deliver frontend solutions"

Enter fullscreen mode Exit fullscreen mode

React and TypeScript appear in every bullet. Sarah has neither. The rewriter optimised for the job description and ignored the candidate entirely.

Week 2 (role prompting + few-shot + candidateSkills):

Original:  "Worked on client websites"
Rewritten: "Developed client websites using HTML, CSS, and JavaScript, 
            ensuring responsive design and supporting [React] and 
            [TypeScript] frameworks"

Original:  "Helped with bug fixes"
Rewritten: "Assisted in bug fixes for frontend components using HTML, 
            CSS, and JavaScript while supporting agile development 
            and [React] integration, improving [metric] by [X%]"

Original:  "Attended daily standups"
Rewritten: "Participated in agile daily standups supporting [React] 
            and [TypeScript] frontend delivery"

Enter fullscreen mode Exit fullscreen mode

The difference is structural. HTML, CSS, and JavaScript — skills Sarah actually has — are integrated directly. [React] and [TypeScript] appear as bracketed placeholders, not claims. The resume now tells an honest story: here is what I have, here is what the role requires that I don't yet have.

That's more useful to a candidate than a fabrication. It's also something they can actually defend in a room.

Strong bullets preserved correctly

The pipeline doesn't rewrite everything. scoreBullet evaluates each bullet before the rewrite pass and flags which ones actually need work.

Here is Priya Patel's resume — Staff Engineer applying for a Principal Engineer role — run through the same pipeline:

Original:  "Architected event-driven microservices platform handling 
            50M daily events, reducing infrastructure cost by 40%"
Rewritten: [unchanged — scoreBullet returned needs_rewrite: false]

Original:  "Led team of 8 engineers across 3 time zones to deliver 
            platform re-architecture 2 weeks ahead of schedule"
Rewritten: [unchanged]

Original:  "Defined engineering standards adopted across 4 product 
            teams, reducing incident rate by 35%"
Rewritten: [unchanged]

Enter fullscreen mode Exit fullscreen mode

Three of Priya's bullets were preserved. Strong bullets with real metrics, ownership verbs, and scope context don't need rewriting — and the pipeline correctly identifies that. The rewriter amplifies good inputs. It cannot manufacture signal that isn't there.

The score floor

Every bullet that scored below 6/10 in the eval loop traced back to an original with no action verb, no skill, and no outcome:

5/10 — original: "Helped with bug fixes"
       no tool, no scale, no outcome — nothing to amplify

5/10 — original: "Attended daily standups"  
       describes presence, not contribution

Enter fullscreen mode Exit fullscreen mode

No prompt engineering fixes this. The information needed to write a strong bullet doesn't exist in the original. That's a product problem, not a prompt problem. The right fix is surfacing a coaching message to the user: "This bullet doesn't give us enough to work with. Try adding what you built, what tool you used, or what changed because of your work."

The rewriter ceiling on content-free bullets is approximately 6/10 regardless of prompt quality. Knowing that ceiling exists — and being honest about it with users — is more valuable than pretending the AI can fix anything.

The numbers

Score progression

Run Score Key change
Week 1 baseline 7.37/10 Combined role + few-shot prompts
Week 2 Day 6 8.26/10 Outcome placeholder instruction
Week 2 Day 7 8.18/10 Edge case fixes — within noise
Week 2 Day 8 final 8.37/10 Stable confirmed score

Net improvement: +1.0 points over two weeks.

Two runs at 8.37 with the same lowest-scoring bullets confirmed this is a stable measurement, not a lucky run. The ±0.15 variance across runs at temperature 0 is the expected noise floor — changes within that range are not meaningful.

Dimension breakdown

Criterion Week 1 Week 2 Change
Action verb 1.82/2 1.89/2 +0.07
Keyword fit 1.45/2 1.37/2 -0.08
Outcome 1.08/2 1.97/2 +0.89
Truthfulness 1.68/2 1.66/2 -0.02
Brevity 1.26/2 1.47/2 +0.21

The outcome story

Outcome is the standout improvement — +0.89 on a 2-point scale from a single instruction change.

Week 1: the rewriter was good at verbs and keywords but almost never added outcome signal. When the original bullet had no outcome, the model left the rewrite outcome-free too. A bullet that describes a task without a result gives a hiring committee nothing to evaluate.

The fix was one additional priority in the system prompt:

7. Outcome placeholder — if no outcome exists in the original bullet,
   add a placeholder rather than leaving the bullet outcome-free:
   "improving [metric] by [X%]", "resulting in [outcome]",
   "reducing [problem] by [X%]", or "enabling [result]".
   Never leave a rewritten bullet without any outcome signal.

Enter fullscreen mode Exit fullscreen mode

That instruction moved outcome from 1.08/2 to 1.97/2. Nearly a full point on a 2-point scale. The highest ROI prompt change of the two weeks.

The keyword fit regression

Keyword fit dropped slightly from 1.45/2 to 1.37/2. This was deliberate.

The rewriter was injecting generic ATS phrases — "delivery record", "strong portfolio", "5+ years experience" — into every weak bullet regardless of relevance. A filter was added to the route handler:

const GENERIC_KEYWORD_FILTER = [
  'delivery record',
  'strong delivery record',
  '5+ years experience',
  'strong portfolio',
  'fast learner',
  'team player',
];

const targetKeywords = [
  ...new Set([...(missingSkills || []), ...(atsKeywords || [])])
].filter(k => !GENERIC_KEYWORD_FILTER.some(
  g => k.toLowerCase().includes(g.toLowerCase())
));

Enter fullscreen mode Exit fullscreen mode

Filtering generic phrases reduced keyword coverage marginally. It eliminated keyword stuffing across long resumes entirely. That tradeoff is correct — a resume where every bullet ends with "enhancing delivery record and meeting business requirements" signals machine generation to any experienced recruiter.

What 8.37/10 actually means

It's not a grade. It's a baseline.

Every future prompt change gets measured against 8.37. If a change produces 8.6 on the same 10 resumes with the same rubric and judge, the prompt improved. If it produces 8.1, it regressed. The number is only meaningful in comparison — not in isolation.

This is the eval loop working as designed. Without it, every prompt change is a guess. With it, every prompt change is an experiment.

Eval stats

Bullets evaluated:    38
Average score:        8.37 / 10
Improved over orig:   30 / 38 (79%)
Unchanged (strong):   8 / 38

Average breakdown:
  Action verb:        1.89 / 2
  Keyword fit:        1.37 / 2
  Outcome:            1.97 / 2
  Truthfulness:       1.66 / 2
  Brevity:            1.47 / 2

Enter fullscreen mode Exit fullscreen mode

What I learned that isn't in the documentation

1. Few-shot examples override written instructions when they conflict

I wrote "never claim skills the candidate doesn't have" in the system prompt. Every few-shot example showed confident keyword integration from a candidate who plausibly had those skills. The model followed the examples.

This isn't a bug — it's how in-context learning works. The model pattern-matches to demonstrated behaviour more reliably than it follows written rules. Which means the examples you choose are more consequential than the instructions you write.

The practical implication: audit your few-shot examples as carefully as your instructions. If your examples demonstrate behaviour you don't want in edge cases, the model will replicate it in those edge cases regardless of what the instructions say.

Show don't tell. But be careful what you show.

2. Safety and quality are not the same measurement

I had a validateRewrite function that acted as a gate before serving any rewritten bullet:

// Returns: "use_rewrite" | "use_original" | "rewrite_again"
export async function validateRewrite(
  original: string,
  rewritten: string,
  targetKeywords: string[],
  candidateSkills: string[]
)

Enter fullscreen mode Exit fullscreen mode

It answered the question: is this safe to serve? No fabricated skills, keywords present, different from original — use it.

The eval loop answered a different question: how good is this?

A bullet could pass the first question and fail the second:

validateRewrite:  recommendation = use_rewrite
                  truthfulness_risk = low       ← passed the gate

eval loop:        total_score = 5/10
                  action_verb = 1               ← weak verb
                  outcome = 0                   ← no outcome signal
                  brevity = 1                   ← slightly long


Enter fullscreen mode Exit fullscreen mode

The bullet was safe. It wasn't good. Both assessments were correct — they were measuring different things.

Most developers building AI features stop at the safety layer. They add validation to prevent hallucination and call it done. The gap between "safe to serve" and "good enough to be useful" is where user trust quietly erodes. The eval loop makes that gap visible.

The production fix is to wire evalBullet as a second gate with a minimum threshold — bullets that pass safety but fail quality get a coaching message rather than a mediocre rewrite. That's a Week 3 task.

3. There is a hard floor on what prompt engineering can achieve

Every bullet scoring below 6/10 in the eval loop had the same profile: an original with no action verb, no skill, and no outcome.

"Attended daily standups"    → no tool, no contribution, no result
"Helped with bug fixes"      → passive, no scope, no outcome
"Worked on improving things" → no specificity whatsoever

Enter fullscreen mode Exit fullscreen mode

The rewriter ceiling on these bullets is approximately 6/10 regardless of how good the prompt is. The information needed to write a strong bullet simply does not exist in the original.

This is not a prompt problem. It is an input problem.

No amount of instruction, few-shot examples, or role prompting manufactures signal that isn't there. The right response to a content-free bullet is not a better rewrite — it's asking the candidate for more information.

"This bullet doesn't give us enough to work with.
Try adding: what you built, what tool you used, 
or what changed because of your work.

Example: 'Attended daily standups' → 
'Attended daily standups for a 6-person React team 
shipping features weekly'"

Enter fullscreen mode Exit fullscreen mode

That coaching message is more useful to a candidate than any rewrite the model could produce from the original. It's also honest in a way that confident-sounding AI output rarely is.

The most important thing prompt engineering taught me over two weeks is where its limits are. Knowing the ceiling — and being transparent about it with users — is more valuable than pretending the AI can fix anything you give it.

What changes in Week 3

The eval loop surfaced four specific gaps that prompt changes alone cannot fix. These are the opening tasks for Week 3:

Code-level keyword distribution tracking. The keyword distribution instruction in the system prompt has no effect across stateless API calls — each bullet is a separate call with no memory of what keywords appeared in previous bullets. The fix is a post-processing pass in the route handler that tracks keyword frequency across the full rewritten set and removes over-repeated phrases. A partial version is already in production. A proper implementation needs cross-bullet state.

Soft skills matching on zero-technical-match candidates. When a candidate has zero technical skills matching the job description, the model makes a top-level "no match" assessment and stops evaluating the skills array for soft skill overlap. A teacher applying to a data analyst role has Communication and Leadership in her skills array — both required by the job — but the analyser returns matched_skills: []. The fix is a code-level post-processing step that compares the skills array directly against JD keywords in TypeScript, bypassing the model for this field.

Production eval gate. The eval loop currently runs offline as a measurement tool. Week 3 wires evalBullet into the production route as an async quality gate — bullets that pass validateRewrite but score below 6/10 on the eval rubric get a coaching message rather than a mediocre rewrite. The gate runs in the background so it doesn't add latency to the critical path.

Controlled comparison: combined vs few-shot only. Week 1 and Week 2 only tested the combined prompt (role + few-shot). The contribution of role prompting in isolation was never measured. Week 3 runs the same 10 resumes through a few-shot-only variant — no role declaration — and scores that too. The difference between those two scores is the actual measured contribution of role prompting. That's the kind of evidence that makes a blog post technically credible rather than anecdotal.

Target: 8.8/10 average across the same 10 resumes.

The build-in-public part

This post documents Week 2 of a 10-week curriculum applying prompt engineering techniques to a real SaaS product — Resume AI Tailor — that real users pay for.

The eval baseline started at 7.37/10. It ended at 8.37/10. The target for Week 3 is 8.8/10.

I'll publish the Week 3 results — including the controlled comparison, the production eval gate, and whatever new edge cases surface — in the next post. If the score doesn't hit 8.8, I'll say that too.

The code for the eval loop, the prompt variants, and the full pipeline is on GitHub: github.com/Azeez1314/resume-ai

The product is live at: resumetailor.cv

One question for you

If you're building AI-powered features and measuring output quality with more than "it looks good to me" — I'd genuinely like to know what you're using.

The LLM-as-judge pattern is well-documented in research but underused in production SaaS. The gap between knowing the technique exists and actually having a scored baseline you make decisions from is significant. If you've closed that gap, I'd like to hear how.

Leave a comment, or find me on X and LinkedIn as NanoCrafts.

Resume AI Tailor is part of the NanoCrafts portfolio — a collection of focused SaaS tools built and shipped in public.

resumetailor.cv · nanocrafts.xyz · azeezroheem.hashnode.dev