Improving AI resume matching with prompt iteration — 7.37 to 8.37/10

The problem nobody talks about

Here is what most AI resume rewriters actually do when you give them a weak bullet:

Original: "Attended daily standups"

AI rewritten: "Developed responsive web applications using React and TypeScript, collaborating in agile daily standups to deliver high-quality frontend solutions"

That looks better. It has keywords. It has structure. It sounds professional.

It is also a lie.

The candidate never developed React applications. They attended meetings. The AI looked at the job description, saw React and TypeScript, and quietly fabricated experience the candidate doesn't have. If they get an interview, the first technical question will expose it.

This isn't a hypothetical. This is what my resume rewriter — Resume AI Tailor — was doing before I fixed it. I know because I measured it.

What Resume AI Tailor is

Resume AI Tailor is a SaaS product I'm building under NanoCrafts. You upload a resume, paste a job description, and get a tailored version back in seconds. The stack is Next.js, gpt-4o-mini, Clerk, Neon/Drizzle, and Vercel.

The core promise is straightforward: better resume, better match score, better interview chances. The problem is that promise breaks down the moment the AI starts inventing skills the candidate doesn't have.

A resume rewriter that fabricates React experience for a WordPress developer isn't helping that developer. It's setting them up to fail in a room.

What I set out to do

I spent two weeks applying advanced prompt engineering techniques — few-shot prompting, role prompting, and LLM-as-judge eval loops — specifically to the rewrite and analyse routes in Resume AI Tailor.

The goal wasn't to make outputs look better. It was to make them measurably better, with a number I could track across every prompt change.

I started by building an eval loop: a second model call that scores each rewritten bullet on five criteria (action verb, keyword fit, outcome, truthfulness, brevity) using a structured rubric. Then I ran 10 resumes through the pipeline and established a baseline.

Week 1 baseline: 7.37/10 across 38 bullets.

Then I iterated. Two weeks, multiple prompt changes, edge case testing across 5 resume profiles. Every change measured against that baseline.

Final score: 8.37/10. Net improvement: +1.0 points.

This post documents what changed, why it worked, and what I learned that isn't in any prompt engineering tutorial.

Few-shot prompting: show don't tell

The original rewrite prompt looked like this:

const prompt = `You are an expert resume writer. Rewrite the following 
bullet point to be stronger, specific, and results-oriented.

Original bullet: "${bullet}"
Rewritten bullet:`;

This is zero-shot prompting. One instruction, no examples. The model knows what "stronger" means in the abstract but has no idea what stronger looks like in the specific context of a resume bullet for a software engineer applying to a senior role.

The outputs were confident-sounding and structurally inconsistent. Sometimes paragraphs. Sometimes bullet lists. Sometimes the model added metrics. Sometimes it didn't. And sometimes it hallucinated skills entirely.

Few-shot prompting fixes the format problem by demonstrating the output you want rather than describing it.

const REWRITE_SYSTEM_PROMPT = `You are a senior ATS engineer and resume 
specialist with 10 years of experience configuring applicant tracking 
systems at Fortune 500 companies.

Study the examples below carefully. Your output must match this format 
exactly — return only the rewritten bullet, no explanation.

EXAMPLE 1:

Original: "Worked on the backend of the company's main product"
Target keywords: Node.js, REST APIs, microservices
Job title: Senior Software Engineer
Candidate skills: Node.js, PostgreSQL, Docker, REST APIs

Rewritten: Engineered Node.js REST APIs for core product, improving 
system performance across microservices architecture

EXAMPLE 2:

Original: "Attended daily standups"
Target keywords: React, TypeScript, agile, frontend
Job title: Frontend Developer
Candidate skills: HTML, CSS, JavaScript, WordPress

Rewritten: Participated in agile daily standups supporting [React] 
and [TypeScript] frontend delivery

Return ONLY the rewritten bullet. No explanation. Maximum 20 words.`;

Three things are happening simultaneously in that prompt:

The role primes attention. "Senior ATS engineer" tells the model which part of its training to draw on — keyword compliance, format discipline, parse-ability. Not generic writing quality.

The examples pin the format. Every run now produces a single-line output in the exact structure shown. No paragraphs, no numbered lists, no explanatory preamble.

Example 2 demonstrates truthfulness. The candidate has HTML, CSS, JavaScript, WordPress — not React or TypeScript. The example shows the model exactly what to do with missing skills: use bracket placeholders like [React] rather than claiming the skill directly.

That last point is the most important. Instructions alone didn't fix the hallucination problem. Showing the model a worked example of correct placeholder behaviour did.

The key insight: few-shot examples override written instructions when they conflict. If your examples all show confident keyword integration from candidates who plausibly have those skills, the model will follow the examples — not the instruction that says "be truthful." Show what you want. Be careful what you show.

Role prompting: giving the model a lens

Role prompting is assigning a persona to the model via the system prompt before any user content is processed. The model doesn't become that person. It shifts which part of its training data it draws on most heavily.

The difference matters. "You are a hiring manager" is an attention directive, not a capability upgrade. The model already knows what hiring managers care about. The role tells it to weight that knowledge above everything else.

I tested three personas against the same resume on Day 2.

The ATS engineer

`You are a senior ATS engineer with 10 years of experience configuring 
applicant tracking systems at Fortune 500 companies like Greenhouse, 
Workday, and Lever.

When analysing a resume, evaluate it exactly as an ATS would before 
a human ever sees it. Your priorities are:

1. Keyword match — does the resume contain exact terms a recruiter 
   would search for?
2. Formatting compliance — tables, columns, and text boxes cause 
   parsing failures. Flag them.
3. Date formatting — inconsistent dates confuse parsers.`

Output: clinical, narrow, reliable. Caught machine-layer problems — inconsistent date formats, missing keywords, non-standard section names. Ignored everything a human would care about.

The FAANG hiring manager

`You are a senior engineering hiring manager at a FAANG company. 
You have conducted over 500 hiring committee reviews across multiple 
product and infrastructure teams.

When analysing a resume, evaluate it as you would in a 30-second 
hiring committee pre-screen. Your priorities are:

1. Impact framing — does each bullet describe an outcome, not a task?
2. Scope signals — team size, user base, revenue influenced.
3. Levelling language — "Helped" signals junior. "Led" signals senior.`

The "30-second pre-screen" framing was the single most effective addition. It changed the model's evaluative posture from thorough to filtering — prioritising signal density over completeness.

Output: direct, specific, high-signal. Flagged levelling language gaps, missing scope context, and vague bullets that give a hiring committee nothing to evaluate. Most actionable of the three.

The career coach

`You are a senior career coach with 12 years of experience helping 
engineers at all levels land roles they care about.

When analysing a resume, evaluate it as you would in a first coaching 
session: with honesty, warmth, and a focus on what the candidate can 
improve before their next application.`

Output: warm, narrative-focused, occasionally too encouraging. Caught underselling and generic language that the technical personas ignored. Defaulted to "Thank you for sharing your resume!" as an opener — a guardrail I had to add explicitly: "Do not open with pleasantries."

What I shipped

The FAANG hiring manager became the primary analysis prompt. It produced the most actionable feedback by a clear margin.

The ATS engineer runs as a lightweight parallel pass — cheap output, focused scope, catches the machine layer the hiring manager ignores.

The career coach is held for a future Pro tier feature. The instincts are right but it needs tightening before production.

/api/analyse
  ├── hiring manager prompt   (always — primary analysis)
  └── ATS engineer prompt     (always — parallel pass)
  └── career coach prompt     (Pro tier — future)

The key lesson: vocabulary inside the role description matters as much as the job title. "You are a hiring manager" is weaker than "you are a hiring manager who evaluates candidates in 30-second pre-screens and cares about levelling language." The second version gives the model a specific lens, not just a label.

Eval loops: measuring what you built

After a week of prompt iteration I had outputs that looked better. I had no way to know if they actually were better.

Manual scoring across 5 resumes took 45 minutes and produced scores I couldn't fully trust — after reading enough rewrites, judgment drifts. I needed something consistent, fast, and repeatable.

The LLM-as-judge pattern is a second model call that receives both the original and the rewritten bullet, evaluates them against a structured rubric, and returns a quality score plus reasoning.

export async function evalBullet(
  original: string,
  rewritten: string,
  targetKeywords: string[],
  candidateSkills: string[]
): Promise<EvalResult> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: JUDGE_SYSTEM_PROMPT },
      { role: 'user', content: `
        Original bullet: "${original}"
        Rewritten bullet: "${rewritten}"
        Target keywords: ${targetKeywords.join(', ')}
        Candidate skills: ${candidateSkills.join(', ')}
      `}
    ],
    temperature: 0  // always 0 for judges
  });
  // parse and return structured score
}

Temperature must be 0. Your rewriter uses 0.3 because lexical variation is desirable. Your judge needs to score the same bullet the same way every time — temperature above 0 introduces scoring drift that makes baseline comparisons meaningless.

The rubric

The rubric is the most important part. A vague rubric produces vague scores regardless of which model you use.

Score this rewritten bullet on 5 criteria, 2 points each (total 10):

1. Action verb (0-2)
   2 = strong past-tense ownership verb: Led, Built, Architected
   1 = weak or passive: Helped, Worked on, Assisted
   0 = no clear action verb

2. Keyword fit (0-2)
   2 = target keywords integrated naturally — reads well without them
   1 = keywords present but sentence feels constructed around them
   0 = no target keywords present

3. Outcome present (0-2)
   2 = clear measurable outcome, or placeholder [X%] used correctly
   1 = outcome implied but not quantified
   0 = task description only, no outcome signal

4. Truthfulness (0-2)
   2 = all claims supported by original or candidate skills,
       OR missing keywords correctly wrapped in [brackets]
   1 = minor extrapolation, defensible
   0 = claims skill not in original or skills list, not bracketed

5. Brevity (0-2)
   2 = one sentence, 20 words or under, no filler phrases
   1 = slightly long or contains padding
   0 = multiple sentences or significantly over 20 words

The additive structure — 5 criteria, 2 points each — forces the judge to evaluate each dimension independently. Holistic scoring ("rate this 1-10") collapses everything into a gut feeling that drifts. Mechanical criteria with explicit level descriptions produce consistent scores you can aggregate and trend.

Rubric design matters more than judge model

Upgrading from gpt-4o-mini to gpt-4o as your judge produces less improvement than tightening your rubric by one criterion. The bottleneck in LLM-as-judge systems is almost never model capability — it's instruction clarity.

Compare:

// Vague — produces drift
"Rate keyword integration 0-2"

// Precise — produces consistent scores  
"Keyword fit (0-2):
 2 = target keywords integrated naturally — the sentence reads 
     well without them
 1 = keywords present but the sentence feels constructed around them
 0 = no target keywords present"

The precise version gives the judge a test it can apply mechanically. "Would this sentence read well without the keywords?" is a yes/no question. "Is keyword integration good?" is a judgment call. Judgment calls drift. Mechanical tests don't.

Scores are relative not absolute

A bullet scoring 7/10 doesn't mean it's objectively a 7/10 resume bullet. It means it scored 7/10 against your rubric, run by your judge, on this version of your prompt.

What it does have is comparative value. If you change the rewrite prompt and the same 10 bullets average 8.2/10 instead of 6.8/10 — everything else held constant — the prompt improved. That's what systematic prompt engineering looks like: not intuition, evidence.

This is why I log results to a JSON file across runs:

export function logResults(results: EvalResult[], outputPath: string): void {
  const dir = path.dirname(outputPath);
  if (!fs.existsSync(dir)) fs.mkdirSync(dir, { recursive: true });

  const existing = fs.existsSync(outputPath)
    ? JSON.parse(fs.readFileSync(outputPath, 'utf-8'))
    : [];

  fs.writeFileSync(outputPath, JSON.stringify(
    [...existing, ...results], null, 2
  ));
}

Each run appends to the file. The file is your baseline. Without it you're iterating blind.

The fix that actually worked: structural constraints over instructions

After adding few-shot examples and role prompting, the rewriter was still hallucinating on underqualified candidates.

Sarah Johnson has HTML, CSS, JavaScript, and WordPress. She applied to a React and TypeScript role. The rewriter was producing this:

Original:  "Attended daily standups"
Rewritten: "Developed responsive web applications using React and TypeScript,
            collaborating in agile daily standups to deliver frontend solutions"

React and TypeScript. Fabricated. Every time.

I had a truthfulness instruction in the system prompt:

Truthfulness — never invent numbers or metrics not present in the original.
If a target keyword represents a skill the candidate has not explicitly used,
do NOT claim they used it.

It wasn't working. Three prompt iterations, same result. The instruction was being overridden by the few-shot examples — every example showed confident keyword integration from a candidate who plausibly had those skills. The model followed the examples, not the written rule.

The fix wasn't a better instruction. It was a structural change.

Before — instruction only

export async function rewriteBullet(
  bullet: string,
  targetKeywords: string[],
  jobTitle: string
): Promise<string>

The model had no reference list to check keywords against. It knew the job required React. It didn't know the candidate didn't have it. So it integrated React — confidently, fluently, incorrectly.

After — candidateSkills as a parameter

export async function rewriteBullet(
  bullet: string,
  targetKeywords: string[],
  jobTitle: string,
  candidateSkills: string[]  // added
): Promise<string> {
  const userMessage = `Original: "${bullet}"
Target keywords: ${targetKeywords.join(', ')}
Job title: ${jobTitle}
Candidate skills: ${candidateSkills.join(', ')}  // in every call

Rewritten:`;

Now the model has both sides of the equation in every single call. Target keywords on one side. Candidate's actual skills on the other. When it sees React in the keywords but not in the skills list, it uses a placeholder instead of claiming the skill:

Original:  "Attended daily standups"
Rewritten: "Participated in agile daily standups supporting [React] 
            and [TypeScript] frontend delivery"

[React] and [TypeScript] are honest. They tell the candidate: these are the skills the job requires that you don't currently have. The resume becomes a gap analysis rather than a fabrication.

Why this worked when the instruction didn't

Passing candidateSkills as a parameter is a structural constraint. The data is present in the context window on every call. The model has to work with it explicitly.

A prompt instruction saying "be truthful" is a suggestion the model weighs against everything else in its context — including the few-shot examples that demonstrate confident integration. When examples and instructions conflict, examples win.

But a skills list in the context window doesn't compete with examples. It's data. The model can't ignore data the way it can de-prioritise an instruction.

The lesson: structural constraints beat instructional constraints for truthfulness enforcement. If you want the model to check something before acting, give it the thing to check — don't just tell it to check.

This is the most transferable insight from two weeks of prompt iteration. It applies beyond resume rewriting to any situation where you need the model to respect a boundary that its training data might push against.

Results: before and after

The hallucination problem, fixed

Here is the same candidate — Sarah Johnson, HTML/CSS/JavaScript/WordPress developer, applying to a React and TypeScript role — run through three versions of the prompt.

Zero-shot (Week 1 — no examples, no candidateSkills):

Original:  "Worked on client websites"
Rewritten: "Developed responsive client websites using React and TypeScript,
            implementing modern CSS frameworks in an agile environment"

Original:  "Helped with bug fixes"  
Rewritten: "Resolved bugs in React and TypeScript, enhancing frontend 
            performance and responsiveness within agile development cycles"

Original:  "Attended daily standups"
Rewritten: "Developed responsive web applications using React and TypeScript,
            collaborating in agile daily standups to deliver frontend solutions"

React and TypeScript appear in every bullet. Sarah has neither. The rewriter optimised for the job description and ignored the candidate entirely.

Week 2 (role prompting + few-shot + candidateSkills):

Original:  "Worked on client websites"
Rewritten: "Developed client websites using HTML, CSS, and JavaScript, 
            ensuring responsive design and supporting [React] and 
            [TypeScript] frameworks"

Original:  "Helped with bug fixes"
Rewritten: "Assisted in bug fixes for frontend components using HTML, 
            CSS, and JavaScript while supporting agile development 
            and [React] integration, improving [metric] by [X%]"

Original:  "Attended daily standups"
Rewritten: "Participated in agile daily standups supporting [React] 
            and [TypeScript] frontend delivery"

The difference is structural. HTML, CSS, and JavaScript — skills Sarah actually has — are integrated directly. [React] and [TypeScript] appear as bracketed placeholders, not claims. The resume now tells an honest story: here is what I have, here is what the role requires that I don't yet have.

That's more useful to a candidate than a fabrication. It's also something they can actually defend in a room.

Strong bullets preserved correctly

The pipeline doesn't rewrite everything. scoreBullet evaluates each bullet before the rewrite pass and flags which ones actually need work.

Here is Priya Patel's resume — Staff Engineer applying for a Principal Engineer role — run through the same pipeline:

Original:  "Architected event-driven microservices platform handling 
            50M daily events, reducing infrastructure cost by 40%"
Rewritten: [unchanged — scoreBullet returned needs_rewrite: false]

Original:  "Led team of 8 engineers across 3 time zones to deliver 
            platform re-architecture 2 weeks ahead of schedule"
Rewritten: [unchanged]

Original:  "Defined engineering standards adopted across 4 product 
            teams, reducing incident rate by 35%"
Rewritten: [unchanged]

Three of Priya's bullets were preserved. Strong bullets with real metrics, ownership verbs, and scope context don't need rewriting — and the pipeline correctly identifies that. The rewriter amplifies good inputs. It cannot manufacture signal that isn't there.

The score floor

Every bullet that scored below 6/10 in the eval loop traced back to an original with no action verb, no skill, and no outcome:

5/10 — original: "Helped with bug fixes"
       no tool, no scale, no outcome — nothing to amplify

5/10 — original: "Attended daily standups"  
       describes presence, not contribution

No prompt engineering fixes this. The information needed to write a strong bullet doesn't exist in the original. That's a product problem, not a prompt problem. The right fix is surfacing a coaching message to the user: "This bullet doesn't give us enough to work with. Try adding what you built, what tool you used, or what changed because of your work."

The rewriter ceiling on content-free bullets is approximately 6/10 regardless of prompt quality. Knowing that ceiling exists — and being honest about it with users — is more valuable than pretending the AI can fix anything.

The numbers

Score progression

Run	Score	Key change
Week 1 baseline	7.37/10	Combined role + few-shot prompts
Week 2 Day 6	8.26/10	Outcome placeholder instruction
Week 2 Day 7	8.18/10	Edge case fixes — within noise
Week 2 Day 8 final	8.37/10	Stable confirmed score

Net improvement: +1.0 points over two weeks.

Two runs at 8.37 with the same lowest-scoring bullets confirmed this is a stable measurement, not a lucky run. The ±0.15 variance across runs at temperature 0 is the expected noise floor — changes within that range are not meaningful.

Dimension breakdown

Criterion	Week 1	Week 2	Change
Action verb	1.82/2	1.89/2	+0.07
Keyword fit	1.45/2	1.37/2	-0.08
Outcome	1.08/2	1.97/2	+0.89
Truthfulness	1.68/2	1.66/2	-0.02
Brevity	1.26/2	1.47/2	+0.21

The outcome story

Outcome is the standout improvement — +0.89 on a 2-point scale from a single instruction change.

Week 1: the rewriter was good at verbs and keywords but almost never added outcome signal. When the original bullet had no outcome, the model left the rewrite outcome-free too. A bullet that describes a task without a result gives a hiring committee nothing to evaluate.

The fix was one additional priority in the system prompt:

7. Outcome placeholder — if no outcome exists in the original bullet,
   add a placeholder rather than leaving the bullet outcome-free:
   "improving [metric] by [X%]", "resulting in [outcome]",
   "reducing [problem] by [X%]", or "enabling [result]".
   Never leave a rewritten bullet without any outcome signal.

That instruction moved outcome from 1.08/2 to 1.97/2. Nearly a full point on a 2-point scale. The highest ROI prompt change of the two weeks.

The keyword fit regression

Keyword fit dropped slightly from 1.45/2 to 1.37/2. This was deliberate.

The rewriter was injecting generic ATS phrases — "delivery record", "strong portfolio", "5+ years experience" — into every weak bullet regardless of relevance. A filter was added to the route handler:

const GENERIC_KEYWORD_FILTER = [
  'delivery record',
  'strong delivery record',
  '5+ years experience',
  'strong portfolio',
  'fast learner',
  'team player',
];

const targetKeywords = [
  ...new Set([...(missingSkills || []), ...(atsKeywords || [])])
].filter(k => !GENERIC_KEYWORD_FILTER.some(
  g => k.toLowerCase().includes(g.toLowerCase())
));

Filtering generic phrases reduced keyword coverage marginally. It eliminated keyword stuffing across long resumes entirely. That tradeoff is correct — a resume where every bullet ends with "enhancing delivery record and meeting business requirements" signals machine generation to any experienced recruiter.

What 8.37/10 actually means

It's not a grade. It's a baseline.

Every future prompt change gets measured against 8.37. If a change produces 8.6 on the same 10 resumes with the same rubric and judge, the prompt improved. If it produces 8.1, it regressed. The number is only meaningful in comparison — not in isolation.

This is the eval loop working as designed. Without it, every prompt change is a guess. With it, every prompt change is an experiment.

Eval stats

Bullets evaluated:    38
Average score:        8.37 / 10
Improved over orig:   30 / 38 (79%)
Unchanged (strong):   8 / 38

Average breakdown:
  Action verb:        1.89 / 2
  Keyword fit:        1.37 / 2
  Outcome:            1.97 / 2
  Truthfulness:       1.66 / 2
  Brevity:            1.47 / 2

What I learned that isn't in the documentation

1. Few-shot examples override written instructions when they conflict

I wrote "never claim skills the candidate doesn't have" in the system prompt. Every few-shot example showed confident keyword integration from a candidate who plausibly had those skills. The model followed the examples.

This isn't a bug — it's how in-context learning works. The model pattern-matches to demonstrated behaviour more reliably than it follows written rules. Which means the examples you choose are more consequential than the instructions you write.

The practical implication: audit your few-shot examples as carefully as your instructions. If your examples demonstrate behaviour you don't want in edge cases, the model will replicate it in those edge cases regardless of what the instructions say.

Show don't tell. But be careful what you show.

2. Safety and quality are not the same measurement

I had a validateRewrite function that acted as a gate before serving any rewritten bullet:

// Returns: "use_rewrite" | "use_original" | "rewrite_again"
export async function validateRewrite(
  original: string,
  rewritten: string,
  targetKeywords: string[],
  candidateSkills: string[]
)

It answered the question: is this safe to serve? No fabricated skills, keywords present, different from original — use it.

The eval loop answered a different question: how good is this?

A bullet could pass the first question and fail the second:

validateRewrite:  recommendation = use_rewrite
                  truthfulness_risk = low       ← passed the gate

eval loop:        total_score = 5/10
                  action_verb = 1               ← weak verb
                  outcome = 0                   ← no outcome signal
                  brevity = 1                   ← slightly long

The bullet was safe. It wasn't good. Both assessments were correct — they were measuring different things.

Most developers building AI features stop at the safety layer. They add validation to prevent hallucination and call it done. The gap between "safe to serve" and "good enough to be useful" is where user trust quietly erodes. The eval loop makes that gap visible.

The production fix is to wire evalBullet as a second gate with a minimum threshold — bullets that pass safety but fail quality get a coaching message rather than a mediocre rewrite. That's a Week 3 task.

3. There is a hard floor on what prompt engineering can achieve

Every bullet scoring below 6/10 in the eval loop had the same profile: an original with no action verb, no skill, and no outcome.

"Attended daily standups"    → no tool, no contribution, no result
"Helped with bug fixes"      → passive, no scope, no outcome
"Worked on improving things" → no specificity whatsoever

The rewriter ceiling on these bullets is approximately 6/10 regardless of how good the prompt is. The information needed to write a strong bullet simply does not exist in the original.

This is not a prompt problem. It is an input problem.

No amount of instruction, few-shot examples, or role prompting manufactures signal that isn't there. The right response to a content-free bullet is not a better rewrite — it's asking the candidate for more information.

"This bullet doesn't give us enough to work with.
Try adding: what you built, what tool you used, 
or what changed because of your work.

Example: 'Attended daily standups' → 
'Attended daily standups for a 6-person React team 
shipping features weekly'"

That coaching message is more useful to a candidate than any rewrite the model could produce from the original. It's also honest in a way that confident-sounding AI output rarely is.

The most important thing prompt engineering taught me over two weeks is where its limits are. Knowing the ceiling — and being transparent about it with users — is more valuable than pretending the AI can fix anything you give it.

What changes in Week 3

The eval loop surfaced four specific gaps that prompt changes alone cannot fix. These are the opening tasks for Week 3:

Code-level keyword distribution tracking. The keyword distribution instruction in the system prompt has no effect across stateless API calls — each bullet is a separate call with no memory of what keywords appeared in previous bullets. The fix is a post-processing pass in the route handler that tracks keyword frequency across the full rewritten set and removes over-repeated phrases. A partial version is already in production. A proper implementation needs cross-bullet state.

Soft skills matching on zero-technical-match candidates. When a candidate has zero technical skills matching the job description, the model makes a top-level "no match" assessment and stops evaluating the skills array for soft skill overlap. A teacher applying to a data analyst role has Communication and Leadership in her skills array — both required by the job — but the analyser returns matched_skills: []. The fix is a code-level post-processing step that compares the skills array directly against JD keywords in TypeScript, bypassing the model for this field.

Production eval gate. The eval loop currently runs offline as a measurement tool. Week 3 wires evalBullet into the production route as an async quality gate — bullets that pass validateRewrite but score below 6/10 on the eval rubric get a coaching message rather than a mediocre rewrite. The gate runs in the background so it doesn't add latency to the critical path.

Controlled comparison: combined vs few-shot only. Week 1 and Week 2 only tested the combined prompt (role + few-shot). The contribution of role prompting in isolation was never measured. Week 3 runs the same 10 resumes through a few-shot-only variant — no role declaration — and scores that too. The difference between those two scores is the actual measured contribution of role prompting. That's the kind of evidence that makes a blog post technically credible rather than anecdotal.

Target: 8.8/10 average across the same 10 resumes.

The build-in-public part

This post documents Week 2 of a 10-week curriculum applying prompt engineering techniques to a real SaaS product — Resume AI Tailor — that real users pay for.

The eval baseline started at 7.37/10. It ended at 8.37/10. The target for Week 3 is 8.8/10.

I'll publish the Week 3 results — including the controlled comparison, the production eval gate, and whatever new edge cases surface — in the next post. If the score doesn't hit 8.8, I'll say that too.

The code for the eval loop, the prompt variants, and the full pipeline is on GitHub: github.com/Azeez1314/resume-ai

The product is live at: resumetailor.cv

One question for you

If you're building AI-powered features and measuring output quality with more than "it looks good to me" — I'd genuinely like to know what you're using.

The LLM-as-judge pattern is well-documented in research but underused in production SaaS. The gap between knowing the technique exists and actually having a scored baseline you make decisions from is significant. If you've closed that gap, I'd like to hear how.

Leave a comment, or find me on X and LinkedIn as NanoCrafts.

Resume AI Tailor is part of the NanoCrafts portfolio — a collection of focused SaaS tools built and shipped in public.

resumetailor.cv · nanocrafts.xyz · azeezroheem.hashnode.dev

推荐订阅源

DEV Community