























Getting prompts right is still the hardest part of shipping reliable LLM applications. Small wording changes can swing accuracy by 20 percent. What works on a few examples often breaks at scale. When a multi-step pipeline returns a wrong answer, finding the failing step means inspecting intermediate outputs by hand.
Cisco AI introduced FAPO to address that bottleneck. FAPO stands for Fully Automated Prompt Optimization. It is a Claude Code-driven system that optimizes LLM pipelines from baseline prompts to target accuracy. You supply a dataset and an initial prompt. FAPO then evaluates, classifies failures, proposes variants, validates them, and iterates. The whole loop is orchestrated by Claude Code agents. The project ships open source under Apache 2.0, and also supports Codex as the optimization agent.
In Cisco’s reported evaluation, FAPO beat GEPA, a state-of-the-art prompt optimizer, on 15 of 18 model-benchmark comparisons. On the two benchmarks where FAPO escalated to pipeline changes, the mean gain over GEPA reached +33.8pp.
FAPO is a multi-tenant evaluation and optimization framework. A tenant is a self-contained optimization project. Each tenant directory holds one task’s prompts, dataset, chain definition, scorer, and config. Tenants stay isolated, so unrelated tasks optimize side by side without interference.
The core engine is named hephaestus and is domain-agnostic. It handles evaluation, chain execution, and scoring. Chains are LangGraph state graphs that process each test case. Out of the box, FAPO supports three providers: OpenAI, Baseten, and SageMaker.
The one input you must bring is a dataset. It is paired inputs and expected outputs that define success. FAPO splits it into a validation set and a held-out test set. The validation set drives iteration; the test set is used only for a final one-shot evaluation. From a task description, Claude can scaffold the rest: the initial prompt, the chain, and the scorer.
Once the pieces exist, FAPO runs a closed loop until target accuracy is reached. Each cycle runs six stages:
The system works at three escalating levels. Prompt edits are lowest cost and tried first. Parameter changes adjust config values like retrieval_k or temperature. Structural changes alter chain topology, such as adding a self-reflection node or switching to a ReAct pattern. FAPO exhausts one level before escalating to the next.
Step attribution sorts failures into four classes. Retrieval failures return empty or irrelevant content. Cascading failures begin when an early step produces empty output. Format failures hide the correct answer inside text the scorer cannot parse. Reasoning failures occur when good inputs still produce a wrong conclusion. Format and reasoning issues are prompt-addressable. Retrieval and cascade issues are structural-addressable.
Guardrails keep the optimizer from overfitting. It inspects only training-split cases, while validation and test expose aggregate scores only. Every variant is a new immutable file, never edited in place. An independent reviewer checks each proposal before it runs.
Cisco team evaluated FAPO against GEPA (Generalized Evolutionary Prompt Architecture), a state-of-the-art prompt optimization method. GEPA uses evolutionary search with genetic operators to optimize prompts for multi-step pipelines. Both systems started from identical baseline pipelines and prompts. FAPO could escalate to structural changes when attribution found bottlenecks. GEPA was limited to prompt-level optimization.
The comparison spanned six benchmarks and three task models: GPT-4.1-mini, GPT-5.4-mini, and Gemma 3-12B. Claude Opus 4.6 served as both FAPO’s orchestrator and GEPA’s reflector. Scores below are averaged across the three task models.
| Benchmark | Baseline | GEPA | FAPO | Gain vs. GEPA |
|---|---|---|---|---|
| HoVer | 35.9 | 48.5 | 83.8 | +35.3pp |
| IFBench | 35.7 | 48.5 | 80.7 | +32.2pp |
| LiveBench-Math | 51.0 | 52.6 | 62.0 | +9.4pp |
| HotpotQA | 50.9 | 61.8 | 68.3 | +6.5pp |
| Papillon | 73.6 | 90.7 | 94.9 | +4.2pp |
| AIME | 16.7 | 16.0 | 12.9 | -3.1pp |
FAPO won 15 of 18 model-benchmark comparisons, with a mean gain of +14.1pp over GEPA. On HoVer and IFBench, where FAPO escalated to pipeline changes, it won all six model-benchmark pairs. The mean gain there was +33.8pp. On the four benchmarks without structural changes, FAPO still won 9 of 12 through prompt optimization alone. AIME was the only benchmark where GEPA led, by 3.1pp. The gap is smaller than the standard deviation across stochastic trials.
A capability comparison shows the design difference reported by Cisco. Every row below reflects the source description of the two systems.
| Capability | GEPA | FAPO |
|---|---|---|
| Optimization levels | Prompt text only | Prompt → parameter → structural |
| Can change chain structure | No | Yes, when attribution finds bottlenecks |
| How it is driven | Evolutionary search with genetic operators | Claude Code or Codex agent loop |
| Result across 18 model-benchmark pairs | Reference | Wins 15 of 18; +14.1pp mean |
FAPO targets multi-step LLM pipelines, not single prompts. A few concrete examples:
The fastest path is to let Claude Code create the tenant files. From the repo, describe your task in plain English, then add a JSONL dataset. Each line is one test case with case_id, task_type, context, expected, and metadata:
{"case_id": "1", "task_type": "qa", "context": {"question": "What is the capital of France?"}, "expected": {"answer": "Paris"}, "metadata": {}}
{"case_id": "2", "task_type": "qa", "context": {"question": "What is 2 + 2?"}, "expected": {"answer": "4"}, "metadata": {}}A scorer compares the chain output to the expected answer. It implements validate_case to catch bad data early and score_case to return a composite score:
from hephaestus.scoring.scorer import Scorer as BaseScorer
class Scorer(BaseScorer):
def validate_case(self, case, scoring_profile):
assert "answer" in case.expected, "Missing 'answer' in expected"
def score_case(self, case, output_text, scoring_profile):
expected = case.expected["answer"].strip().lower()
predicted = output_text.strip().lower()
em = 100.0 if predicted == expected else 0.0
return {"composite_score": em, "score_breakdown": {"exact_match": em}}Verify the setup with a baseline evaluation:
export OPENAI_API_KEY="sk-..."
python -m hephaestus.cli eval --config tenants/my_project/configs/eval.jsonThen invoke the optimization agent with a tenant, config, and success criteria such as composite_score >= 90. Claude Code produces a scope contract, then iterates autonomously. Every prompt variant, config, and per-variant analysis is written to disk, so each run stays auditable. A local read-only UI called FAPO Explorer browses the artifacts afterward.
Check out the Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。