惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
博客园_首页
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
阮一峰的网络日志
阮一峰的网络日志
酷 壳 – CoolShell
酷 壳 – CoolShell
博客园 - 司徒正美
V
V2EX
Cloudbric
Cloudbric
Hugging Face - Blog
Hugging Face - Blog
腾讯CDC
量子位
博客园 - 三生石上(FineUI控件)
博客园 - 叶小钗
K
Kaspersky official blog
博客园 - 【当耐特】
T
Tenable Blog
L
Lohrmann on Cybersecurity
The Cloudflare Blog
S
Schneier on Security
A
Arctic Wolf
Latest news
Latest news
C
Cyber Attacks, Cyber Crime and Cyber Security
罗磊的独立博客
T
The Exploit Database - CXSecurity.com
Cisco Talos Blog
Cisco Talos Blog
小众软件
小众软件
P
Privacy & Cybersecurity Law Blog
WordPress大学
WordPress大学
Simon Willison's Weblog
Simon Willison's Weblog
雷峰网
雷峰网
NISL@THU
NISL@THU
人人都是产品经理
人人都是产品经理
月光博客
月光博客
J
Java Code Geeks
V
Visual Studio Blog
S
Security Affairs
博客园 - Franky
T
Tailwind CSS Blog
Apple Machine Learning Research
Apple Machine Learning Research
H
Heimdal Security Blog
有赞技术团队
有赞技术团队
V2EX - 技术
V2EX - 技术
AWS News Blog
AWS News Blog
G
GRAHAM CLULEY
T
Troy Hunt's Blog
SecWiki News
SecWiki News
Spread Privacy
Spread Privacy
宝玉的分享
宝玉的分享
www.infosecurity-magazine.com
www.infosecurity-magazine.com
博客园 - 聂微东

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Is Your Agent Skill Actually Good? Microsoft's Dual-Paper Deep Dive into Skill Evaluation and Self-Evolving Optimization
WonderLab · 2026-05-31 · via DEV Community

The Question Nobody Wants to Ask: Does Your Skill Actually Help?

You spent an afternoon crafting a carefully structured Skill for your agent. Clear steps, thorough edge-case notes, well-formatted output requirements. You tested it manually a few times, the outputs looked great. You shipped it.

Three weeks later, you notice that some task success rates have gone down compared to before the Skill existed.

This is not a hypothetical. In May 2026, Microsoft Research published two concurrent papers — SkillLens ("From Raw Experience to Skill Consumption") and SkillOpt ("Executive Strategy for Self-Evolving Agent Skills") — that measured this failure mode at scale. Their finding: negative transfer happens in 25% of cases, and you cannot reliably identify the bad skills just by reading the text.

One paper answers "why skills sometimes backfire." The other answers "how to make skills systematically better." Together they sketch a new paradigm for agent capability improvement.


Part One: SkillLens — Mapping the Full Skill Lifecycle

A Skill Is Not a Point — It's a Pipeline

Most practitioners think of a Skill as "a block of text instructions for an agent." SkillLens decomposes this into a three-stage lifecycle:

Stage 1: Experience Generation
    Target model M runs training tasks, producing an experience pool
    of trajectories (both successes and failures)
    ↓
Stage 2: Skill Extraction
    Extractor model E distills the experience pool into a structured
    skill document — procedural knowledge under a fixed budget
    ↓
Stage 3: Skill Consumption
    The same target model M, equipped with the extracted skill,
    is evaluated on held-out test tasks

Notice there are two distinct roles in this chain: the Extractor (distills knowledge from trajectories) and the Target (consumes knowledge to improve task performance). SkillLens's central insight is that these two roles are independent — a strong task executor is not necessarily a strong extractor, and vice versa.

Two New Metrics: EE and TE

To separate these two effects, the paper introduces two complementary metrics:

Extraction Efficacy (EE): Fix an extractor. How reliably does it produce helpful skills across different target models?

$$\text{EE}(E, \mathcal{D}) = \frac{1}{|\mathcal{M}|} \sum_{M \in \mathcal{M}} \Delta(E, M, \mathcal{D})$$

Target Evolvability (TE): Fix a target model. How much does it improve when different extractors distill skills from its own experience?

$$\text{TE}(M, \mathcal{D}) = \frac{1}{|\mathcal{E}|} \sum_{E \in \mathcal{E}} \Delta(E, M, \mathcal{D})$$

A useful analogy: EE measures "how good a teacher is at helping many different students," while TE measures "how much a student can learn from many different teachers." Both dimensions being high is ideal.

Experimental Scale: A Cross-Matrix at Scale

The study is comprehensive:

  • 5 domains: Embodied planning (ALFWorld), Productivity (SpreadsheetBench), Software engineering (SWE-bench-Verified), Web search (SEAL-0), Tool calling (BFCL-v4)
  • 6 target models: GPT-5.4, GPT-5.4-mini, Gemini-3.1-Pro, Gemini-3.1-Flash-Lite, Qwen3.5-35B, Qwen3.5-9B
  • 5 extractor models: Same set (Qwen3.5-9B excluded as extractor — it couldn't reliably follow the structured extraction protocol)

A key design principle is minimal scaffolding: the extraction framework intentionally strips out domain-specific heuristics, filtering rules, and optimization tricks. Only a bare two-stage "per-trajectory analysis → hierarchical consolidation" pipeline remains, ensuring performance differences reflect the extractor's intrinsic capability, not pipeline engineering.

Core Finding 1: 75% Positive, But 25% Negative Transfer

Table 1 shows a stark picture:

Model-generated skills improve downstream performance in 75% of entries. Yet negative transfer remains common: 25% of entries have Δ < 0.

Negative transfer rates vary substantially across domains: SpreadsheetBench and SWE-bench-Verified show the lowest rates (~13%), while ALFWorld reaches 47% — nearly half the time, adding a skill to this domain hurts rather than helps.

Even more counterintuitive: a stronger task-execution model does not predict extraction quality. On SpreadsheetBench, the lightweight Gemini-3.1-Flash-Lite achieves the highest EE, while GPT-5.4 — the strongest performer on the benchmark itself — ranks last as an extractor. Converting target-specific trajectories into procedural guidance that the target can actually use is a distinct capability from solving the tasks.

Core Finding 2: Format Doesn't Matter, Content Does

You might guess that an ordered-list Skill outperforms a prose-format Skill. SkillLens tested this directly:

Rewrite the same skill into four canonical formats (ordered list, unordered list, checklist, prose) and use the Friedman test to check whether any format consistently ranks higher.

Result: Format has no detectable effect on any target (all p > 0.34). Swapping extractors, by contrast, produced significant effects on 5 of 6 targets (p < 0.005).

The implication is direct: obsessing over Skill formatting is wasted effort. What the skill says matters far more than how it's laid out.

Core Finding 3: Plausible-Looking Skills Don't Predict Utility

This is the most surprising finding. The experiment asks GPT-5.4 to act as a judge: given two skills extracted from the same (target model, domain) pair, pick the one that will perform better on downstream tasks.

Without guidance: the unguided judge achieves only 46.4% accuracy — indistinguishable from random guessing (50%). On pairs where the actual performance gap is δ ≥ 5% (clearly better), the judge picks the genuinely higher-performing skill only 15.8% of the time — a clear inversion of actual utility.

In other words, the skill that reads more fluently and coherently tends to be the one that performs worse downstream. Textual plausibility has divorced from actual utility.

This has a direct practical implication: you cannot reliably screen skills by asking an LLM to judge the text. The quality gap lies deeper than surface form.

Digging Into Each Stage: What Actually Drives Skill Quality?

Stage 1 (Experience Generation): Success/Failure Ratio Sets the Ceiling

Fixing the extractor (GPT-5.4-mini), the researchers sampled experience pools with success ratios of 0%/25%/50%/75%/100% from the same trajectories and compared the resulting skills.

Key finding: experience composition strongly shapes skill quality, and the optimal success-failure ratio is domain-specific.

  • SpreadsheetBench favors mostly-successful experience
  • SWE-bench peaks at a mostly-successful pool
  • ALFWorld performs best with failure-heavy pools (failures reveal invalid actions and dead-end states)

One universal rule: all-failure pools consistently produce the worst skills. Successful trajectories are the foundation of skill extraction — they provide positive procedural signals that narrow the agent's action space rather than merely listing what to avoid.

Stage 2 (Skill Extraction): Depth Matters, Not Aesthetics

Starting from "why do plausible-looking skills fail?", a qualitative inspection of high-Δ vs. low-Δ skill pairs reveals the real difference:

High-Δ skills provide concrete, executable remedies (e.g., "when the host engine doesn't evaluate formula strings, precompute static values and write them into cells directly"). Low-Δ skills offer generic platitudes (e.g., "resolve the contract before coding").

A vivid analogy: a high-quality skill reads like a practitioner's debugging journal — recording specific failure modes in specific contexts and their concrete fixes. A low-quality skill reads like a "we all know this already" lecture about best practices.

Stage 3 (Skill Consumption): Same Skill, Very Different Effects Across Targets

Injecting the same skill into different models can produce wildly different results. On SpreadsheetBench, the strong-pool skill boosts GPT-5.4 by +9.0 but produces negative transfer on some Qwen3.5-9B conditions.

Behavioral analysis explains why: skill consumption reshapes the target's default policy rather than triggering new explicit tool calls. For GPT-5.4, the skill steers it away from writing spreadsheet formulas toward computing results in Python and writing back static values — exactly the right strategy correction for formula stability issues. For Qwen3.5-9B, the same guidance pushes it toward more complex workbook-native workflows that improve structural correctness on sheet-level tasks but introduce more execution failures on fine-grained operations.

From Diagnosis to Intervention: Meta-Skill Guided Extraction (RQ3)

The analysis reveals that skill quality is driven by hidden dimensions that are invisible on the surface. RQ3 asks: can these findings be turned into a concrete, drop-in improvement to the extraction process itself?

Step 1: Discover which dimensions actually predict utility

The paper designs a fully automated rubric-discovery pipeline:

  1. Feed high-gap skill pairs into GPT-5.4, which extracts per-pair difference features
  2. Iteratively merge and consolidate these into 7 candidate dimensions (the raw rubric)
  3. Measure each dimension's "better-rate" — how often does the higher-Δ skill receive a more favorable judgment when scored on this dimension alone?

Only 3 dimensions consistently align with utility, forming the validated rubric:

Dimension What It Captures
Failure Mechanism Encoding Does the skill encode specific failure modes and their triggers?
Actionable Specificity Does the skill provide executable guidance tailored to concrete situations?
High-Risk Action Blacklist Does the skill explicitly name high-risk operations to avoid?

Step 2: Verify the rubric's discriminating power

Using the 3-dimension validated rubric to guide the judge raises overall accuracy from 46.4% to 73.8% on 151 high-gap pairs. On the hardest pairs (δ ≥ 5%), where the unguided judge was actively wrong at 15.8%, the guided judge now picks correctly the majority of the time.

Step 3: Operationalize it as a Meta-Skill

The validated rubric is packaged as a compact meta-skill: a generation-time prior injected directly into the extractor's system prompt. Testing it against three conditions:

  • Original unguided extraction
  • Extraction guided by the 7-dimension "plausibility rubric"
  • Extraction guided by the 3-dimension validated rubric

The results are unambiguous:

The plausibility rubric hurts average performance (−0.59pp, 6 of 9 cells regress). The validated rubric improves all nine cells (+1.55pp average), with the largest gains on SpreadsheetBench (+2.3 to +3.7pp).

Using "seems reasonable" criteria actively damages extraction. Only empirically validated dimensions reliably improve it.


Part Two: SkillOpt — Training a Skill Document Like a Neural Network

The Core Analogy: Skill ≈ Model Weights

SkillLens tells us skills can be systematically evaluated and improved. SkillOpt asks: can we apply an optimization loop to the skill document itself — the same discipline that makes weight-space optimization reproducible?

This analogy is the foundation of SkillOpt's entire design, laid out explicitly in the paper:

Deep Learning SkillOpt Equivalent
Parameter (weights) Skill document
Gradient direction Trajectory-derived edit direction
Learning rate Edit budget $L_t$
Validation check Held-out selection gate
Stable training setting Batch / minibatch / schedule / gate

This is not decorative framing — it's operational. Batch size, learning rate, validation, momentum — every concept has a corresponding text-space implementation in SkillOpt.

The SkillOpt Loop

The optimization pipeline operates at two timescales: per-step updates and epoch-wise consolidation.

Forward Pass: Rollout Evidence

At each optimization step, the frozen target model runs a batch of training tasks with the current skill and produces scored trajectories. Small batches update quickly but noisily; larger batches expose more stable patterns. SkillOpt also supports accumulation: multiple rollout batches can be reflected on separately and merged into one update, decoupling execution throughput from update frequency.

Backward Pass: Minibatch Reflection

The optimizer model (a separate frontier LLM) converts trajectories into structured skill edits. Crucially, it processes both failure and success trajectories:

  • Failure minibatches → propose missing or corrective rules
  • Success minibatches → identify behaviors that already work and should be preserved

Single trajectories tend to produce anecdotal fixes; minibatches expose reusable procedural errors — the agent consistently searches the wrong source, always formats answers incorrectly, or reliably fails to verify tool results.

Local proposals are merged hierarchically: failure-driven edits consolidated first, then success-driven edits, with failure corrections given priority. This filters duplicates, contradictions, and instance-specific suggestions before the optimizer selects the final bounded update.

Bounded Text Updates: The Learning Rate Budget

This is the sharpest distinction between SkillOpt and "just rewrite the skill when it seems wrong."

Each optimization step has an edit budget $L_t$: the optimizer may apply at most $L_t$ add/delete/replace operations to the skill document. Candidate edits ranked below the cutoff are discarded.

Why bounding matters:

  • Unbounded rewrites erase useful rules: hard-won lessons from prior epochs get overwritten in one large revision
  • Unbounded rewrites introduce contradictions: iterating without a budget produces an incoherent patchwork
  • Bounded updates preserve continuity: later optimizer calls can learn from what helped, what failed, and what should be protected

SkillOpt supports four edit budget schedules: constant, linear, cosine (default), and autonomous. The default cosine schedule starts with larger edit budgets and decays toward smaller consolidation steps.

The Validation Gate: Strict Acceptance

Every candidate skill is evaluated on the held-out selection split. SkillOpt accepts a candidate only if its selection score is strictly greater than the current best — ties are rejected. This converts reflection into propose-and-test optimization rather than unconditional self-editing.

Rejected edits don't disappear. The optimizer records an epoch-local rejected-edit buffer containing:

  • The failure patterns the rejected edits attempted to address
  • The edits that were tried
  • The score drop they caused

Later reflection calls in the same epoch receive this buffer, steering the optimizer away from known-harmful directions. This provides negative feedback at zero additional inference cost during deployment.

Think of it as running an A/B test on every proposed Skill revision: only the version that passes validation gets promoted. Rejected attempts become institutional memory that prevents repeating mistakes.

Epoch-Wise Slow/Meta Update: Learning Across Epochs

Fast updates learn from the current rollout batch. The slow/meta update learns from the comparison across adjacent epochs — longer-horizon patterns that individual batches can't expose.

At epoch end, SkillOpt runs the same training items under both the previous epoch's skill and the current epoch's skill, categorizing results into: improvements, regressions, persistent failures, and stable successes. The optimizer model writes a concise longitudinal guidance block — capturing which edit patterns helped, which failed, and which failure modes persist across epochs.

This guidance is stored in a protected region of the skill document that step-level edits cannot overwrite, preventing short-term noise from erasing long-term lessons.

Key deployment note: the optimizer-side meta skill is never shipped with best_skill.md. It only lives in the teacher's reflection context. The deployed artifact stays compact and portable; the training process benefits from the full editing history.

Experimental Results: 52 for 52

SkillOpt is evaluated across 6 benchmarks × 7 target models × 3 execution harnesses (direct chat, Codex agentic loop, Claude Code agentic loop), against 7 baselines: no skill, human skill, one-shot LLM skill, Trace2Skill, TextGrad, GEPA, and EvoSkill.

SkillOpt is best or tied-best on all 52 evaluated (model, benchmark, harness) cells.

Key headline numbers for GPT-5.5:

Execution mode Average gain over no-skill
Direct chat +23.5 points
Codex harness +24.8 points
Claude Code harness +19.1 points

Individual benchmark gains are striking: OfficeQA jumps from 33.1 to 72.1 (+39.0), SpreadsheetBench from 41.8 to 80.7 (+38.9), ALFWorld on GPT-5.4-nano from 34.3 to 69.4 (×2.0). Procedural benchmarks with strict format requirements see the largest absolute gains — exactly where frontier models are most exposed zero-shot.

Ablations: Every Design Choice Earns Its Place

Table 3 ablation results confirm each component contributes:

Component removed SearchQA drop SpreadsheetBench drop LiveMath drop
Learning-rate form (unbounded) -2.5 -1.8 -4.0
Rejected-edit buffer -1.6 -4.6 -2.4
Meta skill + slow update -0.6 -22.5 -3.2

Removing the slow/meta update is most damaging for SpreadsheetBench (−22.5 points), because this benchmark requires accumulated procedural knowledge — output format conventions, formula evaluation strategies — exactly what the epoch-wise slow update is designed to protect.

What Do Learned Skills Actually Look Like?

Figure 4 in the paper reproduces one representative learned rule per benchmark, verbatim from the deployed best_skill.md:

SearchQA: "Infer the expected answer type from clue wording, then choose the shortest canonical entity supported by co-occurring distinctive evidence."

SpreadsheetBench: "Inspect workbook structure and formulas, then write evaluated static values across the full requested target range instead of relying on Excel recalculation."

OfficeQA: "Treat oracle parsed pages as primary evidence, lock table/date/unit context, and output exactly the requested rounded value without extra labels."

LiveMathematicianBench: "In strongest-statement MCQs, rank choices by theorem strength and prefer a justified stronger-result option over true but weaker corollaries."

ALFWorld: "Keep a horizon-aware visited/frontier ledger, diversify search after repeated same-type failures, and avoid revisiting the destination until holding the target."

Three properties stand out:

  1. Procedural, not instance-specific: no specific question, filename, or entity is named
  2. They encode discipline that frontier models don't apply zero-shot: answer format constraints, evidence-binding strategies, search-frontier management
  3. They read like notes a thoughtful practitioner would write after a day with the benchmark — except they were produced automatically by the optimizer under bounded updates and validation gating

Compactness: final best_skill.md lengths range from 379 tokens (LiveMathematicianBench) to 1,995 tokens (SpreadsheetBench), with a median around 920 tokens. The number of actually accepted edits is 1 to 4 (median 2.5) — the optimizer proposes far more, but only a handful survive the validation gate.

Transfer: Train Once, Deploy Broadly

One of SkillOpt's most compelling findings is that the optimized artifact transfers well beyond its training setting.

Cross-model transfer: A SpreadsheetBench skill trained on GPT-5.4 retains 82% of the in-domain gain when transferred to GPT-5.4-mini (+9.4 of +11.4). A LiveMath skill transferred to GPT-5.4-nano actually surpasses the in-domain SkillOpt reference (28.8 transferred vs. 27.2 in-domain).

Cross-harness transfer: A SpreadsheetBench skill trained in the Codex loop transfers to Claude Code with a +59.7 point absolute gain over the Claude Code no-skill baseline (22.1 → 81.8), exceeding the Claude Code in-domain SkillOpt reference of 80.4. The transferred skill appears to encode workbook-level procedures — structure-first inspection, formula-aware verification, static-value materialization — that are harness-agnostic.

Cross-benchmark transfer: An OlympiadBench skill applied to Omni-MATH (sharing only the broad "math" family) produces positive gains across all three model scales (+3.7/+1.8/+1.3), supporting the interpretation that the skill encodes reusable mathematical procedure rather than memorized test-specific formatting.

Practical implication: optimize a skill in one execution environment, reuse it across multiple models, harnesses, and related benchmarks — without touching model weights.


The Combined Picture: Diagnosis Meets Optimization

Placing both papers side by side, their contributions form a closed loop:

SkillLens: Understand the problem
├── Finding: 25% negative transfer — driven by variable extraction quality
├── Finding: Format doesn't matter; content does
├── Finding: Plausible text ≠ downstream utility (46.4% = random)
└── Solution: Validated rubric (3 dimensions) + Meta-Skill improves extraction

SkillOpt: Systematically solve it
├── Core idea: Skill as trainable text parameter
├── Mechanism: Bounded edits + validation gate + rejected buffer + slow update
├── Results: 52/52 best-or-tied, +17.6 average across 7 models
└── Properties: compact artifact, transferable, auditable, zero inference overhead

Both papers converge on the same core insight: a Skill should not be a static document written by intuition. It should be a dynamically optimized artifact driven by execution data. SkillLens tells you which dimensions genuinely matter. SkillOpt gives you the machinery to push those dimensions forward systematically.


Practical Takeaways

If you maintain a Skill library:

  1. Don't judge skills by how well they read — experiments show this is essentially uncorrelated with actual performance gains
  2. Build a small evaluation set (10–20 cases) and use deterministic checks to catch negative transfer before it reaches production
  3. When building experience pools, mix successes and failures — pure-failure pools consistently produce the worst skills
  4. Guide your extraction with Failure Mechanism Encoding, Actionable Specificity, and High-Risk Action Blacklist — the three empirically validated quality dimensions

If you're considering systematic skill optimization:

  1. SkillOpt's core architecture — bounded edits + validation gate + negative feedback buffer — is fundamentally more stable than "unconstrained rewriting"
  2. A modest edit budget (learning rate $L_t = 4$) performs competitively across most benchmarks; you don't need large rewrites to make progress
  3. Rejected edits are valuable: recording them prevents repeating known-harmful directions
  4. The slow/meta update is critical for procedural domains — without it, short-term noise overwrites the long-term lessons your optimization was accumulating

Summary

Two Microsoft papers, one cohesive answer to why skills sometimes fail and how to fix them systematically.

SkillLens maps the full three-stage skill lifecycle across five domains, six targets, and five extractors. It discovers that 25% of skill deployments produce negative transfer, that skill format is irrelevant while content depth is decisive, and that "reads well" is a poor predictor of "performs well." It distills these findings into three validated quality dimensions — Failure Mechanism Encoding, Actionable Specificity, High-Risk Action Blacklist — and packages them as a meta-skill prior that improves every evaluated extraction condition.

SkillOpt treats the skill document as a trainable text-space parameter. By combining bounded edit budgets, a strict held-out validation gate, a rejected-edit buffer for negative feedback, and an epoch-wise slow/meta update for long-horizon consolidation, it turns ad hoc skill editing into a controlled optimization loop. The result: best or tied-best on 52 of 52 evaluated cells, +17.6 average improvement across seven models, compact deployable artifacts (< 2,000 tokens, 1–4 accepted edits), and transfer that works across model scales, harnesses, and related benchmarks without touching model weights.

Skill optimization is graduating from craft to engineering.


References:


🎉 Thanks for reading — let's enjoy what technology has to offer!

Visit my personal homepage for all resources I share: Homepage