惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

aimingoo的专栏
aimingoo的专栏
量子位
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Schneier on Security
Cisco Talos Blog
Cisco Talos Blog
T
ThreatConnect
J
Java Code Geeks
博客园 - 司徒正美
A
Arctic Wolf
T
True Tiger Recordings
C
Cybersecurity and Infrastructure Security Agency CISA
Cyberwarzone
Cyberwarzone
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
Recorded Future
Recorded Future
P
Palo Alto Networks Blog
The Hacker News
The Hacker News
The Register - Security
The Register - Security
S
Securelist
www.infosecurity-magazine.com
www.infosecurity-magazine.com
C
CXSECURITY Database RSS Feed - CXSecurity.com
Application and Cybersecurity Blog
Application and Cybersecurity Blog
I
Intezer
P
Privacy & Cybersecurity Law Blog
Scott Helme
Scott Helme
K
Kaspersky official blog
博客园 - 聂微东
Last Week in AI
Last Week in AI
V
V2EX
小众软件
小众软件
F
Fox-IT International blog
Martin Fowler
Martin Fowler
Apple Machine Learning Research
Apple Machine Learning Research
T
Tenable Blog
F
Future of Privacy Forum
Microsoft Security Blog
Microsoft Security Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
阮一峰的网络日志
阮一峰的网络日志
GbyAI
GbyAI
T
Threatpost
I
InfoQ
P
Proofpoint News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
Tor Project blog
G
GRAHAM CLULEY
D
DataBreaches.Net

DEV Community

India’s Laws Were Not Built for AI — And Courts Are Filling the Gap Clprolf Minimalist Messaging in the Age of AI Building Strong Python Basics – Loops, Functions and Logic How to Choose the Right Tech Stack for Your Project I built a free multi-tab JSON editor — here's what I learned HTTP Headers Every Developer Should Know (2026) Building Cross-Platform Digital Products: Challenges and Best Practices Data Privacy in the Age of AI: How Product Teams Can Build Trust with Users What Would WordPress Look Like If It Were Designed Today? Why Backup Success Does Not Mean Database Recoverability Local AI Office Assistant That Never Sends Your Documents to the Cloud Building TaskForge: Translating Enterprise Chaos into an Open-Source Scheduler Tesla P40 in a Homelab: 24GB of Inference on a Budget Llama 4: Meta's Latest — Scout, Maverick, and the MoE Revolution George Hotz called AI code 'slop.' He's half right. Como Construir um Fluxo de Trabalho Baseado em Engenharia de Prompt e Automação We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage. The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure Getting started with openHUMANS can be an exciting venture for developers looking to create innovative applications in the realm of human-ce Stack Overflow: A Powerful Community for Developers and Learners From Language Models to Humanoid Minds ✨ Road to Senior #2: How Computers Think in Numbers Why LLM debugging fails on fragmented repository context How to Deploy a LangGraph Agent on AWS Bedrock AgentCore An outreach kit for solo founders whose drafts can't hallucinate Open Satchel is live Amy Kwalwasser and the Growing Importance of Quantum Risk Modeling I Built ShellReq - A Native API Client for VS Code & Terminal If Microsoft and Uber can't afford AI coding, what chance do the rest of us have? MADCAP: Building a Multi-Agent Debate CLI That Argues With Itself So You Don't Have To Why most AI fails at IDOR (and how AMAS fixes it with causal reasoning) How to Audit a Laravel Codebase You've Inherited LangGraph 워크플로우 템플릿 (v34) BugBench: a developer origin story and practical guide for VS Code / Kiro users A solution to messy token systems for Next.js A NestJS reference app that proves the nest-native stack under realistic backend pressure Observability for AI Systems: Monitoring Drift, Hallucinations, and Reliability in Production I Thought “Data Analyst” Was the Whole Game… Then I Entered the Data Avengers Office 👀 Create and configure network security groups How to analyze the cost of Kafka? How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow [Boost] We built MDCMS, a Markdown-first CMS for teams using AI agents Zero Heap Allocations at 1.18 GB/s: Deep Dive into ForgeZero 4.0.x The Minimum Viable Test Suite for Working with Agents Why Perplexity Started Citing My Blog: 5 Changes That Actually Worked Sync Supabase via OAuth: No Connection String Needed I asked three AI models the same API question. Only one had it right. Implementing Saga Pattern With Lambda Durable Function Why does AI forget what you said (and how to fix it) I built a daily Wordle-style game for AI tools - Here's how Mapping Polish company structures: querying KRS direct via API Built tmpdrop — a tiny self-hosted ephemeral file drop Running Local LLM - 0$ Personal Agentic AI Assistant - Part 3 LLD Object-Oriented Design: Interfaces & Abstract Classes (Designing Contracts) The Smaller Ship: Vitalik, the Ethereum Foundation's Restructuring, and What It Leaves for Investors Looking for 4 people to build something weird with me Building a Local-Only RAG System with Ollama and TypeScript The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem How I self-hosted my Flask app on an old laptop for almost free I built a free DSA interview prep site because I was tired of the existing options I built an AI agent that migrates Next.js Pages Router to App Router Prisma Query Logging and PostgreSQL: Where the ORM Ends and the Database Begins Prisma query logging y PostgreSQL: dónde termina el ORM y empieza la base From Browser to Server : The Journey of an HTTP Request (Demystifying the Web’s Infrastructure) Santa Augmentcode Intent Ep.6 I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability. How to Build a High-Performance Image Optimization Pipeline in 5 Minutes 50 Linux Commands Every DevOps Engineer Must Know Less Toil, More Flow - Automating the Path from Request to Implementation The Code Review Checklist I Actually Use How I run a small blog on Astro 5 + Content Collections Git: Best Practices for Professionals How IBM Bob Became My Everyday Coding Companion Solana Passkey Wallet: Replacing Seed Phrases with SIMD-0075 I built a small browser puzzle game about arrows I wrapped Claude Code in a zsh function. Here's every decision I almost got wrong. Mobile Game Optimization: A Unity Developer's Checklist Git: Best Practices for Beginners Three days I lost chasing a ghost that was already dead on disk Why Too Many Parts Hurt ClickHouse Performance Guardrails for Agent Output: Pluggable Validation Before and After LLM Calls Gemma Forge: Local AI Without the Setup Wall From Half‑dead Prototype to Local‑Only AI Medical Assistant: Rewiring MedClinic with GitHub Copilot Runninig a forkbomb in Jenkins What’s Actually Happening When You Use Git Preventing Recursive Tool Loops in LangChain Agents Building a Rock-Paper-Scissors CLI with TypeScript — Union Types, Conditionals, and Jest Your AI Coding Agent Wastes 80% of Its Context. Fixed That with Graph Theory. Why Flutter Has Become the Go-To Framework for Fintech App Development We built a scripting language just for AI agents. Here's why. Stop building AI inboxes. Build decision layers instead. Meme Monday Why I Built @editora/ui-react? Are AI tools the next level of abstraction in software development? Identity on Solana: Your Wallet Is Your Account One API Call Changed Everything The Internet Career Nobody Talks About Enough: What Is DevRel?
skill-insp: A Skill That Scores Other Skills
Conan · 2026-05-26 · via DEV Community

Conan

skill-insp: A Skill That Scores Other Skills

If you've been building Claude Code skills for a while, you've probably noticed a pattern: every skill author makes the same mistakes the first few times. Vague descriptions that fail to trigger. Workflows that don't say what to do when files are missing. allowed-tools that ask for Bash with no glob restriction. No eval scenarios, so you have no idea if the skill actually works.

I built skill-insp to catch those mistakes automatically. It's a skill that inspects other skills, scores them across 8 dimensions, and tells you what to fix.

Source: github.com/conanttu/skills

What It Does

You point skill-insp at a folder containing a SKILL.md and it gives you:

  1. A 0–100 score across 8 weighted dimensions (Structure, Triggering, Usability, Completeness, Progressive Disclosure, Testability, Maintainability, Safety & Trust)
  2. A prioritized list of findings (High / Medium / Low) with file:line references
  3. An HTML report you can open in a browser
  4. The ability to apply high-priority fixes with automatic backup and revert with SHA-256 hash verification
  5. A functional eval runner that spawns sub-agents against fixture skills to verify the skill's own logic
✨ skill-insp ✨

Overall: my-skill scores 66/100 — risk low, readiness usable-with-improvements.

Key strengths
- Minimal, appropriate permissions (Read and Write only)
- Clear 3-step workflow
- Clean YAML frontmatter with version tracking

Scorecard
| Dimension              | Score |
|------------------------|-------|
| Structure              |  8/10 |
| Triggering             |  9/15 |
| Usability              |  9/15 |
| Completeness           |  7/15 |
| Progressive Disclosure |  7/10 |
| Testability            |  4/10 |
| Maintainability        |  8/10 |
| Safety & Trust         | 14/15 |
| Total                  | 66/100 |

Recommendations
  Medium  Add error-handling for missing or unreadable files.
  Medium  Create evals/ with at least one input and expected output.
  Low     Expand README with usage examples or remove the placeholder.

HTML report: /abs/path/to/cache/my-skill/latest.html
✨ skill-insp ✨

Enter fullscreen mode Exit fullscreen mode

The Scoring Rubric

The 100 points are weighted toward what actually breaks skills in practice:

Dimension Max Why it matters
Structure 10 Frontmatter parses, folder layout makes sense
Triggering 15 Description gets the skill invoked in the right contexts
Usability 15 Workflow steps are concrete and runnable
Completeness 15 Edge cases, inputs/outputs, failure handling
Progressive Disclosure 10 SKILL.md stays lean; details live in references
Testability 10 Evals or success criteria exist
Maintainability 10 No duplication, no stale placeholders
Safety & Trust 15 Permissions scoped, no hidden network, no destructive ops

Safety & Trust is the dimension where pattern-matching tools fall apart, so skill-insp does semantic analysis here. It distinguishes between documentation and executable code: a SKILL.md that says "check for rm -rf usage" in a safety checklist is not a destructive operation. A scripts/cleanup.sh that actually runs rm -rf "$TEMP_DIR" is, and gets flagged with the file:line reference.

How It's Built

The skill follows the "model is the analyzer" pattern. There's no Python or Node script that parses YAML and counts characters. Instead:

skill-insp/
├── SKILL.md                       # Workflow + the 4 modes (default, detailed, apply, revert) + Run Evals
├── README.md
├── references/
│   ├── rubric.md                  # Scoring dimensions and what good looks like
│   └── output-format.md           # JSON schema for analysis.json
├── scripts/
│   ├── render-html.js             # analysis.json → latest.html
│   └── run-evals.js               # fixture setup + sub-agent prompt generation
├── assets/
│   └── report_template.html       # Self-contained HTML template
└── evals/
    └── evals.json                 # 8 functional eval scenarios

Enter fullscreen mode Exit fullscreen mode

Two deterministic scripts handle the parts that should be deterministic:

  • render-html.js turns a JSON analysis into a self-contained HTML report. The template uses CSS custom properties for theming, conic-gradient score rings, and an auto-hidden eval results section.
  • run-evals.js creates a fixture skill in cache/_fixtures/<id>/, copies a snapshot of skill-insp itself into _skill_home/ so sub-agents can resolve <this-skill> references, and prints a self-contained sub-agent prompt.

Everything else — reading files, parsing YAML, evaluating the rubric, distinguishing documentation from code — is done by the model. This sounds slower than a parser, but it's actually the only way to do it correctly. A regex doesn't know that rm -rf inside a markdown code fence labeled "examples to flag" is not the same as rm -rf inside an executable script.

Progressive Disclosure in Practice

Earlier versions of skill-insp had a 300-line SKILL.md with the entire rubric inline. That hit context budget hard and made the skill harder to edit. The current layout pushes details to references:

  • SKILL.md holds the workflow and mode entry points. ~150 lines.
  • rubric.md holds the scoring dimensions, priority levels, compactness rules, and safety guidance. Only loaded when the skill actually scores something.
  • output-format.md holds the JSON schema. Only loaded when writing analysis.json.

The result: SKILL.md is small enough to read in one sitting, and the model only loads the rubric when it's about to score.

The Eval Runner

This is the part that took the most iteration. The idea is simple: skill-insp ships with 8 eval scenarios in evals/evals.json, each describing a user prompt, fixture files to create, and expectations to verify. To run them:

node scripts/run-evals.js <skill-path> list
node scripts/run-evals.js <skill-path> setup <id>

Enter fullscreen mode Exit fullscreen mode

setup creates a fixture directory, writes the fixture files, copies skill-insp's own resources into _skill_home/, and prints a JSON payload that includes a sub_agent_prompt. The parent agent reads that JSON, spawns a sub-agent with the prompt, and after the sub-agent finishes, checks each expectation:

  • File expectations ("analysis.json is written") → find over the fixture directory.
  • Content expectations ("a high-priority finding is raised") → read the output files.
  • Behavioral expectations ("the model reports an error and stops") → judge from the sub-agent's text output.

The Sandbox Gotcha

In the first version, fixtures were created under os.tmpdir(). This worked when I ran it manually, but sub-agents spawned by the harness were sandboxed to the project root — they got "permission denied" on every Read and Bash call against /var/folders/.../T/eval-skill-insp-*. Three out of eight evals failed for sandbox reasons that had nothing to do with the skill's logic.

The fix was a one-line change: move fixtures into cache/_fixtures/<id>/ inside the project. Now sub-agents inherit the project's filesystem permissions, and the cache directory is .gitignored so it doesn't pollute commits. After the change, all 8 evals run cleanly.

Lesson worth remembering: if you're going to spawn sub-agents, keep their working directory inside the parent's sandbox. Temp directories outside the project tree look like a clean choice but break under tighter permission policies.

Apply and Revert

When you say "apply recommendations", skill-insp:

  1. Re-reads the target files (state may have changed since the inspection).
  2. Copies each file it's about to modify into <cache_dir>/last-apply/.
  3. Computes SHA-256 hashes before and after, recording them in manifest.json.
  4. Makes the minimal edits.
  5. Re-runs the analysis so you see the new score.

The manifest looks like this:

{
  "applied_at": "2026-05-25T23:16:00Z",
  "recommendations_applied": [
    { "priority": "high", "dimension": "triggering", "text": "Replace vague description..." },
    { "priority": "high", "dimension": "usability", "text": "Add a ## Workflow section..." },
    { "priority": "high", "dimension": "completeness", "text": "Add allowed-tools list..." }
  ],
  "files": [
    {
      "relative_path": "SKILL.md",
      "before_sha256": "e698920f94613f8fc335cd0e941938e0990bedd72cea66e52a6b956d4ff47845",
      "after_sha256":  "10c1fafffbfd2b0089a85e72aafc43432374ef627cb0b41602f8396083fa2800"
    }
  ]
}

Enter fullscreen mode Exit fullscreen mode

Revert is the inverse: read the manifest, verify the current file hash matches the recorded after_sha256 (so we don't blow away edits made after the apply), then restore from backup. If the hash doesn't match, skill-insp reports the conflict instead of overwriting. It never falls back to git reset --hard or git checkout -- — those are blast-radius operations that don't belong in a recovery path.

Self-Inspection

Because skill-insp is itself a skill, it can score itself:

You: 评估 .claude/skills/skill-insp
Claude: ✨ skill-insp ✨
        Overall: skill-insp scores 94/100 — risk low, readiness ready.
        ...

Enter fullscreen mode Exit fullscreen mode

The first time I did this, the report flagged things I'd already half-noticed but not bothered to fix: the cache slug derivation rule was dense without an example, the description didn't mention "fix" or "improve" as triggers, and there was no Node.js version floor documented. All of these became Low/Medium recommendations, which I then applied — and the score went up.

This is the most useful feedback loop I've found for skill authoring: write the skill, run skill-insp against it, apply the high-priority recommendations, repeat. The eval suite then verifies the workflow still works end-to-end.

When NOT to Use It

skill-insp's description explicitly says "Not for general code review." It's not a linter for arbitrary Python or TypeScript. It's specifically tuned to the structure and conventions of Claude Code skills:

  • It expects SKILL.md with YAML frontmatter.
  • It scores against a skill-specific rubric.
  • Its safety analysis is calibrated to skills (permission scoping, undisclosed network access in scripts, prompt injection in instructions).

If you point it at a normal source tree, it'll refuse to score because there's no SKILL.md — that's the intended behavior, not a bug.

Trying It

Clone the repo and drop the folder into your Claude Code skills directory:

git clone https://github.com/conanttu/skills.git
ln -s "$(pwd)/skills/skill-insp" ~/.claude/skills/skill-insp

Enter fullscreen mode Exit fullscreen mode

Then in any Claude Code session:

inspect the skill at ./my-skill

Enter fullscreen mode Exit fullscreen mode

Or in Chinese:

evaluate ./my-skill

Enter fullscreen mode Exit fullscreen mode

The trigger phrases are listed in the description so the skill is invoked automatically. After the inspection, follow the numbered prompts:

  1. detailed mode to expand evidence
  2. apply recommendations to auto-fix high-priority findings
  3. run evals to verify the skill with eval scenarios
  4. revert to undo the last apply

What's Next

The current version is 1.0.0. A few things I'd like to add:

  • Diff view in the HTML report showing what apply actually changed.
  • Cross-skill consistency checks for repos that ship multiple skills.
  • Optional schema validation for evals.json and analysis.json.

If you build skills regularly, give it a try and let me know what falls over. The eval scenarios in evals/evals.json are a good place to start if you want to extend it — adding a new scenario is just adding a JSON entry with a prompt, fixture files, and expectations.