repo·evals
· 2026-05-05 ·main@HEAD (in anthropics/skills monorepo, pushed 2026-05-03)

skill-creator

anthropics/skill-creator

🏭81 / 100
🎯

🛠
🗺
01Research调研02Plan & design计划与设计03Code & review开发与评审04Package打包发布05Maintain维护
📍
🧬

🛑
0–29
⚠️
30–49
🛠
50–79
🏭
80–100
81
🏭· 81 / 100
  • 6 claims passed, no critical failures
  • MIT / Apache / etc., installable per deployment.install_methods
  • release_pipeline=1, recently_active=True
  • EN-only or ZH-only README
  • static-only eval; live e2e pending

#1👤
#2🎯
#3🧭
#4

Skill briefskill 简介(or existing skill to improve)(或要优化的现成 skill)SKILL.md draftSKILL.md 草稿(auto + human review)(自动 + 人工 review)Analyzer agentAnalyzer agent(frontmatter / trigger fit)(frontmatter / trigger 拟合度)Grader + comparatorGrader + comparator(blind A/B on benchmark)(benchmark 盲 A/B)eval-viewereval-viewer(1325-line HTML report)(1325 行 HTML 报告)Published skill正式发布的 skill+ auditable benchmarks+ 可审计的 benchmark

git clone anthropics/skills + cp -r skills/skill-creator ~/.claude/skills/anymoderate
Git sparse-checkout for just this one skillanymoderate
  • 🌐
Anthropic Claude API
LLM that runs the skill itself + powers all 3 sub-agents (analyzer/comparator/grader)
Iterative improvement loop multiplies token cost per round (grader + comparator + analyzer all consume tokens). Budget accordingly.
Python 3 runtime
Runs the 9 scripts (eval / package / validate / report)
Standard local Python — no extra cost
· 7
6 1
+40
+19
+10
+12
0
0

6 / 7
passed claim-001

passed claim-002

passed claim-003

passed claim-004

passed claim-005

passed claim-006

untested claim-007

input_contract
output_contract
determinism
idempotence
no_skill_callouts
failure_mode_clarity

workflow_correctness
declared_call_graph
stop_conditions
handoff_points
atom_evidence
error_propagation
partial_failure_handling

  • core user-facing layer untested → capped at 'usable'
  • hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer
  • evidence_completeness='partial' (not portable) → capped at 'usable'

  • only 3/4 critical claims covered

archetype: hybrid-skillcore_layer_tested? Falseevidence: partialrecommended: usablefinal: usable
ceiling 1 · core user-facing layer untested → capped at 'usable'
ceiling 2 · hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer
ceiling 3 · evidence_completeness='partial' (not portable) → capped at 'usable'

claim-001SKILL.md 是真实大体量文档(≥ 400 行)+ 标准 frontmattercriticalskill-shape● passed
claim-0029 个 Python 脚本每个都是真实非占位实现criticaltooling-completeness● passed
claim-0033 个 sub-agent prompt 都是真实评测 / 对比 / 分析 promptcriticaleval-discipline● passed
claim-004eval-viewer 是真实可工作的本地评测看板higheval-tooling● passed
claim-005schemas.md 提供清晰的 JSON schema 参考highcontract● passed
claim-006子目录自带 LICENSE,不依赖 parenthighlicensing● passed
claim-007端到端:用 skill-creator 创建一个新 skill 能跑出 eval 报告criticalend-to-end○ untested

0%
0.00s
0

run-static-checks

2026-05-05
0% tokens in ? / out ?

run-static-checks

2026-05-05
0% tokens in ? / out ?
# anthropics/skill-creator — final verdict (2026-05-05)

## Repo

- **Slug:** anthropics/skill-creator
- **Actual location:** github.com/anthropics/skills/tree/main/skills/skill-creator
- **Archetype:** hybrid-skill · **Layer:** molecule
- **Parent stars:** 128,303 · **License:** Apache 2.0 (in subdir) · **Pushed:** 2026-05-03

## What was evaluated

| Claim | Status | Notes |
|---|---|---|
| 001 SKILL.md substantive + frontmatter | passed | 485 lines, multi-intent triggers |
| 002 9 scripts non-trivial | passed | 7 of 9 scripts 102–401 lines (utils intentionally small) |
| 003 3 sub-agent prompts | passed | analyzer 274 / comparator 202 / grader 223 lines |
| 004 eval-viewer real | passed | 471-line Python + 1,325-line HTML |
| 005 schemas.md depth | passed | 430 lines of JSON schemas |
| 006 LICENSE in subdir | passed | Apache 2.0 (201 lines) |
| 007 live "create a skill" e2e | untested | needs Claude Code + Anthropic API |

## Real findings

1. **Eval-discipline = 3, the only repo in the batch to earn it.**
   This skill IS an eval framework. It ships:
   - `run_eval.py` (310 lines)
   - `run_loop.py` (328 lines, iterative improvement)
   - `aggregate_benchmark.py` (401 lines, variance analysis)
   - 3 LLM grader/comparator/analyzer agents
   - 1,325-line HTML viewer for browsing results
   No other repo evaluated has even half this depth on its own
   output quality. This is the canonical example for the
   `eval_discipline_score=3` field.

2. **Heavyweight by design.** 485-line SKILL.md + ~70 KB Python +
   1.3K-line HTML viewer is an unusual amount of surface for a
   single skill. The scope (create / modify / eval / benchmark /
   optimize description) genuinely needs that much, but users
   evaluating "should I install this for a 50-line skill?" should
   know the overhead in advance — covered in `watch_out`.

3. **Sub-directory of a catalog (not a standalone repo) is a
   first-class case for the framework.** All other 18 repos in this
   batch are standalone; skill-creator lives inside
   `anthropics/skills/skills/skill-creator/`. The framework handled
   it gracefully — repo_url points to the subtree, parent's stars
   inherited as ecosystem signal, LICENSE bundled in subdir
   sufficed.

4. **Apache 2.0 inside the subdir is a great pattern.** Many
   in-house Anthropic projects ship without LICENSE; here it's
   self-contained at the skill level, so anyone copying just this
   one folder still has clear legal cover. Worth recommending as
   the default pattern for sub-skills inside catalog repos.

5. **3 sub-agents implement actually-honest evaluation:**
   - **comparator** is *blind* — doesn't know which version it's
     judging
   - **grader** scores against documented expectations, not vibes
   - **analyzer** is post-hoc — only looks at results, doesn't
     write or judge live runs
   That's the right shape for variance-aware skill evaluation.

## Why the score is high

This is the exemplar repo for what the score model rewards:

- Static evidence: 6/7 claims passed → near-cap static eval points
- Maintainer evidence: eval-discipline=3 (+5) + recently_active (+5) → +10 of +15
- Ecosystem: 128K-star parent (+12)
- Layer bonus: molecule (+0)
- Penalties: 0

Predicted score: ~89/100, **🏭 Team-ready** territory.

## Why not higher

`recommendable` (90+) requires multi-evaluator coverage and live e2e
evidence. We have neither. claim-007 (live skill-creation flow) is
the gating evidence — until someone runs `run_loop.py` end-to-end
and logs the results, the dossier honestly says "team-ready" not
"recommend".

## Path to ⭐ Recommend

1. Run a happy-path scenario: in Claude Code, ask the skill to
   create a new skill, let it walk through draft → eval → iterate.
   Log in `runs/<date>/run-live-skill-creation/business-notes.md`.
2. Run a benchmark variance scenario: invoke
   `scripts/aggregate_benchmark.py` on an existing skill with ≥10
   trials; verify variance numbers + viewer.html report.
3. Multi-evaluator: have a second person on a different machine run
   the same flow and confirm reproducibility.
4. Update claim-007 to passed; re-run verdict_calculator.

## Recommended

```yaml
status: evaluated
```