#1
·
2026-05-05
·main@HEAD (in anthropics/skills monorepo, pushed 2026-05-03)
skill-creator
anthropics/skill-creator
🏭81 / 100
🛠
🗺
📍
⚛
→
⚗
→
🧬
🛑
0–29
⚠️
30–49
🛠
50–79
🏭
80–100
▼
81
🏭· 81 / 100
- ✓6 claims passed, no critical failures
- ✓MIT / Apache / etc., installable per deployment.install_methods
- ◐release_pipeline=1, recently_active=True
- ⚪EN-only or ZH-only README
- ⚪static-only eval; live e2e pending
#2
#3
#4
git clone anthropics/skills + cp -r skills/skill-creator ~/.claude/skills/ | any | moderate |
Git sparse-checkout for just this one skill | any | moderate |
Anthropic Claude API
LLM that runs the skill itself + powers all 3 sub-agents (analyzer/comparator/grader)
Iterative improvement loop multiplies token cost per round (grader + comparator + analyzer all consume tokens). Budget accordingly.
Python 3 runtime
Runs the 9 scripts (eval / package / validate / report)
Standard local Python — no extra cost
· 7
6 1
| +40 | |
| +19 | |
| +10 | |
| +12 | |
| 0 | |
| 0 |
6 / 7
passed claim-001
passed claim-002
passed claim-003
passed claim-004
passed claim-005
passed claim-006
untested claim-007
input_contract | |
|---|---|
output_contract | |
determinism | |
idempotence | |
no_skill_callouts | |
failure_mode_clarity |
workflow_correctness | |
|---|---|
declared_call_graph | |
stop_conditions | |
handoff_points | |
atom_evidence | |
error_propagation | |
partial_failure_handling |
- core user-facing layer untested → capped at 'usable'
- hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer
- evidence_completeness='partial' (not portable) → capped at 'usable'
- only 3/4 critical claims covered
archetype: hybrid-skill→core_layer_tested? False→evidence: partial→recommended: usable→final: usable
ceiling 1 · core user-facing layer untested → capped at 'usable'
ceiling 2 · hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer
ceiling 3 · evidence_completeness='partial' (not portable) → capped at 'usable'
| claim-001 | SKILL.md 是真实大体量文档(≥ 400 行)+ 标准 frontmatter | critical | skill-shape | ● passed | |
| claim-002 | 9 个 Python 脚本每个都是真实非占位实现 | critical | tooling-completeness | ● passed | |
| claim-003 | 3 个 sub-agent prompt 都是真实评测 / 对比 / 分析 prompt | critical | eval-discipline | ● passed | |
| claim-004 | eval-viewer 是真实可工作的本地评测看板 | high | eval-tooling | ● passed | |
| claim-005 | schemas.md 提供清晰的 JSON schema 参考 | high | contract | ● passed | |
| claim-006 | 子目录自带 LICENSE,不依赖 parent | high | licensing | ● passed | |
| claim-007 | 端到端:用 skill-creator 创建一个新 skill 能跑出 eval 报告 | critical | end-to-end | ○ untested |
0%
0.00s
0
run-static-checks
2026-05-05
0% — tokens in ? / out ?
run-static-checks
2026-05-05
0% — tokens in ? / out ?
# anthropics/skill-creator — final verdict (2026-05-05)
## Repo
- **Slug:** anthropics/skill-creator
- **Actual location:** github.com/anthropics/skills/tree/main/skills/skill-creator
- **Archetype:** hybrid-skill · **Layer:** molecule
- **Parent stars:** 128,303 · **License:** Apache 2.0 (in subdir) · **Pushed:** 2026-05-03
## What was evaluated
| Claim | Status | Notes |
|---|---|---|
| 001 SKILL.md substantive + frontmatter | passed | 485 lines, multi-intent triggers |
| 002 9 scripts non-trivial | passed | 7 of 9 scripts 102–401 lines (utils intentionally small) |
| 003 3 sub-agent prompts | passed | analyzer 274 / comparator 202 / grader 223 lines |
| 004 eval-viewer real | passed | 471-line Python + 1,325-line HTML |
| 005 schemas.md depth | passed | 430 lines of JSON schemas |
| 006 LICENSE in subdir | passed | Apache 2.0 (201 lines) |
| 007 live "create a skill" e2e | untested | needs Claude Code + Anthropic API |
## Real findings
1. **Eval-discipline = 3, the only repo in the batch to earn it.**
This skill IS an eval framework. It ships:
- `run_eval.py` (310 lines)
- `run_loop.py` (328 lines, iterative improvement)
- `aggregate_benchmark.py` (401 lines, variance analysis)
- 3 LLM grader/comparator/analyzer agents
- 1,325-line HTML viewer for browsing results
No other repo evaluated has even half this depth on its own
output quality. This is the canonical example for the
`eval_discipline_score=3` field.
2. **Heavyweight by design.** 485-line SKILL.md + ~70 KB Python +
1.3K-line HTML viewer is an unusual amount of surface for a
single skill. The scope (create / modify / eval / benchmark /
optimize description) genuinely needs that much, but users
evaluating "should I install this for a 50-line skill?" should
know the overhead in advance — covered in `watch_out`.
3. **Sub-directory of a catalog (not a standalone repo) is a
first-class case for the framework.** All other 18 repos in this
batch are standalone; skill-creator lives inside
`anthropics/skills/skills/skill-creator/`. The framework handled
it gracefully — repo_url points to the subtree, parent's stars
inherited as ecosystem signal, LICENSE bundled in subdir
sufficed.
4. **Apache 2.0 inside the subdir is a great pattern.** Many
in-house Anthropic projects ship without LICENSE; here it's
self-contained at the skill level, so anyone copying just this
one folder still has clear legal cover. Worth recommending as
the default pattern for sub-skills inside catalog repos.
5. **3 sub-agents implement actually-honest evaluation:**
- **comparator** is *blind* — doesn't know which version it's
judging
- **grader** scores against documented expectations, not vibes
- **analyzer** is post-hoc — only looks at results, doesn't
write or judge live runs
That's the right shape for variance-aware skill evaluation.
## Why the score is high
This is the exemplar repo for what the score model rewards:
- Static evidence: 6/7 claims passed → near-cap static eval points
- Maintainer evidence: eval-discipline=3 (+5) + recently_active (+5) → +10 of +15
- Ecosystem: 128K-star parent (+12)
- Layer bonus: molecule (+0)
- Penalties: 0
Predicted score: ~89/100, **🏭 Team-ready** territory.
## Why not higher
`recommendable` (90+) requires multi-evaluator coverage and live e2e
evidence. We have neither. claim-007 (live skill-creation flow) is
the gating evidence — until someone runs `run_loop.py` end-to-end
and logs the results, the dossier honestly says "team-ready" not
"recommend".
## Path to ⭐ Recommend
1. Run a happy-path scenario: in Claude Code, ask the skill to
create a new skill, let it walk through draft → eval → iterate.
Log in `runs/<date>/run-live-skill-creation/business-notes.md`.
2. Run a benchmark variance scenario: invoke
`scripts/aggregate_benchmark.py` on an existing skill with ≥10
trials; verify variance numbers + viewer.html report.
3. Multi-evaluator: have a second person on a different machine run
the same flow and confirm reproducibility.
4. Update claim-007 to passed; re-run verdict_calculator.
## Recommended
```yaml
status: evaluated
```