#1
·
2026-05-04
·main@HEAD (skills-index v1.2.0, generated 2026-04-29)
goose-skills
gooseworks-ai/goose-skills
🛠66 / 100
🗺
📍
⚛
→
⚗
→
🧬
🛑
0–29
⚠️
30–49
🛠
50–79
🏭
80–100
▼
66
🛠· 66 / 100
- ✓6 claims passed, no critical failures
- ⚠README may claim a license but no LICENSE file exists
- ✓release_pipeline_score=2 + pushed in 90-day window
- ⚪EN-only or ZH-only README
- ⚪static-only eval; live e2e pending
#2
#3
#4
npx gooseworks install --claude/--cursor/--codex/--all | any (npm) | easy |
git clone + copy skills/ to ~/.claude/skills/ | any | moderate |
Anthropic Claude API (or Cursor / Codex)
LLM that consumes the skill prompts
Per-skill execution token cost varies; some skills hit external APIs (Reddit, GitHub) too
Per-skill external APIs
Some composites call Reddit / GitHub / Meta Ad Library / Apollo / Semrush / Ahrefs / Apify
Each skill in the catalog may need its own API keys — read SKILL.md before running. Apollo/Semrush/Ahrefs are paid.
· 7
4 2 1
| +40 | |
| +18 | |
| +10 | |
| 0 | |
| 0 | |
| -2 |
6 / 7
passed claim-001
passed claim-002
passed claim-003
passed claim-004
passed claim-005
passed claim-006
untested claim-007
input_contract | |
|---|---|
output_contract | |
determinism | |
idempotence | |
no_skill_callouts | |
failure_mode_clarity |
workflow_correctness | |
|---|---|
declared_call_graph | |
stop_conditions | |
handoff_points | |
atom_evidence | |
error_propagation | |
partial_failure_handling |
- core user-facing layer untested → capped at 'usable'
- hybrid-repo rule: archetype 'prompt-skill' requires end-to-end evaluation of the user-facing layer
- evidence_completeness='partial' (not portable) → capped at 'usable'
- only 4/5 critical claims covered
archetype: prompt-skill→core_layer_tested? False→evidence: partial→recommended: usable→final: usable
ceiling 1 · core user-facing layer untested → capped at 'usable'
ceiling 2 · hybrid-repo rule: archetype 'prompt-skill' requires end-to-end evaluation of the user-facing layer
ceiling 3 · evidence_completeness='partial' (not portable) → capped at 'usable'
| claim-001 | skills-index.json 与文档承诺的 skill 总数对齐 | critical | catalog-coverage | ◐ partial | |
| claim-002 | capabilities / composites / playbooks 三大类都真实存在且非空 | critical | taxonomy | ● passed | |
| claim-003 | 每个 skill 都遵循统一的元数据契约 | high | contract | ● passed | |
| claim-004 | npm 包真实可装,bin 入口存在 | critical | install | ◐ partial | |
| claim-005 | skill packs(如 lead-gen-devtools)确实是多 skill 集合 | high | composition | ● passed | |
| claim-006 | 每个 skill 在 Claude Code / Cursor / Codex 三家都能调起 | critical | cross-platform | ● passed | |
| claim-007 | 端到端:在真实 agent 里装 + 调用 skill 能完成任务 | critical | end-to-end | ○ untested |
0%
0.00s
0
run-static-checks
2026-05-04
0% — tokens in ? / out ?
run-static-checks
2026-05-04
0% — tokens in ? / out ?
# goose-skills — final verdict (2026-05-04)
## Repo
- **Name:** gooseworks-ai/goose-skills
- **Branch evaluated:** main@HEAD (skills-index 1.2.0, generated 2026-04-29)
- **Archetype:** prompt-skill (catalog of prompt skills)
- **Layer:** **molecule** at the repo level (catalog wired by
manifest + npm installer); individual skills have their own layer
(capabilities ≈ atom, composites ≈ molecule, playbooks ≈ compound)
- **Eval framework:** repo-evals layer model v1 (fe256e5)
## Bucket
**`usable`** — strong static layer; capped by the molecule rule
because no live skill execution has been logged on this evaluator's
machine.
## What was evaluated
### Atom + molecule level (static, this run)
| Claim | Status | Notes |
|---|---|---|
| 001 catalog count | passed_with_concerns | 204 skills in manifest vs 108 in README — docs stale |
| 002 three categories | passed | 143 capabilities + 56 composites + 5 playbooks, all non-empty |
| 003 metadata contract | passed | Sampled 3 skills — uniform shape with `installation.{base_command, supports}` |
| 004 npm + bin | passed_with_concerns | bin/goose-skills.js exists (12.5 KB); npm@1.1.0 is ahead of repo@1.0.1 |
| 005 packs | passed | 2 real packs (lead-gen-devtools=7 skills, video-production=5 skills); README's "7-skill" claim matches |
| 006 cross-platform | passed | All 204 skills declare `supports = [claude, cursor, codex]` (100% uniform) |
### Molecule level (deferred)
| Claim | Status | Required |
|---|---|---|
| 007 live skill execution | untested | install via `npx`, run 1 capability + 1 composite + 1 playbook in a real agent session, log token + output evidence |
## Real findings worth surfacing
1. **Cap/Comp/Play taxonomy ≈ atom/molecule/compound.** Goose's
internal classification (capabilities → composites → playbooks) is
functionally identical to repo-evals' atom/molecule/compound layer
model. We didn't invent the insight; we formalized it. This is
worth surfacing in the meta-reflection on framework neutrality.
2. **README is meaningfully out of date.** "108 skills" is the
headline, "204" is the reality. Not a false claim, but it
under-sells the catalog and could send users to the npm package
thinking the surface is half what it is.
3. **npm is one minor version ahead of repo.** A user reading the
source on `main` (v1.0.1) sees something different from what
`npx goose-skills install` ships (v1.1.0 on the registry). Not
broken, but a maintainer / contributor will be confused.
4. **Pack contract is real, not marketing.** Both packs ship genuine
`shared_files` (.env.example + requirements.txt + more), so
"configure once, use whole pack" is structurally enforced, not
just suggested.
## Why not higher
`usable` is the right ceiling because:
- No live skill execution evidence on this machine. The catalog could
have 204 manifest entries and still ship low-signal SKILL.md content
inside any one of them. Per-skill quality is the trust-determining
variable, and we sampled only the manifest, not the prompt content.
- Skill quality is heterogeneous by definition (different authors,
different review depth) — we'd need to sample, not assume.
## Path to `reusable`
1. `npx gooseworks install --claude` in fresh Claude Code.
2. Pick 1 skill per category (suggested: `brand-voice-extractor` /
`competitor-intel` / `competitor-monitoring-system`).
3. Run each on a representative input. Capture the agent's
intermediate plan, final output, and token usage.
4. Log under `runs/<date>/run-live-execution/` with one business-notes
per skill.
5. Update claim-007 status. If all three pass with a useful artefact
and the `lead-gen-devtools` pack also runs end-to-end, candidate
for `reusable` (still not `recommendable` until 2nd evaluator).
## Recommended
```yaml
current_bucket: usable
status: evaluated
```