repo·evals
· 2026-05-04 ·main@HEAD (skills-index v1.2.0, generated 2026-04-29)

goose-skills

gooseworks-ai/goose-skills

🛠66 / 100
🎯

🗺
01Research调研02Plan & design计划与设计03Code & review开发与评审04Package打包发布05Maintain维护
📍
🧬

🛑
0–29
⚠️
30–49
🛠
50–79
🏭
80–100
66
🛠· 66 / 100
  • 6 claims passed, no critical failures
  • README may claim a license but no LICENSE file exists
  • release_pipeline_score=2 + pushed in 90-day window
  • EN-only or ZH-only README
  • static-only eval; live e2e pending

#1👤
#2🎯
#3🧭
#4

Coding agentCoding agent(Claude / Cursor(Claude / Cursor/ Codex)/ Codex)skills-index.jsonskills-index.json(catalog router)(目录路由)143 Capabilities143 Capabilities(atomic skills)(原子 skill)56 Composites56 Composites(multi-skill chains)(多 skill 链)5 Playbooks5 Playbooks(end-to-end)(端到端)GTM artifactsGTM 产物(briefs / posts /(brief / 帖子 /leads / battlecards)lead / 对照卡)

npx gooseworks install --claude/--cursor/--codex/--allany (npm)easy
git clone + copy skills/ to ~/.claude/skills/anymoderate
  • 📡
Anthropic Claude API (or Cursor / Codex)
LLM that consumes the skill prompts
Per-skill execution token cost varies; some skills hit external APIs (Reddit, GitHub) too
Per-skill external APIs
Some composites call Reddit / GitHub / Meta Ad Library / Apollo / Semrush / Ahrefs / Apify
Each skill in the catalog may need its own API keys — read SKILL.md before running. Apollo/Semrush/Ahrefs are paid.
· 7
4 2 1
+40
+18
+10
0
0
-2

6 / 7
passed claim-001

passed claim-002

passed claim-003

passed claim-004

passed claim-005

passed claim-006

untested claim-007

input_contract
output_contract
determinism
idempotence
no_skill_callouts
failure_mode_clarity

workflow_correctness
declared_call_graph
stop_conditions
handoff_points
atom_evidence
error_propagation
partial_failure_handling

  • core user-facing layer untested → capped at 'usable'
  • hybrid-repo rule: archetype 'prompt-skill' requires end-to-end evaluation of the user-facing layer
  • evidence_completeness='partial' (not portable) → capped at 'usable'

  • only 4/5 critical claims covered

archetype: prompt-skillcore_layer_tested? Falseevidence: partialrecommended: usablefinal: usable
ceiling 1 · core user-facing layer untested → capped at 'usable'
ceiling 2 · hybrid-repo rule: archetype 'prompt-skill' requires end-to-end evaluation of the user-facing layer
ceiling 3 · evidence_completeness='partial' (not portable) → capped at 'usable'

claim-001skills-index.json 与文档承诺的 skill 总数对齐criticalcatalog-coverage◐ partial
claim-002capabilities / composites / playbooks 三大类都真实存在且非空criticaltaxonomy● passed
claim-003每个 skill 都遵循统一的元数据契约highcontract● passed
claim-004npm 包真实可装,bin 入口存在criticalinstall◐ partial
claim-005skill packs(如 lead-gen-devtools)确实是多 skill 集合highcomposition● passed
claim-006每个 skill 在 Claude Code / Cursor / Codex 三家都能调起criticalcross-platform● passed
claim-007端到端:在真实 agent 里装 + 调用 skill 能完成任务criticalend-to-end○ untested

0%
0.00s
0

run-static-checks

2026-05-04
0% tokens in ? / out ?

run-static-checks

2026-05-04
0% tokens in ? / out ?
# goose-skills — final verdict (2026-05-04)

## Repo

- **Name:** gooseworks-ai/goose-skills
- **Branch evaluated:** main@HEAD (skills-index 1.2.0, generated 2026-04-29)
- **Archetype:** prompt-skill (catalog of prompt skills)
- **Layer:** **molecule** at the repo level (catalog wired by
  manifest + npm installer); individual skills have their own layer
  (capabilities ≈ atom, composites ≈ molecule, playbooks ≈ compound)
- **Eval framework:** repo-evals layer model v1 (fe256e5)

## Bucket

**`usable`** — strong static layer; capped by the molecule rule
because no live skill execution has been logged on this evaluator's
machine.

## What was evaluated

### Atom + molecule level (static, this run)

| Claim | Status | Notes |
|---|---|---|
| 001 catalog count | passed_with_concerns | 204 skills in manifest vs 108 in README — docs stale |
| 002 three categories | passed | 143 capabilities + 56 composites + 5 playbooks, all non-empty |
| 003 metadata contract | passed | Sampled 3 skills — uniform shape with `installation.{base_command, supports}` |
| 004 npm + bin | passed_with_concerns | bin/goose-skills.js exists (12.5 KB); npm@1.1.0 is ahead of repo@1.0.1 |
| 005 packs | passed | 2 real packs (lead-gen-devtools=7 skills, video-production=5 skills); README's "7-skill" claim matches |
| 006 cross-platform | passed | All 204 skills declare `supports = [claude, cursor, codex]` (100% uniform) |

### Molecule level (deferred)

| Claim | Status | Required |
|---|---|---|
| 007 live skill execution | untested | install via `npx`, run 1 capability + 1 composite + 1 playbook in a real agent session, log token + output evidence |

## Real findings worth surfacing

1. **Cap/Comp/Play taxonomy ≈ atom/molecule/compound.** Goose's
   internal classification (capabilities → composites → playbooks) is
   functionally identical to repo-evals' atom/molecule/compound layer
   model. We didn't invent the insight; we formalized it. This is
   worth surfacing in the meta-reflection on framework neutrality.

2. **README is meaningfully out of date.** "108 skills" is the
   headline, "204" is the reality. Not a false claim, but it
   under-sells the catalog and could send users to the npm package
   thinking the surface is half what it is.

3. **npm is one minor version ahead of repo.** A user reading the
   source on `main` (v1.0.1) sees something different from what
   `npx goose-skills install` ships (v1.1.0 on the registry). Not
   broken, but a maintainer / contributor will be confused.

4. **Pack contract is real, not marketing.** Both packs ship genuine
   `shared_files` (.env.example + requirements.txt + more), so
   "configure once, use whole pack" is structurally enforced, not
   just suggested.

## Why not higher

`usable` is the right ceiling because:

- No live skill execution evidence on this machine. The catalog could
  have 204 manifest entries and still ship low-signal SKILL.md content
  inside any one of them. Per-skill quality is the trust-determining
  variable, and we sampled only the manifest, not the prompt content.
- Skill quality is heterogeneous by definition (different authors,
  different review depth) — we'd need to sample, not assume.

## Path to `reusable`

1. `npx gooseworks install --claude` in fresh Claude Code.
2. Pick 1 skill per category (suggested: `brand-voice-extractor` /
   `competitor-intel` / `competitor-monitoring-system`).
3. Run each on a representative input. Capture the agent's
   intermediate plan, final output, and token usage.
4. Log under `runs/<date>/run-live-execution/` with one business-notes
   per skill.
5. Update claim-007 status. If all three pass with a useful artefact
   and the `lead-gen-devtools` pack also runs end-to-end, candidate
   for `reusable` (still not `recommendable` until 2nd evaluator).

## Recommended

```yaml
current_bucket: usable
status: evaluated
```