repo·evals
· 2026-04-13 ·main@HEAD

content-toolkit

zinan92/content-toolkit

🛠67 / 100
🎯

🗺
01Signal scanning信号发现02Content acquisition内容获取03Content understanding内容理解04Topic curation选题决策05Content production内容生产06Creative assembly创意组装07Distribution & feedback分发反馈08Learning学习
📍
🧬

🛑
0–29
⚠️
30–49
🛠
50–79
🏭
80–100
67
🛠· 67 / 100
  • 12 claims passed, no critical failures
  • README may claim a license but no LICENSE file exists
  • release_pipeline=1, recently_active=True
  • EN-only or ZH-only README
  • compound layer needs a logged scenario run

#1👤
#2🎯
#3🧭
#4

URL / file /URL / 文件 /text / dir / verb文本 / 目录 / 动词Smart router智能 router(intent + shape)(意图 + 形状)ctk-downloadctk-downloadctk-analyzectk-analyzectk-rewritectk-rewritectk-videocutctk-videocutctk-publishctk-publishctk-xiaohongshuctk-xiaohongshu(native ops)(原生动作)Skill-specificSkill 各自output的输出

npm install -g @zinan92/content-toolkitany (Node.js 18+, Python 3.11+ on demand)easy
git clone + npm linkanymoderate
  • 🌐
GitHub (for downstream skill repos)
First-time skill auto-install (git clone)
Public repos; needs network on first use of each skill
Anthropic API (downstream)
ctk-analyze + ctk-rewrite use Claude
Required only for the analyze + rewrite verbs
· 16
12 1 3
+40
+27
+5
0
-3
-2

16 / 16
passed claim-001

passed claim-002

passed claim-003

passed claim-004

passed claim-005

passed claim-006

passed claim-007

passed claim-008

failed claim-009

passed claim-010

passed claim-011

partial claim-012

failed claim-014

input_contract
output_contract
determinism
idempotence
no_skill_callouts
failure_mode_clarity

workflow_correctness
declared_call_graph
stop_conditions
handoff_points
atom_evidence
error_propagation
partial_failure_handling

goal_achievement
direction_judgment
quality_judgment
meaningful_autonomy
handoff_timing
observed_call_graph
failure_recovery

  • core user-facing layer untested → capped at 'usable'
  • hybrid-repo rule: archetype 'orchestrator' requires end-to-end evaluation of the user-facing layer
  • evidence_completeness='partial' (not portable) → capped at 'usable'

archetype: orchestratorcore_layer_tested? Falseevidence: partialrecommended: usablefinal: usable
ceiling 1 · core user-facing layer untested → capped at 'usable'
ceiling 2 · hybrid-repo rule: archetype 'orchestrator' requires end-to-end evaluation of the user-facing layer
ceiling 3 · evidence_completeness='partial' (not portable) → capped at 'usable'

claim-001Unified CLI entry point routes to 7 capabilitiescriticalorchestration● passed
claim-002Bare CLI shows Chinese help with all commandscriticalorchestration● passed
claim-003Auto-install: capabilities installed on first usecriticalorchestration● passed
claim-004Bare URL input → suggests content downloadcriticalsmart-routing● passed
claim-005Bare .mp4 input → suggests videocut subcommandcriticalsmart-routing● passed
claim-006Alias normalization: intelligence→analyze, xhs→xiaohongshu, etc.highsmart-routing● passed
claim-007videocut transcribe produces transcript fileshighvideocut● passed
claim-008videocut autocut produces cut.mp4highvideocut● passed
claim-009videocut subtitle produces subtitled videohighvideocut✕ failed
claim-010health command shows per-capability statushighorchestration● passed
claim-011Unknown command shows helpful errorhigherror-handling● passed
claim-012Error propagation: upstream errors are passed throughhigherror-handling◐ partial
claim-0137 capabilities badge claimmediummeta● passed
claim-014CLI test suite passeshightest-infra✕ failed
claim-015intelligence/analyze capability worksmediumintelligence✕ failed
claim-016Zero npm dependenciesmediummeta● passed

0%
0.00s
0

run-smoke-2026-04-13

2026-04-13
0% tokens in ? / out ?

run-smoke-2026-04-13

2026-04-13
0% tokens in ? / out ?
# Final Verdict

## Repo

- Name: zinan92/content-toolkit
- Date: 2026-04-13
- Archetype: orchestrator
- Final bucket: **usable**
- Confidence: medium

## Verdict Rationale

### Baseline: usable

Per verdict calculator rules:
- Critical claims **claim-001 through claim-006** all PASSED (routing, help, auto-install, smart hints)
- But critical downstream coverage is partial — test suite is 100% broken (claim-014),
  subtitle silently fails (claim-009), intelligence capability degraded (claim-015)
- Error propagation is inconsistent (claim-012: partial)

### Ceiling applied: none

The orchestrator archetype has no default ceiling. However, the broken test suite
and silent failures effectively self-cap at `usable` — you can't recommend something
where automated verification is completely absent and some capabilities fail silently.

## Evaluation Dimensions (Orchestrator-Specific)

| Dimension | Rating | Notes |
|-----------|--------|-------|
| **Routing correctness** | ★★★★☆ | Excellent. All tested routes work. Aliases normalize correctly. Smart input detection is a nice touch. |
| **Error propagation** | ★★☆☆☆ | Inconsistent. `download` passes through errors, but `videocut subtitle` exits 0 with empty output. |
| **Partial failure handling** | ★★★☆☆ | Not tested deeply, but auto-install→degraded is handled well (health reports it). |
| **End-to-end happy path** | ★★★☆☆ | transcribe and autocut work. subtitle fails silently. Pipeline untested in this run but TEST-REPORT.md says it passes. |
| **Per-area coverage** | ★★☆☆☆ | 7 capabilities claimed, only download+videocut(2/7 subs) verified working. Intelligence degraded. publish/xiaohongshu untested (require auth/external services). |
| **Observability** | ★★★★☆ | `content health` is genuinely useful — shows per-capability status, git ref, known issues. |

## Score Summary

| Category | Passed | Failed | Partial | Total |
|----------|--------|--------|---------|-------|
| Critical | 6 | 0 | 0 | 6 |
| High | 4 | 2 | 1 | 7 |
| Medium | 2 | 1 | 0 | 3 |
| **Total** | **12** | **3** | **1** | **16** |

## What I Would Say In Plain English

**content-toolkit's orchestration layer is well-designed — the routing, help, aliases,
and smart input detection are genuinely good.** If you already know which commands work,
it's a useful tool.

**But it's not reliable enough to recommend.** The test suite is 100% broken (all 80+ tests
fail on import), some capabilities silently fail (subtitle exits 0 with empty output),
and the intelligence capability auto-installs into a degraded state. The repo's own
TEST-REPORT.md honestly documents a 7/20 pass rate from March 31 — and nothing has
been fixed since.

**The gap is not in design but in execution quality.** The architecture is sound, the
skill system is thoughtful, and the health reporting is better than most. What's missing
is: fix the test suite, fix silent failures, fix the 13 known issues in TEST-REPORT.md.

## Path to `reusable`

1. **Fix test suite** — export functions from cli.js, prevent help side effect on import.
   Currently zero automated verification of routing logic.
2. **Fix silent failures** — videocut subtitle (and likely clip, cover, speed per TEST-REPORT.md)
   must either produce output or surface a clear error. Exit 0 + empty dir is unacceptable.
3. **Fix intelligence capability** — pyproject.toml module path so auto-install produces
   a working capability, not degraded.
4. **Address TEST-REPORT.md backlog** — at least the 4 MEDIUM bugs (BUG-3/4/5/6) and
   2 HIGH UX issues (UX-1/2).

## Path to `recommendable`

Everything in `reusable` plus:
5. **Per-area claim maps** — each downstream capability (download, extract, rewrite,
   videocut, publish, xiaohongshu) gets its own eval under `areas/<slug>/`
6. **End-to-end workflow verification** — the douyin-to-xhs and pipeline presets
   tested with real content
7. **Consistent error propagation** — every downstream failure surfaces at the
   orchestrator boundary
8. **CI integration** — test suite runs on push, catches regressions

## Remaining Risks

- **Silent failure pattern may be systemic.** We only tested 3 of 7 videocut subcommands
  and 1 already silently fails. TEST-REPORT.md documents similar issues in clip, cover, speed.
- **No CI.** Regressions accumulate silently. The test suite broke and nobody noticed.
- **External capabilities (publish, xiaohongshu) are untested** — they require auth
  and external services, making them hard to evaluate without credentials.
- **intelligence capability has a packaging bug** in the upstream repo, but content-toolkit
  claims it as a capability. Users will encounter a broken experience.