repo·evals
· 2026-05-04 ·v1.11.0 (release page) / main@HEAD

RedBox

Jamailar/RedBox

🛠58 / 100
🎯

🗺
01Signal scanning信号发现02Content acquisition内容获取03Content understanding内容理解04Topic curation选题决策05Content production内容生产06Creative assembly创意组装07Distribution & feedback分发反馈08Learning学习
📍xiaohongshu
🧬

🛑
0–29
⚠️
30–49
🛠
50–79
🏭
80–100
58
🛠· 58 / 100
  • 3 claims passed, no critical failures
  • MIT / Apache / etc., installable per deployment.install_methods
  • release_pipeline_score=3 + pushed in 90-day window
  • multilingual_readme=true
  • compound layer needs a logged scenario run

#1👤
#2🎯
#3🧭
#4

Browser plugin浏览器插件(XHS / YouTube /(小红书 / YouTube /web capture)网页抓取)Local KB本地知识库(content + tags +(内容 + 标签 +search)搜索)Wander漫步(topic spark)(选题碰撞)AI editorAI 编辑器(manuscript)(稿件工作台)Image / video生图 / 生视频generationRedClawRedClaw(LLM-routed(LLM 路由automation)自动化)Scheduled XHS定时小红书发布publish (real(真实账号会话)account session)

Download signed DMG / EXE / DEBmacOS / Windows / Linux × aarch64+amd64+x86easy
  • 🌐
OpenAI / Anthropic / Google (any compatible)
LLM for content generation + analysis
Configure endpoint + key + model in Settings; Vercel AI SDK v6 supports openai-compatible too
Xiaohongshu (real account)
RedClaw automation drives user's session
Use companion browser extension
Image / video generation provider (e.g. GPT-image-2)
Cover image + short-video generation in the creation page
Pay-per-generation; depends on which provider you point at
· 7
2 1 4
+40
+9
+12
0
-3
0

3 / 7
passed claim-001

passed claim-002

passed claim-003

untested claim-004

untested claim-005

untested claim-006

untested claim-007

input_contract
output_contract
determinism
idempotence
no_skill_callouts
failure_mode_clarity

workflow_correctness
declared_call_graph
stop_conditions
handoff_points
atom_evidence
error_propagation
partial_failure_handling

goal_achievement
direction_judgment
quality_judgment
meaningful_autonomy
handoff_timing
observed_call_graph
failure_recovery

  • core user-facing layer untested → capped at 'usable'
  • hybrid-repo rule: archetype 'orchestrator' requires end-to-end evaluation of the user-facing layer
  • evidence_completeness='partial' (not portable) → capped at 'usable'

  • only 3/5 critical claims covered

archetype: orchestratorcore_layer_tested? Falseevidence: partialrecommended: usablefinal: usable
ceiling 1 · core user-facing layer untested → capped at 'usable'
ceiling 2 · hybrid-repo rule: archetype 'orchestrator' requires end-to-end evaluation of the user-facing layer
ceiling 3 · evidence_completeness='partial' (not portable) → capped at 'usable'

claim-001跨平台安装包真实存在且版本一致criticaldistribution● passed
claim-002浏览器插件 manifest 与 README 抓取范围声明一致criticalbrowser-capture◐ partial
claim-003桌面端走 Vercel AI SDK,支持自定义 endpoint/key/modelcriticalai-providers● passed
claim-004端到端创作流程:捕获 → 知识库 → 编辑器 → 配图criticalend-to-end○ untested
claim-005RedClaw 自动化能在单 session 内独立完成任务criticalredclaw-automation○ untested
claim-006后台调度任务确实持续运行highscheduling○ untested
claim-007失败模式对用户友好(API key 缺失 / 模型不可达 / 抓取站点改版)higherror-propagation○ untested

0%
0.00s
0

run-static-checks

2026-05-04
0% tokens in ? / out ?

run-static-checks

2026-05-04
0% tokens in ? / out ?
# RedBox — final verdict (2026-05-04)

## Repo

- **Name:** Jamailar/RedBox
- **Release evaluated:** v1.11.0 (browser-extension v1.9.7)
- **Archetype:** orchestrator
- **Layer:** **compound** — RedClaw automation console runs LLM-driven
  multi-step tasks; background scheduler keeps long-running work alive
- **Eval framework version:** repo-evals layer model v1 (cee2351)

## Bucket

**`usable`** — capped by the compound-layer ceiling rule.

The static layer is in good shape and the distribution / provider /
extension foundations all check out. But the user-facing value
proposition (creation flow, RedClaw automation, background scheduling,
failure-mode UX) is compound-level and has zero logged scenarios on
this evaluator's machine. Per `docs/LAYERS.md`, compound cannot exceed
`usable` without ≥1 logged scenario, and cannot exceed `reusable`
without ≥3.

## What was evaluated

### Atom + molecule level (static, this run)

| Claim | Status | Notes |
|---|---|---|
| 001 distribution | passed | All 7 assets resolve, sizes 14–24 MB (small for Electron — heavy assets likely deferred per build script) |
| 002 capture coverage | passed_with_concerns | 9/10 platforms covered; **YouTube missing from `host_permissions` despite manifest description listing it** |
| 003 ai providers | passed | Vercel `ai` v6 + Anthropic + OpenAI + openai-compatible + Google; Electron 39.6.0 |

### Compound level (deferred)

| Claim | Status | Required for promotion |
|---|---|---|
| 004 end-to-end creation flow | untested | install + provider key + run a real article through workspace |
| 005 RedClaw single-session autonomy | untested | live RedClaw session with multi-step task |
| 006 background scheduling | untested | scheduled task that survives window close |
| 007 user-friendly failure modes | untested | deliberately broken inputs at three layers |

## Real bugs / mismatches surfaced

1. **YouTube capture promised but unimplemented.** The browser-extension
   manifest's own `description` field lists YouTube alongside the other
   capture sources, but `host_permissions` has no `*.youtube.com`
   entries. A user attempting to capture from YouTube will silently
   fail to inject content scripts. Either add the host permission or
   remove YouTube from the description.

2. **Desktop package version lags release tag (cosmetic).**
   `desktop/package.json` is at `1.9.0` while the release tag is
   `v1.11.0`. Not user-visible during install, but a sign the release
   pipeline is not bumping the package version automatically.

## Why not higher

`usable` is the right ceiling now because:

- The framework's compound rule explicitly caps at `usable` until ≥1
  scenario passes, and at `reusable` until ≥3 — same logic that caps
  hybrid-skill repos with untested LLM layers.
- Even ignoring layers, claim-002 has a real defect (YouTube capture)
  that should not be papered over by averaging.
- Single-evaluator, single-OS, single-day pass — even a clean compound
  scenario would not justify `recommendable` until repeated by other
  operators on other OSes.

## Path to `reusable`

Run the four compound experiments rendered on the dashboard
(`dashboard/repos/Jamailar--RedBox.html`). Each is a system prompt + a
"watch for" list. Log the result in
`repos/Jamailar--RedBox/runs/<date>/run-<scenario>/business-notes.md`
and update the matching claim's `status` in `claims/claim-map.yaml`.
After three pass with full evidence, re-run `verdict_calculator.py`
and the bucket can move to `reusable`.

## Recommended bucket

```yaml
current_bucket: usable
status: evaluated
```