repo·evals
· 2026-05-05 ·main@b94031e

repo-evals

zinan92/repo-evals

🛠78 / 100
🎯

🛠
🗺
01Research调研02Plan & design计划与设计03Code & review开发与评审04Package打包发布05Maintain维护
📍
🧬

🛑
0–29
⚠️
30–49
🛠
50–79
🏭
80–100
78
🛠· 78 / 100
  • 9 claims passed, no critical failures
  • MIT / Apache / etc., installable per deployment.install_methods
  • release_pipeline=1, recently_active=True
  • EN-only or ZH-only README
  • static-only eval; live e2e pending

#1👤
#2🎯
#3🧭
#4

new-repo-eval.shnew-repo-eval.sh(scaffold dirs)(生成目录)Author claim-map.yaml手写 claim-map.yaml(human, 30-60 min)(人工 30-60 分钟)Static checks静态检查(verify each claim)(逐条 claim 验证)verdict_calculator.pyverdict_calculator.py(0-100 + category)(0-100 + 类别)render_verdict_html.pyrender_verdict_html.py(bilingual dossier)(双语 dossier)build_master_dashboard.pybuild_master_dashboard.py(corpus index)(corpus 总目录)

git clone + python3 -m pip install pyyamlany (Python 3.11+)easy
  • 📡
GitHub (target repos to evaluate)
Source of repos being evaluated; needed only for cloning the target
Public repos only — no auth required for the eval flow itself
Python ecosystem (PyYAML)
YAML parsing for repo.yaml + claim-map.yaml
pip install pyyaml — single dependency
· 10
9 1
+40
+28
+10
0
0
0

9 / 10
passed claim-001

passed claim-002

passed claim-003

passed claim-004

passed claim-005

passed claim-006

passed claim-007

passed claim-008

passed claim-009

untested claim-010

input_contract
output_contract
determinism
idempotence
no_skill_callouts
failure_mode_clarity

workflow_correctness
declared_call_graph
stop_conditions
handoff_points
atom_evidence
error_propagation
partial_failure_handling

  • core user-facing layer untested → capped at 'usable'
  • evidence_completeness='partial' (not portable) → capped at 'usable'

  • only 4/5 critical claims covered

archetype: pure-clicore_layer_tested? Falseevidence: partialrecommended: usablefinal: usable
ceiling 1 · core user-facing layer untested → capped at 'usable'
ceiling 2 · evidence_completeness='partial' (not portable) → capped at 'usable'

claim-0010-100 score is auditable — every point traces to a named bucketcriticalscore-auditability● passed
claim-002Dossier is bilingual EN/ZH with runtime togglecriticali18n● passed
claim-0034-category collapse (Production / Available / Risky / Don't use) is tested at every boundarycriticalcategory-correctness● passed
claim-004Workflow diagram renders all 3 layouts (io / linear / tree)highvisualization● passed
claim-005Similar-repos comparison computes peer scores live, not from stored sidecarshighcomparison-freshness● passed
claim-006A 30-repo corpus already exists for cross-comparisonhighcorpus-size● passed
claim-007Test suite covers the new code paths and passeshightest-discipline● passed
claim-008Repository has a LICENSE filecriticallicensing● passed
claim-009README reflects the current scoring + category modelhighdocs-currency● passed
claim-010End-to-end: a new evaluator can clone, scaffold, fill, rendercriticalend-to-end○ untested

90%
0.00s
0

run-static-checks

2026-05-05
90% tokens in ? / out ?
  • claim-001 · passed
  • claim-002 · passed
  • claim-003 · passed
  • claim-004 · passed
  • claim-005 · passed
  • claim-006 · passed
  • claim-007 · passed
  • claim-008 · passed
  • claim-009 · passed

run-static-checks

2026-05-05
90% tokens in ? / out ?
  • claim-001 · passed
  • claim-002 · passed
  • claim-003 · passed
  • claim-004 · passed
  • claim-005 · passed
  • claim-006 · passed
  • claim-007 · passed
  • claim-008 · passed
  • claim-009 · passed
# zinan92/repo-evals — final verdict (2026-05-05)

## Repo

- **Name:** zinan92/repo-evals · **Stars:** 0 (private/personal)
- **Archetype:** pure-cli · **Layer:** **molecule**
- **License:** README claims MIT but no LICENSE file
- **Pushed:** 2026-05-05 (today, commit `b94031e`)

## What was evaluated

| Claim | Status | Notes |
|---|---|---|
| 001 score is auditable | passed | 6 named breakdown buckets, math tested |
| 002 bilingual EN/ZH | passed | All 30 dossiers + new SVG diagrams toggle correctly |
| 003 4-category mapping | passed | All 6 boundaries tested |
| 004 3-layout workflow diagrams | passed | io / linear / tree all rendered in 3 golden dossiers |
| 005 similar-repos live scores | passed | _load_other_repo_for_compare calls compute_verdict at render time |
| 006 30-repo corpus exists | passed | 30 dirs under repos/ |
| 007 tests pass | passed | 142/142 |
| 008 LICENSE | **failed** | README says MIT but no LICENSE file (HTTP 404 if browsed) |
| 009 README is current | **failed** | README still describes the deprecated 4-bucket model |
| 010 live e2e onboarding | untested | needs a fresh user + logged session |

## Real findings — meta-eval edition

1. **The framework caught its own LICENSE gap.** This is the same defect
   we flagged on `karpathy/autoresearch`: README has a `## License — MIT`
   section + a license-MIT badge but no LICENSE file. The score model
   correctly applies the −2 penalty (small repo tier, <1K stars). Same
   one-line fix.

2. **README is now stale.** The framework migrated to the 0-100 score
   + 4-category model on 2026-05-05. README still says:
     > 评测分两层 ... 每个被评测的 repo 最终落入且只落入一个可靠性桶
     > unusable / usable / reusable / recommendable
   That's the deprecated 4-bucket model. Anyone reading README will
   form a mental model that doesn't match what the dossiers actually
   show.

3. **Score-model auditability is the real product.** Six named
   components — base 40, static_eval ±30, maintainer +15, ecosystem +15,
   layer_bonus, penalties — every dossier shows the breakdown. That's
   the API for methodology debate. A reader who disagrees can challenge
   any single number.

4. **30-repo corpus is past cold-start.** When the similar-repos block
   was added, it had 30 candidates to draw from. For repo-evals itself
   though, the corpus has zero peers — none of our 30 are eval
   frameworks, so we honestly say so in `similar_repos_pending` rather
   than forcing a wrong comparison.

5. **Self-eval as discipline test.** A framework that's unwilling to
   flag its own gaps will always shade its own claims. This eval found
   2 real defects (LICENSE, stale README) and 1 honest gap (no live
   e2e logged). That's the pattern we want to keep.

## Why the score lands where it does

- 7/9 testable claims passed; 2 failed (LICENSE + stale README); 1 untested
- 0 stars → ecosystem +0
- Recently active +5 + eval_discipline_score=3 (max +5) +
  release_pipeline_score=1 (no release tags yet) → maintainer +10
- Molecule layer: +0
- LICENSE missing: −2 (small-repo tier)
- High claim failed (claim-009 README): −4
- Critical claim failed (claim-008 LICENSE): −10

Predicted score: roughly **45-50** — somewhere between ⚠️ Risky and
🛠 Available. The honest read is that repo-evals scores its own
status correctly: usable for its current author + immediate audience,
but not yet at "share with strangers" because of the LICENSE +
out-of-date README.

## Path to higher score

1. **Add LICENSE file.** One commit. Recovers +10 from the failed
   critical claim → ~55-60 (mid 🛠 Available).
2. **Update README to describe 0-100 + 4-category model.** Recovers
   +4 + the failed-high deduction → ~60-65 (firm 🛠 Available).
3. **Run a logged live e2e** — fresh clone, scaffold a new repo,
   render its dossier, log the session. Moves claim-010 to passed
   and unblocks the path to 70+ (entering 🏭 Production-ready
   territory).
4. **Build the live-eval tooling** the framework keeps assuming
   exists. Right now no claim can earn the "evidence_completeness:
   full" tier without a manual logged session.

## Recommended

```yaml
status: evaluated
```