repo·evals
· 2026-05-13 ·master@HEAD

autoresearch

karpathy/autoresearch

🛠52 / 100
🎯

🗺
01Research调研02Plan & design计划与设计03Code & review开发与评审04Package打包发布05Maintain维护
📍
🧬

🛑
0–29
⚠️
30–49
🛠
50–79
🏭
80–100
52
🛠· 52 / 100
  • 1 critical claim(s) failed
  • README may claim a license but no LICENSE file exists
  • release_pipeline=1, recently_active=True
  • EN-only or ZH-only README
  • compound layer needs a logged scenario run

#1👤
#2🎯
#3🧭
#4

program.mdprogram.md(skill = agent(skill = agentinstructions)指令)Agent editsAgent 改train.pytrain.py(one file only)(只改一个文件)uv run train.pyuv run train.py(5-min budget)(5 分钟预算)Parse val_bpb解析 val_bpb(lower = better)(越低越好)Keep or revert保留或回滚+ journal entry+ 日志记录Next iteration下一轮迭代(~100 / night(H100 一晚on H100)约 100 次)

git clone + uv sync + uv run prepare.pyLinux + single NVIDIA H100moderate
Use community fork (Mac / MLX / Windows-RTX / AMD)macOS / Windows / AMDmoderate
  • 🌐
  • ⚠️
Anthropic Claude / OpenAI Codex / similar agent CLI
The agent that edits train.py + iterates
Each experiment iteration consumes tokens; overnight 100-experiment runs can be expensive — set a token budget
Single NVIDIA H100 GPU (or compatible via fork)
Runs the 5-minute training experiments
Cloud H100 ≈ $2-4/hr depending on provider; 12 experiments/hour, ~100 overnight
FineWeb / shakespeare-style training data (auto-downloaded)
Data prep via prepare.py
Public datasets; download is one-time ~2 minutes
· 8
6 1 1
+40
+12
0
0
0
0

7 / 8
passed claim-001

passed claim-002

passed claim-003

passed claim-004

failed claim-005

passed claim-006

untested claim-007

passed claim-008

input_contract
output_contract
determinism
idempotence
no_skill_callouts
failure_mode_clarity

workflow_correctness
declared_call_graph
stop_conditions
handoff_points
atom_evidence
error_propagation
partial_failure_handling

goal_achievement
direction_judgment
quality_judgment
meaningful_autonomy
handoff_timing
observed_call_graph
failure_recovery

  • core user-facing layer untested → capped at 'usable'
  • hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer
  • evidence_completeness='partial' (not portable) → capped at 'usable'

  • critical claim claim-005 failed

archetype: hybrid-skillcore_layer_tested? Falseevidence: partialrecommended: unusablefinal: usable
ceiling 1 · core user-facing layer untested → capped at 'usable'
ceiling 2 · hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer
ceiling 3 · evidence_completeness='partial' (not portable) → capped at 'usable'

claim-0013 个核心文件齐全且非占位criticalpipeline-shape● passed
claim-002pyproject.toml + uv.lock 真有依赖管理criticalinstall● passed
claim-003program.md 是真实的"agent 该如何工作"指令文档criticalagent-instructions● passed
claim-004train.py 包含完整的 GPT 模型 + 优化器 + 训练循环hightraining-completeness● passed
claim-005仓库有 LICENSE 文件criticallicensing✕ failed
claim-006README 列出的 4 个社区 fork 都真实存在highcommunity-coverage● passed
claim-007端到端:在 H100 上跑出过 1 个有效 baseline 实验criticalend-to-end○ untested
claim-008agent 不会修改它不该改的文件criticalsafety● passed

0%
0.00s
0

run-static-checks

2026-05-13
0% tokens in ? / out ?

run-static-checks

2026-05-05
0% tokens in ? / out ?

run-static-checks

2026-05-13
0% tokens in ? / out ?

run-static-checks

2026-05-05
0% tokens in ? / out ?
# karpathy/autoresearch — refreshed verdict (2026-05-13)

## Bucket

⚪ **usable** (manual override applied — matches 2026-05-05 verdict).

The calculator's raw output is 🔴 unusable, driven entirely by the
LICENSE-missing claim being marked critical+failed. Override is applied
because the LICENSE gap affects redistribution legality, not the repo's
runnability; on actual hardware the pipeline works. The override is
documented in `2026-05-13-verdict-input.yaml`.

Bucket stays at `usable` (not `reusable`) because the compound runtime
layer (claim-007) is still untested — that gap is real and would need a
live H100 run to close.

## Repo state

- **Name:** karpathy/autoresearch · **Stars:** ~79K · **Archetype:** hybrid-skill · **Layer:** compound
- **Upstream:** unchanged since prior eval — last commit `228791fb` on 2026-03-25
- **Refresh trigger:** user invoked `/repo-evals` — re-running because policy says new run overwrites old

## Claims (8 total)

| Claim | Priority | Status | Notes |
|---|---|---|---|
| 001 3-file pipeline shape | critical | ✅ passed | prepare 389 + train 630 + program 114 lines |
| 002 pyproject + uv.lock | critical | ✅ passed | Python 3.10+, pytorch-cu128, locked |
| 003 program.md is real | critical | ✅ passed | 5 sections (Setup / Experimentation / Output / Logging / Loop) |
| 004 train.py model+optim | high | ✅ passed | 25 model/optimizer signatures |
| 005 LICENSE file | critical | ❌ failed | README says MIT, no LICENSE at root (HTTP 404) |
| 006 4 community forks | high | ✅ passed | All HTTP 200 (Mac / MLX / Win-RTX / AMD) |
| 007 e2e H100 training | critical | ⏭ untested | needs H100 + GPU time — skipped, no test rig |
| 008 agent safety scope | critical | ✅ passed | program.md explicitly fences `prepare.py` as read-only |

## Calculator output (authoritative)

- **Recommended:** 🔴 unusable
- **Confidence:** high
- **Ceiling reasons:**
  - core user-facing layer untested → capped at `usable`
  - hybrid-skill requires end-to-end evaluation of the user-facing layer
  - `evidence_completeness=partial` → capped at `usable`
- **Blocking issue:** critical claim claim-005 (LICENSE) failed → drops below `usable` to `unusable`

## What this actually means

Two-line plain English:

1. The repo is real, well-shaped, and the static pieces are healthy — 6/8 claims pass on direct inspection of the code.
2. We can't bless it as "usable / reusable / recommendable" because (a) nobody on this machine has actually run the 5-minute training experiment on an H100 to confirm end-to-end, and (b) Karpathy says MIT in the README but didn't ship a LICENSE file, so legal status for forks is technically unclear.

## Real findings worth surfacing

1. **`program.md` is the single best published example of agent-safety
   scope I've seen.** It explicitly declares `prepare.py` read-only and
   names `evaluate_bpb` as the ground-truth metric. Most "AI does my
   research overnight" repos hand-wave this; this one fences it. Worth
   recommending as a template even if you don't use the rest.

2. **Missing LICENSE on a 79K-star Karpathy repo is striking.** README
   closes with `## License — MIT` but the LICENSE file is HTTP 404.
   License scanners / SBOM tools / risk-averse adopters will all flag.
   One-line upstream fix.

3. **Community fork ecology is healthy.** All 4 listed forks live
   (Mac / MLX / Win-RTX / AMD). Unusual for a single-author repo —
   suggests the audience forks actively rather than waiting upstream.

4. **Compound classification is honest.** The agent decides at runtime
   what to change, runs the 5-min experiment, parses `val_bpb`, decides
   keep-or-discard, iterates. Static eval can't validate that; only a
   live run can. This is why core_layer_tested=false.

## Path to a higher bucket

- Ship a `LICENSE` file upstream → claim-005 passes → bucket can move to `usable`
- Run one logged H100 baseline (`uv sync && uv run prepare.py && uv run train.py`) → claim-007 passes + `core_layer_tested=true` → bucket can move to `reusable`
- Run one adversarial agent-safety probe (tell agent to modify `prepare.py`, watch it refuse) → strengthens claim-008 from static to live