#1
·
2026-05-13
·master@HEAD
autoresearch
karpathy/autoresearch
🛠52 / 100
🗺
📍
⚛
→
⚗
→
🧬
🛑
0–29
⚠️
30–49
🛠
50–79
🏭
80–100
▼
52
🛠· 52 / 100
- ✗1 critical claim(s) failed
- ⚠README may claim a license but no LICENSE file exists
- ◐release_pipeline=1, recently_active=True
- ⚪EN-only or ZH-only README
- ⚪compound layer needs a logged scenario run
#2
#3
#4
git clone + uv sync + uv run prepare.py | Linux + single NVIDIA H100 | moderate |
Use community fork (Mac / MLX / Windows-RTX / AMD) | macOS / Windows / AMD | moderate |
Anthropic Claude / OpenAI Codex / similar agent CLI
The agent that edits train.py + iterates
Each experiment iteration consumes tokens; overnight 100-experiment runs can be expensive — set a token budget
Single NVIDIA H100 GPU (or compatible via fork)
Runs the 5-minute training experiments
Cloud H100 ≈ $2-4/hr depending on provider; 12 experiments/hour, ~100 overnight
FineWeb / shakespeare-style training data (auto-downloaded)
Data prep via prepare.py
Public datasets; download is one-time ~2 minutes
· 8
6 1 1
| +40 | |
| +12 | |
| 0 | |
| 0 | |
| 0 | |
| 0 |
7 / 8
passed claim-001
passed claim-002
passed claim-003
passed claim-004
failed claim-005
passed claim-006
untested claim-007
passed claim-008
input_contract | |
|---|---|
output_contract | |
determinism | |
idempotence | |
no_skill_callouts | |
failure_mode_clarity |
workflow_correctness | |
|---|---|
declared_call_graph | |
stop_conditions | |
handoff_points | |
atom_evidence | |
error_propagation | |
partial_failure_handling |
goal_achievement | |
|---|---|
direction_judgment | |
quality_judgment | |
meaningful_autonomy | |
handoff_timing | |
observed_call_graph | |
failure_recovery |
- core user-facing layer untested → capped at 'usable'
- hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer
- evidence_completeness='partial' (not portable) → capped at 'usable'
- critical claim claim-005 failed
archetype: hybrid-skill→core_layer_tested? False→evidence: partial→recommended: unusable→final: usable
ceiling 1 · core user-facing layer untested → capped at 'usable'
ceiling 2 · hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer
ceiling 3 · evidence_completeness='partial' (not portable) → capped at 'usable'
| claim-001 | 3 个核心文件齐全且非占位 | critical | pipeline-shape | ● passed | |
| claim-002 | pyproject.toml + uv.lock 真有依赖管理 | critical | install | ● passed | |
| claim-003 | program.md 是真实的"agent 该如何工作"指令文档 | critical | agent-instructions | ● passed | |
| claim-004 | train.py 包含完整的 GPT 模型 + 优化器 + 训练循环 | high | training-completeness | ● passed | |
| claim-005 | 仓库有 LICENSE 文件 | critical | licensing | ✕ failed | |
| claim-006 | README 列出的 4 个社区 fork 都真实存在 | high | community-coverage | ● passed | |
| claim-007 | 端到端:在 H100 上跑出过 1 个有效 baseline 实验 | critical | end-to-end | ○ untested | |
| claim-008 | agent 不会修改它不该改的文件 | critical | safety | ● passed |
0%
0.00s
0
run-static-checks
2026-05-13
0% — tokens in ? / out ?
run-static-checks
2026-05-05
0% — tokens in ? / out ?
run-static-checks
2026-05-13
0% — tokens in ? / out ?
run-static-checks
2026-05-05
0% — tokens in ? / out ?
# karpathy/autoresearch — refreshed verdict (2026-05-13) ## Bucket ⚪ **usable** (manual override applied — matches 2026-05-05 verdict). The calculator's raw output is 🔴 unusable, driven entirely by the LICENSE-missing claim being marked critical+failed. Override is applied because the LICENSE gap affects redistribution legality, not the repo's runnability; on actual hardware the pipeline works. The override is documented in `2026-05-13-verdict-input.yaml`. Bucket stays at `usable` (not `reusable`) because the compound runtime layer (claim-007) is still untested — that gap is real and would need a live H100 run to close. ## Repo state - **Name:** karpathy/autoresearch · **Stars:** ~79K · **Archetype:** hybrid-skill · **Layer:** compound - **Upstream:** unchanged since prior eval — last commit `228791fb` on 2026-03-25 - **Refresh trigger:** user invoked `/repo-evals` — re-running because policy says new run overwrites old ## Claims (8 total) | Claim | Priority | Status | Notes | |---|---|---|---| | 001 3-file pipeline shape | critical | ✅ passed | prepare 389 + train 630 + program 114 lines | | 002 pyproject + uv.lock | critical | ✅ passed | Python 3.10+, pytorch-cu128, locked | | 003 program.md is real | critical | ✅ passed | 5 sections (Setup / Experimentation / Output / Logging / Loop) | | 004 train.py model+optim | high | ✅ passed | 25 model/optimizer signatures | | 005 LICENSE file | critical | ❌ failed | README says MIT, no LICENSE at root (HTTP 404) | | 006 4 community forks | high | ✅ passed | All HTTP 200 (Mac / MLX / Win-RTX / AMD) | | 007 e2e H100 training | critical | ⏭ untested | needs H100 + GPU time — skipped, no test rig | | 008 agent safety scope | critical | ✅ passed | program.md explicitly fences `prepare.py` as read-only | ## Calculator output (authoritative) - **Recommended:** 🔴 unusable - **Confidence:** high - **Ceiling reasons:** - core user-facing layer untested → capped at `usable` - hybrid-skill requires end-to-end evaluation of the user-facing layer - `evidence_completeness=partial` → capped at `usable` - **Blocking issue:** critical claim claim-005 (LICENSE) failed → drops below `usable` to `unusable` ## What this actually means Two-line plain English: 1. The repo is real, well-shaped, and the static pieces are healthy — 6/8 claims pass on direct inspection of the code. 2. We can't bless it as "usable / reusable / recommendable" because (a) nobody on this machine has actually run the 5-minute training experiment on an H100 to confirm end-to-end, and (b) Karpathy says MIT in the README but didn't ship a LICENSE file, so legal status for forks is technically unclear. ## Real findings worth surfacing 1. **`program.md` is the single best published example of agent-safety scope I've seen.** It explicitly declares `prepare.py` read-only and names `evaluate_bpb` as the ground-truth metric. Most "AI does my research overnight" repos hand-wave this; this one fences it. Worth recommending as a template even if you don't use the rest. 2. **Missing LICENSE on a 79K-star Karpathy repo is striking.** README closes with `## License — MIT` but the LICENSE file is HTTP 404. License scanners / SBOM tools / risk-averse adopters will all flag. One-line upstream fix. 3. **Community fork ecology is healthy.** All 4 listed forks live (Mac / MLX / Win-RTX / AMD). Unusual for a single-author repo — suggests the audience forks actively rather than waiting upstream. 4. **Compound classification is honest.** The agent decides at runtime what to change, runs the 5-min experiment, parses `val_bpb`, decides keep-or-discard, iterates. Static eval can't validate that; only a live run can. This is why core_layer_tested=false. ## Path to a higher bucket - Ship a `LICENSE` file upstream → claim-005 passes → bucket can move to `usable` - Run one logged H100 baseline (`uv sync && uv run prepare.py && uv run train.py`) → claim-007 passes + `core_layer_tested=true` → bucket can move to `reusable` - Run one adversarial agent-safety probe (tell agent to modify `prepare.py`, watch it refuse) → strengthens claim-008 from static to live