· 2026-05-13 ·master@HEAD

autoresearch

karpathy/autoresearch

🛠52 / 100

✅

⚠

🎯

⚠

🗺

📍

⚛

→

⚗

→

🧬

🛑

0–29

⚠️

30–49

🛠

50–79

🏭

80–100

▼

🛠· 52 / 100

✗1 critical claim(s) failed
⚠README may claim a license but no LICENSE file exists
◐release_pipeline=1, recently_active=True
⚪EN-only or ZH-only README
⚪compound layer needs a logged scenario run

#1👤

#2🎯

#3🧭

#4⇄

anthropics/skill-creator

🏭 · 81molecule

zinan92/doc-driven-dev-workflow

🛠 · 59molecule


`git clone + uv sync + uv run prepare.py`	Linux + single NVIDIA H100	moderate
`Use community fork (Mac / MLX / Windows-RTX / AMD)`	macOS / Windows / AMD	moderate

🌐
⚠️

Anthropic Claude / OpenAI Codex / similar agent CLI

The agent that edits train.py + iterates

Each experiment iteration consumes tokens; overnight 100-experiment runs can be expensive — set a token budget

Single NVIDIA H100 GPU (or compatible via fork)

Runs the 5-minute training experiments

Cloud H100 ≈ $2-4/hr depending on provider; 12 experiments/hour, ~100 overnight

FineWeb / shakespeare-style training data (auto-downloaded)

Data prep via prepare.py

Public datasets; download is one-time ~2 minutes

· 8

6 1 1

	+40
	+12
	0
	0
	0
	0

7 / 8

passed claim-001

passed claim-002

passed claim-003

passed claim-004

failed claim-005

passed claim-006

untested claim-007

passed claim-008

`input_contract`
`output_contract`
`determinism`
`idempotence`
`no_skill_callouts`
`failure_mode_clarity`

`workflow_correctness`
`declared_call_graph`
`stop_conditions`
`handoff_points`
`atom_evidence`
`error_propagation`
`partial_failure_handling`

`goal_achievement`
`direction_judgment`
`quality_judgment`
`meaningful_autonomy`
`handoff_timing`
`observed_call_graph`
`failure_recovery`

core user-facing layer untested → capped at 'usable'
hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer
evidence_completeness='partial' (not portable) → capped at 'usable'

critical claim claim-005 failed

archetype: hybrid-skill→core_layer_tested? False→evidence: partial→recommended: unusable→final: usable

ceiling 1 · core user-facing layer untested → capped at 'usable'

ceiling 2 · hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer

ceiling 3 · evidence_completeness='partial' (not portable) → capped at 'usable'


claim-001	3 个核心文件齐全且非占位	critical	pipeline-shape	● passed
claim-002	pyproject.toml + uv.lock 真有依赖管理	critical	install	● passed
claim-003	program.md 是真实的"agent 该如何工作"指令文档	critical	agent-instructions	● passed
claim-004	train.py 包含完整的 GPT 模型 + 优化器 + 训练循环	high	training-completeness	● passed
claim-005	仓库有 LICENSE 文件	critical	licensing	✕ failed
claim-006	README 列出的 4 个社区 fork 都真实存在	high	community-coverage	● passed
claim-007	端到端：在 H100 上跑出过 1 个有效 baseline 实验	critical	end-to-end	○ untested
claim-008	agent 不会修改它不该改的文件	critical	safety	● passed

0.00s

run-static-checks

2026-05-13

0% — tokens in ? / out ?

run-static-checks

2026-05-05

0% — tokens in ? / out ?

run-static-checks

2026-05-13

0% — tokens in ? / out ?

run-static-checks

2026-05-05

0% — tokens in ? / out ?

# karpathy/autoresearch — refreshed verdict (2026-05-13)

## Bucket

⚪ **usable** (manual override applied — matches 2026-05-05 verdict).

The calculator's raw output is 🔴 unusable, driven entirely by the
LICENSE-missing claim being marked critical+failed. Override is applied
because the LICENSE gap affects redistribution legality, not the repo's
runnability; on actual hardware the pipeline works. The override is
documented in `2026-05-13-verdict-input.yaml`.

Bucket stays at `usable` (not `reusable`) because the compound runtime
layer (claim-007) is still untested — that gap is real and would need a
live H100 run to close.

## Repo state

- **Name:** karpathy/autoresearch · **Stars:** ~79K · **Archetype:** hybrid-skill · **Layer:** compound
- **Upstream:** unchanged since prior eval — last commit `228791fb` on 2026-03-25
- **Refresh trigger:** user invoked `/repo-evals` — re-running because policy says new run overwrites old

## Claims (8 total)

| Claim | Priority | Status | Notes |
|---|---|---|---|
| 001 3-file pipeline shape | critical | ✅ passed | prepare 389 + train 630 + program 114 lines |
| 002 pyproject + uv.lock | critical | ✅ passed | Python 3.10+, pytorch-cu128, locked |
| 003 program.md is real | critical | ✅ passed | 5 sections (Setup / Experimentation / Output / Logging / Loop) |
| 004 train.py model+optim | high | ✅ passed | 25 model/optimizer signatures |
| 005 LICENSE file | critical | ❌ failed | README says MIT, no LICENSE at root (HTTP 404) |
| 006 4 community forks | high | ✅ passed | All HTTP 200 (Mac / MLX / Win-RTX / AMD) |
| 007 e2e H100 training | critical | ⏭ untested | needs H100 + GPU time — skipped, no test rig |
| 008 agent safety scope | critical | ✅ passed | program.md explicitly fences `prepare.py` as read-only |

## Calculator output (authoritative)

- **Recommended:** 🔴 unusable
- **Confidence:** high
- **Ceiling reasons:**
  - core user-facing layer untested → capped at `usable`
  - hybrid-skill requires end-to-end evaluation of the user-facing layer
  - `evidence_completeness=partial` → capped at `usable`
- **Blocking issue:** critical claim claim-005 (LICENSE) failed → drops below `usable` to `unusable`

## What this actually means

Two-line plain English:

1. The repo is real, well-shaped, and the static pieces are healthy — 6/8 claims pass on direct inspection of the code.
2. We can't bless it as "usable / reusable / recommendable" because (a) nobody on this machine has actually run the 5-minute training experiment on an H100 to confirm end-to-end, and (b) Karpathy says MIT in the README but didn't ship a LICENSE file, so legal status for forks is technically unclear.

## Real findings worth surfacing

1. **`program.md` is the single best published example of agent-safety
   scope I've seen.** It explicitly declares `prepare.py` read-only and
   names `evaluate_bpb` as the ground-truth metric. Most "AI does my
   research overnight" repos hand-wave this; this one fences it. Worth
   recommending as a template even if you don't use the rest.

2. **Missing LICENSE on a 79K-star Karpathy repo is striking.** README
   closes with `## License — MIT` but the LICENSE file is HTTP 404.
   License scanners / SBOM tools / risk-averse adopters will all flag.
   One-line upstream fix.

3. **Community fork ecology is healthy.** All 4 listed forks live
   (Mac / MLX / Win-RTX / AMD). Unusual for a single-author repo —
   suggests the audience forks actively rather than waiting upstream.

4. **Compound classification is honest.** The agent decides at runtime
   what to change, runs the 5-min experiment, parses `val_bpb`, decides
   keep-or-discard, iterates. Static eval can't validate that; only a
   live run can. This is why core_layer_tested=false.

## Path to a higher bucket

- Ship a `LICENSE` file upstream → claim-005 passes → bucket can move to `usable`
- Run one logged H100 baseline (`uv sync && uv run prepare.py && uv run train.py`) → claim-007 passes + `core_layer_tested=true` → bucket can move to `reusable`
- Run one adversarial agent-safety probe (tell agent to modify `prepare.py`, watch it refuse) → strengthens claim-008 from static to live