repo·evals
· 2026-05-04 ·main@HEAD (v0.2.1, JCST'26 paper)

OpenMAIC

THU-MAIC/OpenMAIC

🛠75 / 100
🎯

📝
🧬

🛑
0–29
⚠️
30–49
🛠
50–79
🏭
80–100
75
🛠· 75 / 100
  • 5 claims passed, no critical failures
  • MIT / Apache / etc., installable per deployment.install_methods
  • release_pipeline_score=2 + pushed in 90-day window
  • multilingual_readme=true
  • compound layer needs a logged scenario run

#1👤
#2🎯
#3🧭
#4

deepbroad广math数学process过程checkpoint检查continue继续Topic / document主题 / 文档(paper / chapter / brief)(论文 / 章节 / 简介)Outline strategy?大纲策略?(LLM picks depth-first / breadth)(LLM 选深度优先 / 广度)Depth-first outline深度优先大纲(1 hard concept, deep)(1 个难点深挖)Breadth-first outline广度优先大纲(5 concepts, light)(5 个概念,浅)Whiteboard or simulation?白板 或 模拟?(LLM decides per topic)(LLM 按主题决定)whiteboard-agentwhiteboard-agent(math / diagram)(数学 / 图)simulation-agentsimulation-agent(interactive process)(互动过程)teacher-agent + peer-agentteacher-agent + peer-agent(with TTS streaming)(TTS 流式)Quiz or continue?quiz 或 继续?(LLM reads engagement)(LLM 读参与度)quiz-agentquiz-agent(checkpoint)(检查点)Classroom delivered课堂上完(you actually learned)(你真学到了)

Vercel one-click deployVerceleasy
docker compose upany (Docker)moderate
Hosted demo at open.maic.chatany browsereasy
  • 🛠
  • 🌐
OpenAI / Anthropic / Google / DeepSeek / Grok
LLM for classroom generation
Per-classroom token cost can be substantial — pick a model + lock spend before opening to non-tech users
OpenAI / Azure / GLM / Qwen / MiniMax TTS
Voice synthesis for AI teachers
Optional — disable TTS for text-only mode; self-hosted VoxCPM2 is free
· 7
5 2
+40
+14
+15
+9
-3
0

5 / 7
passed claim-001

passed claim-002

passed claim-003

passed claim-004

passed claim-005

untested claim-006

untested claim-007

input_contract
output_contract
determinism
idempotence
no_skill_callouts
failure_mode_clarity

workflow_correctness
declared_call_graph
stop_conditions
handoff_points
atom_evidence
error_propagation
partial_failure_handling

goal_achievement
direction_judgment
quality_judgment
meaningful_autonomy
handoff_timing
observed_call_graph
failure_recovery

  • core user-facing layer untested → capped at 'usable'
  • hybrid-repo rule: archetype 'orchestrator' requires end-to-end evaluation of the user-facing layer
  • evidence_completeness='partial' (not portable) → capped at 'usable'

  • only 2/3 critical claims covered

archetype: orchestratorcore_layer_tested? Falseevidence: partialrecommended: usablefinal: usable
ceiling 1 · core user-facing layer untested → capped at 'usable'
ceiling 2 · hybrid-repo rule: archetype 'orchestrator' requires end-to-end evaluation of the user-facing layer
ceiling 3 · evidence_completeness='partial' (not portable) → capped at 'usable'

claim-001Next.js + React 19 + LangGraph 1.1 真栈一致criticaltech-stack● passed
claim-0025 家 LLM provider 真有 env-var 入口criticalai-providers● passed
claim-0035 家 TTS provider 真有 env-var 入口hightts-providers● passed
claim-004自带 eval harness(不只 talk,真有自测)hightesting-discipline● passed
claim-005OpenClaw skill 真存在(README 集成段不是营销词)highopenclaw-integration● passed
claim-006端到端 happy path:一句话 → 真课堂criticalend-to-end○ untested
claim-007AGPL-3.0 + 多 provider 部署成本曲线披露higheconomics○ untested

0%
0.00s
0

run-static-checks

2026-05-04
0% tokens in ? / out ?

run-static-checks

2026-05-04
0% tokens in ? / out ?
# OpenMAIC — final verdict (2026-05-04)

## Repo

- **Name:** THU-MAIC/OpenMAIC
- **Branch evaluated:** main@HEAD (v0.2.1, JCST'26 paper)
- **Archetype:** orchestrator
- **Layer:** **compound** — LangGraph multi-agent classroom
  generation
- **Eval framework:** repo-evals layer model v1 (f9ed1e9)

## Bucket

**`usable`** — strong static layer with rare positive signals
(in-repo eval harness, well-disclosed multi-provider env, clean
OpenClaw integration). Compound rule caps `usable` until at least one
logged live classroom generation.

## What was evaluated

### Atom + molecule level (static, this run)

| Claim | Status | Notes |
|---|---|---|
| 001 tech stack | passed | next 16.1.2 / react 19.2.3 / langgraph ^1.1.1 / tailwind ^4 — matches README badges |
| 002 5 LLM providers | passed | OpenAI/Anthropic/Google/DeepSeek/Grok all with KEY+BASE_URL+MODELS |
| 003 5 TTS providers | passed | OpenAI/Azure/GLM/Qwen/MiniMax all with KEY+BASE_URL; MiniMax has default endpoint |
| 004 eval harness | passed | 2 named eval scripts (eval:whiteboard + eval:outline-language) reference real tsx runners |
| 005 OpenClaw skill | passed | skills/openmaic/SKILL.md (102 lines) with user-invocable, confirmation-heavy SOP |

### Compound level (deferred)

| Claim | Status | Required |
|---|---|---|
| 006 live classroom generation | untested | open.maic.chat or self-hosted; verify slides + quiz + sim + whiteboard + TTS |
| 007 cost transparency | untested | README to add per-classroom token + TTS cost estimate |

## Real findings worth surfacing

1. **In-repo eval harness is rare and disciplined.** Most "AI demo"
   repos don't ship `eval/`. OpenMAIC has two named evals
   (whiteboard-layout, outline-language) with their own runners and
   a `shared/` for common code. That's a strong testing-intent
   signal.

2. **OpenClaw SOP is safety-conscious.** The skill explicitly says
   "Run one phase at a time and ask for confirmation before each
   state-changing step". This is the right posture for a multi-step
   AI orchestrator that might write files / clone repos / start
   services on the user's machine.

3. **TTS surface is unusually broad.** 5 commercial providers + a
   self-hosted VoxCPM2 (added in v0.2.1) means the classroom doesn't
   degrade silently if one provider has issues — the operator can
   fail over.

4. **Active development cadence.** 4 minor releases in the 6 weeks
   leading up to eval (v0.1.0 through v0.2.1). Healthy for an
   academic-affiliated open-source project.

## Why not higher

`usable` because:

- No live classroom generation logged on this evaluator's machine.
  Compound layer's user value is the multi-agent dance — static
  evidence cannot validate that the agents actually teach
  meaningfully.
- Cost transparency is genuinely missing; non-technical users would
  benefit from a "a 30-min classroom on a typical topic costs roughly
  $X with default config" line.

## Path to `reusable`

1. Run a live classroom on open.maic.chat with a real LLM key.
2. Self-host a fork; verify Vercel one-click deploy works.
3. Try one PDF-upload classroom; verify the OpenClaw skill SOP
   end-to-end.
4. Trigger an LLM-provider failure (revoked key) and verify the
   classroom degrades gracefully.
5. Update claim-006 → `passed`. If the README later adds a cost
   estimate, claim-007 → `passed`. Re-run verdict_calculator.

## Recommended

```yaml
current_bucket: usable
status: evaluated
```