repo·evals
· 2026-04-13 ·main@HEAD

wewrite

oaker-io/wewrite

🛠66 / 100
🎯

🗺
01Signal scanning信号发现02Content acquisition内容获取03Content understanding内容理解04Topic curation选题决策05Content production内容生产06Creative assembly创意组装07Distribution & feedback分发反馈08Learning学习
📍
🧬

🛑
0–29
⚠️
30–49
🛠
50–79
🏭
80–100
66
🛠· 66 / 100
  • 10 claims passed, no critical failures
  • MIT / Apache / etc., installable per deployment.install_methods
  • release_pipeline=1, recently_active=True
  • EN-only or ZH-only README
  • static-only eval; live e2e pending

#1👤
#2🎯
#3🧭
#4

"写一篇\n公众号文章""写一篇\n公众号文章"Hotspot scrape热搜抓取(Weibo + Toutiao(微博 + 头条+ Baidu)+ 百度)Topic scoring选题打分+ history dedup+ 历史去重Pick 1 of 77 选 1 框架frameworks ++ 素材采集material scrapeWrite w/ persona按人格 + 范文风格+ exemplar style写作 + humanness+ humanness check自检SEO + 9-providerSEO + 9 providerimage gen生图(封面 + 内文)(cover + inline)16 themes +16 主题排版 +WeChat fixes +公众号兼容修复 +draft-box push推送草稿箱

git clone + pip install -r requirements.txtany (Python 3.11+)easy
  • 🌐
  • 🔄
WeChat Official Account API
Push articles to draft box; fetch read-stats
Free; needs verified WeChat Official Account with appid/secret
Image-gen providers (9 supported)
Cover + inline image generation
DashScope/Doubao ~¥0.1/img; OpenAI/Gemini priced higher; auto-fallback chain handles outages
Hotspot sources (Weibo / Toutiao / Baidu)
Live trending topic scrape
Public endpoints; rate-limited but no signup
SEO sources (Baidu / 360)
Search suggestions for keyword scoring
Public endpoints
· 13
10 1 2
+40
+18
+5
+3
0
0

11 / 13
passed claim-001

passed claim-002

passed claim-003

passed claim-004

passed claim-005

passed claim-006

passed claim-007

failed claim-011

untested claim-101

untested claim-102

input_contract
output_contract
determinism
idempotence
no_skill_callouts
failure_mode_clarity

workflow_correctness
declared_call_graph
stop_conditions
handoff_points
atom_evidence
error_propagation
partial_failure_handling

  • core user-facing layer untested → capped at 'usable'
  • hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer
  • evidence_completeness='partial' (not portable) → capped at 'usable'

  • only 4/6 critical claims covered

archetype: hybrid-skillcore_layer_tested? Falseevidence: partialrecommended: usablefinal: usable
ceiling 1 · core user-facing layer untested → capped at 'usable'
ceiling 2 · hybrid-repo rule: archetype 'hybrid-skill' requires end-to-end evaluation of the user-facing layer
ceiling 3 · evidence_completeness='partial' (not portable) → capped at 'usable'

claim-001pip install succeeds from requirements.txtcriticalsupport-install● passed
claim-0026 CLI commands all respond to --helpcriticalsupport-cli● passed
claim-003Markdown→WeChat HTML conversion workscriticalsupport-converter● passed
claim-004Hotspot fetching returns live data from 3 sourcescriticalsupport-hotspots● passed
claim-00516 themes exist with full YAML config + dark modehighsupport-themes● passed
claim-0069 image generation providers implementedhighsupport-image-gen● passed
claim-0075 writing personas with rich YAML confighighsupport-personas● passed
claim-008SEO keyword scoring works with live datamediumsupport-seo● passed
claim-009Humanness scoring provides multi-tier analysismediumsupport-quality● passed
claim-010Evals exist for 3 scenariosmediumsupport-quality● passed
claim-011Unit test suite existshighsupport-testing✕ failed
claim-101Full 8-step article generation workflowcriticalcore-llm○ untested
claim-102Anti-AI detection quality measurescriticalcore-llm○ untested

0%
0.00s
0

run-smoke

2026-04-13
0% tokens in ? / out ?

run-smoke

2026-04-13
0% tokens in ? / out ?
# Final Verdict

## Repo

- Name: oaker-io/wewrite
- Date: 2026-04-13
- Archetype: hybrid-skill
- Final bucket: **usable**
- Confidence: medium

## Why This Bucket

- **Core outcome**: Support layer is impressive — all 6 CLI commands work, converter produces real WeChat HTML, hotspot fetching returns live data, 16 themes + 9 image providers + 5 personas all verified. But the **core LLM workflow (8-step article generation) is untested** — it requires a full Claude Code session with WeChat API credentials.
- **Scenario breadth**: Only tested support layer (deterministic code). Core layer (LLM-driven writing) untested. For a hybrid-skill, this triggers the **hybrid cap**: core layer untested → cannot exceed `usable`.
- **Repeatability**: Converter, hotspots, and CLI commands all work consistently in repeated runs. LLM layer repeatability unknown.
- **Failure transparency**: CLI tools handle missing inputs gracefully. Error messages are actionable.

## Hybrid-Skill Ceiling Applied

Per hybrid-skill archetype rules: the **core user-facing layer (LLM-driven article generation)** was not tested. The support layer (converter, hotspots, themes, personas, image providers) all pass. But without core layer evidence, verdict is **capped at `usable`**.

## Score Summary

| Category | Passed | Failed | Partial | Untested | Total |
|----------|--------|--------|---------|----------|-------|
| Critical (support) | 4 | 0 | 0 | 0 | 4 |
| Critical (core) | 0 | 0 | 0 | 2 | 2 |
| High | 3 | 1 | 0 | 0 | 4 |
| Medium | 3 | 0 | 0 | 0 | 3 |
| **Total** | **10** | **1** | **0** | **2** | **13** |

## What I Would Say In Plain English

**wewrite's support layer is genuinely impressive for a skill repo.** The converter produces real WeChat-compatible HTML (inline CSS, footnoted links, dark mode attributes). Hotspot fetching returns live trends from 3 Chinese platforms. 16 themes, 9 image providers, 5 personas — all verified to exist with correct structure. The eval system (3 structured scenarios) shows maturity.

**But it's a writing skill that I haven't seen write.** The entire 8-step article generation pipeline is LLM-driven and requires WeChat API credentials to test end-to-end. The support layer works, but the core promise — "一句话搞定公众号" — is unverified.

**The one real gap: zero unit tests.** 2,232 lines of Python toolkit code with no pytest tests at all. The eval specs test agent behavior, not code correctness. A converter regression would go undetected.

## Path to `reusable`

1. **Test the core LLM workflow** — run a full agent session, generate an article, score it against the quality contract and humanness_score.py
2. **Add unit tests** — converter.py (548 lines) especially needs test coverage for WeChat HTML edge cases
3. **Verify at least 2 image providers** with real API keys

## Path to `recommendable`

Everything in `reusable` plus:
4. **Multiple article generation runs** showing consistency across personas and frameworks
5. **Anti-slop verification** — generated articles scored against banned phrase list
6. **Publish flow verification** — draft-to-WeChat pipeline tested with real credentials
7. **CI for converter tests** — prevent WeChat HTML regressions

## Remaining Risks

- **Core workflow completely untested** — the entire value prop of the skill is unverified
- **No unit tests** — 2,232 lines of Python with zero pytest coverage
- **Image providers cannot be tested without API keys** — 9 providers verified as code, but none tested for actual image generation
- **WeChat API dependency** — publish flow requires real WeChat Official Account credentials
- **camoufox dependency** — browser-based hotspot fetching may break if source sites change layout