Reports¶

Real-LLM evaluation, capability, and stress campaigns — every claim backed by runs against real models (remote gateway + on-device Ollama), graded with pass^k and grounded process checks.

2026-06-15 — Comprehensive real-LLM campaign (TS + Python) — 17-area coverage, functional pass^k, 9/9 flows both SDKs, clean concurrent stress, cross-model table.
2026-06-13 — Agent kernel full test report — consolidated campaign summary, findings, fixes, boundaries.
2026-06-13 — Remote-model framework stress — 450 episodes, concurrency, policy/memory/replay under a remote large model.
2026-06-12 — On-device testing campaign — 6-phase on-device master report.
2026-06-11 — On-device capability & reliability — plan-then-execute + voting → 100%.
2026-06-10 — On-device long run — 22-min, 194-episode multi-step run.
2026-06-09 — On-device performance — single-shot latency/repair across 5 local models.
2026-05-30 — Official agent benchmark run — BFCL / tau-bench / tau2.