model-ondevice long-run multi-step E2E report¶
- Date: 2026-06-10
- Package:
@agentic-kernel/model-ondevice - Hardware / runtime: Apple Silicon laptop, local Ollama 0.24
- What this tests: the FULL kernel loop, not single-shot planning.
OnDevicePlannerdrives the realcreateAgentEngine(policy, tool scheduler, state, observer) on complex multi-step tasks —plan → tool → observe → plan …— run as many episodes as fit in a 22-minute wall-clock window. This exercises what the single-shot benchmark could not: growing transcript (token-trim + atomic action/observation blocks), repair retries under real reasoning, bounded autonomy, and sustained reliability across rounds.
Setup¶
- Models:
qwen3.5:2b,gemma3:4b(raw decoding,think:false,temp:0). - Tools (deterministic):
search(city)→ population from a fixed KB;calculator(expression)→ safe arithmetic eval. - Tasks (each has known ground truth, forces search×N → calculator → answer):
diff(2 cities): Cairo − London = 13sum2(2 cities): New York + Mumbai = 29sum3(3 cities, familiar): Tokyo + Delhi + Shanghai = 76sum3b(3 cities, less familiar): Paris + Lagos + Istanbul = 42- Action vocabulary restricted to
tool_call,final_answer,thought. - Bounded autonomy:
maxSteps=24,repeatedActionThreshold=3,maxFailures=8,defaultMaxIterations=30,maxRepairAttempts=2. - Each episode verified: final answer must contain the ground-truth number.
Headline: infrastructure is 100% reliable¶
| metric | value |
|---|---|
| wall clock | 22.1 min |
| episodes | 194 |
| completed (no crash / stall / infinite loop) | 194 / 194 (100%) |
| tool calls dispatched | 534 |
| planner round-trips | 872 |
Across 194 multi-step episodes and 872 planner round-trips, the kernel + adapter never failed, hung, or looped — every run terminated cleanly with a final answer. Bounded autonomy held in every case.
Per-model results¶
| model | n | completed | correct | mean steps | mean tools | wall p50 | p90 | p99 |
|---|---|---|---|---|---|---|---|---|
| qwen3.5:2b | 98 | 98/98 | 49/98 (50%) | 3.0 | 3.0 | 6087ms | 6313ms | 6710ms |
| gemma3:4b | 96 | 96/96 | 72/96 (75%) | 2.5 | 2.5 | 8070ms | 9590ms | 13489ms |
Correctness by task — where small models break¶
| task | cities | qwen3.5:2b | gemma3:4b |
|---|---|---|---|
diff |
2 | 25/25 ✓ | 24/24 ✓ |
sum2 |
2 | 24/24 ✓ | 24/24 ✓ |
sum3 (familiar) |
3 | 0/25 | 24/24 ✓ |
sum3b (less familiar) |
3 | 0/24 | 0/24 |
A clean capability gradient emerges:
- 2-step chains: both models 100%.
- 3-step, familiar entities: gemma3:4b 100%, qwen3.5:2b 0%.
- 3-step, less-familiar entities: both 0%.
The failure mode is dropping the last step and hallucinating to fill it, not
an infrastructure bug. Verified by replay: for sum3b both models search
Paris(11) and Lagos(15), then skip searching Istanbul and invent the third
number before calling the calculator:
gemma3:4b: search(Paris)=11, search(Lagos)=15, calculator("11 + 15 + 8")=34 → "34" (should be 42)
qwen3.5:2b: search(Paris)=11, search(Lagos)=15, calculator("11+15+11")=37 → "37" (should be 42)
The kernel dispatched every tool correctly and fed every observation back; the calculator received exactly what the model asked. The models simply lose track of the third item in a longer chain. (temp=0 makes this deterministic, hence the clean 0/N.) Implication for on-device agents: keep tool chains short (≤2 hops) or decompose multi-hop tasks into separate runs.
The repair loop earns its place (real evidence)¶
Single-shot benchmarks showed 0 repairs everywhere. The multi-step loop tells a different story (measured by detecting the planner's correction turn):
| model | repairs | of plan-calls | episodes with ≥1 repair | still completed |
|---|---|---|---|---|
| qwen3.5:2b | 0 | 0 / 96 (0.0%) | 0 / 24 | 24/24 |
| gemma3:4b | 17 | 17 / 111 (15.3%) | 17 / 22 | 22/22 |
gemma3:4b malforms ~15% of its actions — and always the same way: it puts the
tool name in the type field instead of using tool_call:
{"type":"calculator","arguments":{"expression":"14 + 33 + 29"}} ← what gemma emits
{"type":"tool_call","toolName":"calculator","arguments":{...}} ← what's required
The adapter rejects type:"calculator" ("action type not allowed"), re-prompts
with the error, and gemma corrects on the retry — every time. Without the
repair loop, every gemma3:4b calculator step would fail. qwen3.5:2b never needs
it (0%). This is the resilience the adapter was designed for, now demonstrated
under real conditions.
Fixes applied after this test¶
- Tolerant
type = <toolName>normalization (shipped, both SDKs). The parser now normalizes{"type":"calculator",…}to atool_callwhen thetypematches an available tool name andtool_callis allowed — turning the common small-model mistake into a zero-cost parse instead of a repair round-trip. Re-running the same workload confirms it:
| model | repairs before | repairs after |
|---|---|---|
| gemma3:4b | 17 / 111 (15.3%) | 0 / 111 (0.0%) |
| qwen3.5:2b | 0 / 96 (0.0%) | 0 / 96 (0.0%) |
Completion stayed 100% and correctness was unchanged (the step-dropping issue is orthogonal); every eliminated repair is one saved LLM round-trip.
-
onRepairtelemetry (shipped, both SDKs). An optional callback fired once per repair attempt ({ attempt, error, output }), so hosts can measure malform rates directly instead of inferring them. This test originally miscounted repairs viaplan_calls − steps(contaminated bythought/final_answercalls); the corrected figures above are measured at the source. -
Not an SDK fix — documented instead: the step-dropping / hallucination on 3-hop chains is a model-capability limit. Guidance (keep chains ≤2 hops or decompose) is in the on-device README sections of both SDKs.
Conclusions¶
- The adapter + kernel are production-solid for sustained on-device agent loops: 194/194 episodes completed over 22 min, no crashes/stalls/loops.
- The repair loop is real and earns its place (a 4B model hit it 15% of the
time and recovered every time) — but the dominant cause here, the
type = <toolName>mistake, is now absorbed by tolerant normalization, so the repair loop falls back to a true last-resort safety net. - The binding constraint is model planning, not the SDK: small on-device models reliably handle ≤2-hop tool chains but drop steps and hallucinate on 3-hop chains. Design on-device agents around short chains or decomposition.
- Latency is steady under load: p50 6.1s (qwen 2b) / 8.1s (gemma 4b) per multi-step episode, tight p99 — no degradation across 194 rounds.