`model-ondevice` long-run multi-step E2E report¶

Date: 2026-06-10
Package: @agentic-kernel/model-ondevice
Hardware / runtime: Apple Silicon laptop, local Ollama 0.24
What this tests: the FULL kernel loop, not single-shot planning. OnDevicePlanner drives the real createAgentEngine (policy, tool scheduler, state, observer) on complex multi-step tasks — plan → tool → observe → plan … — run as many episodes as fit in a 22-minute wall-clock window. This exercises what the single-shot benchmark could not: growing transcript (token-trim + atomic action/observation blocks), repair retries under real reasoning, bounded autonomy, and sustained reliability across rounds.

Setup¶

Models: qwen3.5:2b, gemma3:4b (raw decoding, think:false, temp:0).
Tools (deterministic): search(city) → population from a fixed KB; calculator(expression) → safe arithmetic eval.
Tasks (each has known ground truth, forces search×N → calculator → answer):
diff (2 cities): Cairo − London = 13
sum2 (2 cities): New York + Mumbai = 29
sum3 (3 cities, familiar): Tokyo + Delhi + Shanghai = 76
sum3b (3 cities, less familiar): Paris + Lagos + Istanbul = 42
Action vocabulary restricted to tool_call, final_answer, thought.
Bounded autonomy: maxSteps=24, repeatedActionThreshold=3, maxFailures=8, defaultMaxIterations=30, maxRepairAttempts=2.
Each episode verified: final answer must contain the ground-truth number.

Headline: infrastructure is 100% reliable¶

metric	value
wall clock	22.1 min
episodes	194
completed (no crash / stall / infinite loop)	194 / 194 (100%)
tool calls dispatched	534
planner round-trips	872

Across 194 multi-step episodes and 872 planner round-trips, the kernel + adapter never failed, hung, or looped — every run terminated cleanly with a final answer. Bounded autonomy held in every case.

Per-model results¶

model	n	completed	correct	mean steps	mean tools	wall p50	p90	p99
qwen3.5:2b	98	98/98	49/98 (50%)	3.0	3.0	6087ms	6313ms	6710ms
gemma3:4b	96	96/96	72/96 (75%)	2.5	2.5	8070ms	9590ms	13489ms

Correctness by task — where small models break¶

task	cities	qwen3.5:2b	gemma3:4b
`diff`	2	25/25 ✓	24/24 ✓
`sum2`	2	24/24 ✓	24/24 ✓
`sum3` (familiar)	3	0/25	24/24 ✓
`sum3b` (less familiar)	3	0/24	0/24

A clean capability gradient emerges:

2-step chains: both models 100%.
3-step, familiar entities: gemma3:4b 100%, qwen3.5:2b 0%.
3-step, less-familiar entities: both 0%.

The failure mode is dropping the last step and hallucinating to fill it, not an infrastructure bug. Verified by replay: for sum3b both models search Paris(11) and Lagos(15), then skip searching Istanbul and invent the third number before calling the calculator:

gemma3:4b: search(Paris)=11, search(Lagos)=15, calculator("11 + 15 + 8")=34  → "34"   (should be 42)
qwen3.5:2b: search(Paris)=11, search(Lagos)=15, calculator("11+15+11")=37     → "37"   (should be 42)

The kernel dispatched every tool correctly and fed every observation back; the calculator received exactly what the model asked. The models simply lose track of the third item in a longer chain. (temp=0 makes this deterministic, hence the clean 0/N.) Implication for on-device agents: keep tool chains short (≤2 hops) or decompose multi-hop tasks into separate runs.

The repair loop earns its place (real evidence)¶

Single-shot benchmarks showed 0 repairs everywhere. The multi-step loop tells a different story (measured by detecting the planner's correction turn):

model	repairs	of plan-calls	episodes with ≥1 repair	still completed
qwen3.5:2b	0	0 / 96 (0.0%)	0 / 24	24/24
gemma3:4b	17	17 / 111 (15.3%)	17 / 22	22/22

gemma3:4b malforms ~15% of its actions — and always the same way: it puts the tool name in the type field instead of using tool_call:

{"type":"calculator","arguments":{"expression":"14 + 33 + 29"}}     ← what gemma emits
{"type":"tool_call","toolName":"calculator","arguments":{...}}       ← what's required

The adapter rejects type:"calculator" ("action type not allowed"), re-prompts with the error, and gemma corrects on the retry — every time. Without the repair loop, every gemma3:4b calculator step would fail. qwen3.5:2b never needs it (0%). This is the resilience the adapter was designed for, now demonstrated under real conditions.

Fixes applied after this test¶

Tolerant type = <toolName> normalization (shipped, both SDKs). The parser now normalizes {"type":"calculator",…} to a tool_call when the type matches an available tool name and tool_call is allowed — turning the common small-model mistake into a zero-cost parse instead of a repair round-trip. Re-running the same workload confirms it:

model	repairs before	repairs after
gemma3:4b	17 / 111 (15.3%)	0 / 111 (0.0%)
qwen3.5:2b	0 / 96 (0.0%)	0 / 96 (0.0%)

Completion stayed 100% and correctness was unchanged (the step-dropping issue is orthogonal); every eliminated repair is one saved LLM round-trip.

onRepair telemetry (shipped, both SDKs). An optional callback fired once per repair attempt ({ attempt, error, output }), so hosts can measure malform rates directly instead of inferring them. This test originally miscounted repairs via plan_calls − steps (contaminated by thought / final_answer calls); the corrected figures above are measured at the source.
Not an SDK fix — documented instead: the step-dropping / hallucination on 3-hop chains is a model-capability limit. Guidance (keep chains ≤2 hops or decompose) is in the on-device README sections of both SDKs.

Conclusions¶

The adapter + kernel are production-solid for sustained on-device agent loops: 194/194 episodes completed over 22 min, no crashes/stalls/loops.
The repair loop is real and earns its place (a 4B model hit it 15% of the time and recovered every time) — but the dominant cause here, the type = <toolName> mistake, is now absorbed by tolerant normalization, so the repair loop falls back to a true last-resort safety net.
The binding constraint is model planning, not the SDK: small on-device models reliably handle ≤2-hop tool chains but drop steps and hallucinate on 3-hop chains. Design on-device agents around short chains or decomposition.
Latency is steady under load: p50 6.1s (qwen 2b) / 8.1s (gemma 4b) per multi-step episode, tight p99 — no degradation across 194 rounds.

model-ondevice long-run multi-step E2E report¶