model-ondevice performance report¶
- Date: 2026-06-09
- Package:
@agentic-kernel/model-ondevice(TS) /agentic_kernel.model_ondevice(Python) - Hardware: Apple Silicon laptop, models served by local Ollama 0.24
- Method: the adapter drives real local models through an injected
TextGeneratorbacked by Ollama/api/chat(stream:false,temperature:0,think:falseunless noted). Per call we record wall latency, Ollama's token counts/durations, generator round-trips (= 1 + repair retries), and the parsed action kind. Benchmark scripts are not committed (kept under/tmp).
TL;DR¶
- The adapter is not the bottleneck — by 3–4 orders of magnitude. Its own CPU work (prompt render + token trim + tolerant parse) is microseconds; end-to-end latency is seconds and is ~100% model inference.
- The repair loop is free on capable models. Across 5 models × 78 calls,
78/78 produced a valid action on the first try (0 repairs) — including a
reasoning model whose answer was wrapped in
<think>…</think>. The restricted 5-kind vocabulary + schema-shaped prompt is what makes small models reliable. - Biggest end-to-end win is configuration, not code: disabling "thinking" (or not using a reasoning model) on an interactive lane is ~17×.
End-to-end latency (raw decoding, 6 iters × 3 scenarios)¶
Scenarios: tool-use (expects tool_call), final-answer, after-observation.
| model | ok/n | p50 | p90 | max | mean repairs | prompt tok | out tok | decode tok/s | adapter+net |
|---|---|---|---|---|---|---|---|---|---|
| qwen3.5:0.8b | 18/18 | 780ms | 921ms | 925ms | 0.00 | 207 | 23 | 48.3 | 1.6ms |
| qwen3.5:2b | 18/18 | 1460ms | 1581ms | 1613ms | 0.00 | 207 | 24 | 28.3 | 2.8ms |
| gemma3:4b | 18/18 | 1360ms | 1639ms | 1766ms | 0.00 | 208 | 27 | 30.6 | 6.4ms |
| qwen3:8b | 18/18 | 1911ms | 2262ms | 3228ms | 0.00 | 202 | 25 | 16.8 | 11.2ms |
| deepseek-r1:8b (think) | 12/12 | 32940ms | 44124ms | 44833ms | 0.00 | 183 | 466 | 17.5 | 6.1ms |
adapter+net = wall − Ollama total_duration ≈ localhost HTTP + JSON parse
(not adapter compute; see microbench). It tracks response size/scheduling, not
adapter work.
Reading it: latency scales with model size (0.8B→8B: ~0.78s→1.9s p50) and
decode throughput falls (48→17 tok/s) — both intrinsic to the model. The 0.8B
model is fast but under-powered (it chose ask_user where it should have called
search); 2–4B is the sweet spot here (<1.7s to a correct action). The
reasoning model is accurate but its <think> block (avg 466 output tokens) makes
it 7–44s — unsuitable for an interactive lane unless thinking is disabled.
Constrained decoding (Ollama format = buildActionSchema())¶
Passing the action schema (now exported as buildActionSchema) to the runtime's
structured-output mode. 4 iters × 3 scenarios:
| model | ok/n | p50 | p90 | mean repairs |
|---|---|---|---|---|
| qwen3.5:2b | 12/12 | 1480ms | 1555ms | 0.00 |
| gemma3:4b | 12/12 | 1231ms | 1528ms | 0.00 |
On these models (already 0-repair raw) constrained decoding is latency-neutral.
Its value is insurance: it makes invalid output structurally impossible, so
it matters most on weaker / more aggressively quantized / older models, and it
can trim leading filler tokens. The adapter already passes the schema as
GenerateRequest.schema; hosts opt in inside their TextGenerator.
Adapter-only overhead (no model) — render + trim + parse, µs/call¶
Pure CPU cost as the transcript grows, before vs after the render short-circuit (materialize newest-first, stop at the first block over budget → O(kept), not O(history)):
| transcript blocks | before p50 | after p50 | before p99 | after p99 |
|---|---|---|---|---|
| 0 | 8.0µs | 4.5µs | 39µs | 39µs |
| 40 | 22.0µs | 8.1µs | 105µs | 65µs |
| 100 | 38.9µs | 14.1µs | 185µs | 74µs |
| 600 | 178.4µs | 19.6µs | 440µs | 79µs |
At 600 blocks the optimization is ~9× faster and overhead is now roughly flat. Either way the adapter costs sub-millisecond vs seconds of inference — this change is correctness/scaling hygiene, not a latency lever.
Tuning guide (by impact)¶
- Interactive lane: disable thinking / avoid reasoning models — ~17×.
- Pick a 2–4B model for the latency/quality sweet spot on-device.
- Reuse the runtime's KV prefix cache: the adapter puts the stable
systemprompt (role/goal/instructions/tools/action shapes) first and the volatile transcript last, so a fixed tool set keeps a byte-stable prefix across steps. Note: once token-trim starts dropping the oldest blocks, the user-message prefix changes and the cache for that segment is invalidated. - Tool-heavy agents:
toolPromptStyle: "compact"rendersname(p1, p2): descriptioninstead of inlining each tool's full JSON schema, cutting prefill tokens (matters as tool count grows; negligible for 1–2 tools). - Weak/quantized models: enable constrained decoding via
buildActionSchemawired to your runtime's grammar/format param — turns rare repairs into structurally-impossible ones. maxRepairAttemptsis a safety net, not a routine cost — it never fired on any tested model. Leave it at the default (2).