On-device agentic capability & reliability — enhancements and evidence¶
- Date: 2026-06-11
- Package:
@agentic-kernel/model-ondevice(TS) /agentic_kernel.model_ondevice(Python) - Goal: maximize the validity and reliability of on-device/small-model agents through framework levers, measured against real local models (Ollama).
- Method: the adapter drives the real
createAgentEngineloop over Ollama (qwen3.5:2b,gemma3:4b) on multi-step search→calculate→answer tasks. - Validity = grounded and answer-correct and well-formed. "Grounded" means the answer was actually computed by tools (a calculator observation equals ground truth and every entity was searched) — a hallucinated but numerically-right answer counts as INVALID.
- Reliability =
pass^k/ voted-correct over K stochastic trials (temp 0.7), aggregated with the SDK's ownsummarizeModelScenario.
Levers shipped (all opt-in, additive; core and the remote lane untouched)¶
| Lever | Option / API | Targets | Verdict |
|---|---|---|---|
| Plan-then-execute | planMode: "plan-then-execute" |
step-dropping (validity) | Validated win |
| Episode self-consistency | runSelfConsistent(run, {samples}) |
variance (reliability) | Validated win (when model ≥ ~decent) |
| Terminal verification | verifyAnswer + groundingVerifier() |
hallucination | Shipped; not exercised by this task set |
| Few-shot examples | examples |
adherence | Shipped |
| Per-step self-consistency | SelfConsistencyPlanner |
per-decision variance | Tried, did NOT help agents — kept with caveat |
Finding 1 — plan-then-execute lifts validity (temp 0.7, K=4)¶
| model | react | plan-then-execute |
|---|---|---|
| qwen3.5:2b | 0% | 69% |
| gemma3:4b | 50% | 81% |
Externalizing the plan as a replayable thought and re-injecting it each step
stops weak models from dropping the last hop. These are grounded gains
(answerMatch == grounded), not lucky guesses.
Methodology caveat that changed the conclusion: an earlier temp=0 single-shot run made plan-mode look like it broke qwen (0%). Multi-sample evaluation at temp 0.7 revealed the opposite — plan-mode greatly helps. Deterministic single-shot testing lies; always multi-sample.
Finding 2 — reliability is a separate axis, and the bottleneck¶
High single-shot validity does not imply reliability. gemma plan-mode scored 81%
validity but only 50% pass^4 — half the scenarios failed to be valid on all
4 trials. pass^k is the metric that matters for shipping an agent.
Finding 3 — episode-level self-consistency is the reliability lever¶
Run the whole agent N times independently and majority-vote the final answer
(runSelfConsistent). K=15, vote N=5, disjoint groups:
| model | task | single-shot validity | voted-correct (reliability) |
|---|---|---|---|
| gemma3:4b | sum3 | 73% | 100% (3/3 groups) |
| qwen3.5:2b | sum3b | 27% | 67% (2/3) |
| gemma3:4b | sum3b | 60% | 67% (2/3) |
| qwen3.5:2b | sum3 | 13% | 0% (0/3) |
Voting amplifies a correct plurality: it makes a decent model reliable (gemma sum3 73%→100%) but cannot rescue a model that is mostly wrong (qwen sum3 13%→0% — the wrong answers scatter and never reach the truth). The boundary is ~"the correct answer must already be the single most common outcome."
Finding 4 — per-step self-consistency is the wrong granularity¶
Voting on each step (SelfConsistencyPlanner) did not help and sometimes
hurt (qwen 40%→20%, gemma 60%→60%): it reinforces the model's modal next step,
which on a hard task is often the wrong one, and the extra sampling derailed weak
runs (well-formed 80%→60%). Classic self-consistency votes on outcomes, not
intermediate steps. The per-step wrapper is kept (it correctly stabilizes a
single high-variance decision) but documented as the wrong tool for agent-level
reliability.
Finding 5 — verification (B): situational alone, decisive in the full stack¶
groundingVerifier rejects answers whose numbers aren't in observations (the
"lucky guess"). At temp 0.7 the models rarely hit that exact failure, so grounding
verify alone was roughly neutral. But a stronger process verifier — a host
rule wired through the same verifyAnswer seam — is what finally reaches 100%
(Finding 6). Earlier "B not exercised" runs were temp=0 artifacts.
Finding 6 — 100% reliability is reachable: plan + process-verify + voting¶
Combining all three levers and a host process verifier (via verifyAnswer)
that rejects a final_answer unless: every entity was searched, the calculator
was used, the answer equals the calculator result, the result is an integer, and
the calculator's operands are exactly the looked-up values. K=15, vote N=5:
| model | diff | sum2 | sum3 | sum3b |
|---|---|---|---|---|
| gemma3:4b | 100% | 100% | 100% | 100% |
| qwen3.5:2b | 100% | 100% | 67% | 67% |
gemma3:4b reaches 100% reliability on every task at vote N=5 — the process
verifier eliminated the last residual (gemma sum3b's 42.6 non-integer
miscalculation: 67%→100%). The verifier is host domain logic ("populations
are integers", "use each value once"); the SDK supplies the seam and a generic
groundingVerifier, the host encodes the task's invariants. Correct layering —
only the host knows the task's correctness rules.
qwen3.5:2b reached 100% on 2-hop but only ~67% on 3-hop at N=5. That was NOT a capability floor — it was an under-sized ensemble. The correct answer is still qwen's modal output on 3-hop (just less dominant: ~30–44% single-valid, wrong answers scattered), so a larger ensemble captures it. At vote N=9:
| model (3-hop) | N=5 | N=9 |
|---|---|---|
| qwen3.5:2b sum3 | 67% | 100% ([76,76,76]) |
| qwen3.5:2b sum3b | 67% | 100% ([42,42,42]) |
So both models reach 100% on all 4 tasks with the full recipe — the weaker
model just needs a bigger vote ensemble. Ensemble size N is the cost/reliability
knob: a stronger model (gemma) gets there at N=5; a weaker one (qwen) at N=9.
Recipe to ensure 100%: plan-then-execute + a process verifyAnswer that
enforces the task's correctness invariants + runSelfConsistent, with the vote
ensemble N sized to the model (larger for weaker models). The one real
requirement: the correct answer must be the model's modal output — then enough
votes make it win deterministically. If a model is so weak that a wrong answer is
its mode, no ensemble fixes it; use agreement to abstain/escalate (never
confidently wrong) or pick a stronger model.
Recommended recipe for on-device agents¶
planMode: "plan-then-execute"— the validity lever; biggest single gain.- A process
verifyAnswerthat encodes the task's correctness invariants (all inputs gathered, answer == tool result, domain constraints) — this is what turns "grounded" into "correct" and removes systematic miscalculations. runSelfConsistent({ samples: N })— votes the final answer across runs; sizeNto the model: gemma3:4b reaches 100% at N=5, qwen3.5:2b at N=9. AddacceptFirst(with the sound verifier) andminSamples: 1so easy cases cost ~1 episode instead of N — 100% without paying full N every time (Finding 7).- The one requirement: correct must be the model's modal output. Voting then makes it win deterministically. Both tested models clear this on all 4 tasks.
- If a wrong answer is the model's mode, no ensemble fixes it — use
agreementto abstain/escalate (never confidently wrong) or pick a stronger model. Keep tool chains short (≤2 hops) or decompose to stay above the bar. - Few-shot (
examples) and the genericgroundingVerifierare available too.
With 1–3, both qwen3.5:2b and gemma3:4b reach 100% on all 4 tested tasks (qwen needs the larger N=9 ensemble) — a concrete demonstration that "ensure 100%" is reachable for on-device agents with the full stack.
Finding 7 — cost: 100% reliability without paying full N every time¶
Voting at fixed N is wasteful — most runs just confirm an already-obvious
agreement. runSelfConsistent gained three cost controls (all opt-in,
back-compatible):
minSamples<samples— stop as soon as the leader's lead exceeds the remaining budget (same answer as full N, fewer runs).acceptFirst— accept the first run that passes a sound check. When the process verifier certifies correctness, one verified run is enough — no vote.concurrency— run waves in parallel for batched-inference hosts (latency).
Measured on real models (max N=9, plan + process-verify, vote correctness held at 100% throughout):
| strategy | gemma3:4b diff — avg episodes |
|---|---|
| fixed N=9 | 9.0 |
| adaptive (early-stop) | 5.5 (−39%) |
| accept-first-verified | 1.0 (−89%) |
On an easy task the sound verifier collapses cost from 9 episodes to 1 with no loss of correctness. On hard tasks accept-first needs a few retries (the model fails the verifier more often) so the saving shrinks, but adaptive early-stop and accept-first both stay strictly cheaper than fixed N. This is the on-device advantage realized: 100% reliability at a fraction of the naive voting cost.
The composable verifiers module (requireToolUsed, answerEqualsToolResult,
answerIsInteger, answerMatches, allVerifiers) lets a host assemble the sound
process verifier from tested pieces.
Boundary (honest)¶
Framework levers maximize what the model can do; they don't replace capability.
The one hard requirement is that the correct answer be the model's modal
output — both tested models clear this on all 4 tasks (qwen needs N=9). If a
wrong answer were a model's mode, no ensemble fixes it: abstain/escalate on low
agreement, or use a stronger model.
Status¶
Both SDKs at full gate (TS 530 tests; Python 100% statement+branch, ruff +
mypy --strict). All levers opt-in / default-off; core and the remote lane
unchanged. Not version-bumped or published. Eval scripts live under /tmp
(not committed).