Design Spec: model-ondevice adapter¶
- Date: 2026-06-08
- Status: Draft — awaiting review
- Scope: new
Planneradapter package, both SDKs (TS + Python). No core change.
Context & goal¶
We run a hybrid deployment where remote and on-device models go separate
lanes ("各走各的"): the host starts an independent engine.run() per lane —
remote tasks with OpenAIPlanner, on-device tasks with the new on-device
planner. There is no routing/fallback/escalation in the SDK; combining lanes
is a host concern.
On-device / small / local models differ from remote models on several axes — most painfully unreliable structured output and small context windows. This adapter contains all of that resilience inside the planner, so:
coreis untouched — it only sees thePlannerinterface (one successful plan, or one failure).- The remote path (
core+model-openai) does not change at all — it does not know on-device exists. The large-model capability cannot regress.
The retry/repair lives inside OnDevicePlanner.plan(), invisible to core.
This is why the earlier idea of adding maxPlanRetries/repairHint to core
was the wrong layer and was dropped.
Non-goals¶
- Bundling any model runtime (WebLLM / Ollama / transformers.js / llama.cpp) —
the runtime is injected, exactly like
model-openaiinjectsfetch. - Routing / fallback / escalation between lanes (host concern).
- Replacing
model-openai. - Any change to
coreor other packages.
Package¶
- TS:
@agentic-kernel/model-ondevice(peer ofmodel-openai). - Python:
agentic_kernel.model_ondevice(sibling subpackage). - Implements the existing
Plannerseam (plan+ optionalplanStream).
Public API (locked decisions)¶
Injected text generator — the only required dependency (no runtime bundled):
type ChatMessage = { role: "system" | "user" | "assistant"; content: string };
type GenerateRequest = {
messages: ChatMessage[];
schema?: JsonObject; // for runtimes that support constrained decoding; else ignored
signal?: AbortSignal;
};
// Blocking or streaming; the planner adapts to either.
type TextGenerator = (req: GenerateRequest) => Promise<string> | AsyncIterable<string>;
Planner:
new OnDevicePlanner({
generate: TextGenerator, // required: WebLLM / Ollama / transformers.js / ...
actions?: ActionKind[], // restricted vocabulary; DEFAULT below
maxRepairAttempts?: number, // default 2
maxPromptTokens?: number, // default 3500
countTokens?: (s: string) => number,// default ~ ceil(chars / 4)
clock?: Clock, // to stamp action.createdAt if model omits it
idGenerator?: IdGenerator // to stamp action.id if model omits it
});
Locked decisions:
- Default action vocabulary (restricted):
tool_call,final_answer,ask_user,stop,thought. Host opts into more (spawn_subagent,join_subagent,cancel_subagent,push_goal,complete_goal,schedule) viaactions. Fewer choices ⇒ higher reliability for small models. - Streaming in v1: implements
planStream; reasoning/text deltas surface asPlannerChunks. Falls back to blocking when the injected generator returns a string. messages[]chat interface (not a single prompt string).- Name:
model-ondevice. (Capability is broader — works for any raw-text model — but we name by deployment per the product framing.)
Internal pipeline (per plan())¶
- Render
AgentContext→ compact chat messages, trimmed tomaxPromptTokens: - system: agent role/goal/instructions; "Return exactly ONE JSON action"; the allowed action shapes (compact); available tools (name + 1-line desc + input schema).
- transcript: recent messages, recent action→observation pairs as terse
lines, retrieved memory (compact),
compactedSummaryif present. - trim order when over budget: drop oldest transcript/memory first; always keep the task input, the tool list + action schema, and the most recent observation.
- Generate via the injected
generate(passschemafor constrained-decode runtimes; stream chunks toonChunk/planStreamwhen async-iterable). - Tolerant parse →
Action: - strip
```jsonfences and surrounding prose; extract the first balanced JSON object;JSON.parse. - require
type∈ allowed vocabulary; fillid(idGenerator) andcreatedAt(clock) if missing; validate required fields per type (tool_call→toolName+arguments,final_answer→content, etc.). - Repair loop on parse/validation failure: append the bad output + a strict
correction turn ("Your last reply was not a valid action:
. Reply with ONLY one JSON action, no prose."), retry up to maxRepairAttempts. - Give up → throw
OnDevicePlannerError. Core's runtime catches a planner throw and fails the run exactly as today — no core change, no behavior change for any other planner.
Failure & observability semantics¶
- All retries are internal; core sees success or one terminal failure.
- The planner MAY surface repair attempts via
onChunktext or its own logging; it does not emit core events (it has no access to the event log — correct separation). Hosts that want repair telemetry wrapgenerate.
Parity (TS ↔ Python)¶
- Identical design:
OnDevicePlanner+TextGenerator(Python: aCallableprotocol), same options (snake_case), same render/parse/repair/token pipeline. - The injected generator is platform-specific and host-provided (TS browser: WebLLM; Python local: Ollama/llama.cpp). The package ships no binding.
- Both at full gate: TS build + typecheck + prettier + vitest thresholds; Python ruff + mypy --strict + pytest 100% + CI.
Testing (no real model needed)¶
Inject a scripted TextGenerator returning canned strings to exercise:
- clean JSON → action; fenced JSON; prose-wrapped JSON;
- malformed-then-fixed sequence → repair loop succeeds; always-malformed →
throws after
maxRepairAttempts; - each allowed action type; a disallowed type → rejected/repaired;
- missing
id/createdAt→ stamped; - token-budget trimming (long context → oldest dropped, task/tools/last-obs kept);
- streaming generator → chunks surfaced + final action;
- 100% statement+branch (Python) / thresholds (TS).
Example: two independent lanes (host)¶
// remote lane — unchanged
const remote = createAgentEngine({ planner: new OpenAIPlanner({ fetch, apiKey, model }), policy, toolRegistry });
await remote.run({ agent, task: remoteTask, host });
// on-device lane — independent run, same kernel
const onDevice = createAgentEngine({
planner: new OnDevicePlanner({ generate: webllmGenerate }),
policy, toolRegistry
});
await onDevice.run({ agent, task: localTask, host });
// Optional: inject the same MemoryManager into both to share lessons.
Future (not v1)¶
- Constrained/grammar decoding helpers per runtime (currently just pass
schemathrough; tolerant parse + repair is the safety net). - Multi-action batch output (weak models struggle; single-action only in v1).
- Token-aware compaction in
core(optional; the adapter trims its own prompt, so not required).