Token usage¶
The kernel measures token usage and lets you bound a run by it — the same two
responsibilities it already has for latency/counts and RunBudget. It does not
price tokens or route by cost: that stays a host concern, derived from the counts this
exposes. (Added in 0.6.0.)
How it flows¶
- A planner reports usage for each model call via an optional
onUsage/on_token_usagecallback (non-breaking — planners that can't measure it never call it). - The runtime records each report as a
token_usageevent and aggregates a cumulativestate.tokenUsage(state.token_usagein Python). - Replay re-aggregates from those events, so a replayed state matches live state.
TokenUsage = { inputTokens, outputTokens, totalTokens, model?, metadata? }
(snake_case in Python).
Reading it¶
Budgeting by tokens¶
RunBudget gained token caps, enforced in the same pre-round check as
maxToolCalls/maxWallClockMs — the run stops with budget_exhausted once a cap is
reached:
Where the numbers come from¶
- model-openai reads the provider
usage(Responses APIinput_tokens/output_tokens), blocking and streaming — exact counts. - model-ondevice uses exact counts when the injected generator returns
{ text, usage }(e.g. Ollama'sprompt_eval_count/eval_count); otherwise it estimates fromcountTokensand marksmetadata.estimated = true.
Exporting & evaluating¶
- observer-otel emits the OpenTelemetry GenAI metric
gen_ai.client.token.usage(a histogram split bygen_ai.token.type= input/output, withgen_ai.request.model) for eachtoken_usageevent. - evaluator adds token totals to
EvalTrialMetricsandaverageTotalTokens/totalTokensto the summary. Cost stays via the hostestimateCosthook — now it can be fed real tokens.
Non-goals¶
No pricing tables, per-token rates, or cost-based routing in the kernel. Cost is a host / evaluator concern computed from these counts.