
expected_error + λ·cost.
The three moving parts
1. Embedder
Every prompt is run through a sentence embedder (MiniLM-L6-v2, bundled in the wheel) to produce a 384-dimensional vector. This is a pure function: the same prompt always produces the same vector.2. Cluster assigner
The embedder output is assigned to one of 100 pre-trained semantic clusters. Cluster centroids live in the weights package you downloaded on first run. Examples (cluster names from the default weights):- cluster 47 → “mathematical proofs and formal reasoning”
- cluster 84 → “short factual lookup”
- cluster 88 → “data-structure code generation”
- cluster 29 → “creative short-form writing”
use_soft_assignment=True when loading the router.
3. Per-model error profiles
For every model the router knows about, there’s a vector Ψ of length 100:Ψ[i] is the model’s empirical error rate on cluster i. Error is
measured as “fraction of validation examples where this model got it
wrong” during profile fitting.
A routing decision is then:
λ is the cost_weight argument you pass to load_router().
Using it
The whole thing collapses to two lines:RoutingDecision:
Tuning the cost-quality dial
cost_weight (λ) is the one knob you’ll actually touch:
| λ | Behavior |
|---|---|
0.0 | Pick whichever model has lowest predicted error, ignore cost. |
0.5 | Balanced — common default. A tiny error delta won’t justify a 10× cost. |
1.0 | Strongly prefer cheaper models; only escalate if they’re demonstrably bad. |
2.0+ | Aggressively cheap; escalate only on the worst prompts. |
Restricting the candidate pool
By default the router considers every model in the loaded registry. You can restrict it:The two backends
load_router has a single parameter you’ll barely ever touch: engine.
engine="go"(default) — spawns the bundled Go engine as a subprocess. Fast (~sub-millisecond routing), production path.engine="python"— pure Python implementation, no subprocess. Slower, but useful in environments where process-spawn is forbidden or where you want to introspect every internal (e.g. swap the cluster assigner, monkey-patch profiles). The Go binary is bundled per-platform; if it isn’t present you’ll see a clear error.engine="auto"— prefer Go, fall back to Python if the binary is missing. Not recommended as a default because the fallback is silent — if something’s wrong with the binary, you want to know, not route 10× slower without noticing.
How routing changes over time
The profiles you loaded are from a benchmark the weights were trained on. Your production traffic will be different — maybe your users ask more code questions than the benchmark assumed. Two mechanisms adapt the router:-
blend_with_profiles— periodically combine the benchmark’s per-model error profile with the one observed in production:Ψ_new = α · Ψ_prod + (1 - α) · Ψ_benchmark. Thefeedbackmodule has utilities for this. See the “self-learning” section of the basic_router_to_self_learning notebook. - Alias swapping — when a distilled student is ready for a cluster you’ve worked on, you add it to the registry, point the alias at it, and from that moment the router can select it for prompts in that cluster. See Distillation.
When auto-routing isn’t enough
For two shapes of problem, auto-routing alone won’t cut it:- You have hard policy constraints. “Never route X to Anthropic.” In
that case combine with a
Router(explicit, rule-based) — the logical alias can still be semantic, but the candidates are constrained. - Your prompts don’t cluster well. If everything you do is one narrow
domain that doesn’t match any of the pre-trained clusters, you’ll get
mediocre routing decisions. Solution: retrain the weights on your
traffic (
opentracy.training.full_training_pipeline), or fall back toRouterwith hand-picked deployments.
Next
Distillation
The counterpart — how the student models that auto-routing swaps in get built.
Router reference
load_router parameters, .route() / .route_batch() signatures, full RoutingDecision schema.
