ot.distill

ot.distill() runs the full 4-phase BOND distillation pipeline (data-generation → curation → training → export) in-process and returns a callable Student. No FastAPI service, no ClickHouse, no job-polling — if you just want to train a student from a dataset and use it, this is the API. For the long-running, queued, multi-tenant REST flow, see the Distiller client — same engine, different surface.

import opentracy as ot

student = ot.distill(
    dataset="tickets.jsonl",
    teacher="openai/gpt-4o",
    student="llama-3.2-1b",
    steps=60,
)

# Use it directly
print(student("Classify: please refund the double charge"))

# Or serve it under a logical name
student.deploy("ticket-classifier")
ot.completion(model="ticket-classifier", messages=[...])

ot.distill() needs opentracy[distill] and a CUDA GPU. It fails fast (before any teacher API spend) if torch can’t import or torch.cuda.is_available() is False.

Signature

ot.distill(
    dataset: str | PathLike | list[dict] | Callable[[], Iterable[dict]],
    *,
    teacher: str = "openai/gpt-4o",
    student: str = "llama-3.2-1b",
    num_prompts: Optional[int] = None,
    steps: int = 500,
    n_samples: int = 4,
    bond_beta: float = 0.5,
    bond_gamma: float = 0.1,
    temperature: float = 0.8,
    output_dir: Optional[str | PathLike] = None,
    quantize: Optional[str | list[str]] = "q4_k_m",
    engine_url: Optional[str] = None,
    on_progress: Optional[Callable[[dict], None]] = None,
) -> Student

Parameters

Name	Type	Description
`dataset`	path / list / callable	A `.jsonl` / `.json` file, a list of `{"prompt": ..., "response": ...}` dicts, or a zero-arg callable that yields them (for streaming from traces).
`teacher`	`str`	Provider-prefixed teacher model (e.g. `"openai/gpt-4o"`, `"anthropic/claude-sonnet-4-6"`).
`student`	`str`	Short alias (`"llama-3.2-1b"`) or a full HF repo id. Aliases map via `opentracy.distillation.schemas.STUDENT_MODEL_MAP`.
`num_prompts`	`int?`	Cap on prompts consumed from `dataset`. Default: all of them.
`steps`	`int`	Fine-tune optimizer steps. Small datasets need fewer (60–100 works for 20–50 rows).
`n_samples`	`int`	Best-of-N candidates generated by the teacher per prompt (BOND).
`bond_beta`	`float`	BOND preference weight. Defaults fine for classification.
`bond_gamma`	`float`	KL regularization strength. Raise if the student overfits.
`temperature`	`float`	Teacher sampling temperature for candidate generation.
`output_dir`	`path?`	Where artifacts land. Defaults to a fresh temp dir.
`quantize`	`str \| list[str] \| None`	GGUF quantization(s) to export. `"q4_k_m"` (default) is ~500 MB; `None` skips the GGUF phase and returns a PEFT adapter.
`engine_url`	`str?`	Override the Go engine URL used for teacher calls. If unset, a fresh engine is spawned for the duration and torn down at the end.
`on_progress`	`callable?`	Fires once per pipeline phase transition plus any log line, with a dict `{"job_id", "phase", "status", "progress", "log"}`.

The `Student` returned by `distill()`

A thin wrapper around the freshest artifact.

student = ot.distill(...)

student.backend      # "gguf" if a quantization was exported, else "peft"
student.model_path   # absolute path to the .gguf file or the adapter dir
student.base_model   # HF repo id of the base model (needed for PEFT load)

student("Classify: refund please")          # direct inference
student.batch(["Classify: ...", "Classify: ..."])
student.generate(messages=[...])            # full OpenAI-shape response

student.save("./ticket-classifier-v1")      # copy artifact to a durable path
student.deploy("ticket-classifier")         # register under a local alias

After .deploy(alias), calling ot.completion(model=alias, ...) dispatches to this student locally — no provider call, no HTTP hop. See the Student reference below for the full API.

Dataset shapes

All three are equivalent:

# 1. Path to a .jsonl (one dict per line) or .json (single list)
ot.distill(dataset="tickets.jsonl", ...)

# 2. List of dicts
rows = [
    {"prompt": "Classify: ...", "response": "billing"},
    {"prompt": "Classify: ...", "response": "technical"},
]
ot.distill(dataset=rows, ...)

# 3. Callable that yields dicts — useful for streaming from traces
def from_clickhouse():
    for trace in my_trace_source():
        yield {"prompt": trace["prompt"], "response": trace["label"]}

ot.distill(dataset=from_clickhouse, ...)

Row field aliases: prompt / input / text all work for the input; response / expected_output all work for the gold answer.

Progress callback

Useful for building a UI around the run or just keeping a tidy timeline in a notebook:

last_phase = None
def on_progress(evt):
    global last_phase
    if evt["phase"] and evt["phase"] != last_phase:
        print(f"→ {evt['phase']}")
        last_phase = evt["phase"]
    if evt["log"]:
        print(f"   {evt['log']}")

ot.distill(dataset=rows, on_progress=on_progress)

Graceful export fallback

If phase 4 (GGUF conversion) fails — for example, llama.cpp isn’t installed on the host — ot.distill() does not crash. It logs a warning and returns a Student(backend="peft", model_path=<adapter>) pointing at the LoRA adapter that was successfully trained. You still get a working model; you just serve it via PEFT (1 GB base model in VRAM) instead of a standalone GGUF file. To force a GGUF-only path and raise on failure, call the REST-backed Distiller instead.

Errors — `DistillError`

ot.distill() raises opentracy.DistillError for pipeline failures. Common causes:

Message	Cause	Fix
`Training needs PyTorch, but \`import torch` failed.`	`torch` not installed.	`pip install -U opentracy[distill]`.
`No CUDA GPU is visible to PyTorch.`	Training phase is CUDA-only.	Run on a GPU host; on Colab switch Runtime → T4.
`Dataset is empty — distill() needs at least one prompt.`	Empty dataset file or all rows filtered out.	Check the file format + field names.
`Distillation requires the \`[distill]` extra.`	`opentracy` installed without training deps.	`pip install -U 'opentracy[distill]'`.

The preflight that raises the torch/CUDA errors can be bypassed in tests by setting OPENTRACY_SKIP_DISTILL_PREFLIGHT=1. Don’t set this in production — it’ll let a GPU-less job burn money on teacher calls before dying in phase 3.

Student class

opentracy.Student is callable and serializes to disk. It’s what ot.distill() returns, but you can also instantiate it yourself to load a previously trained adapter.

from opentracy import Student

# Load a PEFT adapter from a path you saved earlier
student = Student(
    backend="peft",
    model_path="./ticket-classifier-v1",
    base_model="unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
)

student("Classify: refund the double charge")

Constructor

Student(
    backend: Literal["peft", "gguf"],
    model_path: str,
    base_model: Optional[str] = None,
    metadata: dict = {},
)

Name	Type	Description
`backend`	`"peft"` or `"gguf"`	`"peft"` loads a LoRA adapter onto a base model (requires `transformers`, `peft`, `torch`). `"gguf"` loads a GGUF file via `llama_cpp` — CPU-friendly, no base model needed.
`model_path`	`str`	Absolute path to the adapter directory (PEFT) or the `.gguf` file (GGUF).
`base_model`	`str?`	HF repo id. Required for `peft` — read from `adapter_config.json` if omitted.
`metadata`	`dict`	Free-form metadata persisted to disk via `.save()` and surfaced through the alias registry.

Methods

`student(prompt, max_new_tokens=512, temperature=0.0, **kwargs) → str`

Single-prompt inference. Returns the text response.

`student.batch(prompts, max_new_tokens=512, temperature=0.0) → list[str]`

Many prompts in one call (sequential).

`student.generate(messages, *, max_tokens=512, temperature=0.0, top_p=None, stop=None) → dict`

Full-chat-shape generation. Returns an OpenAI-shaped dict — this is what ot.completion(model=<Student instance>, ...) dispatches to internally.

`student.save(path) → Path`

Copy the artifact (adapter dir or .gguf file) to a durable location. Returns the resolved destination path.

`student.deploy(alias, engine_url=None) → dict`

Register the student under alias in the local file-based registry (~/.opentracy/aliases.json). After this, ot.completion(model=alias, ...) resolves to this student. If engine_url is provided, the call also POSTs to the engine’s /v1/models/register so server-side callers see the alias — failures there only emit a warning.

Preflights at load time

Loading a PEFT student checks:

torch, transformers, and peft are importable — else raises StudentError pointing at pip install opentracy[distill].
jinja2 >= 3.1 is present — else raises StudentError with the exact {sys.executable} -m pip install -U 'jinja2>=3.1' command for the interpreter currently running. Stale jinja2 3.0.x in a system Python is a common footgun when a uvicorn on PATH picks up a different interpreter than the one that has opentracy installed.

Alias registry

Aliases map a logical name to a Student. The registry lives at ~/.opentracy/aliases.json (or $OPENTRACY_DATA_HOME/aliases.json) and is read by ot.completion() on every call.

import opentracy as ot

# Register — equivalent to student.deploy("ticket-classifier")
ot.set_alias(
    "ticket-classifier",
    backend="peft",
    model_path="/abs/path/to/adapter",
    base_model="unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
)

# Introspect
ot.list_aliases()
# {"ticket-classifier": {"backend": "peft", "model_path": "...", "base_model": "...", "metadata": {...}, "registered_at": "..."}}

ot.get_alias("ticket-classifier")  # single entry or None

# Remove
ot.unset_alias("ticket-classifier")  # True if removed, False if not registered

Any ot.completion(model="ticket-classifier", ...) call from any Python process owned by the same user will resolve through this registry and dispatch locally — no provider call, no HTTP hop, no shared state with a remote engine. The alias-swap pattern:

# Day 1 — alias points at a provider
ot.set_alias("smart", backend="peft", ...)          # or call through the engine

# Day 10 — distill a student from traffic
student = ot.distill(dataset=recent_traces, ...)
student.deploy("smart")                              # atomic re-point

# App code for "smart" never changed
ot.completion(model="smart", messages=[...])

Serving the alias as an OpenAI-compatible HTTP endpoint

For a network-accessible endpoint, wrap the alias in a few lines of FastAPI:

# serve.py
from fastapi import FastAPI
from pydantic import BaseModel
import opentracy as ot

app = FastAPI()

class ChatRequest(BaseModel):
    model: str = "ticket-classifier"
    messages: list
    max_tokens: int = 64
    temperature: float = 0.0

@app.post("/v1/chat/completions")
def complete(req: ChatRequest):
    return ot.completion(
        model=req.model,
        messages=req.messages,
        max_tokens=req.max_tokens,
        temperature=req.temperature,
    )

Launch with the interpreter that has opentracy installed (not a random uvicorn on PATH):

python -m pip install fastapi uvicorn
python -m uvicorn serve:app --host 0.0.0.0 --port 9000

Distillation concepts

What the 4-phase pipeline is and when to retrain.

Distiller (REST client)

Long-running, queued jobs against a remote engine.

Python SDK

REST API

Signature

Parameters

The `Student` returned by `distill()`

Dataset shapes

Progress callback

Graceful export fallback

Errors — `DistillError`

Student class

Constructor

Methods

`student(prompt, max_new_tokens=512, temperature=0.0, **kwargs) → str`

`student.batch(prompts, max_new_tokens=512, temperature=0.0) → list[str]`

`student.generate(messages, *, max_tokens=512, temperature=0.0, top_p=None, stop=None) → dict`

`student.save(path) → Path`

`student.deploy(alias, engine_url=None) → dict`

Preflights at load time

Alias registry

Serving the alias as an OpenAI-compatible HTTP endpoint

Next

Distillation concepts

Distiller (REST client)

​Signature

​Parameters

​The Student returned by distill()

​Dataset shapes

​Progress callback

​Graceful export fallback

​Errors — DistillError

​Student class

​Constructor

​Methods

​student(prompt, max_new_tokens=512, temperature=0.0, **kwargs) → str

​student.batch(prompts, max_new_tokens=512, temperature=0.0) → list[str]

​student.generate(messages, *, max_tokens=512, temperature=0.0, top_p=None, stop=None) → dict

​student.save(path) → Path

​student.deploy(alias, engine_url=None) → dict

​Preflights at load time

​Alias registry

​Serving the alias as an OpenAI-compatible HTTP endpoint

​Next

Distillation concepts

Distiller (REST client)

Signature

Parameters

The `Student` returned by `distill()`

Dataset shapes

Progress callback

Graceful export fallback

Errors — `DistillError`

Student class

Constructor

Methods

`student(prompt, max_new_tokens=512, temperature=0.0, **kwargs) → str`

`student.batch(prompts, max_new_tokens=512, temperature=0.0) → list[str]`

`student.generate(messages, *, max_tokens=512, temperature=0.0, top_p=None, stop=None) → dict`

`student.save(path) → Path`

`student.deploy(alias, engine_url=None) → dict`

Preflights at load time

Alias registry

Serving the alias as an OpenAI-compatible HTTP endpoint

Next