Skip to main content
ot.distill() runs the full 4-phase BOND distillation pipeline (data-generation → curation → training → export) in-process and returns a callable Student. No FastAPI service, no ClickHouse, no job-polling — if you just want to train a student from a dataset and use it, this is the API. For the long-running, queued, multi-tenant REST flow, see the Distiller client — same engine, different surface.
import opentracy as ot

student = ot.distill(
    dataset="tickets.jsonl",
    teacher="openai/gpt-4o",
    student="llama-3.2-1b",
    steps=60,
)

# Use it directly
print(student("Classify: please refund the double charge"))

# Or serve it under a logical name
student.deploy("ticket-classifier")
ot.completion(model="ticket-classifier", messages=[...])
ot.distill() needs opentracy[distill] and a CUDA GPU. It fails fast (before any teacher API spend) if torch can’t import or torch.cuda.is_available() is False.

Signature

ot.distill(
    dataset: str | PathLike | list[dict] | Callable[[], Iterable[dict]],
    *,
    teacher: str = "openai/gpt-4o",
    student: str = "llama-3.2-1b",
    num_prompts: Optional[int] = None,
    steps: int = 500,
    n_samples: int = 4,
    bond_beta: float = 0.5,
    bond_gamma: float = 0.1,
    temperature: float = 0.8,
    output_dir: Optional[str | PathLike] = None,
    quantize: Optional[str | list[str]] = "q4_k_m",
    engine_url: Optional[str] = None,
    on_progress: Optional[Callable[[dict], None]] = None,
) -> Student

Parameters

NameTypeDescription
datasetpath / list / callableA .jsonl / .json file, a list of {"prompt": ..., "response": ...} dicts, or a zero-arg callable that yields them (for streaming from traces).
teacherstrProvider-prefixed teacher model (e.g. "openai/gpt-4o", "anthropic/claude-sonnet-4-6").
studentstrShort alias ("llama-3.2-1b") or a full HF repo id. Aliases map via opentracy.distillation.schemas.STUDENT_MODEL_MAP.
num_promptsint?Cap on prompts consumed from dataset. Default: all of them.
stepsintFine-tune optimizer steps. Small datasets need fewer (60–100 works for 20–50 rows).
n_samplesintBest-of-N candidates generated by the teacher per prompt (BOND).
bond_betafloatBOND preference weight. Defaults fine for classification.
bond_gammafloatKL regularization strength. Raise if the student overfits.
temperaturefloatTeacher sampling temperature for candidate generation.
output_dirpath?Where artifacts land. Defaults to a fresh temp dir.
quantizestr | list[str] | NoneGGUF quantization(s) to export. "q4_k_m" (default) is ~500 MB; None skips the GGUF phase and returns a PEFT adapter.
engine_urlstr?Override the Go engine URL used for teacher calls. If unset, a fresh engine is spawned for the duration and torn down at the end.
on_progresscallable?Fires once per pipeline phase transition plus any log line, with a dict {"job_id", "phase", "status", "progress", "log"}.

The Student returned by distill()

A thin wrapper around the freshest artifact.
student = ot.distill(...)

student.backend      # "gguf" if a quantization was exported, else "peft"
student.model_path   # absolute path to the .gguf file or the adapter dir
student.base_model   # HF repo id of the base model (needed for PEFT load)

student("Classify: refund please")          # direct inference
student.batch(["Classify: ...", "Classify: ..."])
student.generate(messages=[...])            # full OpenAI-shape response

student.save("./ticket-classifier-v1")      # copy artifact to a durable path
student.deploy("ticket-classifier")         # register under a local alias
After .deploy(alias), calling ot.completion(model=alias, ...) dispatches to this student locally — no provider call, no HTTP hop. See the Student reference below for the full API.

Dataset shapes

All three are equivalent:
# 1. Path to a .jsonl (one dict per line) or .json (single list)
ot.distill(dataset="tickets.jsonl", ...)

# 2. List of dicts
rows = [
    {"prompt": "Classify: ...", "response": "billing"},
    {"prompt": "Classify: ...", "response": "technical"},
]
ot.distill(dataset=rows, ...)

# 3. Callable that yields dicts — useful for streaming from traces
def from_clickhouse():
    for trace in my_trace_source():
        yield {"prompt": trace["prompt"], "response": trace["label"]}

ot.distill(dataset=from_clickhouse, ...)
Row field aliases: prompt / input / text all work for the input; response / expected_output all work for the gold answer.

Progress callback

Useful for building a UI around the run or just keeping a tidy timeline in a notebook:
last_phase = None
def on_progress(evt):
    global last_phase
    if evt["phase"] and evt["phase"] != last_phase:
        print(f"→ {evt['phase']}")
        last_phase = evt["phase"]
    if evt["log"]:
        print(f"   {evt['log']}")

ot.distill(dataset=rows, on_progress=on_progress)

Graceful export fallback

If phase 4 (GGUF conversion) fails — for example, llama.cpp isn’t installed on the host — ot.distill() does not crash. It logs a warning and returns a Student(backend="peft", model_path=<adapter>) pointing at the LoRA adapter that was successfully trained. You still get a working model; you just serve it via PEFT (1 GB base model in VRAM) instead of a standalone GGUF file. To force a GGUF-only path and raise on failure, call the REST-backed Distiller instead.

Errors — DistillError

ot.distill() raises opentracy.DistillError for pipeline failures. Common causes:
MessageCauseFix
Training needs PyTorch, but \import torch` failed.`torch not installed.pip install -U opentracy[distill].
No CUDA GPU is visible to PyTorch.Training phase is CUDA-only.Run on a GPU host; on Colab switch Runtime → T4.
Dataset is empty — distill() needs at least one prompt.Empty dataset file or all rows filtered out.Check the file format + field names.
Distillation requires the \[distill]` extra.`opentracy installed without training deps.pip install -U 'opentracy[distill]'.
The preflight that raises the torch/CUDA errors can be bypassed in tests by setting OPENTRACY_SKIP_DISTILL_PREFLIGHT=1. Don’t set this in production — it’ll let a GPU-less job burn money on teacher calls before dying in phase 3.

Student class

opentracy.Student is callable and serializes to disk. It’s what ot.distill() returns, but you can also instantiate it yourself to load a previously trained adapter.
from opentracy import Student

# Load a PEFT adapter from a path you saved earlier
student = Student(
    backend="peft",
    model_path="./ticket-classifier-v1",
    base_model="unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
)

student("Classify: refund the double charge")

Constructor

Student(
    backend: Literal["peft", "gguf"],
    model_path: str,
    base_model: Optional[str] = None,
    metadata: dict = {},
)
NameTypeDescription
backend"peft" or "gguf""peft" loads a LoRA adapter onto a base model (requires transformers, peft, torch). "gguf" loads a GGUF file via llama_cpp — CPU-friendly, no base model needed.
model_pathstrAbsolute path to the adapter directory (PEFT) or the .gguf file (GGUF).
base_modelstr?HF repo id. Required for peft — read from adapter_config.json if omitted.
metadatadictFree-form metadata persisted to disk via .save() and surfaced through the alias registry.

Methods

student(prompt, max_new_tokens=512, temperature=0.0, **kwargs) → str

Single-prompt inference. Returns the text response.

student.batch(prompts, max_new_tokens=512, temperature=0.0) → list[str]

Many prompts in one call (sequential).

student.generate(messages, *, max_tokens=512, temperature=0.0, top_p=None, stop=None) → dict

Full-chat-shape generation. Returns an OpenAI-shaped dict — this is what ot.completion(model=<Student instance>, ...) dispatches to internally.

student.save(path) → Path

Copy the artifact (adapter dir or .gguf file) to a durable location. Returns the resolved destination path.

student.deploy(alias, engine_url=None) → dict

Register the student under alias in the local file-based registry (~/.opentracy/aliases.json). After this, ot.completion(model=alias, ...) resolves to this student. If engine_url is provided, the call also POSTs to the engine’s /v1/models/register so server-side callers see the alias — failures there only emit a warning.

Preflights at load time

Loading a PEFT student checks:
  • torch, transformers, and peft are importable — else raises StudentError pointing at pip install opentracy[distill].
  • jinja2 >= 3.1 is present — else raises StudentError with the exact {sys.executable} -m pip install -U 'jinja2>=3.1' command for the interpreter currently running. Stale jinja2 3.0.x in a system Python is a common footgun when a uvicorn on PATH picks up a different interpreter than the one that has opentracy installed.

Alias registry

Aliases map a logical name to a Student. The registry lives at ~/.opentracy/aliases.json (or $OPENTRACY_DATA_HOME/aliases.json) and is read by ot.completion() on every call.
import opentracy as ot

# Register — equivalent to student.deploy("ticket-classifier")
ot.set_alias(
    "ticket-classifier",
    backend="peft",
    model_path="/abs/path/to/adapter",
    base_model="unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
)

# Introspect
ot.list_aliases()
# {"ticket-classifier": {"backend": "peft", "model_path": "...", "base_model": "...", "metadata": {...}, "registered_at": "..."}}

ot.get_alias("ticket-classifier")  # single entry or None

# Remove
ot.unset_alias("ticket-classifier")  # True if removed, False if not registered
Any ot.completion(model="ticket-classifier", ...) call from any Python process owned by the same user will resolve through this registry and dispatch locally — no provider call, no HTTP hop, no shared state with a remote engine. The alias-swap pattern:
# Day 1 — alias points at a provider
ot.set_alias("smart", backend="peft", ...)          # or call through the engine

# Day 10 — distill a student from traffic
student = ot.distill(dataset=recent_traces, ...)
student.deploy("smart")                              # atomic re-point

# App code for "smart" never changed
ot.completion(model="smart", messages=[...])

Serving the alias as an OpenAI-compatible HTTP endpoint

For a network-accessible endpoint, wrap the alias in a few lines of FastAPI:
# serve.py
from fastapi import FastAPI
from pydantic import BaseModel
import opentracy as ot

app = FastAPI()

class ChatRequest(BaseModel):
    model: str = "ticket-classifier"
    messages: list
    max_tokens: int = 64
    temperature: float = 0.0

@app.post("/v1/chat/completions")
def complete(req: ChatRequest):
    return ot.completion(
        model=req.model,
        messages=req.messages,
        max_tokens=req.max_tokens,
        temperature=req.temperature,
    )
Launch with the interpreter that has opentracy installed (not a random uvicorn on PATH):
python -m pip install fastapi uvicorn
python -m uvicorn serve:app --host 0.0.0.0 --port 9000

Next

Distillation concepts

What the 4-phase pipeline is and when to retrain.

Distiller (REST client)

Long-running, queued jobs against a remote engine.