Evaluations
LLM-as-judge against recorded sessions, with custom rubrics.
thirdeye lets you grade a recorded session by dispatching one of your installed CLI agents (claude, codex, or gemini) as an LLM-as-judge. Rubrics — called eval definitions — are named directive text bundled with sensible defaults and editable per-user. Results are append-only at <session>/evals.jsonl and per-turn findings anchor to the event seq they comment on, so thirdeye events <id> annotates its timeline with the findings inline.
The eval run itself is captured as a thirdeye-traced session, so every grading run has its own audit trail.
Eval definitions (rubrics)
Three shipped defaults are lazily materialized into <thirdeye_home>/evals/defs/ the first time you list them, so you can edit them in place:
default— general-purpose adherence check.token-efficiency— flags redundant or oversized turns.tool-quality— flags incorrect tool selection and error-recovery patterns.
thirdeye eval def list # available rubrics
thirdeye eval def show default # see the directive
thirdeye eval def create my-rubric --directive "<text>" # custom rubric
thirdeye eval def edit my-rubric # open in $EDITOR
thirdeye eval def rm my-rubric # deleteA directive is just markdown text — describe what to look for, what verdicts mean, and what shape findings should take.
Running an eval
thirdeye eval run <id> --agent claude
thirdeye eval run <id> --using token-efficiency --agent gemini--using selects the rubric (defaults to default). --agent picks which CLI runs as judge.
The dispatched agent runs read-only with platform-specific sandboxes:
- Claude —
--allowedToolsallowlist limited tothirdeyeandsqlite3. - Codex —
--sandbox read-only. - Gemini —
--approval-mode plan.
No new Python deps — thirdeye shells out to the agent binaries you already have installed.
Background mode
For long-running evals, detach into the background:
thirdeye eval run <id> --agent claude --background
thirdeye eval status # poll background jobsExit code is 0 regardless of verdict — the verdict is in the result, not the process status.
Viewing results
thirdeye eval show <id> # latest result for a session
thirdeye eval show <id> --using my-rubric
thirdeye eval list # history across all sessions
thirdeye eval list --since 2026-05-01 --verdict warnEach result has a verdict (pass / warn / fail), score(s), and per-turn findings.
Inline annotations
By default, thirdeye events <id> and thirdeye event <id> <seq> weave findings into the event timeline using each finding's anchored seq:
thirdeye events <id> # annotated by default
thirdeye events <id> --no-findings # suppress
thirdeye events <id> --eval token-efficiency # only one rubric's findingsThis is the fastest way to read an eval — you see the model's critique sitting right next to the turn it's critiquing.
Tips
- Audit your judges. Because each eval is itself a traced session, you can
thirdeye eval runagainst the eval's own session to grade the grader. - Compare rubrics on the same session. Run two rubrics and
thirdeye eval list <id>shows both, side by side. - Start with the defaults.
default,token-efficiency, andtool-qualitycover most one-off questions. Author a custom rubric when you find yourself asking the same question repeatedly.