Applies to:
- Plan -
- Deployment -
Summary
Goal: Make interrupted eval workers resume onto the same experiment rows instead of appending duplicates. Features:upsert_id (TypeScript Eval), experiment.log() with stable id, _object_delete insert flag, attemptId metadata tagging.
Configuration steps
Step 1: Set a stable row ID in TypeScript Eval
Useupsert_id on each EvalCase, derived from a value that is stable across retries (e.g., sample ID or input hash).
upsert_id land on the same logical row in the UI.
Step 2: Set a stable row ID when logging manually
If you are using lower-level experiment logging instead of the Eval API, pass the stable key asid.
experiment.log() directly.
Step 3: Understand Python Eval limitations
Eval() in Python does not expose an equivalent upsert_id field on EvalCase. Options:
- Use lower-level
experiment.log()with a stableid(see Step 2). - Implement resume logic in your worker to skip already-completed cases before starting the eval.
Step 4: Tag every span with a unique attemptId
Stable row IDs do not automatically remove child spans from a prior interrupted attempt. Tag every span (root and children) with an attempt-scoped value so stale spans can be identified later.
Step 5: Delete stale child spans from a prior attempt
Query for spans with the oldattemptId, then delete them via the insert endpoint.
Step 6: Alternative — use a fresh experiment per attempt
If managing stale span cleanup is too complex, create a new experiment for each retry attempt and designate the final successful experiment as canonical. This avoids orphan span cleanup entirely.Behavior notes
- Experiment inserts are append-only. Reusing a stable
iddoes not rewrite history; it creates a new version. The UI displays the latest version per row. - Reusing a stable row
idon retry does not delete child spans from the previous attempt. Orphan spans must be removed explicitly using_object_delete. - Do not manually set
span_idorroot_span_idas the primary deduplication strategy. The high-level Eval runner manages these internally, and overriding them can produce rows with multiple traces.