Use evals when you need to measure model behavior against your own tasks, datasets, scoring code, and operational settings. An eval has two layers:Documentation Index
Fetch the complete documentation index at: https://docs.mka1.com/llms.txt
Use this file to discover all available pages before exploring further.
| Layer | What it stores |
|---|---|
| Suite | A versioned manifest with tasks, datasets, prompt templates, preprocessors, graders, and metric definitions. |
| Run | A durable execution of a suite version against one or more models, with generation settings, judge model, embedding model, concurrency, and result artifacts. |
POST /api/v1/llm/responses.
Model-backed Python graders call Responses and Embeddings through a gateway-owned bridge, so grader code never receives your API key.
Before you start
You need:| Requirement | Notes |
|---|---|
| API key scopes | Use a key with write:evals to create suites and runs, read:evals to read results, and file upload access for /files. |
| Candidate model access | The run creator chooses the candidate model IDs in models. |
| Optional judge model access | Required when Python graders call ctx.responses_create(model="auto", ...). |
| Optional embedding model access | Required when Python graders call ctx.embeddings_create(model="auto", ...). |
| Dataset | Upload JSONL/CSV with purpose=evals, or reference a supported Hugging Face dataset. |
| Python grader | Provide inline Python in the manifest, or upload a .py file with purpose=evals. |
X-On-Behalf-Of when the eval belongs to a specific end user context.
Suites, runs, uploaded eval files, and result artifacts are scoped to the authenticated team context.
Workflow
The normal flow is:- Upload dataset and optional Python files through
/files. - Create an eval suite with a manifest.
- Start an eval run for one or more models.
- Poll the run until it reaches a terminal status.
- Inspect sample rows and download generated artifact files.
- Create a new suite version when you edit the manifest.
queued, running, completed, and failed.
Step 1 - Upload a dataset
Upload JSONL or CSV files withpurpose=evals.
JSONL preserves nested objects and arrays.
CSV values are parsed as strings.
eval-smoke.jsonl
file_... ID.
Step 2 - Upload a Python grader file
You can put grader source inline in the manifest. For reusable graders, upload Python files withpurpose=evals.
exact_match_grader.py
curl
file_... ID.
Step 3 - Create a suite
A suite manifest defines one or more tasks. Each task renders a prompt from one dataset row, sends the prompt to each run model, extracts the model output, and grades the sample.curl
eval.suite object.
Use the suite id when you start a run.
Step 4 - Start a run
A run chooses the model or models to test. It can also choose a task subset, judge model, embedding model, generation settings, concurrency, and sample cap.curl
| Field | Purpose |
|---|---|
suite_id | Suite to run. |
suite_version | Optional immutable version number. Defaults to the active suite version. |
models | Candidate model IDs. Maximum 20 per run. Duplicate model IDs are rejected. |
task_ids | Optional subset of task IDs. Omit it to run every task in the suite version. |
judge_model | Model used when Python grader code calls ctx.responses_create(model="auto", ...). |
embedding_model | Model used when Python grader code calls ctx.embeddings_create(model="auto", ...). |
generation | Candidate model settings and eval execution controls. |
concurrency | Number of samples to process concurrently. Range is 1 to 25. |
max_samples_per_task | Optional cap for smoke tests or partial runs. |
Step 5 - Poll the run
curl
metrics is null or empty.
When it completes, metrics are grouped by model and by task:
Step 6 - Inspect samples
List samples when you need per-row debugging. You can filter bytask_id, model, or status.
curl
response_id, raw model output, extracted output, scores, judge details, and error details.
Step 7 - Fetch artifacts
Completed runs create result files withpurpose=evals.
Use the artifacts endpoint to find the result and sample artifact file IDs.
curl
curl
Edit a suite
Suites are versioned. Create a new immutable version when you change a manifest. Setmake_active to false when you want to stage a draft version without making it the default for new runs.
curl
Cancel a run
Cancel a run when it is queued, in progress, or finalizing.curl
cancelled.
Pagination and filtering
List endpoints use cursor pagination.| Endpoint | Filters |
|---|---|
GET /evals/suites | after, limit |
GET /evals/suites/{suite_id}/versions | after, limit |
GET /evals/runs | after, limit, suite_id, status |
GET /evals/runs/{run_id}/samples | after, limit, task_id, model, status |
curl
What to read next
- Design eval task suites covers datasets, task manifests, templates, few-shot examples, output extraction, generation knobs, and metrics.
- Write Python eval graders covers sample, batch, and model-backed Python contracts.
- Use the endpoint paths in this guide with the generated request and response objects returned by the API.