Run evals - MKA1

Use evals when you need to measure model behavior against your own tasks, datasets, scoring code, and operational settings. An eval has two layers:

Layer	What it stores
Suite	A versioned manifest with tasks, datasets, prompt templates, preprocessors, graders, and metric definitions.
Run	A durable execution of a suite version against one or more models, with generation settings, judge model, embedding model, concurrency, and result artifacts.

Eval runs use normal MKA1 routing. Candidate generations go through POST /api/v1/llm/responses. Model-backed Python graders call Responses and Embeddings through a gateway-owned bridge, so grader code never receives your API key.

Before you start

You need:

Requirement	Notes
API key scopes	Use a key with `write:evals` to create suites and runs, `read:evals` to read results, and file upload access for `/files`.
Candidate model access	The run creator chooses the candidate model IDs in `models`.
Optional judge model access	Required when Python graders call `ctx.responses_create(model="auto", ...)`.
Optional embedding model access	Required when Python graders call `ctx.embeddings_create(model="auto", ...)`.
Dataset	Upload JSONL/CSV with `purpose=evals`, or reference a supported Hugging Face dataset.
Python grader	Provide inline Python in the manifest, or upload a `.py` file with `purpose=evals`.

Use X-On-Behalf-Of when the eval belongs to a specific end user context. Suites, runs, uploaded eval files, and result artifacts are scoped to the authenticated team context.

Workflow

The normal flow is:

Upload dataset and optional Python files through /files.
Create an eval suite with a manifest.
Start an eval run for one or more models.
Poll the run until it reaches a terminal status.
Inspect sample rows and download generated artifact files.
Create a new suite version when you edit the manifest.

Eval run statuses move through:

queued -> in_progress -> finalizing -> completed
       \                         \-> failed
        \-> cancelling -> cancelled

Sample statuses are queued, generating, ready_to_score, scoring, running, completed, and failed.

Step 1 - Upload a dataset

Upload JSONL or CSV files with purpose=evals. JSONL preserves nested objects and arrays. CSV values are parsed as strings.

eval-smoke.jsonl

{"question":"Repeat exactly: MKA1_EVAL_SMOKE_OK","answer":"MKA1_EVAL_SMOKE_OK"}

curl https://apigw.mka1.com/api/v1/llm/files \
  --request POST \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --header 'X-On-Behalf-Of: <end-user-id>' \
  --form 'purpose=evals' \
  --form 'file=@./eval-smoke.jsonl;type=application/jsonl'

const form = new FormData();
form.append('purpose', 'evals');
form.append('file', new Blob([
  '{"question":"Repeat exactly: MKA1_EVAL_SMOKE_OK","answer":"MKA1_EVAL_SMOKE_OK"}\n',
], { type: 'application/jsonl' }), 'eval-smoke.jsonl');

const fileRes = await fetch('https://apigw.mka1.com/api/v1/llm/files', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${process.env.MKA1_API_KEY}`,
    'X-On-Behalf-Of': '<end-user-id>',
  },
  body: form,
});

const datasetFile = await fileRes.json();
console.log(datasetFile.id);

Store the returned file_... ID.

Step 2 - Upload a Python grader file

You can put grader source inline in the manifest. For reusable graders, upload Python files with purpose=evals.

exact_match_grader.py

def grade(sample, item):
    output = (sample.get("extracted_output") or "").strip()
    target = (item.get("target") or "").strip()
    return {
        "scores": {
            "exact_match": 1.0 if output == target else 0.0
        },
        "judge": {
            "output": output,
            "target": target
        }
    }

curl

curl https://apigw.mka1.com/api/v1/llm/files \
  --request POST \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --form 'purpose=evals' \
  --form 'file=@./exact_match_grader.py;type=text/x-python'

Store the returned grader file_... ID.

Step 3 - Create a suite

A suite manifest defines one or more tasks. Each task renders a prompt from one dataset row, sends the prompt to each run model, extracts the model output, and grades the sample.

curl

curl https://apigw.mka1.com/api/v1/llm/evals/suites \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --data '{
    "name": "Production smoke eval",
    "description": "A minimal uploaded JSONL and Python grader eval.",
    "manifest": {
      "schema_version": "2026-05-27",
      "tasks": [
        {
          "id": "repeat_exactly",
          "type": "custom",
          "dataset": {
            "file_id": "file_dataset123",
            "format": "jsonl"
          },
          "prompt_template": "{{question}}",
          "target_template": "{{answer}}",
          "output_extraction": {
            "type": "none"
          },
          "metrics": [
            { "id": "exact_match" }
          ],
          "grader": {
            "type": "python",
            "contract": "sample",
            "file_id": "file_grader123",
            "timeout_seconds": 120
          }
        }
      ]
    },
    "metadata": {
      "owner": "eval-team"
    }
  }'

The response returns an eval.suite object. Use the suite id when you start a run.

Step 4 - Start a run

A run chooses the model or models to test. It can also choose a task subset, judge model, embedding model, generation settings, concurrency, and sample cap.

curl

curl https://apigw.mka1.com/api/v1/llm/evals/runs \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --data '{
    "suite_id": "eval_suite_abc123",
    "models": [
      "openai:gpt-4.1-mini"
    ],
    "task_ids": [
      "repeat_exactly"
    ],
    "generation": {
      "temperature": 0,
      "max_output_tokens": 32,
      "max_retries": 1,
      "max_empty_retries": 1,
      "timeout_seconds": 120
    },
    "generation_concurrency": 1,
    "grader_concurrency": 1,
    "max_samples_per_task": 1,
    "metadata": {
      "experiment": "smoke"
    }
  }'

Useful run fields:

Field	Purpose
`suite_id`	Suite to run.
`suite_version`	Optional immutable version number. Defaults to the active suite version.
`models`	Candidate model IDs. Maximum 20 per run. Duplicate model IDs are rejected.
`task_ids`	Optional subset of task IDs. Omit it to run every task in the suite version.
`judge_model`	Model used when Python grader code calls `ctx.responses_create(model="auto", ...)`.
`embedding_model`	Model used when Python grader code calls `ctx.embeddings_create(model="auto", ...)`.
`generation`	Candidate model settings and eval execution controls.
`generation_concurrency`	Number of samples to generate concurrently. Range is 1 to 256.
`grader_concurrency`	Number of samples to grade concurrently. Range is 1 to 256.
`max_samples_per_task`	Optional cap for smoke tests or partial runs.

Step 5 - Poll the run

curl

curl https://apigw.mka1.com/api/v1/llm/evals/runs/eval_run_abc123 \
  --header 'Authorization: Bearer <mka1-api-key>'

While a run is active, metrics is null or empty. When it completes, metrics are grouped by model and by task:

{
  "status": "completed",
  "request_counts": {
    "total": 3,
    "completed": 3,
    "failed": 0
  },
  "metrics": {
    "by_model": {
      "openai:gpt-4.1-mini": {
        "sample_count": 3,
        "failed_count": 0,
        "metrics": {
          "exact_match": 1
        }
      }
    },
    "by_task": {
      "repeat_exactly": {
        "openai:gpt-4.1-mini": {
          "sample_count": 1,
          "failed_count": 0,
          "metrics": {
            "exact_match": 1
          }
        }
      }
    }
  }
}

Step 6 - Inspect samples

List samples when you need per-row debugging. You can filter by task_id, model, or status.

curl

curl 'https://apigw.mka1.com/api/v1/llm/evals/runs/eval_run_abc123/samples?limit=10&task_id=repeat_exactly' \
  --header 'Authorization: Bearer <mka1-api-key>'

Each sample includes the source row, rendered prompt, target, stored Responses response_id, raw model output, extracted output, scores, judge details, and error details.

{
  "object": "eval.sample",
  "task_id": "repeat_exactly",
  "model": "openai:gpt-4.1-mini",
  "status": "completed",
  "dataset_row": {
    "question": "Repeat exactly: MKA1_EVAL_SMOKE_OK",
    "answer": "MKA1_EVAL_SMOKE_OK"
  },
  "prompt": "Repeat exactly: MKA1_EVAL_SMOKE_OK",
  "target": "MKA1_EVAL_SMOKE_OK",
  "response_id": "resp_...",
  "output_text": "MKA1_EVAL_SMOKE_OK",
  "extracted_output": "MKA1_EVAL_SMOKE_OK",
  "scores": {
    "exact_match": 1
  },
  "judge": {
    "output": "MKA1_EVAL_SMOKE_OK",
    "target": "MKA1_EVAL_SMOKE_OK"
  },
  "error": null
}

Step 7 - Fetch artifacts

Completed runs create result files with purpose=evals. Use the artifacts endpoint to find the result and sample artifact file IDs.

curl

curl https://apigw.mka1.com/api/v1/llm/evals/runs/eval_run_abc123/artifacts \
  --header 'Authorization: Bearer <mka1-api-key>'

Then download the files through the Files API:

curl

curl https://apigw.mka1.com/api/v1/llm/files/file_result123/content \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --output eval-result.json

The result artifact summarizes run metadata and final metrics. The samples artifact preserves per-sample details for offline analysis.

Edit a suite

Suites are versioned. Create a new immutable version when you change a manifest. Set make_active to false when you want to stage a draft version without making it the default for new runs.

curl

curl https://apigw.mka1.com/api/v1/llm/evals/suites/eval_suite_abc123/versions \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --data '{
    "make_active": true,
    "manifest": {
      "schema_version": "2026-05-27",
      "tasks": [
        {
          "id": "repeat_exactly",
          "type": "custom",
          "dataset": { "file_id": "file_dataset456", "format": "jsonl" },
          "prompt_template": "{{question}}",
          "target_template": "{{answer}}",
          "metrics": [{ "id": "exact_match" }],
          "grader": {
            "type": "python",
            "contract": "sample",
            "file_id": "file_grader456"
          }
        }
      ]
    },
    "metadata": {
      "change": "larger validation split"
    }
  }'

Runs keep the suite version they were created with. Changing the active version does not mutate historical runs.

Cancel a run

Cancel a run when it is queued, in progress, or finalizing.

curl

curl https://apigw.mka1.com/api/v1/llm/evals/runs/eval_run_abc123/cancel \
  --request POST \
  --header 'Authorization: Bearer <mka1-api-key>'

Cancellation is best effort. Samples that are already running may finish before the workflow reaches cancelled.

Pagination and filtering

List endpoints use cursor pagination.

Endpoint	Filters
`GET /evals/suites`	`after`, `limit`
`GET /evals/suites/{suite_id}/versions`	`after`, `limit`
`GET /evals/runs`	`after`, `limit`, `suite_id`, `status`
`GET /evals/runs/{run_id}/samples`	`after`, `limit`, `task_id`, `model`, `status`

Example:

curl

curl 'https://apigw.mka1.com/api/v1/llm/evals/runs?status=completed&limit=20' \
  --header 'Authorization: Bearer <mka1-api-key>'

​Before you start

​Workflow

​Step 1 - Upload a dataset

​Step 2 - Upload a Python grader file

​Step 3 - Create a suite

​Step 4 - Start a run

​Step 5 - Poll the run

​Step 6 - Inspect samples

​Step 7 - Fetch artifacts

​Edit a suite

​Cancel a run

​Pagination and filtering

​What to read next

Before you start

Workflow

Step 1 - Upload a dataset

Step 2 - Upload a Python grader file

Step 3 - Create a suite

Step 4 - Start a run

Step 5 - Poll the run

Step 6 - Inspect samples

Step 7 - Fetch artifacts

Edit a suite

Cancel a run

Pagination and filtering

What to read next