Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mka1.com/llms.txt

Use this file to discover all available pages before exploring further.

Use evals when you need to measure model behavior against your own tasks, datasets, scoring code, and operational settings. An eval has two layers:
LayerWhat it stores
SuiteA versioned manifest with tasks, datasets, prompt templates, preprocessors, graders, and metric definitions.
RunA durable execution of a suite version against one or more models, with generation settings, judge model, embedding model, concurrency, and result artifacts.
Eval runs use normal MKA1 routing. Candidate generations go through POST /api/v1/llm/responses. Model-backed Python graders call Responses and Embeddings through a gateway-owned bridge, so grader code never receives your API key.

Before you start

You need:
RequirementNotes
API key scopesUse a key with write:evals to create suites and runs, read:evals to read results, and file upload access for /files.
Candidate model accessThe run creator chooses the candidate model IDs in models.
Optional judge model accessRequired when Python graders call ctx.responses_create(model="auto", ...).
Optional embedding model accessRequired when Python graders call ctx.embeddings_create(model="auto", ...).
DatasetUpload JSONL/CSV with purpose=evals, or reference a supported Hugging Face dataset.
Python graderProvide inline Python in the manifest, or upload a .py file with purpose=evals.
Use X-On-Behalf-Of when the eval belongs to a specific end user context. Suites, runs, uploaded eval files, and result artifacts are scoped to the authenticated team context.

Workflow

The normal flow is:
  1. Upload dataset and optional Python files through /files.
  2. Create an eval suite with a manifest.
  3. Start an eval run for one or more models.
  4. Poll the run until it reaches a terminal status.
  5. Inspect sample rows and download generated artifact files.
  6. Create a new suite version when you edit the manifest.
Eval run statuses move through:
queued -> in_progress -> finalizing -> completed
       \                         \-> failed
        \-> cancelling -> cancelled
Sample statuses are queued, running, completed, and failed.

Step 1 - Upload a dataset

Upload JSONL or CSV files with purpose=evals. JSONL preserves nested objects and arrays. CSV values are parsed as strings.
eval-smoke.jsonl
{"question":"Repeat exactly: MKA1_EVAL_SMOKE_OK","answer":"MKA1_EVAL_SMOKE_OK"}
curl https://apigw.mka1.com/api/v1/llm/files \
  --request POST \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --header 'X-On-Behalf-Of: <end-user-id>' \
  --form 'purpose=evals' \
  --form 'file=@./eval-smoke.jsonl;type=application/jsonl'
Store the returned file_... ID.

Step 2 - Upload a Python grader file

You can put grader source inline in the manifest. For reusable graders, upload Python files with purpose=evals.
exact_match_grader.py
def grade(sample, item):
    output = (sample.get("extracted_output") or "").strip()
    target = (item.get("target") or "").strip()
    return {
        "scores": {
            "exact_match": 1.0 if output == target else 0.0
        },
        "judge": {
            "output": output,
            "target": target
        }
    }
curl
curl https://apigw.mka1.com/api/v1/llm/files \
  --request POST \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --form 'purpose=evals' \
  --form 'file=@./exact_match_grader.py;type=text/x-python'
Store the returned grader file_... ID.

Step 3 - Create a suite

A suite manifest defines one or more tasks. Each task renders a prompt from one dataset row, sends the prompt to each run model, extracts the model output, and grades the sample.
curl
curl https://apigw.mka1.com/api/v1/llm/evals/suites \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --data '{
    "name": "Production smoke eval",
    "description": "A minimal uploaded JSONL and Python grader eval.",
    "manifest": {
      "schema_version": "2026-05-27",
      "tasks": [
        {
          "id": "repeat_exactly",
          "type": "custom",
          "dataset": {
            "file_id": "file_dataset123",
            "format": "jsonl"
          },
          "prompt_template": "{{question}}",
          "target_template": "{{answer}}",
          "output_extraction": {
            "type": "none"
          },
          "metrics": [
            { "id": "exact_match" }
          ],
          "grader": {
            "type": "python",
            "contract": "sample",
            "file_id": "file_grader123",
            "timeout_seconds": 120
          }
        }
      ]
    },
    "metadata": {
      "owner": "eval-team"
    }
  }'
The response returns an eval.suite object. Use the suite id when you start a run.

Step 4 - Start a run

A run chooses the model or models to test. It can also choose a task subset, judge model, embedding model, generation settings, concurrency, and sample cap.
curl
curl https://apigw.mka1.com/api/v1/llm/evals/runs \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --data '{
    "suite_id": "eval_suite_abc123",
    "models": [
      "openai:gpt-4.1-mini"
    ],
    "task_ids": [
      "repeat_exactly"
    ],
    "generation": {
      "temperature": 0,
      "max_output_tokens": 32,
      "max_retries": 1,
      "max_empty_retries": 1,
      "timeout_seconds": 120
    },
    "concurrency": 1,
    "max_samples_per_task": 1,
    "metadata": {
      "experiment": "smoke"
    }
  }'
Useful run fields:
FieldPurpose
suite_idSuite to run.
suite_versionOptional immutable version number. Defaults to the active suite version.
modelsCandidate model IDs. Maximum 20 per run. Duplicate model IDs are rejected.
task_idsOptional subset of task IDs. Omit it to run every task in the suite version.
judge_modelModel used when Python grader code calls ctx.responses_create(model="auto", ...).
embedding_modelModel used when Python grader code calls ctx.embeddings_create(model="auto", ...).
generationCandidate model settings and eval execution controls.
concurrencyNumber of samples to process concurrently. Range is 1 to 25.
max_samples_per_taskOptional cap for smoke tests or partial runs.

Step 5 - Poll the run

curl
curl https://apigw.mka1.com/api/v1/llm/evals/runs/eval_run_abc123 \
  --header 'Authorization: Bearer <mka1-api-key>'
While a run is active, metrics is null or empty. When it completes, metrics are grouped by model and by task:
{
  "status": "completed",
  "request_counts": {
    "total": 3,
    "completed": 3,
    "failed": 0
  },
  "metrics": {
    "by_model": {
      "openai:gpt-4.1-mini": {
        "sample_count": 3,
        "failed_count": 0,
        "metrics": {
          "exact_match": 1
        }
      }
    },
    "by_task": {
      "repeat_exactly": {
        "openai:gpt-4.1-mini": {
          "sample_count": 1,
          "failed_count": 0,
          "metrics": {
            "exact_match": 1
          }
        }
      }
    }
  }
}

Step 6 - Inspect samples

List samples when you need per-row debugging. You can filter by task_id, model, or status.
curl
curl 'https://apigw.mka1.com/api/v1/llm/evals/runs/eval_run_abc123/samples?limit=10&task_id=repeat_exactly' \
  --header 'Authorization: Bearer <mka1-api-key>'
Each sample includes the source row, rendered prompt, target, stored Responses response_id, raw model output, extracted output, scores, judge details, and error details.
{
  "object": "eval.sample",
  "task_id": "repeat_exactly",
  "model": "openai:gpt-4.1-mini",
  "status": "completed",
  "dataset_row": {
    "question": "Repeat exactly: MKA1_EVAL_SMOKE_OK",
    "answer": "MKA1_EVAL_SMOKE_OK"
  },
  "prompt": "Repeat exactly: MKA1_EVAL_SMOKE_OK",
  "target": "MKA1_EVAL_SMOKE_OK",
  "response_id": "resp_...",
  "output_text": "MKA1_EVAL_SMOKE_OK",
  "extracted_output": "MKA1_EVAL_SMOKE_OK",
  "scores": {
    "exact_match": 1
  },
  "judge": {
    "output": "MKA1_EVAL_SMOKE_OK",
    "target": "MKA1_EVAL_SMOKE_OK"
  },
  "error": null
}

Step 7 - Fetch artifacts

Completed runs create result files with purpose=evals. Use the artifacts endpoint to find the result and sample artifact file IDs.
curl
curl https://apigw.mka1.com/api/v1/llm/evals/runs/eval_run_abc123/artifacts \
  --header 'Authorization: Bearer <mka1-api-key>'
Then download the files through the Files API:
curl
curl https://apigw.mka1.com/api/v1/llm/files/file_result123/content \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --output eval-result.json
The result artifact summarizes run metadata and final metrics. The samples artifact preserves per-sample details for offline analysis.

Edit a suite

Suites are versioned. Create a new immutable version when you change a manifest. Set make_active to false when you want to stage a draft version without making it the default for new runs.
curl
curl https://apigw.mka1.com/api/v1/llm/evals/suites/eval_suite_abc123/versions \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --data '{
    "make_active": true,
    "manifest": {
      "schema_version": "2026-05-27",
      "tasks": [
        {
          "id": "repeat_exactly",
          "type": "custom",
          "dataset": { "file_id": "file_dataset456", "format": "jsonl" },
          "prompt_template": "{{question}}",
          "target_template": "{{answer}}",
          "metrics": [{ "id": "exact_match" }],
          "grader": {
            "type": "python",
            "contract": "sample",
            "file_id": "file_grader456"
          }
        }
      ]
    },
    "metadata": {
      "change": "larger validation split"
    }
  }'
Runs keep the suite version they were created with. Changing the active version does not mutate historical runs.

Cancel a run

Cancel a run when it is queued, in progress, or finalizing.
curl
curl https://apigw.mka1.com/api/v1/llm/evals/runs/eval_run_abc123/cancel \
  --request POST \
  --header 'Authorization: Bearer <mka1-api-key>'
Cancellation is best effort. Samples that are already running may finish before the workflow reaches cancelled.

Pagination and filtering

List endpoints use cursor pagination.
EndpointFilters
GET /evals/suitesafter, limit
GET /evals/suites/{suite_id}/versionsafter, limit
GET /evals/runsafter, limit, suite_id, status
GET /evals/runs/{run_id}/samplesafter, limit, task_id, model, status
Example:
curl
curl 'https://apigw.mka1.com/api/v1/llm/evals/runs?status=completed&limit=20' \
  --header 'Authorization: Bearer <mka1-api-key>'
  • Design eval task suites covers datasets, task manifests, templates, few-shot examples, output extraction, generation knobs, and metrics.
  • Write Python eval graders covers sample, batch, and model-backed Python contracts.
  • Use the endpoint paths in this guide with the generated request and response objects returned by the API.