Write Python eval graders

Eval scoring is Python-backed. Each task defines a Python grader with one of three contracts:

Contract	Function	Best for
`sample`	`grade(sample, item[, ctx])`	Per-sample exact match, F1, numeric tolerance, regex, custom rubric checks.
`batch`	`grade_batch(samples[, ctx])`	Aggregate scoring, macro metrics, task-level judges, or sample updates after seeing the whole batch.
`model_backed`	`grade` or `grade_batch` with `ctx` model helpers	LLM-as-judge and embedding/similarity scoring through MKA1.

Python runs in the sandbox service. Do not pass API keys or provider credentials into grader code. Use ctx.responses_create and ctx.embeddings_create for model-backed scoring.

Grader declaration

Use inline source for short graders:

{
  "grader": {
    "type": "python",
    "contract": "sample",
    "metric_id": "score",
    "source": "def grade(sample, item):\n    return 1.0 if sample['extracted_output'] == item['target'] else 0.0\n",
    "timeout_seconds": 120
  }
}

Use uploaded files for reusable graders:

{
  "grader": {
    "type": "python",
    "contract": "sample",
    "file_id": "file_grader123",
    "timeout_seconds": 120
  }
}

Fields:

Field	Description
`type`	Must be `python`.
`contract`	`sample`, `batch`, or `model_backed`. Defaults to `sample`.
`execution`	Optional override: `per_sample` or `aggregate`. Defaults from `contract`.
`model_access`	`mka1` enables `ctx.responses_create` and `ctx.embeddings_create`. Defaults to `mka1` for `model_backed`, otherwise `none`.
`metric_id`	Score key used when the grader returns a single float. Defaults to `score`.
`source`	Inline Python source.
`file_id`	Uploaded Python file ID with `purpose=evals`.
`timeout_seconds`	Sandbox execution timeout. Range is 1 to 600. Defaults to 120.
`max_model_calls`	Maximum bridge calls from one grader execution. Range is 0 to 500. Defaults to 64.

Provide either source or file_id.

Sample contract

Sample graders define:

def grade(sample, item):
    ...

They can also accept ctx:

def grade(sample, item, ctx):
    ...

sample describes the model output:

{
  "output_text": "Raw assistant text",
  "extracted_output": "Parsed answer",
  "model": "openai:gpt-4.1-mini",
  "prompt": "Rendered prompt",
  "task_id": "qa_exact_match",
  "run_id": "eval_run_abc123",
  "sample_id": "eval_sample_abc123"
}

item contains the dataset row plus convenience fields:

{
  "question": "2+2?",
  "answer": "4",
  "prompt": "Rendered prompt",
  "target": "4",
  "reference_answer": "4",
  "choices": [],
  "task_id": "qa_exact_match"
}

The exact row fields depend on your dataset and preprocessor.

Return a single score

Return a finite float when the task has one metric. MKA1 stores it under metric_id.

def grade(sample, item):
    return 1.0 if sample.get("extracted_output") == item.get("target") else 0.0

If metric_id is omitted, the score key is score.

Return multiple scores

Return a dict with a scores object when the task has multiple metrics.

def grade(sample, item):
    output = (sample.get("extracted_output") or "").strip().lower()
    target = (item.get("target") or "").strip().lower()

    exact = 1.0 if output == target else 0.0
    contains = 1.0 if target and target in output else 0.0

    return {
        "scores": {
            "exact_match": exact,
            "contains_target": contains
        },
        "judge": {
            "output": output,
            "target": target
        }
    }

Only finite numeric score values are stored in scores. The optional judge object is preserved on the sample for debugging and dashboards.

Invalid results

The following results become invalid and receive a zero score:

Exceptions.
Non-finite floats such as NaN or Infinity.
Boolean returns.
Strings or other non-dict, non-number returns.
Dicts without any finite numeric score.

The sample judge payload includes the raw invalid payload and error details.

Batch contract

Batch graders define:

def grade_batch(samples):
    ...

They run once per task and model during finalization. Use them when the metric needs the whole set of samples. Each batch sample contains:

{
  "sample_id": "eval_sample_abc123",
  "task_id": "macro_f1_task",
  "model": "openai:gpt-4.1-mini",
  "prompt": "Rendered prompt",
  "target": "positive",
  "output_text": "The answer is positive.",
  "extracted_output": "positive",
  "dataset_row": {
    "text": "Great support experience.",
    "label": "positive"
  },
  "response_id": "resp_...",
  "scores": {},
  "judge": null
}

Return aggregate metrics:

def grade_batch(samples):
    total = len(samples)
    correct = 0

    for sample in samples:
        if sample.get("extracted_output") == sample.get("target"):
            correct += 1

    return {
        "metrics": {
            "accuracy": correct / total if total else 0.0
        }
    }

Return sample updates when you want to add per-sample scores, judge details, or corrected extracted outputs:

def grade_batch(samples):
    updates = []
    for sample in samples:
        correct = sample.get("extracted_output") == sample.get("target")
        updates.append({
            "sample_id": sample["sample_id"],
            "scores": {
                "batch_correct": 1.0 if correct else 0.0
            },
            "judge": {
                "checked_in_batch": True
            }
        })

    return {
        "metrics": {
            "batch_accuracy": sum(u["scores"]["batch_correct"] for u in updates) / len(updates)
        },
        "samples": updates
    }

Sample update fields:

Field	Description
`sample_id`	Required. The sample to update.
`scores`	Optional numeric score keys to merge into the sample.
`judge`	Optional object to replace sample judge details. Use `null` to clear.
`extracted_output`	Optional string or `null` to replace the sample extracted output.

If a task declares metrics, unexpected batch metric IDs are dropped from final aggregates. This protects dashboards from accidental metric drift.

Model-backed graders

Use contract: "model_backed" when Python needs model or embedding calls. The Python code asks for a tool call. Gateway performs the call with the run’s authenticated context and returns the result to the sandbox.

{
  "grader": {
    "type": "python",
    "contract": "model_backed",
    "model_access": "mka1",
    "max_model_calls": 4,
    "file_id": "file_judge_grader123"
  }
}

The run should set judge_model and embedding_model when the grader uses model="auto":

{
  "suite_id": "eval_suite_abc123",
  "models": ["openai:gpt-4.1-mini"],
  "judge_model": "openai:gpt-4.1-mini",
  "embedding_model": "openai:text-embedding-3-small"
}

If Python passes an explicit model ID, that explicit model is used. If it passes model="auto" or omits model, MKA1 uses the run’s judge_model or embedding_model.

LLM-as-judge

import json

def grade(sample, item, ctx):
    response = ctx.responses_create(
        model="auto",
        input=(
            "You are grading an answer. Return JSON only.\n"
            f"Question: {sample.get('prompt', '')}\n"
            f"Reference: {item.get('target', '')}\n"
            f"Prediction: {sample.get('output_text', '')}\n"
            'Schema: {"verdict":"correct"|"incorrect","rationale":"..."}'
        ),
        temperature=0,
        text={"format": {"type": "json_object"}},
        metadata={"judge": "reference_binary"}
    )

    parsed = json.loads(response["output_text"])
    score = 1.0 if parsed.get("verdict") == "correct" else 0.0

    return {
        "scores": {
            "judge_score": score
        },
        "judge": {
            "verdict": parsed.get("verdict"),
            "rationale": parsed.get("rationale"),
            "response_id": response.get("id")
        }
    }

ctx.responses_create accepts the same request shape as the MKA1 Responses API, except the eval system forces stream=false, store=true, and background=false. Judge responses are stored and can be audited like normal Responses traffic.

Embedding similarity

import math

def cosine(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(y * y for y in b))
    return 0.0 if na == 0 or nb == 0 else dot / (na * nb)

def grade(sample, item, ctx):
    result = ctx.embeddings_create(
        model="auto",
        input=[
            sample.get("output_text", ""),
            item.get("target", "")
        ]
    )

    output_embedding = result["data"][0]["embedding"]
    target_embedding = result["data"][1]["embedding"]
    score = cosine(output_embedding, target_embedding)

    return {
        "scores": {
            "semantic_similarity": score
        },
        "judge": {
            "embedding_model": result.get("model")
        }
    }

ctx.embeddings_create routes through the MKA1 Embeddings API. Usage is logged under the run’s authenticated context.

Python preprocessors

Preprocessors are not graders, but they use the same sandbox execution model and file loading rules. They run before prompt rendering. Row preprocessor:

def transform(row):
    row["question"] = row["question"].strip()
    row["answer"] = str(row["answer"]).strip()
    return row

Batch preprocessor:

def transform_batch(rows):
    out = []
    for row in rows:
        if row.get("skip"):
            continue
        row["difficulty"] = row.get("difficulty") or "unknown"
        out.append(row)
    return out

Declare them on a task:

{
  "preprocess": {
    "type": "python",
    "contract": "batch",
    "file_id": "file_preprocessor123",
    "timeout_seconds": 120
  }
}

Common scoring recipes

Exact match

def grade(sample, item):
    output = (sample.get("extracted_output") or "").strip()
    target = (item.get("target") or "").strip()
    return {"scores": {"exact_match": 1.0 if output == target else 0.0}}

Case-insensitive exact match

def grade(sample, item):
    output = (sample.get("extracted_output") or "").strip().lower()
    target = (item.get("target") or "").strip().lower()
    return {"scores": {"exact_match": 1.0 if output == target else 0.0}}

Numeric tolerance

def grade(sample, item):
    try:
        output = float(sample.get("extracted_output"))
        target = float(item.get("target"))
    except (TypeError, ValueError):
        return {"scores": {"numeric_match": 0.0}}

    return {
        "scores": {
            "numeric_match": 1.0 if abs(output - target) <= 0.01 else 0.0,
            "absolute_error": abs(output - target)
        }
    }

Token F1

import re
from collections import Counter

def tokens(text):
    return re.findall(r"\w+", (text or "").lower())

def f1(prediction, target):
    pred = tokens(prediction)
    gold = tokens(target)
    if not pred and not gold:
        return 1.0
    if not pred or not gold:
        return 0.0

    overlap = Counter(pred) & Counter(gold)
    common = sum(overlap.values())
    if common == 0:
        return 0.0

    precision = common / len(pred)
    recall = common / len(gold)
    return 2 * precision * recall / (precision + recall)

def grade(sample, item):
    return {
        "scores": {
            "token_f1": f1(sample.get("output_text", ""), item.get("target", ""))
        }
    }

Macro F1 with `grade_batch`

from collections import defaultdict

def grade_batch(samples):
    labels = sorted({
        s.get("target")
        for s in samples
        if s.get("target") is not None
    })

    per_label = []
    for label in labels:
        tp = fp = fn = 0
        for sample in samples:
            pred = sample.get("extracted_output")
            target = sample.get("target")
            if pred == label and target == label:
                tp += 1
            elif pred == label and target != label:
                fp += 1
            elif pred != label and target == label:
                fn += 1

        precision = tp / (tp + fp) if tp + fp else 0.0
        recall = tp / (tp + fn) if tp + fn else 0.0
        score = 2 * precision * recall / (precision + recall) if precision + recall else 0.0
        per_label.append(score)

    return {
        "metrics": {
            "macro_f1": sum(per_label) / len(per_label) if per_label else 0.0
        },
        "samples": [
            {
                "sample_id": s["sample_id"],
                "scores": {
                    "correct": 1.0 if s.get("extracted_output") == s.get("target") else 0.0
                }
            }
            for s in samples
        ]
    }

Debugging graders

Use sample details first:

curl

curl 'https://apigw.mka1.com/api/v1/llm/evals/runs/eval_run_abc123/samples?limit=10&status=failed' \
  --header 'Authorization: Bearer <mka1-api-key>'

Look at:

Field	What to check
`prompt`	The rendered prompt after preprocessing and few-shot examples.
`target`	The rendered target string.
`output_text`	Raw model output after stop-sequence handling.
`extracted_output`	Output after extraction.
`scores`	Numeric values that affected aggregate metrics.
`judge`	Grader-returned details plus raw execution payload when available.
`error`	Candidate generation, extraction, preprocessing, or sandbox failure details.

Common fixes:

Symptom	Fix
Score is always `0`	Check that the grader returns a float or a dict with finite numeric `scores`.
`ctx.responses_create` fails	Set `contract: "model_backed"` and `model_access: "mka1"`, then provide a `judge_model` on the run or an explicit model in Python.
`ctx.embeddings_create` fails	Set an `embedding_model` on the run or pass an explicit embedding model in Python.
Prompt fields are blank	Verify the dataset row field names and preprocessor output.
Batch metrics are missing	Declare the metric IDs in `metrics`, or omit `metrics` for aggregate tasks that should accept every returned metric.
Grader times out	Increase `timeout_seconds`, reduce model bridge calls, or move expensive aggregate work into smaller tasks.

Security model

Python graders and preprocessors execute in the sandbox service. They are intended for eval logic, not for arbitrary application workflows. Keep these rules in mind:

Do not put secrets in grader source, dataset rows, or metadata.
Do not expect raw network credentials inside Python.
Use ctx.responses_create and ctx.embeddings_create for model calls.
Keep uploaded Python files scoped to the team that owns the suite.
Prefer uploaded grader files for reusable logic so suite versions clearly track their dependencies.

The eval API preserves the model call details it can observe, including candidate response_ids and grader judge payloads, so you can audit how scores were produced.

​Grader declaration

​Sample contract

​Return a single score

​Return multiple scores

​Invalid results

​Batch contract

​Model-backed graders

​LLM-as-judge

​Embedding similarity

​Python preprocessors

​Common scoring recipes

​Exact match

​Case-insensitive exact match

​Numeric tolerance

​Token F1

​Macro F1 with grade_batch

​Debugging graders

​Security model