Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mka1.com/llms.txt

Use this file to discover all available pages before exploring further.

Eval scoring is Python-backed. Each task defines a Python grader with one of three contracts:
ContractFunctionBest for
samplegrade(sample, item[, ctx])Per-sample exact match, F1, numeric tolerance, regex, custom rubric checks.
batchgrade_batch(samples[, ctx])Aggregate scoring, macro metrics, task-level judges, or sample updates after seeing the whole batch.
model_backedgrade or grade_batch with ctx model helpersLLM-as-judge and embedding/similarity scoring through MKA1.
Python runs in the sandbox service. Do not pass API keys or provider credentials into grader code. Use ctx.responses_create and ctx.embeddings_create for model-backed scoring.

Grader declaration

Use inline source for short graders:
{
  "grader": {
    "type": "python",
    "contract": "sample",
    "metric_id": "score",
    "source": "def grade(sample, item):\n    return 1.0 if sample['extracted_output'] == item['target'] else 0.0\n",
    "timeout_seconds": 120
  }
}
Use uploaded files for reusable graders:
{
  "grader": {
    "type": "python",
    "contract": "sample",
    "file_id": "file_grader123",
    "timeout_seconds": 120
  }
}
Fields:
FieldDescription
typeMust be python.
contractsample, batch, or model_backed. Defaults to sample.
executionOptional override: per_sample or aggregate. Defaults from contract.
model_accessmka1 enables ctx.responses_create and ctx.embeddings_create. Defaults to mka1 for model_backed, otherwise none.
metric_idScore key used when the grader returns a single float. Defaults to score.
sourceInline Python source.
file_idUploaded Python file ID with purpose=evals.
timeout_secondsSandbox execution timeout. Range is 1 to 600. Defaults to 120.
max_model_callsMaximum bridge calls from one grader execution. Range is 0 to 500. Defaults to 64.
Provide either source or file_id.

Sample contract

Sample graders define:
def grade(sample, item):
    ...
They can also accept ctx:
def grade(sample, item, ctx):
    ...
sample describes the model output:
{
  "output_text": "Raw assistant text",
  "extracted_output": "Parsed answer",
  "model": "openai:gpt-4.1-mini",
  "prompt": "Rendered prompt",
  "task_id": "qa_exact_match",
  "run_id": "eval_run_abc123",
  "sample_id": "eval_sample_abc123"
}
item contains the dataset row plus convenience fields:
{
  "question": "2+2?",
  "answer": "4",
  "prompt": "Rendered prompt",
  "target": "4",
  "reference_answer": "4",
  "choices": [],
  "task_id": "qa_exact_match"
}
The exact row fields depend on your dataset and preprocessor.

Return a single score

Return a finite float when the task has one metric. MKA1 stores it under metric_id.
def grade(sample, item):
    return 1.0 if sample.get("extracted_output") == item.get("target") else 0.0
If metric_id is omitted, the score key is score.

Return multiple scores

Return a dict with a scores object when the task has multiple metrics.
def grade(sample, item):
    output = (sample.get("extracted_output") or "").strip().lower()
    target = (item.get("target") or "").strip().lower()

    exact = 1.0 if output == target else 0.0
    contains = 1.0 if target and target in output else 0.0

    return {
        "scores": {
            "exact_match": exact,
            "contains_target": contains
        },
        "judge": {
            "output": output,
            "target": target
        }
    }
Only finite numeric score values are stored in scores. The optional judge object is preserved on the sample for debugging and dashboards.

Invalid results

The following results become invalid and receive a zero score:
  • Exceptions.
  • Non-finite floats such as NaN or Infinity.
  • Boolean returns.
  • Strings or other non-dict, non-number returns.
  • Dicts without any finite numeric score.
The sample judge payload includes the raw invalid payload and error details.

Batch contract

Batch graders define:
def grade_batch(samples):
    ...
They run once per task and model during finalization. Use them when the metric needs the whole set of samples. Each batch sample contains:
{
  "sample_id": "eval_sample_abc123",
  "task_id": "macro_f1_task",
  "model": "openai:gpt-4.1-mini",
  "prompt": "Rendered prompt",
  "target": "positive",
  "output_text": "The answer is positive.",
  "extracted_output": "positive",
  "dataset_row": {
    "text": "Great support experience.",
    "label": "positive"
  },
  "response_id": "resp_...",
  "scores": {},
  "judge": null
}
Return aggregate metrics:
def grade_batch(samples):
    total = len(samples)
    correct = 0

    for sample in samples:
        if sample.get("extracted_output") == sample.get("target"):
            correct += 1

    return {
        "metrics": {
            "accuracy": correct / total if total else 0.0
        }
    }
Return sample updates when you want to add per-sample scores, judge details, or corrected extracted outputs:
def grade_batch(samples):
    updates = []
    for sample in samples:
        correct = sample.get("extracted_output") == sample.get("target")
        updates.append({
            "sample_id": sample["sample_id"],
            "scores": {
                "batch_correct": 1.0 if correct else 0.0
            },
            "judge": {
                "checked_in_batch": True
            }
        })

    return {
        "metrics": {
            "batch_accuracy": sum(u["scores"]["batch_correct"] for u in updates) / len(updates)
        },
        "samples": updates
    }
Sample update fields:
FieldDescription
sample_idRequired. The sample to update.
scoresOptional numeric score keys to merge into the sample.
judgeOptional object to replace sample judge details. Use null to clear.
extracted_outputOptional string or null to replace the sample extracted output.
If a task declares metrics, unexpected batch metric IDs are dropped from final aggregates. This protects dashboards from accidental metric drift.

Model-backed graders

Use contract: "model_backed" when Python needs model or embedding calls. The Python code asks for a tool call. Gateway performs the call with the run’s authenticated context and returns the result to the sandbox.
{
  "grader": {
    "type": "python",
    "contract": "model_backed",
    "model_access": "mka1",
    "max_model_calls": 4,
    "file_id": "file_judge_grader123"
  }
}
The run should set judge_model and embedding_model when the grader uses model="auto":
{
  "suite_id": "eval_suite_abc123",
  "models": ["openai:gpt-4.1-mini"],
  "judge_model": "openai:gpt-4.1-mini",
  "embedding_model": "openai:text-embedding-3-small"
}
If Python passes an explicit model ID, that explicit model is used. If it passes model="auto" or omits model, MKA1 uses the run’s judge_model or embedding_model.

LLM-as-judge

import json

def grade(sample, item, ctx):
    response = ctx.responses_create(
        model="auto",
        input=(
            "You are grading an answer. Return JSON only.\n"
            f"Question: {sample.get('prompt', '')}\n"
            f"Reference: {item.get('target', '')}\n"
            f"Prediction: {sample.get('output_text', '')}\n"
            'Schema: {"verdict":"correct"|"incorrect","rationale":"..."}'
        ),
        temperature=0,
        text={"format": {"type": "json_object"}},
        metadata={"judge": "reference_binary"}
    )

    parsed = json.loads(response["output_text"])
    score = 1.0 if parsed.get("verdict") == "correct" else 0.0

    return {
        "scores": {
            "judge_score": score
        },
        "judge": {
            "verdict": parsed.get("verdict"),
            "rationale": parsed.get("rationale"),
            "response_id": response.get("id")
        }
    }
ctx.responses_create accepts the same request shape as the MKA1 Responses API, except the eval system forces stream=false, store=true, and background=false. Judge responses are stored and can be audited like normal Responses traffic.

Embedding similarity

import math

def cosine(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(y * y for y in b))
    return 0.0 if na == 0 or nb == 0 else dot / (na * nb)

def grade(sample, item, ctx):
    result = ctx.embeddings_create(
        model="auto",
        input=[
            sample.get("output_text", ""),
            item.get("target", "")
        ]
    )

    output_embedding = result["data"][0]["embedding"]
    target_embedding = result["data"][1]["embedding"]
    score = cosine(output_embedding, target_embedding)

    return {
        "scores": {
            "semantic_similarity": score
        },
        "judge": {
            "embedding_model": result.get("model")
        }
    }
ctx.embeddings_create routes through the MKA1 Embeddings API. Usage is logged under the run’s authenticated context.

Python preprocessors

Preprocessors are not graders, but they use the same sandbox execution model and file loading rules. They run before prompt rendering. Row preprocessor:
def transform(row):
    row["question"] = row["question"].strip()
    row["answer"] = str(row["answer"]).strip()
    return row
Batch preprocessor:
def transform_batch(rows):
    out = []
    for row in rows:
        if row.get("skip"):
            continue
        row["difficulty"] = row.get("difficulty") or "unknown"
        out.append(row)
    return out
Declare them on a task:
{
  "preprocess": {
    "type": "python",
    "contract": "batch",
    "file_id": "file_preprocessor123",
    "timeout_seconds": 120
  }
}

Common scoring recipes

Exact match

def grade(sample, item):
    output = (sample.get("extracted_output") or "").strip()
    target = (item.get("target") or "").strip()
    return {"scores": {"exact_match": 1.0 if output == target else 0.0}}

Case-insensitive exact match

def grade(sample, item):
    output = (sample.get("extracted_output") or "").strip().lower()
    target = (item.get("target") or "").strip().lower()
    return {"scores": {"exact_match": 1.0 if output == target else 0.0}}

Numeric tolerance

def grade(sample, item):
    try:
        output = float(sample.get("extracted_output"))
        target = float(item.get("target"))
    except (TypeError, ValueError):
        return {"scores": {"numeric_match": 0.0}}

    return {
        "scores": {
            "numeric_match": 1.0 if abs(output - target) <= 0.01 else 0.0,
            "absolute_error": abs(output - target)
        }
    }

Token F1

import re
from collections import Counter

def tokens(text):
    return re.findall(r"\w+", (text or "").lower())

def f1(prediction, target):
    pred = tokens(prediction)
    gold = tokens(target)
    if not pred and not gold:
        return 1.0
    if not pred or not gold:
        return 0.0

    overlap = Counter(pred) & Counter(gold)
    common = sum(overlap.values())
    if common == 0:
        return 0.0

    precision = common / len(pred)
    recall = common / len(gold)
    return 2 * precision * recall / (precision + recall)

def grade(sample, item):
    return {
        "scores": {
            "token_f1": f1(sample.get("output_text", ""), item.get("target", ""))
        }
    }

Macro F1 with grade_batch

from collections import defaultdict

def grade_batch(samples):
    labels = sorted({
        s.get("target")
        for s in samples
        if s.get("target") is not None
    })

    per_label = []
    for label in labels:
        tp = fp = fn = 0
        for sample in samples:
            pred = sample.get("extracted_output")
            target = sample.get("target")
            if pred == label and target == label:
                tp += 1
            elif pred == label and target != label:
                fp += 1
            elif pred != label and target == label:
                fn += 1

        precision = tp / (tp + fp) if tp + fp else 0.0
        recall = tp / (tp + fn) if tp + fn else 0.0
        score = 2 * precision * recall / (precision + recall) if precision + recall else 0.0
        per_label.append(score)

    return {
        "metrics": {
            "macro_f1": sum(per_label) / len(per_label) if per_label else 0.0
        },
        "samples": [
            {
                "sample_id": s["sample_id"],
                "scores": {
                    "correct": 1.0 if s.get("extracted_output") == s.get("target") else 0.0
                }
            }
            for s in samples
        ]
    }

Debugging graders

Use sample details first:
curl
curl 'https://apigw.mka1.com/api/v1/llm/evals/runs/eval_run_abc123/samples?limit=10&status=failed' \
  --header 'Authorization: Bearer <mka1-api-key>'
Look at:
FieldWhat to check
promptThe rendered prompt after preprocessing and few-shot examples.
targetThe rendered target string.
output_textRaw model output after stop-sequence handling.
extracted_outputOutput after extraction.
scoresNumeric values that affected aggregate metrics.
judgeGrader-returned details plus raw execution payload when available.
errorCandidate generation, extraction, preprocessing, or sandbox failure details.
Common fixes:
SymptomFix
Score is always 0Check that the grader returns a float or a dict with finite numeric scores.
ctx.responses_create failsSet contract: "model_backed" and model_access: "mka1", then provide a judge_model on the run or an explicit model in Python.
ctx.embeddings_create failsSet an embedding_model on the run or pass an explicit embedding model in Python.
Prompt fields are blankVerify the dataset row field names and preprocessor output.
Batch metrics are missingDeclare the metric IDs in metrics, or omit metrics for aggregate tasks that should accept every returned metric.
Grader times outIncrease timeout_seconds, reduce model bridge calls, or move expensive aggregate work into smaller tasks.

Security model

Python graders and preprocessors execute in the sandbox service. They are intended for eval logic, not for arbitrary application workflows. Keep these rules in mind:
  • Do not put secrets in grader source, dataset rows, or metadata.
  • Do not expect raw network credentials inside Python.
  • Use ctx.responses_create and ctx.embeddings_create for model calls.
  • Keep uploaded Python files scoped to the team that owns the suite.
  • Prefer uploaded grader files for reusable logic so suite versions clearly track their dependencies.
The eval API preserves the model call details it can observe, including candidate response_ids and grader judge payloads, so you can audit how scores were produced.