Documentation Index
Fetch the complete documentation index at: https://docs.mka1.com/llms.txt
Use this file to discover all available pages before exploring further.
Eval scoring is Python-backed.
Each task defines a Python grader with one of three contracts:
| Contract | Function | Best for |
|---|
sample | grade(sample, item[, ctx]) | Per-sample exact match, F1, numeric tolerance, regex, custom rubric checks. |
batch | grade_batch(samples[, ctx]) | Aggregate scoring, macro metrics, task-level judges, or sample updates after seeing the whole batch. |
model_backed | grade or grade_batch with ctx model helpers | LLM-as-judge and embedding/similarity scoring through MKA1. |
Python runs in the sandbox service.
Do not pass API keys or provider credentials into grader code.
Use ctx.responses_create and ctx.embeddings_create for model-backed scoring.
Grader declaration
Use inline source for short graders:
{
"grader": {
"type": "python",
"contract": "sample",
"metric_id": "score",
"source": "def grade(sample, item):\n return 1.0 if sample['extracted_output'] == item['target'] else 0.0\n",
"timeout_seconds": 120
}
}
Use uploaded files for reusable graders:
{
"grader": {
"type": "python",
"contract": "sample",
"file_id": "file_grader123",
"timeout_seconds": 120
}
}
Fields:
| Field | Description |
|---|
type | Must be python. |
contract | sample, batch, or model_backed. Defaults to sample. |
execution | Optional override: per_sample or aggregate. Defaults from contract. |
model_access | mka1 enables ctx.responses_create and ctx.embeddings_create. Defaults to mka1 for model_backed, otherwise none. |
metric_id | Score key used when the grader returns a single float. Defaults to score. |
source | Inline Python source. |
file_id | Uploaded Python file ID with purpose=evals. |
timeout_seconds | Sandbox execution timeout. Range is 1 to 600. Defaults to 120. |
max_model_calls | Maximum bridge calls from one grader execution. Range is 0 to 500. Defaults to 64. |
Provide either source or file_id.
Sample contract
Sample graders define:
def grade(sample, item):
...
They can also accept ctx:
def grade(sample, item, ctx):
...
sample describes the model output:
{
"output_text": "Raw assistant text",
"extracted_output": "Parsed answer",
"model": "openai:gpt-4.1-mini",
"prompt": "Rendered prompt",
"task_id": "qa_exact_match",
"run_id": "eval_run_abc123",
"sample_id": "eval_sample_abc123"
}
item contains the dataset row plus convenience fields:
{
"question": "2+2?",
"answer": "4",
"prompt": "Rendered prompt",
"target": "4",
"reference_answer": "4",
"choices": [],
"task_id": "qa_exact_match"
}
The exact row fields depend on your dataset and preprocessor.
Return a single score
Return a finite float when the task has one metric.
MKA1 stores it under metric_id.
def grade(sample, item):
return 1.0 if sample.get("extracted_output") == item.get("target") else 0.0
If metric_id is omitted, the score key is score.
Return multiple scores
Return a dict with a scores object when the task has multiple metrics.
def grade(sample, item):
output = (sample.get("extracted_output") or "").strip().lower()
target = (item.get("target") or "").strip().lower()
exact = 1.0 if output == target else 0.0
contains = 1.0 if target and target in output else 0.0
return {
"scores": {
"exact_match": exact,
"contains_target": contains
},
"judge": {
"output": output,
"target": target
}
}
Only finite numeric score values are stored in scores.
The optional judge object is preserved on the sample for debugging and dashboards.
Invalid results
The following results become invalid and receive a zero score:
- Exceptions.
- Non-finite floats such as
NaN or Infinity.
- Boolean returns.
- Strings or other non-dict, non-number returns.
- Dicts without any finite numeric score.
The sample judge payload includes the raw invalid payload and error details.
Batch contract
Batch graders define:
def grade_batch(samples):
...
They run once per task and model during finalization.
Use them when the metric needs the whole set of samples.
Each batch sample contains:
{
"sample_id": "eval_sample_abc123",
"task_id": "macro_f1_task",
"model": "openai:gpt-4.1-mini",
"prompt": "Rendered prompt",
"target": "positive",
"output_text": "The answer is positive.",
"extracted_output": "positive",
"dataset_row": {
"text": "Great support experience.",
"label": "positive"
},
"response_id": "resp_...",
"scores": {},
"judge": null
}
Return aggregate metrics:
def grade_batch(samples):
total = len(samples)
correct = 0
for sample in samples:
if sample.get("extracted_output") == sample.get("target"):
correct += 1
return {
"metrics": {
"accuracy": correct / total if total else 0.0
}
}
Return sample updates when you want to add per-sample scores, judge details, or corrected extracted outputs:
def grade_batch(samples):
updates = []
for sample in samples:
correct = sample.get("extracted_output") == sample.get("target")
updates.append({
"sample_id": sample["sample_id"],
"scores": {
"batch_correct": 1.0 if correct else 0.0
},
"judge": {
"checked_in_batch": True
}
})
return {
"metrics": {
"batch_accuracy": sum(u["scores"]["batch_correct"] for u in updates) / len(updates)
},
"samples": updates
}
Sample update fields:
| Field | Description |
|---|
sample_id | Required. The sample to update. |
scores | Optional numeric score keys to merge into the sample. |
judge | Optional object to replace sample judge details. Use null to clear. |
extracted_output | Optional string or null to replace the sample extracted output. |
If a task declares metrics, unexpected batch metric IDs are dropped from final aggregates.
This protects dashboards from accidental metric drift.
Model-backed graders
Use contract: "model_backed" when Python needs model or embedding calls.
The Python code asks for a tool call.
Gateway performs the call with the run’s authenticated context and returns the result to the sandbox.
{
"grader": {
"type": "python",
"contract": "model_backed",
"model_access": "mka1",
"max_model_calls": 4,
"file_id": "file_judge_grader123"
}
}
The run should set judge_model and embedding_model when the grader uses model="auto":
{
"suite_id": "eval_suite_abc123",
"models": ["openai:gpt-4.1-mini"],
"judge_model": "openai:gpt-4.1-mini",
"embedding_model": "openai:text-embedding-3-small"
}
If Python passes an explicit model ID, that explicit model is used.
If it passes model="auto" or omits model, MKA1 uses the run’s judge_model or embedding_model.
LLM-as-judge
import json
def grade(sample, item, ctx):
response = ctx.responses_create(
model="auto",
input=(
"You are grading an answer. Return JSON only.\n"
f"Question: {sample.get('prompt', '')}\n"
f"Reference: {item.get('target', '')}\n"
f"Prediction: {sample.get('output_text', '')}\n"
'Schema: {"verdict":"correct"|"incorrect","rationale":"..."}'
),
temperature=0,
text={"format": {"type": "json_object"}},
metadata={"judge": "reference_binary"}
)
parsed = json.loads(response["output_text"])
score = 1.0 if parsed.get("verdict") == "correct" else 0.0
return {
"scores": {
"judge_score": score
},
"judge": {
"verdict": parsed.get("verdict"),
"rationale": parsed.get("rationale"),
"response_id": response.get("id")
}
}
ctx.responses_create accepts the same request shape as the MKA1 Responses API, except the eval system forces stream=false, store=true, and background=false.
Judge responses are stored and can be audited like normal Responses traffic.
Embedding similarity
import math
def cosine(a, b):
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
return 0.0 if na == 0 or nb == 0 else dot / (na * nb)
def grade(sample, item, ctx):
result = ctx.embeddings_create(
model="auto",
input=[
sample.get("output_text", ""),
item.get("target", "")
]
)
output_embedding = result["data"][0]["embedding"]
target_embedding = result["data"][1]["embedding"]
score = cosine(output_embedding, target_embedding)
return {
"scores": {
"semantic_similarity": score
},
"judge": {
"embedding_model": result.get("model")
}
}
ctx.embeddings_create routes through the MKA1 Embeddings API.
Usage is logged under the run’s authenticated context.
Python preprocessors
Preprocessors are not graders, but they use the same sandbox execution model and file loading rules.
They run before prompt rendering.
Row preprocessor:
def transform(row):
row["question"] = row["question"].strip()
row["answer"] = str(row["answer"]).strip()
return row
Batch preprocessor:
def transform_batch(rows):
out = []
for row in rows:
if row.get("skip"):
continue
row["difficulty"] = row.get("difficulty") or "unknown"
out.append(row)
return out
Declare them on a task:
{
"preprocess": {
"type": "python",
"contract": "batch",
"file_id": "file_preprocessor123",
"timeout_seconds": 120
}
}
Common scoring recipes
Exact match
def grade(sample, item):
output = (sample.get("extracted_output") or "").strip()
target = (item.get("target") or "").strip()
return {"scores": {"exact_match": 1.0 if output == target else 0.0}}
Case-insensitive exact match
def grade(sample, item):
output = (sample.get("extracted_output") or "").strip().lower()
target = (item.get("target") or "").strip().lower()
return {"scores": {"exact_match": 1.0 if output == target else 0.0}}
Numeric tolerance
def grade(sample, item):
try:
output = float(sample.get("extracted_output"))
target = float(item.get("target"))
except (TypeError, ValueError):
return {"scores": {"numeric_match": 0.0}}
return {
"scores": {
"numeric_match": 1.0 if abs(output - target) <= 0.01 else 0.0,
"absolute_error": abs(output - target)
}
}
Token F1
import re
from collections import Counter
def tokens(text):
return re.findall(r"\w+", (text or "").lower())
def f1(prediction, target):
pred = tokens(prediction)
gold = tokens(target)
if not pred and not gold:
return 1.0
if not pred or not gold:
return 0.0
overlap = Counter(pred) & Counter(gold)
common = sum(overlap.values())
if common == 0:
return 0.0
precision = common / len(pred)
recall = common / len(gold)
return 2 * precision * recall / (precision + recall)
def grade(sample, item):
return {
"scores": {
"token_f1": f1(sample.get("output_text", ""), item.get("target", ""))
}
}
Macro F1 with grade_batch
from collections import defaultdict
def grade_batch(samples):
labels = sorted({
s.get("target")
for s in samples
if s.get("target") is not None
})
per_label = []
for label in labels:
tp = fp = fn = 0
for sample in samples:
pred = sample.get("extracted_output")
target = sample.get("target")
if pred == label and target == label:
tp += 1
elif pred == label and target != label:
fp += 1
elif pred != label and target == label:
fn += 1
precision = tp / (tp + fp) if tp + fp else 0.0
recall = tp / (tp + fn) if tp + fn else 0.0
score = 2 * precision * recall / (precision + recall) if precision + recall else 0.0
per_label.append(score)
return {
"metrics": {
"macro_f1": sum(per_label) / len(per_label) if per_label else 0.0
},
"samples": [
{
"sample_id": s["sample_id"],
"scores": {
"correct": 1.0 if s.get("extracted_output") == s.get("target") else 0.0
}
}
for s in samples
]
}
Debugging graders
Use sample details first:
curl 'https://apigw.mka1.com/api/v1/llm/evals/runs/eval_run_abc123/samples?limit=10&status=failed' \
--header 'Authorization: Bearer <mka1-api-key>'
Look at:
| Field | What to check |
|---|
prompt | The rendered prompt after preprocessing and few-shot examples. |
target | The rendered target string. |
output_text | Raw model output after stop-sequence handling. |
extracted_output | Output after extraction. |
scores | Numeric values that affected aggregate metrics. |
judge | Grader-returned details plus raw execution payload when available. |
error | Candidate generation, extraction, preprocessing, or sandbox failure details. |
Common fixes:
| Symptom | Fix |
|---|
Score is always 0 | Check that the grader returns a float or a dict with finite numeric scores. |
ctx.responses_create fails | Set contract: "model_backed" and model_access: "mka1", then provide a judge_model on the run or an explicit model in Python. |
ctx.embeddings_create fails | Set an embedding_model on the run or pass an explicit embedding model in Python. |
| Prompt fields are blank | Verify the dataset row field names and preprocessor output. |
| Batch metrics are missing | Declare the metric IDs in metrics, or omit metrics for aggregate tasks that should accept every returned metric. |
| Grader times out | Increase timeout_seconds, reduce model bridge calls, or move expensive aggregate work into smaller tasks. |
Security model
Python graders and preprocessors execute in the sandbox service.
They are intended for eval logic, not for arbitrary application workflows.
Keep these rules in mind:
- Do not put secrets in grader source, dataset rows, or metadata.
- Do not expect raw network credentials inside Python.
- Use
ctx.responses_create and ctx.embeddings_create for model calls.
- Keep uploaded Python files scoped to the team that owns the suite.
- Prefer uploaded grader files for reusable logic so suite versions clearly track their dependencies.
The eval API preserves the model call details it can observe, including candidate response_ids and grader judge payloads, so you can audit how scores were produced.