Documentation Index
Fetch the complete documentation index at: https://docs.mka1.com/llms.txt
Use this file to discover all available pages before exploring further.
An eval suite is a reusable manifest.
It describes what to run, how to render prompts, how to preprocess rows, how to extract model outputs, and which Python grader should score each sample or task batch.
This page covers the manifest surface.
For the end-to-end run flow, start with Run evals.
For Python scoring contracts, see Write Python eval graders.
Manifest shape
Every suite version stores one manifest.
{
"schema_version": "2026-05-27",
"tasks": [
{
"id": "qa_exact_match",
"name": "QA exact match",
"type": "qa",
"dataset": {
"file_id": "file_dataset123",
"format": "jsonl"
},
"prompt_template": "Answer with only the final answer.\n\nQuestion: {{question}}",
"target_template": "{{answer}}",
"output_extraction": {
"type": "take_first",
"lines": 1
},
"metrics": [
{
"id": "exact_match",
"aggregation": "mean",
"higher_is_better": true
}
],
"grader": {
"type": "python",
"contract": "sample",
"file_id": "file_grader123"
},
"metadata": {
"source": "qa-regression"
}
}
],
"metadata": {
"project": "model-quality"
}
}
Manifest limits:
| Limit | Value |
|---|
| Tasks per suite version | 1 to 100 |
| Task ID characters | Letters, numbers, _, ., and - |
| Few-shot examples per sample | 0 to 100 |
| Run models per run | 1 to 20 |
| Run concurrency | 1 to 25 |
Hugging Face max_rows | Up to 1,000,000 |
Task types
type labels the task for humans and downstream reporting.
Scoring is not hardcoded by task type.
The Python grader determines the actual metric behavior.
Supported task labels:
| Type | Common use |
|---|
classification | Label classification or rubric categories. |
multiple_choice | Multiple-choice answer selection. |
qa | Question answering with exact match, F1, semantic, or judge scoring. |
summarization | Summary quality, coverage, and factuality checks. |
semantic_similarity | Embedding or sentence-similarity scoring through Python. |
llm_judge | Model-backed rubric scoring through ctx.responses_create. |
numeric | Numeric extraction and tolerance scoring. |
math | Math or symbolic-answer evaluation. |
custom | Any task that does not fit the other labels. |
Use metadata on tasks and runs for your own experiment identifiers.
Metadata keys may be up to 64 characters and values up to 512 characters.
Uploaded datasets
Use uploaded files for controlled, reproducible datasets.
Upload them through /files with purpose=evals, then reference the file ID in the task.
{
"dataset": {
"file_id": "file_dataset123",
"format": "jsonl"
}
}
format can be jsonl or csv.
If you omit it, the gateway infers the format from the uploaded filename.
JSONL rows must be objects:
{"question":"2+2?","answer":"4","difficulty":"easy"}
{"question":"Capital of France?","answer":"Paris","difficulty":"easy"}
CSV files use the first row as headers:
question,answer,difficulty
2+2?,4,easy
Capital of France?,Paris,easy
Hugging Face datasets
Use Hugging Face when your eval should run from a public or preconfigured private dataset.
The gateway loads rows from the Hugging Face datasets server.
{
"dataset": {
"source": "huggingface",
"path": "facebook/belebele",
"name": "ukr_Cyrl",
"split": "test",
"revision": "main",
"max_rows": 100
}
}
Fields:
| Field | Description |
|---|
source | Set to huggingface. It is inferred when path is present. |
path | Dataset path, such as facebook/belebele. |
name | Dataset config. If omitted, MKA1 tries to discover a config for the requested split. |
split | Split to load. Defaults to test. |
revision | Optional dataset revision. |
max_rows | Optional cap for loading and smoke tests. |
dataset_kwargs | Additional loading metadata retained for parity with existing eval definitions. |
Private Hugging Face datasets require a Hugging Face token configured for the gateway environment.
The token is not supplied in the eval manifest.
Contact MKA1 if your team needs access to private or gated Hugging Face datasets.
Hugging Face json and csv builders
For path: "json" or path: "csv", provide data_files.
Only HTTP(S) URLs are allowed, and local/private network hosts are rejected.
{
"dataset": {
"source": "huggingface",
"path": "json",
"split": "validation",
"data_files": {
"validation": "https://huggingface.co/datasets/acme/evals/resolve/main/validation.jsonl"
},
"max_rows": 500
}
}
You can also put data_files under dataset_kwargs.data_files.
Templates
prompt_template and target_template render with values from the current dataset row.
{
"prompt_template": "Context: {{context}}\n\nQuestion: {{question}}\n\nChoices:\n{{choice_list}}\n\nAnswer:",
"target_template": "{{answer}}",
"choices": ["A", "B", "C", "D"]
}
Template behavior:
| Pattern | Result |
|---|
{{field}} | Reads a top-level row field. |
{{nested.field}} | Reads a nested object field. |
{{row}} | Inserts the full row as JSON. |
{{choices}} | Inserts the task choices array as JSON. |
{{choice_list}} | Inserts the task choices joined by newlines. |
Missing values render as an empty string.
Objects and arrays render as JSON.
Few-shot examples
Few-shot examples are rendered before the evaluated prompt.
Use num_fewshot for lm-eval style manifests, or the expanded fewshot object when you need more control.
Expanded form:
{
"fewshot": {
"count": 2,
"dataset": {
"source": "huggingface",
"path": "facebook/belebele",
"name": "ukr_Cyrl",
"split": "train"
},
"prompt_template": "Question: {{question}}",
"target_template": "{{correct_answer_num}}",
"example_template": "{{prompt}}\nAnswer: {{target}}",
"separator": "\n\n---\n\n",
"strategy": "first",
"seed": 0
}
}
Few-shot fields:
| Field | Description |
|---|
count | Number of examples to prepend. |
dataset | Optional separate dataset. If omitted, examples come from the task dataset. Hugging Face few-shot datasets default to train when split is omitted. |
prompt_template | Optional template for few-shot prompts. Defaults to the task prompt_template. |
target_template | Optional template for few-shot targets. Defaults to the task target_template. |
example_template | Template that combines rendered {{prompt}} and {{target}}. Defaults to {{prompt}}\n{{target}}. |
separator | Separator between examples and the evaluated prompt. |
strategy | first or random. |
seed | Seed used by random selection. |
The current row is excluded when few-shot examples are drawn from the same in-memory row list.
Python preprocessors
Use preprocessors to normalize dataset rows before prompt rendering.
They run in the Python sandbox with the same file scoping rules as graders.
{
"preprocess": {
"type": "python",
"contract": "row",
"source": "def transform(row):\n row['question'] = row['question'].strip()\n return row\n",
"timeout_seconds": 120
}
}
Contracts:
| Contract | Function names | Use |
|---|
row | transform_row(row), transform(row), or process_doc(row) | Transform one row at a time. |
batch | transform_batch(rows) or process_docs(rows) | Transform, filter, duplicate, or join rows as a batch. |
Preprocessors may return:
| Return value | Meaning |
|---|
dict | One transformed row. |
list[dict] | Zero, one, or many transformed rows. |
None | No rows for that input. |
If preprocessing fails, the run fails before sample execution for that task.
Output extraction turns raw model text into extracted_output.
The grader receives both sample["output_text"] and sample["extracted_output"].
| Type | Fields | Behavior |
|---|
none | None | Trims common model noise and whitespace. |
take_first | lines | Takes the first non-empty line or lines. |
regex | pattern, group, flags | Returns the first regex match group. |
regex_last | pattern, group, flags | Returns the last regex match group. |
label_set | labels, case_sensitive | Returns the first matching label. |
number | None | Extracts a numeric value. |
Examples:
{
"output_extraction": {
"type": "regex",
"pattern": "Answer:\\s*([A-D])",
"group": 1,
"flags": "i"
}
}
{
"output_extraction": {
"type": "label_set",
"labels": ["positive", "neutral", "negative"],
"case_sensitive": false
}
}
Regex patterns are checked for safety.
Very large regex inputs and excessive match counts are capped.
Metrics
Metrics declare which score keys should be aggregated.
Python graders produce the score values.
{
"metrics": [
{
"id": "exact_match",
"aggregation": "mean",
"higher_is_better": true,
"description": "1 when the extracted output exactly matches the target."
}
]
}
Fields:
| Field | Description |
|---|
id | Score key. Letters, numbers, _, ., and - are allowed. |
aggregation | mean or none. Defaults to mean. |
higher_is_better | Defaults to true. |
description | Optional description for dashboards and humans. |
If a per-sample grader returns a single float, MKA1 stores it under the grader metric_id, which defaults to score.
If a per-sample task omits metrics, MKA1 adds a default metric for that metric_id.
Aggregate batch graders can provide final task metrics through grade_batch.
Generation settings
Run-level generation controls candidate model calls.
It includes normalized Responses fields, lm-eval aliases, provider passthrough fields, and eval execution controls.
{
"generation": {
"instructions": "Answer concisely.",
"temperature": 0,
"top_p": 1,
"max_gen_toks": 128,
"until": ["<|endoftext|>"],
"max_retries": 2,
"max_empty_retries": 1,
"timeout_seconds": 120
}
}
Normalized fields:
| Field | Notes |
|---|
instructions | System/developer instruction text for compatible Responses models. |
temperature, top_p | Sampling controls. |
max_output_tokens | MKA1 Responses token cap. |
max_gen_toks | lm-eval alias for max_output_tokens. |
stop | String or list of stop sequences. |
until | lm-eval alias for stop sequences. |
tools, tool_choice, parallel_tool_calls, max_tool_calls | Responses tool settings. |
reasoning, text, truncation, service_tier | Passed to compatible Responses models. |
presence_penalty, frequency_penalty | Penalty settings. |
Provider passthrough fields:
| Field | Notes |
|---|
top_k | For compatible non-OpenAI upstreams. |
min_p | For compatible non-OpenAI upstreams. |
repetition_penalty | For compatible non-OpenAI upstreams. |
do_sample | For compatible non-OpenAI upstreams. |
extra_body | Extra provider body fields after unsafe keys are removed. |
chat_template_kwargs | Chat-template options such as enable_thinking. |
prefill_think | Thinking/prefill control for compatible providers. |
use_cache | Provider cache control when supported. |
Provider passthrough is only sent to OpenAI-compatible non-OpenAI upstreams.
Real OpenAI upstreams receive only supported fields.
Execution controls:
| Field | Description |
|---|
timeout_seconds | Per-model-call timeout. Range is 1 to 3600. |
max_retries | Retries for retryable provider/network errors. Range is 0 to 10. |
max_empty_retries | Retries when the candidate output is empty. Range is 0 to 10. |
Example: multiple-choice Hugging Face task
{
"schema_version": "2026-05-27",
"tasks": [
{
"id": "belebele_ukrainian",
"type": "multiple_choice",
"dataset": {
"source": "huggingface",
"path": "facebook/belebele",
"name": "ukr_Cyrl",
"split": "test",
"max_rows": 100
},
"choices": ["1", "2", "3", "4"],
"prompt_template": "Read the passage and answer with the option number only.\n\n{{flores_passage}}\n\nQuestion: {{question}}\n1. {{mc_answer1}}\n2. {{mc_answer2}}\n3. {{mc_answer3}}\n4. {{mc_answer4}}",
"target_template": "{{correct_answer_num}}",
"output_extraction": {
"type": "label_set",
"labels": ["1", "2", "3", "4"]
},
"metrics": [
{ "id": "accuracy" }
],
"grader": {
"type": "python",
"contract": "sample",
"source": "def grade(sample, item):\n return {'scores': {'accuracy': 1.0 if sample.get('extracted_output') == item.get('target') else 0.0}}\n"
}
}
]
}
Example: few-shot task with a separate training split
{
"id": "fewshot_qa",
"type": "qa",
"dataset": {
"source": "huggingface",
"path": "acme/support-qa",
"split": "test"
},
"fewshot": {
"count": 3,
"dataset": {
"source": "huggingface",
"path": "acme/support-qa",
"split": "train"
},
"example_template": "Question: {{prompt}}\nAnswer: {{target}}",
"separator": "\n\n"
},
"prompt_template": "{{question}}",
"target_template": "{{answer}}",
"grader": {
"type": "python",
"contract": "sample",
"file_id": "file_grader123"
}
}
Versioning guidance
Create a new suite version when you change any behavior that affects results:
- Dataset file or Hugging Face path/config/split.
- Prompt, target, few-shot, preprocessing, or output extraction.
- Metric IDs or aggregation behavior.
- Python grader source or file ID.
Store run-specific choices on the run:
- Candidate models.
- Judge model.
- Embedding model.
- Generation settings.
- Concurrency and sample caps.
This keeps the suite reusable while preserving exactly what each run executed.