Design eval task suites

An eval suite is a reusable manifest. It describes what to run, how to render prompts, how to preprocess rows, how to extract model outputs, and which Python grader should score each sample or task batch. This page covers the manifest surface. For the end-to-end run flow, start with Run evals. For Python scoring contracts, see Write Python eval graders.

Manifest shape

Every suite version stores one manifest.

{
  "schema_version": "2026-05-27",
  "tasks": [
    {
      "id": "qa_exact_match",
      "name": "QA exact match",
      "type": "qa",
      "dataset": {
        "file_id": "file_dataset123",
        "format": "jsonl"
      },
      "prompt_template": "Answer with only the final answer.\n\nQuestion: {{question}}",
      "target_template": "{{answer}}",
      "output_extraction": {
        "type": "take_first",
        "lines": 1
      },
      "metrics": [
        {
          "id": "exact_match",
          "aggregation": "mean",
          "higher_is_better": true
        }
      ],
      "grader": {
        "type": "python",
        "contract": "sample",
        "file_id": "file_grader123"
      },
      "metadata": {
        "source": "qa-regression"
      }
    }
  ],
  "metadata": {
    "project": "model-quality"
  }
}

Manifest limits:

Limit	Value
Tasks per suite version	1 to 100
Task ID characters	Letters, numbers, `_`, `.`, and `-`
Few-shot examples per sample	0 to 100
Run models per run	1 to 20
Run concurrency	1 to 25
Hugging Face `max_rows`	Up to 1,000,000

Task types

type labels the task for humans and downstream reporting. Scoring is not hardcoded by task type. The Python grader determines the actual metric behavior. Supported task labels:

Type	Common use
`classification`	Label classification or rubric categories.
`multiple_choice`	Multiple-choice answer selection.
`qa`	Question answering with exact match, F1, semantic, or judge scoring.
`summarization`	Summary quality, coverage, and factuality checks.
`semantic_similarity`	Embedding or sentence-similarity scoring through Python.
`llm_judge`	Model-backed rubric scoring through `ctx.responses_create`.
`numeric`	Numeric extraction and tolerance scoring.
`math`	Math or symbolic-answer evaluation.
`custom`	Any task that does not fit the other labels.

Use metadata on tasks and runs for your own experiment identifiers. Metadata keys may be up to 64 characters and values up to 512 characters.

Uploaded datasets

Use uploaded files for controlled, reproducible datasets. Upload them through /files with purpose=evals, then reference the file ID in the task.

{
  "dataset": {
    "file_id": "file_dataset123",
    "format": "jsonl"
  }
}

format can be jsonl or csv. If you omit it, the gateway infers the format from the uploaded filename. JSONL rows must be objects:

eval.jsonl

{"question":"2+2?","answer":"4","difficulty":"easy"}
{"question":"Capital of France?","answer":"Paris","difficulty":"easy"}

CSV files use the first row as headers:

eval.csv

question,answer,difficulty
2+2?,4,easy
Capital of France?,Paris,easy

Hugging Face datasets

Use Hugging Face when your eval should run from a public or preconfigured private dataset. The gateway loads rows from the Hugging Face datasets server.

{
  "dataset": {
    "source": "huggingface",
    "path": "facebook/belebele",
    "name": "ukr_Cyrl",
    "split": "test",
    "revision": "main",
    "max_rows": 100
  }
}

Fields:

Field	Description
`source`	Set to `huggingface`. It is inferred when `path` is present.
`path`	Dataset path, such as `facebook/belebele`.
`name`	Dataset config. If omitted, MKA1 tries to discover a config for the requested split.
`split`	Split to load. Defaults to `test`.
`revision`	Optional dataset revision.
`max_rows`	Optional cap for loading and smoke tests.
`dataset_kwargs`	Additional loading metadata retained for parity with existing eval definitions.

Private Hugging Face datasets require a Hugging Face token configured for the gateway environment. The token is not supplied in the eval manifest. Contact MKA1 if your team needs access to private or gated Hugging Face datasets.

Hugging Face `json` and `csv` builders

For path: "json" or path: "csv", provide data_files. Only HTTP(S) URLs are allowed, and local/private network hosts are rejected.

{
  "dataset": {
    "source": "huggingface",
    "path": "json",
    "split": "validation",
    "data_files": {
      "validation": "https://huggingface.co/datasets/acme/evals/resolve/main/validation.jsonl"
    },
    "max_rows": 500
  }
}

You can also put data_files under dataset_kwargs.data_files.

Templates

prompt_template and target_template render with values from the current dataset row.

{
  "prompt_template": "Context: {{context}}\n\nQuestion: {{question}}\n\nChoices:\n{{choice_list}}\n\nAnswer:",
  "target_template": "{{answer}}",
  "choices": ["A", "B", "C", "D"]
}

Template behavior:

Pattern	Result
`{{field}}`	Reads a top-level row field.
`{{nested.field}}`	Reads a nested object field.
`{{row}}`	Inserts the full row as JSON.
`{{choices}}`	Inserts the task `choices` array as JSON.
`{{choice_list}}`	Inserts the task `choices` joined by newlines.

Missing values render as an empty string. Objects and arrays render as JSON.

Few-shot examples

Few-shot examples are rendered before the evaluated prompt. Use num_fewshot for lm-eval style manifests, or the expanded fewshot object when you need more control.

{
  "num_fewshot": 2
}

Expanded form:

{
  "fewshot": {
    "count": 2,
    "dataset": {
      "source": "huggingface",
      "path": "facebook/belebele",
      "name": "ukr_Cyrl",
      "split": "train"
    },
    "prompt_template": "Question: {{question}}",
    "target_template": "{{correct_answer_num}}",
    "example_template": "{{prompt}}\nAnswer: {{target}}",
    "separator": "\n\n---\n\n",
    "strategy": "first",
    "seed": 0
  }
}

Few-shot fields:

Field	Description
`count`	Number of examples to prepend.
`dataset`	Optional separate dataset. If omitted, examples come from the task dataset. Hugging Face few-shot datasets default to `train` when `split` is omitted.
`prompt_template`	Optional template for few-shot prompts. Defaults to the task `prompt_template`.
`target_template`	Optional template for few-shot targets. Defaults to the task `target_template`.
`example_template`	Template that combines rendered `{{prompt}}` and `{{target}}`. Defaults to `{{prompt}}\n{{target}}`.
`separator`	Separator between examples and the evaluated prompt.
`strategy`	`first` or `random`.
`seed`	Seed used by `random` selection.

The current row is excluded when few-shot examples are drawn from the same in-memory row list.

Python preprocessors

Use preprocessors to normalize dataset rows before prompt rendering. They run in the Python sandbox with the same file scoping rules as graders.

{
  "preprocess": {
    "type": "python",
    "contract": "row",
    "source": "def transform(row):\n    row['question'] = row['question'].strip()\n    return row\n",
    "timeout_seconds": 120
  }
}

Contracts:

Contract	Function names	Use
`row`	`transform_row(row)`, `transform(row)`, or `process_doc(row)`	Transform one row at a time.
`batch`	`transform_batch(rows)` or `process_docs(rows)`	Transform, filter, duplicate, or join rows as a batch.

Preprocessors may return:

Return value	Meaning
`dict`	One transformed row.
`list[dict]`	Zero, one, or many transformed rows.
`None`	No rows for that input.

If preprocessing fails, the run fails before sample execution for that task.

Output extraction

Output extraction turns raw model text into extracted_output. The grader receives both sample["output_text"] and sample["extracted_output"].

Type	Fields	Behavior
`none`	None	Trims common model noise and whitespace.
`take_first`	`lines`	Takes the first non-empty line or lines.
`regex`	`pattern`, `group`, `flags`	Returns the first regex match group.
`regex_last`	`pattern`, `group`, `flags`	Returns the last regex match group.
`label_set`	`labels`, `case_sensitive`	Returns the first matching label.
`number`	None	Extracts a numeric value.

Examples:

{
  "output_extraction": {
    "type": "regex",
    "pattern": "Answer:\\s*([A-D])",
    "group": 1,
    "flags": "i"
  }
}

{
  "output_extraction": {
    "type": "label_set",
    "labels": ["positive", "neutral", "negative"],
    "case_sensitive": false
  }
}

Regex patterns are checked for safety. Very large regex inputs and excessive match counts are capped.

Metrics

Metrics declare which score keys should be aggregated. Python graders produce the score values.

{
  "metrics": [
    {
      "id": "exact_match",
      "aggregation": "mean",
      "higher_is_better": true,
      "description": "1 when the extracted output exactly matches the target."
    }
  ]
}

Fields:

Field	Description
`id`	Score key. Letters, numbers, `_`, `.`, and `-` are allowed.
`aggregation`	`mean` or `none`. Defaults to `mean`.
`higher_is_better`	Defaults to `true`.
`description`	Optional description for dashboards and humans.

If a per-sample grader returns a single float, MKA1 stores it under the grader metric_id, which defaults to score. If a per-sample task omits metrics, MKA1 adds a default metric for that metric_id. Aggregate batch graders can provide final task metrics through grade_batch.

Generation settings

Run-level generation controls candidate model calls. It includes normalized Responses fields, lm-eval aliases, provider passthrough fields, and eval execution controls.

{
  "generation": {
    "instructions": "Answer concisely.",
    "temperature": 0,
    "top_p": 1,
    "max_gen_toks": 128,
    "until": ["<|endoftext|>"],
    "max_retries": 2,
    "max_empty_retries": 1,
    "timeout_seconds": 120
  }
}

Normalized fields:

Field	Notes
`instructions`	System/developer instruction text for compatible Responses models.
`temperature`, `top_p`	Sampling controls.
`max_output_tokens`	MKA1 Responses token cap.
`max_gen_toks`	lm-eval alias for `max_output_tokens`.
`stop`	String or list of stop sequences.
`until`	lm-eval alias for stop sequences.
`tools`, `tool_choice`, `parallel_tool_calls`, `max_tool_calls`	Responses tool settings.
`reasoning`, `text`, `truncation`, `service_tier`	Passed to compatible Responses models.
`presence_penalty`, `frequency_penalty`	Penalty settings.

Provider passthrough fields:

Field	Notes
`top_k`	For compatible non-OpenAI upstreams.
`min_p`	For compatible non-OpenAI upstreams.
`repetition_penalty`	For compatible non-OpenAI upstreams.
`do_sample`	For compatible non-OpenAI upstreams.
`extra_body`	Extra provider body fields after unsafe keys are removed.
`chat_template_kwargs`	Chat-template options such as `enable_thinking`.
`prefill_think`	Thinking/prefill control for compatible providers.
`use_cache`	Provider cache control when supported.

Provider passthrough is only sent to OpenAI-compatible non-OpenAI upstreams. Real OpenAI upstreams receive only supported fields. Execution controls:

Field	Description
`timeout_seconds`	Per-model-call timeout. Range is 1 to 3600.
`max_retries`	Retries for retryable provider/network errors. Range is 0 to 10.
`max_empty_retries`	Retries when the candidate output is empty. Range is 0 to 10.

Example: multiple-choice Hugging Face task

{
  "schema_version": "2026-05-27",
  "tasks": [
    {
      "id": "belebele_ukrainian",
      "type": "multiple_choice",
      "dataset": {
        "source": "huggingface",
        "path": "facebook/belebele",
        "name": "ukr_Cyrl",
        "split": "test",
        "max_rows": 100
      },
      "choices": ["1", "2", "3", "4"],
      "prompt_template": "Read the passage and answer with the option number only.\n\n{{flores_passage}}\n\nQuestion: {{question}}\n1. {{mc_answer1}}\n2. {{mc_answer2}}\n3. {{mc_answer3}}\n4. {{mc_answer4}}",
      "target_template": "{{correct_answer_num}}",
      "output_extraction": {
        "type": "label_set",
        "labels": ["1", "2", "3", "4"]
      },
      "metrics": [
        { "id": "accuracy" }
      ],
      "grader": {
        "type": "python",
        "contract": "sample",
        "source": "def grade(sample, item):\n    return {'scores': {'accuracy': 1.0 if sample.get('extracted_output') == item.get('target') else 0.0}}\n"
      }
    }
  ]
}

Example: few-shot task with a separate training split

{
  "id": "fewshot_qa",
  "type": "qa",
  "dataset": {
    "source": "huggingface",
    "path": "acme/support-qa",
    "split": "test"
  },
  "fewshot": {
    "count": 3,
    "dataset": {
      "source": "huggingface",
      "path": "acme/support-qa",
      "split": "train"
    },
    "example_template": "Question: {{prompt}}\nAnswer: {{target}}",
    "separator": "\n\n"
  },
  "prompt_template": "{{question}}",
  "target_template": "{{answer}}",
  "grader": {
    "type": "python",
    "contract": "sample",
    "file_id": "file_grader123"
  }
}

Versioning guidance

Create a new suite version when you change any behavior that affects results:

Dataset file or Hugging Face path/config/split.
Prompt, target, few-shot, preprocessing, or output extraction.
Metric IDs or aggregation behavior.
Python grader source or file ID.

Store run-specific choices on the run:

Candidate models.
Judge model.
Embedding model.
Generation settings.
Concurrency and sample caps.

This keeps the suite reusable while preserving exactly what each run executed.

​Manifest shape

​Task types

​Uploaded datasets

​Hugging Face datasets

​Hugging Face json and csv builders

​Templates

​Few-shot examples

​Python preprocessors

​Output extraction

​Metrics

​Generation settings

​Example: multiple-choice Hugging Face task

​Example: few-shot task with a separate training split

​Versioning guidance