Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mka1.com/llms.txt

Use this file to discover all available pages before exploring further.

An eval suite is a reusable manifest. It describes what to run, how to render prompts, how to preprocess rows, how to extract model outputs, and which Python grader should score each sample or task batch. This page covers the manifest surface. For the end-to-end run flow, start with Run evals. For Python scoring contracts, see Write Python eval graders.

Manifest shape

Every suite version stores one manifest.
{
  "schema_version": "2026-05-27",
  "tasks": [
    {
      "id": "qa_exact_match",
      "name": "QA exact match",
      "type": "qa",
      "dataset": {
        "file_id": "file_dataset123",
        "format": "jsonl"
      },
      "prompt_template": "Answer with only the final answer.\n\nQuestion: {{question}}",
      "target_template": "{{answer}}",
      "output_extraction": {
        "type": "take_first",
        "lines": 1
      },
      "metrics": [
        {
          "id": "exact_match",
          "aggregation": "mean",
          "higher_is_better": true
        }
      ],
      "grader": {
        "type": "python",
        "contract": "sample",
        "file_id": "file_grader123"
      },
      "metadata": {
        "source": "qa-regression"
      }
    }
  ],
  "metadata": {
    "project": "model-quality"
  }
}
Manifest limits:
LimitValue
Tasks per suite version1 to 100
Task ID charactersLetters, numbers, _, ., and -
Few-shot examples per sample0 to 100
Run models per run1 to 20
Run concurrency1 to 25
Hugging Face max_rowsUp to 1,000,000

Task types

type labels the task for humans and downstream reporting. Scoring is not hardcoded by task type. The Python grader determines the actual metric behavior. Supported task labels:
TypeCommon use
classificationLabel classification or rubric categories.
multiple_choiceMultiple-choice answer selection.
qaQuestion answering with exact match, F1, semantic, or judge scoring.
summarizationSummary quality, coverage, and factuality checks.
semantic_similarityEmbedding or sentence-similarity scoring through Python.
llm_judgeModel-backed rubric scoring through ctx.responses_create.
numericNumeric extraction and tolerance scoring.
mathMath or symbolic-answer evaluation.
customAny task that does not fit the other labels.
Use metadata on tasks and runs for your own experiment identifiers. Metadata keys may be up to 64 characters and values up to 512 characters.

Uploaded datasets

Use uploaded files for controlled, reproducible datasets. Upload them through /files with purpose=evals, then reference the file ID in the task.
{
  "dataset": {
    "file_id": "file_dataset123",
    "format": "jsonl"
  }
}
format can be jsonl or csv. If you omit it, the gateway infers the format from the uploaded filename. JSONL rows must be objects:
eval.jsonl
{"question":"2+2?","answer":"4","difficulty":"easy"}
{"question":"Capital of France?","answer":"Paris","difficulty":"easy"}
CSV files use the first row as headers:
eval.csv
question,answer,difficulty
2+2?,4,easy
Capital of France?,Paris,easy

Hugging Face datasets

Use Hugging Face when your eval should run from a public or preconfigured private dataset. The gateway loads rows from the Hugging Face datasets server.
{
  "dataset": {
    "source": "huggingface",
    "path": "facebook/belebele",
    "name": "ukr_Cyrl",
    "split": "test",
    "revision": "main",
    "max_rows": 100
  }
}
Fields:
FieldDescription
sourceSet to huggingface. It is inferred when path is present.
pathDataset path, such as facebook/belebele.
nameDataset config. If omitted, MKA1 tries to discover a config for the requested split.
splitSplit to load. Defaults to test.
revisionOptional dataset revision.
max_rowsOptional cap for loading and smoke tests.
dataset_kwargsAdditional loading metadata retained for parity with existing eval definitions.
Private Hugging Face datasets require a Hugging Face token configured for the gateway environment. The token is not supplied in the eval manifest. Contact MKA1 if your team needs access to private or gated Hugging Face datasets.

Hugging Face json and csv builders

For path: "json" or path: "csv", provide data_files. Only HTTP(S) URLs are allowed, and local/private network hosts are rejected.
{
  "dataset": {
    "source": "huggingface",
    "path": "json",
    "split": "validation",
    "data_files": {
      "validation": "https://huggingface.co/datasets/acme/evals/resolve/main/validation.jsonl"
    },
    "max_rows": 500
  }
}
You can also put data_files under dataset_kwargs.data_files.

Templates

prompt_template and target_template render with values from the current dataset row.
{
  "prompt_template": "Context: {{context}}\n\nQuestion: {{question}}\n\nChoices:\n{{choice_list}}\n\nAnswer:",
  "target_template": "{{answer}}",
  "choices": ["A", "B", "C", "D"]
}
Template behavior:
PatternResult
{{field}}Reads a top-level row field.
{{nested.field}}Reads a nested object field.
{{row}}Inserts the full row as JSON.
{{choices}}Inserts the task choices array as JSON.
{{choice_list}}Inserts the task choices joined by newlines.
Missing values render as an empty string. Objects and arrays render as JSON.

Few-shot examples

Few-shot examples are rendered before the evaluated prompt. Use num_fewshot for lm-eval style manifests, or the expanded fewshot object when you need more control.
{
  "num_fewshot": 2
}
Expanded form:
{
  "fewshot": {
    "count": 2,
    "dataset": {
      "source": "huggingface",
      "path": "facebook/belebele",
      "name": "ukr_Cyrl",
      "split": "train"
    },
    "prompt_template": "Question: {{question}}",
    "target_template": "{{correct_answer_num}}",
    "example_template": "{{prompt}}\nAnswer: {{target}}",
    "separator": "\n\n---\n\n",
    "strategy": "first",
    "seed": 0
  }
}
Few-shot fields:
FieldDescription
countNumber of examples to prepend.
datasetOptional separate dataset. If omitted, examples come from the task dataset. Hugging Face few-shot datasets default to train when split is omitted.
prompt_templateOptional template for few-shot prompts. Defaults to the task prompt_template.
target_templateOptional template for few-shot targets. Defaults to the task target_template.
example_templateTemplate that combines rendered {{prompt}} and {{target}}. Defaults to {{prompt}}\n{{target}}.
separatorSeparator between examples and the evaluated prompt.
strategyfirst or random.
seedSeed used by random selection.
The current row is excluded when few-shot examples are drawn from the same in-memory row list.

Python preprocessors

Use preprocessors to normalize dataset rows before prompt rendering. They run in the Python sandbox with the same file scoping rules as graders.
{
  "preprocess": {
    "type": "python",
    "contract": "row",
    "source": "def transform(row):\n    row['question'] = row['question'].strip()\n    return row\n",
    "timeout_seconds": 120
  }
}
Contracts:
ContractFunction namesUse
rowtransform_row(row), transform(row), or process_doc(row)Transform one row at a time.
batchtransform_batch(rows) or process_docs(rows)Transform, filter, duplicate, or join rows as a batch.
Preprocessors may return:
Return valueMeaning
dictOne transformed row.
list[dict]Zero, one, or many transformed rows.
NoneNo rows for that input.
If preprocessing fails, the run fails before sample execution for that task.

Output extraction

Output extraction turns raw model text into extracted_output. The grader receives both sample["output_text"] and sample["extracted_output"].
TypeFieldsBehavior
noneNoneTrims common model noise and whitespace.
take_firstlinesTakes the first non-empty line or lines.
regexpattern, group, flagsReturns the first regex match group.
regex_lastpattern, group, flagsReturns the last regex match group.
label_setlabels, case_sensitiveReturns the first matching label.
numberNoneExtracts a numeric value.
Examples:
{
  "output_extraction": {
    "type": "regex",
    "pattern": "Answer:\\s*([A-D])",
    "group": 1,
    "flags": "i"
  }
}
{
  "output_extraction": {
    "type": "label_set",
    "labels": ["positive", "neutral", "negative"],
    "case_sensitive": false
  }
}
Regex patterns are checked for safety. Very large regex inputs and excessive match counts are capped.

Metrics

Metrics declare which score keys should be aggregated. Python graders produce the score values.
{
  "metrics": [
    {
      "id": "exact_match",
      "aggregation": "mean",
      "higher_is_better": true,
      "description": "1 when the extracted output exactly matches the target."
    }
  ]
}
Fields:
FieldDescription
idScore key. Letters, numbers, _, ., and - are allowed.
aggregationmean or none. Defaults to mean.
higher_is_betterDefaults to true.
descriptionOptional description for dashboards and humans.
If a per-sample grader returns a single float, MKA1 stores it under the grader metric_id, which defaults to score. If a per-sample task omits metrics, MKA1 adds a default metric for that metric_id. Aggregate batch graders can provide final task metrics through grade_batch.

Generation settings

Run-level generation controls candidate model calls. It includes normalized Responses fields, lm-eval aliases, provider passthrough fields, and eval execution controls.
{
  "generation": {
    "instructions": "Answer concisely.",
    "temperature": 0,
    "top_p": 1,
    "max_gen_toks": 128,
    "until": ["<|endoftext|>"],
    "max_retries": 2,
    "max_empty_retries": 1,
    "timeout_seconds": 120
  }
}
Normalized fields:
FieldNotes
instructionsSystem/developer instruction text for compatible Responses models.
temperature, top_pSampling controls.
max_output_tokensMKA1 Responses token cap.
max_gen_tokslm-eval alias for max_output_tokens.
stopString or list of stop sequences.
untillm-eval alias for stop sequences.
tools, tool_choice, parallel_tool_calls, max_tool_callsResponses tool settings.
reasoning, text, truncation, service_tierPassed to compatible Responses models.
presence_penalty, frequency_penaltyPenalty settings.
Provider passthrough fields:
FieldNotes
top_kFor compatible non-OpenAI upstreams.
min_pFor compatible non-OpenAI upstreams.
repetition_penaltyFor compatible non-OpenAI upstreams.
do_sampleFor compatible non-OpenAI upstreams.
extra_bodyExtra provider body fields after unsafe keys are removed.
chat_template_kwargsChat-template options such as enable_thinking.
prefill_thinkThinking/prefill control for compatible providers.
use_cacheProvider cache control when supported.
Provider passthrough is only sent to OpenAI-compatible non-OpenAI upstreams. Real OpenAI upstreams receive only supported fields. Execution controls:
FieldDescription
timeout_secondsPer-model-call timeout. Range is 1 to 3600.
max_retriesRetries for retryable provider/network errors. Range is 0 to 10.
max_empty_retriesRetries when the candidate output is empty. Range is 0 to 10.

Example: multiple-choice Hugging Face task

{
  "schema_version": "2026-05-27",
  "tasks": [
    {
      "id": "belebele_ukrainian",
      "type": "multiple_choice",
      "dataset": {
        "source": "huggingface",
        "path": "facebook/belebele",
        "name": "ukr_Cyrl",
        "split": "test",
        "max_rows": 100
      },
      "choices": ["1", "2", "3", "4"],
      "prompt_template": "Read the passage and answer with the option number only.\n\n{{flores_passage}}\n\nQuestion: {{question}}\n1. {{mc_answer1}}\n2. {{mc_answer2}}\n3. {{mc_answer3}}\n4. {{mc_answer4}}",
      "target_template": "{{correct_answer_num}}",
      "output_extraction": {
        "type": "label_set",
        "labels": ["1", "2", "3", "4"]
      },
      "metrics": [
        { "id": "accuracy" }
      ],
      "grader": {
        "type": "python",
        "contract": "sample",
        "source": "def grade(sample, item):\n    return {'scores': {'accuracy': 1.0 if sample.get('extracted_output') == item.get('target') else 0.0}}\n"
      }
    }
  ]
}

Example: few-shot task with a separate training split

{
  "id": "fewshot_qa",
  "type": "qa",
  "dataset": {
    "source": "huggingface",
    "path": "acme/support-qa",
    "split": "test"
  },
  "fewshot": {
    "count": 3,
    "dataset": {
      "source": "huggingface",
      "path": "acme/support-qa",
      "split": "train"
    },
    "example_template": "Question: {{prompt}}\nAnswer: {{target}}",
    "separator": "\n\n"
  },
  "prompt_template": "{{question}}",
  "target_template": "{{answer}}",
  "grader": {
    "type": "python",
    "contract": "sample",
    "file_id": "file_grader123"
  }
}

Versioning guidance

Create a new suite version when you change any behavior that affects results:
  • Dataset file or Hugging Face path/config/split.
  • Prompt, target, few-shot, preprocessing, or output extraction.
  • Metric IDs or aggregation behavior.
  • Python grader source or file ID.
Store run-specific choices on the run:
  • Candidate models.
  • Judge model.
  • Embedding model.
  • Generation settings.
  • Concurrency and sample caps.
This keeps the suite reusable while preserving exactly what each run executed.