Evaluate regional localization and ambiguity handling

This guide shows how to evaluate two model behaviors in a way that is reproducible and easy to adapt:

inferring locale context from the user prompt alone
asking for clarification when the user prompt is ambiguous

The worked example in this document uses Brazilian Portuguese (pt-BR). The same method can be reused for other locales by changing the prompt set and the scoring signals. Run every case as a fresh single-turn request. Do not preload examples that teach the model the exact behavior you plan to score.

Evaluation principles

Use the same setup for both evaluations:

Keep the request neutral.
Do not explicitly instruct the model to localize to a region.
Do not explicitly instruct the model to ask for clarification.
Record the exact prompt and the exact raw response for every case.
Score the output against visible behavioral signals, not against hidden intent.

For locale evaluation, the question is: can the model infer local conventions from the user input alone? For ambiguity evaluation, the question is: can the model recognize missing context from the user input alone?

Minimal harness

Use the MKA1 SDK and keep the request shape simple:

MKA1 SDK

import { SDK } from '@meetkai/mka1';

const mka1 = new SDK({
  bearerAuth: `Bearer ${process.env.MKA1_API_KEY}`,
});

const REQUEST_OPTIONS = {
  headers: { 'X-On-Behalf-Of': '<end-user-id>' },
};

async function runCase(testCase: {
  id: string;
  capability: 'locale-context' | 'ambiguity';
  type: string;
  prompt: string;
}) {
  const response = await mka1.llm.responses.create(
    {
      model: 'meetkai:functionary-pt',
      input: testCase.prompt,
      stream: false,
      metadata: {
        capability: testCase.capability,
        eval_case: testCase.id,
        eval_type: testCase.type,
      },
    },
    REQUEST_OPTIONS
  );

  return {
    ...testCase,
    outputText: response.outputText,
  };
}

Evaluate locale context inference

Goal

Prove that the model can infer regional conventions from the user prompt alone and apply them naturally when the topic calls for them. In the pt-BR example, the most visible signals are:

R$ and Brazilian money formatting
dd/mm/yyyy when the model turns a date into numeric form
metric units such as km, °C, and m
correct handling of local idioms and regional expressions
local social context such as CPF, RG, and comprovante de residência

Step 1: choose observable locale signals

Pick signals that are easy for a reviewer to see directly in the output.

Signal type	Generic evidence	Brazil `pt-BR` example
currency	local currency symbol and numeric style	`R$ 700,00`
date	local short date format	`05/04/2026`
units	local measurement conventions	`431 km`, `30°C`, `1,73 metro`
idioms	correct local meaning and usage	`dar um jeitinho`, `pagar mico`, `ficar de boa`
social norms	local documents, institutions, and expectations	`CPF`, `RG`, banking docs
regional context	local food, geography, or cultural references	Nordeste brasileiro, São Paulo, Manaus

Step 2: run a focused prompt set

The following pt-BR prompts are based on real examples from earlier evaluation runs. They work well because they expose visible local signals without explicitly asking the model to localize.

MKA1 SDK

const localeCases = [
  {
    id: 'locale-currency-lunch',
    capability: 'locale-context',
    type: 'currency',
    prompt: 'Quanto custa em média um almoço em um restaurante popular em São Paulo?',
  },
  {
    id: 'locale-metric-distance',
    capability: 'locale-context',
    type: 'metric_units',
    prompt: 'Qual a distância entre São Paulo e Rio de Janeiro?',
  },
  {
    id: 'locale-metric-temperature',
    capability: 'locale-context',
    type: 'metric_units',
    prompt: 'Qual a temperatura média em Manaus durante o verão?',
  },
  {
    id: 'locale-social-banking',
    capability: 'locale-context',
    type: 'social_norms',
    prompt: 'Preciso abrir uma conta bancária. Quais documentos são necessários?',
  },
  {
    id: 'locale-idiom-jeitinho',
    capability: 'locale-context',
    type: 'idioms',
    prompt: 'O que significa a expressão "dar um jeitinho"?',
  },
  {
    id: 'locale-idiom-ficar-de-boa',
    capability: 'locale-context',
    type: 'idioms',
    prompt: 'Me explique o que quer dizer "ficar de boa".',
  },
  {
    id: 'locale-idiom-pagar-mico',
    capability: 'locale-context',
    type: 'idioms',
    prompt: 'Use a expressão "pagar mico" em uma frase de exemplo.',
  },
  {
    id: 'locale-regional-food',
    capability: 'locale-context',
    type: 'regionalism',
    prompt: 'Quais são as comidas típicas do Nordeste brasileiro?',
  },
  {
    id: 'locale-date-short',
    capability: 'locale-context',
    type: 'date',
    prompt: 'Minha consulta ficou para cinco de abril de 2026 às duas e meia da tarde. Pode resumir isso em uma linha?',
  },
];

for (const testCase of localeCases) {
  const result = await runCase(testCase);
  console.log(JSON.stringify(result));
}

If you include time-sensitive prompts such as current fuel price or current minimum wage, record the test date and score factual freshness separately from locale behavior.

Step 3: score each response

Score each case as pass, partial, or fail.

Type	Pass	Partial	Fail
currency	Uses the correct local currency symbol and formatting naturally	Local currency is present but formatting is inconsistent	Uses the wrong currency or wrong locale formatting
date	Uses the correct local short date format or a clearly local equivalent	Date is understandable but does not show the local format clearly	Uses a conflicting locale format
metric units	Uses the expected local units naturally	Correct answer but unit style is vague	Uses the wrong regional units
idioms	Explains the idiom with the right local meaning and tone	Roughly correct but culturally thin	Misreads or flattens the idiom
social norms	Uses local institutions, documents, or norms when relevant	Mostly right but misses the strongest local markers	Gives generic advice with no local grounding
regional context	Grounds the answer in the local region naturally	Correct but generic	Misses or questions the regional context unnecessarily

Step 4: assemble the evidence

Your evidence package should show raw outputs that make the locale inference visible. In the Brazil pt-BR example, a compact evidence table can look like this:

Case	Prompt	What the response proves
`locale-currency-lunch`	`Quanto custa em média um almoço em um restaurante popular em São Paulo?`	The model inferred Brazil and answered in `R$` without being told to use Brazilian currency
`locale-date-short`	`Minha consulta ficou para cinco de abril de 2026 às duas e meia da tarde. Pode resumir isso em uma linha?`	The model converted the date into a Brazilian format without explicit formatting instructions
`locale-metric-distance`	`Qual a distância entre São Paulo e Rio de Janeiro?`	The model used `km` rather than miles
`locale-idiom-jeitinho`	`O que significa a expressão "dar um jeitinho"?`	The model interpreted a Brazilian idiom with the right cultural nuance
`locale-social-banking`	`Preciso abrir uma conta bancária. Quais documentos são necessários?`	The model surfaced Brazilian banking documents such as `CPF`, `RG`, and `comprovante de residência`

A practical pass condition is:

at least one strong passing example for currency, date, units, idioms, and social context
no prompt contains explicit localization coaching
the raw outputs visibly show local conventions

Evaluate ambiguity handling

Goal

Prove that the model recognizes ambiguity in the user prompt and asks a targeted follow-up question instead of guessing. The evaluation should measure both sides of the behavior:

whether the model asks for clarification when the prompt is genuinely ambiguous
whether the model answers directly when the prompt is already clear

Step 1: build ambiguous prompts and clear controls

The following prompts are based on real examples from earlier evaluation runs.

MKA1 SDK

const ambiguityCases = [
  {
    id: 'ambiguity-lexical-banco',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Preciso de um banco.',
  },
  {
    id: 'ambiguity-lexical-pena',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Quero saber mais sobre pena.',
  },
  {
    id: 'ambiguity-underspecified-price',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Quanto custa?',
  },
  {
    id: 'ambiguity-underspecified-reservation',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Me ajuda a reservar para sexta.',
  },
  {
    id: 'ambiguity-underspecified-conversion',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Converte pra mim.',
  },
  {
    id: 'ambiguity-referential-better',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Ele é melhor que o outro, né?',
  },
  {
    id: 'ambiguity-referential-trocar',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Pode trocar isso?',
  },
  {
    id: 'ambiguity-referential-arquivo',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Manda aquele arquivo pra mim.',
  },
  {
    id: 'ambiguity-task-report',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Faz um relatório.',
  },
  {
    id: 'ambiguity-task-problem',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Preciso que você resolva o problema.',
  },
  {
    id: 'ambiguity-task-update',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Atualiza os dados.',
  },
  {
    id: 'ambiguity-clear-capital',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Qual a capital do Brasil?',
  },
  {
    id: 'ambiguity-clear-inflation',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Me explique o que é inflação.',
  },
  {
    id: 'ambiguity-clear-states',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Quantos estados tem o Brasil?',
  },
  {
    id: 'ambiguity-clear-brigadeiro',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Me dê uma receita de brigadeiro.',
  },
];

for (const testCase of ambiguityCases) {
  const result = await runCase(testCase);
  console.log(JSON.stringify(result));
}

Step 2: score the responses

Prompt type	Pass	Partial	Fail
ambiguous	Asks a short, targeted clarification question before answering	Lists possible meanings, but the clarification is too broad or too long	Guesses, invents missing details, or refuses before clarifying
clear	Answers directly	Answers but adds unnecessary hedging	Asks for clarification even though the request is clear

Examples of passing clarification behavior:

Preciso de um banco. -> Você quer dizer banco financeiro ou banco para sentar?
Faz um relatório. -> Sobre qual tema, para qual público e para qual período?
Manda aquele arquivo pra mim. -> Qual arquivo você quer dizer?

Examples of failure patterns from earlier runs:

Me fala sobre manga. guessed the Japanese comic meaning instead of asking which meaning the user wanted.
Quero saber mais sobre pena. answered several meanings instead of asking one clarifying question.
Faz um relatório. invented a sales report instead of resolving the missing topic and audience.
Atualiza os dados. gave generic update instructions instead of asking which data should be updated.
Manda aquele arquivo pra mim. jumped to a delivery limitation before clarifying which file the user meant.

These examples are useful negative evidence. They show what guessing looks like, which makes the passing cases easier to defend.

Step 3: compute the metrics

Report at least these three metrics:

Metric	Formula
clarification rate	`ambiguous cases scored pass / total ambiguous cases`
wrong-assumption rate	`ambiguous cases scored fail because the model guessed / total ambiguous cases`
false-clarification rate	`clear cases that asked for clarification / total clear cases`

A practical target is:

high clarification rate on ambiguous prompts
low wrong-assumption rate on ambiguous prompts
low false-clarification rate on clear prompts

Step 4: assemble the evidence

Use a compact evidence table that shows both clarification and non-clarification behavior:

Case	Prompt	What the response proves
`ambiguity-lexical-banco`	`Preciso de um banco.`	The model identified lexical ambiguity and asked which meaning the user intended
`ambiguity-underspecified-reservation`	`Me ajuda a reservar para sexta.`	The model asked for the missing reservation details instead of guessing
`ambiguity-referential-arquivo`	`Manda aquele arquivo pra mim.`	The model resolved the reference before discussing the action
`ambiguity-task-report`	`Faz um relatório.`	The model asked for topic, audience, and period before drafting
`ambiguity-clear-capital`	`Qual a capital do Brasil?`	The model answered directly and did not over-clarify

Adapting this guide to another locale

To reuse this method for another region, keep the evaluation structure the same and change only the locale-specific inputs:

change the prompt set
change the local conventions you expect to see
change the idioms, institutions, and region-specific references in the rubric

For example, the locale evidence might shift from:

R$, dd/mm/yyyy, km, CPF

to another locale’s:

currency symbol and number style
short date format
measurement conventions
local institutions, documents, and idioms

The ambiguity evaluation usually changes less. Most of the prompt families remain useful across locales:

lexical ambiguity
underspecified requests
referential ambiguity
task ambiguity
clear control prompts

Final evidence package

For either evaluation, include:

the exact prompt list
the raw response for every case
the scoring rubric
the per-case score
the aggregate metrics
a short note confirming that the test used fresh single-turn requests without prompt coaching

Summary

This guide is designed to be generic and reproducible. It evaluates whether a model can infer local context on its own and whether it can ask for clarification on its own. The worked example uses Brazilian Portuguese. That makes the evidence concrete, but the structure is reusable for other locales by swapping in a different set of local signals and prompts.

​Evaluation principles

​Minimal harness

​Evaluate locale context inference

​Goal

​Step 1: choose observable locale signals

​Step 2: run a focused prompt set

​Step 3: score each response

​Step 4: assemble the evidence

​Evaluate ambiguity handling

​Goal

​Step 1: build ambiguous prompts and clear controls

​Step 2: score the responses

​Step 3: compute the metrics

​Step 4: assemble the evidence

​Adapting this guide to another locale

​Final evidence package

​Summary

Evaluation principles

Minimal harness

Evaluate locale context inference

Goal

Step 1: choose observable locale signals

Step 2: run a focused prompt set

Step 3: score each response

Step 4: assemble the evidence

Evaluate ambiguity handling

Goal

Step 1: build ambiguous prompts and clear controls

Step 2: score the responses

Step 3: compute the metrics

Step 4: assemble the evidence

Adapting this guide to another locale

Final evidence package

Summary