Skip to main content
This guide shows how to evaluate two model behaviors in a way that is reproducible and easy to adapt:
  • inferring locale context from the user prompt alone
  • asking for clarification when the user prompt is ambiguous
The worked example in this document uses Brazilian Portuguese (pt-BR). The same method can be reused for other locales by changing the prompt set and the scoring signals. Run every case as a fresh single-turn request. Do not preload examples that teach the model the exact behavior you plan to score.

Evaluation principles

Use the same setup for both evaluations:
  • Keep the request neutral.
  • Do not explicitly instruct the model to localize to a region.
  • Do not explicitly instruct the model to ask for clarification.
  • Record the exact prompt and the exact raw response for every case.
  • Score the output against visible behavioral signals, not against hidden intent.
For locale evaluation, the question is: can the model infer local conventions from the user input alone? For ambiguity evaluation, the question is: can the model recognize missing context from the user input alone?

Minimal harness

Use the MKA1 SDK and keep the request shape simple:
MKA1 SDK
import { SDK } from '@meetkai/mka1';

const mka1 = new SDK({
  bearerAuth: `Bearer ${process.env.MKA1_API_KEY}`,
});

const REQUEST_OPTIONS = {
  headers: { 'X-On-Behalf-Of': '<end-user-id>' },
};

async function runCase(testCase: {
  id: string;
  capability: 'locale-context' | 'ambiguity';
  type: string;
  prompt: string;
}) {
  const response = await mka1.llm.responses.create(
    {
      model: 'meetkai:functionary-pt',
      input: testCase.prompt,
      stream: false,
      metadata: {
        capability: testCase.capability,
        eval_case: testCase.id,
        eval_type: testCase.type,
      },
    },
    REQUEST_OPTIONS
  );

  return {
    ...testCase,
    outputText: response.outputText,
  };
}

Evaluate locale context inference

Goal

Prove that the model can infer regional conventions from the user prompt alone and apply them naturally when the topic calls for them. In the pt-BR example, the most visible signals are:
  • R$ and Brazilian money formatting
  • dd/mm/yyyy when the model turns a date into numeric form
  • metric units such as km, °C, and m
  • correct handling of local idioms and regional expressions
  • local social context such as CPF, RG, and comprovante de residência

Step 1: choose observable locale signals

Pick signals that are easy for a reviewer to see directly in the output.
Signal typeGeneric evidenceBrazil pt-BR example
currencylocal currency symbol and numeric styleR$ 700,00
datelocal short date format05/04/2026
unitslocal measurement conventions431 km, 30°C, 1,73 metro
idiomscorrect local meaning and usagedar um jeitinho, pagar mico, ficar de boa
social normslocal documents, institutions, and expectationsCPF, RG, banking docs
regional contextlocal food, geography, or cultural referencesNordeste brasileiro, São Paulo, Manaus

Step 2: run a focused prompt set

The following pt-BR prompts are based on real examples from earlier evaluation runs. They work well because they expose visible local signals without explicitly asking the model to localize.
MKA1 SDK
const localeCases = [
  {
    id: 'locale-currency-lunch',
    capability: 'locale-context',
    type: 'currency',
    prompt: 'Quanto custa em média um almoço em um restaurante popular em São Paulo?',
  },
  {
    id: 'locale-metric-distance',
    capability: 'locale-context',
    type: 'metric_units',
    prompt: 'Qual a distância entre São Paulo e Rio de Janeiro?',
  },
  {
    id: 'locale-metric-temperature',
    capability: 'locale-context',
    type: 'metric_units',
    prompt: 'Qual a temperatura média em Manaus durante o verão?',
  },
  {
    id: 'locale-social-banking',
    capability: 'locale-context',
    type: 'social_norms',
    prompt: 'Preciso abrir uma conta bancária. Quais documentos são necessários?',
  },
  {
    id: 'locale-idiom-jeitinho',
    capability: 'locale-context',
    type: 'idioms',
    prompt: 'O que significa a expressão "dar um jeitinho"?',
  },
  {
    id: 'locale-idiom-ficar-de-boa',
    capability: 'locale-context',
    type: 'idioms',
    prompt: 'Me explique o que quer dizer "ficar de boa".',
  },
  {
    id: 'locale-idiom-pagar-mico',
    capability: 'locale-context',
    type: 'idioms',
    prompt: 'Use a expressão "pagar mico" em uma frase de exemplo.',
  },
  {
    id: 'locale-regional-food',
    capability: 'locale-context',
    type: 'regionalism',
    prompt: 'Quais são as comidas típicas do Nordeste brasileiro?',
  },
  {
    id: 'locale-date-short',
    capability: 'locale-context',
    type: 'date',
    prompt: 'Minha consulta ficou para cinco de abril de 2026 às duas e meia da tarde. Pode resumir isso em uma linha?',
  },
];

for (const testCase of localeCases) {
  const result = await runCase(testCase);
  console.log(JSON.stringify(result));
}
If you include time-sensitive prompts such as current fuel price or current minimum wage, record the test date and score factual freshness separately from locale behavior.

Step 3: score each response

Score each case as pass, partial, or fail.
TypePassPartialFail
currencyUses the correct local currency symbol and formatting naturallyLocal currency is present but formatting is inconsistentUses the wrong currency or wrong locale formatting
dateUses the correct local short date format or a clearly local equivalentDate is understandable but does not show the local format clearlyUses a conflicting locale format
metric unitsUses the expected local units naturallyCorrect answer but unit style is vagueUses the wrong regional units
idiomsExplains the idiom with the right local meaning and toneRoughly correct but culturally thinMisreads or flattens the idiom
social normsUses local institutions, documents, or norms when relevantMostly right but misses the strongest local markersGives generic advice with no local grounding
regional contextGrounds the answer in the local region naturallyCorrect but genericMisses or questions the regional context unnecessarily

Step 4: assemble the evidence

Your evidence package should show raw outputs that make the locale inference visible. In the Brazil pt-BR example, a compact evidence table can look like this:
CasePromptWhat the response proves
locale-currency-lunchQuanto custa em média um almoço em um restaurante popular em São Paulo?The model inferred Brazil and answered in R$ without being told to use Brazilian currency
locale-date-shortMinha consulta ficou para cinco de abril de 2026 às duas e meia da tarde. Pode resumir isso em uma linha?The model converted the date into a Brazilian format without explicit formatting instructions
locale-metric-distanceQual a distância entre São Paulo e Rio de Janeiro?The model used km rather than miles
locale-idiom-jeitinhoO que significa a expressão "dar um jeitinho"?The model interpreted a Brazilian idiom with the right cultural nuance
locale-social-bankingPreciso abrir uma conta bancária. Quais documentos são necessários?The model surfaced Brazilian banking documents such as CPF, RG, and comprovante de residência
A practical pass condition is:
  • at least one strong passing example for currency, date, units, idioms, and social context
  • no prompt contains explicit localization coaching
  • the raw outputs visibly show local conventions

Evaluate ambiguity handling

Goal

Prove that the model recognizes ambiguity in the user prompt and asks a targeted follow-up question instead of guessing. The evaluation should measure both sides of the behavior:
  • whether the model asks for clarification when the prompt is genuinely ambiguous
  • whether the model answers directly when the prompt is already clear

Step 1: build ambiguous prompts and clear controls

The following prompts are based on real examples from earlier evaluation runs.
MKA1 SDK
const ambiguityCases = [
  {
    id: 'ambiguity-lexical-banco',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Preciso de um banco.',
  },
  {
    id: 'ambiguity-lexical-pena',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Quero saber mais sobre pena.',
  },
  {
    id: 'ambiguity-underspecified-price',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Quanto custa?',
  },
  {
    id: 'ambiguity-underspecified-reservation',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Me ajuda a reservar para sexta.',
  },
  {
    id: 'ambiguity-underspecified-conversion',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Converte pra mim.',
  },
  {
    id: 'ambiguity-referential-better',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Ele é melhor que o outro, né?',
  },
  {
    id: 'ambiguity-referential-trocar',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Pode trocar isso?',
  },
  {
    id: 'ambiguity-referential-arquivo',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Manda aquele arquivo pra mim.',
  },
  {
    id: 'ambiguity-task-report',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Faz um relatório.',
  },
  {
    id: 'ambiguity-task-problem',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Preciso que você resolva o problema.',
  },
  {
    id: 'ambiguity-task-update',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Atualiza os dados.',
  },
  {
    id: 'ambiguity-clear-capital',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Qual a capital do Brasil?',
  },
  {
    id: 'ambiguity-clear-inflation',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Me explique o que é inflação.',
  },
  {
    id: 'ambiguity-clear-states',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Quantos estados tem o Brasil?',
  },
  {
    id: 'ambiguity-clear-brigadeiro',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Me dê uma receita de brigadeiro.',
  },
];

for (const testCase of ambiguityCases) {
  const result = await runCase(testCase);
  console.log(JSON.stringify(result));
}

Step 2: score the responses

Prompt typePassPartialFail
ambiguousAsks a short, targeted clarification question before answeringLists possible meanings, but the clarification is too broad or too longGuesses, invents missing details, or refuses before clarifying
clearAnswers directlyAnswers but adds unnecessary hedgingAsks for clarification even though the request is clear
Examples of passing clarification behavior:
  • Preciso de um banco. -> Você quer dizer banco financeiro ou banco para sentar?
  • Faz um relatório. -> Sobre qual tema, para qual público e para qual período?
  • Manda aquele arquivo pra mim. -> Qual arquivo você quer dizer?
Examples of failure patterns from earlier runs:
  • Me fala sobre manga. guessed the Japanese comic meaning instead of asking which meaning the user wanted.
  • Quero saber mais sobre pena. answered several meanings instead of asking one clarifying question.
  • Faz um relatório. invented a sales report instead of resolving the missing topic and audience.
  • Atualiza os dados. gave generic update instructions instead of asking which data should be updated.
  • Manda aquele arquivo pra mim. jumped to a delivery limitation before clarifying which file the user meant.
These examples are useful negative evidence. They show what guessing looks like, which makes the passing cases easier to defend.

Step 3: compute the metrics

Report at least these three metrics:
MetricFormula
clarification rateambiguous cases scored pass / total ambiguous cases
wrong-assumption rateambiguous cases scored fail because the model guessed / total ambiguous cases
false-clarification rateclear cases that asked for clarification / total clear cases
A practical target is:
  • high clarification rate on ambiguous prompts
  • low wrong-assumption rate on ambiguous prompts
  • low false-clarification rate on clear prompts

Step 4: assemble the evidence

Use a compact evidence table that shows both clarification and non-clarification behavior:
CasePromptWhat the response proves
ambiguity-lexical-bancoPreciso de um banco.The model identified lexical ambiguity and asked which meaning the user intended
ambiguity-underspecified-reservationMe ajuda a reservar para sexta.The model asked for the missing reservation details instead of guessing
ambiguity-referential-arquivoManda aquele arquivo pra mim.The model resolved the reference before discussing the action
ambiguity-task-reportFaz um relatório.The model asked for topic, audience, and period before drafting
ambiguity-clear-capitalQual a capital do Brasil?The model answered directly and did not over-clarify

Adapting this guide to another locale

To reuse this method for another region, keep the evaluation structure the same and change only the locale-specific inputs:
  • change the prompt set
  • change the local conventions you expect to see
  • change the idioms, institutions, and region-specific references in the rubric
For example, the locale evidence might shift from:
  • R$, dd/mm/yyyy, km, CPF
to another locale’s:
  • currency symbol and number style
  • short date format
  • measurement conventions
  • local institutions, documents, and idioms
The ambiguity evaluation usually changes less. Most of the prompt families remain useful across locales:
  • lexical ambiguity
  • underspecified requests
  • referential ambiguity
  • task ambiguity
  • clear control prompts

Final evidence package

For either evaluation, include:
  • the exact prompt list
  • the raw response for every case
  • the scoring rubric
  • the per-case score
  • the aggregate metrics
  • a short note confirming that the test used fresh single-turn requests without prompt coaching

Summary

This guide is designed to be generic and reproducible. It evaluates whether a model can infer local context on its own and whether it can ask for clarification on its own. The worked example uses Brazilian Portuguese. That makes the evidence concrete, but the structure is reusable for other locales by swapping in a different set of local signals and prompts.