- inferring locale context from the user prompt alone
- asking for clarification when the user prompt is ambiguous
pt-BR).
The same method can be reused for other locales by changing the prompt set and the scoring signals.
Run every case as a fresh single-turn request.
Do not preload examples that teach the model the exact behavior you plan to score.
Evaluation principles
Use the same setup for both evaluations:- Keep the request neutral.
- Do not explicitly instruct the model to localize to a region.
- Do not explicitly instruct the model to ask for clarification.
- Record the exact prompt and the exact raw response for every case.
- Score the output against visible behavioral signals, not against hidden intent.
Minimal harness
Use the MKA1 SDK and keep the request shape simple:MKA1 SDK
Evaluate locale context inference
Goal
Prove that the model can infer regional conventions from the user prompt alone and apply them naturally when the topic calls for them. In thept-BR example, the most visible signals are:
R$and Brazilian money formattingdd/mm/yyyywhen the model turns a date into numeric form- metric units such as
km,°C, andm - correct handling of local idioms and regional expressions
- local social context such as
CPF,RG, andcomprovante de residência
Step 1: choose observable locale signals
Pick signals that are easy for a reviewer to see directly in the output.| Signal type | Generic evidence | Brazil pt-BR example |
|---|---|---|
| currency | local currency symbol and numeric style | R$ 700,00 |
| date | local short date format | 05/04/2026 |
| units | local measurement conventions | 431 km, 30°C, 1,73 metro |
| idioms | correct local meaning and usage | dar um jeitinho, pagar mico, ficar de boa |
| social norms | local documents, institutions, and expectations | CPF, RG, banking docs |
| regional context | local food, geography, or cultural references | Nordeste brasileiro, São Paulo, Manaus |
Step 2: run a focused prompt set
The followingpt-BR prompts are based on real examples from earlier evaluation runs.
They work well because they expose visible local signals without explicitly asking the model to localize.
MKA1 SDK
Step 3: score each response
Score each case aspass, partial, or fail.
| Type | Pass | Partial | Fail |
|---|---|---|---|
| currency | Uses the correct local currency symbol and formatting naturally | Local currency is present but formatting is inconsistent | Uses the wrong currency or wrong locale formatting |
| date | Uses the correct local short date format or a clearly local equivalent | Date is understandable but does not show the local format clearly | Uses a conflicting locale format |
| metric units | Uses the expected local units naturally | Correct answer but unit style is vague | Uses the wrong regional units |
| idioms | Explains the idiom with the right local meaning and tone | Roughly correct but culturally thin | Misreads or flattens the idiom |
| social norms | Uses local institutions, documents, or norms when relevant | Mostly right but misses the strongest local markers | Gives generic advice with no local grounding |
| regional context | Grounds the answer in the local region naturally | Correct but generic | Misses or questions the regional context unnecessarily |
Step 4: assemble the evidence
Your evidence package should show raw outputs that make the locale inference visible. In the Brazilpt-BR example, a compact evidence table can look like this:
| Case | Prompt | What the response proves |
|---|---|---|
locale-currency-lunch | Quanto custa em média um almoço em um restaurante popular em São Paulo? | The model inferred Brazil and answered in R$ without being told to use Brazilian currency |
locale-date-short | Minha consulta ficou para cinco de abril de 2026 às duas e meia da tarde. Pode resumir isso em uma linha? | The model converted the date into a Brazilian format without explicit formatting instructions |
locale-metric-distance | Qual a distância entre São Paulo e Rio de Janeiro? | The model used km rather than miles |
locale-idiom-jeitinho | O que significa a expressão "dar um jeitinho"? | The model interpreted a Brazilian idiom with the right cultural nuance |
locale-social-banking | Preciso abrir uma conta bancária. Quais documentos são necessários? | The model surfaced Brazilian banking documents such as CPF, RG, and comprovante de residência |
- at least one strong passing example for currency, date, units, idioms, and social context
- no prompt contains explicit localization coaching
- the raw outputs visibly show local conventions
Evaluate ambiguity handling
Goal
Prove that the model recognizes ambiguity in the user prompt and asks a targeted follow-up question instead of guessing. The evaluation should measure both sides of the behavior:- whether the model asks for clarification when the prompt is genuinely ambiguous
- whether the model answers directly when the prompt is already clear
Step 1: build ambiguous prompts and clear controls
The following prompts are based on real examples from earlier evaluation runs.MKA1 SDK
Step 2: score the responses
| Prompt type | Pass | Partial | Fail |
|---|---|---|---|
| ambiguous | Asks a short, targeted clarification question before answering | Lists possible meanings, but the clarification is too broad or too long | Guesses, invents missing details, or refuses before clarifying |
| clear | Answers directly | Answers but adds unnecessary hedging | Asks for clarification even though the request is clear |
Preciso de um banco.->Você quer dizer banco financeiro ou banco para sentar?Faz um relatório.->Sobre qual tema, para qual público e para qual período?Manda aquele arquivo pra mim.->Qual arquivo você quer dizer?
Me fala sobre manga.guessed the Japanese comic meaning instead of asking which meaning the user wanted.Quero saber mais sobre pena.answered several meanings instead of asking one clarifying question.Faz um relatório.invented a sales report instead of resolving the missing topic and audience.Atualiza os dados.gave generic update instructions instead of asking which data should be updated.Manda aquele arquivo pra mim.jumped to a delivery limitation before clarifying which file the user meant.
Step 3: compute the metrics
Report at least these three metrics:| Metric | Formula |
|---|---|
| clarification rate | ambiguous cases scored pass / total ambiguous cases |
| wrong-assumption rate | ambiguous cases scored fail because the model guessed / total ambiguous cases |
| false-clarification rate | clear cases that asked for clarification / total clear cases |
- high clarification rate on ambiguous prompts
- low wrong-assumption rate on ambiguous prompts
- low false-clarification rate on clear prompts
Step 4: assemble the evidence
Use a compact evidence table that shows both clarification and non-clarification behavior:| Case | Prompt | What the response proves |
|---|---|---|
ambiguity-lexical-banco | Preciso de um banco. | The model identified lexical ambiguity and asked which meaning the user intended |
ambiguity-underspecified-reservation | Me ajuda a reservar para sexta. | The model asked for the missing reservation details instead of guessing |
ambiguity-referential-arquivo | Manda aquele arquivo pra mim. | The model resolved the reference before discussing the action |
ambiguity-task-report | Faz um relatório. | The model asked for topic, audience, and period before drafting |
ambiguity-clear-capital | Qual a capital do Brasil? | The model answered directly and did not over-clarify |
Adapting this guide to another locale
To reuse this method for another region, keep the evaluation structure the same and change only the locale-specific inputs:- change the prompt set
- change the local conventions you expect to see
- change the idioms, institutions, and region-specific references in the rubric
R$,dd/mm/yyyy,km,CPF
- currency symbol and number style
- short date format
- measurement conventions
- local institutions, documents, and idioms
- lexical ambiguity
- underspecified requests
- referential ambiguity
- task ambiguity
- clear control prompts
Final evidence package
For either evaluation, include:- the exact prompt list
- the raw response for every case
- the scoring rubric
- the per-case score
- the aggregate metrics
- a short note confirming that the test used fresh single-turn requests without prompt coaching