> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mka1.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluate regional localization and ambiguity handling

> Run neutral evaluations that prove spontaneous regional localization and clarification behavior with the MKA1 SDK.

This guide shows how to evaluate two model behaviors in a way that is reproducible and easy to adapt:

* inferring locale context from the user prompt alone
* asking for clarification when the user prompt is ambiguous

The worked example in this document uses Brazilian Portuguese (`pt-BR`).
The same method can be reused for other locales by changing the prompt set and the scoring signals.

Run every case as a fresh single-turn request.
Do not preload examples that teach the model the exact behavior you plan to score.

## Evaluation principles

Use the same setup for both evaluations:

* Keep the request neutral.
* Do not explicitly instruct the model to localize to a region.
* Do not explicitly instruct the model to ask for clarification.
* Record the exact prompt and the exact raw response for every case.
* Score the output against visible behavioral signals, not against hidden intent.

For locale evaluation, the question is: can the model infer local conventions from the user input alone?

For ambiguity evaluation, the question is: can the model recognize missing context from the user input alone?

## Minimal harness

Use the MKA1 SDK and keep the request shape simple:

```ts MKA1 SDK theme={null}
import { SDK } from '@meetkai/mka1';

const mka1 = new SDK({
  bearerAuth: `Bearer ${process.env.MKA1_API_KEY}`,
});

const REQUEST_OPTIONS = {
  headers: { 'X-On-Behalf-Of': '<end-user-id>' },
};

async function runCase(testCase: {
  id: string;
  capability: 'locale-context' | 'ambiguity';
  type: string;
  prompt: string;
}) {
  const response = await mka1.llm.responses.create(
    {
      model: 'meetkai:functionary-pt',
      input: testCase.prompt,
      stream: false,
      metadata: {
        capability: testCase.capability,
        eval_case: testCase.id,
        eval_type: testCase.type,
      },
    },
    REQUEST_OPTIONS
  );

  return {
    ...testCase,
    outputText: response.outputText,
  };
}
```

## Evaluate locale context inference

### Goal

Prove that the model can infer regional conventions from the user prompt alone and apply them naturally when the topic calls for them.

In the `pt-BR` example, the most visible signals are:

* `R$` and Brazilian money formatting
* `dd/mm/yyyy` when the model turns a date into numeric form
* metric units such as `km`, `°C`, and `m`
* correct handling of local idioms and regional expressions
* local social context such as `CPF`, `RG`, and `comprovante de residência`

### Step 1: choose observable locale signals

Pick signals that are easy for a reviewer to see directly in the output.

| Signal type      | Generic evidence                                | Brazil `pt-BR` example                          |
| ---------------- | ----------------------------------------------- | ----------------------------------------------- |
| currency         | local currency symbol and numeric style         | `R$ 700,00`                                     |
| date             | local short date format                         | `05/04/2026`                                    |
| units            | local measurement conventions                   | `431 km`, `30°C`, `1,73 metro`                  |
| idioms           | correct local meaning and usage                 | `dar um jeitinho`, `pagar mico`, `ficar de boa` |
| social norms     | local documents, institutions, and expectations | `CPF`, `RG`, banking docs                       |
| regional context | local food, geography, or cultural references   | Nordeste brasileiro, São Paulo, Manaus          |

### Step 2: run a focused prompt set

The following `pt-BR` prompts are based on real examples from earlier evaluation runs.
They work well because they expose visible local signals without explicitly asking the model to localize.

```ts MKA1 SDK theme={null}
const localeCases = [
  {
    id: 'locale-currency-lunch',
    capability: 'locale-context',
    type: 'currency',
    prompt: 'Quanto custa em média um almoço em um restaurante popular em São Paulo?',
  },
  {
    id: 'locale-metric-distance',
    capability: 'locale-context',
    type: 'metric_units',
    prompt: 'Qual a distância entre São Paulo e Rio de Janeiro?',
  },
  {
    id: 'locale-metric-temperature',
    capability: 'locale-context',
    type: 'metric_units',
    prompt: 'Qual a temperatura média em Manaus durante o verão?',
  },
  {
    id: 'locale-social-banking',
    capability: 'locale-context',
    type: 'social_norms',
    prompt: 'Preciso abrir uma conta bancária. Quais documentos são necessários?',
  },
  {
    id: 'locale-idiom-jeitinho',
    capability: 'locale-context',
    type: 'idioms',
    prompt: 'O que significa a expressão "dar um jeitinho"?',
  },
  {
    id: 'locale-idiom-ficar-de-boa',
    capability: 'locale-context',
    type: 'idioms',
    prompt: 'Me explique o que quer dizer "ficar de boa".',
  },
  {
    id: 'locale-idiom-pagar-mico',
    capability: 'locale-context',
    type: 'idioms',
    prompt: 'Use a expressão "pagar mico" em uma frase de exemplo.',
  },
  {
    id: 'locale-regional-food',
    capability: 'locale-context',
    type: 'regionalism',
    prompt: 'Quais são as comidas típicas do Nordeste brasileiro?',
  },
  {
    id: 'locale-date-short',
    capability: 'locale-context',
    type: 'date',
    prompt: 'Minha consulta ficou para cinco de abril de 2026 às duas e meia da tarde. Pode resumir isso em uma linha?',
  },
];

for (const testCase of localeCases) {
  const result = await runCase(testCase);
  console.log(JSON.stringify(result));
}
```

If you include time-sensitive prompts such as current fuel price or current minimum wage, record the test date and score factual freshness separately from locale behavior.

### Step 3: score each response

Score each case as `pass`, `partial`, or `fail`.

| Type             | Pass                                                                   | Partial                                                           | Fail                                                   |
| ---------------- | ---------------------------------------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------ |
| currency         | Uses the correct local currency symbol and formatting naturally        | Local currency is present but formatting is inconsistent          | Uses the wrong currency or wrong locale formatting     |
| date             | Uses the correct local short date format or a clearly local equivalent | Date is understandable but does not show the local format clearly | Uses a conflicting locale format                       |
| metric units     | Uses the expected local units naturally                                | Correct answer but unit style is vague                            | Uses the wrong regional units                          |
| idioms           | Explains the idiom with the right local meaning and tone               | Roughly correct but culturally thin                               | Misreads or flattens the idiom                         |
| social norms     | Uses local institutions, documents, or norms when relevant             | Mostly right but misses the strongest local markers               | Gives generic advice with no local grounding           |
| regional context | Grounds the answer in the local region naturally                       | Correct but generic                                               | Misses or questions the regional context unnecessarily |

### Step 4: assemble the evidence

Your evidence package should show raw outputs that make the locale inference visible.

In the Brazil `pt-BR` example, a compact evidence table can look like this:

| Case                     | Prompt                                                                                                      | What the response proves                                                                            |
| ------------------------ | ----------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| `locale-currency-lunch`  | `Quanto custa em média um almoço em um restaurante popular em São Paulo?`                                   | The model inferred Brazil and answered in `R$` without being told to use Brazilian currency         |
| `locale-date-short`      | `Minha consulta ficou para cinco de abril de 2026 às duas e meia da tarde. Pode resumir isso em uma linha?` | The model converted the date into a Brazilian format without explicit formatting instructions       |
| `locale-metric-distance` | `Qual a distância entre São Paulo e Rio de Janeiro?`                                                        | The model used `km` rather than miles                                                               |
| `locale-idiom-jeitinho`  | `O que significa a expressão "dar um jeitinho"?`                                                            | The model interpreted a Brazilian idiom with the right cultural nuance                              |
| `locale-social-banking`  | `Preciso abrir uma conta bancária. Quais documentos são necessários?`                                       | The model surfaced Brazilian banking documents such as `CPF`, `RG`, and `comprovante de residência` |

A practical pass condition is:

* at least one strong passing example for currency, date, units, idioms, and social context
* no prompt contains explicit localization coaching
* the raw outputs visibly show local conventions

## Evaluate ambiguity handling

### Goal

Prove that the model recognizes ambiguity in the user prompt and asks a targeted follow-up question instead of guessing.

The evaluation should measure both sides of the behavior:

* whether the model asks for clarification when the prompt is genuinely ambiguous
* whether the model answers directly when the prompt is already clear

### Step 1: build ambiguous prompts and clear controls

The following prompts are based on real examples from earlier evaluation runs.

```ts MKA1 SDK theme={null}
const ambiguityCases = [
  {
    id: 'ambiguity-lexical-banco',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Preciso de um banco.',
  },
  {
    id: 'ambiguity-lexical-pena',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Quero saber mais sobre pena.',
  },
  {
    id: 'ambiguity-underspecified-price',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Quanto custa?',
  },
  {
    id: 'ambiguity-underspecified-reservation',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Me ajuda a reservar para sexta.',
  },
  {
    id: 'ambiguity-underspecified-conversion',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Converte pra mim.',
  },
  {
    id: 'ambiguity-referential-better',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Ele é melhor que o outro, né?',
  },
  {
    id: 'ambiguity-referential-trocar',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Pode trocar isso?',
  },
  {
    id: 'ambiguity-referential-arquivo',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Manda aquele arquivo pra mim.',
  },
  {
    id: 'ambiguity-task-report',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Faz um relatório.',
  },
  {
    id: 'ambiguity-task-problem',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Preciso que você resolva o problema.',
  },
  {
    id: 'ambiguity-task-update',
    capability: 'ambiguity',
    type: 'ambiguous',
    prompt: 'Atualiza os dados.',
  },
  {
    id: 'ambiguity-clear-capital',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Qual a capital do Brasil?',
  },
  {
    id: 'ambiguity-clear-inflation',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Me explique o que é inflação.',
  },
  {
    id: 'ambiguity-clear-states',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Quantos estados tem o Brasil?',
  },
  {
    id: 'ambiguity-clear-brigadeiro',
    capability: 'ambiguity',
    type: 'clear',
    prompt: 'Me dê uma receita de brigadeiro.',
  },
];

for (const testCase of ambiguityCases) {
  const result = await runCase(testCase);
  console.log(JSON.stringify(result));
}
```

### Step 2: score the responses

| Prompt type | Pass                                                           | Partial                                                                 | Fail                                                           |
| ----------- | -------------------------------------------------------------- | ----------------------------------------------------------------------- | -------------------------------------------------------------- |
| ambiguous   | Asks a short, targeted clarification question before answering | Lists possible meanings, but the clarification is too broad or too long | Guesses, invents missing details, or refuses before clarifying |
| clear       | Answers directly                                               | Answers but adds unnecessary hedging                                    | Asks for clarification even though the request is clear        |

Examples of passing clarification behavior:

* `Preciso de um banco.` -> `Você quer dizer banco financeiro ou banco para sentar?`
* `Faz um relatório.` -> `Sobre qual tema, para qual público e para qual período?`
* `Manda aquele arquivo pra mim.` -> `Qual arquivo você quer dizer?`

Examples of failure patterns from earlier runs:

* `Me fala sobre manga.` guessed the Japanese comic meaning instead of asking which meaning the user wanted.
* `Quero saber mais sobre pena.` answered several meanings instead of asking one clarifying question.
* `Faz um relatório.` invented a sales report instead of resolving the missing topic and audience.
* `Atualiza os dados.` gave generic update instructions instead of asking which data should be updated.
* `Manda aquele arquivo pra mim.` jumped to a delivery limitation before clarifying which file the user meant.

These examples are useful negative evidence.
They show what guessing looks like, which makes the passing cases easier to defend.

### Step 3: compute the metrics

Report at least these three metrics:

| Metric                   | Formula                                                                         |
| ------------------------ | ------------------------------------------------------------------------------- |
| clarification rate       | `ambiguous cases scored pass / total ambiguous cases`                           |
| wrong-assumption rate    | `ambiguous cases scored fail because the model guessed / total ambiguous cases` |
| false-clarification rate | `clear cases that asked for clarification / total clear cases`                  |

A practical target is:

* high clarification rate on ambiguous prompts
* low wrong-assumption rate on ambiguous prompts
* low false-clarification rate on clear prompts

### Step 4: assemble the evidence

Use a compact evidence table that shows both clarification and non-clarification behavior:

| Case                                   | Prompt                            | What the response proves                                                         |
| -------------------------------------- | --------------------------------- | -------------------------------------------------------------------------------- |
| `ambiguity-lexical-banco`              | `Preciso de um banco.`            | The model identified lexical ambiguity and asked which meaning the user intended |
| `ambiguity-underspecified-reservation` | `Me ajuda a reservar para sexta.` | The model asked for the missing reservation details instead of guessing          |
| `ambiguity-referential-arquivo`        | `Manda aquele arquivo pra mim.`   | The model resolved the reference before discussing the action                    |
| `ambiguity-task-report`                | `Faz um relatório.`               | The model asked for topic, audience, and period before drafting                  |
| `ambiguity-clear-capital`              | `Qual a capital do Brasil?`       | The model answered directly and did not over-clarify                             |

## Adapting this guide to another locale

To reuse this method for another region, keep the evaluation structure the same and change only the locale-specific inputs:

* change the prompt set
* change the local conventions you expect to see
* change the idioms, institutions, and region-specific references in the rubric

For example, the locale evidence might shift from:

* `R$`, `dd/mm/yyyy`, `km`, `CPF`

to another locale's:

* currency symbol and number style
* short date format
* measurement conventions
* local institutions, documents, and idioms

The ambiguity evaluation usually changes less.
Most of the prompt families remain useful across locales:

* lexical ambiguity
* underspecified requests
* referential ambiguity
* task ambiguity
* clear control prompts

## Final evidence package

For either evaluation, include:

* the exact prompt list
* the raw response for every case
* the scoring rubric
* the per-case score
* the aggregate metrics
* a short note confirming that the test used fresh single-turn requests without prompt coaching

## Summary

This guide is designed to be generic and reproducible.
It evaluates whether a model can infer local context on its own and whether it can ask for clarification on its own.

The worked example uses Brazilian Portuguese.
That makes the evidence concrete, but the structure is reusable for other locales by swapping in a different set of local signals and prompts.
