> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mka1.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Speech

> Transcribe audio and generate speech with the MKA1 API. Use speaker-labeled segments when you need multi-speaker separation.

Use the MKA1 API speech endpoints when you need file-based speech-to-text or text-to-speech.
For real-time, bidirectional voice sessions, use [Advanced voice mode](/docs/advanced-voice-mode).

## Choose the right endpoint

| Use case                                | Endpoint                                                                           | Notes                                   |
| --------------------------------------- | ---------------------------------------------------------------------------------- | --------------------------------------- |
| Transcribe a recorded file              | [Speech-to-text transcription](/api-reference/speech/speech-to-text-transcription) | Upload audio with `multipart/form-data` |
| Generate a WAV file from text           | [Text-to-speech](/api-reference/speech/text-to-speech)                             | Best for complete file generation       |
| Start playback as soon as audio arrives | [Streaming text-to-speech](/api-reference/speech/streaming-text-to-speech)         | Best for low-latency playback           |

## Transcribe audio

Send an audio file to the transcription endpoint when you want text output from a recorded file.
If your app acts on behalf of an end user, also send `X-On-Behalf-Of`.

Supported audio formats: `FLAC`, `MP3`, `MP4`, `MPEG`, `MPGA`, `M4A`, `OGG`, `WAV`, `WebM`, `PCM`.

<CodeGroup>
  ```bash CLI theme={null}
  mka1 llm speech transcribe \
    --file ./episode.wav \
    --language en \
    --prompt 'This is a technical podcast about machine learning.' \
    --temperature 0.2 \
    -H 'X-On-Behalf-Of: <end-user-id>'
  ```

  ```ts MKA1 SDK theme={null}
  import { SDK } from '@meetkai/mka1';
  import { openAsBlob } from 'node:fs';

  const mka1 = new SDK({
    bearerAuth: 'Bearer <mka1-api-key>',
  });

  const result = await mka1.llm.speech.transcribe({
    language: 'en',
    prompt: 'This is a technical podcast about machine learning.',
    temperature: 0.2,
    requestBody: {
      file: await openAsBlob('episode.wav'),
    },
  }, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });

  console.log(result.text);
  console.log(result.language);
  console.log(result.confidence);
  ```

  ```csharp C# SDK theme={null}
  using MeetKai.MKA1;
  using MeetKai.MKA1.Types.Components;
  using MeetKai.MKA1.Types.Requests;

  var sdk = new SDK(bearerAuth: "Bearer YOUR_API_KEY");

  var result = await sdk.Llm.Speech.TranscribeAsync(new TranscribeRequest()
  {
      Language = "en",
      Prompt = "This is a technical podcast about machine learning.",
      Temperature = 0.2,
      Body = new TranscribeRequestBody()
      {
          File = new TranscribeFile()
          {
              FileName = "episode.wav",
              Content = File.ReadAllBytes("episode.wav"),
          },
      },
  });

  Console.WriteLine(result.TranscriptionResponse!.Text);
  Console.WriteLine(result.TranscriptionResponse!.Language);
  Console.WriteLine(result.TranscriptionResponse!.Confidence);
  ```

  ```python Python SDK theme={null}
  from mka1 import SDK

  sdk = SDK(bearer_auth="Bearer YOUR_API_KEY")

  result = sdk.llm.speech.transcribe(
      file={"file_name": "episode.wav", "content": open("episode.wav", "rb")},
      language="en",
      prompt="This is a technical podcast about machine learning.",
      temperature=0.2,
  )

  print(result.text)
  print(result.language)
  print(result.confidence)
  ```

  ```bash bash theme={null}
  curl 'https://apigw.mka1.com/api/v1/llm/speech/transcriptions?language=en&prompt=This%20is%20a%20technical%20podcast%20about%20machine%20learning.&temperature=0.2' \
    --request POST \
    --header 'Authorization: Bearer <mka1-api-key>' \
    --header 'X-On-Behalf-Of: <end-user-id>' \
    --form 'file=@episode.wav'
  ```
</CodeGroup>

The response includes the transcript text plus detected language and confidence:

```json theme={null}
{
  "text": "Hello! We're excited to show you our native speech capabilities.",
  "language": "en",
  "confidence": 0.8429018476208717
}
```

## Separate speakers in one transcript

If you need diarization, enable speaker data in the transcription request.
When enabled, the response can include a `speakers` array with speaker-labeled segments and timing metadata.

<Warning>
  For `include_speaker_data`, upload WAV or PCM audio for non-streaming transcription. Other audio formats return `400 BAD_REQUEST` with the message `Speaker diarization currently requires WAV/PCM audio for non-streaming transcription`.
</Warning>

```ts theme={null}
const result = await mka1.llm.speech.transcribe({
  language: 'en',
  includeSpeakerData: true,
  prompt: 'This is a short podcast clip about AI product updates.',
  temperature: 0.2,
  requestBody: {
    file: await openAsBlob('panel.wav'),
  },
}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });

console.log(result.speakers);
```

Example response with speaker separation:

```json theme={null}
{
  "text": "Welcome back to the show. Today we're looking at how speech APIs fit into production apps. We'll keep it practical and focus on latency, accuracy, and speaker turns.",
  "language": "en",
  "confidence": 0.91177404,
  "speakers": [
    {
      "speaker": "Speaker-1",
      "text": "Welcome back to the show.",
      "confidence": 0.91177404,
      "offset_ms": 80,
      "duration_ms": 1280
    },
    {
      "speaker": "Speaker-2",
      "text": "Today we're looking at how speech APIs fit into production apps.",
      "confidence": 0.91177404,
      "offset_ms": 1540,
      "duration_ms": 3380
    },
    {
      "speaker": "Speaker-1",
      "text": "We'll keep it practical and focus on latency, accuracy, and speaker turns.",
      "confidence": 0.91177404,
      "offset_ms": 5220,
      "duration_ms": 3660
    }
  ]
}
```

Use the top-level `text` field when you need a single merged transcript.
Use `speakers` when you need captions, turn-taking, or downstream speaker analytics.

## Generate speech

Use the standard text-to-speech endpoint when you want a complete WAV file.
The response body is binary audio, and the response headers include `X-Language-Code`.

<CodeGroup>
  ```bash CLI theme={null}
  mka1 llm speech speak \
    --text 'Welcome to the MKA1 API speech guide.' \
    --language en \
    --output-file speech.wav
  ```

  ```ts MKA1 SDK theme={null}
  import { writeFileSync } from 'node:fs';

  const result = await mka1.llm.speech.speak({
    text: 'Welcome to the MKA1 API speech guide.',
    language: 'en',
  }, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });

  const audioBody = result.body as Blob | Uint8Array;
  const audioBuffer = audioBody instanceof Uint8Array
    ? Buffer.from(audioBody)
    : Buffer.from(await audioBody.arrayBuffer());
  const languageCode =
    result.headers['X-Language-Code'] ?? result.headers['x-language-code'];

  writeFileSync('speech.wav', audioBuffer);
  console.log(languageCode);
  ```

  ```csharp C# SDK theme={null}
  using MeetKai.MKA1;
  using MeetKai.MKA1.Types.Components;

  var sdk = new SDK(bearerAuth: "Bearer YOUR_API_KEY");

  var result = await sdk.Llm.Speech.SpeakAsync(new TextToSpeechRequest()
  {
      Text = "Welcome to the MKA1 API speech guide.",
      Language = TextToSpeechRequestLanguage.En,
  });

  File.WriteAllBytes("speech.wav", result.Bytes!);
  ```

  ```python Python SDK theme={null}
  result = sdk.llm.speech.speak(
      text="Welcome to the MKA1 API speech guide.",
      language="en",
  )

  with open("speech.wav", "wb") as f:
      f.write(result.body)
  ```

  ```bash bash theme={null}
  curl https://apigw.mka1.com/api/v1/llm/speech/tts \
    --request POST \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer <mka1-api-key>' \
    --header 'X-On-Behalf-Of: <end-user-id>' \
    --data '{
      "text": "Welcome to the MKA1 API speech guide.",
      "language": "en"
    }' \
    --output speech.wav
  ```
</CodeGroup>

## Stream speech for lower latency

Use streaming text-to-speech when you want playback to start before the full audio file is ready.
Choose `mp3` for smaller payloads or `pcm` for uncompressed audio.

<CodeGroup>
  ```bash CLI theme={null}
  mka1 llm speech speak-streaming \
    --text 'Start speaking this response as soon as audio is ready.' \
    --language en \
    --format-param mp3 \
    --output-file speech.mp3
  ```

  ```ts MKA1 SDK theme={null}
  const result = await mka1.llm.speech.speakStreaming({
    text: 'Start speaking this response as soon as audio is ready.',
    language: 'en',
    format: 'mp3',
  }, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });

  const contentType =
    result.headers['Content-Type'] ?? result.headers['content-type'];
  const languageCode =
    result.headers['X-Language-Code'] ?? result.headers['x-language-code'];

  console.log(contentType);
  console.log(languageCode);
  ```

  ```csharp C# SDK theme={null}
  using MeetKai.MKA1;
  using MeetKai.MKA1.Types.Components;

  var sdk = new SDK(bearerAuth: "Bearer YOUR_API_KEY");

  var result = await sdk.Llm.Speech.SpeakStreamingAsync(new TextToSpeechStreamingRequest()
  {
      Text = "Start speaking this response as soon as audio is ready.",
      Language = TextToSpeechStreamingRequestLanguage.En,
      Format = TextToSpeechStreamingRequestFormat.Mp3,
  });

  // Response contains either MP3 or WAV bytes depending on format
  var audioBytes = result.TwoHundredAudioMpegBytes ?? result.TwoHundredAudioWavBytes;
  File.WriteAllBytes("speech.mp3", audioBytes!);
  ```

  ```python Python SDK theme={null}
  result = sdk.llm.speech.speak_streaming(
      text="Start speaking this response as soon as audio is ready.",
      language="en",
      format="mp3",
  )

  with open("speech.mp3", "wb") as f:
      f.write(result.body)
  ```

  ```bash bash theme={null}
  curl https://apigw.mka1.com/api/v1/llm/speech/tts/stream \
    --request POST \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer <mka1-api-key>' \
    --header 'X-On-Behalf-Of: <end-user-id>' \
    --data '{
      "text": "Start speaking this response as soon as audio is ready.",
      "language": "en",
      "format": "mp3"
    }' \
    --output speech.mp3
  ```
</CodeGroup>

## Next steps

* Review the [Speech-to-text transcription](/api-reference/speech/speech-to-text-transcription) reference for request and response details
* Review the [Text-to-speech](/api-reference/speech/text-to-speech) reference for WAV generation
* Review the [Streaming text-to-speech](/api-reference/speech/streaming-text-to-speech) reference for low-latency output
* Use [Advanced voice mode](/docs/advanced-voice-mode) for real-time conversations