Speech

Use the MKA1 API speech endpoints when you need file-based speech-to-text or text-to-speech. For real-time, bidirectional voice sessions, use Advanced voice mode.

Choose the right endpoint

Use case	Endpoint	Notes
Transcribe a recorded file	Speech-to-text transcription	Upload audio with `multipart/form-data`
Generate a WAV file from text	Text-to-speech	Best for complete file generation
Start playback as soon as audio arrives	Streaming text-to-speech	Best for low-latency playback

Transcribe audio

Send an audio file to the transcription endpoint when you want text output from a recorded file. If your app acts on behalf of an end user, also send X-On-Behalf-Of. Supported audio formats: FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, WebM, PCM.

mka1 llm speech transcribe \
  --file ./episode.wav \
  --language en \
  --prompt 'This is a technical podcast about machine learning.' \
  --temperature 0.2 \
  -H 'X-On-Behalf-Of: <end-user-id>'

The response includes the transcript text plus detected language and confidence:

{
  "text": "Hello! We're excited to show you our native speech capabilities.",
  "language": "en",
  "confidence": 0.8429018476208717
}

Separate speakers in one transcript

If you need diarization, enable speaker data in the transcription request. When enabled, the response can include a speakers array with speaker-labeled segments and timing metadata.

For include_speaker_data, upload WAV or PCM audio for non-streaming transcription. Other audio formats return 400 BAD_REQUEST with the message Speaker diarization currently requires WAV/PCM audio for non-streaming transcription.

const result = await mka1.llm.speech.transcribe({
  language: 'en',
  includeSpeakerData: true,
  prompt: 'This is a short podcast clip about AI product updates.',
  temperature: 0.2,
  requestBody: {
    file: await openAsBlob('panel.wav'),
  },
}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });

console.log(result.speakers);

Example response with speaker separation:

{
  "text": "Welcome back to the show. Today we're looking at how speech APIs fit into production apps. We'll keep it practical and focus on latency, accuracy, and speaker turns.",
  "language": "en",
  "confidence": 0.91177404,
  "speakers": [
    {
      "speaker": "Speaker-1",
      "text": "Welcome back to the show.",
      "confidence": 0.91177404,
      "offset_ms": 80,
      "duration_ms": 1280
    },
    {
      "speaker": "Speaker-2",
      "text": "Today we're looking at how speech APIs fit into production apps.",
      "confidence": 0.91177404,
      "offset_ms": 1540,
      "duration_ms": 3380
    },
    {
      "speaker": "Speaker-1",
      "text": "We'll keep it practical and focus on latency, accuracy, and speaker turns.",
      "confidence": 0.91177404,
      "offset_ms": 5220,
      "duration_ms": 3660
    }
  ]
}

Use the top-level text field when you need a single merged transcript. Use speakers when you need captions, turn-taking, or downstream speaker analytics.

Generate speech

Use the standard text-to-speech endpoint when you want a complete WAV file. The response body is binary audio, and the response headers include X-Language-Code.

mka1 llm speech speak \
  --text 'Welcome to the MKA1 API speech guide.' \
  --language en \
  --output-file speech.wav

Stream speech for lower latency

Use streaming text-to-speech when you want playback to start before the full audio file is ready. Choose mp3 for smaller payloads or pcm for uncompressed audio.

mka1 llm speech speak-streaming \
  --text 'Start speaking this response as soon as audio is ready.' \
  --language en \
  --format-param mp3 \
  --output-file speech.mp3

Next steps

Review the Speech-to-text transcription reference for request and response details
Review the Text-to-speech reference for WAV generation
Review the Streaming text-to-speech reference for low-latency output
Use Advanced voice mode for real-time conversations

Getting started

Responses

Features

CLI

Recipes

Benchmarks

Infrastructure

Choose the right endpoint

Transcribe audio

Separate speakers in one transcript

Generate speech

Stream speech for lower latency

Next steps

Getting started

Responses

Features

CLI

Recipes

Benchmarks

Infrastructure

Documentation Index

​Choose the right endpoint

​Transcribe audio

​Separate speakers in one transcript

​Generate speech

​Stream speech for lower latency

​Next steps

Choose the right endpoint

Transcribe audio

Separate speakers in one transcript

Generate speech

Stream speech for lower latency

Next steps