Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mka1.com/llms.txt

Use this file to discover all available pages before exploring further.

Use the MKA1 API speech endpoints when you need file-based speech-to-text or text-to-speech. For real-time, bidirectional voice sessions, use Advanced voice mode.

Choose the right endpoint

Use caseEndpointNotes
Transcribe a recorded fileSpeech-to-text transcriptionUpload audio with multipart/form-data
Generate a WAV file from textText-to-speechBest for complete file generation
Start playback as soon as audio arrivesStreaming text-to-speechBest for low-latency playback

Transcribe audio

Send an audio file to the transcription endpoint when you want text output from a recorded file. If your app acts on behalf of an end user, also send X-On-Behalf-Of. Supported audio formats: FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, WebM, PCM.
mka1 llm speech transcribe \
  --file ./episode.wav \
  --language en \
  --prompt 'This is a technical podcast about machine learning.' \
  --temperature 0.2 \
  -H 'X-On-Behalf-Of: <end-user-id>'
The response includes the transcript text plus detected language and confidence:
{
  "text": "Hello! We're excited to show you our native speech capabilities.",
  "language": "en",
  "confidence": 0.8429018476208717
}

Separate speakers in one transcript

If you need diarization, enable speaker data in the transcription request. When enabled, the response can include a speakers array with speaker-labeled segments and timing metadata.
For include_speaker_data, upload WAV or PCM audio for non-streaming transcription. Other audio formats return 400 BAD_REQUEST with the message Speaker diarization currently requires WAV/PCM audio for non-streaming transcription.
const result = await mka1.llm.speech.transcribe({
  language: 'en',
  includeSpeakerData: true,
  prompt: 'This is a short podcast clip about AI product updates.',
  temperature: 0.2,
  requestBody: {
    file: await openAsBlob('panel.wav'),
  },
}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });

console.log(result.speakers);
Example response with speaker separation:
{
  "text": "Welcome back to the show. Today we're looking at how speech APIs fit into production apps. We'll keep it practical and focus on latency, accuracy, and speaker turns.",
  "language": "en",
  "confidence": 0.91177404,
  "speakers": [
    {
      "speaker": "Speaker-1",
      "text": "Welcome back to the show.",
      "confidence": 0.91177404,
      "offset_ms": 80,
      "duration_ms": 1280
    },
    {
      "speaker": "Speaker-2",
      "text": "Today we're looking at how speech APIs fit into production apps.",
      "confidence": 0.91177404,
      "offset_ms": 1540,
      "duration_ms": 3380
    },
    {
      "speaker": "Speaker-1",
      "text": "We'll keep it practical and focus on latency, accuracy, and speaker turns.",
      "confidence": 0.91177404,
      "offset_ms": 5220,
      "duration_ms": 3660
    }
  ]
}
Use the top-level text field when you need a single merged transcript. Use speakers when you need captions, turn-taking, or downstream speaker analytics.

Generate speech

Use the standard text-to-speech endpoint when you want a complete WAV file. The response body is binary audio, and the response headers include X-Language-Code.
mka1 llm speech speak \
  --text 'Welcome to the MKA1 API speech guide.' \
  --language en \
  --output-file speech.wav

Stream speech for lower latency

Use streaming text-to-speech when you want playback to start before the full audio file is ready. Choose mp3 for smaller payloads or pcm for uncompressed audio.
mka1 llm speech speak-streaming \
  --text 'Start speaking this response as soon as audio is ready.' \
  --language en \
  --format-param mp3 \
  --output-file speech.mp3

Next steps