Transcribe audio and generate speech with the MKA1 API. Use speaker-labeled segments when you need multi-speaker separation.
Use the MKA1 API speech endpoints when you need file-based speech-to-text or text-to-speech.
For real-time, bidirectional voice sessions, use Advanced voice mode.
Send an audio file to the transcription endpoint when you want text output from a recorded file.
If your app acts on behalf of an end user, also send X-On-Behalf-Of.Supported audio formats: FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, WebM, PCM.
mka1 llm speech transcribe \ --file ./episode.wav \ --language en \ --prompt 'This is a technical podcast about machine learning.' \ --temperature 0.2 \ -H 'X-On-Behalf-Of: <end-user-id>'
The response includes the transcript text plus detected language and confidence:
{ "text": "Hello! We're excited to show you our native speech capabilities.", "language": "en", "confidence": 0.8429018476208717}
If you need diarization, enable speaker data in the transcription request.
When enabled, the response can include a speakers array with speaker-labeled segments and timing metadata.
For include_speaker_data, upload WAV or PCM audio for non-streaming transcription. Other audio formats return 400 BAD_REQUEST with the message Speaker diarization currently requires WAV/PCM audio for non-streaming transcription.
const result = await mka1.llm.speech.transcribe({ language: 'en', includeSpeakerData: true, prompt: 'This is a short podcast clip about AI product updates.', temperature: 0.2, requestBody: { file: await openAsBlob('panel.wav'), },}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });console.log(result.speakers);
Example response with speaker separation:
{ "text": "Welcome back to the show. Today we're looking at how speech APIs fit into production apps. We'll keep it practical and focus on latency, accuracy, and speaker turns.", "language": "en", "confidence": 0.91177404, "speakers": [ { "speaker": "Speaker-1", "text": "Welcome back to the show.", "confidence": 0.91177404, "offset_ms": 80, "duration_ms": 1280 }, { "speaker": "Speaker-2", "text": "Today we're looking at how speech APIs fit into production apps.", "confidence": 0.91177404, "offset_ms": 1540, "duration_ms": 3380 }, { "speaker": "Speaker-1", "text": "We'll keep it practical and focus on latency, accuracy, and speaker turns.", "confidence": 0.91177404, "offset_ms": 5220, "duration_ms": 3660 } ]}
Use the top-level text field when you need a single merged transcript.
Use speakers when you need captions, turn-taking, or downstream speaker analytics.
Use the standard text-to-speech endpoint when you want a complete WAV file.
The response body is binary audio, and the response headers include X-Language-Code.
mka1 llm speech speak \ --text 'Welcome to the MKA1 API speech guide.' \ --language en \ --output-file speech.wav
Use streaming text-to-speech when you want playback to start before the full audio file is ready.
Choose mp3 for smaller payloads or pcm for uncompressed audio.
mka1 llm speech speak-streaming \ --text 'Start speaking this response as soon as audio is ready.' \ --language en \ --format-param mp3 \ --output-file speech.mp3