Use the MKA1 API speech endpoints when you need file-based speech-to-text or text-to-speech.
For real-time, bidirectional voice sessions, use Advanced voice mode.
Choose the right endpoint
| Use case | Endpoint | Notes |
|---|
| Transcribe a recorded file | Speech-to-text transcription | Upload audio with multipart/form-data |
| Generate a WAV file from text | Text-to-speech | Best for complete file generation |
| Start playback as soon as audio arrives | Streaming text-to-speech | Best for low-latency playback |
Transcribe audio
Send an audio file to the transcription endpoint when you want text output from a recorded file.
If your app acts on behalf of an end user, also send X-On-Behalf-Of.
Supported audio formats: FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, WebM, PCM.
import { SDK } from '@meetkai/mka1';
import { openAsBlob } from 'node:fs';
const mka1 = new SDK({
bearerAuth: 'Bearer <mka1-api-key>',
});
const result = await mka1.llm.speech.transcribe({
language: 'en',
prompt: 'This is a technical podcast about machine learning.',
temperature: 0.2,
requestBody: {
file: await openAsBlob('episode.wav'),
},
}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });
console.log(result.text);
console.log(result.language);
console.log(result.confidence);
The response includes the transcript text plus detected language and confidence:
{
"text": "Hello! We're excited to show you our native speech capabilities.",
"language": "en",
"confidence": 0.8429018476208717
}
Separate speakers in one transcript
If you need diarization, enable speaker data in the transcription request.
When enabled, the response can include a speakers array with speaker-labeled segments and timing metadata.
For includeSpeakerData, upload WAV or PCM audio for non-streaming transcription. Other audio formats return 400 BAD_REQUEST with the message Speaker diarization currently requires WAV/PCM audio for non-streaming transcription.
const result = await mka1.llm.speech.transcribe({
language: 'en',
includeSpeakerData: true,
prompt: 'This is a short podcast clip about AI product updates.',
temperature: 0.2,
requestBody: {
file: await openAsBlob('panel.wav'),
},
}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });
console.log(result.speakers);
Example response with speaker separation:
{
"text": "Welcome back to the show. Today we're looking at how speech APIs fit into production apps. We'll keep it practical and focus on latency, accuracy, and speaker turns.",
"language": "en",
"confidence": 0.91177404,
"speakers": [
{
"speaker": "Speaker-1",
"text": "Welcome back to the show.",
"confidence": 0.91177404,
"offset_ms": 80,
"duration_ms": 1280
},
{
"speaker": "Speaker-2",
"text": "Today we're looking at how speech APIs fit into production apps.",
"confidence": 0.91177404,
"offset_ms": 1540,
"duration_ms": 3380
},
{
"speaker": "Speaker-1",
"text": "We'll keep it practical and focus on latency, accuracy, and speaker turns.",
"confidence": 0.91177404,
"offset_ms": 5220,
"duration_ms": 3660
}
]
}
Use the top-level text field when you need a single merged transcript.
Use speakers when you need captions, turn-taking, or downstream speaker analytics.
Generate speech
Use the standard text-to-speech endpoint when you want a complete WAV file.
The response body is binary audio, and the response headers include X-Language-Code.
import { writeFileSync } from 'node:fs';
const result = await mka1.llm.speech.speak({
text: 'Welcome to the MKA1 API speech guide.',
language: 'en',
}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });
const audioBody = result.body as Blob | Uint8Array;
const audioBuffer = audioBody instanceof Uint8Array
? Buffer.from(audioBody)
: Buffer.from(await audioBody.arrayBuffer());
const languageCode =
result.headers['X-Language-Code'] ?? result.headers['x-language-code'];
writeFileSync('speech.wav', audioBuffer);
console.log(languageCode);
Stream speech for lower latency
Use streaming text-to-speech when you want playback to start before the full audio file is ready.
Choose mp3 for smaller payloads or pcm for uncompressed audio.
const result = await mka1.llm.speech.speakStreaming({
text: 'Start speaking this response as soon as audio is ready.',
language: 'en',
format: 'mp3',
}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });
const contentType =
result.headers['Content-Type'] ?? result.headers['content-type'];
const languageCode =
result.headers['X-Language-Code'] ?? result.headers['x-language-code'];
console.log(contentType);
console.log(languageCode);
Next steps