Skip to main content
Use the MKA1 API speech endpoints when you need file-based speech-to-text or text-to-speech. For real-time, bidirectional voice sessions, use Advanced voice mode.

Choose the right endpoint

Use caseEndpointNotes
Transcribe a recorded fileSpeech-to-text transcriptionUpload audio with multipart/form-data
Generate a WAV file from textText-to-speechBest for complete file generation
Start playback as soon as audio arrivesStreaming text-to-speechBest for low-latency playback

Transcribe audio

Send an audio file to the transcription endpoint when you want text output from a recorded file. If your app acts on behalf of an end user, also send X-On-Behalf-Of. Supported audio formats: FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, WebM, PCM.
import { SDK } from '@meetkai/mka1';
import { openAsBlob } from 'node:fs';

const mka1 = new SDK({
  bearerAuth: 'Bearer <mka1-api-key>',
});

const result = await mka1.llm.speech.transcribe({
  language: 'en',
  prompt: 'This is a technical podcast about machine learning.',
  temperature: 0.2,
  requestBody: {
    file: await openAsBlob('episode.wav'),
  },
}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });

console.log(result.text);
console.log(result.language);
console.log(result.confidence);
The response includes the transcript text plus detected language and confidence:
{
  "text": "Hello! We're excited to show you our native speech capabilities.",
  "language": "en",
  "confidence": 0.8429018476208717
}

Separate speakers in one transcript

If you need diarization, enable speaker data in the transcription request. When enabled, the response can include a speakers array with speaker-labeled segments and timing metadata.
For includeSpeakerData, upload WAV or PCM audio for non-streaming transcription. Other audio formats return 400 BAD_REQUEST with the message Speaker diarization currently requires WAV/PCM audio for non-streaming transcription.
const result = await mka1.llm.speech.transcribe({
  language: 'en',
  includeSpeakerData: true,
  prompt: 'This is a short podcast clip about AI product updates.',
  temperature: 0.2,
  requestBody: {
    file: await openAsBlob('panel.wav'),
  },
}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });

console.log(result.speakers);
Example response with speaker separation:
{
  "text": "Welcome back to the show. Today we're looking at how speech APIs fit into production apps. We'll keep it practical and focus on latency, accuracy, and speaker turns.",
  "language": "en",
  "confidence": 0.91177404,
  "speakers": [
    {
      "speaker": "Speaker-1",
      "text": "Welcome back to the show.",
      "confidence": 0.91177404,
      "offset_ms": 80,
      "duration_ms": 1280
    },
    {
      "speaker": "Speaker-2",
      "text": "Today we're looking at how speech APIs fit into production apps.",
      "confidence": 0.91177404,
      "offset_ms": 1540,
      "duration_ms": 3380
    },
    {
      "speaker": "Speaker-1",
      "text": "We'll keep it practical and focus on latency, accuracy, and speaker turns.",
      "confidence": 0.91177404,
      "offset_ms": 5220,
      "duration_ms": 3660
    }
  ]
}
Use the top-level text field when you need a single merged transcript. Use speakers when you need captions, turn-taking, or downstream speaker analytics.

Generate speech

Use the standard text-to-speech endpoint when you want a complete WAV file. The response body is binary audio, and the response headers include X-Language-Code.
import { writeFileSync } from 'node:fs';

const result = await mka1.llm.speech.speak({
  text: 'Welcome to the MKA1 API speech guide.',
  language: 'en',
}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });

const audioBody = result.body as Blob | Uint8Array;
const audioBuffer = audioBody instanceof Uint8Array
  ? Buffer.from(audioBody)
  : Buffer.from(await audioBody.arrayBuffer());
const languageCode =
  result.headers['X-Language-Code'] ?? result.headers['x-language-code'];

writeFileSync('speech.wav', audioBuffer);
console.log(languageCode);

Stream speech for lower latency

Use streaming text-to-speech when you want playback to start before the full audio file is ready. Choose mp3 for smaller payloads or pcm for uncompressed audio.
const result = await mka1.llm.speech.speakStreaming({
  text: 'Start speaking this response as soon as audio is ready.',
  language: 'en',
  format: 'mp3',
}, { headers: { 'X-On-Behalf-Of': '<end-user-id>' } });

const contentType =
  result.headers['Content-Type'] ?? result.headers['content-type'];
const languageCode =
  result.headers['X-Language-Code'] ?? result.headers['x-language-code'];

console.log(contentType);
console.log(languageCode);

Next steps