Advanced Voice Mode

The MKA1 API provides a real-time voice interface through LiveKit. This guide covers how to obtain a room token, connect to a voice session, send audio and text input, and capture the agent’s responses.

Overview

The voice integration consists of three main components:

Room Token: A JWT that grants access to a LiveKit room
LiveKit Connection: WebRTC-based real-time communication
Voice Agent: Processes audio/text input and generates spoken responses

The agent pipeline works as follows:

STT (Speech-to-Text): Audio is streamed via WebSocket at 16kHz and transcribed
LLM: Transcribed text is processed by the MKA1 Responses API
TTS (Text-to-Speech): LLM output is synthesized to audio at 24kHz

Every request the voice agent sends to the Responses API automatically includes "voice_mode": "true" in the request metadata. This lets you distinguish voice-originated responses from text-based ones when reviewing usage or response history.

Getting a room token

To start a voice session, first request a room token from the MKA1 API. The token endpoint requires an API key and optionally accepts X-On-Behalf-Of to identify end users. See Authentication for details.

import { SDK } from '@meetkai/mka1';

const mka1 = new SDK({
  bearerAuth: `Bearer ${YOUR_API_KEY}`,
});

const session = await mka1.llm.speech.livekitToken({
  llm: {
    model: 'auto',
    reasoning: { effort: 'none' }
  }
}, { headers: { 'X-On-Behalf-Of': 'user-123' } }); // Optional

console.log(session.token);    // JWT token
console.log(session.url);      // WebSocket URL
console.log(session.roomName); // Room name

Parameters

The request body has two top-level objects:

{
  "llm": { ... },   // Required — LLM configuration
  "stt": { ... }    // Optional — speech-to-text tuning
}

`llm` — LLM configuration (required)

The llm object accepts the same fields as the Responses API request body, minus fields managed by the voice agent (input, stream, store, background).

Field	Required	Description
`model`	Yes	LLM model to use (e.g., `auto`)
`instructions`	No	Custom system instructions for the agent
`previous_response_id`	No	Chain this session to a specific response from a previous session
`conversation`	No	Continue an existing conversation — pass `{ "id": "conv_abc123..." }` or the conversation ID as a string
`tools`	No	Array of tool definitions (function, web_search, file_search, etc.)
`tool_choice`	No	How the model selects tools (`"auto"`, `"none"`, `"required"`, or a specific tool)
`parallel_tool_calls`	No	Whether to allow parallel tool execution
`max_tool_calls`	No	Maximum number of tool calls per response (default: 30)
`temperature`	No	Sampling temperature (e.g., `0.7`)
`max_output_tokens`	No	Maximum tokens in the response
`reasoning`	No	Reasoning configuration (e.g., `{ "effort": "high" }`). Set `{ "effort": "none" }` for voice sessions to minimize latency — see note below.
`top_p`	No	Nucleus sampling parameter
`presence_penalty`	No	Presence penalty for token repetition
`frequency_penalty`	No	Frequency penalty for token repetition
`truncation`	No	`"auto"` or `"disabled"` — controls context truncation
`context_management`	No	Context management strategies for conversation truncation
`service_tier`	No	`"auto"`, `"default"`, `"flex"`, or `"priority"`
`prompt`	No	Reference to a prompt template and its variables
`text`	No	Text output configuration (format, verbosity)
`metadata`	No	Key-value metadata passed to the Responses API

You cannot specify both previous_response_id and conversation.

The token metadata is embedded in a JWT, which is passed as an HTTP header. Keep the total llm payload under ~8 KB — large tools arrays may need to be trimmed.

For voice sessions, disable reasoning by setting "reasoning": { "effort": "none" }. Reasoning adds thinking time before the model responds, which increases latency and creates noticeable pauses in conversation. Disabling it keeps responses fast and natural.

`stt` — Speech-to-text configuration (optional)

Controls server-side voice activity detection (VAD) and endpointing behavior.

Field	Required	Description
`silence_timeout_ms`	No	Milliseconds of silence before finalizing speech (100–5000)
`initial_silence_timeout_ms`	No	Timeout before any speech is detected (1000–30000)

Advanced configuration

You can pass tools, custom instructions, and STT tuning in a single token request:

const session = await mka1.llm.speech.livekitToken({
  llm: {
    model: 'auto',
    instructions: 'You are a helpful travel assistant. Be concise in voice responses.',
    temperature: 0.7,
    tools: [
      {
        type: 'web_search',
        user_location: { country: 'US' }
      },
      {
        type: 'function',
        name: 'book_flight',
        description: 'Book a flight for the user',
        parameters: {
          type: 'object',
          properties: {
            origin: { type: 'string' },
            destination: { type: 'string' },
            date: { type: 'string' }
          },
          required: ['origin', 'destination', 'date']
        }
      }
    ],
    tool_choice: 'auto'
  },
  stt: {
    silence_timeout_ms: 500,
    initial_silence_timeout_ms: 10000
  }
});

Response

{
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "url": "wss://apigw.mka1.com/api/v1/livekit",
  "roomName": "550e8400-e29b-41d4-a716-446655440000"
}

Field	Description
`token`	JWT access token (5-minute TTL) with room join, publish, and subscribe permissions
`url`	LiveKit WebSocket URL to connect to
`roomName`	Auto-generated UUID for this session

The token includes metadata that the voice agent uses to configure the session.

Continuing a session

To continue from a previous response:

const mka1 = new SDK({
  bearerAuth: `Bearer ${YOUR_API_KEY}`,
});

const session = await mka1.llm.speech.livekitToken({
  llm: {
    model: 'auto',
    previousResponseId: 'resp_abc123...'
  }
}, { headers: { 'X-On-Behalf-Of': 'user-123' } }); // Optional, but must match if the original session used it

To continue an existing conversation:

const mka1 = new SDK({
  bearerAuth: `Bearer ${YOUR_API_KEY}`,
});

const session = await mka1.llm.speech.livekitToken({
  llm: {
    model: 'auto',
    conversation: { id: 'conv_abc123...' }
  }
}, { headers: { 'X-On-Behalf-Of': 'user-123' } }); // Optional, but must match if the original session used it

When continuing a session, the API key and X-On-Behalf-Of header (if used) must match the original session. The voice agent encrypts both into the room token and passes them to all downstream MKA1 services. If they don’t match, the agent will not have access to the previous context.

Connecting to a room

Once you have a token, use the LiveKit SDK to connect to the room.

import { Room, RoomEvent, Track } from 'livekit-client';

const room = new Room();

// Connect to the room
await room.connect(session.url, session.token);

console.log('Connected to room:', room.name);

Sending audio input

The agent accepts audio input via the LiveKit room’s audio track. The audio is processed at 16kHz sample rate.

import { createLocalAudioTrack } from 'livekit-client';

// Create a local audio track from the microphone
const audioTrack = await createLocalAudioTrack({
  echoCancellation: true,
  noiseSuppression: true,
  autoGainControl: true
});

// Publish the track to the room
await room.localParticipant.publishTrack(audioTrack);

Audio behavior

Voice Activity Detection (VAD): VAD is handled server-side by the MKA1 agent, not locally. The agent automatically detects when you stop speaking and begins processing.
Sample rate: Audio is streamed at 16kHz to the STT service.
Endpointing: The agent uses server-side endpointing to determine when speech ends. There is no local endpointing delay.

Sending text input

You can also send text messages directly to the agent without speaking.

// Send a text message to the agent
const message = JSON.stringify({
  type: 'user_message',
  content: 'What is the capital of France?'
});

await room.localParticipant.publishData(
  new TextEncoder().encode(message),
  { reliable: true, topic: 'lk.chat' }
);

Receiving agent responses

The agent responds in three ways:

Audio output: Synthesized speech via an audio track
Transcription: Text of what the agent is saying (for captions)
Response metadata: Response ID and conversation ID via data channel

Subscribing to audio output

import { RoomEvent, Track } from 'livekit-client';

room.on(RoomEvent.TrackSubscribed, (track, publication, participant) => {
  if (track.kind === Track.Kind.Audio && participant.identity !== room.localParticipant.identity) {
    // This is the agent's audio output
    const audioElement = track.attach();
    document.body.appendChild(audioElement);
  }
});

Receiving transcriptions

The agent publishes transcriptions of its speech. You can use these for captions or logging.

room.on(RoomEvent.TranscriptionReceived, (segments, participant) => {
  for (const segment of segments) {
    console.log(`Agent said: ${segment.text}`);
  }
});

Receiving response metadata

The agent publishes the response_id and conversation_id (if applicable) when it starts generating a response. Save the response_id to chain future sessions using previous_response_id.

room.on(RoomEvent.DataReceived, (payload, participant) => {
  if (participant.identity !== room.localParticipant.identity) {
    const data = JSON.parse(new TextDecoder().decode(payload));

    if (data.response_id) {
      console.log('Response ID:', data.response_id);
      console.log('Conversation ID:', data.conversation_id); // present if using a conversation
      // Save response_id to chain future sessions with previous_response_id
    }
  }
});

Conversation continuity

The agent supports multi-turn conversations with persistent memory. Every response is automatically assigned a response_id, while conversations must be explicitly created and managed through the Conversations API. There are two ways to continue a conversation: llm.previous_response_id chains a new session to a specific response. The agent receives the context from that response and all prior responses in the chain. Use this when:

You want to continue from a specific point in a conversation
You’re building a linear conversation flow
You want to branch from a specific response

llm.conversation references a conversation created via the Conversations API. Use this when:

You need to manage conversation metadata (titles, tags, etc.)
You want to list or search past conversations
You’re building a chat interface with persistent conversation history
Multiple clients need to access the same conversation

Starting a new session

import { Room, RoomEvent } from 'livekit-client';
import { SDK } from '@meetkai/mka1';

const mka1 = new SDK({
  bearerAuth: `Bearer ${YOUR_API_KEY}`,
});

// 1. Get a token for a new session
const session = await mka1.llm.speech.livekitToken({
  llm: { model: 'auto' }
}, { headers: { 'X-On-Behalf-Of': 'user-123' } }); // Optional

// 2. Connect to the room
const room = new Room();
await room.connect(session.url, session.token);

// 3. Track the response ID when the agent responds
let lastResponseId: string;
room.on(RoomEvent.DataReceived, (payload, participant) => {
  const data = JSON.parse(new TextDecoder().decode(payload));
  if (data.response_id) {
    lastResponseId = data.response_id;
  }
});

// 4. Have a conversation...
// 5. Disconnect when done
room.disconnect();

Continuing from a previous response

Use previous_response_id to chain a new session to the last response, preserving conversation context:

// Use the same API key and X-On-Behalf-Of as the original session
const mka1 = new SDK({
  bearerAuth: `Bearer ${YOUR_API_KEY}`,
});

// 1. Get a new token chained to the previous response
const session = await mka1.llm.speech.livekitToken({
  llm: {
    model: 'auto',
    previousResponseId: lastResponseId
  }
}, { headers: { 'X-On-Behalf-Of': 'user-123' } }); // Optional, but must match if the original session used it

// 2. Connect to the new room
const room = new Room();
await room.connect(session.url, session.token);

// 3. The agent now has context from the previous session
// User: "What did I ask you earlier?"
// Agent: "You asked about the capital of France..."

Continuing from a conversation

Use conversation_id to continue an existing conversation created via the Conversations API:

// Use the same API key and X-On-Behalf-Of as the original session
const mka1 = new SDK({
  bearerAuth: `Bearer ${YOUR_API_KEY}`,
});

// 1. Get a new token with the conversation ID
const session = await mka1.llm.speech.livekitToken({
  llm: {
    model: 'auto',
    conversation: { id: conversationId }
  }
}, { headers: { 'X-On-Behalf-Of': 'user-123' } }); // Optional, but must match if the original session used it

// 2. Connect to the new room
const room = new Room();
await room.connect(session.url, session.token);

// 3. The agent now has context from the entire conversation history

When continuing a conversation, the API key and X-On-Behalf-Of header (if used) must match the original session. The context is scoped to the authenticated identity.

Handling disconnection

Tokens expire after 5 minutes. If you need longer sessions, implement reconnection logic:

room.on(RoomEvent.Disconnected, async () => {
  console.log('Disconnected from room');

  // Get a new token (continuing from the last response)
  const newSession = await mka1.llm.speech.livekitToken({
    llm: {
      model: 'auto',
      previousResponseId: savedResponseId
    }
  });

  // Reconnect
  await room.connect(newSession.url, newSession.token);
});

Complete example

Here’s a complete example putting it all together:

import { Room, RoomEvent, Track, createLocalAudioTrack } from 'livekit-client';
import { SDK } from '@meetkai/mka1';

async function startVoiceSession(model: string = 'auto') {
  const mka1 = new SDK({ bearerAuth: `Bearer ${YOUR_API_KEY}` });

  // Get room credentials
  const session = await mka1.llm.speech.livekitToken({ llm: { model } });

  // Create and connect to room
  const room = new Room();

  let lastResponseId: string | undefined;

  // Handle agent audio output
  room.on(RoomEvent.TrackSubscribed, (track, publication, participant) => {
    if (track.kind === Track.Kind.Audio) {
      const audio = track.attach();
      document.body.appendChild(audio);
    }
  });

  // Handle transcriptions
  room.on(RoomEvent.TranscriptionReceived, (segments) => {
    for (const segment of segments) {
      console.log('Agent:', segment.text);
    }
  });

  // Handle response metadata
  room.on(RoomEvent.DataReceived, (payload, participant) => {
    const data = JSON.parse(new TextDecoder().decode(payload));
    if (data.response_id) {
      lastResponseId = data.response_id;
    }
  });

  // Connect to the room
  await room.connect(session.url, session.token);

  // Capture and publish microphone
  const audioTrack = await createLocalAudioTrack({
    echoCancellation: true,
    noiseSuppression: true
  });
  await room.localParticipant.publishTrack(audioTrack);

  // The agent will greet you automatically
  // Start speaking to interact!

  return { room, getLastResponseId: () => lastResponseId };
}

Error handling

Token endpoint errors

These are returned as HTTP responses when requesting a room token:

Error	Cause	Solution
400 Bad Request	Missing required `llm.model` parameter	Include `model` inside the `llm` object
400 Bad Request	Both `previous_response_id` and `conversation` specified	Use only one, not both
401 Unauthorized	Invalid or missing API key	Check your API key is valid

In-session errors

During an active voice session, the agent publishes errors via the LiveKit data channel. Listen for them alongside response metadata:

room.on(RoomEvent.DataReceived, (payload, participant) => {
  if (participant.identity === room.localParticipant.identity) return;

  const data = JSON.parse(new TextDecoder().decode(payload));

  if (data.error) {
    console.error(`[${data.error.service}] ${data.error.code}: ${data.error.message}`);
    // data.error.details may contain additional debugging info
  }

  if (data.response_id) {
    lastResponseId = data.response_id;
  }
});

The error payload structure:

{
  "error": {
    "code": "rate_limited",
    "message": "HTTP 429",
    "service": "llm",
    "details": "..."
  }
}

Field	Description
`code`	Error code (see table below)
`message`	Short description of the error
`service`	Which part of the pipeline failed: `llm`, `stt`, or `tts`
`details`	Additional context for debugging (optional)

Error codes:

Code	Service	Cause
`invalid_session`	—	Missing required metadata fields (`sub`, `llm`)
`auth_error`	—	Failed to decrypt credentials from the room token
`session_error`	—	Agent failed to start the voice session
`invalid_request`	`llm`	Bad request to the Responses API (HTTP 400)
`auth_error`	`llm`	Invalid API key (HTTP 401)
`access_denied`	`llm`	Insufficient permissions (HTTP 403)
`rate_limited`	`llm`	Rate limit exceeded (HTTP 429)
`service_error`	`llm`	Internal server error (HTTP 500)
`service_unavailable`	`llm`	Upstream unavailable (HTTP 502/503)
`timeout`	`llm`	Request timed out (HTTP 504)
`connection_error`	`llm`	Failed to connect to the Responses API
`transcription_error`	`stt`	Speech-to-text processing failed
`speech_error`	`tts`	Text-to-speech synthesis failed

Connection errors

Issue	Cause	Solution
Connection timeout	Network issues or invalid token	Get a fresh token and retry
Token expired	Session exceeded 5 minutes	Get a new token with `previous_response_id` to continue

Next steps

Explore the LiveKit token endpoint in the API reference
Learn about the Responses API that powers the voice agent
Review TTS and STT endpoints for non-realtime use cases

Getting started

Responses

Features

CLI

Recipes

Benchmarks

Infrastructure

Advanced Voice Mode

Overview

Getting a room token

Parameters

`llm` — LLM configuration (required)

`stt` — Speech-to-text configuration (optional)

Advanced configuration

Response

Continuing a session

Connecting to a room

Sending audio input

Audio behavior

Sending text input

Receiving agent responses

Subscribing to audio output

Receiving transcriptions

Receiving response metadata

Conversation continuity

Starting a new session

Continuing from a previous response

Continuing from a conversation

Handling disconnection

Complete example

Error handling

Token endpoint errors

In-session errors

Connection errors

Next steps

Getting started

Responses

Features

CLI

Recipes

Benchmarks

Infrastructure

​Overview

​Getting a room token

​Parameters

​llm — LLM configuration (required)

​stt — Speech-to-text configuration (optional)

​Advanced configuration

​Response

​Continuing a session

​Connecting to a room

​Sending audio input

​Audio behavior

​Sending text input

​Receiving agent responses

​Subscribing to audio output

​Receiving transcriptions

​Receiving response metadata

​Conversation continuity

​Starting a new session

​Continuing from a previous response

​Continuing from a conversation

​Handling disconnection

​Complete example

​Error handling

​Token endpoint errors

​In-session errors

​Connection errors

​Next steps

Overview

Getting a room token

Parameters

`llm` — LLM configuration (required)

`stt` — Speech-to-text configuration (optional)

Advanced configuration

Response

Continuing a session

Connecting to a room

Sending audio input

Audio behavior

Sending text input

Receiving agent responses

Subscribing to audio output

Receiving transcriptions

Receiving response metadata

Conversation continuity

Starting a new session

Continuing from a previous response

Continuing from a conversation

Handling disconnection

Complete example

Error handling

Token endpoint errors

In-session errors

Connection errors

Next steps