Multimodal output

The MKA1 API can return text, audio, and images. Text is the default output modality. Use modalities and audio to enable speech output, or add the image_generation tool to produce images.

Supported output types

Modality	How to enable	Output format
Text	Default — no extra config	`output_text` in response
Audio (speech)	Set `modalities: ["text", "audio"]`	Base64 audio + transcript
Image	Add `image_generation` tool	Image URL or base64

Generate audio (text-to-speech)

Request audio output by setting modalities to ["text", "audio"] and specifying a voice and format in the audio parameter. The response includes both the text transcript and base64-encoded audio data.

Audio configuration

Parameter	Options	Default
`voice`	`alloy` and other voice profiles	`alloy`
`format`	`wav`, `mp3`, `flac`, `opus`, `pcm16`	`wav`

Audio is synthesized at 24 kHz, 16-bit mono.

mka1 llm responses create \
  -H 'X-On-Behalf-Of: <end-user-id>' \
  --body '{
    "model": "auto",
    "input": "Say hello in a friendly way. Keep it very short.",
    "modalities": ["text", "audio"],
    "audio": { "voice": "alloy", "format": "wav" }
  }'

The response contains an output_audio item with the base64-encoded audio and a transcript of what was spoken:

{
  "status": "completed",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": [
        { "type": "output_text", "text": "Hello!" }
      ]
    },
    {
      "type": "output_audio",
      "id": "audio_460caf1079b34fa0b4aa74448dff4ea7",
      "data": "<Base64-encoded WAV audio data>",
      "transcript": "Hi there!",
      "status": "completed"
    }
  ]
}

The data field contains the full audio file (268 KB in this example). The transcript field contains the text the model chose to speak — which may differ slightly from the text output.

Save audio to a file

# Generate audio and extract the base64 data, then decode to a file
mka1 llm responses create \
  --body '{
    "model": "auto",
    "input": "Read this sentence aloud: The quick brown fox jumps over the lazy dog.",
    "modalities": ["text", "audio"],
    "audio": { "voice": "alloy", "format": "mp3" }
  }' \
  --output-format json \
  --jq '.output[] | select(.type == "output_audio") | .data' | base64 -d > output.mp3

Supported languages

Audio output supports automatic language detection and 20+ languages including English, Chinese, Hindi, Spanish, Arabic, Bengali, Portuguese, Russian, Japanese, Punjabi, German, Korean, French, Turkish, Italian, Thai, Polish, Dutch, Indonesian, Vietnamese, and Urdu.

Generate images

Use the image_generation tool to create images from text prompts. The model interprets your message, generates a prompt for the image model, and returns the result.

Image generation models

Model	Best for
`meetkai:flux-2-klein`	Fast generation, general purpose (default)
`meetkai:z-image-turbo`	High-quality, detailed images

Image generation options

Parameter	Options	Default
`size`	`1024x1024`, `1024x1536`, `1536x1024`, `auto`	`auto`
`quality`	`low`, `medium`, `high`, `auto`	`auto`
`output_format`	`png`, `webp`, `jpeg`	`png`
`background`	`transparent`, `opaque`, `auto`	`auto`

mka1 llm responses create --body '{
  "model": "auto",
  "input": "Generate an image of a sunset over a mountain lake.",
  "tools": [
    {
      "type": "image_generation",
      "model": "auto",
      "quality": "high",
      "size": "1024x1024",
      "output_format": "png"
    }
  ]
}'

The response includes an image_generation_call item with the generated image URL and the revised prompt used by the image model:

{
  "status": "completed",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "I'll generate an image of a beautiful sunset over a mountain lake for you."
        }
      ]
    },
    {
      "type": "image_generation_call",
      "id": "ig_abc123",
      "status": "completed",
      "result": "<Generated Image URL>",
      "revised_prompt": "A breathtaking sunset over a pristine mountain lake, with golden and orange hues reflecting on the calm water surface. Snow-capped mountain peaks in the background, dramatic clouds in the sky with vibrant sunset colors of pink, purple, and orange.",
      "size": "auto",
      "quality": "auto",
      "output_format": "png"
    }
  ]
}

The result field contains a URL to the generated image. The revised_prompt shows the expanded prompt the image model used — the LLM enhances your brief instruction into a detailed image description.

Force image generation

Use tool_choice to ensure the model generates an image rather than responding with text only.

mka1 llm responses create --body '{
  "model": "auto",
  "input": "A red circle on a white background.",
  "tools": [{ "type": "image_generation" }],
  "tool_choice": { "type": "image_generation" }
}'

Image output structure

The response output array contains these items when an image is generated:

function_call — the model’s call to the image generation tool with the refined prompt
image_generation_call — the generation result with status: "completed" and result (image URL)
function_call_output — the raw tool output containing the URL
message — the model’s text response describing or referencing the image

Image URLs expire after 1 hour. Download or cache them if you need long-term access.

Standalone APIs

For direct access without going through the Responses API, MKA1 also provides standalone endpoints:

Text-to-speech API

mka1 llm speech speak \
  --text 'Hello, welcome to the MKA1 platform.' \
  --language en \
  --output-file output.wav

Images API

mka1 llm images create \
  --model auto \
  --prompt 'A futuristic city skyline at dusk' \
  --size 1024x1024 \
  --quality hd

Next steps

Multimodal input — send images, audio, and documents to the model
Speech — transcribe audio and generate speech with the standalone speech endpoints
Advanced voice mode — real-time voice conversations with LiveKit
Generate a response — text requests and multi-turn exchanges

Getting started

Responses

Features

CLI

Recipes

Benchmarks

Infrastructure

Multimodal output

Supported output types

Generate audio (text-to-speech)

Audio configuration

Save audio to a file

Supported languages

Generate images

Image generation models

Image generation options

Force image generation

Image output structure

Standalone APIs

Text-to-speech API

Images API

Next steps

Getting started

Responses

Features

CLI

Recipes

Benchmarks

Infrastructure

Documentation Index

​Supported output types

​Generate audio (text-to-speech)

​Audio configuration

​Save audio to a file

​Supported languages

​Generate images

​Image generation models

​Image generation options

​Force image generation

​Image output structure

​Standalone APIs

​Text-to-speech API

​Images API

​Next steps

Supported output types

Generate audio (text-to-speech)

Audio configuration

Save audio to a file

Supported languages

Generate images

Image generation models

Image generation options

Force image generation

Image output structure

Standalone APIs

Text-to-speech API

Images API

Next steps