Multimodal input

The Responses API accepts text, images, audio, and files in a single request. Use structured input with content arrays to combine modalities.

Supported input types

Type	Content type	Formats	Delivery
Text	`input_text`	Plain text	Inline
Image	`input_image`	JPEG, PNG, WebP, GIF, TIFF	URL, base64 data URI, or `file_id`
Audio	`input_audio`	WAV, MP3	Base64
Document	`input_file`	PDF, DOCX, XLSX, PPTX, RTF, TXT, CSV	URL, base64 data URI, or `file_id`
Video	`input_file`	MP4	Base64 data URI or `file_id`

Image input

Send an image for the model to describe, analyze, or answer questions about. Provide the image as a URL, a base64 data URI, or a previously uploaded file_id.

Image via URL

mka1 llm responses create \
  -H 'X-On-Behalf-Of: <end-user-id>' \
  --body '{
    "model": "auto",
    "input": [
      {
        "type": "message",
        "role": "user",
        "content": [
          { "type": "input_text", "text": "Describe what you see in this image." },
          {
            "type": "input_image",
            "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
          }
        ]
      }
    ]
  }'

Image via base64

Encode the image as a data URI with the appropriate MIME type.

IMAGE_B64=$(base64 -i photo.jpg)

mka1 llm responses create \
  --body "{
    \"model\": \"auto\",
    \"input\": [
      {
        \"type\": \"message\",
        \"role\": \"user\",
        \"content\": [
          { \"type\": \"input_text\", \"text\": \"What is in this photo?\" },
          {
            \"type\": \"input_image\",
            \"image_url\": \"data:image/jpeg;base64,${IMAGE_B64}\"
          }
        ]
      }
    ]
  }"

Image via file_id

Upload an image with the Files API first, then reference it by ID.

# Upload the image
FILE_ID=$(mka1 llm files upload \
  --file @photo.jpg \
  --purpose assistants | jq -r '.id')

# Use the file_id
mka1 llm responses create \
  --body "{
    \"model\": \"auto\",
    \"input\": [
      {
        \"type\": \"message\",
        \"role\": \"user\",
        \"content\": [
          { \"type\": \"input_text\", \"text\": \"Describe this image.\" },
          { \"type\": \"input_image\", \"file_id\": \"${FILE_ID}\" }
        ]
      }
    ]
  }"

Audio input

Send audio for the model to process. The audio is automatically transcribed and the model responds to the spoken content. Supported formats: WAV and MP3 (max 25 MB).

AUDIO_B64=$(base64 -i recording.wav)

mka1 llm responses create \
  --body "{
    \"model\": \"auto\",
    \"input\": [
      {
        \"type\": \"message\",
        \"role\": \"user\",
        \"content\": [
          {
            \"type\": \"input_audio\",
            \"input_audio\": {
              \"data\": \"${AUDIO_B64}\",
              \"format\": \"wav\"
            }
          }
        ]
      }
    ]
  }"

The model automatically transcribes the audio and responds to the spoken content. For example, sending a WAV file containing “Hello, how are you today?” returns:

{
  "status": "completed",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "Hello! I'm doing well, thank you for asking. I'm here and ready to help you with any questions or tasks you might have. How can I assist you today?"
        }
      ]
    }
  ]
}

Document input

Send documents for the model to read and reason over. PDF and scanned documents are automatically processed with OCR — no extra configuration needed.

Document via URL

mka1 llm responses create \
  --body '{
    "model": "auto",
    "input": [
      {
        "type": "message",
        "role": "user",
        "content": [
          { "type": "input_text", "text": "Summarize this document in three bullet points." },
          {
            "type": "input_file",
            "file_url": "https://example.com/report.pdf",
            "filename": "report.pdf"
          }
        ]
      }
    ]
  }'

Document via base64

Encode the file as a data URI. Include the MIME type so the API can route it to the correct processor.

PDF_B64=$(base64 -i contract.pdf)

mka1 llm responses create \
  --body "{
    \"model\": \"auto\",
    \"input\": [
      {
        \"type\": \"message\",
        \"role\": \"user\",
        \"content\": [
          { \"type\": \"input_text\", \"text\": \"What are the key terms in this contract?\" },
          {
            \"type\": \"input_file\",
            \"file_data\": \"data:application/pdf;base64,${PDF_B64}\",
            \"filename\": \"contract.pdf\"
          }
        ]
      }
    ]
  }"

Scanned documents and OCR

Scanned PDFs and images of documents are processed automatically. The API uses OCR to extract text from:

Scanned PDF pages (converted to images at 150 DPI, then OCR’d)
Photos of documents (JPEG, PNG, TIFF)
Office files (DOCX, XLSX, PPTX — converted to PDF first, then OCR’d)

Multi-page documents are processed in parallel. The extracted text is returned as Markdown and passed to the model for reasoning. No special parameters are needed — just send the file as input_file and the pipeline handles detection, conversion, and OCR.

Supported document formats

Format	MIME type	Processing
PDF	`application/pdf`	OCR per page at 150 DPI
JPEG / PNG / TIFF / WebP / GIF	`image/*`	Direct OCR
Word (.doc, .docx)	`application/msword`, `application/vnd.openxmlformats-officedocument.wordprocessingml.document`	Convert to PDF, then OCR
Excel (.xls, .xlsx)	`application/vnd.ms-excel`, `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`	Convert to PDF, then OCR
PowerPoint (.ppt, .pptx)	`application/vnd.ms-powerpoint`, `application/vnd.openxmlformats-officedocument.presentationml.presentation`	Convert to PDF, then OCR
RTF	`application/rtf`	Convert to PDF, then OCR
Plain text / CSV	`text/plain`, `text/csv`	Read directly

Size limit: 30 MB per file.

Mixed input

Combine multiple content types in a single message. The model sees all inputs together and can reason across them.

mka1 llm responses create \
  --body '{
    "model": "auto",
    "input": [
      {
        "type": "message",
        "role": "user",
        "content": [
          { "type": "input_text", "text": "Compare the chart in the image with the data in the spreadsheet. Are the numbers consistent?" },
          {
            "type": "input_image",
            "image_url": "https://example.com/chart.png"
          },
          {
            "type": "input_file",
            "file_url": "https://example.com/data.xlsx",
            "filename": "data.xlsx"
          }
        ]
      }
    ]
  }'

Next steps

Multimodal output — generate audio and images in responses
Files and vector stores — upload and manage files for reuse
Generate a response — text-only requests and multi-turn exchanges
Advanced voice mode — real-time voice conversations with LiveKit

Getting started

Responses

Features

CLI

Recipes

Benchmarks

Infrastructure

Multimodal input

Supported input types

Image input

Image via URL

Image via base64

Image via file_id

Audio input

Document input

Document via URL

Document via base64

Scanned documents and OCR

Supported document formats

Mixed input

Next steps

Getting started

Responses

Features

CLI

Recipes

Benchmarks

Infrastructure

Documentation Index

​Supported input types

​Image input

​Image via URL

​Image via base64

​Image via file_id

​Audio input

​Document input

​Document via URL

​Document via base64

​Scanned documents and OCR

​Supported document formats

​Mixed input

​Next steps

Supported input types

Image input

Image via URL

Image via base64

Image via file_id

Audio input

Document input

Document via URL

Document via base64

Scanned documents and OCR

Supported document formats

Mixed input

Next steps