input with content arrays to combine modalities.
Supported input types
| Type | Content type | Formats | Delivery |
|---|---|---|---|
| Text | input_text | Plain text | Inline |
| Image | input_image | JPEG, PNG, WebP, GIF, TIFF | URL, base64 data URI, or file_id |
| Audio | input_audio | WAV, MP3 | Base64 |
| Document | input_file | PDF, DOCX, XLSX, PPTX, RTF, TXT, CSV | URL, base64 data URI, or file_id |
| Video | input_file | MP4 | Base64 data URI or file_id |
Image input
Send an image for the model to describe, analyze, or answer questions about. Provide the image as a URL, a base64 data URI, or a previously uploadedfile_id.
Image via URL
Image via base64
Encode the image as a data URI with the appropriate MIME type.Image via file_id
Upload an image with the Files API first, then reference it by ID.Audio input
Send audio for the model to process. The audio is automatically transcribed and the model responds to the spoken content. Supported formats: WAV and MP3 (max 25 MB).Document input
Send documents for the model to read and reason over. PDF and scanned documents are automatically processed with OCR — no extra configuration needed.Document via URL
Document via base64
Encode the file as a data URI. Include the MIME type so the API can route it to the correct processor.Scanned documents and OCR
Scanned PDFs and images of documents are processed automatically. The API uses OCR to extract text from:- Scanned PDF pages (converted to images at 150 DPI, then OCR’d)
- Photos of documents (JPEG, PNG, TIFF)
- Office files (DOCX, XLSX, PPTX — converted to PDF first, then OCR’d)
input_file and the pipeline handles detection, conversion, and OCR.
Supported document formats
| Format | MIME type | Processing |
|---|---|---|
application/pdf | OCR per page at 150 DPI | |
| JPEG / PNG / TIFF / WebP / GIF | image/* | Direct OCR |
| Word (.doc, .docx) | application/msword, application/vnd.openxmlformats-officedocument.wordprocessingml.document | Convert to PDF, then OCR |
| Excel (.xls, .xlsx) | application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | Convert to PDF, then OCR |
| PowerPoint (.ppt, .pptx) | application/vnd.ms-powerpoint, application/vnd.openxmlformats-officedocument.presentationml.presentation | Convert to PDF, then OCR |
| RTF | application/rtf | Convert to PDF, then OCR |
| Plain text / CSV | text/plain, text/csv | Read directly |
Mixed input
Combine multiple content types in a single message. The model sees all inputs together and can reason across them.Next steps
- Multimodal output — generate audio and images in responses
- Files and vector stores — upload and manage files for reuse
- Generate a response — text-only requests and multi-turn exchanges
- Advanced voice mode — real-time voice conversations with LiveKit