Rate limiting

The MKA1 API enforces rate limits on a per-key basis. Each API key has its own quota configured independently — requests per second, minute, hour, or day. When a key exceeds its limit, the gateway returns 429 Too Many Requests before the request reaches the model. No tokens are consumed and no usage is billed.

Rate limit configuration

Rate limits are set per API key at provisioning time. The available granularities are:

Window	Example
Per second	10 requests/second
Per minute	60 requests/minute
Per hour	1,000 requests/hour
Per day	10,000 requests/day

Limits can be combined — for example, 10 requests/second and 5,000 requests/hour on the same key.

Demonstration: rate limiting in action

The following example uses a real API key configured with a rate limit of 1 request per minute. The first request succeeds, and the second request — sent immediately after — is rejected with HTTP 429.

Request 1 — succeeds

curl https://apigw.mka1.com/api/v1/llm/responses \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer mk-YfcgFr...BsFLzg' \
  --data '{
    "model": "auto",
    "input": "Hello, how are you?",
    "stream": false
  }'

Response: HTTP 200

{
  "id": "resp_5ab19d58984940c88a971c6829ae201c",
  "object": "response",
  "status": "completed",
  "model": "auto",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": [{ "type": "output_text", "text": "Hello! I'm doing well..." }]
    }
  ]
}

Request 2 — rate limited (sent immediately after)

curl https://apigw.mka1.com/api/v1/llm/responses \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer mk-YfcgFr...BsFLzg' \
  --data '{
    "model": "auto",
    "input": "This should be rate limited",
    "stream": false
  }'

Response: HTTP 429

{
  "error": "Rate limit exceeded"
}

The gateway rejects the request before it reaches the model — no tokens are consumed and no usage is billed.

Programmatic example

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: '<mka1-api-key>',
  baseURL: 'https://apigw.mka1.com/api/v1/llm/',
});

// First request — succeeds
const response = await openai.responses.create({
  model: 'auto',
  input: 'Hello',
  stream: false,
});
console.log(response.status); // "completed"

// Second request — rate limited
try {
  await openai.responses.create({
    model: 'auto',
    input: 'This will be rate limited',
    stream: false,
  });
} catch (err) {
  if (err instanceof OpenAI.RateLimitError) {
    console.log(err.status);   // 429
    console.log(err.message);  // "429 Rate limit exceeded"
    console.log(err.error);    // { error: "Rate limit exceeded" }
  }
}

The 429 response

When a request is rate limited, the gateway returns:

Field	Value
HTTP status	`429 Too Many Requests`
Content-Type	`application/json`
Body	`{"error": "Rate limit exceeded"}`
Tokens consumed	0 — the request never reaches the model
Usage billed	None

The OpenAI SDK surfaces this as an OpenAI.RateLimitError with err.status === 429. The MKA1 SDK throws an APIError with err.statusCode === 429.

Handling rate limits in your application

When your application receives a 429 response, retry with exponential backoff:

async function requestWithRetry(fn: () => Promise<any>, maxRetries = 3): Promise<any> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err: any) {
      if (err.status === 429 && attempt < maxRetries - 1) {
        const waitMs = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
        await new Promise((r) => setTimeout(r, waitMs));
        continue;
      }
      throw err;
    }
  }
}

const response = await requestWithRetry(() =>
  openai.responses.create({ model: 'auto', input: 'Hello', stream: false })
);

Getting started

Guides

Recipes

Benchmarks

Rate limiting

Rate limit configuration

Demonstration: rate limiting in action

Request 1 — succeeds

Request 2 — rate limited (sent immediately after)

Programmatic example

The 429 response

Handling rate limits in your application

See also

Getting started

Guides

Recipes

Benchmarks

​Rate limit configuration

​Demonstration: rate limiting in action

​Request 1 — succeeds

​Request 2 — rate limited (sent immediately after)

​Programmatic example

​The 429 response

​Handling rate limits in your application

​See also

Rate limit configuration

Demonstration: rate limiting in action

Request 1 — succeeds

Request 2 — rate limited (sent immediately after)

Programmatic example

The 429 response

Handling rate limits in your application

See also