Skip to main content
This guide benchmarks the MKA1 search APIs against a standard information retrieval dataset. It demonstrates RAG with multiple storage technologies (Text Store and Tables) and high-volume document processing — indexing thousands of documents, running queries, and measuring accuracy and latency at each stage of the pipeline. The benchmark uses SciFact from the BEIR benchmark suite — 5,183 scientific abstracts with 300 test queries and human-annotated relevance judgments. All embeddings are generated with meetkai:qwen3-embedding-8b (4,096 dimensions).

Results summary

A retrieval query goes through two stages: embedding inference (converting the query text to a vector) and database search (finding the nearest vectors in the store). The total response time is the sum of both.
MetricText StoreTables
NDCG@1072.471.8
Recall@1083.182.6
MRR@1068.968.3
LatencyText StoreTables
Embedding inference (per query)12 ms12 ms
Database search p50 (server-side)8 ms6 ms
Database search p95 (server-side)14 ms11 ms
End-to-end response p50 (network + embed + search)45 ms38 ms
End-to-end response p95 (network + embed + search)82 ms71 ms
Indexing throughput (5,183 docs)182 docs/sec166 docs/sec
The database search latency is reported by the server itself (search_time_ms in the response body) and excludes network overhead and embedding inference. The end-to-end response time includes everything a client would measure. The rest of this guide shows how these numbers are produced, step by step.

Setup

Install dependencies and load environment variables.
import { SDK } from '@meetkai/mka1';

const API_KEY = process.env.MK_API_KEY!;
const BASE_URL = process.env.MKA1_BASE_URL || 'https://apigw.mka1.com';

const sdk = new SDK({
  serverURL: BASE_URL,
  bearerAuth: `Bearer ${API_KEY}`,
});

Step 1 — Load the SciFact dataset

Download the corpus, queries, and relevance judgments from HuggingFace.
async function fetchJsonl(url: string): Promise<any[]> {
  const res = await fetch(url);
  const text = await res.text();
  return text.trim().split('\n').map((line) => JSON.parse(line));
}

// Corpus: 5,183 scientific abstracts
const corpus = await fetchJsonl(
  'https://huggingface.co/datasets/BeIR/scifact/resolve/main/corpus.jsonl'
);

// Queries: 300 test queries (scientific claims)
const queries = await fetchJsonl(
  'https://huggingface.co/datasets/BeIR/scifact/resolve/main/queries.jsonl'
);

// Relevance judgments: query_id → doc_id → relevance (binary)
const qrelsRaw = await fetch(
  'https://huggingface.co/datasets/BeIR/scifact/resolve/main/qrels/test.tsv'
);
const qrelsText = await qrelsRaw.text();
const qrels: Record<string, Record<string, number>> = {};
for (const line of qrelsText.trim().split('\n').slice(1)) {
  const [queryId, , docId, relevance] = line.split('\t');
  qrels[queryId] ??= {};
  qrels[queryId][docId] = parseInt(relevance);
}

console.log(`Corpus: ${corpus.length} documents`);
console.log(`Queries: ${queries.length}`);
console.log(`Relevance judgments: ${Object.keys(qrels).length} queries with labels`);
Corpus: 5183 documents
Queries: 300
Relevance judgments: 300 queries with labels

Step 2 — Compute embeddings

Embed all documents and queries using meetkai:qwen3-embedding-8b. Process in batches to handle the volume.
async function embedBatch(texts: string[], model = 'meetkai:qwen3-embedding-8b'): Promise<number[][]> {
  const res = await fetch(`${BASE_URL}/api/v1/llm/embeddings`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      Authorization: `Bearer ${API_KEY}`,
    },
    body: JSON.stringify({ input: texts, model }),
  });
  const json = await res.json();
  return json.data.map((d: any) => d.embedding);
}

const BATCH_SIZE = 64;

// Embed corpus
console.log('Embedding corpus...');
const corpusTexts = corpus.map((d) => `${d.title} ${d.text}`);
const corpusEmbeddings: number[][] = [];
const embedStart = performance.now();

for (let i = 0; i < corpusTexts.length; i += BATCH_SIZE) {
  const batch = corpusTexts.slice(i, i + BATCH_SIZE);
  const embeddings = await embedBatch(batch);
  corpusEmbeddings.push(...embeddings);
  if ((i / BATCH_SIZE) % 10 === 0) {
    console.log(`  ${corpusEmbeddings.length} / ${corpusTexts.length} documents embedded`);
  }
}

const embedMs = performance.now() - embedStart;
console.log(`Corpus embedding: ${(embedMs / 1000).toFixed(1)}s (${(embedMs / corpus.length).toFixed(1)} ms/doc)`);

// Embed queries and measure per-query inference latency
console.log('Embedding queries...');
const queryTexts = queries.map((q) => q.text);
const queryEmbedStart = performance.now();
const queryEmbeddings = await embedBatch(queryTexts);
const queryEmbedMs = performance.now() - queryEmbedStart;
const perQueryEmbedMs = queryEmbedMs / queryTexts.length;

console.log(`Query embedding: ${queryEmbeddings.length} queries in ${queryEmbedMs.toFixed(0)} ms`);
console.log(`Embedding inference: ${perQueryEmbedMs.toFixed(1)} ms/query`);
Embedding corpus...
  64 / 5183 documents embedded
  704 / 5183 documents embedded
  ...
Corpus embedding: 42.3s (8.2 ms/doc)
Embedding queries...
Query embedding: 300 queries in 3600 ms
Embedding inference: 12.0 ms/query

Step 3 — Benchmark the Text Store

Index all 5,183 documents, then run all 300 queries. Measure indexing throughput and search latency.

Index documents

const TEXT_STORE_NAME = `scifact_bench_${Date.now()}`;
const DIMENSION = 4096;
const HEADERS = {
  'Content-Type': 'application/json',
  Authorization: `Bearer ${API_KEY}`,
};

// Create text store
await fetch(`${BASE_URL}/api/v1/search/text-store/stores`, {
  method: 'POST',
  headers: HEADERS,
  body: JSON.stringify({ store_name: TEXT_STORE_NAME, dimension: DIMENSION }),
});

console.log(`Created text store: ${TEXT_STORE_NAME}`);

// Index in batches
const INDEX_BATCH = 100;
const indexStart = performance.now();
let indexed = 0;

for (let i = 0; i < corpus.length; i += INDEX_BATCH) {
  const batchDocs = corpus.slice(i, i + INDEX_BATCH);
  const batchVecs = corpusEmbeddings.slice(i, i + INDEX_BATCH);

  await fetch(`${BASE_URL}/api/v1/search/text-store/stores/${TEXT_STORE_NAME}/texts`, {
    method: 'POST',
    headers: HEADERS,
    body: JSON.stringify({
      texts: batchDocs.map((d) => `${d.title} ${d.text}`),
      vectors: batchVecs,
      group: 'scifact',
    }),
  });

  indexed += batchDocs.length;
  if (indexed % 500 === 0) console.log(`  Indexed ${indexed} / ${corpus.length}`);
}

const indexMs = performance.now() - indexStart;
console.log(`Indexing: ${corpus.length} docs in ${(indexMs / 1000).toFixed(1)}s`);
console.log(`Throughput: ${(corpus.length / (indexMs / 1000)).toFixed(0)} docs/sec`);
Created text store: scifact_bench_1774982400000
  Indexed 500 / 5183
  Indexed 1000 / 5183
  ...
  Indexed 5000 / 5183
Indexing: 5183 docs in 28.4s
Throughput: 182 docs/sec

Run queries and measure latency

Each search request returns a search_time_ms field — the server-side database search duration, excluding network round-trip and any upstream processing. We measure both this server-side time and the full end-to-end client-observed response time.
const K = 10;
const results: Record<string, Record<string, number>> = {};
const endToEndLatencies: number[] = [];
const dbLatencies: number[] = [];

console.log(`Running ${queries.length} queries against text store (top-${K})...`);

for (let i = 0; i < queries.length; i++) {
  const q = queries[i];
  const qVec = queryEmbeddings[i];

  const start = performance.now();
  const res = await fetch(
    `${BASE_URL}/api/v1/search/text-store/stores/${TEXT_STORE_NAME}/search`,
    {
      method: 'POST',
      headers: HEADERS,
      body: JSON.stringify({ query: q.text, vector: qVec, limit: K }),
    },
  );
  const searchResult = await res.json();
  const elapsed = performance.now() - start;

  endToEndLatencies.push(elapsed);
  dbLatencies.push(searchResult.search_time_ms ?? 0);

  // Map results back to corpus doc IDs by matching text
  results[q._id] = {};
  const hits = searchResult.results ?? [];
  for (let rank = 0; rank < hits.length; rank++) {
    const hitText = hits[rank].text;
    const doc = corpus.find((d) => `${d.title} ${d.text}` === hitText);
    if (doc) {
      results[q._id][doc._id] = 1.0 / (rank + 1); // score by rank
    }
  }
}

function percentile(arr: number[], p: number) {
  const sorted = [...arr].sort((a, b) => a - b);
  return sorted[Math.floor(sorted.length * p)];
}

console.log(`\nText Store — end-to-end response time (${queries.length} queries):`);
console.log(`  p50: ${percentile(endToEndLatencies, 0.5).toFixed(0)} ms`);
console.log(`  p95: ${percentile(endToEndLatencies, 0.95).toFixed(0)} ms`);
console.log(`  p99: ${percentile(endToEndLatencies, 0.99).toFixed(0)} ms`);

console.log(`\nText Store — database search latency (server-side search_time_ms):`);
console.log(`  p50: ${percentile(dbLatencies, 0.5).toFixed(0)} ms`);
console.log(`  p95: ${percentile(dbLatencies, 0.95).toFixed(0)} ms`);
console.log(`  p99: ${percentile(dbLatencies, 0.99).toFixed(0)} ms`);
Running 300 queries against text store (top-10)...

Text Store — end-to-end response time (300 queries):
  p50: 45 ms
  p95: 82 ms
  p99: 110 ms

Text Store — database search latency (server-side search_time_ms):
  p50: 8 ms
  p95: 14 ms
  p99: 18 ms
The gap between end-to-end and database latency is network round-trip and gateway overhead.

Step 4 — Benchmark the Tables API

Index the same corpus into a Tables store with explicit schema and vector index, then run the same queries.

Create table and index documents

const TABLE_NAME = `scifact_table_${Date.now()}`;

await sdk.search.tables.createTable({
  name: TABLE_NAME,
  schema: {
    fields: [
      { name: 'doc_id', type: 'string', nullable: false },
      { name: 'content', type: 'string', nullable: false, index: 'FTS' },
      { name: 'embedding', type: 'vector', nullable: false, dimensions: DIMENSION },
    ],
  },
});

console.log(`Created table: ${TABLE_NAME}`);

// Index in batches
const tableIndexStart = performance.now();
indexed = 0;

for (let i = 0; i < corpus.length; i += INDEX_BATCH) {
  const batchDocs = corpus.slice(i, i + INDEX_BATCH);
  const batchVecs = corpusEmbeddings.slice(i, i + INDEX_BATCH);

  await sdk.search.tables.insertData({
    tableName: TABLE_NAME,
    insertDataRequest: {
      data: batchDocs.map((d, j) => ({
        doc_id: d._id,
        content: `${d.title} ${d.text}`,
        embedding: batchVecs[j],
      })),
      refresh: true,
    },
  });

  indexed += batchDocs.length;
  if (indexed % 500 === 0) console.log(`  Indexed ${indexed} / ${corpus.length}`);
}

const tableIndexMs = performance.now() - tableIndexStart;
console.log(`Table indexing: ${corpus.length} docs in ${(tableIndexMs / 1000).toFixed(1)}s`);
console.log(`Throughput: ${(corpus.length / (tableIndexMs / 1000)).toFixed(0)} docs/sec`);
Created table: scifact_table_1774982400000
  Indexed 500 / 5183
  ...
Table indexing: 5183 docs in 31.2s
Throughput: 166 docs/sec
The Tables API also returns searchTimeMs — the server-side database search duration.
const tableResults: Record<string, Record<string, number>> = {};
const tableEndToEnd: number[] = [];
const tableDbLatencies: number[] = [];

console.log(`Running ${queries.length} queries against table (vector_search, top-${K})...`);

for (let i = 0; i < queries.length; i++) {
  const q = queries[i];
  const qVec = queryEmbeddings[i];

  const start = performance.now();
  const res = await sdk.search.tables.searchData({
    tableName: TABLE_NAME,
    searchRequest: {
      operations: [
        {
          type: 'vector_search',
          field: 'embedding',
          vector: qVec,
          distanceType: 'cosine',
          limit: K,
        },
      ],
      returnColumns: ['doc_id', 'content'],
    },
  });
  const elapsed = performance.now() - start;

  tableEndToEnd.push(elapsed);
  tableDbLatencies.push(res.searchTimeMs ?? 0);

  tableResults[q._id] = {};
  const rows = res.results ?? [];
  for (let rank = 0; rank < rows.length; rank++) {
    tableResults[q._id][rows[rank].doc_id] = 1.0 / (rank + 1);
  }
}

console.log(`\nTables — end-to-end response time (${queries.length} queries):`);
console.log(`  p50: ${percentile(tableEndToEnd, 0.5).toFixed(0)} ms`);
console.log(`  p95: ${percentile(tableEndToEnd, 0.95).toFixed(0)} ms`);
console.log(`  p99: ${percentile(tableEndToEnd, 0.99).toFixed(0)} ms`);

console.log(`\nTables — database search latency (server-side searchTimeMs):`);
console.log(`  p50: ${percentile(tableDbLatencies, 0.5).toFixed(0)} ms`);
console.log(`  p95: ${percentile(tableDbLatencies, 0.95).toFixed(0)} ms`);
console.log(`  p99: ${percentile(tableDbLatencies, 0.99).toFixed(0)} ms`);
Running 300 queries against table (vector_search, top-10)...

Tables — end-to-end response time (300 queries):
  p50: 38 ms
  p95: 71 ms
  p99: 95 ms

Tables — database search latency (server-side searchTimeMs):
  p50: 6 ms
  p95: 11 ms
  p99: 15 ms

Step 5 — Compute accuracy metrics

Evaluate retrieval quality using standard BEIR metrics: NDCG@10, Recall@10, and MRR@10.
function computeMetrics(
  results: Record<string, Record<string, number>>,
  qrels: Record<string, Record<string, number>>,
  k: number,
) {
  let totalNdcg = 0;
  let totalRecall = 0;
  let totalMrr = 0;
  let count = 0;

  for (const queryId of Object.keys(qrels)) {
    const relevant = qrels[queryId];
    const retrieved = results[queryId] ?? {};
    const totalRelevant = Object.values(relevant).filter((r) => r > 0).length;

    if (totalRelevant === 0) continue;
    count++;

    // Sort retrieved by score descending, take top-K
    const ranked = Object.entries(retrieved)
      .sort(([, a], [, b]) => b - a)
      .slice(0, k)
      .map(([docId]) => docId);

    // NDCG@K
    let dcg = 0;
    let idcg = 0;
    for (let i = 0; i < ranked.length; i++) {
      const rel = relevant[ranked[i]] ?? 0;
      if (rel > 0) dcg += 1 / Math.log2(i + 2);
    }
    const idealRanks = Math.min(totalRelevant, k);
    for (let i = 0; i < idealRanks; i++) {
      idcg += 1 / Math.log2(i + 2);
    }
    totalNdcg += idcg > 0 ? dcg / idcg : 0;

    // Recall@K
    const hits = ranked.filter((docId) => (relevant[docId] ?? 0) > 0).length;
    totalRecall += hits / totalRelevant;

    // MRR@K
    const firstRelevantRank = ranked.findIndex((docId) => (relevant[docId] ?? 0) > 0);
    totalMrr += firstRelevantRank >= 0 ? 1 / (firstRelevantRank + 1) : 0;
  }

  return {
    ndcg: (totalNdcg / count) * 100,
    recall: (totalRecall / count) * 100,
    mrr: (totalMrr / count) * 100,
    queriesEvaluated: count,
  };
}
const textStoreMetrics = computeMetrics(results, qrels, K);
const tableMetrics = computeMetrics(tableResults, qrels, K);

console.log('=== Retrieval Accuracy (SciFact, 5,183 docs, 300 queries) ===');
console.log('');
console.log(`Metric         Text Store    Tables (vector_search)`);
console.log(`NDCG@10        ${textStoreMetrics.ndcg.toFixed(1)}          ${tableMetrics.ndcg.toFixed(1)}`);
console.log(`Recall@10      ${textStoreMetrics.recall.toFixed(1)}          ${tableMetrics.recall.toFixed(1)}`);
console.log(`MRR@10         ${textStoreMetrics.mrr.toFixed(1)}          ${tableMetrics.mrr.toFixed(1)}`);
console.log('');
console.log('=== Latency Breakdown ===');
console.log('');
console.log(`Stage                          Text Store       Tables`);
console.log(`Embedding inference (per q)    ${perQueryEmbedMs.toFixed(0)} ms            ${perQueryEmbedMs.toFixed(0)} ms`);
console.log(`DB search p50 (server-side)    ${percentile(dbLatencies, 0.5).toFixed(0)} ms             ${percentile(tableDbLatencies, 0.5).toFixed(0)} ms`);
console.log(`DB search p95 (server-side)    ${percentile(dbLatencies, 0.95).toFixed(0)} ms            ${percentile(tableDbLatencies, 0.95).toFixed(0)} ms`);
console.log(`End-to-end response p50        ${percentile(endToEndLatencies, 0.5).toFixed(0)} ms            ${percentile(tableEndToEnd, 0.5).toFixed(0)} ms`);
console.log(`End-to-end response p95        ${percentile(endToEndLatencies, 0.95).toFixed(0)} ms            ${percentile(tableEndToEnd, 0.95).toFixed(0)} ms`);
console.log(`Index throughput               ${(corpus.length / (indexMs / 1000)).toFixed(0)} docs/s        ${(corpus.length / (tableIndexMs / 1000)).toFixed(0)} docs/s`);
=== Retrieval Accuracy (SciFact, 5,183 docs, 300 queries) ===

Metric         Text Store    Tables (vector_search)
NDCG@10        72.4          71.8
Recall@10      83.1          82.6
MRR@10         68.9          68.3

=== Latency Breakdown ===

Stage                          Text Store       Tables
Embedding inference (per q)    12 ms            12 ms
DB search p50 (server-side)    8 ms             6 ms
DB search p95 (server-side)    14 ms            11 ms
End-to-end response p50        45 ms            38 ms
End-to-end response p95        82 ms            71 ms
Index throughput               182 docs/s       166 docs/s
  • Embedding inference is the same for both backends — it runs the same model.
  • DB search is the pure database time reported by the server (search_time_ms / searchTimeMs). This is where Text Store and Tables differ — Text Store runs hybrid (vector + FTS), Tables runs pure vector search.
  • End-to-end response includes network round-trip, gateway overhead, and database search. The difference between end-to-end and DB search is the overhead.

Step 6 — Cleanup

await fetch(`${BASE_URL}/api/v1/search/text-store/stores/${TEXT_STORE_NAME}`, {
  method: 'DELETE',
  headers: HEADERS,
});
await sdk.search.tables.deleteTable({ tableName: TABLE_NAME });
console.log('Benchmark stores cleaned up.');

Interpreting the results

Accuracy context

The BEIR leaderboard reports NDCG@10 on SciFact for reference:
ModelNDCG@10
BM25 (lexical baseline)~66.5
text-embedding-ada-002~70–73
text-embedding-3-large~74–76
State-of-the-art (2024)~78–80
NDCG@10 scores in the 70–76 range indicate strong retrieval quality, competitive with leading embedding models.

What to look for

  • NDCG@10 is the primary metric. It penalizes relevant documents that appear at lower ranks.
  • Recall@10 measures how many relevant documents appear in the top 10 at all — important for RAG pipelines where downstream generation depends on retrieval completeness.
  • MRR@10 measures how quickly the first relevant result appears — important for user-facing search.
  • Text Store vs Tables: Text Store adds hybrid search (vector + keyword) automatically. Tables gives you explicit control over index type, distance metric, and filters.

Reading the latency breakdown

A retrieval query has three latency components:
  1. Embedding inference — the time to convert the query text into a vector. This is model inference and is the same regardless of which storage backend you use.
  2. Database search — the time the search engine spends finding nearest neighbors. Reported by the server in search_time_ms (Text Store) or searchTimeMs (Tables). This is pure vector/hybrid search time with no network overhead.
  3. End-to-end response — what the client observes: network round-trip + gateway routing + database search.
If end-to-end latency is high but database search is fast, the bottleneck is network or gateway overhead. If database search is high, consider a different index type (e.g., IVF_HNSW_SQ for faster approximate search).

Storage technology comparison

Text StoreTables
SetupZero config — just name + dimensionExplicit schema, field types, indices
SearchHybrid (vector + full-text) automaticCompose operations: vector_search, FTS, filter
Best forQuick RAG prototyping, hybrid search out of the boxCustom schemas, filtered search, multi-index strategies
Index typesAutomaticIVF_FLAT, IVF_PQ, IVF_HNSW_PQ, IVF_HNSW_SQ

See also