Use this file to discover all available pages before exploring further.
This guide benchmarks the MKA1 search APIs against a standard information retrieval dataset.
It demonstrates RAG with multiple storage technologies (Text Store and Tables) and high-volume document processing — indexing thousands of documents, running queries, and measuring accuracy and latency at each stage of the pipeline.The benchmark uses SciFact from the BEIR benchmark suite — 5,183 scientific abstracts with 300 test queries and human-annotated relevance judgments.
All embeddings are generated with auto (4,096 dimensions).
A retrieval query goes through two stages: embedding inference (converting the query text to a vector) and database search (finding the nearest vectors in the store). The total response time is the sum of both.
The database search latency is reported by the server itself (search_time_ms in the response body) and excludes network overhead and embedding inference. The end-to-end response time includes everything a client would measure.The rest of this guide shows how these numbers are produced, step by step.
Each search request returns a search_time_ms field — the server-side database search duration, excluding network round-trip and any upstream processing.
We measure both this server-side time and the full end-to-end client-observed response time.
const K = 10;const results: Record<string, Record<string, number>> = {};const endToEndLatencies: number[] = [];const dbLatencies: number[] = [];console.log(`Running ${queries.length} queries against text store (top-${K})...`);for (let i = 0; i < queries.length; i++) { const q = queries[i]; const qVec = queryEmbeddings[i]; const start = performance.now(); const res = await fetch( `${BASE_URL}/api/v1/search/text-store/stores/${TEXT_STORE_NAME}/search`, { method: 'POST', headers: HEADERS, body: JSON.stringify({ query: q.text, vector: qVec, limit: K }), }, ); const searchResult = await res.json(); const elapsed = performance.now() - start; endToEndLatencies.push(elapsed); dbLatencies.push(searchResult.search_time_ms ?? 0); // Map results back to corpus doc IDs by matching text results[q._id] = {}; const hits = searchResult.results ?? []; for (let rank = 0; rank < hits.length; rank++) { const hitText = hits[rank].text; const doc = corpus.find((d) => `${d.title} ${d.text}` === hitText); if (doc) { results[q._id][doc._id] = 1.0 / (rank + 1); // score by rank } }}function percentile(arr: number[], p: number) { const sorted = [...arr].sort((a, b) => a - b); return sorted[Math.floor(sorted.length * p)];}console.log(`\nText Store — end-to-end response time (${queries.length} queries):`);console.log(` p50: ${percentile(endToEndLatencies, 0.5).toFixed(0)} ms`);console.log(` p95: ${percentile(endToEndLatencies, 0.95).toFixed(0)} ms`);console.log(` p99: ${percentile(endToEndLatencies, 0.99).toFixed(0)} ms`);console.log(`\nText Store — database search latency (server-side search_time_ms):`);console.log(` p50: ${percentile(dbLatencies, 0.5).toFixed(0)} ms`);console.log(` p95: ${percentile(dbLatencies, 0.95).toFixed(0)} ms`);console.log(` p99: ${percentile(dbLatencies, 0.99).toFixed(0)} ms`);
Expected output
Running 300 queries against text store (top-10)...Text Store — end-to-end response time (300 queries): p50: 45 ms p95: 82 ms p99: 110 msText Store — database search latency (server-side search_time_ms): p50: 8 ms p95: 14 ms p99: 18 ms
The gap between end-to-end and database latency is network round-trip and gateway overhead.
const textStoreMetrics = computeMetrics(results, qrels, K);const tableMetrics = computeMetrics(tableResults, qrels, K);console.log('=== Retrieval Accuracy (SciFact, 5,183 docs, 300 queries) ===');console.log('');console.log(`Metric Text Store Tables (vector_search)`);console.log(`NDCG@10 ${textStoreMetrics.ndcg.toFixed(1)} ${tableMetrics.ndcg.toFixed(1)}`);console.log(`Recall@10 ${textStoreMetrics.recall.toFixed(1)} ${tableMetrics.recall.toFixed(1)}`);console.log(`MRR@10 ${textStoreMetrics.mrr.toFixed(1)} ${tableMetrics.mrr.toFixed(1)}`);console.log('');console.log('=== Latency Breakdown ===');console.log('');console.log(`Stage Text Store Tables`);console.log(`Embedding inference (per q) ${perQueryEmbedMs.toFixed(0)} ms ${perQueryEmbedMs.toFixed(0)} ms`);console.log(`DB search p50 (server-side) ${percentile(dbLatencies, 0.5).toFixed(0)} ms ${percentile(tableDbLatencies, 0.5).toFixed(0)} ms`);console.log(`DB search p95 (server-side) ${percentile(dbLatencies, 0.95).toFixed(0)} ms ${percentile(tableDbLatencies, 0.95).toFixed(0)} ms`);console.log(`End-to-end response p50 ${percentile(endToEndLatencies, 0.5).toFixed(0)} ms ${percentile(tableEndToEnd, 0.5).toFixed(0)} ms`);console.log(`End-to-end response p95 ${percentile(endToEndLatencies, 0.95).toFixed(0)} ms ${percentile(tableEndToEnd, 0.95).toFixed(0)} ms`);console.log(`Index throughput ${(corpus.length / (indexMs / 1000)).toFixed(0)} docs/s ${(corpus.length / (tableIndexMs / 1000)).toFixed(0)} docs/s`);
Expected output
=== Retrieval Accuracy (SciFact, 5,183 docs, 300 queries) ===Metric Text Store Tables (vector_search)NDCG@10 72.4 71.8Recall@10 83.1 82.6MRR@10 68.9 68.3=== Latency Breakdown ===Stage Text Store TablesEmbedding inference (per q) 12 ms 12 msDB search p50 (server-side) 8 ms 6 msDB search p95 (server-side) 14 ms 11 msEnd-to-end response p50 45 ms 38 msEnd-to-end response p95 82 ms 71 msIndex throughput 182 docs/s 166 docs/s
Embedding inference is the same for both backends — it runs the same model.
DB search is the pure database time reported by the server (search_time_ms / searchTimeMs). This is where Text Store and Tables differ — Text Store runs hybrid (vector + FTS), Tables runs pure vector search.
End-to-end response includes network round-trip, gateway overhead, and database search. The difference between end-to-end and DB search is the overhead.
NDCG@10 is the primary metric. It penalizes relevant documents that appear at lower ranks.
Recall@10 measures how many relevant documents appear in the top 10 at all — important for RAG pipelines where downstream generation depends on retrieval completeness.
MRR@10 measures how quickly the first relevant result appears — important for user-facing search.
Text Store vs Tables: Text Store adds hybrid search (vector + keyword) automatically. Tables gives you explicit control over index type, distance metric, and filters.
Embedding inference — the time to convert the query text into a vector. This is model inference and is the same regardless of which storage backend you use.
Database search — the time the search engine spends finding nearest neighbors. Reported by the server in search_time_ms (Text Store) or searchTimeMs (Tables). This is pure vector/hybrid search time with no network overhead.
End-to-end response — what the client observes: network round-trip + gateway routing + database search.
If end-to-end latency is high but database search is fast, the bottleneck is network or gateway overhead.
If database search is high, consider a different index type (e.g., IVF_HNSW_SQ for faster approximate search).