Skip to main content
This evaluation compares GraphRAG with traditional RAG on the same benchmark corpus and the same question set. The goal was to measure whether graph-aware retrieval improved multi-hop question answering.

What was implemented

The evaluated system did not use flat chunk retrieval alone. It implemented GraphRAG with these stages:
  1. Split source documents into chunks.
  2. Extract entities and relationships from those chunks.
  3. Build a knowledge graph from the extracted entities and relationships.
  4. Run standard retrieval to get seed evidence for a user question.
  5. Expand through graph links to collect connected evidence.
  6. Re-rank the final evidence set before answer generation.
That is the GraphRAG behavior that was evaluated.

What it was compared against

The comparison used a traditional RAG baseline with the same corpus and the same answer model. The only difference between the two runs was retrieval mode:
  • Baseline RAG: flat chunk retrieval only
  • GraphRAG: graph-seeded expansion plus graph-aware reranking
This matters because it isolates the effect of GraphRAG itself.

Benchmark design

The benchmark was designed to test multi-hop retrieval rather than simple one-chunk lookup. It used:
  • one target knowledge graph
  • several semantically similar distractor graphs
  • questions that required linking facts across multiple chunks
This design is important. If every answer already appears in one obvious chunk, GraphRAG will not show much benefit over standard RAG.

Evaluation method

Both retrieval modes were run over the same benchmark corpus and the same question set. Both then used the same answer-generation model and the same answer prompt. The evaluation recorded three metrics:
  • Exact match: whether the final answer exactly matched the gold answer
  • Token F1: token overlap between the final answer and the gold answer
  • Evidence recall@5: how much of the required supporting evidence appeared in the top 5 retrieved chunks

How the API was used

The evaluated API flow was simple. The same store and the same documents were used for both the baseline and GraphRAG runs. Only the query mode changed.

1. Create a GraphRAG store

curl https://apigw.mka1.com/api/v1/search/graphrag/stores \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --header 'X-On-Behalf-Of: <end-user-id>' \
  --data '{
    "store_name": "benchmark_graphrag",
    "embedding_model": "meetkai:qwen3-embedding-8b",
    "extraction_model": "openai:gpt-4.1-mini",
    "chunk_size": 800,
    "chunk_overlap": 120,
    "max_hops": 2
  }'

2. Ingest documents

curl https://apigw.mka1.com/api/v1/search/graphrag/stores/benchmark_graphrag/documents \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --header 'X-On-Behalf-Of: <end-user-id>' \
  --data '{
    "documents": [
      {
        "document_id": "doc_contract_award",
        "text": "Rivera Logistics won the Northern Bridge Sensors contract.",
        "metadata": {
          "source": "benchmark"
        }
      },
      {
        "document_id": "doc_parent_company",
        "text": "Atlas Infrastructure Group owns Rivera Logistics.",
        "metadata": {
          "source": "benchmark"
        }
      }
    ]
  }'

3. Run the baseline RAG query

curl https://apigw.mka1.com/api/v1/search/graphrag/stores/benchmark_graphrag/query \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --header 'X-On-Behalf-Of: <end-user-id>' \
  --data '{
    "query": "Who is the chief financial officer of the company that owns the Northern Bridge Sensors contract winner?",
    "mode": "baseline",
    "limit": 5,
    "seed_k": 8
  }'

4. Run the GraphRAG query

curl https://apigw.mka1.com/api/v1/search/graphrag/stores/benchmark_graphrag/query \
  --request POST \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <mka1-api-key>' \
  --header 'X-On-Behalf-Of: <end-user-id>' \
  --data '{
    "query": "Who is the chief financial officer of the company that owns the Northern Bridge Sensors contract winner?",
    "mode": "graph",
    "limit": 5,
    "seed_k": 8
  }'
That last step is the key comparison. The query, corpus, and answer model stayed the same. Only mode changed:
  • baseline = traditional flat retrieval
  • graph = graph-seeded retrieval and reranking

Measured results

The live benchmark run produced the following results:
MethodExact MatchToken F1Evidence Recall@5
Baseline RAG50.0%50.0%78.1%
GraphRAG87.5%87.5%90.6%
Improvement:
  • Exact Match: +37.5 points
  • Token F1: +37.5 points
  • Evidence Recall@5: +12.5 points

Acceptance threshold

The benchmark used the following pass criteria:
  • exact match improvement of at least +5.0 points
  • evidence recall@5 improvement of at least +10.0 points
The evaluated GraphRAG implementation passed both thresholds.

Representative question-level outcomes

Examples where GraphRAG succeeded and baseline RAG did not:
  • “Who is the chief financial officer of the company that owns the Northern Bridge Sensors contract winner?”
    • Baseline RAG: unknown
    • GraphRAG: Javier Solis
  • “Which company acquired the firm that prepared a risk report for Meridian Ports Authority?”
    • Baseline RAG: unknown
    • GraphRAG: Atlas Infrastructure Group
  • “Who is the chief financial officer of the company that acquired the firm that prepared a risk report for Meridian Ports Authority?”
    • Baseline RAG: unknown
    • GraphRAG: Javier Solis
These are multi-hop questions. They require linking facts across connected entities rather than retrieving a single directly matching chunk.

Why GraphRAG performed better

Baseline RAG retrieved semantically similar chunks, but it sometimes failed to retrieve the connected evidence needed to complete the reasoning chain. GraphRAG improved performance by:
  • identifying the relevant seed entities from the question
  • traversing graph relationships to find linked evidence
  • reranking the final evidence set with graph signals in addition to semantic similarity
That is why the improvement appears most clearly on multi-hop questions.

Summary

On this benchmark, GraphRAG outperformed traditional RAG on both final-answer accuracy and supporting-evidence retrieval. The largest gains appeared on questions that required linking facts across multiple connected entities.