GraphRAG evaluation

This evaluation compares GraphRAG with traditional RAG on the same benchmark corpus and the same question set. The goal was to measure whether graph-aware retrieval improved multi-hop question answering.

What was implemented

The evaluated system did not use flat chunk retrieval alone. It implemented GraphRAG with these stages:

Split source documents into chunks.
Extract entities and relationships from those chunks.
Build a knowledge graph from the extracted entities and relationships.
Run standard retrieval to get seed evidence for a user question.
Expand through graph links to collect connected evidence.
Re-rank the final evidence set before answer generation.

That is the GraphRAG behavior that was evaluated.

What it was compared against

The comparison used a traditional RAG baseline with the same corpus and the same answer model. The only difference between the two runs was retrieval mode:

Baseline RAG: flat chunk retrieval only
GraphRAG: graph-seeded expansion plus graph-aware reranking

This matters because it isolates the effect of GraphRAG itself.

Benchmark design

The benchmark was designed to test multi-hop retrieval rather than simple one-chunk lookup. It used:

three target knowledge graphs
nine semantically similar distractor graphs
108 short factual documents
24 questions that required linking facts across multiple chunks

This design is important. If every answer already appears in one obvious chunk, GraphRAG will not show much benefit over standard RAG.

Evaluation method

Both retrieval modes were run over the same benchmark corpus and the same question set. Both then used the same answer-generation model and the same answer prompt. The evaluation recorded three metrics:

Exact match: whether the final answer exactly matched the gold answer
Token F1: token overlap between the final answer and the gold answer
Evidence recall@5: how much of the required supporting evidence appeared in the top 5 retrieved chunks

How the API was used

The evaluated API flow was simple. The same store and the same documents were used for both the baseline and GraphRAG runs. Only the query mode changed.

1. Create a GraphRAG store

mka1 search graphrag create-graph-RAG-store \
  --body '{
    "store_name": "benchmark_graphrag",
    "embedding_model": "auto",
    "extraction_model": "auto",
    "chunk_size": 800,
    "chunk_overlap": 120,
    "max_hops": 2
  }' \
  -H 'X-On-Behalf-Of: <end-user-id>'

2. Ingest documents

mka1 search graphrag ingest-graph-RAG-documents \
  --store-name benchmark_graphrag \
  --body '{
    "documents": [
      {
        "document_id": "doc_contract_award",
        "text": "Rivera Logistics won the Northern Bridge Sensors contract.",
        "metadata": { "source": "benchmark" }
      },
      {
        "document_id": "doc_parent_company",
        "text": "Atlas Infrastructure Group owns Rivera Logistics.",
        "metadata": { "source": "benchmark" }
      }
    ]
  }'

3. Run the baseline RAG query

mka1 search graphrag query-graph-RAG-store \
  --store-name benchmark_graphrag \
  --body '{
    "query": "Who is the chief financial officer of the company that owns the Northern Bridge Sensors contract winner?",
    "mode": "baseline",
    "limit": 5,
    "seed_k": 8
  }'

4. Run the GraphRAG query

mka1 search graphrag query-graph-RAG-store \
  --store-name benchmark_graphrag \
  --body '{
    "query": "Who is the chief financial officer of the company that owns the Northern Bridge Sensors contract winner?",
    "mode": "graph",
    "limit": 5,
    "seed_k": 8
  }'

That last step is the key comparison. The query, corpus, and answer model stayed the same. Only mode changed:

baseline = traditional flat retrieval
graph = graph-seeded retrieval and reranking

Measured results

The live benchmark run produced the following results:

Method	Exact Match	Token F1	Evidence Recall@5
Baseline RAG	25.0%	27.1%	55.9%
GraphRAG	62.5%	62.5%	71.9%

Improvement:

Exact Match: +37.5 points
Token F1: +35.4 points
Evidence Recall@5: +16.0 points

Acceptance threshold

The benchmark used the following pass criteria:

exact match improvement of at least +5.0 points
evidence recall@5 improvement of at least +10.0 points

The evaluated GraphRAG implementation passed both thresholds.

Representative question-level outcomes

Examples where GraphRAG succeeded and baseline RAG did not:

“Who is the chief financial officer of the company that owns the Northern Bridge Sensors contract winner?”
- Baseline RAG: unknown
- GraphRAG: Javier Nanda
“Which company acquired the firm that prepared a risk report for Meridian Ports Authority?”
- Baseline RAG: unknown
- GraphRAG: Atlas Infrastructure Group
“Which company owns the company that won the Delta Reach Sensors contract?”
- Baseline RAG: unknown
- GraphRAG: Bluepeak Transit Group

These are multi-hop questions. They require linking facts across connected entities rather than retrieving a single directly matching chunk.

Why GraphRAG performed better

Baseline RAG retrieved semantically similar chunks, but it sometimes failed to retrieve the connected evidence needed to complete the reasoning chain. GraphRAG improved performance by:

identifying the relevant seed entities from the question
traversing graph relationships to find linked evidence
reranking the final evidence set with graph signals in addition to semantic similarity

That is why the improvement appears most clearly on multi-hop questions.

Summary

On this benchmark, GraphRAG outperformed traditional RAG on both final-answer accuracy and supporting-evidence retrieval. The strongest gains appeared on questions that required linking facts across multiple connected entities.

​What was implemented

​What it was compared against

​Benchmark design

​Evaluation method

​How the API was used

​1. Create a GraphRAG store

​2. Ingest documents

​3. Run the baseline RAG query

​4. Run the GraphRAG query

​Measured results

​Acceptance threshold

​Representative question-level outcomes

​Why GraphRAG performed better

​Summary