What was implemented
The evaluated system did not use flat chunk retrieval alone. It implemented GraphRAG with these stages:- Split source documents into chunks.
- Extract entities and relationships from those chunks.
- Build a knowledge graph from the extracted entities and relationships.
- Run standard retrieval to get seed evidence for a user question.
- Expand through graph links to collect connected evidence.
- Re-rank the final evidence set before answer generation.
What it was compared against
The comparison used a traditional RAG baseline with the same corpus and the same answer model. The only difference between the two runs was retrieval mode:- Baseline RAG: flat chunk retrieval only
- GraphRAG: graph-seeded expansion plus graph-aware reranking
Benchmark design
The benchmark was designed to test multi-hop retrieval rather than simple one-chunk lookup. It used:- one target knowledge graph
- several semantically similar distractor graphs
- questions that required linking facts across multiple chunks
Evaluation method
Both retrieval modes were run over the same benchmark corpus and the same question set. Both then used the same answer-generation model and the same answer prompt. The evaluation recorded three metrics:- Exact match: whether the final answer exactly matched the gold answer
- Token F1: token overlap between the final answer and the gold answer
- Evidence recall@5: how much of the required supporting evidence appeared in the top 5 retrieved chunks
How the API was used
The evaluated API flow was simple. The same store and the same documents were used for both the baseline and GraphRAG runs. Only the query mode changed.1. Create a GraphRAG store
2. Ingest documents
3. Run the baseline RAG query
4. Run the GraphRAG query
mode changed:
baseline= traditional flat retrievalgraph= graph-seeded retrieval and reranking
Measured results
The live benchmark run produced the following results:| Method | Exact Match | Token F1 | Evidence Recall@5 |
|---|---|---|---|
| Baseline RAG | 50.0% | 50.0% | 78.1% |
| GraphRAG | 87.5% | 87.5% | 90.6% |
- Exact Match:
+37.5points - Token F1:
+37.5points - Evidence Recall@5:
+12.5points
Acceptance threshold
The benchmark used the following pass criteria:- exact match improvement of at least
+5.0points - evidence recall@5 improvement of at least
+10.0points
Representative question-level outcomes
Examples where GraphRAG succeeded and baseline RAG did not:- “Who is the chief financial officer of the company that owns the Northern Bridge Sensors contract winner?”
- Baseline RAG:
unknown - GraphRAG:
Javier Solis
- Baseline RAG:
- “Which company acquired the firm that prepared a risk report for Meridian Ports Authority?”
- Baseline RAG:
unknown - GraphRAG:
Atlas Infrastructure Group
- Baseline RAG:
- “Who is the chief financial officer of the company that acquired the firm that prepared a risk report for Meridian Ports Authority?”
- Baseline RAG:
unknown - GraphRAG:
Javier Solis
- Baseline RAG:
Why GraphRAG performed better
Baseline RAG retrieved semantically similar chunks, but it sometimes failed to retrieve the connected evidence needed to complete the reasoning chain. GraphRAG improved performance by:- identifying the relevant seed entities from the question
- traversing graph relationships to find linked evidence
- reranking the final evidence set with graph signals in addition to semantic similarity