ComRAG Supplementary Results

This directory contains supplementary experimental results, figures, and brief notes that are not included in the main paper. The results summarize ComRAG performance across datasets, retrieval settings, and RAG baselines.

1. ComRAG Accuracy with Different Embedding Models

This section reports ComRAG accuracy on HotpotQA, 2WikiMultiHopQA, and MuSiQue when different embedding models are used.

Results

Embedding Model	Hotpot Machine Acc.	Hotpot GPT Acc.	2Wiki Machine Acc.	2Wiki GPT Acc.	MuSiQue Machine Acc.	MuSiQue GPT Acc.	Avg. Machine Acc.	Avg. GPT Acc.
all-mpnet-base-v2	0.560	0.640	0.680	0.730	0.256	0.320	0.499	0.563
all-MiniLM-L6-v2	0.550	0.620	0.670	0.710	0.230	0.280	0.483	0.537
bge-large-en-v1.5	0.573	0.660	0.710	0.750	0.260	0.330	0.514	0.580

Figure

Observations

bge-large-en-v1.5 performs best overall across the three datasets, with an average machine accuracy of 0.514 and an average GPT accuracy of 0.580.
2WikiMultiHopQA has the highest accuracy among the three datasets. With bge-large-en-v1.5, ComRAG reaches 0.710 machine accuracy and 0.750 GPT accuracy.
MuSiQue is substantially harder than HotpotQA and 2WikiMultiHopQA under the current ComRAG configuration.
GPT accuracy is higher than machine accuracy for all models and datasets, suggesting that automatic string matching underestimates some semantically correct answers whose wording differs from the reference.

2. ComRAG Accuracy on 2WikiMultiHopQA with Different `embedding_topk` and `direct_bm25_topm`

This section reports a two-variable ablation on the 2WikiMultiHopQA dev set. The test split contains 150 questions sampled from 2WikiMultiHopQA. The experiment changes only embedding_topk and direct_bm25_topm, and reports both automatic exact match and manually verified real accuracy.

Figures

Machine Exact Match

Real Accuracy

Observations

The best setting is embedding_topk=3 and direct_bm25_topm=2, reaching 0.7067 machine exact match and 0.7667 real accuracy.
When embedding_topk is fixed, direct_bm25_topm=2 is generally better than direct_bm25_topm=1 or direct_bm25_topm=3. Too few direct BM25 candidates can miss key evidence, while too many candidates can introduce noise.
Increasing embedding_topk does not produce stable gains. The best result with embedding_topk=3 is clearly higher than those with embedding_topk=6 or embedding_topk=9, indicating that too many candidates may interfere with later reasoning and reranking.
Real accuracy is consistently higher than machine exact match, because automatic exact match is strict for semantically equivalent answers with different surface forms. Real accuracy should therefore be prioritized when comparing settings, while machine exact match remains a reproducible automatic metric.

3. Early-Workload Runtime Comparison Across RAG Methods

This section evaluates only the 2WikiMultiHopQA dataset and focuses on the early workload range, where ComRAG's low-preparation design is most relevant. Total runtime is computed as preprocessing time plus accumulated question-processing time. ComRAG has no preprocessing time. Intersection points are estimated by linear interpolation between adjacent question counts.

Results

Question Count	ComRAG Time	LinearRAG Time	HippoRAG2 Time	HippoRAG Time	Notes
50	1,508.3s	1,801.8s	6,239.4s	5,857.0s	ComRAG faster than all baselines
100	3,013.5s	2,983.9s	6,396.2s	5,964.6s	ComRAG has just become slower than LinearRAG
200	6,413.4s	5,399.5s	6,743.6s	6,188.1s	ComRAG still faster than HippoRAG2, but slower than HippoRAG and LinearRAG
300	10,202.9s	7,828.0s	7,098.5s	6,410.4s	ComRAG slower than all baselines

Summary

Method	Preprocessing Time	Avg. Question Time	Total Time at 50	Total Time at 300
ComRAG	0.0s	34.0097s	1,508.3s	10,202.9s
LinearRAG	524.6s	23.8284s	1,801.8s	7,828.0s
HippoRAG2	6,050.6s	3.4345s	6,239.4s	7,098.5s
HippoRAG	5,751.0s	2.2788s	5,857.0s	6,410.4s

Intersections

Baseline	Intersection Question Count	Interpretation
LinearRAG	94.9	ComRAG has lower total runtime below about 95 questions; LinearRAG is lower beyond that point.
HippoRAG	193.0	ComRAG has lower total runtime below about 193 questions; HippoRAG is lower beyond that point.
HippoRAG2	214.0	ComRAG has lower total runtime below about 214 questions; HippoRAG2 is lower beyond that point.

Figure

Observations

ComRAG's main runtime advantage appears at the beginning of a workload. Unlike methods that build dense indexes, graph structures, or document-level embeddings before inference, ComRAG does not perform a full offline embedding pass over the corpus. This makes ComRAG cheap to start on a new corpus and gives it the lowest total runtime at 50 questions.
The crossover points show the operating range more clearly than far-tail question counts. ComRAG remains lower in total runtime than LinearRAG up to about 95 questions, and lower than HippoRAG and HippoRAG2 up to about 193-214 questions.
This pattern reflects a deliberate tradeoff: ComRAG avoids fixed offline preprocessing cost, but pays more computation during online question processing through local retrieval, question decomposition, and runtime evidence scoring. As the same corpus is queried many more times, preprocessing-based systems can amortize their offline cost and eventually become faster in total runtime.
Therefore, these results should be read as evidence for ComRAG's low-preparation, early-workload advantage rather than as a claim of universal runtime superiority. ComRAG is most suitable when the corpus is new, the question set is modest, the corpus changes frequently, or offline preprocessing is not affordable.

4. Dataset Analysis

This section analyzes the embedding cosine similarity between each question and the evidence document at each hop for HotpotQA, 2WikiMultiHopQA, and MuSiQue. The x-axis is cosine similarity, and the y-axis is an approximate density distribution smoothed from the mean and standard deviation. Curves farther to the right indicate that the evidence at that hop is more semantically similar to the original question.

Figure

Direct Observations

2WikiMultiHopQA shows a clear decreasing trend across hops: P1=0.6869 -> P2=0.6014 -> P3=0.4404 -> P4=0.4370. The drop from P2 to P3 is 0.1610, indicating that later-hop evidence becomes much less similar to the question. P3 and P4 nearly overlap, with a difference of only 0.0034.
HotpotQA has P1=0.7163 and P2=0.6925, a drop of only 0.0238. The two-hop distributions are very close, which means hop-specific question-similarity signals are weak in this setting.
MuSiQue is not monotonically decreasing: P1=0.6382 -> P2=0.5615, then P3 rises slightly to 0.5700 before P4 drops to 0.5416. This suggests that intermediate bridge entities or bridge sentences can make later hops closer to the original question again.

Per-Dataset Analysis

For 2WikiMultiHopQA, the semantic distribution differs most strongly across hops, especially from P1 to P2 and from P2 to P3. As hop count increases, evidence becomes much less similar to the original question, making direct retrieval from the original question more likely to miss later evidence. This makes 2WikiMultiHopQA well suited to hop-aware retrieval, scoring, and reranking, and helps explain why ComRAG's reasoning specialization and question decomposition are more effective on this dataset.
For HotpotQA, the P1 and P2 similarity distributions are highly similar, and P1 -> P2 has only a small-to-medium effect size (d=0.357). This suggests that complex hop-aware gating is less appropriate. A unified threshold, unified reranker, or stable two-hop joint retrieval strategy may be more suitable than many hop-specific rules.
For MuSiQue, the hop signal is structurally non-monotonic. P2 -> P3 increases (d=-0.106), so the assumption that later hops are always less relevant does not hold. This dataset needs more path-aware features, such as bridge entities, relation types, and step consistency, rather than relying only on monotonic decay in question similarity.

Cross-Dataset Analysis

Every HotpotQA hop is generally farther to the right than MuSiQue, meaning HotpotQA evidence is more semantically similar to the original question. This helps explain why traditional RAG usually performs better on HotpotQA and faces more retrieval difficulty on MuSiQue.
2WikiMultiHopQA has the largest hop-to-question similarity gap among the three datasets. Its difficulty is concentrated in later hops becoming farther from the surface semantics of the original question, which directly strengthens the need for multi-step reasoning, subquestion generation, and staged evidence discovery.
MuSiQue does not have the largest hop gap, but it remains one of the harder datasets. Its difficulty is therefore not explained only by later hops moving farther from the question. Combined with prior findings, this is more likely related to entity rewriting, relation paraphrasing, and complex evidence-chain structure. ComRAG's smaller gain on MuSiQue is therefore reasonable: ComRAG strengthens multi-hop reasoning and question decomposition, while MuSiQue also requires stronger entity and expression alignment.

Overall, evidence-question similarity usually decreases as hop count grows, weakening direct one-shot retrieval from the original question. This supports the design motivation for ComRAG: multi-hop QA should use explicit reasoning, subquestion decomposition, and staged evidence discovery rather than relying only on a single retrieval step.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ComRAG		ComRAG
figure		figure
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ComRAG Supplementary Results

Table of Contents

1. ComRAG Accuracy with Different Embedding Models

Results

Figure

Observations

2. ComRAG Accuracy on 2WikiMultiHopQA with Different `embedding_topk` and `direct_bm25_topm`

Figures

Observations

3. Early-Workload Runtime Comparison Across RAG Methods

Results

Summary

Intersections

Figure

Observations

4. Dataset Analysis

Figure

Direct Observations

Per-Dataset Analysis

Cross-Dataset Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ComRAG Supplementary Results

Table of Contents

1. ComRAG Accuracy with Different Embedding Models

Results

Figure

Observations

2. ComRAG Accuracy on 2WikiMultiHopQA with Different embedding_topk and direct_bm25_topm

Figures

Observations

3. Early-Workload Runtime Comparison Across RAG Methods

Results

Summary

Intersections

Figure

Observations

4. Dataset Analysis

Figure

Direct Observations

Per-Dataset Analysis

Cross-Dataset Analysis

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. ComRAG Accuracy on 2WikiMultiHopQA with Different `embedding_topk` and `direct_bm25_topm`

Packages