This directory contains supplementary experimental results, figures, and brief notes that are not included in the main paper. The results summarize ComRAG performance across datasets, retrieval settings, and RAG baselines.
- 1. ComRAG Accuracy with Different Embedding Models
- 2. ComRAG Accuracy on 2WikiMultiHopQA with Different
embedding_topkanddirect_bm25_topm - 3. Early-Workload Runtime Comparison Across RAG Methods
- 4. Dataset Analysis
This section reports ComRAG accuracy on HotpotQA, 2WikiMultiHopQA, and MuSiQue when different embedding models are used.
| Embedding Model | Hotpot Machine Acc. | Hotpot GPT Acc. | 2Wiki Machine Acc. | 2Wiki GPT Acc. | MuSiQue Machine Acc. | MuSiQue GPT Acc. | Avg. Machine Acc. | Avg. GPT Acc. |
|---|---|---|---|---|---|---|---|---|
| all-mpnet-base-v2 | 0.560 | 0.640 | 0.680 | 0.730 | 0.256 | 0.320 | 0.499 | 0.563 |
| all-MiniLM-L6-v2 | 0.550 | 0.620 | 0.670 | 0.710 | 0.230 | 0.280 | 0.483 | 0.537 |
| bge-large-en-v1.5 | 0.573 | 0.660 | 0.710 | 0.750 | 0.260 | 0.330 | 0.514 | 0.580 |
bge-large-en-v1.5performs best overall across the three datasets, with an average machine accuracy of 0.514 and an average GPT accuracy of 0.580.- 2WikiMultiHopQA has the highest accuracy among the three datasets. With
bge-large-en-v1.5, ComRAG reaches 0.710 machine accuracy and 0.750 GPT accuracy. - MuSiQue is substantially harder than HotpotQA and 2WikiMultiHopQA under the current ComRAG configuration.
- GPT accuracy is higher than machine accuracy for all models and datasets, suggesting that automatic string matching underestimates some semantically correct answers whose wording differs from the reference.
This section reports a two-variable ablation on the 2WikiMultiHopQA dev set. The test split contains 150 questions sampled from 2WikiMultiHopQA. The experiment changes only embedding_topk and direct_bm25_topm, and reports both automatic exact match and manually verified real accuracy.
Machine Exact Match |
Real Accuracy |
- The best setting is
embedding_topk=3anddirect_bm25_topm=2, reaching 0.7067 machine exact match and 0.7667 real accuracy. - When
embedding_topkis fixed,direct_bm25_topm=2is generally better thandirect_bm25_topm=1ordirect_bm25_topm=3. Too few direct BM25 candidates can miss key evidence, while too many candidates can introduce noise. - Increasing
embedding_topkdoes not produce stable gains. The best result withembedding_topk=3is clearly higher than those withembedding_topk=6orembedding_topk=9, indicating that too many candidates may interfere with later reasoning and reranking. - Real accuracy is consistently higher than machine exact match, because automatic exact match is strict for semantically equivalent answers with different surface forms. Real accuracy should therefore be prioritized when comparing settings, while machine exact match remains a reproducible automatic metric.
This section evaluates only the 2WikiMultiHopQA dataset and focuses on the early workload range, where ComRAG's low-preparation design is most relevant. Total runtime is computed as preprocessing time plus accumulated question-processing time. ComRAG has no preprocessing time. Intersection points are estimated by linear interpolation between adjacent question counts.
| Question Count | ComRAG Time | LinearRAG Time | HippoRAG2 Time | HippoRAG Time | Notes |
|---|---|---|---|---|---|
| 50 | 1,508.3s | 1,801.8s | 6,239.4s | 5,857.0s | ComRAG faster than all baselines |
| 100 | 3,013.5s | 2,983.9s | 6,396.2s | 5,964.6s | ComRAG has just become slower than LinearRAG |
| 200 | 6,413.4s | 5,399.5s | 6,743.6s | 6,188.1s | ComRAG still faster than HippoRAG2, but slower than HippoRAG and LinearRAG |
| 300 | 10,202.9s | 7,828.0s | 7,098.5s | 6,410.4s | ComRAG slower than all baselines |
| Method | Preprocessing Time | Avg. Question Time | Total Time at 50 | Total Time at 300 |
|---|---|---|---|---|
| ComRAG | 0.0s | 34.0097s | 1,508.3s | 10,202.9s |
| LinearRAG | 524.6s | 23.8284s | 1,801.8s | 7,828.0s |
| HippoRAG2 | 6,050.6s | 3.4345s | 6,239.4s | 7,098.5s |
| HippoRAG | 5,751.0s | 2.2788s | 5,857.0s | 6,410.4s |
| Baseline | Intersection Question Count | Interpretation |
|---|---|---|
| LinearRAG | 94.9 | ComRAG has lower total runtime below about 95 questions; LinearRAG is lower beyond that point. |
| HippoRAG | 193.0 | ComRAG has lower total runtime below about 193 questions; HippoRAG is lower beyond that point. |
| HippoRAG2 | 214.0 | ComRAG has lower total runtime below about 214 questions; HippoRAG2 is lower beyond that point. |
- ComRAG's main runtime advantage appears at the beginning of a workload. Unlike methods that build dense indexes, graph structures, or document-level embeddings before inference, ComRAG does not perform a full offline embedding pass over the corpus. This makes ComRAG cheap to start on a new corpus and gives it the lowest total runtime at 50 questions.
- The crossover points show the operating range more clearly than far-tail question counts. ComRAG remains lower in total runtime than LinearRAG up to about 95 questions, and lower than HippoRAG and HippoRAG2 up to about 193-214 questions.
- This pattern reflects a deliberate tradeoff: ComRAG avoids fixed offline preprocessing cost, but pays more computation during online question processing through local retrieval, question decomposition, and runtime evidence scoring. As the same corpus is queried many more times, preprocessing-based systems can amortize their offline cost and eventually become faster in total runtime.
- Therefore, these results should be read as evidence for ComRAG's low-preparation, early-workload advantage rather than as a claim of universal runtime superiority. ComRAG is most suitable when the corpus is new, the question set is modest, the corpus changes frequently, or offline preprocessing is not affordable.
This section analyzes the embedding cosine similarity between each question and the evidence document at each hop for HotpotQA, 2WikiMultiHopQA, and MuSiQue. The x-axis is cosine similarity, and the y-axis is an approximate density distribution smoothed from the mean and standard deviation. Curves farther to the right indicate that the evidence at that hop is more semantically similar to the original question.
- 2WikiMultiHopQA shows a clear decreasing trend across hops: P1=0.6869 -> P2=0.6014 -> P3=0.4404 -> P4=0.4370. The drop from P2 to P3 is 0.1610, indicating that later-hop evidence becomes much less similar to the question. P3 and P4 nearly overlap, with a difference of only 0.0034.
- HotpotQA has P1=0.7163 and P2=0.6925, a drop of only 0.0238. The two-hop distributions are very close, which means hop-specific question-similarity signals are weak in this setting.
- MuSiQue is not monotonically decreasing: P1=0.6382 -> P2=0.5615, then P3 rises slightly to 0.5700 before P4 drops to 0.5416. This suggests that intermediate bridge entities or bridge sentences can make later hops closer to the original question again.
- For 2WikiMultiHopQA, the semantic distribution differs most strongly across hops, especially from P1 to P2 and from P2 to P3. As hop count increases, evidence becomes much less similar to the original question, making direct retrieval from the original question more likely to miss later evidence. This makes 2WikiMultiHopQA well suited to hop-aware retrieval, scoring, and reranking, and helps explain why ComRAG's reasoning specialization and question decomposition are more effective on this dataset.
- For HotpotQA, the P1 and P2 similarity distributions are highly similar, and P1 -> P2 has only a small-to-medium effect size (
d=0.357). This suggests that complex hop-aware gating is less appropriate. A unified threshold, unified reranker, or stable two-hop joint retrieval strategy may be more suitable than many hop-specific rules. - For MuSiQue, the hop signal is structurally non-monotonic. P2 -> P3 increases (
d=-0.106), so the assumption that later hops are always less relevant does not hold. This dataset needs more path-aware features, such as bridge entities, relation types, and step consistency, rather than relying only on monotonic decay in question similarity.
- Every HotpotQA hop is generally farther to the right than MuSiQue, meaning HotpotQA evidence is more semantically similar to the original question. This helps explain why traditional RAG usually performs better on HotpotQA and faces more retrieval difficulty on MuSiQue.
- 2WikiMultiHopQA has the largest hop-to-question similarity gap among the three datasets. Its difficulty is concentrated in later hops becoming farther from the surface semantics of the original question, which directly strengthens the need for multi-step reasoning, subquestion generation, and staged evidence discovery.
- MuSiQue does not have the largest hop gap, but it remains one of the harder datasets. Its difficulty is therefore not explained only by later hops moving farther from the question. Combined with prior findings, this is more likely related to entity rewriting, relation paraphrasing, and complex evidence-chain structure. ComRAG's smaller gain on MuSiQue is therefore reasonable: ComRAG strengthens multi-hop reasoning and question decomposition, while MuSiQue also requires stronger entity and expression alignment.
Overall, evidence-question similarity usually decreases as hop count grows, weakening direct one-shot retrieval from the original question. This supports the design motivation for ComRAG: multi-hop QA should use explicit reasoning, subquestion decomposition, and staged evidence discovery rather than relying only on a single retrieval step.




