Skip to content

syswonder/ComRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ComRAG Supplementary Results

This directory contains supplementary experimental results, figures, and brief notes that are not included in the main paper. The results summarize ComRAG performance across datasets, retrieval settings, and RAG baselines.

Table of Contents

1. ComRAG Accuracy with Different Embedding Models

This section reports ComRAG accuracy on HotpotQA, 2WikiMultiHopQA, and MuSiQue when different embedding models are used.

Results

Embedding Model Hotpot Machine Acc. Hotpot GPT Acc. 2Wiki Machine Acc. 2Wiki GPT Acc. MuSiQue Machine Acc. MuSiQue GPT Acc. Avg. Machine Acc. Avg. GPT Acc.
all-mpnet-base-v2 0.560 0.640 0.680 0.730 0.256 0.320 0.499 0.563
all-MiniLM-L6-v2 0.550 0.620 0.670 0.710 0.230 0.280 0.483 0.537
bge-large-en-v1.5 0.573 0.660 0.710 0.750 0.260 0.330 0.514 0.580

Figure

ComRAG accuracy under different embedding models

Observations

  • bge-large-en-v1.5 performs best overall across the three datasets, with an average machine accuracy of 0.514 and an average GPT accuracy of 0.580.
  • 2WikiMultiHopQA has the highest accuracy among the three datasets. With bge-large-en-v1.5, ComRAG reaches 0.710 machine accuracy and 0.750 GPT accuracy.
  • MuSiQue is substantially harder than HotpotQA and 2WikiMultiHopQA under the current ComRAG configuration.
  • GPT accuracy is higher than machine accuracy for all models and datasets, suggesting that automatic string matching underestimates some semantically correct answers whose wording differs from the reference.

2. ComRAG Accuracy on 2WikiMultiHopQA with Different embedding_topk and direct_bm25_topm

This section reports a two-variable ablation on the 2WikiMultiHopQA dev set. The test split contains 150 questions sampled from 2WikiMultiHopQA. The experiment changes only embedding_topk and direct_bm25_topm, and reports both automatic exact match and manually verified real accuracy.

Figures

2WikiMultiHopQA machine exact match heatmap
Machine Exact Match
2WikiMultiHopQA real accuracy heatmap
Real Accuracy

Observations

  • The best setting is embedding_topk=3 and direct_bm25_topm=2, reaching 0.7067 machine exact match and 0.7667 real accuracy.
  • When embedding_topk is fixed, direct_bm25_topm=2 is generally better than direct_bm25_topm=1 or direct_bm25_topm=3. Too few direct BM25 candidates can miss key evidence, while too many candidates can introduce noise.
  • Increasing embedding_topk does not produce stable gains. The best result with embedding_topk=3 is clearly higher than those with embedding_topk=6 or embedding_topk=9, indicating that too many candidates may interfere with later reasoning and reranking.
  • Real accuracy is consistently higher than machine exact match, because automatic exact match is strict for semantically equivalent answers with different surface forms. Real accuracy should therefore be prioritized when comparing settings, while machine exact match remains a reproducible automatic metric.

3. Early-Workload Runtime Comparison Across RAG Methods

This section evaluates only the 2WikiMultiHopQA dataset and focuses on the early workload range, where ComRAG's low-preparation design is most relevant. Total runtime is computed as preprocessing time plus accumulated question-processing time. ComRAG has no preprocessing time. Intersection points are estimated by linear interpolation between adjacent question counts.

Results

Question Count ComRAG Time LinearRAG Time HippoRAG2 Time HippoRAG Time Notes
50 1,508.3s 1,801.8s 6,239.4s 5,857.0s ComRAG faster than all baselines
100 3,013.5s 2,983.9s 6,396.2s 5,964.6s ComRAG has just become slower than LinearRAG
200 6,413.4s 5,399.5s 6,743.6s 6,188.1s ComRAG still faster than HippoRAG2, but slower than HippoRAG and LinearRAG
300 10,202.9s 7,828.0s 7,098.5s 6,410.4s ComRAG slower than all baselines

Summary

Method Preprocessing Time Avg. Question Time Total Time at 50 Total Time at 300
ComRAG 0.0s 34.0097s 1,508.3s 10,202.9s
LinearRAG 524.6s 23.8284s 1,801.8s 7,828.0s
HippoRAG2 6,050.6s 3.4345s 6,239.4s 7,098.5s
HippoRAG 5,751.0s 2.2788s 5,857.0s 6,410.4s

Intersections

Baseline Intersection Question Count Interpretation
LinearRAG 94.9 ComRAG has lower total runtime below about 95 questions; LinearRAG is lower beyond that point.
HippoRAG 193.0 ComRAG has lower total runtime below about 193 questions; HippoRAG is lower beyond that point.
HippoRAG2 214.0 ComRAG has lower total runtime below about 214 questions; HippoRAG2 is lower beyond that point.

Figure

Early-workload total execution time of different RAG methods

Observations

  • ComRAG's main runtime advantage appears at the beginning of a workload. Unlike methods that build dense indexes, graph structures, or document-level embeddings before inference, ComRAG does not perform a full offline embedding pass over the corpus. This makes ComRAG cheap to start on a new corpus and gives it the lowest total runtime at 50 questions.
  • The crossover points show the operating range more clearly than far-tail question counts. ComRAG remains lower in total runtime than LinearRAG up to about 95 questions, and lower than HippoRAG and HippoRAG2 up to about 193-214 questions.
  • This pattern reflects a deliberate tradeoff: ComRAG avoids fixed offline preprocessing cost, but pays more computation during online question processing through local retrieval, question decomposition, and runtime evidence scoring. As the same corpus is queried many more times, preprocessing-based systems can amortize their offline cost and eventually become faster in total runtime.
  • Therefore, these results should be read as evidence for ComRAG's low-preparation, early-workload advantage rather than as a claim of universal runtime superiority. ComRAG is most suitable when the corpus is new, the question set is modest, the corpus changes frequently, or offline preprocessing is not affordable.

4. Dataset Analysis

This section analyzes the embedding cosine similarity between each question and the evidence document at each hop for HotpotQA, 2WikiMultiHopQA, and MuSiQue. The x-axis is cosine similarity, and the y-axis is an approximate density distribution smoothed from the mean and standard deviation. Curves farther to the right indicate that the evidence at that hop is more semantically similar to the original question.

Figure

Embedding similarity distribution by dataset and hop

Direct Observations

  • 2WikiMultiHopQA shows a clear decreasing trend across hops: P1=0.6869 -> P2=0.6014 -> P3=0.4404 -> P4=0.4370. The drop from P2 to P3 is 0.1610, indicating that later-hop evidence becomes much less similar to the question. P3 and P4 nearly overlap, with a difference of only 0.0034.
  • HotpotQA has P1=0.7163 and P2=0.6925, a drop of only 0.0238. The two-hop distributions are very close, which means hop-specific question-similarity signals are weak in this setting.
  • MuSiQue is not monotonically decreasing: P1=0.6382 -> P2=0.5615, then P3 rises slightly to 0.5700 before P4 drops to 0.5416. This suggests that intermediate bridge entities or bridge sentences can make later hops closer to the original question again.

Per-Dataset Analysis

  • For 2WikiMultiHopQA, the semantic distribution differs most strongly across hops, especially from P1 to P2 and from P2 to P3. As hop count increases, evidence becomes much less similar to the original question, making direct retrieval from the original question more likely to miss later evidence. This makes 2WikiMultiHopQA well suited to hop-aware retrieval, scoring, and reranking, and helps explain why ComRAG's reasoning specialization and question decomposition are more effective on this dataset.
  • For HotpotQA, the P1 and P2 similarity distributions are highly similar, and P1 -> P2 has only a small-to-medium effect size (d=0.357). This suggests that complex hop-aware gating is less appropriate. A unified threshold, unified reranker, or stable two-hop joint retrieval strategy may be more suitable than many hop-specific rules.
  • For MuSiQue, the hop signal is structurally non-monotonic. P2 -> P3 increases (d=-0.106), so the assumption that later hops are always less relevant does not hold. This dataset needs more path-aware features, such as bridge entities, relation types, and step consistency, rather than relying only on monotonic decay in question similarity.

Cross-Dataset Analysis

  • Every HotpotQA hop is generally farther to the right than MuSiQue, meaning HotpotQA evidence is more semantically similar to the original question. This helps explain why traditional RAG usually performs better on HotpotQA and faces more retrieval difficulty on MuSiQue.
  • 2WikiMultiHopQA has the largest hop-to-question similarity gap among the three datasets. Its difficulty is concentrated in later hops becoming farther from the surface semantics of the original question, which directly strengthens the need for multi-step reasoning, subquestion generation, and staged evidence discovery.
  • MuSiQue does not have the largest hop gap, but it remains one of the harder datasets. Its difficulty is therefore not explained only by later hops moving farther from the question. Combined with prior findings, this is more likely related to entity rewriting, relation paraphrasing, and complex evidence-chain structure. ComRAG's smaller gain on MuSiQue is therefore reasonable: ComRAG strengthens multi-hop reasoning and question decomposition, while MuSiQue also requires stronger entity and expression alignment.

Overall, evidence-question similarity usually decreases as hop count grows, weakening direct one-shot retrieval from the original question. This supports the design motivation for ComRAG: multi-hop QA should use explicit reasoning, subquestion decomposition, and staged evidence discovery rather than relying only on a single retrieval step.

About

Retrieval as reasoning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors