Skip to content

ignorejjj/LongRefiner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LongRefiner | Hierarchical Document Refinement for Long-context Retrieval-augmented Generation

🔍 Overview

LongRefiner is an efficient plug-and-play refinement system for long-context RAG applications. It achieves 10x compression while maintaining superior performance through hierarchical document refinement.

✨ Key Features

  • Low Latency: 10x computational cost reduction compared to baselines
  • 🔌 Plug and Play: Compatible with any LLM and retrieval system
  • 📑 Hierarchical Document Structuring: XML-based efficient document representation
  • 🔄 Adaptive Refinement: Dynamic content selection based on query requirements

🗺️ RoadMap

  • Release training code
  • Release trained modules
  • Release evaluation code
  • Release code for building custom training data

🛠️ Installation

cd LongRefiner
pip install -r requirements.txt
pip install -e .

For training purposes, please additionally install the Llama-Factory framework by following the instructions in the official repository:

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

🚀 Quick Start

You can download the pre-trained LoRA models from here.

import json
from longrefiner import LongRefiner

# Initialize
query_analysis_module_lora_path = "jinjiajie/Query-Analysis-Qwen2.5-3B-Instruct"
doc_structuring_module_lora_path = "jinjiajie/Doc-Structuring-Qwen2.5-3B-Instruct"
selection_module_lora_path = "jinjiajie/Global-Selection-Qwen2.5-3B-Instruct"

refiner = LongRefiner(
    base_model_path="Qwen/Qwen2.5-3B-Instruct",
    query_analysis_module_lora_path=query_analysis_module_lora_path,
    doc_structuring_module_lora_path=doc_structuring_module_lora_path,
    global_selection_module_lora_path=selection_module_lora_path,
    score_model_name="bge-reranker-v2-m3",
    score_model_path="BAAI/bge-reranker-v2-m3",
    max_model_len=25000,
)

# Load sample data
with open("assets/sample_data.json", "r") as f:
    data = json.load(f)
question = list(data.keys())[0]
document_list = list(data.values())[0]

# Process documents
refined_result = refiner.run(question, document_list, budget=2048)
print(refined_result)

📚 Training

Before training, prepare the datasets for three tasks in JSON format. Reference samples can be found in the training_data folder. We use the Llama-Factory framework for training. After setting up the training data, run:

cd scripts/training
# Train query analysis module
llamafactory-cli train train_config_step1.yaml  
# Train doc structuring module
llamafactory-cli train train_config_step2.yaml  
# Train global selection module
llamafactory-cli train train_config_step3.yaml  

📊 Evaluation

We use the FlashRAG framework for RAG task evaluation. Required files:

  • Evaluation dataset (recommended to obtain from FlashRAG's official repository)
  • Retrieval results for each query in the dataset
  • Model paths (same as above)

After preparation, configure the paths in scripts/evaluation/run_eval.sh and run:

cd scripts/evaluation
bash run_eval.sh

About

The code for paper: Hierarchical Document Refinement for Long-context Retrieval-augmented Generation [ACL2025 Oral]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors