LongRefiner | Hierarchical Document Refinement for Long-context Retrieval-augmented Generation

Installation | Quick-Start | Training | Evaluation | Huggingface Models

🔍 Overview

LongRefiner is an efficient plug-and-play refinement system for long-context RAG applications. It achieves 10x compression while maintaining superior performance through hierarchical document refinement.

✨ Key Features

⚡ Low Latency: 10x computational cost reduction compared to baselines
🔌 Plug and Play: Compatible with any LLM and retrieval system
📑 Hierarchical Document Structuring: XML-based efficient document representation
🔄 Adaptive Refinement: Dynamic content selection based on query requirements

🗺️ RoadMap

Release training code
Release trained modules
Release evaluation code
Release code for building custom training data

🛠️ Installation

cd LongRefiner
pip install -r requirements.txt
pip install -e .

For training purposes, please additionally install the Llama-Factory framework by following the instructions in the official repository:

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

🚀 Quick Start

You can download the pre-trained LoRA models from here.

import json
from longrefiner import LongRefiner

# Initialize
query_analysis_module_lora_path = "jinjiajie/Query-Analysis-Qwen2.5-3B-Instruct"
doc_structuring_module_lora_path = "jinjiajie/Doc-Structuring-Qwen2.5-3B-Instruct"
selection_module_lora_path = "jinjiajie/Global-Selection-Qwen2.5-3B-Instruct"

refiner = LongRefiner(
    base_model_path="Qwen/Qwen2.5-3B-Instruct",
    query_analysis_module_lora_path=query_analysis_module_lora_path,
    doc_structuring_module_lora_path=doc_structuring_module_lora_path,
    global_selection_module_lora_path=selection_module_lora_path,
    score_model_name="bge-reranker-v2-m3",
    score_model_path="BAAI/bge-reranker-v2-m3",
    max_model_len=25000,
)

# Load sample data
with open("assets/sample_data.json", "r") as f:
    data = json.load(f)
question = list(data.keys())[0]
document_list = list(data.values())[0]

# Process documents
refined_result = refiner.run(question, document_list, budget=2048)
print(refined_result)

📚 Training

Before training, prepare the datasets for three tasks in JSON format. Reference samples can be found in the training_data folder. We use the Llama-Factory framework for training. After setting up the training data, run:

cd scripts/training
# Train query analysis module
llamafactory-cli train train_config_step1.yaml  
# Train doc structuring module
llamafactory-cli train train_config_step2.yaml  
# Train global selection module
llamafactory-cli train train_config_step3.yaml

📊 Evaluation

We use the FlashRAG framework for RAG task evaluation. Required files:

Evaluation dataset (recommended to obtain from FlashRAG's official repository)
Retrieval results for each query in the dataset
Model paths (same as above)

After preparation, configure the paths in scripts/evaluation/run_eval.sh and run:

cd scripts/evaluation
bash run_eval.sh

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
longrefiner		longrefiner
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongRefiner | Hierarchical Document Refinement for Long-context Retrieval-augmented Generation

🔍 Overview

✨ Key Features

🗺️ RoadMap

🛠️ Installation

🚀 Quick Start

📚 Training

📊 Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LongRefiner | Hierarchical Document Refinement for Long-context Retrieval-augmented Generation

🔍 Overview

✨ Key Features

🗺️ RoadMap

🛠️ Installation

🚀 Quick Start

📚 Training

📊 Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages