REaR: Retrieve, Expand, and Refine for Effective Multitable Retrieval

Abstract

REaR is a modular, multi-stage retrieval pipeline for discovering relevant database tables in response to natural language queries. Unlike single-pass retrieval methods, REaR first retrieves candidate tables via dense vector similarity, then expands the candidate set by discovering joinable tables through column-level matching, and finally refines results using cross-encoder scoring that considers both query-to-table and table-to-table relationships. This retrieve-expand-refine approach consistently improves recall and precision for multitable retrieval, making it well-suited for downstream tasks such as natural language to SQL generation.

Paper: REaR: Retrieve, Expand, and Refine for Effective Multitable Retrieval

Quick Start

Installation

Clone the repository:

git clone https://github.com/<your-username>/REaR.git
cd REaR

Create a virtual environment:

python -m venv venv
source venv/bin/activate

Install dependencies:

bash setup.sh

Setup Configuration

Create a .env file with your API keys:

# LLM API Keys (at least one required for SQL generation)
GOOGLE_API_KEY=your-google-api-key-here
OPENAI_API_KEY=your-openai-api-key-here
DEEPINFRA_API_KEY=your-deepinfra-api-key-here

Prepare your data: Place your table repository JSON and questions file in an accessible path (see Data Format below).
Build the vector indices (one-time preprocessing):

# Column-level FAISS index (for the Expand stage)
python preprocessing/create_col_store.py \
  --table-repository combined_database_with_desc.json \
  --embedding-model BAAI/bge-base-en-v1.5 \
  --index-out vs_col.bin \
  --metadata-out doc_metadata.json

# Table-level vector store (for the Retrieve stage)
python preprocessing/create_vector_score.py \
  --table-repository combined_database_with_desc.json \
  --output-dir storage_bge \
  --embedding-model BAAI/bge-base-en-v1.5

Usage Examples

End-to-End Pipeline

Run all four stages (Retrieve → Expand → Refine → Generate) in a single command:

python full_inference.py \
  --env-file .env \
  --questions-file merged_questions.json \
  --vector-store-path storage_bge \
  --faiss-index vs_col.bin \
  --faiss-metadata doc_metadata.json \
  --table-repository combined_database_with_desc.json \
  --output-file e2e_results.json \
  --intermediate-dir runs/

Running Individual Stages

# Stage 1: Retrieve — top-k tables per question via dense retrieval
python retrieve.py \
  --questions-file merged_questions.json \
  --vector-store-path storage_bge \
  --output-file base.json

# Stage 2: Expand — discover joinable tables via column-level similarity
python expand.py \
  --retrieval-file base.json \
  --faiss-index vs_col.bin \
  --faiss-metadata doc_metadata.json \
  --output-file expansion.json

# Stage 3: Refine — prune with cross-encoder scoring
python refine.py \
  --expansion-file expansion.json \
  --table-repository combined_database_with_desc.json \
  --output-file pruned.json

# Stage 4: Generate — produce SQL from the pruned table set
python generate.py \
  --pruned-file pruned.json \
  --table-repository combined_database_with_desc.json \
  --provider gemini \
  --output-file sql.json \
  --env-file .env

Evaluation

python evaluate_retrieval.py \
  --predictions-file base.json \
  --ground-truth-file merged_questions.json \
  --output-file eval.json

Data Format

Table Repository

{
  "tables": {
    "table_name": [
      {
        "id": "tbl_<hash>",
        "columns": ["col1", "col2"],
        "data": [["val1", "val2"]],
        "primary_keys": ["col1"],
        "foreign_keys": [],
        "table_description": "Optional description"
      }
    ]
  }
}

Questions File

{
  "questions": [
    {
      "question_id": "q1",
      "question": "Natural language query text",
      "tables": [{"id": "tbl_xxx"}]
    }
  ]
}

Pipeline Outputs

Each stage produces a JSON file that feeds into the next:

Stage	Output	Description
Retrieve	`base.json`	Top-k candidate tables per question
Expand	`expansion.json`	Candidate set augmented with joinable tables
Refine	`pruned.json`	Final table set after cross-encoder pruning
Generate	`sql.json`	Generated SQL queries with metadata

Project Structure

REaR/
├── setup.sh                         # Dependency installation
├── imports.py                       # Centralized imports
├── utils.py                         # Utility functions
├── preprocessing/
│   ├── create_col_store.py          # Build column-level FAISS index
│   ├── create_vector_score.py       # Build table-level vector store
│   ├── generate_questions_list.py   # Extract questions from SQL
│   └── generate_table_descriptions.py  # Generate table descriptions
├── retrieve.py                      # Stage 1: Dense retrieval
├── expand.py                        # Stage 2: Join-aware expansion
├── refine.py                        # Stage 3: Cross-encoder refinement
├── generate.py                      # Stage 4: SQL generation
├── evaluate_retrieval.py            # Evaluation metrics
└── full_inference.py                # End-to-end runner

Configuration

Environment Variables

Variable	Description	Required
`GOOGLE_API_KEY`	Google Generative AI API key	For Gemini provider
`OPENAI_API_KEY`	OpenAI API key	For OpenAI provider
`DEEPINFRA_API_KEY`	DeepInfra API key	For DeepInfra provider

Key Parameters

Parameter	Stage	Default	Description
`--top-k`	Retrieve	`5`	Number of candidate tables per question
`--embedding-model`	Retrieve / Expand	`BAAI/bge-base-en-v1.5`	HuggingFace embedding model
`--top-n`	Refine	`5`	Final number of tables after pruning
`--alpha`	Refine	—	Weight for query-to-table scores
`--beta`	Refine	—	Weight for table-to-table scores
`--provider`	Generate	`gemini`	LLM provider (`gemini`, `openai`, `deepinfra`)

Supported LLM Providers

Provider	Model	Environment Variable
Google Gemini	`gemini-2.0-flash`	`GOOGLE_API_KEY`
OpenAI	`gpt-4o-mini`	`OPENAI_API_KEY`
DeepInfra	`Llama-3.2-3B-Instruct`	`DEEPINFRA_API_KEY`

Architecture

REaR's pipeline consists of four stages:

Retrieve: A dual-encoder with FAISS ranks tables by dense vector similarity to the input question using BAAI/bge-base-en-v1.5 embeddings.
Expand: A column-level FAISS index discovers additional joinable tables that share compatible columns with the retrieved candidates.
Refine: A cross-encoder (jina-reranker-v2-base-multilingual) scores table combinations using both query-to-table (QT) and table-to-table (T2T) relevance, then prunes the set to the top-n tables.
Generate: An LLM produces SQL queries from the refined table set and the original question.

Evaluation Metrics

Metric	Description
Recall	Coverage of ground-truth tables in the retrieved set
Precision	Fraction of retrieved tables that are relevant
F1	Harmonic mean of recall and precision
Full Recall / Precision	Perfect match rate (all-or-nothing)
Coverage	Schema-field completeness beyond table-level recall
Redundancy	Pairwise similarity within the retrieved set

Citation

If you use REaR in your research, please cite our paper:

@misc{rear2025,
      title={REaR: Retrieve, Expand, and Refine for Effective Multitable Retrieval},
      author={},
      year={2025},
      eprint={XXXX.XXXXX},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
}

License

This project is licensed under the Creative Commons Attribution 4.0 International License — see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REaR: Retrieve, Expand, and Refine for Effective Multitable Retrieval

Abstract

Quick Start

Installation

Setup Configuration

Usage Examples

End-to-End Pipeline

Running Individual Stages

Evaluation

Data Format

Table Repository

Questions File

Pipeline Outputs

Project Structure

Configuration

Environment Variables

Key Parameters

Supported LLM Providers

Architecture

Evaluation Metrics

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs/images		docs/images
preprocessing		preprocessing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate_retrieval.py		evaluate_retrieval.py
expand.py		expand.py
full_inference.py		full_inference.py
generate.py		generate.py
imports.py		imports.py
pipeline.py		pipeline.py
refine.py		refine.py
retrieve.py		retrieve.py
setup.sh		setup.sh
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

REaR: Retrieve, Expand, and Refine for Effective Multitable Retrieval

Abstract

Quick Start

Installation

Setup Configuration

Usage Examples

End-to-End Pipeline

Running Individual Stages

Evaluation

Data Format

Table Repository

Questions File

Pipeline Outputs

Project Structure

Configuration

Environment Variables

Key Parameters

Supported LLM Providers

Architecture

Evaluation Metrics

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages