Skip to content

CoRAL-ASU/REaR

Repository files navigation

REaR: Retrieve, Expand, and Refine for Effective Multitable Retrieval

Paper Python License

REaR Architecture

Abstract

REaR is a modular, multi-stage retrieval pipeline for discovering relevant database tables in response to natural language queries. Unlike single-pass retrieval methods, REaR first retrieves candidate tables via dense vector similarity, then expands the candidate set by discovering joinable tables through column-level matching, and finally refines results using cross-encoder scoring that considers both query-to-table and table-to-table relationships. This retrieve-expand-refine approach consistently improves recall and precision for multitable retrieval, making it well-suited for downstream tasks such as natural language to SQL generation.

Paper: REaR: Retrieve, Expand, and Refine for Effective Multitable Retrieval


Quick Start

Installation

  1. Clone the repository:
git clone https://github.com/<your-username>/REaR.git
cd REaR
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate
  1. Install dependencies:
bash setup.sh

Setup Configuration

  1. Create a .env file with your API keys:
# LLM API Keys (at least one required for SQL generation)
GOOGLE_API_KEY=your-google-api-key-here
OPENAI_API_KEY=your-openai-api-key-here
DEEPINFRA_API_KEY=your-deepinfra-api-key-here
  1. Prepare your data: Place your table repository JSON and questions file in an accessible path (see Data Format below).

  2. Build the vector indices (one-time preprocessing):

# Column-level FAISS index (for the Expand stage)
python preprocessing/create_col_store.py \
  --table-repository combined_database_with_desc.json \
  --embedding-model BAAI/bge-base-en-v1.5 \
  --index-out vs_col.bin \
  --metadata-out doc_metadata.json

# Table-level vector store (for the Retrieve stage)
python preprocessing/create_vector_score.py \
  --table-repository combined_database_with_desc.json \
  --output-dir storage_bge \
  --embedding-model BAAI/bge-base-en-v1.5

Usage Examples

End-to-End Pipeline

Run all four stages (Retrieve → Expand → Refine → Generate) in a single command:

python full_inference.py \
  --env-file .env \
  --questions-file merged_questions.json \
  --vector-store-path storage_bge \
  --faiss-index vs_col.bin \
  --faiss-metadata doc_metadata.json \
  --table-repository combined_database_with_desc.json \
  --output-file e2e_results.json \
  --intermediate-dir runs/

Running Individual Stages

# Stage 1: Retrieve — top-k tables per question via dense retrieval
python retrieve.py \
  --questions-file merged_questions.json \
  --vector-store-path storage_bge \
  --output-file base.json

# Stage 2: Expand — discover joinable tables via column-level similarity
python expand.py \
  --retrieval-file base.json \
  --faiss-index vs_col.bin \
  --faiss-metadata doc_metadata.json \
  --output-file expansion.json

# Stage 3: Refine — prune with cross-encoder scoring
python refine.py \
  --expansion-file expansion.json \
  --table-repository combined_database_with_desc.json \
  --output-file pruned.json

# Stage 4: Generate — produce SQL from the pruned table set
python generate.py \
  --pruned-file pruned.json \
  --table-repository combined_database_with_desc.json \
  --provider gemini \
  --output-file sql.json \
  --env-file .env

Evaluation

python evaluate_retrieval.py \
  --predictions-file base.json \
  --ground-truth-file merged_questions.json \
  --output-file eval.json

Data Format

Table Repository

{
  "tables": {
    "table_name": [
      {
        "id": "tbl_<hash>",
        "columns": ["col1", "col2"],
        "data": [["val1", "val2"]],
        "primary_keys": ["col1"],
        "foreign_keys": [],
        "table_description": "Optional description"
      }
    ]
  }
}

Questions File

{
  "questions": [
    {
      "question_id": "q1",
      "question": "Natural language query text",
      "tables": [{"id": "tbl_xxx"}]
    }
  ]
}

Pipeline Outputs

Each stage produces a JSON file that feeds into the next:

Stage Output Description
Retrieve base.json Top-k candidate tables per question
Expand expansion.json Candidate set augmented with joinable tables
Refine pruned.json Final table set after cross-encoder pruning
Generate sql.json Generated SQL queries with metadata

Project Structure

REaR/
├── setup.sh                         # Dependency installation
├── imports.py                       # Centralized imports
├── utils.py                         # Utility functions
├── preprocessing/
│   ├── create_col_store.py          # Build column-level FAISS index
│   ├── create_vector_score.py       # Build table-level vector store
│   ├── generate_questions_list.py   # Extract questions from SQL
│   └── generate_table_descriptions.py  # Generate table descriptions
├── retrieve.py                      # Stage 1: Dense retrieval
├── expand.py                        # Stage 2: Join-aware expansion
├── refine.py                        # Stage 3: Cross-encoder refinement
├── generate.py                      # Stage 4: SQL generation
├── evaluate_retrieval.py            # Evaluation metrics
└── full_inference.py                # End-to-end runner

Configuration

Environment Variables

Variable Description Required
GOOGLE_API_KEY Google Generative AI API key For Gemini provider
OPENAI_API_KEY OpenAI API key For OpenAI provider
DEEPINFRA_API_KEY DeepInfra API key For DeepInfra provider

Key Parameters

Parameter Stage Default Description
--top-k Retrieve 5 Number of candidate tables per question
--embedding-model Retrieve / Expand BAAI/bge-base-en-v1.5 HuggingFace embedding model
--top-n Refine 5 Final number of tables after pruning
--alpha Refine Weight for query-to-table scores
--beta Refine Weight for table-to-table scores
--provider Generate gemini LLM provider (gemini, openai, deepinfra)

Supported LLM Providers

Provider Model Environment Variable
Google Gemini gemini-2.0-flash GOOGLE_API_KEY
OpenAI gpt-4o-mini OPENAI_API_KEY
DeepInfra Llama-3.2-3B-Instruct DEEPINFRA_API_KEY

Architecture

REaR's pipeline consists of four stages:

  1. Retrieve: A dual-encoder with FAISS ranks tables by dense vector similarity to the input question using BAAI/bge-base-en-v1.5 embeddings.
  2. Expand: A column-level FAISS index discovers additional joinable tables that share compatible columns with the retrieved candidates.
  3. Refine: A cross-encoder (jina-reranker-v2-base-multilingual) scores table combinations using both query-to-table (QT) and table-to-table (T2T) relevance, then prunes the set to the top-n tables.
  4. Generate: An LLM produces SQL queries from the refined table set and the original question.

Evaluation Metrics

Metric Description
Recall Coverage of ground-truth tables in the retrieved set
Precision Fraction of retrieved tables that are relevant
F1 Harmonic mean of recall and precision
Full Recall / Precision Perfect match rate (all-or-nothing)
Coverage Schema-field completeness beyond table-level recall
Redundancy Pairwise similarity within the retrieved set

Citation

If you use REaR in your research, please cite our paper:

@misc{rear2025,
      title={REaR: Retrieve, Expand, and Refine for Effective Multitable Retrieval},
      author={},
      year={2025},
      eprint={XXXX.XXXXX},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
}

License

This project is licensed under the Creative Commons Attribution 4.0 International License — see the LICENSE file for details.

About

REAR is a fast, LLM-free framework for multi-table retrieval that separates semantic relevance from structural joinability. By retrieving relevant tables, expanding with joinable ones, and refining noisy candidates, it consistently improves multi-table QA and Text-to-SQL performance—matching LLM-based methods at much lower cost and latency.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors