Skip to content

lil-lab/benchpress

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BenchPress: A Standardized Evaluation Suite for Context Compression

BenchPress provides a unified benchmark for evaluating context compression methods across 10 diverse question-answering datasets, spanning short-context and mid-range scenarios.

Paper: No Mean Feat: Simple, Strong Baselines for Context Compression

The benchmark dataset is available on the Hugging Face Hub at yairfeldman/benchpress.

Code for training the models presented in the paper is available at lil-lab/simple-context-compression.

Installation

Install with uv:

uv sync

Then run commands with:

uv run ...

Or activate the virtual environment:

source .venv/bin/activate

Quick Start

import benchpress

# Load the benchmark dataset
dataset = benchpress.load(path="yairfeldman/benchpress")
# OR from local path: dataset = benchpress.load(path="./benchpress_data")

# Prepare a prompt for a sample
sample = dataset[0]
prompt = benchpress.prepare_prompt(
    sample["dataset_name"],
    sample["context"],
    sample["question"],
)

# Evaluate predictions against references
predictions = ["Paris"]
references = [["Paris", "paris"]]
scores = benchpress.evaluate(predictions, references)
print(benchpress.aggregate(scores))
# {'M': 1.0, 'EM': 1.0, 'F1': 1.0, 'Precision': 1.0, 'Recall': 1.0}

Dataset Overview

BenchPress includes 10 datasets organized into two subsets:

Short-Context Datasets

Dataset Source Split Template Type In-Domain
squad rajpurkar/squad_v2 validation extractive_qa (101) Yes
narrativeqa NarrativeQA summaries validation qa (96) Yes
hotpotqa HotpotQA distractor validation extractive_qa (101) Yes
adversarial_qa AdversarialQA droberta validation extractive_qa (101) No
triviaqa_verified TriviaQA verified dev extractive_qa (101) No
paraphrase_rc DuoRC ParaphraseRC validation qa (96) No

Mid-Range Datasets (LongBench)

Dataset Source Split Template Type
longbench_qasper_e LongBench qasper_e test inline (1)
longbench_multifieldqa_en_e LongBench multifieldqa_en_e test inline (1)
longbench_hotpotqa_e LongBench hotpotqa_e test inline (1)
longbench_2wikimqa_e LongBench 2wikimqa_e test inline (1)

Usage Guide

Loading Datasets

import benchpress

dataset_path = "yairfeldman/benchpress" # or local path: "./benchpress_data"

# Load all datasets
dataset = benchpress.load(path=dataset_path)

# Load only short-context datasets
short = benchpress.load(path=dataset_path, subset="short")

# Load only mid-range datasets
mid = benchpress.load(path=dataset_path, subset="mid_range")

# Load specific dataset(s)
squad = benchpress.load(path=dataset_path, datasets="squad")
subset = benchpress.load(path=dataset_path, datasets=["squad", "hotpotqa"])

# Load in-domain or out-of-domain splits
in_domain = benchpress.load(path=dataset_path, split_type="in_domain")
out_domain = benchpress.load(path=dataset_path, split_type="out_of_domain")

Filtering by Context Length

# Keep only samples with at most 1024 context tokens
short_ctx = benchpress.load(
    path="yairfeldman/benchpress",
    max_context_tokens=1024,
)

Working with Templates

import benchpress

# List available templates for a dataset
templates = benchpress.get_templates("squad")
print(f"SQuAD has {len(templates)} templates")  # 101

# Sample a template deterministically for a given sample
template = benchpress.sample_template("squad", context, question)

# Apply a template manually
prompt = benchpress.apply_template(template, context, question)

# Or use the convenience function
prompt = benchpress.prepare_prompt("squad", context, question)

Evaluating Predictions

import benchpress

# predictions: list of model output strings
# references: list of lists of valid reference answers
scores = benchpress.evaluate(predictions, references)
# Returns: {"M": [...], "EM": [...], "F1": [...], "Precision": [...], "Recall": [...]}

# Aggregate to dataset-level means
means = benchpress.aggregate(scores)
# Returns: {"M": 0.85, "EM": 0.72, ...}

Teacher-Normalized Scores

import benchpress

# Compare a compression method against teacher and no-context baselines
normalized = benchpress.teacher_normalized_score(
    score=0.80,          # Your method's score
    teacher_score=0.90,  # Teacher (full context) score
    no_context_score=0.30,  # No-context baseline score
)
# Returns: (0.80 - 0.30) / (0.90 - 0.30) = 0.833

Preparing the Dataset from Source

To rebuild the unified dataset from the original HuggingFace sources:

uv sync --extra prepare
uv run scripts/prepare_dataset.py --output-dir ./benchpress_data

This downloads all 10 source datasets, applies dataset-specific preprocessing, tokenizes contexts with Qwen/Qwen3-1.7B, and saves the unified dataset in HuggingFace arrow format.

Dataset Schema

Each sample in the unified dataset has the following columns:

Column Type Description
id string Unique sample identifier
context string The passage/context text
question string The question
answer string Single canonical answer
answers list[string] All valid reference answers
dataset_name string Source dataset name
num_context_tokens int Context token count (Qwen/Qwen3-1.7B)
domain string or null Domain/type label if available

Citation

@misc{feldman2025simplecontextcompressionmeanpooling,
      title={Simple Context Compression: Mean-Pooling and Multi-Ratio Training}, 
      author={Yair Feldman and Yoav Artzi},
      year={2025},
      eprint={2510.20797},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.20797}, 
}

License

This project is licensed under the MIT License. See LICENSE for details.

About

BenchPress: A Standardized Evaluation Suite for Context Compression

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages