BenchPress: A Standardized Evaluation Suite for Context Compression

BenchPress provides a unified benchmark for evaluating context compression methods across 10 diverse question-answering datasets, spanning short-context and mid-range scenarios.

Paper: No Mean Feat: Simple, Strong Baselines for Context Compression

The benchmark dataset is available on the Hugging Face Hub at yairfeldman/benchpress.

Code for training the models presented in the paper is available at lil-lab/simple-context-compression.

Installation

Install with uv:

uv sync

Then run commands with:

uv run ...

Or activate the virtual environment:

source .venv/bin/activate

Quick Start

import benchpress

# Load the benchmark dataset
dataset = benchpress.load(path="yairfeldman/benchpress")
# OR from local path: dataset = benchpress.load(path="./benchpress_data")

# Prepare a prompt for a sample
sample = dataset[0]
prompt = benchpress.prepare_prompt(
    sample["dataset_name"],
    sample["context"],
    sample["question"],
)

# Evaluate predictions against references
predictions = ["Paris"]
references = [["Paris", "paris"]]
scores = benchpress.evaluate(predictions, references)
print(benchpress.aggregate(scores))
# {'M': 1.0, 'EM': 1.0, 'F1': 1.0, 'Precision': 1.0, 'Recall': 1.0}

Dataset Overview

BenchPress includes 10 datasets organized into two subsets:

Short-Context Datasets

Dataset	Source	Split	Template Type	In-Domain
`squad`	`rajpurkar/squad_v2`	validation	extractive_qa (101)	Yes
`narrativeqa`	NarrativeQA summaries	validation	qa (96)	Yes
`hotpotqa`	HotpotQA distractor	validation	extractive_qa (101)	Yes
`adversarial_qa`	AdversarialQA droberta	validation	extractive_qa (101)	No
`triviaqa_verified`	TriviaQA verified	dev	extractive_qa (101)	No
`paraphrase_rc`	DuoRC ParaphraseRC	validation	qa (96)	No

Mid-Range Datasets (LongBench)

Dataset	Source	Split	Template Type
`longbench_qasper_e`	LongBench qasper_e	test	inline (1)
`longbench_multifieldqa_en_e`	LongBench multifieldqa_en_e	test	inline (1)
`longbench_hotpotqa_e`	LongBench hotpotqa_e	test	inline (1)
`longbench_2wikimqa_e`	LongBench 2wikimqa_e	test	inline (1)

Usage Guide

Loading Datasets

import benchpress

dataset_path = "yairfeldman/benchpress" # or local path: "./benchpress_data"

# Load all datasets
dataset = benchpress.load(path=dataset_path)

# Load only short-context datasets
short = benchpress.load(path=dataset_path, subset="short")

# Load only mid-range datasets
mid = benchpress.load(path=dataset_path, subset="mid_range")

# Load specific dataset(s)
squad = benchpress.load(path=dataset_path, datasets="squad")
subset = benchpress.load(path=dataset_path, datasets=["squad", "hotpotqa"])

# Load in-domain or out-of-domain splits
in_domain = benchpress.load(path=dataset_path, split_type="in_domain")
out_domain = benchpress.load(path=dataset_path, split_type="out_of_domain")

Filtering by Context Length

# Keep only samples with at most 1024 context tokens
short_ctx = benchpress.load(
    path="yairfeldman/benchpress",
    max_context_tokens=1024,
)

Working with Templates

import benchpress

# List available templates for a dataset
templates = benchpress.get_templates("squad")
print(f"SQuAD has {len(templates)} templates")  # 101

# Sample a template deterministically for a given sample
template = benchpress.sample_template("squad", context, question)

# Apply a template manually
prompt = benchpress.apply_template(template, context, question)

# Or use the convenience function
prompt = benchpress.prepare_prompt("squad", context, question)

Evaluating Predictions

import benchpress

# predictions: list of model output strings
# references: list of lists of valid reference answers
scores = benchpress.evaluate(predictions, references)
# Returns: {"M": [...], "EM": [...], "F1": [...], "Precision": [...], "Recall": [...]}

# Aggregate to dataset-level means
means = benchpress.aggregate(scores)
# Returns: {"M": 0.85, "EM": 0.72, ...}

Teacher-Normalized Scores

import benchpress

# Compare a compression method against teacher and no-context baselines
normalized = benchpress.teacher_normalized_score(
    score=0.80,          # Your method's score
    teacher_score=0.90,  # Teacher (full context) score
    no_context_score=0.30,  # No-context baseline score
)
# Returns: (0.80 - 0.30) / (0.90 - 0.30) = 0.833

Preparing the Dataset from Source

To rebuild the unified dataset from the original HuggingFace sources:

uv sync --extra prepare
uv run scripts/prepare_dataset.py --output-dir ./benchpress_data

This downloads all 10 source datasets, applies dataset-specific preprocessing, tokenizes contexts with Qwen/Qwen3-1.7B, and saves the unified dataset in HuggingFace arrow format.

Dataset Schema

Each sample in the unified dataset has the following columns:

Column	Type	Description
`id`	`string`	Unique sample identifier
`context`	`string`	The passage/context text
`question`	`string`	The question
`answer`	`string`	Single canonical answer
`answers`	`list[string]`	All valid reference answers
`dataset_name`	`string`	Source dataset name
`num_context_tokens`	`int`	Context token count (Qwen/Qwen3-1.7B)
`domain`	`string` or `null`	Domain/type label if available

Citation

@misc{feldman2025simplecontextcompressionmeanpooling,
      title={Simple Context Compression: Mean-Pooling and Multi-Ratio Training}, 
      author={Yair Feldman and Yoav Artzi},
      year={2025},
      eprint={2510.20797},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.20797}, 
}

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
src/benchpress		src/benchpress
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenchPress: A Standardized Evaluation Suite for Context Compression

Installation

Quick Start

Dataset Overview

Short-Context Datasets

Mid-Range Datasets (LongBench)

Usage Guide

Loading Datasets

Filtering by Context Length

Working with Templates

Evaluating Predictions

Teacher-Normalized Scores

Preparing the Dataset from Source

Dataset Schema

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BenchPress: A Standardized Evaluation Suite for Context Compression

Installation

Quick Start

Dataset Overview

Short-Context Datasets

Mid-Range Datasets (LongBench)

Usage Guide

Loading Datasets

Filtering by Context Length

Working with Templates

Evaluating Predictions

Teacher-Normalized Scores

Preparing the Dataset from Source

Dataset Schema

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages