BenchPress provides a unified benchmark for evaluating context compression methods across 10 diverse question-answering datasets, spanning short-context and mid-range scenarios.
Paper: No Mean Feat: Simple, Strong Baselines for Context Compression
The benchmark dataset is available on the Hugging Face Hub at yairfeldman/benchpress.
Code for training the models presented in the paper is available at lil-lab/simple-context-compression.
Install with uv:
uv syncThen run commands with:
uv run ...Or activate the virtual environment:
source .venv/bin/activateimport benchpress
# Load the benchmark dataset
dataset = benchpress.load(path="yairfeldman/benchpress")
# OR from local path: dataset = benchpress.load(path="./benchpress_data")
# Prepare a prompt for a sample
sample = dataset[0]
prompt = benchpress.prepare_prompt(
sample["dataset_name"],
sample["context"],
sample["question"],
)
# Evaluate predictions against references
predictions = ["Paris"]
references = [["Paris", "paris"]]
scores = benchpress.evaluate(predictions, references)
print(benchpress.aggregate(scores))
# {'M': 1.0, 'EM': 1.0, 'F1': 1.0, 'Precision': 1.0, 'Recall': 1.0}BenchPress includes 10 datasets organized into two subsets:
| Dataset | Source | Split | Template Type | In-Domain |
|---|---|---|---|---|
squad |
rajpurkar/squad_v2 |
validation | extractive_qa (101) | Yes |
narrativeqa |
NarrativeQA summaries | validation | qa (96) | Yes |
hotpotqa |
HotpotQA distractor | validation | extractive_qa (101) | Yes |
adversarial_qa |
AdversarialQA droberta | validation | extractive_qa (101) | No |
triviaqa_verified |
TriviaQA verified | dev | extractive_qa (101) | No |
paraphrase_rc |
DuoRC ParaphraseRC | validation | qa (96) | No |
| Dataset | Source | Split | Template Type |
|---|---|---|---|
longbench_qasper_e |
LongBench qasper_e | test | inline (1) |
longbench_multifieldqa_en_e |
LongBench multifieldqa_en_e | test | inline (1) |
longbench_hotpotqa_e |
LongBench hotpotqa_e | test | inline (1) |
longbench_2wikimqa_e |
LongBench 2wikimqa_e | test | inline (1) |
import benchpress
dataset_path = "yairfeldman/benchpress" # or local path: "./benchpress_data"
# Load all datasets
dataset = benchpress.load(path=dataset_path)
# Load only short-context datasets
short = benchpress.load(path=dataset_path, subset="short")
# Load only mid-range datasets
mid = benchpress.load(path=dataset_path, subset="mid_range")
# Load specific dataset(s)
squad = benchpress.load(path=dataset_path, datasets="squad")
subset = benchpress.load(path=dataset_path, datasets=["squad", "hotpotqa"])
# Load in-domain or out-of-domain splits
in_domain = benchpress.load(path=dataset_path, split_type="in_domain")
out_domain = benchpress.load(path=dataset_path, split_type="out_of_domain")# Keep only samples with at most 1024 context tokens
short_ctx = benchpress.load(
path="yairfeldman/benchpress",
max_context_tokens=1024,
)import benchpress
# List available templates for a dataset
templates = benchpress.get_templates("squad")
print(f"SQuAD has {len(templates)} templates") # 101
# Sample a template deterministically for a given sample
template = benchpress.sample_template("squad", context, question)
# Apply a template manually
prompt = benchpress.apply_template(template, context, question)
# Or use the convenience function
prompt = benchpress.prepare_prompt("squad", context, question)import benchpress
# predictions: list of model output strings
# references: list of lists of valid reference answers
scores = benchpress.evaluate(predictions, references)
# Returns: {"M": [...], "EM": [...], "F1": [...], "Precision": [...], "Recall": [...]}
# Aggregate to dataset-level means
means = benchpress.aggregate(scores)
# Returns: {"M": 0.85, "EM": 0.72, ...}import benchpress
# Compare a compression method against teacher and no-context baselines
normalized = benchpress.teacher_normalized_score(
score=0.80, # Your method's score
teacher_score=0.90, # Teacher (full context) score
no_context_score=0.30, # No-context baseline score
)
# Returns: (0.80 - 0.30) / (0.90 - 0.30) = 0.833To rebuild the unified dataset from the original HuggingFace sources:
uv sync --extra prepare
uv run scripts/prepare_dataset.py --output-dir ./benchpress_dataThis downloads all 10 source datasets, applies dataset-specific preprocessing, tokenizes contexts with Qwen/Qwen3-1.7B, and saves the unified dataset in HuggingFace arrow format.
Each sample in the unified dataset has the following columns:
| Column | Type | Description |
|---|---|---|
id |
string |
Unique sample identifier |
context |
string |
The passage/context text |
question |
string |
The question |
answer |
string |
Single canonical answer |
answers |
list[string] |
All valid reference answers |
dataset_name |
string |
Source dataset name |
num_context_tokens |
int |
Context token count (Qwen/Qwen3-1.7B) |
domain |
string or null |
Domain/type label if available |
@misc{feldman2025simplecontextcompressionmeanpooling,
title={Simple Context Compression: Mean-Pooling and Multi-Ratio Training},
author={Yair Feldman and Yoav Artzi},
year={2025},
eprint={2510.20797},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.20797},
}
This project is licensed under the MIT License. See LICENSE for details.