vLLM backend crashes with "please provide at least one prompt" when max_gen_toks exceeds model context window

## Problem

When running benchmarks with vLLM backend, evaluation crashes with the following error if `max_gen_toks` exceeds the model's maximum sequence length:

```
ValueError: please provide at least one prompt
ERROR: Engine core proc EngineCore_0 died unexpectedly, shutting down client.
```

This does not occur with the HuggingFace (`hf`) backend.

## Root Cause Analysis

The vLLM integration in lm-evaluation-harness calculates available prompt space as:

```python
available_prompt_space = max_model_len - max_gen_toks
```

When `max_gen_toks >= max_model_len`, this results in `available_prompt_space <= 0`, causing the prompt to be truncated to empty. vLLM then raises a `ValueError` because there's no prompt to generate from.

## Environment

- evalchemy
- vLLM: 0.10.1.1

## Reproduction Steps

```bash
# Using ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3 (4096 context window)
# with MATH500 benchmark (32768 default max_tokens)

# This crashes with vLLM:
python -m eval.eval --model vllm \
  --tasks MATH500 \
  --model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"

# This works with HuggingFace:
python -m eval.eval --model hf \
  --tasks MATH500 \
  --model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"
```

## Affected Benchmarks

The issue occurs when a benchmark's default `max_tokens` exceeds the model's context window. Some benchmarks I tested and confirmed fail with their default settings:

- AIME24
- AIME25
- AMC23
- MATH500
- LiveCodeBench
- GPQADiamond
- JEEBench

This is not an exhaustive list. Any benchmark can be affected if `max_tokens` (default or via `--max_tokens` argument) exceeds the model's context window.

## Expected Behavior

The evaluation should gracefully cap `max_gen_toks` to fit within the available context window instead of crashing.

## Proposed Solution

Dynamically cap `max_gen_toks` per-prompt based on actual prompt length in `_normalize_model_args`:

```python
max_allowed = max_model_len - prompt_length - 16  # 16 token safety buffer
capped_max_new_tokens = min(max_new_tokens, max(1, max_allowed))
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vLLM backend crashes with "please provide at least one prompt" when max_gen_toks exceeds model context window #152

Problem

Root Cause Analysis

Environment

Reproduction Steps

Affected Benchmarks

Expected Behavior

Proposed Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

vLLM backend crashes with "please provide at least one prompt" when max_gen_toks exceeds model context window #152

Description

Problem

Root Cause Analysis

Environment

Reproduction Steps

Affected Benchmarks

Expected Behavior

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions