Skip to content

vLLM backend crashes with "please provide at least one prompt" when max_gen_toks exceeds model context window #152

Description

@Ali-Elganzory

Problem

When running benchmarks with vLLM backend, evaluation crashes with the following error if max_gen_toks exceeds the model's maximum sequence length:

ValueError: please provide at least one prompt
ERROR: Engine core proc EngineCore_0 died unexpectedly, shutting down client.

This does not occur with the HuggingFace (hf) backend.

Root Cause Analysis

The vLLM integration in lm-evaluation-harness calculates available prompt space as:

available_prompt_space = max_model_len - max_gen_toks

When max_gen_toks >= max_model_len, this results in available_prompt_space <= 0, causing the prompt to be truncated to empty. vLLM then raises a ValueError because there's no prompt to generate from.

Environment

  • evalchemy
  • vLLM: 0.10.1.1

Reproduction Steps

# Using ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3 (4096 context window)
# with MATH500 benchmark (32768 default max_tokens)

# This crashes with vLLM:
python -m eval.eval --model vllm \
  --tasks MATH500 \
  --model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"

# This works with HuggingFace:
python -m eval.eval --model hf \
  --tasks MATH500 \
  --model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"

Affected Benchmarks

The issue occurs when a benchmark's default max_tokens exceeds the model's context window. Some benchmarks I tested and confirmed fail with their default settings:

  • AIME24
  • AIME25
  • AMC23
  • MATH500
  • LiveCodeBench
  • GPQADiamond
  • JEEBench

This is not an exhaustive list. Any benchmark can be affected if max_tokens (default or via --max_tokens argument) exceeds the model's context window.

Expected Behavior

The evaluation should gracefully cap max_gen_toks to fit within the available context window instead of crashing.

Proposed Solution

Dynamically cap max_gen_toks per-prompt based on actual prompt length in _normalize_model_args:

max_allowed = max_model_len - prompt_length - 16  # 16 token safety buffer
capped_max_new_tokens = min(max_new_tokens, max(1, max_allowed))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions