Skip to content

Unable to run with either "Qwen/Qwen3-0.6B" (or any sized model)using default driver or pie-driver-vllm #450

Description

@keithomayot

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 610.43.02 KMD Version: 610.47 CUDA UMD Version: 13.3 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3050 ... On | 00000000:01:00.0 Off | N/A |
| N/A 36C P0 12W / 65W | 0MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

The small models run perfectly when using vLLM but when I run it in pie for some reason it does not run not matter the model size.
I get this error in pie using vllm:
"RuntimeError: Insufficient KV cache budget: 0 bytes / 516096 bytes per global block = 0 blocks. Increase gpu_mem_utilization or reduce model size.
worker pid=27465 died with exit code 1 before ready
pie: starting one-shot engine: starting subprocess driver (vllm) for model "default" group 0: reading handshake for vllm group 0: launcher exited before handshake completed; check stderr for the launcher's last log line"

My pie config:

[model.driver]
type = "vllm"
device = ["cuda:0"]
activation_dtype = "bfloat16"
tensor_parallel_size = 1

Dedicated throughput workers can trade CPU for lower IPC wake latency.

ipc_profile = "latency"

[model.driver.options]
venv = "/home/openok/.pie/venvs/pie-vllm"
attention_backend = "FLASHINFER" # FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / …
gpu_memory_utilization = 1.0
enforce_eager = false # disable CUDA graphs
max_num_seqs = 1 # optional active sequence cap
max_num_batched_tokens = 2048 # optional vLLM per-step token budget
max_model_len = 12000 # optional context length cap
quantization = "bitsandbytes"
kv_cache_dtype = "nvfp4"

n-gram speculative decoding (driver-side drafts)

spec_ngram_enabled = true
spec_ngram_num_drafts = 4
spec_ngram_min_n = 2
spec_ngram_max_n = 4

I even tried to change quantization but no success. Anyone who got this running? How did you do it?

Thanks,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions