Unable to run with either "Qwen/Qwen3-0.6B" (or any sized model)using default driver or pie-driver-vllm

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 610.43.02              KMD Version: 610.47        CUDA UMD Version: 13.3     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3050 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   36C    P0             12W /   65W |       0MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The small models run perfectly when using vLLM but when I run it in pie for some reason it does not run not matter the model size. 
I get this error in pie using vllm:
"RuntimeError: Insufficient KV cache budget: 0 bytes / 516096 bytes per global block = 0 blocks. Increase gpu_mem_utilization or reduce model size.
worker pid=27465 died with exit code 1 before ready
pie: starting one-shot engine: starting subprocess driver (vllm) for model "default" group 0: reading handshake for vllm group 0: launcher exited before handshake completed; check stderr for the launcher's last log line"

My pie config:

[model.driver]
type = "vllm"
device = ["cuda:0"]
activation_dtype = "bfloat16"
tensor_parallel_size = 1
## Dedicated throughput workers can trade CPU for lower IPC wake latency.
# ipc_profile = "latency"


[model.driver.options]
venv = "/home/openok/.pie/venvs/pie-vllm"
attention_backend       = "FLASHINFER"   # FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / …
gpu_memory_utilization  = 1.0
enforce_eager           = false          # disable CUDA graphs
max_num_seqs            = 1            # optional active sequence cap
max_num_batched_tokens  = 2048           # optional vLLM per-step token budget
max_model_len           = 12000           # optional context length cap
quantization            = "bitsandbytes"
kv_cache_dtype          = "nvfp4"

## n-gram speculative decoding (driver-side drafts)
spec_ngram_enabled      = true
spec_ngram_num_drafts   = 4
spec_ngram_min_n        = 2
spec_ngram_max_n        = 4

I even tried to change quantization but no success. Anyone who got this running? How did you do it?

Thanks,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to run with either "Qwen/Qwen3-0.6B" (or any sized model)using default driver or pie-driver-vllm #450

Dedicated throughput workers can trade CPU for lower IPC wake latency.

ipc_profile = "latency"

n-gram speculative decoding (driver-side drafts)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Unable to run with either "Qwen/Qwen3-0.6B" (or any sized model)using default driver or pie-driver-vllm #450

Description

Dedicated throughput workers can trade CPU for lower IPC wake latency.

ipc_profile = "latency"

n-gram speculative decoding (driver-side drafts)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions