+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 610.43.02 KMD Version: 610.47 CUDA UMD Version: 13.3 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3050 ... On | 00000000:01:00.0 Off | N/A |
| N/A 36C P0 12W / 65W | 0MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
The small models run perfectly when using vLLM but when I run it in pie for some reason it does not run not matter the model size.
I get this error in pie using vllm:
"RuntimeError: Insufficient KV cache budget: 0 bytes / 516096 bytes per global block = 0 blocks. Increase gpu_mem_utilization or reduce model size.
worker pid=27465 died with exit code 1 before ready
pie: starting one-shot engine: starting subprocess driver (vllm) for model "default" group 0: reading handshake for vllm group 0: launcher exited before handshake completed; check stderr for the launcher's last log line"
My pie config:
[model.driver]
type = "vllm"
device = ["cuda:0"]
activation_dtype = "bfloat16"
tensor_parallel_size = 1
Dedicated throughput workers can trade CPU for lower IPC wake latency.
ipc_profile = "latency"
[model.driver.options]
venv = "/home/openok/.pie/venvs/pie-vllm"
attention_backend = "FLASHINFER" # FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / …
gpu_memory_utilization = 1.0
enforce_eager = false # disable CUDA graphs
max_num_seqs = 1 # optional active sequence cap
max_num_batched_tokens = 2048 # optional vLLM per-step token budget
max_model_len = 12000 # optional context length cap
quantization = "bitsandbytes"
kv_cache_dtype = "nvfp4"
n-gram speculative decoding (driver-side drafts)
spec_ngram_enabled = true
spec_ngram_num_drafts = 4
spec_ngram_min_n = 2
spec_ngram_max_n = 4
I even tried to change quantization but no success. Anyone who got this running? How did you do it?
Thanks,
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 610.43.02 KMD Version: 610.47 CUDA UMD Version: 13.3 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3050 ... On | 00000000:01:00.0 Off | N/A |
| N/A 36C P0 12W / 65W | 0MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
The small models run perfectly when using vLLM but when I run it in pie for some reason it does not run not matter the model size.
I get this error in pie using vllm:
"RuntimeError: Insufficient KV cache budget: 0 bytes / 516096 bytes per global block = 0 blocks. Increase gpu_mem_utilization or reduce model size.
worker pid=27465 died with exit code 1 before ready
pie: starting one-shot engine: starting subprocess driver (vllm) for model "default" group 0: reading handshake for vllm group 0: launcher exited before handshake completed; check stderr for the launcher's last log line"
My pie config:
[model.driver]
type = "vllm"
device = ["cuda:0"]
activation_dtype = "bfloat16"
tensor_parallel_size = 1
Dedicated throughput workers can trade CPU for lower IPC wake latency.
ipc_profile = "latency"
[model.driver.options]
venv = "/home/openok/.pie/venvs/pie-vllm"
attention_backend = "FLASHINFER" # FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / …
gpu_memory_utilization = 1.0
enforce_eager = false # disable CUDA graphs
max_num_seqs = 1 # optional active sequence cap
max_num_batched_tokens = 2048 # optional vLLM per-step token budget
max_model_len = 12000 # optional context length cap
quantization = "bitsandbytes"
kv_cache_dtype = "nvfp4"
n-gram speculative decoding (driver-side drafts)
spec_ngram_enabled = true
spec_ngram_num_drafts = 4
spec_ngram_min_n = 2
spec_ngram_max_n = 4
I even tried to change quantization but no success. Anyone who got this running? How did you do it?
Thanks,