Skip to content

vLLM 0.15.1 crashes on T4 GPU — CUTLASS DSL fails to detect sm_75 compute capability #57

@rmdodhia

Description

@rmdodhia

vLLM 0.15.1 (current latest) cannot serve Fara-7B on a T4 GPU (sm_75, 15GB VRAM) — the most common free GPU available via Google Colab. The CUTLASS DSL compiler fails to detect the T4's compute capability, causing a crash during engine initialization.

Environment

  • GPU: NVIDIA T4 (Google Colab free tier), 15GB VRAM, sm_75
  • vLLM: 0.15.1
  • PyTorch: 2.9.1+cu128
  • Python: 3.12
  • OS: Ubuntu (Colab default)

Error
(EngineCore_DP0 pid=5369) Starting to load model microsoft/Fara-7B...
(EngineCore_DP0 pid=5369) ERROR EngineCore failed to start.
File "nvidia_cutlass_dsl/python_packages/cutlass/base_dsl/compiler.py", line 148, in compile
pm.run(module.operation)
cutlass._mlir._mlir_libs.site_initialize..MLIRError: Failure while executing pass pipeline:
error: unknown: failed to verify the compilation unit (error 7: NVVM_ERROR_INVALID_OPTION),
libNVVM extra log: libnvvm : error: -arch=compute
is an unsupported option

Attempted workarounds (all failed)

  1. --enforce-eager Same CUTLASS error — it still tries to compile
  2. VLLM_USE_V1=0 (V0 engine) Same error
  3. TORCH_CUDA_ARCH_LIST=7.5 Same error — env var not picked up by CUTLASS DSL
  4. SGLang as alternative CuDNN version mismatch, then OOM on FP16
  5. --max-model-len 8192 --gpu-memory-utilization 0.98 Crashes before reaching model loading

Impact
The T4 is the default free GPU on Google Colab, making it the most accessible option for users who:

  • Don't have a local GPU with 24GB+ VRAM
  • Can't access Azure Foundry
  • Want to try Fara-7B without hardware investment
    The README recommends vLLM for self-hosting but doesn't document this incompatibility.

Suggested fixes

  • Document T4 incompatibility in the README's self-hosting section
  • Pin a known-working vLLM version (e.g., 0.8.x) in requirements or document it
  • Provide an AWQ-quantized model — FP16 Fara-7B (14.5GB weights) leaves almost no room for KV cache on T4's 15GB, even if vLLM worked. A 4-bit quantized version that preserves the vision encoder would make T4 viable.
  • Add a Colab notebook to the repo (see related issue)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions