Skip to content

When using Transformers’ .generate() with Tensor Parallel (TP), it seems the method is not TP-aware by default. #7

@JamieCodingDev

Description

@JamieCodingDev

Hi, I’m trying to understand the workflow for generating and using KV cache with large models (70B) and LMCache. I have a few questions:

When using Transformers’ .generate() with Tensor Parallel (TP), it seems the method is not TP-aware by default. So if I generate KV cache with Transformers and then try to decode it, I might run into vocab/GPU mismatch issues. Is that correct?

vLLM supports TP-aware prefill and generation, which solves the TP distribution problem. However, I noticed it currently does not support external past_key_values injection. Does this mean I cannot directly load KV cache generated by Transformers into vLLM for decoding?

For my workflow—generate KV cache → compress with CacheGen/LMCache → decode to test F1 score—what would be the recommended approach? Should I:

Stick with Transformers and implement TP-aware generate manually, or Use vLLM for both prefill and decode, even if I cannot reuse external KV cache?

withdrawchezingt The role of art in society
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [67,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [68,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [69,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [70,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [71,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [72,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [73,0,0] Assertion srcIndex < srcSelectDimSize failed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions