Skip to content

Clarify misleading KV cache Storage Throughput metric #467

@jay-tau

Description

@jay-tau

Problem

The KV cache benchmark uses Storage Throughput for storage_throughput_tokens_per_sec, calculated as:

total_tokens_generated / total_storage_io_latency

That name is confusing because elsewhere storage clearly refers to the disk/NVMe tier, e.g. tier_storage_kv_bytes_read_gb, tier_storage_kv_bytes_written_gb, tier_storage_read_bandwidth_gbps, and tier_storage_write_bandwidth_gbps.

As a result, readers naturally interpret Storage Throughput as raw disk/NVMe throughput, but it is really token throughput through the cache I/O path. CPU RAM hits can increase this metric even while raw disk pressure drops, which makes the discovery note hard to understand:

Storage Throughput shows only 1.1x at cpu_mem=0GB but 2.2x at cpu_mem=4GB

Proposed fix

Rename the displayed/JSON metric to something like:

cache_io_tokens_per_sec

or, if keeping backwards compatibility, add the new name as an alias and mark storage_throughput_tokens_per_sec deprecated in docs/output.

Reserve Storage Throughput wording for actual disk-tier metrics:

tier_storage_read_bandwidth_gbps
tier_storage_write_bandwidth_gbps
tier_storage_kv_bytes_read_gb
tier_storage_kv_bytes_written_gb

Why this matters

The current name makes it look like increasing cpu_mem makes the disk faster. What is actually happening is that CPU RAM reduces cache I/O latency, so tokens / cache_io_latency increases. Raw disk saturation should be judged from tier storage bandwidth/bytes or iostat, not this token metric.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions