Skip to content

VLM: wire performance instrumentation and logits output#72

Merged
stikves merged 3 commits into
apple:mainfrom
stikves:sukru/vlm-instrumentation
Jul 2, 2026
Merged

VLM: wire performance instrumentation and logits output#72
stikves merged 3 commits into
apple:mainfrom
stikves:sukru/vlm-instrumentation

Conversation

@stikves

@stikves stikves commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Wire performance metrics and logits output for the VLM inference path, matching the LLM path behavior.

Changes

  • Report prompt throughput (prefill t/s) and generation throughput in the performance summary
  • Wire --print-logits for VLM: shows top-5 token probabilities per generated step
  • Wire --save-logits for VLM: saves top-K logits to JSON (same format as LLM path)
  • Make TokenLogits and TopLogitEntry properties public (cross-module access)

Sample output

$ llm-runner --model <vlm-bundle> --image photo.jpg --prompt "What is this?" --max-tokens 10

Generating...
The image features a small kitten sitting on

Performance Summary:
==================================================
Model Load: 15299.1ms
Prompt:     1028.3ms, 590 tokens, 573.8 tokens/sec
Generation: 480.8ms, 9 tokens, 18.7 tokens/sec
Total:      24.479s
==================================================

With --print-logits:

Generating...
  logits top5: [450]=26.812 [910]=25.406 [512]=24.422 [319]=22.609 [739]=22.109 The
  logits top5: [1967]=25.562 [9088]=19.609 [7623]=18.797 [15373]=18.734 [17739]=17.516 image
  ...

With --save-logits /tmp/vlm_logits.json:

{
  "tokens": [{
    "token_id": 450,
    "incremental_text": "The",
    "top_logits": [
      {"token_id": 450, "incremental_text": "The", "logit": 26.8125},
      {"token_id": 910, "incremental_text": "This", "logit": 25.406}
    ]
  }]
}

Test plan

  • Build passes
  • VLM shows correct prompt/generation t/s in performance summary
  • --print-logits displays top-5 logits per token during VLM generation
  • --save-logits produces valid JSON with top-K entries
  • Verbose output (--verbose) shows full timing breakdown table
  • Verify no regression on text-only LLM path

- Report prompt throughput (prefill t/s) and generation throughput
  for VLM inference, matching the LLM path performance summary
- Wire --print-logits for VLM: shows top-5 token probabilities per step
- Wire --save-logits for VLM: saves top-K logits to JSON file
- Make TokenLogits and TopLogitEntry properties public (needed by runner)

Tested with LLaVA-1.5-7B bundle: prompt 590 tokens at 579 t/s,
generation 20 tokens at 19.4 t/s. Logits JSON output verified.
@stikves stikves force-pushed the sukru/vlm-instrumentation branch from 60c0a79 to dbde899 Compare July 1, 2026 05:16
@stikves stikves marked this pull request as ready for review July 1, 2026 05:17
@stikves stikves force-pushed the sukru/vlm-instrumentation branch from a084bb9 to 5c68d49 Compare July 1, 2026 05:23
@stikves stikves self-assigned this Jul 1, 2026
Comment thread swift/Sources/Tools/llm-runner/LLMRunnerMain.swift
Comment thread swift/Sources/Tools/llm-runner/LLMRunnerMain.swift Outdated
@carinapeng

Copy link
Copy Markdown
Contributor

Thank you for taking this on @stikves !

We seem to have implemented the runner-level design ( in runVLMInference, call setPromptTokenCount(vlmTokens.count) and wrap the prefill + generation) #70

I proposed engine level here as well, seems to me it could be a more sustainable design because if we instrument CoreAISequentialVLMEngine then metrics can work for any caller, that's how we do it for text engines as well

I wonder if that'd be a better design to be more generic?

@stikves stikves force-pushed the sukru/vlm-instrumentation branch from 20cfdc5 to 4e11890 Compare July 1, 2026 23:15
- Report prompt throughput (prefill t/s) and generation throughput
  for VLM inference, matching the LLM path performance summary
- Wire --print-logits for VLM: shows top-5 token probabilities per step
- Wire --save-logits for VLM: saves top-K logits to JSON via LogitsWriter
- Make TokenLogits and TopLogitEntry properties public (cross-module access)

Tested with VLM bundle: prompt 590 tokens at 579 t/s,
generation 20 tokens at 19.4 t/s. Logits JSON output verified.
@stikves stikves force-pushed the sukru/vlm-instrumentation branch from 9d33bfa to 91fecb2 Compare July 1, 2026 23:20
@stikves stikves merged commit 1303957 into apple:main Jul 2, 2026
3 checks passed
@stikves stikves deleted the sukru/vlm-instrumentation branch July 2, 2026 03:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants