fix: record answer prompt size for cross-provider context-cost reporting#24
Merged
Conversation
Runner token accounting is transport-dependent: the claude CLI reports the user prompt inside cache_creation_input_tokens alongside ~26K of its own system overhead, so usage.input_tokens (~10/case) wildly under-reports the fed context, and no clean per-prompt figure is separable. Record len(answer_prompt) per case (answer_prompt_chars) and the per-provider mean in qa-summary. Chars of assembled prompt are the comparable cross-provider context-size measure — the same answerer sees them all — and they anchor the accuracy-vs-context-cost axis of any published comparison. Runner-reported token fields are kept as-is for cost telemetry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Drew Cain <groksrc@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The claude CLI transport buries the user prompt in
cache_creation_input_tokensalong with ~26K of its own system overhead, sousage.input_tokens(~10/case) wildly under-reports fed context and no clean per-prompt figure is separable.This records
answer_prompt_charsper QA case plusmean_answer_prompt_charsper provider — the comparable cross-provider context-size measure (same answerer sees them all), anchoring the accuracy-vs-context-cost axis any credible comparison needs (full-context baselines will show ~10-50x the context of retrieval systems). Runner-reported token fields stay for cost telemetry.1 new test; suite green (100), lint clean.
🤖 Generated with Claude Code