Add usage.jsonl read + backfill to llm-cost-attribution by riddim-developer-bot[bot] · Pull Request #4 · RiddimSoftware/groove

riddim-developer-bot · 2026-05-27T20:28:13Z

Summary

Implements the consumer side of the Symphony Coding-Agent Cost Telemetry Extension spec (PR #3) inside the llm-cost-attribution package (PR #2). Lets users delete their transcripts after baking the cost-relevant projection into a much smaller usage.jsonl file.

llm-cost backfill --out ~/llm-cost-history.jsonl   # bake transcripts → spec-compliant usage.jsonl
llm-cost EPAC-1940 --from-usage ~/llm-cost-history.jsonl   # query the bake instead of transcripts
rm -rf ~/.claude/projects ~/.codex/sessions   # safe — cost data is in the bake now

Real-world numbers (this machine, 4,309 sessions)

	Before backfill	After backfill
Disk footprint	5.0 GB	83 MB (60× smaller)
`llm-cost EPAC-1940` query time	~3 min (full Codex scan)	~0.3 s (~600× faster)
EPAC-1940 token total	52,605,306	52,605,306 ✓
EPAC-1940 turn count	341	341 ✓
EPAC-1940 wall clock	1h 53m 40s	1h 53m 40s ✓

Backfill emitted 190,481 spec-compliant records from 4,309 sessions; 1,841 sessions were correctly skipped (ad-hoc CLI work outside any Symphony workspace).

New library API

import {
  computeIssueCostFromUsage,
  backfillUsageFromTranscripts,
  readUsageRecords,
  appendUsageRecords,
  validateUsageRecord,
  sessionToUsageRecords,
  rollupUsageRecords,
  SCHEMA_VERSION,
} from 'llm-cost-attribution';

New CLI surface

llm-cost backfill --out <path>            # transcripts → spec-compliant usage.jsonl
llm-cost <ISSUE> --from-usage <path>      # read from usage.jsonl/dir instead of transcripts

--from-usage accepts either a single .jsonl file or a directory of usage*.jsonl files (per spec §4.1's "writers MAY split, readers MUST concatenate" rule).

Fidelity tradeoffs (called out in README)

The spec deliberately drops three things from the raw transcripts. After backfill you lose:

Claude cache-tier split (5m vs 1h cache creation tokens) — collapsed into the input total
Codex reasoning-vs-visible output split — collapsed into the output total
Codex rate_limits.{primary,secondary}.used_percent quota samples — not in the spec schema

Grand totals, per-turn ordinals, models, timestamps, runIDs, and workspacePath provenance are preserved exactly.

Spec conformance (§5.1 Required fields)

Every backfilled record carries: schemaVersion, recordedAt, runID (UUID; the CLI session ID), turn (1-based monotonic), issueIdentifier, provider, model, botRole (always developer — spec §5.1 says "Implementations that do not distinguish a reviewer role MUST emit developer"), inputTokens, outputTokens, totalTokens, usageSource: "provider_reported", startedAt, endedAt. Plus the optional workspacePath since we already have it.

Test plan

All 27 package tests pass (node --test packages/llm-cost-attribution/test/*.test.mjs) — 11 existing + 8 new in usage-jsonl.test.mjs + 5 new in transcript-to-usage.test.mjs
Every backfilled record produced from real transcripts passes validateUsageRecord
llm-cost EPAC-1940 and llm-cost EPAC-1940 --from-usage <backfill> produce identical token totals, turn counts, and wall-clock spans
node --check clean on every .mjs
CI workflow updated to run the new test files

Implements the consumer side of the Symphony Coding-Agent Cost Telemetry Extension spec (groove/specs/symphony-cost-telemetry-extension) so users can: 1. Read cost data from a spec-compliant usage.jsonl source instead of the raw CLI transcripts. 2. Backfill a usage.jsonl from existing transcripts, after which the transcripts can be safely deleted. New library exports: - computeIssueCostFromUsage(issueId, pathOrDir) - backfillUsageFromTranscripts({ outFile, onProgress, ... }) - rollupUsageRecords / sessionToUsageRecords - readUsageRecords / appendUsageRecords / validateUsageRecord - findUsageFiles / SCHEMA_VERSION New CLI surface: llm-cost backfill --out <path> llm-cost <ISSUE> --from-usage <path-or-dir> End-to-end verified on real data (4,309 sessions / 5 GB transcripts): - Backfill: produces 190,481 spec-compliant records in an 83 MB file (60x compression vs the source transcripts). - Read-back: `llm-cost EPAC-1940 --from-usage <backfilled-file>` returns identical 52,605,306 tokens / 341 turns / 1h53m wall clock to the transcript-source path, in 0.3 seconds (vs ~3 minutes for the transcript scan). Fidelity tradeoffs (documented in README): - Cache-tier split (Claude 5m vs 1h cache creation) is collapsed. - Reasoning-vs-visible output split (Codex) is collapsed. - Per-window quota samples (Codex rate_limits) are not in the spec schema, so they're lost on backfill. Grand totals, turn ordinals, models, timestamps, runIDs, and workspacePath provenance are all preserved exactly. New tests (16 of them, in two new files): - test/usage-jsonl.test.mjs — validate + read + write - test/transcript-to-usage.test.mjs — session → usage record mapping All 27 package tests pass. CI workflow updated to run the new files.

github-actions Bot enabled auto-merge (squash) May 27, 2026 20:28

github-actions Bot merged commit c049aaf into main May 27, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add usage.jsonl read + backfill to llm-cost-attribution#4

Add usage.jsonl read + backfill to llm-cost-attribution#4
github-actions[bot] merged 1 commit into
mainfrom
sunny/llm-cost-usage-jsonl

riddim-developer-bot Bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

riddim-developer-bot Bot commented May 27, 2026

Summary

Real-world numbers (this machine, 4,309 sessions)

New library API

New CLI surface

Fidelity tradeoffs (called out in README)

Spec conformance (§5.1 Required fields)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant