fix(summarize-insights): drop reader-engagement filler, reframe audience as downstream AI#708
Merged
Merged
Conversation
…nce as downstream AI
'If you read nothing else, read this.' was a reader hook for humans, but the primary consumer of insights.md is downstream AI (another agent, a planning loop, a follow-on extractor) — token-density of useful signal matters more than engagement. The structural markers (## Bad news first, ### Likely deal-breakers, verdict labels) already do the work the prefix tried to do; restating it in prose burns tokens.
Removed the prefix from render_bad_news_first. The substantive sentence ('Every item below is a signal the plan does not survive its own assumptions. Items are ordered by severity.') stays because it explains what's in the section.
SKILL.md gains a new 'Audience and tone' section codifying the principle: no reader-engagement prefixes, no filler sentences whose only job is to motivate the next sentence, keep substantive explanations (signal, not filler). Replaced 'reader' with 'downstream consumer' in the no-sycophancy rule to reflect the audience reframe.
Verified: all three reference plans regenerate insights.md cleanly with the prefix gone. Smoke 8/8, unittest 45/45.
ChatGPT's review of v33 raised 15 items; this commit ships the five 'quick win' ones that fit on top of the existing per-run state. All five are runner-side analyses plus matching insights.md sections — no schema changes, no LLM prompt edits. §14 binding-gate frequency tracking. For every min() aggregate, the runner records which dependency provided the min value in each run, then aggregates conditional on the aggregate failing its threshold. Faraday demonstration: the weakest_program_gate fails in 9,826 of 10,000 runs; mil_std_cert_funding is the binder in 67% of those, cash_flow_trigger in 32%, inventory_overhang in 0.5%. That tells the reader which sub-gate to fix first; the previous output only knew it failed. §7 quartile pass-rates. For each threshold × driver, P(threshold passes | driver in bottom quartile) vs P(threshold passes | driver in top quartile). The delta in percentage points is much more actionable than Pearson r — 'P(coverage 99%) goes from 18% in worst-quartile satellite-failure runs to 74% in best-quartile' is a directly usable lever. §13 required-input thresholds. For each FAILING gate (P < 80%), find the input-bound restriction that would lift conditional pass rate to >= 80%. Empty list means no single-input restriction is enough — Faraday's weakest_program_gate gets an empty list, correctly diagnosing it as structurally unreachable. §8 missing-value priority. Rank missing_values entries by |delta_pp on worst gate| * (1 - pass_prob) * bound_width_ratio. The highest-scoring entries are the ones most worth replacing with real data instead of an assumed range. §10 model confidence grades. Per output, grade HIGH/MEDIUM/LOW based on the fraction of upstream input bounds anchored in 'data' vs 'assumption' and the average bound-width-to-base ratio. Cutoffs: data >= 70% AND width < 0.5 -> HIGH; data < 30% OR width > 1.5 -> LOW; else MEDIUM. The reasons array names the specific evidence. Five new render functions in summarize_insights.py emit these as separate sections after the existing verdict table. Five new unittest.TestCase methods (TestNewAnalysisBlocks) cover each block end-to-end against a small synthetic fixture. Smoke 8/8, unittest 50/50. Reference runs regenerated for Nuuk, Cross-Border Rail, Faraday, India Census.
* main: prompts: add Hauts-de-France hyperscale AI datacenter test case
neoneye
added a commit
that referenced
this pull request
May 16, 2026
Two small wording changes from ChatGPT's v38 feedback (the reviewer called the format 'production-candidate; freeze the structure', and these are the only follow-ups). 1) DOOM verdict band 'almost certainly fails' -> 'rarely passes under current bounds'. Avoids the epistemic overclaim of 'certainly' on a model-relative pass rate. 2) Decision implications intro line reworded from 'The actual plan revision is for human or LLM interpretation against the source report' to 'This section identifies the affected planning lever; concrete revisions should be derived by reading the source report and the relevant intermediary artifacts.' Less self-referential. Status doc renamed 20260516_claude.md -> 20260517_claude.md and rewritten for the current state: PR #708 (merged, five analysis blocks) and PR #710 (in flight, this branch, the v34->v38 insights-format iteration). Schema v3 frozen. Cross-plan validation extended to five distinct domains. Open issues mostly carry over from 0516 plus a new manifest-regression-test gap.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The user pointed out that the line "If you read nothing else, read this." in
insights.mdis wasted tokens:The structural markers (
## Bad news first,### Likely deal-breakers, the verdict labels) already do the work that prefix tried to do. The reader-hook framing came from when we were optimising for a project-manager audience; the actual primary consumer is downstream AI.What changed
summarize_insights.py: removed the prefix fromrender_bad_news_first. The substantive sentence that explains what items are in the section stays:summarize-insights/SKILL.md: new## Audience and tonesection codifying the principle so it survives future edits:The no-sycophancy rule also got its "the reader" → "the downstream consumer" to match the audience reframe.
Test plan
insights.mdfiles with the prefix gonetests/run_smoke.py)tests/test_run_monte_carlo.py)🤖 Generated with Claude Code