feat(napkin-math): restructure insights.md as a thin interpretation layer#710
Merged
Conversation
…ayer insights.md no longer reproduces the raw simulation tables (per-output distributions, full per-driver quartile blocks, full sensitivity tables, full required-input-threshold blocks). It now declares what it is, points at the intermediary artifacts, and surfaces only the decision-relevant signals: a JSON manifest, a provenance map, gate verdicts (with an aggregation warning when units are incompatible and no min() aggregate exists), one-row failure drivers per failing gate, missing-input priority, confidence and trust boundaries, a short scenario sanity check, and five suggested next actions. Stable retrievable section names (## Artifact contract, ## Machine summary, ## Provenance map, ## Critical findings, ## Gate verdicts, ## Failure drivers, ## Suggested next actions) so programmatic consumers can target by heading. JSON for the manifest, not YAML. No 'AI' framing in headings or output prose — the file describes what it is (an interpretation layer), not who reads it. Reference outputs across six plans land at 140-180 lines (was 180-300+). Smoke (8/8) and unit tests (50/50) pass.
…tifact_set, basis column
ChatGPT v35 feedback: add the bridge from gate result to planning consequence, version the artifact format, make the source identifier portable, and rename the misleading 'data' source label.
Machine summary changes: insights_schema_version=1; primary_model_result is now an object (label/reason/worst_gate/worst_gate_pass_rate) rather than a bare label; artifact_set object (version/plan_slug/relative_dir) parsed from path; source_plan_dir kept for local use but documented as non-primary.
New section ## Decision implications: one row per gate in DOOM/FRAGILE/MARGINAL, with verdict-keyed structural consequence and revision direction derived from quartile_analysis top driver. Templates are generic — plan-specific tactical revisions still require human/LLM interpretation. New section ## Open questions for next analysis pass: five standing audit questions.
Renames: Missing-inputs 'Source' column -> 'Basis' with translated values ('data' -> 'report_derived', 'assumption' -> 'model_assumption') and a footnote disclaiming empirical ground truth. Scenario sanity check columns: 'Low/Middle/High' -> 'Low inputs/Base inputs/High inputs' to match scenarios.json keys.
Reference plans regenerated, all in 170-220 line band. Smoke 8/8, unit 50/50 green.
…shold basis, plan-specific hints, schema_notes) ChatGPT v36 feedback round: four changes to freeze the format. 1) primary_model_result renamed: 'label' -> 'overall_risk_band', added 'basis' field that explicitly says this is the worst declared gate's pass-rate band and NOT a calibrated whole-plan probability. Reduces the risk a downstream consumer reads 'doom' as 'whole plan is impossible'. 2) Gate verdicts gains a 'Threshold basis' column (report_explicit / report_inferred / model_defined / unknown) derived from the corresponding key_value's value_type. Surfaces the difference between thresholds the plan states directly and those the extractor inferred. 3) Decision implications gains a 5th column, 'Plan-specific revision hint', that lifts the gate's own rationale from parameters.recommended_first_calculations[].why_first (or derived_questions[].why_it_matters) and names the threshold parameter the formula tests against. Still does not invent tactical advice — it surfaces the plan's own framing. 4) Added schema_notes block to the JSON manifest listing allowed enums for overall_risk_band, verdict, basis, threshold_basis, plus a primary_model_result_semantics disclaimer. Bumped insights_schema_version to 2. Reference plans regenerated. Length range 201-244 lines across 11 plans (was 173-216). Smoke 8/8, unit 50/50.
…cal findings, declared-source label, wider basis enum) ChatGPT v37 feedback round: four small refinements after the structure was declared ready to freeze. 1) Decision implications column 'Plan-specific revision hint' renamed to 'Gate meaning'. The column was always gate rationale, not revision direction — the new name matches what it carries, and stops over-promising tactical advice the script cannot deterministically generate. 2) Critical findings DOOM/FRAGILE bullets switched from rhetorical to dry. 'The plan depends on this holding; the math says it almost never does' becomes 'fails X% of simulated runs under the current input bounds'. Stable language, no commentary. 3) Per-output confidence column 'Data-sourced inputs' renamed to 'Declared-source inputs' with a footnote stating that neither declared-source nor model-assumption bounds are empirical real-world data. Mirrors the Basis-column discipline applied earlier to Missing inputs ranked by impact. 4) schema_notes.basis_enum expanded from ['report_derived', 'model_assumption'] to include 'report_explicit', 'report_inferred', 'external_reference', 'manual_override', 'unknown' for future provenance types. The current pipeline still only emits the original two. insights_schema_version bumped to 3. Reference plans regenerated, 206-249 lines across 13 plans. Smoke 8/8, unit 50/50.
Two small wording changes from ChatGPT's v38 feedback (the reviewer called the format 'production-candidate; freeze the structure', and these are the only follow-ups). 1) DOOM verdict band 'almost certainly fails' -> 'rarely passes under current bounds'. Avoids the epistemic overclaim of 'certainly' on a model-relative pass rate. 2) Decision implications intro line reworded from 'The actual plan revision is for human or LLM interpretation against the source report' to 'This section identifies the affected planning lever; concrete revisions should be derived by reading the source report and the relevant intermediary artifacts.' Less self-referential. Status doc renamed 20260516_claude.md -> 20260517_claude.md and rewritten for the current state: PR #708 (merged, five analysis blocks) and PR #710 (in flight, this branch, the v34->v38 insights-format iteration). Schema v3 frozen. Cross-plan validation extended to five distinct domains. Open issues mostly carry over from 0516 plus a new manifest-regression-test gap.
ChatGPT's v39 review: 'insights.md is too vague and too human-marketing-ish; the file is a structured interpretation layer, and the name should match.' Top recommendation was assessment.md. File rename: insights.md -> assessment.md. H1 heading '# Insights:' -> '# Assessment:'. Skill summarize-insights -> summarize-assessment. Script summarize_insights.py -> summarize_assessment.py. Build function build_insights -> build_assessment. Manifest field insights_schema_version -> assessment_schema_version. The artifact_type value 'interpretation_layer' is unchanged — ChatGPT specifically called it out as the right characterisation. Schema bumped v3 -> v4 because the manifest's field name changed. Other manifest content (artifact_set, primary_model_result, schema_notes, etc.) is unchanged. SKILL.md and the smoke test updated to match. Reference plans regenerated as assessment.md (length range 206-249 lines across 13 plans); stale insights.md files removed from output/ dirs (gitignored anyway). Status doc (20260517_claude.md) updated: v34 -> v39 iteration table now closes with v38 -> v39 rename row; frozen-state section and JSON snippet reflect schema v4. Smoke 8/8, unit 50/50 green.
ChatGPT v39-rename feedback: 'now that the file is called assessment.md, you could make the relationship explicit'. 'This file is a derived interpretation layer...' -> 'This assessment is a derived interpretation layer...'. One-word swap; no schema change. Smoke 8/8.
neoneye
added a commit
that referenced
this pull request
May 17, 2026
…room run, and the 'fix prompts, not outputs' lesson Two new sections at the end of the doc: ## Later on 0517 — PR #710 merged, PR #711 in flight, casino_royale run. Covers the deterministic Python validator (replacing the LLM skill that was unrunnable on digest-extractor output), the optional unmodelled_gates field on parameters.json, the schema v4 -> v5 -> v6 bumps (known_unmodelled_existential_gates + assessment_scope_warning flat fields, SCOPE WARNING bullet in Critical findings, source-anchor prefixing), and the casino_royale v39-v43 iteration history that drove the changes. ## Process insight — fix prompts, not outputs. Documents the v40-v43 sequence as a concrete violation-and-recovery of the feedback_fix_prompts_not_outputs.md memory rule I had saved and ignored: I hand-patched parameters.json across v40/v41/v42 instead of re-running the extractor against the updated prompt, which hid three real bugs (validator regex on _suffix tokens, LLM emission of _shortfall-named outputs, LLM emission of dead-end variables) until the clean-room v43 run surfaced them. Adds five process rules for future runs. Stale schema-version mentions updated to v6: cross-plan reference set header, Cross-plan generalisation row, regression-test outstanding-issue, schema-v6 basis_enum entry, 'manifest regression test' next-step. Historical references to v3/v4 inside the iteration history table are left as-is.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restructures
insights.mdper the ChatGPT feedback on v33: the file becomes a thin interpretation layer over the intermediary artifacts (parameters.json,bounds.json,calculations.py,scenarios.json,montecarlo_settings.json,montecarlo.json,validation.json) rather than a redundant copy of the raw simulation tables.artifact_type,plan_name,source_plan_dir,primary_model_result(doom/fragile/marginal/viable),validation_status,simulation,primary_failed_gates,primary_uncertainty_drivers,do_not_treat_as.## Artifact contract,## Machine summary,## Provenance map,## Modelling frame,## Simulation settings,## Critical findings,## Gate verdicts,## Failure drivers,## Missing inputs ranked by impact,## Confidence and trust boundaries,## Scenario sanity check,## Suggested next actions.min()aggregate — telling the consumer not to average gates across dimensions.SKILL.mdis rewritten the same way.montecarlo.json: the per-output distribution table, the full per-driver quartile tables, the full sensitivity tables, the full required-input-threshold blocks. Failure drivers now condenses to one row per failing gate.Test plan
tests/run_smoke.py): 8/8 — updated the assertions from "Verdict table" to the new stable section names (## Gate verdicts,## Machine summary,## Artifact contract,## Provenance map).tests/test_run_monte_carlo.py): 50/50, 6 subtests pass.weakest_financial_gate_surplus_eurmin aggregate).🤖 Generated with Claude Code