feat(napkin-math): restructure insights.md as a thin interpretation layer by neoneye · Pull Request #710 · PlanExeOrg/PlanExe

neoneye · 2026-05-16T22:26:27Z

Summary

Restructures insights.md per the ChatGPT feedback on v33: the file becomes a thin interpretation layer over the intermediary artifacts (parameters.json, bounds.json, calculations.py, scenarios.json, montecarlo_settings.json, montecarlo.json, validation.json) rather than a redundant copy of the raw simulation tables.

JSON manifest at the top (not YAML — per the user preference). Keys: artifact_type, plan_name, source_plan_dir, primary_model_result (doom/fragile/marginal/viable), validation_status, simulation, primary_failed_gates, primary_uncertainty_drivers, do_not_treat_as.
Stable retrievable section names (so programmatic consumers can target by heading): ## Artifact contract, ## Machine summary, ## Provenance map, ## Modelling frame, ## Simulation settings, ## Critical findings, ## Gate verdicts, ## Failure drivers, ## Missing inputs ranked by impact, ## Confidence and trust boundaries, ## Scenario sanity check, ## Suggested next actions.
Aggregation warning kicks in automatically when the declared thresholds use incompatible units (e.g. EUR + months + km²) and the source plan declares no min() aggregate — telling the consumer not to average gates across dimensions.
No "AI" framing in headings or output prose. The file describes what it is (an interpretation layer), not who reads it. The "Audience and tone" rule in SKILL.md is rewritten the same way.
Drops the raw-data sections that already live verbatim in montecarlo.json: the per-output distribution table, the full per-driver quartile tables, the full sensitivity tables, the full required-input-threshold blocks. Failure drivers now condenses to one row per failing gate.
Provenance map points at every intermediary file with a "open when" note, so the next reader knows where to go for the data this file does not reproduce.

Test plan

Smoke check (tests/run_smoke.py): 8/8 — updated the assertions from "Verdict table" to the new stable section names (## Gate verdicts, ## Machine summary, ## Artifact contract, ## Provenance map).
Unit tests (tests/test_run_monte_carlo.py): 50/50, 6 subtests pass.
Reference plans regenerated and inspected: v34 media_rescue (165 lines), v34 datacenter_in_france (164), v33 cross_border_rail_ticketing (171), v33 india_census (158), v33 faraday_enclosure (181), v31 nuuk_clay_workshop (167), v31 cross_border_rail_ticketing (140). All land in the 150-200 line target band.
Aggregation warning verified: fires on media_rescue (USD_million + fraction, no aggregate) and Nuuk (DKK + fraction, no aggregate); suppressed on rail_ticketing v33 (has the weakest_financial_gate_surplus_eur min aggregate).

🤖 Generated with Claude Code

…ayer insights.md no longer reproduces the raw simulation tables (per-output distributions, full per-driver quartile blocks, full sensitivity tables, full required-input-threshold blocks). It now declares what it is, points at the intermediary artifacts, and surfaces only the decision-relevant signals: a JSON manifest, a provenance map, gate verdicts (with an aggregation warning when units are incompatible and no min() aggregate exists), one-row failure drivers per failing gate, missing-input priority, confidence and trust boundaries, a short scenario sanity check, and five suggested next actions. Stable retrievable section names (## Artifact contract, ## Machine summary, ## Provenance map, ## Critical findings, ## Gate verdicts, ## Failure drivers, ## Suggested next actions) so programmatic consumers can target by heading. JSON for the manifest, not YAML. No 'AI' framing in headings or output prose — the file describes what it is (an interpretation layer), not who reads it. Reference outputs across six plans land at 140-180 lines (was 180-300+). Smoke (8/8) and unit tests (50/50) pass.

…tifact_set, basis column ChatGPT v35 feedback: add the bridge from gate result to planning consequence, version the artifact format, make the source identifier portable, and rename the misleading 'data' source label. Machine summary changes: insights_schema_version=1; primary_model_result is now an object (label/reason/worst_gate/worst_gate_pass_rate) rather than a bare label; artifact_set object (version/plan_slug/relative_dir) parsed from path; source_plan_dir kept for local use but documented as non-primary. New section ## Decision implications: one row per gate in DOOM/FRAGILE/MARGINAL, with verdict-keyed structural consequence and revision direction derived from quartile_analysis top driver. Templates are generic — plan-specific tactical revisions still require human/LLM interpretation. New section ## Open questions for next analysis pass: five standing audit questions. Renames: Missing-inputs 'Source' column -> 'Basis' with translated values ('data' -> 'report_derived', 'assumption' -> 'model_assumption') and a footnote disclaiming empirical ground truth. Scenario sanity check columns: 'Low/Middle/High' -> 'Low inputs/Base inputs/High inputs' to match scenarios.json keys. Reference plans regenerated, all in 170-220 line band. Smoke 8/8, unit 50/50 green.

…shold basis, plan-specific hints, schema_notes) ChatGPT v36 feedback round: four changes to freeze the format. 1) primary_model_result renamed: 'label' -> 'overall_risk_band', added 'basis' field that explicitly says this is the worst declared gate's pass-rate band and NOT a calibrated whole-plan probability. Reduces the risk a downstream consumer reads 'doom' as 'whole plan is impossible'. 2) Gate verdicts gains a 'Threshold basis' column (report_explicit / report_inferred / model_defined / unknown) derived from the corresponding key_value's value_type. Surfaces the difference between thresholds the plan states directly and those the extractor inferred. 3) Decision implications gains a 5th column, 'Plan-specific revision hint', that lifts the gate's own rationale from parameters.recommended_first_calculations[].why_first (or derived_questions[].why_it_matters) and names the threshold parameter the formula tests against. Still does not invent tactical advice — it surfaces the plan's own framing. 4) Added schema_notes block to the JSON manifest listing allowed enums for overall_risk_band, verdict, basis, threshold_basis, plus a primary_model_result_semantics disclaimer. Bumped insights_schema_version to 2. Reference plans regenerated. Length range 201-244 lines across 11 plans (was 173-216). Smoke 8/8, unit 50/50.

…cal findings, declared-source label, wider basis enum) ChatGPT v37 feedback round: four small refinements after the structure was declared ready to freeze. 1) Decision implications column 'Plan-specific revision hint' renamed to 'Gate meaning'. The column was always gate rationale, not revision direction — the new name matches what it carries, and stops over-promising tactical advice the script cannot deterministically generate. 2) Critical findings DOOM/FRAGILE bullets switched from rhetorical to dry. 'The plan depends on this holding; the math says it almost never does' becomes 'fails X% of simulated runs under the current input bounds'. Stable language, no commentary. 3) Per-output confidence column 'Data-sourced inputs' renamed to 'Declared-source inputs' with a footnote stating that neither declared-source nor model-assumption bounds are empirical real-world data. Mirrors the Basis-column discipline applied earlier to Missing inputs ranked by impact. 4) schema_notes.basis_enum expanded from ['report_derived', 'model_assumption'] to include 'report_explicit', 'report_inferred', 'external_reference', 'manual_override', 'unknown' for future provenance types. The current pipeline still only emits the original two. insights_schema_version bumped to 3. Reference plans regenerated, 206-249 lines across 13 plans. Smoke 8/8, unit 50/50.

Two small wording changes from ChatGPT's v38 feedback (the reviewer called the format 'production-candidate; freeze the structure', and these are the only follow-ups). 1) DOOM verdict band 'almost certainly fails' -> 'rarely passes under current bounds'. Avoids the epistemic overclaim of 'certainly' on a model-relative pass rate. 2) Decision implications intro line reworded from 'The actual plan revision is for human or LLM interpretation against the source report' to 'This section identifies the affected planning lever; concrete revisions should be derived by reading the source report and the relevant intermediary artifacts.' Less self-referential. Status doc renamed 20260516_claude.md -> 20260517_claude.md and rewritten for the current state: PR #708 (merged, five analysis blocks) and PR #710 (in flight, this branch, the v34->v38 insights-format iteration). Schema v3 frozen. Cross-plan validation extended to five distinct domains. Open issues mostly carry over from 0516 plus a new manifest-regression-test gap.

ChatGPT's v39 review: 'insights.md is too vague and too human-marketing-ish; the file is a structured interpretation layer, and the name should match.' Top recommendation was assessment.md. File rename: insights.md -> assessment.md. H1 heading '# Insights:' -> '# Assessment:'. Skill summarize-insights -> summarize-assessment. Script summarize_insights.py -> summarize_assessment.py. Build function build_insights -> build_assessment. Manifest field insights_schema_version -> assessment_schema_version. The artifact_type value 'interpretation_layer' is unchanged — ChatGPT specifically called it out as the right characterisation. Schema bumped v3 -> v4 because the manifest's field name changed. Other manifest content (artifact_set, primary_model_result, schema_notes, etc.) is unchanged. SKILL.md and the smoke test updated to match. Reference plans regenerated as assessment.md (length range 206-249 lines across 13 plans); stale insights.md files removed from output/ dirs (gitignored anyway). Status doc (20260517_claude.md) updated: v34 -> v39 iteration table now closes with v38 -> v39 rename row; frozen-state section and JSON snippet reflect schema v4. Smoke 8/8, unit 50/50 green.

ChatGPT v39-rename feedback: 'now that the file is called assessment.md, you could make the relationship explicit'. 'This file is a derived interpretation layer...' -> 'This assessment is a derived interpretation layer...'. One-word swap; no schema change. Smoke 8/8.

…room run, and the 'fix prompts, not outputs' lesson Two new sections at the end of the doc: ## Later on 0517 — PR #710 merged, PR #711 in flight, casino_royale run. Covers the deterministic Python validator (replacing the LLM skill that was unrunnable on digest-extractor output), the optional unmodelled_gates field on parameters.json, the schema v4 -> v5 -> v6 bumps (known_unmodelled_existential_gates + assessment_scope_warning flat fields, SCOPE WARNING bullet in Critical findings, source-anchor prefixing), and the casino_royale v39-v43 iteration history that drove the changes. ## Process insight — fix prompts, not outputs. Documents the v40-v43 sequence as a concrete violation-and-recovery of the feedback_fix_prompts_not_outputs.md memory rule I had saved and ignored: I hand-patched parameters.json across v40/v41/v42 instead of re-running the extractor against the updated prompt, which hid three real bugs (validator regex on _suffix tokens, LLM emission of _shortfall-named outputs, LLM emission of dead-end variables) until the clean-room v43 run surfaced them. Adds five process rules for future runs. Stale schema-version mentions updated to v6: cross-plan reference set header, Cross-plan generalisation row, regression-test outstanding-issue, schema-v6 basis_enum entry, 'manifest regression test' next-step. Historical references to v3/v4 inside the iteration history table are left as-is.

neoneye added 7 commits May 17, 2026 00:26

neoneye merged commit 568abe8 into main May 16, 2026
3 checks passed

neoneye deleted the feat/napkin-math-insights-as-handoff branch May 16, 2026 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(napkin-math): restructure insights.md as a thin interpretation layer#710

feat(napkin-math): restructure insights.md as a thin interpretation layer#710
neoneye merged 7 commits into
mainfrom
feat/napkin-math-insights-as-handoff

neoneye commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented May 16, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant