Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
193 commits
Select commit Hold shift + click to select a range
6acec9d
HTML API docs experiment: plan contract and markdown renderer.
sirreal Jun 11, 2026
947ca71
HTML API docs experiment: task corpus and execution harness.
sirreal Jun 11, 2026
df08126
HTML API docs experiment: round tooling and protocol.
sirreal Jun 11, 2026
cf0fcdc
HTML API docs experiment: workflow scripts and trial persistence.
sirreal Jun 11, 2026
aa1c305
HTML API docs experiment: round 0 baseline results.
sirreal Jun 11, 2026
58140b2
HTML API docs round 1, hypothesis 1: closer-token depth semantics.
sirreal Jun 11, 2026
2d763ed
HTML API docs round 1, hypothesis 2: rehabilitate HTML Processor next…
sirreal Jun 11, 2026
0b9366f
HTML API docs round 1, hypothesis 3: get_modifiable_text() returns de…
sirreal Jun 11, 2026
5266d91
HTML API docs experiment: round 1 results — all hypotheses confirmed.
sirreal Jun 11, 2026
6af8349
HTML API docs experiment: corpus revision per review.
sirreal Jun 11, 2026
5e3f92f
HTML API docs round 3, hypothesis 1: set_attribute() placement rules.
sirreal Jun 11, 2026
ea22ff5
HTML API docs round 3, hypothesis 2: correct the class-level support …
sirreal Jun 11, 2026
fb1f01c
HTML API docs round 3, hypothesis 3: serialize_token() rewrite idiom.
sirreal Jun 11, 2026
74d4b5f
HTML API docs experiment: round 2 Haiku baseline results.
sirreal Jun 11, 2026
6a63e54
HTML API docs experiment: ingest tooling for the long-run loop.
sirreal Jun 11, 2026
514587f
HTML API docs round 4, hypothesis 1: serialization is not how you rea…
sirreal Jun 11, 2026
41fe8b9
HTML API docs round 4, hypothesis 2: which tokens carry modifiable text.
sirreal Jun 11, 2026
3c1b7ab
HTML API docs round 4, hypothesis 3: bookmark name reuse is the last-…
sirreal Jun 11, 2026
11e5a14
HTML API docs experiment: round 3 results — refine serialize guidance…
sirreal Jun 11, 2026
e6f1dcb
HTML API docs round 5, hypothesis 1: building markup from a template.
sirreal Jun 11, 2026
62d133e
HTML API docs round 5, hypotheses 2-3: next_tag() matching contract; …
sirreal Jun 11, 2026
43a81c4
HTML API docs round 5, hypothesis 4: why the subtree walk uses >=.
sirreal Jun 11, 2026
d098352
HTML API docs experiment: round 4 results — train 94.18 (+3.5), T07 c…
sirreal Jun 11, 2026
da7f4e2
HTML API docs round 6 hypotheses: processor chooser, tree-awareness b…
sirreal Jun 11, 2026
290227e
HTML API docs experiment: round 5 results — train 94.77, T04 +49.2.
sirreal Jun 11, 2026
3359336
HTML API docs round 7 hypotheses: RCDATA text location on the HTML Pr…
sirreal Jun 11, 2026
614e4ed
HTML API docs experiment: round 6 checkpoint — train 97.84, held-out …
sirreal Jun 11, 2026
28742d9
HTML API docs round 8 hypotheses: UTF-8 output statement; break-form …
sirreal Jun 11, 2026
6fe7f8b
HTML API docs experiment: round 7 results — train 97.51, N03 perfect.
sirreal Jun 11, 2026
2b1b6d3
HTML API docs round 9 hypotheses: the shared cursor, and surfacing th…
sirreal Jun 11, 2026
4123688
HTML API docs round 9, hypothesis 2: surface the last-X bookmark idio…
sirreal Jun 11, 2026
6bfa07a
HTML API docs experiment: round 8 results — train 97.70, first satura…
sirreal Jun 11, 2026
6dd4bf3
HTML API docs round 10 hypothesis: the RCDATA exception belongs on th…
sirreal Jun 11, 2026
df814e1
HTML API docs experiment: round 9 checkpoint — train 98.66, T08 stabi…
sirreal Jun 11, 2026
31f421e
HTML API docs round 11 hypotheses: the equality case is the reason fo…
sirreal Jun 11, 2026
d26ca3a
HTML API docs experiment: round 10 results — train 98.70, T08 perfect.
sirreal Jun 11, 2026
f451089
HTML API docs round 12 hypotheses: add_class never removes; quoting s…
sirreal Jun 11, 2026
1531f93
HTML API docs experiment: round 11 results — train 98.28, T09 perfect.
sirreal Jun 11, 2026
0ecd406
HTML API docs round 13 hypotheses: no closer-guard needed by default;…
sirreal Jun 11, 2026
8599a39
HTML API docs experiment: round 12 checkpoint — held-out 91.04, N05 +…
sirreal Jun 11, 2026
f584345
HTML API docs round 14 hypothesis: post-collection measurement noted …
sirreal Jun 11, 2026
43eb4e1
HTML API docs experiment: round 13 results — first 100% functional sw…
sirreal Jun 11, 2026
a1782bc
HTML API docs round 15 hypothesis: construction asymmetry stated on b…
sirreal Jun 11, 2026
f4b1039
HTML API docs experiment: round 14 results — construction-asymmetry g…
sirreal Jun 11, 2026
7575d15
HTML API docs round 16 hypothesis: asymmetry reminder where text read…
sirreal Jun 12, 2026
21e532d
HTML API docs experiment: round 15 checkpoint — T05 cured, N05 one pl…
sirreal Jun 12, 2026
54f53c7
HTML API docs experiment: round 16 results — three concepts at 100; r…
sirreal Jun 12, 2026
f556693
HTML API docs experiment: round 17 hold round — campaign-best 98.93 w…
sirreal Jun 12, 2026
bac6933
Tighten HTML API corpus edge cases
sirreal Jun 12, 2026
f05a24a
Improve HTML API markdown docs
sirreal Jun 12, 2026
a16da73
Document next HTML API docs experiment phase
sirreal Jun 12, 2026
d2094c9
Add GOAL
sirreal Jun 12, 2026
fff14cb
Short-circuit text excerpt collection
sirreal Jun 12, 2026
0d31597
Require string hrefs when collecting links
sirreal Jun 12, 2026
7ff10d2
Replace quoted paragraph task with nested lists
sirreal Jun 12, 2026
5fbac88
Trim table extraction prompt hints
sirreal Jun 12, 2026
1fc5d79
Tighten last heading task
sirreal Jun 12, 2026
65301ca
Replace same-html task with tracking attributes
sirreal Jun 12, 2026
908c3c4
Note token-name span check
sirreal Jun 12, 2026
b012226
Simplify external class task prompt
sirreal Jun 12, 2026
7a8c97f
Note reference failure handling checks
sirreal Jun 12, 2026
10d429a
Clarify HTML API docs improvement goal
sirreal Jun 12, 2026
952ee4e
Replace N03 with first list item count task
sirreal Jun 12, 2026
79ef9c1
Replace N04 with normalize fallback task
sirreal Jun 12, 2026
dde6aed
Tighten N05 document title task
sirreal Jun 12, 2026
633a8fb
Replace N06 with table of contents extraction
sirreal Jun 12, 2026
be9f1a4
Replace H04 with empty paragraph holdout
sirreal Jun 12, 2026
6c38ec2
Fold handoff notes into experiment docs
sirreal Jun 12, 2026
6ed3b38
Reconcile HTML API experiment corpus refresh
sirreal Jun 12, 2026
7027483
Record blocked current-corpus scoring attempt
sirreal Jun 12, 2026
549e041
Prepare current-corpus HTML API baseline tooling
sirreal Jun 12, 2026
4d07135
Add HTML API experiment state audit
sirreal Jun 12, 2026
af84d03
Verify staged HTML API docs scratch isolation
sirreal Jun 12, 2026
a205cd8
Prepare round 18 current-corpus baseline
sirreal Jun 12, 2026
2a011ea
Validate HTML API round artifact lifecycle
sirreal Jun 12, 2026
bece681
Generate workflow args from round metadata
sirreal Jun 12, 2026
8b114d9
Reject incomplete metadata-backed HTML API rounds
sirreal Jun 12, 2026
1b73fb4
Record staged docs hashes for HTML API rounds
sirreal Jun 12, 2026
824b4bf
Validate HTML API workflow outputs before ingestion
sirreal Jun 12, 2026
63b03f7
Record HTML API source digests for prepared rounds
sirreal Jun 12, 2026
e0c237c
Verify staged scratch hashes before HTML API scoring
sirreal Jun 12, 2026
99536e7
Add HTML API corpus reference validator
sirreal Jun 12, 2026
9b94f32
Report prepared HTML API baseline in state audit
sirreal Jun 12, 2026
6d62bc1
Emit HTML API workflow launch manifest
sirreal Jun 12, 2026
4bdf097
Reject incomplete HTML API trial workflow outputs
sirreal Jun 12, 2026
006e7d0
Declare HTML API trial isolation contract
sirreal Jun 12, 2026
80246c5
Reject malformed HTML API judge workflow outputs
sirreal Jun 12, 2026
7c1be8c
Verify HTML API source digests during round validation
sirreal Jun 12, 2026
f23aa5a
Validate HTML API trial artifact contents
sirreal Jun 12, 2026
966bfe8
Validate HTML API judge artifact contents
sirreal Jun 12, 2026
e0e34ed
Reject malformed HTML API trial code payloads
sirreal Jun 12, 2026
fa9629a
Reject malformed HTML API workflow envelopes
sirreal Jun 12, 2026
4bf4c49
Verify HTML API scored summary reproducibility
sirreal Jun 13, 2026
8397f20
Pin HTML API round corpus inputs
sirreal Jun 13, 2026
1f15092
Validate current HTML API baseline candidates
sirreal Jun 13, 2026
f0a4af2
Reject HTML API source drift before scoring
sirreal Jun 13, 2026
3820213
Clarify HTML API workflow preflight bypass
sirreal Jun 13, 2026
9e22792
Clarify current HTML API baseline policy
sirreal Jun 13, 2026
177f8ef
Require HTML API trial isolation attestation
sirreal Jun 13, 2026
0dbc1e2
Return HTML API trial isolation envelope
sirreal Jun 13, 2026
5d02b91
Require explicit HTML API trial agent type
sirreal Jun 13, 2026
4676f0e
Refresh HTML API round 18 launch metadata
sirreal Jun 13, 2026
8be1ddc
Record HTML API workflow launch provenance
sirreal Jun 13, 2026
b79ef64
Prevent HTML API result artifact overwrites
sirreal Jun 13, 2026
90db44d
Validate HTML API corpus before workflow launch
sirreal Jun 13, 2026
15792fa
Match HTML API launch preflight to selected tasks
sirreal Jun 13, 2026
92ab8b7
Validate HTML API round lifecycle artifacts
sirreal Jun 13, 2026
c10f4d0
Validate HTML API trial harness output before persist
sirreal Jun 13, 2026
9582628
Clean up HTML API trial batch failures
sirreal Jun 13, 2026
bd4b2d3
Clean up HTML API judge ingest failures
sirreal Jun 13, 2026
f0b29ff
Allow HTML API workflow args handoff files
sirreal Jun 13, 2026
c71e79b
Clean up HTML API trial isolation failures
sirreal Jun 13, 2026
47b2154
Try to not get stuck on goal
sirreal Jun 13, 2026
86a265d
Record round 18 workflow handoff
sirreal Jun 13, 2026
a3495aa
Add local Codex trial runner
sirreal Jun 13, 2026
b57bbba
Update round 18 local runner handoff
sirreal Jun 13, 2026
edf372a
Surface local trial runner in audit
sirreal Jun 13, 2026
da48a49
Fix Codex trial runner approval config
sirreal Jun 13, 2026
dbda405
Embed local runner trial inputs
sirreal Jun 13, 2026
a58f43e
Ingest round 18 trial artifacts
sirreal Jun 13, 2026
369593a
Add local Codex judge runner
sirreal Jun 13, 2026
6815e3b
Surface local judge runner in audit
sirreal Jun 13, 2026
aeef196
Update round 18 judge handoff
sirreal Jun 13, 2026
e656643
Record local judge runner approval
sirreal Jun 13, 2026
69a99ba
Fix Codex judge runner schema
sirreal Jun 13, 2026
b4e2158
Score round 18 current-corpus baseline
sirreal Jun 13, 2026
7436145
Add local Codex discoverability probe runner
sirreal Jun 13, 2026
6651d33
Fix Codex probe runner isolated workdir
sirreal Jun 13, 2026
382a04c
Record incomplete-token discoverability probe
sirreal Jun 13, 2026
012dabb
Keep probe artifacts outside scored rounds
sirreal Jun 13, 2026
1726193
Add HTML Processor region-scan recipe
sirreal Jun 13, 2026
881fba4
Score round 19 region-scan recipe
sirreal Jun 13, 2026
528361b
Record text-content discoverability probe
sirreal Jun 13, 2026
cccb11d
Score round 20 low-effort calibration
sirreal Jun 13, 2026
29d2511
Record generic recipe discoverability probe
sirreal Jun 13, 2026
27077e0
Add HTML Processor text and rewrite recipes
sirreal Jun 13, 2026
0a0b406
Score round 21 HTML Processor recipes
sirreal Jun 13, 2026
bd624d8
Score round 22 current-docs calibration
sirreal Jun 13, 2026
dfb84bb
Record Tag Processor text-boundary probe
sirreal Jun 13, 2026
f7c83bf
Clarify Tag Processor text-token recipe boundary
sirreal Jun 13, 2026
050b1b4
Score round 23 Tag Processor text boundary
sirreal Jun 13, 2026
0dd50bc
Score round 24 checkpoint
sirreal Jun 13, 2026
25f2f4f
Record read-only text policy probe
sirreal Jun 13, 2026
3d6e1da
Score text policy scratch A/B
sirreal Jun 13, 2026
3af3535
Score ordinary text scratch A/B
sirreal Jun 13, 2026
95173a4
Clarify ordinary subtree text policy
sirreal Jun 13, 2026
f3e8132
Score ordinary subtree text policy
sirreal Jun 13, 2026
ac5dbf2
Record next_tag cursor probe
sirreal Jun 13, 2026
c3660cd
Score next_tag cursor scratch A/B
sirreal Jun 13, 2026
19a49c1
Clarify HTML Processor next_tag cursor searches
sirreal Jun 13, 2026
bbe6768
Score HTML Processor next_tag cursor source edit
sirreal Jun 13, 2026
22f26cf
Reconcile next diagnostic action
sirreal Jun 13, 2026
5151e1f
Score depth-bounded traversal scratch A/B
sirreal Jun 13, 2026
b87cf80
Teach audit about diagnostic subset rounds
sirreal Jun 13, 2026
bb760eb
Run checkpoint before traversal card promotion
sirreal Jun 13, 2026
ba25ad6
Teach audit about checkpoint rounds
sirreal Jun 13, 2026
77ec812
Record traversal card promotion action
sirreal Jun 13, 2026
6548356
Document depth-bounded traversal recipe
sirreal Jun 13, 2026
4a39f78
Teach audit to score source hypotheses
sirreal Jun 13, 2026
8ca976b
Score depth traversal recipe source edit
sirreal Jun 13, 2026
aa2b580
Test method-local text policy scratch variant
sirreal Jun 13, 2026
95739cd
Probe serialize token fallback policy
sirreal Jun 13, 2026
babc0b1
Test serialization fallback scratch variant
sirreal Jun 13, 2026
c5dacae
Run fallback policy checkpoint
sirreal Jun 13, 2026
27c764f
Document serialization rewrite fallback policy
sirreal Jun 13, 2026
ac41d64
Score serialization fallback source edit
sirreal Jun 13, 2026
8441f6b
Test text policy decision table scratch variant
sirreal Jun 13, 2026
44facea
Run text policy checkpoint
sirreal Jun 13, 2026
29a148a
Document HTML Processor text extraction policy
sirreal Jun 13, 2026
09aed17
Score text extraction policy source edit
sirreal Jun 13, 2026
9aaa0ce
Test read-only extraction completion policy
sirreal Jun 13, 2026
feca7c2
Run read-only policy checkpoint
sirreal Jun 13, 2026
5edc48f
Calibrate lower reasoning weak tier
sirreal Jun 13, 2026
65e60e6
Teach audit weak-tier ladder
sirreal Jun 13, 2026
2e163c0
Calibrate mini high weak tier
sirreal Jun 13, 2026
b6dd751
Calibrate mini low weak tier
sirreal Jun 13, 2026
33764b4
Run serialization fallback A/B control
sirreal Jun 13, 2026
feda3a6
Test serialization rewrite fallback card
sirreal Jun 13, 2026
1c0fabd
Document HTML Processor rewrite fallback policy
sirreal Jun 13, 2026
1107adb
Record source score subject tier
sirreal Jun 13, 2026
b8ca8b2
Score rewrite fallback source edit
sirreal Jun 13, 2026
75137f5
Teach audit checkpoint next action
sirreal Jun 13, 2026
d019629
Allow prepared round artifacts in audit
sirreal Jun 13, 2026
14ce436
Checkpoint rewrite fallback source edit
sirreal Jun 13, 2026
f5f875a
Teach audit diagnostic subset lifecycle
sirreal Jun 13, 2026
29eb033
Run traversal boundary A/B control
sirreal Jun 13, 2026
f791b3e
Run traversal boundary A/B variant
sirreal Jun 13, 2026
625debc
Run bounded traversal A/B variant
sirreal Jun 13, 2026
a1286c1
Probe traversal contract discoverability
sirreal Jun 13, 2026
abe45c5
Clarify weak-tier probe followup
sirreal Jun 13, 2026
0d80a17
Probe attribute value contract
sirreal Jun 13, 2026
e8c85ad
Pause exhausted traversal diagnostics
sirreal Jun 13, 2026
727e824
Clarify signal exhaustion pause
sirreal Jun 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
22 changes: 22 additions & 0 deletions .claude/agents/docs-test-subject.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
name: docs-test-subject
description: Documentation-only test subject for the HTML API doc-improvement experiment. Implements a PHP function using only the two provided documentation files. Tool access is restricted to Read and Grep by design — do not widen it.
tools: Read, Grep
---

You are a test subject in a documentation-quality experiment. You implement
a single PHP function using the WordPress HTML API.

Hard rules:

- Your ONLY information sources are the documentation files whose absolute
paths are given in your task prompt. Read or search them as much as you
like.
- You must not attempt to access any other file, directory, or resource.
- You never execute code; you reason from documentation alone.
- Do not invent methods, constants, or behaviors that the documentation
does not describe. If the documentation seems incomplete, choose the
best-supported approach it does describe.

Your final message is your deliverable and must follow the output format
specified in your task prompt exactly.
136 changes: 136 additions & 0 deletions GOAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# HTML API Documentation Improvement Goal

<EXPLICIT_PERMISSIONS>
I, the user, approve using the local Codex judge runner for round this round and all subsequent rounds, including sending judge-visible materials to the model provider.
</EXPLICIT_PERMISSIONS>

Improve the rendered documentation usability for `WP_HTML_Tag_Processor` and
`WP_HTML_Processor`, measured by how well weaker models complete real HTML API
tasks using only the staged rendered documentation.

The only source documentation hypothesis edits are docblock changes in:

- `src/wp-includes/html-api/class-wp-html-tag-processor.php`
- `src/wp-includes/html-api/class-wp-html-processor.php`

Do not change PHP behavior. Infrastructure/tooling changes are allowed only
when needed to keep the experiment valid, and must be tracked separately from
documentation hypothesis edits.

The primary deliverable is improved source documentation in the two HTML API
docblock files. Tooling, handoff files, audits, manifests, and result hygiene
are support work only; they are not progress on the goal unless they unblock
the next documentation-measurement or documentation-edit step.

## Authoritative State

`GOAL.md` defines the stable objective and guardrails. It must not be treated
as the current phase record.

At the start of every run, determine the active phase and next action from:

- `doc-experiment/PLAN.md` - experiment contract
- `doc-experiment/PROTOCOL.md` - operational runbook
- `doc-experiment/NEXT-HYPOTHESES.md` - current hypothesis backlog
- `doc-experiment/LOG.md` - latest experiment narrative
- `doc-experiment/results/round-*` - persisted measurements
- `git status` - unresolved local drift

If these sources conflict, pause scoring and reconcile the experiment state
before continuing.

## Start-of-Run Checklist

Before making edits or running a score:

1. Inspect the worktree and preserve existing user changes.
2. Identify the latest completed trusted round and its score.
3. Identify the current round mode using the modes defined in `PROTOCOL.md`
and the state in `LOG.md`, results, and the worktree.
4. Identify the current model policy, subject tier, judge tier, and whether the
subject tier has a no-edit baseline.
5. Check whether source docs, tooling, corpus, or results changed since the last
trusted score.
6. Determine the next action implied by the plan: calibration, probe, scratch
A/B, normal scoring, checkpoint, source promotion, revert, or stop.
7. Record any mismatch before trusting new scores.
8. Classify the next action as one of:
- `documentation-edit`
- `measurement`
- `result-ingestion`
- `state-reconciliation`
- `external-action-required`
If the next action is `external-action-required`, do not substitute
unrelated tooling work for it.

## Operating Rules

### Progress Priority

- Prefer actions in this order:
1. Run or ingest the measurement required by the active phase.
2. Analyze trusted measurements and choose a documentation hypothesis.
3. Edit source docblocks for one evidence-backed hypothesis.
4. Stage, score, aggregate, log, and commit that hypothesis.
5. Fix tooling only when a specific observed or imminent failure would make
the above steps invalid or non-retryable.
- Do not perform opportunistic infrastructure hardening merely because the
required scoring or documentation action is unavailable.
- A tooling change must name the experiment-validity failure it prevents and
must be followed by a re-audit of the actual next documentation/measurement
action.

- Test subjects may read only the staged markdown docs and task prompt.
- Never expose `reference.php`, `tests.json`, source files, logs, plans, or
hypothesis docs to test-subject agents.
- Use one primary subject tier per scored round. Do not mix model tiers into a
main round score.
- Use the judge/model policy from `PLAN.md` and `PROTOCOL.md`; if runner tooling
disagrees with that policy, fix or explicitly record the mismatch before
comparing scores.
- Held-out tasks are checkpoint/regression sentinels only and must never drive
documentation edits.
- Compare scores only across comparable rounds: same corpus, same round mode,
same primary subject tier, same judge policy, and compatible tooling.
- Scratch rendered-doc variants must stay out of source docblocks until they win
by evidence.
- Promote only general API documentation improvements, not task-shaped answers.
- When trials, judges, or probes repeatedly reveal surprising API behavior,
recurring hallucinated methods, or missing API affordances, record the pattern
in a consistent backlog location for later consideration. Use
`doc-experiment/NEXT-HYPOTHESES.md` for documentation hypotheses, and keep
future API/design observations distinct from immediate docblock edits.
- Keep `@since` tags intact and do not fabricate changelog entries.
- After every source docblock edit, run the docs-only guard, stage docs, run the
appropriate scored flow, aggregate results, update `LOG.md`, and commit one
source hypothesis at a time.
- Commit experiment results separately from source documentation hypotheses
where practical.
- Stop or pause according to `PLAN.md`/`PROTOCOL.md`, especially when signal is
exhausted, failures are generic model noise, or the experiment state is
inconsistent.

### External Runner Gate

- If the active next action is to launch trials or judges in an external
Workflow runner and that runner is not available in the current session:
1. Generate or verify the exact handoff payload once.
2. Report the command/files needed for the external runner.
3. Stop work and ask for one of:
- external runner output to ingest,
- explicit authorization to use an alternative runner,
- explicit authorization to bypass the measurement gate.
- Do not continue with additional tooling, corpus, or documentation edits while
waiting for that external action unless the user explicitly asks for them.
- Do not mark the documentation goal as making substantive progress from
handoff preparation alone.

## Promotion Standard

A source documentation edit is justified only when local evidence shows a
specific documentation usability failure: missing contract, misleading wording,
poor placement, low discoverability, or excessive rendered-doc noise.

Evidence may come from scored train rounds, no-edit baselines, citation-only
discoverability probes, judge analyses, or paired scratch-doc A/B tests.
Held-out-only evidence is not sufficient.
Loading
Loading