Html api docs improvement by sirreal · Pull Request #61 · sirreal/wordpress-develop

sirreal · 2026-06-12T19:39:25Z

Trac ticket:

Use of AI Tools

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

Scaffolding for the autonomous documentation-improvement loop: - PLAN.md records the full agreed design (corpus, scoring, isolation, harness, round flow, revert and stopping rules). - render-docs-markdown.py deterministically renders phpdoc-parser JSON to agent-readable markdown, excluding implementation leakage.

16 tasks (12 train + 4 held-out), each with a subagent-facing prompt, a validated reference implementation, and frozen hidden test cases. Expected outputs were generated from the references and cross-checked against PHP's Dom\HTMLDocument where semantics overlap (text extraction, links, tables, outlines) — all agree. Harness executes candidates standalone (no WordPress boot) with shims for the six WP functions the html-api files reference; each test case runs in an isolated subprocess with a 10s timeout so parse errors, fatals, and infinite loops are contained and reported.

- stage-round.sh: regenerate JSON, render markdown, stage isolated scratch dir containing only the two markdown files. - docs-only-guard.php: comment-stripped token-stream identity vs HEAD plus php -l, run before every round that follows doc edits. - aggregate-round.py: trial/task/round scoring per PLAN.md formula. - PROTOCOL.md: runbook with exact test-subagent and judge prompt templates, judge rubric, and results layout. - docs-test-subject agent definition (Read+Grep only) for structural isolation in future sessions. Pilot validated end-to-end: Sonnet test subject on T01 returned well-formed output passing 8/8 hidden cases.

trials-workflow.js fans out one docs-only test subject per task-trial with structured output; judge-workflow.js fans out one Opus judge per task with the adherence rubric and doc-gap analysis; persist-trials.py writes candidates to results/ and executes them against hidden tests.

48 Sonnet trials (16 tasks x 3) judged by 16 Opus judges. TRAIN 93.57 / HELD-OUT 93.47. Dominant systematic failure: undocumented closer-token depth semantics plus missing subtree-walk idiom (T03, T06, H04). Secondary: get_modifiable_text() decoding unstated (T08, H04); serialize_token() rewrite idiom undocumented (T12); misleading tables-unsupported bullet (T08).

Round-0 failures in T03, T06, and held-out H04 shared one root cause: nothing documents that a closing-tag token reports the PARENT's depth (the element is already popped when matched on its closer). All three T03 trials lost trailing text after nested elements by breaking their walk loops at 'depth <= opener depth'. get_current_depth(): state the closer rule explicitly, define depth as breadcrumb count including non-element tokens, extend the existing example through the closing tokens, and add the canonical visit-every-token-inside-an-element loop (depth >= opener depth). is_tag_closer() (HTML Processor): note that breadcrumbs and depth reflect the parent context when matched on a closer.

@SInCE

…_token(). The docblock described the method as internal ('do not use') and steered readers to the Tag Processor 'for access to the raw tokens' — the opposite of the right guidance for structure-aware text collection, which round-0 judges identified as a driver of the T06 failures (two of three trials collected nothing). Rewrite the description: define tokens, position next_token() as the right tool when non-tag content matters alongside structure, document that closers are visited for every opener (including implicit and end-of-input closes), warn that text may split across consecutive #text tokens, and add the canonical collect-text-of-an-element example in both depth-guard and breadcrumbs-guard forms (both verified by execution). @SInCE history left as-is.

…coded text. Round-0 judges (T08, H04) flagged that nothing states whether the returned text has character references decoded — the single most load-bearing fact for text extraction. Several subjects bolted on a redundant html_entity_decode() pass, which double-decodes and corrupts text like '&amp;'. State the decoding rule with its boundaries (decoded for #text and RCDATA elements like TEXTAREA/TITLE; verbatim for raw text SCRIPT/STYLE and comment interiors — all verified by execution), add a one-line example, and note the set_modifiable_text() inverse so callers work in decoded space on both sides.

TRAIN 98.78 (+5.21 vs baseline). 36/36 trials passed every hidden case. T03 +13.95 (closer-depth rule + subtree-walk example), T06 +46.33 (next_token() rehabilitation), no regressions beyond judge noise. Sonnet has plateaued >=90 for two consecutive rounds; next step per plan is the Haiku re-baseline. Round-2 adherence targets logged.

Task-first rebalance: add six tasks forcing undercovered concepts (class removal, contextual selection, truncated-input detection, normalize() failure handling, full-document parsing, HTML-vs-SVG image namespace). New held-out set: N01/N02/N05/H04; H01-H03 retired; T01/T02 relabeled smoke. Every task now carries role/commonness/ concept/processor labels and the aggregator reports per-concept means. All new references harness-validated; N02/N05/N06 cross-checked against Dom\HTMLDocument (covering image->img conversion and img-breaking-out-of-svg).

Two of three Haiku trials on the build-figure task produced correct markup with src/alt swapped and scored 0/6 — the docs never explain where set_attribute() puts attributes. Verified by execution: updates replace in place keeping position; NEW attributes insert after the tag name before existing ones; multiple new attributes sort by attribute name regardless of call order. Document all three rules plus the start-from-a-template idiom for when output order matters. Also fixes a judge-discovered bug in the paused_at_incomplete_token() example, which called the nonexistent get_next_tag() instead of next_tag().

…claims. The class docblock claimed the HTML Processor cannot process any element inside a TABLE, any foreign content (SVG/MathML), or anything outside the IN BODY insertion mode. All three claims are false on this branch — round-2 trials parsed well-formed tables, SVG content, and full documents with head content; judges traced T08's defensive fallback code directly to this passage. Replace with verified behavior: the processor parses these fine and aborts only on specific constructs — foster-parented content (e.g. a DIV directly inside TABLE) and mis-nested formatting requiring advance-and-rewind reconstruction (e.g. '<b>one<i>two</b>three</i>'), both confirmed by execution, with simple mis-nesting supported. Also document how aborts surface: get_last_error(), get_unsupported_exception(), and null from serialize()/normalize().

Round-1 judges (T12) flagged that nothing connects serialize_token() to its purpose: subjects mixed token loops with whole-string normalize(), unsure which was right. Document that concatenating serialize_token() across a next_token() walk reproduces serialize(), that the token-by-token form exists for selective rewriting (skip to remove, emit around to wrap), and that closers of skipped elements must be skipped too — with an execution-verified removal example. Cross-reference guidance: serialize() for unchanged output, the loop for transformations.

All-19 91.47 / core 90.47 / train 92.56 / held-out 87.38. Round-1 edits transfer to Haiku (T03, T06 perfect). Per-concept reporting exposes the gaps the aggregate hides: attributes 72.2 (set_attribute ordering), full-document 78.0 (held-out, no edit made), namespace 85.9. Round-3 hypothesis edits committed separately.

ingest-trials.py and ingest-judges.py condense per-round bookkeeping (persist, execute, aggregate, compare, gap digest with held-out gaps marked DO-NOT-ACT) into single commands, keeping orchestration overhead low across the 100+ round goal.

…d edits. Round 3 confirmed the serialize_token() idiom (round-3 H3) helped its targets (T09 +8.6, T12 +2.2) but induced a T07 regression (-33.7): two trials called serialize() after add_class(), got null (scanning had begun), and returned the unmodified input. Refining rather than reverting, disclosed in LOG: state the boundary explicitly on both serialize() and serialize_token() — queued attribute/class/text updates are read with the inherited get_updated_html(); serialize() demands a fresh processor and returns null once scanning has begun; serialization is for normalizing/rewriting, get_updated_html() for edits.

Two of three round-3 trials on the build-figure task produced empty captions: they matched the empty FIGCAPTION tag and called set_modifiable_text(), which returns false there — ordinary container elements carry no text of their own and an empty element has no #text token to modify. Nothing documented this. State the eligible token kinds, the empty-element limitation, the check-the-return-value rule, and the placeholder-template idiom (verified by execution).

…X idiom. T10 adherence sat at ~80 because the set_bookmark() docblock forbids programmatic names without stating the supported alternative; subjects hedged with bookmark-count workarounds. State explicitly that re-setting an existing name MOVES the bookmark (no leak, no release needed) and that same-name-per-match is the idiom for tracking the last occurrence in one pass (verified by execution; the docblock's own last-li example already relied on it silently). Also state the documented default for next_tag()'s tag_closers option ('skip'), which round-3 judges flagged as unstated.

…, two new gaps. All-19 87.41 / train 90.66 (-1.9) / held-out 75.22. Round-3 edits helped their targets (T09 +8.6, T12 +2.2, N06 +10.7, N04 100) but the serialization idiom induced T07 -33.7 (serialize() after mutations). Refined rather than reverted, with the boundary now stated. Round-4 hypotheses committed separately.

T04 trials each absorbed exactly one of the two template-building facts (pre-seeded attribute order in set_attribute(), placeholder text in set_modifiable_text()) and failed on the other — the facts live in two distant method docblocks. Add a 'Building markup from a template' section to the class overview, where template builders first look, stating both rules together with one execution-verified example using a link template (deliberately unlike any corpus task).

…decoded reads; add_class idempotency. Judges across four tasks flagged the same unstated guarantees subjects kept inferring (correctly, but unguided): - next_tag(): tag-name matching is ASCII case-insensitive with source casing preserved; comments/CDATA/rawtext can never match; truncated trailing tags are never matched or modified (cross-ref paused_at_incomplete_token()). Stated as a 'What this matches' block. - get_attribute(): string values come back DECODED (don't decode again), inverse of set_attribute's encode-on-write. - add_class(): creates/appends without disturbing existing classes; re-adding an existing class is a no-op with an exact byte-for-byte duplicate check (add 'NOTE' to class="note" appends — verified; an initial case-insensitive claim was caught wrong by probe before commit).

T08 judges noted the only depth-bounded walk example nests one level, where >= and > behave identically, so readers can't learn which is right. State the rule: >= is correct at any depth; > ends the walk at the first direct-child closer (verified: with > the UL walk stops after the first LI's contents).

…ured.

…oundary, get_updated_html identity, >= warning placement. Round-5's two single-trial collapses both trace to unstated boundaries: a T06 trial attempted tree-aware work in the Tag Processor (whose docs never say it lacks depth/breadcrumbs), and a T03 trial copied the next_token() example but guessed '>' because the >= warning only existed in get_current_depth(). - Tag Processor overview: 'Which processor should I use?' section stating it has NO tree awareness and where those methods live; HTML Processor overview gets the matching half. - get_updated_html(): own description at last (was a copy of __toString's) — read-your-edits semantics, byte preservation, safe mid-scan. - next_token() example now carries the >= warning inline where the failing trial actually read.

…ocessor, >= beside the operator, drain idiom, add_class return semantics. Round-6 train gaps: the HTML Processor's own get_modifiable_text() override never stated decoding or that SCRIPT/STYLE/TEXTAREA/TITLE carry their text on the element token (no #text child) — stated now with a verified full-parser TITLE example; the >= rule now sits beside the operator in the get_current_depth() example with the nested-closer/sibling-text explanation inline; the paused_at_incomplete_token() example gains the drain-all-tokens idiom its single-tag example obscured; add_class() return documented as enqueued-not-applied (false only with no matched tag, verified).

…above baseline.

…boundary rule. Round-7's only functional miss (T05 5/9) sliced multibyte text without an explicit encoding; the docs say UTF-8 is the only supported input but never said the output of get_modifiable_text() is UTF-8 nor showed the mb_* explicit-encoding idiom — stated on both classes now. T08's recurring boundary confusion appears in break-form code that the continue-form-only warning misses: stated the equivalence (break at < depth, never <= depth).

…e last-X bookmark idiom. T08's recurring failure class is nested walk loops double-advancing the single cursor: the inner collect-until-close loop exits already matched on the next region's boundary token, which the outer loop's next_token() then skips. Document the one-cursor contract on next_token() with the closer-driven single-pass state-machine shape (verified DT example), noting it stays reliable on malformed input because closers are always visited. Also surface the re-set-the-same-bookmark-name idiom in the overview bookmarks narrative where T10 trials kept missing it.

sirreal added 30 commits June 11, 2026 18:37

HTML API docs experiment: round 4 results — train 94.18 (+3.5), T07 c…

d098352

…ured.

HTML API docs experiment: round 5 results — train 94.77, T04 +49.2.

290227e

HTML API docs experiment: round 6 checkpoint — train 97.84, held-out …

614e4ed

…above baseline.

HTML API docs experiment: round 7 results — train 97.51, N03 perfect.

6fe7f8b

sirreal added 30 commits June 13, 2026 17:36

Run fallback policy checkpoint

c5dacae

Document serialization rewrite fallback policy

27c764f

Score serialization fallback source edit

ac41d64

Test text policy decision table scratch variant

8441f6b

Run text policy checkpoint

44facea

Document HTML Processor text extraction policy

29a148a

Score text extraction policy source edit

09aed17

Test read-only extraction completion policy

9aaa0ce

Run read-only policy checkpoint

feca7c2

Calibrate lower reasoning weak tier

5edc48f

Teach audit weak-tier ladder

65e60e6

Calibrate mini high weak tier

2e163c0

Calibrate mini low weak tier

b6dd751

Run serialization fallback A/B control

33764b4

Test serialization rewrite fallback card

feda3a6

Document HTML Processor rewrite fallback policy

1c0fabd

Record source score subject tier

1107adb

Score rewrite fallback source edit

b8ca8b2

Teach audit checkpoint next action

75137f5

Allow prepared round artifacts in audit

d019629

Checkpoint rewrite fallback source edit

14ce436

Teach audit diagnostic subset lifecycle

f5f875a

Run traversal boundary A/B control

29eb033

Run traversal boundary A/B variant

f791b3e

Run bounded traversal A/B variant

625debc

Probe traversal contract discoverability

a1286c1

Clarify weak-tier probe followup

abe45c5

Probe attribute value contract

0d80a17

Pause exhausted traversal diagnostics

e8c85ad

Clarify signal exhaustion pause

727e824

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Html api docs improvement#61

Html api docs improvement#61
sirreal wants to merge 193 commits into
trunkfrom
html-api-docs-improvement

sirreal commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sirreal commented Jun 12, 2026

Use of AI Tools

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant