Html api docs improvement#61
Draft
sirreal wants to merge 193 commits into
Draft
Conversation
Scaffolding for the autonomous documentation-improvement loop: - PLAN.md records the full agreed design (corpus, scoring, isolation, harness, round flow, revert and stopping rules). - render-docs-markdown.py deterministically renders phpdoc-parser JSON to agent-readable markdown, excluding implementation leakage.
16 tasks (12 train + 4 held-out), each with a subagent-facing prompt, a validated reference implementation, and frozen hidden test cases. Expected outputs were generated from the references and cross-checked against PHP's Dom\HTMLDocument where semantics overlap (text extraction, links, tables, outlines) — all agree. Harness executes candidates standalone (no WordPress boot) with shims for the six WP functions the html-api files reference; each test case runs in an isolated subprocess with a 10s timeout so parse errors, fatals, and infinite loops are contained and reported.
- stage-round.sh: regenerate JSON, render markdown, stage isolated scratch dir containing only the two markdown files. - docs-only-guard.php: comment-stripped token-stream identity vs HEAD plus php -l, run before every round that follows doc edits. - aggregate-round.py: trial/task/round scoring per PLAN.md formula. - PROTOCOL.md: runbook with exact test-subagent and judge prompt templates, judge rubric, and results layout. - docs-test-subject agent definition (Read+Grep only) for structural isolation in future sessions. Pilot validated end-to-end: Sonnet test subject on T01 returned well-formed output passing 8/8 hidden cases.
trials-workflow.js fans out one docs-only test subject per task-trial with structured output; judge-workflow.js fans out one Opus judge per task with the adherence rubric and doc-gap analysis; persist-trials.py writes candidates to results/ and executes them against hidden tests.
48 Sonnet trials (16 tasks x 3) judged by 16 Opus judges. TRAIN 93.57 / HELD-OUT 93.47. Dominant systematic failure: undocumented closer-token depth semantics plus missing subtree-walk idiom (T03, T06, H04). Secondary: get_modifiable_text() decoding unstated (T08, H04); serialize_token() rewrite idiom undocumented (T12); misleading tables-unsupported bullet (T08).
Round-0 failures in T03, T06, and held-out H04 shared one root cause: nothing documents that a closing-tag token reports the PARENT's depth (the element is already popped when matched on its closer). All three T03 trials lost trailing text after nested elements by breaking their walk loops at 'depth <= opener depth'. get_current_depth(): state the closer rule explicitly, define depth as breadcrumb count including non-element tokens, extend the existing example through the closing tokens, and add the canonical visit-every-token-inside-an-element loop (depth >= opener depth). is_tag_closer() (HTML Processor): note that breadcrumbs and depth reflect the parent context when matched on a closer.
…_token().
The docblock described the method as internal ('do not use') and steered
readers to the Tag Processor 'for access to the raw tokens' — the
opposite of the right guidance for structure-aware text collection,
which round-0 judges identified as a driver of the T06 failures (two of
three trials collected nothing).
Rewrite the description: define tokens, position next_token() as the
right tool when non-tag content matters alongside structure, document
that closers are visited for every opener (including implicit and
end-of-input closes), warn that text may split across consecutive #text
tokens, and add the canonical collect-text-of-an-element example in both
depth-guard and breadcrumbs-guard forms (both verified by execution).
@SInCE history left as-is.
…coded text. Round-0 judges (T08, H04) flagged that nothing states whether the returned text has character references decoded — the single most load-bearing fact for text extraction. Several subjects bolted on a redundant html_entity_decode() pass, which double-decodes and corrupts text like '&amp;'. State the decoding rule with its boundaries (decoded for #text and RCDATA elements like TEXTAREA/TITLE; verbatim for raw text SCRIPT/STYLE and comment interiors — all verified by execution), add a one-line example, and note the set_modifiable_text() inverse so callers work in decoded space on both sides.
TRAIN 98.78 (+5.21 vs baseline). 36/36 trials passed every hidden case. T03 +13.95 (closer-depth rule + subtree-walk example), T06 +46.33 (next_token() rehabilitation), no regressions beyond judge noise. Sonnet has plateaued >=90 for two consecutive rounds; next step per plan is the Haiku re-baseline. Round-2 adherence targets logged.
Task-first rebalance: add six tasks forcing undercovered concepts (class removal, contextual selection, truncated-input detection, normalize() failure handling, full-document parsing, HTML-vs-SVG image namespace). New held-out set: N01/N02/N05/H04; H01-H03 retired; T01/T02 relabeled smoke. Every task now carries role/commonness/ concept/processor labels and the aggregator reports per-concept means. All new references harness-validated; N02/N05/N06 cross-checked against Dom\HTMLDocument (covering image->img conversion and img-breaking-out-of-svg).
Two of three Haiku trials on the build-figure task produced correct markup with src/alt swapped and scored 0/6 — the docs never explain where set_attribute() puts attributes. Verified by execution: updates replace in place keeping position; NEW attributes insert after the tag name before existing ones; multiple new attributes sort by attribute name regardless of call order. Document all three rules plus the start-from-a-template idiom for when output order matters. Also fixes a judge-discovered bug in the paused_at_incomplete_token() example, which called the nonexistent get_next_tag() instead of next_tag().
…claims. The class docblock claimed the HTML Processor cannot process any element inside a TABLE, any foreign content (SVG/MathML), or anything outside the IN BODY insertion mode. All three claims are false on this branch — round-2 trials parsed well-formed tables, SVG content, and full documents with head content; judges traced T08's defensive fallback code directly to this passage. Replace with verified behavior: the processor parses these fine and aborts only on specific constructs — foster-parented content (e.g. a DIV directly inside TABLE) and mis-nested formatting requiring advance-and-rewind reconstruction (e.g. '<b>one<i>two</b>three</i>'), both confirmed by execution, with simple mis-nesting supported. Also document how aborts surface: get_last_error(), get_unsupported_exception(), and null from serialize()/normalize().
Round-1 judges (T12) flagged that nothing connects serialize_token() to its purpose: subjects mixed token loops with whole-string normalize(), unsure which was right. Document that concatenating serialize_token() across a next_token() walk reproduces serialize(), that the token-by-token form exists for selective rewriting (skip to remove, emit around to wrap), and that closers of skipped elements must be skipped too — with an execution-verified removal example. Cross-reference guidance: serialize() for unchanged output, the loop for transformations.
All-19 91.47 / core 90.47 / train 92.56 / held-out 87.38. Round-1 edits transfer to Haiku (T03, T06 perfect). Per-concept reporting exposes the gaps the aggregate hides: attributes 72.2 (set_attribute ordering), full-document 78.0 (held-out, no edit made), namespace 85.9. Round-3 hypothesis edits committed separately.
ingest-trials.py and ingest-judges.py condense per-round bookkeeping (persist, execute, aggregate, compare, gap digest with held-out gaps marked DO-NOT-ACT) into single commands, keeping orchestration overhead low across the 100+ round goal.
…d edits. Round 3 confirmed the serialize_token() idiom (round-3 H3) helped its targets (T09 +8.6, T12 +2.2) but induced a T07 regression (-33.7): two trials called serialize() after add_class(), got null (scanning had begun), and returned the unmodified input. Refining rather than reverting, disclosed in LOG: state the boundary explicitly on both serialize() and serialize_token() — queued attribute/class/text updates are read with the inherited get_updated_html(); serialize() demands a fresh processor and returns null once scanning has begun; serialization is for normalizing/rewriting, get_updated_html() for edits.
Two of three round-3 trials on the build-figure task produced empty captions: they matched the empty FIGCAPTION tag and called set_modifiable_text(), which returns false there — ordinary container elements carry no text of their own and an empty element has no #text token to modify. Nothing documented this. State the eligible token kinds, the empty-element limitation, the check-the-return-value rule, and the placeholder-template idiom (verified by execution).
…X idiom.
T10 adherence sat at ~80 because the set_bookmark() docblock forbids
programmatic names without stating the supported alternative; subjects
hedged with bookmark-count workarounds. State explicitly that
re-setting an existing name MOVES the bookmark (no leak, no release
needed) and that same-name-per-match is the idiom for tracking the
last occurrence in one pass (verified by execution; the docblock's
own last-li example already relied on it silently).
Also state the documented default for next_tag()'s tag_closers option
('skip'), which round-3 judges flagged as unstated.
…, two new gaps. All-19 87.41 / train 90.66 (-1.9) / held-out 75.22. Round-3 edits helped their targets (T09 +8.6, T12 +2.2, N06 +10.7, N04 100) but the serialization idiom induced T07 -33.7 (serialize() after mutations). Refined rather than reverted, with the boundary now stated. Round-4 hypotheses committed separately.
T04 trials each absorbed exactly one of the two template-building facts (pre-seeded attribute order in set_attribute(), placeholder text in set_modifiable_text()) and failed on the other — the facts live in two distant method docblocks. Add a 'Building markup from a template' section to the class overview, where template builders first look, stating both rules together with one execution-verified example using a link template (deliberately unlike any corpus task).
…decoded reads; add_class idempotency. Judges across four tasks flagged the same unstated guarantees subjects kept inferring (correctly, but unguided): - next_tag(): tag-name matching is ASCII case-insensitive with source casing preserved; comments/CDATA/rawtext can never match; truncated trailing tags are never matched or modified (cross-ref paused_at_incomplete_token()). Stated as a 'What this matches' block. - get_attribute(): string values come back DECODED (don't decode again), inverse of set_attribute's encode-on-write. - add_class(): creates/appends without disturbing existing classes; re-adding an existing class is a no-op with an exact byte-for-byte duplicate check (add 'NOTE' to class="note" appends — verified; an initial case-insensitive claim was caught wrong by probe before commit).
T08 judges noted the only depth-bounded walk example nests one level, where >= and > behave identically, so readers can't learn which is right. State the rule: >= is correct at any depth; > ends the walk at the first direct-child closer (verified: with > the UL walk stops after the first LI's contents).
…oundary, get_updated_html identity, >= warning placement. Round-5's two single-trial collapses both trace to unstated boundaries: a T06 trial attempted tree-aware work in the Tag Processor (whose docs never say it lacks depth/breadcrumbs), and a T03 trial copied the next_token() example but guessed '>' because the >= warning only existed in get_current_depth(). - Tag Processor overview: 'Which processor should I use?' section stating it has NO tree awareness and where those methods live; HTML Processor overview gets the matching half. - get_updated_html(): own description at last (was a copy of __toString's) — read-your-edits semantics, byte preservation, safe mid-scan. - next_token() example now carries the >= warning inline where the failing trial actually read.
…ocessor, >= beside the operator, drain idiom, add_class return semantics. Round-6 train gaps: the HTML Processor's own get_modifiable_text() override never stated decoding or that SCRIPT/STYLE/TEXTAREA/TITLE carry their text on the element token (no #text child) — stated now with a verified full-parser TITLE example; the >= rule now sits beside the operator in the get_current_depth() example with the nested-closer/sibling-text explanation inline; the paused_at_incomplete_token() example gains the drain-all-tokens idiom its single-tag example obscured; add_class() return documented as enqueued-not-applied (false only with no matched tag, verified).
…boundary rule. Round-7's only functional miss (T05 5/9) sliced multibyte text without an explicit encoding; the docs say UTF-8 is the only supported input but never said the output of get_modifiable_text() is UTF-8 nor showed the mb_* explicit-encoding idiom — stated on both classes now. T08's recurring boundary confusion appears in break-form code that the continue-form-only warning misses: stated the equivalence (break at < depth, never <= depth).
…e last-X bookmark idiom. T08's recurring failure class is nested walk loops double-advancing the single cursor: the inner collect-until-close loop exits already matched on the next region's boundary token, which the outer loop's next_token() then skips. Document the one-cursor contract on next_token() with the closer-driven single-pass state-machine shape (verified DT example), noting it stays reliable on malformed input because closers are always visited. Also surface the re-set-the-same-bookmark-name idiom in the overview bookmarks narrative where T10 trials kept missing it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Trac ticket:
Use of AI Tools
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.