Skip to content

Add property tests for WP_Token_Map#48

Draft
sirreal wants to merge 6 commits into
trunkfrom
token-map-properties
Draft

Add property tests for WP_Token_Map#48
sirreal wants to merge 6 commits into
trunkfrom
token-map-properties

Conversation

@sirreal

@sirreal sirreal commented Jun 10, 2026

Copy link
Copy Markdown
Owner

What
Adds deterministic property tests for WP_Token_Map and fixes divergences they exposed.

Why
Existing tests covered fixture behavior, but not adversarial/generated token sets, non-default key_length, ASCII-insensitive byte semantics, or generated source escaping.

Details

  • Adds generated-property coverage for:
    • contains() vs a linear reference
    • read_token() vs a longest-match reference
    • nested-prefix greedy matching
    • to_array() / from_array() behavioral round-trips
    • precomputed PHP source round-trips
    • ASCII-only case-insensitive matching with high bytes
  • Fixes key_length=1 export in to_array().
  • Implements ASCII-only case folding instead of relying on PHP case-insensitive string helpers.
  • Handles folded group-key collisions in ASCII-insensitive lookup while preserving case-sensitive fast paths.
  • Fixes short-token reads against empty/too-short text.
  • Escapes generated precomputed PHP source for quotes, backslashes, $, control bytes, and high bytes.

Verification

  • vendor/bin/phpcs --standard=phpcs.xml.dist src/wp-includes/class-wp-token-map.php tests/phpunit/tests/wp-token-map/wpTokenMapProperties.php
  • WP_TESTS_SKIP_INSTALL=1 ./vendor/bin/phpunit --group token-map
  • WP_TESTS_SKIP_INSTALL=1 ./vendor/bin/phpunit --group html-api-token-map

Trac ticket:

Use of AI Tools

Yes! Codex GPT 5.5, Claude Fable 5, others.


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@sirreal sirreal force-pushed the token-map-properties branch from ee891ae to 3572d0d Compare June 10, 2026 21:46
@sirreal

sirreal commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

1. Array Export Key Length
Commit: 05cfed923c Fix WP_Token_Map array export key length
Code: class-wp-token-map.php

Bug: to_array() reconstructed long-token prefixes using a hard-coded prefix length of 2. That only works when key_length === 2. For maps built with another key length, especially key_length === 1, exported long-token keys were wrong. In the one-byte case, the exported prefix could include the NUL group separator, producing keys like "a\0b" instead of "ab".

What would break: Any caller round-tripping through to_array() and from_array() could get a semantically different token map. This affects persistence, debugging, generated tables, and any verification that expects to_array() to faithfully represent the map.

Fix: Use $this->key_length as the prefix byte length, while still using $this->key_length + 1 as the record stride.

Why important: WP_Token_Map supports variable group-key lengths. Export must preserve that invariant or every downstream round-trip can silently corrupt keys.

2. read_token() Bounds And Fallbacks
Commit: d875beeb79 Fix WP_Token_Map read_token bounds
Code: class-wp-token-map.php, class-wp-token-map.php, class-wp-token-map.php, class-wp-token-map.php

Bug: read_token() decided whether to attempt long-token lookup from the total document length, not the remaining length at $offset. Near the end of a string, it could enter long-token logic even when there were not enough bytes left for a long-token group key.

Bug: If the would-be long-token group key contained NUL, lookup returned null immediately instead of falling back to small-token lookup. That meant a valid small token could be missed just because following document bytes contained NUL.

Bug: read_small_token() assumed $search_text[0] and later $search_text[$adjust] existed. Empty or truncated input at the offset could produce undefined string offset warnings and unreliable comparison flow.

What would break: Parsers calling read_token() while scanning arbitrary input could miss valid short tokens, especially at offsets near NUL bytes or near the end of input. They could also emit warnings on truncated probes.

Fix: Gate long-token lookup on $text_length - $offset, fall back to small-token lookup when the long group key is invalid due to NUL, return null for empty small-token probes, and check each searched byte exists before comparing it.

Why important: read_token() is used as a streaming/parser primitive. It must be safe at every byte offset, including EOF-adjacent offsets and malformed input.

3. ASCII Matching
Commit: 2110e539bd Fix WP_Token_Map ASCII matching
Code: class-wp-token-map.php, class-wp-token-map.php, class-wp-token-map.php, class-wp-token-map.php

Bug: Small-token contains() used stripos() over the packed fixed-record storage. That can search across record boundaries rather than only at token-record starts.

Bug: Short-token ASCII-insensitive matching had a parenthesization error: it effectively uppercased a boolean comparison instead of uppercasing the stored byte. A token like ab could fail to match AB.

Bug: Case folding needed to be explicitly ASCII-only and byte-oriented. Non-ASCII bytes must stay literal.

What would break: ASCII-insensitive lookup could produce false negatives for short tokens, possible false positives from packed storage boundary matches, and incorrect behavior around high bytes if comparison semantics were not explicitly byte-preserving.

Fix: Scan small-token storage by fixed record boundaries; add matches_at() for length-checked comparisons; add ascii_lowercase() that folds only A-Z; use those helpers consistently for ASCII-insensitive contains/read paths.

Why important: The API mode is specifically ascii-case-insensitive, not Unicode-insensitive or locale-sensitive. Token maps contain raw byte strings, so matching must be deterministic byte logic.

4. Folded Group-Key Collisions
Commit: fc0d5febcc Handle WP_Token_Map folded group keys
Code: class-wp-token-map.php, class-wp-token-map.php, class-wp-token-map.php, class-wp-token-map.php

Bug: Long tokens are grouped by a fixed-length prefix. In ASCII-insensitive mode, multiple stored group keys can fold to the same lookup key, e.g. Ab and aB. The old code used a single stripos() result, so it checked only the first folded-equivalent group.

What would break: Tokens in later folded-equivalent groups were invisible to contains() and read_token() in ASCII-insensitive mode. Worse, read_token() could return a shorter or wrong token because it never inspected all possible folded groups.

Fix: Add find_group_indexes() to return all matching group indexes in ASCII-insensitive mode. contains() checks every matching group. read_token() checks all folded-equivalent groups and returns the longest actual match. Case-sensitive lookup keeps the single exact-group fast path.

Why important: Longest-match behavior is a core parser invariant. Missing one folded-equivalent group can produce wrong tokenization, not just a missed optimization.

5. Precomputed Source Escaping
Commit: 505e46b560 Escape WP_Token_Map precomputed source
Code: class-wp-token-map.php, class-wp-token-map.php, class-wp-token-map.php, class-wp-token-map.php

Bug: precomputed_php_source_table() emitted raw token/mapping bytes into generated PHP double-quoted strings and comments. Escaping was partial: some mapping bytes were handled, but tokens, group strings, small mappings, $, high bytes, and comment-sensitive bytes were not consistently escaped.

What would break: Generated PHP source could become syntactically invalid or semantically different. Examples: " can terminate a string, \ can introduce escapes, $ can trigger interpolation, control/high bytes can be unstable in source, and ?> inside a generated comment can close PHP mode.

Fix: Build long-group binary data with pack() and raw concatenation, then escape the complete string literal. Escape groups, small words, and small mappings through the same string-literal helper. Escape generated comment text separately, including ?, backslash, control bytes, and high bytes.

Why important: Precomputed tables are meant to be pasted/evaluated as PHP source for fast static loading. If source generation is not byte-safe, the optimization path can corrupt maps or generate invalid PHP for perfectly valid token data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant