Skip to content

Package simplification pass + main-section layout surfacing#142

Merged
gitronald merged 8 commits into
feature/v0.9.0from
claude/simplify-websearcher-pkg
May 31, 2026
Merged

Package simplification pass + main-section layout surfacing#142
gitronald merged 8 commits into
feature/v0.9.0from
claude/simplify-websearcher-pkg

Conversation

@gitronald
Copy link
Copy Markdown
Owner

Summary

A quality-focused pass over the WebSearcher package that started as a /simplify review and grew into surfacing, hardening, and cleaning up the main-section layout machinery. Every step is verified green (ruff, ruff format, pyrefly, pytest) and lands incrementally so each commit is reviewable on its own.

Net: 315 tests passing (up from 296), 66 snapshots intact.

What's here, commit by commit

  1. Simplification pass — behavior-preserving cleanups from a 4-angle review (reuse / simplify / efficiency / altitude):

    • knowledge_box: drop two dead first_text computations that were unconditionally overwritten (flagged by all four angles); hoist a function-local import.
    • general classifier: compute class_tokens only in the branch that uses it.
    • footer classifier: natural run-then-check loop.
    • general.py parsers: hoist per-call re.compile to module constants.
    • components.py: drop dead result_counter; fix to_records loop var shadowing the Component class.
    • utils.get_domain: collapse to early returns.
  2. Remove dead Component.get_metadata — never called anywhere in code/tests/docs.

  3. Fix add_component to honor an explicit cmpt_rank of 0 — the not cmpt_rank guard treated a legitimate rank 0 as missing. Uses the is None sentinel. No internal caller passes cmpt_rank today, so it's a no-op on current behavior; pinned with two new tests.

  4. Add features.main_layout — the main-section layout label was internal-only dispatch state; now surfaced in parse_serp output. Additive (existing results consumers unaffected); regenerates snapshots with the new key.

  5. Pin every layout branch + fix no-rso duplication bug — the SERP fixtures only witness 3 of 10 observable layout_label values, so this adds characterization pins for the routing truth table, all standard-* sub-types, top-bars-*, left-bar, and no-rso. Fixes a real bug: extract_from_no_rso appended the page-level sec2 section once per sec1 block; hoisted out of the loop.

  6. Refactor standard-* ladder into a data-driven table — a 4-branch, ~95-line if-ladder (with a 6× copy-pasted preamble) collapses into a _StandardLayout table + two shared extraction shapes. Detection and extraction now iterate the same table, so they can no longer drift. Guarded by the pins from the previous commit.

  7. Rename standard-N labels to observable names + add plan 030 — opaque numeric suffixes become names derived from the kp-wp-tab-* container each detects (standard-overview/-songs/-sports-standings/-airfares, plus standard-fallback). docs/plans/030-main-layout-known-issues.md documents the deferred layout issues (left-bar document-wide scope, dead fallback body, rso shadowing, the layout_label dual role), all blocked on capturing witnessed fixtures.

Notes for review

  • No behavior change is intended in commits 1–2. Commit 3 only differs for an explicit cmpt_rank=0, which no caller passes. Commits 4 and 7 change the parsed-output features block (additive key, then a value rename) — that's the only output-contract change, and it's covered by the regenerated snapshots.
  • Unwitnessed extraction logic was deliberately left alone. Several layout branches have no fixture coverage; rather than guess, they're pinned as-is and the open questions are tracked in plan 030.
  • The "standard" strings in ads.py / component_types.py are an unrelated ad/component sub_type, intentionally untouched.

https://claude.ai/code/session_01XQ6dz41mc3okJAbfFRpniF


Generated by Claude Code

claude added 7 commits May 31, 2026 01:10
Quality-only, behavior-preserving (296 passed, 66 snapshots unchanged,
pyrefly clean). Findings from a 4-angle review (reuse/simplify/
efficiency/altitude) of the WebSearcher package:

- classifiers/main.py knowledge_box: remove two dead first_text
  computations (splitlines + split fallback) that were unconditionally
  overwritten by the _iter_text_fragments walk; hoist the function-local
  import to module top. Flagged by all four review angles.
- classifiers/main.py general: compute class_tokens only in the branch
  that uses it (skip the alloc on the classless path).
- classifiers/footer.py: reorder the short-circuit loop to the natural
  run-then-check form.
- component_parsers/general.py: hoist per-call re.compile in
  parse_ratings/parse_product to module constants, matching the existing
  _ARIA_*_RE pattern.
- components.py: drop dead result_counter attribute; fix to_records loop
  var shadowing the Component class name.
- utils.py get_domain: collapse to early returns, building only the
  domain string actually returned.

Skipped (noted for follow-up): extractor_main standard-layout ladder->
table (hot path, subtle per-branch filtering, needs uncovered-layout
pins first); the endswith('locations') header special-case (registry-
field change is over-engineering for one marker); Component.get_metadata
removal (public API on a published lib) and the cmpt_rank==0 guard
(behavior-changing).
Never called anywhere in the package, tests, or docs, and carried a
mutable-default-arg smell (key_filter=[...]). Confirmed dead before
removal. 296 passed, 66 snapshots unchanged, pyrefly clean.
add_component used `if not cmpt_rank` to detect a missing rank, but the
sentinel is cmpt_rank=None -- so an explicit 0 was treated as missing and
silently replaced by the counter. Use `is None` to match the documented
sentinel. No internal caller passes cmpt_rank today (the only splat carries
elem/section/type), so this is a no-op on current behavior and a
correctness fix for external/future callers.

Add two ComponentList.add_component tests: auto-increment from 0, and the
explicit-zero pin (fails on the old `not cmpt_rank` form).

298 passed, 66 snapshots unchanged, pyrefly clean.
The main-section layout label was internal-only dispatch state on
ExtractorMain. Surface it as features.main_layout so the SERP layout type
is observable in parsed output (e.g. 'standard', 'standard-0').

- SERPFeatures: add main_layout: str | None = None
- parse_serp: set features.main_layout from extractor.main_handler
  .layout_label after extraction (no FeatureExtractor<->Extractor coupling)
- Regenerate 66 snapshots (additive: +1 key each; 62 standard, 3
  standard-0, 1 standard-4)
- Pin the wiring with test_features_expose_main_layout

Additive output change; existing results consumers are unaffected.
299 passed, pyrefly clean.
Establish characterization tests for every layout_label outcome before any
ladder refactor -- the fixtures only witness standard/standard-0/standard-4,
leaving 7 of 10 observable labels untested. Adds pins for:

- get_layout routing truth table (standard, no-rso, left-bar, top-bars,
  and the top-bars-over-left-bar precedence)
- standard-0/1/2/4 sub-type dispatch + the standard-3 empty fallback label
- top-bars-divs / top-bars-children
- left-bar (pins the current document-wide TzHB6b scope) and no-rso

Fix: in extract_from_no_rso the page-level div.WvKfwe.a3spGf (sec2) section
was appended inside the 'for div in sec1' loop, so its content was emitted
once per sec1 block. Hoist it out of the loop so it appears once per page.
Unwitnessed branch -> no snapshot churn.

315 passed (+16 pins), 66 snapshots unchanged, pyrefly clean.
The standard sub-layouts were a 4-branch if-ladder (~95 lines) whose
branches differed only in data: detection container/selectors, extraction
container, and which class token(s) to keep. The extract_top_divs preamble
was copy-pasted 6x across the file.

Collapse to a _StandardLayout table + two shared extraction shapes:
- shape A (keep_tokens set): direct children with the first matching token,
  top_divs + main_divs returned unfiltered (standard-0, standard-4)
- shape B (keep_tokens None): all text-inclusive children, then drop bad
  tags + empties from the combined column (standard-1, standard-2)

Detection now iterates the same table instead of a parallel dict, so the
detect/extract halves can no longer drift. Behavior-preserving: guarded by
the characterization pins added last commit (standard-0/1/2/4 + standard-3
fallback). 315 passed, 66 snapshots unchanged, pyrefly clean.
Rename the opaque numeric standard-N labels to names derived from the
observable kp-wp-tab-* container each detects (no guessing at content
meaning):

  standard-0 -> standard-overview         (kp-wp-tab-overview)
  standard-1 -> standard-songs            (kp-wp-tab-Songs)
  standard-2 -> standard-sports-standings (kp-wp-tab-SportsStandings)
  standard-4 -> standard-airfares         (kp-wp-tab-AIRFARES)
  standard-3 -> standard-fallback         (empty-rso fallback, no tab)

These surface in features.main_layout, so the rename updates 4 snapshots
(the witnessed overview/airfares SERPs) and the pins. top-bars-divs/
children, left-bar, no-rso are already observable-derived and unchanged.
Checked the codebase: the other 'standard' strings (ads.py, component_types
.py, test_ads.py) are an unrelated ad/component sub_type, left untouched.

Add docs/plans/030-main-layout-known-issues.md documenting the deferred
layout issues (left-bar document-wide scope, standard-fallback dead body,
left-bar/top-bars shadowing rso, layout_label dual role, dead defensive
code) -- all blocked on capturing witnessed fixtures.

315 passed, 66 snapshots, pyrefly clean.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR performs a cleanup/simplification pass across the parsing pipeline and makes the main-section layout label observable to downstream consumers by surfacing ExtractorMain.layout_label as features.main_layout in parse_serp output. It also hardens layout dispatch/extraction behavior with characterization tests and refactors the standard-* layout handling into a data-driven table.

Changes:

  • Simplify/refactor several parsers/classifiers/extractors (remove dead code, hoist regex compilation, reduce redundant work).
  • Surface main layout classification via SERPFeatures.main_layout and update snapshots accordingly.
  • Add characterization tests for layout routing/extraction, including fixing no-rso section duplication.

Reviewed changes

Copilot reviewed 78 out of 78 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
WebSearcher/utils.py Simplifies get_domain() to early-return logic and avoids intermediate variables.
WebSearcher/parsers.py Exposes ExtractorMain.layout_label as features.main_layout in parse_serp() output.
WebSearcher/models/features.py Adds main_layout field to SERPFeatures.
WebSearcher/extractors/extractor_main.py Refactors standard-* dispatch into _STANDARD_LAYOUTS, fixes no-rso sec2 duplication, and keeps layout labels consistent.
WebSearcher/components.py Removes dead fields/methods, fixes cmpt_rank=0 handling, and fixes to_records() shadowing.
WebSearcher/component_parsers/general.py Hoists per-call regex compilation into module constants.
WebSearcher/classifiers/main.py Avoids unnecessary class-token computation; uses shared _iter_text_fragments for faithful first-text detection.
WebSearcher/classifiers/footer.py Fixes classifier loop to “run then check” (avoids unreachable assignment ordering).
tests/test_parse_serp.py Adds assertion that parsed SERPs expose features.main_layout and pins witnessed distribution.
tests/test_extractor_main.py Adds layout routing/extraction characterization tests (including no-rso sec2 regression pin).
tests/test_components.py Adds tests covering ComponentList.add_component rank auto-assignment and explicit cmpt_rank=0.
tests/snapshots/test_parse_serp/test_parse_serp[01f85d1329ba].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[032572e185d3].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[0410241ce1e2].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[0d3fc3b49b76].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[0ed311025efc].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[130eba186e94].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[18eccfe8454e].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[2c0aa0bbcd0c].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[2d1b05a046b2].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[305b53af69be].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[30926d7c7ae9].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[30c5d6bdb650].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[39617f527744].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[3c03a4a2cb7c].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[3c09a0f0c92f].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[3f5efb1dc358].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[45b6e019bfa2].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[4c8d8d2f226c].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[53940e35cc92].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[56cbcf8cd4dc].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[56f2eab63e9d].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[5898b04fb534].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[6978d0cd767d].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[6aa70651b0cd].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[6dc5bc34ff55].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[6e206db14899].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[6e401e618433].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[7049404a2dd6].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[7333536d2911].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[7ad9715f3597].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[7b89c00120e3].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[7d76d3a83ebc].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[811a27f92284].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[82e35954f552].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[83b17a6a7750].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[8d1b75b71e7f].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[8e820f7b024f].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[8f98fa9c0bef].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[9101d12ab778].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[923a428c1c22].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[97404b7b7c61].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[984065877aad].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[9a7e39d95bf0].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[9ed1baa7715d].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[a6c881e003e2].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[a6c8fe7fe769].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[aa594f199c3d].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[b15c5131b06c].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[b186024ec98a].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[b2e1777bf0f2].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[be99c971b8f7].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[c48f8aa3f6da].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[c9ab650f5bda].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[cad43c3268a8].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[ce37f114963e].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[d1855fa9cd1c].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[d1ac0c4abb10].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[d920789249af].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[da9b4fce9ab0].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[dc5861b33dda].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[e71a1cb4cd70].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[e828d00dc1b3].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[eab14aa4ff5d].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[f006c9318116].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[f6fae1c9a96e].json Snapshot updated to include features.main_layout.
tests/snapshots/test_parse_serp/test_parse_serp[faa9c7c889db].json Snapshot updated to include features.main_layout.
docs/plans/030-main-layout-known-issues.md Adds a plan doc capturing known issues/follow-ups for main-layout routing/extraction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

self.layout_divs["top-bars"], rso=self.layout_divs["rso"]
)
or []
spec = _STANDARD_LAYOUTS[sub_type]
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in d5aad50. Switched to _STANDARD_LAYOUTS.get(sub_type) with an early return [] for a missing recipe, restoring the pre-refactor "unknown subtype → empty result" behavior. It's unreachable from the current caller (which only passes keys straight from _STANDARD_LAYOUTS.items()), but it makes the helper robust given the sub_type: str = "" default, and is now pinned by a test.


Generated by Claude Code

Address PR review: _STANDARD_LAYOUTS[sub_type] would KeyError if the helper
were ever called with an unknown/empty sub_type (the signature still
defaults to ""). Use .get() and return [] for a missing recipe, restoring
the pre-refactor "unknown subtype -> empty result" behavior. Unreachable
from the current caller, but makes the helper robust. Pinned with a test.
@gitronald gitronald merged commit 75c0b86 into feature/v0.9.0 May 31, 2026
3 checks passed
gitronald pushed a commit that referenced this pull request May 31, 2026
Update frontmatter post-merge: status draft -> done, fill completed date
and pr: 142. The implementation landed in #142; the Open issues section
remains as the deferred follow-up record.
@gitronald gitronald deleted the claude/simplify-websearcher-pkg branch May 31, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants