Package simplification pass + main-section layout surfacing#142
Conversation
Quality-only, behavior-preserving (296 passed, 66 snapshots unchanged,
pyrefly clean). Findings from a 4-angle review (reuse/simplify/
efficiency/altitude) of the WebSearcher package:
- classifiers/main.py knowledge_box: remove two dead first_text
computations (splitlines + split fallback) that were unconditionally
overwritten by the _iter_text_fragments walk; hoist the function-local
import to module top. Flagged by all four review angles.
- classifiers/main.py general: compute class_tokens only in the branch
that uses it (skip the alloc on the classless path).
- classifiers/footer.py: reorder the short-circuit loop to the natural
run-then-check form.
- component_parsers/general.py: hoist per-call re.compile in
parse_ratings/parse_product to module constants, matching the existing
_ARIA_*_RE pattern.
- components.py: drop dead result_counter attribute; fix to_records loop
var shadowing the Component class name.
- utils.py get_domain: collapse to early returns, building only the
domain string actually returned.
Skipped (noted for follow-up): extractor_main standard-layout ladder->
table (hot path, subtle per-branch filtering, needs uncovered-layout
pins first); the endswith('locations') header special-case (registry-
field change is over-engineering for one marker); Component.get_metadata
removal (public API on a published lib) and the cmpt_rank==0 guard
(behavior-changing).
Never called anywhere in the package, tests, or docs, and carried a mutable-default-arg smell (key_filter=[...]). Confirmed dead before removal. 296 passed, 66 snapshots unchanged, pyrefly clean.
add_component used `if not cmpt_rank` to detect a missing rank, but the sentinel is cmpt_rank=None -- so an explicit 0 was treated as missing and silently replaced by the counter. Use `is None` to match the documented sentinel. No internal caller passes cmpt_rank today (the only splat carries elem/section/type), so this is a no-op on current behavior and a correctness fix for external/future callers. Add two ComponentList.add_component tests: auto-increment from 0, and the explicit-zero pin (fails on the old `not cmpt_rank` form). 298 passed, 66 snapshots unchanged, pyrefly clean.
The main-section layout label was internal-only dispatch state on ExtractorMain. Surface it as features.main_layout so the SERP layout type is observable in parsed output (e.g. 'standard', 'standard-0'). - SERPFeatures: add main_layout: str | None = None - parse_serp: set features.main_layout from extractor.main_handler .layout_label after extraction (no FeatureExtractor<->Extractor coupling) - Regenerate 66 snapshots (additive: +1 key each; 62 standard, 3 standard-0, 1 standard-4) - Pin the wiring with test_features_expose_main_layout Additive output change; existing results consumers are unaffected. 299 passed, pyrefly clean.
Establish characterization tests for every layout_label outcome before any ladder refactor -- the fixtures only witness standard/standard-0/standard-4, leaving 7 of 10 observable labels untested. Adds pins for: - get_layout routing truth table (standard, no-rso, left-bar, top-bars, and the top-bars-over-left-bar precedence) - standard-0/1/2/4 sub-type dispatch + the standard-3 empty fallback label - top-bars-divs / top-bars-children - left-bar (pins the current document-wide TzHB6b scope) and no-rso Fix: in extract_from_no_rso the page-level div.WvKfwe.a3spGf (sec2) section was appended inside the 'for div in sec1' loop, so its content was emitted once per sec1 block. Hoist it out of the loop so it appears once per page. Unwitnessed branch -> no snapshot churn. 315 passed (+16 pins), 66 snapshots unchanged, pyrefly clean.
The standard sub-layouts were a 4-branch if-ladder (~95 lines) whose branches differed only in data: detection container/selectors, extraction container, and which class token(s) to keep. The extract_top_divs preamble was copy-pasted 6x across the file. Collapse to a _StandardLayout table + two shared extraction shapes: - shape A (keep_tokens set): direct children with the first matching token, top_divs + main_divs returned unfiltered (standard-0, standard-4) - shape B (keep_tokens None): all text-inclusive children, then drop bad tags + empties from the combined column (standard-1, standard-2) Detection now iterates the same table instead of a parallel dict, so the detect/extract halves can no longer drift. Behavior-preserving: guarded by the characterization pins added last commit (standard-0/1/2/4 + standard-3 fallback). 315 passed, 66 snapshots unchanged, pyrefly clean.
Rename the opaque numeric standard-N labels to names derived from the observable kp-wp-tab-* container each detects (no guessing at content meaning): standard-0 -> standard-overview (kp-wp-tab-overview) standard-1 -> standard-songs (kp-wp-tab-Songs) standard-2 -> standard-sports-standings (kp-wp-tab-SportsStandings) standard-4 -> standard-airfares (kp-wp-tab-AIRFARES) standard-3 -> standard-fallback (empty-rso fallback, no tab) These surface in features.main_layout, so the rename updates 4 snapshots (the witnessed overview/airfares SERPs) and the pins. top-bars-divs/ children, left-bar, no-rso are already observable-derived and unchanged. Checked the codebase: the other 'standard' strings (ads.py, component_types .py, test_ads.py) are an unrelated ad/component sub_type, left untouched. Add docs/plans/030-main-layout-known-issues.md documenting the deferred layout issues (left-bar document-wide scope, standard-fallback dead body, left-bar/top-bars shadowing rso, layout_label dual role, dead defensive code) -- all blocked on capturing witnessed fixtures. 315 passed, 66 snapshots, pyrefly clean.
There was a problem hiding this comment.
Pull request overview
This PR performs a cleanup/simplification pass across the parsing pipeline and makes the main-section layout label observable to downstream consumers by surfacing ExtractorMain.layout_label as features.main_layout in parse_serp output. It also hardens layout dispatch/extraction behavior with characterization tests and refactors the standard-* layout handling into a data-driven table.
Changes:
- Simplify/refactor several parsers/classifiers/extractors (remove dead code, hoist regex compilation, reduce redundant work).
- Surface main layout classification via
SERPFeatures.main_layoutand update snapshots accordingly. - Add characterization tests for layout routing/extraction, including fixing
no-rsosection duplication.
Reviewed changes
Copilot reviewed 78 out of 78 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| WebSearcher/utils.py | Simplifies get_domain() to early-return logic and avoids intermediate variables. |
| WebSearcher/parsers.py | Exposes ExtractorMain.layout_label as features.main_layout in parse_serp() output. |
| WebSearcher/models/features.py | Adds main_layout field to SERPFeatures. |
| WebSearcher/extractors/extractor_main.py | Refactors standard-* dispatch into _STANDARD_LAYOUTS, fixes no-rso sec2 duplication, and keeps layout labels consistent. |
| WebSearcher/components.py | Removes dead fields/methods, fixes cmpt_rank=0 handling, and fixes to_records() shadowing. |
| WebSearcher/component_parsers/general.py | Hoists per-call regex compilation into module constants. |
| WebSearcher/classifiers/main.py | Avoids unnecessary class-token computation; uses shared _iter_text_fragments for faithful first-text detection. |
| WebSearcher/classifiers/footer.py | Fixes classifier loop to “run then check” (avoids unreachable assignment ordering). |
| tests/test_parse_serp.py | Adds assertion that parsed SERPs expose features.main_layout and pins witnessed distribution. |
| tests/test_extractor_main.py | Adds layout routing/extraction characterization tests (including no-rso sec2 regression pin). |
| tests/test_components.py | Adds tests covering ComponentList.add_component rank auto-assignment and explicit cmpt_rank=0. |
| tests/snapshots/test_parse_serp/test_parse_serp[01f85d1329ba].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[032572e185d3].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[0410241ce1e2].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[0d3fc3b49b76].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[0ed311025efc].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[130eba186e94].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[18eccfe8454e].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[2c0aa0bbcd0c].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[2d1b05a046b2].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[305b53af69be].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[30926d7c7ae9].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[30c5d6bdb650].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[39617f527744].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[3c03a4a2cb7c].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[3c09a0f0c92f].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[3f5efb1dc358].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[45b6e019bfa2].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[4c8d8d2f226c].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[53940e35cc92].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[56cbcf8cd4dc].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[56f2eab63e9d].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[5898b04fb534].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[6978d0cd767d].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[6aa70651b0cd].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[6dc5bc34ff55].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[6e206db14899].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[6e401e618433].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[7049404a2dd6].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[7333536d2911].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[7ad9715f3597].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[7b89c00120e3].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[7d76d3a83ebc].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[811a27f92284].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[82e35954f552].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[83b17a6a7750].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[8d1b75b71e7f].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[8e820f7b024f].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[8f98fa9c0bef].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[9101d12ab778].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[923a428c1c22].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[97404b7b7c61].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[984065877aad].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[9a7e39d95bf0].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[9ed1baa7715d].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[a6c881e003e2].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[a6c8fe7fe769].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[aa594f199c3d].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[b15c5131b06c].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[b186024ec98a].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[b2e1777bf0f2].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[be99c971b8f7].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[c48f8aa3f6da].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[c9ab650f5bda].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[cad43c3268a8].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[ce37f114963e].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[d1855fa9cd1c].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[d1ac0c4abb10].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[d920789249af].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[da9b4fce9ab0].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[dc5861b33dda].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[e71a1cb4cd70].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[e828d00dc1b3].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[eab14aa4ff5d].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[f006c9318116].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[f6fae1c9a96e].json | Snapshot updated to include features.main_layout. |
| tests/snapshots/test_parse_serp/test_parse_serp[faa9c7c889db].json | Snapshot updated to include features.main_layout. |
| docs/plans/030-main-layout-known-issues.md | Adds a plan doc capturing known issues/follow-ups for main-layout routing/extraction. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.layout_divs["top-bars"], rso=self.layout_divs["rso"] | ||
| ) | ||
| or [] | ||
| spec = _STANDARD_LAYOUTS[sub_type] |
There was a problem hiding this comment.
Good catch — fixed in d5aad50. Switched to _STANDARD_LAYOUTS.get(sub_type) with an early return [] for a missing recipe, restoring the pre-refactor "unknown subtype → empty result" behavior. It's unreachable from the current caller (which only passes keys straight from _STANDARD_LAYOUTS.items()), but it makes the helper robust given the sub_type: str = "" default, and is now pinned by a test.
Generated by Claude Code
Address PR review: _STANDARD_LAYOUTS[sub_type] would KeyError if the helper were ever called with an unknown/empty sub_type (the signature still defaults to ""). Use .get() and return [] for a missing recipe, restoring the pre-refactor "unknown subtype -> empty result" behavior. Unreachable from the current caller, but makes the helper robust. Pinned with a test.
Update frontmatter post-merge: status draft -> done, fill completed date and pr: 142. The implementation landed in #142; the Open issues section remains as the deferred follow-up record.
Summary
A quality-focused pass over the
WebSearcherpackage that started as a/simplifyreview and grew into surfacing, hardening, and cleaning up the main-section layout machinery. Every step is verified green (ruff, ruff format, pyrefly, pytest) and lands incrementally so each commit is reviewable on its own.Net: 315 tests passing (up from 296), 66 snapshots intact.
What's here, commit by commit
Simplification pass — behavior-preserving cleanups from a 4-angle review (reuse / simplify / efficiency / altitude):
knowledge_box: drop two deadfirst_textcomputations that were unconditionally overwritten (flagged by all four angles); hoist a function-local import.generalclassifier: computeclass_tokensonly in the branch that uses it.footerclassifier: natural run-then-check loop.general.pyparsers: hoist per-callre.compileto module constants.components.py: drop deadresult_counter; fixto_recordsloop var shadowing theComponentclass.utils.get_domain: collapse to early returns.Remove dead
Component.get_metadata— never called anywhere in code/tests/docs.Fix
add_componentto honor an explicitcmpt_rankof 0 — thenot cmpt_rankguard treated a legitimate rank0as missing. Uses theis Nonesentinel. No internal caller passescmpt_ranktoday, so it's a no-op on current behavior; pinned with two new tests.Add
features.main_layout— the main-section layout label was internal-only dispatch state; now surfaced inparse_serpoutput. Additive (existingresultsconsumers unaffected); regenerates snapshots with the new key.Pin every layout branch + fix
no-rsoduplication bug — the SERP fixtures only witness 3 of 10 observablelayout_labelvalues, so this adds characterization pins for the routing truth table, allstandard-*sub-types,top-bars-*,left-bar, andno-rso. Fixes a real bug:extract_from_no_rsoappended the page-levelsec2section once persec1block; hoisted out of the loop.Refactor
standard-*ladder into a data-driven table — a 4-branch, ~95-line if-ladder (with a 6× copy-pasted preamble) collapses into a_StandardLayouttable + two shared extraction shapes. Detection and extraction now iterate the same table, so they can no longer drift. Guarded by the pins from the previous commit.Rename
standard-Nlabels to observable names + add plan 030 — opaque numeric suffixes become names derived from thekp-wp-tab-*container each detects (standard-overview/-songs/-sports-standings/-airfares, plusstandard-fallback).docs/plans/030-main-layout-known-issues.mddocuments the deferred layout issues (left-bar document-wide scope, dead fallback body, rso shadowing, thelayout_labeldual role), all blocked on capturing witnessed fixtures.Notes for review
cmpt_rank=0, which no caller passes. Commits 4 and 7 change the parsed-outputfeaturesblock (additive key, then a value rename) — that's the only output-contract change, and it's covered by the regenerated snapshots."standard"strings inads.py/component_types.pyare an unrelated ad/componentsub_type, intentionally untouched.https://claude.ai/code/session_01XQ6dz41mc3okJAbfFRpniF
Generated by Claude Code