add dev-story commit history report by gitronald · Pull Request #140 · gitronald/WebSearcher

gitronald · 2026-05-30T21:31:48Z

No description provided.

A scoping plan that maps where selectolax (lexbor) could replace BeautifulSoup+lxml in the parse path, gated on a parser-parity harness and a measured pilot rather than a speculative rewrite. Cross-references plan 023's profile (bs4 find traversal ~60%, lxml parse ~16-18%) and inventories the bs4 API surface and byte-identical-output risks a migration must clear. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

Resolves open question #1: the bs4 input/return-type API is fair game to change, collapsing the migration to a clean one-way rewrite (option B) with byte-identical output as the only remaining gate. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

Add scripts/diff_parsers.py, the prerequisite deliverable for the selectolax exploration: it parses every fixture SERP with both lxml+bs4 and selectolax and reports structural divergence plus signal parity for the exact class/id/tag/attr targets the classifier and extractor query. Result over 88 SERPs: zero divergence on every queried signal; the only structural diffs are inert (SVG element-name casing and <tbody> auto-insertion, neither of which the pipeline queries). The byte-identical-output risk from a parser-backend swap is much smaller than feared on real Google markup. selectolax added to the dev group only (diagnostic), not runtime dependencies. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

Replace the bs4/lxml parse and query layer with a selectolax (lexbor) backend via a bs4-compatible adapter (WebSearcher/_slx.py). make_soup now returns a SoupNode wrapping a selectolax tree; the ~50 parsers, classifier, and extractor run against the adapter's bs4-subset API (find/find_all/select/get_text/attrs/ navigation/extract), so the migration lands without rewriting every call site. Key faithfulness work to keep parse_serp output byte-identical (snapshot suite): - bs4 class-match semantics (single token / multi-token exact / list-OR), name=True any-tag, string=True via bs4 .string, find_parent, recursive=False. - get_text skips script/style/template text to match bs4+lxml. - extract() detaches via remove(recursive=False), preserving subtree and stable node identity (mem_id) so DOM-position reordering still resolves; id()-keyed ordering switched to mem_id. - _ComponentSignals walks the selectolax tree natively (hot classify path). Snapshots: 10 SERPs updated for benign whitespace-only differences in concatenated text fields (lexbor emits more whitespace-only text nodes than lxml; content is byte-identical otherwise). All other output unchanged. selectolax moved to runtime dependencies. Full suite green (299 passed, 66 snapshots) on Python 3.13. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

The initial adapter matched find/find_all with a pure-Python descendant walk per call, which was ~4x slower than bs4. Translate the common (name, attrs) queries to a CSS selector and run them on selectolax's C engine via node.css()/css_first(), falling back to the Python matcher only for queries CSS can't express exactly (regex/string= filters, attr/class lists, multi-token class exact-match, callable filters, recursive=False, name lists). Correctness: node.css() matches the node itself, but bs4 find/find_all search descendants only, so self is excluded (except the document root, which legitimately matches its own <html>). Byte-identical across the snapshot suite. Per-SERP median ~451ms -> ~239ms on the local subset; full suite 76s -> 45s. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

…ptcha) find(string=re.compile(...)) with no tag filter (utils.has_captcha) was matching by computing each element's full subtree text via the Python text walker, i.e. O(elements x subtree) per call -- the dominant cost in the profile (~40s of 54s). Scan text nodes directly instead (O(text)), matching bs4's NavigableString search. get_text keeps the Python walker (selectolax's native .text() changes whitespace in ways that alter downstream parsing -- local_results address/directions, a knowledge sub_type). Per-SERP median ~237ms (bs4) -> ~106ms; suite 76s -> 26s. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

…xperiment https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

The four parsers that derive a sub_type slug from a component heading were fragile in two ways: (a) the slug came from a bare .replace(" ", sep), which only normalizes ASCII space (a non-breaking or doubled-space heading would slug to "co__op" or "behind\xa0the-scenes"); (b) three of the four discarded the raw heading text entirely after slugging. Changes: - utils.slugify(text, sep): " ".join-style whitespace-robust slugify; replaces the .replace(" ", sep) sites in knowledge, local_results, searches_related, and perspectives. - Preserve the raw component heading as details["heading"] in searches_related, local_results, and perspectives (matching knowledge.py's existing convention, 3 prior uses). Top-level "title" is left alone: it already means "the result's own title" (the business name, the story title), so a component-level header belongs on details, not there. - local_results._link_text_to_url: key off the stable structural classes (a.L48Cpd = website, a.VDgVie = directions) instead of the localized visible anchor text. The old text-keyed lookup silently dropped both fields whenever get_text(strip=True) left stray whitespace on the key, and missed them on any non-English SERP. Snapshots regenerated for the deliberate enrichment (63 SERPs); diff verified to contain only details["heading"]/sub_type/website/directions changes. Full suite green (299 passed, 66 snapshots). https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

Adds get_text/class_tokens/node_string/find_text/reparse_fragment/make_soup_native as public functions in _slx.py alongside the existing SoupNode class. These are what migrated parsers call directly. Each parser migration: drops bs4 import, adds `node: Node = cmpt.raw` at the top to escape the SoupNode wrapper, replaces bs4-style finds with native selectolax css/css_first plus the helpers above. The SoupNode class stays in place until every call site is native; then it's removed and make_soup swaps to return Node directly. First migration: promo.py (12-line template). https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

…ideos + flights https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

…general_questions Plus _slx: get_text returns None for None input (matches utils.get_text convention so parsers chain css_first+get_text cleanly), and a new has_text helper for the filter_empty_divs predicate. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

…esult + searches_related https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

…people_also_ask https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

top_stories takes either Node or SoupNode (delegators recent_posts/latest_from/ local_news still pass SoupNode from the dispatcher; the hasattr unwrap absorbs the difference until those delegators stop being delegators). https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

shopping_ads: bs4 multi-token-class strings are EXACT (not AND), so 'mnr-c pla-unit' must reject 'mnr-c c3mZkd pla-unit' -- compound CSS .mnr-c.pla-unit alone matches the latter via AND-of-tokens, which routed modern PLA cards through the legacy parser. Narrow + verify exact tokens. shopping_ads also inlines bs4 find_parent('div', {'class':'pla-unit'}) as an upward tag+class walk. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

… next_sibling) - find_subcomponents: explicit self-exclusion via mem_id (the gu-gels nested-.g bug). - regex class match (re.compile('fG8Fp')) -> CSS substring [class*=...] (equivalent). - top_menu.children + child.decompose() -> iter(include_text=False) + decompose. - stars.next_sibling (text-inclusive) -> inline _next_sibling_with_text via parent.iter(include_text=True). - 'd4rhi' in attrs.get('class', []) -> 'd4rhi' in class_tokens(sub). - list-of-classes find ({'class': ['P1usbc','IThcWe']}) -> CSS OR 'div.P1usbc, div.IThcWe'. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

…+ sep fixes) Adds walk_descendants() helper to _slx for subtree-scoped pre-order DFS -- selectolax node.traverse() walks the ENTIRE document from this point forward (not just the subtree), and node.css('span, div, a') with commas returns results GROUPED BY SELECTOR, not document order. Both shapes break the panel text walk that needs descendants-only in document order. Also: where I migrated utils.get_text(...) calls (utils default sep=' '), pass the separator explicitly; the _slx.get_text default is sep='' (bs4 .text style). Knowledge titles like 'Donald Trump 45th and 47th U.S. President' need sep=' '. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

Adds next_sibling/previous_sibling/next_siblings helpers to _slx (text-inclusive by default, matching bs4 .next_sibling semantics -- selectolax .next may skip text nodes). knowledge_rhs walks next_siblings of the description's parent and peeks the previous_sibling for the submenu; both routes now go through these helpers. local_results: recursive=False find_all becomes iter(include_text=False) + tag filter; class_token list checks use class_tokens(); the rest is straightforward. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

news_quotes: children destructure preserved (text-inclusive iter); .div / .span first-tag access -> css_first; inline _is_tag helper for the bs4 isinstance filter. notices: copy.copy(cmpt) -> reparse_fragment (bs4 cloning semantics for the extract-from-clone pattern); div.extract() -> div.remove(recursive=False); class_= kwargs become CSS class selectors. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

ai_overview: find_parent x2 (inline _find_parent_tag_class walking ancestors on tag+class), _drop_nested_descendants uses mem_id sets, _doc_position uses mem_id-based sibling index. find_all([list],{class}) -> CSS comma-join. ads: AD_SUBTYPE_SELECTORS converted to dict of CSS selector strings (was Selector NamedTuples); all utils.* calls inlined as native + _slx helpers. shopping_ads gains the hasattr(raw) unwrap so callers can pass either a SoupNode (component dispatch) or a native Node (ads delegating internally). All 39 component_parsers are now native. utils.* helpers, classifiers, and extractors still operate via the SoupNode adapter; component dispatch still hands parsers a SoupNode, which they unwrap via cmpt.raw. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

Migrates the integration backbone to native selectolax Node: - _slx: subtree_first/subtree_css/walk_descendants/next_sibling/next_siblings helpers. walk_descendants replaces selectolax Node.traverse() (which walks the entire document from this point forward, not just the subtree). - classifiers/main.py + footer.py: every classifier method takes a Node; _ComponentSignals walks via walk_descendants (subtree-only). - extractors/__init__.py + extractor_main/header/footer/rhs/serp_features.py: all bs4 patterns replaced with native CSS + helpers. Critical fix: bs4 check_dict_value(attrs, 'class', ['M8OgIe']) is EXACT list equality, not token membership -- restored exact match. - components.py: Component.elem typed as Node; dispatch passes Node to parsers; reorder_by_dom_position uses mem_id; child-iteration uses iter(include_text=False). - utils.py: SoupNode wrappers stripped; helpers take Node natively (with a transitional _unwrap for any SoupNode that may still reach them). - parsers.py: make_soup returns Node; parse_serp type-annotated as str | Node. - All 39 component_parsers' 'node = cmpt.raw' boilerplate dropped via scripted refactor (cmpt is now Node directly). local_results uses subtree_first/css for cmpt-level queries that could match self (the gu-gels/places bug shape). - tests/test_ads.py, test_extractor_main.py, test_utils.py: switch from bs4-style .find() to utils.get_div() / native API. Suite: 299 passed, 66 snapshots, ruff clean. SoupNode class still present in _slx.py for any unmigrated callers/tests; final cleanup (drop SoupNode, drop bs4/lxml from runtime deps, README) is the next commit. https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

Maps all 39 component_parsers modules by implementation style (37 functions, 2 classes: Footer static-method namespace + NoticeParser per-call instance), determines module-level functions are both more efficient and more maintainable, and lays out a phased conversion of the two class-based parsers plus first-param naming normalization.

Records the git history (Footer began as a @classmethod sibling-call group, never stateful) and the section-as-unit pattern shared by the registry, ExtractorFooter, and ClassifyFooter. Phase 1 now removes only the class wrapper and keeps the three footer-exclusive parsers grouped in footer.py, plus realigns parse_image_card(s) to the img_cards registry key.

…ply contract) The contract states the first param is elem, so declaring it and renaming the misnamed cmpt params are one step. Phase 0 now has 0a (write contract) and 0b (apply rename) so the class conversions in Phases 1-2 are written against an already-true contract. Old Phase 4 (parse_alink dedup) becomes Phase 3.

Three distinct behaviors across four defs (only general/knowledge_rhs are identical): text separator differs, top_image_carousel adds a data-url fallback and skips None-coalescing, and the strict ["href"] variants raise KeyError on missing href (a parse-error path) where .get would silently yield None. Reframes Phase 3 as reconcile-with-care, with smallest-to-largest options.

Removes the parse_alink work from 027's scope (class->function only) and opens 028 to reconcile the helper alongside a broader rethink of the knowledge and knowledge_rhs parsers (13-branch sub_type cascade, shared spine, details schema, dynamic slugify sub_type). 027 can land independently.

Adds a 'Parser contract' section to component_parsers/__init__.py: entry parsers are module-level functions parse_<type>(elem, sub_rank=0) -> list[dict] whose first arg is the selectolax node (elem), returning dicts carrying at least type and sub_rank, with sub-node helpers named sub.

The dispatcher always passes the component's selectolax node, so the conventional first parameter 'cmpt' was a misnomer. Renames it to 'elem' across all entry parsers (and the ads sub-parsers / classifier and knowledge_rhs_main, which also receive the full component node). Leaves genuine sub-node helper params (discussions parse_item, general is_general_video / get_result_details) and the two Component-receiving functions (parse_unknown / parse_not_implemented) untouched; notices class internals are deferred to Phase 2. Behavior-preserving: full suite passes.

parse_unknown now takes elem like every other parser (its type is always 'unknown') instead of receiving the Component. parse_not_implemented, which genuinely needed the component's classified type, is removed: select_parser returns None when a known type has no registered parser, and parse_component reports it via the existing create_parsed_list_error('not implemented'). This removes the special-case branch in run_parser so the dispatch uniformly calls parser_func(self.elem). Also fixes a latent bug: both functions called Node.get_text (which selectolax LexborNode does not have), so the unknown / not-implemented paths would have raised AttributeError if ever hit. They are untested by fixtures (all fixture types resolve to a parser), so new tests/test_components.py pins the behavior.

…plemented)

The contract names per-item helpers' sub-node param 'sub'. Three helpers that receive a sub-component node were still named 'cmpt' after the entry-param sweep: general.is_general_video / general.get_result_details and discussions_and_forums.parse_item. Renames them to 'sub' to match their siblings (get_title/get_url/get_cite, parse_general_video). Behavior-preserving.

…ename The parser param was renamed cmpt -> elem in Phase 0b; this docstring used 'cmpt' as loose shorthand for the component. Spelled out as 'component' to avoid confusion. (Other comments referencing 'cmpt' point at still-current variables -- classifier params, parse_serp's loop var, components.py locals -- and notices.py is handled in Phase 2.)

Drops the stateless Footer class (a vestigial @classmethod sibling-call namespace) and dedents its four staticmethods to module-level functions in footer.py, which stays the footer-section parser module (mirroring ExtractorFooter / ClassifyFooter). Realigns the function names to the img_cards registry key: parse_image_cards -> parse_img_cards, parse_image_card -> parse_img_card. Updates the import and the three PARSERS entries (discover_more, img_cards, omitted_notice) to the bare functions. Behavior-preserving: full suite (278) + snapshots pass.

Removes the NoticeParser class, which was instantiated fresh per component and rebuilt its sub_type_text / parser_dict dicts on every call. Hoists those to module-level constants (_SUB_TYPE_TEXT, _SUB_TYPE_PARSERS built once at import) and rewrites the six _parse_* instance methods plus _classify_sub_type as plain functions taking a node. parse_notices(elem) now classifies, dispatches, and packages directly with no class and no per-call dict construction. Also updates the stale 'copy.copy(cmpt)' comment to 'node'. Behavior-preserving: full suite (278) + snapshots pass.

CI (Tests workflow) only ran on dev/master, so format drift accumulated on feature/v0.9.0: 11 files failed 'ruff format --check .' with the pinned ruff (0.15.15). Reformats them (line-length 100 collapses, etc.) so the format step passes once CI runs on this branch. Pure formatting -- no logic change; full suite (278) + snapshots pass on py3.13. Six files overlap this branch's parser renames; the other five (classifiers/footer.py, extractors/extractor_main.py, utils.py, scripts/diff_parsers.py, tests/test_extractor_main.py) are unrelated pre-existing drift.

Adds feature/** to the push and pull_request branch filters so PRs targeting feature/v0.9.0 (and future feature branches) get CI. Previously only dev/master triggered the workflow, leaving release-feature branches unchecked.

The contract overstated the signature and output. Most entry parsers take only elem (the dispatcher calls parser_func(self.elem) with one arg); sub_rank is an optional second param on the few parsers reused as sub-parsers. And not every result dict carries sub_rank (parse_unknown does not) -- BaseResult defaults it to 0. Reworded so contributors don't treat sub_rank as a required arg/field.

…rdization-3w9XS Standardize component parsers on module-level functions (plan 027)

Replace four near-identical private parse_alink copies (general, knowledge, knowledge_rhs, top_image_carousel) with one parameterized helper in component_parsers/_common.py: parse_alink(a, sep='', data_url_fallback=False) - sep covers the '|' multi-fragment join used by knowledge and the image carousel; '' for the rest. - data_url_fallback covers the carousel's lazy-loaded data-url thumbs. - Missing href yields url=None (lenient) instead of raising; every current call site already guards href presence, so output is unchanged (full suite + 66 snapshots green). Also moves the shared parse_alink_list off general.py into _common.

Convert parse_knowledge_panel's 13-branch if/elif cascade into an ordered (detect-and-handle) table mirroring classifiers/main.py. Each _subtype_* handler inspects the node, mutates parsed/details when it recognizes its sub_type, and returns True to consume the dispatch chain; _subtype_panel is the fallback. Behavior-preserving by construction, including the two conditional consumers: things_to_know claims the panel even when the heading text is unrecognized (no sub_type set), and the dynamic JNkvid slug branch falls through to panel when the level-2 section heading is absent. Adds tests/test_knowledge_dispatch.py pinning all 13 sub_types -- the five with no SERP-fixture coverage (featured_snippet, finance, calculator, election, and the dynamic slug branch) plus the two conditional-consumer edges -- via synthetic markup and the curated coverage fixture. Full suite: 296 passed, 66 snapshots unchanged.

Add panel_rhs to the knowledge ComponentType sub_types (the RHS parser normalizes its rows to type=knowledge/sub_type=panel_rhs) and document that knowledge is an open sub_type space -- the JNkvid branch mints section-heading slugs (movies, songs, lyrics, played-by, cast-and-crew) that cannot be exhaustively enumerated. Records resolutions in plan 028: dispatch (table-driven), knowledge vs knowledge_rhs (stay separate, share only the link helper), slug branch (kept open), link parsing (_common.py). The details-schema typed-details alignment is deferred to its own focused effort -- it needs a concrete target schema and changes output broadly, so it is best reviewed in isolation. Full suite: 296 passed, 66 snapshots unchanged.

Phases 1-3a (parse_alink reconciliation, table-driven knowledge dispatch, sub_type registry close-out) resolve four of the plan's five open questions and are behavior-preserving (296 passed, 66 snapshots unchanged). The fifth -- details-schema alignment with the typed-details direction -- is deferred to a focused follow-up (plan 029) because it needs a concrete target schema defined first and changes output broadly, so its snapshot churn is best reviewed in isolation.

The table-driven handlers annotate h2_text as str, but get_text returns str|None, so pyrefly flagged 13 bad-argument-type errors at the dispatch call sites. Coerce once at the source: h2_text = get_text(h2) or "". Behavior-identical -- None and "" compare equal-to-nothing against every handler's literal checks. pyrefly: 0 errors; pytest: 296 passed (3.12).

- Plan 028 Status prose said 'In progress' while frontmatter was 'completed'; align prose with the completed status and point at 029. - Pinning tests passed the document root into parse_knowledge_panel; select the div.kp-blk component root instead, mirroring production dispatch and avoiding matches leaking outside the panel.

Record that the unified lenient parse_alink returns url=None for the top_image_carousel data_url_fallback path (vs the old "" coalescing), reviewed and accepted on PR #141: kept uniform with the other lenient call sites; None is only reachable from an empty attribute value, which is not observed and moved no snapshot/test.

gitronald and others added 30 commits May 25, 2026 15:50

add dev-story commit history report

9dc8561

expand dev-story with commit-window excerpts and changelog mining

d37436c

add component-growth analysis and expand dev-story architecture section

0a8846a

update: annotate component-growth plot with add/remove labels

690946c

update: shade commit-activity bars by count

c296ab0

Merge branch 'master' into dev

1eb555b

version [prerelease]: 0.8.7a0

242b227

plan 026: record selectolax migration results (~2x faster, suite green)

4060f5d

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

plan 026: note latent parser fragilities exposed by the native-text e…

db110ee

…xperiment https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

plan 026: record slug-and-heading cleanup outcome

68b8423

Native rewrite: top_stories delegators + most_read_articles + short_v…

338b63a

…ideos + flights https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

Native rewrite: scholarly_articles + view_more_news + twitter_cards/r…

8ade679

…esult + searches_related https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

Native rewrite: discussions/top_image_carousel/recipes/images/videos/…

7a41f38

…people_also_ask https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh

gitronald and others added 28 commits May 29, 2026 11:25

update: bump orjson, protobuf, pre-commit, pyrefly, ruff via dependabot

2a5384f

Mark Phase 0 done in plan 027

7c1e7e2

Plan 027: record Phase 0c (exception-free contract, drop parse_not_im…

09222c5

…plemented)

Plan 027: link PR #139

bab5075

ci: run Tests workflow on feature/** branches

87995b6

Adds feature/** to the push and pull_request branch filters so PRs targeting feature/v0.9.0 (and future feature branches) get CI. Previously only dev/master triggered the workflow, leaving release-feature branches unchecked.

Merge pull request #139 from gitronald/claude/component-parser-standa…

8fc8b80

…rdization-3w9XS Standardize component parsers on module-level functions (plan 027)

Plan 027: mark completed (PR #139 merged)

d08928b

gitronald closed this May 31, 2026

gitronald deleted the claude/post-merge-status-check-52Z1B branch May 31, 2026 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add dev-story commit history report#140

add dev-story commit history report#140
gitronald wants to merge 73 commits into
masterfrom
claude/post-merge-status-check-52Z1B

gitronald commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gitronald commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants