Skip to content

add dev-story commit history report#140

Closed
gitronald wants to merge 73 commits into
masterfrom
claude/post-merge-status-check-52Z1B
Closed

add dev-story commit history report#140
gitronald wants to merge 73 commits into
masterfrom
claude/post-merge-status-check-52Z1B

Conversation

@gitronald
Copy link
Copy Markdown
Owner

No description provided.

gitronald and others added 30 commits May 25, 2026 15:50
A scoping plan that maps where selectolax (lexbor) could replace
BeautifulSoup+lxml in the parse path, gated on a parser-parity harness and
a measured pilot rather than a speculative rewrite. Cross-references plan
023's profile (bs4 find traversal ~60%, lxml parse ~16-18%) and inventories
the bs4 API surface and byte-identical-output risks a migration must clear.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
Resolves open question #1: the bs4 input/return-type API is fair game to
change, collapsing the migration to a clean one-way rewrite (option B) with
byte-identical output as the only remaining gate.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
Add scripts/diff_parsers.py, the prerequisite deliverable for the selectolax
exploration: it parses every fixture SERP with both lxml+bs4 and selectolax and
reports structural divergence plus signal parity for the exact class/id/tag/attr
targets the classifier and extractor query.

Result over 88 SERPs: zero divergence on every queried signal; the only
structural diffs are inert (SVG element-name casing and <tbody> auto-insertion,
neither of which the pipeline queries). The byte-identical-output risk from a
parser-backend swap is much smaller than feared on real Google markup.

selectolax added to the dev group only (diagnostic), not runtime dependencies.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
Replace the bs4/lxml parse and query layer with a selectolax (lexbor) backend
via a bs4-compatible adapter (WebSearcher/_slx.py). make_soup now returns a
SoupNode wrapping a selectolax tree; the ~50 parsers, classifier, and extractor
run against the adapter's bs4-subset API (find/find_all/select/get_text/attrs/
navigation/extract), so the migration lands without rewriting every call site.

Key faithfulness work to keep parse_serp output byte-identical (snapshot suite):
- bs4 class-match semantics (single token / multi-token exact / list-OR), name=True
  any-tag, string=True via bs4 .string, find_parent, recursive=False.
- get_text skips script/style/template text to match bs4+lxml.
- extract() detaches via remove(recursive=False), preserving subtree and stable
  node identity (mem_id) so DOM-position reordering still resolves; id()-keyed
  ordering switched to mem_id.
- _ComponentSignals walks the selectolax tree natively (hot classify path).

Snapshots: 10 SERPs updated for benign whitespace-only differences in concatenated
text fields (lexbor emits more whitespace-only text nodes than lxml; content is
byte-identical otherwise). All other output unchanged. selectolax moved to runtime
dependencies. Full suite green (299 passed, 66 snapshots) on Python 3.13.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
The initial adapter matched find/find_all with a pure-Python descendant walk per
call, which was ~4x slower than bs4. Translate the common (name, attrs) queries to
a CSS selector and run them on selectolax's C engine via node.css()/css_first(),
falling back to the Python matcher only for queries CSS can't express exactly
(regex/string= filters, attr/class lists, multi-token class exact-match, callable
filters, recursive=False, name lists).

Correctness: node.css() matches the node itself, but bs4 find/find_all search
descendants only, so self is excluded (except the document root, which legitimately
matches its own <html>). Byte-identical across the snapshot suite.

Per-SERP median ~451ms -> ~239ms on the local subset; full suite 76s -> 45s.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
…ptcha)

find(string=re.compile(...)) with no tag filter (utils.has_captcha) was matching
by computing each element's full subtree text via the Python text walker, i.e.
O(elements x subtree) per call -- the dominant cost in the profile (~40s of 54s).
Scan text nodes directly instead (O(text)), matching bs4's NavigableString search.

get_text keeps the Python walker (selectolax's native .text() changes whitespace
in ways that alter downstream parsing -- local_results address/directions, a
knowledge sub_type). Per-SERP median ~237ms (bs4) -> ~106ms; suite 76s -> 26s.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
The four parsers that derive a sub_type slug from a component heading were
fragile in two ways: (a) the slug came from a bare .replace(" ", sep), which
only normalizes ASCII space (a non-breaking or doubled-space heading would
slug to "co__op" or "behind\xa0the-scenes"); (b) three of the four discarded
the raw heading text entirely after slugging.

Changes:
- utils.slugify(text, sep): " ".join-style whitespace-robust slugify; replaces
  the .replace(" ", sep) sites in knowledge, local_results, searches_related,
  and perspectives.
- Preserve the raw component heading as details["heading"] in searches_related,
  local_results, and perspectives (matching knowledge.py's existing convention,
  3 prior uses). Top-level "title" is left alone: it already means "the result's
  own title" (the business name, the story title), so a component-level header
  belongs on details, not there.
- local_results._link_text_to_url: key off the stable structural classes
  (a.L48Cpd = website, a.VDgVie = directions) instead of the localized visible
  anchor text. The old text-keyed lookup silently dropped both fields whenever
  get_text(strip=True) left stray whitespace on the key, and missed them on any
  non-English SERP.

Snapshots regenerated for the deliberate enrichment (63 SERPs); diff verified
to contain only details["heading"]/sub_type/website/directions changes. Full
suite green (299 passed, 66 snapshots).

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
Adds get_text/class_tokens/node_string/find_text/reparse_fragment/make_soup_native
as public functions in _slx.py alongside the existing SoupNode class. These are
what migrated parsers call directly.

Each parser migration: drops bs4 import, adds `node: Node = cmpt.raw` at the
top to escape the SoupNode wrapper, replaces bs4-style finds with native
selectolax css/css_first plus the helpers above. The SoupNode class stays in
place until every call site is native; then it's removed and make_soup swaps to
return Node directly.

First migration: promo.py (12-line template).

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
…general_questions

Plus _slx: get_text returns None for None input (matches utils.get_text
convention so parsers chain css_first+get_text cleanly), and a new has_text
helper for the filter_empty_divs predicate.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
top_stories takes either Node or SoupNode (delegators recent_posts/latest_from/
local_news still pass SoupNode from the dispatcher; the hasattr unwrap absorbs
the difference until those delegators stop being delegators).

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
shopping_ads: bs4 multi-token-class strings are EXACT (not AND), so
'mnr-c pla-unit' must reject 'mnr-c c3mZkd pla-unit' -- compound CSS
.mnr-c.pla-unit alone matches the latter via AND-of-tokens, which routed
modern PLA cards through the legacy parser. Narrow + verify exact tokens.

shopping_ads also inlines bs4 find_parent('div', {'class':'pla-unit'}) as
an upward tag+class walk.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
… next_sibling)

- find_subcomponents: explicit self-exclusion via mem_id (the gu-gels nested-.g bug).
- regex class match (re.compile('fG8Fp')) -> CSS substring [class*=...] (equivalent).
- top_menu.children + child.decompose() -> iter(include_text=False) + decompose.
- stars.next_sibling (text-inclusive) -> inline _next_sibling_with_text via
  parent.iter(include_text=True).
- 'd4rhi' in attrs.get('class', []) -> 'd4rhi' in class_tokens(sub).
- list-of-classes find ({'class': ['P1usbc','IThcWe']}) -> CSS OR 'div.P1usbc, div.IThcWe'.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
…+ sep fixes)

Adds walk_descendants() helper to _slx for subtree-scoped pre-order DFS --
selectolax node.traverse() walks the ENTIRE document from this point forward
(not just the subtree), and node.css('span, div, a') with commas returns
results GROUPED BY SELECTOR, not document order. Both shapes break the panel
text walk that needs descendants-only in document order.

Also: where I migrated utils.get_text(...) calls (utils default sep=' '), pass
the separator explicitly; the _slx.get_text default is sep='' (bs4 .text style).
Knowledge titles like 'Donald Trump 45th and 47th U.S. President' need sep=' '.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
Adds next_sibling/previous_sibling/next_siblings helpers to _slx (text-inclusive
by default, matching bs4 .next_sibling semantics -- selectolax .next may skip
text nodes). knowledge_rhs walks next_siblings of the description's parent and
peeks the previous_sibling for the submenu; both routes now go through these
helpers.

local_results: recursive=False find_all becomes iter(include_text=False) + tag
filter; class_token list checks use class_tokens(); the rest is straightforward.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
news_quotes: children destructure preserved (text-inclusive iter); .div / .span
first-tag access -> css_first; inline _is_tag helper for the bs4 isinstance
filter.

notices: copy.copy(cmpt) -> reparse_fragment (bs4 cloning semantics for the
extract-from-clone pattern); div.extract() -> div.remove(recursive=False);
class_= kwargs become CSS class selectors.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
ai_overview: find_parent x2 (inline _find_parent_tag_class walking ancestors
on tag+class), _drop_nested_descendants uses mem_id sets, _doc_position uses
mem_id-based sibling index. find_all([list],{class}) -> CSS comma-join.

ads: AD_SUBTYPE_SELECTORS converted to dict of CSS selector strings (was
Selector NamedTuples); all utils.* calls inlined as native + _slx helpers.
shopping_ads gains the hasattr(raw) unwrap so callers can pass either a
SoupNode (component dispatch) or a native Node (ads delegating internally).

All 39 component_parsers are now native. utils.* helpers, classifiers, and
extractors still operate via the SoupNode adapter; component dispatch still
hands parsers a SoupNode, which they unwrap via cmpt.raw.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
Migrates the integration backbone to native selectolax Node:
- _slx: subtree_first/subtree_css/walk_descendants/next_sibling/next_siblings
  helpers. walk_descendants replaces selectolax Node.traverse() (which walks
  the entire document from this point forward, not just the subtree).
- classifiers/main.py + footer.py: every classifier method takes a Node;
  _ComponentSignals walks via walk_descendants (subtree-only).
- extractors/__init__.py + extractor_main/header/footer/rhs/serp_features.py:
  all bs4 patterns replaced with native CSS + helpers. Critical fix: bs4
  check_dict_value(attrs, 'class', ['M8OgIe']) is EXACT list equality, not
  token membership -- restored exact match.
- components.py: Component.elem typed as Node; dispatch passes Node to parsers;
  reorder_by_dom_position uses mem_id; child-iteration uses iter(include_text=False).
- utils.py: SoupNode wrappers stripped; helpers take Node natively (with a
  transitional _unwrap for any SoupNode that may still reach them).
- parsers.py: make_soup returns Node; parse_serp type-annotated as str | Node.
- All 39 component_parsers' 'node = cmpt.raw' boilerplate dropped via scripted
  refactor (cmpt is now Node directly). local_results uses subtree_first/css
  for cmpt-level queries that could match self (the gu-gels/places bug shape).
- tests/test_ads.py, test_extractor_main.py, test_utils.py: switch from
  bs4-style .find() to utils.get_div() / native API.

Suite: 299 passed, 66 snapshots, ruff clean. SoupNode class still present in
_slx.py for any unmigrated callers/tests; final cleanup (drop SoupNode, drop
bs4/lxml from runtime deps, README) is the next commit.

https://claude.ai/code/session_01RdPLdE9zJx9mK4rKaKtpJh
gitronald and others added 28 commits May 29, 2026 11:25
Maps all 39 component_parsers modules by implementation style (37 functions,
2 classes: Footer static-method namespace + NoticeParser per-call instance),
determines module-level functions are both more efficient and more
maintainable, and lays out a phased conversion of the two class-based
parsers plus first-param naming normalization.
Records the git history (Footer began as a @classmethod sibling-call group,
never stateful) and the section-as-unit pattern shared by the registry,
ExtractorFooter, and ClassifyFooter. Phase 1 now removes only the class
wrapper and keeps the three footer-exclusive parsers grouped in footer.py,
plus realigns parse_image_card(s) to the img_cards registry key.
…ply contract)

The contract states the first param is elem, so declaring it and renaming the
misnamed cmpt params are one step. Phase 0 now has 0a (write contract) and 0b
(apply rename) so the class conversions in Phases 1-2 are written against an
already-true contract. Old Phase 4 (parse_alink dedup) becomes Phase 3.
Three distinct behaviors across four defs (only general/knowledge_rhs are
identical): text separator differs, top_image_carousel adds a data-url
fallback and skips None-coalescing, and the strict ["href"] variants raise
KeyError on missing href (a parse-error path) where .get would silently yield
None. Reframes Phase 3 as reconcile-with-care, with smallest-to-largest
options.
Removes the parse_alink work from 027's scope (class->function only) and opens
028 to reconcile the helper alongside a broader rethink of the knowledge and
knowledge_rhs parsers (13-branch sub_type cascade, shared spine, details
schema, dynamic slugify sub_type). 027 can land independently.
Adds a 'Parser contract' section to component_parsers/__init__.py: entry
parsers are module-level functions parse_<type>(elem, sub_rank=0) -> list[dict]
whose first arg is the selectolax node (elem), returning dicts carrying at
least type and sub_rank, with sub-node helpers named sub.
The dispatcher always passes the component's selectolax node, so the
conventional first parameter 'cmpt' was a misnomer. Renames it to 'elem'
across all entry parsers (and the ads sub-parsers / classifier and
knowledge_rhs_main, which also receive the full component node). Leaves
genuine sub-node helper params (discussions parse_item, general
is_general_video / get_result_details) and the two Component-receiving
functions (parse_unknown / parse_not_implemented) untouched; notices class
internals are deferred to Phase 2. Behavior-preserving: full suite passes.
parse_unknown now takes elem like every other parser (its type is always
'unknown') instead of receiving the Component. parse_not_implemented, which
genuinely needed the component's classified type, is removed: select_parser
returns None when a known type has no registered parser, and parse_component
reports it via the existing create_parsed_list_error('not implemented'). This
removes the special-case branch in run_parser so the dispatch uniformly calls
parser_func(self.elem).

Also fixes a latent bug: both functions called Node.get_text (which selectolax
LexborNode does not have), so the unknown / not-implemented paths would have
raised AttributeError if ever hit. They are untested by fixtures (all fixture
types resolve to a parser), so new tests/test_components.py pins the behavior.
The contract names per-item helpers' sub-node param 'sub'. Three helpers that
receive a sub-component node were still named 'cmpt' after the entry-param
sweep: general.is_general_video / general.get_result_details and
discussions_and_forums.parse_item. Renames them to 'sub' to match their
siblings (get_title/get_url/get_cite, parse_general_video). Behavior-preserving.
…ename

The parser param was renamed cmpt -> elem in Phase 0b; this docstring used
'cmpt' as loose shorthand for the component. Spelled out as 'component' to
avoid confusion. (Other comments referencing 'cmpt' point at still-current
variables -- classifier params, parse_serp's loop var, components.py locals --
and notices.py is handled in Phase 2.)
Drops the stateless Footer class (a vestigial @classmethod sibling-call
namespace) and dedents its four staticmethods to module-level functions in
footer.py, which stays the footer-section parser module (mirroring
ExtractorFooter / ClassifyFooter). Realigns the function names to the
img_cards registry key: parse_image_cards -> parse_img_cards,
parse_image_card -> parse_img_card. Updates the import and the three PARSERS
entries (discover_more, img_cards, omitted_notice) to the bare functions.
Behavior-preserving: full suite (278) + snapshots pass.
Removes the NoticeParser class, which was instantiated fresh per component and
rebuilt its sub_type_text / parser_dict dicts on every call. Hoists those to
module-level constants (_SUB_TYPE_TEXT, _SUB_TYPE_PARSERS built once at import)
and rewrites the six _parse_* instance methods plus _classify_sub_type as plain
functions taking a node. parse_notices(elem) now classifies, dispatches, and
packages directly with no class and no per-call dict construction. Also updates
the stale 'copy.copy(cmpt)' comment to 'node'. Behavior-preserving: full suite
(278) + snapshots pass.
CI (Tests workflow) only ran on dev/master, so format drift accumulated on
feature/v0.9.0: 11 files failed 'ruff format --check .' with the pinned ruff
(0.15.15). Reformats them (line-length 100 collapses, etc.) so the format step
passes once CI runs on this branch. Pure formatting -- no logic change; full
suite (278) + snapshots pass on py3.13. Six files overlap this branch's parser
renames; the other five (classifiers/footer.py, extractors/extractor_main.py,
utils.py, scripts/diff_parsers.py, tests/test_extractor_main.py) are unrelated
pre-existing drift.
Adds feature/** to the push and pull_request branch filters so PRs targeting
feature/v0.9.0 (and future feature branches) get CI. Previously only dev/master
triggered the workflow, leaving release-feature branches unchecked.
The contract overstated the signature and output. Most entry parsers take only
elem (the dispatcher calls parser_func(self.elem) with one arg); sub_rank is an
optional second param on the few parsers reused as sub-parsers. And not every
result dict carries sub_rank (parse_unknown does not) -- BaseResult defaults it
to 0. Reworded so contributors don't treat sub_rank as a required arg/field.
…rdization-3w9XS

Standardize component parsers on module-level functions (plan 027)
Replace four near-identical private parse_alink copies (general,
knowledge, knowledge_rhs, top_image_carousel) with one parameterized
helper in component_parsers/_common.py:

  parse_alink(a, sep='', data_url_fallback=False)

- sep covers the '|' multi-fragment join used by knowledge and the
  image carousel; '' for the rest.
- data_url_fallback covers the carousel's lazy-loaded data-url thumbs.
- Missing href yields url=None (lenient) instead of raising; every
  current call site already guards href presence, so output is
  unchanged (full suite + 66 snapshots green).

Also moves the shared parse_alink_list off general.py into _common.
Convert parse_knowledge_panel's 13-branch if/elif cascade into an
ordered (detect-and-handle) table mirroring classifiers/main.py. Each
_subtype_* handler inspects the node, mutates parsed/details when it
recognizes its sub_type, and returns True to consume the dispatch
chain; _subtype_panel is the fallback.

Behavior-preserving by construction, including the two conditional
consumers: things_to_know claims the panel even when the heading text
is unrecognized (no sub_type set), and the dynamic JNkvid slug branch
falls through to panel when the level-2 section heading is absent.

Adds tests/test_knowledge_dispatch.py pinning all 13 sub_types -- the
five with no SERP-fixture coverage (featured_snippet, finance,
calculator, election, and the dynamic slug branch) plus the two
conditional-consumer edges -- via synthetic markup and the curated
coverage fixture. Full suite: 296 passed, 66 snapshots unchanged.
Add panel_rhs to the knowledge ComponentType sub_types (the RHS parser
normalizes its rows to type=knowledge/sub_type=panel_rhs) and document
that knowledge is an open sub_type space -- the JNkvid branch mints
section-heading slugs (movies, songs, lyrics, played-by, cast-and-crew)
that cannot be exhaustively enumerated.

Records resolutions in plan 028: dispatch (table-driven), knowledge vs
knowledge_rhs (stay separate, share only the link helper), slug branch
(kept open), link parsing (_common.py). The details-schema typed-details
alignment is deferred to its own focused effort -- it needs a concrete
target schema and changes output broadly, so it is best reviewed in
isolation. Full suite: 296 passed, 66 snapshots unchanged.
Phases 1-3a (parse_alink reconciliation, table-driven knowledge
dispatch, sub_type registry close-out) resolve four of the plan's five
open questions and are behavior-preserving (296 passed, 66 snapshots
unchanged). The fifth -- details-schema alignment with the typed-details
direction -- is deferred to a focused follow-up (plan 029) because it
needs a concrete target schema defined first and changes output broadly,
so its snapshot churn is best reviewed in isolation.
The table-driven handlers annotate h2_text as str, but get_text returns
str|None, so pyrefly flagged 13 bad-argument-type errors at the dispatch
call sites. Coerce once at the source: h2_text = get_text(h2) or "".
Behavior-identical -- None and "" compare equal-to-nothing against every
handler's literal checks. pyrefly: 0 errors; pytest: 296 passed (3.12).
- Plan 028 Status prose said 'In progress' while frontmatter was
  'completed'; align prose with the completed status and point at 029.
- Pinning tests passed the document root into parse_knowledge_panel;
  select the div.kp-blk component root instead, mirroring production
  dispatch and avoiding matches leaking outside the panel.
Record that the unified lenient parse_alink returns url=None for the
top_image_carousel data_url_fallback path (vs the old "" coalescing),
reviewed and accepted on PR #141: kept uniform with the other lenient
call sites; None is only reachable from an empty attribute value, which
is not observed and moved no snapshot/test.
@gitronald gitronald closed this May 31, 2026
@gitronald gitronald deleted the claude/post-merge-status-check-52Z1B branch May 31, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants