Plan 028: knowledge dispatch + parse_alink reconciliation#141
Merged
Conversation
Replace four near-identical private parse_alink copies (general, knowledge, knowledge_rhs, top_image_carousel) with one parameterized helper in component_parsers/_common.py: parse_alink(a, sep='', data_url_fallback=False) - sep covers the '|' multi-fragment join used by knowledge and the image carousel; '' for the rest. - data_url_fallback covers the carousel's lazy-loaded data-url thumbs. - Missing href yields url=None (lenient) instead of raising; every current call site already guards href presence, so output is unchanged (full suite + 66 snapshots green). Also moves the shared parse_alink_list off general.py into _common.
Convert parse_knowledge_panel's 13-branch if/elif cascade into an ordered (detect-and-handle) table mirroring classifiers/main.py. Each _subtype_* handler inspects the node, mutates parsed/details when it recognizes its sub_type, and returns True to consume the dispatch chain; _subtype_panel is the fallback. Behavior-preserving by construction, including the two conditional consumers: things_to_know claims the panel even when the heading text is unrecognized (no sub_type set), and the dynamic JNkvid slug branch falls through to panel when the level-2 section heading is absent. Adds tests/test_knowledge_dispatch.py pinning all 13 sub_types -- the five with no SERP-fixture coverage (featured_snippet, finance, calculator, election, and the dynamic slug branch) plus the two conditional-consumer edges -- via synthetic markup and the curated coverage fixture. Full suite: 296 passed, 66 snapshots unchanged.
Add panel_rhs to the knowledge ComponentType sub_types (the RHS parser normalizes its rows to type=knowledge/sub_type=panel_rhs) and document that knowledge is an open sub_type space -- the JNkvid branch mints section-heading slugs (movies, songs, lyrics, played-by, cast-and-crew) that cannot be exhaustively enumerated. Records resolutions in plan 028: dispatch (table-driven), knowledge vs knowledge_rhs (stay separate, share only the link helper), slug branch (kept open), link parsing (_common.py). The details-schema typed-details alignment is deferred to its own focused effort -- it needs a concrete target schema and changes output broadly, so it is best reviewed in isolation. Full suite: 296 passed, 66 snapshots unchanged.
Phases 1-3a (parse_alink reconciliation, table-driven knowledge dispatch, sub_type registry close-out) resolve four of the plan's five open questions and are behavior-preserving (296 passed, 66 snapshots unchanged). The fifth -- details-schema alignment with the typed-details direction -- is deferred to a focused follow-up (plan 029) because it needs a concrete target schema defined first and changes output broadly, so its snapshot churn is best reviewed in isolation.
Contributor
There was a problem hiding this comment.
Pull request overview
Implements Plan 028’s core refactor work by consolidating repeated parse_alink logic into a shared helper module and converting parse_knowledge_panel sub-type detection into an ordered, table-driven dispatch. It also updates the knowledge subtype registry/documentation and adds focused tests pinning previously uncovered dispatch branches.
Changes:
- Added
component_parsers/_common.pywith sharedparse_alink/parse_alink_list, and rewired existing parsers to use it. - Refactored
parse_knowledge_panelfrom a largeif/elifcascade into ordered detect-and-handle handlers with a panel fallback. - Added dispatch pinning tests and updated plan docs / component subtype registry (
panel_rhs).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
WebSearcher/component_types.py |
Documents open knowledge sub-type space and adds panel_rhs to the closed registry set. |
WebSearcher/component_parsers/_common.py |
Introduces shared anchor/link parsing helpers used by multiple parsers. |
WebSearcher/component_parsers/top_image_carousel.py |
Switches carousel link parsing to shared parse_alink helper (with data-url fallback). |
WebSearcher/component_parsers/knowledge.py |
Converts knowledge panel subtype detection to ordered handler table + fallback. |
WebSearcher/component_parsers/knowledge_rhs.py |
Switches RHS link parsing to shared parse_alink. |
WebSearcher/component_parsers/general.py |
Switches hyperlink list parsing to shared parse_alink_list. |
tests/test_knowledge_dispatch.py |
Adds pinning coverage for knowledge dispatch subtypes and edge-case consumers. |
docs/plans/027-component-parser-class-vs-function-standardization.md |
Marks Plan 027 as completed. |
docs/plans/028-knowledge-parsers-and-alink-reconciliation.md |
Updates Plan 028 status/decisions and documents remaining work split-out. |
docs/plans/029-knowledge-details-schema-alignment.md |
Adds new Plan 029 draft for deferred knowledge details schema alignment work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The table-driven handlers annotate h2_text as str, but get_text returns str|None, so pyrefly flagged 13 bad-argument-type errors at the dispatch call sites. Coerce once at the source: h2_text = get_text(h2) or "". Behavior-identical -- None and "" compare equal-to-nothing against every handler's literal checks. pyrefly: 0 errors; pytest: 296 passed (3.12).
- Plan 028 Status prose said 'In progress' while frontmatter was 'completed'; align prose with the completed status and point at 029. - Pinning tests passed the document root into parse_knowledge_panel; select the div.kp-blk component root instead, mirroring production dispatch and avoiding matches leaking outside the panel.
Record that the unified lenient parse_alink returns url=None for the top_image_carousel data_url_fallback path (vs the old "" coalescing), reviewed and accepted on PR #141: kept uniform with the other lenient call sites; None is only reachable from an empty attribute value, which is not observed and moved no snapshot/test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the dispatch/reconciliation core of Plan 028 (knowledge parsers rethink +
parse_alinkreconciliation), plus bookkeeping for the merged Plan 027. Built on thefeature/v0.9.0line.All phases are behavior-preserving: 296 tests pass (+18 new), 66 snapshots unchanged, ruff clean.
Phases
Phase 1 —
parse_alinkreconciliation. Replaced four near-identical privateparse_alinkcopies (general,knowledge,knowledge_rhs,top_image_carousel) with one parameterized helper in a newcomponent_parsers/_common.py:parse_alink(a, sep="", data_url_fallback=False). Missinghrefnow yieldsurl=None(lenient) rather than raising; every current call site already guards href presence, so output is unchanged. Sharedparse_alink_listmoved offgeneral.py.Phase 2 — table-driven knowledge dispatch. Converted
parse_knowledge_panel's 13-branchif/elifcascade into an ordered detect-and-handle table (_SUBTYPE_HANDLERS+_subtype_panelfallback), mirroringclassifiers/main.py. Behavior-preserving by construction, including the two conditional-consumer branches (things_to_knowclaims the panel even when the heading text is unrecognized; the dynamicJNkvidslug branch falls through topanelwhen the level-2 heading is absent). Addstests/test_knowledge_dispatch.py(18 pins) covering the five sub_types no SERP fixture exercised + the two edge cases.Phase 3a — registry close-out. Added
panel_rhsto theknowledgeComponentTypesub_types and documented the open dynamic-slug space.Plan bookkeeping
completed(its PR Standardize component parsers on module-level functions (plan 027) #139 was merged).completedfor this scope; split the deferred 5th open question —details-schema typed alignment — into Plan 029, since it needs a concrete target schema defined first and changes output broadly (best reviewed in isolation).Resolved design questions (4 of 5)
knowledgevsknowledge_rhssharingparse_alinkwas true duplication (now unified)component_parsers/_common.pydetailsschema alignmenthttps://claude.ai/code/session_01XQ6dz41mc3okJAbfFRpniF
Generated by Claude Code