Add HTML import pipeline with html-snapshot tool for browser-rendered capture.#3444
Add HTML import pipeline with html-snapshot tool for browser-rendered capture.#3444OnionsYu wants to merge 188 commits into
Conversation
…ten incremental builds.
…wed properties and reporting structural errors.
…before importing.
…d accept rgb/rgba background colors.
…inline SVG renders correctly.
…um and emits subset snapshots for pagx import.
… contents form elements and SVG color cascade.
…e HTML mocks to viewport.
…erve original visual order.
…recover padding before stretch erases their width.
…serves the on-screen size.
… renders without network access.
…o longer leak past their corners on import.
…or clearer reuse.
…y across the HTML corpus.
…alifies so PAGX import preserves author flex layout.
…use it without losing flex flow.
…into per-kind functions for clearer reuse.
…ch the live page and pagx render output.
…during import so post cards stop collapsing on capture.
…t so centred badges and padded labels land in the right spot.
…riting-mode support and new normalization passes.
…ont assets across snapshot requests.
…rt_html2 # Conflicts: # include/pagx/PAGXDocument.h # src/pagx/svg/SVGExporter.cpp
|
|
||
| Builder& withOptions(const Options& options); | ||
| Builder& addDefaultPasses(); | ||
| Builder& addPass(std::unique_ptr<HTMLTransformPass> pass); |
There was a problem hiding this comment.
Builder::addPass 接收 unique_ptr,但 HTMLTransformPass 类只声明在 src/pagx/html/importer/HTMLTransformContext.h,未对外暴露。外部调用者既无法 #include 到该类,也无法继承它来注入自定义 pass,addPass 实际上只能被仓库内部测试使用。建议:要么把 HTMLTransformPass 抽象基类一并迁到 include/pagx/,要么从公开头文件移除 Builder/addPass,仅保留静态 Transform()。
| * config, which is the right choice for HTML-imported documents that | ||
| * want to keep the importer's fallback stack. | ||
| */ | ||
| void applyLayout(const FontConfig* fontConfig = nullptr); |
There was a problem hiding this comment.
语义陷阱:HTMLImporter 会把从 CSS 收集到的 fallback family 写入 document->fontConfig(),但调用方若按惯例传入自己的 FontConfig 调用 applyLayout(&myConfig),会把 importer 注入的 fallback 整体覆盖掉,造成字形 fallback 丢失。文档里写 "merge them in via fontConfig() first" 是把责任转嫁给调用者,容易出错。建议改为合并语义(例如新增 setFontConfig + applyLayout() 无参版本,或 applyLayout 内部做 merge),避免调用方静默丢失 fallback。
| * that don't admit a clean inference are left untouched. Default false (lossy | ||
| * heuristic — opt in explicitly). No effect when `autoNormalize` is false. | ||
| */ | ||
| bool inferFlexFromAbsolute = false; |
There was a problem hiding this comment.
inferFlexFromAbsolute / preserveUnknownElements / strict 等选项在 HTMLImporter::Options 与 HTMLSubsetTransformer::Options 同时存在,HTMLParserContext 在 parseDOM 里逐项 copy (topts.strict = _options.strict 等)。如果两套 Options 是同一组语义,建议在 HTMLImporter::Options 内直接 own 一个 HTMLSubsetTransformer::Options 字段(或 std::optional),避免每新增一个 transformer 选项就要在两处同步更新,降低误同步风险。
| * Note: only the contents at call time are copied; the pointer does not need to outlive | ||
| * the returned document. | ||
| */ | ||
| FontConfig* fontConfig = nullptr; |
There was a problem hiding this comment.
fontConfig 是裸的非 const 指针,但 doc 注释明确说明 "only the contents at call time are copied; the pointer does not need to outlive the returned document",即 importer 只读取一次再深拷贝,并不需要修改它。建议改为 const FontConfig*,避免误导调用方以为 importer 会持有/修改这个对象。
| bool composeMatrix; | ||
| }; | ||
|
|
||
| const TransformHandlerEntry kTransformHandlers[] = { |
There was a problem hiding this comment.
项目编码规范明确要求"全大写下划线(静态常量),不加 k 前缀"。这里的 kTransformHandlers / kEps(HTMLTransformPasses.cpp:1272 也有)违反规范。其余新增代码都遵守了该规则,建议统一改为 TransformHandlers / Eps 或 EPS。
| return HexToColor(hex, /*hasAlpha=*/length == 5); | ||
| } | ||
| if (length == 7 || length == 9) { | ||
| uint32_t hex = std::strtoul(value.c_str() + 1, nullptr, 16); |
There was a problem hiding this comment.
parseColor 对 #hex 格式只校验长度(4/5/7/9),不检查字符是否为合法十六进制。strtoul 遇到非法字符会停止解析返回前缀部分,例如 "#ZZZZZZ" 会被解析为 0(即纯黑色),且不发出 diagnostic 警告。建议在解析后比对 endptr 是否走到字符串末尾,对非法 hex 走 named color 分支或发出 unrecognised color 警告。
| * The transformer is structured as a configurable pipeline of independent passes (see | ||
| * `Builder`) so new behaviour can be added without modifying the default flow. | ||
| */ | ||
| class HTMLSubsetTransformer { |
There was a problem hiding this comment.
HTMLSubsetTransformer 作为完整的对外 API(含 Builder/Diagnostic/Severity/Result)暴露在 include/pagx/ 是否真的必要?HTMLImporter 已经会通过 autoNormalize 自动调用它,普通使用者不需要直接接触;规范化与 importer 是同一管线的两步实现细节。建议评估:(a) 仅暴露最小化的 normalize() 静态函数;或 (b) 把整套 transformer 移到 src/pagx/html/importer/ 内部,spec 里只承诺 importer 的行为。这样可以减少对外 API 表面、降低后续维护与兼容负担。
| @@ -0,0 +1,1692 @@ | |||
| ///////////////////////////////////////////////////////////////////////////////////////////////// | |||
There was a problem hiding this comment.
本文件 1692 行,HTMLStyleCascade.cpp 1008 行,单文件过大会拖慢增量编译(PR 描述里也强调了"按职责拆分以缩短增量编译时间")。建议把多个 Pass 各自分到一个 cpp 文件(DocumentSkeletonPass.cpp、AbsoluteToFlexInferencePass.cpp 等),仅保留共享 helper 与注册入口在主文件中。
| */ | ||
| class HTMLImporter { | ||
| public: | ||
| struct Options { |
There was a problem hiding this comment.
HTMLImporter::Options 的字段从对外 API 角度偏多(targetWidth/Height、preferBodySize、strict、preserveUnknownElements、autoNormalize、inferFlexFromAbsolute、fontConfig 共 7 项),其中 autoNormalize、preserveUnknownElements、inferFlexFromAbsolute 都是 transformer 行为或调试性质。建议拆分:常用的(targetWidth/Height、fontConfig、strict)保留在顶层;其余打包成 DebugOptions / NormalizeOptions 嵌套结构体,主路径 API 看起来更清爽,也便于未来扩展不污染顶层。
| * Custom passes can be inserted via `addPass`. The pipeline is owned by the resulting | ||
| * transformer instance. | ||
| */ | ||
| class Builder { |
There was a problem hiding this comment.
对外同时暴露了静态 Transform() 和 Builder 两套调用方式,但 Builder 仅为"非默认 pass 列表"服务,而 HTMLTransformPass 当前并未对外暴露(见前面 H1)。如果决定保留 Builder 作为公开 API,建议明确其使用场景(spec/example)。否则只保留 static Transform() 单一入口即可。
| * (`<body style="width:…">` or the body's intrinsic content size). Both targetWidth | ||
| * and targetHeight must be set (non-NaN) to take effect. | ||
| */ | ||
| float targetWidth = NAN; |
There was a problem hiding this comment.
targetWidth/Height 默认值是 NAN,文档里靠 "When not NaN, overrides ..." 表达"未设置"语义,对调用方不直观(NaN 比较容易错写成 ==、且头文件需 include )。在 C++17 项目里,更清晰的方式是 std::optional。HTMLSubsetTransformer::Options 的 canvasWidth/canvasHeight、HTMLImporter::Options 的 targetWidth/Height 都同样建议改造,同时还能消除 cmath 依赖。
| */ | ||
| FontConfig* fontConfig = nullptr; | ||
|
|
||
| Options() { |
There was a problem hiding this comment.
显式定义空 Options() {} 没有必要:所有成员都已 in-class 初始化,C++ 默认聚合/隐式构造行为完全一致。这个空构造反而会阻止 Options 成为 aggregate(影响 designated initializer 等)。建议直接删除该构造。HTMLSubsetTransformer::Options 同样问题。
…c child offsets surface as padding.
…o flex containers under stretch parents keep accurate padding.
… works after install without a libpag checkout.
…napshot tool dir in setup script.
…tter labels do not collapse on import.
…apshot so toggle thumbs and similar generated boxes survive into the subset HTML.
…ing snapshot so paths styled via stylesheet selectors keep their fill stroke and dash settings in the subset HTML.
…inline SVG arcs so pure CSS ring spinners survive into the subset HTML.
新增 HTML 到 PAGX 的导入管线,以及配套的 html-snapshot 工具,使得任意 JS 驱动的 HTML 页面都能被捕获并转换为 PAGX 文档。
主要变更:
src/pagx/html 模块:
tools/html-snapshot 工具:
测试:
其他: