Skip to content

Add HTML import pipeline with html-snapshot tool for browser-rendered capture.#3444

Open
OnionsYu wants to merge 188 commits into
Tencent:mainfrom
OnionsYu:feature/onionsyu_import_html2
Open

Add HTML import pipeline with html-snapshot tool for browser-rendered capture.#3444
OnionsYu wants to merge 188 commits into
Tencent:mainfrom
OnionsYu:feature/onionsyu_import_html2

Conversation

@OnionsYu

Copy link
Copy Markdown
Contributor

新增 HTML 到 PAGX 的导入管线,以及配套的 html-snapshot 工具,使得任意 JS 驱动的 HTML 页面都能被捕获并转换为 PAGX 文档。

主要变更:

src/pagx/html 模块:

  • 新增 HTMLImporter、HTMLParserContext、HTMLLayerBuilder、HTMLStyleResolver、HTMLValueParser、HTMLSubsetPropertyTable 等翻译单元,按职责拆分以缩短增量编译时间
  • HTMLSubsetTransformer 作为前置 pass 自动将非 subset 的 HTML 规范化为可导入的子集
  • HTMLTransformPasses 提供绝对定位推断 flex、border-radius 解析、object-fit 映射、background-clip text 渐变等多项 pass
  • 全部代码统一在 pagx::html 命名空间下,去除 lambda 与重复代码

tools/html-snapshot 工具:

  • 在 headless Chromium 中渲染目标页面并产出符合 html_subset 规范的快照,支持 URL 输入与 cookie/header 转发
  • 拆分为 lib 模块,schema 驱动样式输出,缓存解析后的文本样式,并共享 gradient 与 decoration 辅助函数
  • 提供 eval 子命令,对 corpus 执行 snapshot 到 pagx 的保真度评估并在 Chromium 中渲染对比
  • 默认 2x 渲染缩放,自动内联远程图片资源,首次运行时按需安装 puppeteer

测试:

  • 新增 PAGXHTMLImporterTest 与 PAGXHTMLSubsetTransformerTest,覆盖 transform fallback、length resolver、gradient stops cascade、注释、CLI 错误路径与 box-shadow 边界等场景,html 模块覆盖率从 84% 提升到 90%

其他:

  • 新增 html_subset 规范文档与 html-pagx-gen skill 用于辅助生成 HTML 输入

OnionsYu added 30 commits May 13, 2026 20:40
…wed properties and reporting structural errors.
…um and emits subset snapshots for pagx import.
… contents form elements and SVG color cascade.
…recover padding before stretch erases their width.
…alifies so PAGX import preserves author flex layout.
…during import so post cards stop collapsing on capture.
…t so centred badges and padded labels land in the right spot.

Builder& withOptions(const Options& options);
Builder& addDefaultPasses();
Builder& addPass(std::unique_ptr<HTMLTransformPass> pass);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Builder::addPass 接收 unique_ptr,但 HTMLTransformPass 类只声明在 src/pagx/html/importer/HTMLTransformContext.h,未对外暴露。外部调用者既无法 #include 到该类,也无法继承它来注入自定义 pass,addPass 实际上只能被仓库内部测试使用。建议:要么把 HTMLTransformPass 抽象基类一并迁到 include/pagx/,要么从公开头文件移除 Builder/addPass,仅保留静态 Transform()。

* config, which is the right choice for HTML-imported documents that
* want to keep the importer's fallback stack.
*/
void applyLayout(const FontConfig* fontConfig = nullptr);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

语义陷阱:HTMLImporter 会把从 CSS 收集到的 fallback family 写入 document->fontConfig(),但调用方若按惯例传入自己的 FontConfig 调用 applyLayout(&myConfig),会把 importer 注入的 fallback 整体覆盖掉,造成字形 fallback 丢失。文档里写 "merge them in via fontConfig() first" 是把责任转嫁给调用者,容易出错。建议改为合并语义(例如新增 setFontConfig + applyLayout() 无参版本,或 applyLayout 内部做 merge),避免调用方静默丢失 fallback。

* that don't admit a clean inference are left untouched. Default false (lossy
* heuristic — opt in explicitly). No effect when `autoNormalize` is false.
*/
bool inferFlexFromAbsolute = false;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inferFlexFromAbsolute / preserveUnknownElements / strict 等选项在 HTMLImporter::Options 与 HTMLSubsetTransformer::Options 同时存在,HTMLParserContext 在 parseDOM 里逐项 copy (topts.strict = _options.strict 等)。如果两套 Options 是同一组语义,建议在 HTMLImporter::Options 内直接 own 一个 HTMLSubsetTransformer::Options 字段(或 std::optional),避免每新增一个 transformer 选项就要在两处同步更新,降低误同步风险。

* Note: only the contents at call time are copied; the pointer does not need to outlive
* the returned document.
*/
FontConfig* fontConfig = nullptr;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fontConfig 是裸的非 const 指针,但 doc 注释明确说明 "only the contents at call time are copied; the pointer does not need to outlive the returned document",即 importer 只读取一次再深拷贝,并不需要修改它。建议改为 const FontConfig*,避免误导调用方以为 importer 会持有/修改这个对象。

bool composeMatrix;
};

const TransformHandlerEntry kTransformHandlers[] = {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

项目编码规范明确要求"全大写下划线(静态常量),不加 k 前缀"。这里的 kTransformHandlers / kEps(HTMLTransformPasses.cpp:1272 也有)违反规范。其余新增代码都遵守了该规则,建议统一改为 TransformHandlers / Eps 或 EPS。

return HexToColor(hex, /*hasAlpha=*/length == 5);
}
if (length == 7 || length == 9) {
uint32_t hex = std::strtoul(value.c_str() + 1, nullptr, 16);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseColor 对 #hex 格式只校验长度(4/5/7/9),不检查字符是否为合法十六进制。strtoul 遇到非法字符会停止解析返回前缀部分,例如 "#ZZZZZZ" 会被解析为 0(即纯黑色),且不发出 diagnostic 警告。建议在解析后比对 endptr 是否走到字符串末尾,对非法 hex 走 named color 分支或发出 unrecognised color 警告。

* The transformer is structured as a configurable pipeline of independent passes (see
* `Builder`) so new behaviour can be added without modifying the default flow.
*/
class HTMLSubsetTransformer {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTMLSubsetTransformer 作为完整的对外 API(含 Builder/Diagnostic/Severity/Result)暴露在 include/pagx/ 是否真的必要?HTMLImporter 已经会通过 autoNormalize 自动调用它,普通使用者不需要直接接触;规范化与 importer 是同一管线的两步实现细节。建议评估:(a) 仅暴露最小化的 normalize() 静态函数;或 (b) 把整套 transformer 移到 src/pagx/html/importer/ 内部,spec 里只承诺 importer 的行为。这样可以减少对外 API 表面、降低后续维护与兼容负担。

@@ -0,0 +1,1692 @@
/////////////////////////////////////////////////////////////////////////////////////////////////

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

本文件 1692 行,HTMLStyleCascade.cpp 1008 行,单文件过大会拖慢增量编译(PR 描述里也强调了"按职责拆分以缩短增量编译时间")。建议把多个 Pass 各自分到一个 cpp 文件(DocumentSkeletonPass.cpp、AbsoluteToFlexInferencePass.cpp 等),仅保留共享 helper 与注册入口在主文件中。

*/
class HTMLImporter {
public:
struct Options {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTMLImporter::Options 的字段从对外 API 角度偏多(targetWidth/Height、preferBodySize、strict、preserveUnknownElements、autoNormalize、inferFlexFromAbsolute、fontConfig 共 7 项),其中 autoNormalize、preserveUnknownElements、inferFlexFromAbsolute 都是 transformer 行为或调试性质。建议拆分:常用的(targetWidth/Height、fontConfig、strict)保留在顶层;其余打包成 DebugOptions / NormalizeOptions 嵌套结构体,主路径 API 看起来更清爽,也便于未来扩展不污染顶层。

* Custom passes can be inserted via `addPass`. The pipeline is owned by the resulting
* transformer instance.
*/
class Builder {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对外同时暴露了静态 Transform() 和 Builder 两套调用方式,但 Builder 仅为"非默认 pass 列表"服务,而 HTMLTransformPass 当前并未对外暴露(见前面 H1)。如果决定保留 Builder 作为公开 API,建议明确其使用场景(spec/example)。否则只保留 static Transform() 单一入口即可。

* (`<body style="width:…">` or the body's intrinsic content size). Both targetWidth
* and targetHeight must be set (non-NaN) to take effect.
*/
float targetWidth = NAN;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targetWidth/Height 默认值是 NAN,文档里靠 "When not NaN, overrides ..." 表达"未设置"语义,对调用方不直观(NaN 比较容易错写成 ==、且头文件需 include )。在 C++17 项目里,更清晰的方式是 std::optional。HTMLSubsetTransformer::Options 的 canvasWidth/canvasHeight、HTMLImporter::Options 的 targetWidth/Height 都同样建议改造,同时还能消除 cmath 依赖。

*/
FontConfig* fontConfig = nullptr;

Options() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

显式定义空 Options() {} 没有必要:所有成员都已 in-class 初始化,C++ 默认聚合/隐式构造行为完全一致。这个空构造反而会阻止 Options 成为 aggregate(影响 designated initializer 等)。建议直接删除该构造。HTMLSubsetTransformer::Options 同样问题。

OnionsYu added 12 commits June 11, 2026 20:00
…o flex containers under stretch parents keep accurate padding.
… works after install without a libpag checkout.
…apshot so toggle thumbs and similar generated boxes survive into the subset HTML.
…ing snapshot so paths styled via stylesheet selectors keep their fill stroke and dash settings in the subset HTML.
…inline SVG arcs so pure CSS ring spinners survive into the subset HTML.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants