webEmbedding is a source-first website cloning engine for AI coding agents: it captures live pages with Playwright, replays network evidence from HAR artifacts, rebuilds only when direct reuse is blocked, and self-verifies the result.
It ships as a Skill + MCP server. Instead of asking a model to "clone this site" from a screenshot, it inspects the URL, chooses a reuse or rebuild route, captures DOM/runtime HTML/styles/assets/network traces, generates bounded frontend reconstruction artifacts, and checks the output with visual, DOM, computed-style, interaction, and responsive-breakpoint verification.
GitHub listing, social preview, and launch-copy recommendations are in docs/github-listing.md.
The current pipeline is strongest for static and semi-static web pages:
- company, brand, marketing, and documentation pages
- public landing pages
- iframe-blocked pages that need capture-based reconstruction
- responsive page snapshots across desktop, tablet, and mobile
It is not a full backend or app-logic clone engine. Login-only screens, app-first or native-app-required services, captcha-heavy sites, maps, games, canvas/WebGL-heavy pages, real-time feeds, payments, booking flows, and private server behavior still need separate handling.
Operationally, the repo is now a production-candidate clone engine for URL-based capture and bounded reconstruction: jobs can be queued, network evidence can be replay-audited from HAR artifacts, authenticated dashboard runs can be driven from user-owned browser state, and local gates verify the route corpus, score checks, package contents, and CI wiring. The remaining hard boundary is server-side product behavior, not front-end evidence capture and reconstruction.
Recent local benchmark runs from this repo:
| URL | Path | Score |
|---|---|---|
https://developer.mozilla.org/en-US/ |
iframe-blocked bounded rebuild | root 94, visual 95, mobile 94, tablet 94, breakpoint average 94 |
https://www.mozilla.org/ |
bounded rebuild | root 94, visual 100 |
https://www.python.org |
harder bounded rebuild sample | root 90, visual 100 |
https://www.example.com |
exact reuse | ready yes |
These are generated by the local self-verify pipeline, not manually assigned ratings.
The reproducible commands and score thresholds are tracked in docs/benchmark-evidence.json.
Production readiness gates are tracked in docs/production-pipeline-gates.json.
- Source-first routing:
- direct iframe or embed reuse when it is safe and frameable
- original preview, export, remix, or source routes when available
- bounded rebuild only when exact reuse is unavailable
- Live browser capture:
- DOM snapshot
- runtime HTML
- full-page screenshot
- computed style summaries
- CSS analysis
- asset inventory
- HAR-like network metadata
- interaction states and replay traces
- storage state export for session-aware flows
- Blocked-site rebuild:
- handles
X-Frame-Optionsand CSP-blocked pages by rebuilding from captured evidence - generates reusable frontend reconstruction artifacts from captured page structure
- preserves custom tags, shadow-root host structure, and semantic document structure where captured
- handles
- Evidence limitation reporting:
- separates directly captured artifacts from inferred or missing evidence in reproduction results and prompts
- marks app-gated, auth-gated, and native-app-led surfaces as bounded evidence, with recommendations for user screenshots or authenticated session capture
- Operational failure classification:
- reports typed pipeline action codes such as
network-replay-limited,auth-session-missing,public-app-gate, andcanvas-visual-fallback - exposes HAR/network
replay_readinessbefore treating captured network evidence as replay-grade
- reports typed pipeline action codes such as
- Production pipeline helpers:
- filesystem-backed async clone job queue with durable JSON records, worker locks, retry scheduling, cancellation, and manifest annotation
- deterministic HAR replay engine for standard HAR, near-HAR, and captured
network/manifest.jsonartifacts - authenticated dashboard live corpus runner that accepts user-provided
storage_state_pathoruser_data_diroutside the repo
- Self-verification:
- screenshot similarity
- DOM snapshot similarity
- computed-style similarity
- hover/focus/click interaction state parity
- interaction trace parity
- desktop/mobile/tablet breakpoint reports
- Responsive benchmark support:
- primary desktop viewport:
1440x1200 - tablet profile:
768x1024 - mobile profile:
390x844
- primary desktop viewport:
- Repair loop:
- bounded self-repair can run when the first scaffold misses the readiness threshold
- Node.js 18 or newer
- Python 3.9 or newer
- Chrome or Chromium available locally for Playwright runtime capture
The package uses playwright-core; it does not download a browser by itself.
Installing this project adds the source-first-clone plugin bundle, the exact-clone-intake skill, and the MCP server that exposes the URL inspection, capture, rebuild, and verification tools.
npm install -g web-embedding
web-embedding install
web-embedding doctorClone a public URL after installing:
web-embedding clone \
--url https://developer.mozilla.org/en-US/ \
--output-dir ./.tmp/mdn-clone \
--wait-seconds 2 \
--timeout-seconds 35 \
--breakpoints mobile tabletIf you already have an older local plugin installed, overwrite it with:
web-embedding install --force
web-embedding doctorYou can also run the installer without a global install:
npx web-embedding installcurl -fsSL https://github.com/jongko54/webEmbedding/releases/latest/download/install.sh | bashgit clone https://github.com/jongko54/webEmbedding.git
cd webEmbedding
npm install
node ./bin/web-embedding.mjs install
node ./bin/web-embedding.mjs doctorUseful for testing without touching your real agent home:
python3 python/web_embedding/installer.py install --target-home ./.tmp/home
python3 python/web_embedding/installer.py doctor --target-home ./.tmp/home
python3 python/web_embedding/installer.py uninstall --target-home ./.tmp/homeTelemetry is disabled by default. On an interactive first install, web-embedding install asks once and defaults to No. Non-interactive installs such as CI and curl | bash do not prompt. If you opt in, web-embedding sends a small anonymous command-completion event to a JSON POST endpoint you control. It does not send target URLs, local paths, captured HTML, screenshots, storage state, environment variables, API keys, or command output.
Enable it during install:
web-embedding install --telemetry --telemetry-endpoint https://your-collector.example/eventsOr manage it later:
web-embedding telemetry enable --endpoint https://your-collector.example/events
web-embedding telemetry status
web-embedding telemetry disable
web-embedding telemetry reset-idEach event contains an anonymous install id, package version, command name, success/failure status, OS/runtime basics, and coarse option flags such as breakpoint_count or install_source.
Environment controls:
WEB_EMBEDDING_TELEMETRY=1
WEB_EMBEDDING_NO_TELEMETRY=1
WEB_EMBEDDING_TELEMETRY_PROMPT=0
WEB_EMBEDDING_TELEMETRY_ENDPOINT=https://your-collector.example/events
WEB_EMBEDDING_TELEMETRY_LOG=./telemetry.jsonlRun a local/self-hosted JSONL collector:
npm run telemetry:collector -- --host 127.0.0.1 --port 8765 --out ./telemetry.jsonl
WEB_EMBEDDING_TELEMETRY=1 \
WEB_EMBEDDING_TELEMETRY_ENDPOINT=http://127.0.0.1:8765/events \
web-embedding doctorSummarize collected usage:
npm run telemetry:summarize -- ./telemetry.jsonlThe summary includes install and clone executions, total command executions, unique anonymous install IDs, command counts, and version counts. See docs/telemetry.md for collector and analyzer details.
Inspect a URL and get route hints:
node ./bin/web-embedding.mjs inspect \
--url https://developer.mozilla.org/en-US/Run the full clone workflow:
node ./bin/web-embedding.mjs clone \
--url https://developer.mozilla.org/en-US/ \
--output-dir ./.tmp/mdn-clone \
--wait-seconds 2 \
--timeout-seconds 35 \
--breakpoints mobile tabletRun a lightweight quality benchmark:
python3 scripts/check_clone_quality_bench.py \
https://developer.mozilla.org/en-US/ \
--output-root ./.tmp/clone-quality-bench \
--wait-seconds 1 \
--timeout-seconds 35 \
--breakpoints mobile tabletThe benchmark prints compact rows for root, visual, and breakpoint scores. The full artifacts are written under the output directory.
node ./bin/web-embedding.mjs capabilities
node ./bin/web-embedding.mjs install
node ./bin/web-embedding.mjs doctor
node ./bin/web-embedding.mjs uninstall
node ./bin/web-embedding.mjs paths
node ./bin/web-embedding.mjs telemetry statusnode ./bin/web-embedding.mjs inspect --url https://www.mozilla.org/node ./bin/web-embedding.mjs capture \
--url https://www.mozilla.org/ \
--output-dir ./.tmp/capture-mozilla \
--breakpoints mobile tabletnode ./bin/web-embedding.mjs reproduce \
--url https://www.mozilla.org/ \
--output-dir ./.tmp/reproduce-mozilla \
--breakpoints mobile tabletnode ./bin/web-embedding.mjs clone \
--url https://www.mozilla.org/ \
--output-dir ./.tmp/clone-mozilla \
--breakpoints mobile tabletnode ./bin/web-embedding.mjs verify \
--reference-bundle ./.tmp/reference/capture.json \
--candidate-bundle ./.tmp/candidate/capture.jsonA clone run can produce:
capture.jsonpipeline-run-manifest.jsondom/snapshot.jsondom/runtime.htmlstyles/computed-summary.jsonstyles/css-analysis.jsonnetwork/manifest.jsonnetwork/har.jsonnetwork/har-like.jsonnetwork/replay-report.jsonassets/inventory.jsoninteractions/states.jsoninteractions/trace.jsonscreenshots/runtime.pngsession/storage-state.jsonreproduction/plan.jsonreproduction/evidence-limitations.jsonreproduction/rebuild-prompt.txtreproduction/rebuild/starter.htmlreproduction/rebuild/starter.cssreproduction/rebuild/starter.tsxreproduction/rebuild/next-app/reproduction/self-verify/summary.jsonreproduction/self-verify/renderers/*/verification.jsonreproduction/self-verify/renderers/*/visual-qa.jsonreproduction/self-verify/renderers/*/breakpoints/*-verification.json
Run the default small benchmark:
npm run check:clone-bench:localRun the universal route regression corpus and expectations gate:
npm run check:benchmark-routes:localRun a lightweight clone score gate:
npm run check:clone-score-gate:localValidate the committed benchmark evidence manifest:
npm run check:benchmark-evidence:localValidate production pipeline gates:
npm run check:production-readiness:localRun the operational smokes individually:
npm run check:job-queue:local
npm run check:har-replay:local
npm run check:authenticated-corpus:localClassify failure/action codes from a route report:
npm run classify:pipeline-failures -- --report ./.tmp/universal-route-benchmark/universal-route-report.jsonFind low-scoring persisted benchmark artifacts:
npm run summarize:benchmark-scores -- --root ./.tmp --min-score 60 --max-score 70Run specific URLs:
python3 scripts/check_clone_quality_bench.py \
https://www.example.com \
https://www.mozilla.org/ \
--no-breakpointsRun a responsive benchmark:
python3 scripts/check_clone_quality_bench.py \
https://developer.mozilla.org/en-US/ \
--breakpoints mobile tabletpython3 -m py_compile \
bundle/source-first-clone/mcp/source_first_clone/*.py \
scripts/check_integration_smoke.py \
scripts/check_clone_quality_bench.pynpm run check:integration:localgit diff --checkbundle/source-first-cloneInstalled plugin bundle, MCP server, and exact-clone intake skill.bundle/source-first-clone/mcp/source_first_cloneCapture, planning, rebuild, repair, and verification engine.bin/web-embedding.mjsNode CLI wrapper.python/web_embedding/installer.pyShared installer and command dispatcher.scripts/check_clone_quality_bench.pyURL clone quality benchmark helper.scripts/benchmark_routes.pyUniversal route/capture-depth regression benchmark helper.scripts/check_benchmark_report.pyBenchmark expectation validator for exact, minimum, and contains-style checks.scripts/check_benchmark_evidence.pyBenchmark evidence manifest validator.scripts/check_job_queue_smoke.pyFilesystem async clone job queue smoke test.scripts/check_har_replay_smoke.pyDeterministic HAR replay engine smoke test.scripts/benchmark_authenticated_corpus.pyUser-provided authenticated dashboard corpus runner.scripts/summarize_benchmark_scores.pyUtility for finding low or high scoring persisted benchmark artifacts under an output root.scripts/classify_pipeline_failures.pyOperational failure/action taxonomy summarizer for reports and capture artifacts.scripts/check_production_readiness.pyProduction readiness gate validator for corpus, failure taxonomy, CI wiring, and policy docs.scripts/check_integration_smoke.pyRelease, install, and URL-only clone smoke test.scripts/release_bundle.pyRelease artifact builder.docs/Architecture notes and universal benchmark documentation.
The strongest claim for this project is:
A source-first website cloning engine that combines Playwright capture, HAR replay, MCP tools, and self-verification to rebuild iframe-blocked public pages with reproducible visual, DOM, style, interaction, and responsive scores.
Avoid treating the output as a legal or ownership bypass. The engine can reconstruct public page structure, but permission, licensing, and acceptable use still matter.
MIT
