Vision-native browser eyes and hands for agents.
Tendo captures web page state — annotated screenshots, a grounded element map, console and network errors — and executes deterministic actions by element id. The calling agent does the vision and reasoning; Tendo is the eyes and hands. It ships no LLM: no API key, no config, no per-token cost of its own. Built for autonomous agents that drive a browser via shell execution.
A DOM/selector tool is cheaper on clean semantic HTML — use one there. Tendo wins where the DOM lies or isn't the truth:
- Canvas / WebGL / video — charts, maps, games, design tools. No DOM nodes to select; only pixels.
- Visual assertions — "does the total actually render, not overlap, have enough contrast, and not shift on load."
- State setup —
--after "type email, click sign in, add to cart"gets you to a primed page in one call.
It grounds actions on role + name + visual position, so the caller clicks the exact right pixels without guessing coordinates or parsing DOM.
No install needed — any capable agent can run the CLI directly:
Execute `npx -y tendo look <url>` to get browser eyes and hands.
Tendo drives Chromium via Playwright (Node 20+). On first run, install the browser once:
npx playwright install chromiumtendo look https://example.com # capture: element map + screenshot + errors
tendo look https://example.com --annotate # numbered set-of-marks overlay on the screenshot
tendo look https://example.com --text-only # cheapest tier: no screenshot
tendo look https://example.com --session s1 # keep the browser alive for follow-up act
tendo look https://shop.com --after "click sign in" # grounded setup actions before capture
tendo act --session s1 --element 3 --type "lofi" # deterministic: type into element #3
tendo act --session s1 "click the checkout button" # text mode: fuzzy role+name match
tendo act https://example.com "click Learn more" # one-shot: one action on a fresh load
tendo sessions # list live sessions + TTL remaining
tendo kill s1 | tendo kill --all # close sessionsEvery look writes screenshots to disk and prints a machine-readable summary (TOON by default, --format json to opt out). Screenshot bytes are never inlined — only paths, which the agent reads on demand. Every act returns the fused post-action state inline, never a bare "Done".
tendo look <url> --session s1 --annotate— get the numbered screenshot + element map.- The agent reads the annotated image with its own vision: search box =
3, checkout =1. tendo act --session s1 --element 1— click the exact element, get the new state back.- Repeat. Reasoning lives in the agent; grounding and capture live in Tendo.
Default to the cheapest tier and only spend pixels when needed: --text-only → --region <selector> → full look → --annotate. Every response includes hints: that nudge you down a rung.
| Command | Description |
|---|---|
look |
Capture page state → screenshots on disk + element map + diagnostics |
act |
Execute one grounded action, return the fused post-action look payload |
sessions |
List live browser sessions and their idle TTL |
kill |
Close a session (<id>) or all sessions (--all) |
act reports one of: ok · not_found (element gone → fresh state returned) · ambiguous (ranked candidates returned, pick by id) · error.
--help— show help for any command-V,--version— show the installedtendoversion
--session <id> keeps a browser alive across calls (agent turns are minutes apart). A background daemon holds the live page and auto-spawns on first use; sessions idle-reap after 10 minutes. Without --session, look/act run one-shot — launch, capture, kill.
npm install # install all workspace dependencies
npm run build --workspaces # build core → browser → cli
node apps/cli/dist/index.js look <url> # run the built CLISee AGENTS.md for architecture and contributor guidance, and SCOPE.md for the design record and roadmap.
MIT