Give AI agents control of any Windows app — no vision model, no screenshots, no cloud.
ApexComputerUse reads the Windows accessibility tree (the same data the OS exposes to screen readers) and serves it over a plain HTTP REST API. Any AI agent — in any language, on any machine — can find, inspect, and control any desktop app or browser by making simple HTTP requests. No screenshots. No pixel coordinates. No cloud dependency.
5–20 tokens per action instead of 1,000–3,500 for a screenshot. A full browser page in onscreen-only mode is ~126 elements of compact JSON — less than the cost of a single screenshot of the same page.
Works on Win32, WPF, UWP, WinForms, and browsers. Controlled via HTTP REST, named pipes, cmd.exe, and Telegram.
Requirements: Windows 10/11 · .NET 10 SDK
git clone https://github.com/your-org/ApexComputerUse
cd ApexComputerUse
dotnet build
dotnet run --project ApexComputerUse- The app opens. The HTTP server starts automatically on port
8080(HttpAutoStart=trueinappsettings.json). - By default it binds to localhost only (
HttpBindAll=false), so no first-run UAC network setup is required. - If you enable
HttpBindAll=true, the app prompts once (UAC) to configure URL ACL + Windows Firewall for the selected port. - The API key is shown in the Remote Control tab → API Key field — copy it.
- Open
http://localhost:8080/?apiKey=<key>in a browser — the interactive console appears (the browser console pre-fills the key). - Pick any open window from the Windows panel on the left.
- Browse its element tree, click an action button, see the result.
Chat tab: switch to the Chat tab and click Load Chat to open the streaming AI chat UI directly inside the app. Configure provider and API key in the settings group above, then chat away.
Clients tab: use the Clients tab to register other machines running ApexComputerUse. Add each machine's name, IP/host, port, and API key, then click Test to confirm the connection is live. This registry lets you — or an AI agent — track and target multiple Apex endpoints from a single instance.
Or go straight to curl (replace <key> with the API key from the Remote Control tab):
# Confirm the server is up
curl -H "X-Api-Key: <key>" http://localhost:8080/ping
# Find Notepad and read its text editor content (two calls)
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/find \
-H "Content-Type: application/json" -d '{"window":"Notepad"}'
curl -H "X-Api-Key: <key>" http://localhost:8080/exec?action=gettext
# Or combine both in one call
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/find-exec \
-H "Content-Type: application/json" -d '{"window":"Notepad","action":"gettext"}'OCR: requires
eng.traineddata— download from github.com/tesseract-ocr/tessdata and place it intessdata\next to the executable.AI Vision: requires a GGUF vision model and projector — see Usage — AI.
Most AI computer-use tools — Claude Computer Use, OpenAI CUA, UI-TARS, OmniParser — work by sending a screenshot to a vision model and guessing pixel coordinates to click. This approach has compounding costs:
- Screenshot token costs scale with resolution and vary by provider. A 1024×768 image runs ~765 tokens (OpenAI) to ~1,050 tokens (Anthropic). At 1920×1080 that rises to ~1,840 tokens (Anthropic) or ~2,125 tokens (OpenAI). At 2048×2048, OpenAI charges ~2,765 tokens and Anthropic ~2,500–3,500 tokens. Gemini is the exception, typically staying under 1,000 tokens even for ~4K images. And this cost is paid on every single step.
- Screenshots stack in conversation history — a 20-step task accumulates 20+ images in context.
- Coordinate grounding is fragile: it breaks on window resize, DPI scaling, and multi-monitor setups.
- Published benchmarks confirm the accuracy ceiling: even specialist 7B vision models score only 18.9% on real professional UIs (ScreenSpot-Pro, 2025). GPT-4o scores below 2% on unscaled professional screens.
ApexComputerUse reads the accessibility tree the OS already maintains — the same tree used by screen readers and test automation. This gives every element a name, control type, and AutomationId, without rendering a pixel.
Interacting with an element by name costs 5–20 tokens. The element map for a full browser page in onscreen-only mode is typically 100–200 elements of compact JSON — compared to ~1,050 tokens for a single screenshot of the same page, with none of the coordinate fragility.
This is the same direction taken by the most efficient browser-only tools: browser-use claims 50% fewer tokens than screenshot alternatives; Vercel's agent-browser returns 200–400 tokens per page snapshot and uses 82–93% fewer tokens than Playwright MCP. ApexComputerUse brings the same approach to the entire Windows desktop.
| Tool | Coverage | HTTP API | Stable element IDs | Onscreen filter | Status |
|---|---|---|---|---|---|
| ApexComputerUse | Windows desktop + browsers | ✅ REST | ✅ SHA-256 hash | ✅ ?onscreen=true |
Active |
| UFO2 (Microsoft) | Windows desktop + browsers | ❌ research agent | ❌ bounding-box | Partial | Research only |
| UI Automata | Windows desktop + browsers | MCP only | Selector-based | Shadow DOM cache | Active |
| Windows-Use | Windows desktop | ❌ Python lib | ❌ | Partial | Active |
| WinAppDriver | Windows desktop | WebDriver | XPath / selectors | ❌ | Paused by Microsoft |
| browser-use | Browser only | ❌ Python lib | Element hash | ✅ | Active |
| Playwright MCP | Browser only | MCP | Session-scoped refs | Partial | Active |
| Claude Computer Use | Any (screenshot) | Cloud API | ❌ coordinates | ❌ | Active |
No other tool combines: Windows UIA3 coverage, SHA-256 stable element IDs, a language-agnostic HTTP REST API, and an onscreen visibility filter — in a single deployable binary.
ApexComputerUse exposes a plain HTTP REST API, which means any AI agent that can execute shell commands or fetch a URL can use it. No SDK, no plugin, no special integration required — if the agent can run curl, it can drive any Windows app or browser through this server.
There are three ways an agent can interact with ApexComputerUse:
1. Shell / terminal access (curl or any HTTP client)
Any agent that can run shell commands can call the API directly with curl, Python requests, or PowerShell Invoke-RestMethod. This covers the widest range of tools and requires no configuration beyond starting the HTTP server.
2. URL fetch / WebFetch tool
Some agents have a dedicated tool for fetching URLs rather than running shell commands. ApexComputerUse's HTML responses embed a full <script type="application/json" id="apex-result"> block, so any agent that can fetch a webpage gets structured JSON data back without needing a vision model.
3. MCP server (optional wrapper) Several agents support the Model Context Protocol. If you prefer a tighter integration, the REST API can be wrapped as an MCP server so the agent sees your actions as named tools rather than raw HTTP calls.
| Agent | Type | Shell access | URL fetch | MCP | Notes |
|---|---|---|---|---|---|
| Claude Code | CLI | ✅ Bash tool | ✅ WebFetch tool | ✅ | curl is blocked by default but Claude Code automatically falls back to Python requests for the same result |
| Cline | VS Code extension | ✅ Terminal | ✅ Via shell | ✅ | Full agentic loop; browser control; human-in-the-loop approval for each command |
| Aider | CLI | ✅ Shell | ✅ Via shell | ❌ | Oldest and most widely deployed open-source coding CLI; works with any model via Ollama or API key |
| Goose (Block) | CLI + Desktop | ✅ Shell | ✅ Via shell | ✅ | Apache 2.0; model-agnostic; native MCP support |
| Cursor (Agent Mode) | IDE | ✅ Terminal | ✅ Via shell | ✅ | Agent mode can run terminal commands; MCP support available |
| Windsurf (Cascade) | IDE | ✅ Terminal | ✅ Via shell | ✅ | Cascade runs commands automatically; MCP support with admin controls |
| GitHub Copilot (Agent Mode) | VS Code extension | ✅ Terminal | ✅ Via shell | ✅ | VS Code Agent mode handles terminal commands and iteration |
| OpenHands / Devin | Cloud agent | ✅ Shell | ✅ Via shell | Varies | Requires network path from the cloud sandbox to your Windows machine |
| Roo Code / Continue | VS Code extension | ✅ Terminal | ✅ Via shell | ✅ | Open-source; BYOK; shell access via VS Code terminal integration |
| Autocomplete-only tools | Extension | ❌ | ❌ | ❌ | Tabnine, Supermaven, etc. generate code only — no agentic shell or HTTP access |
Local model users: any agent backed by a local model via Ollama (Qwen Coder, DeepSeek Coder, CodeLlama, etc.) that also has shell access works the same way. The model itself doesn't need internet access — the agent runtime executes the curl commands.
Start the HTTP server, then drop this into your Claude Code session:
The ApexComputerUse REST API is running at http://localhost:8080.
Use curl (or Python requests if curl is blocked) to control Windows apps.
Start with: curl http://localhost:8080/ping
Then: curl http://localhost:8080/windows (to see what's open)
Then find and interact with any element using /find and /exec (or /find-exec for both in one call).
Claude Code will handle the rest — finding windows, reading the element tree, clicking, typing, and verifying results across turns using its stable element IDs.
Every element is assigned a SHA-256 hash-based numeric ID derived from its control type, name, AutomationId, and position in the tree. These IDs are stable across sessions — an agent can reference the same element in turn 1 and turn 20 without re-querying the tree. No other tool in the Windows desktop automation space publishes this property.
GET /elements?onscreen=true prunes any element where IsOffscreen = true during the tree scan, skipping entire offscreen subtrees. On a live Chewy.com product page this reduces 634 elements to 126 — an 80% reduction — putting token cost per step in the same range as the best browser-only tools while covering all desktop apps too.
The filter composes with the type filter and the new depth/expansion params: ?onscreen=true&type=Button.
When ?match= is combined with ?onscreen=true, the match search scans all elements (including offscreen ones) so content that has been scrolled out of view can still be found by text search. Offscreen matches are tagged with "isOffscreen": true in the response. Use exec action=scrollinto on the returned element ID to bring an offscreen match into view before interacting with it.
For deep pages, fetch a shallow overview first, then drill into only the branches you care about:
# Step 1 — shallow overview (fast, small response)
curl "http://localhost:8080/elements?depth=2&onscreen=true"
# Nodes that have children beyond the depth limit show "childCount": N instead of "children"
# Step 2 — expand a specific node by its ID (IDs are stable between calls)
curl "http://localhost:8080/elements?id=708379645&depth=2&onscreen=true"
# Returns only that subtree, 2 levels deep — existing map entries are preservedThis lets an AI agent navigate to the relevant section of a large page without fetching the whole tree on every step.
Modern web pages often wrap every visible element in several identity-less Pane/Group/Custom nodes and produce deep trees with many one-child chains. Two opt-in /elements parameters strip that noise:
# RECOMMENDED: global text search — replaces almost all hierarchical drill-down.
# Searches Name, AutomationId, Value, AND ClassName across the entire window tree
# (including offscreen elements). Returns every match with its ancestor path plus
# `depth` levels of descendants. Combine with includePath=true for breadcrumbs.
# The parameter name is `match=` — there is NO separate `global=true` flag; `match=`
# alone forces a full-tree scan and the tester-friendly behaviour described above.
# When `match=` is set, `depth=` is ignored (otherwise depth pruning would hide deep
# matches before they could be found).
curl "http://localhost:8080/elements?match=add+to+cart&onscreen=true&depth=1&includePath=true"
# Collapse "1-in-1-in-1" wrapper chains. A wrapper is skipped only when it has
# exactly one child, no name, no AutomationId, and its control type is Pane,
# Group, or Custom. Named containers and anything with an AutomationId survive.
curl "http://localhost:8080/elements?onscreen=true&collapseChains=true"
# Ancestor breadcrumb on every emitted node: "Chrome > Document > Main > Form".
curl "http://localhost:8080/elements?onscreen=true&includePath=true"
# Opt into Value pattern + HelpText on every node — useful for web inputs
# whose Name is empty and whose visible content lives in the Value pattern.
curl "http://localhost:8080/elements?onscreen=true&properties=extra"
# All new filters combine cleanly with existing ones.
curl "http://localhost:8080/elements?onscreen=true&collapseChains=true&match=submit&type=Button&depth=1&properties=extra"Truncated nodes (ones whose children were cut off by depth) now also emit descendantCount alongside childCount, so an agent can decide whether a subtree is worth expanding without another round trip. Element IDs are computed against the real, unflattened tree — hoisting a descendant through collapseChains does not change its ID, and follow-up /elements?id=<id> and /execute id=<id> calls still resolve.
/find now populates the response's structured element object (id, controlType, name, automationId, className, frameworkId, isEnabled, isOffscreen, boundingRectangle, plus value/helpText when properties=extra), in addition to the existing human-readable string in message.
- Find any window and element by name or AutomationId (exact or fuzzy match)
- Filter element search by ControlType
- Persistent, hash-based stable element and window IDs (survive app restarts)
- Onscreen-only element map (
?onscreen=true) — prunes offscreen subtrees at scan time - Progressive tree expansion (
?depth=N+?id=<elementId>) — fetch a shallow overview then drill into only the branches you need, without re-scanning the whole window - Element nodes include
boundingRectangle(x, y, width, height) for spatial context and visual rendering - Execute all common UI actions: click, type, select, toggle, scroll, drag & drop, etc.
- OCR any UI element using Tesseract
- Multimodal AI: describe UI elements, ask questions about them, analyse image/audio files using a local vision LLM (LLamaSharp MTMD)
- Remote control via HTTP REST API (curl-friendly JSON)
- Remote control via named pipe (PowerShell module included)
- Remote control via cmd.exe batch helper (
apex.cmd) - Remote control via Telegram bot
- Screenshot capture of elements, windows, and full screen (returned as base64 PNG)
- Interactive HTTP test console — served at
GET /, includes live windows list, element tree browser, grouped command builder covering every action, inline capture/OCR/AI vision/UI map buttons, format selector (JSON/HTML/Text/PDF), format demo links, and a response log - AI Drawing —
POST /drawrenders any combination of shapes (rect, ellipse, circle, line, arrow, polygon, text) to a base64 PNG;GET /draw/demorenders a built-in multi-colour space scene;?overlay=trueshows the result as a click-through screen overlay - Layered Scene Editor — persistent, structured drawing canvas with stable shape IDs so AI can generate a composition and the user can refine it; full REST API at
/scenes/*; interactive WinForms editor (Tools → Scene Editor) and browser editor (GET /editor) - UI Map Renderer — renders the element tree as a colour-coded overlay drawn directly on screen, and optionally exports a PNG image; accessible via Tools → Render UI Map or
GET /uimap - Format-adaptive responses — every endpoint serves HTML, plain text, JSON, or PDF via URL extension (
.json,.html,.txt,.pdf),?format=parameter, orAcceptheader; default is an HTML page with embedded JSON readable by any AI that can fetch a URL - System utility routes —
/health(unauthenticated),/ping,/metrics,/sysinfo,/env,/ls,/run,/run-tests,/shutdownfor AI agents that need OS-level context without a separate tool - WindowMonitor — background STA poll thread detects desktop window opens / closes / title changes once per second; fires
WindowsChanged/WindowClosedevents that auto-prune the CommandProcessor element + window caches when a window goes away (no more stale-handle errors when an app closes mid-session). OptionalWatchElementsmode adds descendant-level diff tracking, narrowable to the foreground window or to titles matching a substring filter for tractable scan cost. Inspect activity viaGET /winmon/logand drain viaPOST /winmon/clear - Live monitoring dashboard — browser-based status page at
GET /dashboard; shows health, per-route metrics, system info, registered clients, AI chat session status, and WindowMonitor activity log. Auto-refreshes every 5 seconds. RequiresAllowDiagnosticspermission. - Native HTTPS — opt-in TLS via http.sys (no proxy);
Scripts/setup-https.ps1generates a self-signed cert, binds it vianetsh http add sslcert, and adds a Firewall rule in one elevated step. Supports user-supplied PFX. Three remote-access options documented inScripts/README-remote-access.md: SSH tunnel, native HTTPS, and Caddy reverse proxy. - Embedded AI chat in the Chat tab — the Chat tab opens the streaming HTML chat UI (
/chat) in your default browser; click Open In Browser to launch it. The HTML page handles streaming, provider/model display, and session reset natively. - AI Chat over HTTP — streaming chat UI at
GET /chatbacked by/chat/send,/chat/status,/chat/reset; same 8 providers as the desktop AI Chat window; also accessible from any browser - Agentic tool loop in AI Chat — when the local HTTP server is running, the AI can issue ApexComputerUse API calls inside
```apexcode blocks; results are fed back automatically for up to 8 turns until the AI produces a clean answer (AiChatService.SendAsync+SetLocalServer) - Auto-start on launch — HTTP server starts automatically (
HttpAutoStart=trueby default), binds to localhost by default (HttpBindAll=false), and can be switched to all-interfaces mode with one-time netsh setup (URL ACL + Firewall rule) - Auto-download setup — Model tab "Download All" button fetches the LFM2.5-VL model, projector, and Tesseract data to fixed local paths on first launch
git clone https://github.com/your-org/ApexComputerUse
cd ApexComputerUse
dotnet run --project ApexComputerUseWhen HttpBindAll=true, ApexComputerUse checks whether the HTTP URL ACL and Windows Firewall inbound rule exist for the configured port. If either is missing, a single elevated cmd window opens (one UAC prompt) and runs:
netsh http add urlacl url=http://+:{port}/ user=Everyone
netsh advfirewall firewall add rule name="ApexComputerUse" dir=in action=allow protocol=TCP localport={port}
This happens once and is tracked in %APPDATA%\ApexComputerUse\settings.json. With the default HttpBindAll=false, this setup is skipped.
Open the Model tab and click Download All to automatically fetch:
LFM2.5-VL-450M-Q4_0.gguf— vision LLM (450 M parameters, quantized)mmproj-LFM2.5-VL-450m-F16.gguf— multimodal projectoreng.traineddata— Tesseract English OCR data
Files are saved to models\ and tessdata\ next to the executable. On first launch the app detects missing files and switches to the Model tab automatically.
To download manually: copy eng.traineddata from github.com/tesseract-ocr/tessdata into tessdata\, and place both .gguf files in models\.
Three options — see Scripts/README-remote-access.md for full details:
| Option | When to use | Setup |
|---|---|---|
| SSH tunnel | Ad-hoc, no certificates | .\Scripts\ssh-tunnel.ps1 -Server user@mypc |
| Native HTTPS | Permanent TLS, no proxy | .\Scripts\setup-https.ps1 (run as Admin), then set HttpsEnabled: true in appsettings.json |
| Caddy proxy | Public domain + auto Let's Encrypt | caddy run --config Scripts/Caddyfile with DOMAIN= set |
- Message @BotFather on Telegram and create a bot with
/newbot. - Copy the token (format:
123456789:ABC-DEF...). - Paste it into the Bot Token field in the app and click Start Telegram.
- Add your Telegram chat ID to the Allowed Chat IDs field to restrict who can send commands.
Every HTTP request must include the API key. Three equivalent methods:
# Authorization header (recommended)
curl -H "Authorization: Bearer <key>" http://localhost:8080/ping
# X-Api-Key header
curl -H "X-Api-Key: <key>" http://localhost:8080/ping
# Query parameter (use only for browser links / quick tests)
curl "http://localhost:8080/ping?apiKey=<key>"Requests without a valid key receive HTTP 401. The interactive web console (GET /) pre-fills the key automatically — paste it from the Remote Control tab on first launch.
To disable authentication (local development only), clear the API Key field in the app.
The named pipe is ACL-restricted to the current Windows user. Other local users and unprivileged processes cannot connect.
Enter one or more Telegram chat IDs in the Allowed Chat IDs field (comma-separated). Any message from an unlisted chat ID receives "Unauthorized." and is logged. Leave the field empty only for local testing.
Requests from localhost / loopback always have full access. Non-loopback callers are matched against entries in the Clients tab and constrained by per-client permissions (allow_automation, allow_capture, allow_ai, allow_scenes, allow_shell_run, allow_clients, allow_diagnostics). Unknown non-loopback callers are denied.
The POST /run and GET /run endpoints execute arbitrary cmd.exe commands. They are disabled by default. Enable them explicitly:
- In
appsettings.json:"EnableShellRun": true - Or via environment variable:
APEX_ENABLE_SHELL_RUN=true
All settings can be layered via three sources (highest priority last wins for env vars):
appsettings.json (next to the executable — shipped defaults shown):
{
"HttpPort": 8080,
"HttpBindAll": false,
"HttpAutoStart": true,
"PipeName": "ApexComputerUse",
"LogLevel": "Information",
"EnableShellRun": false,
"TelegramToken": "",
"TestRunnerExePath": "",
"TestRunnerConfigPath": ""
}Shipped defaults are
HttpAutoStart=trueandHttpBindAll=false(auto-start on localhost only). SetHttpBindAll=truefor LAN access.
Environment variables (prefix APEX_, override appsettings.json):
| Variable | Description |
|---|---|
APEX_HTTP_PORT |
HTTP listen port (default 8080) |
APEX_HTTP_BIND_ALL |
true to bind all interfaces instead of localhost only |
APEX_HTTP_AUTOSTART |
true to auto-start HTTP server in GUI mode |
APEX_PIPE_NAME |
Named pipe name |
APEX_LOG_LEVEL |
Serilog minimum level: Debug / Information / Warning / Error |
APEX_ENABLE_SHELL_RUN |
true to enable the /run shell-execution endpoint |
APEX_API_KEY |
Override the auto-generated API key |
APEX_ALLOWED_CHAT_IDS |
Comma-separated Telegram chat ID whitelist |
APEX_TELEGRAM_TOKEN |
Telegram bot token |
APEX_MODEL_PATH |
Default LLM .gguf path |
APEX_MMPROJ_PATH |
Default multimodal projector .gguf path |
APEX_TEST_RUNNER_EXE_PATH |
Path to TestApplications/TestRunner executable for /run-tests |
APEX_TEST_RUNNER_CONFIG_PATH |
Optional config file path passed to TestRunner |
Network binding: HttpBindAll = false (the default) binds to http://localhost:{port}/ — loopback only, safe for single-machine use. Set APEX_HTTP_BIND_ALL=true to bind all interfaces for network-wide access (ensure firewall rules are in place).
Logs are written to %LOCALAPPDATA%\ApexComputerUse\Logs\apex-YYYYMMDD.log (daily rotation, 7-day retention).
ApexComputerUse can run headlessly as a Windows service (no GUI):
# Install
sc.exe create ApexComputerUse binPath="C:\ApexComputerUse\ApexComputerUse.exe --service" start=auto
sc.exe start ApexComputerUse
# Uninstall
sc.exe stop ApexComputerUse
sc.exe delete ApexComputerUseConfigure via appsettings.json or APEX_* environment variables before starting the service. The APEX_TELEGRAM_TOKEN and APEX_API_KEY variables are the recommended way to inject secrets in a service context.
Program.cs supports lightweight startup overrides:
--port <n>setsAPEX_HTTP_PORTfor that process--pipe <name>setsAPEX_PIPE_NAMEfor that process--clientmarks the instance as a subordinate client instance
| Field | Description |
|---|---|
| Window Name | Partial title of the target window. Fuzzy-matched if no exact match found. |
| AutomationId | The element's AutomationId (checked first). |
| Element Name | The element's Name property (fallback if AutomationId is blank). |
| Search Type | Filter the element search to a specific ControlType. All searches everything. |
| Control Type | Selects the action group (Button, TextBox, etc.). |
| Action | The action to perform on the found element. |
| Value / Index | Input for actions that need it (text to type, index, row,col, x,y, etc.). |
Find Element — locates the window and element, logs what was found. Execute Action — runs the selected action against the last found element.
| Item | Description |
|---|---|
| Run AI Computer Use Mode | Launches the interactive multimodal AI agent loop (requires model loaded on the Model tab). |
| Output UI Map | Scans the current window's element tree and logs it as nested JSON to the console tab. |
| Render UI Map | Scans the current window's element tree, draws a colour-coded bounding-box overlay on screen for 5 seconds, and offers to save the overlay as a PNG image. |
| Scene Editor | Opens the layered scene editor — create scenes, add shapes to layers, drag to reposition, use AI to generate and refine compositions. |
| AI Chat | Opens a standalone streaming chat window with support for 8 AI providers (OpenAI, Anthropic, DeepSeek, Grok, Groq, Duck, LM Studio, LlamaSharp). Configure API keys in ai-settings.json next to the executable. The Chat tab opens the same chat UI in your default browser — click Open In Browser after the HTTP server starts. |
Every window and element is assigned a stable numeric ID (SHA-256 hash-based) that persists across sessions. These IDs can be used in find commands instead of titles or AutomationIds.
# 1. Get windows with their IDs
curl http://localhost:8080/windows
# Returns: [{"id":42,"title":"Notepad"},{"id":107,"title":"Calculator"},...]
# 2. Get elements with their IDs for the current window
curl http://localhost:8080/elements
# Onscreen elements only (prunes offscreen subtrees — 80% fewer elements on browser pages)
curl "http://localhost:8080/elements?onscreen=true"
# Limit tree depth — nodes at the cutoff show "childCount" instead of "children"
curl "http://localhost:8080/elements?depth=2&onscreen=true"
# Expand a specific subtree by numeric ID (IDs are stable; map is preserved between expansion calls)
curl "http://localhost:8080/elements?id=708379645&depth=2&onscreen=true"
# Combine with type filter
curl "http://localhost:8080/elements?onscreen=true&type=Button"
# Returns nested JSON including bounding rectangles:
# {
# "id": 105,
# "controlType": "Edit",
# "name": "Text Editor",
# "automationId": "15",
# "boundingRectangle": { "x": 0, "y": 30, "width": 800, "height": 600 },
# "children": [...]
# }
#
# When a depth limit truncates a node's children, "childCount" appears instead:
# {
# "id": 708379645,
# "controlType": "Pane",
# "name": "",
# "boundingRectangle": { ... },
# "childCount": 7 <-- call /elements?id=708379645 to expand
# }
# 3. Find using numeric IDs (no fuzzy matching, direct map lookup)
curl -X POST http://localhost:8080/find \
-H "Content-Type: application/json" \
-d '{"window":42,"id":105}'Using numeric IDs is faster and unambiguous — the element is resolved directly from the in-memory map without any search or fuzzy logic. Every find call also auto-focuses the matched window. When a title/name search is low-confidence or ambiguous, /find now refuses to guess and returns error_data.candidates; choose one of those candidates or use IDs from /windows and /elements.
Map rendering isn't just a debugging convenience — it has compounding implications for token consumption at scale.
With screenshot-based AI automation, every interaction requires sending a fresh image to the model. At typical desktop resolutions that's 1,000–3,500 tokens per screenshot depending on the provider and resolution — every single step, accumulating in conversation history. With ApexComputerUse's map approach, the UI is rendered once as a structured, text-based representation. After that initial render, each individual interaction references elements by name, costing 5–20 tokens on average.
The ?onscreen=true filter further reduces the element map to only what is visible in the current viewport. On a real browser page this produces 126 elements of compact JSON — well under the cost of a single screenshot of the same page.
| Per step | 20-step task | |
|---|---|---|
| Screenshot (1024×768) | ~765–1,050 tokens | ~15,000–21,000 tokens in images alone |
| Screenshot (1920×1080) | ~1,840–2,125 tokens | ~37,000–43,000 tokens in images alone |
| Screenshot (2048×2048) | ~2,765–3,500 tokens | ~55,000–70,000 tokens in images alone |
| ApexComputerUse (full map) | 400–1,800 tokens (one-time) + ~10 per action | ~1,000 tokens total |
ApexComputerUse (?onscreen=true) |
200–600 tokens (one-time) + ~10 per action | ~400 tokens total |
Provider breakdown: at 1024×768, Anthropic ≈ 1,050 tokens / OpenAI ≈ 765 tokens. At 1920×1080, Anthropic ≈ 1,840 / OpenAI ≈ 2,125. At 2048×2048, OpenAI ≈ 2,765 / Anthropic ≈ 2,500–3,500. Gemini is notably more efficient — typically under 1,000 tokens even for ~4K images. All providers compound costs across steps: every screenshot remains in context for the life of the conversation.
Screenshot: 2,500 tokens each · Initial map: 400 tokens · Per-action after map: 8 tokens
By time period — 1 person:
| Timeframe | Screen Capture | Map Approach | Tokens Saved |
|---|---|---|---|
| 1 day | 250,000 | 1,192 | 248,808 |
| 1 week | 1,750,000 | 8,344 | 1,741,656 |
| 1 year | 91,250,000 | 435,080 | 90,814,920 |
Annual totals — by team size:
| Team Size | Screen Capture | Map Approach | Reduction Factor |
|---|---|---|---|
| 1 person | 91,250,000 | 435,080 | ~210x |
| 10 people | 912,500,000 | 4,350,800 | ~210x |
| 50 people | 4,562,500,000 | 21,754,000 | ~210x |
Start the HTTP server from the Remote Control group box, then use curl or open http://localhost:8080/?apiKey=<key> in a browser to access the interactive test console.
Authentication reminder: every route except
GET /healthrequires the API key. For curl, add-H "X-Api-Key: <key>". For browser URLs, append?apiKey=<key>.
Opening the root URL in any browser launches a dark-themed console with:
- Windows panel — live list of all open windows; click to select and auto-load its element tree
- Elements panel — nested element tree flattened with indentation; onscreen-only toggle; ControlType filter; click any element to select it
- Command builder — grouped action buttons covering every action: Click, Text, Keys, State, Scroll, Toggle, Select, Window, Range/Slider, Grid/Table, Transform, Wait, Capture, AI Vision; Value input (multiline, Ctrl+Enter to execute) with context-sensitive hints; ▶ Execute button
- AI Vision buttons —
status,describe,ask,file; requires model loaded on the Model tab - Format selector — dropdown in the header (JSON / HTML / Text / PDF); all requests use the selected format; format demo links (help, status, windows) open directly in a new tab in the chosen format
- Scene Editor link — opens the browser-based canvas editor in a new tab
- Response log — newest result at top; captures rendered as inline images (click to zoom); PDF responses shown as an "Open PDF" link (browser-native rendering)
Every endpoint adapts its response to whatever format the caller can consume, selected by priority:
- URL file extension — append
.json,.html,.txt, or.pdfto any path ?format=query parameter —html,text,json, orpdfAcceptrequest header —text/html,text/plain,application/json, orapplication/pdf- Default:
html
# URL extension (highest priority — works even if the AI cannot set headers or query params)
curl http://localhost:8080/status.json
curl http://localhost:8080/help.txt
curl http://localhost:8080/windows.html
curl http://localhost:8080/status.pdf --output status.pdf
# ?format= query parameter
curl "http://localhost:8080/ping?format=text"
curl "http://localhost:8080/ping?format=json"
# Accept header
curl -H "Accept: application/json" http://localhost:8080/ping
curl -H "Accept: application/pdf" http://localhost:8080/help --output help.pdf
# HTML response (default — works in any browser or AI that can fetch a page)
curl http://localhost:8080/pingHTML includes a <pre> block for human readability and an embedded <script type="application/json" id="apex-result"> block containing the full result as JSON — allowing any AI that can fetch a webpage to extract structured data without a vision model.
PDF is a valid A4 document using the built-in Courier font (no external dependencies). Useful for AI systems that can only accept PDF attachments.
All command endpoints accept both POST (JSON body) and GET (query string parameters), so any command can be expressed as a plain URL — no request body required:
# Find a window via GET
curl "http://localhost:8080/find?window=Notepad"
# Execute an action via GET
curl "http://localhost:8080/exec?action=gettext"
# Combine with URL extension for full URL-only access
curl "http://localhost:8080/find.json?window=Notepad&id=15"
curl "http://localhost:8080/exec.pdf?action=describe" --output result.pdfGET parameter names match the JSON body field names: window, id / automationId, name / elementName, type / searchType, action, value, onscreen, depth, prompt, model, proj.
/elements-specific:depth=Nlimits tree depth (truncated nodes showchildCount);id=<numericId>expands from a previously-mapped element without clearing the rest of the map.
All endpoints return the same canonical structure:
{
"success": true,
"action": "ping",
"data": { "key": "value", ... },
"error": null,
"error_data": null
}HTTP status: 200 on success, 400 on error.
error_data is an additive object populated on failures (null when there is no error). Its shape is action-specific — for example, action-execution failures may carry failed_pattern, supported_patterns, element_state, and a remediation hint; waitfor timeouts carry timeout_ms, predicate, property, expected, and last_observed; wait-window timeouts carry last_observed_titles. Existing callers that only read success / data / error continue to work unchanged.
gettext and getvalue responses include a source field inside data — one of TextPattern, ValuePattern, LegacyIAccessible, or Name — naming the UIA accessor that produced the text. Inside batch step results this appears as extras.source.
Element nodes returned by /elements and /find include className alongside id, controlType, name, automationId, frameworkId, isEnabled, isOffscreen, and boundingRectangle. match= searches className along with the other text fields.
# Unauthenticated liveness probe — safe for external monitoring (the only route that doesn't require the API key)
curl http://localhost:8080/health
# Authenticated health check
curl -H "X-Api-Key: <key>" http://localhost:8080/ping
# Per-route request counters
curl -H "X-Api-Key: <key>" http://localhost:8080/metrics
# Recent WindowMonitor activity (window open/close/rename, optional element add/remove). Append .json for raw JSON.
curl -H "X-Api-Key: <key>" http://localhost:8080/winmon/log.json
# Drain the buffer
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/winmon/clear.json
# System information (OS, machine, user, CPU, CLR)
curl -H "X-Api-Key: <key>" http://localhost:8080/sysinfo
# All environment variables
curl -H "X-Api-Key: <key>" http://localhost:8080/env
# Directory listing (defaults to current working directory)
curl -H "X-Api-Key: <key>" http://localhost:8080/ls
curl -H "X-Api-Key: <key>" "http://localhost:8080/ls?path=C:\Users"
# Trigger the bundled integration test runner (TestApplications/TestRunner)
# Requires TestRunnerExePath (or APEX_TEST_RUNNER_EXE_PATH) to be configured.
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/run-tests
# Gracefully stop the HTTP server
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/shutdown
# Run a shell command (cmd.exe /c); 30-second timeout
# Requires EnableShellRun = true in appsettings.json or APEX_ENABLE_SHELL_RUN=true
curl -H "X-Api-Key: <key>" "http://localhost:8080/run?cmd=whoami"
curl -H "X-Api-Key: <key>" "http://localhost:8080/run?command=whoami"
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/run \
-H "Content-Type: application/json" \
-d '{"command":"dir C:\\"}'/run response data fields: cmd, stdout, stderr, exit_code.
Security note:
/runexecutes arbitrary commands as the process user. It is disabled by default and should only be enabled in trusted, authenticated environments.
# List all open windows (with stable IDs)
curl http://localhost:8080/windows
# Get current state
curl http://localhost:8080/status
# List all elements in the current window (nested JSON with IDs and bounding rectangles)
curl http://localhost:8080/elements
# Onscreen elements only — prunes offscreen subtrees for maximum token efficiency
curl "http://localhost:8080/elements?onscreen=true"
# Limit depth — truncated nodes show "childCount" so you know where to drill in
curl "http://localhost:8080/elements?depth=2&onscreen=true"
# Expand a specific node by numeric ID (preserves the rest of the map — IDs stay stable)
curl "http://localhost:8080/elements?id=<elementId>&depth=2&onscreen=true"
# Filter by ControlType
curl "http://localhost:8080/elements?type=Button"
# Text search across Name, AutomationId, Value, and ClassName — returns only
# matching branches, each wrapped in its ancestor path, with `depth` levels below.
curl "http://localhost:8080/elements?match=add+to+cart&onscreen=true&depth=1"
# Collapse identity-less single-child Pane/Group/Custom wrapper chains
# (named containers and anything with an AutomationId are preserved).
curl "http://localhost:8080/elements?onscreen=true&collapseChains=true"
# Add an ancestor breadcrumb ("path") to every emitted node.
curl "http://localhost:8080/elements?onscreen=true&includePath=true"
# Opt into Value pattern + HelpText (omitted by default to keep payloads small).
curl "http://localhost:8080/elements?onscreen=true&properties=extra"
# All filters combined
curl "http://localhost:8080/elements?depth=3&onscreen=true&type=Button&collapseChains=true&match=submit&properties=extra"
# Render the current window's UI element tree as a colour-coded PNG (returns base64)
curl http://localhost:8080/uimap
# Help
curl http://localhost:8080/help
# Find a window and element by title/name
curl -X POST http://localhost:8080/find \
-H "Content-Type: application/json" \
-d '{"window":"Notepad","id":"15"}'
# Find by element name with ControlType filter
curl -X POST http://localhost:8080/find \
-H "Content-Type: application/json" \
-d '{"window":"Notepad","name":"Text Editor","type":"Edit"}'
# Find by numeric window/element IDs (fast, no fuzzy search)
curl -X POST http://localhost:8080/find \
-H "Content-Type: application/json" \
-d '{"window":42,"id":105}'
# Visual Studio handoff targets:
# F5/debug: find name="Debug Target" type="SplitButton", then exec keys {F5}
# Ctrl+F5/no-debug: find name="Start Without Debugging" type="Button", then exec keys Ctrl+{F5}
# Type text into the found element
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"action":"type","value":"Hello World"}'
# Click a button
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"action":"click"}'
# Read text from element
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"action":"gettext"}'
# Capture current element (returns base64 PNG in data field)
curl -X POST http://localhost:8080/capture
# Capture full screen
curl -X POST http://localhost:8080/capture \
-H "Content-Type: application/json" \
-d '{"action":"screen"}'
# Capture multiple elements stitched into one image
curl -X POST http://localhost:8080/capture \
-H "Content-Type: application/json" \
-d '{"action":"elements","value":"42,105,106"}'
# OCR the found element
curl -X POST http://localhost:8080/ocr
# OCR a region (x,y,width,height) within the element
curl -X POST http://localhost:8080/ocr \
-H "Content-Type: application/json" \
-d '{"value":"0,0,300,50"}'
# Check AI model status
curl http://localhost:8080/ai/status
# Load a vision/audio LLM (run once; model stays loaded until the server restarts)
curl -X POST http://localhost:8080/ai/init \
-H "Content-Type: application/json" \
-d '{"model":"C:\\models\\vision.gguf","proj":"C:\\models\\mmproj.gguf"}'
# Describe the currently selected UI element using the vision model
# Captures the element as an image and sends it to the LLM
curl -X POST http://localhost:8080/ai/describe
# Describe with a custom prompt
curl -X POST http://localhost:8080/ai/describe \
-H "Content-Type: application/json" \
-d '{"prompt":"List every button you can see."}'
# Ask a specific question about the current element
curl -X POST http://localhost:8080/ai/ask \
-H "Content-Type: application/json" \
-d '{"prompt":"Is there an error message visible?"}'
# Describe an image file on disk
curl -X POST http://localhost:8080/ai/file \
-H "Content-Type: application/json" \
-d '{"value":"C:\\screenshots\\app.png","prompt":"What dialog is shown?"}'| Field | Aliases | Description |
|---|---|---|
window |
— | Window title (partial match) or numeric ID from /windows |
automationId |
id |
Element AutomationId string or numeric ID from /elements |
elementName |
name |
Element Name property (fallback if id not given) |
searchType |
type |
ControlType filter (All or e.g. Button) |
action |
— | Action name (see list below) |
value |
— | Value/input for the action |
model |
modelPath |
AI: path to LLM .gguf file |
proj |
mmProjPath |
AI: path to multimodal projector .gguf file |
prompt |
— | AI: question or instruction text |
The drawing engine renders GDI+ shapes to a base64 PNG on demand. Every shape type supports colour, opacity, fill/stroke, and dashed lines.
# Draw a filled blue circle with white text
curl -X POST http://localhost:8080/draw \
-H "Content-Type: application/json" \
-d '{
"value": "{\"canvas\":\"blank\",\"width\":400,\"height\":300,\"shapes\":[
{\"type\":\"circle\",\"x\":200,\"y\":150,\"r\":80,\"color\":\"royalblue\",\"fill\":true},
{\"type\":\"text\",\"x\":200,\"y\":140,\"text\":\"Hello!\",\"color\":\"white\",\"font_size\":20,\"font_bold\":true,\"align\":\"center\"}
]}"
}'
# Render the built-in space scene
curl http://localhost:8080/draw/demo
# Show it as a full-screen overlay for 6 seconds
curl "http://localhost:8080/draw/demo?overlay=true&ms=6000"The data.result field contains the base64 PNG. The web console renders it inline.
| Type | Key fields | Description |
|---|---|---|
rect |
x y w h corner_radius |
Rectangle (rounded if corner_radius > 0) |
ellipse |
x y w h |
Ellipse inside bounding box |
circle |
x y r |
Circle — x,y is the centre |
line |
x y x2 y2 |
Straight line |
arrow |
x y x2 y2 |
Line with arrowhead at (x2,y2) |
polygon |
points[] |
Closed polygon — flat array of x,y pairs |
triangle |
x y w h |
Triangle — bounding-box anchored, top-centre apex |
arc |
x y w h start_angle sweep_angle |
Open arc — angles in degrees, clockwise from 3 o'clock |
text |
x y text font_size font_bold align background |
Rendered text |
Common fields on all shapes: color, fill (bool), stroke_width, opacity (0–1), dashed (bool), rotation (degrees, centre-origin).
Canvas values: blank (transparent), white, black, screen (live screenshot), window (current window), element (current element).
The scene system lets AI agents and users collaborate on persistent, structured drawings. Every shape has a stable ID; coordinates are always accurate; the AI can read them back and refine the composition at any time.
# Create a scene
curl -X POST http://localhost:8080/scenes \
-H "Content-Type: application/json" \
-d '{"name":"My Scene","width":800,"height":600,"background":"#1a1a2e"}'
# → data.scene contains the full scene with id
# List scenes
curl http://localhost:8080/scenes
# Get a scene
curl http://localhost:8080/scenes/{id}
# Add a layer
curl -X POST http://localhost:8080/scenes/{id}/layers \
-H "Content-Type: application/json" \
-d '{"name":"Background"}'
# Add a shape to a layer
curl -X POST http://localhost:8080/scenes/{id}/layers/{lid}/shapes \
-H "Content-Type: application/json" \
-d '{"shape":{"type":"circle","x":400,"y":300,"r":80,"color":"royalblue","fill":true},"name":"Planet"}'
# Render the scene to a PNG
curl http://localhost:8080/scenes/{id}/render
# → data.result is base64 PNG
# Patch shape geometry (after user drags it — never clobbers color/style)
curl -X PATCH http://localhost:8080/scenes/{id}/layers/{lid}/shapes/{sid} \
-H "Content-Type: application/json" \
-d '{"x":420,"y":310}'
# Move a shape to a different layer
curl -X POST http://localhost:8080/scenes/{id}/shapes/{sid}/move \
-H "Content-Type: application/json" \
-d '{"target_layer_id":"{newLayerId}"}'
# Delete a shape / layer / scene
curl -X DELETE http://localhost:8080/scenes/{id}/layers/{lid}/shapes/{sid}
curl -X DELETE http://localhost:8080/scenes/{id}/layers/{lid}
curl -X DELETE http://localhost:8080/scenes/{id}| Method | Route | Description |
|---|---|---|
GET / POST |
/scenes |
List all scenes / create scene |
GET / PUT / PATCH / DELETE |
/scenes/{id} |
Get / update meta / delete scene |
GET |
/scenes/{id}/render |
Render scene → base64 PNG |
GET / POST |
/scenes/{id}/layers |
List layers / add layer |
GET / PUT / PATCH / DELETE |
/scenes/{id}/layers/{lid} |
Get / update / delete layer |
GET / POST |
/scenes/{id}/layers/{lid}/shapes |
List shapes / add shape |
GET / PUT / PATCH / DELETE |
/scenes/{id}/layers/{lid}/shapes/{sid} |
Get / replace / patch geometry / delete shape |
POST |
/scenes/{id}/shapes/{sid}/move |
Move shape to a different layer |
The desktop editor opens a standalone window with:
- Scene list — create, select, or delete scenes
- Toolbar — arrow (select/move), rect, ellipse, circle, line, text, delete
- Canvas — double-buffered; drag shapes to reposition; draw new shapes by clicking and dragging; mouse wheel to zoom
- Layers panel — add/delete layers; click to select the active layer; eye icon to toggle visibility
- Properties panel — x, y, w, h, r fields for the selected shape; edits commit to the store immediately
- Keyboard shortcuts — V/R/E/C/L/T for tools, Delete to remove selected shape, Escape to deselect
All changes are persisted to disk (%LOCALAPPDATA%\ApexComputerUse\scenes\{id}.json) and immediately available via the REST API.
Open http://localhost:8080/editor?apiKey=<key> for the same editing experience in a browser:
- HTML5 Canvas renderer for all 7 shape types
- Click-and-drag to place shapes; click to select and drag to move
- Layer panel with add/delete/visibility toggle
- Properties panel showing live coordinates
- Keyboard shortcuts (V/R/E/C/L/T, Delete, Escape)
- All changes sync to the same
/scenes/*REST API
After starting the bot, send commands to it in any Telegram chat:
/find window=Notepad id=15
/find window=Calculator name=Equals type=Button
/exec action=type value="Hello from Telegram"
/exec action=click
/exec action=gettext
/ocr
/ocr value=0,0,300,50
/status
/windows
/elements
/elements type=Button
/help
Key=value pairs support quoted values for multi-word strings:
/find window="My Application" name="Save Button"
/exec action=type value="some text with spaces"
AI commands work the same way:
/ai action=status
/ai action=init model=C:\models\vision.gguf proj=C:\models\mmproj.gguf
/ai action=describe
/ai action=describe prompt="List every button you can see."
/ai action=ask prompt="Is there an error message visible?"
/ai action=file value=C:\screenshots\app.png prompt="What dialog is shown?"
The app exposes a named pipe server (default name ApexComputerUse). Start it from the Remote Control group box, then use the bundled ApexComputerUse.psm1 module:
# Import the module
Import-Module .\Scripts\ApexComputerUse.psm1
# Connect to the pipe (must be started in the app first)
Connect-FlaUI # default pipe name: ApexComputerUse
Connect-FlaUI -PipeName MyPipe -TimeoutMs 10000
# Discovery
Get-FlaUIWindows # list all open window titles
Get-FlaUIStatus # current window/element state
Get-FlaUIHelp # command reference
Get-FlaUIElements # list all elements in current window
Get-FlaUIElements -Type Button # filter by ControlType
# Find
Find-FlaUIElement -Window 'Notepad'
Find-FlaUIElement -Window 'Notepad' -Name 'Text Editor' -Type Edit
Find-FlaUIElement -Window 'Calculator' -Id 'num5Button'
# Execute actions
Invoke-FlaUIAction -Action click
Invoke-FlaUIAction -Action type -Value 'Hello from PowerShell'
Invoke-FlaUIAction -Action gettext
Invoke-FlaUIAction -Action screenshot
# OCR
Invoke-FlaUIOcr
Invoke-FlaUIOcr -Region '0,0,300,50'
# AI
Invoke-FlaUIAi -SubCommand init -Model 'C:\models\v.gguf' -Proj 'C:\models\p.gguf'
Invoke-FlaUIAi -SubCommand status
Invoke-FlaUIAi -SubCommand describe -Prompt 'What buttons are visible?'
Invoke-FlaUIAi -SubCommand ask -Prompt 'Is there an error message?'
Invoke-FlaUIAi -SubCommand file -Value 'C:\screen.png' -Prompt 'Describe this.'
# Send raw JSON (advanced)
Send-FlaUICommand @{ command='find'; window='Notepad'; elementName='Text Editor' }
# Disconnect
Disconnect-FlaUI| Cmdlet | Key Parameters | Description |
|---|---|---|
Connect-FlaUI |
PipeName, TimeoutMs |
Connect to the pipe server |
Disconnect-FlaUI |
— | Close the connection |
Send-FlaUICommand |
Request (hashtable) |
Send a raw JSON command |
Get-FlaUIWindows |
— | List open window titles |
Get-FlaUIStatus |
— | Show current window/element |
Get-FlaUIHelp |
— | Server command reference |
Get-FlaUIElements |
Type |
List elements in current window |
Find-FlaUIElement |
Window, Id, Name, Type |
Find a window and element |
Invoke-FlaUIAction |
Action, Value |
Execute action on current element |
Invoke-FlaUIOcr |
Region |
OCR current element or region |
Invoke-FlaUICapture |
Target, Value |
Capture screen/window/element(s); returns base64 PNG in data |
Invoke-FlaUIAi |
SubCommand, Model, Proj, Prompt, Value |
Multimodal AI sub-commands |
The pipe connection is session-based: window and element state are preserved across calls within a single
Connect-FlaUI/Disconnect-FlaUIsession. UseFind-FlaUIElementto select a target, then callInvoke-FlaUIActionas many times as needed without re-finding.
Use Scripts\apex.cmd — a batch helper that wraps the HTTP server with simpler positional syntax. Requires the HTTP server to be started first and curl (built-in on Windows 10+).
:: Optional: override port (default is 8080)
set APEX_HTTP_PORT=8080
:: Discovery
apex windows
apex status
apex elements
apex elements Button
apex help
:: Find a window and element
apex find Notepad
apex find "My App" id=btnOK
apex find Notepad name="Text Editor" type=Edit
:: Execute actions
apex exec click
apex exec type value=Hello
apex exec gettext
apex exec screenshot
:: Capture
apex capture
apex capture action=screen
apex capture action=window
apex capture action=elements value=42,105,106
:: OCR
apex ocr
apex ocr 0,0,300,50
:: AI
apex ai status
apex ai init model=C:\models\v.gguf proj=C:\models\p.gguf
apex ai describe
apex ai describe prompt="What do you see?"
apex ai ask prompt="Is there an error message?"
apex ai file value=C:\screen.png prompt="Describe this."Add Scripts\ to your PATH (or copy apex.cmd next to your scripts) to use it from any directory.
The AI command set is backed by MtmdHelper, which uses LLamaSharp to run a local multimodal (vision + audio) LLM. No cloud API is required.
Download a vision-capable GGUF model and its multimodal projector (e.g. LFM2.5-VL from LM Studio) and note the paths to both .gguf files, or use Download All on the Model tab. Then call ai init before any inference commands.
| Sub-action | Required params | Optional params | Description |
|---|---|---|---|
init |
model=<path> proj=<path> |
— | Load the LLM and projector into memory |
status |
— | — | Report whether the model is loaded and which modalities it supports |
describe |
— (uses current element) | prompt=<text> |
Capture the current UI element as an image and ask the vision model to describe it |
ask |
prompt=<text> |
— | Ask a specific question about the current UI element (captures element image) |
file |
value=<file path> |
prompt=<text> |
Send an image or audio file from disk to the model |
Note:
describe,ask, andfilerequire a priorfindcommand to select a window/element. The model must be initialized withinitbefore any inference call. Each inference call starts completely fresh — no chat history is retained between calls.
The HTTP test console (GET /) has a dedicated AI Vision button group (purple-tinted):
| Button | Endpoint | Value field |
|---|---|---|
| status | GET /ai/status |
— |
| describe | POST /ai/describe |
Optional prompt (e.g. list all buttons) |
| ask | POST /ai/ask |
Required question (e.g. what number is shown?) |
Select an element in the Elements panel first, then click describe or ask. The console shows a "Running vision model…" notice immediately and updates with the result when inference completes.
The UI Map Renderer scans the current window's accessibility tree and renders every element's bounding rectangle as a colour-coded overlay. Each control type gets a deterministic, visually distinct colour. Element names are drawn inside the bounding box.
# Returns base64-encoded PNG of the current window's element tree
curl http://localhost:8080/uimapRequires a prior find call to select a window. The response data.result field contains the base64 PNG — identical format to the /capture endpoints. In the interactive test console, the UI map button (in the Capture group) renders the result inline in the response log.
Tools → Render UI Map draws the overlay directly on screen for 5 seconds (press Escape to dismiss early) and offers to save it as a PNG file. This also triggers a live screen overlay, which is not available via the HTTP API.
Tools → Output UI Map logs the raw nested JSON element tree to the console tab — useful for inspecting the tree structure or copying it for use with an AI agent.
Element JSON includes bounding rectangles:
{
"id": 105,
"controlType": "Button",
"name": "OK",
"automationId": "btn_ok",
"boundingRectangle": { "x": 120, "y": 340, "width": 80, "height": 30 },
"children": []
}| Action | Aliases | Value | Description |
|---|---|---|---|
click |
— | — | Smart click: Invoke → Toggle → SelectionItem → mouse fallback |
mouse-click |
mouseclick |
— | Force mouse left-click (bypasses smart chain) |
middle-click |
middleclick |
— | Middle-mouse-button click |
invoke |
— | — | Invoke pattern directly |
right-click |
rightclick |
— | Right-click |
double-click |
doubleclick |
— | Double-click |
click-at |
clickat |
x,y |
Click at pixel offset from element top-left |
drag |
— | x,y |
Drag element to screen coordinates |
hover |
— | — | Move mouse over element |
highlight |
— | — | Draw orange highlight around element for 1 second |
focus |
— | — | Set keyboard focus |
keys |
— | text | Send keystrokes; supports {CTRL}, {ALT}, {SHIFT}, {F5}, Ctrl+A, Alt+F4, etc. |
screenshot |
capture |
— | Save element image to Desktop\Apex_Captures |
describe |
— | — | Return full element property description (UIA properties — not AI vision) |
patterns |
— | — | List automation patterns supported by the element |
bounds |
— | — | Return bounding rectangle |
isenabled |
— | — | Returns True or False |
isvisible |
— | — | Returns True or False |
wait |
— | automationId | Wait for element with given AutomationId to appear |
wait-page-load |
waitpageload |
seconds (default 10) | Poll window title until browser page finishes loading; returns page title on success |
Visual Studio run buttons: for a test handoff, target name="Debug Target" with type="SplitButton" for the F5/debug path, and name="Start Without Debugging" with type="Button" for the Ctrl+F5/no-debug path. Prefer numeric element IDs after an /elements scan to avoid fuzzy matching entirely.
| Action | Aliases | Value | Description |
|---|---|---|---|
waitfor |
— | see below | Poll the current element until predicate satisfied or timeout |
wait-window |
— | see below | Poll the desktop window list until a window title satisfies predicate |
waitfor parameters: predicate=<equals|contains|not-empty|visible|gone>, optional property=<value|text|name|isvisible|isenabled>, optional expected=<text>, optional timeout=<ms> (default 10000), optional interval=<ms> (default 200, min 50). visible and gone are element-level — they ignore property and expected. The success response includes elapsed_ms, property, and predicate inside data. On timeout, error_data.last_observed carries the value at the last poll ("offscreen"/"visible" for visible, "present" for gone-while-still-present, otherwise the property string).
wait-window parameters: predicate=<equals|contains|not-empty|gone>, expected=<title-substring> (required for all but not-empty), optional timeout=<ms> (default 10000), optional interval=<ms> (default 250). On match, the new window is registered in the window map and set as the current window — the next /find or /elements call resolves it without needing a window= field. Timeout error_data.last_observed_titles is the array of titles seen at the last poll, useful for debugging.
# Wait for a debug console window to appear after launching an app
curl -X POST http://localhost:8080/exec -H "X-Api-Key: <key>" \
-d '{"action":"wait-window","predicate":"contains","expected":"Debug Console","timeout":15000}'
# Wait for the current text element to contain a specific value
curl -X POST http://localhost:8080/exec -H "X-Api-Key: <key>" \
-d '{"action":"waitfor","predicate":"contains","property":"value","expected":"OK","timeout":5000}'
# Wait for the current element to become visible
curl -X POST http://localhost:8080/exec -H "X-Api-Key: <key>" \
-d '{"action":"waitfor","predicate":"visible","timeout":3000}'Send actions=[...] to /exec to run several commands sequentially in one round trip. Each entry is a full sub-request — cmd defaults to "execute", so simple action lists need only action and (where relevant) value. The optional stop_on_error field defaults to true: the first failing step ends the batch and remaining steps are skipped.
curl -X POST http://localhost:8080/exec -H "X-Api-Key: <key>" \
-d '{"actions":[
{"action":"clear"},
{"action":"type","value":"hello"},
{"action":"keys","value":"{CTRL}s"}
]}'The response's data.result contains stop_on_error, total_steps, executed, succeeded, and a results array. Each entry has step, cmd, action, success, data, extras (e.g. source for gettext/getvalue steps), and message.
| Action | Aliases | Value | Description |
|---|---|---|---|
type |
enter |
text | Enter text (smart: Value pattern → keyboard) |
insert |
— | text | Type at current caret position |
gettext |
text |
— | Smart read: Text pattern → Value → LegacyIAccessible → Name |
getvalue |
value |
— | Smart read: Value → Text → LegacyIAccessible → Name |
setvalue |
— | text | Smart set: Value pattern (if writable) → RangeValue (if numeric) → keyboard |
clearvalue |
— | — | Set value to empty string via Value pattern |
appendvalue |
— | text | Append text to current value |
getselectedtext |
— | — | Get selected text via Text pattern |
selectall |
— | — | Ctrl+A |
copy |
— | — | Ctrl+C |
cut |
— | — | Ctrl+X |
paste |
— | — | Ctrl+V |
undo |
— | — | Ctrl+Z |
clear |
— | — | Select all and delete |
| Action | Aliases | Value | Description |
|---|---|---|---|
setrange |
— | number | Set RangeValue pattern |
getrange |
— | — | Read current RangeValue |
rangeinfo |
— | — | Min / max / smallChange / largeChange |
| Action | Aliases | Value | Description |
|---|---|---|---|
toggle |
— | — | Toggle CheckBox (cycles state) |
toggle-on |
toggleon |
— | Set toggle to On |
toggle-off |
toggleoff |
— | Set toggle to Off |
gettoggle |
— | — | Read current toggle state (On / Off / Indeterminate) |
| Action | Aliases | Value | Description |
|---|---|---|---|
expand |
— | — | Expand via ExpandCollapse pattern |
collapse |
— | — | Collapse via ExpandCollapse pattern |
expandstate |
— | — | Read current ExpandCollapse state |
| Action | Aliases | Value | Description |
|---|---|---|---|
select |
— | item text | Select ComboBox/ListBox item by text |
select-item |
selectitem |
— | Select current element via SelectionItem pattern |
addselect |
— | — | Add element to multi-selection |
removeselect |
— | — | Remove element from selection |
isselected |
— | — | Returns True or False |
getselection |
— | — | Get selected items from a Selection container |
select-index |
selectindex |
n | Select ComboBox/ListBox item by zero-based index |
getitems |
— | — | List all items in a ComboBox or ListBox (newline-separated) |
getselecteditem |
— | — | Get currently selected item text |
| Action | Aliases | Value | Description |
|---|---|---|---|
minimize |
— | — | Minimize window |
maximize |
— | — | Maximize window |
restore |
— | — | Restore window to normal state |
windowstate |
— | — | Read current window visual state (Normal / Maximized / Minimized) |
| Action | Aliases | Value | Description |
|---|---|---|---|
move |
— | x,y |
Move element via Transform pattern |
resize |
— | w,h |
Resize element via Transform pattern |
Mouse scroll actions move the cursor to the element centre before firing the scroll event, so scrolling reliably lands in the browser content area rather than wherever the cursor happens to be.
| Action | Aliases | Value | Description |
|---|---|---|---|
scroll-up |
scrollup |
n (optional) | Move cursor to element centre, scroll up n clicks (default 3) |
scroll-down |
scrolldown |
n (optional) | Move cursor to element centre, scroll down n clicks (default 3) |
scroll-left |
scrollleft |
n (optional) | Move cursor to element centre, horizontal scroll left n clicks (default 3) |
scroll-right |
scrollright |
n (optional) | Move cursor to element centre, horizontal scroll right n clicks (default 3) |
scrollinto |
scrollintoview |
— | Scroll element into view |
scrollpercent |
— | h,v |
Scroll to h%/v% position via Scroll pattern (0–100) |
getscrollinfo |
— | — | Scroll position and scrollable flags |
| Action | Aliases | Value | Description |
|---|---|---|---|
griditem |
— | row,col |
Get element description at grid cell |
gridinfo |
— | — | Row and column counts |
griditeminfo |
— | — | Row / column / span for a GridItem element |
Returns a screen capture inline as a base64-encoded PNG in the data field. Supports four targets.
| Target | Description |
|---|---|
element (default) |
Current element (requires a prior find) |
window |
Current window (requires a prior find) |
screen |
Full display |
elements |
Multiple elements by ID, stitched vertically into one image |
For elements, provide comma-separated numeric IDs from a prior elements scan in the value field.
# Current element
curl -X POST http://localhost:8080/capture
# Full screen
curl -X POST http://localhost:8080/capture \
-H "Content-Type: application/json" \
-d '{"action":"screen"}'
# Current window
curl -X POST http://localhost:8080/capture \
-H "Content-Type: application/json" \
-d '{"action":"window"}'
# Multiple elements stitched into one image
curl -X POST http://localhost:8080/capture \
-H "Content-Type: application/json" \
-d '{"action":"elements","value":"42,105,106"}'Response data field contains the base64 PNG. Decode it to get the image:
curl -s -X POST http://localhost:8080/capture -d '{"action":"screen"}' \
| python -c "import sys,json,base64; d=json.load(sys.stdin)['data']; open('screen.png','wb').write(base64.b64decode(d))"Telegram: /capture sends the image as a photo message (not text).
/capture
/capture action=screen
/capture action=window
/capture action=elements value=42,105,106
PowerShell:
$r = Send-FlaUICommand @{ command='capture'; action='screen' }
[IO.File]::WriteAllBytes('screen.png', [Convert]::FromBase64String($r.data))Note: This is distinct from the
screenshotexec action, which saves toDesktop\Apex_Capturesand returns only the file path.
OCR uses Tesseract. Download language files from github.com/tesseract-ocr/tessdata and place them in a tessdata\ folder next to the executable (e.g. tessdata\eng.traineddata). Additional languages work the same way.
Captures saved by OCR Element + Save go to Desktop\Apex_Captures\.
The AI command set is backed by MtmdHelper using LLamaSharp's multimodal (MTMD) API. Supports vision and audio modalities depending on the model. Every inference call is fully stateless — no chat history is retained between calls.
Download a vision-capable GGUF model and its multimodal projector (e.g. LFM2.5-VL from LM Studio) and note the paths to both .gguf files, or click Download All on the Model tab. Then call ai init before any inference commands.
ApexComputerUse/
├── Program.cs — Entry point (`--service`, `--port`, `--pipe`, `--client`)
├── appsettings.json — Deployment defaults (Http/pipe/log/shell/test-runner)
├── ai-settings.json — AI provider credentials/settings
├── AI/
│ ├── AiChatService.cs — Provider-agnostic chat service (streaming + session state)
│ ├── AIDrawingCommand.cs — GDI+ drawing engine (`/draw`, overlays, built-in demo scene)
│ ├── MtmdHelper.cs — Local multimodal model wrapper (LLamaSharp MTMD)
│ ├── MtmdInteractiveModeExecute.cs — Interactive AI computer-use mode
│ └── SceneChatAgent.cs — Scene-oriented assistant logic
├── Automation/
│ ├── FlaUIHelper*.cs — UIA wrappers (find, actions, capture, text, keyboard, scrolling)
│ ├── ElementIdGenerator.cs — Stable hash-based element/window IDs
│ └── UiMapRenderer.cs — Colour-coded tree renderer to PNG/overlay
├── Commands/
│ ├── CommandProcessor*.cs — Core command handlers (find/exec/ocr/capture/ai/scenes/help)
│ ├── CommandLineParser.cs — cmd.exe command parsing
│ ├── CommandRequest.cs — Normalized command DTO
│ └── CommandRequestJsonMapper.cs — HTTP JSON/query mapping helpers
├── Servers/
│ ├── HttpCommandServer*.cs — HTTP API + chat/page/scene/system route handlers
│ ├── FormatAdapter.cs — Response negotiation (HTML/JSON/text/PDF; includes `PdfWriter`)
│ ├── PipeCommandServer.cs — Named-pipe server
│ └── TelegramController.cs — Telegram command surface
├── Scenes/
│ ├── Scene.cs — Scene/layer/shape models with stable IDs
│ └── SceneStore.cs — Thread-safe scene store (`%LOCALAPPDATA%\ApexComputerUse\scenes`)
├── Clients/
│ ├── RemoteClient.cs — Remote endpoint metadata
│ ├── ClientPermissions.cs — Per-client endpoint permission gates
│ └── ClientStore.cs — Persistent client registry (`%LOCALAPPDATA%\ApexComputerUse\clients`)
├── Infrastructure/
│ ├── AppConfig.cs / AppSettings.cs — Config layering (`appsettings.json` + `APEX_*` + user prefs)
│ ├── AppLog.cs — Serilog bootstrap/log sink wiring
│ ├── OcrHelper.cs — Tesseract OCR wrapper
│ ├── DownloadManager.cs — Model/OCR asset download support
│ └── ApexService.cs — Windows Service host
└── UI/
├── Form1.cs / Form1.Designer.cs — Main WinForms host
├── ServerTabController.cs — HTTP/pipe/server lifecycle controls
├── ChatTabController.cs — Embedded `/chat` WebView + provider controls
├── ModelTabController.cs — Model/asset management
├── ClientsTabController.cs — Multi-endpoint registry UI
├── SceneEditorForm.cs / .Designer.cs — WinForms scene editor
└── ClientEditForm.cs / .Designer.cs — Client create/edit dialog
Scripts/ — `ApexComputerUse.psm1` (pipe module) and `apex.cmd` (HTTP helper)
restart-apex.bat / restart-apex.ps1 — Restart helpers for local development
AIClients/ — AI messaging libraries and harness projects
TestApplications/ — WPF/WinForms/Web test apps and TestRunner
OCR: place Tesseract language files in a
tessdata\folder next to the executable. Not included in the repo — download from github.com/tesseract-ocr/tessdata.
# Restore and build (Release)
dotnet build -c Release ApexComputerUse/ApexComputerUse.csproj
# Run from source
dotnet run --project ApexComputerUse/ApexComputerUse.csprojRequires the .NET 10 SDK and the Windows Desktop workload (dotnet workload install windows).
dotnet test ApexComputerUse.Tests/ApexComputerUse.Tests.csprojThe test suite covers the pure-logic and data-model layers — everything that can be tested without a live desktop session:
| Test file | Coverage area |
|---|---|
ElementIdGeneratorTests.cs |
Hash mode, incremental mode, reset, thread safety |
SceneStoreTests.cs |
CRUD, disk persistence, concurrent creates |
SceneModelTests.cs |
FlattenForRender, ZIndex ordering, opacity, SceneIds |
AIDrawingCommandTests.cs |
JSON parsing, canvas backgrounds, all 8 shape types |
TelegramParseCommandTests.cs |
Command + key-value parser, DictExtensions.Get |
PipeCommandServerTests.cs |
Named-pipe JSON protocol parser |
LevenshteinTests.cs |
Edit-distance boundary and domain cases |
CommandResponseTests.cs |
ToText / ToJson serialisation |
OcrHelperTests.cs |
CropBitmap region logic, OcrResult.ToString |
Components that require an active Windows session (FlaUI UIA, Tesseract, LLamaSharp, WinForms UI) are covered by the existing integration script Scripts/test_controls.py and manual testing.
TestApplications/TestRunner/ is a cycle-based orchestrator that launches the WinForms, WPF, and web test apps, runs the full suite against the live HTTP API, and reports results. Use it whenever changes touch CommandProcessor, FlaUIHelper, or HttpCommandServer.
# Demo mode — human-readable output, 3 cycles
dotnet run --project TestApplications/TestRunner -- --mode demo
# Benchmark mode — JSON-line output, 25 cycles
dotnet run --project TestApplications/TestRunner -- --mode benchmarkTest apps:
- WinForms —
TortureTestForm.cs: textbox, button, checkbox, radio, combo, listbox, slider, menu, grid - WPF —
TortureTestWindow.xaml: same controls plus Expander, ViewModel-driven state - Web —
index.html: menu, tabs, form controls, scrollable regions
The runner interacts exclusively through the HTTP API, so a failed assertion is reported as the exact curl call that failed. The same suite can also be triggered remotely via POST /run-tests.
All notable changes to ApexComputerUse are documented in this file.
- New
Infrastructure/WindowMonitor.cs— owns a dedicated background STA thread (UIA3 is COM apartment-affine; thread-pool timers can't safely call into it) that polls the desktop once per second, diffs the window set against the previous snapshot, and fires events:WindowsChanged(IReadOnlyList<WindowSnapshot>)— fires whenever any window opens, closes, or changes titleWindowClosed(IntPtr hwnd)— per-HWND closure event, used for cache invalidation
WindowSnapshotrecord —(Hwnd, ProcessId, Title, ElementId). ElementId is generated withexcludeName: trueso the title can change without rotating the ID.- Auto-starts on
Form1.Loadand is stopped/disposed onOnFormClosed. - Tools menu items: Start/Stop Window Monitoring, Watch Elements (slow), Watch Top Window Only, Set Element Window Filter… (substring match against window titles, settable via the inline dialog or programmatically by AI code).
WatchElementsproperty — when on, each poll also scans every monitored window's UIA descendants and firesWindowElementsChanged(window, added, removed)with a per-window add/remove diff. Off-screen elements are skipped. Disabled by default (slow).TopWindowOnly(P/InvokeGetForegroundWindow) andElementWindowFilter(case-insensitive title contains) narrow the element-scan set so it stays tractable.- Per-window state is dropped automatically when a window closes; the first scan of a newly-discovered window establishes a baseline (no event), so opens don't dump every control as "added".
- New
CommandProcessor.InvalidateClosedWindow(IntPtr hwnd)(inCommandProcessor.Windows.cs) — wired fromWindowMonitor.WindowClosed. Takes_stateLock, prunes_windowMapentries with the matching HWND, sweeps_elementMapfor now-invalidAutomationElemententries via the existingIsElementValidstatic, removes from all parallel maps (_elementHashes,_elementReverse,_elementParents,_elementDescriptors), clears_currentElement/CurrentWindowif they went stale, and clears_mappedWindowHandleif it matched._elementReverse.Removeis wrapped in try/catch +LogSwallowedbecauseDictionary<AutomationElement,_>.Removecalls UIA'sCompareElementsand can throwCOMExceptionon stale proxies. - Verified end-to-end:
/findNotepad → close Notepad → wait one poll cycle →/statusreportsWindow: (none)and Notepad is gone from/windows.
WindowMonitorcarries a thread-safeConcurrentQueue<MonitorLogEntry>(default cap 500, FIFO eviction) of recent activity — opens, closes, renames, element add/remove, and internal poll errors.AppendLog,GetLog,ClearLog, plus anIsRunninglifecycle property.- New HTTP routes (named
/winmon/...to avoid collision with the existing/monitor/{id}RegionMonitor namespace):GET /winmon/log→{ count, running, entries: [...] }(append.jsonfor raw JSON)POST /winmon/clear→{ cleared: N }
- Both routes are gated by
AllowDiagnostics; loopback callers always pass. - 13 new unit tests (
WindowMonitorTests,CommandProcessorInvalidateTests) covering the diff logic, log buffer, FIFO eviction, lifecycle, the new properties, and the safe paths ofInvalidateClosedWindow. UIA-dependent paths (live element pruning, descendant scan) verified manually via the running app.
- New
ElementAnnotationmodel andElementAnnotationStore— per-element notes and exclusion flags keyed by stable element hash, persisted at%LOCALAPPDATA%\ApexComputerUse\annotations\elements.json. Empty records auto-GC'd. - New verbs in
CommandProcessor.Annotations.cs:annotate,unannotate,exclude,unexclude,annotations,excluded - New HTTP routes:
POST /annotate,POST /unannotate,POST /exclude,POST /unexclude,GET /annotations,GET /excluded - Notes appear as a
notefield on/elementsoutput; excluded subtrees are skipped during scan (root never excluded —depth > 0guard) - New query param
?unfiltered=trueon/elementsbypasses the exclusion filter - 7 new unit tests in
ElementAnnotationStoreTests.cs
- New
RegionMapmodel andRegionMapStore— persistent named pixel-coordinate grids tied to a window or stable element hash. One file per map under<exe>/regionmaps/{id}.json - Built for AI self-calibration loops on canvas-rendered content (board games, emulators, video timelines) where individual cells are not UIA elements
- Static helpers:
CellToPixel(map, row, col)returns cell center;BuildGridDrawRequest(...)produces a re-usable draw request for both overlay and render paths - New verbs in
CommandProcessor.RegionMaps.cs:regionmapumbrella with sub-actionslist|get|delete|overlay|render|cell - New HTTP routes in
HttpCommandServer.AnnotationRoutes.cs:GET|POST /regionmap— list/createGET|PUT|PATCH|DELETE /regionmap/{id}— per-map opsPOST /regionmap/{id}/overlay— click-through screen overlayPOST /regionmap/{id}/render— base64 PNG of screen (or current window) with grid drawn over it; supports{"canvas":"screen"}(default) or{"canvas":"window"}(auto-translates grid coords to window-local)POST /regionmap/{id}/cell—{row, col}→{x, y}forclick-at
- 10 new unit tests in
RegionMapStoreTests.cs(incl. corner-case cell-coord math)
- New
RegionMonitormodel andRegionMonitorStore— persistent per-region screen-change watchers, one file per monitor under%LOCALAPPDATA%\ApexComputerUse\monitors\{id}.json. Each monitor holds an array ofMonitorRegionso one logical "watch" can cover multiple indicators (LEDs, status icons, etc.) with independent diffs. - New
RegionMonitorRunner— background dispatcher; one Task per enabled monitor; per-region capture → diff vs previous → fire SSE event when over threshold. First tick is the baseline (no fire). Disabled monitors are not polled. Region-count changes handled at runtime. Diff viaLockBits + Marshal.Copy— per-pixel max-channel-difference > tolerance counts as "changed". - New verbs in
CommandProcessor.Monitors.cs:monitorumbrella with sub-actionslist|get|delete|start|stop|check. - New HTTP routes in
HttpCommandServer.MonitorRoutes.cs:GET|POST /monitor— list/createGET|PUT|DELETE /monitor/{id}— per-monitor CRUDPOST /monitor/{id}/start//stop— toggle enabledPOST /monitor/{id}/check— manual one-shot diff vs current baselinesPOST /monitor/{id}/snapshot?index=N— base64 PNG of region N right now
- Notifications via the existing
/eventsSSE stream asmonitor.firedevents:{monitorId, name, regionIndex, label, x, y, width, height, percentDiff, threshold, seq, time}. - Defaults:
intervalMs=1000(floor 100ms),thresholdPct=5.0,tolerance=8,enabled=false. - Last-fire telemetry persisted on the monitor:
lastFiredUtc,lastPercentDiff,lastRegionIndex,hitCount. - 11 new unit tests in
RegionMonitorStoreTests.cscovering CRUD, telemetry, persistence, and diff math.
EventEnvelopereshaped:int? WindowIdplusIReadOnlyDictionary<string, object?> Datareplace the fixedWindowId/Titlefields. Window events still carryid/titleinsideData; non-window subsystems attach arbitrary payloads.- New public
EventBroker.Emit(string type, IDictionary<string, object?> data, int? windowId = null)for non-window emitters (region monitors today, anything else later). - SSE serializer in
HttpCommandServer.Events.csnow flattensDatainto the frame payload alongsideseq/time— both event families render uniformly. JsonElementExtensions.Dbl(name)helper added for parsingthresholdPct.
- New optional setting
PublicHelpPage(defaultfalse): when on,GET /helpis reachable without an API key - New setting
PublicHelpRateLimit(default30req/min/IP): sliding 60-second per-IP window protects the unauthenticated route - Returns HTTP 429 with
Retry-After: 60when limit exceeded - Loopback callers and API-keyed callers always have full access (never rate-limited)
- New
RuntimeFlagsstatic — mutable mirror ofAppConfigvalues seeded at startup, allows GUI changes to take effect without restart - GUI controls in Remote Control tab:
chkPublicHelpcheckbox +numHelpRateLimitnumeric input. Persisted in%APPDATA%\ApexComputerUse\settings.jsonalongside other user prefs. appsettings.jsonkeys:PublicHelpPage,PublicHelpRateLimit. Env:APEX_PUBLIC_HELP_PAGE,APEX_PUBLIC_HELP_RATE_LIMIT.
- Remote Control tab cleaned up:
lblTelegramStatusmoved from (8, 168) — was overlapping the new public-help checkbox — to (465, 104) on the bot-token row.btnStartTelegramshrunk from 120 to 100 wide to make room. - Added tooltips to all interactive controls in Remote Control tab (HTTP port/start, API key, Copy, bot token, Start Telegram, allowed chat IDs, public help, rate limit, pipe name, Start Pipe, status labels).
- New
LICENSE: PolyForm Noncommercial 1.0.0. Source-available; commercial use requires a separate license. - New
THIRD_PARTY_NOTICES.md: license attributions for all 9 NuGet dependencies (FlaUI, Serilog, LLamaSharp, Telegram.Bot, Tesseract, etc.). MIT and Apache 2.0 obligations met inline. - Merged
ACU_AI_CONTROL_GUIDE.mdintoACU_CONTROL_GUIDE.md(deleted the former) — single comprehensive guide. Added "Rules of Thumb" section + Annotations + Region Maps coverage. - Slimmed
ACU_SYSTEM_PROMPT.mdfrom 9 KB → 3.8 KB. Now points at the auto-generated/helppage for endpoint reference instead of duplicating tables that drift. Retains auth, mental model, 10 critical rules, minimal control loop. - Updated
ACU_OPERATIONAL_REFERENCE.mdfor staleness: added/winrun, annotations, region maps,notefield,?unfiltered,PublicHelpPage/PublicHelpRateLimitconfig keys.
CommandProcessor.ScanElementsIntoMapnow consultsElementAnnotationStoreto skip excluded subtrees and attach notes during scan; existing/elementscallers see no behavioral change unless annotations exist.RegionMap.canvas:"window"mode correctly translates screen-absolute grid coords into window-local space before drawing, so the grid lines up with the captured window image.
- Added
--portcommand-line argument to override HTTP listen port for running multiple instances - Added
--pipecommand-line argument to override named-pipe name - Added
--clientcommand-line argument to mark an instance as a subordinate client (disables Launch Instance button) - Port auto-increment in
HttpCommandServer.Start()— automatically tries next available port if preferred port is taken - New buttons in Clients tab: "Open Web UI" (launches
/chatpage in default browser) and "Launch Instance" (spawns new instance with incremented port) ClientsTabController.LaunchInstance()auto-registers spawned instance in client list
- New
ClientPermissionsclass with per-client flags:AllowAutomation,AllowCapture,AllowAi,AllowScenes,AllowShellRun,AllowClients - Permissions stored in JSON alongside each
RemoteClientand loaded on reconnect - Permission enforcement in
HttpCommandServer: loopback (127.0.0.1) always gets full access; registered clients get their stored permissions; unknown IPs get full access - All endpoints gated by appropriate permission:
/runrequiresAllowShellRun,/capture/ocrrequireAllowCapture,/ai/chatrequireAllowAi,/scenes/editorrequireAllowScenes,/clientsrequireAllowClients, everything else requiresAllowAutomation ClientEditFormredesigned with two tabs: "Connection" (existing fields) and "Permissions" (6 checkboxes with ShellRun/Clients highlighted in orange)ClientStore.FindByHost(string host)— case-insensitive lookup by hostname
AiChatService.SetLocalServer(int port, string? apiKey)andClearLocalServer()— configure local HTTP server context for AI chat- Agentic tool loop in
AiChatService.SendAsync()— AI can issue ApexComputerUse API calls viaapexcode blocks - System prompt auto-extended with API reference when server context is set, including endpoint list and example calls
- Loop executes up to 8 turns, executing calls and feeding results back until AI produces clean answer
ServerTabController.ToggleHttp()callsSetLocalServer()on start andClearLocalServer()on stop- Parsing and system prompt generation exposed as
internalfor testing
- Timing-safe API key comparison using
CryptographicOperations.FixedTimeEquals()(replaced three separate==comparisons) - Shell command execution in
/runnow usesProcessStartInfo.ArgumentListinstead of string concatenation to prevent injection HttpCommandServer.Stop()now explicitly closesHttpListenerto immediately release port handles
- Fixed
MtmdInteractiveModeExecuteinfinite loop with hardcoded test path — replaced with properConsole.ReadLine()loop - Fixed
CommandProcessorelement ID lookup to useEquals()instead ofReferenceEquals()(FlaUI usesIUIAutomation.CompareElements) - Added 50k-entry cap on
CommandProcessor._elementMapto prevent unbounded growth during long sessions - Fixed
Form1.SetupNetshIfNeeded()blocking UI thread — made async with proper timeout - Fixed
Form1.AutoLoadModelIfConfigured()fire-and-forget — now logs async exceptions via.ContinueWith() SceneEditorFormcanvas paint optimization — eliminated per-paint full-scene bitmap allocation during drag
Program.IsClientInstance— public static property detecting--clientflag for UI gating- Command-line arg parsing restructured to support flag-only arguments alongside key-value pairs
HttpCommandServerconstructor now accepts optionalClientStore? clientStoreparameterHttpCommandServer.Port { get; private set; }— made settable internally byStart()for auto-incrementRemoteClient.Permissions— new property withClientPermissionsvalue
Form1.Designer.cs— added "Open Web UI" and "Launch Instance" buttons to Clients tabClientEditForm.Designer.cs— complete redesign with TabControl (Connection / Permissions tabs)ClientsTabControllerconstructor signature expanded with button references and port getter
- New test file
ApexComputerUse.Tests/AiChatServiceTests.cswith 22 tests covering apex call parsing and system prompt generation ParseApexCallsandBuildApexSystemPromptexposed asinternalvia existingInternalsVisibleToattribute- All 171 tests passing (149 existing + 22 new)
- AI tool-use loop is non-streaming (full response assembled before delivery)
- IP-spoofing could bypass permission sandboxing on local network
- Clients tab — remote machine registry — a new "Clients" tab (sixth tab in the main UI) lets users and AI maintain a persistent directory of other Apex-enabled machines. Each entry stores a friendly name, host/IP, port, API key, OS version, and description. Entries are listed in a six-column
ListViewand persisted to<exe>/clients/{id}.jsonusing the same thread-safe JSON store pattern as scenes. ClientStore(Clients/ClientStore.cs) — thread-safe store that loads all client records from disk on startup and writes individual JSON files on every create, update, or delete.RemoteClient(Clients/RemoteClient.cs) — data model with[JsonPropertyName]attributes matching the project's snake_case serialization convention.ClientsTabController(UI/ClientsTabController.cs) — tab logic wired to Add, Edit, Remove, and Test buttons. Test Connection fires an asyncGET /pingagainst the selected client's host:port (with its API key if set) and updates a live Status column green/red in-place, with no UI blocking.ClientEditForm(UI/ClientEditForm.cs/ClientEditForm.Designer.cs) — fixed-size dialog for creating and editing client entries, with name/host required-field validation and port range validation.
- Embedded HTML chat in the Chat tab — the Chat tab's RichTextBox, input field, and Send button have been replaced by an embedded
Microsoft.Web.WebView2control hosting the existing/chatstreaming page directly inside the app. Click Load Chat to navigate the WebView2 tohttp://localhost:{port}/chat?apiKey=.... The HTML page handles streaming, the "New chat" reset, and provider/model status display natively. - HTTP server auto-start on launch —
HttpAutoStartandHttpBindAllare nowtrueby default inappsettings.json. The HTTP server starts and binds to all interfaces automatically when the app opens; no manual click on the Remote Control tab is required. - Model auto-load on launch — if model and projector paths are saved in
settings.json, the local vision model is loaded automatically at startup without opening the Model tab. - First-run netsh setup — on the very first launch, the app checks whether the HTTP URL ACL (
http://+:8081/) and the Windows Firewall inbound rule (ApexComputerUse) exist. If either is missing, a single elevatedcmdsession (one UAC prompt) runs bothnetshcommands. The result is persisted tosettings.json(NetshConfigured = true) so the check never repeats. - Restart scripts —
restart-apex.batandrestart-apex.ps1at the repo root kill all running instances (taskkill /F /IM ApexComputerUse.exe) and relaunch the app. Both prefer the Release build, fall back to Debug, and fall back todotnet runif no built exe is found.
ChatTabController— removed_rtbChatHistory,_txtChatInput,_btnChatSend,AppendToChat,AppendColoredText,SendOrCancelAsync,ExecuteCommandsFromResponse, andCurlRx. Constructor now accepts aWebView2instead.OpenChat()navigates the embedded WebView2;ResetChat()callsReload().AppSettings— addedNetshConfiguredbool field (persisted to%APPDATA%\ApexComputerUse\settings.json) for first-run netsh tracking.
/elements?match=<text>— case-insensitive substring search acrossName,AutomationId, andValuepattern. Returns only branches containing matches, each wrapped in its ancestor path (non-matching siblings pruned).depthnow controls how deep to render under each match, so one call replaces the repeated drill-down pattern of "fetch tree → spot candidate → fetch subtree". Composes withtype=andonscreen=true./elements?collapseChains=true— folds "1-in-1-in-1" wrapper chains that dominate web accessibility trees. A node is skipped only when it has exactly one child, noName, noAutomationId, and its control type isPane,Group, orCustom. Named containers and anything with an AutomationId are preserved. IDs of hoisted descendants are unchanged — follow-up/elements?id=<id>and/execute id=<id>calls continue to work against the real (unflattened) tree./elements?includePath=true— every emitted node gains apathbreadcrumb string (e.g."Chrome > Document > Main > Form") so an agent can orient itself without climbing back up the tree./elements?properties=extra— opt-in per-nodevalue(via Value pattern, when the element supports it) andhelpTextproperties. Off by default so token budgets don't change silently; needed for web inputs whoseNameis empty and whose visible content lives in the Value pattern.descendantCounton truncated nodes — nodes cut off bydepthnow emitdescendantCount: Nalongside the existingchildCount, so an agent can decide whether a subtree is worth expanding without another round trip.- Structured
/findresponse —/findnow populates a JSONelementobject on the response (id, controlType, name, automationId, className, frameworkId, isEnabled, isOffscreen, boundingRectangle, plusvalue/helpTextwhenproperties=extra) alongside the existing human-readable string inmessage. The element's numeric ID is recovered from the most recent/elementsscan when available. - Tree-shape unit tests (
ApexComputerUse.Tests/CommandProcessorTreeTests.cs) — coversFilterTreeByMatch(case-insensitive, AutomationId + Value lookup, sibling pruning),CollapseSingleChildChains(identity-less-only collapse, multi-child preservation, ID stability), andElementNodeJSON round-trip for the new opt-in fields.
CommandProcessor.ElementNode/BoundingRectpromoted fromprivatetointernal sealed classso the new in-process post-processors (FilterTreeByMatch,CollapseSingleChildChains) and the test project (InternalsVisibleTo) can exercise them directly.ScanElementsIntoMapnow accepts aScanOptionsstruct (IncludePath + IncludeExtra + depth) and threads the parent breadcrumb through recursion without changing call-site signatures for existing endpoints.
- AI Chat window — Tools → AI Chat opens a standalone chat interface powered by the
AiMessagingCorelibrary. Supports 8 providers: OpenAI, Anthropic, DeepSeek, Grok, Groq, Duck, LM Studio, and LlamaSharp (local GGUF). Streams tokens in real-time; shows timing metrics (total tokens, tokens/second, time-to-first-token). Provider, model, system prompt, and sample query are persisted toai-settings.jsonnext to the executable. AIClientssolution integrated — bothAiMessagingCore(class library) andAIClients(standalone WinForms harness) are now included inApexComputerUse.slnfor single-solution editing.AIClients.slnandAIClients.exeremain fully independent and buildable on their own.ai-settings.json— starter settings file (copied to output on build) with placeholder API keys for all 8 providers. Replace placeholders with real keys to activate each provider.
ProviderSettings.ApiKeyandAiLibrarySettings.DefaultProviderchanged frominit-only tosetso runtime configuration updates (provider switch, API key override) can be applied without reconstructing the settings objects.HandleChatStatusinHttpCommandServernow returnsDictionary<string, string>matching theApexResult.Datacontract;sessionActiveis serialized as"True"/"False".
capturecommand — returns screen captures inline as base64 PNG in thedataresponse field. No file is written to disk. Four targets viaaction=:screen— full displaywindow— current window (requires priorfind)element(default) — current element (requires priorfind)elements value=id1,id2,...— multiple elements by numeric ID, stitched vertically into one image
- HTTP:
POST /capture - Named pipe / PowerShell:
command=capture; newInvoke-FlaUICapturecmdlet inApexComputerUse.psm1 - cmd.exe:
apex capture [action=...] [value=...]inapex.cmd - Telegram:
/capture— response delivered as a photo message, not text
- Persistent element ID map —
elementscommand now recursively scans the UI tree usingElementIdGenerator(SHA-256 hash-based, deterministic across sessions). Each element receives a stable numeric ID that survives app restarts. - Nested JSON element map output —
elementsreturns the full window tree as indented, nested JSON (id,controlType,name,automationId,children), replacing the flat string list. - Window map with persistent IDs —
windowscommand now returns a JSON array of{id, title}pairs. IDs are hash-based and stable for the same window across sessions. - Map-based lookup in
find— pass a numeric ID from eitherwindowsorelementsas thewindow=orid=parameter; the element is resolved directly from the in-memory map without a fuzzy search. - Auto-focus on every
find— the matched window is brought into foreground focus automatically; no separatefocusaction required. - "Output UI Map" menu item — Tools menu item captures the UI tree of the currently selected window and prints the nested JSON to the log.
- Full
ElementOperationsparity — all UIA patterns now covered by bothApexHelperandCommandProcessor:
| Action | Description |
|---|---|
mouse-click |
Force mouse left-click (bypasses Invoke/Toggle/SelectionItem) |
middle-click |
Middle-mouse-button click |
click-at value=x,y |
Click at pixel offset from element top-left |
drag value=x,y |
Drag element to screen coordinates |
highlight |
Draw orange highlight around element for 1 second |
isenabled |
Returns True/False |
isvisible |
Returns True/False |
clearvalue |
Set value to empty string (Value pattern) |
appendvalue |
Append text to current value |
getselectedtext |
Selected text via Text pattern |
setrange value=n |
Set RangeValue pattern |
getrange |
Read current RangeValue |
rangeinfo |
Min / max / smallChange / largeChange |
toggle-on / toggle-off |
Set toggle to a specific state |
gettoggle |
Read current toggle state (On / Off / Indeterminate) |
expandstate |
Read ExpandCollapse state |
select-item |
Select via SelectionItem pattern |
addselect |
Add element to multi-selection |
removeselect |
Remove element from selection |
isselected |
Check SelectionItem selected state |
getselection |
Get selected items from a Selection container |
select-index value=n |
Select ComboBox / ListBox item by zero-based index |
getitems |
List all items in a ComboBox or ListBox |
getselecteditem |
Get currently selected item text |
minimize / maximize / restore |
Window visual state |
windowstate |
Read current window visual state |
move value=x,y |
Move element via Transform pattern |
resize value=w,h |
Resize element via Transform pattern |
scroll-left / scroll-right value=n |
Horizontal mouse scroll |
scrollpercent value=h,v |
Scroll to h%/v% via Scroll pattern |
getscrollinfo |
Scroll position and scrollable flags |
griditem value=row,col |
Get element at grid cell |
gridinfo |
Row and column counts |
griditeminfo |
Row / column / span for a GridItem element |
| Action | Change |
|---|---|
click |
Now smart: Invoke → Toggle → SelectionItem → mouse fallback |
gettext |
Smart chain: Text pattern → Value → LegacyIAccessible → Name |
getvalue |
Smart chain: Value → Text → LegacyIAccessible → Name |
setvalue |
Smart chain: Value (if writable) → RangeValue (if numeric) → keyboard |
select |
Tries SelectionItem on list child first, then FlaUI wrappers |
keys |
Full {KEY} token notation ({CTRL}, {F5}, …) and Ctrl+A / Alt+F4 combo syntax |
windowscommand returns a JSON array of{id, title}for all open windows, enabling the AI to select precisely without relying on fuzzy matching.
- Named-pipe server (
PipeCommandServer) — exposes the full command set over a Windows named pipe (default nameApexComputerUse). Each client connection is session-based (state is preserved across commands on the same connection). Accepts and returns newline-delimited JSON. - Pipe server UI — new row in the Remote Control group box: configurable pipe name, Start/Stop button, and live status label.
Scripts\ApexComputerUse.psm1— PowerShell module providing idiomatic cmdlets over the named pipe:Connect-FlaUI,Disconnect-FlaUI,Send-FlaUICommand,Get-FlaUIWindows,Get-FlaUIStatus,Get-FlaUIHelp,Get-FlaUIElements,Find-FlaUIElement,Invoke-FlaUIAction,Invoke-FlaUIOcr,Invoke-FlaUIAi.Scripts\apex.cmd— cmd.exe batch helper wrapping the HTTP server with simpler positional syntax (e.g.apex find Notepad,apex exec click,apex ai describe). Requires curl (built-in Windows 10+).
- AI multimodal command set (
MtmdHelperintegration) — expose the existingMtmdHelperclass through all remote interfaces. CommandRequestextended withModelPath,MmProjPath, andPromptfields.aicommand inCommandProcessorwith five sub-actions:init— load the LLM and multimodal projector from disk (model=+proj=paths).status— report whether the model is loaded and which modalities it supports.describe— capture the current UI element and ask the vision model to describe it (optionalprompt=).file— send an image or audio file from disk to the model (value=<path>, optionalprompt=).ask— ask an arbitrary question about the current UI element (prompt=required).
- HTTP endpoints for AI commands:
GET /ai/status;POST /ai/init,/ai/describe,/ai/file,/ai/ask. - Telegram
/aicommand — same sub-action set viaaction=<sub>key-value syntax. - Updated
helpcommand output to list allaisub-actions.
- HTTP REST server (
HttpCommandServer) — control the application via curl on a configurable port (default 8080). Endpoints:GET /status,/windows,/elements,/help;POST /find,/execute,/ocr. - Telegram bot (
TelegramController) — same command set over Telegram. Supports/find,/exec,/ocr,/status,/windows,/elements,/help. Key=value argument syntax with quoted multi-word values. - CommandProcessor — shared command engine used by both remote interfaces. Auto-accepts fuzzy window/element matches (no UI prompts in remote mode). Fires
OnLogevents forwarded to the form's status box. - Remote Control group box in the UI — start/stop HTTP server and Telegram bot with live status indicators.
FlaUIHelper.ListWindowTitles()— returns titles of all open windows.FlaUIHelper.ListElements(Window, ControlType?)— lists all elements in a window with optional ControlType filter.README.md— full usage documentation including curl examples and Telegram command reference.CHANGELOG.md— this file.
- OCR (
OcrHelper) — captures any UI element and runs Tesseract OCR on it.OcrElement— capture and recognise.OcrElementAndSave— capture, save image to disk, then recognise (useful for debugging).OcrElementRegion— OCR a sub-rectangle of the element.OcrFile— OCR an existing image file.
tessdata\eng.traineddatabundled in project and copied to output on build.- OCR actions available in the Any Element action group in the UI.
- Fuzzy window matching — tries exact match, then contains, then Levenshtein closest. Prompts for approval on non-exact matches.
- Fuzzy element matching — same three-tier logic, applied to AutomationId or Name.
- Search Type combo — filter element search by
ControlType.Allsearches every type without restriction.Allis never passed as aControlTypevalue to FlaUI. - Levenshtein distance implementation in
FlaUIHelper. FlaUIHelper.FindWindowFuzzyandFlaUIHelper.FindElementFuzzyreturning match metadata (exact vs fuzzy, matched value).
- Form height extended to accommodate the new Search Type row.
- Initial AI computer use application (WinForms) targeting .NET 10.
- FlaUIHelper class wrapping FlaUI UIA3 for all common WPF/WinForms control interactions:
- Button, TextBox, PasswordBox, Label, ComboBox, CheckBox, RadioButton, ListBox, ListView, DataGrid, TreeView, Menu/MenuItem, TabControl, Slider, ProgressBar, Hyperlink.
- Mouse operations: click, right-click, double-click, hover, drag & drop, scroll.
- Keyboard: type, send key, shortcuts (Ctrl+A/C/X/V/Z).
- Text: select all, copy, cut, paste, undo, clear, insert at caret.
- Value/RangeValue patterns, ExpandCollapse, ScrollItem, Transform.
- Screenshots via
FlaUI.Core.Capturing. Retry.WhileNullfor waiting on dynamic elements.- Window operations: move, resize, minimize, maximize, restore, close.
- Focus:
SetFocus,GetFocusedElement.
- Form UI with:
- Window Name, AutomationId, Element Name fields.
- Control Type picker (action groups) and Action picker.
- Value/Index field for parameterised actions.
- Find Element, Execute Action, Clear Log buttons.
- Timestamped output log.
- Designer-compatible
Form1.Designer.cs(standard generated format, no lambdas or helpers insideInitializeComponent).