ApexComputerUse

ApexComputerUse

Give AI agents control of any Windows app — no vision model, no screenshots, no cloud.

ApexComputerUse reads the Windows accessibility tree (the same data the OS exposes to screen readers) and serves it over a plain HTTP REST API. Any AI agent — in any language, on any machine — can find, inspect, and control any desktop app or browser by making simple HTTP requests. No screenshots. No pixel coordinates. No cloud dependency.

5–20 tokens per action instead of 1,000–3,500 for a screenshot. A full browser page in onscreen-only mode is ~126 elements of compact JSON — less than the cost of a single screenshot of the same page.

Works on Win32, WPF, UWP, WinForms, and browsers. Controlled via HTTP REST, named pipes, cmd.exe, and Telegram.

Screenshots

Main Desktop UI

Interactive Web Console (`GET /`)

Scene Editor — WinForms

Scene Editor — Browser (`GET /editor`)

AI-Generated Drawing!

UI Map Overlay

Quickstart

Requirements: Windows 10/11 · .NET 10 SDK

git clone https://github.com/your-org/ApexComputerUse
cd ApexComputerUse
dotnet build
dotnet run --project ApexComputerUse

The app opens. The HTTP server starts automatically on port 8080 (HttpAutoStart=true in appsettings.json).
By default it binds to localhost only (HttpBindAll=false), so no first-run UAC network setup is required.
If you enable HttpBindAll=true, the app prompts once (UAC) to configure URL ACL + Windows Firewall for the selected port.
The API key is shown in the Remote Control tab → API Key field — copy it.
Open http://localhost:8080/?apiKey=<key> in a browser — the interactive console appears (the browser console pre-fills the key).
Pick any open window from the Windows panel on the left.
Browse its element tree, click an action button, see the result.

Chat tab: switch to the Chat tab and click Load Chat to open the streaming AI chat UI directly inside the app. Configure provider and API key in the settings group above, then chat away.

Clients tab: use the Clients tab to register other machines running ApexComputerUse. Add each machine's name, IP/host, port, and API key, then click Test to confirm the connection is live. This registry lets you — or an AI agent — track and target multiple Apex endpoints from a single instance.

Or go straight to curl (replace <key> with the API key from the Remote Control tab):

# Confirm the server is up
curl -H "X-Api-Key: <key>" http://localhost:8080/ping

# Find Notepad and read its text editor content (two calls)
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/find \
     -H "Content-Type: application/json" -d '{"window":"Notepad"}'
curl -H "X-Api-Key: <key>" http://localhost:8080/exec?action=gettext

# Or combine both in one call
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/find-exec \
     -H "Content-Type: application/json" -d '{"window":"Notepad","action":"gettext"}'

OCR: requires eng.traineddata — download from github.com/tesseract-ocr/tessdata and place it in tessdata\ next to the executable.

AI Vision: requires a GGUF vision model and projector — see Usage — AI.

Why ApexComputerUse

The problem with screenshot-based automation

Most AI computer-use tools — Claude Computer Use, OpenAI CUA, UI-TARS, OmniParser — work by sending a screenshot to a vision model and guessing pixel coordinates to click. This approach has compounding costs:

Screenshot token costs scale with resolution and vary by provider. A 1024×768 image runs ~765 tokens (OpenAI) to ~1,050 tokens (Anthropic). At 1920×1080 that rises to ~1,840 tokens (Anthropic) or ~2,125 tokens (OpenAI). At 2048×2048, OpenAI charges ~2,765 tokens and Anthropic ~2,500–3,500 tokens. Gemini is the exception, typically staying under 1,000 tokens even for ~4K images. And this cost is paid on every single step.
Screenshots stack in conversation history — a 20-step task accumulates 20+ images in context.
Coordinate grounding is fragile: it breaks on window resize, DPI scaling, and multi-monitor setups.
Published benchmarks confirm the accuracy ceiling: even specialist 7B vision models score only 18.9% on real professional UIs (ScreenSpot-Pro, 2025). GPT-4o scores below 2% on unscaled professional screens.

The structured-tree approach

ApexComputerUse reads the accessibility tree the OS already maintains — the same tree used by screen readers and test automation. This gives every element a name, control type, and AutomationId, without rendering a pixel.

Interacting with an element by name costs 5–20 tokens. The element map for a full browser page in onscreen-only mode is typically 100–200 elements of compact JSON — compared to ~1,050 tokens for a single screenshot of the same page, with none of the coordinate fragility.

This is the same direction taken by the most efficient browser-only tools: browser-use claims 50% fewer tokens than screenshot alternatives; Vercel's agent-browser returns 200–400 tokens per page snapshot and uses 82–93% fewer tokens than Playwright MCP. ApexComputerUse brings the same approach to the entire Windows desktop.

How it compares

Tool	Coverage	HTTP API	Stable element IDs	Onscreen filter	Status
ApexComputerUse	Windows desktop + browsers	✅ REST	✅ SHA-256 hash	✅ `?onscreen=true`	Active
UFO2 (Microsoft)	Windows desktop + browsers	❌ research agent	❌ bounding-box	Partial	Research only
UI Automata	Windows desktop + browsers	MCP only	Selector-based	Shadow DOM cache	Active
Windows-Use	Windows desktop	❌ Python lib	❌	Partial	Active
WinAppDriver	Windows desktop	WebDriver	XPath / selectors	❌	Paused by Microsoft
browser-use	Browser only	❌ Python lib	Element hash	✅	Active
Playwright MCP	Browser only	MCP	Session-scoped refs	Partial	Active
Claude Computer Use	Any (screenshot)	Cloud API	❌ coordinates	❌	Active

No other tool combines: Windows UIA3 coverage, SHA-256 stable element IDs, a language-agnostic HTTP REST API, and an onscreen visibility filter — in a single deployable binary.

Compatible AI Agents

ApexComputerUse exposes a plain HTTP REST API, which means any AI agent that can execute shell commands or fetch a URL can use it. No SDK, no plugin, no special integration required — if the agent can run curl, it can drive any Windows app or browser through this server.

Access paths

There are three ways an agent can interact with ApexComputerUse:

1. Shell / terminal access (curl or any HTTP client) Any agent that can run shell commands can call the API directly with curl, Python requests, or PowerShell Invoke-RestMethod. This covers the widest range of tools and requires no configuration beyond starting the HTTP server.

2. URL fetch / WebFetch tool Some agents have a dedicated tool for fetching URLs rather than running shell commands. ApexComputerUse's HTML responses embed a full <script type="application/json" id="apex-result"> block, so any agent that can fetch a webpage gets structured JSON data back without needing a vision model.

3. MCP server (optional wrapper) Several agents support the Model Context Protocol. If you prefer a tighter integration, the REST API can be wrapped as an MCP server so the agent sees your actions as named tools rather than raw HTTP calls.

Agent compatibility table

Agent	Type	Shell access	URL fetch	MCP	Notes
Claude Code	CLI	✅ Bash tool	✅ WebFetch tool	✅	`curl` is blocked by default but Claude Code automatically falls back to Python `requests` for the same result
Cline	VS Code extension	✅ Terminal	✅ Via shell	✅	Full agentic loop; browser control; human-in-the-loop approval for each command
Aider	CLI	✅ Shell	✅ Via shell	❌	Oldest and most widely deployed open-source coding CLI; works with any model via Ollama or API key
Goose (Block)	CLI + Desktop	✅ Shell	✅ Via shell	✅	Apache 2.0; model-agnostic; native MCP support
Cursor (Agent Mode)	IDE	✅ Terminal	✅ Via shell	✅	Agent mode can run terminal commands; MCP support available
Windsurf (Cascade)	IDE	✅ Terminal	✅ Via shell	✅	Cascade runs commands automatically; MCP support with admin controls
GitHub Copilot (Agent Mode)	VS Code extension	✅ Terminal	✅ Via shell	✅	VS Code Agent mode handles terminal commands and iteration
OpenHands / Devin	Cloud agent	✅ Shell	✅ Via shell	Varies	Requires network path from the cloud sandbox to your Windows machine
Roo Code / Continue	VS Code extension	✅ Terminal	✅ Via shell	✅	Open-source; BYOK; shell access via VS Code terminal integration
Autocomplete-only tools	Extension	❌	❌	❌	Tabnine, Supermaven, etc. generate code only — no agentic shell or HTTP access

Local model users: any agent backed by a local model via Ollama (Qwen Coder, DeepSeek Coder, CodeLlama, etc.) that also has shell access works the same way. The model itself doesn't need internet access — the agent runtime executes the curl commands.

Quickest agent integration (Claude Code example)

Start the HTTP server, then drop this into your Claude Code session:

The ApexComputerUse REST API is running at http://localhost:8080.
Use curl (or Python requests if curl is blocked) to control Windows apps.
Start with: curl http://localhost:8080/ping
Then: curl http://localhost:8080/windows  (to see what's open)
Then find and interact with any element using /find and /exec (or /find-exec for both in one call).

Claude Code will handle the rest — finding windows, reading the element tree, clicking, typing, and verifying results across turns using its stable element IDs.

Stable element IDs

Every element is assigned a SHA-256 hash-based numeric ID derived from its control type, name, AutomationId, and position in the tree. These IDs are stable across sessions — an agent can reference the same element in turn 1 and turn 20 without re-querying the tree. No other tool in the Windows desktop automation space publishes this property.

The onscreen filter

GET /elements?onscreen=true prunes any element where IsOffscreen = true during the tree scan, skipping entire offscreen subtrees. On a live Chewy.com product page this reduces 634 elements to 126 — an 80% reduction — putting token cost per step in the same range as the best browser-only tools while covering all desktop apps too.

The filter composes with the type filter and the new depth/expansion params: ?onscreen=true&type=Button.

When ?match= is combined with ?onscreen=true, the match search scans all elements (including offscreen ones) so content that has been scrolled out of view can still be found by text search. Offscreen matches are tagged with "isOffscreen": true in the response. Use exec action=scrollinto on the returned element ID to bring an offscreen match into view before interacting with it.

Progressive tree expansion

For deep pages, fetch a shallow overview first, then drill into only the branches you care about:

# Step 1 — shallow overview (fast, small response)
curl "http://localhost:8080/elements?depth=2&onscreen=true"
# Nodes that have children beyond the depth limit show "childCount": N instead of "children"

# Step 2 — expand a specific node by its ID (IDs are stable between calls)
curl "http://localhost:8080/elements?id=708379645&depth=2&onscreen=true"
# Returns only that subtree, 2 levels deep — existing map entries are preserved

This lets an AI agent navigate to the relevant section of a large page without fetching the whole tree on every step.

Browser-friendly tree filters

Modern web pages often wrap every visible element in several identity-less Pane/Group/Custom nodes and produce deep trees with many one-child chains. Two opt-in /elements parameters strip that noise:

# RECOMMENDED: global text search — replaces almost all hierarchical drill-down.
# Searches Name, AutomationId, Value, AND ClassName across the entire window tree
# (including offscreen elements). Returns every match with its ancestor path plus
# `depth` levels of descendants. Combine with includePath=true for breadcrumbs.
# The parameter name is `match=` — there is NO separate `global=true` flag; `match=`
# alone forces a full-tree scan and the tester-friendly behaviour described above.
# When `match=` is set, `depth=` is ignored (otherwise depth pruning would hide deep
# matches before they could be found).
curl "http://localhost:8080/elements?match=add+to+cart&onscreen=true&depth=1&includePath=true"

# Collapse "1-in-1-in-1" wrapper chains. A wrapper is skipped only when it has
# exactly one child, no name, no AutomationId, and its control type is Pane,
# Group, or Custom. Named containers and anything with an AutomationId survive.
curl "http://localhost:8080/elements?onscreen=true&collapseChains=true"

# Ancestor breadcrumb on every emitted node: "Chrome > Document > Main > Form".
curl "http://localhost:8080/elements?onscreen=true&includePath=true"

# Opt into Value pattern + HelpText on every node — useful for web inputs
# whose Name is empty and whose visible content lives in the Value pattern.
curl "http://localhost:8080/elements?onscreen=true&properties=extra"

# All new filters combine cleanly with existing ones.
curl "http://localhost:8080/elements?onscreen=true&collapseChains=true&match=submit&type=Button&depth=1&properties=extra"

Truncated nodes (ones whose children were cut off by depth) now also emit descendantCount alongside childCount, so an agent can decide whether a subtree is worth expanding without another round trip. Element IDs are computed against the real, unflattened tree — hoisting a descendant through collapseChains does not change its ID, and follow-up /elements?id=<id> and /execute id=<id> calls still resolve.

/find now populates the response's structured element object (id, controlType, name, automationId, className, frameworkId, isEnabled, isOffscreen, boundingRectangle, plus value/helpText when properties=extra), in addition to the existing human-readable string in message.

Features

Find any window and element by name or AutomationId (exact or fuzzy match)
Filter element search by ControlType
Persistent, hash-based stable element and window IDs (survive app restarts)
Onscreen-only element map (?onscreen=true) — prunes offscreen subtrees at scan time
Progressive tree expansion (?depth=N + ?id=<elementId>) — fetch a shallow overview then drill into only the branches you need, without re-scanning the whole window
Element nodes include boundingRectangle (x, y, width, height) for spatial context and visual rendering
Execute all common UI actions: click, type, select, toggle, scroll, drag & drop, etc.
OCR any UI element using Tesseract
Multimodal AI: describe UI elements, ask questions about them, analyse image/audio files using a local vision LLM (LLamaSharp MTMD)
Remote control via HTTP REST API (curl-friendly JSON)
Remote control via named pipe (PowerShell module included)
Remote control via cmd.exe batch helper (apex.cmd)
Remote control via Telegram bot
Screenshot capture of elements, windows, and full screen (returned as base64 PNG)
Interactive HTTP test console — served at GET /, includes live windows list, element tree browser, grouped command builder covering every action, inline capture/OCR/AI vision/UI map buttons, format selector (JSON/HTML/Text/PDF), format demo links, and a response log
AI Drawing — POST /draw renders any combination of shapes (rect, ellipse, circle, line, arrow, polygon, text) to a base64 PNG; GET /draw/demo renders a built-in multi-colour space scene; ?overlay=true shows the result as a click-through screen overlay
Layered Scene Editor — persistent, structured drawing canvas with stable shape IDs so AI can generate a composition and the user can refine it; full REST API at /scenes/*; interactive WinForms editor (Tools → Scene Editor) and browser editor (GET /editor)
UI Map Renderer — renders the element tree as a colour-coded overlay drawn directly on screen, and optionally exports a PNG image; accessible via Tools → Render UI Map or GET /uimap
Format-adaptive responses — every endpoint serves HTML, plain text, JSON, or PDF via URL extension (.json, .html, .txt, .pdf), ?format= parameter, or Accept header; default is an HTML page with embedded JSON readable by any AI that can fetch a URL
System utility routes — /health (unauthenticated), /ping, /metrics, /sysinfo, /env, /ls, /run, /run-tests, /shutdown for AI agents that need OS-level context without a separate tool
WindowMonitor — background STA poll thread detects desktop window opens / closes / title changes once per second; fires WindowsChanged / WindowClosed events that auto-prune the CommandProcessor element + window caches when a window goes away (no more stale-handle errors when an app closes mid-session). Optional WatchElements mode adds descendant-level diff tracking, narrowable to the foreground window or to titles matching a substring filter for tractable scan cost. Inspect activity via GET /winmon/log and drain via POST /winmon/clear
Live monitoring dashboard — browser-based status page at GET /dashboard; shows health, per-route metrics, system info, registered clients, AI chat session status, and WindowMonitor activity log. Auto-refreshes every 5 seconds. Requires AllowDiagnostics permission.
Native HTTPS — opt-in TLS via http.sys (no proxy); Scripts/setup-https.ps1 generates a self-signed cert, binds it via netsh http add sslcert, and adds a Firewall rule in one elevated step. Supports user-supplied PFX. Three remote-access options documented in Scripts/README-remote-access.md: SSH tunnel, native HTTPS, and Caddy reverse proxy.
Embedded AI chat in the Chat tab — the Chat tab opens the streaming HTML chat UI (/chat) in your default browser; click Open In Browser to launch it. The HTML page handles streaming, provider/model display, and session reset natively.
AI Chat over HTTP — streaming chat UI at GET /chat backed by /chat/send, /chat/status, /chat/reset; same 8 providers as the desktop AI Chat window; also accessible from any browser
Agentic tool loop in AI Chat — when the local HTTP server is running, the AI can issue ApexComputerUse API calls inside ```apex code blocks; results are fed back automatically for up to 8 turns until the AI produces a clean answer (AiChatService.SendAsync + SetLocalServer)
Auto-start on launch — HTTP server starts automatically (HttpAutoStart=true by default), binds to localhost by default (HttpBindAll=false), and can be switched to all-interfaces mode with one-time netsh setup (URL ACL + Firewall rule)
Auto-download setup — Model tab "Download All" button fetches the LFM2.5-VL model, projector, and Tesseract data to fixed local paths on first launch

Setup

1. Build and run

git clone https://github.com/your-org/ApexComputerUse
cd ApexComputerUse
dotnet run --project ApexComputerUse

2. First-run network setup (only when `HttpBindAll=true`)

When HttpBindAll=true, ApexComputerUse checks whether the HTTP URL ACL and Windows Firewall inbound rule exist for the configured port. If either is missing, a single elevated cmd window opens (one UAC prompt) and runs:

netsh http add urlacl url=http://+:{port}/ user=Everyone
netsh advfirewall firewall add rule name="ApexComputerUse" dir=in action=allow protocol=TCP localport={port}

This happens once and is tracked in %APPDATA%\ApexComputerUse\settings.json. With the default HttpBindAll=false, this setup is skipped.

3. Models and OCR data (optional — auto-download available)

Open the Model tab and click Download All to automatically fetch:

LFM2.5-VL-450M-Q4_0.gguf — vision LLM (450 M parameters, quantized)
mmproj-LFM2.5-VL-450m-F16.gguf — multimodal projector
eng.traineddata — Tesseract English OCR data

Files are saved to models\ and tessdata\ next to the executable. On first launch the app detects missing files and switches to the Model tab automatically.

To download manually: copy eng.traineddata from github.com/tesseract-ocr/tessdata into tessdata\, and place both .gguf files in models\.

4. Remote access (optional)

Three options — see Scripts/README-remote-access.md for full details:

Option	When to use	Setup
SSH tunnel	Ad-hoc, no certificates	`.\Scripts\ssh-tunnel.ps1 -Server user@mypc`
Native HTTPS	Permanent TLS, no proxy	`.\Scripts\setup-https.ps1` (run as Admin), then set `HttpsEnabled: true` in `appsettings.json`
Caddy proxy	Public domain + auto Let's Encrypt	`caddy run --config Scripts/Caddyfile` with `DOMAIN=` set

5. Telegram Bot (optional)

Message @BotFather on Telegram and create a bot with /newbot.
Copy the token (format: 123456789:ABC-DEF...).
Paste it into the Bot Token field in the app and click Start Telegram.
Add your Telegram chat ID to the Allowed Chat IDs field to restrict who can send commands.

Security & Configuration

HTTP API Authentication

Every HTTP request must include the API key. Three equivalent methods:

# Authorization header (recommended)
curl -H "Authorization: Bearer <key>" http://localhost:8080/ping

# X-Api-Key header
curl -H "X-Api-Key: <key>" http://localhost:8080/ping

# Query parameter (use only for browser links / quick tests)
curl "http://localhost:8080/ping?apiKey=<key>"

Requests without a valid key receive HTTP 401. The interactive web console (GET /) pre-fills the key automatically — paste it from the Remote Control tab on first launch.

To disable authentication (local development only), clear the API Key field in the app.

Named Pipe Security

The named pipe is ACL-restricted to the current Windows user. Other local users and unprivileged processes cannot connect.

Telegram Bot Authorization

Enter one or more Telegram chat IDs in the Allowed Chat IDs field (comma-separated). Any message from an unlisted chat ID receives "Unauthorized." and is logged. Leave the field empty only for local testing.

Client Permission Gating (non-loopback callers)

Requests from localhost / loopback always have full access. Non-loopback callers are matched against entries in the Clients tab and constrained by per-client permissions (allow_automation, allow_capture, allow_ai, allow_scenes, allow_shell_run, allow_clients, allow_diagnostics). Unknown non-loopback callers are denied.

Shell Execution (`/run`)

The POST /run and GET /run endpoints execute arbitrary cmd.exe commands. They are disabled by default. Enable them explicitly:

In appsettings.json: "EnableShellRun": true
Or via environment variable: APEX_ENABLE_SHELL_RUN=true

Configuration

All settings can be layered via three sources (highest priority last wins for env vars):

appsettings.json (next to the executable — shipped defaults shown):

{
  "HttpPort":             8080,
  "HttpBindAll":          false,
  "HttpAutoStart":        true,
  "PipeName":             "ApexComputerUse",
  "LogLevel":             "Information",
  "EnableShellRun":       false,
  "TelegramToken":        "",
  "TestRunnerExePath":    "",
  "TestRunnerConfigPath": ""
}

Shipped defaults are HttpAutoStart=true and HttpBindAll=false (auto-start on localhost only). Set HttpBindAll=true for LAN access.

Environment variables (prefix APEX_, override appsettings.json):

Variable	Description
`APEX_HTTP_PORT`	HTTP listen port (default `8080`)
`APEX_HTTP_BIND_ALL`	`true` to bind all interfaces instead of localhost only
`APEX_HTTP_AUTOSTART`	`true` to auto-start HTTP server in GUI mode
`APEX_PIPE_NAME`	Named pipe name
`APEX_LOG_LEVEL`	Serilog minimum level: `Debug` / `Information` / `Warning` / `Error`
`APEX_ENABLE_SHELL_RUN`	`true` to enable the `/run` shell-execution endpoint
`APEX_API_KEY`	Override the auto-generated API key
`APEX_ALLOWED_CHAT_IDS`	Comma-separated Telegram chat ID whitelist
`APEX_TELEGRAM_TOKEN`	Telegram bot token
`APEX_MODEL_PATH`	Default LLM `.gguf` path
`APEX_MMPROJ_PATH`	Default multimodal projector `.gguf` path
`APEX_TEST_RUNNER_EXE_PATH`	Path to `TestApplications/TestRunner` executable for `/run-tests`
`APEX_TEST_RUNNER_CONFIG_PATH`	Optional config file path passed to TestRunner

Network binding: HttpBindAll = false (the default) binds to http://localhost:{port}/ — loopback only, safe for single-machine use. Set APEX_HTTP_BIND_ALL=true to bind all interfaces for network-wide access (ensure firewall rules are in place).

Logs are written to %LOCALAPPDATA%\ApexComputerUse\Logs\apex-YYYYMMDD.log (daily rotation, 7-day retention).

Run as a Windows Service

ApexComputerUse can run headlessly as a Windows service (no GUI):

# Install
sc.exe create ApexComputerUse binPath="C:\ApexComputerUse\ApexComputerUse.exe --service" start=auto
sc.exe start ApexComputerUse

# Uninstall
sc.exe stop ApexComputerUse
sc.exe delete ApexComputerUse

Configure via appsettings.json or APEX_* environment variables before starting the service. The APEX_TELEGRAM_TOKEN and APEX_API_KEY variables are the recommended way to inject secrets in a service context.

Command-line overrides

Program.cs supports lightweight startup overrides:

--port <n> sets APEX_HTTP_PORT for that process
--pipe <name> sets APEX_PIPE_NAME for that process
--client marks the instance as a subordinate client instance

Usage — UI

Field	Description
Window Name	Partial title of the target window. Fuzzy-matched if no exact match found.
AutomationId	The element's `AutomationId` (checked first).
Element Name	The element's `Name` property (fallback if AutomationId is blank).
Search Type	Filter the element search to a specific `ControlType`. `All` searches everything.
Control Type	Selects the action group (Button, TextBox, etc.).
Action	The action to perform on the found element.
Value / Index	Input for actions that need it (text to type, index, row,col, x,y, etc.).

Find Element — locates the window and element, logs what was found. Execute Action — runs the selected action against the last found element.

Tools menu

Item	Description
Run AI Computer Use Mode	Launches the interactive multimodal AI agent loop (requires model loaded on the Model tab).
Output UI Map	Scans the current window's element tree and logs it as nested JSON to the console tab.
Render UI Map	Scans the current window's element tree, draws a colour-coded bounding-box overlay on screen for 5 seconds, and offers to save the overlay as a PNG image.
Scene Editor	Opens the layered scene editor — create scenes, add shapes to layers, drag to reposition, use AI to generate and refine compositions.
AI Chat	Opens a standalone streaming chat window with support for 8 AI providers (OpenAI, Anthropic, DeepSeek, Grok, Groq, Duck, LM Studio, LlamaSharp). Configure API keys in `ai-settings.json` next to the executable. The Chat tab opens the same chat UI in your default browser — click Open In Browser after the HTTP server starts.

Window and Element ID Mapping

Every window and element is assigned a stable numeric ID (SHA-256 hash-based) that persists across sessions. These IDs can be used in find commands instead of titles or AutomationIds.

# 1. Get windows with their IDs
curl http://localhost:8080/windows
# Returns: [{"id":42,"title":"Notepad"},{"id":107,"title":"Calculator"},...]

# 2. Get elements with their IDs for the current window
curl http://localhost:8080/elements

# Onscreen elements only (prunes offscreen subtrees — 80% fewer elements on browser pages)
curl "http://localhost:8080/elements?onscreen=true"

# Limit tree depth — nodes at the cutoff show "childCount" instead of "children"
curl "http://localhost:8080/elements?depth=2&onscreen=true"

# Expand a specific subtree by numeric ID (IDs are stable; map is preserved between expansion calls)
curl "http://localhost:8080/elements?id=708379645&depth=2&onscreen=true"

# Combine with type filter
curl "http://localhost:8080/elements?onscreen=true&type=Button"

# Returns nested JSON including bounding rectangles:
# {
#   "id": 105,
#   "controlType": "Edit",
#   "name": "Text Editor",
#   "automationId": "15",
#   "boundingRectangle": { "x": 0, "y": 30, "width": 800, "height": 600 },
#   "children": [...]
# }
#
# When a depth limit truncates a node's children, "childCount" appears instead:
# {
#   "id": 708379645,
#   "controlType": "Pane",
#   "name": "",
#   "boundingRectangle": { ... },
#   "childCount": 7    <-- call /elements?id=708379645 to expand
# }

# 3. Find using numeric IDs (no fuzzy matching, direct map lookup)
curl -X POST http://localhost:8080/find \
     -H "Content-Type: application/json" \
     -d '{"window":42,"id":105}'

Using numeric IDs is faster and unambiguous — the element is resolved directly from the in-memory map without any search or fuzzy logic. Every find call also auto-focuses the matched window. When a title/name search is low-confidence or ambiguous, /find now refuses to guess and returns error_data.candidates; choose one of those candidates or use IDs from /windows and /elements.

Token Economics

Map rendering isn't just a debugging convenience — it has compounding implications for token consumption at scale.

The Core Difference

With screenshot-based AI automation, every interaction requires sending a fresh image to the model. At typical desktop resolutions that's 1,000–3,500 tokens per screenshot depending on the provider and resolution — every single step, accumulating in conversation history. With ApexComputerUse's map approach, the UI is rendered once as a structured, text-based representation. After that initial render, each individual interaction references elements by name, costing 5–20 tokens on average.

The ?onscreen=true filter further reduces the element map to only what is visible in the current viewport. On a real browser page this produces 126 elements of compact JSON — well under the cost of a single screenshot of the same page.

Real-world token costs (approximate — varies by provider and resolution)

	Per step	20-step task
Screenshot (1024×768)	~765–1,050 tokens	~15,000–21,000 tokens in images alone
Screenshot (1920×1080)	~1,840–2,125 tokens	~37,000–43,000 tokens in images alone
Screenshot (2048×2048)	~2,765–3,500 tokens	~55,000–70,000 tokens in images alone
ApexComputerUse (full map)	400–1,800 tokens (one-time) + ~10 per action	~1,000 tokens total
ApexComputerUse (`?onscreen=true`)	200–600 tokens (one-time) + ~10 per action	~400 tokens total

Provider breakdown: at 1024×768, Anthropic ≈ 1,050 tokens / OpenAI ≈ 765 tokens. At 1920×1080, Anthropic ≈ 1,840 / OpenAI ≈ 2,125. At 2048×2048, OpenAI ≈ 2,765 / Anthropic ≈ 2,500–3,500. Gemini is notably more efficient — typically under 1,000 tokens even for ~4K images. All providers compound costs across steps: every screenshot remains in context for the life of the conversation.

Example 1 — Small App (Calculator, tray utility, simple tool)

Screenshot: 2,500 tokens each · Initial map: 400 tokens · Per-action after map: 8 tokens

By time period — 1 person:

Timeframe	Screen Capture	Map Approach	Tokens Saved
1 day	250,000	1,192	248,808
1 week	1,750,000	8,344	1,741,656
1 year	91,250,000	435,080	90,814,920

Annual totals — by team size:

Team Size	Screen Capture	Map Approach	Reduction Factor
1 person	91,250,000	435,080	~210x
10 people	912,500,000	4,350,800	~210x
50 people	4,562,500,000	21,754,000	~210x

Usage — HTTP API

Start the HTTP server from the Remote Control group box, then use curl or open http://localhost:8080/?apiKey=<key> in a browser to access the interactive test console.

Authentication reminder: every route except GET /health requires the API key. For curl, add -H "X-Api-Key: <key>". For browser URLs, append ?apiKey=<key>.

Interactive Test Console (`GET /`)

Opening the root URL in any browser launches a dark-themed console with:

Windows panel — live list of all open windows; click to select and auto-load its element tree
Elements panel — nested element tree flattened with indentation; onscreen-only toggle; ControlType filter; click any element to select it
Command builder — grouped action buttons covering every action: Click, Text, Keys, State, Scroll, Toggle, Select, Window, Range/Slider, Grid/Table, Transform, Wait, Capture, AI Vision; Value input (multiline, Ctrl+Enter to execute) with context-sensitive hints; ▶ Execute button
AI Vision buttons — status, describe, ask, file; requires model loaded on the Model tab
Format selector — dropdown in the header (JSON / HTML / Text / PDF); all requests use the selected format; format demo links (help, status, windows) open directly in a new tab in the chosen format
Scene Editor link — opens the browser-based canvas editor in a new tab
Response log — newest result at top; captures rendered as inline images (click to zoom); PDF responses shown as an "Open PDF" link (browser-native rendering)

Format negotiation

Every endpoint adapts its response to whatever format the caller can consume, selected by priority:

URL file extension — append .json, .html, .txt, or .pdf to any path
?format= query parameter — html, text, json, or pdf
Accept request header — text/html, text/plain, application/json, or application/pdf
Default: html

# URL extension (highest priority — works even if the AI cannot set headers or query params)
curl http://localhost:8080/status.json
curl http://localhost:8080/help.txt
curl http://localhost:8080/windows.html
curl http://localhost:8080/status.pdf --output status.pdf

# ?format= query parameter
curl "http://localhost:8080/ping?format=text"
curl "http://localhost:8080/ping?format=json"

# Accept header
curl -H "Accept: application/json"  http://localhost:8080/ping
curl -H "Accept: application/pdf"   http://localhost:8080/help --output help.pdf

# HTML response (default — works in any browser or AI that can fetch a page)
curl http://localhost:8080/ping

HTML includes a <pre> block for human readability and an embedded <script type="application/json" id="apex-result"> block containing the full result as JSON — allowing any AI that can fetch a webpage to extract structured data without a vision model.

PDF is a valid A4 document using the built-in Courier font (no external dependencies). Useful for AI systems that can only accept PDF attachments.

GET access to command endpoints

All command endpoints accept both POST (JSON body) and GET (query string parameters), so any command can be expressed as a plain URL — no request body required:

# Find a window via GET
curl "http://localhost:8080/find?window=Notepad"

# Execute an action via GET
curl "http://localhost:8080/exec?action=gettext"

# Combine with URL extension for full URL-only access
curl "http://localhost:8080/find.json?window=Notepad&id=15"
curl "http://localhost:8080/exec.pdf?action=describe" --output result.pdf

GET parameter names match the JSON body field names: window, id / automationId, name / elementName, type / searchType, action, value, onscreen, depth, prompt, model, proj.

/elements-specific: depth=N limits tree depth (truncated nodes show childCount); id=<numericId> expands from a previously-mapped element without clearing the rest of the map.

Response format

All endpoints return the same canonical structure:

{
  "success": true,
  "action": "ping",
  "data":   { "key": "value", ... },
  "error":  null,
  "error_data": null
}

HTTP status: 200 on success, 400 on error.

error_data is an additive object populated on failures (null when there is no error). Its shape is action-specific — for example, action-execution failures may carry failed_pattern, supported_patterns, element_state, and a remediation hint; waitfor timeouts carry timeout_ms, predicate, property, expected, and last_observed; wait-window timeouts carry last_observed_titles. Existing callers that only read success / data / error continue to work unchanged.

gettext and getvalue responses include a source field inside data — one of TextPattern, ValuePattern, LegacyIAccessible, or Name — naming the UIA accessor that produced the text. Inside batch step results this appears as extras.source.

Element nodes returned by /elements and /find include className alongside id, controlType, name, automationId, frameworkId, isEnabled, isOffscreen, and boundingRectangle. match= searches className along with the other text fields.

System / utility routes

# Unauthenticated liveness probe — safe for external monitoring (the only route that doesn't require the API key)
curl http://localhost:8080/health

# Authenticated health check
curl -H "X-Api-Key: <key>" http://localhost:8080/ping

# Per-route request counters
curl -H "X-Api-Key: <key>" http://localhost:8080/metrics

# Recent WindowMonitor activity (window open/close/rename, optional element add/remove). Append .json for raw JSON.
curl -H "X-Api-Key: <key>" http://localhost:8080/winmon/log.json
# Drain the buffer
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/winmon/clear.json

# System information (OS, machine, user, CPU, CLR)
curl -H "X-Api-Key: <key>" http://localhost:8080/sysinfo

# All environment variables
curl -H "X-Api-Key: <key>" http://localhost:8080/env

# Directory listing (defaults to current working directory)
curl -H "X-Api-Key: <key>" http://localhost:8080/ls
curl -H "X-Api-Key: <key>" "http://localhost:8080/ls?path=C:\Users"

# Trigger the bundled integration test runner (TestApplications/TestRunner)
# Requires TestRunnerExePath (or APEX_TEST_RUNNER_EXE_PATH) to be configured.
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/run-tests

# Gracefully stop the HTTP server
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/shutdown

# Run a shell command (cmd.exe /c); 30-second timeout
# Requires EnableShellRun = true in appsettings.json or APEX_ENABLE_SHELL_RUN=true
curl -H "X-Api-Key: <key>" "http://localhost:8080/run?cmd=whoami"
curl -H "X-Api-Key: <key>" "http://localhost:8080/run?command=whoami"
curl -H "X-Api-Key: <key>" -X POST http://localhost:8080/run \
     -H "Content-Type: application/json" \
     -d '{"command":"dir C:\\"}'

/run response data fields: cmd, stdout, stderr, exit_code.

Security note: /run executes arbitrary commands as the process user. It is disabled by default and should only be enabled in trusted, authenticated environments.

UI automation routes

# List all open windows (with stable IDs)
curl http://localhost:8080/windows

# Get current state
curl http://localhost:8080/status

# List all elements in the current window (nested JSON with IDs and bounding rectangles)
curl http://localhost:8080/elements

# Onscreen elements only — prunes offscreen subtrees for maximum token efficiency
curl "http://localhost:8080/elements?onscreen=true"

# Limit depth — truncated nodes show "childCount" so you know where to drill in
curl "http://localhost:8080/elements?depth=2&onscreen=true"

# Expand a specific node by numeric ID (preserves the rest of the map — IDs stay stable)
curl "http://localhost:8080/elements?id=<elementId>&depth=2&onscreen=true"

# Filter by ControlType
curl "http://localhost:8080/elements?type=Button"

# Text search across Name, AutomationId, Value, and ClassName — returns only
# matching branches, each wrapped in its ancestor path, with `depth` levels below.
curl "http://localhost:8080/elements?match=add+to+cart&onscreen=true&depth=1"

# Collapse identity-less single-child Pane/Group/Custom wrapper chains
# (named containers and anything with an AutomationId are preserved).
curl "http://localhost:8080/elements?onscreen=true&collapseChains=true"

# Add an ancestor breadcrumb ("path") to every emitted node.
curl "http://localhost:8080/elements?onscreen=true&includePath=true"

# Opt into Value pattern + HelpText (omitted by default to keep payloads small).
curl "http://localhost:8080/elements?onscreen=true&properties=extra"

# All filters combined
curl "http://localhost:8080/elements?depth=3&onscreen=true&type=Button&collapseChains=true&match=submit&properties=extra"

# Render the current window's UI element tree as a colour-coded PNG (returns base64)
curl http://localhost:8080/uimap

# Help
curl http://localhost:8080/help

# Find a window and element by title/name
curl -X POST http://localhost:8080/find \
     -H "Content-Type: application/json" \
     -d '{"window":"Notepad","id":"15"}'

# Find by element name with ControlType filter
curl -X POST http://localhost:8080/find \
     -H "Content-Type: application/json" \
     -d '{"window":"Notepad","name":"Text Editor","type":"Edit"}'

# Find by numeric window/element IDs (fast, no fuzzy search)
curl -X POST http://localhost:8080/find \
     -H "Content-Type: application/json" \
     -d '{"window":42,"id":105}'

# Visual Studio handoff targets:
# F5/debug: find name="Debug Target" type="SplitButton", then exec keys {F5}
# Ctrl+F5/no-debug: find name="Start Without Debugging" type="Button", then exec keys Ctrl+{F5}

# Type text into the found element
curl -X POST http://localhost:8080/execute \
     -H "Content-Type: application/json" \
     -d '{"action":"type","value":"Hello World"}'

# Click a button
curl -X POST http://localhost:8080/execute \
     -H "Content-Type: application/json" \
     -d '{"action":"click"}'

# Read text from element
curl -X POST http://localhost:8080/execute \
     -H "Content-Type: application/json" \
     -d '{"action":"gettext"}'

# Capture current element (returns base64 PNG in data field)
curl -X POST http://localhost:8080/capture

# Capture full screen
curl -X POST http://localhost:8080/capture \
     -H "Content-Type: application/json" \
     -d '{"action":"screen"}'

# Capture multiple elements stitched into one image
curl -X POST http://localhost:8080/capture \
     -H "Content-Type: application/json" \
     -d '{"action":"elements","value":"42,105,106"}'

# OCR the found element
curl -X POST http://localhost:8080/ocr

# OCR a region (x,y,width,height) within the element
curl -X POST http://localhost:8080/ocr \
     -H "Content-Type: application/json" \
     -d '{"value":"0,0,300,50"}'

# Check AI model status
curl http://localhost:8080/ai/status

# Load a vision/audio LLM (run once; model stays loaded until the server restarts)
curl -X POST http://localhost:8080/ai/init \
     -H "Content-Type: application/json" \
     -d '{"model":"C:\\models\\vision.gguf","proj":"C:\\models\\mmproj.gguf"}'

# Describe the currently selected UI element using the vision model
# Captures the element as an image and sends it to the LLM
curl -X POST http://localhost:8080/ai/describe

# Describe with a custom prompt
curl -X POST http://localhost:8080/ai/describe \
     -H "Content-Type: application/json" \
     -d '{"prompt":"List every button you can see."}'

# Ask a specific question about the current element
curl -X POST http://localhost:8080/ai/ask \
     -H "Content-Type: application/json" \
     -d '{"prompt":"Is there an error message visible?"}'

# Describe an image file on disk
curl -X POST http://localhost:8080/ai/file \
     -H "Content-Type: application/json" \
     -d '{"value":"C:\\screenshots\\app.png","prompt":"What dialog is shown?"}'

Request body fields

Field	Aliases	Description
`window`	—	Window title (partial match) or numeric ID from `/windows`
`automationId`	`id`	Element AutomationId string or numeric ID from `/elements`
`elementName`	`name`	Element Name property (fallback if `id` not given)
`searchType`	`type`	ControlType filter (`All` or e.g. `Button`)
`action`	—	Action name (see list below)
`value`	—	Value/input for the action
`model`	`modelPath`	AI: path to LLM `.gguf` file
`proj`	`mmProjPath`	AI: path to multimodal projector `.gguf` file
`prompt`	—	AI: question or instruction text

Usage — AI Drawing

The drawing engine renders GDI+ shapes to a base64 PNG on demand. Every shape type supports colour, opacity, fill/stroke, and dashed lines.

Quick draw

# Draw a filled blue circle with white text
curl -X POST http://localhost:8080/draw \
     -H "Content-Type: application/json" \
     -d '{
       "value": "{\"canvas\":\"blank\",\"width\":400,\"height\":300,\"shapes\":[
         {\"type\":\"circle\",\"x\":200,\"y\":150,\"r\":80,\"color\":\"royalblue\",\"fill\":true},
         {\"type\":\"text\",\"x\":200,\"y\":140,\"text\":\"Hello!\",\"color\":\"white\",\"font_size\":20,\"font_bold\":true,\"align\":\"center\"}
       ]}"
     }'

# Render the built-in space scene
curl http://localhost:8080/draw/demo

# Show it as a full-screen overlay for 6 seconds
curl "http://localhost:8080/draw/demo?overlay=true&ms=6000"

The data.result field contains the base64 PNG. The web console renders it inline.

Shape types

Type	Key fields	Description
`rect`	`x y w h corner_radius`	Rectangle (rounded if `corner_radius > 0`)
`ellipse`	`x y w h`	Ellipse inside bounding box
`circle`	`x y r`	Circle — x,y is the centre
`line`	`x y x2 y2`	Straight line
`arrow`	`x y x2 y2`	Line with arrowhead at (x2,y2)
`polygon`	`points[]`	Closed polygon — flat array of x,y pairs
`triangle`	`x y w h`	Triangle — bounding-box anchored, top-centre apex
`arc`	`x y w h start_angle sweep_angle`	Open arc — angles in degrees, clockwise from 3 o'clock
`text`	`x y text font_size font_bold align background`	Rendered text

Common fields on all shapes: color, fill (bool), stroke_width, opacity (0–1), dashed (bool), rotation (degrees, centre-origin).

Canvas values: blank (transparent), white, black, screen (live screenshot), window (current window), element (current element).

Usage — Layered Scene Editor

The scene system lets AI agents and users collaborate on persistent, structured drawings. Every shape has a stable ID; coordinates are always accurate; the AI can read them back and refine the composition at any time.

REST API (`/scenes/*`)

# Create a scene
curl -X POST http://localhost:8080/scenes \
     -H "Content-Type: application/json" \
     -d '{"name":"My Scene","width":800,"height":600,"background":"#1a1a2e"}'
# → data.scene contains the full scene with id

# List scenes
curl http://localhost:8080/scenes

# Get a scene
curl http://localhost:8080/scenes/{id}

# Add a layer
curl -X POST http://localhost:8080/scenes/{id}/layers \
     -H "Content-Type: application/json" \
     -d '{"name":"Background"}'

# Add a shape to a layer
curl -X POST http://localhost:8080/scenes/{id}/layers/{lid}/shapes \
     -H "Content-Type: application/json" \
     -d '{"shape":{"type":"circle","x":400,"y":300,"r":80,"color":"royalblue","fill":true},"name":"Planet"}'

# Render the scene to a PNG
curl http://localhost:8080/scenes/{id}/render
# → data.result is base64 PNG

# Patch shape geometry (after user drags it — never clobbers color/style)
curl -X PATCH http://localhost:8080/scenes/{id}/layers/{lid}/shapes/{sid} \
     -H "Content-Type: application/json" \
     -d '{"x":420,"y":310}'

# Move a shape to a different layer
curl -X POST http://localhost:8080/scenes/{id}/shapes/{sid}/move \
     -H "Content-Type: application/json" \
     -d '{"target_layer_id":"{newLayerId}"}'

# Delete a shape / layer / scene
curl -X DELETE http://localhost:8080/scenes/{id}/layers/{lid}/shapes/{sid}
curl -X DELETE http://localhost:8080/scenes/{id}/layers/{lid}
curl -X DELETE http://localhost:8080/scenes/{id}

Full route reference

Method	Route	Description
`GET` / `POST`	`/scenes`	List all scenes / create scene
`GET` / `PUT` / `PATCH` / `DELETE`	`/scenes/{id}`	Get / update meta / delete scene
`GET`	`/scenes/{id}/render`	Render scene → base64 PNG
`GET` / `POST`	`/scenes/{id}/layers`	List layers / add layer
`GET` / `PUT` / `PATCH` / `DELETE`	`/scenes/{id}/layers/{lid}`	Get / update / delete layer
`GET` / `POST`	`/scenes/{id}/layers/{lid}/shapes`	List shapes / add shape
`GET` / `PUT` / `PATCH` / `DELETE`	`/scenes/{id}/layers/{lid}/shapes/{sid}`	Get / replace / patch geometry / delete shape
`POST`	`/scenes/{id}/shapes/{sid}/move`	Move shape to a different layer

Scene Editor — WinForms (Tools → Scene Editor)

The desktop editor opens a standalone window with:

Scene list — create, select, or delete scenes
Toolbar — arrow (select/move), rect, ellipse, circle, line, text, delete
Canvas — double-buffered; drag shapes to reposition; draw new shapes by clicking and dragging; mouse wheel to zoom
Layers panel — add/delete layers; click to select the active layer; eye icon to toggle visibility
Properties panel — x, y, w, h, r fields for the selected shape; edits commit to the store immediately
Keyboard shortcuts — V/R/E/C/L/T for tools, Delete to remove selected shape, Escape to deselect

All changes are persisted to disk (%LOCALAPPDATA%\ApexComputerUse\scenes\{id}.json) and immediately available via the REST API.

Scene Editor — Browser (`GET /editor`)

Open http://localhost:8080/editor?apiKey=<key> for the same editing experience in a browser:

HTML5 Canvas renderer for all 7 shape types
Click-and-drag to place shapes; click to select and drag to move
Layer panel with add/delete/visibility toggle
Properties panel showing live coordinates
Keyboard shortcuts (V/R/E/C/L/T, Delete, Escape)
All changes sync to the same /scenes/* REST API

Usage — Telegram Bot

After starting the bot, send commands to it in any Telegram chat:

/find window=Notepad id=15
/find window=Calculator name=Equals type=Button
/exec action=type value="Hello from Telegram"
/exec action=click
/exec action=gettext
/ocr
/ocr value=0,0,300,50
/status
/windows
/elements
/elements type=Button
/help

Key=value pairs support quoted values for multi-word strings:

/find window="My Application" name="Save Button"
/exec action=type value="some text with spaces"

AI commands work the same way:

/ai action=status
/ai action=init model=C:\models\vision.gguf proj=C:\models\mmproj.gguf
/ai action=describe
/ai action=describe prompt="List every button you can see."
/ai action=ask prompt="Is there an error message visible?"
/ai action=file value=C:\screenshots\app.png prompt="What dialog is shown?"

Usage — PowerShell

The app exposes a named pipe server (default name ApexComputerUse). Start it from the Remote Control group box, then use the bundled ApexComputerUse.psm1 module:

# Import the module
Import-Module .\Scripts\ApexComputerUse.psm1

# Connect to the pipe (must be started in the app first)
Connect-FlaUI                        # default pipe name: ApexComputerUse
Connect-FlaUI -PipeName MyPipe -TimeoutMs 10000

# Discovery
Get-FlaUIWindows                     # list all open window titles
Get-FlaUIStatus                      # current window/element state
Get-FlaUIHelp                        # command reference
Get-FlaUIElements                    # list all elements in current window
Get-FlaUIElements -Type Button       # filter by ControlType

# Find
Find-FlaUIElement -Window 'Notepad'
Find-FlaUIElement -Window 'Notepad' -Name 'Text Editor' -Type Edit
Find-FlaUIElement -Window 'Calculator' -Id 'num5Button'

# Execute actions
Invoke-FlaUIAction -Action click
Invoke-FlaUIAction -Action type  -Value 'Hello from PowerShell'
Invoke-FlaUIAction -Action gettext
Invoke-FlaUIAction -Action screenshot

# OCR
Invoke-FlaUIOcr
Invoke-FlaUIOcr -Region '0,0,300,50'

# AI
Invoke-FlaUIAi -SubCommand init     -Model 'C:\models\v.gguf' -Proj 'C:\models\p.gguf'
Invoke-FlaUIAi -SubCommand status
Invoke-FlaUIAi -SubCommand describe -Prompt 'What buttons are visible?'
Invoke-FlaUIAi -SubCommand ask      -Prompt 'Is there an error message?'
Invoke-FlaUIAi -SubCommand file     -Value 'C:\screen.png' -Prompt 'Describe this.'

# Send raw JSON (advanced)
Send-FlaUICommand @{ command='find'; window='Notepad'; elementName='Text Editor' }

# Disconnect
Disconnect-FlaUI

PowerShell cmdlet reference

Cmdlet	Key Parameters	Description
`Connect-FlaUI`	`PipeName`, `TimeoutMs`	Connect to the pipe server
`Disconnect-FlaUI`	—	Close the connection
`Send-FlaUICommand`	`Request` (hashtable)	Send a raw JSON command
`Get-FlaUIWindows`	—	List open window titles
`Get-FlaUIStatus`	—	Show current window/element
`Get-FlaUIHelp`	—	Server command reference
`Get-FlaUIElements`	`Type`	List elements in current window
`Find-FlaUIElement`	`Window`, `Id`, `Name`, `Type`	Find a window and element
`Invoke-FlaUIAction`	`Action`, `Value`	Execute action on current element
`Invoke-FlaUIOcr`	`Region`	OCR current element or region
`Invoke-FlaUICapture`	`Target`, `Value`	Capture screen/window/element(s); returns base64 PNG in `data`
`Invoke-FlaUIAi`	`SubCommand`, `Model`, `Proj`, `Prompt`, `Value`	Multimodal AI sub-commands

The pipe connection is session-based: window and element state are preserved across calls within a single Connect-FlaUI / Disconnect-FlaUI session. Use Find-FlaUIElement to select a target, then call Invoke-FlaUIAction as many times as needed without re-finding.

Usage — cmd.exe

Use Scripts\apex.cmd — a batch helper that wraps the HTTP server with simpler positional syntax. Requires the HTTP server to be started first and curl (built-in on Windows 10+).

:: Optional: override port (default is 8080)
set APEX_HTTP_PORT=8080

:: Discovery
apex windows
apex status
apex elements
apex elements Button
apex help

:: Find a window and element
apex find Notepad
apex find "My App" id=btnOK
apex find Notepad name="Text Editor" type=Edit

:: Execute actions
apex exec click
apex exec type value=Hello
apex exec gettext
apex exec screenshot

:: Capture
apex capture
apex capture action=screen
apex capture action=window
apex capture action=elements value=42,105,106

:: OCR
apex ocr
apex ocr 0,0,300,50

:: AI
apex ai status
apex ai init model=C:\models\v.gguf proj=C:\models\p.gguf
apex ai describe
apex ai describe prompt="What do you see?"
apex ai ask prompt="Is there an error message?"
apex ai file value=C:\screen.png prompt="Describe this."

Add Scripts\ to your PATH (or copy apex.cmd next to your scripts) to use it from any directory.

Usage — AI (Multimodal)

The AI command set is backed by MtmdHelper, which uses LLamaSharp to run a local multimodal (vision + audio) LLM. No cloud API is required.

Setup

Download a vision-capable GGUF model and its multimodal projector (e.g. LFM2.5-VL from LM Studio) and note the paths to both .gguf files, or use Download All on the Model tab. Then call ai init before any inference commands.

AI sub-commands

Sub-action	Required params	Optional params	Description
`init`	`model=<path>` `proj=<path>`	—	Load the LLM and projector into memory
`status`	—	—	Report whether the model is loaded and which modalities it supports
`describe`	— (uses current element)	`prompt=<text>`	Capture the current UI element as an image and ask the vision model to describe it
`ask`	`prompt=<text>`	—	Ask a specific question about the current UI element (captures element image)
`file`	`value=<file path>`	`prompt=<text>`	Send an image or audio file from disk to the model

Note: describe, ask, and file require a prior find command to select a window/element. The model must be initialized with init before any inference call. Each inference call starts completely fresh — no chat history is retained between calls.

AI Vision in the test console

The HTTP test console (GET /) has a dedicated AI Vision button group (purple-tinted):

Button	Endpoint	Value field
status	`GET /ai/status`	—
describe	`POST /ai/describe`	Optional prompt (e.g. `list all buttons`)
ask	`POST /ai/ask`	Required question (e.g. `what number is shown?`)

Select an element in the Elements panel first, then click describe or ask. The console shows a "Running vision model…" notice immediately and updates with the result when inference completes.

UI Map Renderer

The UI Map Renderer scans the current window's accessibility tree and renders every element's bounding rectangle as a colour-coded overlay. Each control type gets a deterministic, visually distinct colour. Element names are drawn inside the bounding box.

Via HTTP API

# Returns base64-encoded PNG of the current window's element tree
curl http://localhost:8080/uimap

Requires a prior find call to select a window. The response data.result field contains the base64 PNG — identical format to the /capture endpoints. In the interactive test console, the UI map button (in the Capture group) renders the result inline in the response log.

Via the desktop UI

Tools → Render UI Map draws the overlay directly on screen for 5 seconds (press Escape to dismiss early) and offers to save it as a PNG file. This also triggers a live screen overlay, which is not available via the HTTP API.

Tools → Output UI Map logs the raw nested JSON element tree to the console tab — useful for inspecting the tree structure or copying it for use with an AI agent.

Element JSON includes bounding rectangles:

{
  "id": 105,
  "controlType": "Button",
  "name": "OK",
  "automationId": "btn_ok",
  "boundingRectangle": { "x": 120, "y": 340, "width": 80, "height": 30 },
  "children": []
}

Available Actions (exec/execute)

General

Action	Aliases	Value	Description
`click`	—	—	Smart click: Invoke → Toggle → SelectionItem → mouse fallback
`mouse-click`	`mouseclick`	—	Force mouse left-click (bypasses smart chain)
`middle-click`	`middleclick`	—	Middle-mouse-button click
`invoke`	—	—	Invoke pattern directly
`right-click`	`rightclick`	—	Right-click
`double-click`	`doubleclick`	—	Double-click
`click-at`	`clickat`	`x,y`	Click at pixel offset from element top-left
`drag`	—	`x,y`	Drag element to screen coordinates
`hover`	—	—	Move mouse over element
`highlight`	—	—	Draw orange highlight around element for 1 second
`focus`	—	—	Set keyboard focus
`keys`	—	text	Send keystrokes; supports `{CTRL}`, `{ALT}`, `{SHIFT}`, `{F5}`, `Ctrl+A`, `Alt+F4`, etc.
`screenshot`	`capture`	—	Save element image to `Desktop\Apex_Captures`
`describe`	—	—	Return full element property description (UIA properties — not AI vision)
`patterns`	—	—	List automation patterns supported by the element
`bounds`	—	—	Return bounding rectangle
`isenabled`	—	—	Returns `True` or `False`
`isvisible`	—	—	Returns `True` or `False`
`wait`	—	automationId	Wait for element with given AutomationId to appear
`wait-page-load`	`waitpageload`	seconds (default 10)	Poll window title until browser page finishes loading; returns page title on success

Visual Studio run buttons: for a test handoff, target name="Debug Target" with type="SplitButton" for the F5/debug path, and name="Start Without Debugging" with type="Button" for the Ctrl+F5/no-debug path. Prefer numeric element IDs after an /elements scan to avoid fuzzy matching entirely.

Wait

Action	Aliases	Value	Description
`waitfor`	—	see below	Poll the current element until predicate satisfied or timeout
`wait-window`	—	see below	Poll the desktop window list until a window title satisfies predicate

waitfor parameters: predicate=<equals|contains|not-empty|visible|gone>, optional property=<value|text|name|isvisible|isenabled>, optional expected=<text>, optional timeout=<ms> (default 10000), optional interval=<ms> (default 200, min 50). visible and gone are element-level — they ignore property and expected. The success response includes elapsed_ms, property, and predicate inside data. On timeout, error_data.last_observed carries the value at the last poll ("offscreen"/"visible" for visible, "present" for gone-while-still-present, otherwise the property string).

wait-window parameters: predicate=<equals|contains|not-empty|gone>, expected=<title-substring> (required for all but not-empty), optional timeout=<ms> (default 10000), optional interval=<ms> (default 250). On match, the new window is registered in the window map and set as the current window — the next /find or /elements call resolves it without needing a window= field. Timeout error_data.last_observed_titles is the array of titles seen at the last poll, useful for debugging.

# Wait for a debug console window to appear after launching an app
curl -X POST http://localhost:8080/exec -H "X-Api-Key: <key>" \
  -d '{"action":"wait-window","predicate":"contains","expected":"Debug Console","timeout":15000}'

# Wait for the current text element to contain a specific value
curl -X POST http://localhost:8080/exec -H "X-Api-Key: <key>" \
  -d '{"action":"waitfor","predicate":"contains","property":"value","expected":"OK","timeout":5000}'

# Wait for the current element to become visible
curl -X POST http://localhost:8080/exec -H "X-Api-Key: <key>" \
  -d '{"action":"waitfor","predicate":"visible","timeout":3000}'

Batch (multiple actions in one /exec call)

Send actions=[...] to /exec to run several commands sequentially in one round trip. Each entry is a full sub-request — cmd defaults to "execute", so simple action lists need only action and (where relevant) value. The optional stop_on_error field defaults to true: the first failing step ends the batch and remaining steps are skipped.

curl -X POST http://localhost:8080/exec -H "X-Api-Key: <key>" \
  -d '{"actions":[
        {"action":"clear"},
        {"action":"type","value":"hello"},
        {"action":"keys","value":"{CTRL}s"}
      ]}'

The response's data.result contains stop_on_error, total_steps, executed, succeeded, and a results array. Each entry has step, cmd, action, success, data, extras (e.g. source for gettext/getvalue steps), and message.

Text / Value

Action	Aliases	Value	Description
`type`	`enter`	text	Enter text (smart: Value pattern → keyboard)
`insert`	—	text	Type at current caret position
`gettext`	`text`	—	Smart read: Text pattern → Value → LegacyIAccessible → Name
`getvalue`	`value`	—	Smart read: Value → Text → LegacyIAccessible → Name
`setvalue`	—	text	Smart set: Value pattern (if writable) → RangeValue (if numeric) → keyboard
`clearvalue`	—	—	Set value to empty string via Value pattern
`appendvalue`	—	text	Append text to current value
`getselectedtext`	—	—	Get selected text via Text pattern
`selectall`	—	—	Ctrl+A
`copy`	—	—	Ctrl+C
`cut`	—	—	Ctrl+X
`paste`	—	—	Ctrl+V
`undo`	—	—	Ctrl+Z
`clear`	—	—	Select all and delete

Range / Slider

Action	Aliases	Value	Description
`setrange`	—	number	Set RangeValue pattern
`getrange`	—	—	Read current RangeValue
`rangeinfo`	—	—	Min / max / smallChange / largeChange

Toggle / CheckBox

Action	Aliases	Value	Description
`toggle`	—	—	Toggle CheckBox (cycles state)
`toggle-on`	`toggleon`	—	Set toggle to On
`toggle-off`	`toggleoff`	—	Set toggle to Off
`gettoggle`	—	—	Read current toggle state (`On` / `Off` / `Indeterminate`)

Expand / Collapse

Action	Aliases	Value	Description
`expand`	—	—	Expand via ExpandCollapse pattern
`collapse`	—	—	Collapse via ExpandCollapse pattern
`expandstate`	—	—	Read current ExpandCollapse state

Selection (SelectionItem / Selection)

Action	Aliases	Value	Description
`select`	—	item text	Select ComboBox/ListBox item by text
`select-item`	`selectitem`	—	Select current element via SelectionItem pattern
`addselect`	—	—	Add element to multi-selection
`removeselect`	—	—	Remove element from selection
`isselected`	—	—	Returns `True` or `False`
`getselection`	—	—	Get selected items from a Selection container
`select-index`	`selectindex`	n	Select ComboBox/ListBox item by zero-based index
`getitems`	—	—	List all items in a ComboBox or ListBox (newline-separated)
`getselecteditem`	—	—	Get currently selected item text

Window State

Action	Aliases	Value	Description
`minimize`	—	—	Minimize window
`maximize`	—	—	Maximize window
`restore`	—	—	Restore window to normal state
`windowstate`	—	—	Read current window visual state (Normal / Maximized / Minimized)

Transform (Move / Resize)

Action	Aliases	Value	Description
`move`	—	`x,y`	Move element via Transform pattern
`resize`	—	`w,h`	Resize element via Transform pattern

Scroll

Mouse scroll actions move the cursor to the element centre before firing the scroll event, so scrolling reliably lands in the browser content area rather than wherever the cursor happens to be.

Action	Aliases	Value	Description
`scroll-up`	`scrollup`	n (optional)	Move cursor to element centre, scroll up n clicks (default 3)
`scroll-down`	`scrolldown`	n (optional)	Move cursor to element centre, scroll down n clicks (default 3)
`scroll-left`	`scrollleft`	n (optional)	Move cursor to element centre, horizontal scroll left n clicks (default 3)
`scroll-right`	`scrollright`	n (optional)	Move cursor to element centre, horizontal scroll right n clicks (default 3)
`scrollinto`	`scrollintoview`	—	Scroll element into view
`scrollpercent`	—	`h,v`	Scroll to h%/v% position via Scroll pattern (0–100)
`getscrollinfo`	—	—	Scroll position and scrollable flags

Grid / Table

Action	Aliases	Value	Description
`griditem`	—	`row,col`	Get element description at grid cell
`gridinfo`	—	—	Row and column counts
`griditeminfo`	—	—	Row / column / span for a GridItem element

Capture

Returns a screen capture inline as a base64-encoded PNG in the data field. Supports four targets.

Target	Description
`element` (default)	Current element (requires a prior `find`)
`window`	Current window (requires a prior `find`)
`screen`	Full display
`elements`	Multiple elements by ID, stitched vertically into one image

For elements, provide comma-separated numeric IDs from a prior elements scan in the value field.

# Current element
curl -X POST http://localhost:8080/capture

# Full screen
curl -X POST http://localhost:8080/capture \
     -H "Content-Type: application/json" \
     -d '{"action":"screen"}'

# Current window
curl -X POST http://localhost:8080/capture \
     -H "Content-Type: application/json" \
     -d '{"action":"window"}'

# Multiple elements stitched into one image
curl -X POST http://localhost:8080/capture \
     -H "Content-Type: application/json" \
     -d '{"action":"elements","value":"42,105,106"}'

Response data field contains the base64 PNG. Decode it to get the image:

curl -s -X POST http://localhost:8080/capture -d '{"action":"screen"}' \
  | python -c "import sys,json,base64; d=json.load(sys.stdin)['data']; open('screen.png','wb').write(base64.b64decode(d))"

Telegram: /capture sends the image as a photo message (not text).

/capture
/capture action=screen
/capture action=window
/capture action=elements value=42,105,106

PowerShell:

$r = Send-FlaUICommand @{ command='capture'; action='screen' }
[IO.File]::WriteAllBytes('screen.png', [Convert]::FromBase64String($r.data))

Note: This is distinct from the screenshot exec action, which saves to Desktop\Apex_Captures and returns only the file path.

OCR

OCR uses Tesseract. Download language files from github.com/tesseract-ocr/tessdata and place them in a tessdata\ folder next to the executable (e.g. tessdata\eng.traineddata). Additional languages work the same way.

Captures saved by OCR Element + Save go to Desktop\Apex_Captures\.

AI (Multimodal)

The AI command set is backed by MtmdHelper using LLamaSharp's multimodal (MTMD) API. Supports vision and audio modalities depending on the model. Every inference call is fully stateless — no chat history is retained between calls.

Download a vision-capable GGUF model and its multimodal projector (e.g. LFM2.5-VL from LM Studio) and note the paths to both .gguf files, or click Download All on the Model tab. Then call ai init before any inference commands.

Project Structure

ApexComputerUse/
├── Program.cs                            — Entry point (`--service`, `--port`, `--pipe`, `--client`)
├── appsettings.json                      — Deployment defaults (Http/pipe/log/shell/test-runner)
├── ai-settings.json                      — AI provider credentials/settings
├── AI/
│   ├── AiChatService.cs                  — Provider-agnostic chat service (streaming + session state)
│   ├── AIDrawingCommand.cs               — GDI+ drawing engine (`/draw`, overlays, built-in demo scene)
│   ├── MtmdHelper.cs                     — Local multimodal model wrapper (LLamaSharp MTMD)
│   ├── MtmdInteractiveModeExecute.cs     — Interactive AI computer-use mode
│   └── SceneChatAgent.cs                 — Scene-oriented assistant logic
├── Automation/
│   ├── FlaUIHelper*.cs                   — UIA wrappers (find, actions, capture, text, keyboard, scrolling)
│   ├── ElementIdGenerator.cs             — Stable hash-based element/window IDs
│   └── UiMapRenderer.cs                  — Colour-coded tree renderer to PNG/overlay
├── Commands/
│   ├── CommandProcessor*.cs              — Core command handlers (find/exec/ocr/capture/ai/scenes/help)
│   ├── CommandLineParser.cs              — cmd.exe command parsing
│   ├── CommandRequest.cs                 — Normalized command DTO
│   └── CommandRequestJsonMapper.cs       — HTTP JSON/query mapping helpers
├── Servers/
│   ├── HttpCommandServer*.cs             — HTTP API + chat/page/scene/system route handlers
│   ├── FormatAdapter.cs                  — Response negotiation (HTML/JSON/text/PDF; includes `PdfWriter`)
│   ├── PipeCommandServer.cs              — Named-pipe server
│   └── TelegramController.cs             — Telegram command surface
├── Scenes/
│   ├── Scene.cs                          — Scene/layer/shape models with stable IDs
│   └── SceneStore.cs                     — Thread-safe scene store (`%LOCALAPPDATA%\ApexComputerUse\scenes`)
├── Clients/
│   ├── RemoteClient.cs                   — Remote endpoint metadata
│   ├── ClientPermissions.cs              — Per-client endpoint permission gates
│   └── ClientStore.cs                    — Persistent client registry (`%LOCALAPPDATA%\ApexComputerUse\clients`)
├── Infrastructure/
│   ├── AppConfig.cs / AppSettings.cs     — Config layering (`appsettings.json` + `APEX_*` + user prefs)
│   ├── AppLog.cs                         — Serilog bootstrap/log sink wiring
│   ├── OcrHelper.cs                      — Tesseract OCR wrapper
│   ├── DownloadManager.cs                — Model/OCR asset download support
│   └── ApexService.cs                    — Windows Service host
└── UI/
    ├── Form1.cs / Form1.Designer.cs      — Main WinForms host
    ├── ServerTabController.cs            — HTTP/pipe/server lifecycle controls
    ├── ChatTabController.cs              — Embedded `/chat` WebView + provider controls
    ├── ModelTabController.cs             — Model/asset management
    ├── ClientsTabController.cs           — Multi-endpoint registry UI
    ├── SceneEditorForm.cs / .Designer.cs — WinForms scene editor
    └── ClientEditForm.cs / .Designer.cs  — Client create/edit dialog

Scripts/                                  — `ApexComputerUse.psm1` (pipe module) and `apex.cmd` (HTTP helper)
restart-apex.bat / restart-apex.ps1       — Restart helpers for local development
AIClients/                                 — AI messaging libraries and harness projects
TestApplications/                          — WPF/WinForms/Web test apps and TestRunner

OCR: place Tesseract language files in a tessdata\ folder next to the executable. Not included in the repo — download from github.com/tesseract-ocr/tessdata.

Development

Build

# Restore and build (Release)
dotnet build -c Release ApexComputerUse/ApexComputerUse.csproj

# Run from source
dotnet run --project ApexComputerUse/ApexComputerUse.csproj

Requires the .NET 10 SDK and the Windows Desktop workload (dotnet workload install windows).

Unit Tests

dotnet test ApexComputerUse.Tests/ApexComputerUse.Tests.csproj

The test suite covers the pure-logic and data-model layers — everything that can be tested without a live desktop session:

Test file	Coverage area
`ElementIdGeneratorTests.cs`	Hash mode, incremental mode, reset, thread safety
`SceneStoreTests.cs`	CRUD, disk persistence, concurrent creates
`SceneModelTests.cs`	`FlattenForRender`, ZIndex ordering, opacity, `SceneIds`
`AIDrawingCommandTests.cs`	JSON parsing, canvas backgrounds, all 8 shape types
`TelegramParseCommandTests.cs`	Command + key-value parser, `DictExtensions.Get`
`PipeCommandServerTests.cs`	Named-pipe JSON protocol parser
`LevenshteinTests.cs`	Edit-distance boundary and domain cases
`CommandResponseTests.cs`	`ToText` / `ToJson` serialisation
`OcrHelperTests.cs`	`CropBitmap` region logic, `OcrResult.ToString`

Components that require an active Windows session (FlaUI UIA, Tesseract, LLamaSharp, WinForms UI) are covered by the existing integration script Scripts/test_controls.py and manual testing.

Integration Test Runner

TestApplications/TestRunner/ is a cycle-based orchestrator that launches the WinForms, WPF, and web test apps, runs the full suite against the live HTTP API, and reports results. Use it whenever changes touch CommandProcessor, FlaUIHelper, or HttpCommandServer.

# Demo mode — human-readable output, 3 cycles
dotnet run --project TestApplications/TestRunner -- --mode demo

# Benchmark mode — JSON-line output, 25 cycles
dotnet run --project TestApplications/TestRunner -- --mode benchmark

Test apps:

WinForms — TortureTestForm.cs: textbox, button, checkbox, radio, combo, listbox, slider, menu, grid
WPF — TortureTestWindow.xaml: same controls plus Expander, ViewModel-driven state
Web — index.html: menu, tabs, form controls, scrollable regions

The runner interacts exclusively through the HTTP API, so a failed assertion is reported as the exact curl call that failed. The same suite can also be triggered remotely via POST /run-tests.

Changelog

All notable changes to ApexComputerUse are documented in this file.

[0.16.0] — 2026-05-10

Added

WindowMonitor — desktop window change detection

New Infrastructure/WindowMonitor.cs — owns a dedicated background STA thread (UIA3 is COM apartment-affine; thread-pool timers can't safely call into it) that polls the desktop once per second, diffs the window set against the previous snapshot, and fires events:
- WindowsChanged(IReadOnlyList<WindowSnapshot>) — fires whenever any window opens, closes, or changes title
- WindowClosed(IntPtr hwnd) — per-HWND closure event, used for cache invalidation
WindowSnapshot record — (Hwnd, ProcessId, Title, ElementId). ElementId is generated with excludeName: true so the title can change without rotating the ID.
Auto-starts on Form1.Load and is stopped/disposed on OnFormClosed.
Tools menu items: Start/Stop Window Monitoring, Watch Elements (slow), Watch Top Window Only, Set Element Window Filter… (substring match against window titles, settable via the inline dialog or programmatically by AI code).

Optional element-level watching

WatchElements property — when on, each poll also scans every monitored window's UIA descendants and fires WindowElementsChanged(window, added, removed) with a per-window add/remove diff. Off-screen elements are skipped. Disabled by default (slow).
TopWindowOnly (P/Invoke GetForegroundWindow) and ElementWindowFilter (case-insensitive title contains) narrow the element-scan set so it stays tractable.
Per-window state is dropped automatically when a window closes; the first scan of a newly-discovered window establishes a baseline (no event), so opens don't dump every control as "added".

Cache invalidation in CommandProcessor

New CommandProcessor.InvalidateClosedWindow(IntPtr hwnd) (in CommandProcessor.Windows.cs) — wired from WindowMonitor.WindowClosed. Takes _stateLock, prunes _windowMap entries with the matching HWND, sweeps _elementMap for now-invalid AutomationElement entries via the existing IsElementValid static, removes from all parallel maps (_elementHashes, _elementReverse, _elementParents, _elementDescriptors), clears _currentElement / CurrentWindow if they went stale, and clears _mappedWindowHandle if it matched. _elementReverse.Remove is wrapped in try/catch + LogSwallowed because Dictionary<AutomationElement,_>.Remove calls UIA's CompareElements and can throw COMException on stale proxies.
Verified end-to-end: /find Notepad → close Notepad → wait one poll cycle → /status reports Window: (none) and Notepad is gone from /windows.

Inspectable activity buffer

WindowMonitor carries a thread-safe ConcurrentQueue<MonitorLogEntry> (default cap 500, FIFO eviction) of recent activity — opens, closes, renames, element add/remove, and internal poll errors. AppendLog, GetLog, ClearLog, plus an IsRunning lifecycle property.
New HTTP routes (named /winmon/... to avoid collision with the existing /monitor/{id} RegionMonitor namespace):
- GET /winmon/log → { count, running, entries: [...] } (append .json for raw JSON)
- POST /winmon/clear → { cleared: N }
Both routes are gated by AllowDiagnostics; loopback callers always pass.
13 new unit tests (WindowMonitorTests, CommandProcessorInvalidateTests) covering the diff logic, log buffer, FIFO eviction, lifecycle, the new properties, and the safe paths of InvalidateClosedWindow. UIA-dependent paths (live element pruning, descendant scan) verified manually via the running app.

[0.15.0] — 2026-05-06

Added

Element Annotations & Filtering

New ElementAnnotation model and ElementAnnotationStore — per-element notes and exclusion flags keyed by stable element hash, persisted at %LOCALAPPDATA%\ApexComputerUse\annotations\elements.json. Empty records auto-GC'd.
New verbs in CommandProcessor.Annotations.cs: annotate, unannotate, exclude, unexclude, annotations, excluded
New HTTP routes: POST /annotate, POST /unannotate, POST /exclude, POST /unexclude, GET /annotations, GET /excluded
Notes appear as a note field on /elements output; excluded subtrees are skipped during scan (root never excluded — depth > 0 guard)
New query param ?unfiltered=true on /elements bypasses the exclusion filter
7 new unit tests in ElementAnnotationStoreTests.cs

Region Maps

New RegionMap model and RegionMapStore — persistent named pixel-coordinate grids tied to a window or stable element hash. One file per map under <exe>/regionmaps/{id}.json
Built for AI self-calibration loops on canvas-rendered content (board games, emulators, video timelines) where individual cells are not UIA elements
Static helpers: CellToPixel(map, row, col) returns cell center; BuildGridDrawRequest(...) produces a re-usable draw request for both overlay and render paths
New verbs in CommandProcessor.RegionMaps.cs: regionmap umbrella with sub-actions list|get|delete|overlay|render|cell
New HTTP routes in HttpCommandServer.AnnotationRoutes.cs:
- GET|POST /regionmap — list/create
- GET|PUT|PATCH|DELETE /regionmap/{id} — per-map ops
- POST /regionmap/{id}/overlay — click-through screen overlay
- POST /regionmap/{id}/render — base64 PNG of screen (or current window) with grid drawn over it; supports {"canvas":"screen"} (default) or {"canvas":"window"} (auto-translates grid coords to window-local)
- POST /regionmap/{id}/cell — {row, col} → {x, y} for click-at
10 new unit tests in RegionMapStoreTests.cs (incl. corner-case cell-coord math)

Region Monitors

New RegionMonitor model and RegionMonitorStore — persistent per-region screen-change watchers, one file per monitor under %LOCALAPPDATA%\ApexComputerUse\monitors\{id}.json. Each monitor holds an array of MonitorRegion so one logical "watch" can cover multiple indicators (LEDs, status icons, etc.) with independent diffs.
New RegionMonitorRunner — background dispatcher; one Task per enabled monitor; per-region capture → diff vs previous → fire SSE event when over threshold. First tick is the baseline (no fire). Disabled monitors are not polled. Region-count changes handled at runtime. Diff via LockBits + Marshal.Copy — per-pixel max-channel-difference > tolerance counts as "changed".
New verbs in CommandProcessor.Monitors.cs: monitor umbrella with sub-actions list|get|delete|start|stop|check.
New HTTP routes in HttpCommandServer.MonitorRoutes.cs:
- GET|POST /monitor — list/create
- GET|PUT|DELETE /monitor/{id} — per-monitor CRUD
- POST /monitor/{id}/start / /stop — toggle enabled
- POST /monitor/{id}/check — manual one-shot diff vs current baselines
- POST /monitor/{id}/snapshot?index=N — base64 PNG of region N right now
Notifications via the existing /events SSE stream as monitor.fired events: {monitorId, name, regionIndex, label, x, y, width, height, percentDiff, threshold, seq, time}.
Defaults: intervalMs=1000 (floor 100ms), thresholdPct=5.0, tolerance=8, enabled=false.
Last-fire telemetry persisted on the monitor: lastFiredUtc, lastPercentDiff, lastRegionIndex, hitCount.
11 new unit tests in RegionMonitorStoreTests.cs covering CRUD, telemetry, persistence, and diff math.

EventBroker generalization

EventEnvelope reshaped: int? WindowId plus IReadOnlyDictionary<string, object?> Data replace the fixed WindowId/Title fields. Window events still carry id/title inside Data; non-window subsystems attach arbitrary payloads.
New public EventBroker.Emit(string type, IDictionary<string, object?> data, int? windowId = null) for non-window emitters (region monitors today, anything else later).
SSE serializer in HttpCommandServer.Events.cs now flattens Data into the frame payload alongside seq/time — both event families render uniformly.
JsonElementExtensions.Dbl(name) helper added for parsing thresholdPct.

Public /help Page

New optional setting PublicHelpPage (default false): when on, GET /help is reachable without an API key
New setting PublicHelpRateLimit (default 30 req/min/IP): sliding 60-second per-IP window protects the unauthenticated route
Returns HTTP 429 with Retry-After: 60 when limit exceeded
Loopback callers and API-keyed callers always have full access (never rate-limited)
New RuntimeFlags static — mutable mirror of AppConfig values seeded at startup, allows GUI changes to take effect without restart
GUI controls in Remote Control tab: chkPublicHelp checkbox + numHelpRateLimit numeric input. Persisted in %APPDATA%\ApexComputerUse\settings.json alongside other user prefs.
appsettings.json keys: PublicHelpPage, PublicHelpRateLimit. Env: APEX_PUBLIC_HELP_PAGE, APEX_PUBLIC_HELP_RATE_LIMIT.

Changed

UI

Remote Control tab cleaned up: lblTelegramStatus moved from (8, 168) — was overlapping the new public-help checkbox — to (465, 104) on the bot-token row. btnStartTelegram shrunk from 120 to 100 wide to make room.
Added tooltips to all interactive controls in Remote Control tab (HTTP port/start, API key, Copy, bot token, Start Telegram, allowed chat IDs, public help, rate limit, pipe name, Start Pipe, status labels).

Documentation

New LICENSE: PolyForm Noncommercial 1.0.0. Source-available; commercial use requires a separate license.
New THIRD_PARTY_NOTICES.md: license attributions for all 9 NuGet dependencies (FlaUI, Serilog, LLamaSharp, Telegram.Bot, Tesseract, etc.). MIT and Apache 2.0 obligations met inline.
Merged ACU_AI_CONTROL_GUIDE.md into ACU_CONTROL_GUIDE.md (deleted the former) — single comprehensive guide. Added "Rules of Thumb" section + Annotations + Region Maps coverage.
Slimmed ACU_SYSTEM_PROMPT.md from 9 KB → 3.8 KB. Now points at the auto-generated /help page for endpoint reference instead of duplicating tables that drift. Retains auth, mental model, 10 critical rules, minimal control loop.
Updated ACU_OPERATIONAL_REFERENCE.md for staleness: added /winrun, annotations, region maps, note field, ?unfiltered, PublicHelpPage/PublicHelpRateLimit config keys.

Fixed

CommandProcessor.ScanElementsIntoMap now consults ElementAnnotationStore to skip excluded subtrees and attach notes during scan; existing /elements callers see no behavioral change unless annotations exist.
RegionMap.canvas:"window" mode correctly translates screen-absolute grid coords into window-local space before drawing, so the grid lines up with the captured window image.

[0.14.0] — 2026-04-27

Added

Multiple Instance Support

Added --port command-line argument to override HTTP listen port for running multiple instances
Added --pipe command-line argument to override named-pipe name
Added --client command-line argument to mark an instance as a subordinate client (disables Launch Instance button)
Port auto-increment in HttpCommandServer.Start() — automatically tries next available port if preferred port is taken
New buttons in Clients tab: "Open Web UI" (launches /chat page in default browser) and "Launch Instance" (spawns new instance with incremented port)
ClientsTabController.LaunchInstance() auto-registers spawned instance in client list

Client Permissions System

New ClientPermissions class with per-client flags: AllowAutomation, AllowCapture, AllowAi, AllowScenes, AllowShellRun, AllowClients
Permissions stored in JSON alongside each RemoteClient and loaded on reconnect
Permission enforcement in HttpCommandServer: loopback (127.0.0.1) always gets full access; registered clients get their stored permissions; unknown IPs get full access
All endpoints gated by appropriate permission: /run requires AllowShellRun, /capture/ocr require AllowCapture, /ai/chat require AllowAi, /scenes/editor require AllowScenes, /clients require AllowClients, everything else requires AllowAutomation
ClientEditForm redesigned with two tabs: "Connection" (existing fields) and "Permissions" (6 checkboxes with ShellRun/Clients highlighted in orange)
ClientStore.FindByHost(string host) — case-insensitive lookup by hostname

AI Chat with API Tools

AiChatService.SetLocalServer(int port, string? apiKey) and ClearLocalServer() — configure local HTTP server context for AI chat
Agentic tool loop in AiChatService.SendAsync() — AI can issue ApexComputerUse API calls via apex code blocks
System prompt auto-extended with API reference when server context is set, including endpoint list and example calls
Loop executes up to 8 turns, executing calls and feeding results back until AI produces clean answer
ServerTabController.ToggleHttp() calls SetLocalServer() on start and ClearLocalServer() on stop
Parsing and system prompt generation exposed as internal for testing

Security Hardening

Timing-safe API key comparison using CryptographicOperations.FixedTimeEquals() (replaced three separate == comparisons)
Shell command execution in /run now uses ProcessStartInfo.ArgumentList instead of string concatenation to prevent injection
HttpCommandServer.Stop() now explicitly closes HttpListener to immediately release port handles

Bug Fixes

Fixed MtmdInteractiveModeExecute infinite loop with hardcoded test path — replaced with proper Console.ReadLine() loop
Fixed CommandProcessor element ID lookup to use Equals() instead of ReferenceEquals() (FlaUI uses IUIAutomation.CompareElements)
Added 50k-entry cap on CommandProcessor._elementMap to prevent unbounded growth during long sessions
Fixed Form1.SetupNetshIfNeeded() blocking UI thread — made async with proper timeout
Fixed Form1.AutoLoadModelIfConfigured() fire-and-forget — now logs async exceptions via .ContinueWith()
SceneEditorForm canvas paint optimization — eliminated per-paint full-scene bitmap allocation during drag

Changed

Program Structure

Program.IsClientInstance — public static property detecting --client flag for UI gating
Command-line arg parsing restructured to support flag-only arguments alongside key-value pairs

API & Configuration

HttpCommandServer constructor now accepts optional ClientStore? clientStore parameter
HttpCommandServer.Port { get; private set; } — made settable internally by Start() for auto-increment
RemoteClient.Permissions — new property with ClientPermissions value

UI

Form1.Designer.cs — added "Open Web UI" and "Launch Instance" buttons to Clients tab
ClientEditForm.Designer.cs — complete redesign with TabControl (Connection / Permissions tabs)
ClientsTabController constructor signature expanded with button references and port getter

Testing

New test file ApexComputerUse.Tests/AiChatServiceTests.cs with 22 tests covering apex call parsing and system prompt generation
ParseApexCalls and BuildApexSystemPrompt exposed as internal via existing InternalsVisibleTo attribute
All 171 tests passing (149 existing + 22 new)

Known Limitations

AI tool-use loop is non-streaming (full response assembled before delivery)
IP-spoofing could bypass permission sandboxing on local network

[0.13.0] — 2026-04-26

Added

Clients tab — remote machine registry — a new "Clients" tab (sixth tab in the main UI) lets users and AI maintain a persistent directory of other Apex-enabled machines. Each entry stores a friendly name, host/IP, port, API key, OS version, and description. Entries are listed in a six-column ListView and persisted to <exe>/clients/{id}.json using the same thread-safe JSON store pattern as scenes.
ClientStore (Clients/ClientStore.cs) — thread-safe store that loads all client records from disk on startup and writes individual JSON files on every create, update, or delete.
RemoteClient (Clients/RemoteClient.cs) — data model with [JsonPropertyName] attributes matching the project's snake_case serialization convention.
ClientsTabController (UI/ClientsTabController.cs) — tab logic wired to Add, Edit, Remove, and Test buttons. Test Connection fires an async GET /ping against the selected client's host:port (with its API key if set) and updates a live Status column green/red in-place, with no UI blocking.
ClientEditForm (UI/ClientEditForm.cs / ClientEditForm.Designer.cs) — fixed-size dialog for creating and editing client entries, with name/host required-field validation and port range validation.

[0.12.0] — 2026-04-26

Added

Embedded HTML chat in the Chat tab — the Chat tab's RichTextBox, input field, and Send button have been replaced by an embedded Microsoft.Web.WebView2 control hosting the existing /chat streaming page directly inside the app. Click Load Chat to navigate the WebView2 to http://localhost:{port}/chat?apiKey=.... The HTML page handles streaming, the "New chat" reset, and provider/model status display natively.
HTTP server auto-start on launch — HttpAutoStart and HttpBindAll are now true by default in appsettings.json. The HTTP server starts and binds to all interfaces automatically when the app opens; no manual click on the Remote Control tab is required.
Model auto-load on launch — if model and projector paths are saved in settings.json, the local vision model is loaded automatically at startup without opening the Model tab.
First-run netsh setup — on the very first launch, the app checks whether the HTTP URL ACL (http://+:8081/) and the Windows Firewall inbound rule (ApexComputerUse) exist. If either is missing, a single elevated cmd session (one UAC prompt) runs both netsh commands. The result is persisted to settings.json (NetshConfigured = true) so the check never repeats.
Restart scripts — restart-apex.bat and restart-apex.ps1 at the repo root kill all running instances (taskkill /F /IM ApexComputerUse.exe) and relaunch the app. Both prefer the Release build, fall back to Debug, and fall back to dotnet run if no built exe is found.

Changed

ChatTabController — removed _rtbChatHistory, _txtChatInput, _btnChatSend, AppendToChat, AppendColoredText, SendOrCancelAsync, ExecuteCommandsFromResponse, and CurlRx. Constructor now accepts a WebView2 instead. OpenChat() navigates the embedded WebView2; ResetChat() calls Reload().
AppSettings — added NetshConfigured bool field (persisted to %APPDATA%\ApexComputerUse\settings.json) for first-run netsh tracking.

[0.11.0] — 2026-04-16

Added

/elements?match=<text> — case-insensitive substring search across Name, AutomationId, and Value pattern. Returns only branches containing matches, each wrapped in its ancestor path (non-matching siblings pruned). depth now controls how deep to render under each match, so one call replaces the repeated drill-down pattern of "fetch tree → spot candidate → fetch subtree". Composes with type= and onscreen=true.
/elements?collapseChains=true — folds "1-in-1-in-1" wrapper chains that dominate web accessibility trees. A node is skipped only when it has exactly one child, no Name, no AutomationId, and its control type is Pane, Group, or Custom. Named containers and anything with an AutomationId are preserved. IDs of hoisted descendants are unchanged — follow-up /elements?id=<id> and /execute id=<id> calls continue to work against the real (unflattened) tree.
/elements?includePath=true — every emitted node gains a path breadcrumb string (e.g. "Chrome > Document > Main > Form") so an agent can orient itself without climbing back up the tree.
/elements?properties=extra — opt-in per-node value (via Value pattern, when the element supports it) and helpText properties. Off by default so token budgets don't change silently; needed for web inputs whose Name is empty and whose visible content lives in the Value pattern.
descendantCount on truncated nodes — nodes cut off by depth now emit descendantCount: N alongside the existing childCount, so an agent can decide whether a subtree is worth expanding without another round trip.
Structured /find response — /find now populates a JSON element object on the response (id, controlType, name, automationId, className, frameworkId, isEnabled, isOffscreen, boundingRectangle, plus value/helpText when properties=extra) alongside the existing human-readable string in message. The element's numeric ID is recovered from the most recent /elements scan when available.
Tree-shape unit tests (ApexComputerUse.Tests/CommandProcessorTreeTests.cs) — covers FilterTreeByMatch (case-insensitive, AutomationId + Value lookup, sibling pruning), CollapseSingleChildChains (identity-less-only collapse, multi-child preservation, ID stability), and ElementNode JSON round-trip for the new opt-in fields.

Changed

CommandProcessor.ElementNode / BoundingRect promoted from private to internal sealed class so the new in-process post-processors (FilterTreeByMatch, CollapseSingleChildChains) and the test project (InternalsVisibleTo) can exercise them directly.
ScanElementsIntoMap now accepts a ScanOptions struct (IncludePath + IncludeExtra + depth) and threads the parent breadcrumb through recursion without changing call-site signatures for existing endpoints.

[0.10.0] — 2026-04-16

Added

AI Chat window — Tools → AI Chat opens a standalone chat interface powered by the AiMessagingCore library. Supports 8 providers: OpenAI, Anthropic, DeepSeek, Grok, Groq, Duck, LM Studio, and LlamaSharp (local GGUF). Streams tokens in real-time; shows timing metrics (total tokens, tokens/second, time-to-first-token). Provider, model, system prompt, and sample query are persisted to ai-settings.json next to the executable.
AIClients solution integrated — both AiMessagingCore (class library) and AIClients (standalone WinForms harness) are now included in ApexComputerUse.sln for single-solution editing. AIClients.sln and AIClients.exe remain fully independent and buildable on their own.
ai-settings.json — starter settings file (copied to output on build) with placeholder API keys for all 8 providers. Replace placeholders with real keys to activate each provider.

Fixed

ProviderSettings.ApiKey and AiLibrarySettings.DefaultProvider changed from init-only to set so runtime configuration updates (provider switch, API key override) can be applied without reconstructing the settings objects.
HandleChatStatus in HttpCommandServer now returns Dictionary<string, string> matching the ApexResult.Data contract; sessionActive is serialized as "True" / "False".

[0.9.0] — 2026-04-07

Added

capture command — returns screen captures inline as base64 PNG in the data response field. No file is written to disk. Four targets via action=:
- screen — full display
- window — current window (requires prior find)
- element (default) — current element (requires prior find)
- elements value=id1,id2,... — multiple elements by numeric ID, stitched vertically into one image
HTTP: POST /capture
Named pipe / PowerShell: command=capture; new Invoke-FlaUICapture cmdlet in ApexComputerUse.psm1
cmd.exe: apex capture [action=...] [value=...] in apex.cmd
Telegram: /capture — response delivered as a photo message, not text

[0.8.0] — 2026-04-07

Added

Persistent element ID map — elements command now recursively scans the UI tree using ElementIdGenerator (SHA-256 hash-based, deterministic across sessions). Each element receives a stable numeric ID that survives app restarts.
Nested JSON element map output — elements returns the full window tree as indented, nested JSON (id, controlType, name, automationId, children), replacing the flat string list.
Window map with persistent IDs — windows command now returns a JSON array of {id, title} pairs. IDs are hash-based and stable for the same window across sessions.
Map-based lookup in find — pass a numeric ID from either windows or elements as the window= or id= parameter; the element is resolved directly from the in-memory map without a fuzzy search.
Auto-focus on every find — the matched window is brought into foreground focus automatically; no separate focus action required.
"Output UI Map" menu item — Tools menu item captures the UI tree of the currently selected window and prints the nested JSON to the log.
Full ElementOperations parity — all UIA patterns now covered by both ApexHelper and CommandProcessor:

New exec actions

Action	Description
`mouse-click`	Force mouse left-click (bypasses Invoke/Toggle/SelectionItem)
`middle-click`	Middle-mouse-button click
`click-at value=x,y`	Click at pixel offset from element top-left
`drag value=x,y`	Drag element to screen coordinates
`highlight`	Draw orange highlight around element for 1 second
`isenabled`	Returns `True`/`False`
`isvisible`	Returns `True`/`False`
`clearvalue`	Set value to empty string (Value pattern)
`appendvalue`	Append text to current value
`getselectedtext`	Selected text via Text pattern
`setrange value=n`	Set RangeValue pattern
`getrange`	Read current RangeValue
`rangeinfo`	Min / max / smallChange / largeChange
`toggle-on` / `toggle-off`	Set toggle to a specific state
`gettoggle`	Read current toggle state (On / Off / Indeterminate)
`expandstate`	Read ExpandCollapse state
`select-item`	Select via SelectionItem pattern
`addselect`	Add element to multi-selection
`removeselect`	Remove element from selection
`isselected`	Check SelectionItem selected state
`getselection`	Get selected items from a Selection container
`select-index value=n`	Select ComboBox / ListBox item by zero-based index
`getitems`	List all items in a ComboBox or ListBox
`getselecteditem`	Get currently selected item text
`minimize` / `maximize` / `restore`	Window visual state
`windowstate`	Read current window visual state
`move value=x,y`	Move element via Transform pattern
`resize value=w,h`	Resize element via Transform pattern
`scroll-left` / `scroll-right value=n`	Horizontal mouse scroll
`scrollpercent value=h,v`	Scroll to h%/v% via Scroll pattern
`getscrollinfo`	Scroll position and scrollable flags
`griditem value=row,col`	Get element at grid cell
`gridinfo`	Row and column counts
`griditeminfo`	Row / column / span for a GridItem element

Upgraded exec actions

Action	Change
`click`	Now smart: Invoke → Toggle → SelectionItem → mouse fallback
`gettext`	Smart chain: Text pattern → Value → LegacyIAccessible → Name
`getvalue`	Smart chain: Value → Text → LegacyIAccessible → Name
`setvalue`	Smart chain: Value (if writable) → RangeValue (if numeric) → keyboard
`select`	Tries SelectionItem on list child first, then FlaUI wrappers
`keys`	Full `{KEY}` token notation (`{CTRL}`, `{F5}`, …) and `Ctrl+A` / `Alt+F4` combo syntax

[0.7.0] — 2026-04-06

Added

windows command returns a JSON array of {id, title} for all open windows, enabling the AI to select precisely without relying on fuzzy matching.

[0.6.0] — 2026-04-06

Added

Named-pipe server (PipeCommandServer) — exposes the full command set over a Windows named pipe (default name ApexComputerUse). Each client connection is session-based (state is preserved across commands on the same connection). Accepts and returns newline-delimited JSON.
Pipe server UI — new row in the Remote Control group box: configurable pipe name, Start/Stop button, and live status label.
Scripts\ApexComputerUse.psm1 — PowerShell module providing idiomatic cmdlets over the named pipe: Connect-FlaUI, Disconnect-FlaUI, Send-FlaUICommand, Get-FlaUIWindows, Get-FlaUIStatus, Get-FlaUIHelp, Get-FlaUIElements, Find-FlaUIElement, Invoke-FlaUIAction, Invoke-FlaUIOcr, Invoke-FlaUIAi.
Scripts\apex.cmd — cmd.exe batch helper wrapping the HTTP server with simpler positional syntax (e.g. apex find Notepad, apex exec click, apex ai describe). Requires curl (built-in Windows 10+).

[0.5.0] — 2026-04-06

Added

AI multimodal command set (MtmdHelper integration) — expose the existing MtmdHelper class through all remote interfaces.
CommandRequest extended with ModelPath, MmProjPath, and Prompt fields.
ai command in CommandProcessor with five sub-actions:
- init — load the LLM and multimodal projector from disk (model= + proj= paths).
- status — report whether the model is loaded and which modalities it supports.
- describe — capture the current UI element and ask the vision model to describe it (optional prompt=).
- file — send an image or audio file from disk to the model (value=<path>, optional prompt=).
- ask — ask an arbitrary question about the current UI element (prompt= required).
HTTP endpoints for AI commands: GET /ai/status; POST /ai/init, /ai/describe, /ai/file, /ai/ask.
Telegram /ai command — same sub-action set via action=<sub> key-value syntax.
Updated help command output to list all ai sub-actions.

[0.4.0] — 2026-04-06

Added

HTTP REST server (HttpCommandServer) — control the application via curl on a configurable port (default 8080). Endpoints: GET /status, /windows, /elements, /help; POST /find, /execute, /ocr.
Telegram bot (TelegramController) — same command set over Telegram. Supports /find, /exec, /ocr, /status, /windows, /elements, /help. Key=value argument syntax with quoted multi-word values.
CommandProcessor — shared command engine used by both remote interfaces. Auto-accepts fuzzy window/element matches (no UI prompts in remote mode). Fires OnLog events forwarded to the form's status box.
Remote Control group box in the UI — start/stop HTTP server and Telegram bot with live status indicators.
FlaUIHelper.ListWindowTitles() — returns titles of all open windows.
FlaUIHelper.ListElements(Window, ControlType?) — lists all elements in a window with optional ControlType filter.
README.md — full usage documentation including curl examples and Telegram command reference.
CHANGELOG.md — this file.

[0.3.0] — 2026-04-06

Added

OCR (OcrHelper) — captures any UI element and runs Tesseract OCR on it.
- OcrElement — capture and recognise.
- OcrElementAndSave — capture, save image to disk, then recognise (useful for debugging).
- OcrElementRegion — OCR a sub-rectangle of the element.
- OcrFile — OCR an existing image file.
tessdata\eng.traineddata bundled in project and copied to output on build.
OCR actions available in the Any Element action group in the UI.

[0.2.0] — 2026-04-06

Added

Fuzzy window matching — tries exact match, then contains, then Levenshtein closest. Prompts for approval on non-exact matches.
Fuzzy element matching — same three-tier logic, applied to AutomationId or Name.
Search Type combo — filter element search by ControlType. All searches every type without restriction. All is never passed as a ControlType value to FlaUI.
Levenshtein distance implementation in FlaUIHelper.
FlaUIHelper.FindWindowFuzzy and FlaUIHelper.FindElementFuzzy returning match metadata (exact vs fuzzy, matched value).

Changed

Form height extended to accommodate the new Search Type row.

[0.1.0] — 2026-04-06

Added

Initial AI computer use application (WinForms) targeting .NET 10.
FlaUIHelper class wrapping FlaUI UIA3 for all common WPF/WinForms control interactions:
- Button, TextBox, PasswordBox, Label, ComboBox, CheckBox, RadioButton, ListBox, ListView, DataGrid, TreeView, Menu/MenuItem, TabControl, Slider, ProgressBar, Hyperlink.
- Mouse operations: click, right-click, double-click, hover, drag & drop, scroll.
- Keyboard: type, send key, shortcuts (Ctrl+A/C/X/V/Z).
- Text: select all, copy, cut, paste, undo, clear, insert at caret.
- Value/RangeValue patterns, ExpandCollapse, ScrollItem, Transform.
- Screenshots via FlaUI.Core.Capturing.
- Retry.WhileNull for waiting on dynamic elements.
- Window operations: move, resize, minimize, maximize, restore, close.
- Focus: SetFocus, GetFocusedElement.
Form UI with:
- Window Name, AutomationId, Element Name fields.
- Control Type picker (action groups) and Action picker.
- Value/Index field for parameterised actions.
- Find Element, Execute Action, Clear Log buttons.
- Timestamped output log.
Designer-compatible Form1.Designer.cs (standard generated format, no lambdas or helpers inside InitializeComponent).

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.build		.build
.claude		.claude
.github/workflows		.github/workflows
AIClients		AIClients
ApexComputerUse.Tests		ApexComputerUse.Tests
ApexComputerUse		ApexComputerUse
Scripts		Scripts
TestApplications		TestApplications
docs		docs
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
ACU_CONTROL_GUIDE.md		ACU_CONTROL_GUIDE.md
ACU_OPERATIONAL_REFERENCE.md		ACU_OPERATIONAL_REFERENCE.md
ApexComputerUse.sln		ApexComputerUse.sln
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CLAUDE_FOR_TEST_SESSION.md		CLAUDE_FOR_TEST_SESSION.md
CUsersjohnbAppDataLocalTempscreen_raw.json		CUsersjohnbAppDataLocalTempscreen_raw.json
Directory.Build.props		Directory.Build.props
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
TASKS.md		TASKS.md
TEST_RESULTS.txt		TEST_RESULTS.txt
TEST_RESULTS_FULL.txt		TEST_RESULTS_FULL.txt
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
cap.json		cap.json
cap.png		cap.png
draw_payload.json		draw_payload.json
draw_result.json		draw_result.json
global.json		global.json
hangman_capture_raw.json		hangman_capture_raw.json
hangman_game.png		hangman_game.png
help.html		help.html
hg.json		hg.json
restart-apex.bat		restart-apex.bat
restart-apex.ps1		restart-apex.ps1
scrn.json		scrn.json
scrn.png		scrn.png
scrn2.json		scrn2.json
scrn2.png		scrn2.png
settings.html		settings.html
test_controls.py		test_controls.py

Folders and files

Latest commit

History

Repository files navigation

ApexComputerUse

Screenshots

Main Desktop UI

Interactive Web Console (GET /)

Scene Editor — WinForms

Scene Editor — Browser (GET /editor)

AI-Generated Drawing!

UI Map Overlay

Quickstart

Why ApexComputerUse

The problem with screenshot-based automation

The structured-tree approach

How it compares

Compatible AI Agents

Access paths

Agent compatibility table

Quickest agent integration (Claude Code example)

Stable element IDs

The onscreen filter

Progressive tree expansion

Browser-friendly tree filters

Features

Setup

1. Build and run

2. First-run network setup (only when HttpBindAll=true)

3. Models and OCR data (optional — auto-download available)

4. Remote access (optional)

5. Telegram Bot (optional)

Security & Configuration

HTTP API Authentication

Named Pipe Security

Telegram Bot Authorization

Client Permission Gating (non-loopback callers)

Shell Execution (/run)

Configuration

Run as a Windows Service

Command-line overrides

Usage — UI

Tools menu

Window and Element ID Mapping

Token Economics

The Core Difference

Real-world token costs (approximate — varies by provider and resolution)

Example 1 — Small App (Calculator, tray utility, simple tool)

Usage — HTTP API

Interactive Test Console (GET /)

Format negotiation

GET access to command endpoints

Response format

System / utility routes

UI automation routes

Request body fields

Usage — AI Drawing

Quick draw

Shape types

Usage — Layered Scene Editor

REST API (/scenes/*)

Full route reference

Scene Editor — WinForms (Tools → Scene Editor)

Scene Editor — Browser (GET /editor)

Usage — Telegram Bot

Usage — PowerShell

PowerShell cmdlet reference

Usage — cmd.exe

Usage — AI (Multimodal)

Setup

AI sub-commands

AI Vision in the test console

UI Map Renderer

Via HTTP API

Via the desktop UI

Available Actions (exec/execute)

General

Wait

Batch (multiple actions in one /exec call)

Text / Value

Interactive Web Console (`GET /`)

Scene Editor — Browser (`GET /editor`)

2. First-run network setup (only when `HttpBindAll=true`)

Shell Execution (`/run`)

Interactive Test Console (`GET /`)

REST API (`/scenes/*`)

Scene Editor — Browser (`GET /editor`)