Computer Using Agent (CUA)

This repository contains a lightweight Model Context Protocol (MCP) server that gives an AI agent the ability to operate a desktop just like a human: move and click the mouse, type on the keyboard, and capture screenshots. The server is designed to be launched from Cursor, VS Code, or any other MCP‑aware client.

Capabilities

Mouse movement and clicks (left, right, middle)
Drag-and-drop gestures
Scroll wheel control (vertical and horizontal)
Keyboard typing (including configurable delays)
Key presses and hotkeys
Fullscreen or region screenshots returned as base64‑encoded PNGs
Health checks (ping) so clients can verify connectivity

Project layout

├── cua_server.py      # JSON-RPC server that exposes desktop control tools
├── requirements.txt   # Python dependencies for desktop automation
└── README.md          # This file

Prerequisites

Python 3.10 or newer
A desktop session (the automation libraries need an active display)
The following Python packages (install with pip install -r requirements.txt):
- pyautogui
- pillow

Note: pyautogui depends on native tools (python3-xlib, scrot, or their equivalents) on some Linux distributions. Make sure those prerequisites are installed system-wide.

Running the server

python cua_server.py

The server speaks JSON-RPC 2.0 over standard input/output. Once launched, it waits for requests on stdin and streams responses on stdout. An MCP client (such as Cursor or VS Code) should manage the process lifecycle and exchange messages over pipes.

Sample automation script

This repo ships with a small driver (run_actions.py) that reads a JSON file and replays the actions through the server. First, ensure the dependencies are installed:

python3 -m pip install --user -r requirements.txt

Then execute the sample script (it spawns the server automatically):

python run_actions.py sample_actions.json

sample_actions.json demonstrates a short sequence: move the cursor, click, type text, and capture a screenshot (using the "command": "s" shorthand). Feel free to duplicate the file and adjust the coordinates, keystrokes, and screenshot path to suit your workflow.

For a more involved example, sample_vim_build.json opens a terminal (Ctrl+Alt+T), launches Vim, writes a tiny C program, saves it, compiles it with gcc, and runs the binary. To see drag-and-scroll gestures in action, run sample_drag_scroll.yaml.

Available JSON commands:

move, click, drag, scroll, type, press, hotkey, s (screenshot)
wait (or sleep) with seconds to pause between steps; useful while windows open.

drag actions expect start and end objects with x/y coordinates, plus optional button, duration, and moveDuration keys.

scroll actions require an amount (positive scrolls up/left, negative down/right), an optional axis (vertical by default), and optional x/y coordinates to move the cursor before scrolling.

Example CLI usage

You can exercise the server manually using another terminal:

printf '%s\n' \
  '{"jsonrpc":"2.0","id":"1","method":"ping"}' \
  '{"jsonrpc":"2.0","id":"2","method":"click","params":{"x":400,"y":400,"button":"left"}}' | \
python cua_server.py

Each JSON-RPC request must include:

jsonrpc: "2.0"
id: a unique string or number (mirrored in the response)
method: one of the supported commands (ping, click, move, type, press, hotkey, screenshot)
params: optional dictionary with method-specific parameters

Response schema

Successful responses follow:

{
  "jsonrpc": "2.0",
  "id": "2",
  "result": {
    "ok": true,
    "data": { "...": "..." }
  }
}

Errors follow JSON-RPC's standard shape:

{
  "jsonrpc": "2.0",
  "id": "2",
  "error": {
    "code": 400,
    "message": "Invalid parameter: missing x coordinate",
    "data": { "...": "..." }
  }
}

Screenshot payload

result.data.image contains a base64-encoded PNG. Example:

{
  "result": {
    "ok": true,
    "data": {
      "width": 1920,
      "height": 1080,
      "image": "iVBORw0KGgoAAAANSUhEUgAA..."
    }
  }
}

Security considerations

Scope tightly: Only grant MCP clients you trust access to this server; it controls your desktop.
Desktop lock: The server cannot bypass a locked screen. If the display is locked, input operations will fail silently.
Failsafe: Press your system's mouse failsafe gesture (e.g., Ctrl+C in the hosting terminal or move the mouse manually) if automation misbehaves.

Extending the agent

Add more input primitives (scrolling, drag-and-drop, clipboard access)
Add contextual awareness (OCR, UI element recognition)
Layer in higher-level workflows (e.g., “open browser and navigate to URL”)

Pull requests and suggestions are welcome. With this foundation, you can iterate toward a more capable Computer Using Agent tailored to your workflow.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
humanio		humanio
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
attach_to_cursor.yaml		attach_to_cursor.yaml
bridge.py		bridge.py
capture.py		capture.py
click_listener.py		click_listener.py
input_control.py		input_control.py
mouse_position_logger.py		mouse_position_logger.py
open_firefox.json		open_firefox.json
open_firefox.yaml		open_firefox.yaml
requirements.txt		requirements.txt
run_actions.py		run_actions.py
run_stdio_bridge.py		run_stdio_bridge.py
sample_actions.json		sample_actions.json
sample_drag_scroll.yaml		sample_drag_scroll.yaml
sample_vim_build.json		sample_vim_build.json
screenshot.png		screenshot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computer Using Agent (CUA)

Capabilities

Project layout

Prerequisites

Running the server

Sample automation script

Example CLI usage

Response schema

Screenshot payload

Security considerations

Extending the agent

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Computer Using Agent (CUA)

Capabilities

Project layout

Prerequisites

Running the server

Sample automation script

Example CLI usage

Response schema

Screenshot payload

Security considerations

Extending the agent

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages