This repository contains a lightweight Model Context Protocol (MCP) server that gives an AI agent the ability to operate a desktop just like a human: move and click the mouse, type on the keyboard, and capture screenshots. The server is designed to be launched from Cursor, VS Code, or any other MCP‑aware client.
- Mouse movement and clicks (left, right, middle)
- Drag-and-drop gestures
- Scroll wheel control (vertical and horizontal)
- Keyboard typing (including configurable delays)
- Key presses and hotkeys
- Fullscreen or region screenshots returned as base64‑encoded PNGs
- Health checks (
ping) so clients can verify connectivity
├── cua_server.py # JSON-RPC server that exposes desktop control tools
├── requirements.txt # Python dependencies for desktop automation
└── README.md # This file
- Python 3.10 or newer
- A desktop session (the automation libraries need an active display)
- The following Python packages (install with
pip install -r requirements.txt):pyautoguipillow
Note:
pyautoguidepends on native tools (python3-xlib,scrot, or their equivalents) on some Linux distributions. Make sure those prerequisites are installed system-wide.
python cua_server.pyThe server speaks JSON-RPC 2.0 over standard input/output. Once launched, it waits for requests on stdin and streams responses on stdout. An MCP client (such as Cursor or VS Code) should manage the process lifecycle and exchange messages over pipes.
This repo ships with a small driver (run_actions.py) that reads a JSON file and replays the actions through the server. First, ensure the dependencies are installed:
python3 -m pip install --user -r requirements.txtThen execute the sample script (it spawns the server automatically):
python run_actions.py sample_actions.jsonsample_actions.json demonstrates a short sequence: move the cursor, click, type text, and capture a screenshot (using the "command": "s" shorthand). Feel free to duplicate the file and adjust the coordinates, keystrokes, and screenshot path to suit your workflow.
For a more involved example, sample_vim_build.json opens a terminal (Ctrl+Alt+T), launches Vim, writes a tiny C program, saves it, compiles it with gcc, and runs the binary. To see drag-and-scroll gestures in action, run sample_drag_scroll.yaml.
Available JSON commands:
move,click,drag,scroll,type,press,hotkey,s(screenshot)wait(orsleep) withsecondsto pause between steps; useful while windows open.
drag actions expect start and end objects with x/y coordinates, plus optional button, duration, and moveDuration keys.
scroll actions require an amount (positive scrolls up/left, negative down/right), an optional axis (vertical by default), and optional x/y coordinates to move the cursor before scrolling.
You can exercise the server manually using another terminal:
printf '%s\n' \
'{"jsonrpc":"2.0","id":"1","method":"ping"}' \
'{"jsonrpc":"2.0","id":"2","method":"click","params":{"x":400,"y":400,"button":"left"}}' | \
python cua_server.pyEach JSON-RPC request must include:
jsonrpc:"2.0"id: a unique string or number (mirrored in the response)method: one of the supported commands (ping,click,move,type,press,hotkey,screenshot)params: optional dictionary with method-specific parameters
Successful responses follow:
{
"jsonrpc": "2.0",
"id": "2",
"result": {
"ok": true,
"data": { "...": "..." }
}
}Errors follow JSON-RPC's standard shape:
{
"jsonrpc": "2.0",
"id": "2",
"error": {
"code": 400,
"message": "Invalid parameter: missing x coordinate",
"data": { "...": "..." }
}
}result.data.image contains a base64-encoded PNG. Example:
{
"result": {
"ok": true,
"data": {
"width": 1920,
"height": 1080,
"image": "iVBORw0KGgoAAAANSUhEUgAA..."
}
}
}- Scope tightly: Only grant MCP clients you trust access to this server; it controls your desktop.
- Desktop lock: The server cannot bypass a locked screen. If the display is locked, input operations will fail silently.
- Failsafe: Press your system's mouse failsafe gesture (e.g.,
Ctrl+Cin the hosting terminal or move the mouse manually) if automation misbehaves.
- Add more input primitives (scrolling, drag-and-drop, clipboard access)
- Add contextual awareness (OCR, UI element recognition)
- Layer in higher-level workflows (e.g., “open browser and navigate to URL”)
Pull requests and suggestions are welcome. With this foundation, you can iterate toward a more capable Computer Using Agent tailored to your workflow.