Skip to content

SankirthGunnam/humanio

Repository files navigation

Computer Using Agent (CUA)

This repository contains a lightweight Model Context Protocol (MCP) server that gives an AI agent the ability to operate a desktop just like a human: move and click the mouse, type on the keyboard, and capture screenshots. The server is designed to be launched from Cursor, VS Code, or any other MCP‑aware client.

Capabilities

  • Mouse movement and clicks (left, right, middle)
  • Drag-and-drop gestures
  • Scroll wheel control (vertical and horizontal)
  • Keyboard typing (including configurable delays)
  • Key presses and hotkeys
  • Fullscreen or region screenshots returned as base64‑encoded PNGs
  • Health checks (ping) so clients can verify connectivity

Project layout

├── cua_server.py      # JSON-RPC server that exposes desktop control tools
├── requirements.txt   # Python dependencies for desktop automation
└── README.md          # This file

Prerequisites

  • Python 3.10 or newer
  • A desktop session (the automation libraries need an active display)
  • The following Python packages (install with pip install -r requirements.txt):
    • pyautogui
    • pillow

Note: pyautogui depends on native tools (python3-xlib, scrot, or their equivalents) on some Linux distributions. Make sure those prerequisites are installed system-wide.

Running the server

python cua_server.py

The server speaks JSON-RPC 2.0 over standard input/output. Once launched, it waits for requests on stdin and streams responses on stdout. An MCP client (such as Cursor or VS Code) should manage the process lifecycle and exchange messages over pipes.

Sample automation script

This repo ships with a small driver (run_actions.py) that reads a JSON file and replays the actions through the server. First, ensure the dependencies are installed:

python3 -m pip install --user -r requirements.txt

Then execute the sample script (it spawns the server automatically):

python run_actions.py sample_actions.json

sample_actions.json demonstrates a short sequence: move the cursor, click, type text, and capture a screenshot (using the "command": "s" shorthand). Feel free to duplicate the file and adjust the coordinates, keystrokes, and screenshot path to suit your workflow.

For a more involved example, sample_vim_build.json opens a terminal (Ctrl+Alt+T), launches Vim, writes a tiny C program, saves it, compiles it with gcc, and runs the binary. To see drag-and-scroll gestures in action, run sample_drag_scroll.yaml.

Available JSON commands:

  • move, click, drag, scroll, type, press, hotkey, s (screenshot)
  • wait (or sleep) with seconds to pause between steps; useful while windows open.

drag actions expect start and end objects with x/y coordinates, plus optional button, duration, and moveDuration keys.

scroll actions require an amount (positive scrolls up/left, negative down/right), an optional axis (vertical by default), and optional x/y coordinates to move the cursor before scrolling.

Example CLI usage

You can exercise the server manually using another terminal:

printf '%s\n' \
  '{"jsonrpc":"2.0","id":"1","method":"ping"}' \
  '{"jsonrpc":"2.0","id":"2","method":"click","params":{"x":400,"y":400,"button":"left"}}' | \
python cua_server.py

Each JSON-RPC request must include:

  • jsonrpc: "2.0"
  • id: a unique string or number (mirrored in the response)
  • method: one of the supported commands (ping, click, move, type, press, hotkey, screenshot)
  • params: optional dictionary with method-specific parameters

Response schema

Successful responses follow:

{
  "jsonrpc": "2.0",
  "id": "2",
  "result": {
    "ok": true,
    "data": { "...": "..." }
  }
}

Errors follow JSON-RPC's standard shape:

{
  "jsonrpc": "2.0",
  "id": "2",
  "error": {
    "code": 400,
    "message": "Invalid parameter: missing x coordinate",
    "data": { "...": "..." }
  }
}

Screenshot payload

result.data.image contains a base64-encoded PNG. Example:

{
  "result": {
    "ok": true,
    "data": {
      "width": 1920,
      "height": 1080,
      "image": "iVBORw0KGgoAAAANSUhEUgAA..."
    }
  }
}

Security considerations

  • Scope tightly: Only grant MCP clients you trust access to this server; it controls your desktop.
  • Desktop lock: The server cannot bypass a locked screen. If the display is locked, input operations will fail silently.
  • Failsafe: Press your system's mouse failsafe gesture (e.g., Ctrl+C in the hosting terminal or move the mouse manually) if automation misbehaves.

Extending the agent

  • Add more input primitives (scrolling, drag-and-drop, clipboard access)
  • Add contextual awareness (OCR, UI element recognition)
  • Layer in higher-level workflows (e.g., “open browser and navigate to URL”)

Pull requests and suggestions are welcome. With this foundation, you can iterate toward a more capable Computer Using Agent tailored to your workflow.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages