Phone Agent

An autonomous AI agent that controls your iPhone using Google's ADK (Agent Development Kit) and Gemini's spatial understanding capabilities. The agent can see, understand, and interact with your phone's interface through screen mirroring and computer vision.

Demo

https://github.com/user-attachments/assets/your-video-id-here.mp4

To update: Edit this README on GitHub and drag-drop your MP4 file here to replace this placeholder

Features

Autonomous Phone Control: AI-powered agent that can navigate and interact with iOS interfaces
Visual Understanding: Uses Gemini's spatial understanding to locate and identify UI elements
Natural Language Control: Give high-level instructions and let the agent figure out the steps
Smart Navigation: Automatic screenshot analysis, pointer movement, clicking, scrolling, and text entry
Loop Control: Built-in pause and human intervention capabilities for safety

How It Works

Screen Mirroring: Uses macOS's screen capture to monitor a mirrored iPhone display (via QuickTime or similar)
Visual Analysis: Gemini Pro analyzes screenshots to locate UI elements using spatial understanding
Action Execution: PyAutoGUI controls the mouse/keyboard to interact with the mirrored display
Agent Loop: Google ADK orchestrates the autonomous decision-making and tool usage

Prerequisites

Grant cursor the accessibility to:

'Allow the application to control your computer'
'Screen & System Audio Recording'

Platform Support: Built for MacOS, other platforms might not be supported.

Requirements:

macOS (for screencapture command)
Python 3.12+
Google Cloud Project with Vertex AI enabled
iPhone with screen mirroring capability

Installation

Clone the repository

git clone <repository-url>
cd PhoneAgent

Set up environment

# Copy environment template
cp phone_agent/.env.local phone_agent/.env

Configure your environment variables in phone_agent/.env:
- GOOGLE_CLOUD_PROJECT: Your GCP project ID
- GOOGLE_CLOUD_LOCATION: GCP region (default: us-central1)
- GEMINI_PRO_MODEL: Model to use (e.g., gemini-2.5-pro-preview-03-25)
- PHONE_PASSWORD: Your phone's password (optional, for automated unlock)
- Screen bounds and crop settings (adjust based on your display setup)
Install dependencies

pip install -e .

Configuration

Screen Mirroring Setup

Connect your iPhone to your Mac
Open QuickTime Player > File > New Movie Recording
Select your iPhone as the camera source
Position the mirrored display on your screen

Display Calibration

Update these values in your .env file based on your screen setup:

MIRRORING_X_BOUND / MIRRORING_Y_BOUND: The width/height of your mirrored phone display
IMAGE_CROP_BOX: Screen region to capture (left, top, right, bottom)
HOME_BUTTON_X / HOME_BUTTON_Y: Position of the home button/gesture area
SCREEN_Y_INVERSION: Screen height for coordinate transformation

Usage

from phone_agent.agent import root_agent

# Start the agent with a task
root_agent.run("Open Safari and search for weather")

The agent will:

Take a screenshot of the current state
Analyze the UI using Gemini's spatial understanding
Determine the next action needed
Execute pointer movements, clicks, scrolls, or text entry
Verify the result and continue until the task is complete

Project Structure

phone_agent/
├── agent.py           # Main agent configuration
├── tools/
│   ├── navigation.py  # Mouse/keyboard control (click, scroll, type)
│   ├── vision.py      # Screenshot capture and UI element detection
│   └── loop.py        # Agent loop control (pause, human intervention)
└── prompts/
    ├── agent.j2       # Main agent instructions
    └── vision.j2      # Vision model instructions

Known Issues

todo

pip install google-adk[eval] does not work it should be pip install 'google-adk[eval]'. update the docs!

Safety Features

Human Intervention: Agent can request human assistance when uncertain
Pause Loop: Ability to pause the autonomous loop
Explanation Required: All tool calls require explanation of intent
Password Protection: Prompts before entering sensitive information

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
phone_agent		phone_agent
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phone Agent

Demo

Features

How It Works

Prerequisites

Installation

Configuration

Screen Mirroring Setup

Display Calibration

Usage

Project Structure

Known Issues

todo

Safety Features

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Phone Agent

Demo

Features

How It Works

Prerequisites

Installation

Configuration

Screen Mirroring Setup

Display Calibration

Usage

Project Structure

Known Issues

todo

Safety Features

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages