An autonomous AI agent that controls your iPhone using Google's ADK (Agent Development Kit) and Gemini's spatial understanding capabilities. The agent can see, understand, and interact with your phone's interface through screen mirroring and computer vision.
https://github.com/user-attachments/assets/your-video-id-here.mp4
To update: Edit this README on GitHub and drag-drop your MP4 file here to replace this placeholder
- Autonomous Phone Control: AI-powered agent that can navigate and interact with iOS interfaces
- Visual Understanding: Uses Gemini's spatial understanding to locate and identify UI elements
- Natural Language Control: Give high-level instructions and let the agent figure out the steps
- Smart Navigation: Automatic screenshot analysis, pointer movement, clicking, scrolling, and text entry
- Loop Control: Built-in pause and human intervention capabilities for safety
- Screen Mirroring: Uses macOS's screen capture to monitor a mirrored iPhone display (via QuickTime or similar)
- Visual Analysis: Gemini Pro analyzes screenshots to locate UI elements using spatial understanding
- Action Execution: PyAutoGUI controls the mouse/keyboard to interact with the mirrored display
- Agent Loop: Google ADK orchestrates the autonomous decision-making and tool usage
Grant cursor the accessibility to:
- 'Allow the application to control your computer'
- 'Screen & System Audio Recording'
Platform Support: Built for MacOS, other platforms might not be supported.
Requirements:
- macOS (for
screencapturecommand) - Python 3.12+
- Google Cloud Project with Vertex AI enabled
- iPhone with screen mirroring capability
- Clone the repository
git clone <repository-url>
cd PhoneAgent- Set up environment
# Copy environment template
cp phone_agent/.env.local phone_agent/.env-
Configure your environment variables in
phone_agent/.env:GOOGLE_CLOUD_PROJECT: Your GCP project IDGOOGLE_CLOUD_LOCATION: GCP region (default: us-central1)GEMINI_PRO_MODEL: Model to use (e.g., gemini-2.5-pro-preview-03-25)PHONE_PASSWORD: Your phone's password (optional, for automated unlock)- Screen bounds and crop settings (adjust based on your display setup)
-
Install dependencies
pip install -e .- Connect your iPhone to your Mac
- Open QuickTime Player > File > New Movie Recording
- Select your iPhone as the camera source
- Position the mirrored display on your screen
Update these values in your .env file based on your screen setup:
MIRRORING_X_BOUND/MIRRORING_Y_BOUND: The width/height of your mirrored phone displayIMAGE_CROP_BOX: Screen region to capture (left, top, right, bottom)HOME_BUTTON_X/HOME_BUTTON_Y: Position of the home button/gesture areaSCREEN_Y_INVERSION: Screen height for coordinate transformation
from phone_agent.agent import root_agent
# Start the agent with a task
root_agent.run("Open Safari and search for weather")The agent will:
- Take a screenshot of the current state
- Analyze the UI using Gemini's spatial understanding
- Determine the next action needed
- Execute pointer movements, clicks, scrolls, or text entry
- Verify the result and continue until the task is complete
phone_agent/
├── agent.py # Main agent configuration
├── tools/
│ ├── navigation.py # Mouse/keyboard control (click, scroll, type)
│ ├── vision.py # Screenshot capture and UI element detection
│ └── loop.py # Agent loop control (pause, human intervention)
└── prompts/
├── agent.j2 # Main agent instructions
└── vision.j2 # Vision model instructions
pip install google-adk[eval] does not work it should be pip install 'google-adk[eval]'. update the docs!
- Human Intervention: Agent can request human assistance when uncertain
- Pause Loop: Ability to pause the autonomous loop
- Explanation Required: All tool calls require explanation of intent
- Password Protection: Prompts before entering sensitive information
Contributions are welcome! Please feel free to submit a Pull Request.
See LICENSE file for details.