A Python-based AI agentic assistant that leverages Google's Gemini AI API for real-time computer control through voice or text input. The application implements a bidirectional WebSocket architecture for seamless AI communication, coupled with multi-threaded audio processing, computer vision, and OCR for contextual awareness.
- 🗣️ Dual input modes: voice commands or text input
- 🖥️ Real-time screen analysis with OCR and UI element detection
- 🖱️ Precise computer control capabilities:
- Mouse movement and click simulation
- Keyboard input and hotkey combinations
- Application launching and window management
- 🎯 Intelligent command interpretation using Gemini AI
- 🔄 Real-time audio processing with noise reduction
- 📊 Adaptive silence detection for better voice recognition
- 🤖 WebSocket-based real-time communication with Gemini AI
- 🔍 OCR-powered text recognition on screen
- 🎯 UI element detection and classification
- Python 3.8 or higher
- Tesseract OCR (Download and install from here)
- Working microphone (for voice input)
- Google Gemini API key
- Clone the repository:
git clone <repository-url>
cd computer-use
- Create and activate a virtual environment:
python -m venv venv
# On Windows
.\venv\Scripts\activate
# On Linux/MacOS
source venv/bin/activate
- Install Python dependencies:
pip install -r requirements.txt
-
Install Tesseract OCR:
- Windows: Download and install from UB-Mannheim's repository
- Linux:
sudo apt-get install tesseract-ocr
- MacOS:
brew install tesseract
-
Create a .env file in the project root:
GOOGLE_API_KEY=your_api_key_here
- Activate the virtual environment if not already activated:
# On Windows
.\venv\Scripts\activate
# On Linux/MacOS
source venv/bin/activate
- Run the application:
python voice_control.py
- Choose your preferred input mode when prompted:
voice
: Use voice commandstext
: Use text input
- "Open Spotify and play my favorite playlist"
- "Check for the cheapest flights from LA to New York"
- "Open Chrome and search for the weather"
- "Find and click the WiFi icon"
- "Minimize all windows"
- "Type out an email response"
- Ensure Tesseract OCR is properly installed and in PATH
- For voice mode, ensure your microphone is properly configured
- The assistant works best in a quiet environment for voice commands
- Some commands may require administrator privileges
- Screenshots are analyzed in real-time for UI element detection
Contributions are welcome! Feel free to submit issues and pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.