Skip to content

A Python-based AI agentic assistant that uses Google's Gemini AI to provide natural language computer control through voice commands. This assistant can understand context from both voice and screen content to perform complex computer operations.

License

Notifications You must be signed in to change notification settings

iamkhalid2/computer-use

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎙️ Computer - Use

Python 3.8+ License: MIT

A Python-based AI agentic assistant that leverages Google's Gemini AI API for real-time computer control through voice or text input. The application implements a bidirectional WebSocket architecture for seamless AI communication, coupled with multi-threaded audio processing, computer vision, and OCR for contextual awareness.

✨ Features

  • 🗣️ Dual input modes: voice commands or text input
  • 🖥️ Real-time screen analysis with OCR and UI element detection
  • 🖱️ Precise computer control capabilities:
    • Mouse movement and click simulation
    • Keyboard input and hotkey combinations
    • Application launching and window management
  • 🎯 Intelligent command interpretation using Gemini AI
  • 🔄 Real-time audio processing with noise reduction
  • 📊 Adaptive silence detection for better voice recognition
  • 🤖 WebSocket-based real-time communication with Gemini AI
  • 🔍 OCR-powered text recognition on screen
  • 🎯 UI element detection and classification

🛠️ Prerequisites

  1. Python 3.8 or higher
  2. Tesseract OCR (Download and install from here)
  3. Working microphone (for voice input)
  4. Google Gemini API key

🚀 Installation

  1. Clone the repository:
git clone <repository-url>
cd computer-use
  1. Create and activate a virtual environment:
python -m venv venv
# On Windows
.\venv\Scripts\activate
# On Linux/MacOS
source venv/bin/activate
  1. Install Python dependencies:
pip install -r requirements.txt
  1. Install Tesseract OCR:

    • Windows: Download and install from UB-Mannheim's repository
    • Linux: sudo apt-get install tesseract-ocr
    • MacOS: brew install tesseract
  2. Create a .env file in the project root:

GOOGLE_API_KEY=your_api_key_here

🏃‍♂️ Usage

  1. Activate the virtual environment if not already activated:
# On Windows
.\venv\Scripts\activate
# On Linux/MacOS
source venv/bin/activate
  1. Run the application:
python voice_control.py
  1. Choose your preferred input mode when prompted:
    • voice: Use voice commands
    • text: Use text input

💡 Example Commands

  • "Open Spotify and play my favorite playlist"
  • "Check for the cheapest flights from LA to New York"
  • "Open Chrome and search for the weather"
  • "Find and click the WiFi icon"
  • "Minimize all windows"
  • "Type out an email response"

⚠️ Notes

  • Ensure Tesseract OCR is properly installed and in PATH
  • For voice mode, ensure your microphone is properly configured
  • The assistant works best in a quiet environment for voice commands
  • Some commands may require administrator privileges
  • Screenshots are analyzed in real-time for UI element detection

🤝 Contributing

Contributions are welcome! Feel free to submit issues and pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A Python-based AI agentic assistant that uses Google's Gemini AI to provide natural language computer control through voice commands. This assistant can understand context from both voice and screen content to perform complex computer operations.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages