AI voice agent who is capable of seeing the screen, pressing buttons, and typing text, i.e. autonomously operating a computer on your behalf. It uses open-source local or self-hosted models with no dependencies on external APIs and sharing personal data with third parties.
Key components:
- Voice Activity Detection (VAD)
- Speech to Text (STT)
- Language Model (LM)
- Text to Speech (TTS)
I got inspired by the following projects:
The reason I don't fork them is that my vision is slightly different, and I want to learn how to create such system from scratch.