Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interesting VC ideas #4

Open
3 tasks
zavocc opened this issue Aug 6, 2024 · 3 comments
Open
3 tasks

Interesting VC ideas #4

zavocc opened this issue Aug 6, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@zavocc
Copy link
Owner

zavocc commented Aug 6, 2024

Looking at
https://guide.pycord.dev/voice/receiving

It appears its possible to also recieve audio to the bot. With this, its possible to create a voice mode for Gemini models back and forth

The outline of this implementation would be

  1. Using TTS and STT engines, preferably super fast and cost effective as possible if using clouds, best if natural sounding, with minimal latency as possible
  2. Using wavelink as a voice engine by streaming the TTS output.. in separate Cog
  3. Handle multiple requests if possible per server

The flow would be

  1. Initiate possibly through slash command like /call or something and lock the session to specific user when they initiated the command
  2. Record the voice conversation with timeout

On callback function

  1. The recorded voice is then sent to the Speech-to-text engine such as Whisper, either in OpenAI API (paid, faster), Azure Speech services (free in most cases, requires Azure dependencies) or Huggingface spaces (free, slow).... OR USE Gemini's native multimodality
  2. The transcription is now then used as a prompt to reason and engage (either with GPT or Gemini, with different system prompt optimized for speech
  3. Performs checks, if there is an error occured due to model, still proceed... But will speak the error, if there's an error with Speech APIs, abort and ping the user.
  4. Then the output is sent through dedicated TTS program and record
  5. When no errors occured, stream it
  6. Unlock and the command is now ready to be used by anyone

Possible limitation and outcomes:

  1. Possibility of blocking and highest latency if not using Asynchronous tools
  2. This command may be limited to one person at a time as a whole and not per user neither per guild, something that is being prototype how the flow works, for now, until this is being tested
  3. Prone to errors
  4. Chat history/Context handling, this would also require redundant code from /ask command
  5. Multimodality, though parameter maybe added in slash command, just would need to copy code from ask command but with more lines of code
  6. It cannot be initiated through voice, has to be invoked manually via slash command... Defeats the purpose of voice mode, but this should be considered as a basis for building block such implementation

Can be resolved; yes approx 80% success rate


Goals:

  • #11
  • #12
  • #13
@zavocc zavocc added the enhancement New feature or request label Aug 6, 2024
@zavocc
Copy link
Owner Author

zavocc commented Aug 7, 2024

OpenAI tts supports streaming
https://platform.openai.com/docs/guides/text-to-speech/quickstart

@zavocc
Copy link
Owner Author

zavocc commented Aug 8, 2024

Implement GuildVoiceMgmt class

@zavocc zavocc mentioned this issue Aug 9, 2024
@zavocc
Copy link
Owner Author

zavocc commented Oct 4, 2024

Realtime api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant